Language documentation encompasses translation, typically into the dominant high-resource language in the region where the target language is spoken. To make data accessible to a broader audience, additional translation into other high-resource languages might be needed. Working within a project documenting Kotiria, we explore the extent to which state-of-the-art machine translation (MT) systems can support this second translation – in our case from Portuguese to English. This translation task is challenging for multiple reasons: (1) the data is out-of-domain with respect to the MT system’s training data, (2) much of the data is conversational, (3) existing translations include non-standard and uncommon expressions, often reflecting properties of the documented language, and (4) the data includes borrowings from other regional languages. Despite these challenges, existing MT systems perform at a usable level, though there is still room for improvement. We then conduct a qualitative analysis and suggest ways to improve MT between high-resource languages in a language documentation setting. 
                        more » 
                        « less   
                    
                            
                            Towards Domain-Agnostic and Domain-Adaptive Dementia Detection from Spoken Language
                        
                    
    
            Health-related speech datasets are often small and varied in focus. This makes it difficult to leverage them to effectively support healthcare goals. Robust transfer of linguistic features across different datasets orbiting the same goal carries potential to address this concern. To test this hypothesis, we experiment with domain adaptation (DA) techniques on heterogeneous spoken language data to evaluate generalizability across diverse datasets for a common task: dementia detection. We find that adapted models exhibit better performance across conversational and task-oriented datasets. The feature-augmented DA method achieves a 22% increase in accuracy adapting from a conversational to task-specific dataset compared to a jointly trained baseline. This suggests promising capacity of these techniques to allow for productive use of disparate data for a complex spoken language healthcare task. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2125411
- PAR ID:
- 10544281
- Publisher / Repository:
- Association for Computational Linguistics
- Date Published:
- Edition / Version:
- Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Page Range / eLocation ID:
- 11965 to 11978
- Format(s):
- Medium: X
- Location:
- Toronto, Canada
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Conversational Recommendation System (CRS) is a rapidly growing research area that has gained significant attention alongside advancements in language modelling techniques. However, the current state of conversational recommendation faces numerous challenges due to its relative novelty and limited existing contributions. In this study, we delve into benchmark datasets for developing CRS models and address potential biases arising from the feedback loop inherent in multi-turn interactions, including selection bias and multiple popularity bias variants. Drawing inspiration from the success of generative data via using language models and data augmentation techniques, we present two novel strategies, ‘Once-Aug’ and ‘PopNudge’, to enhance model performance while mitigating biases. Through extensive experiments on ReDial and TG-ReDial benchmark datasets, we show a consistent improvement of CRS techniques with our data augmentation approaches and offer additional insights on addressing multiple newly formulated biases.more » « less
- 
            Physical systems are extending their monitoring capacities to edge areas with low-cost, low-power sensors and advanced data mining and machine learning techniques. However, new systems often have limited data for training the model, calling for effective knowledge transfer from other relevant grids. Specifically, Domain Adaptation (DA) seeks domain-invariant features to boost the model performance in the target domain. Nonetheless, existing DA techniques face significant challenges due to the unique characteristics of physical datasets: (1) complex spatial-temporal correlations, (2) diverse data sources including node/edge measurements and labels, and (3) large-scale data sizes. In this paper, we propose a novel cross-graph DA based on two core designs of graph kernels and graph coarsening. The former design handles spatial-temporal correlations and can incorporate networked measurements and labels conveniently. The spatial structures, temporal trends, measurement similarity, and label information together determine the similarity of two graphs, guiding the DA to find domain-invariant features. Mathematically, we construct a Graph kerNel-based distribution Adaptation (GNA) with a specifically-designed graph kernel. Then, we prove the proposed kernel is positive definite and universal, which strictly guarantees the feasibility of the used DA measure. However, the computation cost of the kernel is prohibitive for large systems. In response, we propose a novel coarsening process to obtain much smaller graphs for GNA. Finally, we report the superiority of GNA in diversified systems, including power systems, mass-damper systems, and human-activity sensing systems.more » « less
- 
            Benjamin, Paaßen; Carrie, Demmans Epp (Ed.)One of the areas where Large Language Models (LLMs) show promise is for automated qualitative coding, typically framed as a text classification task in natural language processing (NLP). Their demonstrated ability to leverage in-context learning to operate well even in data-scarce settings poses the question of whether collecting and annotating large-scale data for training qualitative coding models is still beneficial. In this paper, we empirically investigate the performance of LLMs designed for use in prompting-based in-context learning settings, and draw a comparison to models that have been trained using the traditional pretraining--finetuning paradigm with task-specific annotated data, specifically for tasks involving qualitative coding of classroom dialog. Compared to other domains where NLP studies are typically situated, classroom dialog is much more natural and therefore messier. Moreover, tasks in this domain are nuanced and theoretically grounded and require a deep understanding of the conversational context. We provide a comprehensive evaluation across five datasets, including tasks such as talkmove prediction and collaborative problem solving skill identification. Our findings show that task-specific finetuning strongly outperforms in-context learning, showing the continuing need for high-quality annotated training datasets.more » « less
- 
            Personalized Federated Learning via Domain Adaptation with an Application to Distributed 3D PrintingOver the years, Internet of Things (IoT) devices have become more powerful. This sets forth a unique opportunity to exploit local computing resources to distribute model learning and circumvent the need to share raw data. The underlying distributed and privacy-preserving data analytics approach is often termed federated learning (FL). A key challenge in FL is the heterogeneity across local datasets. In this article, we propose a new personalized FL model, PFL-DA, by adopting the philosophy of domain adaptation. PFL-DA tackles two sources of data heterogeneity at the same time: a covariate and concept shift across local devices. We show, both theoretically and empirically, that PFL-DA overcomes intrinsic shortcomings in state of the art FL approaches and is able to borrow strength across devices while allowing them to retain their own personalized model. As a case study, we apply PFL-DA to distributed desktop 3D printing where we obtain more accurate predictions of printing speed, which can help improve the efficiency of the printers.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    