Huge amounts of data generated on social media during emergency situations is regarded as a trove of critical information. The use of supervised machine learning techniques in the early stages of a crisis is challenged by the lack of labeled data for that event. Furthermore, supervised models trained on labeled data from a prior crisis may not produce accurate results, due to inherent crisis variations. To address these challenges, the authors propose a hybrid feature-instance-parameter adaptation approach based on matrix factorization, k-nearest neighbors, and self-training. The proposed feature-instance adaptation selects a subset of the source crisis data that is representative for the target crisis data. The selected labeled source data, together with unlabeled target data, are used to learn self-training domain adaptation classifiers for the target crisis. Experimental results have shown that overall the hybrid domain adaptation classifiers perform better than the supervised classifiers learned from the original source data. 
                        more » 
                        « less   
                    
                            
                            Domain Adaptation for Crisis Data Using Correlation Alignment and Self-Training
                        
                    
    
            Domain adaptation methods have been introduced for auto-filtering disaster tweets to address the issue of lacking labeled data for an emerging disaster. In this article, the authors present and compare two simple, yet effective approaches for the task of classifying disaster-related tweets. The first approach leverages the unlabeled target disaster data to align the source disaster distribution to the target distribution, and, subsequently, learns a supervised classifier from the modified source data. The second approach uses the strategy of self-training to iteratively label the available unlabeled target data, and then builds a classifier as a weighted combination of source and target-specific classifiers. Experimental results using Naïve Bayes as the base classifier show that both approaches generally improve performance as compared to baseline. Overall, the self-training approach gives better results than the alignment-based approach. Furthermore, combining correlation alignment with self-training leads to better result, but the results of self-training are still better. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1741345
- PAR ID:
- 10204866
- Date Published:
- Journal Name:
- International Journal of Information Systems for Crisis Response and Management
- Volume:
- 10
- Issue:
- 4
- ISSN:
- 1937-9390
- Page Range / eLocation ID:
- 1 to 20
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Domain adaptation addresses the challenge where the distribution of target inference data differs from that of the source training data. Recently, data privacy has become a significant constraint, limiting access to the source domain. To mitigate this issue, Source-Free Domain Adaptation (SFDA) methods bypass source domain data by generating source-like data or pseudo-labeling the unlabeled target domain. However, these approaches often lack theoretical grounding. In this work, we provide a theoretical analysis of the SFDA problem, focusing on the general empirical risk of the unlabeled target domain. Our analysis offers a comprehensive understanding of how representativeness, generalization, and variety contribute to controlling the upper bound of target domain empirical risk in SFDA settings. We further explore how to balance this trade-off from three perspectives: sample selection, semantic domain alignment, and a progressive learning framework. These insights inform the design of novel algorithms. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on three benchmark datasets--Office-Home, DomainNet, and VisDA-C--yielding relative improvements of 3.2%, 9.1%, and 7.5%, respectively, over the representative SFDA method, SHOT.more » « less
- 
            Machine learning systems deployed in the wild are often trained on a source distribution but deployed on a different target distribution. Unlabeled data can be a powerful point of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data and can often be obtained from distributions beyond the source distribution as well. However, existing distribution shift benchmarks with unlabeled data do not reflect the breadth of scenarios that arise in real-world applications. In this work, we present the WILDS 2.0 update, which extends 8 of the 10 datasets in the WILDS benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainable in deployment. These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities (photos, satellite images, microscope slides, text, molecular graphs). The update maintains consistency with the original WILDS benchmark by using identical labeled training, validation, and test sets, as well as identical evaluation metrics. We systematically benchmark state-of-the-art methods that use unlabeled data, including domain-invariant, self-training, and self-supervised methods, and show that their success on WILDS is limited. To facilitate method development, we provide an open-source package that automates data loading and contains the model architectures and methods used in this paper. Code and leaderboards are available at https://wilds.stanford.edu.more » « less
- 
            Machine learning systems deployed in the wild are often trained on a source distribution but deployed on a different target distribution. Unlabeled data can be a powerful point of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data and can often be obtained from distributions beyond the source distribution as well. However, existing distribution shift benchmarks with unlabeled data do not reflect the breadth of scenarios that arise in real-world applications. In this work, we present the WILDS 2.0 update, which extends 8 of the 10 datasets in the WILDS benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainable in deployment. These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities (photos, satellite images, microscope slides, text, molecular graphs). The update maintains consistency with the original WILDS benchmark by using identical labeled training, validation, and test sets, as well as identical evaluation metrics. We systematically benchmark state-of-the-art methods that use unlabeled data, including domain-invariant, self-training, and self-supervised methods, and show that their success on WILDS is limited. To facilitate method development, we provide an open-source package that automates data loading and contains the model architectures and methods used in this paper.more » « less
- 
            Machine learning systems deployed in the wild are often trained on a source distribution but deployed on a different target distribution. Unlabeled data can be a powerful point of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data and can often be obtained from distributions beyond the source distribution as well. However, existing distribution shift benchmarks with unlabeled data do not reflect the breadth of scenarios that arise in real-world applications. In this work, we present the WILDS 2.0 update, which extends 8 of the 10 datasets in the WILDS benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainable in deployment. These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities (photos, satellite images, microscope slides, text, molecular graphs). The update maintains consistency with the original WILDS benchmark by using identical labeled training, validation, and test sets, as well as the evaluation metrics. On these datasets, we systematically benchmark state-of-the-art methods that leverage unlabeled data, including domain-invariant, self-training, and self-supervised methods, and show that their success on WILDS is limited. To facilitate method development and evaluation, we provide an open-source package that automates data loading and contains all of the model architectures and methods used in this paper. Code and leaderboards are available at this https URL.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    