skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Toward Learning Human-aligned Cross-domain Robust Models by Countering Misaligned Features
Machine learning has demonstrated remarkable prediction accuracy over i.i.d data, but the accuracy often drops when tested with data from another distribution. In this paper, we aim to offer another view of this problem in a perspective assuming the reason behind this accuracy drop is the reliance of models on the features that are not aligned well with how a data annotator considers similar across these two datasets. We refer to these features as misaligned features. We extend the conventional generalization error bound to a new one for this setup with the knowledge of how the misaligned features are associated with the label. Our analysis offers a set of techniques for this problem, and these techniques are naturally linked to many previous methods in robust machine learning literature. We also compared the empirical strength of these methods demonstrated the performance when these previous techniques are combined, with implementation available here  more » « less
Award ID(s):
2204808 2150012
PAR ID:
10385714
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Conference on Uncertainty in Artificial Intelligence (UAI)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract In recent years, advances in image processing and machine learning have fueled a paradigm shift in detecting genomic regions under natural selection. Early machine learning techniques employed population-genetic summary statistics as features, which focus on specific genomic patterns expected by adaptive and neutral processes. Though such engineered features are important when training data are limited, the ease at which simulated data can now be generated has led to the recent development of approaches that take in image representations of haplotype alignments and automatically extract important features using convolutional neural networks. Digital image processing methods termed α-molecules are a class of techniques for multiscale representation of objects that can extract a diverse set of features from images. One such α-molecule method, termed wavelet decomposition, lends greater control over high-frequency components of images. Another α-molecule method, termed curvelet decomposition, is an extension of the wavelet concept that considers events occurring along curves within images. We show that application of these α-molecule techniques to extract features from image representations of haplotype alignments yield high true positive rate and accuracy to detect hard and soft selective sweep signatures from genomic data with both linear and nonlinear machine learning classifiers. Moreover, we find that such models are easy to visualize and interpret, with performance rivaling those of contemporary deep learning approaches for detecting sweeps. 
    more » « less
  2. Existing studies have demonstrated that using traditional machine learning techniques, phishing detection simply based on the features of URLs can be very effective. In this paper, we explore the deep learning approach and build four RNN (Recurrent Neural Network) models that only use lexical features of URLs for detecting phishing attacks. We collect 1.5 million URLs as the dataset and show that our RNN models can achieve a higher than 99% detection accuracy without the need of any expert knowledge to manually identify the features. However, it is well known that RNNs and other deep learning techniques are still largely in black boxes. Understanding the internals of deep learning models is important and highly desirable to the improvement and proper application of the models. Therefore, in this work, we further develop several unique visualization techniques to intensively interpret how RNN models work internally in achieving the outstanding phishing detection performance. Especially, we identify and answer six important research questions, showing that our four RNN models (1) are complementary to each other and can be combined into an ensemble model with even better accuracy, (2) can well capture the relevant features that were manually extracted and used in the traditional machine learning approach for phishing detection, and (3) can help identify useful new features to enhance the accuracy of the traditional machine learning approach. Our techniques and experience in this work could be helpful for researchers to effectively apply deep learning techniques in addressing other real-world security or privacy problems. 
    more » « less
  3. null (Ed.)
    X-ray CT imaging provides a 3D view of a sample and is a powerful tool for investigating the internal features of porous rock. Reliable phase segmentation in these images is highly necessary but, like any other digital rock imaging technique, is time-consuming, labor-intensive, and subjective. Combining 3D X-ray CT imaging with machine learning methods that can simultaneously consider several extracted features in addition to color attenuation, is a promising and powerful method for reliable phase segmentation. Machine learning-based phase segmentation of X-ray CT images enables faster data collection and interpretation than traditional methods. This study investigates the performance of several filtering techniques with three machine learning methods and a deep learning method to assess the potential for reliable feature extraction and pixel-level phase segmentation of X-ray CT images. Features were first extracted from images using well-known filters and from the second convolutional layer of the pre-trained VGG16 architecture. Then, K-means clustering, Random Forest, and Feed Forward Artificial Neural Network methods, as well as the modified U-Net model, were applied to the extracted input features. The models’ performances were then compared and contrasted to determine the influence of the machine learning method and input features on reliable phase segmentation. The results showed considering more dimensionality has promising results and all classification algorithms result in high accuracy ranging from 0.87 to 0.94. Feature-based Random Forest demonstrated the best performance among the machine learning models, with an accuracy of 0.88 for Mancos and 0.94 for Marcellus. The U-Net model with the linear combination of focal and dice loss also performed well with an accuracy of 0.91 and 0.93 for Mancos and Marcellus, respectively. In general, considering more features provided promising and reliable segmentation results that are valuable for analyzing the composition of dense samples, such as shales, which are significant unconventional reservoirs in oil recovery. 
    more » « less
  4. Zelinski, Michael E.; Taha, Tarek M.; Howe, Jonathan (Ed.)
    Image classification forms an important class of problems in machine learning and is widely used in many realworld applications, such as medicine, ecology, astronomy, and defense. Convolutional neural networks (CNNs) are machine learning techniques designed for inputs with grid structures, e.g., images, whose features are spatially correlated. As such, CNNs have been demonstrated to be highly effective approaches for many image classification problems and have consistently outperformed other approaches in many image classification and object detection competitions. A particular challenge involved in using machine learning for classifying images is measurement data loss in the form of missing pixels, which occurs in settings where scene occlusions are present or where the photodetectors in the imaging system are partially damaged. In such cases, the performance of CNN models tends to deteriorate or becomes unreliable even when the perturbations to the input image are small. In this work, we investigate techniques for improving the performance of CNN models for image classification with missing data. In particular, we explore training on a variety of data alterations that mimic data loss for producing more robust classifiers. By optimizing the categorical cross-entropy loss function, we demonstrate through numerical experiments on the MNIST dataset that training with these synthetic alterations can enhance the classification accuracy of our CNN models. 
    more » « less
  5. Algorithms for the estimation of gaze direction from mobile and video-based eye trackers typically involve tracking a feature of the eye that moves through the eye camera image in a way that covaries with the shifting gaze direction, such as the center or boundaries of the pupil. Tracking these features using traditional computer vision techniques can be difficult due to partial occlusion and environmental reflections. Although recent efforts to use machine learning (ML) for pupil tracking have demonstrated superior results when evaluated using standard measures of segmentation performance, little is known of how these networks may affect the quality of the final gaze estimate. This work provides an objective assessment of the impact of several contemporary ML-based methods for eye feature tracking when the subsequent gaze estimate is produced using either feature-based or model-based methods. Metrics include the accuracy and precision of the gaze estimate, as well as drop-out rate. 
    more » « less