skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Optimal Data Collection for Machine Learning*
Machine learning refers to a collection of computational techniques for identifying or learning patterns in data. Although existing techniques are most effective on large data sets, there is growing interest in applying methods on smaller ones. We consider the application of machine learning to predicting ambient sound levels in the contiguous United States from GIS data. The challenge is limited availability of training data from which to construct a model--data collection in this case is both cost and time expensive. This leads us to consider two questions: First, how to best validate a machine learning model with limited training data and two, given additional data can we measurably improve the accuracy of the model. We create an ensemble of models that perform equally well as measured by leave-one-out cross validation on our initial training set. However, these models give wildly different predictions for areas in the central region of the country. By collecting additional data in cropland areas in Utah, we were able to improve the predictions of our machine learning model to other, geographically similar regions of the country. *National Science Foundation Grant 1557998 Brigham Young University Physics and Astronomy  more » « less
Award ID(s):
1757998
PAR ID:
10106071
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Bulletin of the American Physical Society
Volume:
63
Issue:
16
ISSN:
0003-0503
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Coarse graining techniques play an essential role in accelerating molecular simulations of systems with large length and time scales. Theoretically grounded bottom-up models are appealing due to their thermodynamic consistency with the underlying all-atom models. In this direction, machine learning approaches hold great promise to fitting complex many-body data. However, training models may require collection of large amounts of expensive data. Moreover, quantifying trained model accuracy is challenging, especially in cases of non-trivial free energy configurations, where training data may be sparse. We demonstrate a path towards uncertainty-aware models of coarse grained free energy surfaces. Specifically, we show that principled Bayesian model uncertainty allows for efficient data collection through an on-the-fly active learning framework and opens the possibility of adaptive transfer of models across different chemical systems. Uncertainties also characterize models’ accuracy of free energy predictions, even when training is performed only on forces. This work helps pave the way towards efficient autonomous training of reliable and uncertainty aware many-body machine learned coarse grain models. 
    more » « less
  2. Andreas Krause, Emma Brunskill (Ed.)
    Differentially private (DP) machine learning techniques are notorious for their degradation of model utility (e.g., they degrade classification accuracy). A recent line of work has demonstrated that leveraging public data can improve the trade-off between privacy and utility when training models with DP guaranteed. In this work, we further explore the potential of using public data in DP models, showing that utility gains can in fact be significantly higher than what shown in prior works. Specifically, we introduce DOPE-SGD, a modified DP-SGD algorithm that leverages public data during its training. DOPE-SGD uses public data in two complementary ways: (1) it uses advance augmentation techniques that leverages public data to generate synthetic data that is effectively embedded in multiple steps of the training pipeline; (2) it uses a modified gradient clipping mechanism (which is a standard technique in DP training) to change the origin of gradient vectors using the information inferred from available public and synthetic data, therefore boosting utility. We also introduce a technique to ensemble intermediate DP models by leveraging the post processing property of differential privacy to further improve the accuracy of the predictions. Our experimental results demonstrate the effectiveness of our approach in improving the state-of-the-art in DP machine learning across multiple datasets, network architectures, and application domains. For instance, assuming access to 2,000 public images, and for a privacy budget of 𝜀=2,𝛿=10−5, our technique achieves an accuracy of 75.1 on CIFAR10, significantly higher than 68.1 achieved by the state of the art. 
    more » « less
  3. Abstract. This paper studies how to improve the accuracy of hydrologic models using machine-learning models as post-processors and presents possibilities to reduce the workload to create an accurate hydrologic model by removing the calibration step. It is often challenging to develop an accurate hydrologic model due to the time-consuming model calibration procedure and the nonstationarity of hydrologic data. Our findings show that the errors of hydrologic models are correlated with model inputs. Thus motivated, we propose a modeling-error-learning-based post-processor framework by leveraging this correlation to improve the accuracy of a hydrologic model. The key idea is to predict the differences (errors) between the observed values and the hydrologic model predictions by using machine-learning techniques. To tackle the nonstationarity issue of hydrologic data, a moving-window-based machine-learning approach is proposed to enhance the machine-learning error predictions by identifying the local stationarity of the data using a stationarity measure developed based on the Hilbert–Huang transform. Two hydrologic models, the Precipitation–Runoff Modeling System (PRMS) and the Hydrologic Modeling System (HEC-HMS), are used to evaluate the proposed framework. Two case studies are provided to exhibit the improved performance over the original model using multiple statistical metrics. 
    more » « less
  4. Monitoring and managing groundwater resources is critical for sustaining livelihoods and supporting various human activities, including irrigation and drinking water supply. The most common method of monitoring groundwater is well water level measurements. These records can be difficult to collect and maintain, especially in countries with limited infrastructure and resources. However, long-term data collection is required to characterize and evaluate trends. To address these challenges, we propose a framework that uses data from the Gravity Recovery and Climate Experiment (GRACE) mission and downscaling models to generate higher-resolution (1 km) groundwater predictions. The framework is designed to be flexible, allowing users to implement any machine learning model of interest. We selected four models: deep learning model, gradient tree boosting, multi-layer perceptron, and k-nearest neighbors regressor. To evaluate the effectiveness of the framework, we offer a case study of Sunflower County, Mississippi, using well data to validate the predictions. Overall, this paper provides a valuable contribution to the field of groundwater resource management by demonstrating a framework using remote sensing data and machine learning techniques to improve monitoring and management of this critical resource, especially to those who seek a faster way to begin to use these datasets and applications. 
    more » « less
  5. Wang, N. (Ed.)
    In education, intelligent learning environments allow students to choose how to tackle open-ended tasks while monitoring performance and behavior, allowing for the creation of adaptive support to help students overcome challenges. Timely feedback is critical to aid students’ progression toward learning and improved problem-solving. Feedback on text-based student responses can be delayed when teachers are overloaded with work. Automated evaluation can provide quick student feedback while easing the manual evaluation burden for teachers in areas with a high teacher-to-student ratio. Current methods of evaluating student essay responses to questions have included transformer-based natural language processing models with varying degrees of success. One main challenge in training these models is the scarcity of data for student-generated data. Larger volumes of training data are needed to create models that perform at a sufficient level of accuracy. Some studies have vast data, but large quantities are difficult to obtain when educational studies involve student-generated text. To overcome this data scarcity issue, text augmentation techniques have been employed to balance and expand the data set so that models can be trained with higher accuracy, leading to more reliable evaluation and categorization of student answers to aid teachers in the student’s learning progression. This paper examines the text-generating AI model, GPT-3.5, to determine if prompt-based text-generation methods are viable for generating additional text to supplement small sets of student responses for machine learning model training. We augmented student responses across two domains using GPT-3.5 completions and used that data to train a multilingual BERT model. Our results show that text generation can improve model performance on small data sets over simple self-augmentation. 
    more » « less