Optimal Data Collection for Machine Learning*

Gaza, Casie; Transtrum, Mark; Gee, Kent; Pedersen, Katrina; Butler, Brooks

Citation Details

Machine learning refers to a collection of computational techniques for identifying or learning patterns in data. Although existing techniques are most effective on large data sets, there is growing interest in applying methods on smaller ones. We consider the application of machine learning to predicting ambient sound levels in the contiguous United States from GIS data. The challenge is limited availability of training data from which to construct a model--data collection in this case is both cost and time expensive. This leads us to consider two questions: First, how to best validate a machine learning model with limited training data and two, given additional data can we measurably improve the accuracy of the model. We create an ensemble of models that perform equally well as measured by leave-one-out cross validation on our initial training set. However, these models give wildly different predictions for areas in the central region of the country. By collecting additional data in cropland areas in Utah, we were able to improve the predictions of our machine learning model to other, geographically similar regions of the country. *National Science Foundation Grant 1557998 Brigham Young University Physics and Astronomy more »

Award ID(s):: 1757998

PAR ID:: 10106071

Author(s) / Creator(s):: Gaza, Casie; Transtrum, Mark; Gee, Kent; Pedersen, Katrina; Butler, Brooks

Date Published:: 2018-10-01

Journal Name:: Bulletin of the American Physical Society

Volume:: 63

Issue:: 16

ISSN:: 0003-0503

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
The DOI is not currently available.

More Like this