While machine learning approaches to visual emotion recognition oer great promise, current methods consider training and testing models on small scale datasets covering limited visual emotion concepts. Our analysis identi es an important but long overlooked issue of existing visual emotion benchmarks in the form of dataset biases. We design a series of tests to show and measure how such dataset biases obstruct learning a generalizable emotion recognition model. Based on our analysis, we propose a webly supervised approach by leveraging a large quantity of stock image data. Our approach uses a simple yet eective curriculum guided training strategy for learning discriminative emotion features. We discover that the models learned using our large scale stock image dataset exhibit signi cantly better generalization ability than the existing datasets without the manual collection of even a single label. Moreover, visual representation learned using our approach holds a lot of promise across a variety of tasks on dierent image and video datasets.
more »
« less
Clustering analysis of inputs to a geospatial model of outdoor ambient sound
Outdoor ambient acoustical environments may be predicted through supervised machine learning using geospatial features as inputs. However, collecting sufficient training data is an expensive process, particularly when attempting to improve the accuracy of models based on supervised learning methods over large, geospatially diverse regions. Unsupervised machine learning methods, such as K-Means clustering analysis, enable a statistical comparison between the geospatial diversity represented in the current training dataset versus the predictor locations. In this case, the geospatial features that represent the regions of western North Carolina and Utah have been simultaneously clustered to examine the common clusters between the two locations. Initial results show that most geospatial clusters group themselves according to a relatively small number of prominent geospatial features, and that Utah requires appreciably more clusters to represent its geospace. Additionally, the training dataset has a relatively low geospatial diversity because most of the current training data sites reside in a small number of clusters. This analysis informs a choice of new site locations for data acquisition that maximize the statistical similarity of the training and input datasets.
more »
« less
- Award ID(s):
- 1757998
- PAR ID:
- 10106067
- Date Published:
- Journal Name:
- Bulletin of the American Physical Society
- Volume:
- 63
- Issue:
- 16
- ISSN:
- 0003-0503
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Knowing whether a published research result can be replicated is important. Carrying out direct replication of published research incurs a high cost. There are efforts tried to use machine learning aided methods to predict scientific claims’ replicability. However, existing machine learning aided approaches use only hand-extracted statistics features such as p-value, sample size, etc. without utilizing research papers’ text information and train only on a very small size of annotated data without making the most use of a large number of unlabeled articles. Therefore, it is desirable to develop effective machine learning aided automatic methods which can automatically extract text information as features so that we can benefit from Natural Language Processing techniques. Besides, we aim for an approach that benefits from both labeled and the large number of unlabeled data. In this paper, we propose two weakly supervised learning approaches that use automatically extracted text information of research papers to improve the prediction accuracy of research replication using both labeled and unlabeled datasets. Our experiments over real-world datasets show that our approaches obtain much better prediction performance compared to the supervised models utilizing only statistic features and a small size of labeled dataset. Further, we are able to achieve an accuracy of 75.76% for predicting the replicability of research.more » « less
-
According to the World Health Organization, healthy communities rely on well-functioning ecosystems. Clean air, fresh water, and nutritious food are inextricably linked to ecosystem health. Changes in biological activity convey important information about ecosystem dynamics, and understanding such changes is crucial for the survival of our species. Scientific edge cyberinfrastructures collect distributed data and process it in situ, often using machine learning algorithms. Most current machine learning algorithms deployed on edge cyberinfrastructures, however, are trained on data that does not accurately represent the real stream of data collected at the edge. In this work we explore the applicability of two new self-supervised learning algorithms for characterizing an insufficiently curated, imbalanced, and unlabeled dataset collected by using a set of nine microphones at different locations at the Morton Arboretum, an internationally recognized tree-focused botanical garden and research center in Lisle, IL. Our implementations showed completely autonomous characterization capabilities, such as the separation of spectrograms by recording location, month, week, and hour of the day. The models also showed the ability to discriminate spectrograms by biological and atmospheric activity, including rain, insects, and bird activity, in a completely unsupervised fashion. We validated our findings using a supervised deep learning approach and with a dataset labeled by experts, confirming competitive performance in several features. Toward explainability of our self-supervised learning approach, we used acoustic indices and false color spectrograms, showing that the topology and orientation of the clouds of points in the output space over a 24-h period are strongly linked to the unfolding of biological activity. Our findings show that self-supervised learning has the potential to learn from and process data collected at the edge, characterizing it with minimal human intervention. We believe that further research is crucial to extending this approach for complete autonomous characterization of raw data collected on distributed sensors at the edge.more » « less
-
This study employs supervised machine learning algorithms to test whether locomotive features during exploratory activity in open field arenas can serve as predictors for the genotype of fruit flies. Because of the nonlinearity in locomotive trajectories, traditional statistical methods that are used to compare exploratory activity between genotypes of fruit flies may not reveal all insights. 10-minute-long trajectories of four different genotypes of fruit flies in an open-field arena environment were captured. Turn angles and step size features extracted from the trajectories were used for training supervised learning models to predict the genotype of the fruit flies. Using the first five minute locomotive trajectories, an accuracy of 83% was achieved in differentiating wild-type flies from three other mutant genotypes. Using the final 5 min and the entire ten minute duration decreased the performance indicating that the most variations between the genotypes in their exploratory activity are exhibited in the first few minutes. Feature importance analysis revealed that turn angle is a better predictor than step size in predicting fruit fly genotype. Overall, this study demonstrates that features of trajectories can be used to predict the genotype of fruit flies through supervised machine learning methods.more » « less
-
In the field of materials science, microscopy is the first and often only accessible method for structural characterization. There is a growing interest in the development of machine learning methods that can automate the analysis and interpretation of microscopy images. Typically training of machine learning models requires large numbers of images with associated structural labels, however, manual labeling of images requires domain knowledge and is prone to human error and subjectivity. To overcome these limitations, we present a semi-supervised transfer learning approach that uses a small number of labeled microscopy images for training and performs as effectively as methods trained on significantly larger image datasets. Specifically, we train an image encoder with unlabeled images using self-supervised learning methods and use that encoder for transfer learning of different downstream image tasks (classification and segmentation) with a minimal number of labeled images for training. We test the transfer learning ability of two self-supervised learning methods: SimCLR and Barlow-Twins on transmission electron microscopy (TEM) images. We demonstrate in detail how this machine learning workflow applied to TEM images of protein nanowires enables automated classification of nanowire morphologies ( e.g. , single nanowires, nanowire bundles, phase separated) as well as segmentation tasks that can serve as groundwork for quantification of nanowire domain sizes and shape analysis. We also extend the application of the machine learning workflow to classification of nanoparticle morphologies and identification of different type of viruses from TEM images.more » « less
An official website of the United States government

