skip to main content

Title: Using Machine Learning to Aid in Data Classification: Classifying Occupation Compatibility with Highly Automated Vehicles
Data classification is central to human factors research, and manual data classification is tedious and error prone. Supervised learning enables analysts to train an algorithm by manually classifying a few cases and then have that algorithm classify many cases. However, algorithms often fail to leverage human insight. To address this, we augment supervised learning with unsupervised learning and data visualization. Unsupervised learning highlights potential classification errors, explains the underlying classification, and identifies additional cases that merit manual classification. We illustrate this using the Occupational Information Network database to classify occupations as having tasks that might be performed in an automated vehicle.
Award ID(s):
Publication Date:
Journal Name:
Ergonomics in Design: The Quarterly of Human Factors Applications
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background Cryo-electron microscopy (Cryo-EM) is widely used in the determination of the three-dimensional (3D) structures of macromolecules. Particle picking from 2D micrographs remains a challenging early step in the Cryo-EM pipeline due to the diversity of particle shapes and the extremely low signal-to-noise ratio of micrographs. Because of these issues, significant human intervention is often required to generate a high-quality set of particles for input to the downstream structure determination steps. Results Here we propose a fully automated approach (DeepCryoPicker) for single particle picking based on deep learning. It first uses automated unsupervised learning to generate particle training datasets. Then it trains a deep neural network to classify particles automatically. Results indicate that the DeepCryoPicker compares favorably with semi-automated methods such as DeepEM, DeepPicker, and RELION, with the significant advantage of not requiring human intervention. Conclusions Our framework combing supervised deep learning classification with automated un-supervised clustering for generating training data provides an effective approach to pick particles in cryo-EM images automatically and accurately.
  2. Producing high-quality labeled data is a challenge in any supervised learning problem, where in many cases, human involvement is necessary to ensure the label quality. However, human annotations are not flawless, especially in the case of a challenging problem. In nontrivial problems, the high disagreement among annotators results in noisy labels, which affect the performance of any machine learning model. In this work, we consider three noise reduction strategies to improve the label quality in the Article-Comment Alignment Problem, where the main task is to classify article-comment pairs according to their relevancy level. The first considered labeling disagreement reduction strategy utilizes annotators' background knowledge during the label aggregation step. The second strategy utilizes user disagreement during the training process. In the third and final strategy, we ask annotators to perform corrections and relabel the examples with noisy labels. We deploy these strategies and compare them to a resampling strategy for addressing the class imbalance, another common supervised learning challenge. These alternatives were evaluated on ACAP, a multiclass text pairs classification problem with highly imbalanced data, where one of the classes represents at most 15% of the dataset's entire population. Our results provide evidence that considered strategies can reduce disagreement betweenmore »annotators. However, data quality improvement is insufficient to enhance classification accuracy in the article-comment alignment problem, which exhibits a high-class imbalance. The model performance is enhanced for the same problem by addressing the imbalance issue with a weight loss-based class distribution resampling. We show that allowing the model to pay more attention to the minority class during the training process with the presence of noisy examples improves the test accuracy by 3%.« less

    Galaxy morphology is a fundamental quantity, which is essential not only for the full spectrum of galaxy-evolution studies, but also for a plethora of science in observational cosmology (e.g. as a prior for photometric-redshift measurements and as contextual data for transient light-curve classifications). While a rich literature exists on morphological-classification techniques, the unprecedented data volumes, coupled, in some cases, with the short cadences of forthcoming ‘Big-Data’ surveys (e.g. from the LSST), present novel challenges for this field. Large data volumes make such data sets intractable for visual inspection (even via massively distributed platforms like Galaxy Zoo), while short cadences make it difficult to employ techniques like supervised machine learning, since it may be impractical to repeatedly produce training sets on short time-scales. Unsupervised machine learning, which does not require training sets, is ideally suited to the morphological analysis of new and forthcoming surveys. Here, we employ an algorithm that performs clustering of graph representations, in order to group image patches with similar visual properties and objects constructed from those patches, like galaxies. We implement the algorithm on the Hyper-Suprime-Cam Subaru-Strategic-Program Ultra-Deep survey, to autonomously reduce the galaxy population to a small number (160) of ‘morphological clusters’, populated by galaxiesmore »with similar morphologies, which are then benchmarked using visual inspection. The morphological classifications (which we release publicly) exhibit a high level of purity, and reproduce known trends in key galaxy properties as a function of morphological type at z < 1 (e.g. stellar-mass functions, rest-frame colours, and the position of galaxies on the star-formation main sequence). Our study demonstrates the power of unsupervised machine learning in performing accurate morphological analysis, which will become indispensable in this new era of deep-wide surveys.

    « less
  4. We report a comprehensive computational study of unsupervised machine learning for extraction of chemically relevant information in X-ray absorption near edge structure (XANES) and in valence-to-core X-ray emission spectra (VtC-XES) for classification of a broad ensemble of sulphorganic molecules. By progressively decreasing the constraining assumptions of the unsupervised machine learning algorithm, moving from principal component analysis (PCA) to a variational autoencoder (VAE) to t-distributed stochastic neighbour embedding (t-SNE), we find improved sensitivity to steadily more refined chemical information. Surprisingly, when embedding the ensemble of spectra in merely two dimensions, t-SNE distinguishes not just oxidation state and general sulphur bonding environment but also the aromaticity of the bonding radical group with 87% accuracy as well as identifying even finer details in electronic structure within aromatic or aliphatic sub-classes. We find that the chemical information in XANES and VtC-XES is very similar in character and content, although they unexpectedly have different sensitivity within a given molecular class. We also discuss likely benefits from further effort with unsupervised machine learning and from the interplay between supervised and unsupervised machine learning for X-ray spectroscopies. Our overall results, i.e. , the ability to reliably classify without user bias and to discover unexpected chemical signatures formore »XANES and VtC-XES, likely generalize to other systems as well as to other one-dimensional chemical spectroscopies.« less
  5. Grilli, Jacopo (Ed.)
    Collective behavior is an emergent property of numerous complex systems, from financial markets to cancer cells to predator-prey ecological systems. Characterizing modes of collective behavior is often done through human observation, training generative models, or other supervised learning techniques. Each of these cases requires knowledge of and a method for characterizing the macro-state(s) of the system. This presents a challenge for studying novel systems where there may be little prior knowledge. Here, we present a new unsupervised method of detecting emergent behavior in complex systems, and discerning between distinct collective behaviors. We require only metrics, d (1) , d (2) , defined on the set of agents, X , which measure agents’ nearness in variables of interest. We apply the method of diffusion maps to the systems ( X , d ( i ) ) to recover efficient embeddings of their interaction networks. Comparing these geometries, we formulate a measure of similarity between two networks, called the map alignment statistic (MAS). A large MAS is evidence that the two networks are codetermined in some fashion, indicating an emergent relationship between the metrics d (1) and d (2) . Additionally, the form of the macro-scale organization is encoded in the covariancesmore »among the two sets of diffusion map components. Using these covariances we discern between different modes of collective behavior in a data-driven, unsupervised manner. This method is demonstrated on a synthetic flocking model as well as empirical fish schooling data. We show that our state classification subdivides the known behaviors of the school in a meaningful manner, leading to a finer description of the system’s behavior.« less