skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on December 1, 2025

Title: Clustering of Global Magnetospheric Observations
Abstract The use of supervised methods in space science have demonstrated powerful capability in classification tasks, but purely unsupervised methods have been less utilized for the classification of spacecraft observations. We use a combination of unsupervised methods, being principal component analysis, Self‐Organizing Maps, and hierarchical agglomerative clustering, to classify THEMIS and MMS observations as having occurred in the magnetosphere, magnetosheath, or the solar wind. The resulting classification are validated visually by analyzing the distribution of classifications and studying individual time series as well as by comparison to the labeled data set of a previous model, against which ours has an accuracy of 99.4. The model has a variety of applications beyond region classification such as deeper hierarchical analysis, magnetopause and bow shock crossing identification, and identification of bursty bulk flows, hot flow anomalies, and foreshock bubbles.  more » « less
Award ID(s):
1919310
PAR ID:
10561367
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
AGU
Date Published:
Journal Name:
Journal of Geophysical Research: Machine Learning and Computation
Volume:
1
Issue:
4
ISSN:
2993-5210
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Facing the continuous emergence of new psychoactive substances (NPS) and their threat to public health, more effective methods for NPS prediction and identification are critical. In this study, the pharmacological affinity fingerprints (Ph-fp) of NPS compounds were predicted by Random Forest classification models using bioactivity data from the ChEMBL database. The binaryPh-fpis the vector consisting of a compound’s activity against a list of molecular targets reported to be responsible for the pharmacological effects of NPS. Their performance in similarity searching and unsupervised clustering was assessed and compared to 2D structure fingerprints Morgan and MACCS (1024-bits ECFP4 and 166-bits SMARTS-based MACCS implementation of RDKit). The performance in retrieving compounds according to their pharmacological categorizations is influenced by the predicted active assay counts inPh-fpand the choice of similarity metric. Overall, the comparative unsupervised clustering analysis suggests the use of a classification model with Morgan fingerprints as input for the construction ofPh-fp. This combination gives satisfactory clustering performance based on external and internal clustering validation indices. 
    more » « less
  2. Abstract Insect populations are changing rapidly, and monitoring these changes is essential for understanding the causes and consequences of such shifts. However, large‐scale insect identification projects are time‐consuming and expensive when done solely by human identifiers. Machine learning offers a possible solution to help collect insect data quickly and efficiently.Here, we outline a methodology for training classification models to identify pitfall trap‐collected insects from image data and then apply the method to identify ground beetles (Carabidae). All beetles were collected by the National Ecological Observatory Network (NEON), a continental scale ecological monitoring project with sites across the United States. We describe the procedures for image collection, image data extraction, data preparation, and model training, and compare the performance of five machine learning algorithms and two classification methods (hierarchical vs. single‐level) identifying ground beetles from the species to subfamily level. All models were trained using pre‐extracted feature vectors, not raw image data. Our methodology allows for data to be extracted from multiple individuals within the same image thus enhancing time efficiency, utilizes relatively simple models that allow for direct assessment of model performance, and can be performed on relatively small datasets.The best performing algorithm, linear discriminant analysis (LDA), reached an accuracy of 84.6% at the species level when naively identifying species, which was further increased to >95% when classifications were limited by known local species pools. Model performance was negatively correlated with taxonomic specificity, with the LDA model reaching an accuracy of ~99% at the subfamily level. When classifying carabid species not included in the training dataset at higher taxonomic levels species, the models performed significantly better than if classifications were made randomly. We also observed greater performance when classifications were made using the hierarchical classification method compared to the single‐level classification method at higher taxonomic levels.The general methodology outlined here serves as a proof‐of‐concept for classifying pitfall trap‐collected organisms using machine learning algorithms, and the image data extraction methodology may be used for nonmachine learning uses. We propose that integration of machine learning in large‐scale identification pipelines will increase efficiency and lead to a greater flow of insect macroecological data, with the potential to be expanded for use with other noninsect taxa. 
    more » « less
  3. Summary Cluster analysis has proved to be an invaluable tool for the exploratory and unsupervised analysis of high-dimensional datasets. Among methods for clustering, hierarchical approaches have enjoyed substantial popularity in genomics and other fields for their ability to simultaneously uncover multiple layers of clustering structure. A critical and challenging question in cluster analysis is whether the identified clusters represent important underlying structure or are artifacts of natural sampling variation. Few approaches have been proposed for addressing this problem in the context of hierarchical clustering, for which the problem is further complicated by the natural tree structure of the partition, and the multiplicity of tests required to parse the layers of nested clusters. In this article, we propose a Monte Carlo based approach for testing statistical significance in hierarchical clustering which addresses these issues. The approach is implemented as a sequential testing procedure guaranteeing control of the family-wise error rate. Theoretical justification is provided for our approach, and its power to detect true clustering structure is illustrated through several simulation studies and applications to two cancer gene expression datasets. 
    more » « less
  4. Abstract MotivationCryo-Electron Tomography (cryo-ET) is a 3D imaging technology that enables the visualization of subcellular structures in situ at near-atomic resolution. Cellular cryo-ET images help in resolving the structures of macromolecules and determining their spatial relationship in a single cell, which has broad significance in cell and structural biology. Subtomogram classification and recognition constitute a primary step in the systematic recovery of these macromolecular structures. Supervised deep learning methods have been proven to be highly accurate and efficient for subtomogram classification, but suffer from limited applicability due to scarcity of annotated data. While generating simulated data for training supervised models is a potential solution, a sizeable difference in the image intensity distribution in generated data as compared with real experimental data will cause the trained models to perform poorly in predicting classes on real subtomograms. ResultsIn this work, we present Cryo-Shift, a fully unsupervised domain adaptation and randomization framework for deep learning-based cross-domain subtomogram classification. We use unsupervised multi-adversarial domain adaption to reduce the domain shift between features of simulated and experimental data. We develop a network-driven domain randomization procedure with ‘warp’ modules to alter the simulated data and help the classifier generalize better on experimental data. We do not use any labeled experimental data to train our model, whereas some of the existing alternative approaches require labeled experimental samples for cross-domain classification. Nevertheless, Cryo-Shift outperforms the existing alternative approaches in cross-domain subtomogram classification in extensive evaluation studies demonstrated herein using both simulated and experimental data. Availabilityand implementationhttps://github.com/xulabs/aitom. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  5. Abstract Surface defect identification is a crucial task in many manufacturing systems, including automotive, aircraft, steel rolling, and precast concrete. Although image-based surface defect identification methods have been proposed, these methods usually have two limitations: images may lose partial information, such as depths of surface defects, and their precision is vulnerable to many factors, such as the inspection angle, light, color, noise, etc. Given that a three-dimensional (3D) point cloud can precisely represent the multidimensional structure of surface defects, we aim to detect and classify surface defects using a 3D point cloud. This has two major challenges: (i) the defects are often sparsely distributed over the surface, which makes their features prone to be hidden by the normal surface and (ii) different permutations and transformations of 3D point cloud may represent the same surface, so the proposed model needs to be permutation and transformation invariant. In this paper, a two-step surface defect identification approach is developed to investigate the defects’ patterns in 3D point cloud data. The proposed approach consists of an unsupervised method for defect detection and a multi-view deep learning model for defect classification, which can keep track of the features from both defective and non-defective regions. We prove that the proposed approach is invariant to different permutations and transformations. Two case studies are conducted for defect identification on the surfaces of synthetic aircraft fuselage and the real precast concrete specimen, respectively. The results show that our approach receives the best defect detection and classification accuracy compared with other benchmark methods. 
    more » « less