skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Poly(A)-DG: A deep-learning-based domain generalization method to identify cross-species Poly(A) signal without prior knowledge from target species
In eukaryotes, polyadenylation (poly(A)) is an essential process during mRNA maturation. Identifying the cis -determinants of poly(A) signal (PAS) on the DNA sequence is the key to understand the mechanism of translation regulation and mRNA metabolism. Although machine learning methods were widely used in computationally identifying PAS, the need for tremendous amounts of annotation data hinder applications of existing methods in species without experimental data on PAS. Therefore, cross-species PAS identification, which enables the possibility to predict PAS from untrained species, naturally becomes a promising direction. In our works, we propose a novel deep learning method named Poly(A)-DG for cross-species PAS identification. Poly(A)-DG consists of a Convolution Neural Network-Multilayer Perceptron (CNN-MLP) network and a domain generalization technique. It learns PAS patterns from the training species and identifies PAS in target species without re-training. To test our method, we use four species and build cross-species training sets with two of them and evaluate the performance of the remaining ones. Moreover, we test our method against insufficient data and imbalanced data issues and demonstrate that Poly(A)-DG not only outperforms state-of-the-art methods but also maintains relatively high accuracy when it comes to a smaller or imbalanced training set.  more » « less
Award ID(s):
2007595 1949629
PAR ID:
10231123
Author(s) / Creator(s):
; ; ; ; ;
Editor(s):
Zhang, Zhaolei
Date Published:
Journal Name:
PLOS Computational Biology
Volume:
16
Issue:
11
ISSN:
1553-7358
Page Range / eLocation ID:
e1008297
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Access to large image volumes through camera traps and crowdsourcing provides novel possibilities for animal monitoring and conservation. It calls for automatic methods for analysis, in particular, when re-identifying individual animals from the images. Most existing re-identification methods rely on either hand-crafted local features or end-to-end learning of fur pattern similarity. The former does not need labeled training data, while the latter, although very data-hungry typically outperforms the former when enough training data is available. We propose a novel re-identification pipeline that combines the strengths of both approaches by utilizing modern learnable local features and feature aggregation. This creates representative pattern feature embeddings that provide high re-identification accuracy while allowing us to apply the method to small datasets by using pre-trained feature descriptors. We report a comprehensive comparison of different modern local features and demonstrate the advantages of the proposed pipeline on two very different species. 
    more » « less
  2. Data-driven approaches are increasingly popular for identifying dynamical systems due to improved accuracy and availability of sensor data. However, relying solely on data for identification does not guarantee that the identified systems will maintain their physical properties or that the predicted models will generalize well. In this paper, we propose a novel method for data-driven system identification by integrating a neural network as the first-order derivative of the learned dynamics in a Taylor series instead of learning the dynamical function directly. In addition, for dynamical systems with known monotonic properties, our approach can ensure monotonicity by constraining the neural network derivative to be non-positive or non-negative to the respective inputs, resulting in Monotonic Taylor Neural Networks (MTNN). Such constraints are enforced by either a specialized neural network architecture or regularization in the loss function for training. The proposed method demonstrates better performance compared to methods without the physics-based monotonicity constraints when tested on experimental data from an HVAC system and a temperature control testbed. Furthermore, MTNN shows good performance in a control application of a model predictive controller for a nonlinear MIMO system, illustrating the practical application of our method. 
    more » « less
  3. While prior federated learning (FL) methods mainly consider client heterogeneity, we focus on the Federated Domain Generalization (DG) task, which introduces train-test heterogeneity in the FL context. Existing evaluations in this field are limited in terms of the scale of the clients and dataset diversity. Thus, we propose a Federated DG benchmark that aim to test the limits of current methods with high client heterogeneity, large numbers of clients, and diverse datasets. Towards this objective, we introduce a novel data partition method that allows us to distribute any domain dataset among few or many clients while controlling client heterogeneity. We then introduce and apply our methodology to evaluate 14 DG methods, which include centralized DG methods adapted to the FL context, FL methods that handle client heterogeneity, and methods designed specifically for Federated DG on 7 datasets. Our results suggest that, despite some progress, significant performance gaps remain in Federated DG, especially when evaluating with a large number of clients, high client heterogeneity, or more realistic datasets. Furthermore, our extendable benchmark code will be publicly released to aid in benchmarking future Federated DG approaches. 
    more » « less
  4. Abstract Insect populations are changing rapidly, and monitoring these changes is essential for understanding the causes and consequences of such shifts. However, large‐scale insect identification projects are time‐consuming and expensive when done solely by human identifiers. Machine learning offers a possible solution to help collect insect data quickly and efficiently.Here, we outline a methodology for training classification models to identify pitfall trap‐collected insects from image data and then apply the method to identify ground beetles (Carabidae). All beetles were collected by the National Ecological Observatory Network (NEON), a continental scale ecological monitoring project with sites across the United States. We describe the procedures for image collection, image data extraction, data preparation, and model training, and compare the performance of five machine learning algorithms and two classification methods (hierarchical vs. single‐level) identifying ground beetles from the species to subfamily level. All models were trained using pre‐extracted feature vectors, not raw image data. Our methodology allows for data to be extracted from multiple individuals within the same image thus enhancing time efficiency, utilizes relatively simple models that allow for direct assessment of model performance, and can be performed on relatively small datasets.The best performing algorithm, linear discriminant analysis (LDA), reached an accuracy of 84.6% at the species level when naively identifying species, which was further increased to >95% when classifications were limited by known local species pools. Model performance was negatively correlated with taxonomic specificity, with the LDA model reaching an accuracy of ~99% at the subfamily level. When classifying carabid species not included in the training dataset at higher taxonomic levels species, the models performed significantly better than if classifications were made randomly. We also observed greater performance when classifications were made using the hierarchical classification method compared to the single‐level classification method at higher taxonomic levels.The general methodology outlined here serves as a proof‐of‐concept for classifying pitfall trap‐collected organisms using machine learning algorithms, and the image data extraction methodology may be used for nonmachine learning uses. We propose that integration of machine learning in large‐scale identification pipelines will increase efficiency and lead to a greater flow of insect macroecological data, with the potential to be expanded for use with other noninsect taxa. 
    more » « less
  5. An organ segmentation method that can generalize to unseen contrasts and scanner settings can significantly reduce the need for retraining of deep learning models. Domain Generalization (DG) aims to achieve this goal. However, most DG methods for segmentation require training data from multiple domains during training. We propose a novel adversarial domain generalization method for organ segmentation trained on data from a single domain. We synthesize the new domains via learning an adversarial domain synthesizer (ADS) and presume that the synthetic domains cover a large enough area of plausible distributions so that unseen domains can be interpolated from synthetic domains. We propose a mutual information regularizer to enforce the semantic consistency between images from the synthetic domains, which can be estimated by patch-level contrastive learning. We evaluate our method for various organ segmentation for unseen modalities, scanning protocols, and scanner sites. 
    more » « less