Abstract Mobile element insertions (MEIs) are repetitive genomic sequences that contribute to genetic variation and can lead to genetic disorders. Targeted and whole-genome approaches using short-read sequencing have been developed to identify reference and non-reference MEIs; however, the read length hampers detection of these elements in complex genomic regions. Here, we pair Cas9-targeted nanopore sequencing with computational methodologies to capture active MEIs in human genomes. We demonstrate parallel enrichment for distinct classes of MEIs, averaging 44% of reads on-targeted signals and exhibiting a 13.4-54x enrichment over whole-genome approaches. We show an individual flow cell can recover most MEIs (97% L1Hs, 93% Alu Yb, 51% Alu Ya, 99% SVA_F, and 65% SVA_E). We identify seventeen non-reference MEIs in GM12878 overlooked by modern, long-read analysis pipelines, primarily in repetitive genomic regions. This work introduces the utility of nanopore sequencing for MEI enrichment and lays the foundation for rapid discovery of elusive, repetitive genetic elements.
more »
« less
This content will become publicly available on June 1, 2026
Genomic Anomaly Detection with Functional Data Analysis
Background: Genetic variation provides a foundation for understanding evolution. With the rise of artificial intelligence, machine learning has emerged as a powerful tool for identifying genomic footprints of evolutionary processes through simulation-based predictive modeling. However, existing approaches require prior knowledge of the factors shaping genetic variation, whereas uncovering anomalous genomic regions regardless of their causes remains an equally important and complementary endeavor. Methods: To address this problem, we introduce ANDES (ANomaly DEtection using Summary statistics), a suite of algorithms that apply statistical techniques to extract features for unsupervised anomaly detection. A key innovation of ANDES is its ability to account for autocovariation due to linkage disequilibrium by fitting curves to contiguous windows and computing their first and second derivatives, thereby capturing the “velocity” and “acceleration” of genetic variation. These features are then used to train models that flag biologically significant or artifactual regions. Results: Application to human genomic data demonstrates that ANDES successfully detects anomalous regions that colocalize with genes under positive or balancing selection. Moreover, these analyses reveal a non-uniform distribution of anomalies, which are enriched in specific autosomes, intergenic regions, introns, and regions with low GC content, repetitive sequences, and poor mappability. Conclusions: ANDES thus offers a novel, model-agnostic framework for uncovering anomalous genomic regions in both model and non-model organisms.
more »
« less
- Award ID(s):
- 2302258
- PAR ID:
- 10610227
- Publisher / Repository:
- Genes
- Date Published:
- Journal Name:
- Genes
- Volume:
- 16
- Issue:
- 6
- ISSN:
- 2073-4425
- Page Range / eLocation ID:
- 710
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract A thermodynamic energy budget analysis is applied to the lowest model level of the ERA5 dataset to investigate the mechanisms that drive the growth and decay of extreme positive surface air temperature (SAT) events. Regional and seasonal variation of the mechanisms are investigated. For each grid point on Earth’s surface, a separate composite analysis is performed for extreme SAT events, which are days when temperature anomaly exceeds the 95th percentile. Among the dynamical terms, horizontal temperature advection of the climatological temperature by the anomalous wind dominates SAT anomaly growth over the extratropics, while nonlinear horizontal temperature advection is a major factor over high-latitude regions and the adiabatic warming is important over major mountainous regions. During the decay period, advection of the climatological temperature by the anomalous wind sustains the warming while nonlinear advection becomes the dominant decay mechanism. Among diabatic heating processes, vertical mixing contributes to the SAT anomaly growth over most locations while longwave radiative cooling hinders SAT anomaly growth, especially over the ocean. However, over arid regions during summer, longwave heating largely contributes to SAT anomaly growth while the vertical mixing dampens the SAT anomaly growth. During the decay period, both longwave cooling and vertical mixing contribute to SAT anomaly decay with more pronounced effects over the ocean and land, respectively. These regional and seasonal characteristics of the processes that drive extreme SAT events can serve as a benchmark for understanding the future behavior of extreme weather.more » « less
-
Human mobility anomaly detection based on location is essential in areas such as public health, safety, welfare, and urban planning. Developing models and approaches for location-based anomaly detection requires a comprehensive dataset. However, privacy concerns and the absence of ground truth hinder the availability of publicly available datasets. With this paper, we provide extensive simulated human mobility datasets featuring various anomaly types created using an existing Urban Patterns of Life Simulation. To create these datasets, we inject changes in the logic of individual agents to change their behavior. Specifically, we create four of anomalous agent behavior by (1) changing the agents’ appetite (causing agents to have meals more frequently), (2) changing their group of interest (causing agents to interact with different agents from another group). (3) changing their social place selection (causing agents to visit different recreational places) and (4) changing their work schedule (causing agents to skip work), For each type of anomaly, we use three degrees of behavioral change to tune the difficulty of detecting the anomalous agents. To select agents to inject anomalous behavior into, we employ three methods: (1) Random selection using a centralized manipulation mechanism, (2) Spread based selection using an infectious disease model, and (3) through exposure of agents to a specific location. All datasets are split into normal and anomalous phases. The normal phase, which can be used for training models of normalcy, exhibits no anomalous behavior. The anomalous phase, which can be used for testing for anomalous detection algorithm, includes ground truth labels that indicate, for each five-minute simulation step, which agents are anomalous at that time. Datasets are generated using the maps (roads and buildings) for Atlanta and Berlin having 1k agents in each simulation. All datasets are openly available at https://osf.io/dg6t3/. Additionally, we provide instructions to regenerate the data for other locations and numbers of agents.more » « less
-
Vehicles can utilize their sensors or receive messages from other vehicles to acquire information about the surrounding environments. However, the information may be inaccurate, faulty, or maliciously compromised due to sensor failures, communication faults, or security attacks. The goal of this work is to detect if a lane-changing decision and the sensed or received information are anomalous. We develop three anomaly detection approaches based on deep learning: a classifier approach, a predictor approach, and a hybrid approach combining the classifier and the predictor. All of them do not need anomalous data nor lateral features so that they can generally consider lane-changing decisions before the vehicles start moving along the lateral axis. They achieve at least 82% and up to 93% F1 scores against anomaly on data from Simulation of Urban MObility (SUMO) and HighD. We also examine system properties and verify that the detected anomaly includes more dangerous scenarios.more » « less
-
This paper presents GeoDMA , which processes the GPS data from multiple vehicles to detect anomalous driving maneuvers, such as rapid acceleration, sudden braking, and rapid swerving. First, an unsupervised deep auto-encoder is designed to learn a set of unique features from the normal historical GPS data of all drivers. We consider the temporal dependency of the driving data for individual drivers and the spatial correlation among different drivers. Second, to incorporate the peer dependency of drivers in local regions, we develop a geographical partitioning algorithm to partition a city into several sub-regions to do the driving anomaly detection. Specifically, we extend the vehicle-vehicle dependency to road-road dependency and formulate the geographical partitioning problem into an optimization problem. The objective of the optimization problem is to maximize the dependency of roads within each sub-region and minimize the dependency of roads between any two different sub-regions. Finally, we train a specific driving anomaly detection model for each sub-region and perform in-situ updating of these models by incremental training. We implement GeoDMA in Pytorch and evaluate its performance using a large real-world GPS trajectories. The experiment results demonstrate that GeoDMA achieves up to 8.5% higher detection accuracy than the baseline methods.more » « less
An official website of the United States government
