skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Thursday, January 16 until 2:00 AM ET on Friday, January 17 due to maintenance. We apologize for the inconvenience.


Search for: All records

Creators/Authors contains: "Li, Xiang"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract

    Spike sorting is a commonly used analysis method for identifying single-units and multi-units from extracellular recordings. The extracellular recordings contain a mixture of signal components, such as neural and non-neural events, possibly due to motion and breathing artifacts or electrical interference. Identifying single and multi-unit spikes using a simple threshold-crossing method may lead to uncertainty in differentiating the actual neural spikes from non-neural spikes. The traditional method for classifying neural and non-neural units from spike sorting results is manual curation by a trained person. This subjective method suffers from human error and variability and is further complicated by the absence of ground truth in experimental extracellular recordings. Moreover, the manual curation process is time consuming and is becoming intractable due to the growing size and complexity of extracellular datasets. To address these challenges, we, for the first time, present a novel automatic curation method based on an autoencoder model, which is trained on features of simulated extracellular spike waveforms. The model is then applied to experimental electrophysiology datasets, where the reconstruction error is used as the metric for classifying neural and non-neural spikes. As an alternative to the traditional frequency domain and statistical techniques, our proposed method offers a time-domain evaluation model to automate the analysis of extracellular recordings based on learned time-domain features. The model exhibits excellent performance and throughput when applied to real-world extracellular datasets without any retraining, highlighting its generalizability. This method can be integrated into spike sorting pipelines as a pre-processing filtering step or a post-processing curation method.

     
    more » « less
  2. Free, publicly-accessible full text available December 1, 2025
  3. Abstract

    Given a monotone submodular set function with a knapsack constraint, its maximization problem has two types of approximation algorithms with running time$$O(n^2)$$O(n2)and$$O(n^5)$$O(n5), respectively. With running time$$O(n^5)$$O(n5), the best performance ratio is$$1-1/e$$1-1/e. With running time$$O(n^2)$$O(n2), the well-known performance ratio is$$(1-1/e)/2$$(1-1/e)/2and an improved one is claimed to be$$(1-1/e^2)/2$$(1-1/e2)/2recently. In this paper, we design an algorithm with running$$O(n^2)$$O(n2)and performance ratio$$1-1/e^{2/3}$$1-1/e2/3, and an algorithm with running time$$O(n^3)$$O(n3)and performance ratio 1/2.

     
    more » « less
  4. Pissis, Solon P ; Sung, Wing-Kin (Ed.)
    Modern sequencing technologies allow for the addition of short-sequence tags, known as anchors, to both ends of a captured molecule. Anchors are useful in assembling the full-length sequence of a captured molecule as they can be used to accurately determine the endpoints. One representative of such anchor-enabled technology is LoopSeq Solo, a synthetic long read (SLR) sequencing protocol. LoopSeq Solo also achieves ultra-high sequencing depth and high purity of short reads covering the entire captured molecule. Despite the availability of many assembly methods, constructing full-length sequence from these anchor-enabled, ultra-high coverage sequencing data remains challenging due to the complexity of the underlying assembly graphs and the lack of specific algorithms leveraging anchors. We present Anchorage, a novel assembler that performs anchor-guided assembly for ultra-high-depth sequencing data. Anchorage starts with a kmer-based approach for precise estimation of molecule lengths. It then formulates the assembly problem as finding an optimal path that connects the two nodes determined by anchors in the underlying compact de Bruijn graph. The optimality is defined as maximizing the weight of the smallest node while matching the estimated sequence length. Anchorage uses a modified dynamic programming algorithm to efficiently find the optimal path. Through both simulations and real data, we show that Anchorage outperforms existing assembly methods, particularly in the presence of sequencing artifacts. Anchorage fills the gap in assembling anchor-enabled data. We anticipate its broad use as anchor-enabled sequencing technologies become prevalent. Anchorage is freely available at https://github.com/Shao-Group/anchorage; the scripts and documents that can reproduce all experiments in this manuscript are available at https://github.com/Shao-Group/anchorage-test. 
    more » « less
    Free, publicly-accessible full text available August 26, 2025
  5. Abstract

    Catastrophic landslides are often preceded by slow, progressive, accelerating deformation that differs from the persistent motion of slow‐moving landslides. Here, we investigate the motion of a landslide that damaged 12 homes in Rolling Hills Estates (RHE), Los Angeles, California on 8 July 2023, using satellite‐based synthetic aperture radar interferometry (InSAR) and pixel tracking of satellite‐based optical images. To better understand the precursory motion of the RHE landslide, we compared its behavior with local precipitation and with several slow‐moving landslides nearby. Unlike the slow‐moving landslides, we found that RHE was a first‐time progressive failure that failed after one of the wettest years on record. We then applied a progressive failure model to interpret the failure mechanisms and further predict the failure time from the pre‐failure movement of RHE. Our work highlights the importance of monitoring incipient slow motion of landslides, particularly where no discernible historical displacement has been observed.

     
    more » « less
    Free, publicly-accessible full text available July 16, 2025
  6. We consider the problem of constructing embeddings of large attributed graphs and supporting multiple downstream learning tasks. We develop a graph embedding method, which is based on extending deep metric and unbiased contrastive learning techniques to 1) work with attributed graphs, 2) enabling a mini-batch based approach, and 3) achieving scalability. Based on a multi-class tuplet loss function, we present two algorithms -- DMT for semi-supervised learning and DMAT-i for the unsupervised case. Analyzing our methods, we provide a generalization bound for the downstream node classification task and for the first time relate tuplet loss to contrastive learning. Through extensive experiments, we show high scalability of representation construction, and in applying the method for three downstream tasks (node clustering, node classification, and link prediction) better consistency over any single existing method. 
    more » « less
    Free, publicly-accessible full text available July 3, 2025
  7. Abstract

    The modification of seed shattering has been a recurring theme in rice evolution. The wild ancestor of cultivated rice disperses its seeds, but reduced shattering was selected during multiple domestication events to facilitate harvesting. Conversely, selection for increased shattering occurred during the evolution of weedy rice, a weed invading cultivated rice fields that has originated multiple times from domesticated ancestors. Shattering requires formation of a tissue known as the abscission zone (AZ), but how the AZ has been modified throughout rice evolution is unclear. We quantitatively characterized the AZ characteristics of relative length, discontinuity, and intensity in 86 cultivated and weedy rice accessions. We reconstructed AZ evolutionary trajectories and determined the degree of convergence among different cultivated varieties and among independent weedy rice populations. AZ relative length emerged as the best feature to distinguish high and low shattering rice. Cultivated varieties differed in average AZ morphology, revealing lack of convergence in how shattering reduction was achieved during domestication. In contrast, weedy rice populations typically converged on complete AZs, irrespective of origin. By examining AZ population-level morphology, our study reveals its evolutionary plasticity, and suggests that the genetic potential to modify the ecologically and agronomically important trait of shattering is plentiful in rice lineages.

     
    more » « less
  8. Existing object recognition models have been shown to lack robustness in diverse geographical scenarios due to domain shifts in design and context. Class representations need to be adapted to more accurately reflect an object concept under these shifts. In the absence of training data from target geographies, we hypothesize that geographically diverse descriptive knowledge of categories can enhance robustness. For this purpose, we explore the feasibility of probing a large language model for geography-based object knowledge, and we examine the effects of integrating knowledge into zero-shot and learnable soft prompting with CLIP. Within this exploration, we propose geography knowledge regularization to ensure that soft prompts trained on a source set of geographies generalize to an unseen target set. Accuracy gains over prompting baselines on DollarStreet while training only on Europe data are up to +2.8/1.2/1.6 on target data from Africa/Asia/Americas, and +4.6 overall on the hardest classes. Competitive performance is shown vs. few-shot target training, and analysis is provided to direct future study of geographical robustness. 
    more » « less
    Free, publicly-accessible full text available June 17, 2025
  9. Abstract Motivation

    Many tasks in sequence analysis ask to identify biologically related sequences in a large set. The edit distance, being a sensible model for both evolution and sequencing error, is widely used in these tasks as a measure. The resulting computational problem—to recognize all pairs of sequences within a small edit distance—turns out to be exceedingly difficult, since the edit distance is known to be notoriously expensive to compute and that all-versus-all comparison is simply not acceptable with millions or billions of sequences. Among many attempts, we recently proposed the locality-sensitive bucketing (LSB) functions to meet this challenge. Formally, a (d1,d2)-LSB function sends sequences into multiple buckets with the guarantee that pairs of sequences of edit distance at most d1 can be found within a same bucket while those of edit distance at least d2 do not share any. LSB functions generalize the locality-sensitive hashing (LSH) functions and admit favorable properties, with a notable highlight being that optimal LSB functions for certain (d1,d2) exist. LSB functions hold the potential of solving above problems optimally, but the existence of LSB functions for more general (d1,d2) remains unclear, let alone constructing them for practical use.

    Results

    In this work, we aim to utilize machine learning techniques to train LSB functions. With the development of a novel loss function and insights in the neural network structures that can potentially extend beyond this specific task, we obtained LSB functions that exhibit nearly perfect accuracy for certain (d1,d2), matching our theoretical results, and high accuracy for many others. Comparing to the state-of-the-art LSH method Order Min Hash, the trained LSB functions achieve a 2- to 5-fold improvement on the sensitivity of recognizing similar sequences. An experiment on analyzing erroneous cell barcode data is also included to demonstrate the application of the trained LSB functions.

    Availability and implementation

    The code for the training process and the structure of trained models are freely available at https://github.com/Shao-Group/lsb-learn.

     
    more » « less
  10. Free, publicly-accessible full text available April 14, 2025