skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Performing Selection on a Monotonic Function in Lieu of Sorting Using Layer-Ordered Heaps
Nonparametric statistical tests are an integral part of scientific experiments in a diverse range of fields. When performing such tests, it is standard to sort values; however, this requires Ω(n log(n)) time to sort n values. Thus given enough data, sorting becomes the computational bottleneck, even with very optimized implementations such as the C++ standard library routine, std::sort. Frequently, a nonparametric statistical test is only used to partition values above and below a threshold in the sorted ordering, where the threshold corresponds to a significant statistical result. Linear-time selection and partitioning algorithms cannot be directly used because the selection and partitioning are performed on the transformed statistical significance values rather than on the sorted statistics. Usually, those transformed statistical significance values (e.g., the p value when investigating the family-wise error rate and q values when investigating the false discovery rate (FDR)) can only be computed at a threshold. Because this threshold is unknown, this leads to sorting the data. Layer-ordered heaps, which can be constructed in O(n), only partially sort values and thus can be used to get around the slow runtime required to fully sort. Here we introduce a layer-ordering-based method for selection and partitioning on the transformed values (e.g., p values or q values). We demonstrate the use of this method to partition peptides using an FDR threshold. This approach is applied to speed up Percolator, a postprocessing algorithm used in mass-spectrometry-based proteomics to evaluate the quality of peptide-spectrum matches (PSMs), by >70% on data sets with 100 million PSMs.  more » « less
Award ID(s):
1845465
PAR ID:
10232268
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Journal of proteome research
ISSN:
1535-3907
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Recent advances in extracellular electrophysiology now facilitate the recording of spikes from hundreds or thousands of neurons simultaneously. This has necessitated both the development of new computational methods for spike sorting and better methods to determine spike-sorting accuracy. One long-standing method of assessing the false discovery rate (FDR) of spike sorting—the rate at which spikes are assigned to the wrong cluster—has been the rate of interspike interval (ISI) violations. Despite their near ubiquitous usage in spike sorting, our understanding of how exactly ISI violations relate to FDR, as well as best practices for using ISI violations as a quality metric, remains limited. Here, we describe an analytical solution that can be used to predict FDR from the ISI violation rate (ISIv). We test this model in silico through Monte Carlo simulation and apply it to publicly available spike-sorted electrophysiology datasets. We find that the relationship between ISIvand FDR is highly nonlinear, with additional dependencies on firing frequency, the correlation in activity between neurons, and contaminant neuron count. Predicted median FDRs in public datasets recorded in mice were found to range from 3.1 to 50.0%. We found that stochasticity in the occurrence of ISI violations as well as uncertainty in cluster-specific parameters make it difficult to predict FDR for single clusters with high confidence but that FDR can be estimated accurately across a population of clusters. Our findings will help the growing community of researchers using extracellular electrophysiology assess spike-sorting accuracy in a principled manner. 
    more » « less
  2. Fu, Yan (Ed.)
    Abstract Advances in mass spectrometry (MS) have enabled high-throughput analysis of proteomes in biological systems. The state-of-the-art MS data analysis relies on database search algorithms to quantify proteins by identifying peptide–spectrum matches (PSMs), which convert mass spectra to peptide sequences. Different database search algorithms use distinct search strategies and thus may identify unique PSMs. However, no existing approaches can aggregate all user-specified database search algorithms with a guaranteed increase in the number of identified peptides and a control on the false discovery rate (FDR). To fill in this gap, we proposed a statistical framework, Aggregation of Peptide Identification Results (APIR), that is universally compatible with all database search algorithms. Notably, under an FDR threshold, APIR is guaranteed to identify at least as many, if not more, peptides as individual database search algorithms do. Evaluation of APIR on a complex proteomics standard dataset showed that APIR outpowers individual database search algorithms and empirically controls the FDR. Real data studies showed that APIR can identify disease-related proteins and post-translational modifications missed by some individual database search algorithms. The APIR framework is easily extendable to aggregating discoveries made by multiple algorithms in other high-throughput biomedical data analysis, e.g., differential gene expression analysis on RNA sequencing data. The APIR R package is available at https://github.com/yiling0210/APIR. 
    more » « less
  3. Guruswami, Venkatesan (Ed.)
    The problem of sorting with priced information was introduced by [Charikar, Fagin, Guruswami, Kleinberg, Raghavan, Sahai (CFGKRS), STOC 2000]. In this setting, different comparisons have different (potentially infinite) costs. The goal is to find a sorting algorithm with small competitive ratio, defined as the (worst-case) ratio of the algorithm’s cost to the cost of the cheapest proof of the sorted order. The simple case of bichromatic sorting posed by [CFGKRS] remains open: We are given two sets A and B of total size N, and the cost of an A-A comparison or a B-B comparison is higher than an A-B comparison. The goal is to sort A ∪ B. An Ω(log N) lower bound on competitive ratio follows from unit-cost sorting. Note that this is a generalization of the famous nuts and bolts problem, where A-A and B-B comparisons have infinite cost, and elements of A and B are guaranteed to alternate in the final sorted order. In this paper we give a randomized algorithm InversionSort with an almost-optimal w.h.p. competitive ratio of O(log³ N). This is the first algorithm for bichromatic sorting with a o(N) competitive ratio. 
    more » « less
  4. Shotgun proteomics has been widely used to identify histone marks. Conventional database search methods rely on the “target-decoy” strategy to calculate the false discovery rate (FDR) and distinguish true peptide-spectrum matches (PSMs) from false ones. This strategy has a caveat of inaccurate FDR caused by the small data size of histone marks. To address this challenge, we developed a tailored database search strategy, named “Comprehensive Histone Mark Analysis (CHiMA).” Instead of target-decoy–based FDR, this method uses “50% matched fragment ions” as the key criterion to identify high-confidence PSMs. CHiMA identified twice as many histone modification sites as the conventional method in benchmark datasets. Reanalysis of our previous proteomics data using CHiMA led to the identification of 113 new histone marks for four types of lysine acylations, almost doubling the number of previously reported marks. This tool not only offers a valuable approach for identifying histone modifications but also greatly expands the repertoire of histone marks. 
    more » « less
  5. We present here a passive and label-free droplet microfluidic platform to sort cells stepwise by lactate and proton secretion from glycolysis. A technology developed in our lab, Sorting by Interfacial Tension (SIFT), sorts droplets containing single cells into two populations based on pH by using interfacial tension. Cellular glycolysis lowers the pH of droplets through proton secretion, enabling passive selection based on interfacial tension and hence single-cell glycolysis. The SIFT technique is expanded here by exploiting the dynamic droplet acidification from surfactant adsorption that leads to a concurrent increase in interfacial tension. This allows multiple microfabricated rails at different downstream positions to isolate cells with distinct glycolytic levels. The device is used to correlate sorted cells with three levels of glycolysis with a conventional surface marker for T-cell activation. As glycolysis is associated with both disease and cell state, this technology facilitates the sorting and analysis of crucial cell subpopulations for applications in oncology, immunology and immunotherapy. 
    more » « less