NSF PAR Search | NSF Public Access Repository

Abstract FM-indexes are crucial data structures in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. In 2022, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed that the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing—which takes parameters that let us tune the average length of the phrases—instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it seems our method accelerates the performance of count over all state-of-the-art methods with a moderate increase in the memory. The source code for$$\texttt {PFP-FM}$$

PFP - FM

is available athttps://github.com/AaronHong1024/afm.

ONeSAMP 3.0: estimation of effective population size via single nucleotide polymorphism data from one population

https://doi.org/10.1093/g3journal/jkae153

Hong, Aaron; Cheek, Rebecca_G; De_Silva, Suhashi_Nihara; Mukherjee, Kingshuk; Yooseph, Isha; Oliva, Marco; Heim, Mark; Funk, Chris_W; Tallmon, David; Boucher, Christina; et al (July 2024, G3: Genes, Genomes, Genetics)

Abstract The genetic effective size (Ne) is arguably one of the most important characteristics of a population as it impacts the rate of loss of genetic diversity. Methods that estimate Ne are important in population and conservation genetic studies as they quantify the risk of a population being inbred or lacking genetic diversity. Yet there are very few methods that can estimate the Ne from data from a single population and without extensive information about the genetics of the population, such as a linkage map, or a reference genome of the species of interest. We present ONeSAMP 3.0, an algorithm for estimating Ne from single nucleotide polymorphism data collected from a single population sample using approximate Bayesian computation and local linear regression. We demonstrate the utility of this approach using simulated Wright–Fisher populations, and empirical data from five endangered Channel Island fox (Urocyon littoralis) populations to evaluate the performance of ONeSAMP 3.0 compared to a commonly used Ne estimator. Our results show that ONeSAMP 3.0 is broadly applicable to natural populations and is flexible enough that future versions could easily include summary statistics appropriate for a suite of biological and sampling conditions. ONeSAMP 3.0 is publicly available under the GNU General Public License at https://github.com/AaronHong1024/ONeSAMP_3.

Search for: All records