Abstract Genome search and/or classification typically involves finding the best-match database (reference) genomes and has become increasingly challenging due to the growing number of available database genomes and the fact that traditional methods do not scale well with large databases. By combining k-mer hashing-based probabilistic data structures (i.e. ProbMinHash, SuperMinHash, Densified MinHash and SetSketch) to estimate genomic distance, with a graph based nearest neighbor search algorithm (Hierarchical Navigable Small World Graphs, or HNSW), we created a new data structure and developed an associated computer program, GSearch, that is orders of magnitude faster than alternative tools while maintaining high accuracy and low memory usage. For example, GSearch can search 8000 query genomes against all available microbial or viral genomes for their best matches (n = ∼318 000 or ∼3 000 000, respectively) within a few minutes on a personal laptop, using ∼6 GB of memory (2.5 GB via SetSketch). Notably, GSearch has an O(log(N)) time complexity and will scale well with billions of genomes based on a database splitting strategy. Further, GSearch implements a three-step search strategy depending on the degree of novelty of the query genomes to maximize specificity and sensitivity. Therefore, GSearch solves a major bottleneck of microbiome studies that require genome search and/or classification.
more »
« less
Area-Optimal Simple Polygonalizations: The CG Challenge 2019
We give an overview of theoretical and practical aspects of finding a simple polygon of minimum ( Min-Area ) or maximum ( Max-Area ) possible area for a given set of n points in the plane. Both problems are known to be NP -hard and were the subject of the 2019 Computational Geometry Challenge, which presented the quest of finding good solutions to more than 200 instances, ranging from n = 10 all the way to n = 1, 000, 000.
more »
« less
- Award ID(s):
- 2007275
- PAR ID:
- 10340406
- Date Published:
- Journal Name:
- ACM Journal of Experimental Algorithmics
- Volume:
- 27
- ISSN:
- 1084-6654
- Page Range / eLocation ID:
- 1 to 12
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The uncertainties in sea ice extent (total area covered by sea ice with concentration>15%) derived from passive microwave sensors are assessed in two ways. Absolute uncertainty (accuracy) is evaluated based on the comparison of the extent between several products. There are clear biases between the extent from the different products that are of the order of 500 000 to 1×106 km2 depending on the season and hemisphere. These biases are due to differences in the algorithm sensitivity to ice edge conditions and the spatial resolution of different sensors. Relative uncertainty is assessed by examining extents from the National Snow and Ice Data Center Sea Ice Index product. The largest source of uncertainty,∼100 000 km2, is between near-real-time and final products due to different input source data and different processing and quality control. For consistent processing, the uncertainty is assessed using different input source data and by varying concentration algorithm parameters. This yields a relative uncertainty of 30 000–70 000 km2. The Arctic minimum extent uncertainty is∼40 000 km2. Uncertainties in comparing with earlier parts of the record may be higher due to sensor transitions. For the first time, this study provides a quantitative estimate of sea ice extent uncertainty.more » « less
-
The Chaos Canyon landslide, which collapsed on the afternoon of 28 June 2022 in Rocky Mountain National Park, presents an opportunity to evaluate instabilities within alpine regions faced with a warming and dynamic climate. Video documentation of the landslide was captured by several eyewitnesses and motivated a rapid field campaign. Initial estimates put the failure area at 66 630 m2, with an average elevation of 3555 m above sea level. We undertook an investigation of previous movement of this landslide, measured the volume of material involved, evaluated the potential presence of interstitial ice and snow within the failed deposit, and examined potential climatological impacts on the collapse of the slope. Satellite radar and optical measurements were used to calculate deformation of the landslide in the 5 years leading up to collapse. From 2017 to 2019, the landslide moved ∼5 m yr−1, accelerating to 17 m yr−1 in 2019. Movement took place through both internal deformation and basal sliding. Climate analysis reveals that the collapse took place during peak snowmelt, and 2022 followed 10 years of higher than average positive degree day sums. We also made use of slope stability modeling to test what factors controlled the stability of the area. Models indicate that even a small increase in the water table reduces the factor of safety to <1, leading to failure. We posit that a combination of permafrost thaw from increasing average temperatures, progressive weakening of the basal shear zone from several years of movement, and an increase in pore-fluid pressure from snowmelt led to the 28 June collapse. Material volumes were estimated using structure from motion (SfM) models incorporating photographs from two field expeditions on 8 July 2022 – 10 d after the slide. Detailed mapping and SfM models indicate that ∼1 258 000 ± 150 000 m3 of material was deposited at the slide toe and ∼1 340 000 ± 133 000 m3 of material was evacuated from the source area. The Chaos Canyon landslide may be representative of future dynamic alpine topography, wherein slope failures become more common in a warming climate.more » « less
-
Abstract Hyperbranched polymethacrylates were synthesized by green‐light‐induced atom transfer radical polymerization (ATRP) under biologically relevant conditions in the open air. Sodium 2‐bromoacrylate (SBA) was prepared in situ from commercially available 2‐bromoacrylic acid and used as a water‐soluble inibramer to induce branching during the copolymerization of methacrylate monomers. As a result, well‐defined branched polymethacrylates were obtained in less than 30 min with predetermined molecular weights (36 000<Mn<170 000), tunable degree of branching, and low dispersity values (1.14≤Đ≤1.33). Moreover, the use of SBA inibramer enabled the synthesis of bioconjugates with a well‐controlled branched architecture.more » « less
-
ABSTRACT The Australia Telescope Large Area Survey (ATLAS) and the VLA survey in the XMM-LSS/VIDEO deep field provide deep (≈15 $$\mu$$ Jy beam−1) and high-resolution (≈4.5–8 arcsec) radio coverage of the three XMM-SERVS fields (W-CDF-S, ELAIS-S1, and XMM-LSS). These data cover a total sky area of 11.3 deg2 and contain ≈11 000 radio components. Furthermore, about 3 deg2 of the XMM-LSS field also has deeper MIGHTEE data that achieve a median RMS of 5.6 $$\mu$$ Jy beam−1 and detect more than 20 000 radio sources. We analyse all these radio data and find source counterparts at other wavebands utilizing deep optical and infrared (IR) surveys. The nature of these radio sources is studied using radio-band properties (spectral slope and morphology) and the IR–radio correlation. Radio AGNs are selected and compared with those selected using other methods (e.g. X-ray). We found 1656 new AGNs that were not selected using X-ray and/or MIR methods. We constrain the FIR-to-UV SEDs of radio AGNs using cigale and investigate the dependence of radio AGN fraction upon galaxy stellar mass and star formation rate.more » « less
An official website of the United States government

