skip to main content


Title: Efficiently Merging r-indexes
Large sequencing projects, such as GenomeTrakr and MetaSub, are updated frequently (sometimes daily, in the case of GenomeTrakr) with new data. Therefore, it is imperative that any data structure indexing such data supports efficient updates. Toward this goal, Bannai et al. (TCS, 2020) proposed a data structure named dynamic r-index which is suitable for large genome collections and supports incremental construction; however, it is still not powerful enough to support substantial updates. Here, we develop a novel algorithm for updating the r-index, which we refer to as RIMERGE. Fundamental to our algorithm is the combination of the basics of the dynamic r-index with a known algorithm for merging Burrows-Wheeler Transforms (BWTs). As a result, RIMERGE is capable of performing batch updates in a manner that exploits parallelism while keeping the memory overhead small. We compare our method to the dynamic r-index of Bannai et al. using two different datasets, and show that RIMERGE is between 1.88 to 5.34 times faster on reasonably large inputs.  more » « less
Award ID(s):
2029552
NSF-PAR ID:
10276036
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
2021 Data Compression Conference (DCC)
Page Range / eLocation ID:
203 to 212
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Computing a dense subgraph is a fundamental problem in graph mining, with a diverse set of applications ranging from electronic commerce to community detection in social networks. In many of these applications, the underlying context is better modelled as a weighted hypergraph that keeps evolving with time. This motivates the problem of maintaining the densest subhypergraph of a weighted hypergraph in a dynamic setting, where the input keeps changing via a sequence of updates (hyperedge insertions/deletions). Previously, the only known algorithm for this problem was due to Hu et al. [19]. This algorithm worked only on unweighted hypergraphs, and had an approximation ratio of (1 +ϵ)r2 and an update time of O(poly(r, log n)), where r denotes the maximum rank of the input across all the updates. We obtain a new algorithm for this problem, which works even when the input hypergraph is weighted. Our algorithm has a significantly improved (near-optimal) approximation ratio of (1 +ϵ) that is independent of r, and a similar update time of O(poly(r, log n)). It is the first (1 +ϵ)-approximation algorithm even for the special case of weighted simple graphs. To complement our theoretical analysis, we perform experiments with our dynamic algorithm on large-scale, real-world data-sets. Our algorithm significantly outperforms the state of the art [19] both in terms of accuracy and efficiency. 
    more » « less
  2. Tauman Kalai, Yael (Ed.)
    Over the last two decades, a significant line of work in theoretical algorithms has made progress in solving linear systems of the form 𝐋𝐱 = 𝐛, where 𝐋 is the Laplacian matrix of a weighted graph with weights w(i,j) > 0 on the edges. The solution 𝐱 of the linear system can be interpreted as the potentials of an electrical flow in which the resistance on edge (i,j) is 1/w(i,j). Kelner, Orrechia, Sidford, and Zhu [Kelner et al., 2013] give a combinatorial, near-linear time algorithm that maintains the Kirchoff Current Law, and gradually enforces the Kirchoff Potential Law by updating flows around cycles (cycle toggling). In this paper, we consider a dual version of the algorithm that maintains the Kirchoff Potential Law, and gradually enforces the Kirchoff Current Law by cut toggling: each iteration updates all potentials on one side of a fundamental cut of a spanning tree by the same amount. We prove that this dual algorithm also runs in a near-linear number of iterations. We show, however, that if we abstract cut toggling as a natural data structure problem, this problem can be reduced to the online vector-matrix-vector problem (OMv), which has been conjectured to be difficult for dynamic algorithms [Henzinger et al., 2015]. The conjecture implies that the data structure does not have an O(n^{1-ε}) time algorithm for any ε > 0, and thus a straightforward implementation of the cut-toggling algorithm requires essentially linear time per iteration. To circumvent the lower bound, we batch update steps, and perform them simultaneously instead of sequentially. An appropriate choice of batching leads to an Õ(m^{1.5}) time cut-toggling algorithm for solving Laplacian systems. Furthermore, we show that if we sparsify the graph and call our algorithm recursively on the Laplacian system implied by batching and sparsifying, we can reduce the running time to O(m^{1 + ε}) for any ε > 0. Thus, the dual cut-toggling algorithm can achieve (almost) the same running time as its primal cycle-toggling counterpart. 
    more » « less
  3. The phenology of critical biological events in aquatic ecosystems are rapidly shifting due to climate change. Growing variability in phenological cues can increase the likelihood of trophic mismatches, causing recruitment failures in commercially, culturally, and recreationally important fisheries. We tested for changes in spawning phenology of regionally important walleye (Sander vitreus) populations in 194 Midwest US lakes in Minnesota, Michigan, and Wisconsin spanning 1939-2019 to investigate factors influencing walleye phenological responses to climate change and associated climate variability, including ice-off timing, lake physical characteristics, and population stocking history. Data from Wisconsin and Michigan lakes (185 and 5 out of 194 total lakes, respectively) were collected by the Wisconsin Department of Natural Resources (WDNR) and the Great Lakes Indian Fish and Wildlife Commission (GLIFWC) through standardized spring walleye mark-recapture surveys and spring tribal harvest season records. Standardized spring mark-recapture population estimates are performed shortly after ice-off, where following a marking event, a subsequent recapture sampling event is conducted using nighttime electrofishing (typically AC – WDNR, pulsed-DC – GLIFWC) of the entire shoreline including islands for small lakes and index stations for large lakes (Hansen et al. 2015) that is timed to coincide with peak walleye spawning activity (G. Hatzenbeler, WDNR, personal communication; M. Luehring, GLIFWC, personal communication; Beard et al. 1997). Data for four additional Minnesota lakes were collected by the Minnesota Department of Natural Resources (MNDNR) beginning in 1939 during annual collections of walleye eggs and broodstock (Schneider et al. 2010), where date of peak egg take was used to index peak spawning activity. For lakes where spawning location did not match the lake for which the ice-off data was collected, the spawning location either flowed into (Pike River) or was within 50 km of a lake where ice-off data were available (Pine River) and these ice-off data were used. Following the affirmation of off-reservation Ojibwe tribal fishing rights in the Ceded Territories of Wisconsin and the Upper Peninsula of Michigan in 1987, tribal spearfishers have targeted walleye during spring spawning (Mrnak et al. 2018). Nightly harvests are recorded as part of a compulsory creel survey (US Department of the Interior 1991). Using these records, we calculated the date of peak spawning activity in a given lake-year as the day of maximum tribal harvest. Although we were unable to account for varying effort in these data, a preliminary analysis comparing spawning dates estimated using tribal harvest to those determined from standardized agency surveys in the same lake and year showed that they were highly correlated (Pearson’s correlation: r = 0.91, P < 0.001). For lakes that had walleye spawning data from both agency surveys and tribal harvest, we used the data source with the greatest number of observation years. Ice-off phenology data was collected from two sources – either observed from the Global Lake and River Ice Phenology database (Benson et al. 2000)t, or modeled from a USGS region-wide machine-learning model which used North American Land Data Assimilation System (NLDAS) meteorological inputs combined with lake characteristics (lake position, clarity, size, depth, hypsography, etc.) to predict daily water column temperatures from 1979 - 2022, from which ice-off dates could be derived (https://www.sciencebase.gov/catalog/item/6206d3c2d34ec05caca53071; see Corson-Dosch et al. 2023 for details). Modeled data for our study lakes (see (Read et al. 2021) for modeling details), which performed well in reflecting ice phenology when compared to observed data (i.e., highly significant correlation between observed and modeled ice-off dates when both were available; r = 0.71, p < 0.001). Lake surface area (ha), latitude, and maximum depth (m) were acquired from agency databases and lake reports. Lake class was based on a WDNR lakes classification system (Rypel et al. 2019) that categorized lakes based on temperature, water clarity, depth, and fish community. Walleye stocking history was defined using the walleye stocking classification system developed by the Wisconsin Technical Working Group (see also Sass et al. 2021), which categorized lakes based on relative contributions of naturally-produced and stocked fish to adult recruitment by relying heavily on historic records of age-0 and age-1 catch rates and stocking histories. Wisconsin lakes were divided into three groups: natural recruitment (NR), a combination of stocking and natural recruitment (C-ST), and stocked only (ST). Walleye natural recruitment was indexed as age-0 walleye CPE (number of age-0 walleye captured per km of shoreline electrofished) from WDNR and GLIFWC fall electrofishing surveys (see Hansen et al. 2015 for details). We excluded lake-years where stocking of age-0 fish occurred before age-0 surveys to only include measurements of naturally-reproduced fish. 
    more » « less
  4. While a reinvigoration of ocean circulation and CO 2 marine geologic carbon release over the last 20,000 years. Much of this evidence points to outgassing is the leading explanation for atmospheric CO rise since the Last Glacial Maximum (LGM), there is also evidence of regions of the mid-depth Pacific Ocean, where multiple radiocarbon (1 4 C) records show anomalously low 14 C/C values, potentially caused by the addition of carbon [1,2]. To better constrain this geologic carbon release hypothesis, we aim to place 14 C-free geologic an upper bound limit on the amount of carbon that may have been added, in addition to the geochemical pathway of that carbon. To do so, we numerical invert a carbon cycle model based on observational atmospheric CO 2 and 14 C records. Given these observational constraints, we use data assimilation techniques and an optimization algorithm to calculate the rate of carbon addition and its alkalinity-to-carbon ratio (R ) over the last 20,000 A/C years. Using the modeled planetary radiocarbon budget calculated in Hain et al. [3], we find observations allow for only ~300 Pg of carbon to be added, as a majority of the deglacial atmospheric 14 C decline is already explained by magnetic field strength changes and ocean circulation changes [3]. However, when we adjust the initial state of the model by increasing C by 75‰ to match the observational C records, we find that observations 14 14 allow for ~3500 Pg of carbon addition with an average R of ~1.4. A/C These results allow for the possibility of a large release of 14C-free geologic carbon, which could provide local and regional 14C anomalies, as the records have in the Pacific [1,2]. As this geological carbon was added with a RA/C of ~1.4, these results also imply that 14C evidence for significant geologic carbon release since the LGM may not be taken as contributing to deglacial CO2 rise, unless there is evidence for significant local acidification and corrosion of seafloor sediments. If the geologic carbon cycle is indeed more dynamic than previously thought, we may also need to rethink the approach to estimate the land/ocean carbon repartitioning from the deglacial stable carbon isotope budget. [1] Rafter et al. (2019), GRL 46(23), 13950–13960. [2] Ronge et al. (2016), Nature Communications 7(1), 11487. [3] Hain et al. (2014), EPSL 394, 198–208. 
    more » « less
  5. The Global Biodiversity Information Facility (GBIF 2022a) has indexed more than 2 billion occurrence records from 70,147 datasets. These datasets often include "hidden" biotic interaction data because biodiversity communities use the Darwin Core standard (DwC, Wieczorek et al. 2012) in different ways to document biotic interactions. In this study, we extracted biotic interactions from GBIF data using an approach similar to that employed in the Global Biotic Interactions (GloBI; Poelen et al. 2014) and summarized the results. Here we aim to present an estimation of the interaction data available in GBIF, showing that biotic interaction claims can be automatically found and extracted from GBIF. Our results suggest that much can be gained by an increased focus on development of tools that help to index and curate biotic interaction data in existing datasets. Combined with data standardization and best practices for sharing biotic interactions, such as the initiative on plant-pollinators interaction (Salim 2022), this approach can rapidly contribute to and meet open data principles (Wilkinson 2016). We used Preston (Elliott et al. 2020), open-source software that versions biodiversity datasets, to copy all GBIF-indexed datasets. The biodiversity data graph version (Poelen 2020) of the GBIF-indexed datasets used during this study contains 58,504 datasets in Darwin Core Archive (DwC-A) format, totaling 574,715,196 records. After retrieval and verification, the datasets were processed using Elton. Elton extracts biotic interaction data and supports 20+ existing file formats, including various types of data elements in DwC records. Elton also helps align interaction claims (e.g., host of, parasite of, associated with) to the Relations Ontology (RO, Mungall 2022), making it easier to discover datasets across a heterogeneous collection of datasets. Using specific mapping between interaction claims found in the DwC records to the terms in RO*1, Elton found 30,167,984 potential records (with non-empty values for the scanned DwC terms) and 15,248,478 records with recognized interaction types. Taxonomic name validation was performed using Nomer, which maps input names to names found in a variety of taxonomic catalogs. We only considered an interaction record valid where the interaction type could be mapped to a term in RO and where Nomer found a valid name for source and target taxa. Based on the workflow described in Fig. 1, we found 7,947,822 interaction records (52% of the potential interactions). Most of them were generic interactions ( interacts_ with , 87.5%), but the remaining 12.5% (993,477 records) included host-parasite and plant-animal interactions. The majority of the interactions records found involved plants (78%), animals (14%) and fungi (6%). In conclusion, there are many biotic interactions embedded in existing datasets registered in large biodiversity data indexers and aggregators like iDigBio, GBIF, and BioCASE. We exposed these biotic interaction claims using the combined functionality of biodiversity data tools Elton (for interaction data extraction), Preston (for reliable dataset tracking) and Nomer (for taxonomic name alignment). Nonetheless, the development of new vocabularies, standards and best practice guides would facilitate aggregation of interaction data, including the diversification of the GBIF data model (GBIF 2022b) for sharing biodiversity data beyond occurrences data. That is the aim of the TDWG Interest Group on Biological Interactions Data (TDWG 2022). 
    more » « less