skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: PaC-trees: supporting parallel and compressed purely-functional collections
Many modern programming languages are shifting toward a functional style for collection interfaces such as sets, maps, and sequences. Functional interfaces offer many advantages, including being safe for parallelism and providing simple and lightweight snapshots. However, existing high-performance functional interfaces such as PAM, which are based on bal- anced purely-functional trees, incur large space overheads for large-scale data analysis due to storing every element in a separate node in a tree. This paper presents PaC-trees, a purely-functional data structure supporting functional interfaces for sets, maps, and sequences that provides a significant reduction in space over existing approaches. A PaC-tree is a balanced binary search tree which blocks the leaves and compresses the blocks us- ing arrays. We provide novel techniques for compressing and uncompressing the blocks which yield practical parallel functional algorithms for a broad set of operations on PaC- trees such as union, intersection, filter, reduction, and range queries which are both theoretically and practically efficient. Using PaC-trees we designed CPAM, a C++ library that im- plements the full functionality of PAM, while offering signifi- cant extra functionality for compression. CPAM consistently matches or outperforms PAM on a set of microbenchmarks on sets, maps, and sequences while using about a quarter of the space. On applications including inverted indices, 2D range queries, and 1D interval queries, CPAM is competitive with or faster than PAM, while using 2.1ś7.8x less space. For static and streaming graph processing, CPAM offers 1.6x faster batch updates while using 1.3ś2.6x less space than the state-of-the-art graph processing system Aspen.  more » « less
Award ID(s):
1901381 2119352 1910030 2103483 1919223
PAR ID:
10338921
Author(s) / Creator(s):
; ; ;
Editor(s):
Ranjit Jhala and Isil Dillig
Date Published:
Journal Name:
ACM SIGPLAN International Conference on Programming Language Design and Implementation
Page Range / eLocation ID:
108 to 121
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. There has been a growing interest in the graph-streaming setting where a continuous stream of graph updates is mixed with graph queries. In principle, purely-functional trees are an ideal fit for this setting as they enable safe parallelism, lightweight snapshots, and strict serializability for queries. However, directly using them for graph processing leads to significant space overhead and poor cache locality. This paper presents C-trees, a compressed purely-functional search tree data structure that significantly improves on the space usage and locality of purely-functional trees. We design theoretically-efficient and practical algorithms for performing batch updates to C-trees, and also show that we can store massive dynamic real-world graphs using only a few bytes per edge, thereby achieving space usage close to that of the best static graph processing frameworks. To study the applicability of our data structure, we designed Aspen, a graph-streaming framework that extends the interface of Ligra with operations for updating graphs. We show that Aspen is faster than two state-of-the-art graph-streaming systems, Stinger and LLAMA, while requiring less memory, and is competitive in performance with the state-of-the-art static graph frameworks, Galois, GAP, and Ligra+. With Aspen, we are able to efficiently process the largest publicly-available graph with over two hundred billion edges in the graph-streaming setting using a single commodity multicore server with 1TB of memory. 
    more » « less
  2. Phylogenetic placement, used widely in ecological analyses, seeks to add a new species to an existing tree. A deep learning approach was previously proposed to estimate the distance between query and backbone species by building a map from gene sequences to a high-dimensional space that preserves species tree distances. They then use a distance-based placement method to place the queries on that species tree. In this paper, we examine the appropriate geometry for faithfully representing tree distances while embedding gene sequences. Theory predicts that hyperbolic spaces should provide a drastic reduction in distance distortion compared to the conventional Euclidean space. Nevertheless, hyperbolic embedding imposes its own unique challenges related to arithmetic operations, exponentially-growing functions, and limited bit precision, and we address these challenges. Our results confirm that hyperbolic embeddings have substantially lower distance errors than Euclidean space. However, these better-estimated distances do not always lead to better phylogenetic placement. We then show that the deep learning framework can be used not just to place on a backbone tree but to update it to obtain a fully resolved tree. With our hyperbolic embedding framework, species trees can be updated remarkably accurately with only a handful of genes. 
    more » « less
  3. Abstract Despite the availability of Cas9 variants with varied protospacer-adjacent motif (PAM) compatibilities, some genomic loci—especially those with pyrimidine-rich PAM sequences—remain inaccessible by high-activity Cas9 proteins. Moreover, broadening PAM sequence compatibility through engineering can increase off-target activity. With directed evolution, we generated four Cas9 variants that together enable targeting of most pyrimidine-rich PAM sequences in the human genome. Using phage-assisted noncontinuous evolution and eVOLVER-supported phage-assisted continuous evolution, we evolved Nme2Cas9, a compact Cas9 variant, into variants that recognize single-nucleotide pyrimidine-PAM sequences. We developed a general selection strategy that requires functional editing with fully specified target protospacers and PAMs. We applied this selection to evolve high-activity variants eNme2-T.1, eNme2-T.2, eNme2-C and eNme2-C.NR. Variants eNme2-T.1 and eNme2-T.2 offer access to N4TN PAM sequences with comparable editing efficiencies as existing variants, while eNme2-C and eNme2-C.NR offer less restrictive PAM requirements, comparable or higher activity in a variety of human cell types and lower off-target activity at N4CN PAM sequences. 
    more » « less
  4. xCas9 is an evolved variant of the CRISPR-Cas9 genome editing system, engineered to improve specificity and reduce undesired off-target effects. How xCas9 expands the DNA targeting capability of Cas9 by recognising a series of alternative protospacer adjacent motif (PAM) sequences while ignoring others is unknown. Here, we elucidate the molecular mechanism underlying xCas9’s expanded PAM recognition and provide critical insights for expanding DNA targeting. We demonstrate that while wild-type Cas9 enforces stringent guanine selection through the rigidity of its interacting arginine dyad, xCas9 introduces flexibility in R1335, enabling selective recognition of specific PAM sequences. This increased flexibility confers a pronounced entropic preference, which also improves recognition of the canonical TGG PAM. Furthermore, xCas9 enhances DNA binding to alternative PAM sequences during the early evolution cycles, while favouring binding to the canonical PAM in the final evolution cycle. This dual functionality highlights how xCas9 broadens PAM recognition and underscores the importance of fine-tuning the flexibility of the PAM-interacting cleft as a key strategy for expanding the DNA targeting potential of CRISPR-Cas systems. These findings deepen our understanding of DNA recognition in xCas9 and may apply to other CRISPR-Cas systems with similar PAM recognition requirements. 
    more » « less
  5. Random forests use ensembles of decision trees to boost accuracy for machine learning tasks. However, large ensembles slow down inference on platforms that process each tree in an ensemble individually. We present Bolt, a platform that restructures whole random forests, not just individual trees, to speed up inference. Conceptually, Bolt maps every path in each tree to a lookup table which, if cache were large enough, would allow inference with just one memory access. When the size of the lookup table exceeds cache capacity, Bolt employs a novel combination of lossless compression, parameter selection, and bloom filters to shrink the table while preserving fast inference. We compared inference speed in Bolt to three state-of-the-art platforms: Python Scikit-Learn, Ranger, and Forest Packing. We evaluated these platforms using datasets with vision, natural language processing and categorical applications. We observed that on ensembles of shallow decision trees Bolt can run 2-14X faster than competing platforms and that Bolt's speedups persist as the number of decision trees in an ensemble increases. 
    more » « less