skip to main content


Title: Rhyme: A Data-Centric Expressive Query Language for Nested Data Structures
We present Rhyme, an expressive language designed for high-level data manipulation, with a primary focus on querying and transforming nested structures such as JSON and tensors, while yielding nested structures as output. Rhyme draws inspiration from a diverse range of declarative languages, including Datalog, JQ, JSONiq, Einstein summation (Einsum), GraphQL, and more recent functional logic programming languages like Verse. It has a syntax that closely resembles existing object notation, is compositional, and has the ability to perform query optimization and code generation through the construction of an intermediate representation (IR). Our IR comprises loop-free and branch-free code with program structure implicitly captured via dependencies. To demonstrate Rhyme’s versatility, we implement Rhyme in JavaScript (as an embedded DSL) and illustrate its application across various domains, showcasing its ability to express common data manipulation queries, tensor expressions (à la Einsum), and more.  more » « less
Award ID(s):
1918483
PAR ID:
10494806
Author(s) / Creator(s):
;
Publisher / Repository:
ACM (https://dl.acm.org/doi/10.1007/978-3-031-52038-9_5)
Date Published:
Journal Name:
Practical Aspects of Declarative Languages: 26th International Symposium
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The fast-and-loose, permissive semantics of dynamic programming languages limit the power of static analyses. For that reason, soundness is often traded for precision through dynamic program analysis. Dynamic analysis is only as good as the available runnable code, and relying solely on test suites is fraught as they do not cover the full gamut of possible behaviors. Fuzzing is an approach for automatically exercising code, and could be used to obtain more runnable code. However, the shape of user-defined data in dynamic languages is difficult to intuit, limiting a fuzzer's reach. We propose a feedback-driven blackbox fuzzing approach which draws inputs from a database of values recorded from existing code. We implement this approach in a tool called signatr for the R language. We present the insights of its design and implementation, and assess signatr's ability to uncover new behaviors by fuzzing 4,829 R functions from 100 R packages, revealing 1,195,184 new signatures. 
    more » « less
  2. null (Ed.)
    Statistical data manipulation is a crucial component of many data science analytic pipelines, particularly as part of data ingestion. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) and etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation. It consists of a data model, called Structured Data Transformation Data Model (SDTDM), inspired by the data models of multiple statistical transformations frameworks; an algebra, Structural Data Transformation Algebra (SDTA), with the ability to transform not only data within SDTDM but also metadata at multiple structural levels; and an equivalent descriptive counterpart, called Structured Data Transformation Language (SDTL), recently adopted by the DDI Alliance that maintains international standards for metadata as part of its suite of products. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository. We illustrate with examples how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets.We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how functional-equivalent transformation programs could be converted to other functionally equivalent programs, in the same or different language, permitting code reuse and result reproducibility, We also illustrate the possibility of using of SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations. 
    more » « less
  3. Rowhammer is an increasingly threatening vulnerability that grants an attacker the ability to flip bits in memory without directly accessing them. Despite efforts to mitigate Rowhammer via software and defenses built directly into DRAM modules, more recent generations of DRAM are actually more susceptible to malicious bit-flips than their predecessors. This phenomenon has spawned numerous exploits, showing how Rowhammer acts as the basis for various vulnerabilities that target sensitive structures, such as Page Table Entries (PTEs) or opcodes, to grant control over a victim machine. However, in this paper, we consider Rowhammer as a more general vulnerability, presenting a novel exploit vector for Rowhammer that targets particular code patterns. We show that if victim code is designed to return benign data to an unprivileged user, and uses nested pointer dereferences, Rowhammer can flip these pointers to gain arbitrary read access in the victim's address space. Furthermore, we identify gadgets present in the Linux kernel, and demonstrate an end-to-end attack that precisely flips a targeted pointer. To do so we developed a number of improved Rowhammer primitives, including kernel memory massaging, Rowhammer synchronization, and testing for kernel flips, which may be of broader interest to the Rowhammer community. Compared to prior works' leakage rate of .3 bits/s, we show that such gadgets can be used to read out kernel data at a rate of 82.6 bits/s. By targeting code gadgets, this work expands the scope and attack surface exposed by Rowhammer. It is no longer sufficient for software defenses to selectively pad previously exploited memory structures in flip-safe memory, as any victim code that follows the pattern in question must be protected. 
    more » « less
  4. null (Ed.)
    Linear nested codes, where two or more sub-codes are nested in a global code, have been proposed as candidates for reliable multi-terminal communication. In this paper, we consider nested array-based spatially coupled low-density parity-check (SC-LDPC) codes and propose a line-counting based optimization scheme for minimizing the number of dominant absorbing sets in order to improve its performance in the high signal-to-noise ratio regime. Since the parity-check matrices of different nested sub-codes partially overlap, the optimization of one nested sub-code imposes constraints on the optimization of the other sub-codes. To tackle these constraints, a multi-step optimization process is applied first to one of the nested codes, then sequential optimization of the remaining nested codes is carried out based on the constraints imposed by the previously optimized sub-codes. Results show that the order of optimization has a significant impact on the number of dominant absorbing sets in the Tanner graph of the code, resulting in a trade-off between the performance of a nested code structure and its optimization sequence: the code which is optimized without constraints has fewer harmful structures than the code which is optimized with constraints. We also show that for certain code parameters, dominant absorbing sets in the Tanner graphs of all nested codes are completely removed using our proposed optimization strategy. 
    more » « less
  5. Random data generators can be thought of as parsers of streams of randomness. This perspective on generators for random data structures is established folklore in the programming languages community, but it has never been formalized, nor have its consequences been deeply explored. We build on the idea of freer monads to develop free generators, which unify parsing and generation using a common structure that makes the relationship between the two concepts precise. Free generators lead naturally to a proof that a monadic generator can be factored into a parser plus a distribution over choice sequences. Free generators also support a notion of derivative, analogous to the familiar Brzozowski derivatives of formal languages, allowing analysis tools to “preview” the effect of a particular generator choice. This gives rise to a novel algorithm for generating data structures satisfying user-specified preconditions. 
    more » « less