NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Toward a Typed Intermediate Language for R (Extended Abstract)

https://doi.org/10.4230/oasics.programming.2025.24

Laurent, Mickaël; Hain, Jakob; Krikava, Filip; Krynski, Sebastián; Vitek, Jan (January 2025, Schloss Dagstuhl – Leibniz-Zentrum für Informatik)
Edwards, Jonathan; Perera, Roly; Petricek, Tomas (Ed.)
Compilers for dynamic languages often rely on intermediate representations with explicit type annotations to facilitate writing program transformations. This paper documents the design of a new typed intermediate representation for a just-in-time compiler for the R programming language called FIŘ. Type annotations, in FIŘ, capture properties such as sharing, the potential for effects, and compiler speculations. In this extended abstract, we focus on the sharing properties that may be used to optimize away some copies of values.
more » « less
Full Text Available
Comparing R Bytecode Compilers Written in R, Java, and Rust (Extended Abstract)

https://doi.org/10.4230/oasics.programming.2025.1

Donat-Bouillud, Pierre; Křikava, Filip; Hain, Jakob; Plodek, Adam; Vitek, Jan (January 2025, Schloss Dagstuhl – Leibniz-Zentrum für Informatik)
Edwards, Jonathan; Perera, Roly; Petricek, Tomas (Ed.)
This paper presents a comparative analysis of three implementations of the R bytecode compiler: the official R implementation, a Java-based compiler, and a Rust-based compiler. The R compiler, written in R itself, poses challenges in terms of performance and maintainability. We evaluate designs of the compilers, their trade-offs, and performance characteristics. The Rust version outperforms the Java version, which itself outperforms the R version.
more » « less
Full Text Available
Decidable Subtyping of Existential Types for Julia

https://doi.org/10.1145/3656421

Belyakova, Julia; Chung, Benjamin; Tate, Ross; Vitek, Jan (June 2024, Proceedings of the ACM on Programming Languages)

Julia is a modern scientific-computing language that relies on multiple dispatch to implement generic libraries. While the language does not have a static type system, method declarations are decorated with expressive type annotations to determine when they are applicable. To find applicable methods, the implementation uses subtyping at run-time. We show that Julia’s subtyping is undecidable, and we propose a restriction on types to recover decidability by stratifying types into method signatures over value types—where the former can freely use bounded existential types but the latter are restricted to use-site variance. A corpus analysis suggests that nearly all Julia programs written in practice already conform to this restriction.
more » « less
The Fault in Our Stars: Designing Reproducible Large-scale Code Analysis Experiments

https://doi.org/10.4230/LIPIcs.ECOOP.2024.27

Maj, Petr; Muroya, Stefanie; Siek, Konrad; Di_Grazia, Luca; Vitek, Jan (January 2024, Schloss Dagstuhl – Leibniz-Zentrum für Informatik)
Aldrich, Jonathan; Salvaneschi, Guido (Ed.)
Large-scale software repositories are a source of insights for software engineering. They offer an unmatched window into the software development process at scale. Their sheer number and size holds the promise of broadly applicable results. At the same time, that very size presents practical challenges for scaling tools and algorithms to millions of projects. A reasonable approach is to limit studies to representative samples of the population of interest. Broadly applicable conclusions can then be obtained by generalizing to the entire population. The contribution of this paper is a standardized experimental design methodology for choosing the inputs of studies working with large-scale repositories. We advocate for a methodology that clearly lays out what the population of interest is, how to sample it, and that fosters reproducibility. Along the way, we discourage researchers from using extrinsic attributes of projects such as stars, that measure some unclear notion of popularity.
more » « less
Full Text Available
Reusing Just-in-Time Compiled Code

https://doi.org/10.1145/3622839

Mehta, Meetesh Kalpesh; Krynski, Sebastián; Gualandi, Hugo Musso; Thakur, Manas; Vitek, Jan (October 2023, Proceedings of the ACM on Programming Languages)

Most code is executed more than once. If not entire programs then libraries remain unchanged from one run to the next. Just-in-time compilers expend considerable effort gathering insights about code they compiled many times, and often end up generating the same binary over and over again. We explore how to reuse compiled code across runs of different programs to reduce warm-up costs of dynamic languages. We propose to usespeculative contextual dispatchto select versions of functions from anoff-line curated code repository. That repository is a persistent database of previously compiled functions indexed by the context under which they were compiled. The repository is curated to remove redundant code and to optimize dispatch. We assess practicality by extending Ř, a compiler for the R language, and evaluating its performance. Our results suggest that the approach improves warmup times while preserving peak performance.
more » « less
Full Text Available
signatr: A Data-Driven Fuzzing Tool for R

https://doi.org/10.1145/3567512.3567530

Turcotte, Alexi; Donat-Bouillud, Pierre; Křikava, Filip; Vitek, Jan (November 2022, SLE)

The fast-and-loose, permissive semantics of dynamic programming languages limit the power of static analyses. For that reason, soundness is often traded for precision through dynamic program analysis. Dynamic analysis is only as good as the available runnable code, and relying solely on test suites is fraught as they do not cover the full gamut of possible behaviors. Fuzzing is an approach for automatically exercising code, and could be used to obtain more runnable code. However, the shape of user-defined data in dynamic languages is difficult to intuit, limiting a fuzzer's reach. We propose a feedback-driven blackbox fuzzing approach which draws inputs from a database of values recorded from existing code. We implement this approach in a tool called signatr for the R language. We present the insights of its design and implementation, and assess signatr's ability to uncover new behaviors by fuzzing 4,829 R functions from 100 R packages, revealing 1,195,184 new signatures.
more » « less
Full Text Available
Deoptless: speculation with dispatched on-stack replacement and specialized continuations

https://doi.org/10.1145/3519939.3523729

Flückiger, Olivier; Ječmen, Jan; Krynski, Sebastián; Vitek, Jan (June 2022, ACM SIGPLAN International Conference on Programming Language Design and Implementation)

Just-in-time compilation provides significant performance improvements for programs written in dynamic languages. These benefits come from the ability of the compiler to spec- ulate about likely cases and generate optimized code for these. Unavoidably, speculations sometimes fail and the opti- mizations must be reverted. In some pathological cases, this can leave the program stuck with suboptimal code. In this paper we propose deoptless, a technique that replaces deopti- mization points with dispatched specialized continuations. The goal of deoptless is to take a step towards providing users with a more transparent performance model in which mysterious slowdowns are less frequent and grave.
more » « less
Full Text Available
First-class environments in R

https://doi.org/10.1145/3486602.3486768

Goel, Aviral; Vitek, Jan (October 2021, ACM SIGPLAN International Symposium on Dynamic Languages)

The R programming language is widely used for statistical computing. To enable interactive data exploration and rapid prototyping, R encourages a dynamic programming style. This programming style is supported by features such as first-class environments. Amongst widely used languages, R has the richest interface for programmatically manipulating environments. With the flexibility afforded by reflective operations on first-class environments, come significant challenges for reasoning and optimizing user-defined code. This paper documents the reflective interface used to operate over first-class environment. We explain the rationale behind its design and conduct a large-scale study of how the interface is used in popular libraries.
more » « less
Full Text Available
CodeDJ: Reproducible Queries over Large-Scale Software Repositories

Maj, P; Siek, K; Kovalenko, A; Vitek, Jan (July 2021, European Conference on Object Oriented Programming)
Moller, A; Sridharan, M (Ed.)
Analyzing massive code bases is a staple of modern software engineering research – a welcome side-effect of the advent of large-scale software repositories such as GitHub. Selecting which projects one should analyze is a labor-intensive process, and a process that can lead to biased results if the selection is not representative of the population of interest. One issue faced by researchers is that the interface exposed by software repositories only allows the most basic of queries. CodeDJ is an infrastructure for querying repositories composed of a persistent datastore, constantly updated with data acquired from GitHub, and an in-memory database with a Rust query interface. CodeDJ supports reproducibility, historical queries are answered deterministically using past states of the datastore; thus researchers can reproduce published results. To illustrate the benefits of CodeDJ, we identify biases in the data of a published study and, by repeating the analysis with new data, we demonstrate that the study’s conclusions were sensitive to the choice of projects.
more » « less
Full Text Available
Sampling optimized code for type feedback

https://doi.org/10.1145/3426422.3426984

Flückiger, Olivier; Wälchli, Andreas; Krynski, Sebastián; Vitek, Jan (November 2020, Proceedings of the 16th ACM SIGPLAN International Symposium on Dynamic Languages)

To efficiently execute dynamically typed languages, many language implementations have adopted a two-tier architecture. The first tier aims for low-latency startup times and collects dynamic profiles, such as the dynamic types of variables. The second tier provides high-throughput using an optimizing compiler that specializes code to the recorded type information. If the program behavior changes to the point that not previously seen types occur in specialized code, that specialized code becomes invalid, it is deoptimized, and control is transferred back to the first tier execution engine which will start specializing anew. However, if the program behavior becomes more specific, for instance, if a polymorphic variable becomes monomorphic, nothing changes. Once the program is running optimized code, there are no means to notice that an opportunity for optimization has been missed. We propose to employ a sampling-based profiler to monitor native code without any instrumentation. The absence of instrumentation means that when the profiler is not active, no overhead is incurred. We present an implementation is in the context of the Ř just-in-time, optimizing compiler for the R language. Based on the sampled profiles, we are able to detect when the native code produced by Ř is specialized for stale type feedback and recompile it to more type-specific code. We show that sampling adds an overhead of less than 3% in most cases and up to 9% in few cases and that it reliably detects stale type feedback within milliseconds.
more » « less
Full Text Available

« Prev Next »

Search for: All records