NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Hyperspecialized Compilation for Serverless Data Analytics

Spiegelberg, Leonhard; Kraska, Tim; Schwarzkopf, Malte (August 2023, Joint Proceedings of Workshops at the 49th International Conference on Very Large Data Bases (VLDB 2023))

Serverless functions can be spun up in milliseconds and scaled out quickly, forming an ideal platform for quick, interactive parallel queries over large data sets. Modern databases use code generation to produce efficient physical plans, but compiling such a plan on each serverless function is costly: every millisecond spent executing on serverless functions multiplies in cost by the number of functions running. Existing serverless data science frameworks therefore generate and compile code on the client, which precludes specializing this code to patterns that may exist in the input data of individual serverless functions. This paper argues for exploring a trade-off space between one-off code generation on the client, and hyperspecialized compilation that generates bespoke code on each serverless function. Our preliminary experiments show that hyperspecialization outperforms client-based compilation on typical heterogeneous datasets in both cost and performance by 2–4×.
more » « less
Full Text Available
K9db: Privacy-Compliant Storage For Web Applications By Construction

Dak_Albab, Kinan; Sharma, Ishan; Adam, Justus; Kilimnik, Benjamin; Jeyaraj, Aaron; Paul, Raj; Agvanian, Artem; Spiegelberg, Leonhard; Schwarzkopf, Malte (July 2023, Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI))

Data privacy laws like the EU’s GDPR grant users new rights, such as the right to request access to and deletion of their data. Manual compliance with these requests is error-prone and imposes costly burdens especially on smaller organizations, as non-compliance risks steep fines. K9db is a new, MySQL-compatible database that complies with privacy laws by construction. The key idea is to make the data ownership and sharing semantics explicit in the storage system. This requires K9db to capture and enforce applications’ complex data ownership and sharing semantics, but in exchange simplifies privacy compliance. Using a small set of schema annotations, K9db infers storage organization, generates procedures for data retrieval and deletion, and reports compliance errors if an application risks violating the GDPR. Our K9db prototype successfully expresses the data sharing semantics of real web applications, and guides developers to getting privacy compliance right. K9db also matches or exceeds the performance of existing storage systems, at the cost of a modest increase in state size.
more » « less
Full Text Available
Towards Increased Datacenter Efficiency with Soft Memory

https://doi.org/10.1145/3593856.3595902

Frisella, Megan; Sanchez, Shirley Loayza; Schwarzkopf, Malte (June 2023, Proceedings of the 19th ACM SIGOPS Workshop on Hot Topics in Operating Systems)

Memory is the bottleneck resource in today’s datacenters because it is inflexible: low-priority processes are routinely killed to free up resources during memory pressure. This wastes CPU cycles upon re-running killed jobs and incentivizes datacenter operators to run at low memory utilization for safety. This paper introduces soft memory, a software-level abstraction on top of standard primary storage that, under memory pressure, makes memory revocable for reallocation elsewhere. We prototype soft memory with the Redis key-value store, and find that it has low overhead.
more » « less
Full Text Available
Applying cognitive principles to model-finding output: the positive value of negative information

https://doi.org/10.1145/3527323

Dyer, Tristan; Nelson, Tim; Fisler, Kathi; Krishnamurthi, Shriram (April 2022, Proceedings of the ACM on Programming Languages)

Model-finders, such as SAT/SMT-solvers and Alloy, are used widely both directly and embedded in domain-specific tools. They support both conventional verification and, unlike other verification tools, property-free exploration. To do this effectively, they must produce output that helps users with these tasks. Unfortunately, the output of model-finders has seen relatively little rigorous human-factors study. Conventionally, these tools tend to show one satisfying instance at a time. Drawing inspiration from the cognitive science literature, we investigate two aspects of model-finder output: how many instances to show at once, and whether all instances must actually satisfy the input constraints. Using both controlled studies and open-ended talk-alouds, we show that there is benefit to showing negative instances in certain settings; the impact of multiple instances is less clear. Our work is a first step in a theoretically grounded approach to understanding how users engage cognitively with model-finder output, and how those tools might better support users in doing so.
more » « less
Tuplex: Data Science in Python at Native Code Speed

https://doi.org/10.1145/3448016.3457244

Spiegelberg, Leonhard; Yesantharao, Rahul; Schwarzkopf, Malte; Kraska, Tim (June 2021, Proceedings of the 2021 International Conference on Management of Data)
null (Ed.)
Today's data science pipelines often rely on user-defined functions (UDFs) written in Python. But interpreted Python code is slow, and Python UDFs cannot be compiled to machine code easily. We present Tuplex, a new data analytics framework that just in-time compiles developers' natural Python UDFs into efficient, end-to-end optimized native code. Tuplex introduces a novel dual-mode execution model that compiles an optimized fast path for the common case, and falls back on slower exception code paths for data that fail to match the fast path's assumptions. Dual-mode execution is crucial to making end-to-end optimizing compilation tractable: by focusing on the common case, Tuplex keeps the code simple enough to apply aggressive optimizations. Thanks to dual-mode execution, Tuplex pipelines always complete even if exceptions occur, and Tuplex's post-facto exception handling simplifies debugging. We evaluate Tuplex with data science pipelines over real-world datasets. Compared to Spark and Dask, Tuplex improves end-to-end pipeline runtime by 5-91x and comes within 1.1-1.7x of a hand-optimized C++ baseline. Tuplex outperforms other Python compilers by 6x and competes with prior, more limited query compilers. Optimizations enabled by dual-mode processing improve runtime by up to 3x, and Tuplex performs well in a distributed setting on serverless functions.
more » « less
Full Text Available
A New Model for Weaving Responsible Computing Into Courses Across the CS Curriculum

https://doi.org/10.1145/3408877.3432456

Cohen, Lena; Precel, Heila; Triedman, Harold; Fisler, Kathi (March 2021, Proceedings of the 52nd ACM Technical Symposium on Computer Science Education)
null (Ed.)
Full Text Available
Early Post-Secondary Student Performance of Adversarial Thinking

https://doi.org/10.1145/3446871.3469743

Young, Nick; Krishnamurthi, Shriram (January 2021, Proceedings of the 17th ACM Conference on International Computing Education Research)

Full Text Available

Search for: All records