Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Cloud storage services such as Amazon S3, Azure Blob Storage, and Google Cloud Storage are widely used to store raw data for machine learning applications. When the data is later processed, the analysis predominantly focuses on regions of interest (such as a small bounding box in a larger image) and discards uninteresting regions. Machine learning applications can significantly accelerate their I/O if they push this data filtering step to the cloud. Prior work has proposed different methods to partially read array (tensor) objects, such as chunking, reading a contiguous byte range, and evaluating a lambda function. No method is optimal; estimating the total time and cost of a data retrieval requires an understanding of the data serialization order, the chunk size and platform-specific properties. This paper introduces ArrayMorph, a cloud-based array data storage system that automatically determines which is the best method to use to retrieve regions of interest from data on the cloud. ArrayMorph formulates data accesses as hyperslab queries, and optimizes them using a multi-phase cost-based approach. ArrayMorph seamlessly integrates with Python/PyTorch-based ML applications, and is experimentally shown to transfer up to 9.8X less data than existing systems. This makes ML applications run up to 1.7X faster and 9X cheaper than prior solutions.more » « lessFree, publicly-accessible full text available May 1, 2026
-
Distributed data stores typically provide weak isolation levels, which are efficient but can lead to unserializable behaviors, which are hard for programmers to understand and often result in errors. This paper presents the first dynamic predictive analysis for data store applications under weak isolation levels, called IsoPredict. Given an observed serializable execution of a data store application, IsoPredict generates and solves SMT constraints to find an unserializable execution that is a feasible execution of the application. IsoPredict introduces novel techniques to handle divergent application behavior; to solve mutually recursive sets of constraints; and to balance coverage, precision, and performance. An evaluation shows IsoPredict finds unserializable behaviors in four data store benchmarks, and that more than 99% of its predicted executions are feasible.more » « less
-
Developer’s Responsibility or Database’s Responsibility? Rethinking Concurrency Control in DatabasesMany database applications execute transactions under a weaker isolation level, such as READ COMMITTED. This often leads to concurrency bugs that look like race conditions in multi-threaded programs. While this problem is well known, philosophies of how to address this problem vary a lot, ranging from making a SERIALIZABLE database faster to living with weaker isolation and the consequence of concurrency bugs. This paper studies the consequences, root causes, and how developers fix 93 real-world concurrency bugs in database applications. We observe that, on the one hand, developers still prefer preventing these bugs from happening. On the other hand, database systems are not providing sufficient support for this task, so developers often fix these bugs using ad-hoc solutions, which are often complicated and not fully correct. We further discuss research opportunities to improve concurrency control in database implementations.more » « less
-
Public opinion surveys constitute a widespread, powerful tool to study peoples’ attitudes and behaviors from comparative perspectives. However, even global surveys can have limited geographic and temporal coverage, which can hinder the production of comprehensive knowledge. To expand the scope of comparison, social scientists turn to ex-post harmonization of variables from datasets that cover similar topics but in different populations and/or at different times. These harmonized datasets can be analyzed as a single source and accessed through various data portals. However, the Survey Data Recycling (SDR) research project has identified three challenges faced by social scientists when using data portals: the lack of capability to explore data in-depth or query data based on customized needs, the difficulty in efficiently identifying related data for studies, and the incapability to evaluate theoretical models using sliced data. To address these issues, the SDR research project has developed the SDR Querier, which is applied to the harmonized SDR database. The SDR Querier includes a BERT-based model that allows for customized data queries through research questions or keywords (Query-by-Question), a visual design that helps users determine the availability of harmonized data for a given research question (Query-by-Condition), and the ability to reveal the underlying relational patterns among substantive and methodological variables in the database (Query-by-Relation), aiding in the rigorous evaluation or improvement of regression models. Case studies with multiple social scientists have demonstrated the usefulness and effectiveness of the SDR Querier in addressing daily challenges.more » « less
-
Database applications frequently use weaker isolation levels, such as Read Committed, for better performance, which may lead to bugs that do not happen under Serializable. Although a number of works have proposed methods to identify such isolation-related bugs, the difficulty of analyzing reported bugs is often underestimated, since these bugs often involve multiple complicated transactions interleaved in a specific order and they often require users' feedback to improve the accuracy of bug analysis. This paper presents IsoBugView, a tool to visualize isolation bugs and incorporate users' feedback: to address the challenge that a complicated bug may include much information and thus is hard to present, IsoBugView displays a high-level overview of the bug first and displays further information of individual pieces if the developer needs further investigation. To incorporate users' feedback, IsoBugView embeds hook functions into the backend analysis tool to preprocess a dependency graph and postprocess a found cycle and further allows a user to apply predefined hook functions in its graphic user interface. Our experience shows that IsoBugView has greatly improved our productivity of analyzing isolation bugs.more » « less
An official website of the United States government

Full Text Available