NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Towards an optimized GROUP by abstraction for large-scale machine learning

https://doi.org/10.14778/3476249.3476284

Li, Side; Kumar, Arun (July 2021, Proceedings of the VLDB Endowment)

Many applications that use large-scale machine learning (ML) increasingly prefer different models for subgroups (e.g., countries) to improve accuracy, fairness, or other desiderata. We call this emerging popular practice learning over groups , analogizing to GROUP BY in SQL, albeit for ML training instead of SQL aggregates. From the systems standpoint, this practice compounds the already data-intensive workload of ML model selection (e.g., hyperparameter tuning). Often, thousands of models may need to be trained, necessitating high-throughput parallel execution. Alas, most ML systems today focus on training one model at a time or at best, parallelizing hyperparameter tuning. This status quo leads to resource wastage, low throughput, and high runtimes. In this work, we take the first step towards enabling and optimizing learning over groups from the data systems standpoint for three popular classes of ML: linear models, neural networks, and gradient-boosted decision trees. Analytically and empirically, we compare standard approaches to execute this workload today: task-parallelism and data-parallelism. We find neither is universally dominant. We put forth a novel hybrid approach we call grouped learning that avoids redundancy in communications and I/O using a novel form of parallel gradient descent we call Gradient Accumulation Parallelism (GAP). We prototype our ideas into a system we call Kingpin built on top of existing ML tools and the flexible massively-parallel runtime Ray. An extensive empirical evaluation on large ML benchmark datasets shows that Kingpin matches or is 4x to 14x faster than state-of-the-art ML systems, including Ray's native execution and PyTorch DDP.
more » « less
Full Text Available
Cerebro: A Layered Data Platform for Scalable Deep Learning

Kumar, Arun; Nakandala, Supun; Zhang, Yuhao; Li, Side; Gemawat, Advitya; Nagrecha, Kabir (January 2021, 11th Annual Conference on Innovative Data Systems Research (CIDR '21))
null (Ed.)
Full Text Available
SpeakQL: Towards Speech-driven Multimodal Querying of Structured Data

https://doi.org/10.1145/3318464.3389777

Shah, Vraj; Li, Side; Kumar, Arun; Saul, Lawrence (July 2020, Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data)

Speech-driven querying is becoming popular in new device environments such as smartphones, tablets, and even conversational assistants. However, such querying is largely restricted to natural language. Typed SQL remains the gold standard for sophisticated structured querying although it is painful in many environments, which restricts when and how users consume their data. In this work, we propose to bridge this gap by designing a speech-driven querying system and interface for structured data we call SpeakQL. We support a practically useful subset of regular SQL and allow users to query in any domain with novel touch/speech based human-in-the-loop correction mechanisms. Automatic speech recognition (ASR) introduces myriad forms of errors in transcriptions, presenting us with a technical challenge. We exploit our observations of SQL's properties, its grammar, and the queried database to build a modular architecture. We present the first dataset of spoken SQL queries and a generic approach to generate them for any arbitrary schema. Our experiments show that SpeakQL can automatically correct a large fraction of errors in ASR transcriptions. User studies show that SpeakQL can help users specify SQL queries significantly faster with a speedup of average 2.7x and up to 6.7x compared to typing on a tablet device. SpeakQL also reduces the user effort in specifying queries by a factor of average 10x and up to 60x compared to raw typing effort.
more » « less
Full Text Available
Demonstration of SpeakQL: Speech-driven Multimodal Querying of Structured Data

https://doi.org/10.1145/3299869.3320224

Shah, Vraj; Li, Side; Yang, Kevin; Kumar, Arun; Saul, Lawrence (January 2019, Proceedings of the 2019 International Conference on Management of Data)

In this demonstration, we present SpeakQL, a speech-driven query system and interface for structured data. SpeakQL supports a tractable and practically useful subset of regular SQL, allowing users to query in any domain with unbounded vocabulary with the help of speech/touch based user-in-the-loop mechanisms for correction. When querying in such domains, automatic speech recognition introduces countless forms of errors in transcriptions, presenting us with a technical challenge. We characterize such errors and leverage our observations along with SQL's unambiguous context-free grammar to first correct the query structure. We then exploit phonetic representation of the queried database to identify the correct Literals, hence delivering the corrected transcribed query. In this demo, we show that SpeakQL helps users reduce time and effort in specifying SQL queries significantly. In addition, we show that SpeakQL, unlike Natural Language Interfaces and conversational assistants, allows users to query over any arbitrary database schema. We allow the audience to explore SpeakQL using an easy-to-use web-based interface to compose SQL queries.
more » « less
Full Text Available

Search for: All records