NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Are More LLM Calls All You Need? Towards the Scaling Properties of Compound AI Systems

Chen, Lingjiao; Davis, Jared; Hanin, Boris; Bailis, Peter; Zaharia, Matei; Stoica, Ion; Zou, Jason (September 2024, Advances in neural information processing systems)

Many recent state-of-the-art results in language tasks were achieved using compound systems that perform multiple Language Model (LM) calls and aggregate their responses. However, there is little understanding of how the number of LM calls -- e.g., when asking the LM to answer each question multiple times and taking a majority vote -- affects such a compound system's performance. In this paper, we initiate the study of scaling properties of compound inference systems. We analyze, theoretically and empirically, how the number of LM calls affects the performance of Vote and Filter-Vote, two of the simplest compound system designs, which aggregate LM responses via majority voting, optionally applying LM filters. We find, surprisingly, that across multiple language tasks, the performance of both Vote and Filter-Vote can first increase but then decrease as a function of the number of LM calls. Our theoretical results suggest that this non-monotonicity is due to the diversity of query difficulties within a task: more LM calls lead to higher performance on "easy" queries, but lower performance on "hard" queries, and non-monotone behavior can emerge when a task contains both types of queries. This insight then allows us to compute, from a small number of samples, the number of LM calls that maximizes system performance, and define an analytical scaling model for both systems. Experiments show that our scaling model can accurately predict the performance of Vote and Filter-Vote systems and thus find the optimal number of LM calls to make.
more » « less
Full Text Available
ScenicNL: Generating Probabilistic Scenario Programs from Natural Language

Elmaaroufi, Karim; Shanker, Devan; Cismaru, Ana; Vazquez-Chanlatte, Marcell; Sangiovanni-Vincentelli, Alberto; Zaharia, Matei; Seshia, Sanjit A (September 2024, Proceedings of the First Conference on Language Models (COLM))

Full Text Available
Memory-Efficient Pipeline-Parallel DNN Training

Narayanan, Deepak; Phanishayee, Amar; Shi, Kaiyu; Chen, Xie; Zaharia, Matei (May 2022, Proceedings of Machine Learning Research)

Many state-of-the-art ML results have been obtained by scaling up the number of parameters in existing models. However, parameters and activations for such large models often do not fit in the memory of a single accelerator device; this means that it is necessary to distribute training of large models over multiple accelerators. In this work, we propose PipeDream-2BW, a system that supports memory-efficient pipeline parallelism. PipeDream-2BW uses a novel pipelining and weight gradient coalescing strategy, combined with the double buffering of weights, to ensure high throughput, low memory footprint, and weight update semantics similar to data parallelism. In addition, PipeDream-2BW automatically partitions the model over the available hardware resources, while respecting hardware constraints such as memory capacities of accelerators and interconnect topologies. PipeDream-2BW can accelerate the training of large GPT and BERT language models by up to 20x with similar final model accuracy.
more » « less
Full Text Available
Clamor: Extending Functional Cluster Computing Frameworks with Fine-Grained Remote Memory Access

https://doi.org/10.1145/3472883.3486996

Thaker, Pratiksha; Ayers, Hudson; Raghavan, Deepti; Niu, Ning; Levis, Philip; Zaharia, Matei (November 2021, SoCC '21: Proceedings of the ACM Symposium on Cloud Computing)

We propose Clamor, a functional cluster computing framework that adds support for fine-grained, transparent access to global variables for distributed, data-parallel tasks. Clamor targets workloads that perform sparse accesses and updates within the bulk synchronous parallel execution model, a setting where the standard technique of broadcasting global variables is highly inefficient. Clamor implements a novel dynamic replication mechanism in order to enable efficient access to popular data regions on the fly, and tracks finegrained dependencies in order to retain the lineage-based fault tolerance model of systems like Spark. Clamor can integrate with existing Rust and C++ libraries to transparently distribute programs on the cluster. We show that Clamor is competitive with Spark in simple functional workloads and can improve performance significantly compared to custom systems on workloads that sparsely access large global variables: from 5x for sparse logistic regression to over 100x on distributed geospatial queries.
more » « less
Full Text Available
Breakfast of champions: towards zero-copy serialization with NIC scatter-gather

https://doi.org/10.1145/3458336.3465287

Raghavan, Deepti; Levis, Philip; Zaharia, Matei; Zhang, Irene (June 2021, HotOS '21: Proceedings of the Workshop on Hot Topics in Operating Systems)

Microsecond I/O will make data serialization a major bottleneck for datacenter applications. Serialization is fundamentally about data movement: serialization libraries coalesce and flatten in-memory data structures into a single transmittable buffer. CPU-based serialization approaches will hit a performance limit due to data movement overheads and be unable to keep up with modern networks. We observe that widely deployed NICs possess scatter-gather capabilities that can be re-purposed to accelerate serialization's core task of coalescing and flattening in-memory data structures. It is possible to build a completely zero-copy, zero-allocation serialization library with commodity NICs. Doing so introduces many research challenges, including using the hardware capabilities efficiently for a wide variety of non-uniform data structures, making application memory available for zero-copy I/O, and ensuring memory safety.
more » « less
Full Text Available
Exploiting Proximity Search and Easy Examples to Select Rare Events

Kang, Daniel; Derhacobian, Alex; Tsuji, Kaoru; Hebert, Trevor; Bailis, Peter; Fukami, Tadashi; Hashimoto, Tatsunori; Sun, Yi; Zaharia, Matei (December 2021, NeurIPS Data-Centric AI Workshop 2021)

A common problem practitioners face is to select rare events in a large dataset. Unfortunately, standard techniques ranging from pre-trained models to active learning do not leverage proximity structure present in many datasets and can lead to worse-than-random results. To address this, we propose EZMODE, an algorithm for iterative selection of rare events in large, unlabeled datasets. EZMODE leverages active learning to iteratively train classifiers, but chooses the easiest positive examples to label in contrast to standard uncertainty techniques. EZMODE also leverages proximity structure (e.g., temporal sampling) to find difficult positive examples. We show that EZMODE can outperform baselines by up to 130× on a novel, real-world, 9,000 GB video dataset.
more » « less
Full Text Available
Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP

https://doi.org/10.1145/3477132.3483588

Narayanan, Deepak; Kazhamiaka, Fiodar; Abuzaid, Firas; Kraft, Peter; Agrawal, Akshay; Kandula, Srikanth; Boyd, Stephen; Zaharia, Matei (October 2021, SOSP 2021)

Resource allocation problems in many computer systems can be formulated as mathematical optimization problems. However, finding exact solutions to these problems using off-the-shelf solvers is often intractable for large problem sizes with tight SLAs, leading system designers to rely on cheap, heuristic algorithms. We observe, however, that many allocation problems are granular: they consist of a large number of clients and resources, each client requests a small fraction of the total number of resources, and clients can interchangeably use different resources. For these problems, we propose an alternative approach that reuses the original optimization problem formulation and leads to better allocations than domain-specific heuristics. Our technique, Partitioned Optimization Problems (POP), randomly splits the problem into smaller problems (with a subset of the clients and resources in the system) and coalesces the resulting sub-allocations into a global allocation for all clients. We provide theoretical and empirical evidence as to why random partitioning works well. In our experiments, POP achieves allocations within 1.5% of the optimal with orders-of-magnitude improvements in runtime compared to existing systems for cluster scheduling, traffic engineering, and load balancing.
more » « less
Full Text Available
Similarity Search for Efficient Active Learning and Search of Rare Concepts

Coleman Cody; Chou, Edward; Katz-Samuels, Julian; Culatana, Sean; Bailis, Peter; Berg, Alexander C.; Nowak, Robert; Sumbaly, Roshan; Zaharia, Matei; Yalniz, I. Zeki (January 2022, Proceedings of the AAAI Conference on Artificial Intelligence)

Many active learning and search approaches are intractable for large-scale industrial settings with billions of unlabeled examples. Existing approaches search globally for the optimal examples to label, scaling linearly or even quadratically with the unlabeled data. In this paper, we improve the computational efficiency of active learning and search methods by restricting the candidate pool for labeling to the nearest neighbors of the currently labeled set instead of scanning over all of the unlabeled data. We evaluate several selection strategies in this setting on three large-scale computer vision datasets: ImageNet, OpenImages, and a de-identified and aggregated dataset of 10 billion publicly shared images provided by a large internet company. Our approach achieved similar mean average precision and recall as the traditional global approach while reducing the computational cost of selection by up to three orders of magnitude, enabling web-scale active learning.
more » « less
Full Text Available
Spectral Lower Bounds on the I/O Complexity of Computation Graphs

https://doi.org/10.1145/3350755.3400210

Jain, Saachi; Zaharia, Matei (July 2020, SPAA 2020)
null (Ed.)
We consider the problem of finding lower bounds on the I/O complexity of arbitrary computations in a two level memory hierarchy. Executions of complex computations can be formalized as an evaluation order over the underlying computation graph. However, prior methods for finding I/O lower bounds leverage the graph structures for specific problems (e.g matrix multiplication) which cannot be applied to arbitrary graphs. In this paper, we first present a novel method to bound the I/O of any computation graph using the first few eigenvalues of the graph’s Laplacian. We further extend this bound to the parallel setting. This spectral bound is not only efficiently computable by power iteration, but can also be computed in closed form for graphs with known spectra. We apply our spectral method to compute closed-form analytical bounds on two computation graphs (the Bellman-Held-Karp algorithm for the traveling salesman problem and the Fast Fourier Transform), as well as provide a probabilistic bound for random Erdős Rényi graphs. We empirically validate our bound on four computation graphs, and find that our method provides tighter bounds than current empirical methods and behaves similarly to previously published I/O bounds.
more » « less
Full Text Available
Efficient large-scale language model training on GPU clusters using megatron-LM

https://doi.org/10.1145/3458817.3476209

Narayanan, Deepak; Shoeybi, Mohammad; Casper, Jared; LeGresley, Patrick; Patwary, Mostofa; Korthikanti, Vijay; Vainbrand, Dmitri; Kashinkunti, Prethvi; Bernauer, Julie; Catanzaro, Bryan; et al (November 2021, Supercomputing 2021)

Large language models have led to state-of-the-art accuracies across several tasks. However, training these models efficiently is challenging because: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server, and b) the number of compute operations required can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to scaling issues at thousands of GPUs. In this paper, we show how tensor, pipeline, and data parallelism can be composed to scale to thousands of GPUs. We propose a novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs (per-GPU throughput of 52% of theoretical peak).
more » « less
Full Text Available

« Prev Next »

Search for: All records