Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
A common problem practitioners face is to select rare events in a large dataset. Unfortunately, standard techniques ranging from pre-trained models to active learning do not leverage proximity structure present in many datasets and can lead to worse-than-random results. To address this, we propose EZMODE, an algorithm for iterative selection of rare events in large, unlabeled datasets. EZMODE leverages active learning to iteratively train classifiers, but chooses the easiest positive examples to label in contrast to standard uncertainty techniques. EZMODE also leverages proximity structure (e.g., temporal sampling) to find difficult positive examples. We show that EZMODE can outperform baselines bymore »Free, publicly-accessible full text available December 14, 2022
-
Specialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs have been increasingly deployed to train deep learning models. These accelerators exhibit heterogeneous performance behavior across model architectures. Existing schedulers for clusters of accelerators, which are used to arbitrate these expensive training resources across many users, have shown how to optimize for various multi-job, multiuser objectives, like fairness and makespan. Unfortunately, existing schedulers largely do not consider performance heterogeneity. In this paper, we propose Gavel, a heterogeneity-aware scheduler that systematically generalizes a wide range of existing scheduling policies. Gavel expresses these policies as optimization problems and then systematically transforms thesemore »
-
We consider the problem of finding lower bounds on the I/O complexity of arbitrary computations in a two level memory hierarchy. Executions of complex computations can be formalized as an evaluation order over the underlying computation graph. However, prior methods for finding I/O lower bounds leverage the graph structures for specific problems (e.g matrix multiplication) which cannot be applied to arbitrary graphs. In this paper, we first present a novel method to bound the I/O of any computation graph using the first few eigenvalues of the graph’s Laplacian. We further extend this bound to the parallel setting. This spectral boundmore »
-
Cloud providers offer instances with similar compute capabilities (for example, instances with different generations of GPUs like K80s, P100s, V100s) across many regions, availability zones, and on-demand and spot markets, with prices governed independently by individual supplies and demands. In this paper, using machine learning model training as an example application, we explore the potential cost reductions possible by leveraging this cross-cloud instance market. We present quantitative results on how the prices of cloud instances change with time, and how total costs can be decreased by considering this dynamic pricing market. Our preliminary experiments show that a) the optimal instancemore »
-
We present POSH, a framework that accelerates shell applications with I/O-heavy components, such as data analytics with command-line utilities. Remote storage such as networked filesystems can severely limit the performance of these applications: data makes a round trip over the network for relatively little computation at the client. Reducing the data movement by moving the code to the data can improve performance. POSH automatically optimizes unmodified I/O-intensive shell applications running over remote storage by offloading the I/O-intensive portions to proxy servers closer to the data. A proxy can run directly on a storage server, or on a machine closer tomore »
-
As specialized hardware accelerators such as GPUs become increasingly popular, developers are looking for ways to target these platforms with high-level APIs. One promising approach is kernel libraries such as PyTorch or cuML, which provide interfaces that mirror CPU-only counterparts such as NumPy or Scikit-Learn. Unfortunately, these libraries are hard to develop and to adopt incrementally: they only support a subset of their CPU equivalents, only work with datasets that fit in device memory, and require developers to reason about data placement and transfers manually. To address these shortcomings, we present a new approach called offload annotations (OAs) that enablesmore »
-
We present Fleet, a framework that offers a massively parallel streaming model for FPGAs and is effective in a number of domains well-suited for FPGA acceleration, including parsing, compression, and machine learning. Fleet requires the user to specify RTL for a processing unit that serially processes every input token in a stream, a far simpler task than writing a parallel processing unit. It then takes the user’s processing unit and generates a hardware design with many copies of the unit as well as memory controllers to feed the units with separate streams and drain their outputs. Fleet includes a Chisel-basedmore »
-
Systems for ML inference are widely deployed today, but they typically optimize ML inference workloads using techniques designed for conventional data serving workloads and miss critical opportunities to leverage the statistical nature of ML. In this paper, we present WILLUMP, an optimizer for ML inference that introduces two statistically-motivated optimizations targeting ML applications whose performance bottleneck is feature computation. First, WILLUMP automatically cascades feature computation for classification queries: WILLUMP classifies most data inputs using only high-value, low-cost features selected through empirical observations of ML model performance, improving query performance by up to 5× without statistically significant accuracy loss. Second, WILLUMPmore »
-
Data movement between main memory and the CPU is a major bottleneck in parallel data-intensive applications. In response, researchers have proposed using compilers and intermediate representations (IRs) that apply optimizations such as loop fusion under existing high-level APIs such as NumPy and TensorFlow. Even though these techniques generally do not require changes to user applications, they require intrusive changes to the library itself: often, library developers must rewrite each function using a new IR. In this paper, we propose a new technique called split annotations (SAs) that enables key data movement optimizations over unmodified library functions. SAs only require developersmore »
-
Existing deep neural network (DNN) frameworks optimize the computation graph of a DNN by applying graph transformations manually designed by human experts. This approach misses possible graph optimizations and is difficult to scale, as new DNN operators are introduced on a regular basis. We propose TASO, the first DNN computation graph optimizer that automatically generates graph substitutions. TASO takes as input a list of operator specifications and generates candidate substitutions using the given operators as basic building blocks. All generated substitutions are formally verified against the operator specifications using an automated theorem prover. To optimize a given DNN computation graph,more »