Recent trends towards large machine learning models require both training and inference tasks to be distributed. Considering the huge cost of training these models, it is imperative to unlock optimizations in computation and communication to obtain best performance. However, the current logical separation between computation and communication kernels in machine learning frameworks misses optimization opportunities across this barrier. Breaking this abstraction can provide many optimizations to improve the performance of distributed workloads. However, manually applying these optimizations requires modifying the underlying computation and communication libraries for each scenario, which is both time consuming and error-prone. Therefore, we present CoCoNet, which contains (i) a domain specific language to express a distributed machine learning program in the form of computation and communication operations, (ii) a set of semantics preserving transformations to optimize the program, and (iii) a compiler to generate jointly optimized communication and computation GPU kernels. Providing both computation and communication as first class constructs allows users to work on a high-level abstraction and apply powerful optimizations, such as fusion or overlapping of communication and computation. CoCoNet enabled us to optimize data-, model- and pipeline-parallel workloads in large language models with only a few lines of code. Our experiments show that CoCoNet significantly outperforms state-of-the-art distributed machine learning implementations.
more »
« less
Fast and Scalable 2D Convolutions and Cross-correlations for Processing Image Databases and Videos on CPUs
The dominant use of Convolutional Neural Networks (CNNs) in several image and video analysis tasks necessitates a careful re-evaluation of the underlying software libraries for computing them for large-scale image and video databases. We focus our attention on developing methods that can be applied to large image databases or videos of large image sizes. We develop a method that maximizes throughput through the use of vector-based memory I/O and optimized 2D FFT libraries that run on all available physical cores. We also show how to decompose arbitrarily large images into smaller, optimal blocks that can be effectively processed through the use of overlap-and- add. Our approach outperforms Tensorflow for 5x5 kernels and significantly outperforms Tensorflow for 11x11 kernels.
more »
« less
- Award ID(s):
- 1842220
- PAR ID:
- 10184957
- Date Published:
- Journal Name:
- 2020 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI)
- Page Range / eLocation ID:
- 70 to 73
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Modern scientific workflows couple simulations with AI-powered analytics by frequently exchanging data to accelerate time-to-science to reduce the complexity of the simulation planes. However, this data exchange is limited in performance and portability due to a lack of support for scientific data formats in AI frameworks. We need a cohesive mechanism to effectively integrate at scale complex scientific data formats such as HDF5, PnetCDF, ADIOS2, GNCF, and Silo into popular AI frameworks such as TensorFlow, PyTorch, and Caffe. To this end, we designed Stimulus, a data management library for ingesting scientific data effectively into the popular AI frameworks. We utilize the StimOps functions along with StimPack abstraction to enable the integration of scientific data formats with any AI framework. The evaluations show that Stimulus outperforms several large-scale applications with different use-cases such as Cosmic Tagger (consuming HDF5 dataset in PyTorch), Distributed FFN (consuming HDF5 dataset in TensorFlow), and CosmoFlow (converting HDF5 into TFRecord and then consuming that in TensorFlow) by 5.3 x, 2.9 x, and 1.9 x respectively with ideal I/O scalability up to 768 GPUs on the Summit supercomputer. Through Stimulus, we can portably extend existing popular AI frameworks to cohesively support any complex scientific data format and efficiently scale the applications on large-scale supercomputers.more » « less
-
We present Atos, a dynamic scheduling framework for multi-node-GPU systems that supports PGAS-style lightweight one-sided memory operations within and between nodes. Atos's lightweight GPU-to-GPU communication enables latency hiding and can smooth the interconnection usage for bisection-limited problems. These benefits are significant for dynamic, irregular applications that often involve fine-grained communication at unpredictable times and without predetermined patterns. Some principles for high performance: (1) do not involve the CPU in the communication control path; (2) allow GPU communication within kernels, addressing memory consistency directly rather than relying on synchronization with the CPU; (3) perform dynamic communication aggregation when interconnections have limited bandwidth. By lowering the overhead of communication and allowing it within GPU kernels, we support large, high-utilization GPU kernels but with more frequent communication. We evaluate Atos on two irregular problems: Breadth-First-Search and PageRank. Atos outperforms the state-of-the-art graph libraries Gunrock, Groute and Galois on both single-node-multi-GPU and multi-node-GPU settings.more » « less
-
It is a common practice to think of a video as a sequence of images (frames), and re-use deep neural network models that are trained only on images for similar analytics tasks on videos. In this paper, we show that this “leap of faith” that deep learning models that work well on images will also work well on videos is actually flawed.We show that even when a video camera is viewing a scene that is not changing in any humanperceptible way, and we control for external factors like video compression and environment (lighting), the accuracy of video analytics application fluctuates noticeably. These fluctuations occur because successive frames produced by the video camera may look similar visually, but are perceived quite differently by the video analytics applications.We observed that the root cause for these fluctuations is the dynamic camera parameter changes that a video camera automatically makes in order to capture and produce a visually pleasing video. The camera inadvertently acts as an “unintentional adversary” because these slight changes in the image pixel values in consecutive frames, as we show, have a noticeably adverse impact on the accuracy of insights from video analytics tasks that re-use image-trained deep learning models. To address this inadvertent adversarial effect from the camera, we explore the use of transfer learning techniques to improve learning in video analytics tasks through the transfer of knowledge from learning on image analytics tasks. Our experiments with a number of different cameras, and a variety of different video analytics tasks, show that the inadvertent adversarial effect from the camera can be noticeably offset by quickly re-training the deep learning models using transfer learning. In particular, we show that our newly trained Yolov5 model reduces fluctuation in object detection across frames, which leads to better tracking of objects (∼40% fewer mistakes in tracking). Our paper also provides new directions and techniques to mitigate the camera’s adversarial effect on deep learning models used for video analytics applications.more » « less
-
null (Ed.)Chemical mixtures can be leveraged to store large amounts of data in a highly compact form and have the potential for massive scalability owing to the use of large-scale molecular libraries. With the parallelism that comes from having many species available, chemical-based memory can also provide the physical substrate for computation with increased throughput. Here, we represent non-binary matrices in chemical solutions and perform multiple matrix multiplications and additions, in parallel, using chemical reactions. As a case study, we demonstrate image processing, in which small greyscale images are encoded in chemical mixtures and kernel-based convolutions are performed using phenol acetylation reactions. In these experiments, we use the measured concentrations of reaction products (phenyl acetates) to reconstruct the output image. In addition, we establish the chemical criteria required to realize chemical image processing and validate reaction-based multiplication. Most importantly, this work shows that fundamental arithmetic operations can be reliably carried out with chemical reactions. Our approach could serve as a basis for developing more advanced chemical computing architectures.more » « less
An official website of the United States government

