skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 10:00 PM ET on Friday, February 6 until 10:00 AM ET on Saturday, February 7 due to maintenance. We apologize for the inconvenience.


Title: Telekine: Secure Computing with Cloud GPUs
GPUs have become ubiquitous in the cloud due to the dramatic performance gains they enable in domains such as machine learning and computer vision. However, offloading GPU computation to the cloud requires placing enormous trust in providers and administrators. Recent proposals for GPU trusted execution environments (TEEs) are promising but fail to address very real side-channel concerns. To illustrate the severity of the problem, we demonstrate a novel attack that enables an attacker to correctly classify images from ImageNet by observing only the timing of GPU kernel execution, rather than the images themselves. Telekine enables applications to use GPU acceleration in the cloud securely, based on a novel GPU stream abstraction that ensures execution and interaction through untrusted components are independent of any secret data. Given a GPU with support for a TEE, Telekine employs a novel variant of API remoting to partition application-level software into components to ensure secret-dependent behaviors occur only on trusted components. Telekine can securely train modern image recognition models on MXNet with 10%–22% performance penalty relative to an insecure baseline with a locally attached GPU. It runs graph algorithms using Galois on one and two GPUs with 18%–41% overhead.  more » « less
Award ID(s):
1900457
PAR ID:
10188585
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
17th USENIX Symposium on Networked Systems Design and Implementation
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Because FPGAs outperform traditional processing cores like CPUs and GPUs in terms of performance per watt and flexibility, they are being used more and more in cloud and data center applications. There are growing worries about the security risks posed by multi-tenant sharing as the demand for hardware acceleration increases and gradually gives way to FPGA multi-tenancy in the cloud. The confidentiality, integrity, and availability of FPGA-accelerated applications may be compromised if space-shared FPGAs are made available to many cloud tenants. We propose a root of trust-based trusted execution mechanism called TrustToken to prevent harmful software-level attackers from getting unauthorized access and jeopardizing security. With safe key creation and truly random sources, TrustToken creates a security block that serves as the foundation of trust-based IP security. By offering crucial security characteristics, such as secure, isolated execution and trusted user interaction, TrustToken only permits trustworthy connection between the non-trusted third-party IP and the rest of the SoC environment. The suggested approach does this by connecting the third-party IP interface to the TrustToken Controller and running run-time checks on the correctness of the IP authorization(Token) signals. With an emphasis on software-based assaults targeting unauthorized access and information leakage, we offer a noble hardware/software architecture for trusted execution in FPGA-accelerated clouds and data centers. 
    more » « less
  2. The demand for Deep Neural Network (DNN) execution (including both inference and training) on mobile system-on-a-chip (SoCs) has surged, driven by factors like the need for real-time latency, privacy, and reducing vendors’ costs. Mainstream mobile GPUs (e.g., Qualcomm Adreno GPUs) usually have a 2.5D L1 texture cache that offers throughput superior to that of on-chip memory. However, to date, there is limited understanding of the performance features of such a 2.5D cache, which limits the optimization potential. This paper introduces TMModel, a framework with three components: 1) a set of micro-benchmarks and a novel performance assessment methodology to characterize a non-well-documented architecture with 2D memory, 2) a complete analytical performance model configurable for different data access pattern(s), tiling size(s), and other GPU execution parameters for a given operator (and associated size and shape), and 3) a compilation framework incorporating this model and generating optimized code with low overhead. TMModel is validated both on a set of DNN kernels and for training complete models on a mobile GPU, and compared against both popular mobile DNN frameworks and another GPU performance model. Evaluation results demonstrate that TMModel outperforms all baselines, achieving 1.48 − 3.61× speedup on individual kernels and 1.83 − 66.1× speedup for end-to-end on-device training with only 0.25% − 18.5% the tuning cost of the baselines. 
    more » « less
  3. The demand for Deep Neural Network (DNN) execution (including both inference and training) on mobile system-ona-chip (SoCs) has surged, driven by factors like the need for real-time latency, privacy, and reducing vendors’ costs. Mainstream mobile GPUs (eg, Qualcomm Adreno GPUs) usually have a 2.5 D L1 texture cache that offers throughput superior to that of on-chip memory. However, to date, there is limited understanding of the performance features of such a 2.5 D cache, which limits the optimization potential. This paper introduces TMModel, a framework with three components: 1) a set of micro-benchmarks and a novel performance assessment methodology to characterize a non-well-documented architecture with 2D memory, 2) a complete analytical performance model configurable for different data access pattern (s), tiling size (s), and other GPU execution parameters for a given operator (and associated size and shape), and 3) a compilation framework incorporating this model and generating optimized code with low overhead. TMModel is validated both on a set of DNN kernels and for training complete models on mobile GPU. 
    more » « less
  4. The demand for Deep Neural Network (DNN) execution (including both inference and training) on mobile system-ona-chip (SoCs) has surged, driven by factors like the need for real-time latency, privacy, and reducing vendors’ costs. Mainstream mobile GPUs (eg, Qualcomm Adreno GPUs) usually have a 2.5 D L1 texture cache that offers throughput superior to that of on-chip memory. However, to date, there is limited understanding of the performance features of such a 2.5 D cache, which limits the optimization potential. This paper introduces TMModel, a framework with three components: 1) a set of micro-benchmarks and a novel performance assessment methodology to characterize a non-well-documented architecture with 2D memory, 2) a complete analytical performance model configurable for different data access pattern (s), tiling size (s), and other GPU execution parameters for a given operator (and associated size and shape), and 3) a compilation framework incorporating this model and generating optimized code with low overhead. TMModel is validated both on a set of DNN kernels and for training complete models on mobile GPU. 
    more » « less
  5. Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD (CPU-GPU) architectures, which means moving away from the traditional CPU and NVIDIA-GPU systems. Due to the current limitations of profiling tools for AMD GPUs, this shift leaves a void in how to measure application performance on AMD GPUs. In this article, we design an instruction roofline model for AMD GPUs using AMD’s ROCProfiler and a benchmarking tool, BabelStream (the HIP implementation), as a way to measure an application’s performance in instructions and memory transactions on new AMD hardware. Specifically, we create instruction roofline models for a case study scientific application, PIConGPU, an open source particle-in-cell simulations application used for plasma and laser-plasma physics on the NVIDIA V100, AMD Radeon Instinct MI60, and AMD Instinct MI100 GPUs. When looking at the performance of multiple kernels of interest in PIConGPU we find that although the AMD MI100 GPU achieves a similar, or better, execution time compared to the NVIDIA V100 GPU, profiling tool differences make comparing performance of these two architectures hard. When looking at execution time, GIPS, and instruction intensity, the AMD MI60 achieves the worst performance out of the three GPUs used in this work. 
    more » « less