null
(Ed.)
Edge clouds can provide very responsive services for
end-user devices that require more significant compute capabilities than they have. But edge cloud resources such as CPUs and
accelerators such as GPUs are limited and must be shared across
multiple concurrently running clients. However, multiplexing
GPUs across applications is challenging. Further, edge servers
are likely to require considerable amounts of streaming data to
be processed. Getting that data from the network stream to the
GPU can be a bottleneck, limiting the amount of work GPUs
do. Finally, the lack of prompt notification of job completion
from GPU also results in ineffective GPU utilization. We propose
a framework that addresses these challenges in the following
manner. We utilize spatial sharing of GPUs to multiplex the GPU
more efficiently. While spatial sharing of GPU can increase GPU
utilization, the uncontrolled spatial sharing currently available
with state-of-the-art systems such as CUDA-MPS can cause interference between applications, resulting in unpredictable latency.
Our framework utilizes controlled spatial sharing of GPU, which
limits the interference across applications. Our framework uses
the GPU DMA engine to offload data transfer to GPU, therefore
preventing CPU from being bottleneck while transferring data
from the network to GPU. Our framework uses the CUDA
event library to have timely, low overhead GPU notifications.
Preliminary experiments show that we can achieve low DNN
inference latency and improve DNN inference throughput by a
factor of ∼1.4.
more »
« less