EIF: A Mediated Pass-Through Framework for Inference as a Service

Gao, Yiming; Wang, Zhen; Wu, Weili; Lam, Herman

doi:10.1109/ISPA-BDCloud-SocialCom-SustainCom59178.2023.00111

Description / Abstract: In order to effectively provide INaaS (Inference-as-a-Service) for AI applications in resource-limited cloud environments, two major challenges must be overcome: achieving low latency and providing multi-tenancy. This paper presents EIF (Efficient INaaS Framework), which uses a heterogeneous CPU-FPGA architecture to provide three methods to address these challenges (1) spatial multiplexing via software-hardware co-design virtualization techniques, (2) temporal multiplexing that exploits the sparsity of neural-net models, and (3) streaming-mode inference which overlaps data transfer and computation. The prototype EIF is implemented on an Intel PAC (shared-memory CPU-FPGA) platform. For evaluation, 12 types of DNN models were used as benchmarks, with different size and sparsity. Based on these experiments, we show that in EIF, the temporal multiplexing technique can improve the user density of an AI Accelerator Unit from 2$$\times$$ to 6$$\times$$, with marginal performance degradation. In the prototype system, the spatial multiplexing technique supports eight AI Accelerators Unit on one FPGA. By using a streaming mode based on a Mediated Pass-Through architecture, EIF can overcome the FPGA on-chip memory limitation to improve multi-tenancy and optimize the latency of INaaS. To further enhance INaaS, EIF utilizes the MapReduce function to provide a more flexible QoS. Together with the temporal/spatial multiplexing techniques, EIF can support 48 users simultaneously on a single FPGA board in our prototype system. In all tested benchmarks, cold-start latency accounts for only approximately 5\% of the total response time.

More Like this