Data‐driven applications are essential to handle the ever‐increasing volume, velocity, and veracity of data generated by sources such as the Web and Internet of Things (IoT) devices. Simultaneously, an event‐driven computational paradigm is emerging as the core of modern systems designed for database queries, data analytics, and on‐demand applications. Modern big data processing runtimes and asynchronous many task (AMT) systems from high performance computing (HPC) community have adopted dataflow event‐driven model. The services are increasingly moving to an event‐driven model in the form of Function as a Service (FaaS) to compose services. An event‐driven runtime designed for data processing consists of well‐understood components such as communication, scheduling, and fault tolerance. Different design choices adopted by these components determine the type of applications a system can support efficiently. We find that modern systems are limited to specific sets of applications because they have been designed with fixed choices that cannot be changed easily. In this paper, we present a loosely coupled component‐based design of a big data toolkit where each component can have different implementations to support various applications. Such a polymorphic design would allow services and data analytics to be integrated seamlessly and expand from edge to cloud to HPC environments.
Characterization and Implication of Edge WebAssembly Runtimes
WebAssembly, an emerging bytecode format, which
is initially developed for partially replacing JavaScript and speeding
up browser applications, has been extended to the server-side
due to its speed and security promise. It has been considered as a
promising alternative to the widely deployed container technique
for isolating lightweight applications. To run WebAssmebly from
the server-side, aside from the NodeJS runtime, several WebAssembly
native runtimes have been proposed. We characterize
majorWebAssembly runtimes through extensive applications and
metrics. Our results show that different runtimes fit different
application scenarios. Based on that, a framework for reducing
the startup latency of WebAssembly service while keeping maximum
performance is provided. To identify the root causes of the
performance gap, the analysis of emerging Cranelift compiler
against LLVM in detail is reported. In addition, this paper gives
revealing suggestions and architectural proposals for designing
an efficient WebAssembly runtime. Our work provides insights
on both WebAssembly runtime enhancement and WebAssemblybased
cloud service exploitation.
more »
« less
- PAR ID:
- 10337553
- Date Published:
- Journal Name:
- 2021 IEEE 23rd Int Conf on High Performance Computing & Communications
- Page Range / eLocation ID:
- 71 to 80
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Summary -
Asynchronous many-task runtimes look promising for the next generation of high performance computing systems. But these runtimes are usually based on new programming models, requiring extensive programmer effort to port existing applications to them. An alternative approach is to reimagine the execution model of widely used programming APIs, such as MPI, in order to execute them more asynchronously. Virtualization is a powerful technique that can be used to execute a bulk synchronous parallel program in an asynchronous manner. Moreover, if the virtualized entities can be migrated between address spaces, the runtime can optimize execution with dynamic load balancing, fault tolerance, and other adaptive techniques. Previous work on automating process virtualization has explored compiler approaches, source-to-source refactoring tools, and runtime methods. These approaches achieve virtualization with different tradeoffs in terms of portability (across different architectures, operating systems, compilers, and linkers), programmer effort required, and the ability to handle all different kinds of global state and programming languages. We implement support for three different related runtime methods, discuss shortcomings and their applicability to user-level virtualized process migration, and compare performance to existing approaches. Compared to existing approaches, one of our new methods achieves what we consider the best overall functionality in terms of portability, automation, support for migration, and runtime performance.more » « less
-
Serverless or functions as a service runtimes have shown significant benefits to efficiency and cost for event-driven cloud applications. Although serverless runtimes are limited to applications requiring lightweight computation and memory, such as machine learning prediction and inference, they have shown improvements on these applications beyond other cloud runtimes. Training deep learning can be both compute and memory intensive. We investigate the use of serverless runtimes while leveraging data parallelism for large models, show the challenges and limitations due to the tightly coupled nature of such models, and propose modifications to the underlying runtime implementations that would mitigate them. For hyper-parameter optimization of smaller deep learning models, we show that serverless runtimes can provide significant benefitmore » « less
-
Large scale data sets from the web, social networks, and bioinformatics are widely available and can often be represented by strings and suffix arrays are highly efficient data structures enabling string analysis. But, our personal devices and corresponding exploratory data analysis (EDA) tools cannot handle big data sets beyond the local memory. Arkouda is a framework under early development that brings together the productivity of Python at the user side with the high-performance of Chapel at the server-side. In this paper, an efficient suffix array data structure design and integration method are given first. A suffix array algorithm library integration method instead of one single suffix algorithm is presented to enable runtime performance optimization in Arkouda since different suffix array algorithms may have very different practical performances for strings in various applications. A parallel suffix array construction algorithm framework is given to further exploit hierarchical parallelism on multiple locales in Chapel. A corresponding benchmark is developed to evaluate the feasibility of the provided suffix array integration method and measure the end-to-end performance. Experimental results show that the proposed solution can provide data scientists an easy and efficient method to build suffix arrays with high performance in Python. All our codes are open source and available from GitHub (https://github.com/Bader-Research/arkouda/tree/string-suffix-array-functionality).more » « less
-
Runtimes and applications that rely heavily on asynchronous event notifications suffer when such notifications must traverse several layers of processing in software. Many of these layers necessarily exist in order to support a general-purpose, portable kernel architecture, but they introduce considerable overheads for demanding, high-performance parallel runtimes and applications. Other overheads can arise from a mismatched event programming or system call interface. Whatever the case, the average latency and variance in latency of commonly used software mechanisms for event notifications is abysmal compared to the capabilities of the hardware, which can exhibit orders of magnitude lower latency. We leverage the flexibility and freedom of the previously proposed Hybrid Runtime (HRT) model to explore the construction of low-latency, asynchronous software events uninhibited by interfaces and execution models commonly imposed by general-purpose OSes. We propose several mechanisms in a system we call Nemo which employs kernel mode-only features to accelerate event notifications by up to 4,000 times and we provide a detailed evaluation of our implementation using extensive microbenchmarks. We carry out our evaluation both on a modern x64 server and the Intel Xeon Phi. Finally, we propose a small addition to existing interrupt controllers (APICs) that could push the limit of asynchronous events closer to the latency of the hardware cache coherence network.more » « less