Data‐driven applications are essential to handle the ever‐increasing volume, velocity, and veracity of data generated by sources such as the Web and Internet of Things (IoT) devices. Simultaneously, an event‐driven computational paradigm is emerging as the core of modern systems designed for database queries, data analytics, and on‐demand applications. Modern big data processing runtimes and asynchronous many task (AMT) systems from high performance computing (HPC) community have adopted dataflow event‐driven model. The services are increasingly moving to an event‐driven model in the form of Function as a Service (FaaS) to compose services. An event‐driven runtime designed for data processing consists of well‐understood components such as communication, scheduling, and fault tolerance. Different design choices adopted by these components determine the type of applications a system can support efficiently. We find that modern systems are limited to specific sets of applications because they have been designed with fixed choices that cannot be changed easily. In this paper, we present a loosely coupled component‐based design of a big data toolkit where each component can have different implementations to support various applications. Such a polymorphic design would allow services and data analytics to be integrated seamlessly and expand from edge to cloud to HPC environments.
more » « less- PAR ID:
- 10453261
- Publisher / Repository:
- Wiley Blackwell (John Wiley & Sons)
- Date Published:
- Journal Name:
- Concurrency and Computation: Practice and Experience
- Volume:
- 32
- Issue:
- 3
- ISSN:
- 1532-0626
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
As data analytics applications become increasingly important in a wide range of domains, the ability to develop large-scale and sustainable platforms and software infrastructure to support these applications has significant potential to drive research and innovation in both science and business domains. This paper characterizes performance and power-related behavior trends and tradeoffs of the two predominant frameworks for Big Data analytics (i.e., Apache Hadoop and Spark) for a range of representative applications. It also evaluates system design knobs, such as storage and network technologies and power capping techniques. Experimental results from empirical executions provide meaningful data points for exploring the potential of software-defined infrastructure for Big Data processing systems through simulation. The results provide better understanding of the design space to build multi-criteria application-centric models as well as show significant advantages of software-defined infrastructure in terms of execution time, energy and cost. It motivates further research focused on in-memory processing formulations regarding systems with deeper memory hierarchies and software-defined infrastructure.more » « less
-
null (Ed.)The amazing advances being made in the fields of machine and deep learning are a highlight of the Big Data era for both enterprise and research communities. Modern applications require resources beyond a single node's ability to provide. However this is just a small part of the issues facing the overall data processing environment, which must also support a raft of data engineering for pre- and post-data processing, communication, and system integration. An important requirement of data analytics tools is to be able to easily integrate with existing frameworks in a multitude of languages, thereby increasing user productivity and efficiency. All this demands an efficient and highly distributed integrated approach for data processing, yet many of today's popular data analytics tools are unable to satisfy all these requirements at the same time. In this paper we present Cylon, an open-source high performance distributed data processing library that can be seamlessly integrated with existing Big Data and AI/ML frameworks. It is developed with a flexible C++ core on top of a compact data structure and exposes language bindings to C++, Java, and Python. We discuss Cylon's architecture in detail, and reveal how it can be imported as a library to existing applications or operate as a standalone framework. Initial experiments show that Cylon enhances popular tools such as Apache Spark and Dask with major performance improvements for key operations and better component linkages. Finally, we show how its design enables Cylon to be used cross-platform with minimum overhead, which includes popular AI tools such as PyTorch, Tensorflow, and Jupyter notebooks.more » « less
-
Nesmachnow, S. ; Castro, H. ; Tchernykh, A. (Ed.)Artificial intelligence (AI) is transforming research through analysis of massive datasets and accelerating simulations by factors of up to a billion. Such acceleration eclipses the speedups that were made possible though improvements in CPU process and design and other kinds of algorithmic advances. It sets the stage for a new era of discovery in which previously intractable challenges will become surmountable, with applications in fields such as discovering the causes of cancer and rare diseases, developing effective, affordable drugs, improving food sustainability, developing detailed understanding of environmental factors to support protection of biodiversity, and developing alternative energy sources as a step toward reversing climate change. To succeed, the research community requires a high-performance computational ecosystem that seamlessly and efficiently brings together scalable AI, general-purpose computing, and large-scale data management. The authors, at the Pittsburgh Supercomputing Center (PSC), launched a second-generation computational ecosystem to enable AI-enabled research, bringing together carefully designed systems and groundbreaking technologies to provide at no cost a uniquely capable platform to the research community. It consists of two major systems: Neocortex and Bridges-2. Neocortex embodies a revolutionary processor architecture to vastly shorten the time required for deep learning training, foster greater integration of artificial deep learning with scientific workflows, and accelerate graph analytics. Bridges-2 integrates additional scalable AI, high-performance computing (HPC), and high-performance parallel file systems for simulation, data pre- and post-processing, visualization, and Big Data as a Service. Neocortex and Bridges-2 are integrated to form a tightly coupled and highly flexible ecosystem for AI- and data-driven research.more » « less
-
As large-scale scientific simulations and big data analyses become more popular, it is increasingly more expensive to store huge amounts of raw simulation results to perform post-analysis. To minimize the expensive data I/O, “in-situ” analysis is a promising approach, where data analysis applications analyze the simulation generated data on the fly without storing it first. However, it is challenging to organize, transform, and transport data at scales between two semantically different ecosystems due to the distinct software and hardware difference. To tackle these challenges, we design and implement the X-Composer framework. X-Composer connects cross-ecosystem applications to form an “in-situ” scientific workflow, and provides a unified approach and recipe for supporting such hybrid in-situ workflows on distributed heterogeneous resources. X-Composer reorganizes simulation data as continuous data streams and feeds them seamlessly into the Cloud-based stream processing services to minimize I/O overheads. For evaluation, we use X-Composer to set up and execute a cross-ecosystem workflow, which consists of a parallel Computational Fluid Dynamics simulation running on HPC, and a distributed Dynamic Mode Decomposition analysis application running on Cloud. Our experimental results show that X-Composer can seamlessly couple HPC and Big Data jobs in their own native environments, achieve good scalability, and provide high-fidelity analytics for ongoing simulations in real-time.more » « less
-
Public transit is a critical component of a smart and connected community. As such, citizens expect and require accurate information about real-time arrival/departures of transportation assets. As transit agencies enable large-scale integration of real-time sensors and support back-end data-driven decision support systems, the dynamic data-driven applications systems (DDDAS) paradigm becomes a promising approach to make the system smarter by providing online model learning and multi-time scale analytics as part of the decision support system that is used in the DDDAS feedback loop. In this paper, we describe a system in use in Nashville and illustrate the analytic methods developed by our team. These methods use both historical as well as real-time streaming data for online bus arrival prediction. The historical data is used to build classifiers that enable us to create expected performance models as well as identify anomalies. These classifiers can be used to provide schedule adjustment feedback to the metro transit authority. We also show how these analytics services can be packaged into modular, distributed and resilient micro-services that can be deployed on both cloud back ends as well as edge computing resources.more » « less