Abstract Twister2 is an open‐source big data hosting environment designed to process both batch and streaming data at scale. Twister2 runs jobs in both high‐performance computing (HPC) and big data clusters. It provides a cross‐platform resource scheduler to run jobs in diverse environments. Twister2 is designed with a layered architecture to support various clusters and big data problems. In this paper, we present the cross‐platform resource scheduler of Twister2. We identify required services and explain implementation details. We present job startup delays for single jobs and multiple concurrent jobs in Kubernetes and OpenMPI clusters. We compare job startup delays for Twister2 and Spark at a Kubernetes cluster. In addition, we compare the performance of terasort algorithm on Kubernetes and bare metal clusters at AWS cloud.
more »
« less
Graph Analytics: Complexity, Scalability, and Architectures
Big Data as expressed as “Big Graphs” are growing in importance. Looking forward, there is also increasing interest in streaming versions of the associated analytics. This paper develops an initial template for the relationship between “traditional” batch graph problems, and streaming forms. Variations of streaming problems are discussed, along with their relationship to existing benchmarks. Also included is a discussion of classes of parallel architectures (including newly emerging ones) and how such kernels are liable to scale on them. Preliminary projections for some of these systems is presented.
more »
« less
- Award ID(s):
- 1642280
- PAR ID:
- 10026001
- Date Published:
- Journal Name:
- HPBDC Workshop at Int. Parallel and Dist. Processing Conf.
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
A longstanding observation, which was partially proven in \cite{LNW14,AHLW16}, is that any turnstile streaming algorithm can be implemented as a linear sketch (the reverse is trivially true). We study the relationship between turnstile streaming and linear sketching algorithms in more detail, giving both new separations and new equivalences between the two models. It was shown in \cite{LNW14} that, if a turnstile algorithm works for arbitrarily long streams with arbitrarily large coordinates at intermediate stages of the stream, then the turnstile algorithm is equivalent to a linear sketch. We show separations of the opposite form: if either the stream length or the maximum value of the stream are substantially restricted, there exist problems where linear sketching is exponentially harder than turnstile streaming. A further limitation of the \cite{LNW14} equivalence is that the turnstile sketching algorithm is neither explicit nor uniform, but requires an exponentially long advice string. We show how to remove this limitation for deterministic streaming algorithms: we give an explicit small-space algorithm that takes the streaming algorithm and computes an equivalent module.more » « less
-
Santhanam, Rahul (Ed.)The following question arises naturally in the study of graph streaming algorithms: Is there any graph problem which is "not too hard", in that it can be solved efficiently with total communication (nearly) linear in the number n of vertices, and for which, nonetheless, any streaming algorithm with Õ(n) space (i.e., a semi-streaming algorithm) needs a polynomial n^Ω(1) number of passes? Assadi, Chen, and Khanna [STOC 2019] were the first to prove that this is indeed the case. However, the lower bounds that they obtained are for rather non-standard graph problems. Our first main contribution is to present the first polynomial-pass lower bounds for natural "not too hard" graph problems studied previously in the streaming model: k-cores and degeneracy. We devise a novel communication protocol for both problems with near-linear communication, thus showing that k-cores and degeneracy are natural examples of "not too hard" problems. Indeed, previous work have developed single-pass semi-streaming algorithms for approximating these problems. In contrast, we prove that any semi-streaming algorithm for exactly solving these problems requires (almost) Ω(n^{1/3}) passes. The lower bound follows by a reduction from a generalization of the hidden pointer chasing (HPC) problem of Assadi, Chen, and Khanna, which is also the basis of their earlier semi-streaming lower bounds. Our second main contribution is improved round-communication lower bounds for the underlying communication problems at the basis of these reductions: - We improve the previous lower bound of Assadi, Chen, and Khanna for HPC to achieve optimal bounds for this problem. - We further observe that all current reductions from HPC can also work with a generalized version of this problem that we call MultiHPC, and prove an even stronger and optimal lower bound for this generalization. These two results collectively allow us to improve the resulting pass lower bounds for semi-streaming algorithms by a polynomial factor, namely, from n^{1/5} to n^{1/3} passes.more » « less
-
For a growing class of prediction problems, big data and machine learning analyses can greatly enhance our understanding of the effectiveness of public investments and public policy. However, the outputs of many machine learning models are often abstract and inaccessible to policy communities or the general public. In this article, we describe a hands-on teaching case that is suitable for use in a graduate or advanced undergraduate public policy, public affairs or environmental studies classroom. Students will engage on the use of increasingly popular machine learning classification algorithms and cloud-based data visualization tools to support policy and planning on the theme of electric vehicle mobility and connected infrastructure. By using these tools, students will critically evaluate and convert large and complex datasets into human understandable visualization for communication and decision-making. The tools also enable user flexibility to engage with streaming data sources in a new creative design with little technical background.more » « less
-
Abstract In large-scale applications including medical imaging, collocation differential equation solvers, and estimation with differential privacy, the underlying linear inverse problem can be reformulated as a streaming problem. In theory, the streaming problem can be effectively solved using memory-efficient, exponentially-converging streaming solvers. In special cases when the underlying linear inverse problem is finite-dimensional, streaming solvers can periodically evaluate the residual norm at a substantial computational cost. When the underlying system is infinite dimensional, streaming solver can only access noisy estimates of the residual. While such noisy estimates are computationally efficient, they are useful only when their accuracy is known. In this work, we rigorously develop a general family of computationally-practical residual estimators and their uncertainty sets for streaming solvers, and we demonstrate the accuracy of our methods on a number of large-scale linear problems. Thus, we further enable the practical use of streaming solvers for important classes of linear inverse problems.more » « less
An official website of the United States government

