 Award ID(s):
 1816577
 NSFPAR ID:
 10178062
 Date Published:
 Journal Name:
 10th Annual Conference on Innovative Data Systems Research (CIDR ‘20)
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

The analysis of massive datasets requires a large number of processors. Prior research has largely assumed that tracking the actual data distribution and the underlying network structure of a cluster, which we collectively refer to as the topology, comes with a high cost and has little practical benefit. As a result, theoretical models, algorithms and systems often assume a uniform topology; however this assumption rarely holds in practice. This necessitates an endtoend investigation of how one can model, design and deploy topologyaware algorithms for fundamental data processing tasks at large scale. To achieve this goal, we first develop a theoretical parallel model that can jointly capture the cost of computation and communication. Using this model, we explore algorithms with theoretical guarantees for three basic tasks: aggregation, join, and sorting. Finally, we consider the practical aspects of implementing topologyaware algorithms at scale, and show that they have the potential to be orders of magnitude faster than their topologyoblivious counterparts.more » « less

Chambers, Erin W. ; Gudmundsson, Joachim (Ed.)Datasets with nontrivial large scale topology can be hard to embed in lowdimensional Euclidean space with existing dimensionality reduction algorithms. We propose to model topologically complex datasets using vector bundles, in such a way that the base space accounts for the large scale topology, while the fibers account for the local geometry. This allows one to reduce the dimensionality of the fibers, while preserving the large scale topology. We formalize this point of view and, as an application, we describe a dimensionality reduction algorithm based on topological inference for vector bundles. The algorithm takes as input a dataset together with an initial representation in Euclidean space, assumed to recover part of its large scale topology, and outputs a new representation that integrates local representations obtained through local linear dimensionality reduction. We demonstrate this algorithm on examples coming from dynamical systems and chemistry. In these examples, our algorithm is able to learn topologically faithful embeddings of the data in lower target dimension than various well known metricbased dimensionality reduction algorithms.more » « less

Abstract Inspired by the recent achievements of machine learning in diverse domains, datadriven metamaterials design has emerged as a compelling paradigm that can unlock the potential of the multiscale architectures. The modelcentric research trend, however, lacks principled frameworks dedicated to data acquisition, whose quality propagates into the downstream tasks. Built by naive spacefilling design in shape descriptor space, metamaterial datasets suffer from property distributions that are either highly imbalanced or at odds with design tasks of interest. To this end, we present tMETASET: an activelearningbased data acquisition framework aiming to guide both balanced and taskaware data generation. Uniquely, we seek a solution to a commonplace yet frequently overlooked scenario at early stages of datadriven design: when a massive shapeonly library has been prepared with no properties evaluated. The key idea is to harness a datadriven shape descriptor learned from generative models, fit a sparse regressor as a startup agent, and leverage metrics related to diversity to drive data acquisition to areas that help designers fulfill design goals. We validate the proposed framework in three deployment cases, which encompass general use, taskspecific use, and tailorable use. Two largescale mechanical metamaterial datasets (∼ O(104)) are used to demonstrate the efficacy. Applicable to general design representations, tMETASET can boost future advancements in datadriven design.

In this paper, we consider how to provide fast estimates of flowlevel tail latency performance for very large scale data center networks. Network tail latency is often a crucial metric for cloud application performance that can be affected by a wide variety of factors, including network load, interrack traffic skew, traffic burstiness, flow size distributions, oversubscription, and topology asymmetry. Network simulators such as ns3 and OMNeT++ can provide accurate answers, but are very hard to parallelize, taking hours or days to answer what if questions for a single configuration at even moderate scale. Recent work with MimicNet has shown how to use machine learning to improve simulation performance, but at a cost of including a long training step per configuration, and with assumptions about workload and topology uniformity that typically do not hold in practice. We address this gap by developing a set of techniques to provide fast performance estimates for large scale networks with general traffic matrices and topologies. A key step is to decompose the problem into a large number of parallel independent singlelink simulations; we carefully combine these linklevel simulations to produce accurate estimates of endtoend flow level performance distributions for the entire network. Like MimicNet, we exploit symmetry where possible to gain additional speedups, but without relying on machine learning, so there is no training delay. On a largescale net work where ns3 takes 11 to 27 hours to simulate five seconds of network behavior, our techniques run in one to two minutes with accuracy within 9% for tail flow completion times.more » « less

Graph Neural Networks (GNNs) have seen significant success in tasks such as node classification, largely contingent upon the availability of sufficient labeled nodes. Yet, the excessive cost of labeling largescale graphs led to a focus on active learning on graphs, which aims for effective data selection to maximize downstream model performance. Notably, most existing methods assume reliable graph topology, while realworld scenarios often present noisy graphs. Given this, designing a successful active learning framework for noisy graphs is highly needed but challenging, as selecting data for labeling and obtaining a clean graph are two tasks naturally interdependent: selecting highquality data requires clean graph structure while cleaning noisy graph structure requires sufficient labeled data. Considering the complexity mentioned above, we propose an active learning framework, GALClean, which has been specifically designed to adopt an iterative approach for conducting both data selection and graph purification simultaneously with best information learned from the prior iteration. Importantly, we summarize GALClean as an instance of the ExpectationMaximization algorithm, which provides a theoretical understanding of its design and mechanisms. This theory naturally leads to an enhanced version, GALClean+. Extensive experiments have demonstrated the effectiveness and robustness of our proposed method across various types and levels of noisy graphs.more » « less