skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Scalable Score Computation for Learning Multinomial Bayesian Networks over Distributed Data
In this paper, we focus on the problem of learning a Bayesian network over distributed data stored in a commodity cluster. Specifically, we address the challenge of computing the scoring function over distributed data in a scalable manner, which is a fundamental task during learning. We propose a novel approach designed to achieve: (a) scalable score computation using the principle of gossiping; (b) lower resource consumption via a probabilistic approach for maintaining scores using the properties of a Markov chain; and (c) effective distribution of tasks during score computation (on large datasets) by synergistically combining well-known hashing techniques. Through theoretical analysis, we show that our approach is superior to a MapReduce-style computation in terms of communication and width. Further, it is superior to the batchstyle processing of MapReduce for recomputing scores when new data are available.  more » « less
Award ID(s):
1740858
PAR ID:
10111668
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
The AAAI-17 Workshop on Distributed Machine Learning
Page Range / eLocation ID:
498-504
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The C+ score for US bridges on the 2017 infrastructure report card underscores the need for improved data-driven methods to understand bridge performance. There is a lot of interest and prior work in using inspection records to determine bridge health scores. However, aggregating, cleaning, and analyzing bridge inspection records from all states and all past years is a challenging task, limiting the access and reproducibility of findings. This research introduces a new score computed using inspection records from the National Bridge Inventory (NBI) data set. Differences between the time series of condition ratings for a bridge and a time series of average national condition ratings by age are used to develop a health score for that bridge. This baseline difference score complements NBI condition ratings in further understanding a bridge’s performance over time. Moreover, the role of bridge attributes and environmental factors can be analyzed using the score. Such analysis shows that bridge material type has the highest association with the baseline difference score, followed by snowfall and maintenance. This research also makes a methodological contribution by outlining a data-driven approach to repeatable and scalable analysis of the NBI data set. 
    more » « less
  2. Abstract Traditional tests of concept knowledge generate scores to assess how well a learner understands a concept. Here, we investigated whether patterns of brain activity collected during a concept knowledge task could be used to compute a neural ‘score’ to complement traditional scores of an individual’s conceptual understanding. Using a novel data-driven multivariate neuroimaging approach—informational network analysis—we successfully derived a neural score from patterns of activity across the brain that predicted individual differences in multiple concept knowledge tasks in the physics and engineering domain. These tasks include an fMRI paradigm, as well as two other previously validated concept inventories. The informational network score outperformed alternative neural scores computed using data-driven neuroimaging methods, including multivariate representational similarity analysis. This technique could be applied to quantify concept knowledge in a wide range of domains, including classroom-based education research, machine learning, and other areas of cognitive science. 
    more » « less
  3. null (Ed.)
    Propensity score methods account for selection bias in observational studies. However, the consistency of the propensity score estimators strongly depends on a correct specification of the propensity score model. Logistic regression and, with increasing popularity, machine learning tools are used to estimate propensity scores. We introduce a stacked generalization ensemble learning approach to improve propensity score estimation by fitting a meta learner on the predictions of a suitable set of diverse base learners. We perform a comprehensive Monte Carlo simulation study, implementing a broad range of scenarios that mimic characteristics of typical data sets in educational studies. The population average treatment effect is estimated using the propensity score in Inverse Probability of Treatment Weighting. Our proposed stacked ensembles, especially using gradient boosting machines as a meta learner trained on a set of 12 base learner predictions, led to superior reduction of bias compared to the current state-of-the-art in propensity score estimation. Further, our simulations imply that commonly used balance measures (averaged standardized absolute mean differences) might be misleading as propensity score model selection criteria. We apply our proposed model - which we call GBM-Stack - to assess the population average treatment effect of a Supplemental Instruction (SI) program in an introductory psychology (PSY 101) course at San Diego State University. Our analysis provides evidence that moving the whole population to SI attendance would on average lead to 1.69 times higher odds to pass the PSY 101 class compared to not offering SI, with a 95% bootstrap confidence interval of (1.31, 2.20). 
    more » « less
  4. null (Ed.)
    We propose a flexible low complexity design (FLCD) of coded distributed computing (CDC) with empirical evaluation on Amazon Elastic Compute Cloud (Amazon EC2). CDC can expedite MapReduce like computation by trading increased map computations to reduce communication load and shuffle time. A main novelty of FLCD is to utilize the design freedom in defining map and reduce functions to develop asymptotic homogeneous systems to support varying intermediate values (IV) sizes under a general MapReduce framework. Compared to existing designs with constant IV sizes, FLCD offers greater flexibility in adapting to network parameters and significantly reduces the implementation complexity by requiring fewer input files and shuffle groups. The FLCD scheme is the first proposed low-complexity CDC design that can operate on a network with an arbitrary number of nodes and computation load. We perform empirical evaluations of the FLCD by executing the TeraSort algorithm on an Amazon EC2 cluster. This is the first time that theoretical predictions of the CDC shuffle time are validated by empirical evaluations. The evaluations demonstrate a 2.0 to 4.24 speedup compared to conventional uncoded MapReduce, a 12% to 52% reduction in total time, and a wider range of operating network parameters compared to existing CDC schemes. 
    more » « less
  5. Distributed machine learning is primarily motivated by the promise of increased computation power for accelerating training and mitigating privacy concerns. Unlike machine learning on a single device, distributed machine learning requires collaboration and communication among the devices. This creates several new challenges: (1) the heavy communication overhead can be a bottleneck that slows down the training, and (2) the unreliable communication and weaker control over the remote entities make the distributed system vulnerable to systematic failures and malicious attacks. This paper presents a variant of stochastic gradient descent (SGD) with improved communication efficiency and security in distributed environments. Our contributions include (1) a new technique called error reset to adapt both infrequent synchronization and message compression for communication reduction in both synchronous and asynchronous training, (2) new score-based approaches for validating the updates, and (3) integration with both error reset and score-based validation. The proposed system provides communication reduction, both synchronous and asynchronous training, Byzantine tolerance, and local privacy preservation. We evaluate our techniques both theoretically and empirically. 
    more » « less