skip to main content

Title: Partitioned hybrid learning of Bayesian network structures

We develop a novel hybrid method for Bayesian network structure learning called partitioned hybrid greedy search (pHGS), composed of three distinct yet compatible new algorithms: Partitioned PC (pPC) accelerates skeleton learning via a divide-and-conquer strategy,p-value adjacency thresholding (PATH) effectively accomplishes parameter tuning with a single execution, and hybrid greedy initialization (HGI) maximally utilizes constraint-based information to obtain a high-scoring and well-performing initial graph for greedy search. We establish structure learning consistency of our algorithms in the large-sample limit, and empirically validate our methods individually and collectively through extensive numerical comparisons. The combined merits of pPC and PATH achieve significant computational reductions compared to the PC algorithm without sacrificing the accuracy of estimated structures, and our generally applicable HGI strategy reliably improves the estimation structural accuracy of popular hybrid algorithms with negligible additional computational expense. Our empirical results demonstrate the competitive empirical performance of pHGS against many state-of-the-art structure learning algorithms.

more » « less
Award ID(s):
Author(s) / Creator(s):
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Machine Learning
Page Range / eLocation ID:
p. 1695-1738
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Location-Based Services are often used to find proximal Points of Interest PoI - e.g., nearby restaurants and museums, police stations, hospitals, etc. - in a plethora of applications. An important recently addressed variant of the problem not only considers the distance/proximity aspect, but also desires semantically diverse locations in the answer-set. For instance, rather than picking several close-by attractions with similar features - e.g., restaurants with similar menus; museums with similar art exhibitions - a tourist may be more interested in a result set that could potentially provide more diverse types of experiences, for as long as they are within an acceptable distance from a given (current) location. Towards that goal, in this work we propose a novel approach to efficiently retrieve a path that will maximize the semantic diversity of the visited PoIs that are within distance limits along a given road network. We introduce a novel indexing structure - the Diversity Aggregated R-tree, based on which we devise efficient algorithms to generate the answer-set - i.e., the recommended locations among a set of given PoIs - relying on a greedy search strategy. Our experimental evaluations conducted on real datasets demonstrate the benefits of proposed methodology over the baseline alternative approaches. 
    more » « less
  2. Peng, Jie (Ed.)
    Outcome labeling ambiguity and subjectivity are ubiquitous in real-world datasets. While practitioners commonly combine ambiguous outcome labels for all data points (instances) in an ad hoc way to improve the accuracy of multi-class classification, there lacks a principled approach to guide the label combination for all data points by any optimality criterion. To address this problem, we propose the information-theoretic classification accuracy (ITCA), a criterion that balances the trade-off between prediction accuracy (how well do predicted labels agree with actual labels) and classification resolution (how many labels are predictable), to guide practitioners on how to combine ambiguous outcome labels. To find the optimal label combination indicated by ITCA, we propose two search strategies: greedy search and breadth-first search. Notably, ITCA and the two search strategies are adaptive to all machine-learning classification algorithms. Coupled with a classification algorithm and a search strategy, ITCA has two uses: improving prediction accuracy and identifying ambiguous labels. We first verify that ITCA achieves high accuracy with both search strategies in finding the correct label combinations on synthetic and real data. Then we demonstrate the effectiveness of ITCA in diverse applications, including medical prognosis, cancer survival prediction, user demographics prediction, and cell type classification. We also provide theoretical insights into ITCA by studying the oracle and the linear discriminant analysis classification algorithms. Python package itca (available at implements ITCA and the search strategies. 
    more » « less
  3. A current challenge for data management systems is to support the construction and maintenance of machine learning models over data that is large, multi-dimensional, and evolving. While systems that could support these tasks are emerging, the need to scale to distributed, streaming data requires new models and algorithms. In this setting, as well as computational scalability and model accuracy, we also need to minimize the amount of communication between distributed processors, which is the chief component of latency. We study Bayesian Networks, the workhorse of graphical models, and present a communication-efficient method for continuously learning and maintaining a Bayesian network model over data that is arriving as a distributed stream partitioned across multiple processors. We show a strategy for maintaining model parameters that leads to an exponential reduction in communication when compared with baseline approaches to maintain the exact MLE (maximum likelihood estimation). Meanwhile, our strategy provides similar prediction errors for the target distribution and for classification tasks. 
    more » « less
  4. Abstract

    During a design process, designers iteratively go back and forth between different design stages to explore the design space and search for the best design solution that satisfies all design constraints. For complex design problems, human has shown surprising capability in effectively reducing the dimensionality of design space and quickly converging it to a reasonable range for algorithms to step in and continue the search process. Therefore, modeling how human designers make decisions in such a sequential design process can help discover beneficial design patterns, strategies, and heuristics, which are important to the development of new algorithms embedded with human intelligence to augment computational design. In this paper, we develop a deep learning based approach to model and predict designers’ sequential decisions in a system design context. The core of this approach is an integration of the function-behavior-structure model for design process characterization and the long short term memory unit model for deep leaning. This approach is demonstrated in a solar energy system design case study, and its prediction accuracy is evaluated benchmarked on several commonly used models for sequential design decisions, such as Markov Chain model, Hidden Markov Chain model, and random sequence generation model. The results indicate that the proposed approach outperforms the other traditional models. This implies that during a system design task, designers are very likely to reply on both short-term and long-term memory of past design decisions in guiding their decision making in future design process. Our approach is general to be applied in many other design contexts as long as the sequential design action data is available.

    more » « less
  5. Consider deploying a team of robots in order to visit sites in a risky environment (i.e., where a robot might be lost during a traversal), subject to team-based operational constraints such as limits on team composition, traffic throughputs, and launch constraints. We formalize this problem using a graph to represent the environment, enforcing probabilistic survival constraints for each robot, and using a matroid (which generalizes linear independence to sets) to capture the team-based operational constraints. The resulting “Matroid Team Surviving Orienteers” (MTSO) problem has broad applications for robotics such as informative path planning, resource delivery, and search and rescue. We demonstrate that the objective for the MTSO problem has submodular structure, which leads us to develop two polynomial time algorithms which are guaranteed to find a solution with value within a constant factor of the optimum. The second of our algorithms is an extension of the accelerated continuous greedy algorithm, and can be applied to much broader classes of constraints while maintaining bounds on suboptimality. In addition to in-depth analysis, we demonstrate the efficiency of our approaches by applying them to a scenario where a team of robots must gather information while avoiding dangers in the Coral Triangle and characterize scaling and parameter selection using a synthetic dataset.

    more » « less