skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: DAG-Structured Clustering by Nearest Neighbors
Hierarchical clusterings compactly encode multiple granularities of clusters within a tree structure. Hierarchies, by definition, fail to capture different flat partitions that are not subsumed in one another. In this paper, we advocate for an alternative structure for representing multiple clusterings, a directed acyclic graph (DAG). By allowing nodes to have multiple parents, DAG structures are not only more flexible than trees, but also allow for points to be members of multiple clusters. We describe a scalable algorithm, Llama, which simply merges nearest neighbor substructures to form a DAG structure. Llama discovers structures that are more accurate than state-of-the-art tree-based techniques while remaining scalable to large-scale clustering benchmarks. Additionally, we support the proposed algorithm with theoretical guarantees on separated data, including types of data that cannot be correctly clustered by tree-based algorithms.  more » « less
Award ID(s):
1763618
PAR ID:
10290562
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of The 24th International Conference on Artificial Intelligence and Statistics
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Compound graphs are networks in which vertices can be grouped into larger subsets, with these subsets capable of further grouping, resulting in a nesting that can be many levels deep. In several applications, including biological workflows, chemical equations, and computational data flow analysis, these graphs often exhibit a tree-like nesting structure, where sibling clusters are disjoint. Common compound graph layouts prioritize the lowest level of the grouping, down to the individual ungrouped vertices, which can make the higher level grouped structures more difficult to discern, especially in deeply nested networks. Leveraging the additional structure of the tree-like nesting, we contribute an overview+detail layout for this class of compound graphs that preserves the saliency of the higher level network structure when groups are expanded to show internal nested structure. Our layout draws inner structures adjacent to their parents, using a modified tree layout to place substructures. We describe our algorithm and then present case studies demonstrating the layout's utility to a domain expert working on data flow analysis. Finally, we discuss network parameters and analysis situations in which our layout is well suited. 
    more » « less
  2. We present Centaurus a scalable, open source, clustering service for K-means clustering of correlated, multidimensional data. Centaurus provides users with automatic deployment via public or private cloud resources, model selection (using Bayesian information criterion), and data visualisation. We apply Centaurus to a real-world, agricultural analytics application and compare its results to the industry standard clustering approach. The application uses soil electrical conductivity (EC) measurements, GPS coordinates, and elevation data from a field to produce a map of differing soil zones (so that management can be specialised for each). We use Centaurus and these datasets to empirically evaluate the impact of considering multiple K-means variants and large numbers of experiments. We show that Centaurus yields more consistent and useful clusterings than the competitive approach for use in zone-based soil decision-support applications where a hard decision is required. 
    more » « less
  3. null (Ed.)
    Hierarchical clustering is a critical task in numerous domains. Many approaches are based on heuristics and the properties of the resulting clusterings are studied post hoc. However, in several applications, there is a natural cost function that can be used to characterize the quality of the clustering. In those cases, hierarchical clustering can be seen as a combinatorial optimization problem. To that end, we introduce a new approach based on A* search. We overcome the prohibitively large search space by combining A* with a novel \emph{trellis} data structure. This combination results in an exact algorithm that scales beyond previous state of the art, from a search space with 10^12 trees to 10^15 trees, and an approximate algorithm that improves over baselines, even in enormous search spaces that contain more than 10^1000 trees. We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks. We describe how our method provides significantly improved theoretical bounds on the time and space complexity of A* for clustering. 
    more » « less
  4. In this paper, we present a network-based template for analyzing large-scale dynamic data. Specifically, we present a novel shared-memory parallel algorithm for updating treebased structures, including connected components (CC) and the minimum spanning tree (MST) on dynamic networks. We propose a rooted tree-based data structure to store the edges that are most relevant to the analysis. Our algorithm is based on updating the information stored in this rooted tree.In this paper, we present a network-based template for analyzing large-scale dynamic data. Specifically, we present a novel shared-memory parallel algorithm for updating tree-based structures, including connected components (CC) and the minimum spanning tree (MST) on dynamic networks. We propose a rooted tree-based data structure to store the edges that are most relevant to the analysis. Our algorithm is based on updating the information stored in this rooted tree. 
    more » « less
  5. Large tree structures are ubiquitous and real-world relational datasets often have information associated with nodes (e.g., labels or other attributes) and edges (e.g., weights or distances) that need to be communicated to the viewers. Yet, scalable, easy to read tree layouts are difficult to achieve. We consider tree layouts to be readable if they meet some basic requirements: node labels should not overlap, edges should not cross, edge lengths should be preserved, and the output should be compact. There are many algorithms for drawing trees, although very few take node labels or edge lengths into account, and none optimizes all requirements above. With this in mind, we propose a new scalable method for readable tree layouts. The algorithm guarantees that the layout has no edge crossings and no label overlaps, and optimizing one of the remaining aspects: desired edge lengths and compactness. We evaluate the performance of the new algorithm by comparison with related earlier approaches using several real-world datasets, ranging from a few thousand nodes to hundreds of thousands of nodes. Tree layout algorithms can be used to visualize large general graphs, by extracting a hierarchy of progressively larger trees. We illustrate this functionality by presenting several map-like visualizations generated by the new tree layout algorithm. 
    more » « less