For many classic structured prediction problems, probability distributions over the dependent variables can be efficiently computed using widely-known algorithms and data structures (such as forward-backward, and its corresponding trellis for exact probability distributions in Markov models). However, we know of no previ- ous work studying efficient representations of exact distributions over clusterings. This paper presents definitions and proofs for a dynamic-programming inference procedure that computes the partition function, the marginal probability of a cluster, and the MAP clustering—all exactly. Rather than the N th Bell number, these exact solutions take time and space proportional to the substantially smaller powerset of N . Indeed, we improve upon the time complexity of the algorithm introduced by Kohonen and Corander [11] for this problem by a factor of N. While still large, this previously unknown result is intellectually interesting in its own right, makes feasible exact inference for important real-world small data applications (such as medicine), and provides a natural stepping stone towards sparse-trellis approximations that enable further scalability (which we also explore). In experi- ments, we demonstrate the superiority of our approach over approximate methods in analyzing real-world gene expression data used in cancer treatment.
more »
« less
An Algorithm for the Separation-Preserving Transition of Clusterings
The separability of clusters is one of the most desired properties in clustering. There is a wide range of settings in which different clusterings of the same data set appear. We are interested in applications for which there is a need for an explicit, gradual transition of one separable clustering into another one. This transition should be a sequence of simple, natural steps that upholds separability of the clusters throughout. We design an algorithm for such a transition. We exploit the intimate connection of separability and linear programming over bounded-shape partition and transportation polytopes: separable clusterings lie on the boundary of partition polytopes and form a subset of the vertices of the corresponding transportation polytopes, and circuits of both polytopes are readily interpreted as sequential or cyclical exchanges of items between clusters. This allows for a natural approach to achieve the desired transition through a combination of two walks: an edge walk between two so-called radial clusterings in a transportation polytope, computed through an adaptation of classical tools of sensitivity analysis and parametric programming, and a walk from a separable clustering to a corresponding radial clustering, computed through a tailored, iterative routine updating cluster sizes and reoptimizing the cluster assignment of items. Funding: Borgwardt gratefully acknowledges support of this work through National Science Foundation [Grant 2006183] Circuit Walks in Optimization, Algorithmic Foundations, Division of Computing and Communication Foundations; through Air Force Office of Scientific Research [Grant FA9550-21-1-0233] The Hirsch Conjecture for Totally-Unimodular Polyhedra; and through Simons Collaboration [Grant 524210] Polyhedral Theory in Data Analytics. Happach has been supported by the Alexander von Humboldt Foundation with funds from the German Federal Ministry of Education and Research.
more »
« less
- Award ID(s):
- 2006183
- PAR ID:
- 10420455
- Date Published:
- Journal Name:
- INFORMS Journal on Optimization
- Volume:
- 5
- Issue:
- 1
- ISSN:
- 2575-1484
- Page Range / eLocation ID:
- 1 to 26
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We consider the problem of clustering with user feedback. Existing methods express constraints about the input data points, most commonly through must-link and cannot-link constraints on data point pairs. In this paper, we introduce existential cluster constraints: a new form of feedback where users indicate the features of desired clusters. Specifically, users make statements about the existence of a cluster having (and not having) particular features. Our approach has multiple advantages: (1) constraints on clusters can express user intent more efficiently than point pairs; (2) in cases where the users’ mental model is of the desired clusters, it is more natural for users to express cluster-wise preferences; (3) it functions even when privacy restrictions prohibit users from seeing raw data. In addition to introducing existential cluster constraints, we provide an inference algorithm for incorporating our constraints into the output clustering. Finally, we demonstrate empirically that our proposed framework facilitates more accurate clustering with dramatically fewer user feedback inputs.more » « less
-
We study quotients of the Toeplitz C*-algebra of a random walk, similar to those studied by the author and Markiewicz for finite stochastic matrices. We introduce a new Cuntz-type quotient C*-algebra for random walks that have convergent ratios of transition probabilities. These C*-algebras give rise to new notions of ratio limit space and boundary for such random walks, which are computed by appealing to a companion paper by Woess. Our combined results are leveraged to identify a unique symmetry-equivariant quotient C*-algebra for any symmetric random walk on a hyperbolic group, shedding light on a question of Viselter on C*-algebras of subproduct systems.more » « less
-
Finding the reduced-dimensional structure is critical to understanding complex networks. Existing approaches such as spectral clustering are applicable only when the full network is explicitly observed. In this paper, we focus on the online factorization and partition of implicit large lumpable networks based on observations from an associated random walk. We formulate this into a nonconvex stochastic factorization problem and propose an efficient and scalable stochastic generalized Hebbian algorithm (GHA). The algorithm is able to process random walk data in a streaming fashion and learn a low-dimensional representation for each vertex. By applying a diffusion approximation analysis, we show that the continuous-time limiting process of the stochastic algorithm converges globally to the “principal components” of the Markov chain. We also establish a finite-sample error bound that matches the nonimprovable state-of-art result for online factorization. Once learned the low-dimensional state representations, we further apply clustering techniques to recover the network partition. We show that when the associated Markov process is lumpable, one can recover the partition exactly with high probability given sufficient data. We apply the proposed approach to model the traffic flow of Manhattan as city-wide random walks. By using our algorithm to analyze the taxi trip data, we discover a latent partition of the Manhattan city that closely matches the traffic dynamics.more » « less
-
null (Ed.)The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition but also provide a two-approximation to non-parametric DP-Means objective. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive empirical experiments in both hierarchical and flat clustering settings and show that our proposed approach achieves state-of-the-art results on publicly available clustering benchmarks. Finally, we demonstrate our method's scalability by applying it to a dataset of 30 billion queries. Human evaluation of the discovered clusters show that our method finds better quality of clusters than the current state-of-the-art.more » « less
An official website of the United States government

