In this paper, we address clustering problems in scenarios where accurate direct access to the full dataset is impractical or impossible. Instead, we leverage oracle-based methods, which are particularly valuable in real-world applications where the data may be noisy, restricted due to privacy concerns or sheer volume. We utilize two oracles, the quadruplet and the distance oracle. The quadruplet oracle is a weaker oracle that only approximately compares the distances of two pairs of vertices. In practice, these oracles can be implemented using crowdsourcing or training classifiers or other predictive models. On the other hand, the distance oracle returns exactly the distance of two vertices, so it is a stronger and more expensive oracle to implement. We consider two noise models for the quadruplet oracle. In the adversarial noise model, if two pairs have similar distances, the response is chosen by an adversary. In the probabilistic noise model, the pair with the smaller distance is returned with a constant probability. We consider a set V of n vertices in a metric space that supports the quadruplet and the distance oracle. For each of the k-center, k-median, and k-means clustering problem on V, we design constant approximation algorithms that perform roughly O(nk) calls to the quadruplet oracle and O(k^2) calls to the distance oracle in both noise models. When the dataset has low intrinsic dimension, we significantly improve the approximation factors of our algorithms by performing a few additional calls to the distance oracle. We also show that for k-median and k-means clustering there is no hope to return any sublinear approximation using only the quadruplet oracle. Finally, we give constant approximation algorithms for estimating the clustering cost induced by any set of k vertices, performing roughly O(nk) calls to the quadruplet oracle and O(k^2) calls to the distance oracle.
more »
« less
FPDclustering: a comprehensive R package for probabilistic distance clustering based methods
Abstract Data clustering has a long history and refers to a vast range of models and methods that exploit the ever-more-performing numerical optimization algorithms and are designed to find homogeneous groups of observations in data. In this framework, the probability distance clustering (PDC) family methods offer a numerically effective alternative to model-based clustering methods and a more flexible opportunity in the framework of geometric data clustering. GivennJ-dimensional data vectors arranged in a data matrix and the numberKof clusters, PDC maximizes the joint density function that is defined as the sum of the products between the distance and the probability, both of which are measured for each data vector from each center. This article shows the capabilities of the PDC family, illustrating the package .
more »
« less
- Award ID(s):
- 2209974
- PAR ID:
- 10507342
- Publisher / Repository:
- Springer Science + Business Media
- Date Published:
- Journal Name:
- Computational Statistics
- Volume:
- 40
- Issue:
- 2
- ISSN:
- 0943-4062
- Format(s):
- Medium: X Size: p. 1123-1146
- Size(s):
- p. 1123-1146
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This paper proposes a classification scheme for categorization of PDC educational resources. We have also proposed an evaluation framework for assessing the PDC resources. Under the proposed framework, each resource type has a set of criteria and an associated score. A PDC resource will obtain a score if evaluated under our proposed framework that is the sum of the scores of the criteria that the resource satisfies. The evaluation of whether a resource met a criterion is subjective. We have also presented our evaluation of PDC educational resources appropriate for CS1, CS2 (Computer Science 1 and 2), and DS/A (Data Structures and Algorithms) available on the web using our proposed framework.more » « less
-
Abstract The apparent clustering in longitude of perihelionϖand ascending node Ω of extreme trans-Neptunian objects (ETNOs) has been attributed to the gravitational effects of an unseen 5–10 Earth-mass planet in the outer solar system. To investigate how selection bias may contribute to this clustering, we consider 14 ETNOs discovered by the Dark Energy Survey, the Outer Solar System Origins Survey, and the survey of Sheppard and Trujillo. Using each survey's published pointing history, depth, and TNO tracking selections, we calculate the joint probability that these objects are consistent with an underlying parent population with uniform distributions inϖand Ω. We find that the mean scaled longitude of perihelion and orbital poles of the detected ETNOs are consistent with a uniform population at a level between 17% and 94% and thus conclude that this sample provides no evidence for angular clustering.more » « less
-
Abstract Several methods have been developed to computationally predict cell-types for single cell RNA sequencing (scRNAseq) data. As methods are developed, a common problem for investigators has been identifying the best method they should apply to their specific use-case. To address this challenge, we present CHAI (consensus Clustering tHrough similArIty matrix integratIon for single cell-type identification), a wisdom of crowds approach for scRNAseq clustering. CHAI presents two competing methods which aggregate the clustering results from seven state-of-the-art clustering methods: CHAI-AvgSim and CHAI-SNF. CHAI-AvgSim and CHAI-SNF demonstrate superior performance across several benchmarking datasets. Furthermore, both CHAI methods outperform the most recent consensus clustering method, SAME-clustering. We demonstrate CHAI’s practical use case by identifying a leader tumor cell cluster enriched with CDH3. CHAI provides a platform for multiomic integration, and we demonstrate CHAI-SNF to have improved performance when including spatial transcriptomics data. CHAI overcomes previous limitations by incorporating the most recent and top performing scRNAseq clustering algorithms into the aggregation framework. It is also an intuitive and easily customizable R package where users may add their own clustering methods to the pipeline, or down-select just the ones they want to use for the clustering aggregation. This ensures that as more advanced clustering algorithms are developed, CHAI will remain useful to the community as a generalized framework. CHAI is available as an open source R package on GitHub: https://github.com/lodimk2/chai.more » « less
-
Abstract Estimating spatiotemporal patterns of population density is a primary objective of wildlife monitoring programs. However, estimating density is challenging for species that are elusive and/or occur in habitats with limited visibility. In such situations, indirect measures (e.g., nests, dung) can serve as proxies for counts of individuals. Scientists have developed approaches to estimate population density using these “indirect count” data, although current methods do not adequately account for variation in sign production and spatial patterns of animal density. In this study, we describe a modified hierarchical distance sampling model that maximizes the information content of indirect count data using Bayesian inference. We apply our model to assess the status of chimpanzee and elephant populations using counts of nests and dung, respectively, which were collected along transects in 2007 and 2021 in western Uganda. Compared with conventional methods, our modeling framework produced more precise estimates of covariate effects on expected animal density by accounting for both long‐term and recent variations in animal abundance and enabled the estimation of the number of days that animal signs remained visible. We estimated a 0.98 probability that chimpanzee density in the region had declined by at least 10% and a 0.99 probability that elephant density had increased by 50% from 2007 to 2021. We recommend applying our modified hierarchical distance sampling model in the analysis of indirect count data to account for spatial variation in animal density, assess population change between survey periods, estimate the decay rate of animal signs, and obtain more precise density estimates than achievable with traditional methods.more » « less
An official website of the United States government
