skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: FPDclustering: a comprehensive R package for probabilistic distance clustering based methods
Abstract Data clustering has a long history and refers to a vast range of models and methods that exploit the ever-more-performing numerical optimization algorithms and are designed to find homogeneous groups of observations in data. In this framework, the probability distance clustering (PDC) family methods offer a numerically effective alternative to model-based clustering methods and a more flexible opportunity in the framework of geometric data clustering. GivennJ-dimensional data vectors arranged in a data matrix and the numberKof clusters, PDC maximizes the joint density function that is defined as the sum of the products between the distance and the probability, both of which are measured for each data vector from each center. This article shows the capabilities of the PDC family, illustrating the package .  more » « less
Award ID(s):
2209974
PAR ID:
10507342
Author(s) / Creator(s):
;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Computational Statistics
Volume:
40
Issue:
2
ISSN:
0943-4062
Format(s):
Medium: X Size: p. 1123-1146
Size(s):
p. 1123-1146
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper, we address clustering problems in scenarios where accurate direct access to the full dataset is impractical or impossible. Instead, we leverage oracle-based methods, which are particularly valuable in real-world applications where the data may be noisy, restricted due to privacy concerns or sheer volume. We utilize two oracles, the quadruplet and the distance oracle. The quadruplet oracle is a weaker oracle that only approximately compares the distances of two pairs of vertices. In practice, these oracles can be implemented using crowdsourcing or training classifiers or other predictive models. On the other hand, the distance oracle returns exactly the distance of two vertices, so it is a stronger and more expensive oracle to implement. We consider two noise models for the quadruplet oracle. In the adversarial noise model, if two pairs have similar distances, the response is chosen by an adversary. In the probabilistic noise model, the pair with the smaller distance is returned with a constant probability. We consider a set V of n vertices in a metric space that supports the quadruplet and the distance oracle. For each of the k-center, k-median, and k-means clustering problem on V, we design constant approximation algorithms that perform roughly O(nk) calls to the quadruplet oracle and O(k^2) calls to the distance oracle in both noise models. When the dataset has low intrinsic dimension, we significantly improve the approximation factors of our algorithms by performing a few additional calls to the distance oracle. We also show that for k-median and k-means clustering there is no hope to return any sublinear approximation using only the quadruplet oracle. Finally, we give constant approximation algorithms for estimating the clustering cost induced by any set of k vertices, performing roughly O(nk) calls to the quadruplet oracle and O(k^2) calls to the distance oracle. 
    more » « less
  2. This paper proposes a classification scheme for categorization of PDC educational resources. We have also proposed an evaluation framework for assessing the PDC resources. Under the proposed framework, each resource type has a set of criteria and an associated score. A PDC resource will obtain a score if evaluated under our proposed framework that is the sum of the scores of the criteria that the resource satisfies. The evaluation of whether a resource met a criterion is subjective. We have also presented our evaluation of PDC educational resources appropriate for CS1, CS2 (Computer Science 1 and 2), and DS/A (Data Structures and Algorithms) available on the web using our proposed framework. 
    more » « less
  3. Abstract The apparent clustering in longitude of perihelionϖand ascending node Ω of extreme trans-Neptunian objects (ETNOs) has been attributed to the gravitational effects of an unseen 5–10 Earth-mass planet in the outer solar system. To investigate how selection bias may contribute to this clustering, we consider 14 ETNOs discovered by the Dark Energy Survey, the Outer Solar System Origins Survey, and the survey of Sheppard and Trujillo. Using each survey's published pointing history, depth, and TNO tracking selections, we calculate the joint probability that these objects are consistent with an underlying parent population with uniform distributions inϖand Ω. We find that the mean scaled longitude of perihelion and orbital poles of the detected ETNOs are consistent with a uniform population at a level between 17% and 94% and thus conclude that this sample provides no evidence for angular clustering. 
    more » « less
  4. Abstract Key science questions, such as galaxy distance estimation and weather forecasting, often require knowing the full predictive distribution of a target variableYgiven complex inputsX. Despite recent advances in machine learning and physics-based models, it remains challenging to assess whether an initial model is calibrated for allx, and when needed, to reshape the densities ofytoward ‘instance-wise’ calibration. This paper introduces the local amortized diagnostics and reshaping of conditional densities (LADaR) framework and proposes a new computationally efficient algorithm (Cal-PIT) that produces interpretable local diagnostics and provides a mechanism for adjusting conditional density estimates (CDEs).Cal-PITlearns a single interpretable local probability–probability map from calibration data that identifies where and how the initial model is miscalibrated across feature space, which can be used to morph CDEs such that they are well-calibrated. We illustrate the LADaR framework on synthetic examples, including probabilistic forecasting from image sequences, akin to predicting storm wind speed from satellite imagery. Our main science application involves estimating the probability density functions of galaxy distances given photometric data, whereCal-PITachieves better instance-wise calibration than all 11 other literature methods in a benchmark data challenge, demonstrating its utility for next-generation cosmological analyzes99Code available as a Python package here:https://github.com/lee-group-cmu/Cal-PIT.. 
    more » « less
  5. Abstract Several methods have been developed to computationally predict cell-types for single cell RNA sequencing (scRNAseq) data. As methods are developed, a common problem for investigators has been identifying the best method they should apply to their specific use-case. To address this challenge, we present CHAI (consensus Clustering tHrough similArIty matrix integratIon for single cell-type identification), a wisdom of crowds approach for scRNAseq clustering. CHAI presents two competing methods which aggregate the clustering results from seven state-of-the-art clustering methods: CHAI-AvgSim and CHAI-SNF. CHAI-AvgSim and CHAI-SNF demonstrate superior performance across several benchmarking datasets. Furthermore, both CHAI methods outperform the most recent consensus clustering method, SAME-clustering. We demonstrate CHAI’s practical use case by identifying a leader tumor cell cluster enriched with CDH3. CHAI provides a platform for multiomic integration, and we demonstrate CHAI-SNF to have improved performance when including spatial transcriptomics data. CHAI overcomes previous limitations by incorporating the most recent and top performing scRNAseq clustering algorithms into the aggregation framework. It is also an intuitive and easily customizable R package where users may add their own clustering methods to the pipeline, or down-select just the ones they want to use for the clustering aggregation. This ensures that as more advanced clustering algorithms are developed, CHAI will remain useful to the community as a generalized framework. CHAI is available as an open source R package on GitHub: https://github.com/lodimk2/chai. 
    more » « less