NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Designing unsupervised mixed‐type feature selection techniques using the heterogeneous correlation matrix

https://doi.org/10.1111/insr.70016

Tortora, C.; Madhvani, S.; Punzo, A. (November 2025, International Statistical Review)

Abstract Real‐life data often include both numerical and categorical features. When categorical features are ordinal, the Pearson correlation matrix (CM) can be extended to a heterogeneous CM (HCM), which combines Pearson's correlations (numerical‐numerical), polyserial correlations (numerical‐ordinal) and polychoric correlations (ordinal‐ordinal). HCM entries are comparable, enabling assessment of pairwise‐linear dependencies. An added benefit is the computation of ‐values for pairwise uncorrelation tests, forming a heterogeneous ‐values matrix (HPM). While the HCM has been used for unsupervised feature extraction (UFE), that is, transforming features into informative representations (e.g., PCA), its application to unsupervised feature selection (UFS), that is, selecting relevant features, remains unexplored. This paper proposes two HCM‐based UFS methods for mixed‐type features. These, called UFS‐rHCM and UFS‐cHCM, iteratively remove redundant features using the HCM—row‐wise (UFS‐rHCM) or cell‐wise (UFS‐cHCM). The HPM determines the stopping point, enabling a statistically grounded approach to selecting the number of features. We also introduce a visualization tool for assessing feature importance and ranking. The performance of our methods is evaluated on simulated and real datasets.
more » « less
FPDclustering: a comprehensive R package for probabilistic distance clustering based methods

https://doi.org/10.1007/s00180-024-01490-5

Tortora, Cristina; Palumbo, Francesco (May 2024, Computational Statistics)

Abstract Data clustering has a long history and refers to a vast range of models and methods that exploit the ever-more-performing numerical optimization algorithms and are designed to find homogeneous groups of observations in data. In this framework, the probability distance clustering (PDC) family methods offer a numerically effective alternative to model-based clustering methods and a more flexible opportunity in the framework of geometric data clustering. GivennJ-dimensional data vectors arranged in a data matrix and the numberKof clusters, PDC maximizes the joint density function that is defined as the sum of the products between the distance and the probability, both of which are measured for each data vector from each center. This article shows the capabilities of the PDC family, illustrating the package .
more » « less
Handling skewness and directional tails in model-based clustering

https://doi.org/10.1007/s00362-025-01723-9

Tortora, Cristina; Punzo, Antonio; Franczak, Brian_C (July 2025, Statistical Papers)
Evaluating methods for addressing skewness in clustering: a focus on generalized hyperbolic mixture models

https://doi.org/10.1080/00949655.2025.2502535

Tortora, Cristina (May 2025, Journal of Statistical Computation and Simulation)

In model-based clustering, the population is assumed to be a combination of sub-populations. Typically, each sub-population is modeled by a mixture model component, distributed according to a known probability distribution. Each component is considered a cluster. Two primary approaches have been used in the literature when clusters are skewed: (1) transforming the data within each cluster and applying a mixture of symmetric distributions to the transformed data, and (2) directly modeling each cluster using a skewed distribution. Among skewed distributions, the generalized hyperbolic distribution is notably flexible and includes many other known distributions as special or limiting cases. This paper achieves two goals. First, it extends the flexibility of transformation-based methods as outlined in approach (1) by employing a flexible symmetric generalized hyperbolic distribution to model each transformed cluster. This innovation results in the introduction of two new models, each derived from distinct within-cluster data transformations. Second, the paper benchmarks the approaches listed in (1) and (2) for handling skewness using both simulated and real data. The findings highlight the necessity of both approaches in varying contexts.
more » « less
Free, publicly-accessible full text available May 27, 2026
A Laplace-based model with flexible tail behavior

https://doi.org/10.1016/j.csda.2023.107909

Tortora, Cristina; Franczak, Brian C.; Bagnato, Luca; Punzo, Antonio (April 2024, Computational Statistics & Data Analysis)

Full Text Available
Missing Values and Directional Outlier Detection in Model-Based Clustering

https://doi.org/10.1007/s00357-023-09450-2

Tong, Hung; Tortora, Cristina (October 2023, Journal of Classification)

Full Text Available

Search for: All records