skip to main content


This content will become publicly available on June 4, 2024

Title: Inv-Senet: Invariant Self Expression Network for Clustering Under Biased Data
Subspace clustering algorithms are used for understanding the cluster structure that explains the patterns prevalent in the dataset well. These methods are extensively used for data-exploration tasks in various areas of Natural Sciences. However, most of these methods fail to handle confounding attributes in the dataset. For datasets where a data sample represent multiple attributes, naively applying any clustering approach can result in undesired output. To this end, we propose a novel framework for jointly removing confounding attributes while learning to cluster data points in individual subspaces. Assuming we have label information about these confounding attributes, we regularize the clustering method by adversarially learning to minimize the mutual information between the data representation and the confounding attribute labels. Our experimental result on synthetic and real-world datasets demonstrate the effectiveness of our approach.  more » « less
Award ID(s):
1947972
NSF-PAR ID:
10444738
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Page Range / eLocation ID:
1 to 5
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. As one of the most important research topics in the unsupervised learning field, Multi-View Clustering (MVC) has been widely studied in the past decade and numerous MVC methods have been developed. Among these methods, the recently emerged Graph Neural Networks (GNN) shine a light on modeling both topological structure and node attributes in the form of graphs, to guide unified embedding learning and clustering. However, the effectiveness of existing GNN-based MVC methods is still limited due to the insufficient consideration in utilizing the self-supervised information and graph information, which can be reflected from the following two aspects: 1) most of these models merely use the self-supervised information to guide the feature learning and fail to realize that such information can be also applied in graph learning and sample weighting; 2) the usage of graph information is generally limited to the feature aggregation in these models, yet it also provides valuable evidence in detecting noisy samples. To this end, in this paper we propose Self-Supervised Graph Attention Networks for Deep Weighted Multi-View Clustering (SGDMC), which promotes the performance of GNN-based deep MVC models by making full use of the self-supervised information and graph information. Specifically, a novel attention-allocating approach that considers both the similarity of node attributes and the self-supervised information is developed to comprehensively evaluate the relevance among different nodes. Meanwhile, to alleviate the negative impact caused by noisy samples and the discrepancy of cluster structures, we further design a sample-weighting strategy based on the attention graph as well as the discrepancy between the global pseudo-labels and the local cluster assignment. Experimental results on multiple real-world datasets demonstrate the effectiveness of our method over existing approaches. 
    more » « less
  2. In many applications, multiple parties have private data regarding the same set of users but on disjoint sets of attributes, and a server wants to leverage the data to train a model. To enable model learning while protecting the privacy of the data subjects, we need vertical federated learning (VFL) techniques, where the data parties share only information for training the model, instead of the private data. However, it is challenging to ensure that the shared information maintains privacy while learning accurate models. To the best of our knowledge, the algorithm proposed in this paper is the first practical solution for differentially private vertical federatedk-means clustering, where the server can obtain a set of global centers with a provable differential privacy guarantee. Our algorithm assumes an untrusted central server that aggregates differentially private local centers and membership encodings from local data parties. It builds a weighted grid as the synopsis of the global dataset based on the received information. Final centers are generated by running anyk-means algorithm on the weighted grid. Our approach for grid weight estimation uses a novel, light-weight, and differentially private set intersection cardinality estimation algorithm based on the Flajolet-Martin sketch. To improve the estimation accuracy in the setting with more than two data parties, we further propose a refined version of the weights estimation algorithm and a parameter tuning strategy to reduce the finalk-means loss to be close to that in the central private setting. We provide theoretical utility analysis and experimental evaluation results for the cluster centers computed by our algorithm and show that our approach performs better both theoretically and empirically than the two baselines based on existing techniques 
    more » « less
  3. Abstract

    Citizen and community science datasets are typically collected using flexible protocols. These protocols enable large volumes of data to be collected globally every year; however, the consequence is that these protocols typically lack the structure necessary to maintain consistent sampling across years. This can result in complex and pronounced interannual changes in the observation process, which can complicate the estimation of population trends because population changes over time are confounded with changes in the observation process.

    Here we describe a novel modelling approach designed to estimate spatially explicit species population trends while controlling for the interannual confounding common in citizen science data. The approach is based on Double machine learning, a statistical framework that uses machine learning (ML) methods to estimate population change and the propensity scores used to adjust for confounding discovered in the data. ML makes it possible to use large sets of features to control for confounding and to model spatial heterogeneity in trends. Additionally, we present a simulation method to identify and adjust for residual confounding missed by the propensity scores.

    To illustrate the approach, we estimated species trends using data from the citizen science project eBird. We used a simulation study to assess the ability of the method to estimate spatially varying trends when faced with realistic confounding and temporal correlation. Results demonstrated the ability to distinguish between spatially constant and spatially varying trends. There were low error rates on the estimated direction of population change (increasing/decreasing) at each location and high correlations on the estimated magnitude of population change.

    The ability to estimate spatially explicit trends while accounting for confounding inherent in citizen science data has the potential to fill important information gaps, helping to estimate population trends for species and/or regions lacking rigorous monitoring data.

     
    more » « less
  4. Multi-View Clustering (MVC) aims to find the cluster structure shared by multiple views of a particular dataset. Existing MVC methods mainly integrate the raw data from different views, while ignoring the high-level information. Thus, their performance may degrade due to the conflict between heterogeneous features and the noises existing in each individual view. To overcome this problem, we propose a novel Multi-View Ensemble Clustering (MVEC) framework to solve MVC in an Ensemble Clustering (EC) way, which generates Basic Partitions (BPs) for each view individually and seeks for a consensus partition among all the BPs. By this means, we naturally leverage the complementary information of multi-view data in the same partition space. Instead of directly fusing BPs, we employ the low-rank and sparse decomposition to explicitly consider the connection between different views and detect the noises in each view. Moreover, the spectral ensemble clustering task is also involved by our framework with a carefully designed constraint, making MVEC a unified optimization framework to achieve the final consensus partition. Experimental results on six real-world datasets show the efficacy of our approach compared with both MVC and EC methods.

     
    more » « less
  5. To alleviate the cost of collecting and annotating large- scale point cloud datasets for 3D scene understanding tasks, we propose an unsupervised learning approach to learn features from unlabeled point cloud ”3D object” dataset by using part contrasting and object clustering with deep graph neural networks (GNNs). In the contrast learn- ing step, all the samples in the 3D object dataset are cut into two parts and put into a ”part” dataset. Then a contrast learning GNN (ContrastNet) is trained to verify whether two randomly sampled parts from the part dataset belong to the same object. In the cluster learning step, the trained ContrastNet is applied to all the samples in the original 3D object dataset to extract features, which are used to group the samples into clusters. Then another GNN for cluster- ing learning (ClusterNet) is trained to predict the cluster IDs of all the training samples. The contrasting learning forces the ContrastNet to learn high-level semantic features of objects but probably ignores low-level features, while the ClusterNet improves the quality of learned features by be- ing trained to discover objects that belong to the same se- mantic categories by using cluster IDs. We have conducted extensive experiments to evaluate the proposed framework on point cloud classification tasks. The proposed unsupervised learning approach obtained comparable performance to the state-of-the-art unsupervised learning methods that used much more complicated network structures. The code and an extended version of this work is publicly available via: https://github.com/lingzhang1/ContrastNet 
    more » « less