skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
A major challenge for clustering algorithms is to balance the trade-off between homogeneity, i.e. , the degree to which an individual cluster includes only related sequences, and completeness, the degree to which related sequences are broken up into multiple clusters. Most algorithms are conservative in grouping sequences with other sequences. Remote homologs may fail to be clustered together and instead form unnecessarily distinct clusters. The resulting clusters have high homogeneity but completeness that is too low. We propose Complet+, a computationally scalable post-processing method to increase the completeness of clusters without an undue cost in homogeneity. Complet+ proves to effectively merge closely-related clusters of protein that have verified structural relationships in the SCOPe classification scheme, improving the completeness of clustering results at little cost to homogeneity. Applying Complet+ to clusters obtained using MMseqs2’s clusterupdate achieves an increased V-measure of 0.09 and 0.05 at the SCOPe superfamily and family levels, respectively. Complet+ also creates more biologically representative clusters, as shown by a substantial increase in Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) metrics when comparing predicted clusters to biological classifications. Complet+ similarly improves clustering metrics when applied to other methods, such as CD-HIT and linclust. Finally, we show that Complet+ runtime scales linearly with respect to the number of clusters being post-processed on a COG dataset of over 3 million sequences. Code and supplementary information is available on Github: https://github.com/EESI/Complet-Plus .  more » « less
Award ID(s):
1936782 1936791
PAR ID:
10451100
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
PeerJ
Volume:
11
ISSN:
2167-8359
Page Range / eLocation ID:
e14779
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Modern graph or network datasets often contain rich structure that goes beyond simple pairwise connections between nodes. This calls for complex representations that can capture, for instance, edges of different types as well as so-called “higher-order interactions” that involve more than two nodes at a time. However, we have fewer rigorous methods that can provide insight from such representations. Here, we develop a computational framework for the problem of clustering hypergraphs with categorical edge labels — or different interaction types — where clusters corresponds to groups of nodes that frequently participate in the same type of interaction. Our methodology is based on a combinatorial objective function that is related to correlation clustering on graphs but enables the design of much more efficient algorithms that also seamlessly generalize to hypergraphs. When there are only two label types, our objective can be optimized in polynomial time, using an algorithm based on minimum cuts. Minimizing our objective becomes NP-hard with more than two label types, but we develop fast approximation algorithms based on linear programming relaxations that have theoretical cluster quality guarantees. We demonstrate the efficacy of our algorithms and the scope of the model through problems in edge-label community detection, clustering with temporal data, and exploratory data analysis. 
    more » « less
  2. ABSTRACT The rapid expansion of food and nutrition information requires new ways of data sharing and dissemination. Interactive platforms integrating data portals and visualization dashboards have been effectively utilized to describe, monitor, and track information related to food and nutrition; however, a comprehensive evaluation of emerging interactive systems is lacking. We conducted a systematic review on publicly available dashboards using a set of 48 evaluation metrics for data integrity, completeness, granularity, visualization quality, and interactivity based on 4 major principles: evidence, efficiency, emphasis, and ethics. We evaluated 13 dashboards, summarized their characteristics, strengths, and limitations, and provided guidelines for developing nutrition dashboards. We applied mixed effects models to summarize evaluation results adjusted for interrater variability. The proposed metrics and evaluation principles help to improve data standardization and harmonization, dashboard performance and usability, broaden information and knowledge sharing among researchers, practitioners, and decision makers in the field of food and nutrition, and accelerate data literacy and communication. 
    more » « less
  3. Clustering is a fundamental unsupervised learning problem where a data-set is partitioned into clusters that consist of nearby points in a metric space. A recent variant, fair clustering, associates a color with each point representing its group membership and requires that each color has (approximately) equal representation in each cluster to satisfy group fairness. In this model, the cost of the clustering objective increases due to enforcing fairness in the algorithm. The relative increase in the cost, the “price of fairness,” can indeed be unbounded. Therefore, in this paper we propose to treat an upper bound on the clustering objective as a constraint on the clustering problem, and to maximize equality of representation subject to it. We consider two fairness objectives: the group utilitarian objective and the group egalitarian objective, as well as the group leximin objective which generalizes the group egalitarian objective. We derive fundamental lower bounds on the approximation of the utilitarian and egalitarian objectives and introduce algorithms with provable guarantees for them. For the leximin objective we introduce an effective heuristic algorithm. We further derive impossibility results for other natural fairness objectives. We conclude with experimental results on real-world datasets that demonstrate the validity of our algorithms. 
    more » « less
  4. Clustering is a fundamental unsupervised learning problem where a dataset is partitioned into clusters that consist of nearby points in a metric space. A recent variant, fair clustering, associates a color with each point representing its group membership and requires that each color has (approximately) equal representation in each cluster to satisfy group fairness. In this model, the cost of the clustering objective increases due to enforcing fairness in the algorithm. The relative increase in the cost, the `''price of fairness,'' can indeed be unbounded. Therefore, in this paper we propose to treat an upper bound on the clustering objective as a constraint on the clustering problem, and to maximize equality of representation subject to it. We consider two fairness objectives: the group utilitarian objective and the group egalitarian objective, as well as the group leximin objective which generalizes the group egalitarian objective. We derive fundamental lower bounds on the approximation of the utilitarian and egalitarian objectives and introduce algorithms with provable guarantees for them. For the leximin objective we introduce an effective heuristic algorithm. We further derive impossibility results for other natural fairness objectives. We conclude with experimental results on real-world datasets that demonstrate the validity of our algorithms. 
    more » « less
  5. Abstract Clustering is a fundamental feature of earthquakes that impacts basic and applied analyses of seismicity. Events included in the existing short-duration instrumental catalogs are concentrated strongly within a very small fraction of the space–time volume, which is highly amplified by activity associated with the largest recorded events. The earthquakes that are included in instrumental catalogs are unlikely to be fully representative of the long-term behavior of regional seismicity. We illustrate this and other aspects of space–time earthquake clustering, and propose a quantitative clustering measure based on the receiver operating characteristic diagram. The proposed approach allows eliminating effects of marginal space and time inhomogeneities related to the geometry of the fault network and regionwide changes in earthquake rates, and quantifying coupled space–time variations that include aftershocks, swarms, and other forms of clusters. The proposed measure is used to quantify and compare earthquake clustering in southern California, western United States, central and eastern United States, Alaska, Japan, and epidemic-type aftershock sequence model results. All examined cases show a high degree of coupled space–time clustering, with the marginal space clustering dominating the marginal time clustering. Declustering earthquake catalogs can help clarify long-term aspects of regional seismicity and increase the signal-to-noise ratio of effects that are subtler than the strong clustering signatures. We illustrate how the high coupled space–time clustering can be decreased or removed using a data-adaptive parsimonious nearest-neighbor declustering approach, and emphasize basic unresolved issues on the proper outcome and quality metrics of declustering. At present, declustering remains an exploratory tool, rather than a rigorous optimization problem, and selecting an appropriate declustering method should depend on the data and problem at hand. 
    more » « less