skip to main content


Title: Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
A major challenge for clustering algorithms is to balance the trade-off between homogeneity, i.e. , the degree to which an individual cluster includes only related sequences, and completeness, the degree to which related sequences are broken up into multiple clusters. Most algorithms are conservative in grouping sequences with other sequences. Remote homologs may fail to be clustered together and instead form unnecessarily distinct clusters. The resulting clusters have high homogeneity but completeness that is too low. We propose Complet+, a computationally scalable post-processing method to increase the completeness of clusters without an undue cost in homogeneity. Complet+ proves to effectively merge closely-related clusters of protein that have verified structural relationships in the SCOPe classification scheme, improving the completeness of clustering results at little cost to homogeneity. Applying Complet+ to clusters obtained using MMseqs2’s clusterupdate achieves an increased V-measure of 0.09 and 0.05 at the SCOPe superfamily and family levels, respectively. Complet+ also creates more biologically representative clusters, as shown by a substantial increase in Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) metrics when comparing predicted clusters to biological classifications. Complet+ similarly improves clustering metrics when applied to other methods, such as CD-HIT and linclust. Finally, we show that Complet+ runtime scales linearly with respect to the number of clusters being post-processed on a COG dataset of over 3 million sequences. Code and supplementary information is available on Github: https://github.com/EESI/Complet-Plus .  more » « less
Award ID(s):
1936782 1936791
NSF-PAR ID:
10451100
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
PeerJ
Volume:
11
ISSN:
2167-8359
Page Range / eLocation ID:
e14779
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Clustering is a fundamental feature of earthquakes that impacts basic and applied analyses of seismicity. Events included in the existing short-duration instrumental catalogs are concentrated strongly within a very small fraction of the space–time volume, which is highly amplified by activity associated with the largest recorded events. The earthquakes that are included in instrumental catalogs are unlikely to be fully representative of the long-term behavior of regional seismicity. We illustrate this and other aspects of space–time earthquake clustering, and propose a quantitative clustering measure based on the receiver operating characteristic diagram. The proposed approach allows eliminating effects of marginal space and time inhomogeneities related to the geometry of the fault network and regionwide changes in earthquake rates, and quantifying coupled space–time variations that include aftershocks, swarms, and other forms of clusters. The proposed measure is used to quantify and compare earthquake clustering in southern California, western United States, central and eastern United States, Alaska, Japan, and epidemic-type aftershock sequence model results. All examined cases show a high degree of coupled space–time clustering, with the marginal space clustering dominating the marginal time clustering. Declustering earthquake catalogs can help clarify long-term aspects of regional seismicity and increase the signal-to-noise ratio of effects that are subtler than the strong clustering signatures. We illustrate how the high coupled space–time clustering can be decreased or removed using a data-adaptive parsimonious nearest-neighbor declustering approach, and emphasize basic unresolved issues on the proper outcome and quality metrics of declustering. At present, declustering remains an exploratory tool, rather than a rigorous optimization problem, and selecting an appropriate declustering method should depend on the data and problem at hand. 
    more » « less
  2. Abstract

    Foreshocks are the only currently widely identified precursory seismic behavior, yet their utility and even identifiability are problematic, in part because of extreme variation in behavior. Here, we establish some global trends that help identify the expected frequency of foreshocks as well the type of earthquake most prone to foreshocks. We establish these tendencies using the global earthquake catalog of the U.S. Geological Survey National Earthquake Information Center with a completeness level of magnitude 5 and mainshocks with Mw≥7.0. Foreshocks are identified using three clustering algorithms to address the challenge of distinguishing foreshocks from background activity. The methods give a range of 15%–43% of large mainshocks having at least one foreshock but a narrower range of 13%–26% having at least one foreshock with magnitude within two units of the mainshock magnitude. These observed global foreshock rates are similar to regional values for a completeness level of magnitude 3 using the same detection conditions. The foreshock sequences have distinctive characteristics with the global composite population b-values being lower for foreshocks than for aftershocks, an attribute that is also manifested in synthetic catalogs computed by epidemic-type aftershock sequences, which intrinsically involves only cascading processes. Focal mechanism similarity of foreshocks relative to mainshocks is more pronounced than for aftershocks. Despite these distinguishing characteristics of foreshock sequences, the conditions that promote high foreshock productivity are similar to those that promote high aftershock productivity. For instance, a modestly higher percentage of interplate mainshocks have foreshocks than intraplate mainshocks, and reverse faulting events slightly more commonly have foreshocks than normal or strike-slip-faulting mainshocks. The western circum-Pacific is prone to having slightly more foreshock activity than the eastern circum-Pacific.

     
    more » « less
  3. Modern graph or network datasets often contain rich structure that goes beyond simple pairwise connections between nodes. This calls for complex representations that can capture, for instance, edges of different types as well as so-called “higher-order interactions” that involve more than two nodes at a time. However, we have fewer rigorous methods that can provide insight from such representations. Here, we develop a computational framework for the problem of clustering hypergraphs with categorical edge labels — or different interaction types — where clusters corresponds to groups of nodes that frequently participate in the same type of interaction. Our methodology is based on a combinatorial objective function that is related to correlation clustering on graphs but enables the design of much more efficient algorithms that also seamlessly generalize to hypergraphs. When there are only two label types, our objective can be optimized in polynomial time, using an algorithm based on minimum cuts. Minimizing our objective becomes NP-hard with more than two label types, but we develop fast approximation algorithms based on linear programming relaxations that have theoretical cluster quality guarantees. We demonstrate the efficacy of our algorithms and the scope of the model through problems in edge-label community detection, clustering with temporal data, and exploratory data analysis. 
    more » « less
  4. Clustering is a fundamental unsupervised learning problem where a data-set is partitioned into clusters that consist of nearby points in a metric space. A recent variant, fair clustering, associates a color with each point representing its group membership and requires that each color has (approximately) equal representation in each cluster to satisfy group fairness. In this model, the cost of the clustering objective increases due to enforcing fairness in the algorithm. The relative increase in the cost, the “price of fairness,” can indeed be unbounded. Therefore, in this paper we propose to treat an upper bound on the clustering objective as a constraint on the clustering problem, and to maximize equality of representation subject to it. We consider two fairness objectives: the group utilitarian objective and the group egalitarian objective, as well as the group leximin objective which generalizes the group egalitarian objective. We derive fundamental lower bounds on the approximation of the utilitarian and egalitarian objectives and introduce algorithms with provable guarantees for them. For the leximin objective we introduce an effective heuristic algorithm. We further derive impossibility results for other natural fairness objectives. We conclude with experimental results on real-world datasets that demonstrate the validity of our algorithms. 
    more » « less
  5. Clustering is a fundamental unsupervised learning problem where a dataset is partitioned into clusters that consist of nearby points in a metric space. A recent variant, fair clustering, associates a color with each point representing its group membership and requires that each color has (approximately) equal representation in each cluster to satisfy group fairness. In this model, the cost of the clustering objective increases due to enforcing fairness in the algorithm. The relative increase in the cost, the `''price of fairness,'' can indeed be unbounded. Therefore, in this paper we propose to treat an upper bound on the clustering objective as a constraint on the clustering problem, and to maximize equality of representation subject to it. We consider two fairness objectives: the group utilitarian objective and the group egalitarian objective, as well as the group leximin objective which generalizes the group egalitarian objective. We derive fundamental lower bounds on the approximation of the utilitarian and egalitarian objectives and introduce algorithms with provable guarantees for them. For the leximin objective we introduce an effective heuristic algorithm. We further derive impossibility results for other natural fairness objectives. We conclude with experimental results on real-world datasets that demonstrate the validity of our algorithms. 
    more » « less