skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Cryo-EM heterogeneity analysis using regularized covariance estimation and kernel regression
Proteins and the complexes they form are central to nearly all cellular processes. Their flexibility, expressed through a continuum of states, provides a window into their biological functions. Cryogenic electron microscopy (cryo-EM) is an ideal tool to study these dynamic states as it captures specimens in noncrystalline conditions and enables high-resolution reconstructions. However, analyzing the heterogeneous distributions of conformations from cryo-EM data is challenging. We present RECOVAR, a method for analyzing these distributions based on principal component analysis (PCA) computed using a REgularized COVARiance estimator. RECOVAR is fast, robust, interpretable, expressive, and competitive with state-of-the-art neural network methods on heterogeneous cryo-EM datasets. The regularized covariance method efficiently computes a large number of high-resolution principal components that can encode rich heterogeneous distributions of conformations and does so robustly thanks to an automatic regularization scheme. The reconstruction method based on adaptive kernel regression resolves conformational states to a higher resolution than all other tested methods on extensive independent benchmarks while remaining highly interpretable. Additionally, we exploit favorable properties of the PCA embedding to estimate the conformational density accurately. This density allows for better interpretability of the latent space by identifying stable states and low free-energy motions. Finally, we present a scheme to navigate the high-dimensional latent space by automatically identifying these low free-energy trajectories. We make the code freely available athttps://github.com/ma-gilles/recovar.  more » « less
Award ID(s):
2009753
PAR ID:
10573937
Author(s) / Creator(s):
;
Publisher / Repository:
Proceedings of the National Academy of Sciences
Date Published:
Journal Name:
Proceedings of the National Academy of Sciences
Volume:
122
Issue:
9
ISSN:
0027-8424
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract New X‐ray crystallography and cryo‐electron microscopy (cryo‐EM) approaches yield vast amounts of structural data from dynamic proteins and their complexes. Modeling the full conformational ensemble can provide important biological insights, but identifying and modeling an internally consistent set of alternate conformations remains a formidable challenge. qFit efficiently automates this process by generating a parsimonious multiconformer model. We refactored qFit from a distributed application into software that runs efficiently on a small server, desktop, or laptop. We describe the new qFit 3 software and provide some examples. qFit 3 is open‐source under the MIT license, and is available athttps://github.com/ExcitedStates/qfit-3.0. 
    more » « less
  2. Abstract BackgroundEstimating and accounting for hidden variables is widely practiced as an important step in molecular quantitative trait locus (molecular QTL, henceforth “QTL”) analysis for improving the power of QTL identification. However, few benchmark studies have been performed to evaluate the efficacy of the various methods developed for this purpose. ResultsHere we benchmark popular hidden variable inference methods including surrogate variable analysis (SVA), probabilistic estimation of expression residuals (PEER), and hidden covariates with prior (HCP) against principal component analysis (PCA)—a well-established dimension reduction and factor discovery method—via 362 synthetic and 110 real data sets. We show that PCA not only underlies the statistical methodology behind the popular methods but is also orders of magnitude faster, better-performing, and much easier to interpret and use. ConclusionsTo help researchers use PCA in their QTL analysis, we provide an R package along with a detailed guide, both of which are freely available athttps://github.com/heatherjzhou/PCAForQTL. We believe that using PCA rather than SVA, PEER, or HCP will substantially improve and simplify hidden variable inference in QTL mapping as well as increase the transparency and reproducibility of QTL research. 
    more » « less
  3. Abstract Principal component analysis (PCA) plays an important role in the analysis of cryo-electron microscopy (cryo-EM) images for various tasks such as classification, denoising, compression, and ab initio modeling. We introduce a fast method for estimating a compressed representation of the 2-D covariance matrix of noisy cryo-EM projection images affected by radial point spread functions that enables fast PCA computation. Our method is based on a new algorithm for expanding images in the Fourier–Bessel basis (the harmonics on the disk), which provides a convenient way to handle the effect of the contrast transfer functions. For $ N $ images of size $$ L\times L $$ , our method has time complexity $$ O\left({NL}^3+{L}^4\right) $$ and space complexity $$ O\left({NL}^2+{L}^3\right) $$ . In contrast to previous work, these complexities are independent of the number of different contrast transfer functions of the images. We demonstrate our approach on synthetic and experimental data and show acceleration by factors of up to two orders of magnitude. 
    more » « less
  4. null (Ed.)
    Single-particle cryogenic electron microscopy (cryo-EM) has revolutionized the field of the structural biology, providing an access to the atomic resolution structures of large biomolecular complexes in their near-native environment. Today’s cryo-EM maps can frequently reach the atomic-level resolution, while often containing a range of resolutions, with conformationally variable regions obtained at 6 Å or worse. Low resolution density maps obtained for protein flexible domains, as well as the ensemble of coexisting conformational states arising from cryo-EM, poses new challenges and opportunities for Molecular Dynamics (MD) simulations. With the ability to describe the biomolecular dynamics at the atomic level, MD can extend the capabilities of cryo-EM, capturing the conformational variability and predicting biologically relevant short-lived conformational states. Here, we report about the state-of-the-art MD procedures that are currently used to refine, reconstruct and interpret cryo-EM maps. We show the capability of MD to predict short-lived conformational states, finding remarkable confirmation by cryo-EM structures subsequently solved. This has been the case of the CRISPR-Cas9 genome editing machinery, whose catalytically active structure has been predicted through both long-time scale MD and enhanced sampling techniques 2 years earlier than cryo-EM. In summary, this contribution remarks the ability of MD to complement cryo-EM, describing conformational landscapes and relating structural transitions to function, ultimately discerning relevant short-lived conformational states and providing mechanistic knowledge of biological function. 
    more » « less
  5. Principal Component Analysis (PCA) is a standard dimensionality reduction technique, but it treats all samples uniformly, making it suboptimal for heterogeneous data that are increasingly common in modern settings. This paper proposes a PCA variant for samples with heterogeneous noise levels, i.e., heteroscedastic noise, that naturally arise when some of the data come from higher quality sources than others. The technique handles heteroscedasticity by incorporating it in the statistical model of a probabilistic PCA. The resulting optimization problem is an interesting nonconvex problem related to but not seemingly solved by singular value decomposition, and this paper derives an expectation maximization (EM) algorithm. Numerical experiments illustrate the benefits of using the proposed method to combine samples with heteroscedastic noise in a single analysis, as well as benefits of careful initialization for the EM algorithm. Index Terms— Principal component analysis, heterogeneous data, maximum likelihood estimation, latent factors 
    more » « less