Many proteins are composed of several domains that pack together into a complex tertiary structure. Multidomain proteins can be challenging for protein structure modeling, particularly those for which templates can be found for individual domains but not for the entire sequence. In such cases, homology modeling can generate high quality models of the domains but not for the orientations between domains. Small‐angle X‐ray scattering (SAXS) reports the structural properties of entire proteins and has the potential for guiding homology modeling of multidomain proteins. In this article, we describe a novel multidomain protein assembly modeling method, SAXSDom that integrates experimental knowledge from SAXS with probabilistic Input‐Output Hidden Markov model to assemble the structures of individual domains together. Four SAXS‐based scoring functions were developed and tested, and the method was evaluated on multidomain proteins from two public datasets. Incorporation of SAXS information improved the accuracy of domain assembly for 40 out of 46 critical assessment of protein structure prediction multidomain protein targets and 45 out of 73 multidomain protein targets from the ab initio domain assembly dataset. The results demonstrate that SAXS data can provide useful information to improve the accuracy of domain‐domain assembly. The source code and tool packages are available at
- Award ID(s):
- 1755377
- PAR ID:
- 10383757
- Editor(s):
- Rosenberg, Michael
- Date Published:
- Journal Name:
- Molecular Biology and Evolution
- Volume:
- 38
- Issue:
- 12
- ISSN:
- 1537-1719
- Page Range / eLocation ID:
- 5806 to 5818
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract https://github.com/jianlin-cheng/SAXSDom . -
null (Ed.)Abstract Background Hidden Markov models (HMM) are a powerful tool for analyzing biological sequences in a wide variety of applications, from profiling functional protein families to identifying functional domains. The standard method used for HMM training is either by maximum likelihood using counting when sequences are labelled or by expectation maximization, such as the Baum–Welch algorithm, when sequences are unlabelled. However, increasingly there are situations where sequences are just partially labelled. In this paper, we designed a new training method based on the Baum–Welch algorithm to train HMMs for situations in which only partial labeling is available for certain biological problems. Results Compared with a similar method previously reported that is designed for the purpose of active learning in text mining, our method achieves significant improvements in model training, as demonstrated by higher accuracy when the trained models are tested for decoding with both synthetic data and real data. Conclusions A novel training method is developed to improve the training of hidden Markov models by utilizing partial labelled data. The method will impact on detecting de novo motifs and signals in biological sequence data. In particular, the method will be deployed in active learning mode to the ongoing research in detecting plasmodesmata targeting signals and assess the performance with validations from wet-lab experiments.more » « less
-
Abstract Comparative genomics and molecular phylogenetics are foundational for understanding biological evolution. Although many studies have been made with the aim of understanding the genomic contents of early life, uncertainty remains. A study by Weiss et al. (Weiss MC, Sousa FL, Mrnjavac N, Neukirchen S, Roettger M, Nelson-Sathi S, Martin WF. 2016. The physiology and habitat of the last universal common ancestor. Nat Microbiol. 1(9):16116.) identified a number of protein families in the last universal common ancestor of archaea and bacteria (LUCA) which were not found in previous works. Here, we report new research that suggests the clustering approaches used in this previous study undersampled protein families, resulting in incomplete phylogenetic trees which do not reflect protein family evolution. Phylogenetic analysis of protein families which include more sequence homologs rejects a simple LUCA hypothesis based on phylogenetic separation of the bacterial and archaeal domains for a majority of the previously identified LUCA proteins (∼82%). To supplement limitations of phylogenetic inference derived from incompletely populated orthologous groups and to test the hypothesis of a period of rapid evolution preceding the separation of the domains, we compared phylogenetic distances both within and between domains, for thousands of orthologous groups. We find a substantial diversity of interdomain versus intradomain branch lengths, even among protein families which exhibit a single domain separating branch and are thought to be associated with the LUCA. Additionally, phylogenetic trees with long interdomain branches relative to intradomain branches are enriched in information categories of protein families in comparison to those associated with metabolic functions. These results provide a new view of protein family evolution and temper claims about the phenotype and habitat of the LUCA.more » « less
-
Abstract STI1‐domains are present in a variety of co‐chaperone proteins and are required for the transfer of hydrophobic clients in various cellular processes. The domains were first identified in the yeast Sti1 protein where they were referred to as DP1 and DP2. Based on hidden Markov model searches, this domain had previously been found in other proteins including the mammalian co‐chaperone SGTA, the DNA damage response protein Rad23, and the chloroplast import protein Tic40. Here, we refine the domain definition and carry out structure‐based sequence alignment of STI1‐domains showing conservation of five amphipathic helices. Upon examinations of these identified domains, we identify a preceding helix 0 and unifying sequence properties, determine new molecular models, and recognize that STI1‐domains nearly always occur in pairs. The similarity at the sequence, structure, and molecular levels likely supports a unified functional role.
-
Abstract Summary Multiple sequence alignment is an important problem in computational biology with applications that include phylogeny and the detection of remote homology between protein sequences. UPP is a popular software package that constructs accurate multiple sequence alignments for large datasets based on ensembles of hidden Markov models (HMMs). A computational bottleneck for this method is a sequence-to-HMM assignment step, which relies on the precise computation of probability scores on the HMMs. In this work, we show that we can speed up this assignment step significantly by replacing these HMM probability scores with alternative scores that can be efficiently estimated. Our proposed approach utilizes a multi-armed bandit algorithm to adaptively and efficiently compute estimates of these scores. This allows us to achieve similar alignment accuracy as UPP with a significant reduction in computation time, particularly for datasets with long sequences.
Availability and implementation The code used to produce the results in this paper is available on GitHub at: https://github.com/ilanshom/adaptiveMSA.