<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>11/18/2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10167313</idno>
					<idno type="doi">10.1093/bioinformatics/btz863</idno>
					<title level='j'>Bioinformatics</title>
<idno>1367-4803</idno>
<biblScope unit="volume">36</biblScope>
<biblScope unit="issue">7</biblScope>					

					<author>Chengxin Zhang</author><author>Wei Zheng</author><author>S M Mortuza</author><author>Yang Li</author><author>Yang Zhang</author><author>Alfonso Valencia</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Abstract                          Motivation              The success of genome sequencing techniques has resulted in rapid explosion of protein sequences. Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins. There are however no standard and efficient pipeline available for sensitive multiple sequence alignment (MSA) collection. This is particularly challenging when large whole-genome and metagenome databases are involved.                                      Results              We developed DeepMSA, a new open-source method for sensitive MSA construction, which has homologous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model algorithms. The practical usefulness of the pipeline was examined in three large-scale benchmark experiments based on 614 non-redundant proteins. First, DeepMSA was utilized to generate MSAs for residue-level contact prediction by six coevolution and deep learning-based programs, which resulted in an accuracy increase in long-range contacts by up to 24.4% compared to the default programs. Next, multiple threading programs are performed for homologous structure identification, where the average TM-score of the template alignments has over 7.5% increases with the use of the new DeepMSA profiles. Finally, DeepMSA was used for secondary structure prediction and resulted in statistically significant improvements in the Q3 accuracy. It is noted that all these improvements were achieved without re-training the parameters and neural-network models, demonstrating the robustness and general usefulness of the DeepMSA in protein structural bioinformatics applications, especially for targets without homologous templates in the PDB library.                                      Availability and implementation              https://zhanglab.ccmb.med.umich.edu/DeepMSA/.                                      Supplementary information              Supplementary data are available at Bioinformatics online.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Multiple sequence alignment (MSA), also called 'sequence profile', is designed to collect and align multiple homologous sequences of a query protein of interest. Since it contains rich information about the evolutionarily conserved positions and motifs, which cannot be derived from the query sequence alone, it has found fundamental usefulness in various bioinformatics studies. In protein structure prediction, e.g. the MSA is the primary component to derive local secondary structure (SS) features <ref type="bibr">(Jones, 1999;</ref><ref type="bibr">Wu and Zhang, 2008)</ref>, residue-residue contacts <ref type="bibr">(Adhikari et al., 2018;</ref><ref type="bibr">Hanson et al., 2018;</ref><ref type="bibr">He et al., 2017;</ref><ref type="bibr">Wang et al., 2017)</ref> and homologous structural templates <ref type="bibr">(Soding, 2005;</ref><ref type="bibr">Wu and Zhang, 2008;</ref><ref type="bibr">Zheng et al., 2019)</ref>; these are of critical importance for the full-length 3D structure constructions <ref type="bibr">(Ovchinnikov et al., 2018;</ref><ref type="bibr">Zhang et al., 2018)</ref>. In protein function annotations, the use of MSAs also has major impacts on the accuracy of Gene Ontology <ref type="bibr">(Cozzetto et al., 2016;</ref><ref type="bibr">Zhang et al., 2017)</ref> and ligand-binding site <ref type="bibr">(Gil and Fiser, 2019;</ref><ref type="bibr">Yang et al., 2013)</ref> predictions.</p><p>Due to the critical importance of MSA, much attention has been paid to the development of various MSA and sequence profile construction methods. While PSI-BLAST is one of the most widely used approaches to query-specific sequence profile generation <ref type="bibr">(Altschul et al., 1997)</ref>, HHblits <ref type="bibr">(Remmert et al., 2012)</ref> from the HH-suite <ref type="bibr">(Steinegger et al., 2019)</ref> recently becomes popular for profile hidden Markov model (HMM) construction. Meanwhile, Jackhmmer and HMMsearch tools from the HMMER suite <ref type="bibr">(Eddy, 1998)</ref> are common alternatives for the applications. Both lines of programs have been heavily used, especially for the contact predictions that are recently found critical for template-free (or ab initio) protein structure prediction <ref type="bibr">(Ovchinnikov et al., 2017;</ref><ref type="bibr">Schaarschmidt et al., 2018;</ref><ref type="bibr">Wu et al., 2011)</ref>. Most recently, a hybrid MSA generation approach combining HHblits and Jackhmmer searches is shown to improve contact prediction by MetaPSICOV2 <ref type="bibr">(Buchan and Jones, 2018)</ref>. There was also evidence showing that MSAs collected from metagenome protein sequences can increase the coverage of sequence homologies and be useful for contact-assisted de novo structure prediction <ref type="bibr">(Ovchinnikov et al., 2017;</ref><ref type="bibr">Wang et al., 2019)</ref>.</p><p>Despite the importance of MSA construction, few standalone pipelines exist which can efficiently generate sensitive MSAs from a query input sequence, especially when multiple large sequence databases are involved. To address this urgent need, we developed and release DeepMSA, a new open-source program that constructs deep (in the sense of more sequences with a high diversity) and sensitive MSAs by merging sequences from three whole-genome and metagenome databases through a hybrid homology-detection approach. In this approach, HHblits from HH-suite 2.0.16 <ref type="bibr">(Steinegger et al., 2019)</ref> and Jackhmmer/HMMsearch, which were modified from HMMER 3.1b2 <ref type="bibr">(Eddy, 1998)</ref> package to make the output format more compact in order to reduce file input/output, are used to perform homologous sequence search, and the alignments are further refined by a custom HHblits database reconstruction step. Largescale benchmark experiments have showed that, compared to the widely used HHblits, PSI-BLAST and Jackhmmer programs, DeepMSA can consistently improve the accuracy of contact and SS predictions, and threading programs, which is particularly important for distant-homology proteins.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Materials and methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Counting the number of effective sequences in MSAs</head><p>A common approach to quantify the homologous sequence coverage and/or alignment depth of an MSA is by counting the normalized number of effective sequence (Nf):</p><p>where L is the length of the query protein, N is the number of sequences in the MSA, S m;n is the sequence identity between the mth and nth sequences and I &#189; is an Iverson bracket, i.e. I S m;n ! 0:8 &#189; equals to 1 if S m;n ! 0:8, and to zero otherwise. While current literature lacks consensus in terms of the ideal Nf for contact prediction, we optimize the Nf cutoff as 128 to attain accurate contact prediction, as discussed later. An example to illustrate the mathematical meaning of Nf is shown at Supplementary Figure <ref type="figure">S1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">DeepMSA pipeline for MSA construction</head><p>The MSA construction process in DeepMSA can be divided into three stages, which correspond to the searching of three sequence databases [Uniclust30 <ref type="bibr">(Mirdita et al., 2017)</ref>, UniRef90 <ref type="bibr">(Suzek et al., 2015)</ref> and Metaclust <ref type="bibr">(Steinegger and So &#168;ding, 2018)</ref>] through a combination of the HH-suite and HMMER programs (Fig. <ref type="figure">1</ref>).</p><p>In Stage 1 (Fig. <ref type="figure">1</ref> first column), HHblits from HH-suite 2.0.16 is used to search UniClust30 with the parameters '-diff inf -id 99 -cov 50 -n 3'. After testing HHblits MSA generated using the last version of UniProt20 (2016_02), latest Uniboost30 (2016_09) and three recent versions of Uniclust30 (2017_04, 2017_07, 2017_10), we found the three versions of Uniclust30 generate MSAs with comparable quality, all with a higher contact prediction accuracy than MSA generated by either UniProt20 or Uniboost30. Therefore, an arbitrary UniClust30 version (2017_10) is used for this study.</p><p>If Stage 1 does not generate enough sequences, i.e. Nf &lt; 128, Stage 2 will be performed (Fig. <ref type="figure">1</ref> second column), where Jackhmmer is used to search against UniRef90 with parameters '-N 3 -E 10 -incE 1e-3'. We choose '-E 10' because lowering this e-value cutoff occasionally results in the inclusion of excessive number of nonhomologous multi-domain hits in edge cases, although the final number of significant hits in the Jackhmmer alignment is determined by '-incE'. Instead of directly using the alignment generated by Jackhmmer search, esl-sfetch from the HMMER package is used to extract full-length sequences according to the list of Jackhmmer hits. These full-length sequences are converted into a custom HHblits format database by 'hhblitdb.pl' script from HH-suite. After the construction of the custom database, HHblits is again applied to search this custom database using the same search parameter as in Stage 1 but jump-starting the search from the Stage 1 sequence MSA. If the MSA from Stage 2 has an Nf higher than that from Stage 1 MSA, it will replace the Stage 1 MSA for subsequent computation.</p><p>DeepMSA implements two time-saving heuristics to reduce time complexity associated with construction of HHblits format database, which, unlike conventional sequence databases, comprise of sequence profiles. Each profile can be either one sequence or one MSA within a family of protein sequences clustered by sequence identity. The time required to construct a profile database is proportional to the number of profiles and the average number of positions of the profiles. It may take many hours to construct a custom HHblits database if the sequences are very long or if there are too many sequences. To shorten the time for database construction, we trim the Jackhmmer hits and perform sequence clustering. In particular, instead of using the full-length Jackhmmer hit, we trim the Jackhmmer hit to extract the local region aligned to the query in the Jackhmmer alignment, as well the L flanking residues at both sides of the aligned regions. Moreover, all trimmed hits from the previous step are further clustered by kClust <ref type="bibr">(Hauser et al., 2013)</ref> into sequence clusters by 30% sequence identity cutoff. Next, Clustal Omega <ref type="bibr">(Sievers et al., 2011)</ref> is then used to align sequences within each cluster into aligned sequence profiles. These profiles are fed into hhblitsdb.pl to construct the custom HHblits database. As kClust and Clustal Omega usually take only a few minutes, and the number sequences is $10 times larger than the number of kClust sequence clusters, it will take less than half an hour to construct the custom database.</p><p>If the MSA from previous stages still has Nf &lt; 128, Stage 3 is performed (Fig. <ref type="figure">1</ref> third column), where the MSA from the previous stage is converted into a HMM by HMMbuild from the HMMER package. This HMM is searched against Metaclust metagenome sequence database by HMMsearch, using parameters '-E 10 -incE 1e-3'. Similar to Stage 2, sequence hits from HMMsearch are built into a custom HHblits database. The MSA from previous stages is used to jump-start an HHblits search against this new custom HHblits database to derive the final Stage 3 MSA.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Dataset</head><p>DeepMSA is tested on a set of 614 non-redundant proteins curated from the SCOPe database <ref type="bibr">(Hubbard et al., 2010)</ref> according to the following criteria: (i) any target coming from a fold with only one superfamily is excluded, because such a target is unlikely to have any remote structure analog; (ii) redundant sequences with a 30% pair-wise sequence identity are removed; (iii) each query should have at least one template structure, detectable by TM-align <ref type="bibr">(Zhang and Skolnick, 2005)</ref>, from the PDB which has a TM-score &gt;0.5 with the sequence identity &lt;0.3 to the query. These resulted in 614 proteins, which are classified into 403 'Easy' and 211 'Hard' targets by the meta-threading program, LOMETS <ref type="bibr">(Wu and Zhang, 2007)</ref>, based on the significance of threading alignments between query and template sequences. While our discussions are mainly focused on the 'Hard' targets which DeepMSA aims to address, the results for the 'Easy' targets are listed in the Supplementary Material for the completeness of comparisons.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Coverage and depth of MSAs by DeepMSA</head><p>Since one of the initial motivations for DeepMSA to combine sequences from different sequence databases is to collect more diverse sequences, it is instrumental to examine the coverage and depth of the MSA brought by DeepMSA. To this end, Table <ref type="table">1</ref> lists the depth results of MSAs generated by six different schemes, including DeepMSA, its three stages and three baseline methods. Here, to obtain data for different stages, we force DeepMSA to perform all three stages regardless of Nf cutoff. Nevertheless, the final MSA in DeepMSA is calculated as the normal procedure, i.e. having the MSA constructed from Stage 1 if its Nf ! 128; or from Stage 2 if Stage 1 has Neff &lt; 128 but Stage 2 has Nf ! 128; or from Stage 3, otherwise. Two of the baseline methods generate MSAs by Jackhmmer or PSI-BLAST search against the same UniRef90 database as used by DeepMSA. For the last baseline method, denoted as 'No custom db' in Table <ref type="table">1</ref>, the custom HHblits database construction and HHblits search in Stages 2 and 3 are replaced by direct concatenation of HMMER (Jackhmmer and HMMsearch) MSAs to the MSA from the previous stage, similar to the approach reported earlier <ref type="bibr">(Ovchinnikov et al., 2017)</ref>.</p><p>As expected, the alignment depth, when measured by Nf and the total number of detected sequences, gradually increases from Stage 1 to Stage 3. The increase is particularly large for 'Hard' targets, where the final MSAs from DeepMSA are on average 1.5 and 1.8 times deeper than Stage 1 in terms of Nf and number of sequences, respectively. On the other hand, the alignment depth of DeepMSA is significantly smaller than 'No custom db' and 'Jackhmmer'. This is because all HMMER hits are included in the 'No custom db' and 'Jackhmmer' alignments, while many HMMER hits are discarded by DeepMSA during HHblits search through custom databases.</p><p>It should be noted that the full-length MSA constructions often cost more memory and slow down the computing processes. Moreover, due to the composite profile construction and alignment algorithms, MSAs with greater Nf and sequence numbers do not necessarily indicate better MSA quality, as shown in later sections. In fact, there is no single index which can directly assess the performance of MSA collection programs. To more objectively assess the quality of MSA builders, below we apply these MSAs to three protein structure modeling experiments, i.e. residue contact prediction, SS prediction and protein fold-recognition (i.e. threading).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">DeepMSA increases contact prediction accuracy</head><p>The utility of DeepMSA for contact prediction is assessed using six state-of-the-art programs: CCMpred <ref type="bibr">(Seemayer et al., 2014)</ref>, MetaPSICOV2 <ref type="bibr">(Buchan and Jones, 2018)</ref>, DeepContact <ref type="bibr">(Liu et al., 2018)</ref>, DeepCov <ref type="bibr">(Jones and Kandathil, 2018)</ref>, PConsC4 <ref type="bibr">(Michel et al., 2018)</ref> and TripletRes <ref type="bibr">(Li et al., 2019)</ref>. Here, CCMpred is a representative coevolution-only contact predictor. MetaPSICOV2 is based on traditional (shallow and fully-connected) neural networks. The rest of the programs are based on deep convolutional neural networks. While other predictors with good performance also exist, we selected the six programs partly because of the availability of standalone packages, which facilitate the large-scale implement and comparison of the results.</p><p>In Table <ref type="table">2</ref>, we list the results of contact predictions by the six predictors, each having the MSA collected from the six schemes listed in Table <ref type="table">1</ref>. Since MetaPSICOV2 and DeepContact have their own built-in MSA generation protocols, both of which combine HHblits and jackhammer, contact precisions from the built-in MSAs are listed as 'default' in Table <ref type="table">2</ref>.</p><p>Here, as in community-wide Critical Assessment of protein Structure Prediction (CASP) challenges <ref type="bibr">(Schaarschmidt et al., 2018)</ref>, a contact is defined as Cb atoms (Ca atoms for glycine) from a pair of residues, i and j, being close to each other by &lt;8 A &#730;. Contact prediction accuracies of different methods are evaluated by precisions of top L, L/2 and L/5 medium-range ( <ref type="formula">12</ref>i &#192; j 23) and longrange (24 i &#192; j) predicted contacts. In accordance with CASP convention, Table <ref type="table">2</ref> only lists the long-range contacts of 'Hard' targets, where. For completeness, the results for medium-range contacts for all targets ('Hard' and 'Easy') are listed in Supplementary Table <ref type="table">S1</ref>. We also provide a spreadsheet file for per-target assessment result in Supplementary Table <ref type="table">S4</ref>.</p><p>It is shown that the MSA from DeepMSA outperforms the default MSA for contact prediction in all six contact predictors. For instance, the precisions for the top L contacts generated by TripletRes and CCMpred increased by 2.7 and 24.4%, respectively, when they use the MSA from DeepMSA instead of the default MSA. Furthermore, contact precision improves progressively from Stage 1 to Stage 3 for all the programs, indicating the effectiveness of depth of MSAs in contact prediction. Contact precisions from DeepMSA are also consistently higher than those from HHblits (i.e. Stage 1), Jackhmmer and PSI-BLAST alone.</p><p>We note that the output MSA of DeepMSA is not always created from Stage 3 if previous two stages achieve Nf ! 128, which helps to save the memory and running time of DeepMSA. Interestingly, this setting does not degrade contact precision significantly for most predictors. In fact, for TripletRes and DeepCov, the MSA from DeepMSA yields slightly better contact precision compared to the MSA from DeepMSA Stage 3. Figure <ref type="figure">2</ref> shows the effect of Nf cutoff in DeepMSA on the precision of contact prediction, where, for all but one program (CCMpred), increasing the Nf cutoff over 128 has no obvious improvement on contact precisions. In other words, when the alignment is already deep (Nf ! 128), further inclusion of more sequences is indeed not beneficial for all five neural networkbased contact predictors. This might be because deeper MSAs are more prone to contain alignment errors and false positive hits, where the cutoff of Nf &#188; 128 might be the result of the tradeoff between the sequence coverage and alignment noises. Moreover, this result may also suggest that the sequence datasets from the standard Uniclust30 utilized in Stage 1 is more reliable than the UniRef90 Note: Bold font indicates the highest value in each category. The standard deviation of the average precision is presented in Supplementary Table <ref type="table">S4</ref>. *Each P-value is calculated by one-tailed paired t-test to test whether DeepMSA has significant higher contact prediction accuracy than the respective MSA.</p><p>and metagenomic database, and thus the addition of more sequences from the latter datasets might have the tendency to introduce more noises.</p><p>We note that the high quality of MSA from DeepMSA is not merely the result of combining multiple sequence databases. In particular, apart from the lack of custom HHblits database construction and search step, 'No custom db' uses identical sequence databases, with the same HHblits and HMMER programs as DeepMSA. Despite $4 times greater alignment depth (Table <ref type="table">1</ref>), 'No custom db' is worse than DeepMSA by 1.0% (CCMpred) to 4.2% (TripletRes) in terms of top L contact precision (Table <ref type="table">2</ref>). These data suggest again that deeper alignments (with more sequence homologs) do not necessarily guarantee better contact prediction. It also indicates that although diverse sequence databases are contributive to DeepMSA performance, it is also essential to combine multiple sequence search and alignment algorithms, especially the custom HHblits database construction subroutines in our case.</p><p>DeepMSA also outperforms the default MSAs in DeepContact and MetaPSICOV. In particular, the Stage 2 MSA yields slightly more precise (0.3%) top L contact prediction by MetaPSICOV than its default MSA, even though both kinds of MSAs come from HHblits search through custom HHblits database constructed from Jackhmmer hits. This show that our time-saving heuristics (HMMER hit trimming and kClust clustering, which result in an overall average DeepMSA running time of 0.7 h per protein, Supplementary Fig. <ref type="figure">S2</ref>) introduce little compromise to final alignment quality.</p><p>Apart from benchmark data discussed herein, DeepMSA was also blindly tested in CASP13 as the MSA generation pipeline for our TripletRes server <ref type="bibr">(Li et al., 2019)</ref>, whose average top L contact precisions on all 31 FM targets increased from 0.332 with HHblits MSAs to 0.409 with DeepMSA.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">DeepMSA enables more accurate threading</head><p>Threading is an important approach to template-based protein structure prediction, which recognizes proteins with similar fold to the query proteins. Since most of the state-of-the-art methods use profiles, in the form of either HMM or position specific scoring matrix, to deduce query-template alignments, we examine whether and how DeepMSA can impact the performance of two typical threading programs, HHsearch <ref type="bibr">(Soding, 2005)</ref> and MUSTER <ref type="bibr">(Wu and Zhang, 2008)</ref>, which by default use HHblits and PSI-BLAST to construct sequence profile, respectively.</p><p>The HHsearch and MUSTER template database is constructed from the 71 684 non-redundant (pair-wise sequence identity &lt;70%) protein structures from the I-TASSER <ref type="bibr">(Yang et al., 2015)</ref> template library at <ref type="url">https://zhanglab.ccmb.med.umich.edu/library/</ref>. To generate the HHsearch library with default profile and with our new profiles, we first build MSAs for all templates by HHblits search against Uniclust30 database and DeepMSA, respectively. The hhmake program from HH-suite is then used to convert the MSAs to HHsearch style HMM library.</p><p>In MUSTER, the default sequence profiles are constructed by searching NR database with blastpgp, i.e. the legacy PSI-BLAST program <ref type="bibr">(Altschul et al., 1997)</ref>. Checkpoint files from PSI-BLAST search is then converted to MTX format sequence profiles. Conversion of DeepMSA alignments to MTX format is implemented by the 'a3m2mtx.pl' script in the DeepMSA package. This script jump-starts a PSI-BLAST search using the MSA of DeepMSA against a dummy BLAST format database. The MTX file can then be recovered from the checkpoint file of the jump-start search. Similarly, for query proteins, we also construct both DeepMSA profiles and default profiles.</p><p>In Table <ref type="table">3</ref>, we list a comparison of template alignments obtained by HHsearch and MUSTER using different MSAs. The results are presented only for 'Hard' targets in terms of the average TM-score <ref type="bibr">(Zhang and Skolnick, 2004)</ref>, alignment coverage (number of aligned residues divided by query length) and RMSD of aligned regions, where all templates with a sequence identity &gt;30% to the query have been excluded. The results for 'Easy' and all targets are listed in Supplementary Table <ref type="table">S2</ref>. It is shown that, for 'Hard' threading targets, the TM-score of first template by MUSTER and HHsearch is increased by 10.9 and 7.5%, respectively, if the DeepMSA profiles instead of the default PSI-BLAST/HHblits profiles are used. Of note, the number of 'Hard' targets with correctly identified templates (TM-score &gt;0.5) is increased by 64.0 and 39.4% for MUSTER and HHsearch, respectively.</p><p>The observation that DeepMSA significantly boosts threading performance for 'Hard' targets can be partially explained by improved quality of query-template alignments. To examine this point, we curate a subset of 143 'enriched' 'Hard' targets, each of them having at least 30 templates of the correct fold (TM-score &gt;0.5) detectable by TM-align with &lt;30% sequence identity to the query. For each of these targets, we calculate average TM-score with all the templates aligned by HHsearch using DeepMSA sequence profile and compare it to that using the default HHblits profile used by HHsearch. Figure <ref type="figure">3A</ref> lists the average TM-score difference on the top 30 templates for each of 143 targets. The data show that DeepMSA generated positive impact on the querytemplate alignments for 68.5% (&#188;98/143) of the cases. Among the 98 cases, 69 (70.4%) have the TM-score difference with P-value &lt;0.05 in the paired t-test (dark bars in Fig. <ref type="figure">3A</ref>), showing that the difference is statistically significant although only about 30 data points are involved in the paired t-test calculation for each target.</p><p>To further illustrate the importance of DeepMSA profile in threading, we show a case study on query d1hx6a2 and its template 2bbdA. HHsearch threading based on DeepMSA profile correctly aligns query to C-terminal (residue 167 to 319) of template and achieves a TM-score &#188;0.61 (Fig. <ref type="figure">3B</ref>); the alignment region is similar to that by the structure alignment from TM-align, although TM-align  has an even higher TM-score (&#188;0.82, see Supplementary Fig. <ref type="figure">S3</ref>). On the other hand, HHsearch threading with the default HHblits profile only gets a TM-score &#188;0.15 due to complete mis-alignment of query to the N-terminal (residue 27 to 188) of template (Fig. <ref type="figure">3C</ref>). Such differences can be explained by depths of MSAs for both query and template: the default HHblits run only detects 133 homologs for the template and no homolog for the query. On the other hand, DeepMSA profile is much deeper, with 624 and 118 homologs for the template (Fig. <ref type="figure">3D</ref>) and the query (Fig. <ref type="figure">3E</ref>), respectively. The lack of template homologs in the default run is particularly severe at the C-terminal of the template, driving HHsearch to align the query to the template N-terminal instead.</p><p>In addition to the creation of correct alignments, another reason for the performance improvement by DeepMSA on threading is that better MSA profiles can help improve the ranking of the template alignments. In Figure <ref type="figure">4</ref>, we show an example from the query protein (d1yvua1) which is aligned on the template 3f73A2 using HHsearch. Although both default and DeepMSA profiles resulted in reasonable query-template alignments with a TM-score &gt;0.5, their alignment scores are very different. While the HMM probability on the DeepMSA profile is 77.5% which puts the template as ranked No. 1, the probability score is 0.2% using the default profile which is ranked at 19 825th position among all templates. Thus, although the default profile can generate correct alignment on this querytemplate pair, the correct template cannot be selected by the threading program due to the poor alignment scores. In this case, an unrelated protein (3iz6D3, TM-score &#188;0.08) was selected as the first template when using the default HMM profile alignments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5">DeepMSA profiles improve SS prediction over traditional PSI-BLAST profiles</head><p>In this section, we further test the performance of DeepMSA in SS prediction by PSIPRED 4.0 <ref type="bibr">(Jones, 1999)</ref> and PSSpred <ref type="bibr">(Yan et al., 2013)</ref>. By default, PSIPRED and PSSpred construct MTX format sequence profiles by searching UniRef90 or NR database with PSI-BLAST program <ref type="bibr">(Altschul et al., 1997)</ref>. MTX format DeepMSA profile for these two programs can also be obtained by a3m2mtx.pl.</p><p>The accuracy of the SS predictions by PSSpred (Table <ref type="table">4</ref>) and PSIPRED (Supplementary Table <ref type="table">S3</ref>) is evaluated by Q3 accuracy and SOV segment overlap measure <ref type="bibr">(Zemla et al., 1999)</ref>. Compared to the default profiles, sequence profiles from DeepMSA improve the Q3 accuracy by 1.2 and 1.0% for PSSpred and PSIPRED, respectively. Similarly, SOV scores by PSSpred and PSIPRED are improved by 1.8 and 1.5%, respectively, when MSAs from DeepMSA are used. The differences are statistically significant, since the P-values in Student's t-test are all below 0.002.</p><p>Here, it important to note that the original models of PSSpred and PSIPRED were trained based on 2011 and 2016 sequence databases, respectively. Although SS predictions, as well as the contact and threading programs studied in previous sections, are usually sensitive to the sequence databases and MSAs that the models are originally trained on, we do not attempt to re-train the models using the new DeepMSA profiles. In this context, the performance improvement should be mainly attributed to the sensitive and comprehensive information that DeepMSA provides, compared to the MSAs generated by other default programs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion</head><p>We developed an open-source pipeline, DeepMSA, aiming to collect deep and sensitive MSAs from whole-genome and metagenome sequence databases. Large-scale benchmark experiments show that DeepMSA consistently improves protein contact prediction, foldrecognition and SS prediction, compared to the widely used HHblits, Jackhmmer and PSI-BLAST sequence searching programs. For example, the use of MSAs from DeepMSA improves  top L long-range contact prediction precision of CCMpred by 24.4% compared to the default use of the HHblits MSAs by the program. Similarly, MUSTER threading identifies correct templates for 64.0% more 'Hard' targets by switching the default PSI-BLAST profiles to the DeepMSA profiles. Notably, all improvements in contact prediction, SS prediction and threading have been achieved without re-training predictor model and parameters in neural networks or dynamic programing alignment.</p><p>The high quality of MSA by DeepMSA is partly due to the greater coverage and alignment depth resulted from the combination of diverse source of sequence databases. However, benchmark study shows that deeper MSA with more sequence homologs does not always lead to better contact prediction, since the final effect of MSAs is often a tradeoff of sequence coverage and alignment accuracy. Further analysis reveals that appropriate incorporation of multiple sequence search and alignment algorithms is the key to generate high quality MSAs by DeepMSA. In particular, HMMER alignment reconstruction by custom HHblits database generation is found to be especially helpful: a baseline method ('No custom db' in Tables <ref type="table">1</ref> and<ref type="table">2</ref>) without the custom HHblits database generation step results in 1.0-4.2% worse top L long-range contact prediction accuracies than DeepMSA, even when both methods use identical sequence databases.</p><p>The on-line server and the standalone program of DeepMSA have been made freely available at <ref type="url">https://zhanglab.ccmb.med. umich.edu/DeepMSA/</ref>. The continuous developments of robust MSA and profile construction methods should help enhance the usefulness and impacts of the whole-genome and metagenomics initiatives on the structure and function prediction studies of the community. For example, the current DeepMSA program runs only with monomer proteins, while an extension of the program for protein-protein complex MSA constructing is important and under progress.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Downloaded from https://academic.oup.com/bioinformatics/article-abstract/36/7/2105/5628221 by Univ. of Michigan Law Library user on 01 July 2020</p></note>
		</body>
		</text>
</TEI>
