skip to main content


Title: A new algorithm to train hidden Markov models for biological sequences with partial labels
Abstract Background Hidden Markov models (HMM) are a powerful tool for analyzing biological sequences in a wide variety of applications, from profiling functional protein families to identifying functional domains. The standard method used for HMM training is either by maximum likelihood using counting when sequences are labelled or by expectation maximization, such as the Baum–Welch algorithm, when sequences are unlabelled. However, increasingly there are situations where sequences are just partially labelled. In this paper, we designed a new training method based on the Baum–Welch algorithm to train HMMs for situations in which only partial labeling is available for certain biological problems. Results Compared with a similar method previously reported that is designed for the purpose of active learning in text mining, our method achieves significant improvements in model training, as demonstrated by higher accuracy when the trained models are tested for decoding with both synthetic data and real data. Conclusions A novel training method is developed to improve the training of hidden Markov models by utilizing partial labelled data. The method will impact on detecting de novo motifs and signals in biological sequence data. In particular, the method will be deployed in active learning mode to the ongoing research in detecting plasmodesmata targeting signals and assess the performance with validations from wet-lab experiments.  more » « less
Award ID(s):
1820103
NSF-PAR ID:
10229102
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
BMC Bioinformatics
Volume:
22
Issue:
1
ISSN:
1471-2105
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    In this paper, we present a novel training method based on Baum-Welch algorithm for hidden Markov models (HMM), named as Comprehensive HMM (CompHMM), which changes the traditional approach of training HMM from positive examples only to be able to utilize both positive and negative examples in training HMMs. By comparison, our method outperformed the standard Baum-Welch method and another HMM discriminative training method significantly through both synthetic and real data in membership prediction task. 
    more » « less
  2. Accurate multiple sequence alignment is challenging on many data sets, including those that are large, evolve under high rates of evolution, or have sequence length heterogeneity. While substantial progress has been made over the last decade in addressing the first two challenges, sequence length heterogeneity remains a significant issue for many data sets. Sequence length heterogeneity occurs for biological and technological reasons, including large insertions or deletions (indels) that occurred in the evolutionary history relating the sequences, or the inclusion of sequences that are not fully assembled. Ultra-large alignments using Phylogeny-Aware Profiles (UPP) (Nguyen et al. 2015) is one of the most accurate approaches for aligning data sets that exhibit sequence length heterogeneity: it constructs an alignment on the subset of sequences it considers ‘‘full-length,’’ represents this ‘‘backbone alignment’’ using an ensemble of hidden Markov models (HMMs), and then adds each remaining sequence into the backbone alignment based on an HMM selected for that sequence from the ensemble. Our new method, WeIghTed Consensus Hmm alignment (WITCH), improves on UPP in three important ways: first, it uses a statistically principled technique to weight and rank the HMMs; second, it uses k > 1 HMMs from the ensemble rather than a single HMM; and third, it combines the alignments for each of the selected HMMs using a consensus algorithm that takes the weights into account. We show that this approach provides improved alignment accuracy compared with UPP and other leading alignment methods, as well as improved accuracy for maximum likelihood trees based on these alignments. 
    more » « less
  3. Abstract

    Understanding animal movement often relies upon telemetry and biologging devices. These data are frequently used to estimate latent behavioural states to help understand why animals move across the landscape. While there are a variety of methods that make behavioural inferences from biotelemetry data, some features of these methods (e.g. analysis of a single data stream, use of parametric distributions) may limit their generality to reliably discriminate among behavioural states.

    To address some of the limitations of existing behavioural state estimation models, we introduce a nonparametric Bayesian framework called the mixed‐membership method for movement (M4), which is available within the open‐sourcebayesmoveR package. This framework can analyse multiple data streams (e.g. step length, turning angle, acceleration) without relying on parametric distributions, which may capture complex behaviours more successfully than current methods. We tested our Bayesian framework using simulated trajectories and compared model performance against two segmentation methods (behavioural change point analysis (BCPA) and segclust2d), one machine learning method [expectation‐maximization binary clustering (EMbC)] and one type of state‐space model [hidden Markov model (HMM)]. We also illustrated this Bayesian framework using movements of juvenile snail kitesRostrhamus sociabilisin Florida, USA.

    The Bayesian framework estimated breakpoints more accurately than the other segmentation methods for tracks of different lengths. Likewise, the Bayesian framework provided more accurate estimates of behaviour than the other state estimation methods when simulations were generated from less frequently considered distributions (e.g. truncated normal, beta, uniform). Three behavioural states were estimated from snail kite movements, which were labelled as ‘encamped’, ‘area‐restricted search’ and ‘transit’. Changes in these behaviours over time were associated with known dispersal events from the nest site, as well as movements to and from possible breeding locations.

    Our nonparametric Bayesian framework estimated behavioural states with comparable or superior accuracy compared to the other methods when step lengths and turning angles of simulations were generated from less frequently considered distributions. Since the most appropriate parametric distributions may not be obvious a priori, methods (such as M4) that are agnostic to the underlying distributions can provide powerful alternatives to address questions in movement ecology.

     
    more » « less
  4. Abstract

    We report the results of residue‐residue contact prediction of a new pipeline built purely on the learning of coevolutionary features in the CASP13 experiment. For a query sequence, the pipeline starts with the collection of multiple sequence alignments (MSAs) from multiple genome and metagenome sequence databases using two complementary Hidden Markov Model (HMM)‐based searching tools. Three profile matrices, built on covariance, precision, and pseudolikelihood maximization respectively, are then created from the MSAs, which are used as the input features of a deep residual convolutional neural network architecture for contact‐map training and prediction. Two ensembling strategies have been proposed to integrate the matrix features through end‐to‐end training and stacking, resulting in two complementary programs called TripletRes and ResTriplet, respectively. For the 31 free‐modeling domains that do not have homologous templates in the PDB, TripletRes and ResTriplet generated comparable results with an average accuracy of 0.640 and 0.646, respectively, for the topL/5 long‐range predictions, where 71% and 74% of the cases have an accuracy above 0.5. Detailed data analyses showed that the strength of the pipeline is due to the sensitive MSA construction and the advanced strategies for coevolutionary feature ensembling. Domain splitting was also found to help enhance the contact prediction performance. Nevertheless, contact models for tail regions, which often involve a high number of alignment gaps, and for targets with few homologous sequences are still suboptimal. Development of new approaches where the model is specifically trained on these regions and targets might help address these problems.

     
    more » « less
  5. Abstract

    Context.Large multi-site neuroimaging datasets have significantly advanced our quest to understand brain-behavior relationships and to develop biomarkers of psychiatric and neurodegenerative disorders. Yet, such data collections come at a cost, as the inevitable differences across samples may lead to biased or erroneous conclusions.Objective.We aim to validate the estimation of individual brain network dynamics fingerprints and appraise sources of variability in large resting-state functional magnetic resonance imaging (rs-fMRI) datasets by providing a novel point of view based on data-driven dynamical models.Approach.Previous work has investigated this critical issue in terms of effects on static measures, such as functional connectivity and brain parcellations. Here, we utilize dynamical models (hidden Markov models—HMM) to examine how diverse scanning factors in multi-site fMRI recordings affect our ability to infer the brain’s spatiotemporal wandering between large-scale networks of activity. Specifically, we leverage a stable HMM trained on the Human Connectome Project (homogeneous) dataset, which we then apply to an heterogeneous dataset of traveling subjects scanned under a multitude of conditions.Main Results.Building upon this premise, we first replicate previous work on the emergence of non-random sequences of brain states. We next highlight how these time-varying brain activity patterns are robust subject-specific fingerprints. Finally, we suggest these fingerprints may be used to assess which scanning factors induce high variability in the data.Significance.These results demonstrate that we can (i) use large scale dataset to train models that can be then used to interrogate subject-specific data, (ii) recover the unique trajectories of brain activity changes in each individual, but also (iii) urge caution as our ability to infer such patterns is affected by how, where and when we do so.

     
    more » « less