skip to main content


Title: A Novel Algorithm for Training Hidden Markov Models with Positive and Negative Examples
In this paper, we present a novel training method based on Baum-Welch algorithm for hidden Markov models (HMM), named as Comprehensive HMM (CompHMM), which changes the traditional approach of training HMM from positive examples only to be able to utilize both positive and negative examples in training HMMs. By comparison, our method outperformed the standard Baum-Welch method and another HMM discriminative training method significantly through both synthetic and real data in membership prediction task.  more » « less
Award ID(s):
1820103
NSF-PAR ID:
10229100
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
Page Range / eLocation ID:
305 to 310
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract Background Hidden Markov models (HMM) are a powerful tool for analyzing biological sequences in a wide variety of applications, from profiling functional protein families to identifying functional domains. The standard method used for HMM training is either by maximum likelihood using counting when sequences are labelled or by expectation maximization, such as the Baum–Welch algorithm, when sequences are unlabelled. However, increasingly there are situations where sequences are just partially labelled. In this paper, we designed a new training method based on the Baum–Welch algorithm to train HMMs for situations in which only partial labeling is available for certain biological problems. Results Compared with a similar method previously reported that is designed for the purpose of active learning in text mining, our method achieves significant improvements in model training, as demonstrated by higher accuracy when the trained models are tested for decoding with both synthetic data and real data. Conclusions A novel training method is developed to improve the training of hidden Markov models by utilizing partial labelled data. The method will impact on detecting de novo motifs and signals in biological sequence data. In particular, the method will be deployed in active learning mode to the ongoing research in detecting plasmodesmata targeting signals and assess the performance with validations from wet-lab experiments. 
    more » « less
  2. The application of deep learning to automatic speech recognition (ASR) has yielded dramatic accuracy increases for languages with abundant training data, but languages with limited training resources have yet to see accuracy improvements on this scale. In this paper, we compare a fully convolutional approach for acoustic modelling in ASR with a variety of established acoustic modeling approaches. We evaluate our method on Seneca, a low-resource endangered language spoken in North America. Our method yields word error rates up to 40% lower than those reported using both standard GMM-HMM approaches and established deep neural methods, with a substantial reduction in training time. These results show particular promise for languages like Seneca that are both endangered and lack extensive documentation. 
    more » « less
  3. This paper proposes a novel linear prediction coding-based data augmentation method for children’s low and zero resource dialect ASR. The data augmentation procedure consists of perturbing the formant peaks of the LPC spectrum during LPC analysis and reconstruction. The method is evaluated on two novel children’s speech datasets with one containing California English from the Southern California Area and the other containing a mix of Southern American English and African American English from the Atlanta, Georgia area. We test the proposed method in training both an HMM-DNN system and an end-to-end system to show model-robustness and demonstrate that the algorithm improves ASR performance, especially for zero resource dialect children’s task, as compared to common data augmentation methods such as VTLP, Speed Perturbation, and SpecAugment. 
    more » « less
  4. Abstract

    Untargeted metabolomics based on liquid chromatography-mass spectrometry technology is quickly gaining widespread application, given its ability to depict the global metabolic pattern in biological samples. However, the data are noisy and plagued by the lack of clear identity of data features measured from samples. Multiple potential matchings exist between data features and known metabolites, while the truth can only be one-to-one matches. Some existing methods attempt to reduce the matching uncertainty, but are far from being able to remove the uncertainty for most features. The existence of the uncertainty causes major difficulty in downstream functional analysis. To address these issues, we develop a novel approach for Bayesian Analysis of Untargeted Metabolomics data (BAUM) to integrate previously separate tasks into a single framework, including matching uncertainty inference, metabolite selection and functional analysis. By incorporating the knowledge graph between variables and using relatively simple assumptions, BAUM can analyze datasets with small sample sizes. By allowing different confidence levels of feature-metabolite matching, the method is applicable to datasets in which feature identities are partially known. Simulation studies demonstrate that, compared with other existing methods, BAUM achieves better accuracy in selecting important metabolites that tend to be functionally consistent and assigning confidence scores to feature-metabolite matches. We analyze a COVID-19 metabolomics dataset and a mouse brain metabolomics dataset using BAUM. Even with a very small sample size of 16 mice per group, BAUM is robust and stable. It finds pathways that conform to existing knowledge, as well as novel pathways that are biologically plausible.

     
    more » « less
  5. null (Ed.)
    Abstract One of the core challenges in applying machine learning and artificial intelligence to medicine is the limited availability of annotated medical data. Unlike in other applications of machine learning, where an abundance of labeled data is available, the labeling and annotation of medical data and images require a major effort of manual work by expert clinicians who do not have the time to annotate manually. In this work, we propose a new deep learning technique (SLIVER-net), to predict clinical features from 3-dimensional volumes using a limited number of manually annotated examples. SLIVER-net is based on transfer learning, where we borrow information about the structure and parameters of the network from publicly available large datasets. Since public volume data are scarce, we use 2D images and account for the 3-dimensional structure using a novel deep learning method which tiles the volume scans, and then adds layers that leverage the 3D structure. In order to illustrate its utility, we apply SLIVER-net to predict risk factors for progression of age-related macular degeneration (AMD), a leading cause of blindness, from optical coherence tomography (OCT) volumes acquired from multiple sites. SLIVER-net successfully predicts these factors despite being trained with a relatively small number of annotated volumes (hundreds) and only dozens of positive training examples. Our empirical evaluation demonstrates that SLIVER-net significantly outperforms standard state-of-the-art deep learning techniques used for medical volumes, and its performance is generalizable as it was validated on an external testing set. In a direct comparison with a clinician panel, we find that SLIVER-net also outperforms junior specialists, and identifies AMD progression risk factors similarly to expert retina specialists. 
    more » « less