skip to main content


Title: Applications of Machine Learning for the Classification of Porcine Reproductive and Respiratory Syndrome Virus Sublineages Using Amino Acid Scores of ORF5 Gene
Porcine reproductive and respiratory syndrome is an infectious disease of pigs caused by PRRS virus (PRRSV). A modified live-attenuated vaccine has been widely used to control the spread of PRRSV and the classification of field strains is a key for a successful control and prevention. Restriction fragment length polymorphism targeting the Open reading frame 5 (ORF5) genes is widely used to classify PRRSV strains but showed unstable accuracy. Phylogenetic analysis is a powerful tool for PRRSV classification with consistent accuracy but it demands large computational power as the number of sequences gets increased. Our study aimed to apply four machine learning (ML) algorithms, random forest, k-nearest neighbor, support vector machine and multilayer perceptron, to classify field PRRSV strains into four clades using amino acid scores based on ORF5 gene sequence. Our study used amino acid sequences of ORF5 gene in 1931 field PRRSV strains collected in the US from 2012 to 2020. Phylogenetic analysis was used to labels field PRRSV strains into one of four clades: Lineage 5 or three clades in Linage 1. We measured accuracy and time consumption of classification using four ML approaches by different size of gene sequences. We found that all four ML algorithms classify a large number of field strains in a very short time (<2.5 s) with very high accuracy (>0.99 Area under curve of the Receiver of operating characteristics curve). Furthermore, the random forest approach detects a total of 4 key amino acid positions for the classification of field PRRSV strains into four clades. Our finding will provide an insightful idea to develop a rapid and accurate classification model using genetic information, which also enables us to handle large genome datasets in real time or semi-real time for data-driven decision-making and more timely surveillance.  more » « less
Award ID(s):
1838207 2040680
NSF-PAR ID:
10313522
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Frontiers in Veterinary Science
Volume:
8
ISSN:
2297-1769
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Objective

    Body mass index (BMI) is the primary criterion differentiating anorexia nervosa (AN) and atypical anorexia nervosa despite prior literature indicating few differences between disorders. Machine learning (ML) classification provides us an efficient means of accurately distinguishing between twomeaningfulclasses given any number of features. The aim of the present study was to determine if ML algorithms can accurately distinguish AN and atypical AN given an ensemble of features excluding BMI, and if not, if the inclusion of BMI enables ML to accurately classify between the two.

    Methods

    Using an aggregate sample from seven studies consisting of individuals with AN and atypical AN who completed baseline questionnaires (N = 448), we used logistic regression, decision tree, and random forest ML classification models each trained on two datasets, one containing demographic, eating disorder, and comorbid features without BMI, and one retaining all features and BMI.

    Results

    Model performance for all algorithms trained with BMI as a feature was deemed acceptable (mean accuracy = 74.98%, mean area under the receiving operating characteristics curve [AUC] = 74.75%), whereas model performance diminished without BMI (mean accuracy = 59.37%, mean AUC = 59.98%).

    Discussion

    Model performance was acceptable, but not strong, if BMI was included as a feature; no other features meaningfully improved classification. When BMI was excluded, ML algorithms performed poorly at classifying cases of AN and atypical AN when considering other demographic and clinical characteristics. Results suggest a reconceptualization of atypical AN should be considered.

    Public Significance

    There is a growing debate about the differences between anorexia nervosa and atypical anorexia nervosa as their diagnostic differentiation relies on BMI despite being similar otherwise. We aimed to see if machine learning could distinguish between the two disorders and found accurate classification only if BMI was used as a feature. This finding calls into question the need to differentiate between the two disorders.

     
    more » « less
  2. Abstract

    Advances in genome sequencing and annotation have eased the difficulty of identifying new gene sequences. Predicting the functions of these newly identified genes remains challenging. Genes descended from a common ancestral sequence are likely to have common functions. As a result, homology is widely used for gene function prediction. This means functional annotation errors also propagate from one species to another. Several approaches based on machine learning classification algorithms were evaluated for their ability to accurately predict gene function from non‐homology gene features. Among the eight supervised classification algorithms evaluated, random‐forest‐based prediction consistently provided the most accurate gene function prediction. Non‐homology‐based functional annotation provides complementary strengths to homology‐based annotation, with higher average performance in Biological Process GO terms, the domain where homology‐based functional annotation performs the worst, and weaker performance in Molecular Function GO terms, the domain where the accuracy of homology‐based functional annotation is highest. GO prediction models trained with homology‐based annotations were able to successfully predict annotations from a manually curated “gold standard” GO annotation set. Non‐homology‐based functional annotation based on machine learning may ultimately prove useful both as a method to assign predicted functions to orphan genes which lack functionally characterized homologs, and to identify and correct functional annotation errors which were propagated through homology‐based functional annotations.

     
    more » « less
  3. null (Ed.)
    Introduction: Alzheimer’s disease (AD) causes progressive irreversible cognitive decline and is the leading cause of dementia. Therefore, a timely diagnosis is imperative to maximize neurological preservation. However, current treatments are either too costly or limited in availability. In this project, we explored using retinal vasculature as a potential biomarker for early AD diagnosis. This project focuses on stage 3 of a three-stage modular machine learning pipeline which consisted of image quality selection, vessel map generation, and classification [1]. The previous model only used support vector machine (SVM) to classify AD labels which limited its accuracy to 82%. In this project, random forest and gradient boosting were added and, along with SVM, combined into an ensemble classifier, raising the classification accuracy to 89%. Materials and Methods: Subjects classified as AD were those who were diagnosed with dementia in “Dementia Outcome: Alzheimer’s disease” from the UK Biobank Electronic Health Records. Five control groups were chosen with a 5:1 ratio of control to AD patients where the control patients had the same age, gender, and eye side image as the AD patient. In total, 122 vessel images from each group (AD and control) were used. The vessel maps were then segmented from fundus images through U-net. A t-test feature selection was first done on the training folds and the selected features was fed into the classifiers with a p-value threshold of 0.01. Next, 20 repetitions of 5-fold cross validation were performed where the hyperparameters were solely tuned on the training data. An ensemble classifier consisting of SVM, gradient boosting tree, and random forests was built and the final prediction was made through majority voting and evaluated on the test set. Results and Discussion: Through ensemble classification, accuracy increased by 4-12% relative to the individual classifiers, precision by 9-15%, sensitivity by 2-9%, specificity by at least 9-16%, and F1 score by 712%. Conclusions: Overall, a relatively high classification accuracy was achieved using machine learning ensemble classification with SVM, random forest, and gradient boosting. Although the results are very promising, a limitation of this study is that the requirement of needing images of sufficient quality decreased the amount of control parameters that can be implemented. However, through retinal vasculature analysis, this project shows machine learning’s high potential to be an efficient, more cost-effective alternative to diagnosing Alzheimer’s disease. Clinical Application: Using machine learning for AD diagnosis through retinal images will make screening available for a broader population by being more accessible and cost-efficient. Mobile device based screening can also be enabled at primary screening in resource-deprived regions. It can provide a pathway for future understanding of the association between biomarkers in the eye and brain. 
    more » « less
  4. ABSTRACT In recent years, considerable progress has been made in topologically and functionally characterizing integral outer membrane proteins (OMPs) of Treponema pallidum subspecies pallidum , the syphilis spirochete, and identifying its surface-exposed β-barrel domains. Extracellular loops in OMPs of Gram-negative bacteria are known to be highly variable. We examined the sequence diversity of β-barrel-encoding regions of tprC , tprD , and bamA in 31 specimens from Cali, Colombia; San Francisco, California; and the Czech Republic and compared them to allelic variants in the 41 reference genomes in the NCBI database. To establish a phylogenetic framework, we used T. pallidum 0548 ( tp0548 ) genotyping and tp0558 sequences to assign strains to the Nichols or SS14 clades. We found that (i) β-barrels in clinical strains could be grouped according to allelic variants in T. pallidum subsp. pallidum reference genomes; (ii) for all three OMP loci, clinical strains within the Nichols or SS14 clades often harbored β-barrel variants that differed from the Nichols and SS14 reference strains; and (iii) OMP variable regions often reside in predicted extracellular loops containing B-cell epitopes. On the basis of structural models, nonconservative amino acid substitutions in predicted transmembrane β-strands of T. pallidum repeat C (TprC) and TprD2 could give rise to functional differences in their porin channels. OMP profiles of some clinical strains were mosaics of different reference strains and did not correlate with results from enhanced molecular typing. Our observations suggest that human host selection pressures drive T. pallidum subsp. pallidum OMP diversity and that genetic exchange contributes to the evolutionary biology of T. pallidum subsp. pallidum . They also set the stage for topology-based analysis of antibody responses to OMPs and help frame strategies for syphilis vaccine development. IMPORTANCE Despite recent progress characterizing outer membrane proteins (OMPs) of Treponema pallidum , little is known about how their surface-exposed, β-barrel-forming domains vary among strains circulating within high-risk populations. In this study, sequences for the β-barrel-encoding regions of three OMP loci, tprC , tprD , and bamA , in T. pallidum subsp. pallidum isolates from a large number of patient specimens from geographically disparate sites were examined. Structural models predict that sequence variation within β-barrel domains occurs predominantly within predicted extracellular loops. Amino acid substitutions in predicted transmembrane strands that could potentially affect porin channel function were also noted. Our findings suggest that selection pressures exerted within human populations drive T. pallidum subsp. pallidum OMP diversity and that recombination at OMP loci contributes to the evolutionary biology of syphilis spirochetes. These results also set the stage for topology-based analysis of antibody responses that promote clearance of T. pallidum subsp. pallidum and frame strategies for vaccine development based upon conserved OMP extracellular loops. 
    more » « less
  5. Comprising 501 genera and around 14,000 species, Papilionoideae is not only the largest subfamily of Fabaceae (Leguminosae; legumes), but also one of the most extraordinarily diverse clades among angiosperms. Papilionoids are a major source of food and forage, are ecologically successful in all major biomes, and display dramatic variation in both floral architecture and plastid genome (plastome) structure. Plastid DNA-based phylogenetic analyses have greatly improved our understanding of relationships among the major groups of Papilionoideae, yet the backbone of the subfamily phylogeny remains unresolved. In this study, we sequenced and assembled 39 new plastomes that are covering key genera representing the morphological diversity in the subfamily. From 244 total taxa, we produced eight datasets for maximum likelihood (ML) analyses based on entire plastomes and/or concatenated sequences of 77 protein-coding sequences (CDS) and two datasets for multispecies coalescent (MSC) analyses based on individual gene trees. We additionally produced a combined nucleotide dataset comprising CDS plus matK gene sequences only, in which most papilionoid genera were sampled. A ML tree based on the entire plastome maximally supported all of the deep and most recent divergences of papilionoids (223 out of 236 nodes). The Swartzieae, ADA (Angylocalyceae, Dipterygeae, and Amburaneae), Cladrastis, Andira, and Exostyleae clades formed a grade to the remainder of the Papilionoideae, concordant with nine ML and two MSC trees. Phylogenetic relationships among the remaining five papilionoid lineages (Vataireoid, Dermatophyllum , Genistoid s.l., Dalbergioid s.l., and Baphieae + Non-Protein Amino Acid Accumulating or NPAAA clade) remained uncertain, because of insufficient support and/or conflicting relationships among trees. Our study fully resolved most of the deep nodes of Papilionoideae, however, some relationships require further exploration. More genome-scale data and rigorous analyses are needed to disentangle phylogenetic relationships among the five remaining lineages. 
    more » « less