skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 10:00 PM to 12:00 PM ET on Tuesday, March 25 due to maintenance. We apologize for the inconvenience.


Title: Machine learning-based approaches for identifying human blood cells harboring CRISPR-mediated fetal chromatin domain ablations
Abstract Two common hemoglobinopathies, sickle cell disease (SCD) and β-thalassemia, arise from genetic mutations within the β-globin gene. In this work, we identified a 500-bp motif (Fetal Chromatin Domain, FCD) upstream of human ϒ-globin locus and showed that the removal of this motif using CRISPR technology reactivates the expression of ϒ-globin. Next, we present two different cell morphology-based machine learning approaches that can be used identify human blood cells (KU-812) that harbor CRISPR-mediated FCD genetic modifications. Three candidate models from the first approach, which uses multilayer perceptron algorithm (MLP 20-26, MLP26-18, and MLP 30-26) and flow cytometry-derived cellular data, yielded 0.83 precision, 0.80 recall, 0.82 accuracy, and 0.90 area under the ROC (receiver operating characteristic) curve when predicting the edited cells. In comparison, the candidate model from the second approach, which uses deep learning (T2D5) and DIC microscopy-derived imaging data, performed with less accuracy (0.80) and ROC AUC (0.87). We envision that equivalent machine learning-based models can complement currently available genotyping protocols for specific genetic modifications which result in morphological changes in human cells.  more » « less
Award ID(s):
2029121
PAR ID:
10362139
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
Scientific Reports
Volume:
12
Issue:
1
ISSN:
2045-2322
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background In the CRISPR-Cas9 system, the efficiency of genetic modifications has been found to vary depending on the single guide RNA (sgRNA) used. A variety of sgRNA properties have been found to be predictive of CRISPR cleavage efficiency, including the position-specific sequence composition of sgRNAs, global sgRNA sequence properties, and thermodynamic features. While prevalent existing deep learning-based approaches provide competitive prediction accuracy, a more interpretable model is desirable to help understand how different features may contribute to CRISPR-Cas9 cleavage efficiency. Results We propose a gradient boosting approach, utilizing LightGBM to develop an integrated tool, BoostMEC (Boosting Model for Efficient CRISPR), for the prediction of wild-type CRISPR-Cas9 editing efficiency. We benchmark BoostMEC against 10 popular models on 13 external datasets and show its competitive performance. Conclusions BoostMEC can provide state-of-the-art predictions of CRISPR-Cas9 cleavage efficiency for sgRNA design and selection. Relying on direct and derived sequence features of sgRNA sequences and based on conventional machine learning, BoostMEC maintains an advantage over other state-of-the-art CRISPR efficiency prediction models that are based on deep learning through its ability to produce more interpretable feature insights and predictions. 
    more » « less
  2. Abstract A genetic knockout can be lethal to one human cell type while increasing growth rate in another. This context specificity confounds genetic analysis and prevents reproducible genome engineering. Genome-wide CRISPR compendia across most common human cell lines offer the largest opportunity to understand the biology of cell specificity. The prevailing viewpoint, synthetic lethality, occurs when a genetic alteration creates a unique CRISPR dependency. Here, we use machine learning for an unbiased investigation of cell type specificity. Quantifying model accuracy, we find that most cell type specific phenotypes are predicted by the function of related genes of wild-type sequence, not synthetic lethal relationships. These models then identify unexpected sets of 100-300 genes where reduced CRISPR measurements can produce genome-scale loss-of-function predictions across >18,000 genes. Thus, it is possible to reduce in vitro CRISPR libraries by orders of magnitude—with some information loss—when we remove redundant genes and not redundant sgRNAs. 
    more » « less
  3. A novel method based on atomic force microscopy (AFM) working in Ringing mode (RM) to distinguish between two similar human colon epithelial cancer cell lines that exhibit different degrees of neoplastic aggressiveness is reported on. The classification accuracy in identifying the cell line based on the images of a single cell can be as high as 94% (the area under the receiver operating characteristic [ROC] curve is 0.99). Comparing the accuracy using the RM and the regular imaging channels, it is seen that the RM channels are responsible for the high accuracy. The cells are also studied with a traditional AFM indentation method, which gives information about cell mechanics and the pericellular coat. Although a statistically significant difference between the two cell lines is also seen in the indentation method, it provides the accuracy of identifying the cell line at the single‐cell level less than 68% (the area under the ROC curve is 0.73). Thus, AFM cell imaging is substantially more accurate in identifying the cell phenotype than the traditional AFM indentation method. All the obtained cell data are collected on fixed cells and analyzed using machine learning methods. The biophysical reasons for the observed classification are discussed. 
    more » « less
  4. Machine Learning models have the ability to streamline the process by which Youtube video comments are filtered between legitimate comments (ham) and spam. In order to integrate machine learning models into regular usage on media-sharing platforms, recent approaches have aimed to develop models trained on Youtube comments, which have emerged as valuable tools for the classification and have enabled the identification of spam content and enhancing user experience. In this paper, eight machine learning approaches are applied to spam detection for YouTube comments. The eight machine learning models include Gaussian Naive Bayes, logistic regression, K-nearest neighbors (KNN) classifier, multi-layer perceptron (MLP), support vector machine (SVM) classifier, random forest classifier, decision tree classifier, and voting classifier. All eight models perform very well, specifically random forest approach can achieve almost perfect performance with average precision of 100% and AUC-ROC of 0.9841. The computational complexity of the eight machine learning approaches are compared. 
    more » « less
  5. Abstract A key question in evolutionary biology concerns the relative importance of different sources of adaptive genetic variation, such as de novo mutations, standing variation, and introgressive hybridization. A corollary question concerns how allelic variants derived from these different sources may influence the molecular basis of phenotypic adaptation. Here, we use a protein-engineering approach to examine the phenotypic effect of putatively adaptive hemoglobin (Hb) mutations in the high-altitude Tibetan wolf that were selectively introgressed into the Tibetan mastiff, a high-altitude dog breed that is renowned for its hypoxia tolerance. Experiments revealed that the introgressed coding variants confer an increased Hb–O2 affinity in conjunction with an enhanced Bohr effect. We also document that affinity-enhancing mutations in the β-globin gene of Tibetan wolf were originally derived via interparalog gene conversion from a tandemly linked β-globin pseudogene. Thus, affinity-enhancing mutations were introduced into the β-globin gene of Tibetan wolf via one form of intragenomic lateral transfer (ectopic gene conversion) and were subsequently introduced into the Tibetan mastiff genome via a second form of lateral transfer (introgression). Site-directed mutagenesis experiments revealed that the increased Hb–O2 affinity requires a specific two-site combination of amino acid replacements, suggesting that the molecular underpinnings of Hb adaptation in Tibetan mastiff (involving mutations that arose in a nonexpressed gene and which originally fixed in Tibetan wolf) may be qualitatively distinct from functionally similar changes in protein function that could have evolved via sequential fixation of de novo mutations during the breed’s relatively short duration of residency at high altitude. 
    more » « less