skip to main content


Search for: All records

Award ID contains: 2107215

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract

    Natural language processing (NLP) techniques can enhance our ability to interpret plant science literature. Many state-of-the-art algorithms for NLP tasks require high-quality labelled data in the target domain, in which entities like genes and proteins, as well as the relationships between entities, are labelled according to a set of annotation guidelines. While there exist such datasets for other domains, these resources need development in the plant sciences. Here, we present the Plant ScIenCe KnowLedgE Graph (PICKLE) corpus, a collection of 250 plant science abstracts annotated with entities and relations, along with its annotation guidelines. The annotation guidelines were refined by iterative rounds of overlapping annotations, in which inter-annotator agreement was leveraged to improve the guidelines. To demonstrate PICKLE’s utility, we evaluated the performance of pretrained models from other domains and trained a new, PICKLE-based model for entity and relation extraction (RE). The PICKLE-trained models exhibit the second-highest in-domain entity performance of all models evaluated, as well as a RE performance that is on par with other models. Additionally, we found that computer science-domain models outperformed models trained on a biomedical corpus (GENIA) in entity extraction, which was unexpected given the intuition that biomedical literature is more similar to PICKLE than computer science. Upon further exploration, we established that the inclusion of new types on which the models were not trained substantially impacts performance. The PICKLE corpus is, therefore, an important contribution to training resources for entity and RE in the plant sciences.

     
    more » « less
  2. The plant science corpus consists of the titles and abstracts of plant science articles in PubMed published prior to 2021 with a small number of 2021 records due to modification of records. The columns are: Index: integer index serving as identifier PMID: PubMed identifier Date: Publication date Journal: journal where the article was published Title: Title of the article Abstract: Abstract of the article Corpus: Title and abstract combined Text classification score: plant science record prediction model score Preprocessed corpus: Corpus after lower-casing, stop word removal, removal of non-alphanumeric and non-white space characters, lemmitisation Topic: index of topics after topic modeling 
    more » « less
  3. Abstract Motivation

    The rapid development of scRNA-seq technologies enables us to explore the transcriptome at the cell level on a large scale. Recently, various computational methods have been developed to analyze the scRNAseq data, such as clustering and visualization. However, current visualization methods, including t-SNE and UMAP, are challenged by the limited accuracy of rendering the geometric relationship of populations with distinct functional states. Most visualization methods are unsupervised, leaving out information from the clustering results or given labels. This leads to the inaccurate depiction of the distances between the bona fide functional states. In particular, UMAP and t-SNE are not optimal to preserve the global geometric structure. They may result in a contradiction that clusters with near distance in the embedded dimensions are in fact further away in the original dimensions. Besides, UMAP and t-SNE cannot track the variance of clusters. Through the embedding of t-SNE and UMAP, the variance of a cluster is not only associated with the true variance but also is proportional to the sample size.

    Results

    We present supCPM, a robust supervised visualization method, which separates different clusters, preserves the global structure and tracks the cluster variance. Compared with six visualization methods using synthetic and real datasets, supCPM shows improved performance than other methods in preserving the global geometric structure and data variance. Overall, supCPM provides an enhanced visualization pipeline to assist the interpretation of functional transition and accurately depict population segregation.

    Availability and implementation

    The R package and source code are available at https://zenodo.org/record/5975977#.YgqR1PXMJjM.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  4. Abstract

    A signaling complex comprising members of the LORELEI (LRE)-LIKE GPI-anchored protein (LLG) and Catharanthus roseus RECEPTOR-LIKE KINASE 1-LIKE (CrRLK1L) families perceive RAPID ALKALINIZATION FACTOR (RALF) peptides and regulate growth, reproduction, immunity, and stress responses in Arabidopsis (Arabidopsis thaliana). Genes encoding these proteins are members of multigene families in most angiosperms and could generate thousands of signaling complex variants. However, the links between expansion of these gene families and the functional diversification of this critical signaling complex as well as the evolutionary factors underlying the maintenance of gene duplicates remain unknown. Here, we investigated LLG gene family evolution by sampling land plant genomes and explored the function and expression of angiosperm LLGs. We found that LLG diversity within major land plant lineages is primarily due to lineage-specific duplication events, and that these duplications occurred both early in the history of these lineages and more recently. Our complementation and expression analyses showed that expression divergence (i.e. regulatory subfunctionalization), rather than functional divergence, explains the retention of LLG paralogs. Interestingly, all but one monocot and all eudicot species examined had an LLG copy with preferential expression in male reproductive tissues, while the other duplicate copies showed highest levels of expression in female or vegetative tissues. The single LLG copy in Amborella trichopoda is expressed vastly higher in male compared to in female reproductive or vegetative tissues. We propose that expression divergence plays an important role in retention of LLG duplicates in angiosperms.

     
    more » « less
  5. Abstract

    Plants respond to wounding stress by changing gene expression patterns and inducing the production of hormones including jasmonic acid. This wounding transcriptional response activates specialized metabolism pathways such as the glucosinolate pathways in Arabidopsis thaliana. While the regulatory factors and sequences controlling a subset of wound-response genes are known, it remains unclear how wound response is regulated globally. Here, we how these responses are regulated by incorporating putative cis-regulatory elements, known transcription factor binding sites, in vitro DNA affinity purification sequencing, and DNase I hypersensitive sites to predict genes with different wound-response patterns using machine learning. We observed that regulatory sites and regions of open chromatin differed between genes upregulated at early and late wounding time-points as well as between genes induced by jasmonic acid and those not induced. Expanding on what we currently know, we identified cis-elements that improved model predictions of expression clusters over known binding sites. Using a combination of genome editing, in vitro DNA-binding assays, and transient expression assays using native and mutated cis-regulatory elements, we experimentally validated four of the predicted elements, three of which were not previously known to function in wound-response regulation. Our study provides a global model predictive of wound response and identifies new regulatory sequences important for wounding without requiring prior knowledge of the transcriptional regulators.

     
    more » « less
  6. Summary

    Revealing the contributions of genes to plant phenotype is frequently challenging because loss‐of‐function effects may be subtle or masked by varying degrees of genetic redundancy. Such effects can potentially be detected by measuring plant fitness, which reflects the cumulative effects of genetic changes over the lifetime of a plant. However, fitness is challenging to measure accurately, particularly in species with high fecundity and relatively small propagule sizes such asArabidopsis thaliana.

    An image segmentation‐based method using the software ImageJ and an object detection‐based method using the Faster Region‐based Convolutional Neural Network (R‐CNN) algorithm were used for measuring two Arabidopsis fitness traits: seed and fruit counts.

    The segmentation‐based method was error‐prone (correlation between true and predicted seed counts,r2 = 0.849) because seeds touching each other were undercounted. By contrast, the object detection‐based algorithm yielded near perfect seed counts (r2 = 0.9996) and highly accurate fruit counts (r2 = 0.980). Comparing seed counts for wild‐type and 12 mutant lines revealed fitness effects for three genes; fruit counts revealed the same effects for two genes.

    Our study provides analysis pipelines and models to facilitate the investigation of Arabidopsis fitness traits and demonstrates the importance of examining fitness traits when studying gene functions.

     
    more » « less
  7. Basic helix–loop–helix (bHLH) proteins are one of the largest families of transcription factor (TF) in eukaryotes, and ~30% of all flowering plants’ bHLH TFs contain the aspartate kinase, chorismate mutase, and TyrA (ACT)-like domain at variable distances C-terminal from the bHLH. However, the evolutionary history and functional consequences of the bHLH/ACT-like domain association remain unknown. Here, we show that this domain association is unique to the plantae kingdom with green algae (chlorophytes) harboring a small number of bHLH genes with variable frequency of ACT-like domain’s presence. bHLH-associated ACT-like domains form a monophyletic group, indicating a common origin. Indeed, phylogenetic analysis results suggest that the association of ACT-like and bHLH domains occurred early in Plantae by recruitment of an ACT-like domain in a common ancestor with widely distributed ACT DOMAIN REPEAT ( ACR ) genes by an ancestral bHLH gene. We determined the functional significance of this association by showing that Chlamydomonas reinhardtii ACT-like domains mediate homodimer formation and negatively affect DNA binding of the associated bHLH domains. We show that, while ACT-like domains have experienced faster selection than the associated bHLH domain, their rates of evolution are strongly and positively correlated, suggesting that the evolution of the ACT-like domains was constrained by the bHLH domains. This study proposes an evolutionary trajectory for the association of ACT-like and bHLH domains with the experimental characterization of the functional consequence in the regulation of plant-specific processes, highlighting the impacts of functional domain coevolution. 
    more » « less
    Free, publicly-accessible full text available May 9, 2024
  8. Abstract New graduate students in biology programs may lack the quantitative skills necessary for their research and professional careers. The acquisition of these skills may be impeded by teaching and mentoring experiences that decrease rather than increase students’ beliefs in their ability to learn and apply quantitative approaches. In this opinion piece, we argue that revising instructional experiences to ensure that both student confidence and quantitative skills are enhanced may improve both educational outcomes and professional success. A few studies suggest that explicitly addressing productive failure in an instructional setting and ensuring effective mentoring may be the most effective routes to simultaneously increasing both quantitative self-efficacy and quantitative skills. However, there is little work that specifically addresses graduate student needs, and more research is required to reach evidence-backed conclusions. 
    more » « less
    Free, publicly-accessible full text available April 29, 2024