skip to main content

Title: PaleoRec: A sequential recommender system for the annotation of paleoclimate datasets
Abstract Studying past climate variability is fundamental to our understanding of current changes. In the era of Big Data, the value of paleoclimate information critically depends on our ability to analyze large volume of data, which itself hinges on standardization. Standardization also ensures that these datasets are more Findable, Accessible, Interoperable, and Reusable. Building upon efforts from the paleoclimate community to standardize the format, terminology, and reporting of paleoclimate data, this article describes PaleoRec, a recommender system for the annotation of such datasets. The goal is to assist scientists in the annotation task by reducing and ranking relevant entries in a drop-down menu. Scientists can either choose the best option for their metadata or enter the appropriate information manually. PaleoRec aims to reduce the time to science while ensuring adherence to community standards. PaleoRec is a type of sequential recommender system based on a recurrent neural network that takes into consideration the short-term interest of a user in a particular dataset. The model was developed using 1996 expert-annotated datasets, resulting in 6,512 sequences. The performance of the algorithm, as measured by the Hit Ratio, varies between 0.7 and 1.0. PaleoRec is currently deployed on a web interface used for the more » annotation of paleoclimate datasets using emerging community standards. « less
; ; ; ; ;
Award ID(s):
1948822 1948746
Publication Date:
Journal Name:
Environmental Data Science
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    SARS-CoV-2 RNA detection in wastewater is being rapidly developed and adopted as a public health monitoring tool worldwide. With wastewater surveillance programs being implemented across many different scales and by many different stakeholders, it is critical that data collected and shared are accompanied by an appropriate minimal amount of meta-information to enable meaningful interpretation and use of this new information source and intercomparison across datasets. While some databases are being developed for specific surveillance programs locally, regionally, nationally, and internationally, common globally-adopted data standards have not yet been established within the research community. Establishing such standards will require national and international consensus on what meta-information should accompany SARS-CoV-2 wastewater measurements. To establish a recommendation on minimum information to accompany reporting of SARS-CoV-2 occurrence in wastewater for the research community, the United States National Science Foundation (NSF) Research Coordination Network on Wastewater Surveillance for SARS-CoV-2 hosted a workshop in February 2021 with participants from academia, government agencies, private companies, wastewater utilities, public health laboratories, and research institutes. This report presents the primary two outcomes of the workshop: (i) a recommendation on the set of minimum meta-information that is needed to confidently interpret wastewater SARS-CoV-2 data, and (ii) insights from workshop discussionsmore »on how to improve standardization of data reporting.« less
  2. Natural history collections are often considered remote and inaccessible without special permission from curators. Digitization of these collections can make them much more accessible to researchers, educators, and general enthusiasts alike, thereby removing the stigma of a lonely specimen on a dusty shelf in the back room of a museum that will never again see the light of day. We are in the process of digitizing the microfossils of the Indiana University Paleontology collection using the GIGAmacro Magnify2 Robotic Imaging System. This suite of software and hardware allows us to automate photography and post-production of high resolution images, thereby severely reducing the amount of time and labor needed to serve the data. Our hardware includes a Canon T6i 24 megapixel DSLR, a Canon MPE 65mm 1X to 5X lens, and a Canon MT26EX Dual Flash, all mounted on a lead system made with high performance precision IGUS Drylin anodized aluminum. The camera and its mount move over the tray of microfossil slides using bearings and rails. The software includes the GIGAmacro Capture Software (photography), GIGAmacro Viewer Software (display and annotation), Zerene Stacker (focus stacking), and Autopano GIGA (stitching). All of the metadata is kept in association with the images, uploadedmore »to Notes from Nature, transcribed by community scientists, then everything is stored in the image archive, Imago. In ~460 hours we have photographed ~10,500 slides and have completed ~65% of our microfossil collection. Using the GIGAmacro system we are able update and store collection information in a more secure and longer lasting digital form. The advantages of this system are numerable and highly recommended for museums who are looking to bring their collections out of the shadows and back into the light.« less
  3. We are now over four decades into digitally managing the names of Earth's species. As the number of federating (i.e., software that brings together previously disparate projects under a common infrastructure, for example TaxonWorks) and aggregating (e.g., International Plant Name Index, Catalog of Life (CoL)) efforts increase, there remains an unmet need for both the migration forward of old data, and for the production of new, precise and comprehensive nomenclatural catalogs. Given this context, we provide an overview of how TaxonWorks seeks to contribute to this effort, and where it might evolve in the future. In TaxonWorks, when we talk about governed names and relationships, we mean it in the sense of existing international codes of nomenclature (e.g., the International Code of Zoological Nomenclature (ICZN)). More technically, nomenclature is defined as a set of objective assertions that describe the relationships between the names given to biological taxa and the rules that determine how those names are governed. It is critical to note that this is not the same thing as the relationship between a name and a biological entity, but rather nomenclature in TaxonWorks represents the details of the (governed) relationships between names. Rather than thinking of nomenclature as changingmore »(a verb commonly used to express frustration with biological nomenclature), it is useful to think of nomenclature as a set of data points, which grows over time. For example, when synonymy happens, we do not erase the past, but rather record a new context for the name(s) in question. The biological concept changes, but the nomenclature (names) simply keeps adding up. Behind the scenes, nomenclature in TaxonWorks is represented by a set of nodes and edges, i.e., a mathematical graph, or network (e.g., Fig. 1). Most names (i.e., nodes in the network) are what TaxonWorks calls "protonyms," monomial epithets that are used to construct, for example, bionomial names (not to be confused with "protonym" sensu the ICZN). Protonyms are linked to other protonyms via relationships defined in NOMEN, an ontology that encodes governed rules of nomenclature. Within the system, all data, nodes and edges, can be cited, i.e., linked to a source and therefore anchored in time and tied to authorship, and annotated with a variety of annotation types (e.g., notes, confidence levels, tags). The actual building of the graphs is greatly simplified by multiple user-interfaces that allow scientists to review (e.g. Fig. 2), create, filter, and add to (again, not "change") the nomenclatural history. As in any complex knowledge-representation model, there are outlying scenarios, or edge cases that emerge, making certain human tasks more complex than others. TaxonWorks is no exception, it has limitations in terms of what and how some things can be represented. While many complex representations are hidden by simplified user-interfaces, some, for example, the handling of the ICZN's Family-group name, batch-loading of invalid relationships, and comparative syncing against external resources need more work to simplify the processes presently required to meet catalogers' needs. The depth at which TaxonWorks can capture nomenclature is only really valuable if it can be used by others. This is facilitated by the application programming interface (API) serving its data (, serving text files, and by exports to standards like the emerging Catalog of Life Data Package. With reference to real-world problems, we illustrate different ways in which the API can be used, for example, as integrated into spreadsheets, through the use of command line scripts, and serve in the generation of public-facing websites. Behind all this effort are an increasing number of people recording help videos, developing documentation, and troubleshooting software and technical issues. Major contributions have come from developers at many skill levels, from high school to senior software engineers, illustrating that TaxonWorks leads in enabling both technical and domain-based contributions. The health and growth of this community is a key factor in TaxonWork's potential long-term impact in the effort to unify the names of Earth's species.« less
  4. Abstract Motivation

    MicroRNAs (miRNAs) are small RNA molecules (∼22 nucleotide long) involved in post-transcriptional gene regulation. Advances in high-throughput sequencing technologies led to the discovery of isomiRs, which are miRNA sequence variants. While many miRNA-seq analysis tools exist, the diversity of output formats hinders accurate comparisons between tools and precludes data sharing and the development of common downstream analysis methods.


    To overcome this situation, we present here a community-based project, miRNA Transcriptomic Open Project (miRTOP) working towards the optimization of miRNA analyses. The aim of miRTOP is to promote the development of downstream isomiR analysis tools that are compatible with existing detection and quantification tools. Based on the existing GFF3 format, we first created a new standard format, mirGFF3, for the output of miRNA/isomiR detection and quantification results from small RNA-seq data. Additionally, we developed a command line Python tool, mirtop, to create and manage the mirGFF3 format. Currently, mirtop can convert into mirGFF3 the outputs of commonly used pipelines, such as seqbuster, isomiR-SEA, sRNAbench, Prost! as well as BAM files. Some tools have also incorporated the mirGFF3 format directly into their code, such as, miRge2.0, IsoMIRmap and OptimiR. Its open architecture enables any tool or pipeline to output or convertmore »results into mirGFF3. Collectively, this isomiR categorization system, along with the accompanying mirGFF3 and mirtop API, provide a comprehensive solution for the standardization of miRNA and isomiR annotation, enabling data sharing, reporting, comparative analyses and benchmarking, while promoting the development of common miRNA methods focusing on downstream steps of miRNA detection, annotation and quantification.

    Availability and implementation and

    Contact or

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less
  5. The LabelBee system is a web application designed to facilitate the collection, annotation and analysis of large amounts of honeybee behavior data from video monitoring. It is developed as part of NSF BIGDATA project “Large-scale multi-parameter analysis of honeybee behavior in their natural habitat”, where we analyze continuous video of the entrance of bee colonies. Due to the large volume of data and its complexity, LabelBee provides advanced Artificial Intelligence and visualization capabilities to enable the construction of good quality datasets necessary for the discovery of complex behavior patterns. It integrates several levels of information: raw video, honeybee positions, decoded tags, individual trajectories and behavior events (entrance/exit, presence of pollen, fanning, etc.). This integration enables the combination of manual and automatic processing by the biologist end-users, who also share and correct their annotation through a centralized server. These annotations are used by the Computer Scientists to create new automatic models, and improve the quality of the automatic modules. The data constructed by this semi-automatized approach can then be exported for the analytic part, which is taking place on the same server using Jupyter notebooks for the extraction and exploration of behavior patterns.