TaxonWorks is an open-source workbench for biodiversity researchers. With several years of development behind it, we highlight its present status, and discuss if and when it makes sense to release a version 1.0, i.e. software completed to specific stage. TaxonWorks' scope is broad; it seeks to touch nearly all areas that might be of interest to taxonomists, i.e. those who integrate everything that is known about a taxon into a single resource. Its role as a software platform is placed in a broader context, where many instances of TaxonWorks each can support multiple research projects. Instances may be supported by individuals or organizations. A suite of technical tools including containerization and unit tests facilitate collaboration at many different levels. TaxonWorks is a research tool, mechanisms for analyzing the results of data curation including its application programing interface are described. The long-term development of TaxonWorks is supported by an endowment to the Species File Group. Its source is available on Github.
more »
« less
Specimens, Databases, and Accession Books: Using TaxonWorks to Integrate Multiple Sources of Modern and Historical Data in the INHS Insect Collection
Grant-supported digitization projects over the past 20 years at the Illinois Natural History Survey (INHS) have yielded over 1,000,000 occurrence records (representing over 2.7 million specimens), one of the most successful digitization efforts within the United States. However, receiving multiple grants at the cutting edge has led to numerous projects left at various stages of completeness, several relational databases, orphaned data, and specimens at various stages of curation. TaxonWorks (taxonworks.org), an integrated web-based workbench developed by the Species File Group and supported by the INHS and the National Science Foundation, has provided the digital infrastructure to unify multiple workflows, projects, databases, and even historical accession books into one easy to access, open-source platform. We demonstrate the practical utility of this platform and summarize past, present, and future efforts at the INHS towards integrating all our data within TaxonWorks.
more »
« less
- Award ID(s):
- 1639601
- PAR ID:
- 10079958
- Date Published:
- Journal Name:
- Biodiversity Information Science and Standards
- Volume:
- 2
- ISSN:
- 2535-0897
- Page Range / eLocation ID:
- e25896
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
TaxonWorks (http://taxonworks.org) in an integrated, open-source, cybertaxonomic web application serving taxonomists and biodiversity scientists. It is designed to facilitate efficient data capture, storage, manipulation, and retrieval. It integrates a wide variety of data types used by biodiversity scientists, including, but not limited to, taxonomy (with validation based on codes of zoological, botanical, bacterial, and viral nomenclature), specimen data, bibliographies, media (images, PDFs, sounds, videos), morphology (character/trait matrices), distribution, biological associations. Available TaxonWorks web interfaces currently provide various data entry forms for simple and advanced querying of the database. TaxonWorks has integrated batch uploader functionality. But, for larger datasets, specialized migration scripts were used. Several projects, historically build in 3i (http://dmitriev.speciesfile.org), MX (http://mx.phenomix.org), SpeciesFiles (http://software.speciesfile.org), and other databases, have been or are being migrated into TaxonWorks. Of the projects moving into TaxonWorks, it is worth mentioning several: 3i World Auchenorrhyncha Database, LepIndex, Universal Chalcidoidea Database, Orthoptera SpeciesFile, Plecoptera SpeciesFile, Illinois Natural History Survey Insect Collection database, and several others. An experience of the data migration will be shared during the presentation.more » « less
-
Over 300 million arthropod specimens are housed in North American natural history collections. These collections represent a “vast hidden treasure trove” of biodiversity −95% of the specimen label data have yet to be transcribed for research, and less than 2% of the specimens have been imaged. Specimen labels contain crucial information to determine species distributions over time and are essential for understanding patterns of ecology and evolution, which will help assess the growing biodiversity crisis driven by global change impacts. Specimen images offer indispensable insight and data for analyses of traits, and ecological and phylogenetic patterns of biodiversity. Here, we review North American arthropod collections using two key metrics, specimen holdings and digitization efforts, to assess the potential for collections to provide needed biodiversity data. We include data from 223 arthropod collections in North America, with an emphasis on the United States. Our specific findings are as follows: (1) The majority of North American natural history collections (88%) and specimens (89%) are located in the United States. Canada has comparable holdings to the United States relative to its estimated biodiversity. Mexico has made the furthest progress in terms of digitization, but its specimen holdings should be increased to reflect the estimated higher Mexican arthropod diversity. The proportion of North American collections that has been digitized, and the number of digital records available per species, are both much lower for arthropods when compared to chordates and plants. (2) The National Science Foundation’s decade-long ADBC program (Advancing Digitization of Biological Collections) has been transformational in promoting arthropod digitization. However, even if this program became permanent, at current rates, by the year 2050 only 38% of the existing arthropod specimens would be digitized, and less than 1% would have associated digital images. (3) The number of specimens in collections has increased by approximately 1% per year over the past 30 years. We propose that this rate of increase is insufficient to provide enough data to address biodiversity research needs, and that arthropod collections should aim to triple their rate of new specimen acquisition. (4) The collections we surveyed in the United States vary broadly in a number of indicators. Collectively, there is depth and breadth, with smaller collections providing regional depth and larger collections providing greater global coverage. (5) Increased coordination across museums is needed for digitization efforts to target taxa for research and conservation goals and address long-term data needs. Two key recommendations emerge: collections should significantly increase both their specimen holdings and their digitization efforts to empower continental and global biodiversity data pipelines, and stimulate downstream research.more » « less
-
Abstract Natural history collections (NHCs) are the foundation of historical baselines for assessing anthropogenic impacts on biodiversity. Along these lines, the online mobilization of specimens via digitization—the conversion of specimen data into accessible digital content—has greatly expanded the use of NHC collections across a diversity of disciplines. We broaden the current vision of digitization (Digitization 1.0)—whereby specimens are digitized within NHCs—to include new approaches that rely on digitized products rather than the physical specimen (Digitization 2.0). Digitization 2.0 builds on the data, workflows, and infrastructure produced by Digitization 1.0 to create digital-only workflows that facilitate digitization, curation, and data links, thus returning value to physical specimens by creating new layers of annotation, empowering a global community, and developing automated approaches to advance biodiversity discovery and conservation. These efforts will transform large-scale biodiversity assessments to address fundamental questions including those pertaining to critical issues of global change.more » « less
-
INTRODUCTION: CRSS-UTDallas initiated and oversaw the efforts to recover APOLLO mission communications by re-engineering the NASA SoundScriber playback system, and digitizing 30-channel analog audio tapes – with the entire Apollo-11, Apollo-13, and Gemini-8 missions during 2011-17 [1,6]. This vast data resource was made publicly available along with supplemental speech & language technologies meta-data based on CRSS pipeline diarization transcripts and conversational speaker time-stamps for Apollo team at NASA Mission Control Center, [2,4]. In 2021, renewed efforts over the past year have resulted in the digitization of an additional +50,000hrs of audio from Apollo 7,8,9,10,12 missions, and remaining A-13 tapes. Cumulative digitization efforts have enabled the development of the largest publicly available speech data resource with unprompted, real conversations recorded in naturalistic environments. Deployment of this massive corpus has inspired multiple collaborative initiatives such as Web resources ExploreApollo (https://app.exploreapollo.org) LanguageARC (https://languagearc.com/projects/21) [3]. ExploreApollo.org serves as the visualization and play-back tool, and LanguageARC the crowd source subject content tagging resource developed by UG/Grad. Students, intended as an educational resource for k-12 students, and STEM/Apollo enthusiasts. Significant algorithmic advancements have included advanced deep learning models that are now able to improve automatic transcript generation quality, and even extract high level knowledge such as ID labels of topics being spoken across different mission stages. Efficient transcript generation and topic extraction tools for this naturalistic audio have wide applications including content archival and retrieval, speaker indexing, education, group dynamics and team cohesion analysis. Some of these applications have been deployed in our online portals to provide a more immersive experience for students and researchers. Continued worldwide outreach in the form of the Fearless Steps Challenges has proven successful with the most recent Phase-4 of the Challenge series. This challenge has motivated research in low level tasks such as speaker diarization and high level tasks like topic identification. IMPACT: Distribution and visualization of the Apollo audio corpus through the above mentioned online portals and Fearless Steps Challenges have produced significant impact as a STEM education resource for K-12 students as well as a SLT development resource with real-world applications for research organizations globally. The speech technologies developed by CRSS-UTDallas using the Fearless Steps Apollo corpus have improved previous benchmarks on multiple tasks [1, 5]. The continued initiative will extend the current digitization efforts to include over 150,000 hours of audio recorded during all Apollo missions. ILLUSTRATION: We will demonstrate WebExploreApollo and LanguageARC online portals with newly digitized audio playback in addition to improved SLT baseline systems, the results from ASR and Topic Identification systems which will include research performed on the corpus conversational. Performance analysis visualizations will also be illustrated. We will also display results from the past challenges and their state-of-the-art system improvements.more » « less
An official website of the United States government

