skip to main content

Title: Collections Management and High-Throughput Digitization using Distributed Cyberinfrastructure Resources
Collections digitization relies increasingly upon computational and data management resources that occasionally exceed the capacity of natural history collections and their managers and curators. Digitization of many tens of thousands of micropaleontological specimen slides, as evidenced by the effort presented here by the Indiana University Paleontology Collection, has been a concerted effort in adherence to the recommended practices of multifaceted aspects of collections management for both physical and digital collections resources. This presentation highlights the contributions of distributed cyberinfrastructure from the National Science Foundation-supported Extreme Science and Engineering Discovery Environment (XSEDE) for web-hosting of collections management system resources and distributed processing of millions of digital images and metadata records of specimens from our collections. The Indiana University Center for Biological Research Collections is currently hosting its instance of the Specify collections management system (CMS) on a virtual server hosted on Jetstream, the cloud service for on-demand computational resources as provisioned by XSEDE. This web-service allows the CMS to be flexibly hosted on the cloud with additional services that can be provisioned on an as-needed basis for generating and integrating digitized collections objects in both web-friendly and digital preservation contexts. On-demand computing resources can be used for the manipulation of digital more » images for automated file I/O, scripted renaming of files for adherence to file naming conventions, derivative generation, and backup to our local tape archive for digital disaster preparedness and long-term storage. Here, we will present our strategies for facilitating reproducible workflows for general collections digitization of the IUPC nomenclatorial types and figured specimens in addition to the gigapixel resolution photographs of our large collection of microfossils using our GIGAmacro system (e.g., this slide of conodonts). We aim to demonstrate the flexibility and nimbleness of cloud computing resources for replicating this, and other, workflows to enhance the findability, accessibility, interoperability, and reproducibility of the data and metadata contained within our collections. « less
; ; ;
Award ID(s):
Publication Date:
Journal Name:
Biodiversity Information Science and Standards
Page Range or eLocation-ID:
Sponsoring Org:
National Science Foundation
More Like this
  1. Natural history collections are often considered remote and inaccessible without special permission from curators. Digitization of these collections can make them much more accessible to researchers, educators, and general enthusiasts alike, thereby removing the stigma of a lonely specimen on a dusty shelf in the back room of a museum that will never again see the light of day. We are in the process of digitizing the microfossils of the Indiana University Paleontology collection using the GIGAmacro Magnify2 Robotic Imaging System. This suite of software and hardware allows us to automate photography and post-production of high resolution images, thereby severely reducing the amount of time and labor needed to serve the data. Our hardware includes a Canon T6i 24 megapixel DSLR, a Canon MPE 65mm 1X to 5X lens, and a Canon MT26EX Dual Flash, all mounted on a lead system made with high performance precision IGUS Drylin anodized aluminum. The camera and its mount move over the tray of microfossil slides using bearings and rails. The software includes the GIGAmacro Capture Software (photography), GIGAmacro Viewer Software (display and annotation), Zerene Stacker (focus stacking), and Autopano GIGA (stitching). All of the metadata is kept in association with the images, uploadedmore »to Notes from Nature, transcribed by community scientists, then everything is stored in the image archive, Imago. In ~460 hours we have photographed ~10,500 slides and have completed ~65% of our microfossil collection. Using the GIGAmacro system we are able update and store collection information in a more secure and longer lasting digital form. The advantages of this system are numerable and highly recommended for museums who are looking to bring their collections out of the shadows and back into the light.« less
  2. The first major goal of this project is to build a state-of-the-art information storage, retrieval, and analysis system that utilizes the latest technology and industry methods. This system is leveraged to accomplish another major goal, supporting modern search and browse capabilities for a large collection of tweets from the Twitter social media platform, web pages, and electronic theses and dissertations (ETDs). The backbone of the information system is a Docker container cluster running with Rancher and Kubernetes. Information retrieval and visualization is accomplished with containers in a pipelined fashion, whether in the cluster or on virtual machines, for Elasticsearch and Kibana, respectively. In addition to traditional searching and browsing, the system supports full-text and metadata searching. Search results include facets as a modern means of browsing among related documents. The system supports text analysis and machine learning to reveal new properties of collection data. These new properties assist in the generation of available facets. Recommendations are also presented with search results based on associations among documents and with logged user activity. The information system is co-designed by five teams of Virginia Tech graduate students, all members of the same computer science class, CS 5604. Although the project is an academicmore »exercise, it is the practice of the teams to work and interact as though they are groups within a company developing a product. The teams on this project include three collection management groups -- Electronic Theses and Dissertations (ETD), Tweets (TWT), and Web-Pages (WP) -- as well as the Front-end (FE) group and the Integration (INT) group to help provide the overarching structure for the application. This submission focuses on the work of the Integration (INT) team, which creates and administers Docker containers for each team in addition to administering the cluster infrastructure. Each container is a customized application environment that is specific to the needs of the corresponding team. Each team will have several of these containers set up in a pipeline formation to allow scaling and extension of the current system. The INT team also contributes to a cross-team effort for exploring the use of Elasticsearch and its internally associated database. The INT team administers the integration of the Ceph data storage system into the CS Department Cloud and provides support for interactions between containers and the Ceph filesystem. During formative stages of development, the INT team also has a role in guiding team evaluations of prospective container components and workflows. The INT team is responsible for the overall project architecture and facilitating the tools and tutorials that assist the other teams in deploying containers in a development environment according to mutual specifications agreed upon with each team. The INT team maintains the status of the Kubernetes cluster, deploying new containers and pods as needed by the collection management teams as they expand their workflows. This team is responsible for utilizing a continuous integration process to update existing containers. During the development stage the INT team collaborates specifically with the collection management teams to create the pipeline for the ingestion and processing of new collection documents, crossing services between those teams as needed. The INT team develops a reasoner engine to construct workflows with information goal as input, which are then programmatically authored, scheduled, and monitored using Apache Airflow. The INT team is responsible for the flow, management, and logging of system performance data and making any adjustments necessary based on the analysis of testing results. The INT team has established a Gitlab repository for archival code related to the entire project and has provided the other groups with the documentation to deposit their code in the repository. This repository will be expanded using Gitlab CI in order to provide continuous integration and testing once it is available. Finally, the INT team will provide a production distribution that includes all embedded Docker containers and sub-embedded Git source code repositories. The INT team will archive this distribution on the Virginia Tech Docker Container Registry and deploy it on the Virginia Tech CS Cloud. The INT-2020 team owes a sincere debt of gratitude to the work of the INT-2019 team. This is a very large undertaking and the wrangling of all of the products and processes would not have been possible without their guidance in both direct and written form. We have relied heavily on the foundation they and their predecessors have provided for us. We continue their work with systematic improvements, but also want to acknowledge their efforts Ibid. Without them, our progress to date would not have been possible.« less
  3. The Tweet Collection Management (TWT) Team aims to ingest 5 billion tweets, clean this data, analyze the metadata present, extract key information, classify tweets into categories, and finally, index these tweets into Elasticsearch to browse and query. The main deliverable of this project is a running software application for searching tweets and for viewing Twitter collections from Digital Library Research Laboratory (DLRL) event archive projects. As a starting point, we focused on two development goals: (1) hashtag-based and (2) username-based search for tweets. For IR1, we completed extraction of two fields within our sample collection: hashtags and username. Sample code for TwiRole, a user-classification program, was investigated for use in our project. We were able to sample from multiple collections of tweets, spanning topics like COVID-19 and hurricanes. Initial work encompassed using a sample collection, provided via Google Drive. An NFS-based persistent storage was later involved to allow access to larger collections. In total, we have developed 9 services to extract key information like username, hashtags, geo-location, and keywords from tweets. We have also developed services to allow for parsing and cleaning of raw API data, and backup of data in an Apache Parquet filestore. All services are Dockerized andmore »added to the GitLab Container Registry. The services are deployed in the CS cloud cluster to integrate services into the full search engine workflow. A service is created to convert WARC files to JSON for reading archive files into the application. Unit testing of services is complete and end-to-end tests have been conducted to improve system robustness and avoid failure during deployment. The TWT team has indexed 3,200 tweets into the Elasticsearch index. Future work could involve parallelization of the extraction of metadata, an alternative feature-flag approach, advanced geo-location inference, and adoption of the DMI-TCAT format. Key deliverables include a data body that allows for search, sort, filter, and visualization of raw tweet collections and metadata analysis; a running software application for searching tweets and for viewing Twitter collections from Digital Library Research Laboratory (DLRL) event archive projects; and a user guide to assist those using the system.« less
  4. Abstract. As cloud-based web services get more and more capable, available, and powerful (CAP), data science and engineering is pulled toward the frontline because DATA means almost anything-as-a-service (XaaS) via Digital Archiving and Transformed Analytics. In general, a web service (via a website) serves customers with web documents in HTML, JSON, XML, and multimedia via interactive (request) and responsive (reply) ways for specific domain problem solving over the Internet. In particular, a web service is deeply involved with UI & UX (user interface and user experience) plus considerate regulations on QoS (Quality of Service) as well, which refers to both information synthesis and security, namely availability and reliability for providential web services. This paper, based on the novel wiseCIO as a Platform-as-a-Service (PaaS), presents digital archiving 3 and transformed analytics (DATA) via machine learning, one of the most practical aspects of artificial intelligence. Machine learning is the science of data analysis that automates analytical model building and online analytical processing (OLAP) that enables computers to act without being explicitly programmed through CTMP. Computational thinking combined with manageable processing is 4 thoroughly discussed and utilized for FAST solutions in a feasible, analytical, scalable and testable approach. DATA is central to informationmore »synthesis and analytics (ISA), and digitized archives plays a key role in transformed analytics on intelligence for business, education and entertainment (iBEE). Case studies as applicable examples are discussed over broad fields where archival digitization is required for analytical transformation via machine learning, such as scalable ARM (archival repository for manageable accessibility), visual BUS (biological understanding from STEM), schooling DIGIA (digital intelligence governing instruction and administering), viewable HARP (historical archives & religious preachings), vivid MATH (mathematical apps in teaching and hands-on exercise), and SHARE (studies via hands-on assignment, revision and evaluation). As a result, wiseCIO promotes DATA service by providing ubiquitous web services of analytical processing via universal interface and user-centric experience in favor of logical organization of web content and relational information groupings that are vital steps in the ability of an archivist or librarian to recommend and retrieve information for a researcher. More important, wiseCIO also plays a key role as a content management system and delivery platform with capacity of hosting 10,000+ traditional web pages with great ease.« less
  5. Elmer Ottis Wooton (1865–1945) was one of the most important early botanists to work in the Southwestern United States, contributing a great deal of natural history knowledge and botanical research on the flora of New Mexico that shaped many naturalists and scientists for generations. The extensive Wooton legacy includes herbarium collections that he and his famous student Paul Carpenter Standley (1884–1963), prolific botanist and explorer, used for the first Flora of New Mexi co by Wooten and Standley 1915 , along with resources covering botany and range management strategies for the northern Chihuahuan Desert, and an extensive, yet to be digitized, historical archive of correspondence, field notes, vegetation sketches, photographs, and lantern slides, all from his travels and field work in the region. Starting in 1890, the most complete set of Wooton’s herbarium collections were deposited in the NMC herbarium at New Mexico State University (NMSU), and his archives, now stored in a Campus library, have together been underutilized, offline resources. The goals of this ongoing project are to secure, preserve, and promote Wooton’s important historical resources, by fleshing out the botanical history of the region, raising appreciation of herbarium collections within the community, and emphasizing their unique role inmore »facilitating contemporary research aimed at addressing pressing scientific questions such as vegetation responses to global climate change. Students and the general public involved in this project are engaged through hands-on activities including cataloging, databasing and digitization of nearly 10,000 herbarium specimens and Wooton’s archives. These outputs, combined with contemporary data collection and computational biology techniques from an ecological perspective, are being used to document vegetation changes in iconic, climate-sensitive, high-elevation mountainous ecosystems present in southwestern New Mexico. In a later phase of the project, a variety of public audiences will participate through interactive online story maps and citizen science programs such as iNaturalist , Notes from Nature , and BioBlitz . Images of herbarium specimens will be shared via an online database and other relevant biodiversity portals ( Symbiota , iDigBio , JStor ) Community members reached through this project will be better-informed citizens, who may go on to become new stewards of natural history collections, with the potential to influence policies safeguarding the future of our planet’s biodiversity. More locally, the project will support the management of Organ Mountains Desert Peaks National Monument, which was established in 2014 to protect the area's human and environmental resources, and for which knowledge and data are currently limited.« less