skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Automatic Metadata Generation for Fish Specimen Image Collections
Conference Title: 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL) Conference Start Date: 2021, Sept. 27 Conference End Date: 2021, Sept. 30 Conference Location: Champaign, IL, USAMetadata are key descriptors of research data, particularly for researchers seeking to apply machine learning (ML) to the vast collections of digitized specimens. Unfortunately, the available metadata is often sparse and, at times, erroneous. Additionally, it is prohibitively expensive to address these limitations through traditional, manual means. This paper reports on research that applies machine-driven approaches to analyzing digitized fish images and extracting various important features from them. The digitized fish specimens are being analyzed as part of the Biology Guided Neural Networks (BGNN) initiative, which is developing a novel class of artificial neural networks using phylogenies and anatomy ontologies. Automatically generated metadata is crucial for identifying the high-quality images needed for the neural network's predictive analytics. Methods that combine ML and image informatics techniques allow us to rapidly enrich the existing metadata associated with the 7,244 images from the Illinois Natural History Survey (INHS) used in our study. Results show we can accurately generate many key metadata properties relevant to the BGNN project, as well as general image quality metrics (e.g. brightness and contrast). Results also show that we can accurately generate bounding boxes and segmentation masks for fish, which are needed for subsequent machine learning analyses. The automatic process outperforms humans in terms of time and accuracy, and provides a novel solution for leveraging digitized specimens in ML. This research demonstrates the ability of computational methods to enhance the digital library services associated with the tens of thousands of digitized specimens stored in open-access repositories worldwide.  more » « less
Award ID(s):
1940233 2118240
PAR ID:
10383343
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Conference Proceedings/2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)
Page Range / eLocation ID:
31 to 40
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. A new science discipline has emerged within the last decade at the intersection of informatics, computer science and biology:Imageomics. Like most other -omics fields, Imageomics also uses emerging technologies to analyze biological data but from the images. One of the most applied data analysis methods for image datasets is Machine Learning (ML). In 2019, we started working on a United States National Science Foundation (NSF) funded project, known as Biology Guided Neural Networks (BGNN) with the purpose of extracting information about biology by using neural networks and biological guidance such as species descriptions, identifications, phylogenetic trees and morphological annotations (Bart et al. 2021). Even though the variety and abundance of biological data is satisfactory for some ML analysis and the data are openly accessible, researchers still spend up to 80% of their time preparing data into a usable, AI-ready format, leaving only 20% for exploration and modeling (Long and Romanoff 2023). For this reason, we have built a dataset composed of digitized fish specimens, taken either directly from collections or from specialized repositories. The range of digital representations we cover is broad and growing, from photographs and radiographs, to CT scans, and even illustrations. We have added new groups of vocabularies to the dataset management system including image quality metadata, extended image metadata and batch metadata. With the image quality metadata and extended image metadata, we aimed to extract information from the digital objects that can possibly help ML scientists in their research with filtering, image processing and object recognition routines. Image quality metadata provides information about objects contained in the image, features and condition of the specimen, and some basic visual properties of the image, while extended image metadata provides information about technical properties of the digital file and the digital multimedia object (Bakış et al. 2021, Karnani et al. 2022, Leipzig et al. 2021, Pepper et al. 2021, Wang et al. 2021) (see details on Fish-AIR vocabulary web page). Batch metadata is used for separating different datasets and facilitates downloading and uploading data in batches with additional batch information and supplementary files. Additional flexibility, built into the database infrastructure using an RDF framework, will enable the system to host different taxonomic groups, which might require new metadata features (Jebbia et al. 2023). By the combination of these features, along with FAIR (Findable, Accessable, Interoperable, Reusable) principles, and reproducibility, we provide Artificial Intelligence Readiness (AIR; Long and Romanoff 2023) to the dataset. Fish-AIR provides an easy-to-access, filtered, annotated and cleaned biological dataset for researchers from different backgrounds and facilitates the integration of biological knowledge based on digitized preserved specimens into ML pipelines. Because of the flexible database infrastructure and addition of new datasets, researchers will also be able to access additional types of data—such as landmarks, specimen outlines, annotated parts, and quality scores—in the near future. Already, the dataset is the largest and most detailed AI-ready fish image dataset with integrated Image Quality Management System (Jebbia et al. 2023, Wang et al. 2021). 
    more » « less
  2. Adam, N.; Neuhold, E.; Furuta, R. (Ed.)
    Metadata is a key data source for researchers seeking to apply machine learning (ML) to the vast collections of digitized biological specimens that can be found online. Unfortunately, the associated metadata is often sparse and, at times, erroneous. This paper extends previous research conducted with the Illinois Natural History Survey (INHS) collection (7244 specimen images) that uses computational approaches to analyze image quality, and then automatically generates 22 metadata properties representing the image quality and morphological features of the specimens. In the research reported here, we demonstrate the extension of our initial work to University of the Wisconsin Zoological Museum (UWZM) collection (4155 specimen images). Further, we enhance our computational methods in four ways: (1) augmenting the training set, (2) applying contrast enhancement, (3) upscaling small objects, and (4) refining our processing logic. Together these new methods improved our overall error rates from 4.6 to 1.1%. These enhancements also allowed us to compute an additional set of 17 image-based metadata properties. The new metadata properties provide supplemental features and information that may also be used to analyze and classify the fish specimens. Examples of these new features include convex area, eccentricity, perimeter, skew, etc. The newly refined process further outperforms humans in terms of time and labor cost, as well as accuracy, providing a novel solution for leveraging digitized specimens with ML. This research demonstrates the ability of computational methods to enhance the digital library services associated with the tens of thousands of digitized specimens stored in open-access repositories world-wide by generating accurate and valuable metadata for those repositories. 
    more » « less
  3. Garoufallou, E. (Ed.)
    Flexible metadata pipelines are crucial for supporting the FAIR data principles. Despite this need, researchers seldom report their approaches for identifying metadata standards and protocols that sup-port optimal flexibility. This paper reports on an initiative targeting the development of a flexible metadata pipeline for a collection contain-ing over 300,000 digital fish specimen images, harvested from multiple data repositories and fish collections. The images and their associated metadata are being used for AI-related scientific research involving au-tomated species identification, segmentation and trait extraction. The paper provides contextual background, followed by the presentation of a four-phased approach involving: 1. Assessment of the Problem, 2. Inves-tigation of Solutions, 3. Implementation, and 4. Refinement. The work is part of the NSF Harnessing the Data Revolution, Biology Guided Neural Networks (NSF/HDR-BGNN) project and the HDR Imageomics Institute. An RDF graph prototype pipeline is presented, followed by a discussion of research implications and conclusion summarizing the re-sults.ite this need, researchers seldom report their approaches for identi-fying metadata standards and protocols that support optimal flexibility. This paper reports on an initiative targeting the development of a flex-ible metadata pipeline for a collection containing over 300,000 digital fish specimen images, harvested from multiple data repositories and fish collections. The images and their associated metadata are being used for AI-related scientific research involving automated species identification, segmentation and trait extraction. The paper provides contextual back-ground, followed by the presentation of a four-phased approach involving: 1. Assessment of the Problem, 2. Investigation of Solutions, 3. Implemen-tation, and 4. Refinement. The work is part of the NSF Harnessing the Data Revolution, Biology Guided Neural Networks (NSF/HDR-BGNN) 
    more » « less
  4. Garoufallou E., Ovalle-Perandones MA. (Ed.)
    Biodiversity image repositories are crucial sources for training machine learning approaches to support biological research. Metadata about object (e.g. image) quality is a putatively important prerequisite to selecting samples for these experiments. This paper reports on a study demonstrating the importance of image quality metadata for a species classification experiment involving a corpus of 1935 fish specimen images which were annotated with 22 metadata quality properties. A small subset of high quality images produced an F1 accuracy of 0.41 compared to 0.35 for a taxonomically matched subset low quality images when used by a convolutional neural network approach to species identification. Using the full corpus of images revealed that image quality differed between correctly classified and misclassified images. We found anatomical feature visibility was the most important quality feature for classification accuracy. We suggest biodiversity image repositories consider adopting a minimal set of image quality metadata to support machine learning. 
    more » « less
  5. null; null (Ed.)
    Biodiversity image repositories are crucial sources of training data for machine learning approaches to biological research. Metadata, specifically metadata about object quality, is putatively an important prerequisite to selecting sample subsets for these experiments. This study demonstrates the importance of image quality metadata to a species classification experiment involving a corpus of 1935 fish specimen images which were annotated with 22 metadata quality properties. A small subset of high quality images produced an F1 accuracy of 0.41 compared to 0.35 for a taxonomically matched subset of low quality images when used by a convolutional neural network approach to species identification. Using the full corpus of images revealed that image quality differed between correctly classified and misclassified images. We found the visibility of all anatomical features was the most important quality feature for classification accuracy. We suggest biodiversity image repositories consider adopting a minimal set of image quality metadata to support future machine learning projects. 
    more » « less