skip to main content


Title: Portals for Interactive Steering of HPC Workflows
High performance computing workloads often benefit from human in the loop interactions. Steps in complex pipelines ranging from quality control to parameter adjustments are critical to the successful and efficient completion of modern problems. We give several example workflows in bioinformatics and deep learning where computing decisions are made throughout the processing pipelines ultimately changing the course of the compute. We also show how users can interact with the pipeline using Open OnDemand plus XDMoD or Plot.ly.  more » « less
Award ID(s):
1835725
NSF-PAR ID:
10300449
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
Communications in computer and information science
Volume:
1190
Issue:
1
ISSN:
1865-0929
Page Range / eLocation ID:
179-189
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. As phenomics data volume and dimensionality increase due to advancements in sensor technology, there is an urgent need to develop and implement scalable data processing pipelines. Current phenomics data processing pipelines lack modularity, extensibility, and processing distribution across sensor modalities and phenotyping platforms. To address these challenges, we developed PhytoOracle (PO), a suite of modular, scalable pipelines for processing large volumes of field phenomics RGB, thermal, PSII chlorophyll fluorescence 2D images, and 3D point clouds. PhytoOracle aims to ( i ) improve data processing efficiency; ( ii ) provide an extensible, reproducible computing framework; and ( iii ) enable data fusion of multi-modal phenomics data. PhytoOracle integrates open-source distributed computing frameworks for parallel processing on high-performance computing, cloud, and local computing environments. Each pipeline component is available as a standalone container, providing transferability, extensibility, and reproducibility. The PO pipeline extracts and associates individual plant traits across sensor modalities and collection time points, representing a unique multi-system approach to addressing the genotype-phenotype gap. To date, PO supports lettuce and sorghum phenotypic trait extraction, with a goal of widening the range of supported species in the future. At the maximum number of cores tested in this study (1,024 cores), PO processing times were: 235 minutes for 9,270 RGB images (140.7 GB), 235 minutes for 9,270 thermal images (5.4 GB), and 13 minutes for 39,678 PSII images (86.2 GB). These processing times represent end-to-end processing, from raw data to fully processed numerical phenotypic trait data. Repeatability values of 0.39-0.95 (bounding area), 0.81-0.95 (axis-aligned bounding volume), 0.79-0.94 (oriented bounding volume), 0.83-0.95 (plant height), and 0.81-0.95 (number of points) were observed in Field Scanalyzer data. We also show the ability of PO to process drone data with a repeatability of 0.55-0.95 (bounding area). 
    more » « less
  2. null (Ed.)
    Abstract Real-time execution of machine learning (ML) pipelines on radiology images is difficult due to limited computing resources in clinical environments, whereas running them in research clusters requires efficient data transfer capabilities. We developed Niffler, an open-source Digital Imaging and Communications in Medicine (DICOM) framework that enables ML and processing pipelines in research clusters by efficiently retrieving images from the hospitals’ PACS and extracting the metadata from the images. We deployed Niffler at our institution (Emory Healthcare, the largest healthcare network in the state of Georgia) and retrieved data from 715 scanners spanning 12 sites, up to 350 GB/day continuously in real-time as a DICOM data stream over the past 2 years. We also used Niffler to retrieve images bulk on-demand based on user-provided filters to facilitate several research projects. This paper presents the architecture and three such use cases of Niffler. First, we executed an IVC filter detection and segmentation pipeline on abdominal radiographs in real-time, which was able to classify 989 test images with an accuracy of 96.0%. Second, we applied the Niffler Metadata Extractor to understand the operational efficiency of individual MRI systems based on calculated metrics. We benchmarked the accuracy of the calculated exam time windows by comparing Niffler against the Clinical Data Warehouse (CDW). Niffler accurately identified the scanners’ examination timeframes and idling times, whereas CDW falsely depicted several exam overlaps due to human errors. Third, with metadata extracted from the images by Niffler, we identified scanners with misconfigured time and reconfigured five scanners. Our evaluations highlight how Niffler enables real-time ML and processing pipelines in a research cluster. 
    more » « less
  3. null (Ed.)
    Abstract Background High-throughput sequencing has increased the number of available microbial genomes recovered from isolates, single cells, and metagenomes. Accordingly, fast and comprehensive functional gene annotation pipelines are needed to analyze and compare these genomes. Although several approaches exist for genome annotation, these are typically not designed for easy incorporation into analysis pipelines, do not combine results from different annotation databases or offer easy-to-use summaries of metabolic reconstructions, and typically require large amounts of computing power for high-throughput analysis not available to the average user. Results Here, we introduce MicrobeAnnotator, a fully automated, easy-to-use pipeline for the comprehensive functional annotation of microbial genomes that combines results from several reference protein databases and returns the matching annotations together with key metadata such as the interlinked identifiers of matching reference proteins from multiple databases [KEGG Orthology (KO), Enzyme Commission (E.C.), Gene Ontology (GO), Pfam, and InterPro]. Further, the functional annotations are summarized into Kyoto Encyclopedia of Genes and Genomes (KEGG) modules as part of a graphical output (heatmap) that allows the user to quickly detect differences among (multiple) query genomes and cluster the genomes based on their metabolic similarity. MicrobeAnnotator is implemented in Python 3 and is freely available under an open-source Artistic License 2.0 from https://github.com/cruizperez/MicrobeAnnotator . Conclusions We demonstrated the capabilities of MicrobeAnnotator by annotating 100 Escherichia coli and 78 environmental Candidate Phyla Radiation (CPR) bacterial genomes and comparing the results to those of other popular tools. We showed that the use of multiple annotation databases allows MicrobeAnnotator to recover more annotations per genome compared to faster tools that use reduced databases and is computationally efficient for use in personal computers. The output of MicrobeAnnotator can be easily incorporated into other analysis pipelines while the results of other annotation tools can be seemingly incorporated into MicrobeAnnotator to generate summary plots. 
    more » « less
  4. Abstract

    Pressing environmental research questions demand the integration of increasingly diverse and large‐scale ecological datasets as well as complex analytical methods, which require specialized tools and resources.

    Computational training for ecological and evolutionary sciences has become more abundant and accessible over the past decade, but tool development has outpaced the availability of specialized training. Most training for scripted analyses focuses on individual analysis steps in one script rather than creating a scripted pipeline, where modular functions comprise an ecosystem of interdependent steps. Although current computational training creates an excellent starting place, linear styles of scripting can risk becoming labor‐ and time‐intensive and less reproducible by often requiring manual execution. Pipelines, however, can be easily automated or tracked by software to increase efficiency and reduce potential errors. Ecology and evolution would benefit from techniques that reduce these risks by managing analytical pipelines in a modular, readily parallelizable format with clear documentation of dependencies.

    Workflow management software (WMS) can aid in the reproducibility, intelligibility and computational efficiency of complex pipelines. To date, WMS adoption in ecology and evolutionary research has been slow. We discuss the benefits and challenges of implementing WMS and illustrate its use through a case study with thetargets rpackage to further highlight WMS benefits through workflow automation, dependency tracking and improved clarity for reviewers.

    Although WMS requires familiarity with function‐oriented programming and careful planning for more advanced applications and pipeline sharing, investment in training will enable access to the benefits of WMS and impart transferable computing skills that can facilitate ecological and evolutionary data science at large scales.

     
    more » « less
  5. Moreno-Hagelsieb, Gabriel (Ed.)
    Advances in the analysis of amplicon sequence datasets have introduced a methodological shift in how research teams investigate microbial biodiversity, away from sequence identity-based clustering (producing Operational Taxonomic Units, OTUs) to denoising methods (producing amplicon sequence variants, ASVs). While denoising methods have several inherent properties that make them desirable compared to clustering-based methods, questions remain as to the influence that these pipelines have on the ecological patterns being assessed, especially when compared to other methodological choices made when processing data (e.g. rarefaction) and computing diversity indices. We compared the respective influences of two widely used methods, namely DADA2 (a denoising method) vs. Mothur (a clustering method) on 16S rRNA gene amplicon datasets (hypervariable region v4), and compared such effects to the rarefaction of the community table and OTU identity threshold (97% vs. 99%) on the ecological signals detected. We used a dataset comprising freshwater invertebrate (three Unionidae species) gut and environmental (sediment, seston) communities sampled in six rivers in the southeastern USA. We ranked the respective effects of each methodological choice on alpha and beta diversity, and taxonomic composition. The choice of the pipeline significantly influenced alpha and beta diversities and changed the ecological signal detected, especially on presence/absence indices such as the richness index and unweighted Unifrac. Interestingly, the discrepancy between OTU and ASV-based diversity metrics could be attenuated by the use of rarefaction. The identification of major classes and genera also revealed significant discrepancies across pipelines. Compared to the pipeline’s effect, OTU threshold and rarefaction had a minimal impact on all measurements. 
    more » « less