skip to main content


Title: Challenges and Opportunities in Big Data Research: Outcomes from the Second Annual Joint PI Meeting of the NSF BIGDATA Research Program and the NSF Big Data Regional Innovation Hubs and Spokes Programs 2018
Award ID(s):
1834405
NSF-PAR ID:
10113364
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
NSF Workshop Reports
ISSN:
9999-999X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This special session will report on the updated NSF/IEEE-TCPP Curriculum on Parallel and Distributed Computing released in Nov 2020 by the Center for Parallel and Distributed Computing Curricu- lum Development and Educational Resources (CDER). The purpose of the special session is to obtain SIGCSE community feedback on this curriculum in a highly interactive manner employing the hybrid modality and supported by a full-time CDER booth for the duration of SIGCSE. In this era of big data, cloud, and multi- and many-core systems, it is essential that the computer science (CS) and computer engineering (CE) graduates have basic skills in par- allel and distributed computing (PDC). The topics are primarily organized into the areas of architecture, programming, and algo- rithms topics. A set of pervasive concepts that percolate across area boundaries are also identified. Version 1 of this curriculum was released in December 2012. That curriculum guideline has over 140 early adopter institutions worldwide and has been incorpo- rated into the 2013 ACM/IEEE Computer Science curricula. This Version-II represents a major revision. The updates have focused on enhancing coverage related to the topical aspects of Big Data, Energy, and Distributed Computing. The session will also report on related CDER activities including a workshop series on a PDC institute conceptualization, developing a CE-oriented version of the curriculum, and identifying a minimal set of PDC topics aligned with ABET’s exposure-level PDC require- ments. The interested SIGCSE audience includes educators, authors,publishers, curriculum committee members, department chairs and administrators, professional societies, and the computing industry. 
    more » « less
  2. ABSTRACT Artificial Intelligence (AI) methods are valued for their ability to predict outcomes from dynamically complex data. Despite this virtue, AI is widely criticized as a “black box” i.e., lacking mechanistic explanations to accompany predictions. We introduce a novel interdisciplinary approach that balances the predictive power of data-driven methods with theory-driven explanatory power by presenting a shared use case from four disciplinary perspectives. The use case examines scientific career trajectories through temporally complex, heterogeneous bibliographic big data. Topics addressed include: data representation in complex problems, trade-offs between theoretical, hypothesis driven, and data-driven approaches, AI trustworthiness, model fairness, algorithm explainability and AI adoption/usability. Panelists and audience members will be prompted to discuss the value of approach presented versus other ways to address the challenges raised by the panel, and to consider their limitations and remaining challenges. 
    more » « less
  3. Abstract

    A key remit of theNSF‐funded “Arabidopsis Research and Training for the 21stCentury” (ART‐21) Research Coordination Network has been to convene a series of workshops with community members to explore issues concerning research and training in plant biology, including the role that research usingArabidopsis thalianacan play in addressing those issues. A first workshop focused on training needs for bioinformatic and computational approaches in plant biology was held in 2016, and recommendations from that workshop have been published (Friesner et al.,Plant Physiology, 175, 2017, 1499). In this white paper, we provide a summary of the discussions and insights arising from the secondART‐21 workshop. The second workshop focused on experimental aspects of omics data acquisition and analysis and involved a broad spectrum of participants from academics and industry, ranging from graduate students through post‐doctorates, early career and established investigators. Our hope is that this article will inspire beginning and established scientists, corporations, and funding agencies to pursue directions in research and training identified by this workshop, capitalizing on the reference speciesArabidopsis thalianaand other valuable plant systems.

     
    more » « less
  4. Abstract

    Omics research inevitably involves the collection and analysis of big data, which can only be handled by automated approaches. Here we point out that the analysis of big data in the field of genomics dictates certain requirements, such as specialized software, quality control of input data, and simplification for visualization of the results. The latter results in a loss of information, as is exemplified for phylogenetic trees. Clear communication of big data analyses can be enhanced by novel visualization strategies. The interpretation of findings is sometimes hampered when dedicated analytical tools are not fully understood by microbiologists, while the researchers performing these analyses may not have a full overview of the biology of the microbes under study. These issues are illustrated here, using SARS-Cov-2 and Salmonella enterica as zoonotic examples. Whereas in scientific communications jargon should be avoided or explained, nomenclature to group similar organisms and distinguish these from more distant relatives is not only essential, but also influences the interpretation of results. Unfortunately, changes in taxonomically accepted names are now so frequent that they hamper rather than assist research, as is illustrated with difficulties of microbiome studies. Nomenclature to group viral isolates, as is done for SARS-Cov2, is also not without difficulties. Some weaknesses in current omics research stem from poor quality of data or biased databases, and problems can be magnified by machine learning approaches. Moreover, the overall opus of scientific publications can now be considered “big data”, as is illustrated by the avalanche of COVID-19-related publications. The peer-review model of scientific publishing is only barely coping with this novel situation, resulting in retractions and the publication of bogus works. The avalanche of scientific publications that originated from the current pandemic can obstruct literature searches, and this will unfortunately continue over time.

     
    more » « less