Big data, the “new oil” of the modern data science era, has attracted much attention in the GIScience community. However, we have ignored the role of code in enabling the big data revolution in this modern gold rush. Instead, what attention code has received has focused on computational efficiency and scalability issues. In contrast, we have missed the opportunities that the more transformative aspects of code afford as ways to organize our science. These “big code” practices hold the potential for addressing some ill effects of big data that have been rightly criticized, such as algorithmic bias, lack of representation, gatekeeping, and issues of power imbalances in our communities. In this article, I consider areas where lessons from the open source community can help us evolve a more inclusive, generative, and expansive GIScience. These concern best practices for codes of conduct, data pipelines and reproducibility, refactoring our attribution and reward systems, and a reinvention of our pedagogy.
more »
« less
Reviews and syntheses: The promise of big diverse soil data, moving current practices towards future potential
Abstract. In the age of big data, soil data are more available and richer than ever, but – outside of a few large soil survey resources – they remain largely unusable for informing soil management and understanding Earth system processes beyond the original study.Data science has promised a fully reusable research pipeline where data from past studies are used to contextualize new findings and reanalyzed for new insight.Yet synthesis projects encounter challenges at all steps of the data reuse pipeline, including unavailable data, labor-intensive transcription of datasets, incomplete metadata, and a lack of communication between collaborators.Here, using insights from a diversity of soil, data, and climate scientists, we summarize current practices in soil data synthesis across all stages of database creation: availability, input, harmonization, curation, and publication.We then suggest new soil-focused semantic tools to improve existing data pipelines, such as ontologies, vocabulary lists, and community practices.Our goal is to provide the soil data community with an overview of current practices in soil data and where we need to go to fully leverage big data to solve soil problems in the next century.
more »
« less
- Award ID(s):
- 1655622
- PAR ID:
- 10352560
- Author(s) / Creator(s):
- ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more »
- Date Published:
- Journal Name:
- Biogeosciences
- Volume:
- 19
- Issue:
- 14
- ISSN:
- 1726-4189
- Page Range / eLocation ID:
- 3505 to 3522
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The theoretical foundations of Big Data Science are not fully developed, yet. This study proposes a new scalable framework for Big Data representation, high-throughput analytics (variable selection and noise reduction), and model-free inference. Specifically, we explore the core principles of distribution-free and model-agnostic methods for scientific inference based on Big Data sets. Compressive Big Data analytics (CBDA) iteratively generates random (sub)samples from a big and complex dataset. This subsampling with replacement is conducted on the feature and case levels and results in samples that are not necessarily consistent or congruent across iterations. The approach relies on an ensemble predictor where established model-based or model-free inference techniques are iteratively applied to preprocessed and harmonized samples. Repeating the subsampling and prediction steps many times, yields derived likelihoods, probabilities, or parameter estimates, which can be used to assess the algorithm reliability and accuracy of findings via bootstrapping methods, or to extract important features via controlled variable selection. CBDA provides a scalable algorithm for addressing some of the challenges associated with handling complex, incongruent, incomplete and multi-source data and analytics challenges. Albeit not fully developed yet, a CBDA mathematical framework will enable the study of the ergodic properties and the asymptotics of the specific statistical inference approaches via CBDA. We implemented the high-throughput CBDA method using pure R as well as via the graphical pipeline environment. To validate the technique, we used several simulated datasets as well as a real neuroimaging-genetics of Alzheimer’s disease case-study. The CBDA approach may be customized to provide generic representation of complex multimodal datasets and to provide stable scientific inference for large, incomplete, and multisource datasets.more » « less
-
The Non-Clinical Tomography Users Research Network (NoCTURN) was established in 2022 to advance Findability, Accessibility, Interoperability, and Reuse (FAIR) and Open Science (OS) practices in the computed tomographic (CT) imaging community. CT specialists utilize a shared pipeline to create digital representations of real-world objects for research, education, and outreach, and we face a shared set of challenges and limitations imposed by siloing of current workflows, best practices, and expertise. Mirroring the U.S. National Science Foundation’s “10 Big Ideas” of Convergence Research (2016), and in consideration of the White House Office of Science and Technology Policy's Nelson Memorandum (2020), NoCTURN is leveraging input from a broad community of more than 100 CT educators, researchers, curators, and industry stakeholders to propose improvements to data handling, management, and sharing that cut across scientific disciplines and extend beyond. Our primary goal is to develop practical recommendations and tools that link today's CT data to tomorrow's CT discoveries. NoCTURN is working toward this goal by providing a platform to: 1) engage the international scientific CT community via participant recruitment from imaging facilities, academic departments and museums, and data repositories across the globe; 2) stimulate improvements for CT imaging and data management standards that focus on FAIR and OS principles; and 3) work directly with private companies that manufacture the hardware and software used in CT imaging, visualization, and analysis to find common ground in documentation and interoperability that better reflects the OS standards championed by federal funding agencies. The planned deliverables from this three-year grant include a ‘Rosetta Stone’ for CT terminology, an interactive world map of CT facilities, a guide to CT repositories, and ‘Good, Better, Best’ guidelines for metadata and long-term data management. We aim to reduce the barriers to entry that isolate individuals and research labs, and we anticipate that developing community standards and improving methodological reporting will enable long-term, systemic changes necessary to aid those at all levels of experience in furthering their access to and use of CT imaging.more » « less
-
The adoption of big data analytics in healthcare applications is overwhelming not only because of the huge volume of data being analyzed, but also because of the heterogeneity and sensitivity of the data. Effective and efficient analysis and visualization of secure patient health records are needed to e.g., find new trends in disease management, determining risk factors for diseases, and personalized medicine. In this paper, we propose a novel community cloud architecture to help clinicians and researchers to have easy/increased accessibility to data sets from multiple sources, while also ensuring security compliance of data providers is not compromised. Our cloud-based system design configuration with cloudlet principles ensures application performance has high-speed processing, and data analytics is sufficiently scalable while adhering to security standards (e.g., HIPAA, NIST). Through a case study, we show how our community cloud architecture can be implemented along with best practices in an ophthalmology case study which includes health big data (i.e., Health Facts database, I2B2, Millennium) hosted in a campus cloud infrastructure featuring virtual desktop thin-clients and relevant Data Classification Levels in storage.more » « less
-
With the growing availability and accessibility of big data in ecology, we face an urgent need to train the next generation of scientists in data science practices and tools. One of the biggest barriers for implementing a data-driven curriculum in undergraduate classrooms is the lack of training and support for educators to develop their own skills and time to incorporate these principles into existing courses or develop new ones. Alongside the research goals of the National Ecological Observatory Network (NEON), providing education and training are key components for building a community of scientists and users equipped to utilize large-scale ecological and environmental data. To address this need, the NEON Data Education Fellows program formed as a collaborative Faculty Mentoring Network (FMN) between scientists from NEON and university faculty interested in using NEON data and resources in their ecology classrooms. Like other FMNs, this group has two main goals: 1) to provide tools, resources, and support for faculty interested in developing data-driven curriculum, and (2) to make teaching materials that have been implemented and tested in the classroom available as open educational resources for other educators. We hosted this program using an open education and collaboration platform from the Quantitative Undergraduate Biology Education and Synthesis (QUBES) project. Here, we share lessons learned from facilitating five FMN cohorts and emphasize the successes, pitfalls, and opportunities for developing open education resources through community-driven collaborations.more » « less