Big data, the “new oil” of the modern data science era, has attracted much attention in the GIScience community. However, we have ignored the role of code in enabling the big data revolution in this modern gold rush. Instead, what attention code has received has focused on computational efficiency and scalability issues. In contrast, we have missed the opportunities that the more transformative aspects of code afford as ways to organize our science. These “big code” practices hold the potential for addressing some ill effects of big data that have been rightly criticized, such as algorithmic bias, lack of representation, gatekeeping, and issues of power imbalances in our communities. In this article, I consider areas where lessons from the open source community can help us evolve a more inclusive, generative, and expansive GIScience. These concern best practices for codes of conduct, data pipelines and reproducibility, refactoring our attribution and reward systems, and a reinvention of our pedagogy.
more »
« less
Reviews and syntheses: The promise of big diverse soil data, moving current practices towards future potential
Abstract. In the age of big data, soil data are more available and richer than ever, but – outside of a few large soil survey resources – they remain largely unusable for informing soil management and understanding Earth system processes beyond the original study.Data science has promised a fully reusable research pipeline where data from past studies are used to contextualize new findings and reanalyzed for new insight.Yet synthesis projects encounter challenges at all steps of the data reuse pipeline, including unavailable data, labor-intensive transcription of datasets, incomplete metadata, and a lack of communication between collaborators.Here, using insights from a diversity of soil, data, and climate scientists, we summarize current practices in soil data synthesis across all stages of database creation: availability, input, harmonization, curation, and publication.We then suggest new soil-focused semantic tools to improve existing data pipelines, such as ontologies, vocabulary lists, and community practices.Our goal is to provide the soil data community with an overview of current practices in soil data and where we need to go to fully leverage big data to solve soil problems in the next century.
more »
« less
- Award ID(s):
- 1655622
- PAR ID:
- 10352560
- Author(s) / Creator(s):
- ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more »
- Date Published:
- Journal Name:
- Biogeosciences
- Volume:
- 19
- Issue:
- 14
- ISSN:
- 1726-4189
- Page Range / eLocation ID:
- 3505 to 3522
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The theoretical foundations of Big Data Science are not fully developed, yet. This study proposes a new scalable framework for Big Data representation, high-throughput analytics (variable selection and noise reduction), and model-free inference. Specifically, we explore the core principles of distribution-free and model-agnostic methods for scientific inference based on Big Data sets. Compressive Big Data analytics (CBDA) iteratively generates random (sub)samples from a big and complex dataset. This subsampling with replacement is conducted on the feature and case levels and results in samples that are not necessarily consistent or congruent across iterations. The approach relies on an ensemble predictor where established model-based or model-free inference techniques are iteratively applied to preprocessed and harmonized samples. Repeating the subsampling and prediction steps many times, yields derived likelihoods, probabilities, or parameter estimates, which can be used to assess the algorithm reliability and accuracy of findings via bootstrapping methods, or to extract important features via controlled variable selection. CBDA provides a scalable algorithm for addressing some of the challenges associated with handling complex, incongruent, incomplete and multi-source data and analytics challenges. Albeit not fully developed yet, a CBDA mathematical framework will enable the study of the ergodic properties and the asymptotics of the specific statistical inference approaches via CBDA. We implemented the high-throughput CBDA method using pure R as well as via the graphical pipeline environment. To validate the technique, we used several simulated datasets as well as a real neuroimaging-genetics of Alzheimer’s disease case-study. The CBDA approach may be customized to provide generic representation of complex multimodal datasets and to provide stable scientific inference for large, incomplete, and multisource datasets.more » « less
-
The Non-Clinical Tomography Users Research Network (NoCTURN) was established in 2022 to advance Findability, Accessibility, Interoperability, and Reuse (FAIR) and Open Science (OS) practices in the computed tomographic (CT) imaging community. CT specialists utilize a shared pipeline to create digital representations of real-world objects for research, education, and outreach, and we face a shared set of challenges and limitations imposed by siloing of current workflows, best practices, and expertise. Mirroring the U.S. National Science Foundation’s “10 Big Ideas” of Convergence Research (2016), and in consideration of the White House Office of Science and Technology Policy's Nelson Memorandum (2020), NoCTURN is leveraging input from a broad community of more than 100 CT educators, researchers, curators, and industry stakeholders to propose improvements to data handling, management, and sharing that cut across scientific disciplines and extend beyond. Our primary goal is to develop practical recommendations and tools that link today's CT data to tomorrow's CT discoveries. NoCTURN is working toward this goal by providing a platform to: 1) engage the international scientific CT community via participant recruitment from imaging facilities, academic departments and museums, and data repositories across the globe; 2) stimulate improvements for CT imaging and data management standards that focus on FAIR and OS principles; and 3) work directly with private companies that manufacture the hardware and software used in CT imaging, visualization, and analysis to find common ground in documentation and interoperability that better reflects the OS standards championed by federal funding agencies. The planned deliverables from this three-year grant include a ‘Rosetta Stone’ for CT terminology, an interactive world map of CT facilities, a guide to CT repositories, and ‘Good, Better, Best’ guidelines for metadata and long-term data management. We aim to reduce the barriers to entry that isolate individuals and research labs, and we anticipate that developing community standards and improving methodological reporting will enable long-term, systemic changes necessary to aid those at all levels of experience in furthering their access to and use of CT imaging.more » « less
-
The adoption of big data analytics in healthcare applications is overwhelming not only because of the huge volume of data being analyzed, but also because of the heterogeneity and sensitivity of the data. Effective and efficient analysis and visualization of secure patient health records are needed to e.g., find new trends in disease management, determining risk factors for diseases, and personalized medicine. In this paper, we propose a novel community cloud architecture to help clinicians and researchers to have easy/increased accessibility to data sets from multiple sources, while also ensuring security compliance of data providers is not compromised. Our cloud-based system design configuration with cloudlet principles ensures application performance has high-speed processing, and data analytics is sufficiently scalable while adhering to security standards (e.g., HIPAA, NIST). Through a case study, we show how our community cloud architecture can be implemented along with best practices in an ophthalmology case study which includes health big data (i.e., Health Facts database, I2B2, Millennium) hosted in a campus cloud infrastructure featuring virtual desktop thin-clients and relevant Data Classification Levels in storage.more » « less
-
Recent advances in big data and deep learning technologies have enabled researchers across many disciplines to gain new insight into large and complex data. For example, deep neural networks are being widely used to analyze various types of data including images, videos, texts, and time-series data. In another example, various disciplines such as sociology, social work, and criminology are analyzing crowd-sourced and online social network data using big data technologies to gain new insight from a plethora of data. Even though many different types of data are being generated and analyzed in various domains, the development of distributed city-level cyberinfrastructure for effectively integrating such data to generate more value and gain insights is still not well-addressed in the research literature. In this paper, we present our current efforts and ultimate vision to build distributed cyberinfrastructure which integrates big data and deep learning technologies with a variety of data for enhancing public safety and livability in cites. We also introduce several methodologies and applications that we are developing on top of the cyberinfrastructure to support diverse community stakeholders in cities.more » « less
An official website of the United States government

