skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Beyond Open Data: A Model for Linking Digital Artifacts to Enable Reproducibility of Scientific Claims
The last few years has seen a substantial push toward “Open Data” by policy makers, researchers, archivists, and even the public. This article postulates that the value of data is not intrinsic but instead derives from its ability to produce knowledge; the extraction of which from data is not deterministic. The value of data is realized through a focus on the reproducibility of the findings from the data, which acknowledges the complexity of the leap from data to knowledge, and the inextricable interrelationships between data, software, computational environments and cyberinfrastructure, and knowledge. Modern information archiving practices have a long history and were shaped in a pre-digital world comprised of physical objects such as books, monographs, film, paper, and other physical artifacts. This article argues that “data,” the modern collection of digital bits representing empirical measurements, is a wholly new entity and not a digital analog to any physical object. It further argues that a focus on the interrelationships between digital artifacts and their unique properties, instead of Open Data alone, will instead produce an augmented and more useful understanding of knowledge when it is derived from digital data. Data-derived knowledge, represented by claims in the scholarly record, must persistently link to immutable versions of the digital artifacts from which it was derived, including 1) any data, 2) software that allows access to the data and the regeneration of those claims that rely on the version of the data, and 3) computational environment information including input parameters, function invocation sequences, and resource details. In this sense the epistemological gap between data and extracted knowledge can be closed. Datasets and software are often subject to change and revision, sometimes even with high velocity, and such changes imply new versions with new unique identifiers. We propose considering knowledge, rather than data in isolation, with a schematic model representing the interconnectedness of datasets, software, and computational information upon which its derivation depends. Capturing the interconnectedness of these digital artifacts, and their relationship to the knowledge they generate, is essential for supporting the reproducibility, transparency, and cognitive tractability of scientific claims derived from digital data.  more » « less
Award ID(s):
2138776
PAR ID:
10388163
Author(s) / Creator(s):
Date Published:
Journal Name:
Third International Workshop on Practical Reproducible Evaluation of Computer Systems (P-RECS'20)
Issue:
2020
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Computer vision and other biometrics data science applications have commenced a new project of profiling people. Rather than using 'transaction generated information', these systems measure the 'real world' and produce an assessment of the 'world state' - in this case an assessment of some individual trait. Instead of using proxies or scores to evaluate people, they increasingly deploy a logic of revealing the truth about reality and the people within it. While these profiling knowledge claims are sometimes tentative, they increasingly suggest that only through computation can these excesses of reality be captured and understood. This article explores the bases of those claims in the systems of measurement, representation, and classification deployed in computer vision. It asks if there is something new in this type of knowledge claim, sketches an account of a new form of computational empiricism being operationalised, and questions what kind of human subject is being constructed by these technological systems and practices. Finally, the article explores legal mechanisms for contesting the emergence of computational empiricism as the dominant knowledge platform for understanding the world and the people within it. 
    more » « less
  2. Clouds are shareable scientific instruments that create the potential for reproducibility by ensuring that all investigators have access to a common execution platform on which computational experiments can be repeated and compared. By virtue of the interface they present, they also lead to the creation of digital artifacts compatible with the cloud, such as images or orchestration templates, that go a long way—and sometimes all the way—to representing an experiment in a digital, repeatable form. In this article, I describe how we developed these natural advantages of clouds in the Chameleon testbed and argue that we should leverage them to create a digital research marketplace that would make repeating experiments as natural and viable part of research as sharing ideas via reading papers is today. 
    more » « less
  3. Continuous integration (CI) is a well-established technique in commercial and open-source software projects, although not routinely used in scientific publishing. In the scientific software context, CI can serve two functions to increase reproducibility of scientific results: providing an established platform for testing the reproducibility of these results, and demonstrating to other scientists how the code and data generate the published results. We explore scientific software testing and CI strategies using two articles published in the areas of applied mathematics and computational physics. We discuss lessons learned from reproducing these articles as well as examine and discuss existing tests. We introduce the notion of a scientific test as one that produces computational results from a published article. We then consider full result reproduction within a CI environment. If authors find their work too time or resource intensive to easily adapt to a CI context, we recommend the inclusion of results from reduced versions of their work (e.g., run at lower resolution, with shorter time scales, with smaller data sets) alongside their primary results within their article. While these smaller versions may be less interesting scientifically, they can serve to verify that published code and data are working properly. We demonstrate such reduction tests on the two articles studied. 
    more » « less
  4. Abstract Objective.Intan Technologies’ integrated circuits (ICs) are valuable tools for neurophysiological data acquisition, providing signal amplification, filtering, and digitization from many channels (up to 64 channels/chip) at high sampling rates (up to 30 kSPS) within a compact package (⩽9× 7 mm). However, we found that the analog-to-digital converters (ADCs) in the Intan RHD2000 series ICs can produce artifacts in recorded signals. Here, we examine the effects of these ADC artifacts on neural signal quality and describe a method to detect them in recorded data.Approach.We identified two types of ADC artifacts produced by Intan ICs: 1) jumps, resulting from missing output codes, and 2) flatlines, resulting from overrepresented output codes. We identified ADC artifacts in neural recordings acquired with Intan RHD2000 ICs and tested the repeated performance of 17 ICsin vitro. With the on-chip digital-signal-processing disabled, we detected the ADC artifacts in each test recording by examining the distribution of unfiltered ADC output codes.Main Results.We found larger ADC artifacts in recordings using the Intan RHX data acquisition software versions 3.0–3.2, which did not run the necessary ADC calibration command when the inputs to the Intan recording controller were rescanned. This has been corrected in the Intan RHX software version 3.3. We found that the ADC calibration routine significantly reduced, but did not fully eliminate, the occurrence and size of ADC artifacts as compared with recordings acquired when the calibration routine was not run (p< 0.0001). When the ADC calibration routine was run, we found that the artifacts produced by each ADC were consistent over time, enabling us to sort ICs by performance.Significance.Our findings call attention to the importance of evaluating signal quality when acquiring electrophysiological data using Intan Technologies ICs and offer a method for detecting ADC artifacts in recorded data. 
    more » « less
  5. Reusable software libraries, frameworks, and components, such as those provided by open source ecosystems and third-party suppliers, accelerate digital innovation. However, recent years have shown almost exponential growth in attackers leveraging these software artifacts to launch software supply chain attacks. Past well-known software supply chain attacks include the SolarWinds, log4j, and xz utils incidents. Supply chain attacks are considered to have three major attack vectors: through vulnerabilities and malware accidentally or intentionally injected into open source and third-partydependencies/components/containers; by infiltrating thebuild infrastructureduring the build and deployment processes; and through targeted techniques aimed at thehumansinvolved in software development, such as through social engineering. Plummeting trust in the software supply chain could decelerate digital innovation if the software industry reduces its use of open source and third-party artifacts to reduce risks. This article contains perspectives and knowledge obtained from intentional outreach with practitioners to understand their practical challenges and from extensive research efforts. We then provide an overview of current research efforts to secure the software supply chain. Finally, we propose a future research agenda to close software supply chain attack vectors and support the software industry. 
    more » « less