- Award ID(s):
- 1930645
- PAR ID:
- 10304156
- Date Published:
- Journal Name:
- eScience 2021
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Data curation is the process of making a dataset fit-for-use and archivable. It is critical to data-intensive science because it makes complex data pipelines possible, studies reproducible, and data reusable. Yet the complexities of the hands-on, technical, and intellectual work of data curation is frequently overlooked or downplayed. Obscuring the work of data curation not only renders the labor and contributions of data curators invisible but also hides the impact that curators' work has on the later usability, reliability, and reproducibility of data. To better understand the work and impact of data curation, we conducted a close examination of data curation at a large social science data repository, the Inter-university Consortium for Political and Social Research (ICPSR). We asked: What does curatorial work entail at ICPSR, and what work is more or less visible to different stakeholders and in different contexts? And, how is that curatorial work coordinated across the organization? We triangulated accounts of data curation from interviews and records of curation in Jira tickets to develop a rich and detailed account of curatorial work. While we identified numerous curatorial actions performed by ICPSR curators, we also found that curators rely on a number of craft practices to perform their jobs. The reality of their work practices defies the rote sequence of events implied by many life cycle or workflow models. Further, we show that craft practices are needed to enact data curation best practices and standards. The craft that goes into data curation is often invisible to end users, but it is well recognized by ICPSR curators and their supervisors. Explicitly acknowledging and supporting data curators as craftspeople is important in creating sustainable and successful curatorial infrastructures.more » « less
-
Abstract Data archives are an important source of high-quality data in many fields, making them ideal sites to study data reuse. By studying data reuse through citation networks, we are able to learn how hidden research communities—those that use the same scientific data sets—are organized. This paper analyzes the community structure of an authoritative network of data sets cited in academic publications, which have been collected by a large, social science data archive: the Interuniversity Consortium for Political and Social Research (ICPSR). Through network analysis, we identified communities of social science data sets and fields of research connected through shared data use. We argue that communities of exclusive data reuse form “subdivisions” that contain valuable disciplinary resources, while data sets at a “crossroads” broadly connect research communities. Our research reveals the hidden structure of data reuse and demonstrates how interdisciplinary research communities organize around data sets as shared scientific inputs. These findings contribute new ways of describing scientific communities to understand the impacts of research data reuse.
-
Abstract University libraries are partnering with disciplinary data producers to provide long‐term digital curation of research data sets. Managing data set producer expectations and guiding future development of library services requires understanding the decisions libraries make about curatorial activities, why they make these decisions, and the effects on future data reuse. We present a study, comprising interviews (
n = 43) and ethnographic observation, of two university libraries who partnered with the Sloan Digital Sky Survey (SDSS) collaboration to curate a significant astronomy data set. The two libraries made different choices of the materials to curate and associated services, which resulted in different reuse possibilities. Each of the libraries offered partial solutions to the SDSS leaders' objectives. The libraries' approaches to curation diverged due to contextual factors, notably the extant infrastructure at their disposal (including technical infrastructure, staff expertise, values and internal culture, and organizational structure). The Data Transfer Process case offers lessons in understanding how libraries choose curation paths and how these choices influence possibilities for data reuse. Outcomes may not match data producers' initial expectations but may create opportunities for reusing data in unexpected and beneficial ways. -
Workflow reconstruction through logs is crucial for troubleshooting targeted distributed systems. It is also challenging to extract enough information from logs and keep a concise view, which makes manual log analysis hard to practice. However, currently popular tools rely on identifier-based log parsing, leaving a large amount of workflow information unexploited. In this paper, we propose a log extraction approach NLog, which utilizes a natural language processing based approach to obtain the key information from log messages and identify the same object in logs generated by different statements without any domain knowledge. We propose to use keyed message, a new log storage structure to store the parsed logs. We implement NLog and apply it to distributed data analytics frameworks Spark and MapReduce. Evaluation results show that NLog can accurately identify the objects in log messages even without explicit identifiers. By using keyed messages, users can have a concise as well as flexible view of the workflows.more » « less
-
The sharing of research data is essential to ensure reproducibility and maximize the impact of public investments in scientific research. Here, we describe OpenNeuro, a BRAIN Initiative data archive that provides the ability to openly share data from a broad range of brain imaging data types following the FAIR principles for data sharing. We highlight the importance of the Brain Imaging Data Structure standard for enabling effective curation, sharing, and reuse of data. The archive presently shares more than 600 datasets including data from more than 20,000 participants, comprising multiple species and measurement modalities and a broad range of phenotypes. The impact of the shared data is evident in a growing number of published reuses, currently totalling more than 150 publications. We conclude by describing plans for future development and integration with other ongoing open science efforts.more » « less