NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

The Valuable, Vulnerable, Long Tail of Earth Science Databases

https://doi.org/10.1029/2025EO250107

Thomer, Andrea; Williams, John; Goring, Simon; Blois, Jessica (March 2025, Eos)

Community-curated data resources in the Earth sciences, highly valuable but systematically underfunded, are vital to research on a changing planet.
more » « less
Free, publicly-accessible full text available March 20, 2026
Metadata Enhancement Using Large Language Models

Song, Hyunju; Bethard, Steven; Thomer, Andrea K (August 2024, Association for Computational Linguistics)

Full Text Available
Automated Metadata Enhancement for Physical Sample Record Aggregation in the iSamples Project

https://doi.org/10.1002/pra2.968

Song, Hyunju; Cui, Hong; Vieglais, Dave; Mandel, Danny; Thomer, Andrea K (October 2023, Proceedings of the Association for Information Science and Technology)

Large amounts of samples have been collected and stored by different institutions and collections across the world. However, even the most carefully curated collections can appear incomplete when aggregated. To solve this problem and support the increasing multidisciplinary science conducted on these samples, we propose a method to support the FAIRness of the aggregation by augmenting the metadata of source records. Using a pipeline that is a combination of rule‐based and machine learning‐based procedures, we predict the missing values of the metadata fields of 4,388,514 samples. We use these inferred fields in our user interface to improve the reusability.
more » « less
Full Text Available
Opening Doors to Physical Sample Data Discovery, Integration, and Credit

https://doi.org/10.31223/X5ST2K

Damerow, Joan; Raia, Natalie; Stanley, Val; Choe, Saebyul; Borton, Mikayla; Byers, Neil; Cassidy, Ellen; Cholia, Shreyas; Edmunds, Rorie; Forbes, Brieanne; et al (June 2024, Nature Scientific Data)

Physical samples and their associated (meta)data underpin scientific discoveries across disciplines, and can enable new science when appropriately archived. However, there are significant gaps in community practices and infrastructure that currently prevent accurate provenance tracking, reproducibility, and attribution. For the vast majority of samples, descriptive metadata is often sparse, inaccessible, or absent. Samples and associated (meta)data may also be scattered across numerous physical collections, data repositories, laboratories, data files, and papers with no clear linkages or provenance tracking as new information is generated over time. The Physical Samples Curation Cluster has therefore developed ‘A Scientific Author Guide for Publishing Open Research Using Physical Samples.’ This involved synthesizing existing practices, community feedback, and assessing real-world examples to identify community and infrastructure needs. We identified areas of work needed to enable authors to efficiently reference samples and related data, link related samples and data, and track their use. Our goal is to help improve the discoverability, interoperability, use of physical samples and associated (meta)data into the future.
more » « less
Full Text Available
The Craft and Coordination of Data Curation: Complicating Workflow Views of Data Science

https://doi.org/10.1145/3555139

Thomer, Andrea K.; Akmon, Dharma; York, Jeremy J.; Tyler, Allison R.; Polasek, Faye; Lafia, Sara; Hemphill, Libby; Yakel, Elizabeth (November 2022, Proceedings of the ACM on Human-Computer Interaction)

Data curation is the process of making a dataset fit-for-use and archivable. It is critical to data-intensive science because it makes complex data pipelines possible, studies reproducible, and data reusable. Yet the complexities of the hands-on, technical, and intellectual work of data curation is frequently overlooked or downplayed. Obscuring the work of data curation not only renders the labor and contributions of data curators invisible but also hides the impact that curators' work has on the later usability, reliability, and reproducibility of data. To better understand the work and impact of data curation, we conducted a close examination of data curation at a large social science data repository, the Inter-university Consortium for Political and Social Research (ICPSR). We asked: What does curatorial work entail at ICPSR, and what work is more or less visible to different stakeholders and in different contexts? And, how is that curatorial work coordinated across the organization? We triangulated accounts of data curation from interviews and records of curation in Jira tickets to develop a rich and detailed account of curatorial work. While we identified numerous curatorial actions performed by ICPSR curators, we also found that curators rely on a number of craft practices to perform their jobs. The reality of their work practices defies the rote sequence of events implied by many life cycle or workflow models. Further, we show that craft practices are needed to enact data curation best practices and standards. The craft that goes into data curation is often invisible to end users, but it is well recognized by ICPSR curators and their supervisors. Explicitly acknowledging and supporting data curators as craftspeople is important in creating sustainable and successful curatorial infrastructures.
more » « less
Full Text Available
Leveraging Machine Learning to Detect Data Curation Activities

https://doi.org/10.1109/eScience51609.2021.00025

Lafia, Sara; Thomer, Andrea; Bleckley, David; Akmon, Dharma; Hemphill, Libby (September 2021, eScience 2021)

This paper describes a machine learning approach for annotating and analyzing data curation work logs at ICPSR, a large social sciences data archive. The systems we studied track curation work and coordinate team decision-making at ICPSR. Archive staff use these systems to organize, prioritize, and document curation work done on datasets, making them promising resources for studying curation work and its impact on data reuse, especially in combination with data usage analytics. A key challenge, however, is classifying similar activities so that they can be measured and associated with impact metrics. This paper contributes: 1) a set of data curation activities; 2) a computational model for identifying curation actions in work log descriptions; and 3) an analysis of frequent data curation activities at ICPSR over time. We first propose a set of data curation actions to help us analyze the impact of curation work. We then use this set to annotate a set of data curation logs, which contain records of data transformations and project management decisions completed by archive staff. Finally, we train a text classifier to detect the frequency of curation actions in a large set of work logs. Our approach supports the analysis of curation work documented in work log systems as an important step toward studying the relationship between research data curation and data reuse.
more » « less
Full Text Available
Privacy Impact Assessments for Digital Repositories

https://doi.org/10.2218/ijdc.v15i1.692

Mhaidli, Abraham; Hemphill, Libby; Schaub, Florian; Jordan, Cundiff; Thomer, Andrea K. (January 1970, International Journal of Digital Curation)

Trustworthy data repositories ensure the security of their collections. We argue they should also ensure the security of researcher and human subject data. Here we demonstrate the use of a privacy impact assessment (PIA) to evaluate potential privacy risks to researchers using the ICPSR’s Open Badges Research Credential System as a case study. We present our workflow and discuss potential privacy risks and mitigations for those risks. [This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]
more » « less
Full Text Available

Search for: All records