NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

How do we share data in COVID-19 research? A systematic review of COVID-19 datasets in PubMed Central Articles

https://doi.org/10.1093/bib/bbaa331

Zuo, Xu; Chen, Yong; Ohno-Machado, Lucila; Xu, Hua (December 2020, Briefings in Bioinformatics)
null (Ed.)
Abstract Objective This study aims at reviewing novel coronavirus disease (COVID-19) datasets extracted from PubMed Central articles, thus providing quantitative analysis to answer questions related to dataset contents, accessibility and citations. Methods We downloaded COVID-19-related full-text articles published until 31 May 2020 from PubMed Central. Dataset URL links mentioned in full-text articles were extracted, and each dataset was manually reviewed to provide information on 10 variables: (1) type of the dataset, (2) geographic region where the data were collected, (3) whether the dataset was immediately downloadable, (4) format of the dataset files, (5) where the dataset was hosted, (6) whether the dataset was updated regularly, (7) the type of license used, (8) whether the metadata were explicitly provided, (9) whether there was a PubMed Central paper describing the dataset and (10) the number of times the dataset was cited by PubMed Central articles. Descriptive statistics about these seven variables were reported for all extracted datasets. Results We found that 28.5% of 12 324 COVID-19 full-text articles in PubMed Central provided at least one dataset link. In total, 128 unique dataset links were mentioned in 12 324 COVID-19 full text articles in PubMed Central. Further analysis showed that epidemiological datasets accounted for the largest portion (53.9%) in the dataset collection, and most datasets (84.4%) were available for immediate download. GitHub was the most popular repository for hosting COVID-19 datasets. CSV, XLSX and JSON were the most popular data formats. Additionally, citation patterns of COVID-19 datasets varied depending on specific datasets. Conclusion PubMed Central articles are an important source of COVID-19 datasets, but there is significant heterogeneity in the way these datasets are mentioned, shared, updated and cited.
more » « less
Full Text Available
Coronavirus: indexed data speed up solutions

https://doi.org/10.1038/d41586-020-02331-3

Ohno-Machado, Lucila; Xu, Hua (August 2020, Nature)
null (Ed.)
Full Text Available
Intake / Pangeo Catalog: Making It Easier To Consume Earth’s Climate and Weather Data

Banihirwe, A.; Blackmon-Luca, C.; Abernathey, R.; Hamman, J. (July 2020, 2020 EarthCube Annual Meeting)
null (Ed.)
Computer simulations of the Earth’s climate and weather generate huge amounts of data. These data are often persisted on HPC systems or in the cloud across multiple data assets of a variety of formats (netCDF, zarr, etc...). Finding, investigating, loading these data assets into compute-ready data containers costs time and effort. The data user needs to know what data sets are available, the attributes describing each data set, before loading a specific data set and analyzing it. In this notebook, we demonstrate the integration of data discovery tools such as intake and intake-esm (an intake plugin) with data stored in cloud optimized formats (zarr). We highlight (1) how these tools provide transparent access to local and remote catalogs and data, (2) the API for exploring arbitrary metadata associated with data, loading data sets into data array containers. We also showcase the Pangeo catalog, an open source project to enumerate and organize cloud optimized climate data stored across a variety of providers, and a place where several intake-esm collections are now publicly available. We use one of these public collections as an example to show how an end user would explore and interact with the data, and conclude with a short overview of the catalog's online presence.
more » « less
Full Text Available
Scikit-downscale: an open source Python package for scalable climate downscaling

https://doi.org/10.1002/essoar.10507604.1

Hamman, Joseph; Kent, Julia (June 2020, 2020 EarthCube Annual Meeting)
null (Ed.)
Climate data from Earth System Models are increasingly being used to study the impacts of climate change on a broad range of biogeophysical (forest fires, fisheries, etc.) and human systems (reservoir operations, urban heat waves, etc.). Before this data can be used to study many of these systems, post-processing steps commonly referred to as bias correction and statistical downscaling must be performed. “Bias correction” is used to correct persistent biases in climate model output and “statistical downscaling” is used to increase the spatiotemporal resolution of the model output (i.e. 1 deg to 1/16th deg grid boxes). For our purposes, we’ll refer to both parts as “downscaling”. In the past few decades, the applications community has developed a plethora of downscaling methods. Many of these methods are ad-hoc collections of post processing routines while others target very specific applications. The proliferation of downscaling methods has left the climate applications community with an overwhelming body of research to sort through without much in the form of synthesis guiding method selection or applicability. Motivated by the pressing socio-environmental challenges of climate change – and with the learnings from previous downscaling efforts in mind – we have begun working on a community-centered open framework for climate downscaling: scikit-downscale. We believe that the community will benefit from the presence of a well-designed open source downscaling toolbox with standard interfaces alongside a repository of benchmark data to test and evaluate new and existing downscaling methods. In this notebook, we provide an overview of the scikit-downscale project, detailing how it can be used to downscale a range of surface climate variables such as air temperature and precipitation. We also highlight how scikit-downscale framework is being used to compare existing methods and how it can be extended to support the development of new downscaling methods.
more » « less
Full Text Available
Multi-Cloud workflows with Pangeo and Dask Gateway

Augspurger, T.; Durant, M.; Abernathey, R.; Hamman, J. (June 2020, 2020 EarthCube Annual Meeting)
null (Ed.)
As more analysis-ready datasets are provided on the cloud, we need to consider how researchers access data. To maximize performance and minimize costs, we move the analysis to the data. This notebook demonstrates a Pangeo deployment connected to multiple Dask Gateways to enable analysis, regardless of where the data is stored. Public clouds are partitioned into regions, a geographic location with a cluster of data centers. A dataset like the National Water Model Short-Range Forecast is provided in a single region of some cloud provider (e.g. AWS’s us-east-1). To analyze that dataset efficiently, we do the analysis in the same region as the dataset. That’s especially true for very large datasets. Making local “dark replicas” of the datasets is slow and expensive. In this notebook we demonstrate a few open source tools to compute “close” to cloud data. We use Intake as a data catalog, to discover the datasets we have available and load them as an xarray Dataset. With xarray, we’re able to write the necessary transformations, filtering, and reductions that compose our analysis. To process the large amounts of data in parallel, we use Dask. Behind the scenes, we’ve configured this Pangeo deployment with multiple Dask Gateways, which provide a secure, multi-tenant server for managing Dask clusters. Each Gateway is provisioned with the necessary permissions to access the data. By placing compute (the Dask workers) in the same region as the dataset, we achieve the highest performance: these worker machines are physically close to the machines storing the data and have the highest bandwidth. We minimize cost by avoiding egress costs: fees charged to the data provider when data leaves a cloud region.
more » « less
Full Text Available

Search for: All records