skip to main content


Title: SciSciNet: A large-scale open data lake for the science of science research
Abstract

The science of science has attracted growing research interests, partly due to the increasing availability of large-scale datasets capturing the innerworkings of science. These datasets, and the numerous linkages among them, enable researchers to ask a range of fascinating questions about how science works and where innovation occurs. Yet as datasets grow, it becomes increasingly difficult to track available sources and linkages across datasets. Here we present SciSciNet, a large-scale open data lake for the science of science research, covering over 134M scientific publications and millions of external linkages to funding and public uses. We offer detailed documentation of pre-processing steps and analytical choices in constructing the data lake. We further supplement the data lake by computing frequently used measures in the literature, illustrating how researchers may contribute collectively to enriching the data lake. Overall, this data lake serves as an initial but useful resource for the field, by lowering the barrier to entry, reducing duplication of efforts in data processing and measurements, improving the robustness and replicability of empirical claims, and broadening the diversity and representation of ideas in the field.

 
more » « less
NSF-PAR ID:
10418049
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
Scientific Data
Volume:
10
Issue:
1
ISSN:
2052-4463
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    In recent years, the availability of airborne imaging spectroscopy (hyperspectral) data has expanded dramatically. The high spatial and spectral resolution of these data uniquely enable spatially explicit ecological studies including species mapping, assessment of drought mortality and foliar trait distributions. However, we have barely begun to unlock the potential of these data to use direct mapping of vegetation characteristics to infer subsurface properties of the critical zone. To assess their utility for Earth systems research, imaging spectroscopy data acquisitions require integration with large, coincident ground‐based datasets collected by experts in ecology and environmental and Earth science. Without coordinated, well‐planned field campaigns, potential knowledge leveraged from advanced airborne data collections could be lost. Despite the growing importance of this field, documented methods to couple such a wide variety of disciplines remain sparse.

    We coordinated the first National Ecological Observatory Network Airborne Observation Platform (AOP) survey performed outside of their core sites, which took place in the Upper East River watershed, Colorado. Extensive planning for sample tracking and organization allowed field and flight teams to update the ground‐based sampling strategy daily. This enabled collection of an extensive set of physical samples to support a wide range of ecological, microbiological, biogeochemical and hydrological studies.

    We present a framework for integrating airborne and field campaigns to obtain high‐quality data for foliar trait prediction and document an archive of coincident physical samples collected to support a systems approach to ecological research in the critical zone. This detailed methodological account provides an example of how a multi‐disciplinary and multi‐institutional team can coordinate to maximize knowledge gained from an airborne survey, an approach that could be extended to other studies.

    The coordination of imaging spectroscopy surveys with appropriately timed and extensive field surveys, along with high‐quality processing of these data, presents a unique opportunity to reveal new insights into the structure and dynamics of the critical zone. To our knowledge, this level of co‐aligned sampling has never been undertaken in tandem with AOP surveys and subsequent studies utilizing this archive will shed considerable light on the breadth of applications for which imaging spectroscopy data can be leveraged.

     
    more » « less
  2. Abstract

    For wildlife inhabiting snowy environments, snow properties such as onset date, depth, strength, and distribution can influence many aspects of ecology, including movement, community dynamics, energy expenditure, and forage accessibility. As a result, snow plays a considerable role in individual fitness and ultimately population dynamics, and its evaluation is, therefore, important for comprehensive understanding of ecosystem processes in regions experiencing snow. Such understanding, and particularly study of how wildlife–snow relationships may be changing, grows more urgent as winter processes become less predictable and often more extreme under global climate change. However, studying and monitoring wildlife–snow relationships continue to be challenging because characterizing snow, an inherently complex and constantly changing environmental feature, and identifying, accessing, and applying relevant snow information at appropriate spatial and temporal scales, often require a detailed understanding of physical snow science and technologies that typically lie outside the expertise of wildlife researchers and managers. We argue that thoroughly assessing the role of snow in wildlife ecology requires substantive collaboration between researchers with expertise in each of these two fields, leveraging the discipline‐specific knowledge brought by both wildlife and snow professionals. To facilitate this collaboration and encourage more effective exploration of wildlife–snow questions, we provide a five‐step protocol: (1) identify relevant snow property information; (2) specify spatial, temporal, and informational requirements; (3) build the necessary datasets; (4) implement quality control procedures; and (5) incorporate snow information into wildlife analyses. Additionally, we explore the types of snow information that can be used within this collaborative framework. We illustrate, in the context of two examples, field observations, remote‐sensing datasets, and four example modeling tools that simulate spatiotemporal snow property distributions and, in some cases, evolutions. For each type of snow data, we highlight the collaborative opportunities for wildlife and snow professionals when designing snow data collection efforts, processing snow remote sensing products, producing tailored snow datasets, and applying the resulting snow information in wildlife analyses. We seek to provide a clear path for wildlife professionals to address wildlife–snow questions and improve ecological inference by integrating the best available snow science through collaboration with snow professionals.

     
    more » « less
  3. Summary

    The explosion of IoT devices and sensors in recent years has led to a demand for efficiently storing, processing and analyzing time‐series data. Geoscience researchers use time‐series data stores such as Hydroserver, Virtual Observatory and Ecological Informatics System (VOEIS), and Cloud‐Hosted Real‐time Data Service (CHORDS). Many of these tools require a great deal of infrastructure to deploy and expertise to manage and scale. The Tapis framework, an NSF funded project, provides science as a service APIs to allow researchers to achieve faster scientific results, by eliminating the need to set up a complex infrastructure stack. The University of Hawai'i (UH) and Texas Advanced Computing Center (TACC) have collaborated to develop an open source Tapis Streams API that builds on the concepts of the CHORDS time series data service to support research. This new hosted service allows storing, processing, annotating, archiving, and querying time‐series data in the Tapis multi‐user and multi‐tenant collaborative platform. The Streams API provides a hosted production level middleware service that enables new data‐driven event workflows capabilities that may be leveraged by researchers and Tapis powered science gateways for handling spatially indexed time‐series datasets.

     
    more » « less
  4. Summary

    We are in the midst of a scientific data explosion in which the rate of data growth is rapidly increasing. While large‐scale research projects have developed sophisticated data distribution networks to share their data with researchers globally, there is no such support for the many millions of research projects generating data of interest to much smaller audiences (as exemplified by the long tail scientist). In data‐oriented research, every aspect of the research process is influenced by data access. However, sharing and accessing data efficiently as well as lowering access barriers are difficult. In the absence of dedicated large‐scale storage, many have noted that there is an enormous storage capacity available via connected peers, none more so than the storage resources of many research groups. With widespread usage of the content delivery network model for disseminating web content, we believe a similar model can be applied to distributing, sharing, and accessing long tail research data in an e‐Science context. We describe the vision and architecture of a social content delivery network – a model that leverages the social networks of researchers to automatically share and replicate data on peers' resources based upon shared interests and trust. Using this model, we describe a simulator and investigate how aspects such as user activity, geographic distribution, trust, and replica selection algorithms affect data access and storage performance. From these results, we show that socially informed replication strategies are comparable with more general strategies in terms of availability and outperform them in terms of spatial efficiency. Copyright © 2016 John Wiley & Sons, Ltd.

     
    more » « less
  5. Abstract

    Growth of macroscale limnological research has been accompanied by an increase in secondary datasets compiled from multiple sources. We examined patterns of data availability in LAGOS‐NE, a dataset derived from 87 sources, to identify biases in availability of lake water quality data and to consider how such biases might affect perceived patterns at a subcontinental scale. Of eight common water quality parameters, variables indicative of trophic state (Secchi, chlorophyll, and total P) were most abundant in terms of total observations, lakes sampled, and long‐term records, whereas carbon variables (true color and dissolved organic carbon) were scarcest. Most data were collected during summer from larger (≥ 20 ha) lakes over 1–3 yr. Approximately 80% of data for each variable is derived from ~ 20% of sampled lakes. Long‐term (≥ 20 yr) records were rare and spatially clustered. Data availability is linked to major management challenges (eutrophication and acid rain), citizen science, and a few programs that quantify C and N variables. Resampling exercises suggested that correcting for the surface area sampling bias did not substantially change statistical distributions of the eight variables. Further, estimating a lake's long‐term median Secchi, chlorophyll, and total P using average record lengths had high uncertainty, but modest increases in sample size to > 5 yr yielded estimates with manageable error. Although the specific nature of sampling biases may vary among regions, we expect that they are widespread. Thus, large integrated datasets can and should be used to identify tendencies in how lakes are studied and to address these biases as part broad‐scale limnological investigations.

     
    more » « less