skip to main content

Title: SciSciNet: A large-scale open data lake for the science of science research

The science of science has attracted growing research interests, partly due to the increasing availability of large-scale datasets capturing the innerworkings of science. These datasets, and the numerous linkages among them, enable researchers to ask a range of fascinating questions about how science works and where innovation occurs. Yet as datasets grow, it becomes increasingly difficult to track available sources and linkages across datasets. Here we present SciSciNet, a large-scale open data lake for the science of science research, covering over 134M scientific publications and millions of external linkages to funding and public uses. We offer detailed documentation of pre-processing steps and analytical choices in constructing the data lake. We further supplement the data lake by computing frequently used measures in the literature, illustrating how researchers may contribute collectively to enriching the data lake. Overall, this data lake serves as an initial but useful resource for the field, by lowering the barrier to entry, reducing duplication of efforts in data processing and measurements, improving the robustness and replicability of empirical claims, and broadening the diversity and representation of ideas in the field.

; ; ;
Publication Date:
Journal Name:
Scientific Data
Nature Publishing Group
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    In recent years, the availability of airborne imaging spectroscopy (hyperspectral) data has expanded dramatically. The high spatial and spectral resolution of these data uniquely enable spatially explicit ecological studies including species mapping, assessment of drought mortality and foliar trait distributions. However, we have barely begun to unlock the potential of these data to use direct mapping of vegetation characteristics to infer subsurface properties of the critical zone. To assess their utility for Earth systems research, imaging spectroscopy data acquisitions require integration with large, coincident ground‐based datasets collected by experts in ecology and environmental and Earth science. Without coordinated, well‐planned field campaigns, potential knowledge leveraged from advanced airborne data collections could be lost. Despite the growing importance of this field, documented methods to couple such a wide variety of disciplines remain sparse.

    We coordinated the first National Ecological Observatory Network Airborne Observation Platform (AOP) survey performed outside of their core sites, which took place in the Upper East River watershed, Colorado. Extensive planning for sample tracking and organization allowed field and flight teams to update the ground‐based sampling strategy daily. This enabled collection of an extensive set of physical samples to support a wide range of ecological, microbiological, biogeochemical and hydrologicalmore »studies.

    We present a framework for integrating airborne and field campaigns to obtain high‐quality data for foliar trait prediction and document an archive of coincident physical samples collected to support a systems approach to ecological research in the critical zone. This detailed methodological account provides an example of how a multi‐disciplinary and multi‐institutional team can coordinate to maximize knowledge gained from an airborne survey, an approach that could be extended to other studies.

    The coordination of imaging spectroscopy surveys with appropriately timed and extensive field surveys, along with high‐quality processing of these data, presents a unique opportunity to reveal new insights into the structure and dynamics of the critical zone. To our knowledge, this level of co‐aligned sampling has never been undertaken in tandem with AOP surveys and subsequent studies utilizing this archive will shed considerable light on the breadth of applications for which imaging spectroscopy data can be leveraged.

    « less
  2. Abstract

    For wildlife inhabiting snowy environments, snow properties such as onset date, depth, strength, and distribution can influence many aspects of ecology, including movement, community dynamics, energy expenditure, and forage accessibility. As a result, snow plays a considerable role in individual fitness and ultimately population dynamics, and its evaluation is, therefore, important for comprehensive understanding of ecosystem processes in regions experiencing snow. Such understanding, and particularly study of how wildlife–snow relationships may be changing, grows more urgent as winter processes become less predictable and often more extreme under global climate change. However, studying and monitoring wildlife–snow relationships continue to be challenging because characterizing snow, an inherently complex and constantly changing environmental feature, and identifying, accessing, and applying relevant snow information at appropriate spatial and temporal scales, often require a detailed understanding of physical snow science and technologies that typically lie outside the expertise of wildlife researchers and managers. We argue that thoroughly assessing the role of snow in wildlife ecology requires substantive collaboration between researchers with expertise in each of these two fields, leveraging the discipline‐specific knowledge brought by both wildlife and snow professionals. To facilitate this collaboration and encourage more effective exploration of wildlife–snow questions, we provide a five‐stepmore »protocol: (1) identify relevant snow property information; (2) specify spatial, temporal, and informational requirements; (3) build the necessary datasets; (4) implement quality control procedures; and (5) incorporate snow information into wildlife analyses. Additionally, we explore the types of snow information that can be used within this collaborative framework. We illustrate, in the context of two examples, field observations, remote‐sensing datasets, and four example modeling tools that simulate spatiotemporal snow property distributions and, in some cases, evolutions. For each type of snow data, we highlight the collaborative opportunities for wildlife and snow professionals when designing snow data collection efforts, processing snow remote sensing products, producing tailored snow datasets, and applying the resulting snow information in wildlife analyses. We seek to provide a clear path for wildlife professionals to address wildlife–snow questions and improve ecological inference by integrating the best available snow science through collaboration with snow professionals.

    « less
  3. Summary

    The explosion of IoT devices and sensors in recent years has led to a demand for efficiently storing, processing and analyzing time‐series data. Geoscience researchers use time‐series data stores such as Hydroserver, Virtual Observatory and Ecological Informatics System (VOEIS), and Cloud‐Hosted Real‐time Data Service (CHORDS). Many of these tools require a great deal of infrastructure to deploy and expertise to manage and scale. The Tapis framework, an NSF funded project, provides science as a service APIs to allow researchers to achieve faster scientific results, by eliminating the need to set up a complex infrastructure stack. The University of Hawai'i (UH) and Texas Advanced Computing Center (TACC) have collaborated to develop an open source Tapis Streams API that builds on the concepts of the CHORDS time series data service to support research. This new hosted service allows storing, processing, annotating, archiving, and querying time‐series data in the Tapis multi‐user and multi‐tenant collaborative platform. The Streams API provides a hosted production level middleware service that enables new data‐driven event workflows capabilities that may be leveraged by researchers and Tapis powered science gateways for handling spatially indexed time‐series datasets.

  4. Summary

    We are in the midst of a scientific data explosion in which the rate of data growth is rapidly increasing. While large‐scale research projects have developed sophisticated data distribution networks to share their data with researchers globally, there is no such support for the many millions of research projects generating data of interest to much smaller audiences (as exemplified by the long tail scientist). In data‐oriented research, every aspect of the research process is influenced by data access. However, sharing and accessing data efficiently as well as lowering access barriers are difficult. In the absence of dedicated large‐scale storage, many have noted that there is an enormous storage capacity available via connected peers, none more so than the storage resources of many research groups. With widespread usage of the content delivery network model for disseminating web content, we believe a similar model can be applied to distributing, sharing, and accessing long tail research data in an e‐Science context. We describe the vision and architecture of a social content delivery network – a model that leverages the social networks of researchers to automatically share and replicate data on peers' resources based upon shared interests and trust. Using this model, wemore »describe a simulator and investigate how aspects such as user activity, geographic distribution, trust, and replica selection algorithms affect data access and storage performance. From these results, we show that socially informed replication strategies are comparable with more general strategies in terms of availability and outperform them in terms of spatial efficiency. Copyright © 2016 John Wiley & Sons, Ltd.

    « less
  5. Abstract Why the new findings matter

    The process of teaching and learning is complex, multifaceted and dynamic. This paper contributes a seminal resource to highlight the digitisation of the educational sciences by demonstrating how new machine learning methods can be effectively and reliably used in research, education and practical application.

    Implications for educational researchers and policy makers

    The progressing digitisation of societies around the globe and the impact of the SARS‐COV‐2 pandemic have highlighted the vulnerabilities and shortcomings of educational systems. These developments have shown the necessity to provide effective educational processes that can support sometimes overwhelmed teachers to digitally impart knowledge on the plan of many governments and policy makers. Educational scientists, corporate partners and stakeholders can make use of machine learning techniques to develop advanced, scalable educational processes that account for individual needs of learners and that can complement and support existing learning infrastructure. The proper use of machine learning methods can contribute essential applications to the educational sciences, such as (semi‐)automated assessments, algorithmic‐grading, personalised feedback and adaptive learning approaches. However, these promises are strongly tied to an at least basic understanding of the concepts of machine learning and a degree of data literacy, which has to become the standard in education and the educational sciences.

    Demonstrating both the promises and the challenges that are inherent to the collection and the analysis of large educational data with machine learning, this paper covers the essential topics that their application requires and provides easy‐to‐follow resources and code to facilitate the process of adoption.

    « less