Title: An Architecture for Cell-Centric Indexing of Datasets
ncreasingly, large collections of datasets are made available to the public via the Web, ranging from government-curated datasets like those of data.gov to communally-sourced datasets such as Wikipedia tables. It has become clear that traditional search techniques are insufficient for such sources, especially when the user is unfamiliar with the terminology used by the creators of the relevant datasets. We propose to address this problem by elevating the datum to a first-class object that is indexed, thereby making it less dependent on how a dataset is structured. In a data table, a cell contains a value for a particular row as described by a particular column. In our cell-centric indexing approach, we index the metadata of each cell, so that information about its dataset and column simply become metadata rather than constraining concepts. In this paper we define cell-centric indexing and present a system architecture that supports its use in exploring datasets. We describe how cell-centric indexing can be implemented using traditional information retrieval technology and evaluate the scalability of the architecture. more »« less
Johnson, Drake; Register, Keith; Davison, Brian D.; Heflin, Jeff(
, IEEE International Conference on Big Data (IEEE BigData 2020))
null
(Ed.)
Large collections of datasets are being published on the Web at an increasing rate. This poses a problem to researchers and data journalists who must sift through these large quantities of data to find datasets that meet their needs. Our solution to this problem is cell-centric indexing, a novel approach which considers the individual cell of a dataset to be the fundamental unit of search, indexing the corresponding metadata to each individual cell. This facilitates a new style of user interface that allows users to explore the collection via histograms that show the distributions of various terms organized by how they are used in the dataset.
Heflin, Jeff; Davison, Brian D.; Jia, Haiyan(
, Proceedings of DESIRES 2021: Design of Experimental Search and Information REtrieval Systems, CEUR Workshop Proceedings)
We present a novel approach to dataset search and exploration. Cell-centric indexing is a unique indexing strategy that enables a powerful, new interface. The strategy treats individual cells of a table as the indexed unit, and combining this with a number of structure-specific fields enables queries that cannot be answered by a traditional indexing approach. Our interface provides users with an overview of a dataset repository, and allows them to efficiently use various facets to explore the collection and identify datasets that match their interests.
Scientific data, along with its analysis, accuracy, completeness, and reproducibility, plays a vital role in advancing science and engineering. Open Science Chain (OSC) provides a Cyberinfrastructure platform, built using distributed ledger technologies, where verification information about scientific dataset is stored and managed in a consortium blockchain. Researchers have the ability to independently verify the authenticity of scientific results using the information stored with OSC. Researchers can also build research workflows by linking data entries in the ledger and external repositories such as GitHub that will allow for detailed provenance tracking. OSC enables answers to questions such as: how can we ensure research integrity when different research groups share and work on the same datasets across the world? Is it possible to enable quick verification of the exact data sets that were used for particular published research? Can we check the provenance of the data used in the research? In this poster, we highlight our work in building a secure, scalable architecture for OSC including developing a security module for storing identities that can be used by the researchers of science gateways communities to increase the confidence of their scientific results.
Jian Wu, Bharath Kandimalla(
, IEEE International Conference on Big Data)
We report the preliminary work on cleansing and
classifying a scholarly big dataset containing 10+ million academic
documents released by CiteSeerX. We design novel approaches
to match paper entities in CiteSeerX to reference
datasets, including DBLP, Web of Science, and Medline, resulting
in 4.2M unique matches, whose metadata can be cleansed. We
also investigate traditional machine learning and neural network
methods to classify abstracts into 6 subject categories. The
classification results reveal that the current CiteSeerX dataset is
highly multidisciplinary, containing papers well beyond computer
and information sciences.
Juozas Gaigalas, Liping Di(
, ISPRS international journal of geoinformation)
Understanding the past, present, and changing behavior of the climate requires close collaboration of a large number of researchers from many scientific domains. At present, the necessary interdisciplinary collaboration is greatly limited by the difficulties in discovering, sharing, and integrating climatic data due to the tremendously increasing data size. This paper discusses the methods and techniques for solving the inter-related problems encountered when transmitting, processing, and serving metadata for heterogeneous Earth System Observation and Modeling (ESOM) data. A cyberinfrastructure-based solution is proposed to enable effective cataloging and two-step search on big climatic datasets by leveraging state-of-the-art web service technologies and crawling the existing data centers. To validate its feasibility, the big dataset served by UCAR THREDDS Data Server (TDS), which provides Petabyte-level ESOM data and updates hundreds of terabytes of data every day, is used as the case study dataset. A complete workflow is designed to analyze the metadata structure in TDS and create an index for data parameters. A simplified registration model which defines constant information, delimits secondary information, and exploits spatial and temporal coherence in metadata is constructed. The model derives a sampling strategy for a high-performance concurrent web crawler bot which is used to mirror the essential metadata of the big data archive without overwhelming network and computing resources. The metadata model, crawler, and standard-compliant catalog service form an incremental search cyberinfrastructure, allowing scientists to search the big climatic datasets in near real-time. The proposed approach has been tested on UCAR TDS and the results prove that it achieves its design goal by at least boosting the crawling speed by 10 times and reducing the redundant metadata from 1.85 gigabytes to 2.2 megabytes, which is a significant breakthrough for making the current most non-searchable climate data servers searchable.
Qiu, Lixuan, Jia, Haiyan, Davison, Brian D., and Heflin, Jeff. An Architecture for Cell-Centric Indexing of Datasets. Retrieved from https://par.nsf.gov/biblio/10254054. CEUR workshop proceedings 2722.
Qiu, Lixuan, Jia, Haiyan, Davison, Brian D., & Heflin, Jeff. An Architecture for Cell-Centric Indexing of Datasets. CEUR workshop proceedings, 2722 (). Retrieved from https://par.nsf.gov/biblio/10254054.
Qiu, Lixuan, Jia, Haiyan, Davison, Brian D., and Heflin, Jeff.
"An Architecture for Cell-Centric Indexing of Datasets". CEUR workshop proceedings 2722 (). Country unknown/Code not available. https://par.nsf.gov/biblio/10254054.
@article{osti_10254054,
place = {Country unknown/Code not available},
title = {An Architecture for Cell-Centric Indexing of Datasets},
url = {https://par.nsf.gov/biblio/10254054},
abstractNote = {ncreasingly, large collections of datasets are made available to the public via the Web, ranging from government-curated datasets like those of data.gov to communally-sourced datasets such as Wikipedia tables. It has become clear that traditional search techniques are insufficient for such sources, especially when the user is unfamiliar with the terminology used by the creators of the relevant datasets. We propose to address this problem by elevating the datum to a first-class object that is indexed, thereby making it less dependent on how a dataset is structured. In a data table, a cell contains a value for a particular row as described by a particular column. In our cell-centric indexing approach, we index the metadata of each cell, so that information about its dataset and column simply become metadata rather than constraining concepts. In this paper we define cell-centric indexing and present a system architecture that supports its use in exploring datasets. We describe how cell-centric indexing can be implemented using traditional information retrieval technology and evaluate the scalability of the architecture.},
journal = {CEUR workshop proceedings},
volume = {2722},
author = {Qiu, Lixuan and Jia, Haiyan and Davison, Brian D. and Heflin, Jeff},
editor = {null}
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.