skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: An Architecture for Cell-Centric Indexing of Datasets
ncreasingly, large collections of datasets are made available to the public via the Web, ranging from government-curated datasets like those of data.gov to communally-sourced datasets such as Wikipedia tables. It has become clear that traditional search techniques are insufficient for such sources, especially when the user is unfamiliar with the terminology used by the creators of the relevant datasets. We propose to address this problem by elevating the datum to a first-class object that is indexed, thereby making it less dependent on how a dataset is structured. In a data table, a cell contains a value for a particular row as described by a particular column. In our cell-centric indexing approach, we index the metadata of each cell, so that information about its dataset and column simply become metadata rather than constraining concepts. In this paper we define cell-centric indexing and present a system architecture that supports its use in exploring datasets. We describe how cell-centric indexing can be implemented using traditional information retrieval technology and evaluate the scalability of the architecture.  more » « less
Award ID(s):
1816325
PAR ID:
10254054
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
CEUR workshop proceedings
Volume:
2722
ISSN:
1613-0073
Page Range / eLocation ID:
82-96
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Large collections of datasets are being published on the Web at an increasing rate. This poses a problem to researchers and data journalists who must sift through these large quantities of data to find datasets that meet their needs. Our solution to this problem is cell-centric indexing, a novel approach which considers the individual cell of a dataset to be the fundamental unit of search, indexing the corresponding metadata to each individual cell. This facilitates a new style of user interface that allows users to explore the collection via histograms that show the distributions of various terms organized by how they are used in the dataset. 
    more » « less
  2. Alonso, Omar; Marchesin, Stefano; Najork, Mark; Silvello, Gianmaria (Ed.)
    We present a novel approach to dataset search and exploration. Cell-centric indexing is a unique indexing strategy that enables a powerful, new interface. The strategy treats individual cells of a table as the indexed unit, and combining this with a number of structure-specific fields enables queries that cannot be answered by a traditional indexing approach. Our interface provides users with an overview of a dataset repository, and allows them to efficiently use various facets to explore the collection and identify datasets that match their interests. 
    more » « less
  3. Psychometric measures reflecting people’s knowledge, ability, attitudes, and personality traits are critical for many real-world applications, such as e-commerce, health care, and cybersecurity. However, traditional methods cannot collect and measure rich psychometric dimensions in a timely and unobtrusive manner. Consequently, despite their importance, psychometric dimensions have received limited attention from the natural language processing and information retrieval communities. In this article, we propose a deep learning architecture, PyNDA, to extract psychometric dimensions from user-generated texts. PyNDA contains a novel representation embedding, a demographic embedding, a structural equation model (SEM) encoder, and a multitask learning mechanism designed to work in unison to address the unique challenges associated with extracting rich, sophisticated, and user-centric psychometric dimensions. Our experiments on three real-world datasets encompassing 11 psychometric dimensions, including trust, anxiety, and literacy, show that PyNDA markedly outperforms traditional feature-based classifiers as well as the state-of-the-art deep learning architectures. Ablation analysis reveals that each component of PyNDA significantly contributes to its overall performance. Collectively, the results demonstrate the efficacy of the proposed architecture for facilitating rich psychometric analysis. Our results have important implications for user-centric information extraction and retrieval systems looking to measure and incorporate psychometric dimensions. 
    more » « less
  4. null (Ed.)
    Scientific data, along with its analysis, accuracy, completeness, and reproducibility, plays a vital role in advancing science and engineering. Open Science Chain (OSC) provides a Cyberinfrastructure platform, built using distributed ledger technologies, where verification information about scientific dataset is stored and managed in a consortium blockchain. Researchers have the ability to independently verify the authenticity of scientific results using the information stored with OSC. Researchers can also build research workflows by linking data entries in the ledger and external repositories such as GitHub that will allow for detailed provenance tracking. OSC enables answers to questions such as: how can we ensure research integrity when different research groups share and work on the same datasets across the world? Is it possible to enable quick verification of the exact data sets that were used for particular published research? Can we check the provenance of the data used in the research? In this poster, we highlight our work in building a secure, scalable architecture for OSC including developing a security module for storing identities that can be used by the researchers of science gateways communities to increase the confidence of their scientific results. 
    more » « less
  5. We report the preliminary work on cleansing and classifying a scholarly big dataset containing 10+ million academic documents released by CiteSeerX. We design novel approaches to match paper entities in CiteSeerX to reference datasets, including DBLP, Web of Science, and Medline, resulting in 4.2M unique matches, whose metadata can be cleansed. We also investigate traditional machine learning and neural network methods to classify abstracts into 6 subject categories. The classification results reveal that the current CiteSeerX dataset is highly multidisciplinary, containing papers well beyond computer and information sciences. 
    more » « less