Large collections of datasets are being published on the Web at an increasing rate. This poses a problem to researchers and data journalists who must sift through these large quantities of data to find datasets that meet their needs. Our solution to this problem is cell-centric indexing, a novel approach which considers the individual cell of a dataset to be the fundamental unit of search, indexing the corresponding metadata to each individual cell. This facilitates a new style of user interface that allows users to explore the collection via histograms that show the distributions of various terms organized by how they are used in the dataset.
An Architecture for Cell-Centric Indexing of Datasets
ncreasingly, large collections of datasets are made available to the public via the Web, ranging from government-curated datasets like those of data.gov to communally-sourced datasets such as Wikipedia tables. It has become clear that traditional search techniques are insufficient for such sources, especially when the user is unfamiliar with the terminology used by the creators of the relevant datasets. We propose to address this problem by elevating the datum to a first-class object that is indexed, thereby making it less dependent on how a dataset is structured. In a data table, a cell contains a value for a particular row as described by a particular column. In our cell-centric indexing approach, we index the metadata of each cell, so that information about its dataset and column simply become metadata rather than constraining concepts. In this paper we define cell-centric indexing and present a system architecture that supports its use in exploring datasets. We describe how cell-centric indexing can be implemented using traditional information retrieval technology and evaluate the scalability of the architecture.
- Award ID(s):
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- CEUR workshop proceedings
- Page Range or eLocation-ID:
- Sponsoring Org:
- National Science Foundation
More Like this
Abstract<p>Binder is a publicly accessible online service for executing interactive notebooks based on Git repositories. Binder dynamically builds and deploys containers following a recipe stored in the repository, then gives the user a browser-based notebook interface. The Binder group periodically releases a log of container launches from the public Binder service. Archives of launch records are available here. These records do not include identifiable information like IP addresses, but do give the source repo being launched along with some other metadata. The main content of this dataset is in the <code>binder.sqlite</code> file. This SQLite database includes launch records from 2018-11-03 to 2021-06-06 in the <code>events</code> table, which has the following schema.</p> <code>CREATE TABLE events( version INTEGER, timestamp TEXT, provider TEXT, spec TEXT, origin TEXT, ref TEXT, guessed_ref TEXT ); CREATE INDEX idx_timestamp ON events(timestamp); </code> <ul><li><code>version</code> indicates the version of the record as assigned by Binder. The <code>origin</code> field became available with version 3, and the <code>ref</code> field with version 4. Older records where this information was not recorded will have the corresponding fields set to null.</li><li><code>timestamp</code> is the ISO timestamp of the launch</li><li><code>provider</code> gives the type of source repo being launched ("GitHub" is by far the most common). The rest of the
Alonso, Omar ; Marchesin, Stefano ; Najork, Mark ; Silvello, Gianmaria (Ed.)We present a novel approach to dataset search and exploration. Cell-centric indexing is a unique indexing strategy that enables a powerful, new interface. The strategy treats individual cells of a table as the indexed unit, and combining this with a number of structure-specific fields enables queries that cannot be answered by a traditional indexing approach. Our interface provides users with an overview of a dataset repository, and allows them to efficiently use various facets to explore the collection and identify datasets that match their interests.
Since its selection as the method of the year in 2013, single-cell technologies have become mature enough to provide answers to complex research questions. With the growth of single-cell profiling technologies, there has also been a significant increase in data collected from single-cell profilings, resulting in computational challenges to process these massive and complicated datasets. To address these challenges, deep learning (DL) is positioned as a competitive alternative for single-cell analyses besides the traditional machine learning approaches. Here, we survey a total of 25 DL algorithms and their applicability for a specific step in the single cell RNA-seq processing pipeline. Specifically, we establish a unified mathematical representation of variational autoencoder, autoencoder, generative adversarial network and supervised DL models, compare the training strategies and loss functions for these models, and relate the loss functions of these models to specific objectives of the data processing step. Such a presentation will allow readers to choose suitable algorithms for their particular objective at each step in the pipeline. We envision that this survey will serve as an important information portal for learning the application of DL for scRNA-seq analysis and inspire innovative uses of DL to address a broader range of new challenges inmore »
Budak, Ceren ; Cha, Meeyoung ; Quercia, Daniele ; Xie, Lexing (Ed.)Parler is as an ``alternative'' social network promoting itself as a service that allows to ``speak freely and express yourself openly, without fear of being deplatformed for your views.'' Because of this promise, the platform become popular among users who were suspended on mainstream social networks for violating their terms of service, as well as those fearing censorship. In particular, the service was endorsed by several conservative public figures, encouraging people to migrate from traditional social networks. After the storming of the US Capitol on January 6, 2021, Parler has been progressively deplatformed, as its app was removed from Apple/Google Play stores and the website taken down by the hosting provider. This paper presents a dataset of 183M Parler posts made by 4M users between August 2018 and January 2021, as well as metadata from 13.25M user profiles. We also present a basic characterization of the dataset, which shows that the platform has witnessed large influxes of new users after being endorsed by popular figures, as well as a reaction to the 2020 US Presidential Election. We also show that discussion on the platform is dominated by conservative topics, President Trump, as well as conspiracy theories like QAnon.