ncreasingly, large collections of datasets are made available to the public via the Web, ranging from government-curated datasets like those of data.gov to communally-sourced datasets such as Wikipedia tables. It has become clear that traditional search techniques are insufficient for such sources, especially when the user is unfamiliar with the terminology used by the creators of the relevant datasets. We propose to address this problem by elevating the datum to a first-class object that is indexed, thereby making it less dependent on how a dataset is structured. In a data table, a cell contains a value for a particular row as described by a particular column. In our cell-centric indexing approach, we index the metadata of each cell, so that information about its dataset and column simply become metadata rather than constraining concepts. In this paper we define cell-centric indexing and present a system architecture that supports its use in exploring datasets. We describe how cell-centric indexing can be implemented using traditional information retrieval technology and evaluate the scalability of the architecture.
An Exploratory Interface for Dataset Repositories Using Cell-Centric Indexing
Large collections of datasets are being published on the Web at an increasing rate. This poses a problem to researchers and data journalists who must sift through these large quantities of data to find datasets that meet their needs. Our solution to this problem is cell-centric indexing, a novel approach which considers the individual cell of a dataset to be the fundamental unit of search, indexing the corresponding metadata to each individual cell. This facilitates a new style of user interface that allows users to explore the collection via histograms that show the distributions of various terms organized by how they are used in the dataset.
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- IEEE International Conference on Big Data (IEEE BigData 2020)
- Page Range or eLocation-ID:
- 5716 - 5718
- Sponsoring Org:
- National Science Foundation
More Like this
Abstract Purpose The ability to identify the scholarship of individual authors is essential for performance evaluation. A number of factors hinder this endeavor. Common and similarly spelled surnames make it difficult to isolate the scholarship of individual authors indexed on large databases. Variations in name spelling of individual scholars further complicates matters. Common family names in scientific powerhouses like China make it problematic to distinguish between authors possessing ubiquitous and/or anglicized surnames (as well as the same or similar first names). The assignment of unique author identifiers provides a major step toward resolving these difficulties. We maintain, however, that in and of themselves, author identifiers are not sufficient to fully address the author uncertainty problem. In this study we build on the author identifier approach by considering commonalities in fielded data between authors containing the same surname and first initial of their first name. We illustrate our approach using three case studies. Design/methodology/approach The approach we advance in this study is based on commonalities among fielded data in search results. We cast a broad initial net—i.e., a Web of Science (WOS) search for a given author’s last name, followed by a comma, followed by the first initial of his ormore »
ONe Index for All Kernels (ONIAK): A Zero Re-Indexing LSH Solution to ANNS-ALT (After Linear Transformation)In this work, we formulate and solve a new type of approximate nearest neighbor search (ANNS) problems called ANNS after linear transformation (ALT). In ANNS-ALT, we search for the vector (in a dataset) that, after being linearly transformed by a user-specified query matrix, is closest to a query vector. It is a very general mother problem in the sense that a wide range of baby ANNS problems that have important applications in databases and machine learning can be reduced to and solved as ANNS-ALT, or its dual that we call ANNS-ALTD. We propose a novel and computationally efficient solution, called ONe Index for All Kernels (ONIAK), to ANNS-ALT and all its baby problems when the data dimension 𝑑 is not too large (say 𝑑 ≤ 200). In ONIAK, a universal index is built, once and for all, for answering all future ANNS-ALT queries that can have distinct query matrices. We show by experiments that, when 𝑑 is not too large, ONIAK has better query performance than linear scan on the mother problem (of ANNS-ALT), and has query performances comparable to those of the state-of-the-art solutions on the baby problems. However, the algorithmic technique behind this universal index approach suffers frommore »
null (Ed.)Large, comprehensive collections of single-cell RNA sequencing (scRNA-seq) datasets have been generated that allow for the full transcriptional characterization of cell types across a wide variety of biological and clinical conditions. As new methods arise to measure distinct cellular modalities, a key analytical challenge is to integrate these datasets or transfer knowledge from one to the other to better understand cellular identity and functions. Here, we present a simple yet surprisingly effective method named common factor integration and transfer learning (cFIT) for capturing various batch effects across experiments, technologies, subjects, and even species. The proposed method models the shared information between various datasets by a common factor space while allowing for unique distortions and shifts in genewise expression in each batch. The model parameters are learned under an iterative nonnegative matrix factorization (NMF) framework and then used for synchronized integration from across-domain assays. In addition, the model enables transferring via low-rank matrix from more informative data to allow for precise identification in data of lower quality. Compared with existing approaches, our method imposes weaker assumptions on the cell composition of each individual dataset; however, it is shown to be more reliable in preserving biological variations. We apply cFIT to multiplemore »
Abstract Background Drug sensitivity prediction and drug responsive biomarker selection on high-throughput genomic data is a critical step in drug discovery. Many computational methods have been developed to serve this purpose including several deep neural network models. However, the modular relations among genomic features have been largely ignored in these methods. To overcome this limitation, the role of the gene co-expression network on drug sensitivity prediction is investigated in this study. Methods In this paper, we first introduce a network-based method to identify representative features for drug response prediction by using the gene co-expression network. Then, two graph-based neural network models are proposed and both models integrate gene network information directly into neural network for outcome prediction. Next, we present a large-scale comparative study among the proposed network-based methods, canonical prediction algorithms (i.e., Elastic Net, Random Forest, Partial Least Squares Regression, and Support Vector Regression), and deep neural network models for drug sensitivity prediction. All the source code and processed datasets in this study are available at https://github.com/compbiolabucf/drug-sensitivity-prediction . Results In the comparison of different feature selection methods and prediction methods on a non-small cell lung cancer (NSCLC) cell line RNA-seq gene expression dataset with 50 different drug treatments, wemore »