skip to main content


Search for: All records

Creators/Authors contains: "Dasgupta, S."

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abelló, A ; Vassiliadis, P ; Romero, O ; Wrembel, R ; Bugiotti, F ; Gamper, J ; Vargas-Solar, G ; Zumpano, E (Ed.)
    Constructing knowledge graphs from heterogeneous data sources and evaluating their quality and consistency are important research questions in the field of knowledge graphs. We propose mapping rules to guide users to translate data from relational and graph sources into a meaningful knowledge graph and design a user-friendly language to specify the mapping rules. Given the mapping rules and constraints on source data, equivalent constraints on the target graph can be inferred, which is referred to as data source constraints. Besides this type of constraint, we design other two types: user-specified constraints and general rules that a high-quality knowledge graph should adhere to. We translate the three types of constraints into uniform expressions in the form of graph functional dependencies and extended graph dependencies, which can be used for consistency checking. Our approach provides a systematic way to build and evaluate knowledge graphs from diverse data sources. 
    more » « less
    Free, publicly-accessible full text available September 4, 2024
  2. We consider k-means clustering in an online setting where each new data point is assigned to its closest cluster center and incurs a loss equal to the squared distance to that center, after which the algorithm is allowed to update its centers. The goal over a data stream X is to achieve a total loss that is not too much larger than L(X, OPT), the best possible loss using k fixed centers in hindsight. We give the first algorithm to achieve polynomial space and time complexity in the online setting. 
    more » « less
  3. We consider k-means clustering in an online setting where each new data point is assigned to its closest cluster center and incurs a loss equal to the squared distance to that center, after which the algorithm is allowed to update its centers. The goal over a data stream X is to achieve a total loss that is not too much larger than the best possible loss using k fixed centers in hindsight. Ours is the first algorithm to achieve polynomial space and time complexity in the online setting. We note that our results have implications to the related streaming setting, where one final clustering is outputted, and the no-substitution setting, where center selections are permanent. We show a general reduction between the no-substitution cost of a blackbox algorithm and its online cost. Finally, we translate our algorithm to the no-substitution setting and streaming settings, and it competes with and can outperform existing work in the areas. 
    more » « less
  4. Abstract A discrete degree of freedom can be engineered to match the Hamiltonian of particles moving in a real-space lattice potential. Such synthetic dimensions are powerful tools for quantum simulation because of the control they offer and the ability to create configurations difficult to access in real space. Here, in an ultracold 84 Sr atom, we demonstrate a synthetic-dimension based on Rydberg levels coupled with millimeter waves. Tunneling amplitudes between synthetic lattice sites and on-site potentials are set by the millimeter-wave amplitudes and detunings respectively. Alternating weak and strong tunneling in a one-dimensional configuration realizes the single-particle Su-Schrieffer-Heeger (SSH) Hamiltonian, a paradigmatic model of topological matter. Band structure is probed through optical excitation from the ground state to Rydberg levels, revealing symmetry-protected topological edge states at zero energy. Edge-state energies are robust to perturbations of tunneling-rates that preserve chiral symmetry, but can be shifted by the introduction of on-site potentials. 
    more » « less
  5. Oshima, J. ; Mochizuki, T. ; Hayashi, Y. (Ed.)
  6. Within machine learning, active learning studies the gains in performance made possible by adaptively selecting data points to label. In this work, we show through upper and lower bounds, that for a simple benign setting of well-specified logistic regression on a uniform distribution over a sphere, the expected excess error of both active learning and random sampling have the same inverse proportional dependence on the number of samples. Importantly, due to the nature of lower bounds, any more general setting does not allow a better dependence on the number of samples. Additionally, we show a variant of uncertainty sampling can achieve a faster rate of convergence than random sampling by a factor of the Bayes error, a recent empirical observation made by other work. Qualitatively, this work is pessimistic with respect to the asymptotic dependence on the number of samples, but optimistic with respect to finding performance gains in the constants. 
    more » « less
  7. We prove asymptotic convergence for a general class of k-means algorithms performed over streaming data from a distribution— the centers asymptotically converge to the set of stationary points of the k-means objective function. To do so, we show that online k-means over a distribution can be interpreted as stochastic gradient descent with a stochastic learning rate schedule. Then, we prove convergence by extending techniques used in optimization literature to handle settings where center-specific learning rates may depend on the past trajectory of the centers. 
    more » « less
  8. Recent work introduced the model of learning from discriminative feature feedback, in which a human annotator not only provides labels of instances, but also identifies discriminative features that highlight important differences between pairs of instances. It was shown that such feedback can be conducive to learning, and makes it possible to efficiently learn some concept classes that would otherwise be in- tractable. However, these results all relied upon perfect annotator feedback. In this pa- per, we introduce a more realistic, robust ver- sion of the framework, in which the annotator is allowed to make mistakes. We show how such errors can be handled algorithmically, in both an adversarial and a stochastic setting. In particular, we derive regret bounds in both settings that, as in the case of a perfect an- notator, are independent of the number of features. We show that this result cannot be obtained by a naive reduction from the robust setting to the non-robust setting. 
    more » « less
  9. Detecting overfitting in generative models is an important challenge in machine learning. In this work, we formalize a form of overfitting that we call data-copying – where the gener- ative model memorizes and outputs training samples or small variations thereof. We pro- vide a three sample non-parametric test for detecting data-copying that uses the training set, a separate sample from the target dis- tribution, and a generated sample from the model, and study the performance of our test on several canonical models and datasets. 
    more » « less
  10. We consider the problem of embedding a relation, represented as a directed graph, into Euclidean space. For three types of embeddings motivated by the recent literature on knowledge graphs, we obtain characterizations of which relations they are able to capture, as well as bounds on the minimal dimensionality and precision needed. 
    more » « less