skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Rapid Approximate Aggregation with Distribution-Sensitive Interval Guarantees
Aggregating data is fundamental to data analytics, data exploration, and OLAP. Approximate query processing (AQP) techniques are often used to accelerate computation of aggregates using samples, for which confidence intervals (CIs) are widely used to quantify the associated error. CIs used in practice fall into two categories: techniques that are tight but not correct, i.e., they yield tight intervals but only offer asymptoticguarantees,makingthem unreliable, or techniques that are correct but not tight, i.e., they offer rigorous guarantees, but are overly conservative, leading to confidence intervals that are too loose to be useful. In this paper, we develop a CI technique that is both correct and tighter than traditional approaches. Starting from conservative CIs, we identify two issues they often face: pessimistic mass allocation (PMA) and phantom outlier sensitivity (PHOS). By developing a novel range-trimming technique for eliminating PHOS and pairing it with known CI techniques without PMA, we develop a technique for computing CIs with strong guarantees that requires fewer samples for the same width. We implement our techniques underneath a sampling-optimized in-memory column store and show how they accelerate queries involving aggregates on real datasets with typical speedups on the order of 10× over both traditional AQP-with-guarantees and exact methods, all while obeying accuracy constraints.  more » « less
Award ID(s):
2006664 2022448 1740751
PAR ID:
10275604
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
IEEE International Conference on Data Engineering workshop
ISSN:
1943-2895
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. A classic problem in statistics is the estimation of the expectation of random variables from samples. This gives rise to the tightly connected problems of deriving concentration inequalities and confidence sequences, i.e., confidence intervals that hold uniformly over time. Previous studies have shown that it is possible to convert the regret guarantee of an online learning algorithm into concentration inequalities, but these concentration results were not tight. In this paper, we show regret guarantees of universal portfolio algorithms applied to the online learning problem of betting give rise to new implicit time-uniform concentration inequalities for bounded random variables. The key feature of our concentration results is that they are centered around the maximum log wealth of the best fixed betting strategy in hindsight. We propose numerical methods to solve these implicit inequalities, which results in confidence sequences that enjoy the empirical Bernstein rate with the optimal asymptotic behavior while being never worse than Bernoulli-KL confidence bounds. We further show that our confidence sequences are never vacuous with even one sample, for any given target failure rate δ∈(0,1). Our empirical study shows that our confidence bounds achieve the state-of-the-art performance, especially in the small sample regime. 
    more » « less
  2. Predicting the minimum operating voltage Vmin of chips is one of the important techniques for improving the manufacturing testing flow, as well as ensuring the long-term reliability and safety of in-field systems. Current Vmin prediction methods often provide only point estimates, necessitating additional techniques for constructing prediction confidence intervals to cover uncertainties caused by different sources of variations. While some existing techniques offer region predictions, but they rely on certain distributional assumptions and/or provide no coverage guarantees. In response to these limitations, we propose a novel distribution-free Vmin interval estimation methodology possessing a theoretical guarantee of coverage. Our approach leverages conformalized quantile regression and on-chip monitors to generate reliable prediction intervals. We demonstrate the effectiveness of the proposed method on an industrial 5nm automotive chip dataset. Moreover, we show that the use of on-chip monitors can reduce the interval length significantly for Vmin prediction. 
    more » « less
  3. High-level data types are often associated with semantic invariants that must be preserved by any correct implementation. While having implementations enforce strong guarantees such as linearizability or serializability can often be used to prevent invariant violations in concurrent settings, such mechanisms are impractical in geo-distributed replicated environments, the platform of choice for many scalable Web services. To achieve high-availability essential to this domain, these environments admit various forms of weak consistency that do not guarantee all replicas have a consistent view of an application's state. Consequently, they often admit difficult-to-understand anomalous behaviors that violate a data type's invariants, but which are extremely challenging, even for experts, to understand and debug. In this paper, we propose a novel programming framework for replicated data types (RDTs) equipped with an automatic (bounded) verification technique that discovers and fixes weak consistency anomalies. Our approach, implemented in a tool called Q9, involves systematically exploring the state space of an application executing on top of an eventually consistent data store, under an unrestricted consistency model but with a finite concurrency bound. Q9 uncovers anomalies (i.e., invariant violations) that manifest as finite counterexamples, and automatically generates repairs for such anamolies by selectively strengthening consistency guarantees for specific operations. Using Q9, we have uncovered a range of subtle anomalies in implementations of well-known benchmarks, and have been able to apply the repairs it mandates to effectively eliminate them. Notably, these benchmarks were written adopting best practices suggested to manage distributed replicated state (e.g., they are composed of provably convergent RDTs (CRDTs), avoid mutable state, etc.). While the safety guarantees offered by our technique are constrained by the concurrency bound, we show that in practice, proving bounded safety guarantees typically generalize to the unbounded case. 
    more » « less
  4. In critical infrastructure (CI) sectors such as emergency management or healthcare, researchers can analyze and detect useful patterns in data and help emergency management personnel efficaciously allocate limited resources or detect epidemiology spread patterns. However, all of this data contains personally identifiable information (PII) that needs to be safeguarded for legal and ethical reasons. Traditional techniques for safeguarding, such as anonymization, have shown to be ineffective. Differential privacy is a technique that supports individual privacy while allowing the analysis of datasets for societal benefit. This paper motivates the use of differential privacy to answer a wide range of queries about CI data containing PII with better privacy guarantees than is possible with traditional techniques. Moreover, it introduces a new technique based on Multipleattribute Workload Partitioning, which does not depend on the nature of the underlying dataset and provides better protection for privacy than current differential privacy approaches. 
    more » « less
  5. Conditional independence (CI) tests play a central role in statistical inference, machine learning, and causal discovery. Most existing CI tests assume that the samples are indepen- dently and identically distributed (i.i.d.). How- ever, this assumption often does not hold in the case of relational data. We define Relational Conditional Independence (RCI), a generaliza- tion of CI to the relational setting. We show how, under a set of structural assumptions, we can test for RCI by reducing the task of test- ing for RCI on non-i.i.d. data to the problem of testing for CI on several data sets each of which consists of i.i.d. samples. We develop Kernel Relational CI test (KRCIT), a nonpara- metric test as a practical approach to testing for RCI by relaxing the structural assumptions used in our analysis of RCI. We describe re- sults of experiments with synthetic relational data that show the benefits of KRCIT relative to traditional CI tests that don’t account for the non-i.i.d. nature of relational data. 
    more » « less