skip to main content


Search for: All records

Creators/Authors contains: "Chen, Justin"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Estimating frequencies of elements appearing in a data stream is a key task in large-scale data analysis. Popular sketching approaches to this problem (e.g., CountMin and CountSketch) come with worst-case guarantees that probabilistically bound the error of the estimated frequencies for any possible input. The work of Hsu et al.~(2019) introduced the idea of using machine learning to tailor sketching algorithms to the specific data distribution they are being run on. In particular, their learning-augmented frequency estimation algorithm uses a learned heavy-hitter oracle which predicts which elements will appear many times in the stream. We give a novel algorithm, which in some parameter regimes, already theoretically outperforms the learning based algorithm of Hsu et al. without the use of any predictions. Augmenting our algorithm with heavy-hitter predictions further reduces the error and improves upon the state of the art. Empirically, our algorithms achieve superior performance in all experiments compared to prior approaches. 
    more » « less
    Free, publicly-accessible full text available September 21, 2024
  2. We study statistical/computational tradeoffs for the following density estimation problem: given kdistributionsv1,...,vk overadiscretedomain of size n, and sampling access to a distribution p, identify vi that is “close” to p. Our main result is the first data structure that, given a sublinear (in n) number of samples from p, identifies vi in time sublinear in k. We also give an improved version of the algorithm of (Acharya et al., 2018) that reports vi in time linear in k. The experimental evaluation of the latter algorithm shows that it achieves a significant reduction in the number of operations needed to achieve a given accuracy compared to prior work. 
    more » « less
  3. An ε-approximate quantile sketch over a stream of n inputs approximates the rank of any query point q—that is, the number of input points less than q—up to an additive error of εn, generally with some probability of at least 1−1/ poly(n), while consuming o(n) space. While the celebrated KLL sketch of Karnin, Lang, and Liberty achieves a provably optimal quantile approximation algorithm over worst-case streams, the approximations it achieves in practice are often far from optimal. Indeed, the most commonly used technique in practice is Dunning’s t-digest, which often achieves much better approximations than KLL on realworld data but is known to have arbitrarily large errors in the worst case. We apply interpolation techniques to the streaming quantiles problem to attempt to achieve better approximations on real-world data sets than KLL while maintaining similar guarantees in the worst case. 
    more » « less
  4. We propose a model for online graph problems where algorithms are given access to an oracle that predicts (e.g., based on modeling assumptions or on past data) the degrees of nodes in the graph. Within this model, we study the classic problem of online bipartite matching, and a natural greedy matching algorithm called MinPredictedDegree, which uses predictions of the degrees of offline nodes. For the bipartite version of a stochastic graph model due to Chung, Lu, and Vu where the expected values of the offline degrees are known and used as predictions, we show that MinPredictedDegree stochastically dominates any other online algorithm, i.e., it is optimal for graphs drawn from this model. Since the “symmetric” version of the model, where all online nodes are identical, is a special case of the well-studied “known i.i.d. model”, it follows that the competitive ratio of MinPredictedDegree on such inputs is at least 0.7299. For the special case of graphs with power law degree distributions, we show that MinPredictedDegree frequently produces matchings almost as large as the true maximum matching on such graphs. We complement these results with an extensive empirical evaluation showing that MinPredictedDegree compares favorably to state-of-the-art online algorithms for online matching. 
    more » « less
  5. As the most lethal major cancer, pancreatic cancer is a global healthcare challenge. Personalized medicine utilizing cutting-edge multi-omics data holds potential for major breakthroughs in tackling this critical problem. Radiomics and deep learning, two trendy quantitative imaging methods that take advantage of data science and modern medical imaging, have shown increasing promise in advancing the precision management of pancreatic cancer via diagnosing of precursor diseases, early detection, accurate diagnosis, and treatment personalization and optimization. Radiomics employs manually-crafted features, while deep learning applies computer-generated automatic features. These two methods aim to mine hidden information in medical images that is missed by conventional radiology and gain insights by systematically comparing the quantitative image information across different patients in order to characterize unique imaging phenotypes. Both methods have been studied and applied in various pancreatic cancer clinical applications. In this review, we begin with an introduction to the clinical problems and the technology. After providing technical overviews of the two methods, this review focuses on the current progress of clinical applications in precancerous lesion diagnosis, pancreatic cancer detection and diagnosis, prognosis prediction, treatment stratification, and radiogenomics. The limitations of current studies and methods are discussed, along with future directions. With better standardization and optimization of the workflow from image acquisition to analysis and with larger and especially prospective high-quality datasets, radiomics and deep learning methods could show real hope in the battle against pancreatic cancer through big data-based high-precision personalization. 
    more » « less