NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

AutoOD: Automatic Outlier Detection

https://doi.org/10.1145/3588700

Cao, Lei; Yan, Yizhou; Wang, Yu; Madden, Samuel; Rundensteiner, Elke A. (May 2023, Proceedings of the ACM on Management of Data)

Outlier detection is critical in real world. Due to the existence of many outlier detection techniques which often return different results for the same data set, the users have to address the problem of determining which among these techniques is the best suited for their task and tune its parameters. This is particularly challenging in the unsupervised setting, where no labels are available for cross-validation needed for such method and parameter optimization. In this work, we propose AutoOD which uses the existing unsupervised detection techniques to automatically produce high quality outliers without any human tuning. AutoOD's fundamentally new strategy unifies the merits of unsupervised outlier detection and supervised classification within one integrated solution. It automatically tests a diverse set of unsupervised outlier detectors on a target data set, extracts useful signals from their combined detection results to reliably capture key differences between outliers and inliers. It then uses these signals to produce a "custom outlier classifier" to classify outliers, with its accuracy comparable to supervised outlier classification models trained with ground truth labels - without having access to the much needed labels. On a diverse set of benchmark outlier detection datasets, AutoOD consistently outperforms the best unsupervised outlier detector selected from hundreds of detectors. It also outperforms other tuning-free approaches from 12 to 97 points (out of 100) in the F-1 score.
more » « less
Full Text Available
Tile-based Lightweight Integer Compression in GPU

https://doi.org/10.1145/3514221.3526132

Shanbhag, Anil; Yogotama, Bobbi; Yu, Xiangyao; Madden, Samuel (June 2022, Proceedings of the 2022 International Conference on Management of Data (SIGMOD ’22))

Full Text Available
A demonstration of AutoOD: a self-tuning anomaly detection system

https://doi.org/10.14778/3554821.3554880

Hofmann, Dennis; VanNostrand, Peter; Zhang, Huayi; Yan, Yizhou; Cao, Lei; Madden, Samuel; Rundensteiner, Elke (August 2022, Proceedings of the VLDB Endowment)

Anomaly detection is a critical task in applications like preventing financial fraud, system malfunctions, and cybersecurity attacks. While previous research has offered a plethora of anomaly detection algorithms, effective anomaly detection remains challenging for users due to the tedious manual tuning process. Currently, model developers must determine which of these numerous algorithms is best suited for their particular domain and then must tune many parameters by hand to make the chosen algorithm perform well. This demonstration showcases AutoOD, the first unsupervised self-tuning anomaly detection system which frees users from this tedious manual tuning process. AutoOD outperforms the best un-supervised anomaly detection methods it deploys, with its performance similar to those of supervised anomaly classification models, yet without requiring ground truth labels. Our easy-to-use visual interface allows users to gain insights into AutoOD's self-tuning process and explore the underlying patterns within their datasets.
more » « less
Full Text Available
LANCET: labeling complex data at scale

https://doi.org/10.14778/3476249.3476269

Zhang, Huayi; Cao, Lei; Madden, Samuel; Rundensteiner, Elke (July 2021, Proceedings of the VLDB Endowment)

Cutting-edge machine learning techniques often require millions of labeled data objects to train a robust model. Because relying on humans to supply such a huge number of labels is rarely practical, automated methods for label generation are needed. Unfortunately, critical challenges in auto-labeling remain unsolved, including the following research questions: (1) which objects to ask humans to label, (2) how to automatically propagate labels to other objects, and (3) when to stop labeling. These three questions are not only each challenging in their own right, but they also correspond to tightly interdependent problems. Yet existing techniques provide at best isolated solutions to a subset of these challenges. In this work, we propose the first approach, called LANCET, that successfully addresses all three challenges in an integrated framework. LANCET is based on a theoretical foundation characterizing the properties that the labeled dataset must satisfy to train an effective prediction model, namely the Covariate-shift and the Continuity conditions. First, guided by the Covariate-shift condition, LANCET maps raw input data into a semantic feature space, where an unlabeled object is expected to share the same label with its near-by labeled neighbor. Next, guided by the Continuity condition, LANCET selects objects for labeling, aiming to ensure that unlabeled objects always have some sufficiently close labeled neighbors. These two strategies jointly maximize the accuracy of the automatically produced labels and the prediction accuracy of the machine learning models trained on these labels. Lastly, LANCET uses a distribution matching network to verify whether both the Covariate-shift and Continuity conditions hold, in which case it would be safe to terminate the labeling process. Our experiments on diverse public data sets demonstrate that LANCET consistently outperforms the state-of-the-art methods from Snuba to GOGGLES and other baselines by a large margin - up to 30 percentage points increase in accuracy.
more » « less
Full Text Available
ELITE: Robust Deep Anomaly Detection with Meta Gradient

https://doi.org/10.1145/3447548.3467320

Zhang, Huayi; Cao, Lei; VanNostrand, Peter; Madden, Samuel; Rundensteiner, Elke A. (August 2021, KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining)

Full Text Available
ATLANTIC: Making Database Differentially Private and Faster with Accuracy Guarantee

https://doi.org/10.14778/3476311.3476337

Cao, Lei; Xiao, Dongqing; Yan, Yizhou; Madden, Samuel; Li, Guoliang (January 2021, Proceedings of the International Conference on Very Large Data Bases)

Full Text Available
A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics

https://doi.org/10.1145/3318464.3380595

Shanbhag, Anil; Madden, Samuel; Yu, Xiangyao (May 2020, 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD’20))

Full Text Available
Large-scale in-memory analytics on Intel ^® Optane ^™ DC persistent memory

https://doi.org/10.1145/3399666.3399933

Shanbhag, Anil; Tatbul, Nesime; Cohen, David; Madden, Samuel (June 2020, DaMoN '20: Proceedings of the 16th International Workshop on Data Management on New Hardware)

Full Text Available
Continuously Adaptive Similarity Search

https://doi.org/10.1145/3318464.3380601

Zhang, Huayi; Cao, Lei; Yan, Yizhou; Madden, Samuel; Rundensteiner, Elke A. (June 2020, Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data)

Similarity search is the basis for many data analytics techniques, including k-nearest neighbor classification and outlier detection. Similarity search over large data sets relies on i) a distance metric learned from input examples and ii) an index to speed up search based on the learned distance metric. In interactive systems, input to guide the learning of the distance metric may be provided over time. As this new input changes the learned distance metric, a naive approach would adopt the costly process of re-indexing all items after each metric change. In this paper, we propose the first solution, called OASIS, to instantaneously adapt the index to conform to a changing distance metric without this prohibitive re-indexing process. To achieve this, we prove that locality-sensitive hashing (LSH) provides an invariance property, meaning that an LSH index built on the original distance metric is equally effective at supporting similarity search using an updated distance metric as long as the transform matrix learned for the new distance metric satisfies certain properties. This observation allows OASIS to avoid recomputing the index from scratch in most cases. Further, for the rare cases when an adaption of the LSH index is shown to be necessary, we design an efficient incremental LSH update strategy that re-hashes only a small subset of the items in the index. In addition, we develop an efficient distance metric learning strategy that incrementally learns the new metric as inputs are received. Our experimental study using real world public datasets confirms the effectiveness of OASIS at improving the accuracy of various similarity search-based data analytics tasks by instantaneously adapting the distance metric and its associated index in tandem, while achieving an up to 3 orders of magnitude speedup over the state-of-art techniques.
more » « less
Full Text Available
Kaskade: Graph Views for Efficient Graph Analytics

https://doi.org/10.1109/ICDE48307.2020.00024

da Trindade, Joana M.; Karanasos, Konstantinos; Curino, Carlo; Madden, Samuel; Shun, Julian (April 2020, IEEE International Conference on Data Engineering)
null (Ed.)
Full Text Available

« Prev Next »

Search for: All records