Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
Cutting-edge machine learning techniques often require millions of labeled data objects to train a robust model. Because relying on humans to supply such a huge number of labels is rarely practical, automated methods for label generation are needed. Unfortunately, critical challenges in auto-labeling remain unsolved, including the following research questions: (1) which objects to ask humans to label, (2) how to automatically propagate labels to other objects, and (3) when to stop labeling. These three questions are not only each challenging in their own right, but they also correspond to tightly interdependent problems. Yet existing techniques provide at best isolated solutions to a subset of these challenges. In this work, we propose the first approach, called LANCET, that successfully addresses all three challenges in an integrated framework. LANCET is based on a theoretical foundation characterizing the properties that the labeled dataset must satisfy to train an effective prediction model, namely the Covariate-shift and the Continuity conditions. First, guided by the Covariate-shift condition, LANCET maps raw input data into a semantic feature space, where an unlabeled object is expected to share the same label with its near-by labeled neighbor. Next, guided by the Continuity condition, LANCET selects objects for labeling, aimingmore »
Similarity search is the basis for many data analytics techniques, including k-nearest neighbor classification and outlier detection. Similarity search over large data sets relies on i) a distance metric learned from input examples and ii) an index to speed up search based on the learned distance metric. In interactive systems, input to guide the learning of the distance metric may be provided over time. As this new input changes the learned distance metric, a naive approach would adopt the costly process of re-indexing all items after each metric change. In this paper, we propose the first solution, called OASIS, to instantaneously adapt the index to conform to a changing distance metric without this prohibitive re-indexing process. To achieve this, we prove that locality-sensitive hashing (LSH) provides an invariance property, meaning that an LSH index built on the original distance metric is equally effective at supporting similarity search using an updated distance metric as long as the transform matrix learned for the new distance metric satisfies certain properties. This observation allows OASIS to avoid recomputing the index from scratch in most cases. Further, for the rare cases when an adaption of the LSH index is shown to be necessary, we designmore »