skip to main content


Title: Combining Deep Learning and Verification for Precise Object Instance Detection
Deep learning object detectors often return false positives with very high confidence. Although they optimize generic detection performance, such as mean average precision (mAP), they are not designed for reliability. For a re- liable detection system, if a high confidence detection is made, we would want high certainty that the object has indeed been detected. To achieve this, we have developed a set of verification tests which a proposed detection must pass to be accepted. We develop a theoretical framework which proves that, under certain assumptions, our verification tests will not accept any false positives. Based on an approximation to this framework, we present a practical detection system that can verify, with high precision, whether each detection of a machine-learning based object detector is correct. We show that these tests can improve the overall accu- racy of a base detector and that accepted examples are highly likely to be correct. This allows the detector to operate in a high precision regime and can thus be used for robotic perception systems as a reliable instance detection method.  more » « less
Award ID(s):
1849154
NSF-PAR ID:
10159659
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of Machine Learning Research
Volume:
100
ISSN:
2640-3498
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Identifying people in photographs is a critical task in a wide variety of domains, from national security [7] to journalism [14] to human rights investigations [1]. The task is also fundamentally complex and challenging. With the world population at 7.6 billion and growing, the candidate pool is large. Studies of human face recognition ability show that the average person incorrectly identifies two people as similar 20–30% of the time, and trained police detectives do not perform significantly better [11]. Computer vision-based face recognition tools have gained considerable ground and are now widely available commercially, but comparisons to human performance show mixed results at best [2,10,16]. Automated face recognition techniques, while powerful, also have constraints that may be impractical for many real-world contexts. For example, face recognition systems tend to suffer when the target image or reference images have poor quality or resolution, as blemishes or discolorations may be incorrectly recognized as false positives for facial landmarks. Additionally, most face recognition systems ignore some salient facial features, like scars or other skin characteristics, as well as distinctive non-facial features, like ear shape or hair or facial hair styles. This project investigates how we can overcome these limitations to support person identification tasks. By adjusting confidence thresholds, users of face recognition can generally expect high recall (few false negatives) at the cost of low precision (many false positives). Therefore, we focus our work on the “last mile” of person identification, i.e., helping a user find the correct match among a large set of similarlooking candidates suggested by face recognition. Our approach leverages the powerful capabilities of the human vision system and collaborative sensemaking via crowdsourcing to augment the complementary strengths of automatic face recognition. The result is a novel technology pipeline combining collective intelligence and computer vision. We scope this project to focus on identifying soldiers in photos from the American Civil War era (1861– 1865). An estimated 4,000,000 soldiers fought in the war, and most were photographed at least once, due to decreasing costs, the increasing robustness of the format, and the critical events separating friends and family [17]. Over 150 years later, the identities of most of these portraits have been lost, but as museums and archives increasingly digitize and publish their collections online, the pool of reference photos and information has never been more accessible. Historians, genealogists, and collectors work tirelessly to connect names with faces, using largely manual identification methods [3,9]. Identifying people in historical photos is important for preserving material culture [9], correcting the historical record [13], and recognizing contributions of marginalized groups [4], among other reasons. 
    more » « less
  2. Long-term object detection requires the integration of frame-based results over several seconds. For non-deformable objects, long-term detection is often addressed using object detection followed by video tracking. Unfortunately, tracking is inapplicable to objects that undergo dramatic changes in appearance from frame to frame. As a related example, we study hand detection over long video recordings in collaborative learning environments. More specifically, we develop long-term hand detection methods that can deal with partial occlusions and dramatic changes in appearance. Our approach integrates object-detection, followed by time projections, clustering, and small region removal to provide effective hand detection over long videos. The hand detector achieved average precision (AP) of 72% at 0.5 intersection over union (IoU). The detection results were improved to 81% by using our optimized approach for data augmentation. The method runs at 4.7× the real-time with AP of 81% at 0.5 intersection over the union. Our method reduced the number of false-positive hand detections by 80% by improving IoU ratios from 0.2 to 0.5. The overall hand detection system runs at 4× real-time. 
    more » « less
  3. Abstract

    Short‐term precipitation forecasts are critical to regional water management, particularly in the Western U.S. where atmospheric rivers can be predicted reliably days in advance. However, spatial error in these forecasts may reduce their utility when the costs of false positives and negatives differ greatly. Here we investigate whether deep learning methods can leverage spatial patterns in precipitation forecasts to (a) improve the skill of predicting the occurrence of precipitation events at lead times from 1 to 14 days, and (b) balance the tradeoff between the rate of false negatives and false positives by modifying the discrimination threshold of the classifiers. This approach is demonstrated for the Sacramento River Basin, California, using the Global Ensemble Forecast System (GEFS) v2 precipitation fields as input to convolutional neural network (CNN) and multi‐layer perceptron models. Results show that the deep learning models do not significantly improve the overall skill (F1 score) relative to the ensemble mean GEFS forecast with bias‐corrected threshold. However, additional analysis of the CNN models suggests they often correct missed predictions from GEFS by compensating for spatial error at longer lead times. Additionally, the deep learning models provide the ability to adjust the rate of false positives and negatives based on the ratio of costs. Finally, analysis of the network activations (saliency) indicates spatial patterns consistent with physical understanding of atmospheric river events in this region, lending additional confidence in the ability of the method to support water management applications.

     
    more » « less
  4. In Video Analytics Pipelines (VAP), Analytics Units (AUs) such as object detection and face recognition running on remote servers critically rely on surveillance cameras to capture high-quality video streams in order to achieve high accuracy. Modern IP cameras come with a large number of camera parameters that directly affect the quality of the video stream capture. While a few of such parameters, e.g., exposure, focus, white balance are automatically adjusted by the camera internally, the remaining ones are not. We denote such camera parameters as non-automated (NAUTO) parameters. In this paper, we first show that environmental condition changes can have significant adverse effect on the accuracy of insights from the AUs, but such adverse impact can potentially be mitigated by dynamically adjusting NAUTO camera parameters in response to changes in environmental conditions. We then present CamTuner, to our knowledge, the first framework that dynamically adapts NAUTO camera parameters to optimize the accuracy of AUs in a VAP in response to adverse changes in environmental conditions. CamTuner is based on SARSA reinforcement learning and it incorporates two novel components: a light-weight analytics quality estimator and a virtual camera that drastically speed up offline RL training. Our controlled experiments and real-world VAP deployment show that compared to a VAP using the default camera setting, CamTuner enhances VAP accuracy by detecting 15.9% additional persons and 2.6%--4.2% additional cars (without any false positives) in a large enterprise parking lot and 9.7% additional cars in a 5G smart traffic intersection scenario, which enables a new usecase of accurate and reliable automatic vehicle collision prediction (AVCP). CamTuner opens doors for new ways to significantly enhance video analytics accuracy beyond incremental improvements from refining deep-learning models. 
    more » « less
  5. null (Ed.)
    Usage of drones has increased substantially in both recreation and commercial applications and is projected to proliferate in the near future. As this demand rises, the threat they pose to both privacy and safety also increases. Delivering contraband and unauthorized surveillance are new risks that accompany the growth in this technology. Prisons and other commercial settings where venue managers are concerned about public safety need cost-effective detection solutions in light of their increasingly strained budgets. Hence, there arises a need to design a drone detection system that is low cost, easy to maintain, and without the need for expensive real-time human monitoring and supervision. To this end, this paper presents a low-cost drone detection system, which employs a Convolutional Neural Network (CNN) algorithm, making use of acoustic features. The Mel Frequency Cepstral Coefficients (MFCC) derived from audio signatures are fed as features to the CNN, which then predicts the presence of a drone. We compare field test results with an earlier Support Vector Machine (SVM) detection algorithm. Using the CNN yielded a decrease in the false positives and an increase in the correct detection rate.Previous tests showed that the SVM was particularly susceptible to false alarms for lawn equipment and helicopters, which were significantly improved when using the CNN. Also,in order to determine how well such a system compared to human performance and also explore including the end-user in the detection loop, a human performance experiment was conducted.With a sample of 35 participants, the human classification accuracy was 92.47%. These preliminary results clearly indicate that humans are very good at identifying drone’s acoustic signatures from other sounds and can augment the CNN’s performance. 
    more » « less