This paper studies the evaluation of learning-based object detection models in conjunction with model-checking of formal specifications defined on an abstract model of an autonomous system and its environment. In particular, we define two metrics – proposition-labeled and class-labeled confusion matrices – for evaluating object detection, and we incorporate these metrics to compute the satisfaction probability of system-level safety requirements. While confusion matrices have been effective for comparative evaluation of classification and object detection models, our framework fills two key gaps. First, we relate the performance of object detection to formal requirements defined over downstream high-level planning tasks. In particular, we provide empirical results that show that the choice of a good object detection algorithm, with respect to formal requirements on the overall system, significantly depends on the downstream planning and control design. Secondly, unlike the traditional confusion matrix, our metrics account for variations in performance with respect to the distance between the ego and the object being detected. We demonstrate this framework on a car-pedestrian example by computing the satisfaction probabilities for safety requirements formalized in Linear Temporal Logic (LTL).
more »
« less
This content will become publicly available on October 15, 2026
Task-Relevant Evaluation Metrics of Object Detection for Quantitative System-Level Analysis of Safety-Critical Autonomous Systems
In safety-critical robotic systems, perception is tasked with representing the environment to effectively guide decision-making and plays a crucial role in ensuring that the overall system meets its requirements. To quantitatively assess the impact of object detection and classification errors on system-level performance, we present a rigorous formalism for a model of detection error, and probabilistically reason about the satisfaction of regular-safety temporal logic requirements at the system level. We also show how standard evaluation metrics for object detection, such as confusion matrices, can be represented as models of detection error, which enables the computation of probabilistic satisfaction of system-level specifications. However, traditional confusion matrices treat all detections equally, without considering their relevance to the system-level task. To address this limitation, we propose novel evaluation metrics for object detection that are informed by both the system-level task and the downstream control logic, enabling a more context-appropriate evaluation of detection models. We identify logic-based formulas relevant to the downstream control and system-level specifications and use these formulas to define a logic-based evaluation metric for object detection and classification. These logic-based metrics result in less conservative assessments of system-level performance. Finally, we demonstrate our approach on a car-pedestrian example with a leaderboard PointPillars model evaluated on the nuScenes dataset, and validate probabilistic system-level guarantees in simulation.
more »
« less
- Award ID(s):
- 2141153
- PAR ID:
- 10645813
- Publisher / Repository:
- Association for Computing Machinery
- Date Published:
- Journal Name:
- ACM Transactions on Cyber-Physical Systems
- ISSN:
- 2378-962X
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Object classification is a key element that enables effective decision-making in many autonomous systems. A more sophisticated system may also utilize the probability distribution over the classes instead of basing its decision only on the most likely class. This paper introduces new performance metrics: the absolute class error (ACE), expectation of absolute class error (EACE) and variance of absolute class error (VACE) for evaluating the accuracy of such probabilities. We test this metric using different neural network architectures and datasets. Furthermore, we present a new task-based neural network for object classification and compare its performance with a typical probabilistic classification model to show the improvement with threshold-based probabilistic decision-making.more » « less
-
After the 2017 TuSimple Lane Detection Challenge, its dataset and evaluation based on accuracy and F1 score have become the de facto standard to measure the performance of lane detection methods. While they have played a major role in improving the performance of lane detection methods, the validity of this evaluation method in downstream tasks has not been adequately researched. In this study, we design 2 new driving-oriented metrics for lane detection: End-to-End Lateral Deviation metric (E2E-LD) is directly formulated based on the requirements of autonomous driving, a core downstream task of lane detection; Per-frame Simulated Lateral Deviation metric (PSLD) is a lightweight surrogate metric of E2E-LD. To evaluate the validity of the metrics, we conduct a large-scale empirical study with 4 major types of lane detection approaches on the TuSimple dataset and our newly constructed dataset Comma2k19-LD. Our results show that the conventional metrics have strongly negative correlations (≤-0.55) with E2E-LD, meaning that some recent improvements purely targeting the conventional metrics may not have led to meaningful improvements in autonomous driving, but rather may actually have made it worse by overfitting to the conventional metrics. As autonomous driving is a security/safety-critical system, the underestimation of robustness hinders the sound development of practical lane detection models. We hope that our study will help the community achieve more downstream task-aware evaluations for lane detection.more » « less
-
Any safety issues or cyber attacks on an Industrial Control Systems (ICS) may have catastrophic consequences on human lives and the environment. Hence, it is imperative to have resilient tools and mechanisms to protect ICS. To verify the safety and security of the control logic, complete and consistent specifications should be defined to guide the testing process. Second, it is vital to ensure that those requirements are met by the program control algorithm. In this paper, we proposed an approach to formally define the system specifications, safety, and security requirements to build an ontology that is used further to verify the control logic of the PLC software. The use of ontology allowed us to reason about semantic concepts, check the consistency of concepts, and extract specifications by inference. For the proof of concept, we studied part of an industrial chemical process to implement the proposed approach. The experimental results in this work showed that the proposed approach detects inconsistencies in the formally defined requirements and is capable of verifying the correctness and completeness of the control logic. The tools and algorithms designed and developed as part of this work will help technicians and engineers create safer and more secure control logic for ICS processes.more » « less
-
We propose a deductive synthesis framework for construct- ing reinforcement learning (RL) agents that provably satisfy temporal reach-avoid specifications over infinite horizons. Our approach decomposes these temporal specifications into a sequence of finite-horizon subtasks, for which we synthesize individual RL policies. Using formal verification techniques, we ensure that the composition of a finite number of subtask policies guarantees satisfaction of the overall specification over infinite horizons. Experimental results on a suite of benchmarks show that our synthesized agents outperform standard RL methods in both task performance and compliance with safety and temporal requirements.more » « less
An official website of the United States government
