Rethink reporting of evaluation results in AI

Burnell, Ryan; Schellaert, Wout; Burden, John; Ullman, Tomer D.; Martinez-Plumed, Fernando; Tenenbaum, Joshua B.; Rutar, Danaja; Cheke, Lucy G.; Sohl-Dickstein, Jascha; Mitchell, Melanie; Kiela, Douwe; Shanahan, Murray; Voorhees, Ellen M.; Cohn, Anthony G.; Leibo, Joel Z.; Hernandez-Orallo, Jose

doi:10.1126/science.adf6369

Citation Details

Rethink reporting of evaluation results in AI

Artificial intelligence (AI) systems have begun to be deployed in high-stakes contexts, including autonomous driving and medical diagnosis. In contexts such as these, the consequences of system failures can be devastating. It is therefore vital that researchers and policy-makers have a full understanding of the capabilities and weaknesses of AI systems so that they can make informed decisions about where these systems are safe to use and how they might be improved. Unfortunately, current approaches to AI evaluation make it exceedingly difficult to build such an understanding, for two key reasons. First, aggregate metrics make it hard to predict how a system will perform in a particular situation. Second, the instance-by-instance evaluation results that could be used to unpack these aggregate metrics are rarely made available ( 1 ). Here, we propose a path forward in which results are presented in more nuanced ways and instance-by-instance evaluation results are made publicly available. more »

Award ID(s):: 2139983

PAR ID:: 10448744

Author(s) / Creator(s):: Burnell, Ryan; Schellaert, Wout; Burden, John; Ullman, Tomer D.; Martinez-Plumed, Fernando; Tenenbaum, Joshua B.; Rutar, Danaja; Cheke, Lucy G.; Sohl-Dickstein, Jascha; Mitchell, Melanie; Kiela, Douwe; Shanahan, Murray; Voorhees, Ellen M.; Cohn, Anthony G.; Leibo, Joel Z.; Hernandez-Orallo, Jose

Date Published:: 2023-04-14

Journal Name:: Science

Volume:: 380

Issue:: 6641

ISSN:: 0036-8075

Page Range / eLocation ID:: 136 to 138

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
https://doi.org/10.1126/science.adf6369

More Like this