RippleBench: Capturing Ripple Effects by Leveraging Existing Knowledge Repositories

Rinberg, Roy; Bhalla, Usha; Shilov, Igor; Gandikota, Rohit

Citation Details

This content will become publicly available on December 7, 2026

RippleBench: Capturing Ripple Effects by Leveraging Existing Knowledge Repositories

The ability to make targeted updates to models, whether for unlearning, debiasing, model editing, or safety alignment, is central to AI safety. While these interventions aim to modify specific knowledge (e.g., removing virology content), their effects often propagate to related but unintended areas (e.g., allergies). Due to lack of standardized tools, existing evaluations typically compare performance on targeted versus unrelated general tasks, overlooking this broader collateral impact called the "ripple effect". We introduce RippleBench, a benchmark for systematically measuring how interventions affect semantically related knowledge. Using RippleBench, built on top of a Wikipedia-RAG pipeline for generating multiple-choice questions, we evaluate eight state-of-the-art unlearning methods. We find that all methods exhibit non-trivial accuracy drops on topics increasingly distant from the unlearned knowledge, each with distinct propagation profiles. We release our codebase for on-the-fly ripple evaluation as well as RippleBench-WMDP-Bio, a dataset derived from WMDP biology, containing 9,888 unique topics and 49,247 questions. more »

Award ID(s):: 2218803

PAR ID:: 10642780

Author(s) / Creator(s):: Rinberg, Roy; Bhalla, Usha; Shilov, Igor; Gandikota, Rohit

Publisher / Repository:: Mechanistic Interpretability Workshop at NeurIPS 2025 (https://mechinterpworkshop.com/)

Date Published:: 2025-12-07

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on December 7, 2026
Conference Paper:
The DOI is not currently available.

More Like this