CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation

Chen, Tong; Asai, Akari; Mireshghallah, Niloofar; Min, Sewon; Grimmelmann, James; Choi, Yejin; Hajishirzi, Hannaneh; Zettlemoyer, Luke; Koh, Pang Wei

doi:10.18653/v1/2024.emnlp-main.844

We present a lightweight reencoding technique that augments propositional formulas containing implicit or explicit exactly-one constraints with auxiliary variables derived from the order encoding. Our approach is based on the observation that many formulas contain clauses where each literal appears only in that clause, and that these unique literal clauses can be replaced by the corresponding sequential counter encoding of exactly-one constraints, which introduces the same variables as the order encoding. We implemented the reencoding in the state-of-the-art SAT solver CaDiCaL with support for proof logging and solution reconstruction. Experiments on SAT Competition benchmarks demonstrate that our technique enables solving dozens of additional formulas. We found that shuffling a formula before reencoding harms performance. To mitigate this issue, we introduce a method that sorts literals within clauses based on the formula structure before applying our reencoding. The same technique also predicts whether reencoding is likely to yield improvements.

More Like this