Towards Understanding Self-play for LLM Reasoning

Chae, Justin Yang; Alam, Md Tanvirul; Rastogi, Nidhi

Citation Details

This content will become publicly available on December 6, 2026

Towards Understanding Self-play for LLM Reasoning

Recent advances in large language model (LLM) reasoning, led by reinforcement learning with verifiable rewards (RLVR), have inspired self-play post-training, where models improve by generating and solving their own problems. While selfplay has shown strong in-domain and out-of-domain gains, the mechanisms behind these improvements remain poorly understood. In this work, we analyze the training dynamics of self-play through the lens of the Absolute Zero Reasoner, comparing it against RLVR and supervised fine-tuning (SFT). Our study examines parameter update sparsity, entropy dynamics of token distributions, and alternative proposer reward functions. We further connect these dynamics to reasoning performance using pass@k evaluations. Together, our findings clarify how self-play differs from other post-training strategies, highlight its inherent limitations, and point toward future directions for improving LLM math reasoning through self-play. more »

Award ID(s):: 2447631

PAR ID:: 10646784

Author(s) / Creator(s):: Chae, Justin Yang; Alam, Md Tanvirul; Rastogi, Nidhi

Publisher / Repository:: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Math-AI.

Date Published:: 2025-12-06

Format(s):: Medium: X

Location:: San Diego, California

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on December 6, 2026
Conference Proceeding:
The DOI is not currently available.

More Like this