null
(Ed.)
Recent results in supervised learning suggest that while overparameterized models have the capac- ity to overfit, they in fact generalize quite well. We ask whether the same phenomenon occurs for offline contextual bandits. Our results are mixed. Value-based algorithms benefit from the same gen- eralization behavior as overparameterized super- vised learning, but policy-based algorithms do not. We show that this discrepancy is due to the action-stability of their objectives. An ob- jective is action-stable if there exists a prediction (action-value vector or action distribution) which is optimal no matter which action is observed. While value-based objectives are action-stable, policy-based objectives are unstable. We formally prove upper bounds on the regret of overparam- eterized value-based learning and lower bounds on the regret for policy-based algorithms. In our experiments with large neural networks, this gap between action-stable value-based objectives and unstable policy-based objectives leads to signifi- cant performance differences.
more »
« less
An official website of the United States government

