Offline policy optimization could have a large impact on many real-world decision-making problems, as online learning may be infeasible in many applications. Importance sampling and its variants are a commonly used type of estimator in offline policy evaluation, and such estimators typically do not require assumptions on the properties and representational capabilities of value function or decision process model function classes. In this paper, we identify an important overfitting phenomenon in optimizing the importance weighted return, in which it may be possible for the learned policy to essentially avoid making aligned decisions for part of the initial state space. We propose an algorithm to avoid this overfitting through a new per-state-neighborhood normalization constraint, and provide a theoretical justification of the proposed algorithm. We also show the limitations of previous attempts to this approach. We test our algorithm in a healthcare-inspired simulator, a logged dataset collected from real hospitals and continuous control tasks. These experiments show the proposed method yields less overfitting and better test performance compared to state-of-the-art batch reinforcement learning algorithms.
more »
« less
Optimization’s Neglected Normative Commitments
Optimization is offered as an objective approach to resolving com- plex, real-world decisions involving uncertainty and conflicting interests. It drives business strategies as well as public policies and, increasingly, lies at the heart of sophisticated machine learning systems. A paradigm used to approach potentially high-stakes de- cisions, optimization relies on abstracting the real world to a set of decision(s), objective(s) and constraint(s). Drawing from the mod- eling process and a range of actual cases, this paper describes the normative choices and assumptions that are necessarily part of us- ing optimization. It then identifies six emergent problems that may be neglected: 1) Misspecified values can yield optimizations that omit certain imperatives altogether or incorporate them incorrectly as a constraint or as part of the objective, 2) Problematic decision boundaries can lead to faulty modularity assumptions and feedback loops, 3) Failing to account for multiple agents’ divergent goals and decisions can lead to policies that serve only certain narrow inter- ests, 4) Mislabeling and mismeasurement can introduce bias and imprecision, 5) Faulty use of relaxation and approximation methods, unaccompanied by formal characterizations and guarantees, can severely impede applicability, and 6) Treating optimization as a justification for action, without specifying the necessary contex- tual information, can lead to ethically dubious or faulty decisions. Suggestions are given to further understand and curb the harms that can arise when optimization is used wrongfully.
more »
« less
- Award ID(s):
- 1704527
- PAR ID:
- 10437744
- Date Published:
- Journal Name:
- ACM Conference on Fairness, Accountability, and Transparency (FAccT)
- Page Range / eLocation ID:
- 50 to 63
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
A significant body of research in the data sciences considers unfair discrimination against social categories such as race or gender that could occur or be amplified as a result of algorithmic decisions. Simultaneously, real-world disparities continue to exist, even before algorithmic decisions are made. In this work, we draw on insights from the social sciences brought into the realm of causal modeling and constrained optimization, and develop a novel algorithmic framework for tackling pre-existing real-world disparities. The purpose of our framework, which we call the “impact remediation framework,” is to measure real-world disparities and discover the optimal intervention policies that could help improve equity or access to opportunity for those who are underserved with respect to an outcome of interest. We develop a disaggregated approach to tackling pre-existing disparities that relaxes the typical set of assumptions required for the use of social categories in structural causal models. Our approach flexibly incorporates counterfactuals and is compatible with various ontological assumptions about the nature of social categories. We demonstrate impact remediation with a hypothetical case study and compare our disaggregated approach to an existing state-of-the-art approach, comparing its structure and resulting policy recommendations. In contrast to most work on optimal policy learning, we explore disparity reduction itself as an objective, explicitly focusing the power of algorithms on reducing inequality.more » « less
-
Many real-life scenarios require humans to make difficult trade-offs: do we always follow all the traffic rules or do we violate the speed limit in an emergency? In general, how should we account for and balance the ethical values, safety recommendations, and societal norms, when we are trying to achieve a certain objective? To enable effective AI-human collaboration, we must equip AI agents with a model of how humans make such trade-offs in environments where there is not only a goal to be reached, but there are also ethical constraints to be considered and to possibly align with. These ethical constraints could be both deontological rules on actions that should not be performed, or also consequentialist policies that recommend avoiding reaching certain states of the world. Our purpose is to build AI agents that can mimic human behavior in these ethically constrained decision environments, with a long term research goal to use AI to help humans in making better moral judgments and actions. To this end, we propose a computational approach where competing objectives and ethical constraints are orchestrated through a method that leverages a cognitive model of human decision making, called multi-alternative decision field theory (MDFT). Using MDFT, we build an orchestrator, called MDFT-Orchestrator (MDFT-O), that is both general and flexible. We also show experimentally that MDFT-O both generates better decisions than using a heuristic that takes a weighted average of competing policies (WA-O), but also performs better in terms of mimicking human decisions as collected through Amazon Mechanical Turk (AMT). Our methodology is therefore able to faithfully model human decision in ethically constrained decision environments.more » « less
-
Feng, B.; Pedrielli, G; Peng, Y.; Shashaani, S.; Song, E.; Corlu, C.; Lee, L.; Chew, E.; Roeder, T.; Lendermann, P. (Ed.)Ranking & selection (R&S) procedures are simulation-optimization algorithms for making one-time decisions among a finite set of alternative system designs or feasible solutions with a statistical assurance of a good selection. R&S with covariates (R&S+C) extends the paradigm to allow the optimal selection to depend on contextual information that is obtained just prior to the need for a decision. The dominant approach for solving such problems is to employ offline simulation to create metamodels that predict the performance of each system or feasible solution as a function of the covariate. This paper introduces a fundamentally different approach that solves individual R&S problems offline for various values of the covariate, and then treats the real-time decision as a classification problem: given the covariate information, which system is a good solution? Our approach exploits the availability of efficient R&S procedures, requires milder assumptions than the metamodeling paradigm to provide strong guarantees, and can be more efficient.more » « less
-
In this paper, we study reinforcement learning (RL) algorithms to solve real-world decision problems with the objective of maximizing the long-term reward as well as satisfying cumulative constraints. We propose a novel first-order policy optimization method, Interior-point Policy Optimization (IPO), which augments the objective with logarithmic barrier functions, inspired by the interior-point method. Our proposed method is easy to implement with performance guarantees and can handle general types of cumulative multiconstraint settings. We conduct extensive evaluations to compare our approach with state-of-the-art baselines. Our algorithm outperforms the baseline algorithms, in terms of reward maximization and constraint satisfaction.more » « less
An official website of the United States government

