skip to main content


Title: Minimax Model Learning
We present a novel off-policy loss function for learning a transition model in model-based reinforcement learning. Notably, our loss is derived from the off-policy policy evaluation objective with an emphasis on correcting distribution shift. Compared to previous model-based techniques, our approach allows for greater robustness under model mis-specification or distribution shift induced by learning/evaluating policies that are distinct from the data-generating policy. We provide a theoretical analysis and show empirical improvements over existing model-based off-policy evaluation methods. We provide further analysis showing our loss can be used for off-policy optimization (OPO) and demonstrate its integration with more recent improvements in OPO.  more » « less
Award ID(s):
1645832
NSF-PAR ID:
10329359
Author(s) / Creator(s):
; ;
Editor(s):
Banerjee, A; Fukumizu, K
Date Published:
Journal Name:
24TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS (AISTATS)
Volume:
130
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. BACKGROUND Electromagnetic (EM) waves underpin modern society in profound ways. They are used to carry information, enabling broadcast radio and television, mobile telecommunications, and ubiquitous access to data networks through Wi-Fi and form the backbone of our modern broadband internet through optical fibers. In fundamental physics, EM waves serve as an invaluable tool to probe objects from cosmic to atomic scales. For example, the Laser Interferometer Gravitational-Wave Observatory and atomic clocks, which are some of the most precise human-made instruments in the world, rely on EM waves to reach unprecedented accuracies. This has motivated decades of research to develop coherent EM sources over broad spectral ranges with impressive results: Frequencies in the range of tens of gigahertz (radio and microwave regimes) can readily be generated by electronic oscillators. Resonant tunneling diodes enable the generation of millimeter (mm) and terahertz (THz) waves, which span from tens of gigahertz to a few terahertz. At even higher frequencies, up to the petahertz level, which are usually defined as optical frequencies, coherent waves can be generated by solid-state and gas lasers. However, these approaches often suffer from narrow spectral bandwidths, because they usually rely on well-defined energy states of specific materials, which results in a rather limited spectral coverage. To overcome this limitation, nonlinear frequency-mixing strategies have been developed. These approaches shift the complexity from the EM source to nonresonant-based material effects. Particularly in the optical regime, a wealth of materials exist that support effects that are suitable for frequency mixing. Over the past two decades, the idea of manipulating these materials to form guiding structures (waveguides) has provided improvements in efficiency, miniaturization, and production scale and cost and has been widely implemented for diverse applications. ADVANCES Lithium niobate, a crystal that was first grown in 1949, is a particularly attractive photonic material for frequency mixing because of its favorable material properties. Bulk lithium niobate crystals and weakly confining waveguides have been used for decades for accessing different parts of the EM spectrum, from gigahertz to petahertz frequencies. Now, this material is experiencing renewed interest owing to the commercial availability of thin-film lithium niobate (TFLN). This integrated photonic material platform enables tight mode confinement, which results in frequency-mixing efficiency improvements by orders of magnitude while at the same time offering additional degrees of freedom for engineering the optical properties by using approaches such as dispersion engineering. Importantly, the large refractive index contrast of TFLN enables, for the first time, the realization of lithium niobate–based photonic integrated circuits on a wafer scale. OUTLOOK The broad spectral coverage, ultralow power requirements, and flexibilities of lithium niobate photonics in EM wave generation provides a large toolset to explore new device functionalities. Furthermore, the adoption of lithium niobate–integrated photonics in foundries is a promising approach to miniaturize essential bench-top optical systems using wafer scale production. Heterogeneous integration of active materials with lithium niobate has the potential to create integrated photonic circuits with rich functionalities. Applications such as high-speed communications, scalable quantum computing, artificial intelligence and neuromorphic computing, and compact optical clocks for satellites and precision sensing are expected to particularly benefit from these advances and provide a wealth of opportunities for commercial exploration. Also, bulk crystals and weakly confining waveguides in lithium niobate are expected to keep playing a crucial role in the near future because of their advantages in high-power and loss-sensitive quantum optics applications. As such, lithium niobate photonics holds great promise for unlocking the EM spectrum and reshaping information technologies for our society in the future. Lithium niobate spectral coverage. The EM spectral range and processes for generating EM frequencies when using lithium niobate (LN) for frequency mixing. AO, acousto-optic; AOM, acousto-optic modulation; χ (2) , second-order nonlinearity; χ (3) , third-order nonlinearity; EO, electro-optic; EOM, electro-optic modulation; HHG, high-harmonic generation; IR, infrared; OFC, optical frequency comb; OPO, optical paramedic oscillator; OR, optical rectification; SCG, supercontinuum generation; SHG, second-harmonic generation; UV, ultraviolet. 
    more » « less
  2. Existing offline reinforcement learning (RL) methods face a few major challenges, particularly the distributional shift between the learned policy and the behavior policy. Offline Meta-RL is emerging as a promising approach to address these challenges, aiming to learn an informative meta-policy from a collection of tasks. Nevertheless, as shown in our empirical studies, offline Meta-RL could be outperformed by offline single-task RL methods on tasks with good quality of datasets, indicating that a right balance has to be delicately calibrated between "exploring" the out-of-distribution state-actions by following the meta-policy and "exploiting" the offline dataset by staying close to the behavior policy. Motivated by such empirical analysis, we propose model-based offline ta-RL with regularized policy optimization (MerPO), which learns a meta-model for efficient task structure inference and an informative meta-policy for safe exploration of out-of-distribution state-actions. In particular, we devise a new meta-Regularized model-based Actor-Critic (RAC) method for within-task policy optimization, as a key building block of MerPO, using both conservative policy evaluation and regularized policy improvement; and the intrinsic tradeoff therein is achieved via striking the right balance between two regularizers, one based on the behavior policy and the other on the meta-policy. We theoretically show that the learnt policy offers guaranteed improvement over both the behavior policy and the meta-policy, thus ensuring the performance improvement on new tasks via offline Meta-RL. Our experiments corroborate the superior performance of MerPO over existing offline Meta-RL methods. 
    more » « less
  3. Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation methods (to deal with the curse of dimensionality) can provide a means to help alleviate the excessive sample complexity burden in modern sequential decision making problems. However, the extent to which this broader approach can be effective is not well understood, where the literature largely consists of sufficient conditions. This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning. Perhaps surprisingly, our main result shows that even if: i) we have realizability in that the true value function of \emph{every} policy is linear in a given set of features and 2) our off-policy data has good coverage over all features (under a strong spectral condition), any algorithm still (information-theoretically) requires a number of offline samples that is exponential in the problem horizon to non-trivially estimate the value of \emph{any} given policy. Our results highlight that sample-efficient offline policy evaluation is not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability). 
    more » « less
  4. Abstract Why the new findings matter

    The process of teaching and learning is complex, multifaceted and dynamic. This paper contributes a seminal resource to highlight the digitisation of the educational sciences by demonstrating how new machine learning methods can be effectively and reliably used in research, education and practical application.

    Implications for educational researchers and policy makers

    The progressing digitisation of societies around the globe and the impact of the SARS‐COV‐2 pandemic have highlighted the vulnerabilities and shortcomings of educational systems. These developments have shown the necessity to provide effective educational processes that can support sometimes overwhelmed teachers to digitally impart knowledge on the plan of many governments and policy makers. Educational scientists, corporate partners and stakeholders can make use of machine learning techniques to develop advanced, scalable educational processes that account for individual needs of learners and that can complement and support existing learning infrastructure. The proper use of machine learning methods can contribute essential applications to the educational sciences, such as (semi‐)automated assessments, algorithmic‐grading, personalised feedback and adaptive learning approaches. However, these promises are strongly tied to an at least basic understanding of the concepts of machine learning and a degree of data literacy, which has to become the standard in education and the educational sciences.

    Demonstrating both the promises and the challenges that are inherent to the collection and the analysis of large educational data with machine learning, this paper covers the essential topics that their application requires and provides easy‐to‐follow resources and code to facilitate the process of adoption.

     
    more » « less
  5. We propose and analyze a reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along an user-defined space of test functions. Focusing on applications to model-free offline RL with function approximation, we exploit this principle to derive confidence intervals for off-policy evaluation, as well as to optimize over policies within a prescribed policy class. We prove an oracle inequality on our policy optimization procedure in terms of a trade-off between the value and uncertainty of an arbitrary comparator policy. Different choices of test function spaces allow us to tackle different problems within a common framework. We characterize the loss of efficiency in moving from on-policy to off-policy data using our procedures, and establish connections to concentrability coefficients studied in past work. We examine in depth the implementation of our methods with linear function approximation, and provide theoretical guarantees with polynomial-time implementations even when Bellman closure does not hold. 
    more » « less