Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Free, publiclyaccessible full text available December 10, 2024

Motivated by the many realworld applications of reinforcement learning (RL) that require safepolicy iterations, we consider the problem of offpolicy evaluation (OPE) — the problem of evaluating a new policy using the historical data ob tained by different behavior policies — under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon H. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a meansquared error of [ ] where μ and π are the logging and target policies, dμt (st) and dπt (st) are the marginal distribution of the state at tth step, H is the horizon, n is the sample size and V π is the value function of the MDP under π. The result matches the t+1 CramerRao lower bound in Jiang and Li [2016] up to a multiplicative factor of H. To the best of our knowledge, this is the first OPE estimation error bound with a polynomial dependence on H . Besides theory, we show empirical superiority of our method in timevarying, partially observable, and longhorizon RL environments.more » « less

We take initial steps in studying PACMDP algorithms with limited adaptivity, that is, algorithms that change its exploration policy as infrequently as possible during regret minimization. This is motivated by the difficulty of running fully adaptive algorithms in realworld applications (such as medical domains), and we propose to quantify adaptivity using the notion of local switching cost. Our main contribution, QLearning with UCB2 exploration, is a modelfree algorithm for Hstep episodic MDP that achieves sublinear regret whose local switching cost in K episodes is O(H3SA log K), and we provide a lower bound of Ω(HSA) on the local switching cost for any noregret algorithm. Our algorithm can be naturally adapted to the concurrent setting [13], which yields nontrivial results that improve upon prior work in certain aspects.more » « less

In integrated photonics, specific wavelengths such as 1,550 nm are preferred due to lowloss transmission and the availability of optical gain in this spectral region. For chipbased photodetectors, twodimensional materials bear scientifically and technologically relevant properties such as electrostatic tunability and strong light–matter interactions. However, no efficient photodetector in the telecommunication Cband has been realized with twodimensional transition metal dichalcogenide materials due to their large optical bandgaps. Here we demonstrate a MoTe2based photodetector featuring a strong photoresponse (responsivity 0.5 A W–1) operating at 1,550 nm in silicon photonics enabled by strain engineering the twodimensional material. Nonplanarized waveguide structures show a bandgap modulation of 0.2 eV, resulting in a large photoresponse in an otherwise photoinactive medium when unstrained. Unlike graphenebased photodetectors that rely on a gapless band structure, this photodetector shows an approximately 100fold reduction in dark current, enabling an efficient noiseequivalent power of 90 pW Hz–0.5. Such a strainengineered integrated photodetector provides new opportunities for integrated optoelectronic systems.more » « less