skip to main content


Search for: All records

Creators/Authors contains: "Xie, T"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Free, publicly-accessible full text available December 10, 2024
  2. Free, publicly-accessible full text available May 1, 2024
  3. null (Ed.)
  4. Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) — the problem of evaluating a new policy using the historical data ob- tained by different behavior policies — under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon H. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of [ ] where μ and π are the logging and target policies, dμt (st) and dπt (st) are the marginal distribution of the state at tth step, H is the horizon, n is the sample size and V π is the value function of the MDP under π. The result matches the t+1 Cramer-Rao lower bound in Jiang and Li [2016] up to a multiplicative factor of H. To the best of our knowledge, this is the first OPE estimation error bound with a polynomial dependence on H . Besides theory, we show empirical superiority of our method in time-varying, partially observable, and long-horizon RL environments. 
    more » « less
  5. We take initial steps in studying PAC-MDP algorithms with limited adaptivity, that is, algorithms that change its exploration policy as infrequently as possible during regret minimization. This is motivated by the difficulty of running fully adaptive algorithms in real-world applications (such as medical domains), and we propose to quantify adaptivity using the notion of local switching cost. Our main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for H-step episodic MDP that achieves sublinear regret whose local switching cost in K episodes is O(H3SA log K), and we provide a lower bound of Ω(HSA) on the local switching cost for any no-regret algorithm. Our algorithm can be naturally adapted to the concurrent setting [13], which yields nontrivial results that improve upon prior work in certain aspects. 
    more » « less
  6. In integrated photonics, specific wavelengths such as 1,550 nm are preferred due to low-loss transmission and the availability of optical gain in this spectral region. For chip-based photodetectors, two-dimensional materials bear scientifically and technologically relevant properties such as electrostatic tunability and strong light–matter interactions. However, no efficient photodetector in the telecommunication C-band has been realized with two-dimensional transition metal dichalcogenide materials due to their large optical bandgaps. Here we demonstrate a MoTe2-based photodetector featuring a strong photoresponse (responsivity 0.5 A W–1) operating at 1,550 nm in silicon photonics enabled by strain engineering the two-dimensional material. Non-planarized waveguide structures show a bandgap modulation of 0.2 eV, resulting in a large photoresponse in an otherwise photoinactive medium when unstrained. Unlike graphene-based photodetectors that rely on a gapless band structure, this photodetector shows an approximately 100-fold reduction in dark current, enabling an efficient noise-equivalent power of 90 pW Hz–0.5. Such a strain-engineered integrated photodetector provides new opportunities for integrated optoelectronic systems. 
    more » « less