skip to main content

Title: Hoeffding's inequality for Markov chains and its applications to statistical learning
This paper establishes Hoeffding’s lemma and inequality for bounded functions of general-state-space and not necessarily reversible Markov chains. The sharpness of these results is characterized by the optimality of the ratio between variance prox- ies in the Markov-dependent and independent settings. The boundedness of functions is shown necessary for such results to hold in general. To showcase the usefulness of the new results, we apply them for non-asymptotic analyses of MCMC estima- tion, respondent-driven sampling and high-dimensional covariance matrix estimation on time series data with a Markovian nature. In addition to statistical problems, we also apply them to study the time-discounted rewards in econometric models and the multi-armed bandit problem with Markovian rewards arising from the field of machine learning.
Award ID(s):
1662139 1712591
Publication Date:
Journal Name:
Journal of machine learning research
Page Range or eLocation-ID:
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    Genome annotations are a common way to represent genomic features such as genes, regulatory elements or epigenetic modifications. The amount of overlap between two annotations is often used to ascertain if there is an underlying biological connection between them. In order to distinguish between true biological association and overlap by pure chance, a robust measure of significance is required. One common way to do this is to determine if the number of intervals in the reference annotation that intersect the query annotation is statistically significant. However, currently employed statistical frameworks are often either inefficient or inaccurate when computing P-values on the scale of the whole human genome.


    We show that finding the P-values under the typically used ‘gold’ null hypothesis is NP-hard. This motivates us to reformulate the null hypothesis using Markov chains. To be able to measure the fidelity of our Markovian null hypothesis, we develop a fast direct sampling algorithm to estimate the P-value under the gold null hypothesis. We then present an open-source software tool MCDP that computes the P-values under the Markovian null hypothesis in O(m2+n) time and O(m) memory, where m and n are the numbers of intervals in the reference and query annotations,more »respectively. Notably, MCDP runtime and memory usage are independent from the genome length, allowing it to outperform previous approaches in runtime and memory usage by orders of magnitude on human genome annotations, while maintaining the same level of accuracy.

    Availability and implementation

    The software is available at All data for reproducibility are available at

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less
  2. We study ergodic properties of Markovian multiclass many-server queues that are uniform over scheduling policies and the size of the system. The system is heavily loaded in the Halfin–Whitt regime, and the scheduling policies are work conserving and preemptive. We provide a unified approach via a Lyapunov function method that establishes Foster–Lyapunov equations for both the limiting diffusion and the prelimit diffusion-scaled queuing processes simultaneously. We first study the limiting controlled diffusion and show that if the spare capacity (safety staffing) parameter is positive, the diffusion is exponentially ergodic uniformly over all stationary Markov controls, and the invariant probability measures have uniform exponential tails. This result is sharp because when there is no abandonment and the spare capacity parameter is negative, the controlled diffusion is transient under any Markov control. In addition, we show that if all the abandonment rates are positive, the invariant probability measures have sub-Gaussian tails regardless whether the spare capacity parameter is positive or negative. Using these results, we proceed to establish the corresponding ergodic properties for the diffusion-scaled queuing processes. In addition to providing a simpler proof of previous results in Gamarnik and Stolyar [Gamarnik D, Stolyar AL (2012) Multiclass multiserver queueing system in themore »Halfin-Whitt heavy traffic regime: asymptotics of the stationary distribution. Queueing Systems 71(1–2):25–51], we extend these results to multiclass models with renewal arrival processes, albeit under the assumption that the mean residual life functions are bounded. For the Markovian model with Poisson arrivals, we obtain stronger results and show that the convergence to the stationary distribution is at an exponential rate uniformly over all work-conserving stationary Markov scheduling policies.« less
  3. Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long- and infinite-horizon settings due to diminishing overlap between behavior and target policies. In this paper, we study the role of Markovian and time-invariant structure in efficient OPE. We first derive the efficiency bounds and efficient influence functions for OPE when one assumes each of these structures. This precisely characterizes the curse of horizon: in time-variant processes, OPE is only feasible in the near-on-policy setting, where behavior and target policies are sufficiently similar. But, in time-invariant Markov decision processes, our bounds show that truly off-policy evaluation is feasible, even with only just one dependent trajectory, and provide the limits of how well we could hope to do. We develop a new estimator based on double reinforcement learning (DRL) that leverages this structure for OPE. Our DRL estimator simultaneously uses estimated stationary density ratios and q-functions and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently. We investigate these properties and the performance benefits of leveraging the problem structure for more efficient OPE.
  4. Cancer screening is a large, population-based intervention that would benefit from tools enabling individually-tailored decision making to decrease unintended consequences such as overdiagnosis. The heterogeneity of cancer screening participants advocates the need for more personalized approaches. Partially observable Markov decision processes (POMDPs) can be used to suggest optimal, individualized screening policies. However, determining an appropriate reward function can be challenging. Here, we propose the use of inverse reinforcement learning (IRL) to form rewards functions for lung and breast cancer screening POMDP models. Using data from the National Lung Screening Trial and our institution's breast screening registry, we developed two POMDP models with corresponding reward functions. Specifically, the maximum entropy (MaxEnt) IRL algorithm with an adaptive step size was used to learn rewards more efficiently; and combined with a multiplicative model to learn state-action pair rewards in the POMDP. The lung and breast cancer screening models were evaluated based on their ability to recommend appropriate screening decisions before the diagnosis of cancer. Results are comparable with experts' decisions. The lung POMDP demonstrated an improved performance in terms of recall and false positive rate in the second screening and post-screening stages. Precision (0.02-0.05) was comparable to experts' (0.02-0.06). The breast POMDP hasmore »excellent recall (0.97-1.00), matching the physicians and a satisfactory false positive rate (<0.03). The reward functions learned with the MaxEnt IRL algorithm, when combined with POMDP models in lung and breast cancer screening, demonstrate performance comparable to experts.« less
  5. We consider an ultra-dense wireless network with N channels and M = N devices. Messages with fresh information are generated at each device according to a random process and need to be transmitted to an access point. The value of a message decreases as it ages, so each device searches for an idle channel to transmit the message as soon as it can. However, each channel probing is associated with a fixed cost (energy), so a device needs to adapt its probing rate based on the "age" of the message. At each device, the design of the optimal probing strategy can be formulated as an infinite horizon Markov Decision Process (MDP) where the devices compete with each other to find idle channels. While it is natural to view the system as a Bayesian game, it is often intractable to analyze such a system. Thus, we use the Mean Field Game (MFG) approach to analyze the system in a large-system regime, where the number of devices is very large, to understand the structure of the problem and to find efficient probing strategies. We present an analysis based on the MFG perspective. We begin by characterizing the space of valid policies andmore »use this to show the existence of a Mean Field Nash Equilibrium (MFNE) in a constrained set for any general increasing cost functions with diminishing rewards. Further we provide an algorithm for computing the equilibrium for any given device, and the corresponding age-dependent channel probing policy.« less