skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Flipping Coins to Estimate Pseudocounts for Exploration in Reinforcement Learning
We propose a new method for count-based exploration in high-dimensional state spaces. Unlike previous work which relies on density models, we show that counts can be derived by averaging samples from the Rademacher distribution (or coin flips). This insight is used to set up a simple supervised learning objective which, when optimized, yields a state’s visitation count. We show that our method is significantly more effective at deducing ground-truth visitation counts than previous work; when used as an exploration bonus for a model-free reinforcement learning algorithm, it outperforms existing approaches on most of 9 challenging exploration tasks, including the Atari game MONTEZUMA’S REVENGE.  more » « less
Award ID(s):
1955361 1844960
PAR ID:
10467319
Author(s) / Creator(s):
; ;
Publisher / Repository:
Proceedings of the 40th International Conference on Machine Learning
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Segata, Nicola (Ed.)
    The understanding of bacterial gene function has been greatly enhanced by recent advancements in the deep sequencing of microbial genomes. Transposon insertion sequencing methods combines next-generation sequencing techniques with transposon mutagenesis for the exploration of the essentiality of genes under different environmental conditions. We propose a model-based method that uses regularized negative binomial regression to estimate the change in transposon insertions attributable to gene-environment changes in this genetic interaction study without transformations or uniform normalization. An empirical Bayes model for estimating the local false discovery rate combines unique and total count information to test for genes that show a statistically significant change in transposon counts. When applied to RB-TnSeq (randomized barcode transposon sequencing) and Tn-seq (transposon sequencing) libraries made in strains of Caulobacter crescentus using both total and unique count data the model was able to identify a set of conditionally beneficial or conditionally detrimental genes for each target condition that shed light on their functions and roles during various stress conditions. 
    more » « less
  2. Objective. Dynamic positron emission tomography (PET) imaging, which can provide information on dynamic changes in physiological metabolism, is now widely used in clinical diagnosis and cancer treatment. However, the reconstruction from dynamic data is extremely challenging due to the limited counts received in individual frame, especially in ultra short frames. Recently, the unrolled modelbased deep learning methods have shown inspiring results for low-count PET image reconstruction with good interpretability. Nevertheless, the existing model-based deep learning methods mainly focus on the spatial correlations while ignore the temporal domain. Approach. In this paper, inspired by the learned primal dual (LPD) algorithm, we propose the spatio-temporal primal dual network (STPDnet) for dynamic low-count PET image reconstruction. Both spatial and temporal correlations are encoded by 3D convolution operators. The physical projection of PET is embedded in the iterative learning process of the network, which provides the physical constraints and enhances interpretability. Main results. The experiments of both simulation data and real rat scan data have shown that the proposed method can achieve substantial noise reduction in both temporal and spatial domains and outperform the maximum likelihood expectation maximization, spatio-temporal kernel method, LPD and FBPnet. Significance. Experimental results show STPDnet better reconstruction performance in the low count situation, which makes the proposed method particularly suitable in whole-body dynamic imaging and parametric PET imaging that require extreme short frames and usually suffer from high level of noise. 
    more » « less
  3. Across basic research studies, cell counting requires significant human time and expertise. Trained experts use thin focal plane scanning to count (click) cells in stained biological tissue. This computer-assisted process (optical disector) requires a well-trained human to select a unique best z-plane of focus for counting cells of interest. Though accurate, this approach typically requires an hour per case and is prone to inter-and intra-rater errors. Our group has previously proposed deep learning (DL)-based methods to automate these counts using cell segmentation at high magnification. Here we propose a novel You Only Look Once (YOLO) model that performs cell detection on multi-channel z-plane images (disector stack). This automated Multiple Input Multiple Output (MIMO) version of the optical disector method uses an entire z-stack of microscopy images as its input, and outputs cell detections (counts) with a bounding box of each cell and class corresponding to the z-plane where the cell appears in best focus. Compared to the previous segmentation methods, the proposed method does not require time-and labor-intensive ground truth segmentation masks for training, while producing comparable accuracy to current segmentation-based automatic counts. The MIMO-YOLO method was evaluated on systematic-random samples of NeuN-stained tissue sections through the neocortex of mouse brains (n=7). Using a cross validation scheme, this method showed the ability to correctly count total neuron numbers with accuracy close to human experts and with 100% repeatability (Test-Retest). 
    more » « less
  4. Ruiz, Francisco and (Ed.)
    Motivated by personalized healthcare and other applications involving sensitive data, we study online exploration in reinforcement learning with differential privacy (DP) constraints. Existing work on this problem established that no-regret learning is possible under joint differential privacy (JDP) and local differential privacy (LDP) but did not provide an algorithm with optimal regret. We close this gap for the JDP case by designing an $$\epsilon$$-JDP algorithm with a regret of $$\widetilde{O}(\sqrt{SAH^2T}+S^2AH^3/\epsilon)$$ which matches the information-theoretic lower bound of non-private learning for all choices of $$\epsilon> S^{1.5}A^{0.5} H^2/\sqrt{T}$$. In the above, $$S$$, $$A$$ denote the number of states and actions, $$H$$ denotes the planning horizon, and $$T$$ is the number of steps. To the best of our knowledge, this is the first private RL algorithm that achieves privacy for free asymptotically as $$T\rightarrow \infty$$. Our techniques — which could be of independent interest — include privately releasing Bernstein-type exploration bonuses and an improved method for releasing visitation statistics. The same techniques also imply a slightly improved regret bound for the LDP case. 
    more » « less
  5. Motivated by personalized healthcare and other applications involving sensitive data, we study online exploration in reinforcement learning with differential privacy (DP) constraints. Existing work on this problem established that no-regret learning is possible under joint differential privacy (JDP) and local differential privacy (LDP) but did not provide an algorithm with optimal regret. We close this gap for the JDP case by designing an ϵ-JDP algorithm with a regret of O˜(sqrt(SAH^2T) +S^2AH^3/ϵ) which matches the information-theoretic lower bound of non-private learning for all choices of ϵ>S^1.5A^0.5H^2/sqrt(T). In the above, S, A denote the number of states and actions, H denotes the planning horizon, and T is the number of steps. To the best of our knowledge, this is the first private RL algorithm that achieves \emph{privacy for free} asymptotically as T→∞. Our techniques -- which could be of independent interest -- include privately releasing Bernstein-type exploration bonuses and an improved method for releasing visitation statistics. The same techniques also imply a slightly improved regret bound for the LDP case. 
    more » « less