 NSFPAR ID:
 10417916
 Editor(s):
 Koyejo, S; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; Oh, A.
 Date Published:
 Journal Name:
 Conference on Neural Information Processing Systems (NeurIPS 2022)
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

Recursive calls over recursive data are useful for generating probability distributions, and probabilistic programming allows computations over these distributions to be expressed in a modular and intuitive way. Exact inference is also useful, but unfortunately, existing probabilistic programming languages do not perform exact inference on recursive calls over recursive data, forcing programmers to code many applications manually. We introduce a probabilistic language in which a wide variety of recursion can be expressed naturally, and inference carried out exactly. For instance, probabilistic pushdown automata and their generalizations are easy to express, and polynomialtime parsing algorithms for them are derived automatically. We eliminate recursive data types using program transformations related to defunctionalization and refunctionalization. These transformations are assured correct by a linear type system, and a successful choice of transformations, if there is one, is guaranteed to be found by a greedy algorithm.more » « less

We study modelfree reinforcement learning (RL) algorithms for infinitehorizon averagereward Markov decision process (MDP), which is more appropriate for applications that involve continuing operations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of modelfree RL algorithms is relatively inadequate for the averagereward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient modelfree algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results. In the online setting, we design an algorithm, UCBAVG, based on an optimistic variant of variancereduced Qlearning. We show that UCBAVG achieves a regret bound $\widetilde{O}(S^5A^2sp(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of stateaction space, and $sp(h^*)$ the span of the optimal bias function. Our result provides the first computationally efficient modelfree algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCBAVG to develop a modelfree algorithm that finds an $\epsilon$optimal policy with sample complexity $\widetilde{O}(SAsp^2(h^*)\epsilon^{2} + S^2Asp(h^*)\epsilon^{1}).$ This sample complexity is nearoptimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SAsp(^*)\epsilon^{2})$. Existing work mainly focuses on ergodic MDPs and the results typically depend on $t_{mix},$ the worstcase mixing time induced by a policy. We remark that the diameter $D$ and mixing time $t_{mix}$ are both lower bounded by $sp(h^*)$, and $t_{mix}$ can be arbitrarily large for certain MDPs. On the technical side, our approach integrates two key ideas: learning an $\gamma$discounted MDP as an approximation, and leveraging referenceadvantage decomposition for variance in optimistic Qlearning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of valuedifference between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a goodquality reference value function with $O(SA)$ space complexity.more » « less

Abbott, Derek (Ed.)
Abstract Human vision, thought, and planning involve parsing and representing objects and scenes using structured representations based on partwhole hierarchies. Computer vision and machine learning researchers have recently sought to emulate this capability using neural networks, but a generative model formulation has been lacking. Generative models that leverage compositionality, recursion, and partwhole hierarchies are thought to underlie human concept learning and the ability to construct and represent flexible mental concepts. We introduce Recursive Neural Programs (RNPs), a neural generative model that addresses the partwhole hierarchy learning problem by modeling images as hierarchical trees of probabilistic sensorymotor programs. These programs recursively reuse learned sensorymotor primitives to model an image within different spatial reference frames, enabling hierarchical composition of objects from parts and implementing a grammar for images. We show that RNPs can learn partwhole hierarchies for a variety of image datasets, allowing rich compositionality and intuitive partsbased explanations of objects. Our model also suggests a cognitive framework for understanding how human brains can potentially learn and represent concepts in terms of recursively defined primitives and their relations with each other.

Translating OmegaRegular Specifications to Average Objectives for ModelFree Reinforcement LearningPiotr Faliszewski ; Viviana Mascardi (Ed.)Recent success in reinforcement learning (RL) has brought renewed attention to the design of reward functions by which agent behavior is reinforced or deterred. Manually designing reward functions is tedious and errorprone. An alternative approach is to specify a formal, unambiguous logic requirement, which is automatically translated into a reward function to be learned from. Omegaregular languages, of which Linear Temporal Logic (LTL) is a subset, are a natural choice for specifying such requirements due to their use in verification and synthesis. However, current techniques based on omegaregular languages learn in an episodic manner whereby the environment is periodically reset to an initial state during learning. In some settings, this assumption is challenging or impossible to satisfy. Instead, in the continuing setting the agent explores the environment without resets over a single lifetime. This is a more natural setting for reasoning about omegaregular specifications defined over infinite traces of agent behavior. Optimizing the average reward instead of the usual discounted reward is more natural in this case due to the infinitehorizon objective that poses challenges to the convergence of discounted RL solutions. We restrict our attention to the omegaregular languages which correspond to absolute liveness specifications. These specifications cannot be invalidated by any finite prefix of agent behavior, in accordance with the spirit of a continuing problem. We propose a translation from absolute liveness omegaregular languages to an average reward objective for RL. Our reduction can be done onthefly, without full knowledge of the environment, thereby enabling the use of modelfree RL algorithms. Additionally, we propose a reward structure that enables RL without episodic resetting in communicating MDPs, unlike previous approaches. We demonstrate empirically with various benchmarks that our proposed method of using average reward RL for continuing tasks defined by omegaregular specifications is more effective than competing approaches that leverage discounted RL.more » « less

Krause, Andreas and (Ed.)General function approximation is a powerful tool to handle large state and action spaces in a broad range of reinforcement learning (RL) scenarios. However, theoretical understanding of nonstationary MDPs with general function approximation is still limited. In this paper, we make the first such an attempt. We first propose a new complexity metric called dynamic Bellman Eluder (DBE) dimension for nonstationary MDPs, which subsumes majority of existing tractable RL problems in static MDPs as well as nonstationary MDPs. Based on the proposed complexity metric, we propose a novel confidenceset based modelfree algorithm called SWOPEA, which features a sliding window mechanism and a new confidence set design for nonstationary MDPs. We then establish an upper bound on the dynamic regret for the proposed algorithm, and show that SWOPEA is provably efficient as long as the variation budget is not significantly large. We further demonstrate via examples of nonstationary linear and tabular MDPs that our algorithm performs better in small variation budget scenario than the existing UCBtype algorithms. To the best of our knowledge, this is the first dynamic regret analysis in nonstationary MDPs with general function approximation.more » « less