VIME: Variational Information Maximizing Exploration

Similar documents
arxiv: v1 [cs.lg] 31 May 2016

arxiv: v4 [cs.lg] 27 Jan 2017

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Behavior Policy Gradient Supplemental Material

Trust Region Policy Optimization

Noisy Natural Gradient as Variational Inference

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Variational Deep Q Network

Exploration. 2015/10/12 John Schulman

Combining PPO and Evolutionary Strategies for Better Policy Search

Probabilistic Mixture of Model-Agnostic Meta-Learners

Gradient Estimation Using Stochastic Computation Graphs

Decision Theory: Q-Learning

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme

Reinforcement Learning

Deep Reinforcement Learning

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Lecture 9: Policy Gradient II 1

Decision Theory: Markov Decision Processes

Reinforcement Learning for NLP

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Artificial Intelligence & Sequential Decision Problems

CSC321 Lecture 22: Q-Learning

arxiv: v2 [cs.lg] 7 Mar 2018

Noisy Natural Gradient as Variational Inference

REINFORCEMENT LEARNING

Efficient Learning in Linearly Solvable MDP Models

Reinforcement Learning as Variational Inference: Two Recent Approaches

Stein Variational Policy Gradient

Bayesian Policy Gradients via Alpha Divergence Dropout Inference

Reinforcement Learning and NLP

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Planning by Probabilistic Inference

Recurrent Latent Variable Networks for Session-Based Recommendation

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Reinforcement Learning: An Introduction

Bayesian Semi-supervised Learning with Deep Generative Models

Neural network ensembles and variational inference revisited

Machine Learning I Continuous Reinforcement Learning

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

CSC 2541: Bayesian Methods for Machine Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Multiagent (Deep) Reinforcement Learning

Lecture 9: Policy Gradient II (Post lecture) 2

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Grundlagen der Künstlichen Intelligenz

Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018

MDP Preliminaries. Nan Jiang. February 10, 2019

Bayesian Regression Linear and Logistic Regression

Reinforcement Learning

Active Learning of MDP models

Be able to define the following terms and answer basic questions about them:

Q-Learning in Continuous State Action Spaces

Lecture 7: CS395T Numerical Optimization for Graphics and AI Trust Region Methods

Q-learning. Tambet Matiisen

Reinforcement Learning II. George Konidaris

Reinforcement Learning via Policy Optimization

Reinforcement Learning II. George Konidaris

CS230: Lecture 9 Deep Reinforcement Learning

arxiv: v3 [cs.ai] 22 Feb 2018

Sampling diverse neural networks for exploration in reinforcement learning

Algorithms for Variational Learning of Mixture of Gaussians

Preference Elicitation for Sequential Decision Problems

Nonparametric Inference for Auto-Encoding Variational Bayes

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Reinforcement Learning. Machine Learning, Fall 2010

Constraint-Space Projection Direct Policy Search

Scalable Non-linear Beta Process Factor Analysis

arxiv: v4 [stat.ml] 15 Jun 2018

Exercises, II part Exercises, II part

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Graphical Models for Collaborative Filtering

Bayesian reinforcement learning and partially observable Markov decision processes November 6, / 24

Deep Q-Learning with Recurrent Neural Networks

Markov decision processes

Proximal Policy Optimization (PPO)

, and rewards and transition matrices as shown below:

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Model Selection for Gaussian Processes

CS 598 Statistical Reinforcement Learning. Nan Jiang

Reinforcement Learning as Classification Leveraging Modern Classifiers

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Predictive Processing in Planning:

Inverse Optimal Control

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning

Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

Bayesian Neural Networks

Development of a Deep Recurrent Neural Network Controller for Flight Applications

Variance Reduction for Policy Gradient Methods. March 13, 2017

Lecture 8: Policy Gradient I 2

Implicit Incremental Natural Actor Critic

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Reinforcement Learning. George Konidaris

arxiv: v1 [cs.ne] 4 Dec 2015

Transcription:

1 VIME: Variational Information Maximizing Exploration Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, Pieter Abbeel Reviewed by Zhao Song January 13, 2017

1 Exploration in Reinforcement Learning The exploration-exploitation dilemma: Exploration: The agent experiments with novel strategies that may improve returns in the long run; Exploitation: The agent maximizes rewards through behavior that is known to be successful. This paper focuses on the exploration.

2 Exploration in Reinforcement Learning (cont.) Typical strategies in exploration: Bayesian RL and PAC-MDP methods - Theoretical guarantees - Not scalable - Continuous control Heuristic exploration - Data inefficient

3 Notations Finite-horizon discounted Markov decision process (M D P): S R n : the state set A R m : the action set P : S A S R 0 : the transition probability r : S A R: the reward function γ (0, 1]: the discount factor T : the horizon

4 Curiosity-driven Exploration Basic idea: Seeking out state-action regions relatively unexplored [Schmidhuber 1991] Environment dynamics modeled as p(s t+1 s t, a t ; θ) Taking actions that maximize the reduction in uncertainty about the dynamics [ H(Θ ξt, a t ) H(Θ s t+1, ξ t, a t ) ] t where ξ t = {s 1, a 1,..., s t } corresponds to the history up to time t Interpretation using mutual information between s t+1 and Θ [ I(s t+1 ; Θ ξ t, a t ) = E st+1 P( ξ t,a t){d KL p(θ ξt, a t, s t+1 ) p(θ ξ t ) ] } }{{} Information Gain

5 Curiosity-driven Exploration (cont.) Explicitly maxmizing mutual information intractable An approximate approach with RL - Taking action a t π α (s t ) - Sampling s t+1 P( s t, a t ) - Obtaining the new reward r (s t, a t, s t+1 ) = r(s t, a t ) + ηd KL [ p(θ ξt, a t, s t+1 ) p(θ ξ t ) ] where η R + is a hyperparameter

The VIME Algorithm 6

7 Variational Bayes Posterior via Bayes rule where p(θ ξ t, a t, s t+1 ) = p(θ ξ t) p(s t+1 ξ t, a t ; θ) p(s t+1 ξ t, a t ) p(s t+1 ξ t, a t ) = Approximating p(θ D) with q(θ; φ) The total reward is then approximated as Θ p(s t+1 ξ t, a t ; θ) p(θ ξ t )dθ r (s t, a t, s t+1 ) = r(s t, a t ) + ηd KL [ q(θ φt+1 ) q(θ φ t ) ]

8 Neural Networks Implementation The Bayesian nerual network (BNN) [Blundell et al. 2015] Modeling p(s t+1 s t, a t ; θ) The weight distribution parameterized by φ Θ q(θ; φ) = N (θ i µ i, σi 2 ) i=1 Feedforward network structure and ReLU nonlinearity between hidden layers Trained by maximizing the variational lower bound using Backprop L[q(θ; φ), D] = E θ q( ;φ) [log p(d θ)] D KL [q(θ; φ) p(θ)]

9 Efficient Computation of Information Gain Approximated with D KL [q(θ; φ + λ φ ) q(θ; φ)] 1 Newton s method - Step φ = H 1 (l) φ l ( q(θ; φ), s t ) - Diagonal Hessian 2 Taylor expansion - Only the second-order term left - D KL [q(θ; φ + λ φ ) q(θ; φ)] 1 2 λ2 φ l T H 1 φ l

10 Experiments Setup Continous control problems Policies learned by TRPO [Schulman et al. 2015] Performance measured in terms of average return

Experimental Results 11

Experimental Results (cont.) 12

Experimental Results (cont.) 13

14 References I Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight Uncertainty in Neural Network. : Proceedings of The 32nd International Conference on Machine Learning. 2015. Jürgen Schmidhuber. Curious Model-Building Control Systems. : Proceedings of International Joint Conference on Neural Networks. Citeseer. 1991. John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust Region Policy Optimization. : Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 2015.