VIME: Variational Information Maximizing Exploration

1 VIME: Variational Information Maximizing Exploration Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, Pieter Abbeel Reviewed by Zhao Song January 13, 2017

1 Exploration in Reinforcement Learning The exploration-exploitation dilemma: Exploration: The agent experiments with novel strategies that may improve returns in the long run; Exploitation: The agent maximizes rewards through behavior that is known to be successful. This paper focuses on the exploration.

2 Exploration in Reinforcement Learning (cont.) Typical strategies in exploration: Bayesian RL and PAC-MDP methods - Theoretical guarantees - Not scalable - Continuous control Heuristic exploration - Data inefficient

3 Notations Finite-horizon discounted Markov decision process (M D P): S R n : the state set A R m : the action set P : S A S R 0 : the transition probability r : S A R: the reward function γ (0, 1]: the discount factor T : the horizon

4 Curiosity-driven Exploration Basic idea: Seeking out state-action regions relatively unexplored [Schmidhuber 1991] Environment dynamics modeled as p(s t+1 s t, a t ; θ) Taking actions that maximize the reduction in uncertainty about the dynamics [ H(Θ ξt, a t ) H(Θ s t+1, ξ t, a t ) ] t where ξ t = {s 1, a 1,..., s t } corresponds to the history up to time t Interpretation using mutual information between s t+1 and Θ [ I(s t+1 ; Θ ξ t, a t ) = E st+1 P( ξ t,a t){d KL p(θ ξt, a t, s t+1 ) p(θ ξ t ) ] } }{{} Information Gain

5 Curiosity-driven Exploration (cont.) Explicitly maxmizing mutual information intractable An approximate approach with RL - Taking action a t π α (s t ) - Sampling s t+1 P( s t, a t ) - Obtaining the new reward r (s t, a t, s t+1 ) = r(s t, a t ) + ηd KL [ p(θ ξt, a t, s t+1 ) p(θ ξ t ) ] where η R + is a hyperparameter

The VIME Algorithm 6

7 Variational Bayes Posterior via Bayes rule where p(θ ξ t, a t, s t+1 ) = p(θ ξ t) p(s t+1 ξ t, a t ; θ) p(s t+1 ξ t, a t ) p(s t+1 ξ t, a t ) = Approximating p(θ D) with q(θ; φ) The total reward is then approximated as Θ p(s t+1 ξ t, a t ; θ) p(θ ξ t )dθ r (s t, a t, s t+1 ) = r(s t, a t ) + ηd KL [ q(θ φt+1 ) q(θ φ t ) ]

8 Neural Networks Implementation The Bayesian nerual network (BNN) [Blundell et al. 2015] Modeling p(s t+1 s t, a t ; θ) The weight distribution parameterized by φ Θ q(θ; φ) = N (θ i µ i, σi 2 ) i=1 Feedforward network structure and ReLU nonlinearity between hidden layers Trained by maximizing the variational lower bound using Backprop L[q(θ; φ), D] = E θ q( ;φ) [log p(d θ)] D KL [q(θ; φ) p(θ)]

9 Efficient Computation of Information Gain Approximated with D KL [q(θ; φ + λ φ ) q(θ; φ)] 1 Newton s method - Step φ = H 1 (l) φ l ( q(θ; φ), s t ) - Diagonal Hessian 2 Taylor expansion - Only the second-order term left - D KL [q(θ; φ + λ φ ) q(θ; φ)] 1 2 λ2 φ l T H 1 φ l

10 Experiments Setup Continous control problems Policies learned by TRPO [Schulman et al. 2015] Performance measured in terms of average return

Experimental Results 11

Experimental Results (cont.) 12

Experimental Results (cont.) 13

14 References I Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight Uncertainty in Neural Network. : Proceedings of The 32nd International Conference on Machine Learning. 2015. Jürgen Schmidhuber. Curious Model-Building Control Systems. : Proceedings of International Joint Conference on Neural Networks. Citeseer. 1991. John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust Region Policy Optimization. : Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 2015.