Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017

Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing Thoughts/Takeaways

Background: Reinforcement Learning A Markov Decision Process (MDP) is a tuple (S, A, p, r, γ), where: S is the state space A is the action space p(s t+1 s t, a t ) is the probability of the next state s t+1 S given the current state s t S and action a t A taken by the agent r : S A [r min, r max ] is the reward function γ [0, 1] is the discount factor A policy π( s t ) is a distribution over A conditioned on the current state s t The goal of reinforcement learning is to learn a policy π that maximizes the total expected (discounted) reward J(π), where: [ ] J(π) = E (st,at) π γ t r(s t, a t ) t=0

Background: Stein Variational Gradient Descent (SVGD) SVGD 1 is a variational inference algorithm that iteratively transports a set of particles {x i } n i=1 to match a target distribution at each step, particles are updated as follows: x i x i + ɛφ(x i ) where ɛ is the step size and φ is a perturbation direction chosen to greedily minimize KL divergence with target distribution By restricting φ to lie in the unit ball of an RKHS with corresponding kernel k, we can obtain a closed form solution for the optimal perturbation direction... 1 Q. Liu and D. Wang. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm.

Background: Stein Variational Gradient Descent (SVGD) Input : A target distribution with density function p(x) and a set of initial particles {x 0 i }n i=1 Output: A set of particles {x i } n i=1 that approximates the target distribution for iteration l do x l+1 i x l i + ɛ l ˆφ (x l i ) where φ (x l i ) = 1 ] n n j=1 [k(x l j, x) x lj log p(xl j ) + x lj k(xl j, x) and ɛ l is the step size at the l-th iteration end Algorithm 1: Stein Variational Gradient Descent (SVGD)

Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing Thoughts/Takeaways

Stein Variational Policy Gradient 2 : Preliminaries Policy iteration: parametrize policy as π(a t s t ; θ) and iteratively update θ to maximize J(π(a t s t ; θ)) (a.k.a. J(θ)) MaxEnt Policy Optimization: instead of searching for a single policy θ, optimize a distribution q on θ as follows: max q { Eq(θ) [J(θ)] αd KL (q q 0 ) } where q 0 is a prior on θ. If q 0 = constant, then the objective simplifies to { max Eq(θ) [J(θ)] + αh(q) } q where H(q) is the entropy of q Optimal distribution is ( ) 1 q(θ) exp α J(θ)q 0(θ) 2 Liu et al. Stein Variational Policy Gradient.

Stein Variational Policy Gradient: Algorithm Input: Learning rate ɛ, kernel k(x, x ), temperature α, initial particles {θ i } for t = 0, 1,..., T do for i = 0, 1,...n do Compute θi J(θ i ) (using RL method of choice) end for i = 0, 1,...n do θ i = 1 n [ ( 1 n j=1 θj α J(θ j) + log q 0 (θ j ) ) k(θ j, θ i ) + θj k(θ j, θ i ) ] θ i θ i + ɛ θ i end end Algorithm 2: Stein Variational Policy Gradient (SVPG)

Stein Variational Policy Gradient: Results

Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing Thoughts/Takeaways

Soft Q-Learning 3 : Preliminaries Recall the standard RL objective (assuming γ = 1): π std = arg max π E (st,at) π[r(s t, a t )] t In MaxEnt RL, we augment the reward with an entropy term that encourages exploration: πmaxent = arg max E (st,a π t) π[r(s t, a t ) + αh(π( s t ))] t Note: when γ 1, the MaxEnt RL objective becomes: [ ] arg max E (st,at) π γ l t E (sl,a l )[r(s l, a l ) + αh(π( s l )) s t, a t ] π t l=t 3 Haarnoja et al. Reinforcement Learning with Deep Energy-Based Policies.

Soft Q-Learning: Preliminaries Theorem Let the soft Q-function be defined by [ ] Q soft (s t, a t ) = r t + E π MaxEnt γ l (r t+l + αh(πmaxent ( s t+l))) and soft value function by Vsoft (s t) = α log then l=1 A exp ( Q soft (s t, a ) ) da ( ) 1 πmaxent (s t, a t ) = exp α (Q soft (s t, a t ) Vsoft (s t))

Soft Q-Learning: Preliminaries Theorem The soft Q-function satisfies the soft Bellman equation Q soft (s t, a t ) = r t + γe st+1 p[v soft (s t+1)] Theorem (Soft Q-iteration) Let Q soft (, ) and V soft ( ) be bounded and assume that A exp( 1 α Q soft(, a ))da < and that Q soft < exists. Then the fixed point iteration Q soft (s t, a t ) r t + γe st+1 p[v soft (s t+1 )], s t, a t V soft (s t ) α log exp( 1 A α Q soft(s t, a ))da, s t converges to Q soft and V soft respectively.

Soft Q-Learning: Soft Q-iteration in Practice In practice, we model the Soft Q function using a function approximator with parameters θ, denoted as Q θ soft To convert Soft Q Iteration into a stochastic optimization problem, we can re-express V soft (s t ) as an expectation via importance sampling (instead of an integral) as follows: [ exp( Vsoft θ 1 (s α t) = α log E Qθ soft (s ] t, a )) q q(a ) where q is the sampling distribution (e.g. current policy) Furthermore, we can update Q soft to minimize [ 1 ( ] 2 J Q (θ) = E ˆQ θsoft (st,at) q (s t, a t ) Q θ soft 2 (s t, a t )) where ˆQ θ soft (s t, a t ) = r t + γe st+1 p[v θ soft (s t+1 )]

Soft Q-Learning: Sampling from Soft Q-function Goal: learn a state-conditioned stochastic neural network a t = f φ (ξ; s t ), with parameters φ, that maps Gaussian noise ξ to unbiased action samples from EBM specified by Q θ soft Specifically, we want to minimize the following loss: ( )) 1 J π (φ; s t ) = D KL (π φ ( s t ) exp α (Qθ soft(s t, ) Vsoft(s θ t )) where π φ ( s t ) is the action distribution induced by φ Strategy: Sample actions a (i) t = f φ (ξ (i) ; s t ) and use SVGD to compute optimal greedy perturbations f φ (ξ (i) ; s t ) to minimize J π (φ; s t ), where: f φ (ξ (i) ; s t ) = E at π φ [κ(a t, f φ (ξ (i) ; s t )) a Q θ soft (s t, a ) a =a t + α a κ(a, f φ (ξ (i) ; s t )) a =a t ]

Soft Q-Learning: Sampling from Soft Q-function Stein variational gradient can be backpropagated into sampling network using: J π (φ; s t ) φ And so we have: ˆ φ J π (φ; s t ) = 1 KM [ E ξ f φ (ξ; s t ) f φ ] (ξ; s t ) φ K M ( j=1 i=1 κ(a (i) t + α a κ(a, ã (j) t ) a =a (i) t, ã (j) t ) a Q θ soft (s t, a ) a =a (i) t ) φ f φ ( ξ (j) ; s t ) the ultimate update direction ˆ φ J π (φ) is the average of ˆ φ J π (φ; s t ) over a mini-batch sampled from replay memory

Soft Q-Learning: Algorithm for each epoch do for each t do Collect Experience a t f φ (ξ, s t) where ξ N (0, I) s t+1 p(s t+1 a t, s t) D D {s t, a t, r(s t, a t), s t+1 } Sample minibatch from replay memory {s (i) t, a (i) t, r (i) t, s (i) t+1 }N i=0 D Update soft Q-function parameters Sample {a (i,j) } M j=0 q for each s(i) t+1 Compute ˆV θ soft (s (i) t+1 ) and ˆ θ J Q, update θ using ADAM Update policy Sample {ξ (i,j) } M j=0 N (0, I) for each s(i) t Compute actions a (i,j) t = f φ (ξ (i,j), s (i) t ) Compute f φ and ˆ φ J π, update φ using ADAM end if epoch mod update interval = 0 then Update target parameters: θ θ, φ φ end end Algorithm 3: Soft Q-Learning

Soft Q-Learning: Results

Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing Thoughts/Takeaways