Reinforcement Learning as Variational Inference: Two Recent Approaches

Similar documents
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning and NLP

Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

REINFORCEMENT LEARNING

CS Deep Reinforcement Learning HW5: Soft Actor-Critic Due November 14th, 11:59 pm

Energy-Based Generative Adversarial Network

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

CSC321 Lecture 22: Q-Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

arxiv: v2 [cs.lg] 21 Jul 2017

Lecture 10 - Planning under Uncertainty (III)

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Variational Deep Q Network

Reinforcement Learning. Introduction

Machine Learning I Continuous Reinforcement Learning

Lecture 7: Value Function Approximation

Trust Region Policy Optimization

(Deep) Reinforcement Learning

Stochastic Primal-Dual Methods for Reinforcement Learning

Reinforcement Learning

Introduction to Reinforcement Learning

, and rewards and transition matrices as shown below:

Q-Learning in Continuous State Action Spaces

Decision Theory: Q-Learning

Real Time Value Iteration and the State-Action Value Function

Planning in Markov Decision Processes

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient I 2

MDP Preliminaries. Nan Jiang. February 10, 2019

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Stein Variational Policy Gradient

State Space Abstractions for Reinforcement Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Human-level control through deep reinforcement. Liia Butler

Reinforcement Learning

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Reinforcement Learning Part 2

Grundlagen der Künstlichen Intelligenz

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Reinforcement Learning based Multi-Access Control and Battery Prediction with Energy Harvesting in IoT Systems

J. Sadeghi E. Patelli M. de Angelis

Q-learning. Tambet Matiisen

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

15-780: ReinforcementLearning

Reinforcement Learning and Deep Reinforcement Learning

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II (Post lecture) 2

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Reinforcement Learning as Classification Leveraging Modern Classifiers

Policy Gradient Methods. February 13, 2017

Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Variational inference

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Optimal sequential decision making for complex problems agents. Damien Ernst University of Liège

CS230: Lecture 9 Deep Reinforcement Learning

Internet Monetization

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Decision Theory: Markov Decision Processes

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Lecture 4: Approximate dynamic programming

Reinforcement Learning. Machine Learning, Fall 2010

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications

Deterministic Policy Gradient, Advanced RL Algorithms

Reinforcement Learning

Reinforcement Learning

Approximate Q-Learning. Dan Weld / University of Washington

Reinforcement Learning via Policy Optimization

Policy Gradient Reinforcement Learning for Robotics

Markov decision processes

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Reinforcement Learning II. George Konidaris

6 Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Reinforcement Learning

Learning Tetris. 1 Tetris. February 3, 2009

Reinforcement Learning II. George Konidaris

Notes on Reinforcement Learning

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Tutorial on Policy Gradient Methods. Jan Peters

Reinforcement Learning for NLP

6 Basic Convergence Results for RL Algorithms

Deep Reinforcement Learning via Policy Optimization

A Survey on Methods and Applications of Deep Reinforcement Learning

Variational Inference via Stochastic Backpropagation

Transcription:

Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017

Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing Thoughts/Takeaways

Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing Thoughts/Takeaways

Background: Reinforcement Learning A Markov Decision Process (MDP) is a tuple (S, A, p, r, γ), where: S is the state space A is the action space p(s t+1 s t, a t ) is the probability of the next state s t+1 S given the current state s t S and action a t A taken by the agent r : S A [r min, r max ] is the reward function γ [0, 1] is the discount factor A policy π( s t ) is a distribution over A conditioned on the current state s t The goal of reinforcement learning is to learn a policy π that maximizes the total expected (discounted) reward J(π), where: [ ] J(π) = E (st,at) π γ t r(s t, a t ) t=0

Background: Stein Variational Gradient Descent (SVGD) SVGD 1 is a variational inference algorithm that iteratively transports a set of particles {x i } n i=1 to match a target distribution at each step, particles are updated as follows: x i x i + ɛφ(x i ) where ɛ is the step size and φ is a perturbation direction chosen to greedily minimize KL divergence with target distribution By restricting φ to lie in the unit ball of an RKHS with corresponding kernel k, we can obtain a closed form solution for the optimal perturbation direction... 1 Q. Liu and D. Wang. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm.

Background: Stein Variational Gradient Descent (SVGD) Input : A target distribution with density function p(x) and a set of initial particles {x 0 i }n i=1 Output: A set of particles {x i } n i=1 that approximates the target distribution for iteration l do x l+1 i x l i + ɛ l ˆφ (x l i ) where φ (x l i ) = 1 ] n n j=1 [k(x l j, x) x lj log p(xl j ) + x lj k(xl j, x) and ɛ l is the step size at the l-th iteration end Algorithm 1: Stein Variational Gradient Descent (SVGD)

Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing Thoughts/Takeaways

Stein Variational Policy Gradient 2 : Preliminaries Policy iteration: parametrize policy as π(a t s t ; θ) and iteratively update θ to maximize J(π(a t s t ; θ)) (a.k.a. J(θ)) MaxEnt Policy Optimization: instead of searching for a single policy θ, optimize a distribution q on θ as follows: max q { Eq(θ) [J(θ)] αd KL (q q 0 ) } where q 0 is a prior on θ. If q 0 = constant, then the objective simplifies to { max Eq(θ) [J(θ)] + αh(q) } q where H(q) is the entropy of q Optimal distribution is ( ) 1 q(θ) exp α J(θ)q 0(θ) 2 Liu et al. Stein Variational Policy Gradient.

Stein Variational Policy Gradient: Algorithm Input: Learning rate ɛ, kernel k(x, x ), temperature α, initial particles {θ i } for t = 0, 1,..., T do for i = 0, 1,...n do Compute θi J(θ i ) (using RL method of choice) end for i = 0, 1,...n do θ i = 1 n [ ( 1 n j=1 θj α J(θ j) + log q 0 (θ j ) ) k(θ j, θ i ) + θj k(θ j, θ i ) ] θ i θ i + ɛ θ i end end Algorithm 2: Stein Variational Policy Gradient (SVPG)

Stein Variational Policy Gradient: Results

Stein Variational Policy Gradient: Results

Stein Variational Policy Gradient: Results

Stein Variational Policy Gradient: Results

Stein Variational Policy Gradient: Results

Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing Thoughts/Takeaways

Soft Q-Learning 3 : Preliminaries Recall the standard RL objective (assuming γ = 1): π std = arg max π E (st,at) π[r(s t, a t )] t In MaxEnt RL, we augment the reward with an entropy term that encourages exploration: πmaxent = arg max E (st,a π t) π[r(s t, a t ) + αh(π( s t ))] t Note: when γ 1, the MaxEnt RL objective becomes: [ ] arg max E (st,at) π γ l t E (sl,a l )[r(s l, a l ) + αh(π( s l )) s t, a t ] π t l=t 3 Haarnoja et al. Reinforcement Learning with Deep Energy-Based Policies.

Soft Q-Learning: Preliminaries Theorem Let the soft Q-function be defined by [ ] Q soft (s t, a t ) = r t + E π MaxEnt γ l (r t+l + αh(πmaxent ( s t+l))) and soft value function by Vsoft (s t) = α log then l=1 A exp ( Q soft (s t, a ) ) da ( ) 1 πmaxent (s t, a t ) = exp α (Q soft (s t, a t ) Vsoft (s t))

Soft Q-Learning: Preliminaries Theorem The soft Q-function satisfies the soft Bellman equation Q soft (s t, a t ) = r t + γe st+1 p[v soft (s t+1)] Theorem (Soft Q-iteration) Let Q soft (, ) and V soft ( ) be bounded and assume that A exp( 1 α Q soft(, a ))da < and that Q soft < exists. Then the fixed point iteration Q soft (s t, a t ) r t + γe st+1 p[v soft (s t+1 )], s t, a t V soft (s t ) α log exp( 1 A α Q soft(s t, a ))da, s t converges to Q soft and V soft respectively.

Soft Q-Learning: Soft Q-iteration in Practice In practice, we model the Soft Q function using a function approximator with parameters θ, denoted as Q θ soft To convert Soft Q Iteration into a stochastic optimization problem, we can re-express V soft (s t ) as an expectation via importance sampling (instead of an integral) as follows: [ exp( Vsoft θ 1 (s α t) = α log E Qθ soft (s ] t, a )) q q(a ) where q is the sampling distribution (e.g. current policy) Furthermore, we can update Q soft to minimize [ 1 ( ] 2 J Q (θ) = E ˆQ θsoft (st,at) q (s t, a t ) Q θ soft 2 (s t, a t )) where ˆQ θ soft (s t, a t ) = r t + γe st+1 p[v θ soft (s t+1 )]

Soft Q-Learning: Sampling from Soft Q-function Goal: learn a state-conditioned stochastic neural network a t = f φ (ξ; s t ), with parameters φ, that maps Gaussian noise ξ to unbiased action samples from EBM specified by Q θ soft Specifically, we want to minimize the following loss: ( )) 1 J π (φ; s t ) = D KL (π φ ( s t ) exp α (Qθ soft(s t, ) Vsoft(s θ t )) where π φ ( s t ) is the action distribution induced by φ Strategy: Sample actions a (i) t = f φ (ξ (i) ; s t ) and use SVGD to compute optimal greedy perturbations f φ (ξ (i) ; s t ) to minimize J π (φ; s t ), where: f φ (ξ (i) ; s t ) = E at π φ [κ(a t, f φ (ξ (i) ; s t )) a Q θ soft (s t, a ) a =a t + α a κ(a, f φ (ξ (i) ; s t )) a =a t ]

Soft Q-Learning: Sampling from Soft Q-function Stein variational gradient can be backpropagated into sampling network using: J π (φ; s t ) φ And so we have: ˆ φ J π (φ; s t ) = 1 KM [ E ξ f φ (ξ; s t ) f φ ] (ξ; s t ) φ K M ( j=1 i=1 κ(a (i) t + α a κ(a, ã (j) t ) a =a (i) t, ã (j) t ) a Q θ soft (s t, a ) a =a (i) t ) φ f φ ( ξ (j) ; s t ) the ultimate update direction ˆ φ J π (φ) is the average of ˆ φ J π (φ; s t ) over a mini-batch sampled from replay memory

Soft Q-Learning: Algorithm for each epoch do for each t do Collect Experience a t f φ (ξ, s t) where ξ N (0, I) s t+1 p(s t+1 a t, s t) D D {s t, a t, r(s t, a t), s t+1 } Sample minibatch from replay memory {s (i) t, a (i) t, r (i) t, s (i) t+1 }N i=0 D Update soft Q-function parameters Sample {a (i,j) } M j=0 q for each s(i) t+1 Compute ˆV θ soft (s (i) t+1 ) and ˆ θ J Q, update θ using ADAM Update policy Sample {ξ (i,j) } M j=0 N (0, I) for each s(i) t Compute actions a (i,j) t = f φ (ξ (i,j), s (i) t ) Compute f φ and ˆ φ J π, update φ using ADAM end if epoch mod update interval = 0 then Update target parameters: θ θ, φ φ end end Algorithm 3: Soft Q-Learning

Soft Q-Learning: Results

Soft Q-Learning: Results

Soft Q-Learning: Results

Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing Thoughts/Takeaways