EM-based Reinforcement Learning

Size: px

Start display at page:

Download "EM-based Reinforcement Learning"

Felicity Harper
5 years ago
Views:

1 EM-based Reinforcement Learning Gerhard Neumann 1 1 TU Darmstadt, Intelligent Autonomous Systems December 21, 2011

2 Outline Expectation Maximization (EM)-based Reinforcement Learning Recap : Modelling data with Maximum Likelihood Expectation Maximization EM for RL Applications

3 Why should we use probabilities for RL? Reinforcement Learning in Continuous State and Action Spaces is a hard problem Value-functions are hard to estimate in continuous spaces Many RL methods rely on discretizations of the state space, action space or both

4 Why should we use probabilities for RL? However : Many probablistic inference algorithms can be used in continuous spaces Gaussians, Mixture of Gaussians, Linear Gaussian Models, Gaussian Processes We know how to estimate these distributions from data Can we use probabilistic inference for infering a policy?

5 Quick Recap : Fun from high school... Definitions : Marginal distribution : P (X) = Y P (X, Y ) Conditional distribution : P (X Y ) = P (X,Y ) P (Y )

6 Quick Recap : Fun from high school... Implications : Product rule : P (X, Y ) = P (X Y )P (Y ) = P (Y X)P (X) Chain rule : P (X 1,..., X n ) = i P (X i X 1,..., X i 1 ) Bayes rule : P (Y X) = P (X Y )P (Y ) P (X)

7 Quick Recap Gaussian Distribution: P (x θ) = N (x µ, Σ) = Parameters θ : µ... mean Σ... covariance matrix 1 (2π) (k/2) Σ 1/2 exp( 1 2 (x µ)t Σ 1 (x µ))

8 Recap : Modelling our data We are given a set of data points y i... and we want to estimate a generative model P (y i ; θ) for these data points

9 Recap : Modelling our data Maximum Likelihood Solutions We want to find the parameters θ maximizes the likelihood P (Y ; θ) of the data y i argmax P (y 1:N ; θ) = P (y i ; θ) θ This is often easier in log-space argmax log P (y 1:N ; θ) = θ i=1...n i=1...n log P (y i ; θ) A piece of cake for all distributions from the exponential family (e.g Gaussian)

10 Recap : Modelling our data E.g. Gaussian Distribution Given : Set of data-points {x i } i=1...n Estimate Parameters xi µ = N, Σ = (xi µ)(x i µ) T N

11 Recap : Modelling our data with hidden variables Often we are not given all information... E.g. missing data Mixture Modelling / Clustering : Which mixture component created the data? Reinforcement Learning : Which trajectories create high reward?

12 Recap : Modelling our data with hidden variables Maximum Likelihood Solutions with hidden variables z Given a model P (y, z; θ) which maximizes the likelihood of the data y i argmax L(θ) = log P (y 1:N ; θ) = θ i log P (y i ; θ) = i log z P (y i, z; θ) Since the data for the hidden variables z is missing, we need to marginalize it out!

13 Recap : Modelling our data with hidden variables Maximum Likelihood Solutions with hidden variables z argmax L(θ) = log P (y 1:N ; θ) = θ i log P (y i ; θ) = i log z P (y i, z; θ) oohh... the log of a sum... are we doomed?! At least no closed form solution exists any more...

14 Outline EM-based Reinforcement Learning Recap : Modelling data with Maximum Likelihood Expectation Maximization (EM) EM for RL Applications

15 Iterative Solution : Expectation-Maximization Expectation-Maximization based Algorithms: (E)xpectation Step (M)aximization Step

16 Expectation Step: Use a proposal distribution P i (z) over the hidden variables What is my belief over the hidden variables given the current model θ (t 1) and the observation y i? Calculate P i (z) = P (z y i ; θ (t 1) )

17 Maximization Step: Weight the log-likelihood of the joint by the proposal distribution Q(θ) = argmax θ P i (z) log P (y i, z; θ) Set θ (t) to argmax θ Q(θ) i z

18 Iterative Solution : EM Comparison : Standard ML Solution : L(θ) = argmax log θ z i P (y i, z; θ) M-Step : Q(θ) = i z P i (z) log P (y i, z; θ) Magic of EM : Transformed log of sum into sum of log The E and the M-step can be solved in closed form! Both steps are proved to increase the log-likelihood L(θ) or leave it unchanged Thus the algorithm always converges to a (local) maxima

19 Example : Gaussian Mixture Models The distribution is composed of K Gaussians components P (y) = k=1...k P (k)p (y k) = k=1...k c k N (y µ k, Σ k ) θ : c k... Mixture coefficients, µ k... mean, Σ k... Covariance

20 Hidden variable k We do not know which component k created our data Joint Distribution : P (y, k) = c k N (y µ k, Σ k ) If we would know k the task would be easy...

21 EM for Gaussian Mixture Models Expectation Step : Calculate probability that component k created data point y j P i (k) = P (k y i ) = P (y i, k; θ) k P (y i, k; θ) Called responsibilities... Maximization Step : argmax {c 1:K,µ 1:K,Σ 1:K } i k P i (k) log P (y, k) Each mixture component can be optimized independently!

22 EM for Gaussian Mixture Models Expectation Step : Calculate probability that component k created data point y j Called responsibilities... P i (k) = P (k y i ) = P (y i, k; θ) k P (y i, k; θ) Maximization Step : argmax P i (k) log P (y, k) {c 1:K,µ 1:K,Σ 1:K } i Each mixture component can be optimized independently! argmax P i (k) log P (y, k) {c 1:K,µ 1:K,Σ 1:K } k k i

23 EM for Gaussian Mixture Models Each mixture component can be optimized independently! argmax P i (k)(log N (y i µ k, Σ k ) + log c k ) {c k,µ k,σ k } i Comparison : Maximum-Likelihood (ML) Problem of a single Gaussian argmax log N (y i µ, Σ) {µ,σ} i Weighted ML-Solution : Pi (k) defines a weighting of each data-point

24 EM for Gaussian Mixture Models Comparison : ML-Solution for single Gaussian j µ = y j Σ = N N M-Step : Weighted ML-Solution µ k = j P j (k)y j j P j (k) Σ k = j (y j µ k )(y j µ k ) T j P j (k)(y j µ k )(y j µ k ) T j P j (k)

25 EM for Gaussian Mixture Models Example: From Bishop book

26 EM in a nutshell EM can be used whenever we need to deal with hidden/unobserved variables Iteratively apply E- and M-step Both are applicable in closed formulate No learning rates or whatsoever are needed! Uses proposal distribution over hidden variables Belief over hidden variables using the current model... Used as Weighting in the joint log-likelihood

27 Outline Expectation Maximization (EM)-based Reinforcement Learning Recap : Modelling data with Maximum Likelihood Expectation Maximization EM for RL Applications

28 EM for Reinforcement Learning Ok, nice, but how can I use that for robotic learning? Model RL as Maximum Likelihood Problem! Observed variable : We want to observe a reward event P (R = 1 τ) exp(βr τ ) Binary event of observing a reward, β temperature of distribution τ... trajectory, R τ reward of the trajectory Common approach to transform a reward into a distribution

29 EM for Reinforcement Learning Example for reward distribution : Matlab...

30 EM for Reinforcement Learning Observed variable : Reward Event P (R = 1 τ) exp(βr τ ) Hidden Variable : We do not know which trajectories generated the reward event Model for trajectories : p(τ; θ) Contains our policy : p(τ; θ) = P (s 0 ) t P (s t a t 1, s t 1 )π(a t 1 s t 1 ; θ) We want to find a θ which gives high reward!

31 EM for Reinforcement Learning We want to find a θ which gives high reward! Joint Distribution : p(r, τ; θ) = p(r τ)p(τ; ) We want to maximize the log-likelihood of our observation (getting a reward) argmax log p(r = 1; θ) = log P (R = 1 τ)p (τ; θ)dτ θ τ

32 EM for Reinforcement Learning We want to maximize the log-likelihood of our observation (getting a reward) log p(r) = log P (R τ)p (τ; θ)dτ τ High dimensional trajectory space : The sum over all trajectories is intractable Are we doomed again?

33 EM for Reinforcement Learning We want to maximize the log-likelihood of our observation (getting a reward) log p(r) = log P (R τ)p (τ; θ)dτ τ High dimensional trajectory space : The sum over all trajectories is intractable Are we doomed again? No... EM can help us out

34 EM for Reinforcement Learning EM can help us out Use proposal distibution P (τ) over trajectories E-step : Estimate the probability that trajectory τ has created the reward event. P (τ) = P (τ R; θ t 1 ) = P (R τ)p (τ; θt 1 ) P (R; θ t 1 ) P (R τ)p (τ; θ t 1 ) P (τ) is also called the reward-weighted model distribution.

35 EM for Reinforcement Learning M-step : θ t = argmax Q(θ) = P (τ) log P (R τ)p (τ; θ)dτ θ τ = P (R τ)p (τ; θ t 1 ) log P (τ; θ)dτ + const τ If we we use samples from τ j P (τ; θ t 1 ) this integral can be efficiently approximated! L(θ) τ j P (R τ j ) log P (τ j ; θ) This is again just the weighted maximum likelihood solution Each trajectory is weighted by its reward probability exp(βr τ )

36 Summary : EM for Reinforcement Learning Start with initial distribution P (τ; θ 0 ) For t = 1... L Sample N trajectories from P (τ; θ t 1 ) Weight each trajectory by its probability w i exp(βr τ ) that it created the reward event Estimate new model parameters θ t by weighted maximum likilihood estimate

37 Illustration : 1-step RL Problem 2-dimensional action space, no states Reward Function : r(a) = (a a ) T D(a a ) Show matlab demo...

38 Problems : 1-step RL Problem 2-dimensional action space, multi-modal solution space Reward Function : r(a) = max( (a a,1 ) T D(a a,1 ), (a a,2 ) T D(a a,2 )) Show matlab demo... Current master thesis of Chris

39 Using Linear Features... 2 different models have been used Reward-Weighted Regression (RWR) : a = θ T φ(s) + ɛ Add noise to the action vector... Policy-learning by Weighting Exploration with Returns (PoWER) : a = (θ T + ɛ)φ(s) Add noise to the parameter vector... with ɛ N (0, σ 2 I)

40 Linear Feature Representations 2 different models have been used Reward-Weighted Regression (RWR) : a = θ T φ(s) + ɛ Add noise to the action vector... Policy-learning by Weighting Exploration with Returns (PoWER) : a = (θ T + ɛ)φ(s) Add noise to the parameter vector... Will both be covererd in more detail by Jan... with ɛ N (0, σ 2 I)

41 Reward Weighted Regression a = θ T φ(s) + ɛ : Model for the Policy π(a s; θ) = N (a θ T φ(s), σ 2 I) In the M-step we have to maximize argmax exp(βr j )(a j θ T φ(s j )) 2 θ Looks familiar...? j

42 Reward Weighted Regression In the M-step we have to maximize argmax exp(βr j )(a j θ T φ(s j )) 2 θ j This is just a weighted linear regression problem! θ = (Φ T RΦ) 1 Φ T RA with... Φ = [φ(s 1 ), φ(s 2 ),..., φ(s N )] T R = diag([r j ]) A = [a 1, a 2,..., a N ]

43 Things you can do... Ball in the Cup EM-based Reinforcement Learning Robot Learning, WS 2011

44 Things you can do... Dart : Playing around the clock

45 Things you can do... Robot Balancing for different forces...

46 Extensions / Not covered... Similar EM-based approach to estimate the V-function (Neumann & Peters, 2009) Variational inference approach which has better properties in case of a multi-modal solution-space (Neumann, 2011) How to choose β? Similar, but better : Relative Entropy Policy Search (REPS) (Peters et al., 2010) Bound the distance between two subsequent policies

47 Possible Projects / Bachelor Thesis... Lets play table tennis...! Final Setup : 2 robots playing against each other... We will also get the real robots...

48 Lets play table tennis...! Use EM-based algorithms for... Learning when to intercept the ball Learning to smash Learning to stop the ball Learning to play the ball with spin

49 The end Thanks for your attention!

50 Bibliography I Neumann, G., & Peters, J Fitted Q-Iteration by Advantage Weighted Regression. In: Advances in Neural Information Processing Systems 22 (NIPS 2008). MA: MIT Press. Neumann, Gerhard Variational Inference for Policy Search in Changing Situations. Pages of: Getoor, Lise, & Scheffer, Tobias (eds), Proceedings of the 28th International Conference on Machine Learning (ICML-11). ICML 11. New York, NY, USA: ACM.

51 Bibliography II Peters, Jan, Mülling, Katharina, & Altun, Yasemin Relative Entropy Policy Search. In: AAAI.

Multivariate Bayesian Linear Regression MLAI Lecture 11

Multivariate Bayesian Linear Regression MLAI Lecture 11 Neil D. Lawrence Department of Computer Science Sheffield University 21st October 2012 Outline Univariate Bayesian Linear Regression Multivariate