Using Expectation-Maximization for Reinforcement Learning

Similar documents
Does the Wake-sleep Algorithm Produce Good Density Estimators?

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

Learning Tetris. 1 Tetris. February 3, 2009

Open Theoretical Questions in Reinforcement Learning

Graphical models: parameter learning

Bias-Variance Error Bounds for Temporal Difference Updates

Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network

The wake-sleep algorithm for unsupervised neural networks

Expectation Maximization

Temporal Difference Learning & Policy Iteration

The Variance of Covariance Rules for Associative Matrix Memories and Reinforcement Learning

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

The connection of dropout and Bayesian statistics

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Optimal Convergence in Multi-Agent MDPs

Q-Learning for Markov Decision Processes*

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Expectation maximization tutorial

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation

6 Reinforcement Learning

Lecture 6: April 19, 2002

The Expectation Maximization Algorithm

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

An introduction to Variational calculus in Machine Learning

Lecture 16 Deep Neural Generative Models

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Algorithmisches Lernen/Machine Learning

The Variational Gaussian Approximation Revisited

Stochastic Variational Inference

Hopfield Networks and Boltzmann Machines. Christian Borgelt Artificial Neural Networks and Deep Learning 296

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Expectation Maximization and Mixtures of Gaussians

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

an introduction to bayesian inference

Adaptive linear quadratic control using policy. iteration. Steven J. Bradtke. University of Massachusetts.

Self-Organization by Optimizing Free-Energy

Variational Inference. Sargur Srihari

Lecture 5: Logistic Regression. Neural Networks

Q-Learning in Continuous State Action Spaces

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Deriving the EM-Based Update Rules in VARSAT. University of Toronto Technical Report CSRG-580

Intelligent Systems:

Generalization and Function Approximation

HMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Learning Gaussian Process Models from Uncertain Data

Variational Inference (11/04/13)

Convergence Rate of Expectation-Maximization

Reinforcement Learning and Control

Food delivered. Food obtained S 3

Planning by Probabilistic Inference

MIXTURE MODELS AND EM

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Samy Bengioy, Yoshua Bengioz. y INRS-Telecommunications, 16, Place du Commerce, Ile-des-Soeurs, Qc, H3E 1H6, CANADA

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

, and rewards and transition matrices as shown below:

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

The Reinforcement Learning Problem

Learning Binary Classifiers for Multi-Class Problem

Elements of Reinforcement Learning

Basic math for biology

Replacing eligibility trace for action-value learning with function approximation

Latent Variable Models and EM algorithm

ilstd: Eligibility Traces and Convergence Analysis

Reinforcement Learning II

Prioritized Sweeping Converges to the Optimal Value Function

Contrastive Divergence

STA 4273H: Statistical Machine Learning

Master 2 Informatique Probabilistic Learning and Data Analysis

Lecture 14. Clustering, K-means, and EM

Linear Dynamical Systems

Reinforcement Learning based on On-line EM Algorithm

STA 4273H: Statistical Machine Learning

Kartik Audhkhasi, Osonde Osoba, Bart Kosko

Gaussian Mixture Models

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

What is the expectation maximization algorithm?

MDP Preliminaries. Nan Jiang. February 10, 2019

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Technical Details about the Expectation Maximization (EM) Algorithm

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Lecture 4: Perceptrons and Multilayer Perceptrons

Introduction to Reinforcement Learning

On the Slow Convergence of EM and VBEM in Low-Noise Linear Models

Variational Autoencoder

Machine Learning Techniques for Computer Vision

CSE 473: Artificial Intelligence Autumn Topics

Week 3: The EM algorithm

Does Better Inference mean Better Learning?

Policy Gradient Reinforcement Learning for Robotics

Context-Driven Predictions

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Transcription:

NOTE Communicated by Andrew Barto and Michael Jordan Using Expectation-Maximization for Reinforcement Learning Peter Dayan Department of Brain and Cognitive Sciences, Center for Biological and Computational Learning, Massachusetts Institute of Technology, Cambridge, MA 02139 USA Geoffrey E. Hinton Department of Computer Science, University of Toronto, Toronto M5S 1A4, Canada We discuss Hinton s (1989) relative payoff procedure (RPP), a static reinforcement learning algorithm whose foundation is not stochastic gradient ascent. We show circumstances under which applying the RPP is guaranteed to increase the mean return, even though it can make large changes in the values of the parameters. The proof is based on a mapping between the RPP and a form of the expectation-maximization procedure of Dempster, Laird, and Rubin (1977). 1 Introduction Consider a stochastic learning automaton (e.g., Narendra & Thatachar 1989) whose actions y are drawn from a set Y. This could, for instance, be the set of 2 n choices over n separate binary decisions. If the automaton maintains a probability distribution p(y θ) over these possible actions, where θ is a set of parameters, then its task is to learn values of the parameters θ that maximize the expected payoff: ρ(θ) = E[r y] p(y θ) = Y p(y θ)e[r y]. (1.1) Here, E[r y] is the expected reward for performing action y. Apart from random search, simulated annealing, and genetic algorithms, almost all the methods with which we are familiar for attempting to choose appropriate θ in such domains are local in the sense that they make small steps, usually in a direction that bears some relation to the gradient. Examples include the Keifer-Wolfowitz procedure (Wasan 1969), the A RP algorithm (Barto & Anandan 1985), and the REINFORCE framework (Williams 1992). There are two reasons to make small steps. One is that there is a noisy estimation problem. Typically the automaton will emit single actions y 1, y 2,...according to p(y θ), will receive single samples of the reward from the distributions p(r y m ), and will have to average the gradient over these noisy values. The other reason is that θ might have a complicated effect on which actions are chosen, or the relationship between actions and rewards Neural Computation 9, 271 278 (1997) c 1997 Massachusetts Institute of Technology

272 Peter Dayan and Geoffrey E. Hinton might be obscure. For instance, if Y is the set of 2 n choices over n separate binary actions a 1,...,a n, and θ ={p 1,...,p n }is the collection of probabilities of choosing a i = 1, so p(y ={a 1..a n } θ) = n i=1 p a i i (1 p i) 1 a i, (1.2) then the average reward ρ(θ) depends in a complicated manner on the collection of p i. Gradient ascent in θ would seem the only option because taking large steps might lead to decreases in the average reward. The sampling problem is indeed present, and we will evade it by assuming a large batch size. However, we show that in circumstances such as the n binary action task, it is possible to make large, well-founded changes to the parameters without explicitly estimating the curvature of the space of expected payoffs, by a mapping onto a maximum likelihood probability density estimation problem. In effect, we maximize reward by solving a sequence of probability matching problems, where θ is chosen at each step to match as best it can a fictitious distribution determined by the average rewards experienced on the previous step. Although there can be large changes in θ from one step to the next, we are guaranteed that the average reward is monotonically increasing. The guarantee comes for exactly the same reason as in the expectation-maximization (EM) algorithm (Baum et al. 1970; Dempster et al. 1977) and, as with EM, there can be local optima. The relative payoff procedure (RPP) (Hinton 1989) is a particular reinforcement learning algorithm for the n binary action task with positive r, which makes large moves in the p i. Our proof demonstrates that the RPP is well founded. 2 Theory The RPP operates to improve the parameters p 1,...,p n of equation 1.2 in a synchronous manner based on substantial sampling. It suggests updating the probability of choosing a i = 1to p i = a ie[r y] p(y θ) E[r y] p(y θ), (2.1) which is the ratio of the mean reward that accrues when a i = 1 to the net mean reward. If all the rewards r are positive, then 0 p i 1. This note proves that when using the RPP, the expected reinforcement increases; that is, E[r y] p(y θ ) E[r y] p(y θ) where θ ={p 1,...,p n }.

Using EM for Reinforcement Learning 273 The proof rests on the following observation: Given a current value of θ, if one could arrange that αe[r y] = p(y θ ) p(y θ) (2.2) for some α, then θ would lead to higher average returns than θ. Weprove this formally below, but the intuition is that if p(y 1 θ ) E[r y 1 ] > E[r y 2 ] then p(y 2 θ ) > p(y 1 θ) p(y 2 θ) so θ will put more weight on y 1 than θ does. We therefore pick θ so that p(y θ ) matches the distribution αe[r y]p(y θ) as much as possible (using a Kullback-Leibler penalty). Note that this target distribution moves with θ. Matching just βe[r y], something that animals can be observed to do under some circumstances (Gallistel 1990), does not result in maximizing average rewards (Sabes & Jordan 1995). If the rewards r are stochastic, then our method (and the RPP) does not eliminate the need for repeated sampling to work out the mean return. We assume knowledge of E[r y]. Defining the distribution in equation 2.2 correctly requires E[r y] > 0. Since maximizing ρ(θ) and ρ(θ) + ω has the same consequences for θ, we can add arbitrary constants to the rewards so that they are all positive. However, this can affect the rate of convergence. We now show how an improvement in the expected reinforcement can be guaranteed: where log ρ(θ ) ρ(θ) = log p(y θ E[r y] ) ρ(θ) y Y = log [ p(y θ)e[r y] ρ(θ) y Y [ ] p(y θ)e[r y] ρ(θ) y Y ] p(y θ ) p(y θ) (2.3) log p(y θ ), by Jensen s inequality p(y θ) = 1 [ Q(θ, θ ) Q(θ, θ) ]. (2.4) ρ(θ) Q(θ, θ ) = y Y p(y θ)e[r y] log p(y θ ), so if Q(θ, θ ) Q(θ, θ), then ρ(θ ) ρ(θ). The normalization step in equation 2.3 creates the matching distribution from equation 2.2. Given θ,ifθ is chosen to maximize Q(θ, θ ), then we are guaranteed that Q(θ, θ ) Q(θ, θ) and therefore that the average reward is nondecreasing.

274 Peter Dayan and Geoffrey E. Hinton In the RPP, the new probability p i for choosing a i = 1 is given by p i = a ie[r y] p(y θ) E[r y] p(y θ), (2.5) so it is the fraction of the average reward that arrives when a i = 1. Using equation 1.2, Q(θ, θ ) p i 1 = p i (1 p i ) p(y θ)e[r y] p i p(y θ)e[r y]. y Y:a i =1 y Y So, if p i = y Y:a i =1 p(y θ)e[r y] y Y p(y θ)e[r y], then Q(θ, θ ) p i = 0, and it is readily seen that Q(θ, θ ) is maximized. But this condition is just that of equation 2.5. Therefore the RPP is monotonic in the average return. Figure 1 shows the consequence of employing the RPP. Figure 1a shows the case in which n = 2; the two lines and associated points show how p 1 and p 2 change on successive steps using the RPP. The terminal value p 1 = p 2 = 0, reached by the left-hand line, is a local optimum. Note that the RPP always changes the parameters in the direction of the gradient of the expected amount of reinforcement (this is generally true) but by a variable amount. Figure 1b compares the RPP with (a deterministic version of) Williams s (1992) stochastic gradient ascent REINFORCE algorithm for a case with n = 12 and rewards E[r y] drawn from an exponential distribution. The RPP and REINFORCE were started at the same point; the graph shows the difference between the maximum possible reward and the expected reward after given numbers of iterations. To make a fair comparison between the two algorithms, we chose n small enough that the exact averages in equation 2.1 and (for REINFORCE) the exact gradients, p i = α E[r y] p(y θ), p i could be calculated. Figure 1b shows the (consequently smooth) course of learning for various values of the learning rate. We observe that for both algorithms, the expected return never decreases (as guaranteed for the RPP but not REINFORCE), that the course of learning is not completely smooth with a large plateau in the middle and that both algorithms get stuck in

Using EM for Reinforcement Learning 275 local minima. This is a best case for the RPP: only for a small learning rate and consequently slow learning does REINFORCE not get stuck in a worse local minimum. In other cases, there are values of α for which REINFORCE beats the RPP. However, there are no free parameters in the RPP, and it performs well across a variety of such problems. 3 Discussion The analogy to EM can be made quite precise. EM is a maximum likelihood method for probability density estimation for a collection X of observed data where underlying point x X there can be a hidden variable y Y. The density has the form p(x θ) = y Y p(y θ)p(x y,θ), and we seek to choose θ to maximize log [ p(x θ) ]. x X The E phase of EM calculates the posterior responsibilities p(y x,θ) for each y Y for each x: p(y x,θ)= p(y θ)p(x y,θ). p(z θ)p(x z,θ) z Y In our case, there is no x, but the equivalent of this posterior distribution, which comes from equation 2.2, is p(y θ)e[r y] P y (θ). z Y p(z θ)r(z) The M phase of EM chooses parameters θ in the light of this posterior distribution to maximize p(y x,θ)log[p(x, y θ )]. x X y Y In our case this is exactly equivalent to minimizing the Kullback-Leibler divergence KL[P y (θ), p(y θ )] = y YP [ p(y θ ] ) y (θ) log. P y (θ) Up to some terms that do not affect θ, this is Q(θ, θ ). The Kullback-Leibler divergence between two distributions is a measure of the distance between them, and therefore minimizing it is a form of probability matching.

276 Peter Dayan and Geoffrey E. Hinton Figure 1: Performance of the RPP. (a) Adaptation of p 1 and p 2 using the RPP from two different start points on the given ρ(p 1,p 2 ). The points are successive values; the lines are joined for graphical convenience. (b) Comparison of the RPP with Williams s (1992) REINFORCE for a particular problem with n = 12. See text for comment and details.

Using EM for Reinforcement Learning 277 Our result is weak: the requirement for sampling from the distribution is rather restrictive, and we have not proved anything about the actual rate of convergence. The algorithm performs best (Sutton, personal communication) if the differences between the rewards are of the same order of magnitude as the rewards themselves (as a result of the normalization in equation 2.3). It uses multiplicative comparison rather than the subtractive comparison of Sutton (1984), Williams (1992), and Dayan (1990). The link to the EM algorithm suggests that there may be reinforcement learning algorithms other than the RPP that make large changes to the values of the parameters for which similar guarantees about nondecreasing average rewards can be given. The most interesting extension would be to dynamic programming (Bellman 1957), where techniques for choosing (single-component) actions to optimize return in sequential decision tasks include two algorithms that make large changes on each step: value and policy iteration (Howard 1960). As various people have noted, the latter explicitly involves both estimation (of the value of a policy) and maximization (choosing a new policy in the light of the value of each state under the old one), although its theory is not at all described in terms of density modeling. Acknowledgments Support came from the Natural Sciences and Engineering Research Council and the Canadian Institute for Advanced Research (CIAR). GEH is the Nesbitt-Burns Fellow of the CIAR. We are grateful to Philip Sabes and Mike Jordan for the spur and to Mike Jordan and the referees for comments on an earlier version of this paper. References Barto, A. G., & Anandan, P. (1985). Pattern recognizing stochastic learning automata. IEEE Transactions on Systems, Man and Cybernetics, 15, 360 374. Baum, L. E., Petrie, E., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Stat., 41, 164 171. Bellman, R. E. (1957). Dynamic programming. Princeton, NJ: Princeton University Press. Dayan, P. (1990). Reinforcement comparison. In Proceedings of the 1990 Connectionist Models Summer School, D. S. Touretzky, J. L. Elman, T. J. Sejnowski, & G. E. Hinton (Eds.), (pp. 45 51). San Mateo, CA: Morgan Kaufmann. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Proceedings of the Royal Statistical Society, 1 38. Gallistel, C. R. (1990). The organization of learning. Cambridge, MA: MIT Press. Hinton, G. E. (1989). Connectionist learning procedures. Artificial Intelligence, 40, 185 234.

278 Peter Dayan and Geoffrey E. Hinton Howard, R. A. (1960). Dynamic programming and Markov processes. Cambridge, MA: MIT Press. Narendra, K. S., & Thatachar, M. A. L. (1989). Learning automata: An introduction. Englewood Cliffs, NJ: Prentice-Hall. Sabes, P. N., & Jordan, M. I. (1995). Reinforcement learning by probability matching. Advances in Neural Information Processing Systems, 8. Cambridge, MA: MIT Press. Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning. Unpublished doctoral dissertation, University of Massachusetts, Amherst. Wasan, M. T. (1969). Stochastic approximation. Cambridge University Press, Cambridge. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229 256. Received October 2, 1995; accepted May 30, 1996.