Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Similar documents
15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

6 Reinforcement Learning

Introduction to Reinforcement Learning

Markov Decision Processes (and a small amount of reinforcement learning)

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Expectation Maximization

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Reinforcement Learning

Reinforcement Learning. George Konidaris

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

CS599 Lecture 1 Introduction To RL

CS181 Midterm 2 Practice Solutions

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

The Expectation Maximization or EM algorithm

Reinforcement Learning

CS 570: Machine Learning Seminar. Fall 2016

Reinforcement Learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning

STA 4273H: Statistical Machine Learning

Machine Learning I Reinforcement Learning

REINFORCEMENT LEARNING

Markov Decision Processes

Machine Learning I Continuous Reinforcement Learning

Artificial Intelligence

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Latent Variable Models and EM algorithm

CSCI-567: Machine Learning (Spring 2019)

Decision Theory: Q-Learning

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

13: Variational inference II

Q-learning Tutorial. CSC411 Geoffrey Roeder. Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava

Lecture 23: Reinforcement Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Final Exam, Fall 2002

Reinforcement Learning: An Introduction

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

CSC321 Lecture 22: Q-Learning

Trust Region Policy Optimization

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

STA 4273H: Statistical Machine Learning

Reinforcement Learning. Machine Learning, Fall 2010

Planning by Probabilistic Inference

, and rewards and transition matrices as shown below:

Reinforcement Learning

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning: the basics

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Reinforcement Learning II

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

Reinforcement Learning. Introduction

Statistical learning. Chapter 20, Sections 1 4 1

Reinforcement Learning

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Reinforcement Learning and NLP

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

CSE250A Fall 12: Discussion Week 9

Statistical Pattern Recognition

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Introduction to Machine Learning

Machine Learning and Bayesian Inference

The Expectation-Maximization Algorithm

Q-Learning in Continuous State Action Spaces

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Gradient Methods for Markov Decision Processes

Linear Dynamical Systems

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a

Be able to define the following terms and answer basic questions about them:

Reinforcement Learning

Lecture 9: Policy Gradient II (Post lecture) 2

Week 3: The EM algorithm

Latent Variable Models

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Hidden Markov Models Part 2: Algorithms

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Reinforcement Learning and Control

Reinforcement Learning Wrap-up

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Mathematical Formulation of Our Example

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Probabilistic Planning. George Konidaris

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Machine Learning Techniques for Computer Vision

CPSC 540: Machine Learning

CS 188: Artificial Intelligence Spring Announcements

STA 414/2104: Machine Learning

Series 7, May 22, 2018 (EM Convergence)

Lecture 9: Policy Gradient II 1

Introduction to Machine Learning

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Reinforcement Learning as Classification Leveraging Modern Classifiers

Transcription:

Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity in data without the aid of labels? 6 6 4 4 2 2 Part VI In a nutshell... Unsupervised learning Reinforcement learning -2-4 -6-2 -4-6 - - Copyright c Sean Holden 22-17. 1 Is this one cluster? Or three? Or some other number? 2 The K-means algorithm The K-means algorithm The example on the last slide was obtained using the classical K-means algorithm. Given a set {x i } of m points, guess that there are K clusters. Here K = 3. Actual data for 3 clusters 1 iteration Chose at random K centre points c j for the clusters. Then iterate as follows: 1. Divide {x i } into K clusters, so each point is associated with the closest centre: x i C j k x i c j x i c k. - - - - Call these clusters C 1,..., C K. 2 iterations 3 iterations 2. Update the cluster centres to be the average of the associated points: c j = 1 x i. C j x i C j - - - - 3 4

Clustering as maximum-likelihood Clustering as maximum-likelihood The modern approach is once again probabilistic. Data from K clusters can be modelled probabilistically as K p(x θ) = π k p(x µ k, Σ k ) where θ = {π, µ 1, Σ 1,..., µ K, Σ K } and typically p(x µ, Σ) = N (µ, Σ). This leads to a log-likelihood for m points of m log p(x θ) = log p(x i θ) = = i=1 n log p(x i θ) i=1 n K log π k p(x i µ k, Σ k ). i=1 This tends to be hard to maximise directly to choose θ. (You can find stationary points but they depend on one-another.) 6 Clustering as maximum-likelihood We can however introduce some latent variables. π Clustering as maximum-likelihood Having introduced the z i we can use the marginalization trick and write log p(x θ) = log p(x, Z θ) Z z i = log Z p(x Z, θ)p(z θ) µ x i Σ m For each x i introduce the latent variable z i where [ ] z T i = z (1) i z (K) i and z (j) i = { 1 if x i was generated by cluster j otherwise. where the final step has given us probabilities that are reasonably tractable. Why is this? First, if I know which cluster generated x i then its probability is just that for the corresponding Gaussian and similarly p(x i z i, θ) = K p(z i θ) = [p(x i µ k, Σ k )] z(k) i K [π k ] z(k) i. 7 8

Clustering as maximum-likelihood Clustering as maximum-likelihood In other words, if you treat the z i as observed rather than latent z i π Consequently log p(x, Z θ) = m K i=1 z (k) i (log p(x i µ k, Σ k ) + log π k ). then you can write p(x i, z i θ) = log p(x, Z θ) = log µ x i Σ m = log K [p(x i µ k, Σ k )π k ] z(k) i. m p(x i, z i θ) i=1 m K i=1 [p(x i µ k, Σ k )π k ] z(k) i. What have we achieved so far? 1. We want to maximize the log-likelihood log p(x θ) but this is intractable. 2. We introduce some latent variables Z. 3. That gives us a tractable log-likelihood log p(x, Z θ). But how do we link them together? 9 1 The EM algorithm The Expectation Maximization (EM) algorithm provides a general way of maximizing likelihood for problems like this. Here we apply it to unsupervised learning, but it can also be applied to learning Hidden Markov Models (HMMs) and many other things Let q(z) be any distribution on the latent variables. Write Z p(x, Z θ) q(z) log = q(z) Z = Z p(z X, θ)p(x θ) q(z) log q(z) ( ) p(z X, θ) q(z) log + log p(x θ) q(z) = D KL [q(z) p(z X, θ)] + Z q(z) log p(x θ) The Kullback-Leibler (KL) distance The Kullback-Leibler (KL) distance measures the distance between two probability distributions. For discrete distributions p and q it is D KL [p q] = x It has the important properties that: 1. It is non-negative p(x) log p(x) q(x). D KL (p q). 2. It is precisely when the distributions are equal D KL [p q] = if and only if p = q. = D KL [q(z) p(z X, θ)] + log p(x θ). D KL is the Kullback-Leibler (KL) distance. 11 12

The EM algorithm If we also define L[q, θ] = p(x, Z θ) q(z) log q(z) Z then we can re-arrange the last expression to get log p(x θ) = L[q, θ] + D KL [q p] and we know that D KL [q p] so that gives us an upper bound The EM algorithm works as follows: We iteratively maximize L[q, θ]. L[q, θ] log p(x θ). We do this by alternately maximizing with respect to q and θ while keeping the other fixed. Maximizing with respect to q is the E step. Maximizing with respect to θ is the M step. 13 The EM algorithm Let s look at the two steps separately. Say we have θ t at time t in the iteration. For the E step, we have θ t fixed and log p(x θ t ) = L[q, θ t ] + D KL [q p] so this is easy! 1. As θ t is fixed, so is log p(x θ t ). 2. So to maximize L[q, θ t ] we must minimize D KL [q p]. 3. And we know that D KL [q p] is minimized and equal to when q = p. So in the E step we just choose q(z) = p(z X, θ t ). 14 The EM algorithm The M step is a little more involved, but we end up with θ t+1 = argmax θ γ (k) π k p(x i µ i = k, Σ k ) K π kp(x i µ k, Σ k ) m K i=1 and γ (k) i (log p(x i µ k, Σ k ) + log π k ) and this maximization is tractable. (Though you will need a Langrange multiplier... ) The EM algorithm The EM algorithm for a mixture model summarized: We want to find θ to maximize log p(x θ). But that s not tractable. So we introduce an arbitrary distribution q and obtain a lower bound L(q, θ) log p(x θ). We maximize the lower bound iteratively in two steps: 1. E step: keep θ fixed and maximize with respect to q. This always results in q(z) = p(z X, θ). 2. M step: keep q fixed and maximize with respect to θ. For the mixture model the M step is m K θ t+1 = argmax γ (k) i (log p(x i µ k, Σ k ) + log π k ). θ i=1 1 16

Reinforcement learning and HMMs Hidden Markov Models are appropriate when our agent models the world as follows Reinforcement learning and supervised learning Supervised learners learn from specifically labelled chunks of information: S Pr (St St 1) S1 S2 St 1 St x??? Prior Pr (S) Pr (Et St) E1 E2 Et 1 Et and only wants to infer information about the state of the world on the basis of observing the available evidence. This might be criticised as un-necessarily restricted, although it is very effective for the right kind of problem. (x1, 1) (x2, 1) (x3, ). This might also be criticised as un-necessarily restricted: there are other ways to learn. 17 18 Reinforcement learning: the basic case Modelling the world in a more realistic way: Formally, we have a set of states Deterministic Markov Decision Processes S = {s 1, s 2,..., s n } and in each state we can perform one of a set of actions S S1 S2 S3 In any state: Perform an action a to move to a new state. (There may be many possibilities.) Receive a reward r depending on the start state and action. The agent can perform actions in order to change the world s state. If the agent performs an action in a particular state, then it gains a corresponding reward. We also have a function A = {a 1, a 2,..., a m }. S : S A S such that S(s, a) is the new state resulting from performing action a in state s, and a function R : S A R such that R(s, a) is the reward obtained by executing action a in state s. 19 2

Deterministic Markov Decision Processes From the point of view of the agent, there is a matter of considerable importance: The agent does not have access to the functions S and R. It therefore has to learn a policy, which is a function p : S A such that p(s) provides the action a that should be executed in state s. What might the agent use as its criterion for learning a policy? Measuring the quality of a policy Say we start in a state at time t, denoted s t, and we follow a policy p. At each future step in time we get a reward. Denote the rewards r t, r t+1,... and so on. A common measure of the quality of a policy p is the discounted cumulative reward V p (s t ) = ɛ i r t+i i= = r t + ɛr t+1 + ɛ 2 r t+2 + where ɛ 1 is a constant, which defines a trade-off for how much we value immediate rewards against future rewards. The intuition for this measure is that, on the whole, we should like our agent to prefer rewards gained quickly. 21 22 Two important issues Note that in this kind of problem we need to address two particularly relevant issues: The temporal credit assignment problem: that is, how do we decide which specific actions are important in obtaining a reward? The exploration/exploitation problem. How do we decide between exploiting the knowledge we already have, and exploring the environment in order to possibly obtain new (and more useful) knowledge? We will see later how to deal with these. 23 The optimal policy Ultimately, our learner s aim is to learn the optimal policy p opt (s) = argmax V p (s) p for some initial state s. Define the optimal discounted cumulative reward V opt (s) = V p opt (s). How might we go about learning the optimal policy? The only information we have during learning is the individual rewards obtained from the environment. We could try to learn V opt (s) directly, so that states can be compared: Consider s as better than s if V opt (s) > V opt (s ). However we actually want to compare actions, not states. Learning V opt (s) might help as p opt (s) = argmax [R(s, a) + ɛv opt (S(s, a))] a but only if we know S and R. As we are interested in the case where these functions are not known, we need something slightly different. 24

The Q function The trick is to define the following function: Q(s, a) = R(s, a) + ɛv opt (S(s, a)). This function specifies the discounted cumulative reward obtained if you do action a in state s and then follow the optimal policy. As p opt (s) = argmax Q(s, a) a then provided one can learn Q it is not necessary to have knowledge of S and R to obtain the optimal policy. The Q function Note also that V opt (s) = max Q(s, α) α and so Q(s, a) = R(s, a) + ɛ max Q(S(s, a), α) α which suggests a simple learning algorithm. Let Q be our learner s estimate of what the exact Q function is. That is, in the current scenario Q is a table containing the estimated values of Q(s, a) for all pairs (s, a). 2 26 Q-learning Start with all entries in Q set to. (In fact random entries will do.) Repeat the following: 1. Look at the current state s and choose an action a. (We will see how to do this in a moment.) 2. Do the action a and obtain some reward R(s, a). 3. Observe the new state S(s, a). 4. Perform the update Q (s, a) = R(s, a) + ɛ max α Q (S(s, a), α). Note that this can be done in episodes. For example, in learning to play games, we can play multiple games, each being a single episode. The procedure converges under some simple conditions. Choosing actions to perform We have not yet answered the question of how to choose actions to perform during learning. One approach is to choose actions based on our current estimate Q. For instance action chosen in current state s = argmax Q (s, a). a However we have already noted the trade-off between exploration and exploitation. It makes more sense to: Explore during the early stages of training. Exploit during the later stages of training. (This also turns out to be sensible to guarantee convergence.) 27 28

Choosing actions to perform One way in which to choose actions that incorporates these requirements is to introduce a constant λ and choose actions probabilistically according to Note that: Pr (action a state s) = If λ is small this promotes exploration. If λ is large this promotes exploitation. We can vary λ as training progresses. λq (s,a) a λq (s,a). There are two further simple ways in which the process can be improved: 1. If training is episodic, we can store the rewards obtained during an episode and update backwards at the end. This allows better updating at the expense of requiring more memory. 2. We can remember information about rewards and occasionally re-use it by re-training. Nondeterministic MDPs The Q-learning algorithm generalises easily to a more realistic situation, where the outcomes of actions are probabilistic. Instead of the functions S and R we have probability distributions and Pr (new state current state, action) Pr (reward current state, action). and we now use S(s, a) and R(s, a) to denote the corresponding random variables. We now have and the best policy p opt maximises V p. ( ) V p = E ɛ i r t+i i= 29 3 Q-learning for nondeterministic MDPs Alternative representation for the Q table We now have Q(s, a) = E(R(s, a)) + ɛ σ Pr (σ s, a) V opt (σ) But there s always a catch... We have to store the table for Q : = E(R(s, a)) + ɛ σ and the rule for learning becomes Pr (σ s, a) max Q(σ, α) α ] Q n+1 = (1 θ n+1 )Q n(s, a) + θ n+1 [R(s, a) + max α Q n(s(s, a), α) with 1 θ n+1 = 1 + v n+1 (s, a) where v n+1 (s, a) is the number of times the pair s and a has been visited so far. Even for quite straightforward problems it is HUGE!!! - certainly big enough that it can t be stored. A standard approach to this problem is, for example, to represent it as a neural network. One way might be to make s and a the inputs to the network and train it to produce Q (s, a) as its output. This, of course, introduces its own problems, although it has been used very successfully in practice. 31 32