Bits of Machine Learning Part 2: Unsupervised Learning

Size: px

Start display at page:

Download "Bits of Machine Learning Part 2: Unsupervised Learning"

Esmond Jackson
5 years ago
Views:

1 Bits of Machine Learning Part 2: Unsupervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology)

2 Outline of the Course 1. Supervised Learning Regression and Classification Neural Networks (Deep Learning) 2. Unsupervised Learning Clustering Reinforcement Learning 1

3 Outline of the Course 1. Supervised Learning Regression and Classification Neural Networks (Deep Learning) 2. Unsupervised Learning Clustering Reinforcement Learning 2

4 Clustering Clustering extracts structures from unlabelled data {X i, i = 1, n}, i.e., groups similar data points into a small set of clusters 3

5 Clustering: Applications Historical application: find geographical correlations in cholera deaths in London (John Snow 1850) Image segmentation: identify regions of interests in an image Marketing: detect groups of customers to design targeted ads Insurance: Identify groups of users with high risk Online services and web: e.g. cluster webpage based on their content Bioinformatics: group proteins w.r.t. their functionality and chemical structure... 4

6 Algorithms Commonly used clustering algorithms Center based approaches (K-means, K-medians, Gaussian mixture models, affinity propogation) Hierarchical/agglomerative approaches (single linkage, complete linkage, average linkage, Ward s method, BIRCH) Graph based approaches (spectral clustering) Density based approaches (DBSCAN, OPTICS) Bayesian nonparametric approaches (Dirichlet Processes, PitmanYor processes)... 5

7 Algorithms 6

8 Clustering evaluation 1. Performance benchmarks: artificially generate (clustered) data and evaluate the number of misclassified points 2. From the data only: a hard problem. What is a good clustering? - Intra-cluster cohesion: measures how near data points are in the same cluster Example: the cohesion of cluster C could be i,j C Xi Xj 2 K-means algorithm finds over possible clusters C 1,..., C K local minima of K X i µ k 2 i C k k=1 where µ k is the center of cluster k - Inter-cluster separation: distance between points of different clusters 7

Clustering evaluation 2. From the data only: Graph-based approach, V = {1,..., n} Define a complete weighted graph from the data: weight on edge (i, j) w ij = f( X i X j ) Example: Normalized cut.

9 Clustering evaluation 2. From the data only: Graph-based approach, V = {1,..., n} Define a complete weighted graph from the data: weight on edge (i, j) w ij = f( X i X j ) Example: Normalized cut. The Ncuts of (C 1, C 2 ) is Ncut(C 1, C 2 ) = cut(c 1, C 2 ) cut(c 1, V ) + cut(c 2, C 1 ) cut(c 2, V ), where cut(a, B) = i A,j B w ij We may wish to find clusters minimising the Ncut (NP-hard) 8

10 K-means (Lloyd 1957, MacQueen 1967) K-means aims at finding clusters C = {C 1, C k } solving: min C K k=1 j C k X i µ k 2 where K is the number of clusters, and µ k is the center of the k-th cluster. Solving the K-means clustering problem in R d is: 1. NP-hard in general Euclidean space d even for 2 clusters 2. NP-hard for a general number of clusters K even in the plane 3. When K and d are fixed, the problem can be solved in time O(n dk+1 log n) 9

11 K-means 1. Initialisation: pick K random points as initial cluster centers 2. Cluster improvement: while the cluster assignment changes do a. Assign each data point to the cluster with the closest center b. Change the cluster centers (the average position of points within the cluster) 10

12 Example 1 11

13 Example 2: Image Segmentation 12

14 K-means algorithm: Properties Guaranteed to converge in a finite number of iterations Running time per iteration: O(nKd) Issues 1. Can be stuck in a bad local minimum 2. Different initialisations yield different solutions 3. The number of clusters K needs to be specified in advance Example of approaches to solve the initialization and finding the number of clusters problem: V. Petrosyan and A. Proutiere, Viral Clustering: A Robust Method to Extract Structures in Heterogeneous Datasets, AAAI

15 Gaussian Mixture Models (GMM) and the EM Algorithm Approach: assume that the data points are generated according to a GMM in an i.i.d. manner, and estimate the parameters of the GMM GMM: Each point is distributed according to the distribution with the density p(x θ): θ = (θ k ) k=1,...,k and θ k = (α k, µ k, Σ k ) p(x θ) = K α k p G (x θ k ) k=1 1 where p G (x θ k ) = (2π) d/2 Σ k exp( 1 1/2 2 (x µ k) Σ 1 k (x µ k)) α k : probability that a point is attached to cluster k µ k : center of cluster k Σ k : covariance matrix of the k-th component of the GMM 14

16 GMM Example The density of a GMM with three components in R 2 15

17 The Expectation-Maximisation algorithm Dempster-Laird-Rubin, 1977 Aims at maximising the log-likelihood of the model given the data Two alternating steps: 1. E-step: For a given estimate of θ, evaluate the probability of a data point to belong to the various clusters. The weight membership of X i in cluster k is: w ik = α k p G (X i θ k ) K m=1 α mp G (X i θ m ) 2. M-step: Update θ with increased the log-likelihood 16

18 The Expectation-Maximisation algorithm Initialisation: select α 1 k, µ1 k, Σ1 k arbitrarily, t = 1 Until convergence: do 1. E-step: Compute for all i and k w ik = α t k p G(X i µ t k, Σt k ) K m=1 αt mp G (X i µ t m, Σ t m), N k = n i=1 2. M-step: Update the parameters as follows: for all k, w ik α t+1 k = N k N, µt+1 k = 1 N k n w ik X i i=1 3. t := t + 1 Σ t+1 k = 1 N k n w ik (X i µ t+1 k )(X i µ t+1 k i=1 ) 17

19 EM algorithm: Properties Convergence: On the Convergence Properties of the EM Algorithm, C.F.J. Wu, Annals of Statistics, 1983 Issues 1. Can be stuck in a bad local minimum 2. Suffers from the curse of dimensionality (not suitable for high dimensional data) 18

20 Spectral Clustering K-means and the EM algorithm for GMM are center-based algorithms. How to deal with clusters that do not look like potatoes?... apply spectral methods on similarity matrices or graphs! Spectral method: work with eigenvalues and eigenvectors of the similarity matrix 19

21 Spectral Clustering Data: X 1,... X n R d Similarity matrix: W = (w ij ) R n n with for example (Gaussian kernel) i, j, w ij = 1 i j exp( X i X j 2 ) σ Spectral analysis: we get the top-k eigenvectors of W (those corresponding to the highest singular values) The spectral idea: the data points are well represented by their projections on the linear span these eigenvectors 20

22 Spectral Clustering On spectral clustering: Analysis and an algorithm, A. Ng, M. Jordan, Y. Weiss, NIPS Input: W R n n, K {1, 2,, n} 2. Let D be a diagonal matrix s.t. D ii := n j=1 w ij 3. L := D 1/2 W D 1/2 4. E = [e 1, e 2,, e K ] := eigs(l, K) R n K 5. Normalize each row of E 6. y := K-means(E, K) 7. Output: y eigs(l, K) returns the top K eigenvectors of L 21

23 Example 22

24 Tuning Spectral Clustering We may consider different notions of similarity: 1. Gaussian kernel w ij = e X i X j 2 σ σ? for i j and w ii = 0. Choice of 2. k-nearest neighbour similarity: w ij = 1 if i is one of the k nearest neighbours of j or if j is one of the k nearest neighbours of i. Choice of k? 3. Self tuning kernel: w ij = e X i X j X i X im X j X jm for i j and w i,i = 0, where i m is the index of m-th nearest neighbour of i. Choice of m? [Play with these parameters in the lab... :-) ] 2 23

25 Outline of the Course 1. Supervised Learning Regression and Classification Neural Networks (Deep Learning) 2. Unsupervised Learning Clustering Reinforcement Learning 24

26 Reinforcement learning Learning optimal sequential behaviour / control from interacting with the environment State dynamics: s π t+1 = F t (s π t, a π t ) 25

27 Reinforcement learning Learning optimal sequential behaviour / control from interacting with the environment [...] By the time we learn to live It s already too late Our hearts cry in unison at night [...] Louis Aragon 26

28 Reinforcement learning: Applications Making a robot walk Portfolio optimisation Playing games better than humans Helicopter stunt manoeuvres Optimal communication protocols in radio networks Display ads Search engines... 27

29 1. Bandit Optimisation State dynamics: s π t+1 = F t (s π t, a π t ) Interact with an i.i.d. or adversarial environment The reward is independent of the state and is the only feedback: - i.i.d. environment: r t(a, s) = r t(a) random variable with mean θ a - adversarial environment: r t(a, s) = r t(a) is arbitrary! 28

30 2. Markov Decision Process (MDP) State dynamics: s π t+1 = F t (s π t, a π t ) History at t: h π t = (s π 1, a π 1,..., s π t 1, a π t 1, s π t ) Markovian environment: P[s π t+1 = s h π t, s π t = s, a π t = a] = p(s s, a) Stationary deterministic rewards (for simplicity): r t (a, s) = r(a, s) 29

31 What is to be learnt and optimised? Bandit optimisation: the average rewards of actions are unknown Information available at time t under π: a π 1, r 1 (a π 1 ),..., a π t 1, r t 1 (a π t 1) MDP: The state dynamics p( s, a) and the reward function r(a, s) are unknown Information available at time t under π: s π 1, a π 1, r 1 (a π 1, s π 1 ),..., s π t 1, a π t 1, r t 1 (a π t 1, s π t 1), s π t Objective: maximise the cumulative reward T E[r t (a π t, s π t )] t=1 or λ t E[r t (a π t, s π t )] t=1 30

32 Regret Difference between the cumulative reward of an Oracle policy and that of agent π Regret quantifies the price to pay for learning! Exploration vs. exploitation trade-off: we need to probe all actions to play the best later... 31

33 1. Bandit Optimisation First application: Clinical trial, Thompson A set of possible actions at each step - Unknown sequence of rewards for each action - Bandit feedback: only rewards of chosen actions are observed - Goal: maximise the cumulative reward (up to step T ) Two examples: a. Finite number of actions, stochastic rewards b. Continuous actions, concave adversarial rewards 32

34 a. Stochastic bandits Robbins 1952 Finite set of actions A (Unknown) rewards of action a A: (r t (a), t 0) i.i.d. Bernoulli with E[r t (a)] = θ a Optimal action a arg max a θ a Online policy π: select action a π t a π 1, r 1 (a π 1 ),..., a π t 1, r t 1 (a π t 1) at time t depending on Regret up to time T : R π (T ) = T θ a T t=1 θ a π t 33

35 a. Stochastic bandits Fundamental performance limits: (Lai-Robbins1985) For any reasonable π: lim inf T R π (T ) log(t ) a a θ a θ a KL(θ, θ a ) where KL(a, b) = a log( a 1 a b ) + (1 a) log( 1 b ) (KL divergence) Algorithms: (i) ɛ-greedy: linear regret (ii) ɛ t -greedy: logarithmic regret (ɛ t = 1/t) (iii) Upper Confidence Bound algorithm: b a (t) = ˆθ 2 log(t) a (t) + n a (t) ˆθ(t): empirical reward of a up to t n a (t): nb of times a played up to t 34

36 b. Adversarial Convex Bandits At the beginning of each year, Volvo has to select a vector x (in a convex set) representing the relative efforts in producing various models (S60, V70, V90,...). The reward is an arbitrarily varying and unknown concave function of x. How to maximise reward over say 50 years? 35

37 b. Adversarial Convex Bandits At the beginning of each year, Volvo has to select a vector x (in a convex set) representing the relative efforts in producing various models (S60, V70, V90,...). The reward is an arbitrarily varying and unknown concave function of x. How to maximise reward over say 50 years? 36

38 b. Adversarial Convex Bandits At the beginning of each year, Volvo has to select a vector x (in a convex set) representing the relative efforts in producing various models (S60, V70, V90,...). The reward is an arbitrarily varying and unknown concave function of x. How to maximise reward over say 50 years? 37

39 b. Adversarial Convex Bandits At the beginning of each year, Volvo has to select a vector x (in a convex set) representing the relative efforts in producing various models (S60, V70, V90,...). The reward is an arbitrarily varying and unknown concave function of x. How to maximise reward over say 50 years? 38

40 b. Adversarial Convex Bandits At the beginning of each year, Volvo has to select a vector x (in a convex set) representing the relative efforts in producing various models (S60, V70, V90,...). The reward is an arbitrarily varying and unknown concave function of x. How to maximise reward over say 50 years? 39

41 b. Adversarial Convex Bandits At the beginning of each year, Volvo has to select a vector x (in a convex set) representing the relative efforts in producing various models (S60, V70, V90,...). The reward is an arbitrarily varying and unknown concave function of x. How to maximise reward over say 50 years? 40

42 b. Adversarial Convex Bandits Continuous set of actions A = [0, 1] (Unknown) Arbitrary but concave rewards of action x A: r t (x) Online policy π: select action x π t x π 1, r 1 (x π 1 ),..., x π t 1, r t 1 (x π t 1) at time t depending on Regret up to time T : (defined w.r.t. the best empirical action up to time T ) T T R π (T ) = max r t (x) r t (x π t ) x [0,1] t=1 Can we do something smart at all? Achieve a sublinear regret? t=1 41

43 b. Adversarial Convex Bandits If r t ( ) = r( ), and if r( ) was known, we could apply a gradient ascent algorithm One-point gradient estimate: ˆf(x) = E v B [f(x + δv)], B = {x : x 2 1} E u S [f(x + δu)u] = δ ˆf(x), S = {x : x 2 = 1} Simulated Gradient Ascent algorithm: at each step t, do - u t uniformly chosen in S - y t = x t + δu t - y t+1 = y t + αr t(x t)u t Regret: R(T ) = O(T 5/6 ) 42

44 2. Markov Decision Process (MDP) State dynamics: s π t+1 = F t (s π t, a π t ) Markovian environment: P[s π t+1 = s h π t, s π t = s, a π t = a] = p(s s, a) Stationary deterministic rewards (for simplicity): r t (a, s) = r(a, s) p( s, a) and r(, ) are unknown initially 43

45 Example Playing pacman (Google Deepmind experiment, 2015) State: the current displayed image Action: right, left, down, up Feedback: the score and its increments + state 44

46 Bellman s equation Objective: max the average discounted reward t=1 λt E[r(a π t, s π t )] Assume the transition probabilities and the reward function are known Value function: maps the initial state s to the corresponding maximum reward v(s) Bellman s equation: v(s) = max a A r(a, s) + λ j p(j s, a)v(j) Solve Bellman s equation. The optimal policy is given by: a (s) = arg max a A r(a, s) + λ j p(j s, a)v(j) 45

47 Q-learning What if the transition probabilities and the reward function are unknown? Q-value function: the max expected reward starting from state s and playing action a: Q(s, a) = r(a, s) + λ j p(j s, a) max Q(j, b) b A Note that: v(s) = max a A Q(s, a) Algorithm: update the Q-value estimate sequentially so that it converges to the true Q-value 46

48 Q-learning 1. Initialisation: select Q R S A arbitrarily, and s 0 2. Q-value iteration: at each step t, select action a t (each state-action pair must be selected infinitely often) Observe the new state s t+1 and the reward r(s t, a t ) Update Q(s t, a t ): Q(s t, a t ) :=Q(s t, a t ) ] + α t [r(s t, a t ) + λ max Q(s t+1, a) Q(s t, a t ) a A It converges to Q if t α t = and t α2 t < 47

49 Q-learning: demo The crawling robot

50 Scaling up Q-learning Q-learning converges very slowly, especially when the state and action spaces are large... State-of-the-art algorithms (optimal exploration, ideas from bandit opt.): regret O(S AT ) What if the action and state are continuous variables? Example: Mountain car demo 1 1 See Sutton tutorial, NIPS

51 Q-learning with function approximation Idea: restrict our attention to Q-value functions belonging to a family of functions Q Examples: 1. Linear functions: Q = {Q θ, θ R M }, Q θ (s, a) = M φ i (s, a)θ i = φ θ i=1 where for all i, φ i is linear. The φ i s are linearly independent. 2. Deep networks: Q = {Q w, w R M }, Q w (s, a) given as the output of a neural network with weights w and inputs (s, a) 50

52 Q-learning with linear function approximation 1. Initialisation: select θ R M arbitrarily, and s 0 2. Q-value iteration: at each step t, select action a t (each state-action pair must be selected infinitely often) Observe the new state s t+1 and the reward r(s t, a t ) Update θ: θ := θ + α t t θ Q θ (s t, a t ) = θ + α t t φ(s t, a t ) where t = r(s t, a t ) + λ max a A Q θ (s t+1, a) Q θ (s t, a t ) For convergence results, see An analysis of Reinforcement Learning with Function Approximation, Melo et al., ICML

53 Q-learning with function approximation Success stories: TD-Gammon (Backgammon), Tesauro 1995 (neural nets) Acrobatic helicopter autopilots, Ng et al Jeopardy, IBM Watson, atari games, pixel-level visual inputs, Google Deepmind

54 Questions? Alexandre Proutiere 53

Reinforcement Learning

Reinforcement Learning Lecture 5: Bandit optimisation Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce bandit optimisation: the