Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26

Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 2 / 26

Schedule and Final Upcoming Schedule Today: Last day of class Next Monday (3/13): No class I will hold office hours in my office (BH 4531F) Next Wednesday (3/15): Final Exam, HW6 due Final Exam Cumulative but with more emphasis on new material 8 short questions (recall that midterm had 6 short and 3 long questions) Should be much shorter than midterm, but of equal difficulty Focus on major concepts (e.g., MLE, primal and dual formulations of SVM, gradient descent, etc.) Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 3 / 26

Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 4 / 26

Neural Networks Basic Idea Learning nonlinear basis functions and classifiers Hidden layers are nonlinear mappings from input features to new representation Output layers use the new representations for classification and regression Learning parameters Backpropogation = efficient algorithm for (stochastic) gradient descent Can write down explicit updates via chain rule of calculus Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 5 / 26

bias'unit ' x x =1 1 2 3 X Single'Node' x = 2 6 4 x x 1 x 2 x 3 3 7 5 = 6 4 h (x) =g ( x) 1 = 1+e T x 2 1 2 3 3 7 5 Sigmoid'(logis1c)'ac1va1on'func1on:' g(z) = 1 1+e z Based'on'slide'by'Andrew'Ng' 9'

Neural'Network' bias'units' x a (2) h (x) Layer'1' (Input'Layer)' Layer'2' (Hidden'Layer)' Layer'3' (Output'Layer)' Slide'by'Andrew'Ng' 11'

Vectoriza1on' a (2) = g z (2) 1 = g (1) 1 x + (1) 11 x 1 + (1) 12 x 2 + (1) 13 x 3 1 a (2) = g z (2) 2 = g (1) 2 x + (1) 21 x 1 + (1) 22 x 2 + (1) 23 x 3 2 a (2) = g z (2) 3 = g (1) 3 x + (1) 31 x 1 + (1) 32 x 2 + (1) 33 x 3 3 h (x) =g (2) 1 a(2) + (2) 11 a(2) 1 + (2) 12 a(2) 2 + (2) 13 a(2) 3 = g z (3) 1 Based'on'slide'by'Andrew'Ng' (1) (2) h (x) Feed@Forward'Steps:' z (2) = (1) x a (2) = g(z (2) ) Add a (2) =1 z (3) = (2) a (2) h (x) =a (3) = g(z (3) ) 15'

Learning'in'NN:'Backpropaga1on' Similar'to'the'perceptron'learning'algorithm,'we'cycle' through'our'examples' If'the'output'of'the'network'is'correct,'no'changes'are'made' If'there'is'an'error,'weights'are'adjusted'to'reduce'the'error' We'are'just'performing''(stochas1c)'gradient'descent!' Based'on'slide'by'T.'Finin,'M.'desJardins,'L'Getoor,'R.'Par' 32'

J( ) = Op1mizing'the'Neural'Network' 1 n " n X LX 1 + 2n i=1 k=1 s X l 1 l=1 Solve'via:'' KX y ik log(h (x i )) k +(1 Xs l i=1 j=1 (l) ji 2 # y ik ) log 1 (h (x i )) k Unlike before, J(Θ)'is'not' convex,'so'gd'on'a'neural'net' yields'a'local'op1mum' @ @ (l) ij J( ) =a (l) j (l+1) i (ignoring'λ;'if'λ = )' Based'on'slide'by'Andrew'Ng' Forward' Propaga1on' Backpropaga1on' 34'

Forward'Propaga1on' Given'one'labeled'training'instance'(x, y):' ' Forward'Propaga1on' a (1) '= x' z (2) = Θ (1) a (1) a (2) '= g(z (2) ) z (3) = Θ (2) a (2) a (3) '= g(z (3) ) z (4) = Θ (3) a (3) [add'a (2) ]' [add'a (3) ]' a (4) '= h Θ (x) = g(z (4) )' a (1) a (2) a (3) a (4) Based'on'slide'by'Andrew'Ng' 35'

Backpropaga1on:'Gradient'Computa1on' Let'δ j (l) =' error 'of'node'j'in'layer'l (#layers'l'='4) Backpropaga1on'' Element@wise' product'.*' δ (4) = a (4) y δ (3) = (Θ (3) ) T δ (4).* g (z (3) ) δ (2) = (Θ (2) ) T δ (3).* g (z (2) ) (No'δ (1) ) δ (2) δ (3) g (z (3) ) = a (3).* (1 a (3) )' g (z (2) ) = a (2).* (1 a (2) )' δ (4) ' ' Based'on'slide'by'Andrew'Ng' 42'

Training'a'Neural'Network'via'Gradient' Descent'with'Backprop' Given: training set {(x 1,y 1 ),...,(x n,y n )} Initialize all (l) randomly (NOT to!) Loop // each iteration is called an epoch (l) ij Set = 8l, i, j For each training instance (x i,y i ): Set a (1) = x i Compute {a (2),...,a (L) } via forward propagation Compute (L) = a (L) y i Compute errors { (L 1),..., (2) } Compute gradients (l) ij = (l) ij + a(l) j Compute avg regularized gradient D (l) ij (l+1) i = ( 1 n Update weights via gradient step (l) ij = (l) ij Until weights converge or max #epochs is reached (Used'to'accumulate'gradient)' 1 n (l) ij + (l) (l) ij D (l) ij ij if j 6= otherwise Backpropaga1on' Based'on'slide'by'Andrew'Ng' 45'

Summary of the course so far Supervised learning has been our focus Setup: given a training dataset {x n, y n } N n=1, we learn a function h(x) to predict x s true value y (i.e., regression or classification) Linear vs. nonlinear features 1 Linear: h(x) depends on w T x 2 Nonlinear: h(x) depends on w T φ(x), where φ is either explicit or depends on a kernel function k(x m, x n ) = φ(x m ) T φ(x n ) Loss function 1 Squared loss: least square for regression (minimizing residual sum of errors) 2 Logistic loss: logistic regression 3 Exponential loss: AdaBoost 4 Margin-based loss: support vector machines Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 6 / 26

cont d Optimization 1 Methods: gradient descent, Newton method 2 Convex optimization: global optimum vs. local optimum 3 Lagrange duality: primal and dual formulation Learning theory 1 Difference between training error and generalization error 2 Overfitting, bias and variance tradeoff 3 Regularization: various regularized models Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 7 / 26

Supervised versus Unsupervised Learning Supervised Learning from labeled observations Labels teach algorithm to learn mapping from observations to labels Classification, Regression Unsupervised Learning from unlabeled observations Learning algorithm must find latent structure from features alone Can be goal in itself (discover hidden patterns, exploratory analysis) Can be means to an end (preprocessing for supervised task) Clustering (Today) Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 8 / 26

Outline 1 Administration 2 Review of last lecture 3 Clustering K-means Gaussian mixture models Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 9 / 26

Clustering Setup Given D = {x n } N n=1 and K, we want to output: Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26

Clustering Setup Given D = {x n } N n=1 and K, we want to output: {µ k } K k=1 : prototypes of clusters A(x n ) {1, 2,..., K}: the cluster membership Toy Example Cluster data into two clusters. 2 (a) 2 (i) 2 2 2 2 2 2 Applications Identify communities within social networks Find topics in news stories Group similiar sequences into gene families Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26

K-means example 2 (a) 2 (b) 2 2 2 2 2 2 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 11 / 26

K-means example 2 (a) 2 (b) 2 (c) 2 2 2 2 2 2 2 2 2 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 11 / 26

K-means example 2 (a) 2 (b) 2 (c) 2 2 2 2 2 2 2 2 2 2 (d) 2 2 2 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 11 / 26

K-means example 2 (a) 2 (b) 2 (c) 2 2 2 2 2 2 2 2 2 2 (d) 2 (e) 2 2 2 2 2 2 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 11 / 26

K-means example 2 (a) 2 (b) 2 (c) 2 2 2 2 2 2 2 2 2 2 (d) 2 (e) 2 (f) 2 2 2 2 2 2 2 2 2 2 (g) 2 (h) 2 (i) 2 2 2 2 2 2 2 2 2 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 11 / 26

K-means clustering Intuition Data points assigned to cluster k should be near prototype µ k Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 12 / 26

K-means clustering Intuition Data points assigned to cluster k should be near prototype µ k Distortion measure (clustering objective function, cost function) N K J = r nk x n µ k 2 2 n=1 k=1 where r nk {, 1} is an indicator variable r nk = 1 if and only if A(x n ) = k Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 12 / 26

Algorithm Minimize distortion Alternative optimization between {r nk } and {µ k } Step Initialize {µ k } to some values Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 13 / 26

Algorithm Minimize distortion Alternative optimization between {r nk } and {µ k } Step Initialize {µ k } to some values Step 1 Fix {µ k } and minimize over {r nk }, to get this assignment: r nk = { 1 if k = arg minj x n µ j 2 2 otherwise Step 2 Fix {r nk } and minimize over {µ k } to get this update: n µ k = r nkx n n r nk Step 3 Return to Step 1 unless stopping criterion is met Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 13 / 26

Remarks Prototype µ k is the mean of points assigned to cluster k, hence K-means The procedure reduces J in both Step 1 and Step 2 and thus makes improvements on each iteration No guarantee we find the global solution Quality of local optimum depends on initial values at Step k-means++ is a principled approximation algorithm Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 14 / 26

Probabilistic interpretation of clustering? How can we model p(x) to reflect our intuition that points stay close to their cluster centers? Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 15 / 26

Probabilistic interpretation of clustering? How can we model p(x) to reflect our intuition that points stay close to their cluster centers? 1 (b) Points seem to form 3 clusters.5.5 1 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 15 / 26

Probabilistic interpretation of clustering? How can we model p(x) to reflect our intuition that points stay close to their cluster centers? 1.5 (b) Points seem to form 3 clusters We cannot model p(x) with simple and known distributions E.g., the data is not a Guassian b/c we have 3 distinct concentrated regions.5 1 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 15 / 26

Gaussian mixture models: intuition 1.5 (a) Model each region with a distinct distribution Can use Gaussians Gaussian mixture models (GMMs).5 1 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 16 / 26

Gaussian mixture models: intuition 1.5 (a).5 1 Model each region with a distinct distribution Can use Gaussians Gaussian mixture models (GMMs) We don t know cluster assignments (label), parameters of Gaussians, or mixture components! Must learn from unlabeled data D = {x n } N n=1 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 16 / 26

Gaussian mixture models: formal definition GMM has the following density function for x K p(x) = ω k N(x µ k, Σ k ) k=1 K: number of Gaussians they are called mixture components µ k and Σ k : mean and covariance matrix of k-th component ω k : mixture weights (or priors) represent how much each component contributes to final distribution. They satisfy 2 properties: Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 17 / 26

Gaussian mixture models: formal definition GMM has the following density function for x K p(x) = ω k N(x µ k, Σ k ) k=1 K: number of Gaussians they are called mixture components µ k and Σ k : mean and covariance matrix of k-th component ω k : mixture weights (or priors) represent how much each component contributes to final distribution. They satisfy 2 properties: k, ω k >, and ω k = 1 These properties ensure p(x) is in fact a probability density function k Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 17 / 26

GMM as the marginal distribution of a joint distribution Consider the following joint distribution p(x, z) = p(z)p(x z) where z is a discrete random variable taking values between 1 and K. Denote ω k = p(z = k) Now, assume the conditional distributions are Gaussian distributions p(x z = k) = N(x µ k, Σ k ) Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 18 / 26

GMM as the marginal distribution of a joint distribution Consider the following joint distribution p(x, z) = p(z)p(x z) where z is a discrete random variable taking values between 1 and K. Denote ω k = p(z = k) Now, assume the conditional distributions are Gaussian distributions p(x z = k) = N(x µ k, Σ k ) Then, the marginal distribution of x is p(x) = K ω k N(x µ k, Σ k ) k=1 Namely, the Gaussian mixture model Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 18 / 26

GMMs: example 1.5 (a).5 1 The conditional distribution between x and z (representing color) are p(x z = red) = N(x µ 1, Σ 1 ) p(x z = blue) = N(x µ 2, Σ 2 ) p(x z = green) = N(x µ 3, Σ 3 ) Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 19 / 26

GMMs: example 1.5 (a).5 1 The conditional distribution between x and z (representing color) are p(x z = red) = N(x µ 1, Σ 1 ) p(x z = blue) = N(x µ 2, Σ 2 ) p(x z = green) = N(x µ 3, Σ 3 ) 1 The marginal distribution is thus (b).5 p(x) = p(red)n(x µ 1, Σ 1 ) + p(blue)n(x µ 2, Σ 2 ) + p(green)n(x µ 3, Σ 3 ).5 1 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 19 / 26

Parameter estimation for Gaussian mixture models The parameters in GMMs are Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 2 / 26

Parameter estimation for Gaussian mixture models The parameters in GMMs are θ = {ω k, µ k, Σ k } K k=1 Let s first consider the simple/unrealistic case where we have labels z Define D = {x n, z n } N n=1 D is the complete data D the incomplete data How can we learn our parameters? Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 2 / 26

Parameter estimation for Gaussian mixture models The parameters in GMMs are θ = {ω k, µ k, Σ k } K k=1 Let s first consider the simple/unrealistic case where we have labels z Define D = {x n, z n } N n=1 D is the complete data D the incomplete data How can we learn our parameters? Given D, the maximum likelihood estimation of the θ is given by θ = arg max log D = n log p(x n, z n ) Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 2 / 26

Parameter estimation for GMMs: complete data The complete likelihood is decomposable log p(x n, z n ) = log p(z n )p(x n z n ) = n n k n:z n=k log p(z n )p(x n z n ) where we have grouped data by its values z n. Let γ nk {, 1} be a binary variable that indicates whether z n = k: Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 21 / 26

Parameter estimation for GMMs: complete data The complete likelihood is decomposable log p(x n, z n ) = log p(z n )p(x n z n ) = n n k n:z n=k log p(z n )p(x n z n ) where we have grouped data by its values z n. Let γ nk {, 1} be a binary variable that indicates whether z n = k: log p(x n, z n ) = n k γ nk log p(z = k)p(x n z = k) n Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 21 / 26

Parameter estimation for GMMs: complete data The complete likelihood is decomposable log p(x n, z n ) = log p(z n )p(x n z n ) = n n k n:z n=k log p(z n )p(x n z n ) where we have grouped data by its values z n. Let γ nk {, 1} be a binary variable that indicates whether z n = k: log p(x n, z n ) = n k = k γ nk log p(z = k)p(x n z = k) n γ nk [log ω k + log N(x n µ k, Σ k )] n Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 21 / 26

Parameter estimation for GMMs: complete data From our previous discussion, we have log p(x n, z n ) = γ nk [log ω k + log N(x n µ k, Σ k )] n k n Regrouping, we have log p(x n, z n ) = n k γ nk log ω k + n k { } γ nk log N(x n µ k, Σ k ) n The term inside the braces depends on k-th component s parameters. It is now easy to show that (left as an exercise) the MLE is: n ω k = γ nk 1 k n γ, µ k = nk n γ γ nk x n nk n 1 Σ k = n γ γ nk (x n µ k )(x n µ k ) T nk What s the intuition? n Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 22 / 26

Intuition Since γ nk is binary, the previous solution is nothing but ω k : fraction of total data points whose z n is k note that k n γ nk = N µ k : mean of all data points whose z n is k Σ k : covariance of all data points whose z n is k Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 23 / 26

Intuition Since γ nk is binary, the previous solution is nothing but ω k : fraction of total data points whose z n is k note that k n γ nk = N µ k : mean of all data points whose z n is k Σ k : covariance of all data points whose z n is k This intuition will help us develop an algorithm for estimating θ when we do not know z n (incomplete data) Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 23 / 26

Parameter estimation for GMMs: incomplete data When z n is not given, we can guess it via the posterior probability p(z n = k x n ) = p(x n z n = k)p(z n = k) p(x n ) = p(x n z n = k)p(z n = k) K k =1 p(x n z n = k )p(z n = k ) To compute the posterior probability, we need to know the parameters θ! Let s pretend we know the value of the parameters so we can compute the posterior probability. How is that going to help us? Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 24 / 26

Estimation with soft γ nk We define γ nk = p(z n = k x n ) Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 25 / 26

Estimation with soft γ nk We define γ nk = p(z n = k x n ) Recall that γ nk was previously binary Now it s a soft assignment of x n to k-th component Each x n is assigned to a component fractionally according to p(z n = k x n ) We now get the same expression for the MLE as before! n ω k = γ nk 1 k n γ, µ k = nk n γ γ nk x n nk n 1 Σ k = n γ γ nk (x n µ k )(x n µ k ) T nk But remember, we re cheating by using θ to compute γ nk! n Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 25 / 26

Iterative procedure Alternate between estimating γ nk and computing parameters Step : initialize θ with some values (random or otherwise) Step 1: compute γ nk using the current θ Step 2: update θ using the just computed γ nk Step 3: go back to Step 1 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 26 / 26

Iterative procedure Alternate between estimating γ nk and computing parameters Step : initialize θ with some values (random or otherwise) Step 1: compute γ nk using the current θ Step 2: update θ using the just computed γ nk Step 3: go back to Step 1 This is an example of the EM algorithm a powerful procedure for model estimation with hidden/latent variables Connection with K-means? Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 26 / 26

Iterative procedure Alternate between estimating γ nk and computing parameters Step : initialize θ with some values (random or otherwise) Step 1: compute γ nk using the current θ Step 2: update θ using the just computed γ nk Step 3: go back to Step 1 This is an example of the EM algorithm a powerful procedure for model estimation with hidden/latent variables Connection with K-means? GMMs provide probabilistic interpretation for K-means K-means is hard GMM or GMMs is soft K-means Posterior γ nk provides a probabilistic assignment for x n to cluster k Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 26 / 26