Algorithms for Clustering

Similar documents
ECE 901 Lecture 13: Maximum Likelihood Estimation

Lecture 13: Maximum Likelihood Estimation

Maximum Likelihood Estimation and Complexity Regularization

Expectation-Maximization Algorithm.

Mixtures of Gaussians and the EM Algorithm

The Expectation-Maximization (EM) Algorithm

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Estimation for Complete Data

CSE 527, Additional notes on MLE & EM

Axis Aligned Ellipsoid

Bayesian Methods: Introduction to Multi-parameter Models

Regression and generalization

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

Empirical Process Theory and Oracle Inequalities

18.657: Mathematics of Machine Learning

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Vector Quantization: a Limiting Case of EM

10-701/ Machine Learning Mid-term Exam Solution

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Support vector machine revisited

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Optimally Sparse SVMs

Lecture 11 and 12: Basic estimation theory

Machine Learning Brett Bernstein

5.1 Review of Singular Value Decomposition (SVD)

Clustering: Mixture Models

REGRESSION WITH QUADRATIC LOSS

Lecture 2 Clustering Part II

Lecture 2: Monte Carlo Simulation

Infinite Sequences and Series

Agnostic Learning and Concentration Inequalities

Lecture 23: Minimal sufficiency

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Exponential Families and Bayesian Inference

Chapter 6 Principles of Data Reduction

Simulation. Two Rule For Inverting A Distribution Function

7.1 Convergence of sequences of random variables

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

CS284A: Representations and Algorithms in Molecular Biology

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Final Review for MATH 3510

Machine Learning Brett Bernstein

6.867 Machine learning, lecture 7 (Jaakkola) 1

Department of Mathematics

7.1 Convergence of sequences of random variables

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Lecture 10 October Minimaxity and least favorable prior sequences

Optimization Methods MIT 2.098/6.255/ Final exam

Lecture 6 Ecient estimators. Rao-Cramer bound.

x = Pr ( X (n) βx ) =

Lecture 12: September 27

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Distributional Similarity Models (cont.)

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Statistical Theory MT 2008 Problems 1: Solution sketches

5. Likelihood Ratio Tests

Distributional Similarity Models (cont.)

Intro to Learning Theory

Machine Learning for Data Science (CS 4786)

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes.

Lecture 19: Convergence

Statistical Theory MT 2009 Problems 1: Solution sketches

Statistical Pattern Recognition

( ) (( ) ) ANSWERS TO EXERCISES IN APPENDIX B. Section B.1 VECTORS AND SETS. Exercise B.1-1: Convex sets. are convex, , hence. and. (a) Let.

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

A survey on penalized empirical risk minimization Sara A. van de Geer

Lecture 10: Universal coding and prediction

( 1) n (4x + 1) n. n=0

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

Linear Programming and the Simplex Method

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Element sampling: Part 2

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Stat410 Probability and Statistics II (F16)

Statistics for Applications. Chapter 3: Maximum Likelihood Estimation 1/23

Lecture 4. We also define the set of possible values for the random walk as the set of all x R d such that P(S n = x) > 0 for some n.

Introduction to Machine Learning DIS10

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

R. van Zyl 1, A.J. van der Merwe 2. Quintiles International, University of the Free State

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Entropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP

Regression with quadratic loss

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Lecture 2: Concentration Bounds

Notes 5 : More on the a.s. convergence of sums

Lecture 2 October 11

Distribution of Random Samples & Limit theorems

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Transcription:

CR2: Statistical Learig & Applicatios Algorithms for Clusterig Lecturer: J. Salmo Scribe: A. Alcolei Settig: give a data set X R p where is the umber of observatio ad p is the umber of features, we wat to separate these data ito K classes (clusters, i.e.we wat to lear :. the cetroid (ceter of each cluster 2. a assigatio fuctio A : {,..., } {,..., K}, meaig sample x i belogs to class A(i. Figure : A simple represetatio of the situatio ( = 25, p = 2, K = 3 The K-meas Algorithm. The Algorithm The K-meas algorithm [.] computes K clusters of a iput data set, such that the average (squared distace from a poit to the cetre of its cluster, i.e.the iertia, is miimized. Theorem. K-meas mootoically decreases the iertia K j= x i c j 2 Proof. Let ψ(x (t = K j= x i, c j 2 where X (t is the curret partitio X (t,..., X(t K with cetroids

2 Algorithms for Clusterig Algorithm The K-meas Algorithm Iput: a data set X = {x,..., x } (x i R p. Output: a partitio M = {X,..., X K } of X together with the cetroids c,..., c K of each cluster. Iitializatio: choose c,..., c K i X at radom Repeat util covergece: for j =... K do X j assigmet step: for i =... do A(x i arg mi x i c j 2 j {,...,K} X A(xi X A(xi {x i } doe re-estimatio step: for j =... K do j (x i X j c j j x i(x i X j doe retur M, c,..., c K c (t,..., c(t K ad assigatio fuctio A(t, the ψ(x (t K j= K j= x i X (t j x i X (t j ψ(x (t+ x i, c (t A (t+ (x i 2 (sice A(x i miimizes the quatity x i c j 2 over all j {,..., K} x i, c (t+ j 2 (sice c (t+ j miimizes the quatity x i c j 2 over all x i X j Corollary. K-meas stops after a ite umber of steps. Proof. There is o iite sequece ( of partitios such that the iertia decreases strictly sice there is oly a ite umber of partitios:. Thus the sequece ψ(x k (t t N has a ite umber of values, i.e.there exists t such that ψ(x (t+ = ψ(x (t. This implies that at step t, X (t+ = X (t otherwise some elemets would be wrogly classied. Remark. The above corollary does ( ot tell aythig about how quick the algorithm coverges, we oly have a expoetial boud:. The time eeded for the algorithm to coverge deped o the k iitializatio, some heuristic ca be d i the literature to get better result. Similarly, the solutio foud by the algorithm is oly a local optimal, sice i geeral the iertia overall all partitios is ot a covex fuctio. The result depeds o the iitializatio. Thus it might be useful to ru the algorithm several times ad pick the best result as a al aswer.

Algorithms for Clusterig 3 It is possible to parametrize the K-meas algorithm for example by chagig the way the distace betwee two poits is measured or by projectig poits o radom coordiates if the feature space is of high dimesio..2 Keralised K-meas We chage the previous algorithm so as to miimize i the reproducig kerel Hilbert space H associated to R p istead of miimizig i R p. Usig ϕ : R p H, the algorithm remais the same except for: - The iitializatio step: we choose c,..., c K i H istead of R p. - The assigmet step: we compute A xi arg mi ϕ(x i c j 2 istead of A xi arg mi x i c j 2. j {,...,K} j {,...,K} Remark. We do ot eed to compute explicitly ϕ(x i for each x i X, all we eed to kow are the values ϕ(x i, ϕ(x j for every pair x i, x j X. 2 Gaussia Mixture ad EM Algorithm 2. Gaussia maximum likelihood The desity of a Gaussia radom variable over R p is give by ϕ (x = ( (2πp det(σ exp 2 (x µ Σ (x µ where µ is the mea of the variable (µ R p ad Σ is the co-variace matrix (Σ R p p. Σ is positive deite so rk(σ = p. This formula satisfy the coditios for beig a probability distributio:. x R p, ϕ (x 0 2. x R p ϕ (xdx = Example. p = 2, Σ R 2, µ = p =, Σ = σ 2, µ = 0: ϕ (x = x 2 exp( 2σ 2 2πσ (cf. gure below for dieret value of σ 2 ( 0 0, the cotour lies are described for all c i R by {x R p ϕ (x = c} = {x R p l(ϕ (x = c } (for c = l(c ( = {x R p l (2πp det(σ 2 (x µ Σ (x µ = c } (cf. gures below for Σ = = {x R p p j= p x i x j α ij + c = 0} (for some a ij, c depedig o Σ ad c ( σ 2 0 0 σ 2 ad for geeral Σ R 2 for dieret values of c.

4 Algorithms for Clusterig σ 2 = σ 2 = /2 σ 2 = 4 µ µ µ I statistical machie learig we are iterested i the followig problem: suppose you observe (X, X 2,..., X iid ϕ, ca you estimate µ ad Σ? (iid stads for idepedet ad idetically distributed Idea: Let ϕ (X,..., X := ϕ (X i, we wat to d (ˆµ, ˆΣ arg max ϕ (X,..., X. The quatity ϕ (X,..., X see as a fuctio of µ ad Σ is called the likelihood. The pair (ˆµ, ˆΣ is called the maximum likelihood. Example. For p =, Σ =, we have ˆµ = X i ϕ µ, ϕˆµ, Propositio. The empirical mea ad the empirical co-variace are good estimators, i.e. ˆµ = µ X i ad ˆΣ = (X i ˆµ(X i ˆµ Proof. We oly show the rst equality: Fidig (ˆµ, ˆΣ arg max ϕ (X,..., X is equivalet to dig (ˆµ, ˆΣ arg mi [ l (ϕ (X,..., X ] (. Yet, ( is easier to solve sice it ivolves miimizig over a sum rather tha maximizig over a product : [ ( = arg mi ( c + ] 2 tr (X i µσ (X i µ + 2 l(det(σ where c is some costat that does ot deped o µ or Σ.

Algorithms for Clusterig 5 Thus, xig Σ we get: [ ( ] ( = arg mi µ 2 tr (X i µσ (X i µ (X i µσ (X i µ is a covex fuctio of µ so its global miimum ˆµ is the uique poit that satises: ( δ (X i ˆµΣ (X i ˆµ = 0 δµ This implies that Σ (X i ˆµ = 0, that is X i = ˆµ ad so ˆµ = X i 2.2 Mixture We ree the model preseted above by regardig the desity of (X,..., X as a mixture of K weighted gaussia desities, ϕ µk,σ k, over R p : (X,..., X iid f(x = K π k ϕ µk,σ k (x, k= Example. I R 2 for K = 3, π k = 3 where π k is the weight associated to ϕ µk,σ k we could have a distributio like the followig: +µ +µ 2 +µ 3 Drawig x R p accordig to the distributio of the Gaussia mixture f is equivalet as drawig x as follows (hierarchical way:. draw k with probability {π,..., π K } over the elemets of {,..., K} 2. draw x R p accordig to the distributio associated to k, i.e. accordig to ϕ µk,σ k The problem of dig the mixture of K Gaussia distributios from a give set of samples (X,..., X ca be see as a geeralizatio of the K-meas problem where the distace to the cetre of a cluster chages accordig to the idex of the cluster. The Expectatio-Maximizatio algorithm (EM [2.2] ca thus be viewed as a geeralizatio of the K-meas algorithm, where the value to maximize is ϕ(θ = f θ (X,..., X = f θ (X i

6 Algorithms for Clusterig We have the same kid of termiatio property: Propositio. Let θ (t be the iterates of the EM algorithm ad ϕ(θ (t be their correspodig iertia, the t, ϕ(θ (t+ ϕ(θ (t. Proof. We do ot give a complete proof here. The idea is the followig: sice maximizig over the likelihood ϕ(θ = f θ (X,..., X is hard, we istead maximize over the log-likelihood L(θ = l(ϕ(θ = l(f θ(x i. This is still hard to evaluate except if we kew from which Gaussia desity iside the Gaussia mixture each X i was draw out. Thus for each i {,..., } we dee z i to be the hidde radom variable that idicates whether X i is draw from the j th Gaussia desity, with probability p ij ( K j= p ij =, ad we try to maximize the parametrized log likelihood L(θ, (p ij i = j K ( l K j= z i=j f θj (X i. Remark. oce agai the aswer provided by the EM algorithm is oly a local optimum ad depeds o the iitializatio. I practice, the EM algorithm is used for recoverig missig or icomplete data. Algorithm 2 The Expectatio-Maximizatio Algorithm Iput: a data set X = {x,..., x } (x i R p. π,..., π K Output: θ := µ,..., µ K, a set of weights ad Gaussia desities that locally maximize the probability Σ,..., Σ K of the x i 's beig draw from the correspodig Gaussia mixture f θ (x = K k= π k ϕ µk,σ k (x. π,..., π K Iitializatio: choose θ := µ,..., µ K at radom. Σ,..., Σ K Let p i,k be the probability that x i is comig from the k th class. Repeat util covergece: estimatio step: for i =... for j =... K do ( = p i,k πj ϕµ j,σ j (xi f θ (x i doe maximizatio step: for j =... K do π j p i,j µ j σ j doe retur M, c,..., c K pi,jxi pi,j pi,j(xi µj (x i µ j pi,j π j ϕ µj,σ j (x i K k= π k ϕ µk,σ k (x i