CR2: Statistical Learig & Applicatios Algorithms for Clusterig Lecturer: J. Salmo Scribe: A. Alcolei Settig: give a data set X R p where is the umber of observatio ad p is the umber of features, we wat to separate these data ito K classes (clusters, i.e.we wat to lear :. the cetroid (ceter of each cluster 2. a assigatio fuctio A : {,..., } {,..., K}, meaig sample x i belogs to class A(i. Figure : A simple represetatio of the situatio ( = 25, p = 2, K = 3 The K-meas Algorithm. The Algorithm The K-meas algorithm [.] computes K clusters of a iput data set, such that the average (squared distace from a poit to the cetre of its cluster, i.e.the iertia, is miimized. Theorem. K-meas mootoically decreases the iertia K j= x i c j 2 Proof. Let ψ(x (t = K j= x i, c j 2 where X (t is the curret partitio X (t,..., X(t K with cetroids
2 Algorithms for Clusterig Algorithm The K-meas Algorithm Iput: a data set X = {x,..., x } (x i R p. Output: a partitio M = {X,..., X K } of X together with the cetroids c,..., c K of each cluster. Iitializatio: choose c,..., c K i X at radom Repeat util covergece: for j =... K do X j assigmet step: for i =... do A(x i arg mi x i c j 2 j {,...,K} X A(xi X A(xi {x i } doe re-estimatio step: for j =... K do j (x i X j c j j x i(x i X j doe retur M, c,..., c K c (t,..., c(t K ad assigatio fuctio A(t, the ψ(x (t K j= K j= x i X (t j x i X (t j ψ(x (t+ x i, c (t A (t+ (x i 2 (sice A(x i miimizes the quatity x i c j 2 over all j {,..., K} x i, c (t+ j 2 (sice c (t+ j miimizes the quatity x i c j 2 over all x i X j Corollary. K-meas stops after a ite umber of steps. Proof. There is o iite sequece ( of partitios such that the iertia decreases strictly sice there is oly a ite umber of partitios:. Thus the sequece ψ(x k (t t N has a ite umber of values, i.e.there exists t such that ψ(x (t+ = ψ(x (t. This implies that at step t, X (t+ = X (t otherwise some elemets would be wrogly classied. Remark. The above corollary does ( ot tell aythig about how quick the algorithm coverges, we oly have a expoetial boud:. The time eeded for the algorithm to coverge deped o the k iitializatio, some heuristic ca be d i the literature to get better result. Similarly, the solutio foud by the algorithm is oly a local optimal, sice i geeral the iertia overall all partitios is ot a covex fuctio. The result depeds o the iitializatio. Thus it might be useful to ru the algorithm several times ad pick the best result as a al aswer.
Algorithms for Clusterig 3 It is possible to parametrize the K-meas algorithm for example by chagig the way the distace betwee two poits is measured or by projectig poits o radom coordiates if the feature space is of high dimesio..2 Keralised K-meas We chage the previous algorithm so as to miimize i the reproducig kerel Hilbert space H associated to R p istead of miimizig i R p. Usig ϕ : R p H, the algorithm remais the same except for: - The iitializatio step: we choose c,..., c K i H istead of R p. - The assigmet step: we compute A xi arg mi ϕ(x i c j 2 istead of A xi arg mi x i c j 2. j {,...,K} j {,...,K} Remark. We do ot eed to compute explicitly ϕ(x i for each x i X, all we eed to kow are the values ϕ(x i, ϕ(x j for every pair x i, x j X. 2 Gaussia Mixture ad EM Algorithm 2. Gaussia maximum likelihood The desity of a Gaussia radom variable over R p is give by ϕ (x = ( (2πp det(σ exp 2 (x µ Σ (x µ where µ is the mea of the variable (µ R p ad Σ is the co-variace matrix (Σ R p p. Σ is positive deite so rk(σ = p. This formula satisfy the coditios for beig a probability distributio:. x R p, ϕ (x 0 2. x R p ϕ (xdx = Example. p = 2, Σ R 2, µ = p =, Σ = σ 2, µ = 0: ϕ (x = x 2 exp( 2σ 2 2πσ (cf. gure below for dieret value of σ 2 ( 0 0, the cotour lies are described for all c i R by {x R p ϕ (x = c} = {x R p l(ϕ (x = c } (for c = l(c ( = {x R p l (2πp det(σ 2 (x µ Σ (x µ = c } (cf. gures below for Σ = = {x R p p j= p x i x j α ij + c = 0} (for some a ij, c depedig o Σ ad c ( σ 2 0 0 σ 2 ad for geeral Σ R 2 for dieret values of c.
4 Algorithms for Clusterig σ 2 = σ 2 = /2 σ 2 = 4 µ µ µ I statistical machie learig we are iterested i the followig problem: suppose you observe (X, X 2,..., X iid ϕ, ca you estimate µ ad Σ? (iid stads for idepedet ad idetically distributed Idea: Let ϕ (X,..., X := ϕ (X i, we wat to d (ˆµ, ˆΣ arg max ϕ (X,..., X. The quatity ϕ (X,..., X see as a fuctio of µ ad Σ is called the likelihood. The pair (ˆµ, ˆΣ is called the maximum likelihood. Example. For p =, Σ =, we have ˆµ = X i ϕ µ, ϕˆµ, Propositio. The empirical mea ad the empirical co-variace are good estimators, i.e. ˆµ = µ X i ad ˆΣ = (X i ˆµ(X i ˆµ Proof. We oly show the rst equality: Fidig (ˆµ, ˆΣ arg max ϕ (X,..., X is equivalet to dig (ˆµ, ˆΣ arg mi [ l (ϕ (X,..., X ] (. Yet, ( is easier to solve sice it ivolves miimizig over a sum rather tha maximizig over a product : [ ( = arg mi ( c + ] 2 tr (X i µσ (X i µ + 2 l(det(σ where c is some costat that does ot deped o µ or Σ.
Algorithms for Clusterig 5 Thus, xig Σ we get: [ ( ] ( = arg mi µ 2 tr (X i µσ (X i µ (X i µσ (X i µ is a covex fuctio of µ so its global miimum ˆµ is the uique poit that satises: ( δ (X i ˆµΣ (X i ˆµ = 0 δµ This implies that Σ (X i ˆµ = 0, that is X i = ˆµ ad so ˆµ = X i 2.2 Mixture We ree the model preseted above by regardig the desity of (X,..., X as a mixture of K weighted gaussia desities, ϕ µk,σ k, over R p : (X,..., X iid f(x = K π k ϕ µk,σ k (x, k= Example. I R 2 for K = 3, π k = 3 where π k is the weight associated to ϕ µk,σ k we could have a distributio like the followig: +µ +µ 2 +µ 3 Drawig x R p accordig to the distributio of the Gaussia mixture f is equivalet as drawig x as follows (hierarchical way:. draw k with probability {π,..., π K } over the elemets of {,..., K} 2. draw x R p accordig to the distributio associated to k, i.e. accordig to ϕ µk,σ k The problem of dig the mixture of K Gaussia distributios from a give set of samples (X,..., X ca be see as a geeralizatio of the K-meas problem where the distace to the cetre of a cluster chages accordig to the idex of the cluster. The Expectatio-Maximizatio algorithm (EM [2.2] ca thus be viewed as a geeralizatio of the K-meas algorithm, where the value to maximize is ϕ(θ = f θ (X,..., X = f θ (X i
6 Algorithms for Clusterig We have the same kid of termiatio property: Propositio. Let θ (t be the iterates of the EM algorithm ad ϕ(θ (t be their correspodig iertia, the t, ϕ(θ (t+ ϕ(θ (t. Proof. We do ot give a complete proof here. The idea is the followig: sice maximizig over the likelihood ϕ(θ = f θ (X,..., X is hard, we istead maximize over the log-likelihood L(θ = l(ϕ(θ = l(f θ(x i. This is still hard to evaluate except if we kew from which Gaussia desity iside the Gaussia mixture each X i was draw out. Thus for each i {,..., } we dee z i to be the hidde radom variable that idicates whether X i is draw from the j th Gaussia desity, with probability p ij ( K j= p ij =, ad we try to maximize the parametrized log likelihood L(θ, (p ij i = j K ( l K j= z i=j f θj (X i. Remark. oce agai the aswer provided by the EM algorithm is oly a local optimum ad depeds o the iitializatio. I practice, the EM algorithm is used for recoverig missig or icomplete data. Algorithm 2 The Expectatio-Maximizatio Algorithm Iput: a data set X = {x,..., x } (x i R p. π,..., π K Output: θ := µ,..., µ K, a set of weights ad Gaussia desities that locally maximize the probability Σ,..., Σ K of the x i 's beig draw from the correspodig Gaussia mixture f θ (x = K k= π k ϕ µk,σ k (x. π,..., π K Iitializatio: choose θ := µ,..., µ K at radom. Σ,..., Σ K Let p i,k be the probability that x i is comig from the k th class. Repeat util covergece: estimatio step: for i =... for j =... K do ( = p i,k πj ϕµ j,σ j (xi f θ (x i doe maximizatio step: for j =... K do π j p i,j µ j σ j doe retur M, c,..., c K pi,jxi pi,j pi,j(xi µj (x i µ j pi,j π j ϕ µj,σ j (x i K k= π k ϕ µk,σ k (x i