LECTURE NOTE #12 PROF. ALAN YUILLE

Size: px

Start display at page:

Download "LECTURE NOTE #12 PROF. ALAN YUILLE"

June Miles
6 years ago
Views:

1 LECTURE NOTE #12 PROF. ALAN YUILLE 1. Clustering, K-mens, nd EM Tsk: set of unlbeled dt D = {x 1,..., x n } Decompose into clsses w 1,..., w M where M is unknown. Lern clss models p(x w)) Discovery of new clsses (concepts) This is form of unsupervised lerning. How do people lern grmmr? Is it innte? Coded in DNA? Or is it lern in n unsupervised mnner? Prcticl Motivtions for unsupervised lerning - lbeling lrge dtsets is hrd, costly, nd time consuming. Also tht the dt is generted by complex model which involves hidden vribles. Why is this clled unsupervised? It mens tht there re some hidden/ltent/missing vribles tht re not specified by the trining dt. I.e., the trining dt is {x 1,..., x n } nd we ssume tht it is generted by model P (x w)p (w), where the w re unknown nd cn be clled ltent/hidden/missing (Sttisticins cll them missing dt, Neurl Network reserchers cll them hidden units, Mchine Lerning reserchers cll them ltent vribles ). By contrst, in previous lectures, we ssumed model p(x) which ws either n exponentil distribution or non-prmetric distribution. We cn think of this s lerning model P (x, w) nd we cn ssume tht this model is of exponentil form P (x, w) = 1 Z[λ] exp{λ φ(x, w)}, i.e. n exponentil distribution 1

2 2 PROF. ALAN YUILLE defined over the observer units x nd the unobserved units w (ltent/hidden/missing). This model cn cpture extremely complex models stochstic grmmrs on nturl lnguge, hierrchicl models of objects (e.g. my tlk in the BCE symposium), probbilistic models of resoning. But these models re hrder to del with thn models without hidden vribles. The models require: (i) lerning the structure how the hidden nd observed units interct (see lter lectures) which is clled structure induction nd which is very difficult, (ii) lerning the prmeters λ of the model if the structure is known this is non-convex optimiztion problem nd so it will hve multiple minim. We will describe the bsic lgorithm for doing this the EM lgorithm lter in this lecture. (Note: there is lso mchine lerning lgorithm clled ltent SVM which is like n pproximtion to EM). Note, there re lso non-prmetric methods for lerning P (x, w). Note tht there is lso semi-supervised lerning (e.g, Prof. Jin s tlk in the BCE Symposium). This mens tht some informtion is supplied for some of the hidden sttes i.e. you hve some dt {x 1,..., x n } which is unsupervised, nd some dt which is supervised {(x n+1, h n+1 ),...(x n+m, h n+m )}. 2. K-mens lgorithm Bsic Assumption the dt D is clustered round (unknown) men vlues m 1,..., m k, see figure (1). This is n ssumption bout the structure of the model. Figure 1. The k-mens lgorithm ssumes tht the dt cn be clustered round (unknown) men vlues m 1,..., m k. Equivlently, tht the dt is generted by mixture of Gussins with fixed covrince nd with these mens.

3 LECTURE NOTE #12 3 Seek to ssocite the dt to these mens, nd to simultneously estimte the mens. For now, we ssume tht the number of mens is known i.e. k is fixed. Ide: decompose D = k =1 D, where D is dt ssocited with men m Men of Fit F ({D }) = k =1 x D (x m ) 2 Wnt to select {m }, nd ssignment x D to minimize the fit F ( ). 3. Assignment Vrible: V i = 1, if dt x i is ssigned to m V i = 0, otherwise Constrint k =1 V i = 1, i, (i.e. ech dtpoint x i is ssigned to only one clss m ). Note: don t need ssignment vribles yet. - so tht ll dt- Deterministic K-mens: 1. Initilize prtition {D 0 : = 1 to k} of the dt (E.G. rndomly choose points x nd put them into set, D1 0, D0 2,..., D0 k points re in exctly one set) 2. Compute the men of ech cluster D, m = 1 w x D x 3. For i=1 to N, compute d (x i ) = x i m 2 Assign x i to cluster D s.t. = rg min{d (x i ),..., d k (x i )} 4. Repet steps 2 & 3 until convergence.

4 4 PROF. ALAN YUILLE 4. Comment: The k-men lgorithm is not gurnteed to converge to the best fit of F (D) It will lmost lwys converge to fixed point. Typiclly, you cn improve k-mens by giving it different initil conditions. Then select the solution which hs best fit. Evlute solution with different vlues of k. If the number of mens (clusters) is too lrge, then you will usully find some of the men will be close together, see figure (2). i.e do postprocessing to eliminte mens which re too close to ech other. Figure 2. There re three clusters nd four mens. mens will usully be ssigned to one cluster. In prctice, two

5 LECTURE NOTE # A softer version of k-mens the Expecttion-Mximiztion (EM) lgorithm. Assign dtpoint x i to ech cluster with probbility (P 1,..., P k ) 1. Initilize prtition. E.G. rndomly chose k points s centres m 1, m 2,..., m k 2. For j=1 to N, Compute distnces d (x j ) = x j m 2 Compute the probbility tht x j belongs to ω P (x j ) = 1 e 1 2σ 2 (x j m ) 2 (2πσ 2 )d/2 3. Compute the men nd vrince for ech cluster: m = 1 D x D xp (x) σ 2 = 1 D x D (x m ) 2 P (x) Repet steps 2 & 3 until convergence. 6. Deeper Understnding Modeling dt with hidden vribles P (x, h, θ)

6 6 PROF. ALAN YUILLE * x : observed dt * h : hidden vrible *θ : model prmeter Exmple: dt is generted by mixture of Gussin distribution with unknown mens nd covrince. The vrible h indictes which Gussin generted the dt. P (x h =, θ ) = 1 e 1 2 (x µ )Σ 1 (x µ (2π) Σ 1/2 ) Assume P (h = θ) = eφ c eφc And prior probbility P (θ) on {φ, θ } 7. Mixture of Gussin exmple P (x, h, θ) = p(x h, θ)p(h θ)p(θ), see figure (3). Now perform MAP estimtion of θ from dt exmples D = {x 1,.., x n } p(d h, θ) = p({x i,..., x n } h, θ) = N i=1 p(x i h, θ) i.i.d. ssumption.

7 LECTURE NOTE #12 7 Figure 3. The dt x is generted by hidden vrible h by probbility model with unknown prmeters θ. p(θ D) = h p(θ, h D) How to estimte ˆθ from D? Answer (most populr), the Expecttion Mximiztion (EM) lgorithm. Comment: stndrd k-mens lgorithm cn be thought of s limiting cse of EM for mixture of Gussins - where the covrince is fixed to be the identity mtrix. 8. The EM Algorithm p(θ D) = h p(θ, h D) Define new distribution q(h) Minimize F (θ, q) = log p(θ D) + q(h) h q(h) log p(h θ,d)

8 8 PROF. ALAN YUILLE * Kullbck-Leibler divergence: h q(h) log q(h) p(h θ,d) 0 Note, the minimum occurs t ˆθ = rg min θ { log p(θ D)} = rg mx θ p(θ w) nd t ˆq(h) = p(h ˆθ, D) (Becuse the Kullbck-Liebler divergence ttins its minimum t ˆq(h) = p(h ˆθ, w)). 9. We cn re-express the Free Energy p(h θ,d) F (θ, q) = log p(θ D) + q(h) h q(h) log = h q(h) log p(θ D) + h q(h) log q(h) h q(h) log p(h θ, D) = h q(h) log{p(θ D)p(h θ, D)} + h q(h) log q(h) F (θ, q) = h q(h) log p(h, θ D) + h q(h) log q(h) EM minimize F (θ, q) w.r.t. θ&q lterntively. This is clled coordinte descent, see figure (4). Here is n intuition for this lgorithm. You live in city with hills, like Seoul, nd the streets re rrnged in horizontl grid of blocks. You strt on hill (t high ltitude) nd wnt to decrese your ltitude. You cn either wlk in the North-South direction or in the Est-West direction. You look in ll directions nd must choose to wlk North, South, Est, or West. You see tht North nd Est re uphill, so you do not choose them. You see tht you cn wlk South nd decrese your ltitude by only 10 meters before the street strts going uphill. You look West nd see tht you cn decrese your ltitude by 100 meters by wlking in tht direction (until tht street lso strts going uphill). So you i.e. coordinte descent chooses to wlk West nd stop when the street strts going uphill (i.e. you hve lost 100 meters). The you stop, look gin in the directions North, South, Est, West nd wlk in the direction to mximize your decrese in ltitude. The you repet until you re t plce where ll directions re uphill. Note: coordinte descent will be importnt when the dimension of the spce is very big. In two-dimensions, like the city exmple, the number of choices is smll so if you hve moved Est-West lst time then you hve to move North-South the next time. In high-dimensions, there re n enormous number of choices. So the lgorithm will often choose only to move in smll number of possible directions, nd the other directions will be irrelevnt. This will be importnt for AdBoost lerning.

9 LECTURE NOTE #12 9 Note: you my not be ble to clculte how much you will decrese ltitude by wlking in one direction. This will depend on the specific problem. Figure 4. The coordinte descent lgorithm. You strt t θ t, q t. You fix q t nd decrese F (θ, q) by chnging θ until you rech θ t+1, where θ F (θ, qt ) = 0. Then fix θ t+1 nd decrese F (θ, q) by chnging q until you rech q t+1, where q F (θt+1, q) = 0. Step 1: Fix q t, set θ t+1 = rg min θ { h q(h) log p(h, θ D)} Step 2: Fix θ t, set q t+1 (h) = p(h θ t, D) Iterte steps 1 & 2 until convergence. Note: this lgorithm is gurnteed to converge to locl minimum of F (θ, q), but there is no gurntee tht it will converge to the globl minimum. You cn use multiple strting points nd pick the solution which hs lowest vlue of F (., ) or you cn use extr knowledge bout the problem to hve good strting point, or strting points. Or you cn use stochstic lgorithm. 10. Exponentil EM Exmple & Decision Trees EM tkes simple form for exponentil distributions. P (x, h λ) = eλφ(x,h) z[λ]

10 10 PROF. ALAN YUILLE Generl exponentil form, where the h re hidden vribles / ltent vribles. P (x λ) = Σ h P (x, h λ) Two steps of EM Step 1: q t+1 (h) = P (h x, λ t ) = P (x,h λt ) Σ xp ( x,h λ t ) Step 2: λ t+1 = ARG λ MIN h qt+1 (h){log P (x, h λ)} λ t+1 = ARG x MIN { h qt+1 φ(x, h)λ + log z[λ]} Differentiting gives : x,h P (x, h λt+1 )φ(x, h) = h qt+1 (h)φ(x, h) 11. Mixture of Gussin Exmple 1 P (y i V i = 1, θ) = e 1 2 (y i µ )T Σ 1 (y (2π) d/2 Σ 1/2 i µ ) dtpoint y i is generted by th model. We cn write : P (y i {V i }, θ) = 1 (2π) d/2 e V i{ 1 2 (y i µ )T Σ 1 (y i µ )+ 1 2 log 1/2 } By i.i.d ssumption, P ({y i } {V i }, θ) = i P (y i {V i }, θ) = 1 (2π) nd/2 e i V i{ 1 2 (y i µ )T Σ 1 (y i µ )+ 1 2 log 1/2 }

11 LECTURE NOTE #12 11 Let the prior probbility on the {V i } be uniform unknown P ({V i }) = constnt. Then P ({y i }, {V i } θ) = e Z i V i{ 1 2 (y i µ )T Σ 1 (y i µ )+ 1 2 log 1/2 } *Z : Normliztion constnt. 12. To pply EM to this problem, E-step: Q(θ θ (t) ) = v {log P (y, V θ)}p (V θt ) M-step: Determine θ (t+1) by mximizing Q(θ θ (t) ) P ({V i } θ) = i P i(v i θ). T Σ 1 (y i µ ) 1 2 log } b e V ib { 1 2 (y µ ) T Σ 1 (y µ ) 1 i b b i b 2 log b } P i (V i θ) = e V i { 1 2 (y i µ ) Note : P i(v i = 1 θ) = 1. Also P i (V i θ) is biggest for the model µ, Σ the dt point. closest to Let V i V i P i (V i θ) = q i E-step: Q(θ θ (t) ) = q i { 1 2 (y i µ ) T Σ 1 (y i µ ) log }

12 12 PROF. ALAN YUILLE 13. M-step: minimize w.r.t. µ & gives µ = i 1 q ( i i q iy i ) weighted sum i q i(y i µ )(y i µ ) T = 1 i q i Hence, for this cse (mixture of Gussins) we cn perform the E nd M steps nlyticlly. Extend the probbility model, Let the number of clsses be rndom vrible k p(k) = exp λk Z[λ] p(y v, µ,, k)p(k)

Lecture 14. Clustering, K-means, and EM

Lecture 14. Clustering, K-means, and EM Prof. Alan Yuille Spring 2014 Outline 1. Clustering 2. K-means 3. EM 1 Clustering Task: Given a set of unlabeled data D = {x 1,..., x n }, we do the following: 1.