C4B Machne Learnng Answers II.(a) Show that for the logstc sgmod functon dσ(z) dz = σ(z) ( σ(z)) A. Zsserman, Hlary Term 20 Start from the defnton of σ(z) Note that Then σ(z) = σ = dσ(z) dz = + e z e z = + e z + e z e z = σ( σ) ( + e z ) 2 (b) If y {0, }, then the negatve log-lkelhood for logstc regresson tranng s L(w) = N y log σ(w x ) + ( y ) log( σ(w x )) Show that ts gradent has the smple form: dl dw = N (y σ)x and hence derve the update equaton for learnng w usng a steepest descent algorthm.
Remnder wrte ths more compactly as p(y = x;w) = σ(w x) p(y = 0 x;w) = σ(w x) p(y x;w) = ( σ(w x) ) y ( σ(w x) ) ( y) Then the lkelhood (assumng ndependence) s p(y x;w) N ( σ(w x ) ) y ( σ(w x ) ) ( y ) and the negatve log lkelhood s L(w) = N y log σ(w x ) + ( y ) log( σ(w x )) We need to compute dl dw = N = N y d log σ(w x ) dw dσ(w x ) y σ dw + ( y ) d log( σ(w x )) dw ( y ) ( σ) dσ(w x ) dw = N y σ σ( σ)x ( y ) ( σ) x σ( σ) = N y ( σ)x ( y )σx = N (y σ)x To mnmze a cost functon C(w) wth steepest descent, the teratve update s w t+ w t η t w C(w t ) where η s the learnng rate. So n ths case for each data pont x w w + η(y σ(w x ))x 2
2.(a) Show that f the SVM cost functon s wrtten as C(w) = N N λ 2 w 2 + max (0, y f(x )) where f(x ) = w x, then usng usng steepest descent optmzaton, w t+ may be learnt from w t by cyclng through the data wth the followng update rule w t+ ( ηλ)w t + ηy x f y w x < ( ηλ)w t where η s the learnng rate. otherwse Frst, start from standard form for the SVM mn w Then wrte ths as an average w 2 + C N max (0, y f(x )) mn C(w) = λ w 2 w 2 + N N N max (0, y f(x )) = λ N 2 w 2 + max (0, y f(x )) (wth λ = 2/(NC) up to an overall scale of the problem). Now compute the gradent wrt w. For the hnge loss the sub-gradent s y x f y w x < 0 otherwse and for the λ w 2 /2 the gradent s λw. Puttng ths together wth the teratve update rule gves the teratve update w t+ w t η t w C(w t ) w t+ ( ηλ)w t + ηy x f y w x < ( ηλ)w t otherwse 3
(b) Contrast the SVM update rule wth that of the perceptron w w ηsgn(w x )x What are the dfferences, and how do they nfluence the margn? There are two man dfferences: () the condton for the SVM s on whether the data pont volates the margn (y w x < ), whereas for the percepton the condton s on whether the pont s ncorrectly classfed (y w x < 0); () for the perceptron there s no regularzaton, and so no ηλw t term resultng from ths. Note, for the SVM, the ηλw t, whch s added even f the pont s outsde the margn, can decrease w. For the perceptron, nothng s added f the pont s correctly classfed. (c) The perceptron learnng rule can be derved as steepest descent optmzaton of a loss functon. What s the loss functon? max (0, y f(x )) 4
3. A K-class dscrmnant s obtaned by tranng K lnear classfers of the form f k (x) = w k x + b k and assgnng a pont to class C k f f k (x) > f j (x) for all j k. (a) Wrte the equaton of the hyperplane separatng class j and k. Ponts on the hyperplane satsfy w j x + b j = w k x + b k Thus, the equaton s (w j w k ) x + (b j b k ) = 0 (b) If x A and x B are both n the decson regon R j (.e. classfed as class j), then show that any pont on the lne x = λx A + ( λ)x B where 0 λ, s also classfed as class j. For ponts x f j (x) = w j (λx A + ( λ)x B ) + b k and usng the lnearty of the classfer f j (x) = λf j (x A ) + ( λ)f j (x B ) As x A and x B are n regon R j, t follows that f j (x A ) > f k (x A ) and f j (x B ) > f k (x B ) for all k j. Hence f j (x) > f k (x) for all k j, and the result follows. 5
4. A student uses the regresson functon f(x,w) = w 0 + w φ (x) + w 2 φ 2 (x) +... + w M φ M (x) = w Φ(x) (where x s a scalar and f a scalar valued functon) for two possble data sources: (a) A perodc source whch oscllates wth a known perod p. (b) A polynomal of second degree. What are sutable bass functons for each of these sources? Can the student save tme and desgn a sngle set of bass functons φ (x) that wll allow hm/her to model observatons from ether source? (a) A perodc source whch oscllates wth a known perod p. Sutable bass functons are φ (x) = cos( 2πx p ) Recall the trgonometrc cos dentty: φ 2(x) = sn( 2πx p ) cos(a B) = cos A cos B + snasnb so that cos((2π(x θ)/p) may be wrtten as a lnear combnaton cos((2π(x θ)/p) = cos( 2πθ p ) cos(2πx p ) sn(2πθ p ) sn(2πx p ) for any phase θ. (b) A polynomal of second degree. Sutable bass functons are φ (x) = x φ 2 (x) = x 2 If the student smply combnes the two bass sets then, gven suffcent data, the coeffcents of the bass functons that are not relevant for that source should be close to zero. 6
5. The cost functon for rdge regresson s: E(w) = 2 N = Ths has the dual representaton ( y w Φ(x ) ) 2 + λ 2 w 2 E(a) = 2 (y Ka)2 + λ 2 a Ka whereks the N N kernel gram matrx wth entres k(x, x j ) = Φ(x ) Φ(x j ). Show that the vector a that mnmzes Ẽ(a) s gven by a = (K + λi) y Dfferentate w.r.t. a dẽ(a) = K (y Ka) + λka = 0 da and rearrangng, assumng K s full rank, (K + λi)a = y Hence a = (K + λi) y 7
6. Consder the followng 3-dmensonal dataponts: (.3,.6, 2.8), (4.3,.4, 5.8), ( 0.6, 3.7, 0.7), ( 0.4, 3.2, 5.8), (3.3, 0.4, 4.3), ( 0.4, 3., 0.9) The mean and covarance matrx of ths data are 3.7292 3.7083 2.3825 c = (.2500,.6333, 3.3833) S = 3.7083 3.7022 2.4294 2.3825 2.4294 4.374 and the egenvector correspondng to the largest egenvalue s u = (0.593, 0.594, 0.5435) (a) Verfy that Su = λ u where λ = 9.6269. (b) The sum of the egenvalues 3 = =.8028. What fracton of the varance s explaned by the frst prncpal component? The varance s (trace of covarance matrx) N x c 2 = d λ =.8028 N k= The frst prncpal component s x = u (x c), and ts varance s N u N (x c)(x c) u = u Su = u λ u = λ = 9.6269 Hence the proporton of varance s 9.6269 = 0.856 = 8.56%.8028 (c) The projecton of a datapont x onto the frst prncpal component s gven by y = u (x c), and smlarly y 2 = u 2 (x c) for the second. If u 2 = ( 0.3958, 0.3727, 0.8393), calculate the projecton of the frst datapont (.3,.6, 2.8) onto the frst two prncpal components. (x, x 2) = ( 0.2676, 0.528) 8
7. Gven the followng 2D data: x = 3 x 2 = 3 x 3 = 3 x 4 = 3 determne the clusters obtaned by runnng the K-means algorthm, wth K = 2 and the clusters ntalzed as (a) c = x,c 2 = x 4 (b) c = x,c 2 = x 3 (c) c = x,c 2 = x 2 (d) c = x +x 4,c 2 2 = x 2+x 3 2 (a) 4 3 2 (b) (c) (d) 9
8. Consder a GMM n whch all the K mxture components have the same covarance matrx Σ = ǫi where I s the dentty matrx. Show that f ths model s ftted usng the EM algorthm, then n the lmt that ǫ 0 the algorthm s equvalent to K-means clusterng. (Hnt, compute the responsbltes, γ k, for ths lmt). If Σ = ǫi, then Σ = ǫ I and N(x µ,σ) e 2ǫ x µ 2 In the Expectaton step of the EM algorthm, the responsbltes are computed as γ k = π kn(x µ k,σ k ) K j= π j N(x µ j,σ j ) = π ke 2ǫ x µ k 2 K j= π j e 2ǫ x µ j 2 As ǫ 0, the term for whch x µ k 2 s smallest wll go to zero more slowly than the rest, and n the lmt γ k for ths k, and zero for all other ks. (assumng π k 0). Ths s a softmax. Thus, responsbltes become the hard assgnment varables r k of K-means. Smlarly n the Maxmzaton step: µ k = N k N N k = N = π k = N k N = γ k x N k N γ k N r k = = r k x 0
9. Descrbe what happens to an EM update f the mean of one of the Gaussan mxture components exactly concdes wth one of the data ponts. Consder D Gaussans N(x µ k, σ k ) = If x concdes wth µ k then N(x µ k, σ k ) = e 2σ 2 (x µ k ) 2 k 2πσk 2πσk Suppose σ k s small k, then γ k for ths pont wll approach unty (from queston 6) n the Expectaton step, and the contrbutons of other ponts to ths component wll also be small. In the Maxmzaton step: σ k = N k N = γ k (x µ k ) 2 so σ k can become smaller stll. As the teratons proceed σ k 0 and the (negatve) log-lkelhood dverges. L(θ) = N ln K π k N(x µ k,σ k ) = k=