Machine Learning for Data Science (CS 4786)

Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm or method. The text i red are mathematical details for those who are iterested. Dimesioality Reductio ad Liear Projectio We are provided with data poits x,..., x where each x t R d is a d-dimesioal vector. The goal i dimesioality reductio is to compress these poits ito vectors y,..., y R K where K is smaller tha d. I this lecture we will cosider dimesioality reductio through liear trasformatios, meaig, the low dimesioal represetatio y t for each datapoit x t is obtaied by settig y t = W x t where W defies the liear trasformatio give by a d k matrix Notice that oe ca represet the matrix W as W = [w,..., w K ] where w,..., w K are each d-dimesioal vectors. PCA to Oe Dimesio K = ) For the case whe K =, y t [] = w x t. Now ote that arbitrary scalig of w to say αw for some α R simply leads to y t [] s beig scaled by α. Such scalig does ot really affect our projectios however. Hece without loss of geerality we ca cimply set w =, that is fid a vector of uit orm. The figure below illustrates the basic idea whe we cosider the projectio dow to oly oe dimesio. The poits i blue are the origial x,..., x. w the first directio of projectio is illustrated i the figure by the red lie. The poits i gree represet the recostructios ˆx,..., ˆx. The oe dimesioal represetatio y [],..., y [] are illustrated by the legths i the figure.

Now the mai questio at had boils dow to, How do we pick the right liear trasformatio W so a to retai as much iformatio about the origial data poits as possible.. Maximizig Spread variace) Basic idea: Pick the directios alog which data is maximally spread or variace is high). As a example i the illustratio below we would like to pick the secod optio as the spread if the poits across the chose directio is larger i the secod figure. How do we formalize this for K = )? Let us first cosider the first directio to pick w. We wat to pick the directio log which variace of y [],..., y [] is largest. That is we wat to pick the directio w uit orm vector) that maximize the sample variace i the projected directio which is give by: y t [] y t [] = w x t w x t t= Thus the solutio w is give by, t= w w: w = t= w x t t= t= w x t t=

Let µ = t= x t. Now let us simplify the above expressio further, w = arg max w: w = w: w = w: w = w: w = w: w = w w x t w x t t= ) w x t x t t= t= t= w x t µ) t= w x t µ)x t µ) w t= w: w = w Σw Where Σ is sample the covariace matrix. ) x t µ)x t µ) w t= The first directio we pick will be the the uit vector w that maximizes w Σw Roughly speakig) Wheever we wat to maximize or miimize) a fuctio subject to a costrait, we ca use the idea of Lagrage multipliers. What the result says is that there exists λ R such that the solutio to w ca be alteratively writte dow as : w w R d w Σw λ w To optimize the above we simply take derivative ad equate to 0. This gives us Σw λw = 0 The directio w we obtai by maximizig the variace i the directio is some uit vector that satisfies Σw = λw But this is exactly the defiitio of a eige vector of matrix Σ. I Dutch the word eige meas self or ow. Eige vector of a matrix multiplied to the matrix results i a vector that is the just a scaled versio of the eigevector itself. So we see that w is a eigevector of Σ. The ext questio is, which eige vector to choose. To this ed ote that we wat to maximize w Σw ad we just saw that Σw = λw. Hece w Σw = λ w = λ Sice we wat to maximize the above quatity it stads to reaso that we pick w to be the eige vector correspodig to the largest eige value of Σ. 3

3 What about K >? For simplicity, for this sectio we shall assume that t= x t = 0 because if ot we ca simply ceter the poits by subtractig the mea from each oe. A simple fact from liear algebra is that, if we cosider ay orthoormal basis of R d give by w,..., w d R d, the ay vector i R d ca be represeted as a liear combiatio of the d basis. Recall that orthoormal vectors are vectors w,..., w d such that each vector is of uit legth, that is i d, w i = w i [j] = ad the vectors are orthogoal to each other, that is j= i, j s.t. i j, w i w j = 0 The key idea we are goig to use to produce the K dimesioal represetatio of the poits is that we shall first fid orthoormal basis w,..., w d i which to represet each poit x t. But however we shall pick oly K of the d basis w,..., w d ad approximate the data poits i the this chose K dimesioal subspace spaed by w,..., w K. Thus the matrix W will be got by cosiderig oly these K basis. Sice we ca write ay vector i d-dimesio as a liear combiatio of the orthoormal basis, let us write each x t = y t [j]w j ) j= where for each x t, y t [j] represets the coefficiet o the jth basis w j. Now the K dimesioal represetatio of the poit x t is give by the K umbers y t [],..., y t [K]. What this meas is that we ca view the data poit x t beig approximated by the recostructio Now further ote that, sice x t cosider W = [w,..., w K ], ˆx t = y t [j]w j ) j= = d j= y t[j]w j ad sice w,..., w are orthoormal, if we W x t = W y t [j]w j = j= y t [] y t [K] which coicides with our defiitio of liear trasformatio i the first sectio. 3) 4

3. View I: Maximize Total Variace Basic idea: coordiates. Pick the orthogoal directios alog which data is maximally spread i each of the How do we formalize this? We wat to maximize the sum of the variaces of y t s. That is, we wat to maximize y t [j] y t [j]) = wj x t wj x t j= t= t= j= That is we wat to solve the optimizatio problem: w,..., w K ) = argmax orthoormal W = argmax orthoormal W j= t= wj x t t= wj Σw j j= t= w x t To solve the above, first we ote that we ca solve w just as the K = case which yields the top eige vector as solutio for w. Next we ca solve w, subject to it beig orthogoal to w ad this exactly yields the secod largest eigevector. Next w 3 is the third largest ad so forth. Thus the solutio we get is the first K largest eige vectors. 3. View II: Miimizig Recostructio Error Basic idea: Aother way of thikig about which orthoormal basis to choose is to pick the basis such that the recostructio error is miimized. That is pick the orthoormal basis w,..., w K such that x t ˆx t is miimized. See the figure o page ). t= t= Let us simplify the above term, ˆx t x t = y t [j]w j x t t= t= j= = y t [j]w j y t [j]w j t= j= j= = y t [j]w j t= j=k+ = wj x t µ))w j t= j=k+ usig the fact that y t [j] = w j x t from Eq. 3) 5

from the above to the ext equatio is ot hard but just takes a bit of starig. Note that for ay vector v, v = v v. Expadig the above ad oticig that w j s are orthoormal yields the below. It s ok if this step seems hard, just take it as give. = = = t= j=k+ t= j=k+ j=k+ sice x t s are cetered, = j=k+ w j w j Σw j w j x t w j x t x t w j ) x t x t t= w j The orthoormal basis w,..., w d that we shall pick are the oes that miimize the recostructio error ad are hece the orthoormal basis that miimize, d j=k+ w j Σw j. We agai use the Lagragia multipliers to rewrite the costraied miimizatio problem with the uit orm costraits) ito a ucostraied miimizatio problem. Specifically we see that there exists λ,..., λ d such that the orthoormal basis w,..., w d are the oes that miimize, j=k+ w j Σw j λ j w j Takig derivative ad equatig to 0 we fid that for ay idex K + j d, j= Σw j λ j w j = 0 Thus agai we fid that the w s are eige vectors. To miimize the recostructio error we simply pick the eige basis of Σ ad retai K of them while throwig away the remaiig. Now sice each w j is a eige vector we have that Σw j = λ j w j where λ j is the correspodig eige value. Now the questio remais, which Eige vector to keep ad which to throw away. To this ed recall that we wat to miimize j=k+ w j Σw j = j=k+ λ j w j w j = j=k+ Thus it stads to reaso that to miimize the above we pick w K+,..., w d to be the eigevectors with the d K smallest eigevalues. λ j 6

4 PCA Algorithm What both the views tell us is that the matrix W we shall use for PCA is got by takig the K eigevectors correspodig to the top K eige values. As for the PCA algorithm, oe way to implemet it is to first compute the covariace matrix give the data. This ca be doe by calculatig the mea vector ad the covariace matrix as µ = t= x t Σ = x t µ)x t µ) t= Next we perform eige decompositio of the matrix ad take the top K eige vectors ad set w W = = eigsσ, K) wk 4. Projectio to Lower Dimesio Projectio is simply give by y t = W x t µ) The above is the versio where we ceter x t s which is represeted by the fact that µ is subtracted from each x t. 4. Recostructio Recostructio of the data poits based o low dimesioal represetatio is give by ˆx t = W y t + µ 7