STATS 306B: Usupervised Learig Sprig 2014 Lecture 8 April 23 Lecturer: Lester Mackey Scribe: Kexi Nie, Na Bi 8.1 Pricipal Compoet Aalysis Last time we itroduced the mathematical framework uderlyig Pricipal Compoet Aalysis (PCA); ext we will cosider some of its applicatios. Please refer to the accompayig slides. 8.1.1 Examples Example 1. Digit data (Slide 2:) Here is a example take from the textbook. This set of had writte digital images cotais 130 threes, ad each three is a 16 16 greyscale image. Hece we may represet each datapoit as a vector of 256 greyscale pixels. (Slide 3:) The figure o the left shows the first two pricipal compoets of these images. The rectagular grid is computed by selected quatiles of the two pricipal compoets. Based o the projected coordiates o the two directios, the circled poits refer to the images that are closest to these vertices of the grid. The figure o the right displays the threes correspodig to the circled poits. The vertical compoet appears to capture chages i lie thickess / darkess, while the horizotal compoet appears to capture chages i the legth of the bottom of the three. (Slide 4:) This is a visual represetatio of the leared two-compoet PCA model. The first term is the mea of all images, ad the followig v 1 ad v 2 are two visualized pricipal directios (the loadigs), which ca also be called eige threes. Example 2. Eige-faces (Slide 5:) PCA is widely used i face recogitio. Suppose X d is the pixel-image matrix, where each colum is a face image. d is the umber of pixels ad x ji is the itesity of j-th pixel i image i. The loadigs retured by PCA are liear combiatios of faces, which ca be called eige-faces. The workig assumptio is that the PC scores z i, gotte by projectig the origial image oto the eige-face space, represet a more meaigful ad compact represetatio of the i-th face tha the raw pixel represetatio. The z i ca be used i place of x i for earest-eighbor classificatio. Sice the dimesio of face-space has decreased from d to k, the computatioal complexity becomes O(dk + k) istead of O(d). This is of great efficiecy whe, d k. Example 3. Latet sematic aalysis (Slide 6:) Aother applicatio of PCA is i text aalysis. Let d to be the total umber of words i the vocabulary; the each documet x i R d is a vector of word couts, ad x ji is the frequecy of word j i documet i. After we apply PCA here, the similarity betwee two documets is ow z T i z j, which is ofte more iformative tha the raw measure x T i x j. Notice that there may ot be sigificat computatioal savigs, sice the origial word-documet matrix was sparse, while the reduced represetatio is typically dese. 8-1
Example 4. Aomaly detectio (Slide 7:) PCA ca be used i etwork aomaly detectio. I the time-lik matrix X, x ji represets the amout of traffic o lik j i the etwork durig time-iterval i. I the two pictures o the left, traffic appears periodic ad reasoably determiistic o the selected pricipal compoet, which asserts that these two are ormal behaviors. I the cotrast, traffic spikes i the pictures o the right, which idicates aomalous behavior i this flow. Example 5. Part-of-speech taggig (Slide 8:) Usupervised part-of-speech taggig is a commo task i atural laguage processig, as maually taggig a large corpus is expesive ad time-cosumig. Here it is commo to model each word i a vocabulary by its cotext distributio, i.e., x ji is the umber of times that word i appears i cotext j. The key idea of usupervised POS taggig is that words appearig i similar cotexts ted to have same POS tags. Hece, a typical taggig techique is to cluster words accordig to their cotexts. However, i ay give corpus, ay give cotext may occur rather ifrequetly (the vectors x i are too sparse), so PCA has bee used to fid a more suitable, comparable represetatio for each word before clusterig is applied. Example 6. Multi-task learig (Slide 9:) I multi-task learig, oe is attemptig to solve related learig tasks simultaeously, e.g., classifyig documets as relevat or ot for users. Ofte task i reduces to learig a weight vector x i which produces for example the classificatio rule. Our goal is to exploit the similarities amogst these tasks to do more effective learig overall. Oe way to accomplish this is to use PCA is to idetify a small set of eige-classifiers amog the leared rules x 1,..., x. The, the classifiers ca be retraied with a added regularizatio term ecouragig each x i to lie ear the subspace spaed by the pricipal directios. These two steps of PCA ad retraiig are iterated util covergece. I this way, low-dimesioal represetatio of classifiers ca help to detect the shared structures betwee idepedet tasks. 8.1.2 Choosig a umber of compoets (Slide 10:) As i the clusterig settig, we face a model selectio questio: how do we choose the umber of pricipal compoets? While there is o agreed-upo solutio to this problem, here are some guidelies. The umber of pricipal compoets might be costraied by the problem goal, your computatioal or storage resources, or by the miimum fractio of variace to be explaied. For example, it is commo to choose 3 or fewer pricipal compoets whe doig visualizatio problems. Recall that eigevalue magitudes determie the explaied variace. I the accompayig figure, the first 5 pricipal compoets already explai early all of the variace, so a small umber of pricipal compoets may be sufficiet (although oe must use care i drawig this coclusio, sice small differeces i recostructio error may still be sematically sigificat; cosider face recogitio for example). Furthermore, we may look for elbow criterio or compare explaied variace with that obtaied uder a referece distributio. 8-2
8.1.3 PCA limitatios ad extesios While PCA has a great umber of applicatios, it has its limitatios as well: Squared Euclidea recostructio error is ot appropriate for all data types. Various extesios, such as expoetial family PCA, have bee developed for biary, categorical, cout, ad oegative data. PCA ca oly fid liear compressios of data. Kerel PCA is a importat geeralizatio desiged for o-liear dimesioality reductio. 8.2 No-liear dimesioality reductio with kerel PCA 8.2.1 Ituitio Figure 8.1. Data lyig ear a liear subspace Figure 8.2. Data lyig ear a parabola Figure 8.1 displays a 2D example i which PCA is effective because data lie ear a liear subspace. However, i Figure 8.2 PCA is ieffective, because data the data lie ear a parabola. I this case, the PCA compressio of the data might project all poits oto the orage lie, which is far from ideal. Let us cosider the differeces betwee these two settigs mathematically. Liear subspace (Figure 8.1): I this example we have ambiet dimesio p = 2 ad compoet dimesio k = 1. Sice the blue lie is a k-dimesioal liear subspace of R p, we kow that there is some matrix U R p k such that the subspace S takes the form S = {x R p : x = Uz, z R k } where U = [ u1 u 2 = {(x 1, x 2 ) : x 1 = u 1 z, x 2 = u 2 z} = {(x 1, x 2 ) : x 2 = u 2 x 1 }, u 1 ], sice (p, k) = (2, 1) i our example. 8-3
Parabola (Figure 8.2): I this example we agai have ambiet dimesio p = 2 ad compoet dimesio k = 1. Moreover, there is some fixed matrix U R p k such that the uderlyig blue parabola takes the form S = {(x 1, x 2 ) : x 2 = u 2 u 1 x 2 1} which is similar to the represetatio derived i the liear model. Ideed, if we itroduce a auxiliary variable z, we get, S = {(x 1, x 2 ) : x 2 1 = u 1 z, x 2 2 = u 2 z, for z R} = {x R p : φ(x) = Uz, z R k, } [ ] x 2 where φ(x) = 1 is a o-liear fuctio of x. I this fial represetatio, U is still a x 2 liear mappig of the latet compoets z, but the represetatio beig recostructed liearly is o loger x itself but rather a potetially o-liear mappig φ of x. 8.2.2 Take-away We should be able to capture o-liear dimesioality reductio i x space by performig liear dimesioality reductio i φ(x) space (we ofte call φ(x) the feature space). Of course we still eed to fid the right feature space to perform dimesioality reductio i. Oe optio is to had-desig the feature mappig φ explicitly coordiate by coordiate, e.g., φ(x) = (x 1, x 2 2, x 1 x 2, si(x 1 ),...). However, this process quickly becomes tedious ad has to be ad hoc. Moreover, workig i feature space becomes expesive if φ(x) is very large. For example, cosider the umber of all quadratic terms x i x j = O(p 2 ). A alterative, which we will explore ext, is to ecode φ implicitly via its ier products usig the kerel trick. 8.2.3 The Kerel Trick Our path to the kerel trick begis with a iterestig claim: the PCA solutio depeds o the data matrix x 1 X = x 2... x oly through the Gram matrix (a.k.a. the Kerel matrix), K = XX T R. The kerel matrix is the matrix of ier products K ij =< x i, x j >. Proof. Each Pricipal Compoet loadig u j is a eigevector of XT X u j = λ j u j for some λ j XT X u j = X T α j = i=1 α jix i for some weights α j. That is, u j is a liear combiatio of the datapoits. This is called a represeter theorem for the PCA solutio. It is aalogous 8-4
to represeter theorems you may have see for Support Vector Machies or ridge regressio. Therefore oe ca restrict attetio to cadidate loadigs u j with this form. Now cosider the PCA objective max u j u T j X T X u j s.t. u j 2 = 1, u T X T X j u l = 0, l < j X T α j s.t. αj T XX T α j = 1, αj T X (XT X) max αj T X (XT X) α j max α j α T j K 2 α j, s.t. α T j Kα j = 1, α T j which oly depeds o the data through K! X T α l = 0, l < j K 2 α l = 0, l < j (8.1) The fial represetatio of PCA i kerel form (8.1) is a example of a geeralized eigevalue problem, so we kow how to compute its solutio. However, we will give a more explicit derivatio of its solutio by covertig this problem ito a equivalet eigevalue problem. Hereafter we will assume K is o-sigular. Let β j = K 1 2 α j so that α j = K 1 2 β j. Now the problem becomes (8.1) max u j βj T Kβ j, s.t. βj T β j = 1, βj T K β j = 0, l < j This is a eigevalue problem with solutio give by β j = the j-th leadig eigevector of K ad hece α j = K 1 2 β j = β j λj (K). Furthermore, we ca recover the pricipal compoet scores from this represetatio by z = u T X T = [α 1,..., α k] T XX T = [α 1,..., α k] T K. The puchlie is that we ca solve PCA by fidig the eigevectors ad eigevalues of K; this is kerel PCA, the kerelized form of the PCA algorithm (ote that the solutio is equivalet to the origial PCA solutio if K = XX T ). Hece, the ier products of X are sufficiet, ad we do ot eed additioal access to explicit datapoits. Why is this relevat? Suppose we wat to ru kerel PCA o a o-liear mappig of data Φ = φ(x 1 ) φ(x 2 )... φ(x ) The we do ot eed to compute or store Φ explicitly; K φ = ΦΦ T suffices to ru kerel PCA. Moreover, we ca ofte compute etries of K φ ij =< φ(x i), φ(x j ) > via a kerel fuctio K(x i, x j ) without formig φ(x i ) explicitly. This is the kerel trick. Here are a few commo examples:. 8-5
Kerel trick examples Kerel K(x i, x j ) φ(x) Liear < x i, x j > x Quadratic (1+ < x i, x j >) 2 (1, x 1,..., x p, x 2 1...., x 2 p, x 1 x 2,..., x p 1 x p ) Polyomial (1+ < x i, x j >) d all moomials of order d or less ( ) xi x Gaussia/ Radial basis fuctio exp j 2 2 σ 2 ifiite dimesioal feature vector A pricipal advatage of the kerel trick is that oe ca carry out o-liear dimesio reductio with little depedece o the dimesio of the o-liear feature space. However, oe has to form ad operate o a matrix (which ca be quite expesive). It is commo to approximate the kerel whe is large usig radom (e.g., the Nystrom method of Williams & Seeger, 2000) or determiistic (e.g., the icomplete Cholesky decompositio of Fie & Scheiberg, 2001) low-rak approximatios. 8-6