Principal Coponents Analysis Cheng Li, Bingyu Wang Noveber 3, 204 What s PCA Principal coponent analysis (PCA) is a statistical procedure that uses an orthogonal transforation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal coponents. The nuber of principal coponents is less than or equal to the nuber of original variables. In another word, PCA tries to identify the subspace in which the data approxiately lies. Here are two exaples given by Andrew Ng s Notes: ) Given a dataset {x (i) ; i =,..., } of attributes of different types of autoobiles, such as their axiu speed, turn radius, and so on. Let x (i) R n for each i. But unknown to us, two different features-soe x i and x j -respectively give a car s axiu speed easured in iles per hour, and axiu speed easured in kiloeters per hour. These two features are therefore alost linearly dependent, up to only sall differences introduced by rounding off to the nearest ph and kph. Thus the data really lies approxiately on an n diensinal subspace. How can we autoatically detect, and perhaps reove, this redundancy? 2) For a less contrived exaple, consider a dataset resulting fro a survey of pilots for radio-controlled helicopters, where x i is a easure of the piloting skill of pilot i, and x i 2 captures how uch he/she enjoys flying. Because RC helicopters are very difficult to fly, only the ost coitted students, ones that truly enjoy flying, becoe a good pilots. So, the two features x and x 2 are strongly correlated. Indeed, we ight posit that the data actually likes along soe diagonal axis (the u direction) capturing the intrinsic piloting kara of a person, with sall aount of noise lying off this axis. (See Figure) How can we autoatically copute this u direction? 2 Prior PCA algorith Before running PCA, typically we first pre-process the data to noralize its ean and variance, as following:. Let u = x(i) 2. Replace each x (i) with x (i) u 3. Let σ 2 j = (x(i) j )2
Figure : Pilot s skill and enjoyent relationship. 4. Replace each x (i) j with x(i) j σ j Step 2 zero out of the ean of the data, and ay be oitted for data known to have zero ean(for instance, tie series corresponding to speech or other acoustic signals.) Step 3 4 rescale each coordinate to have unit variance, which ensures that different attributes are all treated on the sae scale. For instance, if x was car s axiu speed in ph(taking values fro {0 200}) and x 2 are the nuber of seats(taking values fro {2 8}), then this renoralization rescales the different features to ake the ore coparable. Step 3 4 ay be oitted if we had apriori knowledge that the different features are all on the sae scale(for instance, the MNIST digits dataset). 3 PCA Theory After the prior Steps 4, which has been described above, all we have to do is just only an eigenvector calculation. Then we just choose the top k eigenvectors as the new subspace. Why does it work and what is the theory behind the PCA? In fact, there are any theories can explain the PCA. But we just choose one of the, called Maxiu Variance theory. 3. Maxiu Variance Theory Having carried out the noralization, how do we copute the ajor axis of variation u-that is, the direction on which the data approxiately lies? One way to pose this proble is as finding the unit vector u so that when the data is projected onto the direction corresponding to u, the variance of the projected onto the direction corresponding to u, the variance of the projected data is axiized. Intuitively, the data starts off with soe aount of variance/inforation in it. We would like to choose a direction u so that if we were to approxiate the 2
data as lying in the direction/subspace corresponding to u, as uch as possible of this variance is still retained. 3.2 Analysis Consider the following dataset(see Figure 2), on which we have already carried out the noralization steps: Now, suppose we pick u to correspond the direction Figure 2: Saple data in 2-Diention shown in Figure 3. The circles denotes the projections of the original data onto this line. We see that in Figure 3 the projected data still has a fairly large variance, and the points tend to be far fro zero. In contrast, suppose had instead picked the following direction(figure 4): In the Figure 4, the projections have a significantly saller variance, and are uch closer to the origin. We would like to autoatically select the direction u corresponding to the Figure 3 with axiu variance. First we should know how to calculate the distance of the point s projection onto u fro the origin. Note that given a unit vector u and a point x, the length of the projection of x onto u is given by x T u. So the proble can be transfored to atheatic proble as following: choose u so that: ax u: u (x (i)t u) 2 = 3
Figure 3: Saple data with Maxiu Variance Figure 4: Saple data with Miniu Variance Next, we will derivate the foular as following: 4
= ax u: u = = ax u: u = = ax u: u = = ax u: u = (x (i)t u) 2 (x (i)t u) T (x (i)t u) (u T x (i) )(x (i)t u) u T x (i) x (i)t u = ax u: u u T ( = x (i) x (i)t )u = ax u: u u T Σu, where Σ = = x (i) x (i)t Now we can use Lagrange ultiplier to continue: = ax u T Σu subject u T u = (Because u = ) = L(u, λ) = u T Σu + λ(u T u ) = L(u, λ) = (ut Σu + λu T u) = (ut Σu) + (λut u) Since u R n, which is a colun vector. According to Matrix calculus, which is shown as following: x T Ax = 2Ax x x T x x = 2x where A is not a function of x, A is syetric, and x is a row vector. Then we get: (u T Σu) + (λut u) =2Σu + 2λu Set = 0 Since λ is a scalar, we can replace λ by λ. Finally, we get the Σu = λu, where u is the engivector of Σ and λ is the engivalue of Σ. To suarize, we have found that if we wish to find a -diensional subspace with to approxiate the data, we should choose u to be the principal eigenvector http://en.wikipedia.org/wiki/matrix calculus 5
of Σ. More generally, if we wish to project our data to k-diensional subspace (k < n), we should choose u, u 2,..., u k to be the top k vectors of Σ. The u i s now for a new, orthogonal basis for the data. 2 Then, to represent x (i) in this basis, we need only copute the corresponding vector u T x (i) ˆx (i) = u T 2 x (i)... Rk. u T k x(i) Thus, whereas x (i) R n, the vector ˆx (i) now gives a lower, k-diensional, approxiation/representation for x (i). That is to say, x (i) is the original datapoint and ˆx (i) is the new datapoint after PCA. PCA is therefore also referred to as a diensionality reduction algorith. The vectors u,..., u k are called the first k principal coponents of the data. 4 References PCA Lecture Notes by Andrew Ng(Stanford Univ.). other. 2 Because Σ is syetric, the u i s will (or always can be chosen to be) orthogonal to each 6