Introduction to Machine Learning Lecturer: Regev Schweiger Recitation Fall Seester Scribe: Regev Schweiger. Kernel Ridge Regression We now take on the task of kernel-izing ridge regression. Let x,..., x R d, and y R. Recall that ridge regression solves the following proble: arg in a R d y Xa 2 + λ a 2 where λ is penalty coefficient. Equating the gradient to 0 result in the solution we have seen in class: â = (X T X + λi d ) X T y Note that (X T X + λi d )X T = X T XX T + λx T = X T (XX T + λi n ) Multiplying (X T X + λi d ) at the left and (XX T + λi n ) at the right, we get X T (XX T + λi n ) = (X T X + λi d ) X T Therefore, the optial solution is equivalently, â = X T (XX T + λi n ) y Given a new point x, our regression estiate will be x T â = x T X T (XX T + λi n ) y We would now like to ebed our points to a space H, with x i φ(x i ), and perfor ridge regression after the transforation. It is easy to see that, given the forulation above, we can replace all the expressions involving X with kernel expressions. First define K as the atrix for which K i,j = K(x i, x j ). Siilarly define k as the vector for which k i = φ(x) T φ(x i ) = K(x, x i ). Thus, given a new point x, our regression estiate will be φ(x) T â = k T ( K + λi n ) y Note that, as usual, we cannot write down â explicitly, but we can apply it to the transforation of new points.
2 Lecture.2 PCA as axiizing variance We have seen how the PCA algorith can be derived in the context of iniizing the reconstruction error. More forally, assue we have a set of input vectors x,..., x, where x i R d. Denote the principal coponents by the coluns of V, as v,..., v r ; the orthonorality constraints iply that V T V = I. The PCA proble was: V = arg in V R d r x i V V T x i 2. (.) i= We now consider another possible criterion. Let s assue r =, that is, we would like to find the best line in soe sense. One intuitive criterion is the line, which if we project all points on, will give axial epirical variance. The epirical variance of a set of easureents, a,..., a, is (a i i= a j ) 2 j= Assue without loss of generality that the data points are centered at zero, that is x i = 0 i= If that is not the case, we ean-center the data. Therefore, it is easy to say that i= vt x i = 0 for each v. Therefore, the epirical variance of the set of projection is siply the ean of squares. Therefore, the criterion we like for the first direction is: v = arg ax v = (v T x i ) 2 For the next direction, we would like to capture the variance on directions we have not yet seen. Forally, we would like directions orthogonal to previous directions. Assue we found already v,..., v r. Then, the r-th direction is: i= v r = argax v =,v v,...,v r (v T x i ) 2 We can instead forulate that to find all r directions together, to get: i= argax V R d r,v T V =I r j= (vj T x i ) 2 i=
.3. PCA EXAMPLE 3 It is easy to see that the optiization function is: r j= (vj T x i ) 2 = i= r (vj T x i ) 2 = i= j= V T x i 2 i= To suarize, a sensible criterion for diensionality reduction would be to choose V so that the variance of projections is axiized, i.e., intuitively the structure of the data is preserved as uch as possible: argax V R d r,v T V =I We note, however, the following equality, based on Pythagoras: V T x i 2. (.2) i= x i 2 = V V T x i 2 + x i V V T x i 2. And it is easy to see that V V T x i 2 = V T x i 2 due to the orthonorality of V. Since x i does not depend on V, we see that iniizing the reconstruction error is equivalent to axiizing the variance. The goal in principal coponent analysis (PCA) is therefore to iniize the reconstruction error (see Equation.), and to axiize the projected variance (Equation.2). Eigenvalues. An iportant observation is the following. We know that the solution of PCA is the eigenvectors of the epirical covariance atrix. What are the eigenvalues? The variance axiization criterion gives an intuitive interpretation. Let λ,..., λ n be the eigenvalues of C = i= x ix T i ; i.e., Cv j = λ i v j. We seeked to axiize i= (vt x i ) 2. Plugging in v = v j, we get ( (vj T x i ) 2 = ) vt j x i x T i v j = vj T Cv j = λ j i= i= That is, λ j, the j-th eigenvalue, is the epirical variance of the projection on the j-th principal axis..3 PCA exaple.3. Background The DNA in our cells contains long chains of four cheical building blocks adenine, thyine, cytosine, and guanine, abbreviated A, T, C, and G. More than 6 billion of these
4 Lecture cheical bases, strung together in 23 pairs of chroosoes, exist in a huan cell. These genetic sequences contain inforation that influences our physical traits, our likelihood of suffering fro disease, and the responses of our bodies to substances that we encounter in the environent. The genetic sequences of different people are rearkably siilar. When the chroosoes of two huans are copared, their DNA sequences can be identical for hundreds of bases. But at about one in every,200 bases, on average, the sequences will differ. Differences in individual bases are by far the ost coon type of genetic variation. One person ight have an A at that location, while another person has a G. These genetic differences are known as single nucleotide polyorphiss, or SNPs (pronounced snips ). There are approxiately 0 illion SNPs estiated to occur coonly in the huan genoe. Each distinct spelling of a chroosoal region is called an allele, and a collection of alleles in a person s chroosoes is known as a genotype. In the ost coon case, there are only two alleles for all population at each SNP position. Data describing the genotype data for individuals, often does not specify the bases explicitly. Instead, one allele (per position) is selected as a reference allele. Then, at that position, the nuber of non-reference alleles is presented: 0 if both alleles in that position, in the chroosoe pair, were identical to the reference allele for that position; if only one of the was the reference allele; and 2 if neither were the reference alleles..3.2 Novebre et al., 2008 In the work of Novebre et al. 2008, Nature, 3,92 European individuals were genotyped at 500,568 positions (soe details are oitted for siplicity). They applied PCA with r = 2 and presented the projections of all genoes on these two principal axes:
.3. PCA EXAMPLE 5 Each individuals is denoted by colored two-letters, denoting their country of origin. It can be seen that the projections reflect the geography of Europe well.