Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

CS 189 Itroductio to Machie Learig Sprig 218 Note 11 1 Caoical Correlatio Aalysis The Pearso Correlatio Coefficiet ρ(x, Y ) is a way to measure how liearly related (i other words, how well a liear model captures the relatioship betwee) radom variables X ad Y. ρ(x, Y ) Cov(X, Y ) Var(X) Var(Y ) Here are some importat facts about it: It is commutative: ρ(x, Y ) ρ(y, X) It always lies betwee -1 ad 1: 1 ρ(x, Y ) 1 It is completely ivariat to affie trasformatios: for ay a, b, c, d R, ρ(ax + b, cy + d) Cov(aX + b, cy + d) Var(aX + b) Var(cY + d) Cov(aX, cy ) Var(aX) Var(cY ) a c Cov(X, Y ) a2 Var(X) c 2 Var(Y ) Cov(X, Y ) Var(X) Var(Y ) ρ(x, Y ) The correlatio is defied i terms of radom variables rather tha observed data. Assume ow that, y R are vectors cotaiig idepedet observatios of X ad Y, respectively. Recall the law of large umbers, which states that for i.i.d. X i with mea µ, 1 X i a.s. µ as We ca use this law to justify a sample-based approimatio to the mea: Cov(X, Y ) E[(X E[X)(Y E[Y ) 1 ( i )(y i ȳ) Note 11, UCB CS 189, Sprig 218. All Rights Reserved. This may ot be publicly shared without eplicit permissio. 1

where the bar idicates the sample average, i.e. 1 i. The as a special case we have Var(X) Cov(X, X) E[(X E[X) 2 1 ( i ) 2 Var(Y ) Cov(Y, Y ) E[(Y E[Y ) 2 1 (y i ȳ) 2 Pluggig these estimates ito the defiitio for correlatio ad cacelig the factor of 1/ leads us to the sample Pearso Correlatio Coefficiet ˆρ: ˆρ(, y) ( i )(y i ȳ) ( i ) 2 (y i ȳ) 2 ỹ where, ỹ y ȳ ỹ ỹ Here are some 2-D scatterplots ad their correspodig correlatio coefficiets: You should otice that: The magitude of ˆρ icreases as X ad Y become more liearly correlated. The sig of ˆρ tells whether X ad Y have a positive or egative relatioship. The correlatio coefficiet is udefied if either X or Y has variace (horizotal lie). 1.1 Correlatio ad Gaussias Here s a eat fact: if X ad Y are joitly Gaussia, i.e. [ X N (, Σ) Y the we ca defie a distributio o ormalized X ad Y ad have their relatioship etirely captured by ρ(x, Y ). First write ρ(x, Y ) σ y σ σ y Note 11, UCB CS 189, Sprig 218. All Rights Reserved. This may ot be publicly shared without eplicit permissio. 2

The so Σ [ σ 2 σ y σ y σy 2 [ [ σ 1 X σy 1 Y N, [ N, [ σ 2 ρσ σ y ρσ σ y σ 2 y [ [ σ 1 σy 1 Σ 1 ρ ρ 1 σ 1 σ 1 y 1.2 Caoical Correlatio Aalysis Caoical Correlatio Aalysis (CCA) is a method of modelig the relatioship betwee two poit sets by makig use of the correlatio coefficiet. Formally, give zero-mea radom vectors X rv R p ad Y rv R q, we wat to fid projectio vectors u R p ad v R q that maimizes the correlatio betwee X rv u ad Y rv v: Observe that ma ρ(x rv u, Y Cov(X rv u, Y rv v) rv v) Var(Xrv u) Var(Y rv v) Cov(X rv u, Y rv v) E[(X rv u E[X rv u)(y rv v E[Y rv v) E[u (X rv E[X rv )(Y rv E[Y rv ) v u E[(X rv E[X rv )(Y rv E[Y rv ) v u Cov(X rv, Y rv )v which also implies (sice Var(Z) Cov(Z, Z) for ay radom variable Z) that so the correlatio writes ρ(x rv u, Y rv v) Var(X rv u) u Cov(X rv, X rv )u Var(Y rv v) v Cov(Y rv, Y rv )v u Cov(X rv, Y rv )v u Cov(X rv, X rv )u v Cov(Y rv, Y rv )v Ufortuately, we do ot have access to the true distributios of X rv ad Y rv, so we caot compute these covariaces matrices. However, we ca estimate them from data. Assume ow that we are give zero-mea data matrices X R p ad Y R q, where the rows of the matri X are i.i.d. samples i R p from the radom variable X rv, ad correspodigly for Y rv. The Cov(X rv, Y rv ) E[(X rv E[X rv )(Y rv E[Y rv ) E[X rv Y rv 1 }{{}}{{} i y i 1 X Y Note 11, UCB CS 189, Sprig 218. All Rights Reserved. This may ot be publicly shared without eplicit permissio. 3

where agai the sample-based approimatio is justified by the law of large umbers. Similarly, Cov(X rv, X rv ) E[X rv X rv 1 i i 1 X X Cov(Y rv, Y rv ) E[Y rv Y rv 1 y i y i 1 Y Y Pluggig these estimates i for the true covariace matrices, we arrive at the problem ma u ( 1 X Y ) u u ( 1 X X ) u v ( 1 Y Y ) v u X Y v u X Xu v Y Y v } {{ } ˆρ(Xu,Y v) Let s try to massage the maimizatio problem ito a form that we ca reaso with more easily. Our strategy is to choose matrices to trasform X ad Y such that the maimizatio problem is equivalet but easier to uderstad. 1. First, let s choose matrices W, W y to white X ad Y. This will make the (co)variace matrices (XW ) (XW ) ad (Y W y ) (Y W y ) become idetity matrices ad simplify our epressio. To do this, ote that X X is positive defiite (ad hece symmetric), so we ca employ the eigedecompositio X X U S U Sice S diag(λ 1 (X X),..., λ d (X X)) where all the eigevalues are positive, we ca defie the square root of this matri by takig the square root of every diagoal etry: ( S 1 /2 diag λ1 (X X),..., ) λ d (X X) The, defiig W U S 1/2 U, we have (XW ) (XW ) W X XW U S 1/2 U U S U U S 1/2 U S 1/2 S S 1/2 U U U I which shows that W is a whiteig matri for X. The same process ca be repeated to produce a whiteig matri W y U y S 1/2 y U y for Y. Let s deote the whiteed data X w XW ad Y w Y W y. The by the chage of variables u w W 1 u, v w Wy 1 v, ma ˆρ(Xu, Y v) (Xu) Y v (Xu) Xu(Y v) Y v U Note 11, UCB CS 189, Sprig 218. All Rights Reserved. This may ot be publicly shared without eplicit permissio. 4

(XW W 1 (XW W 1 u) Y W y W 1 v u) XW W 1 u(y W y W 1 (X w u w ) Y w v w u w,v w (Xw u w ) X w u w (Y w v w ) Y w v w u w X w Y w v w u w,v w uw X w X w u w v w Y w Y w v w u w X w Y w v w u w,v w uw u w v w v }{{ w } ˆρ(X wu w,y wv w) y y v) Y W y Wy 1 v Note we have used the fact that X w X w ad Y w Y w are idetity matrices by costructio. 2. Secod, let s choose matrices D, D y to decorrelate X w ad Y w. This will let us simplify the covariace matri (X w D ) (Y w D y ) ito a diagoal matri. To do this, we ll make use of the SVD: X w Y w USV The choice of U for D ad V for D y accomplishes our goal, sice (X w U) (Y w V ) U X w Y w V U (USV )V S Let s deote the decorrelated data X d X w D y ad Y d Y w W y. The by the chage of variables u d D 1 u w D u w, v d Dy 1 v w D y v w, (X w u w ) Y w v ma ˆρ(X w u w, Y w v w ) w u w,v w u w,v w uw u w v w v w u w,v w v w (X w D D 1 u w ) Y w D y Dy 1 (D u w ) D u w (D y v w ) D y v w (X d u d ) Y d v d ud u d v d v d u d X d Y d v d ud u d v d v d } {{ } ˆρ(X d u d,y d v d ) u d Sv d ud u d v d v d Without loss of geerality, suppose u d ad v d are uit vectors 1 so that the deomiator becomes 1, ad we ca igore it: ma u d Sv d ud u d v d v d u d 1 v d 1 u d Sv d u d v d u d 1 v d 1 u d Sv d 1 Why ca we assume this? Observe that the value of the objective does ot chage if we replace u d by αu d ad v d by βv d, where α ad β are ay positive costats. Thus if there are maimizers u d, v d which are ot uit vectors, the u d / u d ad v d / v d (which are uit vectors) are also maimizers. Note 11, UCB CS 189, Sprig 218. All Rights Reserved. This may ot be publicly shared without eplicit permissio. 5

The diagoal ature of S implies S ij for i j, so our simplified objective epads as u d Sv d (u d ) i S ij (v d ) j S ii (u d ) i (v d ) i i j i where S ii, the sigular values of X w Y w, are arraged i descedig order. Thus we have a weighted sum of these sigular values, where the weights are give by the etries of u d ad v d, which are costraied to have uit orm. To maimize the sum, we put all our eggs i oe basket ad etract S 11 by settig the first compoets of u d ad v d to 1, ad the rest to : 1 1 u d. Rp v d. Rq Ay other arragemet would put weight o S ii at the epese of takig that weight away from S 11, which is the largest, thus reducig the value of the sum. Fially we have a aalytical solutio, but it is i a differet coordiate system tha our origial problem! I particular, u d ad v d are the best weights i a coordiate system where the data has bee whiteed ad decorrelated. To brig it back to our origial coordiate system ad fid the vectors we actually care about (u ad v), we must ivert the chages of variables we made: u W u w W D u d v W y v w W y D y v d More geerally, to get the best k directios, we choose [ U d I k p k,k R p k V d where I k deotes the k-dimesioal idetity matri. The [ I k q k,k R q k U W D U d V W y D y V d Note that U d ad V d have orthogoal colums. The colums of U ad V, which are the projectio directios we seek, will i geeral ot be orthogoal, but they will be liearly idepedet (sice they come from the applicatio of ivertible matrices to the colums of U d, V d ). 1.3 Compariso with PCA A advatage of CCA over PCA is that it is ivariat to scaligs ad affie trasformatios of X ad Y. Cosider a simplified sceario i which two matri-valued radom variables X, Y satisfy Y X + ɛ where the oise ɛ has huge variace. What happes whe we ru PCA o Y? Sice PCA maimizes variace, it will actually project Y (largely) ito the colum space of ɛ! However, we re iterested i Y s relatioship to X, ot its depedece o oise. How ca we fi this? As it turs out, CCA solves this issue. Istead of maimizig variace of Y, we maimize correlatio betwee X ad Y. I some sese, we wat the maimize predictive power of iformatio we have. Note 11, UCB CS 189, Sprig 218. All Rights Reserved. This may ot be publicly shared without eplicit permissio. 6

1.4 CCA regressio Oce we ve computed the CCA coefficiets, oe applicatio is to use them for regressio tasks, predictig Y from X (or vice-versa). Recall that the correlatio coefficiet attais a greater value whe the two sets of data are more liearly correlated. Thus, it makes sese to fid the k k weight matri A that liearly relates XU ad Y V. We ca accomplish this with ordiary least squares. Deote the projected data matrices by X c XU ad Y c Y V. Observe that X c ad Y c are zeromea because they are liear trasformatios of X ad Y, which are zero-mea. Thus we ca fit a liear model relatig the two: Y c X c A The least-squares solutio is give by A (X c X c ) 1 X c Y c (U X XU) 1 U X Y V However, sice what we really wat is a estimate of Y give ew (zero-mea) observatios X (or vice-versa), it s useful to have the etire series of trasformatios that relates the two. The predicted caoical variables are give by Ŷ c X c A XU(U X XU) 1 U X Y V The we use the caoical variables to compute the actual values: Ŷ Ŷc(V V ) 1 V XU(U X XU) 1 (U X Y V )(V V ) 1 V We ca collapse all these terms ito a sigle matri A eq that gives the predictio Ŷ from X: A eq U }{{} projectio (U X XU) 1 (U X Y V ) (V V ) 1 V }{{}}{{}}{{} whiteig decorrelatio projectio back Note 11, UCB CS 189, Sprig 218. All Rights Reserved. This may ot be publicly shared without eplicit permissio. 7