Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

Similar documents
Machine Learning for Data Science (CS 4786)

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Machine Learning for Data Science (CS 4786)

ECON 3150/4150, Spring term Lecture 3

Inverse Matrix. A meaning that matrix B is an inverse of matrix A.

Bivariate Sample Statistics Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 7

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Session 5. (1) Principal component analysis and Karhunen-Loève transformation

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

11 Correlation and Regression

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

Chimica Inorganica 3

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

5.1. The Rayleigh s quotient. Definition 49. Let A = A be a self-adjoint matrix. quotient is the function. R(x) = x,ax, for x = 0.

5.1 Review of Singular Value Decomposition (SVD)

18.S096: Homework Problem Set 1 (revised)

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = =

1 Introduction to reducing variance in Monte Carlo simulations

ALGEBRAIC GEOMETRY COURSE NOTES, LECTURE 5: SINGULARITIES.

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

For a 3 3 diagonal matrix we find. Thus e 1 is a eigenvector corresponding to eigenvalue λ = a 11. Thus matrix A has eigenvalues 2 and 3.

Chapter 6 Principles of Data Reduction

MATH10212 Linear Algebra B Proof Problems

Linear Regression Demystified

Axis Aligned Ellipsoid

Topics in Eigen-analysis

Machine Learning for Data Science (CS4786) Lecture 4

Estimation for Complete Data

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

, then cv V. Differential Equations Elements of Lineaer Algebra Name: Consider the differential equation. and y2 cos( kx)

Polynomial Functions and Their Graphs

Algebra of Least Squares

Introduction to Machine Learning DIS10

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Problem Set 2 Solutions

Lecture 3: August 31

Statistical Properties of OLS estimators

Lecture 8: October 20, Applications of SVD: least squares approximation

Output Analysis (2, Chapters 10 &11 Law)

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

10-701/ Machine Learning Mid-term Exam Solution

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

BHW #13 1/ Cooper. ENGR 323 Probabilistic Analysis Beautiful Homework # 13

Random Variables, Sampling and Estimation

Section 14. Simple linear regression.

Chapter 1 Simple Linear Regression (part 6: matrix version)

Chapter Vectors

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

ARIMA Models. Dan Saunders. y t = φy t 1 + ɛ t

a for a 1 1 matrix. a b a b 2 2 matrix: We define det ad bc 3 3 matrix: We define a a a a a a a a a a a a a a a a a a

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

f X (12) = Pr(X = 12) = Pr({(6, 6)}) = 1/36

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

g p! where ω is a p-form. The operator acts on forms, not on components. Example: Consider R 3 with metric +++, i.e. g µν =

ECON 3150/4150, Spring term Lecture 1

(VII.A) Review of Orthogonality

1 Last time: similar and diagonalizable matrices

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

CHAPTER I: Vector Spaces

Statistics 511 Additional Materials


Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Last time: Moments of the Poisson distribution from its generating function. Example: Using telescope to measure intensity of an object

Full file at

Estimation of the Mean and the ACVF

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

a for a 1 1 matrix. a b a b 2 2 matrix: We define det ad bc 3 3 matrix: We define a a a a a a a a a a a a a a a a a a

Properties and Hypothesis Testing

Efficient GMM LECTURE 12 GMM II

The Basic Space Model

Castiel, Supernatural, Season 6, Episode 18

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

Notes for Lecture 5. 1 Grover Search. 1.1 The Setting. 1.2 Motivation. Lecture 5 (September 26, 2018)

STAT Homework 2 - Solutions

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

Topic 9: Sampling Distributions of Estimators

THE ASYMPTOTIC COMPLEXITY OF MATRIX REDUCTION OVER FINITE FIELDS

September 2012 C1 Note. C1 Notes (Edexcel) Copyright - For AS, A2 notes and IGCSE / GCSE worksheets 1

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

(all terms are scalars).the minimization is clearer in sum notation:

LECTURE 8: ORTHOGONALITY (CHAPTER 5 IN THE BOOK)

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

4. Basic probability theory

U8L1: Sec Equations of Lines in R 2

Binomial Distribution

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

6.883: Online Methods in Machine Learning Alexander Rakhlin

CMSE 820: Math. Foundations of Data Sci.

Variance of Discrete Random Variables Class 5, Jeremy Orloff and Jonathan Bloom

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

R is a scalar defined as follows:

Infinite Sequences and Series

1 Inferential Methods for Correlation and Regression Analysis

Transcription:

CS 189 Itroductio to Machie Learig Sprig 218 Note 11 1 Caoical Correlatio Aalysis The Pearso Correlatio Coefficiet ρ(x, Y ) is a way to measure how liearly related (i other words, how well a liear model captures the relatioship betwee) radom variables X ad Y. ρ(x, Y ) Cov(X, Y ) Var(X) Var(Y ) Here are some importat facts about it: It is commutative: ρ(x, Y ) ρ(y, X) It always lies betwee -1 ad 1: 1 ρ(x, Y ) 1 It is completely ivariat to affie trasformatios: for ay a, b, c, d R, ρ(ax + b, cy + d) Cov(aX + b, cy + d) Var(aX + b) Var(cY + d) Cov(aX, cy ) Var(aX) Var(cY ) a c Cov(X, Y ) a2 Var(X) c 2 Var(Y ) Cov(X, Y ) Var(X) Var(Y ) ρ(x, Y ) The correlatio is defied i terms of radom variables rather tha observed data. Assume ow that, y R are vectors cotaiig idepedet observatios of X ad Y, respectively. Recall the law of large umbers, which states that for i.i.d. X i with mea µ, 1 X i a.s. µ as We ca use this law to justify a sample-based approimatio to the mea: Cov(X, Y ) E[(X E[X)(Y E[Y ) 1 ( i )(y i ȳ) Note 11, UCB CS 189, Sprig 218. All Rights Reserved. This may ot be publicly shared without eplicit permissio. 1

where the bar idicates the sample average, i.e. 1 i. The as a special case we have Var(X) Cov(X, X) E[(X E[X) 2 1 ( i ) 2 Var(Y ) Cov(Y, Y ) E[(Y E[Y ) 2 1 (y i ȳ) 2 Pluggig these estimates ito the defiitio for correlatio ad cacelig the factor of 1/ leads us to the sample Pearso Correlatio Coefficiet ˆρ: ˆρ(, y) ( i )(y i ȳ) ( i ) 2 (y i ȳ) 2 ỹ where, ỹ y ȳ ỹ ỹ Here are some 2-D scatterplots ad their correspodig correlatio coefficiets: You should otice that: The magitude of ˆρ icreases as X ad Y become more liearly correlated. The sig of ˆρ tells whether X ad Y have a positive or egative relatioship. The correlatio coefficiet is udefied if either X or Y has variace (horizotal lie). 1.1 Correlatio ad Gaussias Here s a eat fact: if X ad Y are joitly Gaussia, i.e. [ X N (, Σ) Y the we ca defie a distributio o ormalized X ad Y ad have their relatioship etirely captured by ρ(x, Y ). First write ρ(x, Y ) σ y σ σ y Note 11, UCB CS 189, Sprig 218. All Rights Reserved. This may ot be publicly shared without eplicit permissio. 2

The so Σ [ σ 2 σ y σ y σy 2 [ [ σ 1 X σy 1 Y N, [ N, [ σ 2 ρσ σ y ρσ σ y σ 2 y [ [ σ 1 σy 1 Σ 1 ρ ρ 1 σ 1 σ 1 y 1.2 Caoical Correlatio Aalysis Caoical Correlatio Aalysis (CCA) is a method of modelig the relatioship betwee two poit sets by makig use of the correlatio coefficiet. Formally, give zero-mea radom vectors X rv R p ad Y rv R q, we wat to fid projectio vectors u R p ad v R q that maimizes the correlatio betwee X rv u ad Y rv v: Observe that ma ρ(x rv u, Y Cov(X rv u, Y rv v) rv v) Var(Xrv u) Var(Y rv v) Cov(X rv u, Y rv v) E[(X rv u E[X rv u)(y rv v E[Y rv v) E[u (X rv E[X rv )(Y rv E[Y rv ) v u E[(X rv E[X rv )(Y rv E[Y rv ) v u Cov(X rv, Y rv )v which also implies (sice Var(Z) Cov(Z, Z) for ay radom variable Z) that so the correlatio writes ρ(x rv u, Y rv v) Var(X rv u) u Cov(X rv, X rv )u Var(Y rv v) v Cov(Y rv, Y rv )v u Cov(X rv, Y rv )v u Cov(X rv, X rv )u v Cov(Y rv, Y rv )v Ufortuately, we do ot have access to the true distributios of X rv ad Y rv, so we caot compute these covariaces matrices. However, we ca estimate them from data. Assume ow that we are give zero-mea data matrices X R p ad Y R q, where the rows of the matri X are i.i.d. samples i R p from the radom variable X rv, ad correspodigly for Y rv. The Cov(X rv, Y rv ) E[(X rv E[X rv )(Y rv E[Y rv ) E[X rv Y rv 1 }{{}}{{} i y i 1 X Y Note 11, UCB CS 189, Sprig 218. All Rights Reserved. This may ot be publicly shared without eplicit permissio. 3

where agai the sample-based approimatio is justified by the law of large umbers. Similarly, Cov(X rv, X rv ) E[X rv X rv 1 i i 1 X X Cov(Y rv, Y rv ) E[Y rv Y rv 1 y i y i 1 Y Y Pluggig these estimates i for the true covariace matrices, we arrive at the problem ma u ( 1 X Y ) u u ( 1 X X ) u v ( 1 Y Y ) v u X Y v u X Xu v Y Y v } {{ } ˆρ(Xu,Y v) Let s try to massage the maimizatio problem ito a form that we ca reaso with more easily. Our strategy is to choose matrices to trasform X ad Y such that the maimizatio problem is equivalet but easier to uderstad. 1. First, let s choose matrices W, W y to white X ad Y. This will make the (co)variace matrices (XW ) (XW ) ad (Y W y ) (Y W y ) become idetity matrices ad simplify our epressio. To do this, ote that X X is positive defiite (ad hece symmetric), so we ca employ the eigedecompositio X X U S U Sice S diag(λ 1 (X X),..., λ d (X X)) where all the eigevalues are positive, we ca defie the square root of this matri by takig the square root of every diagoal etry: ( S 1 /2 diag λ1 (X X),..., ) λ d (X X) The, defiig W U S 1/2 U, we have (XW ) (XW ) W X XW U S 1/2 U U S U U S 1/2 U S 1/2 S S 1/2 U U U I which shows that W is a whiteig matri for X. The same process ca be repeated to produce a whiteig matri W y U y S 1/2 y U y for Y. Let s deote the whiteed data X w XW ad Y w Y W y. The by the chage of variables u w W 1 u, v w Wy 1 v, ma ˆρ(Xu, Y v) (Xu) Y v (Xu) Xu(Y v) Y v U Note 11, UCB CS 189, Sprig 218. All Rights Reserved. This may ot be publicly shared without eplicit permissio. 4

(XW W 1 (XW W 1 u) Y W y W 1 v u) XW W 1 u(y W y W 1 (X w u w ) Y w v w u w,v w (Xw u w ) X w u w (Y w v w ) Y w v w u w X w Y w v w u w,v w uw X w X w u w v w Y w Y w v w u w X w Y w v w u w,v w uw u w v w v }{{ w } ˆρ(X wu w,y wv w) y y v) Y W y Wy 1 v Note we have used the fact that X w X w ad Y w Y w are idetity matrices by costructio. 2. Secod, let s choose matrices D, D y to decorrelate X w ad Y w. This will let us simplify the covariace matri (X w D ) (Y w D y ) ito a diagoal matri. To do this, we ll make use of the SVD: X w Y w USV The choice of U for D ad V for D y accomplishes our goal, sice (X w U) (Y w V ) U X w Y w V U (USV )V S Let s deote the decorrelated data X d X w D y ad Y d Y w W y. The by the chage of variables u d D 1 u w D u w, v d Dy 1 v w D y v w, (X w u w ) Y w v ma ˆρ(X w u w, Y w v w ) w u w,v w u w,v w uw u w v w v w u w,v w v w (X w D D 1 u w ) Y w D y Dy 1 (D u w ) D u w (D y v w ) D y v w (X d u d ) Y d v d ud u d v d v d u d X d Y d v d ud u d v d v d } {{ } ˆρ(X d u d,y d v d ) u d Sv d ud u d v d v d Without loss of geerality, suppose u d ad v d are uit vectors 1 so that the deomiator becomes 1, ad we ca igore it: ma u d Sv d ud u d v d v d u d 1 v d 1 u d Sv d u d v d u d 1 v d 1 u d Sv d 1 Why ca we assume this? Observe that the value of the objective does ot chage if we replace u d by αu d ad v d by βv d, where α ad β are ay positive costats. Thus if there are maimizers u d, v d which are ot uit vectors, the u d / u d ad v d / v d (which are uit vectors) are also maimizers. Note 11, UCB CS 189, Sprig 218. All Rights Reserved. This may ot be publicly shared without eplicit permissio. 5

The diagoal ature of S implies S ij for i j, so our simplified objective epads as u d Sv d (u d ) i S ij (v d ) j S ii (u d ) i (v d ) i i j i where S ii, the sigular values of X w Y w, are arraged i descedig order. Thus we have a weighted sum of these sigular values, where the weights are give by the etries of u d ad v d, which are costraied to have uit orm. To maimize the sum, we put all our eggs i oe basket ad etract S 11 by settig the first compoets of u d ad v d to 1, ad the rest to : 1 1 u d. Rp v d. Rq Ay other arragemet would put weight o S ii at the epese of takig that weight away from S 11, which is the largest, thus reducig the value of the sum. Fially we have a aalytical solutio, but it is i a differet coordiate system tha our origial problem! I particular, u d ad v d are the best weights i a coordiate system where the data has bee whiteed ad decorrelated. To brig it back to our origial coordiate system ad fid the vectors we actually care about (u ad v), we must ivert the chages of variables we made: u W u w W D u d v W y v w W y D y v d More geerally, to get the best k directios, we choose [ U d I k p k,k R p k V d where I k deotes the k-dimesioal idetity matri. The [ I k q k,k R q k U W D U d V W y D y V d Note that U d ad V d have orthogoal colums. The colums of U ad V, which are the projectio directios we seek, will i geeral ot be orthogoal, but they will be liearly idepedet (sice they come from the applicatio of ivertible matrices to the colums of U d, V d ). 1.3 Compariso with PCA A advatage of CCA over PCA is that it is ivariat to scaligs ad affie trasformatios of X ad Y. Cosider a simplified sceario i which two matri-valued radom variables X, Y satisfy Y X + ɛ where the oise ɛ has huge variace. What happes whe we ru PCA o Y? Sice PCA maimizes variace, it will actually project Y (largely) ito the colum space of ɛ! However, we re iterested i Y s relatioship to X, ot its depedece o oise. How ca we fi this? As it turs out, CCA solves this issue. Istead of maimizig variace of Y, we maimize correlatio betwee X ad Y. I some sese, we wat the maimize predictive power of iformatio we have. Note 11, UCB CS 189, Sprig 218. All Rights Reserved. This may ot be publicly shared without eplicit permissio. 6

1.4 CCA regressio Oce we ve computed the CCA coefficiets, oe applicatio is to use them for regressio tasks, predictig Y from X (or vice-versa). Recall that the correlatio coefficiet attais a greater value whe the two sets of data are more liearly correlated. Thus, it makes sese to fid the k k weight matri A that liearly relates XU ad Y V. We ca accomplish this with ordiary least squares. Deote the projected data matrices by X c XU ad Y c Y V. Observe that X c ad Y c are zeromea because they are liear trasformatios of X ad Y, which are zero-mea. Thus we ca fit a liear model relatig the two: Y c X c A The least-squares solutio is give by A (X c X c ) 1 X c Y c (U X XU) 1 U X Y V However, sice what we really wat is a estimate of Y give ew (zero-mea) observatios X (or vice-versa), it s useful to have the etire series of trasformatios that relates the two. The predicted caoical variables are give by Ŷ c X c A XU(U X XU) 1 U X Y V The we use the caoical variables to compute the actual values: Ŷ Ŷc(V V ) 1 V XU(U X XU) 1 (U X Y V )(V V ) 1 V We ca collapse all these terms ito a sigle matri A eq that gives the predictio Ŷ from X: A eq U }{{} projectio (U X XU) 1 (U X Y V ) (V V ) 1 V }{{}}{{}}{{} whiteig decorrelatio projectio back Note 11, UCB CS 189, Sprig 218. All Rights Reserved. This may ot be publicly shared without eplicit permissio. 7