Machine Learning for Data Science (CS4786) Lecture 4

Machie Learig for Data Sciece (CS4786) Lecture 4 Caoical Correlatio Aalysis (CCA) Course Webpage : http://www.cs.corell.edu/courses/cs4786/2016fa/

Aoucemet We are gradig HW0 ad you will be added to cms by moday HW1 will be posted toight o webpage (homework tab) HW1 o CCA ad PCA (due i a week)

QUIZ x > 1 X x > d Assume poits are cetered. Which of the followig are equal to the covariace matrix? X X A. = 1 t=1 x > t x t B. = 1 t=1 C. = XX > D. = XX > x t x > t

Example: Studets i classroom z y x

Maximize Spread Miimize Recostructio Error

! PRINCIPAL COMPONENT ANALYSIS Eigevectors of the covariace matrix are the pricipal compoets 1. =cov X Top K pricipal compoets are the eigevectors with K largest eigevalues 2. W = eigs,k Projectio = Data Top Keigevectors ( ) Recostructio = Projectio Traspose of top K eigevectors X µ Idepedetly discovered by Pearso i 1901 ad Hotellig i 1933. 3. Y = W

RECONSTRUCTION 4. bx W > +µ = Y

WHEN d >> If d >> the is large But we oly eed top K eige vectors. Idea: use SVD X µ = UDV V T V = I U T U = I The ote that, = (X µ) (X µ) = VD 2 V Hece, matrix V is the same as matrix W got from eige decompositio of, eigevalues are diagoal elemets of D 2 Alterative algorithm: [U, V] = SVD(X µ, K) W = V

WHEN TO USE PCA? Whe data aturally lies i a low dimesioal liear subspace To miimize recostructio error Fid directios where data is maximally spread

Caoical Correlatio Aalysis Age + Geder z y Agle x

TWO VIEW DIMENSIONALITY REDUCTION Data comes i pairs (x 1, x 1 ),...,(x, x ) where x t s are d dimesioal ad x t s are d dimesioal Goal: Compress say view oe ito y 1,...,y, that are K dimesioal vectors Retai iformatio redudat betwee the two views Elimiate oise specific to oly oe of the views

EXAMPLE I: SPEECH RECOGNITION + Audio might have backgroud souds ucorrelated with video Video might have lightig chages ucorrelated with audio Redudat iformatio betwee two views: the speech

EXAMPLE II: COMBINING FEATURE EXTRACTIONS Method A ad Method B are both equally good feature extractio techiques Cocateatig the two features blidly yields large dimesioal feature vector with redudacy Applyig techiques like CCA extracts the key iformatio betwee the two methods Removes extra uwated iformatio

How do we get the right directio? (say K = 1) Age + Geder Agle

WHICH DIRECTION TO PICK? View I View II

WHICH DIRECTION TO PICK? PCA directio 0 0

WHICH DIRECTION TO PICK? Directio has large covariace

How do we pick the right directio to project to?

MAXIMIZING CORRELATION COEFFICIENT Say w 1 ad v 1 are the directios we choose to project i views 1 ad 2 respectively we wat these directios to maximize, 1 t=1 y t [1] 1 t=1 y t [1] y t[1] 1 y t[1] t=1 s.t. 1 t=1 y t[1] 1 t=1 y t[1] 2 = 1 t=1 y t [1] 1 t=1 y t [1] = 1 where y t [1] = w 1 x t ad y t [1] = v 1 x t

What is the problem with the above?

WHY NOT MAXIMIZE COVARIANCE Relevat iformatio Say 1 X t=1 x t [2] x 0 t[2] > 0 Scalig up this coordiate we ca blow up covariace

MAXIMIZING CORRELATION COEFFICIENT Say w 1 ad v 1 are the directios we choose to project i views 1 ad 2 respectively we wat these directios to maximize, 1 t=1 y t[1] 1 t=1 y t[1] y t [1] 1 t=1 y t [1] 1 t=1 y t[1] 1 t=1 y t[1] 2 1 t=1 y t [1] 1 t=1 y t [1]

BASIC IDEA OF CCA Normalize variace i chose directio to be costat (say 1) The maximize covariace This is same as maximizig correlatio coefficiet (recall from last class).

COVARIANCE VS CORRELATION Covariace(A, B) = E[(A E[A]) (B E[B])] Depeds o the scale of A ad B. If B is rescaled, covariace shifts. Corelatio(A, B) = E[(A E[A]) (B E[B])] Var(A) Var(B) Scale free.

CANONICAL CORRELATION ANALYSIS Hece we wat to solve for projectio vectors w 1 ad v 1 that maximize 1 subject to 1 t=1 w 1 (x t µ) v 1 (x t µ ) t=1(w 1 (x t µ)) 2 = 1 t=1 (v 1 (x t µ )) 2 = 1 where µ = 1 t=1 x t ad µ = 1 t=1 x t

CANONICAL CORRELATION ANALYSIS Hece we wat to solve for projectio vectors w 1 ad v 1 that maximize w 1 1,2v 1 subject to w 1 1,1w 1 = v 1 2,2v 1 = 1 Writig Lagragia takig derivative equatig to 0 we get =cov X 0 = X1,1 1,2 v 1 = 2 v 1 1,2 1 2,2 2,1w 1 = 2 1,1 w 1 ad 2,1 1 1,1 1,2v 1 = 2 2,2 v 1 11 12 or equivaletly 1 1,1 1,2 1 2,2 2,1 w21 1 = 2 22 w 1 ad 1 2,2 2,1 1!

SOLUTION eigs ( ) =,K 11 12 22 1 21 W 1 1 ( ) = eigs,k 22 21 11 1 12 W 2 1

1. CCA ALGORITHM! X = X 1 X 2 Write x t = x t x t the d + d dimesioal cocateated vectors., Calculate covariace matrix d1 of the joit d2 data poits = =cov 1,1 1,2 X 2. = 11 12 2,1 2,2 21 22! Calculate 1 1,1 1,2 1 2,2 2,1. The top K eige vectors of this matrix give us projectio matrix for view I. ( ) = eigs,k 11 12 22 1 21 3. W 1 1 Calculate 1 2,2 2,1 1 1,1 1,2. The top K eige vectors of this matrix give us projectio matrix for view II. X µ 4. Y = W 1 1 1 1