Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang)
Kernel Regression
Basis function regression Linear regression Basis function regression For N samples Polynomial regression
Basis Function Regression 1 M =3 t 0 1 0 x 1
The Kernel Trick Define a kernel function such that k can be cheaper to evaluate than φ!
Kernel Ridge Regression MAP / Expected value for Weights (requires inversion of DxD matrix) E[w y]=a 1 > y A :=( > + I) := (X) Alternate representation (requires inversion of NxN matrix) A 1 > = > (K + I) 1 K := 1 > Predictive posterior (using kernel function) E[ f (x ) y]= (x ) > E[w y]= (x ) > > (K + I) 1 y X
Kernel Ridge Regression MAP / Expected value for Weights (requires inversion of DxD matrix) E[w y]=a 1 > y A :=( > + I) := (X) Alternate representation (requires inversion of NxN matrix) A 1 > = > (K + I) 1 K := 1 > Predictive posterior (using kernel function) E[ f (x ) y]= (x ) > E[w y]= (x ) > > (K + I) 1 y X = k(x, x n )(K + I) 1 nm y m n,m
Kernel Ridge Regression f = arg min f 2H nx i=1 (y i hf, (x i )i H ) 2 + kf k 2 H!. 1 λ=0.1, σ=0.6 1 λ=10, σ=0.6 1.5 λ=1e 07, σ=0.6 0.5 0.5 1 0 0 0.5 0 0.5 0.5 0.5 1 0.5 0 0.5 1 1.5 1 0.5 0 0.5 1 1.5 1 0.5 0 0.5 1 1.5 Closed form Solution
Gaussian Processes (a.k.a. Kernel Ridge Regression with Variance Estimates) 2 2 output, f(x) 1 0 1 output, f(x) 1 0 1 2 5 0 5 input, x 2 5 0 5 input, x p(y x, x, y) N k(x, x) > [K + 2 noisei] -1 y, k(x, x )+ 2 noise - k(x, x) > [K + 2 noisei] -1 k(x, x) adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/
Choosing Kernel Hyperparameters function value, y 0.5 0.0 0.5 1.0 too long about right too short 10 5 0 5 10 input, x The mean posterior predictive function is plotted for 3 different length scales (th k(x, x 0 )=v 2 exp - (x - x0 ) 2 2`2 + 2 noise xx 0. adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/
Intermezzo: Kernels Borrowing from: Arthur Gretton (Gatsby, UCL)
Hilbert Spaces Definition (Inner product) Let H be a vector space over R. A function h, i H : H H! R is an inner product on H if 1 Linear: h 1 f 1 + 2 f 2, gi H = 1 hf 1, gi H + 2 hf 2, gi H 2 Symmetric: hf, gi H = hg, f i H 3 hf, f i H 0andhf, f i H = 0ifandonlyiff = 0. Norm induced by the inner product: kf k H := p hf, f i H
Example: Fourier Bases
Example: Fourier Bases
Example: Fourier Bases
Example: Fourier Bases Fourier modes define a vector space
Kernels Definition Let X be a non-empty set. A function k : X X! R is a kernel if there exists an R-Hilbert space and a map : X! H such that 8x, x 0 2 X, k(x, x 0 ):= (x), (x 0 ) H. Almost no conditions on X (eg, X itself doesn t need an inner product, eg. documents). Asinglekernelcancorrespondtoseveralpossiblefeatures.A trivial example for X := R: 1(x) =x and 2(x) = apple x/ p 2 x/ p 2
Sums, Transformations, Products Theorem (Sums of kernels are kernels) Given > 0 and k, k 1 and k 2 all kernels on X, then k and k 1 + k 2 are kernels on X. (Proof via positive definiteness: later!) A difference of kernels may not be a kernel (why?) Theorem (Mappings between spaces) Let X and e X be sets, and define a map A : X! e X. Define the kernel k on e X. Then the kernel k(a(x), A(x 0 )) is a kernel on X. Example: k(x, x 0 )=x 2 (x 0 ) 2. Theorem (Products of kernels are kernels) Given k 1 on X 1 and k 2 on X 2, then k 1 k 2 is a kernel on X 1 X 2. If X 1 = X 2 = X, then k := k 1 k 2 is a kernel on X.
Polynomial Kernels Theorem (Polynomial kernels) Let x, x 0 2 R d for d 1, and let m 1 be an integer and c 0 be apositivereal.then is a valid kernel. k(x, x 0 ):= x, x 0 + c m To prove: expand into a sum (with non-negative scalars) of kernels hx, x 0 i raised to integer powers. These individual terms are valid kernels by the product rule.
Infinite Sequences Definition The space `2 (square summable sequences) comprises all sequences a := (a i ) i 1 for which kak 2`2 = 1X i=1 a 2 i < 1. Definition Given sequence of functions ( i (x)) i 1 in `2 where i : X! R is the ith coordinate of (x). Then k(x, x 0 ):= 1X i=1 i(x) i (x 0 ) (1)
Infinite Sequences Why square summable? By Cauchy-Schwarz, 1X i=1 i(x) i (x 0 ) applek (x)k`2 (x 0 ) `2, so the sequence defining the inner product converges for all x, x 0 2 X
Taylor Series Kernels Definition (Taylor series kernel) For r 2 (0, 1], with a n 0foralln 0 f (z) = 1X n=0 a n z n z < r, z 2 R, Define X to be the p r-ball in R d, sokxk < p r, k(x, x 0 )=f x, x 0 = 1X a n x, x 0 n. n=0 Example (Exponential kernel) k(x, x 0 ):=exp x, x 0.
Gaussian Kernel (also known as Radial Basis Function (RBF) kernel) Example (Gaussian kernel) The Gaussian kernel on R d is defined as k(x, x 0 ):=exp x x 0 2. Proof: an exercise! Use product rule, mapping rule, exponential kernel.
Gaussian Kernel (also known as Radial Basis Function (RBF) kernel) Example (Gaussian kernel) The Gaussian kernel on R d is defined as k(x, x 0 ):=exp x x 0 2. Proof: an exercise! Use product rule, mapping rule, exponential kernel. Squared Exponential (SE) Automatic Relevance Determination (ARD)
Products of Kernels Squared-exp (SE) Periodic (Per) Linear (Lin) 2 f exp 1 (x xõ ) 2 2 2 2 2 f exp 1 2 2 sin 2 1 fi x xõ p 22 2 f(x c)(x Õ c) 0 0 x x Õ x x Õ x (with x Õ =1) Lin Lin SE Per Lin SE Lin Per 0 0 0 x (with x Õ =1) x x Õ x (with x Õ =1) x (with x Õ =1) 0 0 source: David Duvenaud (PhD Thesis)
Positive Definiteness Definition (Positive definite functions) Asymmetricfunctionk : X X! R is positive definite if 8n 1, 8(a 1,...a n ) 2 R n, 8(x 1,...,x n ) 2 X n, nx i=1 nx j=1 a i a j k(x i, x j ) 0. The function k(, ) is strictly positive definite if for mutually distinct x i, the equality holds only when all the a i are zero.
Mercer s Theorem Theorem Let H be a Hilbert space, X anon-emptysetand : X! H. Then h (x), (y)i H =: k(x, y) is positive definite. Proof. nx nx a i a j k(x i, x j ) = nx nx ha i (x i ), a j (x j )i H i=1 j=1 = i=1 j=1 nx a i (x i ) 2 0. i=1 H Reverse also holds: positive definite k(x, x 0 ) is inner product in a unique H (Moore-Aronsajn: coming later!). Lecture 1: Introduction to RKHS
DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford)
Linear Dimensionality Reduction Idea: Project high-dimensional vector onto a lower dimensional space x 2 R 361 2 z 2 R 10 z = U > x
Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n Transpose of X used in regression!
Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n Want to reduce dimensionality from d to k Choose k directions u 1,...,u k U = ( u 1 u k ) 2 Rd k u = u > x
Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n Want to reduce dimensionality from d to k Choose k directions u 1,...,u k U = ( u 1 u k ) 2 Rd k For each u j, compute similarity z j = u > j x =( ) > = >
Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n Want to reduce dimensionality from d to k Choose k directions u 1,...,u k U = ( u 1 u k ) 2 Rd k For each u j, compute similarity z j = u > j x Project x down to z =(z 1,...,z k ) > = U > x How to choose U?
Principal Component Analysis x 2 R 361 2 z 2 R 10 z = U > x Optimize two equivalent objectives 1. Minimize the reconstruction error 2. Maximizes the projected variance
PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x P
PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x Decode: x = Uz = P k j=1 z ju j kx xk
PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x Decode: x = Uz = P k j=1 z ju j Want reconstruction error kx xk to be small
PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x Decode: x = Uz = P k j=1 z ju j Want reconstruction error kx xk to be small Objective: minimize total squared reconstruction error min U2R d k nx kx i UU > x i k 2 i=1
PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2
PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: Ê[x] =0 (
PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: Ê[x] =0 (
PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: E Ê[x] =0 ( Objective: maximize variance of projected data max U2R d k,u > U=I Ê[kU > xk 2 ]
PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: Ê[x] =0 (what s Ê[U> x]?) Objective: maximize variance of projected data max U2R d k,u > U=I Ê[kU > xk 2 ]
Equivalence of two objectives Key intuition: variance of data {z } fixed = captured variance {z } want large + reconstruction error {z } want small = > +( > )
Equivalence of two objectives Key intuition: variance of data {z } fixed = captured variance {z } want large + reconstruction error {z } want small Pythagorean decomposition: x = UU > x +(I UU > )x kxk k(i UU > )xk kuu > xk Take expectations; note rotation U doesn t a ect length: Ê[kxk 2 ]=Ê[kU> xk 2 ]+Ê[kx UU> xk 2 ] $
Equivalence of two objectives Key intuition: variance of data {z } fixed = captured variance {z } want large + reconstruction error {z } want small Pythagorean decomposition: x = UU > x +(I UU > )x kxk k(i UU > )xk kuu > xk Take expectations; note rotation U doesn t a ect length: Ê[kxk 2 ]=Ê[kU> xk 2 ]+Ê[kx UU> xk 2 ] Minimize reconstruction error $ Maximize captured variance
Changes of Basis Data Orthonormal Basis X = ( x 1 x n ) 2 Rd n U = ( u 1 u k ) d 2 Rd d =
Changes of Basis Data Orthonormal Basis X = ( x 1 x n ) Change of basis 2 2 Rd n z =(z 1,...,z ) > d z j = u > j x U = ( u 1 u k ) 2 Rd = Inverse Change of basis > x = Uz d d d > z = U > x
Principal Component Analysis Data Orthonormal Basis X = ( x 1 x n ) 2 Rd n U = ( u 1 u k ) d 2 Rd d Eigenvectors of Covariance = Eigen-decomposition = 0 B @ 1 2... 1 C A Claim: Eigenvectors of a symmetric matrix are orthogonal d
Principal Component Analysis n (from stack exchange)
Principal Component Analysis Data Orthonormal Basis X = ( x 1 x n ) 2 Rd n Eigenvectors of Covariance U = ( u 1 u k ) 2 Rd = Eigen-decomposition d d = 0 B @ 1 2... 1 C A Idea: Take top-k eigenvectors to maximize variance d
Principal Component Analysis Data Truncated Basis X = ( x 1 x n ) 2 Rd n 2 Rd k U = ( u 1 u k ) Eigenvectors of Covariance = Truncated decomposition (k) = 0 B @ 1 2... 1 C A k
Principal Component Analysis Top 2 components Bottom 2 components Data: three varieties of wheat: Kama, Rosa, Canadian Attributes: Area, Perimeter, Compactness, Length of Kernel, Width of Kernel, Asymmetry Coefficient, Length of Groove
PCA: Complexity Data Truncated Basis X = ( x 1 x n ) 2 Rd n U = ( u 1 u k ) 2 Rd k = Using eigen-value decomposition Computation of covariance C: O(n d 2 ) Eigen-value decomposition: O(d 3 ) Total complexity: O(n d 2 +d 3 )
PCA: Complexity Data Truncated Basis X = ( x 1 x n ) 2 Rd n U = ( u 1 u k ) 2 Rd k = Using singular-value decomposition Full decomposition: O(min{nd 2, n 2 d}) Rank-k decomposition: O(k d n log(n)) (with power method)
Singular Value Decomposition Idea: Decompose a d x d matrix M into 1. Change of basis V (unitary matrix) 2. A scaling Σ (diagonal matrix) 3. Change of basis U (unitary matrix)
Singular Value Decomposition Idea: Decompose the d x n matrix X into 1. A n x n basis V (unitary matrix) 2. A d x n matrix Σ (diagonal projection) 3. A d x d basis U (unitary matrix) X = U d d d n V > n n U > U = V > V
Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i
Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) z x
Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) Idea: z i more meaningful representation of i-th face than x i Can use z i for nearest-neighbor classification ( + ) ( )
Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) Idea: z i more meaningful representation of i-th face than x i Can use z i for nearest-neighbor classification Much faster: O(dk + nk) time instead of O(dn) when n, d k
Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) Idea: z i more meaningful representation of i-th face than x i Can use z i for nearest-neighbor classification Much faster: O(dk + nk) time instead of O(dn) when n, d Why no time savings for linear classifier? k
Aside: How many components? Magnitude of eigenvalues indicate fraction of variance captured. Eigenvalues on a face image dataset: 1353.2 1086.7 820.1 i 553.6 287.1 2 3 4 5 6 7 8 9 10 11 i Eigenvalues typically drop o sharply, so don t need that many. Of course variance isn t everything...
Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x =
Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x ji = frequency of word j in document i ( X d n u U d k Z ) ( k n 0.4-0.001 ) 0.8 0.03 0.01 0.04 u.. 0.002 2.3 0.003 1.9 stocks: 2 0 chairman: 4 1 the: 8 7.. wins: 0 2 game: 1 3 ( z 1... z n )
Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x ji = frequency of word j in document i ( X d n u U d k Z ) ( k n 0.4-0.001 ) 0.8 0.03 0.01 0.04 u.. 0.002 2.3 0.003 1.9 stocks: 2 0 chairman: 4 1 the: 8 7.. wins: 0 2 game: 1 3 ( z 1... z n ) How to measure similarity between two documents? z > 1 z 2 is probably better than x > 1 x 2
Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x ji = frequency of word j in document i ( X d n u U d k Z ) ( k n 0.4-0.001 ) 0.8 0.03 0.01 0.04 u.. 0.002 2.3 0.003 1.9 stocks: 2 0 chairman: 4 1 the: 8 7.. wins: 0 2 game: 1 3 ( z 1... z n ) How to measure similarity between two documents? z > 1 z 2 is probably better than x > 1 x 2 Applications: information retrieval Note: no computational savings; original x is already sparse
PCA Summary Intuition: capture variance of data or minimize reconstruction error Algorithm: find eigendecomposition of covariance matrix or SVD Impact: reduce storage (from O(nd) to O(nk)), reduce time complexity Advantages: simple,fast Applications: eigen-faces, eigen-documents, network anomaly detection, etc.
Probabilistic Interpretation Generative Model [Tipping and Bishop, 1999]: For each data point i =1,...,n: Draw the latent vector: z i N (0,I k k ) Create the data point: x i N (Uz i, 2 I d d ) PCA finds the U that maximizes the likelihood of the data max U p(x U)
Probabilistic Interpretation Generative Model [Tipping and Bishop, 1999]: For each data point i =1,...,n: Draw the latent vector: z i N (0,I k k ) Create the data point: x i N (Uz i, 2 I d d ) PCA finds the U that maximizes the likelihood of the data max U p(x U) Advantages: Handles missing data (important for collaborative filtering) Extension to factor analysis: allow non-isotropic noise (replace 2 I d d with arbitrary diagonal matrix)
Limitations of Linearity PCA is e ective PCA is ine ective
Limitations of Linearity PCA is e ective PCA is ine ective Problem is that PCA subspace is linear: In this example: S = {x = Uz : z 2 R k } S = {(x 1,x 2 ):x 2 = u 2 u 1 x 1 }
Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} = { (x) =Uz} (x) =( ) >
Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} We can get this: S = { (x) =Uz} with (x) =(x 2 1,x 2 ) >
Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} We can get this: S = { (x) =Uz} with (x) =(x 2 1,x 2 ) > { } > Linear dimensionality reduction in (x) space, Nonlinear dimensionality reduction in x space (x) =( sin( ) ) >
Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} We can get this: S = { (x) =Uz} with (x) =(x 2 1,x 2 ) > { } > Linear dimensionality reduction in (x) space, Nonlinear dimensionality reduction in x space (x) =( sin( ) ) > Idea: Use kernels
Kernel PCA Representer theorem: XX > u = u x X u = X = P n i=1 ix i
Kernel PCA Representer theorem: X u = X = P n XX > u = u i=1 ix i x Kernel function: k(x 1, x 2 ) such that K, thekernelmatrixformedbyk ij = k(x i, x j ), is positive semi-definite
Kernel PCA P Representer theorem: X u = X = P n XX > u = u i=1 ix i x Kernel function: k(x 1, x 2 ) such that K, thekernelmatrixformedbyk ij = k(x i, x j ), is positive semi-definite max kuk=1 u> XX > u = max > X > X =1 > (X > X)(X > X) = max > K =1 > K 2
Kernel PCA Direct method: Kernel PCA objective: max > K =1 > K 2 ) kernel PCA eigenvalue problem: X > X = 0 Modular method (if you don t want to think about kernels): Find vectors x 0 1,...,x 0 n such that x 0> i x 0 j = K ij = (x i ) > (x j ) Key: use any vectors that preserve inner products One possibility is Cholesky decomposition K = X x 0 n > Xx 0 n
Kernel PCA
Canonical Correlation Analysis (CCA)
Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image
Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image Time series: x: Signal at time t y: Signal at time t +1
Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image Time series: x: Signal at time t y: Signal at time t +1 Two-view learning: dividefeaturesintotwosets x: Features of a word/object, etc. y: Features of the context in which it appears
Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image Time series: x: Signal at time t y: Signal at time t +1 Two-view learning: dividefeaturesintotwosets x: Features of a word/object, etc. y: Features of the context in which it appears Goal: reduce the dimensionality of the two views jointly
CCA Example Setup: Input data: (x 1, y 1 ),...,(x n, y n ) (matrices X, Y) Goal: find pair of projections (u, v)
CCA Example Setup: Input data: (x 1, y 1 ),...,(x n, y n ) (matrices X, Y) Goal: find pair of projections (u, v) Dimensionality reduction solutions: Independent Joint x and y are paired by brightness
CCA Definition Definitions: Variance: cvar(u > x)=u > XX > u Covariance: ccov(u > x, v > y)=u > XY > v Correlation: ccov(u > x,v > y) p p cvar(u > x) cvar(v > y) Objective: maximize correlation between projected views Properties: max u,v dcorr(u> x, v > y) Focus on how variables are related, not how much they vary Invariant to any rotation and scaling of data
From PCA to CCA PCA on views separately: no covariance term max u,v u > XX > u u > u + v> YY > v v > v PCA on concatenation (X >, Y > ) > : includes covariance term max u,v u > XX > u +2u > XY > v + v > YY > v u > u + v > v
From PCA to CCA PCA on views separately: no covariance term max u,v u > XX > u u > u + v> YY > v v > v PCA on concatenation (X >, Y > ) > : includes covariance term max u,v u > XX > u +2u > XY > v + v > YY > v u > u + v > v Maximum covariance: drop variance terms max u,v u > XY > v p u> u p v > v
From PCA to CCA PCA on views separately: no covariance term max u,v u > XX > u u > u + v> YY > v v > v PCA on concatenation (X >, Y > ) > : includes covariance term max u,v u > XX > u +2u > XY > v + v > YY > v u > u + v > v Maximum covariance: drop variance terms max u,v u > XY > v p u> u p v > v Maximum correlation (CCA): divide out variance terms max u,v u > XY > v p u> XX > u p v > YY > v
Importance of Regularization Extreme examples of degeneracy: If x = Ay, thenany(u, v) with u = Av is optimal (correlation 1) If x and y are independent, then any (u, v) is optimal (correlation 0) X Y (u v)
Importance of Regularization Extreme examples of degeneracy: If x = Ay, thenany(u, v) with u = Av is optimal (correlation 1) If x and y are independent, then any (u, v) is optimal (correlation 0) Problem: ifx or Y has rank n, thenany(u, v) is optimal > (correlation 1) with u = X > Yv ) CCA is meaningless!
Importance of Regularization Extreme examples of degeneracy: If x = Ay, thenany(u, v) with u = Av is optimal (correlation 1) If x and y are independent, then any (u, v) is optimal (correlation 0) Problem: ifx or Y has rank n, thenany(u, v) is optimal (correlation 1) with u = X > > Yv ) CCA is meaningless! Solution: regularization (interpolate between maximum covariance and maximum correlation) max u,v u > XY > v p u> (XX > + I)u p v > (YY > + I)v