Data Mining Techniques

Similar documents
Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Data Mining Techniques

Linear Dimensionality Reduction

Lecture 8: Principal Component Analysis; Kernel PCA

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Dimensionality Reduction

CS 340 Lec. 6: Linear Dimensionality Reduction

Principal Component Analysis

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

PCA, Kernel PCA, ICA

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Principal Component Analysis (PCA)

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine learning for pervasive systems Classification in high-dimensional spaces

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Principal Component Analysis

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Multivariate Statistical Analysis

Data Mining Techniques

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Expectation Maximization

PCA and admixture models

CS281 Section 4: Factor Analysis and PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Dimensionality reduction

Statistical Machine Learning

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Principal Component Analysis (PCA)

A Least Squares Formulation for Canonical Correlation Analysis

Machine Learning - MT & 14. PCA and MDS

Unsupervised Learning

Lecture 7: Con3nuous Latent Variable Models

CITS 4402 Computer Vision

Principal components analysis COMS 4771

1 Principal Components Analysis

Deriving Principal Component Analysis (PCA)

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Tutorial on Principal Component Analysis

Clustering VS Classification

Lecture Notes 2: Matrices

Advanced Introduction to Machine Learning CMU-10715

Introduction to Machine Learning

STA 414/2104: Machine Learning

Maximum variance formulation

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Machine Learning (Spring 2012) Principal Component Analysis

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Covariance to PCA. CS 510 Lecture #14 February 23, 2018

Data Mining Techniques

LECTURE NOTE #11 PROF. ALAN YUILLE

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

MLCC 2015 Dimensionality Reduction and PCA

Latent Semantic Analysis. Hongning Wang

Notes on Latent Semantic Analysis

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

What is Principal Component Analysis?

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

CS181 Midterm 2 Practice Solutions

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Principal Component Analysis (PCA) CSC411/2515 Tutorial

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Kernel methods for comparing distributions, measuring dependence

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Probabilistic Latent Semantic Analysis

Learning with Singular Vectors

Covariance to PCA. CS 510 Lecture #8 February 17, 2014

Announcements (repeat) Principal Components Analysis

Exercises * on Principal Component Analysis

Dimensionality Reduction and Principle Components Analysis

Dimensionality Reduction

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

Face Detection and Recognition

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota


STATISTICAL LEARNING SYSTEMS

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Nonlinear Dimensionality Reduction

1 Singular Value Decomposition and Principal Component

Principal Component Analysis

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Subspace Methods for Visual Learning and Recognition

Vector Space Models. wine_spectral.r

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

Numerical Methods I Singular Value Decomposition

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Kernel methods, kernel SVM and ridge regression

Eigenfaces. Face Recognition Using Principal Components Analysis

14 Singular Value Decomposition

CS534 Machine Learning - Spring Final Exam

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

CSE 554 Lecture 7: Alignment

CSC 411 Lecture 12: Principal Component Analysis

Transcription:

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang)

DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford)

Linear Dimensionality Reduction Idea: Project high-dimensional vector onto a lower dimensional space x 2 R 361 2 z 2 R 10 z = U > x

Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n

Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n Want to reduce dimensionality from d to k Choose k directions u 1,...,u k U = ( u 1 u k ) 2 Rd k u = u > x

Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n Want to reduce dimensionality from d to k Choose k directions u 1,...,u k U = ( u 1 u k ) 2 Rd k For each u j, compute similarity z j = u > j x =( ) > = >

Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n Want to reduce dimensionality from d to k Choose k directions u 1,...,u k U = ( u 1 u k ) 2 Rd k For each u j, compute similarity z j = u > j x Project x down to z =(z 1,...,z k ) > = U > x How to choose U?

Principal Component Analysis x 2 R 361 2 R z 2 R 10 z = U > x How do we choose U? Two Objectives 1. Minimize the reconstruction error 2. Maximize the projected variance

PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x P

PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x Decode: x = Uz = P k j=1 z ju j kx xk

PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x Decode: x = Uz = P k j=1 z ju j Want reconstruction error kx xk to be small

PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x Decode: x = Uz = P k j=1 z ju j Want reconstruction error kx xk to be small Objective: minimize total squared reconstruction error min U2R d k nx kx i UU > x i k 2 i=1

PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2

PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: Ê[x] =0 (

PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: Ê[x] =0 (what s Ê[U> x]?)

PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: Ê[x] =0 (what s Ê[U> x]?) Objective: maximize variance of projected data max U2R d k,u > U=I Ê[kU > xk 2 ]

PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: Ê[x] =0 (what s Ê[U> x]?) Objective: maximize variance of projected data max U2R d k,u > U=I Ê[kU > xk 2 ]

Equivalence of two objectives Key intuition: variance of data {z } fixed = captured variance {z } want large + reconstruction error {z } want small = > +( > )

Equivalence of two objectives Key intuition: variance of data {z } fixed = captured variance {z } want large + reconstruction error {z } want small Pythagorean decomposition: x = UU > x +(I UU > )x kxk k(i UU > )xk kuu > xk Take expectations; note rotation U doesn t a ect length: Ê[kxk 2 ]=Ê[kU> xk 2 ]+Ê[kx UU> xk 2 ] $

Equivalence of two objectives Key intuition: variance of data {z } fixed = captured variance {z } want large + reconstruction error {z } want small Pythagorean decomposition: x = UU > x +(I UU > )x kxk k(i UU > )xk kuu > xk Take expectations; note rotation U doesn t a ect length: Ê[kxk 2 ]=Ê[kU> xk 2 ]+Ê[kx UU> xk 2 ] Minimize reconstruction error $ Maximize captured variance

Finding one principal component Objective: maximizevariance of projected data ˆ Input data: X = ( x 1... x n ) ncipal component analysis (PCA) / B

Finding one principal component Objective: maximizevariance of projected data = max kuk=1 Ê[(u > x) 2 ] 1 X Input data: X = ( x 1... x n ) ncipal component analysis (PCA) / B

Finding one principal component Objective: maximizevariance of projected data Input data: X = ( x 1... x n ) = max kuk=1 = max kuk=1 Ê[(u > x) 2 ] 1 n 1 nx (u > x i ) 2 i=1 ncipal component analysis (PCA) / B

Finding one principal component Objective: maximizevariance of projected data Input data: X = ( x 1... x n ) = max kuk=1 = max kuk=1 = max kuk=1 Ê[(u > x) 2 ] 1 n nx (u > x i ) 2 i=1 1 n ku> Xk 2 1 ncipal component analysis (PCA) / B

Finding one principal component Objective: maximizevariance of projected data Input data: X = ( x 1... x n ) = max kuk=1 = max kuk=1 = max kuk=1 Ê[(u > x) 2 ] 1 n = max kuk=1 u> nx (u > x i ) 2 i=1 1 n ku> Xk 2 1 n XX> u 1 ncipal component analysis (PCA) / B

Finding one principal component Objective: maximizevariance of projected data Input data: X = ( x 1... x n ) = max kuk=1 = max kuk=1 = max kuk=1 Ê[(u > x) 2 ] 1 n = max kuk=1 u> nx (u > x i ) 2 i=1 1 n ku> Xk 2 1 n XX> = largest eigenvalue of C def = 1 n XX> u ncipal component analysis (PCA) / B

Finding one principal component Objective: maximizevariance of projected data Input data: X = ( x 1... x n ) ncipal component analysis (PCA) / B = max kuk=1 = max kuk=1 = max kuk=1 Ê[(u > x) 2 ] 1 n = max kuk=1 u> nx (u > x i ) 2 i=1 1 n ku> Xk 2 1 n XX> = largest eigenvalue of C def = 1 n XX> (C is covariance matrix of data) principles u

How many components? Similar to question of How many clusters? Magnitude of eigenvalues indicate fraction of variance captured.

How many components? Similar to question of How many clusters? Magnitude of eigenvalues indicate fraction of variance captured. Eigenvalues on a face image dataset: 1353.2 1086.7 820.1 i 553.6 287.1 2 3 4 5 6 7 8 9 10 11 i

How many components? Similar to question of How many clusters? Magnitude of eigenvalues indicate fraction of variance captured. Eigenvalues on a face image dataset: 1353.2 1086.7 820.1 i 553.6 287.1 2 3 4 5 6 7 8 9 10 11 i Eigenvalues typically drop o sharply, so don t need that many. Of course variance isn t everything...

Computing PCA Method 1: eigendecomposition U are eigenvectors of covariance matrix C = 1 n XX> ( 2 )

Computing PCA Method 1: eigendecomposition U are eigenvectors of covariance matrix C = 1 n XX> Computing C already takes O(nd 2 ) time (very expensive)

Computing PCA Method 1: eigendecomposition U are eigenvectors of covariance matrix C = 1 n XX> Computing C already takes O(nd 2 ) time (very expensive) Method 2: singular value decomposition (SVD) Find X = U d d d n V > n n where U > U = I d d, V > V = I n n, is diagonal ( )

Computing PCA Method 1: eigendecomposition U are eigenvectors of covariance matrix C = 1 n XX> Computing C already takes O(nd 2 ) time (very expensive) Method 2: singular value decomposition (SVD) Find X = U d d d n V > n n where U > U = I d d, V > V = I n n, is diagonal Computing top k singular vectors takes only O(ndk)

Computing PCA Method 1: eigendecomposition U are eigenvectors of covariance matrix C = 1 n XX> Computing C already takes O(nd 2 ) time (very expensive) Method 2: singular value decomposition (SVD) Find X = U d d d n V > n n where U > U = I d d, V > V = I n n, is diagonal Computing top k singular vectors takes only O(ndk) Relationship between eigendecomposition and SVD: Left singular vectors are principal components (C = U 2 U > )

Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i

Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) z x

Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) Idea: z i more meaningful representation of i-th face than x i Can use z i for nearest-neighbor classification ( + ) ( )

Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) Idea: z i more meaningful representation of i-th face than x i Can use z i for nearest-neighbor classification Much faster: O(dk + nk) time instead of O(dn) when n, d k

Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) Idea: z i more meaningful representation of i-th face than x i Can use z i for nearest-neighbor classification Much faster: O(dk + nk) time instead of O(dn) when n, d Why no time savings for linear classifier? k

Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x =

Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x ji = frequency of word j in document i ( X d n u U d k Z ) ( k n 0.4-0.001 ) 0.8 0.03 0.01 0.04 u.. 0.002 2.3 0.003 1.9 stocks: 2 0 chairman: 4 1 the: 8 7.. wins: 0 2 game: 1 3 ( z 1... z n )

Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x ji = frequency of word j in document i ( X d n u U d k Z ) ( k n 0.4-0.001 ) 0.8 0.03 0.01 0.04 u.. 0.002 2.3 0.003 1.9 stocks: 2 0 chairman: 4 1 the: 8 7.. wins: 0 2 game: 1 3 ( z 1... z n ) How to measure similarity between two documents? z > 1 z 2 is probably better than x > 1 x 2

Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x ji = frequency of word j in document i ( X d n u U d k Z ) ( k n 0.4-0.001 ) 0.8 0.03 0.01 0.04 u.. 0.002 2.3 0.003 1.9 stocks: 2 0 chairman: 4 1 the: 8 7.. wins: 0 2 game: 1 3 ( z 1... z n ) How to measure similarity between two documents? z > 1 z 2 is probably better than x > 1 x 2 Applications: information retrieval Note: no computational savings; original x is already sparse

Network anomaly detection [Lakhina 2005] x ji = amount of tra c on link j in the network during each time interval i

Network anomaly detection [Lakhina 2005] x ji = amount of tra c on link j in the network during each time interval i Model assumption: total tra c is sum of flows along a few paths Apply PCA: each principal component intuitively represents a path Anomaly when tra c deviates from first few principal components Normal Anomalous

Multi-task learning [Ando & Zhang 2005] Have n related tasks (classify documents for various users) Each task has a linear classifier with weights x i Want to share structure between classifiers

Multi-task learning [Ando & Zhang 2005] Have n related tasks (classify documents for various users) Each task has a linear classifier with weights x i Want to share structure between classifiers One step of their procedure: given n linear classifiers x 1,...,x n, run PCA to identify shared structure: X = ( x 1... x n ) u UZ Each principal component is a eigen-classifier

Multi-task learning [Ando & Zhang 2005] Have n related tasks (classify documents for various users) Each task has a linear classifier with weights x i Want to share structure between classifiers One step of their procedure: given n linear classifiers x 1,...,x n, run PCA to identify shared structure: X = ( x 1... x n ) u UZ Each principal component is a eigen-classifier Other step of their procedure: Retrain classifiers, regularizing towards subspace U

PCA Summary Intuition: capture variance of data or minimize reconstruction error Algorithm: find eigendecomposition of covariance matrix or SVD Impact: reduce storage (from O(nd) to O(nk)), reduce time complexity Advantages: simple,fast Applications: eigen-faces, eigen-documents, network anomaly detection, etc.

Probabilistic Interpretation Generative Model [Tipping and Bishop, 1999]: For each data point i =1,...,n: Draw the latent vector: z i N (0,I k k ) Create the data point: x i N (Uz i, 2 I d d ) PCA finds the U that maximizes the likelihood of the data max U p(x U)

Probabilistic Interpretation Generative Model [Tipping and Bishop, 1999]: For each data point i =1,...,n: Draw the latent vector: z i N (0,I k k ) Create the data point: x i N (Uz i, 2 I d d ) PCA finds the U that maximizes the likelihood of the data max U p(x U) Advantages: Handles missing data (important for collaborative filtering) Extension to factor analysis: allow non-isotropic noise (replace 2 I d d with arbitrary diagonal matrix)

Limitations of Linearity PCA is e ective PCA is ine ective

Limitations of Linearity PCA is e ective PCA is ine ective Problem is that PCA subspace is linear: In this example: S = {x = Uz : z 2 R k } S = {(x 1,x 2 ):x 2 = u 2 u 1 x 1 }

Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} = { (x) =Uz} (x) =( ) >

Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} We can get this: S = { (x) =Uz} with (x) =(x 2 1,x 2 ) >

Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} We can get this: S = { (x) =Uz} with (x) =(x 2 1,x 2 ) > { } > Linear dimensionality reduction in (x) space, Nonlinear dimensionality reduction in x space (x) =( sin( ) ) >

Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} We can get this: S = { (x) =Uz} with (x) =(x 2 1,x 2 ) > { } > Linear dimensionality reduction in (x) space, Nonlinear dimensionality reduction in x space (x) =( sin( ) ) > Idea: Use kernels

Kernel PCA Representer theorem: XX > u = u x X u = X = P n i=1 ix i

Kernel PCA Representer theorem: X u = X = P n XX > u = u i=1 ix i x Kernel function: k(x 1, x 2 ) such that K, thekernelmatrixformedbyk ij = k(x i, x j ), is positive semi-definite

Kernel PCA P Representer theorem: X u = X = P n XX > u = u i=1 ix i x Kernel function: k(x 1, x 2 ) such that K, thekernelmatrixformedbyk ij = k(x i, x j ), is positive semi-definite max kuk=1 u> XX > u = max > X > X =1 > (X > X)(X > X) = max > K =1 > K 2

Kernel PCA Direct method: Kernel PCA objective: max > K =1 > K 2 ) kernel PCA eigenvalue problem: X > X = 0 Modular method (if you don t want to think about kernels): Find vectors x 0 1,...,x 0 n such that x 0> i x 0 j = K ij = (x i ) > (x j ) Key: use any vectors that preserve inner products One possibility is Cholesky decomposition K = X x 0 n > Xx 0 n

Kernel PCA

Canonical Correlation Analysis (CCA)

Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image

Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image Time series: x: Signal at time t y: Signal at time t +1

Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image Time series: x: Signal at time t y: Signal at time t +1 Two-view learning: dividefeaturesintotwosets x: Features of a word/object, etc. y: Features of the context in which it appears

Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image Time series: x: Signal at time t y: Signal at time t +1 Two-view learning: dividefeaturesintotwosets x: Features of a word/object, etc. y: Features of the context in which it appears Goal: reduce the dimensionality of the two views jointly

CCA Example Setup: Input data: (x 1, y 1 ),...,(x n, y n ) (matrices X, Y) Goal: find pair of projections (u, v)

CCA Example Setup: Input data: (x 1, y 1 ),...,(x n, y n ) (matrices X, Y) Goal: find pair of projections (u, v) Dimensionality reduction solutions: Independent Joint x and y are paired by brightness

CCA Definition Definitions: Variance: cvar(u > x)=u > XX > u Covariance: ccov(u > x, v > y)=u > XY > v Correlation: ccov(u > x,v > y) p p cvar(u > x) cvar(v > y) Objective: maximize correlation between projected views Properties: max u,v dcorr(u> x, v > y) Focus on how variables are related, not how much they vary Invariant to any rotation and scaling of data

From PCA to CCA PCA on views separately: no covariance term max u,v u > XX > u u > u + v> YY > v v > v PCA on concatenation (X >, Y > ) > : includes covariance term max u,v u > XX > u +2u > XY > v + v > YY > v u > u + v > v

From PCA to CCA PCA on views separately: no covariance term max u,v u > XX > u u > u + v> YY > v v > v PCA on concatenation (X >, Y > ) > : includes covariance term max u,v u > XX > u +2u > XY > v + v > YY > v u > u + v > v Maximum covariance: drop variance terms max u,v u > XY > v p u> u p v > v

From PCA to CCA PCA on views separately: no covariance term max u,v u > XX > u u > u + v> YY > v v > v PCA on concatenation (X >, Y > ) > : includes covariance term max u,v u > XX > u +2u > XY > v + v > YY > v u > u + v > v Maximum covariance: drop variance terms max u,v u > XY > v p u> u p v > v Maximum correlation (CCA): divide out variance terms max u,v u > XY > v p u> XX > u p v > YY > v

Importance of Regularization Extreme examples of degeneracy: If x = Ay, thenany(u, v) with u = Av is optimal (correlation 1) If x and y are independent, then any (u, v) is optimal (correlation 0) X Y (u v)

Importance of Regularization Extreme examples of degeneracy: If x = Ay, thenany(u, v) with u = Av is optimal (correlation 1) If x and y are independent, then any (u, v) is optimal (correlation 0) Problem: ifx or Y has rank n, thenany(u, v) is optimal > (correlation 1) with u = X > Yv ) CCA is meaningless!

Importance of Regularization Extreme examples of degeneracy: If x = Ay, thenany(u, v) with u = Av is optimal (correlation 1) If x and y are independent, then any (u, v) is optimal (correlation 0) Problem: ifx or Y has rank n, thenany(u, v) is optimal (correlation 1) with u = X > > Yv ) CCA is meaningless! Solution: regularization (interpolate between maximum covariance and maximum correlation) max u,v u > XY > v p u> (XX > + I)u p v > (YY > + I)v

Canonical Correlation Forests CANONICAL CORRELATION FORESTS, RAINFORTH AND WOOD (a) Single CART (unpruned) (b) RF with 200 Trees 2 (c) Single CCT (unpruned) (d) CCF with 200 Trees Fig. 1: Decision surfaces for artificial spirals dataset. (a) Shows the hierarchical partitions and surface for a single axis aligned tree while (b) shows the effect of averaging over a number of, individually randomized, axis aligned trees. (c) Shows a single canonical correlation tree (CCT) and (d) demonstrates that averaging over CCTs to give a canonical correlation forest leads to smoother decision surfaces which better represent the data than the axis aligned equivalent. that different trees are trained in a different coordinate system. Although this was shown to given significant improvement in predictive performance over RF, they require all features to be considered for splitting at every node, leading to a computationally relatively expensive algorithm for more than a modest number of features. 2 node is defined by the tuple j = { j,1, j,2, j, sj } where { j,1, j,2 } J \j are the two child node ids, j 2 RD is a weight vector used to project the input features and sj is the point at which the splitting occurs in the projected space X T j. Note that for orthogonal trees only a single element of j will be non-zero, whereas oblique trees will have multiple non-zero elements. Let B (j, t) denote the partition of the input space associated with node j such that B (0, t) = RD and B (j, t) = B ( j,1, t) [ B ( j,2, t). The partitioning procedure is then defined such that Example: RF that uses CCA to determine axis for splits C ANONICAL C ORRELATION F ORESTS Canonical correlation forests (CCFs) are a new tree ensemble

Summary Framework: z = U > x, x u Uz Criteria for choosing U: PCA: maximize projected variance CCA: maximize projected correlation Algorithm: generalized eigenvalue problem Extensions: non-linear using kernels (using same linear framework) probabilistic, sparse, robust (hard optimization)