Data Mining Techniques

Similar documents
Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Data Mining Techniques

Linear Dimensionality Reduction

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Multivariate Statistical Analysis

CS 7140: Advanced Machine Learning

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

PCA, Kernel PCA, ICA

Dimensionality Reduction

CS 340 Lec. 6: Linear Dimensionality Reduction

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Lecture 8: Principal Component Analysis; Kernel PCA

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Principal Component Analysis

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

4 Bias-Variance for Ridge Regression (24 points)

Mathematical foundations - linear algebra

Announcements. stuff stat 538 Zaid Hardhaoui. Statistics. cry. spring. g fa. inference. VC dimension covering

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

Lecture Notes 2: Matrices

Kernel Methods. Barnabás Póczos

CS281 Section 4: Factor Analysis and PCA

Kernel methods, kernel SVM and ridge regression

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Probabilistic & Unsupervised Learning

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Tutorial on Principal Component Analysis

Principal Component Analysis

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Nonparameteric Regression:

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Introduction to Machine Learning

Expectation Maximization

Dimensionality reduction

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Advanced Introduction to Machine Learning CMU-10715

Maximum variance formulation

Descriptive Statistics

Kernel Methods in Machine Learning

Machine Learning (Spring 2012) Principal Component Analysis

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Lecture 2: Linear Algebra Review

Unsupervised Learning

Machine Learning - MT & 14. PCA and MDS

Lecture Notes 1: Vector spaces

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

10-701/ Recitation : Kernels

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

Kernel methods for comparing distributions, measuring dependence

Lecture 7: Con3nuous Latent Variable Models

1 Singular Value Decomposition and Principal Component

Cheng Soon Ong & Christian Walder. Canberra February June 2018

LECTURE NOTE #11 PROF. ALAN YUILLE

Kernel Methods. Outline

4 Bias-Variance for Ridge Regression (24 points)

PCA and admixture models

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Probabilistic Latent Semantic Analysis

Machine learning for pervasive systems Classification in high-dimensional spaces

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Exercises * on Principal Component Analysis

Functional Analysis Review

14 Singular Value Decomposition

Problem Set 1. Homeworks will graded based on content and clarity. Please show your work clearly for full credit.

1 Principal Components Analysis

Linear Algebra and Eigenproblems

Nonparametric Bayesian Methods (Gaussian Processes)

Functional Analysis Review

Gaussian Process Regression

GWAS V: Gaussian processes

Kernel Methods. Machine Learning A W VO

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Announcements (repeat) Principal Components Analysis

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra

CS 4495 Computer Vision Principle Component Analysis

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Principal Component Analysis (PCA)

Introduction to Machine Learning HW6

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Chapter XII: Data Pre and Post Processing

Kernel Methods. Charles Elkan October 17, 2007

Methods for sparse analysis of high-dimensional data, II

Data Mining Techniques

Kernel Method: Data Analysis with Positive Definite Kernels

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Principal Component Analysis (PCA)

Transcription:

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang)

Kernel Regression

Basis function regression Linear regression Basis function regression For N samples Polynomial regression

Basis Function Regression 1 M =3 t 0 1 0 x 1

The Kernel Trick Define a kernel function such that k can be cheaper to evaluate than φ!

Kernel Ridge Regression MAP / Expected value for Weights (requires inversion of DxD matrix) E[w y]=a 1 > y A :=( > + I) := (X) Alternate representation (requires inversion of NxN matrix) A 1 > = > (K + I) 1 K := 1 > Predictive posterior (using kernel function) E[ f (x ) y]= (x ) > E[w y]= (x ) > > (K + I) 1 y X

Kernel Ridge Regression MAP / Expected value for Weights (requires inversion of DxD matrix) E[w y]=a 1 > y A :=( > + I) := (X) Alternate representation (requires inversion of NxN matrix) A 1 > = > (K + I) 1 K := 1 > Predictive posterior (using kernel function) E[ f (x ) y]= (x ) > E[w y]= (x ) > > (K + I) 1 y X = k(x, x n )(K + I) 1 nm y m n,m

Kernel Ridge Regression f = arg min f 2H nx i=1 (y i hf, (x i )i H ) 2 + kf k 2 H!. 1 λ=0.1, σ=0.6 1 λ=10, σ=0.6 1.5 λ=1e 07, σ=0.6 0.5 0.5 1 0 0 0.5 0 0.5 0.5 0.5 1 0.5 0 0.5 1 1.5 1 0.5 0 0.5 1 1.5 1 0.5 0 0.5 1 1.5 Closed form Solution

Gaussian Processes (a.k.a. Kernel Ridge Regression with Variance Estimates) 2 2 output, f(x) 1 0 1 output, f(x) 1 0 1 2 5 0 5 input, x 2 5 0 5 input, x p(y x, x, y) N k(x, x) > [K + 2 noisei] -1 y, k(x, x )+ 2 noise - k(x, x) > [K + 2 noisei] -1 k(x, x) adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Choosing Kernel Hyperparameters function value, y 0.5 0.0 0.5 1.0 too long about right too short 10 5 0 5 10 input, x The mean posterior predictive function is plotted for 3 different length scales (th k(x, x 0 )=v 2 exp - (x - x0 ) 2 2`2 + 2 noise xx 0. adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Intermezzo: Kernels Borrowing from: Arthur Gretton (Gatsby, UCL)

Hilbert Spaces Definition (Inner product) Let H be a vector space over R. A function h, i H : H H! R is an inner product on H if 1 Linear: h 1 f 1 + 2 f 2, gi H = 1 hf 1, gi H + 2 hf 2, gi H 2 Symmetric: hf, gi H = hg, f i H 3 hf, f i H 0andhf, f i H = 0ifandonlyiff = 0. Norm induced by the inner product: kf k H := p hf, f i H

Example: Fourier Bases

Example: Fourier Bases

Example: Fourier Bases

Example: Fourier Bases Fourier modes define a vector space

Kernels Definition Let X be a non-empty set. A function k : X X! R is a kernel if there exists an R-Hilbert space and a map : X! H such that 8x, x 0 2 X, k(x, x 0 ):= (x), (x 0 ) H. Almost no conditions on X (eg, X itself doesn t need an inner product, eg. documents). Asinglekernelcancorrespondtoseveralpossiblefeatures.A trivial example for X := R: 1(x) =x and 2(x) = apple x/ p 2 x/ p 2

Sums, Transformations, Products Theorem (Sums of kernels are kernels) Given > 0 and k, k 1 and k 2 all kernels on X, then k and k 1 + k 2 are kernels on X. (Proof via positive definiteness: later!) A difference of kernels may not be a kernel (why?) Theorem (Mappings between spaces) Let X and e X be sets, and define a map A : X! e X. Define the kernel k on e X. Then the kernel k(a(x), A(x 0 )) is a kernel on X. Example: k(x, x 0 )=x 2 (x 0 ) 2. Theorem (Products of kernels are kernels) Given k 1 on X 1 and k 2 on X 2, then k 1 k 2 is a kernel on X 1 X 2. If X 1 = X 2 = X, then k := k 1 k 2 is a kernel on X.

Polynomial Kernels Theorem (Polynomial kernels) Let x, x 0 2 R d for d 1, and let m 1 be an integer and c 0 be apositivereal.then is a valid kernel. k(x, x 0 ):= x, x 0 + c m To prove: expand into a sum (with non-negative scalars) of kernels hx, x 0 i raised to integer powers. These individual terms are valid kernels by the product rule.

Infinite Sequences Definition The space `2 (square summable sequences) comprises all sequences a := (a i ) i 1 for which kak 2`2 = 1X i=1 a 2 i < 1. Definition Given sequence of functions ( i (x)) i 1 in `2 where i : X! R is the ith coordinate of (x). Then k(x, x 0 ):= 1X i=1 i(x) i (x 0 ) (1)

Infinite Sequences Why square summable? By Cauchy-Schwarz, 1X i=1 i(x) i (x 0 ) applek (x)k`2 (x 0 ) `2, so the sequence defining the inner product converges for all x, x 0 2 X

Taylor Series Kernels Definition (Taylor series kernel) For r 2 (0, 1], with a n 0foralln 0 f (z) = 1X n=0 a n z n z < r, z 2 R, Define X to be the p r-ball in R d, sokxk < p r, k(x, x 0 )=f x, x 0 = 1X a n x, x 0 n. n=0 Example (Exponential kernel) k(x, x 0 ):=exp x, x 0.

Gaussian Kernel (also known as Radial Basis Function (RBF) kernel) Example (Gaussian kernel) The Gaussian kernel on R d is defined as k(x, x 0 ):=exp x x 0 2. Proof: an exercise! Use product rule, mapping rule, exponential kernel.

Gaussian Kernel (also known as Radial Basis Function (RBF) kernel) Example (Gaussian kernel) The Gaussian kernel on R d is defined as k(x, x 0 ):=exp x x 0 2. Proof: an exercise! Use product rule, mapping rule, exponential kernel. Squared Exponential (SE) Automatic Relevance Determination (ARD)

Products of Kernels Squared-exp (SE) Periodic (Per) Linear (Lin) 2 f exp 1 (x xõ ) 2 2 2 2 2 f exp 1 2 2 sin 2 1 fi x xõ p 22 2 f(x c)(x Õ c) 0 0 x x Õ x x Õ x (with x Õ =1) Lin Lin SE Per Lin SE Lin Per 0 0 0 x (with x Õ =1) x x Õ x (with x Õ =1) x (with x Õ =1) 0 0 source: David Duvenaud (PhD Thesis)

Positive Definiteness Definition (Positive definite functions) Asymmetricfunctionk : X X! R is positive definite if 8n 1, 8(a 1,...a n ) 2 R n, 8(x 1,...,x n ) 2 X n, nx i=1 nx j=1 a i a j k(x i, x j ) 0. The function k(, ) is strictly positive definite if for mutually distinct x i, the equality holds only when all the a i are zero.

Mercer s Theorem Theorem Let H be a Hilbert space, X anon-emptysetand : X! H. Then h (x), (y)i H =: k(x, y) is positive definite. Proof. nx nx a i a j k(x i, x j ) = nx nx ha i (x i ), a j (x j )i H i=1 j=1 = i=1 j=1 nx a i (x i ) 2 0. i=1 H Reverse also holds: positive definite k(x, x 0 ) is inner product in a unique H (Moore-Aronsajn: coming later!). Lecture 1: Introduction to RKHS

DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford)

Linear Dimensionality Reduction Idea: Project high-dimensional vector onto a lower dimensional space x 2 R 361 2 z 2 R 10 z = U > x

Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n Transpose of X used in regression!

Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n Want to reduce dimensionality from d to k Choose k directions u 1,...,u k U = ( u 1 u k ) 2 Rd k u = u > x

Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n Want to reduce dimensionality from d to k Choose k directions u 1,...,u k U = ( u 1 u k ) 2 Rd k For each u j, compute similarity z j = u > j x =( ) > = >

Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n Want to reduce dimensionality from d to k Choose k directions u 1,...,u k U = ( u 1 u k ) 2 Rd k For each u j, compute similarity z j = u > j x Project x down to z =(z 1,...,z k ) > = U > x How to choose U?

Principal Component Analysis x 2 R 361 2 z 2 R 10 z = U > x Optimize two equivalent objectives 1. Minimize the reconstruction error 2. Maximizes the projected variance

PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x P

PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x Decode: x = Uz = P k j=1 z ju j kx xk

PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x Decode: x = Uz = P k j=1 z ju j Want reconstruction error kx xk to be small

PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x Decode: x = Uz = P k j=1 z ju j Want reconstruction error kx xk to be small Objective: minimize total squared reconstruction error min U2R d k nx kx i UU > x i k 2 i=1

PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2

PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: Ê[x] =0 (

PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: Ê[x] =0 (

PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: E Ê[x] =0 ( Objective: maximize variance of projected data max U2R d k,u > U=I Ê[kU > xk 2 ]

PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: Ê[x] =0 (what s Ê[U> x]?) Objective: maximize variance of projected data max U2R d k,u > U=I Ê[kU > xk 2 ]

Equivalence of two objectives Key intuition: variance of data {z } fixed = captured variance {z } want large + reconstruction error {z } want small = > +( > )

Equivalence of two objectives Key intuition: variance of data {z } fixed = captured variance {z } want large + reconstruction error {z } want small Pythagorean decomposition: x = UU > x +(I UU > )x kxk k(i UU > )xk kuu > xk Take expectations; note rotation U doesn t a ect length: Ê[kxk 2 ]=Ê[kU> xk 2 ]+Ê[kx UU> xk 2 ] $

Equivalence of two objectives Key intuition: variance of data {z } fixed = captured variance {z } want large + reconstruction error {z } want small Pythagorean decomposition: x = UU > x +(I UU > )x kxk k(i UU > )xk kuu > xk Take expectations; note rotation U doesn t a ect length: Ê[kxk 2 ]=Ê[kU> xk 2 ]+Ê[kx UU> xk 2 ] Minimize reconstruction error $ Maximize captured variance

Changes of Basis Data Orthonormal Basis X = ( x 1 x n ) 2 Rd n U = ( u 1 u k ) d 2 Rd d =

Changes of Basis Data Orthonormal Basis X = ( x 1 x n ) Change of basis 2 2 Rd n z =(z 1,...,z ) > d z j = u > j x U = ( u 1 u k ) 2 Rd = Inverse Change of basis > x = Uz d d d > z = U > x

Principal Component Analysis Data Orthonormal Basis X = ( x 1 x n ) 2 Rd n U = ( u 1 u k ) d 2 Rd d Eigenvectors of Covariance = Eigen-decomposition = 0 B @ 1 2... 1 C A Claim: Eigenvectors of a symmetric matrix are orthogonal d

Principal Component Analysis n (from stack exchange)

Principal Component Analysis Data Orthonormal Basis X = ( x 1 x n ) 2 Rd n Eigenvectors of Covariance U = ( u 1 u k ) 2 Rd = Eigen-decomposition d d = 0 B @ 1 2... 1 C A Idea: Take top-k eigenvectors to maximize variance d

Principal Component Analysis Data Truncated Basis X = ( x 1 x n ) 2 Rd n 2 Rd k U = ( u 1 u k ) Eigenvectors of Covariance = Truncated decomposition (k) = 0 B @ 1 2... 1 C A k

Principal Component Analysis Top 2 components Bottom 2 components Data: three varieties of wheat: Kama, Rosa, Canadian Attributes: Area, Perimeter, Compactness, Length of Kernel, Width of Kernel, Asymmetry Coefficient, Length of Groove

PCA: Complexity Data Truncated Basis X = ( x 1 x n ) 2 Rd n U = ( u 1 u k ) 2 Rd k = Using eigen-value decomposition Computation of covariance C: O(n d 2 ) Eigen-value decomposition: O(d 3 ) Total complexity: O(n d 2 +d 3 )

PCA: Complexity Data Truncated Basis X = ( x 1 x n ) 2 Rd n U = ( u 1 u k ) 2 Rd k = Using singular-value decomposition Full decomposition: O(min{nd 2, n 2 d}) Rank-k decomposition: O(k d n log(n)) (with power method)

Singular Value Decomposition Idea: Decompose a d x d matrix M into 1. Change of basis V (unitary matrix) 2. A scaling Σ (diagonal matrix) 3. Change of basis U (unitary matrix)

Singular Value Decomposition Idea: Decompose the d x n matrix X into 1. A n x n basis V (unitary matrix) 2. A d x n matrix Σ (diagonal projection) 3. A d x d basis U (unitary matrix) X = U d d d n V > n n U > U = V > V

Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i

Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) z x

Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) Idea: z i more meaningful representation of i-th face than x i Can use z i for nearest-neighbor classification ( + ) ( )

Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) Idea: z i more meaningful representation of i-th face than x i Can use z i for nearest-neighbor classification Much faster: O(dk + nk) time instead of O(dn) when n, d k

Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) Idea: z i more meaningful representation of i-th face than x i Can use z i for nearest-neighbor classification Much faster: O(dk + nk) time instead of O(dn) when n, d Why no time savings for linear classifier? k

Aside: How many components? Magnitude of eigenvalues indicate fraction of variance captured. Eigenvalues on a face image dataset: 1353.2 1086.7 820.1 i 553.6 287.1 2 3 4 5 6 7 8 9 10 11 i Eigenvalues typically drop o sharply, so don t need that many. Of course variance isn t everything...

Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x =

Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x ji = frequency of word j in document i ( X d n u U d k Z ) ( k n 0.4-0.001 ) 0.8 0.03 0.01 0.04 u.. 0.002 2.3 0.003 1.9 stocks: 2 0 chairman: 4 1 the: 8 7.. wins: 0 2 game: 1 3 ( z 1... z n )

Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x ji = frequency of word j in document i ( X d n u U d k Z ) ( k n 0.4-0.001 ) 0.8 0.03 0.01 0.04 u.. 0.002 2.3 0.003 1.9 stocks: 2 0 chairman: 4 1 the: 8 7.. wins: 0 2 game: 1 3 ( z 1... z n ) How to measure similarity between two documents? z > 1 z 2 is probably better than x > 1 x 2

Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x ji = frequency of word j in document i ( X d n u U d k Z ) ( k n 0.4-0.001 ) 0.8 0.03 0.01 0.04 u.. 0.002 2.3 0.003 1.9 stocks: 2 0 chairman: 4 1 the: 8 7.. wins: 0 2 game: 1 3 ( z 1... z n ) How to measure similarity between two documents? z > 1 z 2 is probably better than x > 1 x 2 Applications: information retrieval Note: no computational savings; original x is already sparse

PCA Summary Intuition: capture variance of data or minimize reconstruction error Algorithm: find eigendecomposition of covariance matrix or SVD Impact: reduce storage (from O(nd) to O(nk)), reduce time complexity Advantages: simple,fast Applications: eigen-faces, eigen-documents, network anomaly detection, etc.

Probabilistic Interpretation Generative Model [Tipping and Bishop, 1999]: For each data point i =1,...,n: Draw the latent vector: z i N (0,I k k ) Create the data point: x i N (Uz i, 2 I d d ) PCA finds the U that maximizes the likelihood of the data max U p(x U)

Probabilistic Interpretation Generative Model [Tipping and Bishop, 1999]: For each data point i =1,...,n: Draw the latent vector: z i N (0,I k k ) Create the data point: x i N (Uz i, 2 I d d ) PCA finds the U that maximizes the likelihood of the data max U p(x U) Advantages: Handles missing data (important for collaborative filtering) Extension to factor analysis: allow non-isotropic noise (replace 2 I d d with arbitrary diagonal matrix)

Limitations of Linearity PCA is e ective PCA is ine ective

Limitations of Linearity PCA is e ective PCA is ine ective Problem is that PCA subspace is linear: In this example: S = {x = Uz : z 2 R k } S = {(x 1,x 2 ):x 2 = u 2 u 1 x 1 }

Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} = { (x) =Uz} (x) =( ) >

Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} We can get this: S = { (x) =Uz} with (x) =(x 2 1,x 2 ) >

Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} We can get this: S = { (x) =Uz} with (x) =(x 2 1,x 2 ) > { } > Linear dimensionality reduction in (x) space, Nonlinear dimensionality reduction in x space (x) =( sin( ) ) >

Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} We can get this: S = { (x) =Uz} with (x) =(x 2 1,x 2 ) > { } > Linear dimensionality reduction in (x) space, Nonlinear dimensionality reduction in x space (x) =( sin( ) ) > Idea: Use kernels

Kernel PCA Representer theorem: XX > u = u x X u = X = P n i=1 ix i

Kernel PCA Representer theorem: X u = X = P n XX > u = u i=1 ix i x Kernel function: k(x 1, x 2 ) such that K, thekernelmatrixformedbyk ij = k(x i, x j ), is positive semi-definite

Kernel PCA P Representer theorem: X u = X = P n XX > u = u i=1 ix i x Kernel function: k(x 1, x 2 ) such that K, thekernelmatrixformedbyk ij = k(x i, x j ), is positive semi-definite max kuk=1 u> XX > u = max > X > X =1 > (X > X)(X > X) = max > K =1 > K 2

Kernel PCA Direct method: Kernel PCA objective: max > K =1 > K 2 ) kernel PCA eigenvalue problem: X > X = 0 Modular method (if you don t want to think about kernels): Find vectors x 0 1,...,x 0 n such that x 0> i x 0 j = K ij = (x i ) > (x j ) Key: use any vectors that preserve inner products One possibility is Cholesky decomposition K = X x 0 n > Xx 0 n

Kernel PCA

Canonical Correlation Analysis (CCA)

Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image

Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image Time series: x: Signal at time t y: Signal at time t +1

Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image Time series: x: Signal at time t y: Signal at time t +1 Two-view learning: dividefeaturesintotwosets x: Features of a word/object, etc. y: Features of the context in which it appears

Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image Time series: x: Signal at time t y: Signal at time t +1 Two-view learning: dividefeaturesintotwosets x: Features of a word/object, etc. y: Features of the context in which it appears Goal: reduce the dimensionality of the two views jointly

CCA Example Setup: Input data: (x 1, y 1 ),...,(x n, y n ) (matrices X, Y) Goal: find pair of projections (u, v)

CCA Example Setup: Input data: (x 1, y 1 ),...,(x n, y n ) (matrices X, Y) Goal: find pair of projections (u, v) Dimensionality reduction solutions: Independent Joint x and y are paired by brightness

CCA Definition Definitions: Variance: cvar(u > x)=u > XX > u Covariance: ccov(u > x, v > y)=u > XY > v Correlation: ccov(u > x,v > y) p p cvar(u > x) cvar(v > y) Objective: maximize correlation between projected views Properties: max u,v dcorr(u> x, v > y) Focus on how variables are related, not how much they vary Invariant to any rotation and scaling of data

From PCA to CCA PCA on views separately: no covariance term max u,v u > XX > u u > u + v> YY > v v > v PCA on concatenation (X >, Y > ) > : includes covariance term max u,v u > XX > u +2u > XY > v + v > YY > v u > u + v > v

From PCA to CCA PCA on views separately: no covariance term max u,v u > XX > u u > u + v> YY > v v > v PCA on concatenation (X >, Y > ) > : includes covariance term max u,v u > XX > u +2u > XY > v + v > YY > v u > u + v > v Maximum covariance: drop variance terms max u,v u > XY > v p u> u p v > v

From PCA to CCA PCA on views separately: no covariance term max u,v u > XX > u u > u + v> YY > v v > v PCA on concatenation (X >, Y > ) > : includes covariance term max u,v u > XX > u +2u > XY > v + v > YY > v u > u + v > v Maximum covariance: drop variance terms max u,v u > XY > v p u> u p v > v Maximum correlation (CCA): divide out variance terms max u,v u > XY > v p u> XX > u p v > YY > v

Importance of Regularization Extreme examples of degeneracy: If x = Ay, thenany(u, v) with u = Av is optimal (correlation 1) If x and y are independent, then any (u, v) is optimal (correlation 0) X Y (u v)

Importance of Regularization Extreme examples of degeneracy: If x = Ay, thenany(u, v) with u = Av is optimal (correlation 1) If x and y are independent, then any (u, v) is optimal (correlation 0) Problem: ifx or Y has rank n, thenany(u, v) is optimal > (correlation 1) with u = X > Yv ) CCA is meaningless!

Importance of Regularization Extreme examples of degeneracy: If x = Ay, thenany(u, v) with u = Av is optimal (correlation 1) If x and y are independent, then any (u, v) is optimal (correlation 0) Problem: ifx or Y has rank n, thenany(u, v) is optimal (correlation 1) with u = X > > Yv ) CCA is meaningless! Solution: regularization (interpolate between maximum covariance and maximum correlation) max u,v u > XY > v p u> (XX > + I)u p v > (YY > + I)v