Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Principal component analysis Given a set of M centered observations x k R d, PCA finds the direction that maximizes the variance X = x 1, x 2,, x M w = 1 argmax w 1 M k w x k 2 1 = argmax w 1 M w XX w C = 1 M XX, w can be found by solving the following eigen-value problem Cw = λ w 2

Alternative expression for PCA The principal component lies in the span of the data w = α k x k = Xα k Plug this in we have Cw = 1 M XX Xα = λ Xα Furthermore, for each data point x k, the following relation holds x k Cw = 1 M x k XX Xα = λ x k Xα, k In matrix form, 1 M X XX Xα = λx Xα Only depends on inner product matrix 3

Kernel PCA Key Idea: Replace inner product matrix by kernel matrix PCA: 1 M X XX Xα = λx Xα x k φ x k, Φ = φ x 1,, φ x k, K = Φ Φ Nonlinear component w = Φα Kernel PCA: 1 M KKα = λkα, equivalent to 1 M Kα = λ α First form an M by M kernel matrix K, and then perform eigendecomposition on K 4

Kernel PCA example Gaussian RBF kernel exp x x 2 2σ 2 over 2 dimensional space Eigen-vector evaluated at a test point x is a function w φ x = α k k(x k, x) k 5

Spectral clustering 6

Spectral clustering Form kernel matrix K with Gaussian RBF kernel Treat kernel matrix K as the adjacency matrix of a graph (set diagonal of K to be 0) Construct the graph Laplacian L = D 1/2 KD 1/2, where D = diag(k 1) Compute the top k eigen-vector V = (v 1, v 2,, v k ) of L Use V as the input to K-means for clustering 7

Canonical correlation analysis 8

Canonical correlation analysis Given Estimate two basis vectors w x and w y Estimate the two basis vectors so that the correlations of the projections onto these vectors are maximized. 9

CCA derivation II Define the covariance matrix of x, y The optimization problem is equal to We can require the following normalization, and just maximize the numerator 10

CCA as generalized eigenvalue problem The optimality conditions say Put these conditions into matrix format Generalized eigenvalue problem Aw = λbw 11 y yy x yx x xx y xy w C w C w C w C y x yy xx y x yx xy w w C C w w C C 0 0 0 0

CCA in inner product format Similar to PCA, the directions of projection lie in the span of the data X = x 1,, x m, Y = (y 1,, y m ) w x = Xα, w y = Yβ C xy = 1 m XY, C xx = 1 m XX, C yy = 1 m YY^ Earlier we have Plug in w x = Xα, w y = Yβ, we have max, T X T XX T X T T X XY T Y T Y T YY Data only appear in inner products T Y 12

Kernel CCA Replace inner product matrix by kernel matrix Where K x is kernel matrix for data X, with entries K x i, j = k x i, x j Solve generalized eigenvalue problem 13 y y T x x T y x T K K K K K K, max y y x x x y y x K K K K K K K K 0 0 0 0

Comparing two distributions For two Gaussian distributions P X and Q X with unit variance, simply test H 0 : μ 1 = μ 2? For general distributions, we can also use KL-divergence H 0 : P X = Q X? KL(P Q = P X log P(X) dx X Q(X) Given a set of samples x 1,, x m P X, x 1,, x n Q X μ 1 1 m i x i Need to estimate the density function first X P X log P(X) Q(X) dx 1 m i log P(x i) Q(x i ) 14

Embedding distributions into feature space Summary statistics for distributions Mean Covariance expected features Pick a kernel, and generate a different summary statistic 15

Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature map: 16

Finite sample approximation of embedding One-to-one mapping from to for certain kernels (RBF kernel) Sample average converges to true mean at 17

Embedding Distributions: Mean Mean reduces the entire distribution to a single number Representation power very restricted 1D feature space 18

Embedding Distributions: Mean + Variance Mean and variance reduces the entire distribution to two numbers Variance Richer representation But not enough Mean 2D feature space 19

Embedding with kernel features Transform distribution to infinite dimensional vector Rich representation Feature space Mean, Variance, higher order moment 20

Estimating embedding distances Given samples x 1,, x m P X, x 1,, x m Q X Distance can be expressed as inner products 21

Estimating embedding distance Finite sample estimator Form a kernel matrix with 4 blocks Average this block Average this block Average this block Average this block 22

Optimization view of embedding distance Optimization problem μ X μ X 2 = sup w 1 sup w 1 < w, μ X μ X > 2 = < w, E X P φ X E X Q φ X > 2 Witness function w = E X P φ X E X Q φ X 1 E X P φ X m φ x i 1 i m i φ(x i ) 1 m φ x i 1 i m i φ(x i ) E X Q φ X = μ X μ X μ X μ X μ X μ X w w 23

Plot the witness function values w x = w φ x 1 k x m i, x 1 i k(x m i i, x) Gaussian and Laplace distribution with the same mean and variance (Use Gaussian RBF kernel) 24

Application of kernel distance measure 25

Covariate shift correction Training and test data are not from the same distribution Want to reweight training data points to match the distribution of test data points Argmin α 0, α 1 =1 α i φ x i 1 i φ y m i i 2 26

Embedding Joint Distributions Transform the entire joint distribution to expected features maps to Cross Covariance (Cov.) maps to 1 X Mean Y Mean Cov. maps to 1 X Mean Y Mean Cov. Higher order feature 27

Embedding Joint: Finite Sample Feature space Weights Feature mapped data points [Smola, Gretton, Song and Scholkopf. 2007] 28

Measure Dependence via Embeddings Use squared distance to measure dependence between X and Y Feature space [Smola, Gretton, Song and Scholkopf. 2007] Dependence measure useful for: Dimensionality reduction Clustering Matching 29

Estimating embedding distances Given samples (x 1, y 1 ),, (x m, y m ) P X, Y Dependence measure can be expressed as inner products μ XY μ X μ Y 2 = E XY [φ X ψ Y ] E X φ X E Y [ψ Y ] 2 =< μ XY, μ XY > 2 < μ XY, μ X μ Y >+< μ X μ Y, μ X μ Y > Kernel matrix operation (H = I 1 m 11 ) X and Y data are ordered in the same way trace( H H k(x i, x j ) k(y i, y j ) ) 30

Optimization view of the dependence measure Optimization problem μ XY μ X μ 2 Y = sup w 1 w μ XY μ X μ Y < w, μ XY μ X μ Y > 2 Witness function w x, y = w (φ x ψ y ) A distribution with two stripes Two stripe distribution vs Uniform over [-1,1]x[-1,1] 31

Application of kernel distance measure 32

Application of dependence meaure Independent component analysis Transform the times series, such that the resulting signals are as independent as possible (minimize kernel dependence) Feature selection Choose a set of features, such that its dependence with labels are as large as possible (maximize kernel dependence) Clustering Generate labels for each data point, such that the dependence between the labels and data are maximized (maximize kernel dependence) Supervised dimensionality reduction Reduce the dimension of the data, such that its dependence with side information in maximized (maximize kernel dependence) 33

PCA vs. Supervised dimensionality reduction 20 news groups 34

Supervised dimensionality reduction 10 years of NIPS papers: Text + Coauthor networks 35

Visual Map of LabelMe Images 36

Imposing Structures to Image Collections Adjacent points on the grid are similar Layout (sort/organize) images according to image features and maximize its dependence with an external structure High dimensional image features color feature texture feature sift feature composition description 37

Compare to Other Methods Other layout algorithms do not have exact control of what structure to impose Kernel Embedding Method Generative Topographic Map (GTM) Self-Organizing Map (SOM) [Quadrinato, Song and Smola 2009] [Bishop et al. 1998] [Kohonen 1990] 38

Reference 40