Kernel methods for comparing distributions, measuring dependence

Similar documents
Kernel PCA, clustering and canonical correlation analysis

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Review. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

PCA, Kernel PCA, ICA

Kernel methods, kernel SVM and ridge regression

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Statistical Machine Learning

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Dimensionality Reduction

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Kernel Principal Component Analysis

Approximate Kernel PCA with Random Features

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Dimension Reduction (PCA, ICA, CCA, FLD,

Manifold Learning: Theory and Applications to HRI

Unsupervised Kernel Dimension Reduction Supplemental Material

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings

Discriminative Direction for Kernel Classifiers

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Introduction to Machine Learning

Machine learning - HT Maximum Likelihood

Machine Learning for Data Science (CS4786) Lecture 12

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Nonlinear Dimensionality Reduction

Supervised Learning Coursework

Canonical Correlation Analysis with Kernels

22 : Hilbert Space Embeddings of Distributions

Nonlinear Dimensionality Reduction

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Kernel Methods in Machine Learning

Unsupervised dimensionality reduction

Data Mining Techniques

PATTERN RECOGNITION AND MACHINE LEARNING

Graphical Models for Collaborative Filtering

Support Vector Machine (SVM) and Kernel Methods

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Expectation Maximization

Linear Dimensionality Reduction

Linear and Non-Linear Dimensionality Reduction

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Chemometrics: Classification of spectra

Big Hypothesis Testing with Kernel Embeddings

Machine Learning - MT & 14. PCA and MDS

CSC 411 Lecture 12: Principal Component Analysis

LECTURE NOTE #11 PROF. ALAN YUILLE

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Principal Component Analysis

Principal Component Analysis!! Lecture 11!

CS798: Selected topics in Machine Learning

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Introduction to Machine Learning

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Hilbert Space Representations of Probability Distributions

Unsupervised Learning

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

Semi-Supervised Learning through Principal Directions Estimation

Data Mining Techniques

Statistical and Computational Analysis of Locality Preserving Projection

Lecture 35: December The fundamental statistical distances

Mathematical foundations - linear algebra

Gaussian Process Regression

CS534 Machine Learning - Spring Final Exam

Semi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analysis

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

4 Bias-Variance for Ridge Regression (24 points)

Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction

Approximate Kernel Methods

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

10-701/ Recitation : Kernels

Kernel methods for Bayesian inference

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Distribution Regression

How to learn from very few examples?

Machine Learning 2: Nonlinear Regression

Learning with kernels and SVM

Dimensionality reduction

Maximum Mean Discrepancy

Dimensionality Reduction by Unsupervised Regression

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Machine Learning A W VO

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Covariance and Correlation Matrix

Computational functional genomics

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

A Linear-Time Kernel Goodness-of-Fit Test

Intelligent Data Analysis. Principal Component Analysis. School of Computer Science University of Birmingham

Convergence of Eigenspaces in Kernel Principal Component Analysis

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Advances in kernel exponential families

Colored Maximum Variance Unfolding

Transcription:

Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Principal component analysis Given a set of M centered observations x k R d, PCA finds the direction that maximizes the variance X = x 1, x 2,, x M w = 1 argmax w 1 M k w x k 2 1 = argmax w 1 M w XX w C = 1 M XX, w can be found by solving the following eigen-value problem Cw = λ w 2

Alternative expression for PCA The principal component lies in the span of the data w = α k x k = Xα k Plug this in we have Cw = 1 M XX Xα = λ Xα Furthermore, for each data point x k, the following relation holds x k Cw = 1 M x k XX Xα = λ x k Xα, k In matrix form, 1 M X XX Xα = λx Xα Only depends on inner product matrix 3

Kernel PCA Key Idea: Replace inner product matrix by kernel matrix PCA: 1 M X XX Xα = λx Xα x k φ x k, Φ = φ x 1,, φ x k, K = Φ Φ Nonlinear component w = Φα Kernel PCA: 1 M KKα = λkα, equivalent to 1 M Kα = λ α First form an M by M kernel matrix K, and then perform eigendecomposition on K 4

Kernel PCA example Gaussian RBF kernel exp x x 2 2σ 2 over 2 dimensional space Eigen-vector evaluated at a test point x is a function w φ x = α k k(x k, x) k 5

Spectral clustering 6

Spectral clustering Form kernel matrix K with Gaussian RBF kernel Treat kernel matrix K as the adjacency matrix of a graph (set diagonal of K to be 0) Construct the graph Laplacian L = D 1/2 KD 1/2, where D = diag(k 1) Compute the top k eigen-vector V = (v 1, v 2,, v k ) of L Use V as the input to K-means for clustering 7

Canonical correlation analysis 8

Canonical correlation analysis Given Estimate two basis vectors w x and w y Estimate the two basis vectors so that the correlations of the projections onto these vectors are maximized. 9

CCA derivation II Define the covariance matrix of x, y The optimization problem is equal to We can require the following normalization, and just maximize the numerator 10

CCA as generalized eigenvalue problem The optimality conditions say Put these conditions into matrix format Generalized eigenvalue problem Aw = λbw 11 y yy x yx x xx y xy w C w C w C w C y x yy xx y x yx xy w w C C w w C C 0 0 0 0

CCA in inner product format Similar to PCA, the directions of projection lie in the span of the data X = x 1,, x m, Y = (y 1,, y m ) w x = Xα, w y = Yβ C xy = 1 m XY, C xx = 1 m XX, C yy = 1 m YY^ Earlier we have Plug in w x = Xα, w y = Yβ, we have max, T X T XX T X T T X XY T Y T Y T YY Data only appear in inner products T Y 12

Kernel CCA Replace inner product matrix by kernel matrix Where K x is kernel matrix for data X, with entries K x i, j = k x i, x j Solve generalized eigenvalue problem 13 y y T x x T y x T K K K K K K, max y y x x x y y x K K K K K K K K 0 0 0 0

Comparing two distributions For two Gaussian distributions P X and Q X with unit variance, simply test H 0 : μ 1 = μ 2? For general distributions, we can also use KL-divergence H 0 : P X = Q X? KL(P Q = P X log P(X) dx X Q(X) Given a set of samples x 1,, x m P X, x 1,, x n Q X μ 1 1 m i x i Need to estimate the density function first X P X log P(X) Q(X) dx 1 m i log P(x i) Q(x i ) 14

Embedding distributions into feature space Summary statistics for distributions Mean Covariance expected features Pick a kernel, and generate a different summary statistic 15

Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature map: 16

Finite sample approximation of embedding One-to-one mapping from to for certain kernels (RBF kernel) Sample average converges to true mean at 17

Embedding Distributions: Mean Mean reduces the entire distribution to a single number Representation power very restricted 1D feature space 18

Embedding Distributions: Mean + Variance Mean and variance reduces the entire distribution to two numbers Variance Richer representation But not enough Mean 2D feature space 19

Embedding with kernel features Transform distribution to infinite dimensional vector Rich representation Feature space Mean, Variance, higher order moment 20

Estimating embedding distances Given samples x 1,, x m P X, x 1,, x m Q X Distance can be expressed as inner products 21

Estimating embedding distance Finite sample estimator Form a kernel matrix with 4 blocks Average this block Average this block Average this block Average this block 22

Optimization view of embedding distance Optimization problem μ X μ X 2 = sup w 1 sup w 1 < w, μ X μ X > 2 = < w, E X P φ X E X Q φ X > 2 Witness function w = E X P φ X E X Q φ X 1 E X P φ X m φ x i 1 i m i φ(x i ) 1 m φ x i 1 i m i φ(x i ) E X Q φ X = μ X μ X μ X μ X μ X μ X w w 23

Plot the witness function values w x = w φ x 1 k x m i, x 1 i k(x m i i, x) Gaussian and Laplace distribution with the same mean and variance (Use Gaussian RBF kernel) 24

Application of kernel distance measure 25

Covariate shift correction Training and test data are not from the same distribution Want to reweight training data points to match the distribution of test data points Argmin α 0, α 1 =1 α i φ x i 1 i φ y m i i 2 26

Embedding Joint Distributions Transform the entire joint distribution to expected features maps to Cross Covariance (Cov.) maps to 1 X Mean Y Mean Cov. maps to 1 X Mean Y Mean Cov. Higher order feature 27

Embedding Joint: Finite Sample Feature space Weights Feature mapped data points [Smola, Gretton, Song and Scholkopf. 2007] 28

Measure Dependence via Embeddings Use squared distance to measure dependence between X and Y Feature space [Smola, Gretton, Song and Scholkopf. 2007] Dependence measure useful for: Dimensionality reduction Clustering Matching 29

Estimating embedding distances Given samples (x 1, y 1 ),, (x m, y m ) P X, Y Dependence measure can be expressed as inner products μ XY μ X μ Y 2 = E XY [φ X ψ Y ] E X φ X E Y [ψ Y ] 2 =< μ XY, μ XY > 2 < μ XY, μ X μ Y >+< μ X μ Y, μ X μ Y > Kernel matrix operation (H = I 1 m 11 ) X and Y data are ordered in the same way trace( H H k(x i, x j ) k(y i, y j ) ) 30

Optimization view of the dependence measure Optimization problem μ XY μ X μ 2 Y = sup w 1 w μ XY μ X μ Y < w, μ XY μ X μ Y > 2 Witness function w x, y = w (φ x ψ y ) A distribution with two stripes Two stripe distribution vs Uniform over [-1,1]x[-1,1] 31

Application of kernel distance measure 32

Application of dependence meaure Independent component analysis Transform the times series, such that the resulting signals are as independent as possible (minimize kernel dependence) Feature selection Choose a set of features, such that its dependence with labels are as large as possible (maximize kernel dependence) Clustering Generate labels for each data point, such that the dependence between the labels and data are maximized (maximize kernel dependence) Supervised dimensionality reduction Reduce the dimension of the data, such that its dependence with side information in maximized (maximize kernel dependence) 33

PCA vs. Supervised dimensionality reduction 20 news groups 34

Supervised dimensionality reduction 10 years of NIPS papers: Text + Coauthor networks 35

Visual Map of LabelMe Images 36

Imposing Structures to Image Collections Adjacent points on the grid are similar Layout (sort/organize) images according to image features and maximize its dependence with an external structure High dimensional image features color feature texture feature sift feature composition description 37

Compare to Other Methods Other layout algorithms do not have exact control of what structure to impose Kernel Embedding Method Generative Topographic Map (GTM) Self-Organizing Map (SOM) [Quadrinato, Song and Smola 2009] [Bishop et al. 1998] [Kohonen 1990] 38

39

Reference 40