MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Similar documents
Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Kernel methods for comparing distributions, measuring dependence

STA 4273H: Sta-s-cal Machine Learning

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

PCA, Kernel PCA, ICA

Maximum variance formulation

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Machine Learning - MT & 14. PCA and MDS

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Lecture 7: Con3nuous Latent Variable Models

LECTURE NOTE #11 PROF. ALAN YUILLE

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

MACHINE LEARNING ADVANCED MACHINE LEARNING

MACHINE LEARNING ADVANCED MACHINE LEARNING

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

STA414/2104 Statistical Methods for Machine Learning II

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Introduction to Machine Learning

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Principal Component Analysis

Non-linear Dimensionality Reduction

Expectation Maximization

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

STA 414/2104: Machine Learning

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

CS281 Section 4: Factor Analysis and PCA

Unsupervised Learning

Machine Learning (BSMC-GA 4439) Wenke Liu

CS534 Machine Learning - Spring Final Exam

Nonlinear Dimensionality Reduction

Machine learning for pervasive systems Classification in high-dimensional spaces

Statistical Machine Learning

Machine Learning Techniques for Computer Vision

Pattern Recognition and Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Kernel methods, kernel SVM and ridge regression

Linear & nonlinear classifiers

Linear Dimensionality Reduction

Data Mining Techniques

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Kernels and the Kernel Trick. Machine Learning Fall 2017

Kernel Methods. Machine Learning A W VO

1 Principal Components Analysis

Dimensionality reduction

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Lecture Notes 1: Vector spaces

STA 414/2104: Lecture 8

Linear Regression and Its Applications

Dimensionality Reduction with Principal Component Analysis

Kernel Methods in Machine Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

7. Variable extraction and dimensionality reduction

Mathematical foundations - linear algebra

Support Vector Machines

PCA and admixture models

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

L5 Support Vector Classification

PRINCIPAL COMPONENTS ANALYSIS

Statistical Pattern Recognition

Support Vector Machine (SVM) and Kernel Methods

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Robustness of Principal Components

CS181 Midterm 2 Practice Solutions

Advanced Machine Learning & Perception

Advanced Machine Learning & Perception

Perceptron Revisited: Linear Separators. Support Vector Machines

Linear & nonlinear classifiers

Apprentissage non supervisée

Vector Space Models. wine_spectral.r

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Chemometrics: Classification of spectra

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Linear Models for Regression. Sargur Srihari

L11: Pattern recognition principles

CSC 411 Lecture 12: Principal Component Analysis

Support Vector Machine (SVM) and Kernel Methods

Principal Component Analysis (PCA)

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Support Vector Machine (SVM) and Kernel Methods

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

SVMs: nonlinearity through kernels

Kernel PCA, clustering and canonical correlation analysis

Machine Learning 2nd Edition

HST.582J/6.555J/16.456J

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

STA 4273H: Statistical Machine Learning

What is Principal Component Analysis?

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

L26: Advanced dimensionality reduction

Machine Learning. Support Vector Machines. Manfred Huber

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Transcription:

1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR C0 02!

3 Why reducing the data dimensionality? Reducing the dimensionality of the dataset at hand so that computation afterwards is more tractable Idea: sole a few of the dimensions matter, the projections of the data along the residual dimensions do not contain informative structure of the data (already a form of generalization)

4 Why reducing the data dimensionality? The curse of dimensionality refers to the exponential growth of volume covered by the parameters values to be tested as the dimensionality increases. In ML, analyzing high dimensional data is made particularly difficult because: Often one does not have enough observations to get good estimates (i.e. to sample enough all parameters). Adding more dimensions (hence more features) can increase the noise, and hence the error.

5 Principal Component Analysis Principal Component Analysis is a method widely used in Engineering and Science. Its principle is based on statistical analysis of the correlations underpinning the dataset at hand Its popularity is due to the fact that: Its computation is simple and tractable with an analytical solution Its result can be easily visualized through usually a 2 or 3 dimensional graph.

6 Co-Variance, Correlation The covariance and correlation are a measure of the dependency between two variables. Given two variables x and y (assuming that x and y are both zero mean): cov( x, y) E x, y E x E y, cov( xy, ) corr( x, y). var var x y x and y are said to be uncorrelated, if their covariance is zero: x y corr x, y 0 and cov, 0.

7 Co-Variance Matrix If X x i j i 1... M j1... N is a multidimensional dataset containing M N-dimensional datapoints, the covariance matrix C of X is given by: C E XX T X X X X cov 1, 1...cov 1,... cov X1, XN......cov XN, X N N C is diagonal when t h e d a t a X is decorrelated along each dimension. The rows X, j 1... N,represent the coordinate of each datapoint with respect to the j-th basis vector. The column of X contain the M datapoints. j T T C E XX ~ XX, since expectation is only a normalization factor.

8 Purpose of PCA Goal: to find a better representation of the dataset at hand so as to simplify computation afterwards Assumes a linear transformation Assumes maximal variance is a criterion Raw 2D dataset Projected onto two first Principal components

9 PCA: principle PRINCIPLE: Define a low dimensional manifold in the original space. Represent each data point X by its projection Y onto this manifold. FORMALISM: Consider a data set of MN-dimensional data points i i1... M i N j X= x and x, i 1,..., M : j1,... N PCA aims at finding a linear map A,s.t N A q A:, q N Y AX Y y y y 1 M i q,,..., and each

10 PCA: principle There are three equivalent methods for performing PCA: 1. Maximize the variance of the projection (Hotelling 1933). In other words, this method tries to maximize the spread of the projected data. 2. Minimize the reconstruction error (Pearson 1901), i.e. to minimize the squared distance between the original data and its estimate in a low dimensional manifold. 3. Mean Least Error of the parameter in a latent variable (Tipping and Bishop 1996)

11 Standard PCA: Variance Maximization through Eigenvalue Decomposition Algorithm: 1. Determine the direction (vector) along which the variance of the data is maximal. 2. Determine an orthonormal basis composed of the direction obtained in 1. The projection of the data onto each axis are uncorrelated.

12 Standard PCA: Variance Maximization through Eigenvalue Decomposition Algorithm: 1) Zero mean: X X ' X - E X 2) Compute Covariance matrix: C E X ' X ' 3) Compute eigenvalues using C I 0, i 1... N 4) Compute eigenvectors using Ce i i e 1 q 5) Choose first q N eigenvectors: e,... e with 1 2... q 1 1 e1... e N 6) Project data onto new basis: X ' Y AqX ', Aq.. q q e1... e N i i T

13 Standard PCA: Example Sepatatiing line Two classes with 20 and 16 examples in each class Projection of the image datapoints on the first and 2 nd PC Demo PCA for Face Classification By projecting a set of images of two classes (two different persons) onto first two principal component allows to extract features particular to each class, which can then be use for classification.

14 Principal Component Analysis LIMITATION OF STANDARD and MSQ PCA: The variance-covariance matrix needs to be calculated: Can be very computation-intensive for large datasets with a high # of dimensions Does not deal properly with missing data Incomplete data must either be discarded or imputed using ad-hoc methods Outliers can unduly affect the analysis Probabilistic PCA addresses some of the above limitations

15 Probabilistic PCA i i 1... j M The data X x are samples of the distribution of the random variable x. j1... N x is generated by the latent variable z following: T x W z The latent variable z corresponds to the unobserved variable. It is a lower dimensional representation of the data and their dependencies. In Probabilistic PCA, the latent variable model consists then of: x: observed variables (Dimension N) z: latent variables (Dimension q) with q<n Less dimensions results in more parsimonious models.

16 Probabilistic PCA Assumptions: i i 1... j The data X x are samples of the distribution of the random variable x. j1... N - The latent variable z are centered and white, i.e. z 0, I - is a parameter (usually the mean of the data x) - the noise follows a zero mean Gaussian distribution 0, T - W is a N q matrix. M x is generated by the latent variable z following: T x W z Variance of the noise is diagonal conditional independence on the observables given the latent variables; z encapsulates all correlations across original dimensions. 2

17 Probabilistic PCA p(z) Assumptions: - The latent variable z are centered and white, i.e. z 0, I - is a parameter (usually the mean of the data x) - the noise follows a zero mean Gaussian distribution 0, T - W is a N q matrix. z Variance of the noise is diagonal conditional independence on the observables given the latent variables; z encapsulates all correlations across original dimensions. 2

18 Probabilistic PCA p(z) x 2 p(x z 1 ) w Z 1 * w z 1 z x 1 2 Assuming further an isotropic Gaussian noise model N (0, I) conditional probability distribution of the observables given the latent variables p( x z) is given by: T 2 px ( z) W z, I

19 Probabilistic PCA p(z) x 2 p(x z 1 ) w z 1 z p(x) Z 1 * w x 1 Axes of the ellipse correspond to the colums of W, i.e. to the eigenvectors of the covariance matrix: XX T The marginal distribution can be computed by integrating out the latent variable and is then: T 2, p x W W I Open parameters; can be learned through maximum likelihood

20 Probabilistic PCA through Maximum Likelihood If we set T 2 B W W I 1 M datapoints X= x,..., x., one can then compute the log-likelihood: M 1 ln L B,, ln 2 ln N B tr B C 2 where C is the sample covariance matrix of the complete set of M The maximum likelihood estimate of is the mean of the dataset X. The parameters B and are estimated through E-M. See lecture notes for values of B and + exercises for derivation

21 Probabilistic PCA through Maximum Likelihood The use of E-M to estimate the variables of PPCA offers a natural approach to the estimation of the principal axes when some of the data vectors X exhibit one or more missing values. Exploit E-M approach to estimate the latent variables. - Compute complete likelihood of the dataset (dataset is X and Z): log p X, Z, W, and treat X as the missing data! - E-step: compute expectation of complete log likelihood using estimate of p( z x) and current parameters - M-step: maximize parameters W, Iterate until likelihood no longer increases Allows on-line learning (incremental update)

23 Probabilistic PCA p(z x) x 2 w ww T (x-x- z The conditional distribution of the latent variable over the data is given by: 1 1 2 T 2 p z x B W x, B, B W W I p(x) x 1 Is again Gaussian! Axes of the ellipse correspond to the colums of W, i.e. to the eigenvectors of the covariance matrix: XX T In the absence of noise, one recovers standard PCA, as T 1 W W W x the latent space. is an orthogonal projection of x onto

24 Probabilistic PCA: Dimensionality Reduction Reduction of the dimensionality is obtained by looking at the latent variable and estimating its distribution. Reduce dimensionality by projecting onto a subset q of the dimensions p z x B W x, B, B W Wq I 1 1 2 T 2 q In the absence of noise, one recovers standard PCA, as T 1 W W W x the latent space. is an orthogonal projection of x onto

25 Probabilistic PCA: Summary Idea: Assume that the data X were generated by a Gaussian latent variable model, Probabilistic PCA consists then in estimating the density of the latent variable through maximum likelihood. Probabilistic PCA is then PCA through projection on a latent space. Advantages of expressing PCA in probabilistic form: It can easily be extended to estimation from mixtures of PCA models. The estimated density can easily be used for classification and other Bayesian computation afterwards.

26 Probabilistic PCA: Summary p(z) x 2 p(x z 1 ) w Z 1 * w p(x) z 1 z x 1 Assumptions: underlying latent variable has a Gaussian distribution linear relationship between latent and observed variables isotropic Gaussian noise in observed dimensions

27 Revisiting the hypotheses of PCA PCA assumed a linear transformation Non-linear PCA (Kernel PCA): to find a non-linear embedding of the data

28 Going back to linearity Find a non-linear transformation that send the data in a space where linear computation is again feasible.

Kernel-Induced Feature Space Idea: Send the data X into a feature space H through the nonlinear map f. i1... M i N 1,..., M f f f X x X x x f H Performs linear PCA in feature space Original Space x 2 e 2 x 1 e 1 29

30 Kernel-Induced Feature Space Idea: Send the data X into a feature space H through the nonlinear map f. i1... M i N 1,..., M f f f X x X x x x 2 Original Space f While the dimension of the original space is N, the dimension of the feature space may be greater than N! X is lifted onto H Determining f is difficult Kernel Trick

31 The Kernel Trick In most cases, determining the transformation f may be difficult. Linear PCA computes an inner product across pairs of observations: i x, x j No need to compute the transformation f, if one expresses everything as a function of the inner product in feature space the kernel function: k : X X i j i j f f k x, x x, x. Metric of similarity across datapoints May extract some features

32 Popular Kernels Gaussian / RBF Kernel (translation-invariant): xx' k x x e 2 2, ',. Homogeneous Polynomial Kernels: k x, x' x, x', p ; p Inhomogeneous Polynomial Kernels: k x, x' x, x' c, p, c 0 p

33 From Linear PCA to Kernel PCA Rewriting PCA in terms of dot products: 1 N Each eigenvector e,..., e found by linear PCA can be expressed as a linear combination of the datapoints: M 1 Using = T Ce x x e with Ce ie M i j j i i i j1 we obtain, M 1 T e x x e M i j j i i j1 i j

34 From Linear PCA to Kernel PCA Rewriting PCA in terms of dot products: 1 N Each eigenvector e,..., e found by linear PCA can be expressed as a linear combination of the datapoints: M 1 Using = T Ce x x e with Ce ie M i j j i i i j1 we obtain, M 1 T e x x e M i j j i i j1 i j 1 M i Scalar M j1 i j x j.

35 Linear PCA in Feature Space Sending the data in feature space through f: f: X H x f x Assume that, in feature space H, the data are centered: M i1 f x i 0 The covariance matrix in the feature space is: C f 1 M FF i The columns i 1... M of F are composed of f x. T

36 Linear PCA in Feature Space As in the original space, in feature space, the covariance matrix can be diagonalized and we have now to find the eigenvalues i > 0, satisfying: C v f i v i i f i f j i j i f x, C v x, v, i, j 1,... M All solutions v with different of zero lie in the span of the fx 1 ),, fx M ), and we can thus write: M i i j i i i iv i x 1 M j1 f, [... ]

37 Linear PCA in Feature Space M i i j i x f f i f Cf v f j1 x, C v x, v j i j i 1 M f f f f f f M M M l i,,, i j j l j i j x x x x x x l1 j1 j1 Given that: ij i, j f K f x x Kernel Trick eigenvalue problem of the form: i i K M, M: number of datapoints i Dual eigenvalue problem of finding the eigenvectors v of Cf.

38 Linear PCA in Feature Space The solutions to the dual eigenvalue problem : 1 M are given by all the eigenvectors,..., with non-zero eigenvalues,...,. Asking that the eigenvectors v of i i i.e. v, v 1 i 1,..., M 1 M C f be normalized, 1 is equivalent to asking that the dual eigenvectors,..., i i are such that: 1/. Kernel PCA finds at most M eigenvectors M: number of datapoints M>>N dimension of each datapoint M

39 Constructing the kpca projections We cannot see the projection in feature space! We can only compute the projections of each point onto each eigenvector i Projection of query point x onto eigenvector v : M M j j i i j i j v, f x f x, f x k x, x j1 j1 Sum over all training points Contour lines group points with equal projection: i All points x, s. t: v, f x cst.

40 H From Scholkopf & Smola, 2002 Contour linear in linear PCA are straight lines. In kpca, these appear curvy in original space, while straight in feature space.

41 Popular Kernels Gaussian / RBF Kernel (translation-invariant): xx' k x x e 2 2, ',. Homogeneous Polynomial Kernels: k x, x' x, x', p ; p Inhomogeneous Polynomial Kernels: k x, x' x, x' c, p, c 0 p

42 Kernel PCA: Examples From Scholkopf & Smola, 2002

43 Kernel PCA: Examples From Scholkopf & Smola, 2002

44 Kernel PCA: Examples MLDEMOS Two sets of circle datapoints Original Data

45 Kernel PCA: Examples MLDEMOS Gaussian Kernel Projections onto first two eigenvectors

46 Kernel PCA: Examples MLDEMOS Pair of Glasses datapoints Original Data

47 Kernel PCA: Examples MLDEMOS Gaussian Kernel, kernel width=0.9 Projections onto first two eigenvectors

48 Kernel PCA: Examples MLDEMOS Two sets of circle datapoints Original Data

49 Kernel PCA: Examples MLDEMOS Polynomial Kernel order p=20 Projections onto first two eigenvectors

50 Kernel PCA: Examples MLDEMOS Polynomial Kernel order p=20 Points clusters here Projections onto first two eigenvectors

51 Curse of Dimensionality Kernel PCA is very intensive computationally. Computation of the eigenvectors requires eigenvalue decomposition of the Gram matrix (Kernel Matrix is M x M) which grows quadratic ally with the number of data points M. Computation of each projection in original space grows linearly with M. too. A variety of sparse methods have been proposed in the literature

54 How to choose kernels? There is no rule for choosing the right kernel; each kernel must be adapted to a particular problem. Do a grid search over values of kernel parameters and perform crossvalidation for each choice. Some considerations are important: Kernel parameters are often related to geometrical properties of data; e.g kernel width in rbf kernel relates to size of the variance of the data Experimentally, there is some robustness in the choice, if the chosen kernels provide an acceptable trade-off between simpler and more efficient structure (e.g. linear separability), which requires some explosion Information structure preserving, which requires that the explosion is not too strong.

55 Coding Mini-Projects List of topics: - Co-Inertia Analysis (CIA) : equivalent to CCA - ISOMAP - Locally Liner Embedding - Laplacian Eigenmaps - Chirp classifier - Learning Vector Quantization Visualization

56 Coding Mini-Projects Instructions - Code should be embedded in mldemos, propertly tested for compilation - Implement the method with real-world or simulated data and analyze its performance, when possible comparing its performance to other equivalent available methods - In your report, describe briefly the method, its implementation and its evaluation.

57 Surveys of Literature Topics Real-world applications of kernel learning methods Application of continuous RL methods for robot control Inverse reinforcement learning methods Semi-supervised clustering with application to finance, Classification with SVM with application to finance Applications of Gaussian Process regression for robot control

58 Caveats when Conducting Surveys of Literature Surveys are done by team of two people Count 25-30 hours of work including redaction Each member of the team reads about 10-20 articles Do not use google! Use rather Google scholar, IEEEXplore and other known search engine (see http://library.epfl.ch/db/) Jot down notes as you read a paper! Be critical in your survey of the liteature. Report on contradictory findings, or spot claims unsubstantiated by data!

59 Reports Format Pick your projects (first come, first serve basis), see doodle poll at: http://lasa.epfl.ch/teaching/lectures/ml_phd/miniprojects.html Mini-Projects and Lit Survey are evaluated in two ways: 1. Written Reports 1. Report on coding project (10 pages maximum, 10pt minimum, single column; code from mini-project must be submitted together with the report). 2. Reports on lit. survey If you do the lit. survey as team of two, the maximum length of the survey should be 20 pages (10pt minimum, single column) Reports (+code) are due on May 17 2012, 6pm and should be submitted electronically to basilio.noris@epfl.ch. 2. Oral presentation in class (10 minutes presentation) on May 31