Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Similar documents
Lecture: Face Recognition

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

PCA and LDA. Man-Wai MAK

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Discriminant Uncorrelated Neighborhood Preserving Projections

PCA and LDA. Man-Wai MAK

What is Principal Component Analysis?

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Linear Dimensionality Reduction

Local Learning Projections

Recognition Using Class Specific Linear Projection. Magali Segal Stolrasky Nadav Ben Jakov April, 2015

Principal Component Analysis and Linear Discriminant Analysis

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Notes on Implementation of Component Analysis Techniques

Example: Face Detection

Eigenfaces. Face Recognition Using Principal Components Analysis

Image Analysis. PCA and Eigenfaces

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Locality Preserving Projections

Image Analysis & Retrieval. Lec 14. Eigenface and Fisherface

Pattern Recognition 2

When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Machine Learning. Data visualization and dimensionality reduction. Eric Xing. Lecture 7, August 13, Eric Xing Eric CMU,

Keywords Eigenface, face recognition, kernel principal component analysis, machine learning. II. LITERATURE REVIEW & OVERVIEW OF PROPOSED METHODOLOGY

Lecture 13 Visual recognition

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Principal Component Analysis (PCA)

Image Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 15. Graph Laplacian Embedding

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

Image Analysis & Retrieval Lec 14 - Eigenface & Fisherface

Lecture 17: Face Recogni2on

Lecture 17: Face Recogni2on

Statistical and Computational Analysis of Locality Preserving Projection

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

An Efficient Pseudoinverse Linear Discriminant Analysis method for Face Recognition

Reconnaissance d objetsd et vision artificielle

Eigenface-based facial recognition

Principal Component Analysis

Unsupervised Learning: Dimensionality Reduction

ECE 661: Homework 10 Fall 2014

Nonlinear Dimensionality Reduction

2 GU, ZHOU: NEIGHBORHOOD PRESERVING NONNEGATIVE MATRIX FACTORIZATION graph regularized NMF (GNMF), which assumes that the nearby data points are likel

Regularized Locality Preserving Projections with Two-Dimensional Discretized Laplacian Smoothing

Integrating Global and Local Structures: A Least Squares Framework for Dimensionality Reduction

Principal Component Analysis

Machine Learning 2nd Edition

Subspace Methods for Visual Learning and Recognition

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Dimensionality Reduction

Eigenimages. Digital Image Processing: Bernd Girod, 2013 Stanford University -- Eigenimages 1

Principal Component Analysis (PCA) CSC411/2515 Tutorial

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Fisher Linear Discriminant Analysis

Linear & nonlinear classifiers

Lecture 7: Con3nuous Latent Variable Models

CS281 Section 4: Factor Analysis and PCA

Lecture 6 Proof for JL Lemma and Linear Dimensionality Reduction

Spectral Regression for Efficient Regularized Subspace Learning

Mid-year Report Linear and Non-linear Dimentionality. Reduction. applied to gene expression data of cancer tissue samples

Machine Learning - MT & 14. PCA and MDS

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

A Unified Bayesian Framework for Face Recognition

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Unsupervised dimensionality reduction

Statistical Pattern Recognition

Principal Component Analysis

Advanced Introduction to Machine Learning CMU-10715

PCA and admixture models

Machine Learning (Spring 2012) Principal Component Analysis

Eigenfaces and Fisherfaces

Dimension Reduction and Component Analysis Lecture 4

Principal Component Analysis (PCA)

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

CITS 4402 Computer Vision

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

CSC321 Lecture 20: Autoencoders

Machine Learning (BSMC-GA 4439) Wenke Liu

Manifold Learning: From Linear to nonlinear. Presenter: Wei-Lun (Harry) Chao Date: April 26 and May 3, 2012 At: AMMAI 2012

Dimension Reduction and Component Analysis

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Locally Linear Embedded Eigenspace Analysis

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Principal Component Analysis

Linear Subspace Models

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Enhanced Fisher Linear Discriminant Models for Face Recognition

Face Recognition Using Eigenfaces

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to:

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Unsupervised Learning

Transcription:

Course 495: Advanced Statistical Machine Learning/Pattern Recognition Deterministic Component Analysis Goal (Lecture): To present standard and modern Component Analysis (CA) techniques such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Graph/Neighbourhood based Component Analysis Goal (Tutorials): To provide the students the necessary mathematical tools for deeply understanding the CA techniques. 1

Materials Pattern Recognition & Machine Learning by C. Bishop Chapter 12 Turk, Matthew, and Alex Pentland. "Eigenfaces for recognition." Journal of cognitive neuroscience 3.1 (1991): 71-86. Belhumeur, Peter N., João P. Hespanha, and David J. Kriegman. "Eigenfaces vs. fisherfaces: Recognition using class specific linear projection." Pattern Analysis and Machine Intelligence, IEEE Transactions on 19.7 (1997): 711-72. He, Xiaofei, et al. "Face recognition using laplacianfaces." Pattern Analysis and Machine Intelligence, IEEE Transactions on 27.3 (25): 328-34. Belkin, Mikhail, and Partha Niyogi. "Laplacian eigenmaps for dimensionality reduction and data representation." Neural computation 15.6 (23): 1373-1396. 2

Deterministic Component Analysis Problem: Given a population of data x 1,, x N R F (i.e. observations) find a latent space y 1,, y N R d (usually d F) which is relevant to a task. x 1 x 2 x 3 x N y 1 y 2 y 3 y N 3

Latent Variable Models Linear: d F Parameters: n Non-linear: m F = mn 4

What to have in mind? What are the properties of my latent space? How do I find it (linear, non-linear)? Which is the cost function? How do I solve the problem? 5

A first example Let s assume that we want to find a descriptive latent space, i.e. best describes my population as a whole. How do I define it mathematically? Idea! This is a statistically machine learning course, isn t? Hence, I will try to preserve global statistical properties. 6

A first example What are the data statistics that can used to describe the variability of my observations? One such statistic is the variance of the population (how much the data deviate around a mean). Attempt: I want to find a low-dimensional latent space where the majority of variance is preserved (or in other words maximized). 7

PCA (one dimension) Variance of the latent space σ y 2 = 1 N We want to find N i=1 (y i μ y ) 2 Via a linear projection w {y 1, y 2,, y N } μ y = {y 1 o, y n o } = argmax {y 1,,y n } i.e., y i = w T x i N i=1 y i σ y 2 But we are missing something The way to do it. 8

PCA (geometric interpretation of projection) cos (θ) = x T w x w x w x = w = x T x w T w θ x cos(θ) = xt w w 9

PCA (one dimension) 1

11 PCA (one dimension) Assuming y i = w T x i latent space w = argmax w = argmax w = argmax w = argmax w = argmax w y 1,, y N = {w T x 1,, w T x N } σ2 = arg max w 1 1 N (w T x i w T μ) 2 N (w T (x i μ)) 2 1 N w T (x i μ) (x i μ) T w 1 N wt (x i μ) (x i μ) T w w T S t w μ = 1 N N i=1 x i

PCA (one dimension) w o = argmax w σ2 = argmax w 1 N wt S t w S t = 1 N (x i μ) (x i μ) T There is a trivial solution of w = We can avoid it by adding extra constraints (a fixed magnitude on w ( w 2 = w T w=1) w = arg max w wt S t w subject to s. t. w T w = 1 12

PCA (one dimension) Formulate the Lagrangian L w, λ = w T S t w λ(w T w 1) w T S t w w = 2S tw w T w w = 2w L w = S tw = λw w is the largest eigenvector of S t 13

PCA Properties of S t S t = 1 N x i μ x i μ T = 1 N XXT X = [x 1 μ,, x N μ] S t is a symmetric matrix all eigenvalues are real S t is a positive semi-definite matrix, i.e. w w T S t w (all eigenvalues are non negative) rank S t = min(n 1, F) S t = UΛU T U T U = I, UU T = I 14

PCA (more dimensions) How can we find a latent space with more than one dimensions? We need to find a set of projections {w 1,, w d } y i = y i1 y id = w 1 T x i w d T x i =W T x i y i W = [w 1,, w d ] x i 15

PCA (more dimensions) 16

PCA (more dimensions) Maximize the variance in each dimension W o = arg max W 1 N d k=1 N i=1 (y ik μ ik ) 2 = arg max W 1 N d k=1 N i=1 w k T (x i μ i )(x i μ i ) T w k d = arg max W k=1 w k T S t w k = arg max W tr[wt S t W] 17

PCA (more dimensions) W o = arg max W tr[wt S t W] s.t. W T W = I Lagrangian L W, Λ = tr[w T S t W]-tr[Λ(W T W I)] tr[w T S t W] W = 2S t W tr[λ(w T W I)] W = 2WΛ L(W, Λ) W = S t W = WΛ Does it ring a bell? 18

PCA (more dimensions) Hence, W has as columns the d eigenvectors of S t that correspond to its d largest nonzero eigenvalues W = U d tr[w T S t W] = tr[w T UΛU T W] = tr[λ d ] Example: U be 5x5 and W be a 5x3 W T U = u 1. u 1 u 1. u 2 u 1. u 3 u 2. u 1 u 2. u 2 u 2. u 3 u 3. u 1 u 3. u 2 u 3. u 3 u 1. u 4 u 1. u 5 u 2. u 4 u 2. u 5 = u 3. u 4 u 3. u 5 1 1 1 19

PCA (more dimensions) W T UΛ = and λ 1 λ 2 λ 3 W T UΛU T W = λ 1 λ 2 λ 3 Hence the maximum is d tr Λ d = λ d i=1 2

PCA (another perspective) We want to find a set of bases W that best reconstructs the data after projection y i = W T x i x i x i = WW T x i 21

PCA (another perspective) Let us assume for simplicity centred data (zero mean) Reconstructed data X = WY = W T WX W = arg min W 1 N N i=1 F j=1 (x ij x ij ) 2 = arg min W X WWT X 2 F (1) s. t. W T W = I 22

PCA (another perspective) X WW T X 2 F = tr X WW T X T X WW T X = tr[x T X X T WW T X X T WW T X + X T WW T WW T X] = tr X T X tr[x T WW T X] I min(1) constant max W tr[wt XX T W] 23

PCA (applications) TEXTURE Mean 1st PC 2nd PC 3rd PC 4th PC 24

PCA (applications) SHAPE Mean 1st PC 2nd PC 3rd PC 4th PC 25

PCA (applications) 26 Identity Expression

Linear Discriminant Analysis PCA: Unsupervised approach good for compression of data and data reconstruction. Good statistical prior. PCA: Not explicitly defined for classification problems (i.e., in case that data come with labels) How do we define a latent space it this case? (i.e., that helps in data classification) 27

Linear Discriminant Analysis We need to properly define statistical properties which may help us in classification. Intuition: We want to find a space in which (a) the data consisting each class look more like each other, while (b) the data of separate classes look more dissimilar. 28

Linear Discriminant Analysis How do I make my data in each class look more similar? Minimize the variability in each class (minimize the variance) Space of x σ y 2 c 2 σ y 2 c 1 Space of y 29

Linear Discriminant Analysis How do I make the data between classes look dissimilar? I move the data from different classes further away from each other (increase the distance between their means). Space of x (μ y (c 1 ) μ y (c 2 )) 2 Space of y 3

Linear Discriminant Analysis A bit more formally. I want a latent space y such that: σ 2 y c 1 + σ 2 y c 2 is minimum (μ y (c 1 ) μ y (c 2 )) 2 is maximum How do I combine them together? minimize σ y 2 c 1 + σ y 2 c 2 (μ y (c 1 ) μ y (c 2 )) 2 Or maximize (μ y (c 1 ) μ y (c 2 )) 2 σ y 2 c 1 + σ y 2 c 2 31

Linear Discriminant Analysis How can I find my latent space? y i = w T x i 32

Linear Discriminant Analysis σ y 2 c 1 = 1 N c1 (y i μ y c 1 ) 2 x i c 1 = 1 N c 1 = 1 N c1 x i c 1 = w T 1 (w T (x i μ c 1 )) 2 x i c 1 N c1 x i c 1 = w T S 1 w σ y 2 c 2 = w T S 2 w μ c 1 = 1 N c 1 w T (x i μ c 1 )(x i μ c 1 ) T w (x i μ c 1 )(x i μ c 1 ) T w xi c 1 x i 33

Linear Discriminant Analysis σ y 2 c 1 + σ y 2 c 2 = w T (S 1 + S 2 )w S w within class scatter matrix (μ y (c 1 ) μ y (c 2 )) 2 = w T (μ c 1 μ(c 2 ))(μ c 1 μ (c 2 )) T w S b between class scatter matrix (μ y (c 1 ) μ y (c 2 )) 2 σ y 2 c 1 + σ y 2 c 2 = wt S b w w T S w w 34

Linear Discriminant Analysis max w T S b w s.t. w T S w w=1 Lagrangian: L w, λ = w T S b w λ(w T S w w-1) w T S w w w L w = = 2S w T S b w ww w λs ww = S b w = 2S bw w is the largest eigenvector of S w 1 S b w S w 1 (μ c 1 μ c 2 ) 35

Linear Discriminant Analysis Compute the LDA projection for the following 2D dataset c 1 = 4,1, 2,4, 2,3, 3,6, (4,4) c 2 = 9,1, 6,8, 9,5, 8,7, (1,8) 36

Linear Discriminant Analysis Solution (by hand) The class statistics are.8.4 S 1 = [.4 2.64 ] 1.84.4 S 2 = [.4 2.64 ] The within and between class scatter are μ 1 = [3. 3.6] T μ 2 = [8.4 7.6] T 29.16 21.6 S b = [ 21.6 16. ] 2.64.44 S w = [.44 5.28 ] 37

Linear Discriminant Analysis The LDA projection is then obtained as the solution of the generalized eigenvalue problem 11.89 8.81 5.8 3.76 Or directly by S 1 w S b w = λw S 1 w S b λi = 11.89 λ 8.81 λ=15.65 5.8 3.76 λ = w 1 w 2 = 15.65 w 1 w 2 w 1 w 2 =.91.39 w = S W 1 μ 1 μ 2 = [.91.39] T 38

LDA (Multiclass & Multidimensional case) W = [w 1,, w d ] 39

LDA (Multiclass & Multidimensional case) Within-class scatter matrix S w = C j=1 S j = C j=1 1 N cj (x i μ c j )(x i μ c j ) T x i c j Between-class scatter matrix c S b = (μ c j μ) (μ c j μ) Τ j=1 4

LDA (Multiclass & Multidimensional case) max tr[w T S b W] s.t. W T S w W=I Lagranging: L W, Λ = tr[w T S b w] tr[λ(w T S w W I)] 41 tr[w T S b W] W = 2S b W tr[λ(w T S w W I)] W L(W, Λ) W = S b W = S w WΛ the eigenvectors of S w 1 S b that correspond to the largest eigenvalues = 2S w WΛ