Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Course 495: Advanced Statistical Machine Learning/Pattern Recognition Deterministic Component Analysis Goal (Lecture): To present standard and modern Component Analysis (CA) techniques such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Graph/Neighbourhood based Component Analysis Goal (Tutorials): To provide the students the necessary mathematical tools for deeply understanding the CA techniques. 1

Materials Pattern Recognition & Machine Learning by C. Bishop Chapter 12 Turk, Matthew, and Alex Pentland. "Eigenfaces for recognition." Journal of cognitive neuroscience 3.1 (1991): 71-86. Belhumeur, Peter N., João P. Hespanha, and David J. Kriegman. "Eigenfaces vs. fisherfaces: Recognition using class specific linear projection." Pattern Analysis and Machine Intelligence, IEEE Transactions on 19.7 (1997): 711-72. He, Xiaofei, et al. "Face recognition using laplacianfaces." Pattern Analysis and Machine Intelligence, IEEE Transactions on 27.3 (25): 328-34. Belkin, Mikhail, and Partha Niyogi. "Laplacian eigenmaps for dimensionality reduction and data representation." Neural computation 15.6 (23): 1373-1396. 2

Deterministic Component Analysis Problem: Given a population of data x 1,, x N R F (i.e. observations) find a latent space y 1,, y N R d (usually d F) which is relevant to a task. x 1 x 2 x 3 x N y 1 y 2 y 3 y N 3

Latent Variable Models Linear: d F Parameters: n Non-linear: m F = mn 4

What to have in mind? What are the properties of my latent space? How do I find it (linear, non-linear)? Which is the cost function? How do I solve the problem? 5

A first example Let s assume that we want to find a descriptive latent space, i.e. best describes my population as a whole. How do I define it mathematically? Idea! This is a statistically machine learning course, isn t? Hence, I will try to preserve global statistical properties. 6

A first example What are the data statistics that can used to describe the variability of my observations? One such statistic is the variance of the population (how much the data deviate around a mean). Attempt: I want to find a low-dimensional latent space where the majority of variance is preserved (or in other words maximized). 7

PCA (one dimension) Variance of the latent space σ y 2 = 1 N We want to find N i=1 (y i μ y ) 2 Via a linear projection w {y 1, y 2,, y N } μ y = {y 1 o, y n o } = argmax {y 1,,y n } i.e., y i = w T x i N i=1 y i σ y 2 But we are missing something The way to do it. 8

PCA (geometric interpretation of projection) cos (θ) = x T w x w x w x = w = x T x w T w θ x cos(θ) = xt w w 9

PCA (one dimension) 1

11 PCA (one dimension) Assuming y i = w T x i latent space w = argmax w = argmax w = argmax w = argmax w = argmax w y 1,, y N = {w T x 1,, w T x N } σ2 = arg max w 1 1 N (w T x i w T μ) 2 N (w T (x i μ)) 2 1 N w T (x i μ) (x i μ) T w 1 N wt (x i μ) (x i μ) T w w T S t w μ = 1 N N i=1 x i

PCA (one dimension) w o = argmax w σ2 = argmax w 1 N wt S t w S t = 1 N (x i μ) (x i μ) T There is a trivial solution of w = We can avoid it by adding extra constraints (a fixed magnitude on w ( w 2 = w T w=1) w = arg max w wt S t w subject to s. t. w T w = 1 12

PCA (one dimension) Formulate the Lagrangian L w, λ = w T S t w λ(w T w 1) w T S t w w = 2S tw w T w w = 2w L w = S tw = λw w is the largest eigenvector of S t 13

PCA Properties of S t S t = 1 N x i μ x i μ T = 1 N XXT X = [x 1 μ,, x N μ] S t is a symmetric matrix all eigenvalues are real S t is a positive semi-definite matrix, i.e. w w T S t w (all eigenvalues are non negative) rank S t = min(n 1, F) S t = UΛU T U T U = I, UU T = I 14

PCA (more dimensions) How can we find a latent space with more than one dimensions? We need to find a set of projections {w 1,, w d } y i = y i1 y id = w 1 T x i w d T x i =W T x i y i W = [w 1,, w d ] x i 15

PCA (more dimensions) 16

PCA (more dimensions) Maximize the variance in each dimension W o = arg max W 1 N d k=1 N i=1 (y ik μ ik ) 2 = arg max W 1 N d k=1 N i=1 w k T (x i μ i )(x i μ i ) T w k d = arg max W k=1 w k T S t w k = arg max W tr[wt S t W] 17

PCA (more dimensions) W o = arg max W tr[wt S t W] s.t. W T W = I Lagrangian L W, Λ = tr[w T S t W]-tr[Λ(W T W I)] tr[w T S t W] W = 2S t W tr[λ(w T W I)] W = 2WΛ L(W, Λ) W = S t W = WΛ Does it ring a bell? 18

PCA (more dimensions) Hence, W has as columns the d eigenvectors of S t that correspond to its d largest nonzero eigenvalues W = U d tr[w T S t W] = tr[w T UΛU T W] = tr[λ d ] Example: U be 5x5 and W be a 5x3 W T U = u 1. u 1 u 1. u 2 u 1. u 3 u 2. u 1 u 2. u 2 u 2. u 3 u 3. u 1 u 3. u 2 u 3. u 3 u 1. u 4 u 1. u 5 u 2. u 4 u 2. u 5 = u 3. u 4 u 3. u 5 1 1 1 19

PCA (more dimensions) W T UΛ = and λ 1 λ 2 λ 3 W T UΛU T W = λ 1 λ 2 λ 3 Hence the maximum is d tr Λ d = λ d i=1 2

PCA (another perspective) We want to find a set of bases W that best reconstructs the data after projection y i = W T x i x i x i = WW T x i 21

PCA (another perspective) Let us assume for simplicity centred data (zero mean) Reconstructed data X = WY = W T WX W = arg min W 1 N N i=1 F j=1 (x ij x ij ) 2 = arg min W X WWT X 2 F (1) s. t. W T W = I 22

PCA (another perspective) X WW T X 2 F = tr X WW T X T X WW T X = tr[x T X X T WW T X X T WW T X + X T WW T WW T X] = tr X T X tr[x T WW T X] I min(1) constant max W tr[wt XX T W] 23

PCA (applications) TEXTURE Mean 1st PC 2nd PC 3rd PC 4th PC 24

PCA (applications) SHAPE Mean 1st PC 2nd PC 3rd PC 4th PC 25

PCA (applications) 26 Identity Expression

Linear Discriminant Analysis PCA: Unsupervised approach good for compression of data and data reconstruction. Good statistical prior. PCA: Not explicitly defined for classification problems (i.e., in case that data come with labels) How do we define a latent space it this case? (i.e., that helps in data classification) 27

Linear Discriminant Analysis We need to properly define statistical properties which may help us in classification. Intuition: We want to find a space in which (a) the data consisting each class look more like each other, while (b) the data of separate classes look more dissimilar. 28

Linear Discriminant Analysis How do I make my data in each class look more similar? Minimize the variability in each class (minimize the variance) Space of x σ y 2 c 2 σ y 2 c 1 Space of y 29

Linear Discriminant Analysis How do I make the data between classes look dissimilar? I move the data from different classes further away from each other (increase the distance between their means). Space of x (μ y (c 1 ) μ y (c 2 )) 2 Space of y 3

Linear Discriminant Analysis A bit more formally. I want a latent space y such that: σ 2 y c 1 + σ 2 y c 2 is minimum (μ y (c 1 ) μ y (c 2 )) 2 is maximum How do I combine them together? minimize σ y 2 c 1 + σ y 2 c 2 (μ y (c 1 ) μ y (c 2 )) 2 Or maximize (μ y (c 1 ) μ y (c 2 )) 2 σ y 2 c 1 + σ y 2 c 2 31

Linear Discriminant Analysis How can I find my latent space? y i = w T x i 32

Linear Discriminant Analysis σ y 2 c 1 = 1 N c1 (y i μ y c 1 ) 2 x i c 1 = 1 N c 1 = 1 N c1 x i c 1 = w T 1 (w T (x i μ c 1 )) 2 x i c 1 N c1 x i c 1 = w T S 1 w σ y 2 c 2 = w T S 2 w μ c 1 = 1 N c 1 w T (x i μ c 1 )(x i μ c 1 ) T w (x i μ c 1 )(x i μ c 1 ) T w xi c 1 x i 33

Linear Discriminant Analysis σ y 2 c 1 + σ y 2 c 2 = w T (S 1 + S 2 )w S w within class scatter matrix (μ y (c 1 ) μ y (c 2 )) 2 = w T (μ c 1 μ(c 2 ))(μ c 1 μ (c 2 )) T w S b between class scatter matrix (μ y (c 1 ) μ y (c 2 )) 2 σ y 2 c 1 + σ y 2 c 2 = wt S b w w T S w w 34

Linear Discriminant Analysis max w T S b w s.t. w T S w w=1 Lagrangian: L w, λ = w T S b w λ(w T S w w-1) w T S w w w L w = = 2S w T S b w ww w λs ww = S b w = 2S bw w is the largest eigenvector of S w 1 S b w S w 1 (μ c 1 μ c 2 ) 35

Linear Discriminant Analysis Compute the LDA projection for the following 2D dataset c 1 = 4,1, 2,4, 2,3, 3,6, (4,4) c 2 = 9,1, 6,8, 9,5, 8,7, (1,8) 36

Linear Discriminant Analysis Solution (by hand) The class statistics are.8.4 S 1 = [.4 2.64 ] 1.84.4 S 2 = [.4 2.64 ] The within and between class scatter are μ 1 = [3. 3.6] T μ 2 = [8.4 7.6] T 29.16 21.6 S b = [ 21.6 16. ] 2.64.44 S w = [.44 5.28 ] 37

Linear Discriminant Analysis The LDA projection is then obtained as the solution of the generalized eigenvalue problem 11.89 8.81 5.8 3.76 Or directly by S 1 w S b w = λw S 1 w S b λi = 11.89 λ 8.81 λ=15.65 5.8 3.76 λ = w 1 w 2 = 15.65 w 1 w 2 w 1 w 2 =.91.39 w = S W 1 μ 1 μ 2 = [.91.39] T 38

LDA (Multiclass & Multidimensional case) W = [w 1,, w d ] 39

LDA (Multiclass & Multidimensional case) Within-class scatter matrix S w = C j=1 S j = C j=1 1 N cj (x i μ c j )(x i μ c j ) T x i c j Between-class scatter matrix c S b = (μ c j μ) (μ c j μ) Τ j=1 4

LDA (Multiclass & Multidimensional case) max tr[w T S b W] s.t. W T S w W=I Lagranging: L W, Λ = tr[w T S b w] tr[λ(w T S w W I)] 41 tr[w T S b W] W = 2S b W tr[λ(w T S w W I)] W L(W, Λ) W = S b W = S w WΛ the eigenvectors of S w 1 S b that correspond to the largest eigenvalues = 2S w WΛ