Computation. For QDA we need to calculate: Lets first consider the case that

Similar documents
1 Principal Components Analysis

Principal Component Analysis

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Regularized Discriminant Analysis and Reduced-Rank LDA

Principal Component Analysis

Introduction to Machine Learning

Principal Components Analysis (PCA)

What is Principal Component Analysis?

PRINCIPAL COMPONENT ANALYSIS

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Principal Component Analysis

Maximum variance formulation

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

MATH2070 Optimisation

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

15 Singular Value Decomposition

PCA, Kernel PCA, ICA

Upon successful completion of MATH 220, the student will be able to:

Math 118, Fall 2014 Final Exam

CS 143 Linear Algebra Review

Assignment 1 Math 5341 Linear Algebra Review. Give complete answers to each of the following questions. Show all of your work.

Lagrange Multipliers

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Link lecture - Lagrange Multipliers

Advanced Introduction to Machine Learning CMU-10715

Example Linear Algebra Competency Test

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Math 5311 Constrained Optimization Notes

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Covariance and Correlation Matrix

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Data Preprocessing Tasks

Multivariate Statistics (I) 2. Principal Component Analysis (PCA)

Mathematical foundations - linear algebra

STATISTICAL LEARNING SYSTEMS

8. Diagonalization.

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Data Mining and Analysis: Fundamental Concepts and Algorithms

Linear Dimensionality Reduction

1 Singular Value Decomposition and Principal Component

ICS 6N Computational Linear Algebra Symmetric Matrices and Orthogonal Diagonalization

Kernel Methods. Machine Learning A W VO

Least Squares Optimization

Preprocessing & dimensionality reduction

COMP 558 lecture 18 Nov. 15, 2010

Principal Component Analysis and Linear Discriminant Analysis

Principal component analysis

7. Variable extraction and dimensionality reduction

Statistical Pattern Recognition

Positive Definite Matrix

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Lagrange Multipliers

Least Squares Optimization

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

Principal Components Analysis. Sargur Srihari University at Buffalo

Signal Analysis. Principal Component Analysis

Least Squares Optimization

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Machine Learning (Spring 2012) Principal Component Analysis

Lecture 6: Methods for high-dimensional problems

Math 164-1: Optimization Instructor: Alpár R. Mészáros

Vectors and Matrices Statistics with Vectors and Matrices

Principal Component Analysis

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

PCA and LDA. Man-Wai MAK

Principal Component Analysis

Foundations of Computer Vision

Chapter 3 Transformations

B553 Lecture 5: Matrix Algebra Review

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Solutions to Review Problems for Chapter 6 ( ), 7.1

14 Singular Value Decomposition

Machine learning for pervasive systems Classification in high-dimensional spaces

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Tutorial on Principal Component Analysis

A Statistical Analysis of Fukunaga Koontz Transform

3.7 Constrained Optimization and Lagrange Multipliers

PCA and LDA. Man-Wai MAK

Lecture: Face Recognition

Matrix Vector Products

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

MLCC 2015 Dimensionality Reduction and PCA

EECS 275 Matrix Computation

LECTURE NOTE #10 PROF. ALAN YUILLE

Announcements (repeat) Principal Components Analysis

Image Analysis. PCA and Eigenfaces

Linear Algebra- Final Exam Review

IV. Matrix Approximation using Least-Squares

Lecture: Face Recognition and Feature Reduction

Deriving Principal Component Analysis (PCA)

2. Every linear system with the same number of equations as unknowns has a unique solution.

Transcription:

Computation For QDA we need to calculate: δ (x) = 1 2 log( Σ ) 1 2 (x µ ) Σ 1 (x µ ) + log(π ) Lets first consider the case that Σ = I,. This is the case where each distribution is spherical, around the mean point.

Case 1 When Σ = I We have: δ = 1 2 log( I ) 1 2 (x µ ) I (x µ ) + log(π ) but log( I ) = log(1) = 0 and (x µ ) I (x µ ) = (x µ ) (x µ ) is the squared Euclidean distance between two points x and µ

Thus under this condition (i.e. Σ = I ), a new point can be classified by its distance from the center of a class, adjusted by some prior. Further, for two-class problem with equal prior, the discriminating function would be the perpendicular bisector of the 2-class s means.

Case 2 When Σ I Using the Singular Value Decomposition (SVD) of Σ we get Σ = U S U Note: Σ is a symmetric matrix Σ = Σ T Σ = U S U., so we have

(x µ ) Σ 1 (x µ ) = (x µ ) U S 1 U T (x µ ) = (U x U µ ) S 1 (U x U µ ) = (U x U µ ) S 1 2 S 1 2 (U x U µ ) = (S 1 2 U x S 1 2 U µ ) I (S 1 2 U x S 1 2 U µ ) = (S 1 2 U x S 1 2 U µ ) (S 1 2 U x S 1 2 U µ ) For δ, the second term becomes what is also nown as the Mahalanobis distance:

Thin of S 1 2 U as a linear transformation that taes points in class and distributes them spherically around a point, lie in Case 1. Thus when we are given a new point, we can apply the modified δ values to calculate h ( x). After applying the singular value decomposition, Σ 1 is considered to be an identity matrix such that δ = 1log( I ) 1[(S 1 2 2 2 U x S 1 2 U µ ) (S 1 2 U x S 1 2 U µ )] + log(π ) and, log( I ) = log(1) = 0

For applying the above method with classes that have different covariance matrices (for example the covariance matrices Σ 0 and Σ 1 for the two class case), each of the covariance matrices has to be decomposed using SVD to find the according transformation. Then, each new data point has to be transformed using each transformation to compare its distance to the mean of each class (for example for the two class case, the new data point would have to be transformed by the class 1 transformation and then compared to µ 0 and the new data point would also have to be transformed by the class 2 transformation and then compared to µ 1 ).

The difference between Case 1 and Case 2 (i.e. the difference between using the Euclidean and Mahalanobis distance) can be seen in the illustration below.

Alternative approach to do QDA There is a tric that allows us to use the linear discriminant analysis (LDA) algorithm to generate a quadratic function that can be used to classify data. This tric is similar to, but more primitive than, the Kernel tric that will be discussed later in the course.

In this approach the feature vector is augmented with the quadratic terms (i.e. new dimensions are introduced) We then apply LDA on the new higher-dimensional data. The motivation behind this approach is to tae advantage of the fact that fewer parameters have to be calculated in LDA, and therefore have a more robust classifier when we have fewer data points.

Using this tric, we introduce two new vectors, ŵ and ˆx such that: ŵ = [w 1, w 2,..., w d, v 1, v 2,..., v d ] T and ˆx = [x 1, x 2,..., x d, x 2 1, x 2 2,..., x 2 d ] T

We can then apply LDA to estimate the new function: ĝ(x, x 2 ) = ŷ = ŵ T ˆx. Note that we can do this for any x and in any dimension; we could extend a d n matrix to a quadratic dimension by appending another d n matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a sin(x) dimension. Note, we are not applying QDA, but instead extending LDA to calculate a non-linear boundary, that will be different from QDA.

Principal Component Analysis (PCA) Principal Component Analysis (PCA) is a method of dimensionality reduction/feature extraction that transforms the data from a d-dimensional space into a new coordinate system of dimension p, where p d ( the worst case would be to have p = d).

Principal Component Analysis (PCA) The goal is to preserve as much of the variance in the original data as possible in the new coordinate systems. Give data on d variables, the hope is that the data points will lie mainly in a linear subspace of dimension lower than d. In practice, the data will usually not lie precisely in some lower dimensional subspace. The new variables that form a new coordinate system are called principal components (PCs).

Principal Component Analysis (PCA) PCs are denoted by u 1, u 2,..., u d. The principal components form a basis for the data. Since PCs are orthogonal linear transformations of the original variables there is at most d PCs. Normally, not all of the d PCs are used but rather a subset of p PCs, u 1, u 2,..., u p In order to approximate the space spanned by the original data points x = x 1. x d We can choose p based on what percentage of the variance of the original data we would lie to maintain.

Principal Component Analysis (PCA) The first PC, u 1 is called first principal component and has the maximum variance, thus it accounts for the most significant variance in the data. The second PC, u 2 is called second principal component and has the second highest variance and so on until PC u d which has the minimum variance.

Principal Component Analysis (PCA) The most common definition of PCA, due to Hotelling is that, for a given set of data vectors xi, i 1...n, the p principal axes are those orthonormal axes onto which the variance retained under projection is maximal.

Principal Component Analysis (PCA) In order to capture as much of the variability as possible, let us choose the first principal component, denoted by u 1, to capture the maximum variance. Suppose that all centred observations are staced into the columns of a d n matrix X, where each column corresponds to a d-dimensional observation and there are n observations. The projection of n, d-dimensional observations on the first principal component u 1 is u 1 T X.

Principal Component Analysis (PCA) We want projection on this first dimension to have maximum variance. var(u 1 T X ) = u 1 T Su 1 where S is the d d sample covariance matrix of X.

Clearly var(u 1 T X ) can be made arbitrarily large by increasing the magnitude of u 1. var(u 1 T X ) = u 1 T Su 1 where S is sample covariance matrix of sample data X. This means that the variance stated above has no upper limit and so we can not find the maximum. To solve this problem, we choose u 1 to maximize u 1 T Su 1 while constraining u 1 to have unit length. Therefore, we can rewrite the above optimization problem as: max u 1 T Su 1 subject to u 1 T u1 = 1

To solve this optimization problem a Lagrange multiplier λ is introduced: L(u 1, λ) = u 1 T Su 1 λ(u 1 T u 1)

Review of Lagrange Multiplier Lagrange multipliers are used to find the maximum or minimum of a function f (x, y) subject to constraint g(x, y) = c

The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution. [Lagrange Multipliers, Wiipedia

we define a new constant λ called a Lagrange Multiplier and we form the Lagrangian, L(x, y, λ) = f (x, y) λg(x, y) If f (x, y ) is the max of f (x, y), there exists λ such that (x, y, λ ) is a stationary point of L (partial derivatives are 0). In addition (x, y ) is a point in which functions f and g touch but do not cross. At this point, the tangents of f and g are parallel or gradients of f and g are parallel, such that: x,y f = λ x,y g where, x,y f = ( f x, f ) the gradient of f y x,y g = ( g x, g ) the gradient of g y

Example : Suppose we want to maximize the function f (x, y) = x y subject to the constraint x 2 + y 2 = 1.

We can apply the Lagrange multiplier method to find the maximum value for the function f ; the Lagrangian is: L(x, y, λ) = x y λ(x 2 + y 2 1)

We want the partial derivatives equal to zero: L x = 1 + 2λx = 0 L y = 1 + 2λy = 0 L λ = x 2 + y 2 1

Solving the system we obtain two stationary points: ( 2/2, 2/2) and ( 2/2, 2/2). In order to understand which one is the maximum, we just need to substitute it in f (x, y) and see which one as the biggest value. In this case the maximum is ( 2/2, 2/2).

L(u 1, λ) = u 1 T Su 1 λ(u 1 T u1 1) (1) Differentiating with respect to u 1 gives d equations, Su 1 = λu 1 Premultiplying both sides by u 1 T we have: u 1 T Su 1 = λu 1 T u1 = λ u 1 T Su 1 is maximized if λ is the largest eigenvalue of S.

Clearly λ and u 1 are an eigenvalue and an eigenvector of S. Differentiating (1) with respect to the Lagrange multiplier λ gives us bac the constraint: u 1 T u1 = 1 This shows that the first principal component is given by the eigenvector with the largest associated eigenvalue of the sample covariance matrix S. A similar argument can show that the p dominant eigenvectors of covariance matrix S determine the first p principal components.