Data Mining and Analysis: Fundamental Concepts and Algorithms

Similar documents
Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Linear Dimensionality Reduction

Linear Algebra Methods for Data Mining

Regularized Discriminant Analysis and Reduced-Rank LDA

PCA and LDA. Man-Wai MAK

Principal Component Analysis

PCA and LDA. Man-Wai MAK

ECE 661: Homework 10 Fall 2014

Notes on Implementation of Component Analysis Techniques

Data Mining and Analysis: Fundamental Concepts and Algorithms

Fisher Linear Discriminant Analysis

Chapter 3 Transformations

Principal Component Analysis and Linear Discriminant Analysis

Computation. For QDA we need to calculate: Lets first consider the case that

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Principal Component Analysis

Exercise Sheet 1.

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

LEC 3: Fisher Discriminant Analysis (FDA)

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Lecture 6: Methods for high-dimensional problems

Dimensionality Reduction

When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants

Classification Methods II: Linear and Quadratic Discrimminant Analysis

c 4, < y 2, 1 0, otherwise,

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Knowledge Discovery and Data Mining 1 (VO) ( )

Motivating the Covariance Matrix

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

A Statistical Analysis of Fukunaga Koontz Transform

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Fisher s Linear Discriminant Analysis

Chap 3. Linear Algebra

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Econ Slides from Lecture 7

Machine Learning 11. week

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Linear & nonlinear classifiers

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Chapter XII: Data Pre and Post Processing

Pattern Recognition 2

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

5. Discriminant analysis

Clustering VS Classification

Glossary of Linear Algebra Terms. Prepared by Vince Zaccone For Campus Learning Assistance Services at UCSB

Introduction to Signal Detection and Classification. Phani Chavali

Course Outline MODEL INFORMATION. Bayes Decision Theory. Unsupervised Learning. Supervised Learning. Parametric Approach. Nonparametric Approach

Introduction to Machine Learning

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

A Least Squares Formulation for Canonical Correlation Analysis

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

L11: Pattern recognition principles

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

LECTURE NOTE #10 PROF. ALAN YUILLE

Tutorial on Principal Component Analysis

Machine Learning 2017

TBP MATH33A Review Sheet. November 24, 2018

1 Determinants. 1.1 Determinant

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Applied Multivariate Analysis

Properties of Matrices and Operations on Matrices

STATISTICAL LEARNING SYSTEMS

Distance Metric Learning

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

Introduction to Matrix Algebra

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Covariance and Correlation Matrix

1 Principal Components Analysis

EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS

Fall 2016 MATH*1160 Final Exam

1. What is the determinant of the following matrix? a 1 a 2 4a 3 2a 2 b 1 b 2 4b 3 2b c 1. = 4, then det

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Lecture Summaries for Linear Algebra M51A

Least Squares SVM Regression

Face Recognition. Lecture-14

Principal component analysis

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

MATRICES. knowledge on matrices Knowledge on matrix operations. Matrix as a tool of solving linear equations with two or three unknowns.

STAT 730 Chapter 1 Background

Chapter 7. Linear Algebra: Matrices, Vectors,

Face Recognition. Lecture-14

Transcription:

Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chap. 20: Linear Discriminant Analysis Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 1 / 20

Linear Discriminant Analysis Given labeled data consisting of d-dimensional points x i along with their classes y i, the goal of linear discriminant analysis (LDA) is to find a vector w that maximizes the separation between the classes after projection onto w. The key difference between principal component analysis and LDA is that the former deals with unlabeled data and tries to maximize variance, whereas the latter deals with labeled data and tries to maximize the discrimination between the classes. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 2 / 20

Projection onto a Line Let D i denote the subset of points labeled with class c i, i.e., D i = {x j y j = c i }, and let D i = n i denote the number of points with class c i. We assume that there are only k = 2 classes. The projection of any d-dimensional point x i onto a unit vector w is given as ( w x T ) x i i = w T w = ( w T ) x i w = ai w w where a i specifies the offset or coordinate of x i along the line w: a i = w T x i The set of n scalars{a 1, a 2,...,a n } represents the mapping from R d to R, that is, from the original d-dimensional space to a 1-dimensional space (along w). Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 3 / 20

Projection onto w: Iris 2D Data iris-setosa as class c 1 (circles), and the other two Iris types as class c 2 (triangles) 4.5 4.0 3.5 3.0 2.5 2.0 w 1.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 4 / 20

Optimal Linear Discriminant Direction 4.5 4.0 3.5 3.0 2.5 2.0 w 1.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 5 / 20

Optimal Linear Discriminant The mean of the projected points is given as: m 1 =w T µ 1 m 2 =w T µ 2 To maximize the separation between the classes, we maximize the difference between the projected means, m 1 m 2. However, for good separation, the variance of the projected points for each class should also not be too large. LDA maximizes the separation by ensuring that the scatter s 2 i for the projected points within each class is small, where scatter is defined as s 2 i = x j D i (a j m i ) 2 = n i σ 2 i where σ 2 i is the variance for class c i. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 6 / 20

Linear Discriminant Analysis: Fisher Objective We incorporate the two LDA criteria, namely, maximizing the distance between projected means and minimizing the sum of projected scatter, into a single maximization criterion called the Fisher LDA objective: max w J(w) = (m 1 m 2 ) 2 s 2 1 + s2 2 In matrix terms, we can rewrite (m 1 m 2 ) 2 as follows: ( 2 (m 1 m 2 ) 2 = w T (µ 1 µ 2 )) = w T Bw where B = (µ 1 µ 2 )(µ 1 µ 2 ) T is a d d rank-one matrix called the between-class scatter matrix. The projected scatter for class c i is given as si 2 = (w T x j w T µ i ) 2 = w T (x j µ i )(x j µ i ) T w = w T S i w x j D 1 x j D i where S i is the scatter matrix for D i. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 7 / 20

Linear Discriminant Analysis: Fisher Objective The combined scatter for both classes is given as s 2 1 + s2 2 = wt S 1 w+w T S 2 w = w T (S 1 + S 2 )w = w T Sw where the symmetric positive semidefinite matrix S = S 1 + S 2 denotes the within-class scatter matrix for the pooled data. The LDA objective function in matrix form is max w J(w) = wt Bw w T Sw To solve for the best direction w, we differentiate the objective function with respect to w; after simplification it yields the generalized eigenvalue problem Bw = λsw where λ = J(w) is a generalized eigenvalue of B and S. To maximize the objective λ should be chosen to be the largest generalized eigenvalue, and w to be the corresponding eigenvector. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 8 / 20

Linear Discriminant Algorithm 1 2 3 4 5 6 7 LINEARDISCRIMINANT (D = {(x i, y i )} n i=1 ): D i { x j y j = c i, j = 1,...,n }, i = 1, 2// class-specific subsets µ i mean(d i ), i = 1, 2// class means B (µ 1 µ 2 )(µ 1 µ 2 ) T // between-class scatter matrix Z i D i 1 ni µ T i, i = 1, 2// center class matrices S i Z T i Z i, i = 1, 2// class scatter matrices S S 1 + S 2 // within-class scatter matrix λ 1, w eigen(s 1 B)// compute dominant eigenvector Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 9 / 20

Linear Discriminant Direction: Iris 2D Data The between-class scatter matrix is B = ( 1.587 ) 0.693 0.693 0.303 4.5 4.0 3.5 3.0 2.5 2.0 w 1.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 and the within-class scatter matrix is ( ) 6.09 4.91 S 1 = 4.91 7.11 ( ) 43.5 12.09 S 2 = 12.09 10.96 ( ) 49.58 17.01 S = 17.01 18.08 The direction of most separation between c 1 and c 2 is the dominant eigenvector corresponding to the largest eigenvalue of the matrix S 1 B. The solution is J(w) = λ 1 = 0.11 ( ) 0.551 w = 0.834 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 10 / 20

Linear Discriminant Analysis: Two Classes For the two class scenario, if S is nonsingular, we can directly solve for w without computing the eigenvalues and eigenvectors. The between-class scatter matrix B points in the same direction as (µ 1 µ 2 ) because ( Bw = (µ 1 µ 2 )(µ 1 µ 2 ) T) w ( ) =(µ 1 µ 2 ) (µ 1 µ 2 ) T w =b(µ 1 µ 2 ) The generalized eigenvectors equation can then be rewritten as w = b λ S 1 (µ 1 µ 2 ) Because b λ is just a scalar, we can solve for the best linear discriminant as w =S 1 (µ 1 µ 2 ) We can finally normalize w to be a unit vector. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 11 / 20

Kernel Discriminant Analysis The goal of kernel LDA is to find the direction vector w in feature space that maximizes max w J(w) = (m 1 m 2 ) 2 s 2 1 + s2 2 It is well known that w can be expressed as a linear combination of the points in feature space n w = a j φ(x j ) j=1 The mean for class c i in feature space is given as µ φ i = 1 n i x j D i φ(x j ) and the covariance matrix for class c i in feature space is Σ φ i = 1 ( )( φ(x j ) µ φ i φ(x j ) µ φ i n i x j D i ) T Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 12 / 20

Kernel Discriminant Analysis The between-class scatter matrix in feature space is B φ = ( )( ) T µ φ 1 µφ 2 µ φ 1 µφ 2 and the within-class scatter matrix in feature space is S φ = n 1 Σ φ 1 + n 2Σ φ 2 S φ is a d d symmetric, positive semidefinite matrix, where d is the dimensionality of the feature space. The best linear discriminant vector w in feature space is the dominant eigenvector, which satisfies the expression ( ) S 1 φ B φ w = λw where we assume that S φ is non-singular. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 13 / 20

LDA Objective via Kernel Matrix: Between-class Scatter The projected mean for class c i is given as m i = w T µ φ i = 1 n n i j=1 x k D i a j K(x j, x k ) = a T m i where a = (a 1, a 2,...,a n) T is the weight vector, and m i = 1 x k D i K(x 1, x k ) x k D i K(x 2, x k ) n i = 1 K c i 1 ni. n i x k D i K(x n, x k ) where K c i is the n n i subset of the kernel matrix, restricted to columns belonging to points only in D i, and 1 ni is the n i -dimensional vector all of whose entries are one. The separation between the projected means is thus ( ) 2 (m 1 m 2 ) 2 = a T m 1 a T m 2 = a T Ma where M = (m 1 m 2 )(m 1 m 2 ) T is the between-class scatter matrix. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 14 / 20

LDA Objective via Kernel Matrix: Within-class Scatter We can compute the projected scatter for each class, s1 2 and s2 2, purely in terms of the kernel function, as follows s1 2 = ( ( w T φ(x i ) w T µ φ 1 2 = a T K i K T i x i D 1 x i D 1 ) ) n 1 m 1 m T 1 a = a T N 1 a where K i is the ith column of the kernel matrix, and N 1 is the class scatter matrix for c 1. The sum of projected scatter values is then given as s 2 1 + s2 2 = at (N 1 + N 2 )a = a T Na where N is the n n within-class scatter matrix. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 15 / 20

Kernel LDA The kernel LDA maximization condition is max w J(w) = max a J(a) = at Ma a T Na The weight vector a is the eigenvector corresponding to the largest eigenvalue of the generalized eigenvalue problem: Ma = λ 1 Na When there are only two classes a can be obtained directly: a = N 1 (m 1 m 2 ) To normalize w to be a unit vector we scale a by 1 a T Ka. We can project any point x onto the discriminant direction as follows: w T φ(x) = n a j φ(x j ) T φ(x) = j=1 n a j K(x j, x) j=1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 16 / 20

Kernel Discriminant Analysis Algorithm 1 2 3 4 5 6 7 8 KERNELDISCRIMINANT (D = {(x i, y i )} n i=1, K ): K { K(x i, x j ) } // compute n n kernel matrix i,j=1,...,n K c i { K(j, k) y k = c i, 1 j, k n }, i = 1, 2// class kernel matrix m i 1 n i K c i 1ni, i = 1, 2// class means M (m 1 m 2 )(m 1 m 2 ) T // between-class scatter matrix N i K c i (Ini 1 n i 1 ni n i )(K c i ) T, i = 1, 2// class scatter matrices N N 1 + N 2 // within-class scatter matrix λ 1, a eigen(n 1 M)// compute weight vector a a // normalize w to be unit vector at Ka Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 17 / 20

Kernel Discriminant Analysis: Quadratic Homogeneous Kernel Iris 2D Data: c1 (circles; iris-virginica) and c2 (triangles; other two Iris types). Kernel Function: K (xi, xj ) = (xti xj )2. Contours of constant projection onto optimal kernel discriminant. u2 1.0 0.5 Tu Tu Tu 0 Tu 0.5 Tu Tu Tu Cb Cb Cb Cb Cb Cb Cb Cb Tu Tu Tu 1.0 u1 1.5 4 Zaki & Meira Jr. (RPI and UFMG) 3 2 1 0 1 Data Mining and Analysis 2 3 Chap. 20: Linear Discriminant Analysis 18 / 20

Kernel Discriminant Analysis: Quadratic Homogeneous Kernel Iris 2D Data: c 1 (circles;iris-virginica) and c 2 (triangles; other two Iris types). Kernel Function: K(x i, x j ) = (x T i x j ) 2. Projection onto Optimal Kernel Discriminant 1 0 1 2 3 4 5 6 7 8 9 10 w Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 19 / 20

Kernel Feature Space and Optimal Discriminant Iris 2D Data: c 1 (circles;iris-virginica) and c 2 (triangles; other two Iris types). Kernel Function: K(x i, x j ) = (x T i x j ) 2. X1X2 X2 2 w X 2 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 20 / 20