Adaptive Discriminant Analysis by Minimum Squared Errors

Size: px

Start display at page:

Download "Adaptive Discriminant Analysis by Minimum Squared Errors"

Nathaniel Price
5 years ago
Views:

1 Adaptive Discriminant Analysis by Minimum Squared Errors Haesun Park Div. of Computational Science and Engineering Georgia Institute of Technology Atlanta, GA, U.S.A. (Joint work with Barry L. Drake and Hyunsoo Kim) Stanford, June 2006

2 Adaptive Dimension Reduction for Clustered Data Linear Discriminant Analysis (LDA) and its Generalizations for undersampled problems, LDA/GSVD Extension to kernel-based nonlinear method KDA/GSVD Relationship to Classifier design by MSE Adaptive feature subspace tracking method Test results: Facial recognition, efficient cross-validation by downdating, etc.

3 Clustered Data: Facial Recognition AT&T (ORL ) Face Database The 1st sample 400 frontal images = 40 person x 1o images each, variations in pose, facial expression image size :92 x 112 Severely Undersampled: x 400 The 35th sample

4 Adaptive Dimension Reduction of Clustered Data Original images/data Known Data 1 2 i r New Data Data preprocessing Adaptive Form Data Matrix m x n m x 1 Dimension Reducing Transformation Lower Dimensional Representation 1 2 i r q x n q x 1 Classification + update transf. / classifier Want: Adaptive Dimension Reducing Transformation that can be effectively applied across many application areas

5 Measure for Cluster Quality A = [a 1...a n ] :mxn, clustered data N i = items in class i, N i = n i, total r classes c i = average of data items in class i, centroid c = global average, global centroid (1)Within-class scatter matrix S w = 1 i r j Ni (a j c i ) (a j c i ) T (2) Between-class scatter matrix S b = 1 i r j Ni (c i c) (c i c) T (3)Total scatter matrix S t = 1 i n (a i c ) (a i c ) T NOTE: S w + S b = S t

6 Trace of Scatter Matrix trace (S w ) = 1 i r j Ni a j c i 2 2 trace (S b ) = 1 i r j Ni c i - c 2 2 trace (S t ) = 1 i r j Ni a j c 2 2 trace (S w ) trace (S b ) Dimension Reducing Transformation

7 Optimal Dimension Reducing Transformation y :mx1 G T :qxm G T y : qx1, q << m High quality clusters have small trace(s w ) & large trace(s b ) Want: G s.t. min trace(g T S w G) & max trace(g T S b G) max trace ((G T S w G) -1 (G T S b G)) LDA (Fisher 36, Rao 48) max trace (G T S b G) Orthogonal Centroid (Park et al. 03) G T G=I max trace (G T (S w +S b )G) PCA (Hotelling 33) G T G=I max trace (G T A A T G) LSI (Deerwester et al. 90) G T G=I

8 Classical LDA (Fisher 36, R ao 48) max trace ((G T S w G) -1 (G T S b G)) G : leading (r-1) e.vectors of S w -1 S b Fails when m>n (undersampled), S w singular x S H T w = H w w S w =H w H wt, H w =[a 1 -c 1, a 2 -c 1,, a n -c r ] : mxn S b =H b H bt, H b =[1/ n 1 (c 1 -c),,1/ n r (c r - c)] : mxr

9 LDA based on GSVD (LDA/GSVD) (Howland, Jeon, Park, SIMAX03, Howland and Park, IEEE TPAMI 04) Works regardless of singularity of scatter matrices S w -1 S b x = l x S b x=ls w x 2 H b H bt x = g 2 H w H wt x G comes from leading (r-1) generalized singular vectors of H bt and H w T U T H b T X V T H w T X = (S b 0) = = (S w 0) = 0 0 X T H b H bt X = X T S b X and X T H w H wt X = X T S w X Classical LDA is a special case of LDA/GSVD

10 Generalized SVD (Paige and Saunders 81) S b x=ls w x 2 H b H bt x = g 2 H w H wt x I 0 D b D w X T S b X = 0 0 X T S w X = I 0 Want G s.t. max trace (G T S b G) and min trace (G T S w G) X = [ X1 X2 X3 X4 ] g δ x i belongs to X null(s b ) c null(s w ) X 2 1 > β >0 0 < δ <1 null(s b ) c null(s w ) c X null(s b ) null(s w ) c X 4 any any null(s b ) null(s w )

11 Generalization of LDA for Undersampled Problems Regularized LDA (F ried m an 89, Z h ao et al. 99 ) LDA/GSVD : Solution G = [ X 1 X 2 ] (H o w lan d, Jeo n, P ark 03) Solutions based on Null(S w ) and Range(S b ) (C h en et al. 00, Y u & Y an g 01, P ark & P ark 03 ) Two-stage methods: Face Recognition: PCA + LDA (S w ets & W en g 96, Z h ao et al. 99 ) Information Retrieval: LSI + LDA (T o rkko la 01) Mathematical Equivalence: (H o w lan d an d P ark 03) PCA+ LDA/GSVD = LDA/GSVD LSI +LDA/GSVD = LDA/GSVD More efficient = QRD + LDA/GSVD

12 Nonlinear Dimension Reduction by Ex. Feature mapping F Kernel Functions x = x 1 x 2 F (x) = (a polynomial kernel function) x x 1 x 2 x 2 2 k (x, y) = < F (x), F (y) >= < x, y > 2, F 2D

13 Nonlinear Dimension Reduction by Kernel Functions If k(x,y) satisfies M ercer s condition, then there is a mapping F to an inner product space, k(x,y) = < F (x), F (y) > A < x, y > F F (A) k(x,y) = < F (x), F (y) > M ercer s C o n d itio n for A=[a 1,,a n ]: kernel matrix K = [ k(a i, a j ) ] 1 i, j n is positive semi-definite. Ex) RBF Kernel Function: k(a i, a j ) =exp(-s a i a j 2 )

14 Nonlinear Discriminant Analysis based on Kernel Functions (KDA/GSVD) (C. Park and H. Park, SIMAX 04) Assume a feature mapping: f : a: mx1 f(a): px1, m<<p and apply LDA/GSVD to S w and S b in feature space G : leading (r-1) generalized singular vectors of ( H w f, H b f ) S w f = H w f x H w f T S f w =H f w H ft w, H f w =[a f 1 -c f 1, a f 2 -c f 1,, a f n -c f r ] : pxn S f b =H f b H f T b, H bf =[a 1 (c f 1 -c f ),,a r (c f r - c f )] : pxr f unknown but problem can be formulated to utilize kernel fcn.

15 Classifier by MSE and Dimension Reduction by LDA/GSVD (Binary Case) (Duda et al. 01, C. Park and H. Park 04 ) MSE f(a) = a T w + b n/n 1 if a class 1 = -n/n 2 if a class 2 { LDA/GSVD d 2 S b x = g 2 S w x min 1 a 1 T 1 a n T b W _ n/n 1 -n/n 2 2 w T a +b = w T (a-c) = a x T (a-c) * Extended to (non)linear multi-class relationship (C. Park and H. Park, SIMAX, 05)

16 Relationship between Kernelized MSE and KDA/GSVD (Binary Case) (Billings and Lee 02, C. Park and H. Park 05 ) MSE f(a) = f(a) T w + b n/n 1 if a class 1 = -n/n 2 if a class 2 { LDA/GSVD d 2 S b f x = g 2 S w f x min 1 f(a 1 ) T 1 f(a n ) T b W _ n/n 1 -n/n 2 2 w T f(a) +b = w T (f(a)-c f ) = a x T (f(a)-c f ) = min e f(a) T b W _ y 2 However, f is not known.

17 Formulation of Kernelized MSE f is unknown but nonlinearization is possible using kernel functions and the fact that w = f(a) z for some z min e f(a) T b w _ y = min 2 e, f(a) T f(a) b z _ y 2 = min e K b z _ y 2 Let G = [ e K] : nx(n+1) K : symmetric positive semidefinite Solution related to KDA/GSVD is b z = G + y If rank(g) = n, then G + can be obtained from QRD of G T Let G T = Q, then G + = G T (GG T ) -1 = Q R 0 R -T 0

18 Adaptive KDA by Regularized MSE (KDA/RMSE) Replace min e K b z _ y 2 by min e K + l I b z _ y 2 G l = [ e K + l I ] : nx(n+1), rank(g l ) = n for l > 0 Solution can be obtained by QRD of G T l Updated and downdated sol. can be obtained by QRD updating/downdating. At least an order of magnitude faster than GSVD updates.

19 Adaptive Kernel MSE (Kim, Drake, and Park 05) Kernel MSE G l = [e K + l I ] = 1 k 1,1 + l k 1,n 1 k n,1 k n,n + l Appending a data point a : apply QRD updating twice G l = 1 k a,a + l k a,1 k a,n 1 k 1,a K + l I 1 k n,a Removing a data point a k : apply QRD downdating twice G l = 1 k 1,1 + l k 1,k-1 k 1,k+1 k 1,n 1 k k-1,1 k k-1,k-1 + l k k-1,k+1 k k-1,n 1 k k+1,1 k k+1,k-1 k k+1,k+1 + l k k+1,n 1 k n,1 k n,k-1 k n,k+1 k n,n + l

20 New Decision Boundaries after Deletion and Addition of Data Points New decision boundaries after deleting the 12 th point and inserting (5,6) Dash-dotted contour : a decision boundary of the adaptive KDA/RMSE Dashed contour : a decision boundary obtained by computing sol. from scratch by the RMSE.

21 Comparison Between KDA and KDA/RMSE Method Thyroid Diabetes Heart Titanic KDA KDA(75%) + adaptive KDA(25%) Average and standard deviation of test set classification errors in % for 100 partitions

22 Err min = 2.0% Face Recognition Training Dataset (AT&T) 10 persons x 5 images/person = 50 images total Each image: 46x56 Five-fold CV using KDA SVD of G = (e K): 40x41 is computed for each fold K(a 1, a 2 ) = exp(-g a 1 - a 2 2 ) Err min = 4.0% Five-fold CV using KDA/RMSE QRD of G l = (e K+lI): 50x51, and block downdate.

23 Face Recognition Testing 1. Cross validation on training dataset 2. Classification of test dataset using optimal parameters obtained from CV.

24 Updated Face Recognition Training Dataset Target: 1 st image Efficient Computing 1. Removed an old image by deckda 2. Appended a new image by inckda Result:

25 Computation time for leave-one-out cross validation (LOOCV): 2001 KDD cup drug design data, 8000 features Solid black line : computation time of ordinary LOOCV Dashed blue line : computation time of LOOCV using deckda/rmse

26 Summary / Future Research Effective Algorithm for Adaptive Disc. Analysis Utilized the relationship between LDA/GSVD and MSE Replaced SVD up/down-dating by QRD up/down-dating Applicable to a wide range of problems (Facial recognition, text classification, faster cross validation algorithm s ) Current and Future Research * Development of recursive feature tracking system based on recursive KDA/RMSE / parallel implementation * Utilization of other efficient methods such as complete orthogonal decomposition * Establish Mathematical Relationships among D im ension R eduction, C lassifier D esign, D ata R eduction, Thank you!

Multiclass Classifiers Based on Dimension Reduction. with Generalized LDA

Multiclass Classifiers Based on Dimension Reduction with Generalized LDA Hyunsoo Kim a Barry L Drake a Haesun Park a a College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA Abstract