for Classification Problems Sergios Petridis

Size: px

Start display at page:

Download "for Classification Problems Sergios Petridis"

Prudence Myrtle Hart
5 years ago
Views:

1 for Classification Problems National and Kapodistrian University of Athens, School of Sciencies, Department of Informatics and Telecommunications NCSR Demokritos, Institute of Informatics and Telecommunications, Computational Intelligence Laboratory Mars 2006

2 Outline LDA HDA MMI relation

3 Feature Vector Observation Feature Generation x R n Feature Transform y R m

4 Feature transforms y = f(x), f : R n R m Adjusting Dimensionality Changing Axis Reforming space Transform families scaling selection linear non-linear scaling non-linear Transform properties linear feature oriented invertible

5 Supervised transform learning Criterion: Suitability against a classification problem Features more usefull less redundant well balanced Final goal assist classifier faciliate visualization y x 2 Samples without class x 1

6 Supervised transform learning Criterion: Suitability against a classification problem Features more usefull less redundant well balanced Final goal assist classifier faciliate visualization x 2 y Samples with class x 1

7 Thesis aims 1 Formulate a unifying framework for supervised learning of feature transforms, associating known criteria and methods 2 Engineer new learning algorithms based on general criteria adapted to the transform type and properties

8 Outline LDA HDA MMI relation

9 Outline LDA HDA MMI relation

10 Supervised Linear Feature Extraction (SLFE) y = A x, R n R m, m < n x 1 x 2 y 1 Equivalent expressions Extraction of m linear features Searching m projections Searching matrix A [n m] x i x n a ij y 2 y j y m A = [a 1 a 2 a m ]

11 Minimum Bayes risk [r(ĥ(x) x) ] B R (C X) E The minimum risk depends on: The feature vector p (x c 1 ) p (x c 2 ) X, C p(x = x, C = c) The risk matrix [ 0 α R = β 0 ] α = β h(x)

12 Minimum Bayes risk [r(ĥ(x) x) ] B R (C X) E The minimum risk depends on: The feature vector p (x c 1 ) p (x c 2 ) X, C p(x = x, C = c) The risk matrix [ 0 α R = β 0 ] α > β h(x)

13 The Bayes optimal transform The Bayes criterion Optimal transform is the one extracting features implying minimum Bayes risk Â = argmin B R (C A X) A Subspace optimality Optimality is not associated to features but to the subspace they engender

14 The features number Arguments in favor of dimensionality reduction: visualisation, computation complexity, classifier s performance B R However: Dimensionality reduction max min won t lead to Bayes risk reduction 0 n Risk vs dimension

15 goals Three distinct goals: No loss with known risk matrix No loss with unkwnown risk matrix min B R B R R 2 R 1 0 m 1 m 2 ˆm n With contstraints upon dimension min B 0 ˆm m n

16 Zero Information / Risk Loss A s [ ] s M ζ ζ We define two models: x Z M X Zero Information Loss model (ZIL) S Z C S Zero Risk Loss model (ZRL) C x, r(ĉ x) = r(ĉ s) ZRL & 2 cl ZIL

17 Constraint no-lossy We prove that: ZIL No-lossy SLFE with unkown risk matrix leads to exactly d features x2 R1 φ R2 s x1 R3 ZRL No-lossy SLFE with known risk matrix leads to subpsace dimension at most d features Example 1: ZIL ζ

18 Constraint no-lossy We prove that: x 2 ζ ZIL No-lossy SLFE with unkown risk matrix leads to exactly d features R 2 R 3 R 1a R 1b x 1 s R 1c ZRL No-lossy SLFE with known risk matrix leads to subpsace dimension at most d features Example 2: ZIL,ZRL

19 Constraint no-lossy We prove that: x 2 ZIL No-lossy SLFE with unkown risk matrix leads to exactly d features R 1 R 3 R 2 R 2b φ x 1 s 2 s ZRL R 1b ζ ζ 2 No-lossy SLFE with known risk matrix Example 3: - leads to subpsace dimension at most d features

20 DLFE Goals: Conclusions Extracting Linear Features Extracting Feature Subspace Knowing the risk matrix is significant for optimal reduction 2 class problems ZRL = ZIL

21 Outline LDA HDA MMI relation

22 LDA HDA MMI relation Three distinct criteria, different from the Bayes criterion Experimental evidence LDA: small complexity, efficient HDA: moderated complexity, superior to LDA MMI: high complexity, superior to LDA Structural relation What is the connection between these criteria? Form relation Optimality relation Derivation one from the other

23 Linear discriminant analysis Linear Discriminant Analysis (LDA) Fisher x 2 inter-class means variance, small within class variance Optimality µ 2 A x 1 Homoscedastic Gaussian model (HOG) µ 1

24 Heteroscedastic Discriminant Analysis Heteroscedastic Discriminant Analysis (HDA) Kumar, Andreou 1996 There is a subspace where x 2 both means and covariance matrices are identical Optimality Kumar-Andreou Heteroscedastic model A µ 1 = µ 2 x 1 (KAH)

25 Maximization of mutual information Maximization of Mutual Information (MMI)(Shannon, Lewis 1962) The feature vector knowledge reduces class uncertainty I(C ; X) = H(C ) H(C X) x 2 Optimality ZIL x 1

26 Original criteria formulation Â = argmax A LDA HDA cov(a X) log cov(a X C ) K log cov(a c X) p(c k ) log cov(a X c k ) +2 log Ã k=1 MMI I(C ; A X)

27 Criteria reformulation Â = argmax A LDA log cov(a X) K k=1 p(c k ) cov(a X c k ) HDA log cov(a X) K k=1 cov(a X c k ) p( c k ) MMI log cov(a X) K k=1 cov(a X c k ) p( c k ) 2 ( J (A X) J (A X C ) )

28 Model hierarchy We prove that : When a problem conforms to HOG then it conforms to KAH When a problem conforms to KAH then it conforms to ZIL Common source - noise view for all three models ZIL KAH HOG

29 Criteria equivalence ZIL Bayes = MMI KAH Bayes = MMI = HDA HOG Bayes = MMI = HDA = LDA Derivation of LDA through information theory

30 Simulations (1) Success of all three criteria 1 MMIHDA LDA x 2 ζ c 1 c φ x s

31 Simulations (2) Success of HDA - MMI x 2 1 LDA MMI HDA c 1 c 2 φ x c 1 s

32 Simulations (3) Success of MMI 1 MMI HDA LDA x 2 c c φ x s c 1

33 LDA HDA MMI: Conclusions Hierarchical connection between criteria The MMI criterion is optimal in a wider problem range Derivation of the LDA criterion from information theory

34 Outline LDA HDA MMI relation

35 Motivation BR var (A C, X) = R n A change in Bayes risk within a subspace implies usefullness of the subspace r(ĉ x) A 2 p(x) dx Local sensitivity r(ĉ x) A

36 Subspace risk sensitivity To krit hrio euaisjhs ias r iskou Bayes Maximization of Bayes Risk Sensitivity (MBRS) Optimal transform is the one that extracts features implying maximum Bayes risk sensitivity Optimality Â MBRS = argmax BR var (A C, X) A Mont elo mhdenik hs ap wleias euaisjhs ias r iskou Zero Risk Sensitivity Loss model (ZRSL) ZRL + differ ZRSL

37 The SPCA algorithm Supervised Principal Component Analysis (SPCA) Define the feature risk sensitivity matrix Express local sensitivity as the trace of a quadratic form Solve an eigenvector-eigenvalues problem Evaluate sensitivities as derivatives estimations via Parzen method Reduction to Principal Component Analysis (PCA) by replacing samples with the corresponding samples sensitivities [ r(ĉ x t ] ), r(ĉ xt ), r(ĉ xt ) x 1 x 2 x n

38 Evaluation (1) methodology Goals No loss 1,2,3 features Framework LDA comparison Classifier: K NN 30-crossvalidation pair-t-test (095) problem features dimension classes samples cancer card diabetes glass heart horse iris sonar soybean xorrot

39 Evaluation (2) Artificial Data Rotated XOR (XORROT) xorrotspcaknnbdf 0 The best 2 dimensions

40 Evaluation (2) Optimal dimensionality problem dimension comparaison optimal reduction performance difference t-test cancer 1 90% card 2 96% diabetes 2 75% glass 6 67% heart 3 91% horse 2 96% iris 1 75% sonar 14 77% soybean 20 76% xorrot 2 75%

41 Evaluation (3) Constraint number of dimensions problem 1 feat 2 feat 3 feat difference t-test difference t-test difference t-test cancer card diabetes glass 1, heart horse iris sonar soybean xorrot

42 Evaluation (4) Performance against dimension soybean LDA SPCA number of features sonar LDA SPCA number of features 4

43 criterion: Conclusions Equivalence between MBRS and Bayes (ZRSL) Eigenvector-eigenvalues problem formulation Parzen SPCA wins over LDA

44 Outline LDA HDA MMI relation

45 Outline LDA HDA MMI relation

46 Motivation The Bayes and MMI criteria don t account for the increase in generalization ability The curse of dimensionality problem Invertible transforms The Bayes and MMI criteria don t apply: B R (C f(x)) = B R (C X) and I(f(X); C ) = I(X; C )

47 Experiment: impact of sphering problem space comp original sphered diff t-test cancer card diabetes gene glass heart horse iris sonar soybean xorrot

48 Experiment: scaling vs rejection Invertible linear transform A = D Q card LDA diabetes SPCA A = lim l L, d l 0 D Q card LDA cancer SPCA

49 Focusing in feature space X2 Focalization kernel K(s z, Φ) e z Φ s 2 Focus magnitude Focus angle matrix x X 1 P (X = x z, Φ) = K(s z(x), Φ(x)) p(x = x + s) ds R n

50 Effective random variable X X z,φ Effective functionals Effective risk Bayes P (X = x z, Φ) p( X z,φ = x) P (X = x z, Φ) dx R n B R (C X) B R (C X) Limit of infinite accuracy lim X z,φ = X z= Effective information Iz,Φ (X; C ) I( X z,φ ; C )

51 Generalized learning criteria Isotropy Space suitability criterion stationarity Φ(x) = Φ isotropy Φ = I n Minimization of Isotropic Risk (MIR) ˆf argmin B R (C f(x)) {f F} Maximization of Isotropic Information (MII) ˆf argmax I(f(X); C ) {f F}

52 Generalized learning criteria Finding optimal transform Approximation Finding locally optimal transforms Aggregating to a global transform X2 K 3 K 1 K 2 x0 X1 ω 1 ω 2

53 Generalized learning criteria Finding optimal transform Approximation Finding locally optimal transforms Aggregating to a global transform

54 Generalized learning criteria Finding optimal transform Y2 Approximation Finding locally optimal transforms Aggregating to a global transform f(x 0) K 2 ω 1 ω 2 Y1

55 : Conclusions Integrating partial probability knowledge as a new random variable Space isotropy as a desired property Account for classifiers limitations regarding local adaptation and accuracy Dimensionality is not a problem on its own Bridging invertible non-invertible learning criteria at the infinite accuracy limit

56 Outline LDA HDA MMI relation

57 Motivation When is dimensionality reduction required? x 2 c 1 How can a transform be interpretable? When do we need non-linearity transforms? c 2 x 1

58 Motivation When is dimensionality x 2 reduction required? c 1 How can a transform be interpretable? When do we need c 2 c 1 c 2 non-linearity transforms? x 1

59 The transform (NLFS) f : R n R n f(x) = [f 1 (x 1 ), f 2 (x 2 ),, f n (x n )] Stretching function x i f(x i ) > 0 Properties: Preserving dimensionality Feature-oriented No information loss

60 Interpretability Preserving samples ordering Changing distance between samples Shrinking expanding space p(x i) d AB > d BC x i X i p(y i) 1 g(xi) x i f(x i ) 1 d AB <d BC y i Y i

61 Grid transform x 2 x 2 l 2 B A G l 2,7 B A G x 1 x 1 l 1 d = 10 s 1,4 l 1 d = 10

62 Transform learning The Isotropic (ISONLFS) algorithm MII criterion application Locally optimal matrices are diagonal Near-projection matrix definition Weighted matrix summation in respect with local near-projected effective information Evaluation as effective feature information

63 Evaluation (1) - Artificial data

64 Evaluation (2) - Real data problem methods comp SPCA ISONLFS diff t-test cancer card diabetes glass heart horse iris sonar soybean

65 Conclusions NLFS: reversibility, interpretability, low complexity Maximization of isotropic information application Low complexity learning algorithm Classification performance improvement

66 Contributions (1/2) 1 Formulation of a concise framework of goals and models for surpervised feature transform criteria learning comparison 2 Establishment of a relation between linear discriminant analysis, heteroskedastic discriminant analysis and mutual information criteria 3 Formulation of the new Bayes risk sensitivity criterion and derivation of the SPCA algorithm

67 Contributions (2/2) 1 Generalization of Bayes and MMI criteria, unifying invertible and non-invertible learing criteria 2 Study of non-linear feature scaling transforms and derivation of the ISONLFS algorithm

68 Conclusions non-invertibility linearity MBRS ZRSL SPCA MIR z Bayes ZRL R MII z N homo MMI HDA LDA ZIL KAH HOG ISONLFS

69 Future directions Enhance feature space models Goal-independent relation between MMI and LDA Impact of generalized entropies Mutual information sensitivity analysis Effective information: maximization in respect to focalization magnitude Adaptive grid resolution for ISONLFS Integrate isotropicalization to classifiers

70 Publications 1/4 S Petridis and S J Perantonis On the relation between discriminant analysis and mutual information for supervised linear feature Pattern Recognition, 37: , 2004 S Petridis and S J Perantonis Isotropic information maximization for non-linear feature scaling IEEE Transactions on Pattern Recognition and Machine Intelligence, submitted, 2004 S Petridis and S J Perantonis Feature deforming for improved similarity based learning In Proceedings of the 3rd Hellenic Conference on Artificial Intelligence, volume 3025 of Lecture Notes in Artificial Intelligence, pages , 2004

71 Publications 2/4 S Petridis, E Charou, and S J Perantonis Non-redundant feature selection of multi-band remotely sensed images for land cover classification In Proceedings of the 2003 Tyrrhenian International Workshop on Remote Sensing, 2003 S J Perantonis, S Petridis, and V Virvilis Supervised principal component analysis using a smooth classifier paradigm In Proceedings of the 15th International Conference on Pattern Recognition, volume 2, pages , 2000 E Charou, S Petridis, M Stefouli, O D Mavrantza, and S J Perantonis Innovative feature selection used in multispectral imagery, classification for water quality monitoring

72 Publications 3/4 B Gatos, K Ntzios, I Pratikakis, S Petridis, T Konidaris, and S J Perantonis An efficient segmentation-free approach to assist old greek handwritten manuscript ocr Pattern Analysis and Applications, 8(4): , 2006 A Agogino, J Gosh, S J Perantonis, V Virvilis, S Petridis, and PJG Lisboa The role of multiple, linear-projection based visualization techniques in rbf-based classification of high dimensional data In Proceedings of the International Joint Conference on Neural Networks, volume 3, 2000 S J Perantonis, N Ampazis, V Virvilis, and S Petridis Open issues in feedforward neural network training In LFTNC (Siena, Italy), 2001

73 Publications 4/4 G Petasis, S Petridis, and G Paliouras Symbolic and neural learning for named-entity recognition In GTselentis and H-J Zimmermann, editors, Advances in Computational Intelligence and, Methods and Applications Kluwer Academic Publishers, 2001 G Petasis, S Petridis, G Paliouras, and V Karkaletsis Symbolic and neural learning for named-entity recognition In Proceedings of European Best Practice Workshops and Symposium on Computational Intelligence and, 2000 B Gatos, K Ntzios, I Pratikakis, S Petridis, T Konidaris, and S J Perantonis A segmentation-free recognition technique to assist old greek handwritten manuscript ocr In Proceedings of Document Analysis Systems, volume 3163 of

Classification-Optimal Feature Transforms

Classification-Optimal Feature Transforms Sergios Petridis National and Kapodistrian University of Athens Department of Informatics and Telecommunications, NCSR Demokritos petridis@iitdemokritosgr Abstract