THE SCIENCE AND ENGINEERING REVIEW OF DOSHISHA UNIVERSITY, VOL. 50, NO. 3 October 2009 Towards Maximum Geometric Margin Minimum Error Classification Kouta YAMADA*, Shigeru KATAGIRI*, Erik MCDERMOTT**, Hideyuki WATANABE***, Atsushi NAKAMURA**, Shini WATANABE**, Miho OHSAKI* (Received uly 28, 2009) The recent dramatic growth of computation power and data availability has increased research interests in discriminative training methodologies for pattern classifier design. Minimum Classification Error (MCE) training and Support Vector Machine (SVM) training methods are especially attracting a great deal of attention. The former has been widely used as a general framework for discriminatively designing various types of speech and text classifiers; the latter has become the standard technology for the effective classification of fixed-dimensional vectors. In principle, MCE aims to achieve minimum error classification, and in contrast, SVM aims to increase the classification decision robustness. The simultaneous achievement of these two different goals would definitely valuable. Motivated this concern, in this paper we elaborate the MCE and SVM methodologies and develop a new MCE training method that leads in practice to the best condition of maximum geometric margin and minimum error classification. minimum classification error training, geometric margin, functional margin, support vector machine Minimum Squared Error: MSE Minimum Classification Error: MCE Support Vector Machine: SVM Conditional Random Field: CRF) 4) Boosting *Graduate School of Engineering, Doshisha University, Kyoto, E-mail: skatagir@mail.doshisha.ac.p, Telephone: +81-774-65-7567 **NTT Communication Science Laboratories, NTT Corporation, Kyoto ***Mastar Proect, National Institute of Information and Communications Technology, Kyoto 43
150 MCE SVM MCE SVM MCE SVM MCE SVM MCE MCE SVM MCE MCE x x C N( { x1,, xn}) 1 if i, ( i C ) 0 otherwise, i C i ( i C ) C Ci ( x) R ( ( x ) C) p( C, x ) d x 1 ( x) x decide C i if P( C i x) P( C x) for all i decide Ci if gi( x; ) g( x; ) for all i g ( x; ) x C 44
151 g ( x; ) 1,, d ( x; ) g ( x; ) max g ( x; ) i i x C robustness MCE d ( x; ) g ( x; ) 1 1 log exp( gi ( x; )) 1 i x i i d ( x; ) 0 d ( x; ) 0 1( d ( x; ) 0) R( ) 1( d ( x; ) 0) p( C, x) dx 1 1( d ( x; ) 0) MCE ( x; ) ( d ( x; )) 1 1 exp( ad ( x; ) b) a b a L p 45
152 MCE R( ) 1( d ( x; ) 0) p( C, x) dx 1 1 PC ( ) p( x C ) dx { x d ( x, ) 0} x g 1 ( x, ),, g ( x, ),, g ( x, ) m ( d ( x, )) C m 0 p( x C ) dx Pd [ ( x; ) 0 C ] pm ( C) dm R( ) PC ( ) pm ( C) dm 1 0 m pm ( C) C m N 1 1 m d ( x, k; ) pn ( m ) C N k 1 h h x k, C k m d ( x, k; ) h 1 d ( x, k; ) h N C R( ) R ( ) PC ( ) p ( m C ) dm N N 0 1 N 1 N PC ( ) N N N 1 1 m d ( x, k; ) RN( ) dm 0 N 1 k 1 h h RN ( ) 46
153 1 m d( x, k; ) ( d( x, k; )) dm 0 h h MCE h N h R( ) Fig. 1 Fig. 1. Schematic explanation of kernels that do not cross over to each other. MCE 47
154 MCE x C C g( x) w x b w b g( x) 0 C g( x) 0 C C C y yg( x) 0 yg( x) 0 yg( x) yg( x) yg( x) yg( x) yg( x) x r x wx b r 2 w w L 2 w L 2 48
155 L 2 L 2 wx b 1 1 r 2 w SVM SVM SVM SVM SVM S x1, y1,, xn, yn,, xn, yn x n n y n ( w, b) N minimize ww C 1,,,, n N w b n1 subect to yn( wxn b) n 0 ( n 1,, N) n n n max(0, yn( wxn b)) ( 0) C x n y ( wx b) n n n n yn( wxn b) SVM 49
156 C Fig. 2 Fig. 2. Schematic explanation of hinge, logistic, and smooth logistic losses. MCE SVM MCE SVM SVM MCE SVM MCE SVM MCE SVM MCE MCE SVM MCE SVM SVM MCE MCE SVM 50
157 L2 2 w MCE SVM w x SVM Fig. 3 p C p C xˆ p xˆ p ˆx x d 2 2 x p x p d 2 p p Fig. 3. Schematic explanation of geometric margin for distance classifier. x 51
158 MCE L 2 w 2 p p 2 MCE SVM MCE large margin HMM MCE MCE MCE B 1) R.O.Duda and P.E. Hart, Pattern Classification and Scene Analysis, (Wiley Interscience Publishers, 1973). 2) S. Katagiri, B. uang, and C. Lee, Pattern Recognition Using a Family of Design Algorithms Based Upon the Generalized Probabilistic Descent Method, Proc. IEEE., vol. 86, no. 11, pp. 2345-2373 (1998). 3) V. N. Vapnik, The Nature of Statistical Learning Theory, (Springer-Verlag, 1995). 4). Lafferty, A. McCallum, and F. Pereira, Conditional Random Fields: Probabilistic Models for Segmental and Labeling Sequence Data, Proc. ICML 2001, pp. 282-289 (2001). 5). Friedman, T. Hastie, and R. Tibshirani, Additive Logistic Regression: A Statistical View of Boosting The Annal of Statistics, vol. 28, no. 2, pp. 337-407 (2000). 6) S. Katagiri, A Unified Approach to Pattern Recognition, Proc. ISANN 94, pp. 561-570 (1994). 7) E. McDermott and S. Katagiri, A Derivation of Minimum Classification Error from the Theoretical Classification Risk Using Parzen Estimation, Computer Speech and Language, vol. 18, pp. 107-122 (2004). 8) E. McDermott and S. Katagiri, Discriminative Trainig via Minimization of Risk Estimates Based on Parzen Smoothing, Appl. Intell., vol. 25, pp 35-57 (2006). 9) N. Cristianini and. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, (Cambridge University Press, Cambridge, 2000). 10) H. iang, X. Li, and C. Liu, Large Margin Hidden Markov Models for Speech Recognition, IEEE Trans. Audio, Speech, and Language Processing, vol. 4, No. 5, pp.1584-1595 (2006). 11)T. Poggio and F. Girosi, Regularization Algorithms for Learning That Are Equivalent to Multi-Layer Networks, Science, vol. 247, pp. 978-982 (1990). 12) D. Yu, L. Deng, X. He, and A. Acero, Large-margin Minimum Classification Error Training: A Theoretical Risk Minimization Perspective, Computer Speech and Language, vol. 22, pp. 415-429 (2008). 52