Towards Maximum Geometric Margin Minimum Error Classification

Similar documents
Machine Learning for Structured Prediction

Random Field Models for Applications in Computer Vision

Graphical models for part of speech tagging

Margin Maximizing Loss Functions

Does Modeling Lead to More Accurate Classification?

A Study of Relative Efficiency and Robustness of Classification Methods

Undirected Graphical Models

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Support Vector Machines for Classification: A Statistical Portrait

Information Extraction from Text

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

A Simple Algorithm for Learning Stable Machines

Geometry of U-Boost Algorithms

Support Vector Machines

An Introduction to Statistical and Probabilistic Linear Models

Inf2b Learning and Data

Support Vector Machines using GMM Supervectors for Speaker Verification

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

SUPPORT VECTOR MACHINE

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Robust Kernel-Based Regression

Conditional Random Fields for Sequential Supervised Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Support Vector Machines vs Multi-Layer. Perceptron in Particle Identication. DIFI, Universita di Genova (I) INFN Sezione di Genova (I) Cambridge (US)

Kernelized Perceptron Support Vector Machines

Model-Based Margin Estimation for Hidden Markov Model Learning and Generalization

Sequential Supervised Learning

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

TUM 2016 Class 1 Statistical learning theory

Pattern Recognition Problem. Pattern Recognition Problems. Pattern Recognition Problems. Pattern Recognition: OCR. Pattern Recognition Books

Learning Methods for Linear Detectors

Announcements - Homework

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Probabilistic Machine Learning. Industrial AI Lab.

Discriminative Models

Formulation with slack variables

Intelligent Systems (AI-2)

Brief Introduction of Machine Learning Techniques for Content Analysis

Discriminative Models

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Discriminative Learning in Speech Recognition

Analysis of Multiclass Support Vector Machines

(Kernels +) Support Vector Machines

Conditional Random Field

Support Vector Machines

Polyhedral Computation. Linear Classifiers & the SVM

Discriminative Direction for Kernel Classifiers

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)

Logistic Regression & Neural Networks

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Machine Learning Lecture 7

Logistic Regression. Machine Learning Fall 2018

Introduction to Support Vector Machines

Midterm exam CS 189/289, Fall 2015

Metric Embedding for Kernel Classification Rules

MinOver Revisited for Incremental Support-Vector-Classification

SVMs, Duality and the Kernel Trick

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Learning Kernel Parameters by using Class Separability Measure

Linear & nonlinear classifiers

Dynamic Time-Alignment Kernel in Support Vector Machine

Pattern Recognition and Machine Learning

Introduction to Support Vector Machines

Are Loss Functions All the Same?

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Holdout and Cross-Validation Methods Overfitting Avoidance

Intelligent Systems (AI-2)

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Microarray Data Analysis: Discovery

Machine Learning : Support Vector Machines

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Maxent Models and Discriminative Estimation

Accelerated Training of Max-Margin Markov Networks with Kernels

Semi-Supervised Learning

Machine learning for pervasive systems Classification in high-dimensional spaces

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Statistical Learning Reading Assignments

Hilbert Space Methods in Learning

Parameter learning in CRF s

Support Vector Machines

MLCC 2017 Regularization Networks I: Linear Models

A Hierarchical Convolutional Neural Network for Mitosis Detection in Phase-Contrast Microscopy Images

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Neural network time series classification of changes in nuclear power plant processes

Support Vector Machine (SVM) and Kernel Methods

Two-Stream Bidirectional Long Short-Term Memory for Mitosis Event Detection and Stage Localization in Phase-Contrast Microscopy Images

Multivariate statistical methods and data mining in particle physics

Generative MaxEnt Learning for Multiclass Classification

Support Vector Machine For Functional Data Classification

Transcription:

THE SCIENCE AND ENGINEERING REVIEW OF DOSHISHA UNIVERSITY, VOL. 50, NO. 3 October 2009 Towards Maximum Geometric Margin Minimum Error Classification Kouta YAMADA*, Shigeru KATAGIRI*, Erik MCDERMOTT**, Hideyuki WATANABE***, Atsushi NAKAMURA**, Shini WATANABE**, Miho OHSAKI* (Received uly 28, 2009) The recent dramatic growth of computation power and data availability has increased research interests in discriminative training methodologies for pattern classifier design. Minimum Classification Error (MCE) training and Support Vector Machine (SVM) training methods are especially attracting a great deal of attention. The former has been widely used as a general framework for discriminatively designing various types of speech and text classifiers; the latter has become the standard technology for the effective classification of fixed-dimensional vectors. In principle, MCE aims to achieve minimum error classification, and in contrast, SVM aims to increase the classification decision robustness. The simultaneous achievement of these two different goals would definitely valuable. Motivated this concern, in this paper we elaborate the MCE and SVM methodologies and develop a new MCE training method that leads in practice to the best condition of maximum geometric margin and minimum error classification. minimum classification error training, geometric margin, functional margin, support vector machine Minimum Squared Error: MSE Minimum Classification Error: MCE Support Vector Machine: SVM Conditional Random Field: CRF) 4) Boosting *Graduate School of Engineering, Doshisha University, Kyoto, E-mail: skatagir@mail.doshisha.ac.p, Telephone: +81-774-65-7567 **NTT Communication Science Laboratories, NTT Corporation, Kyoto ***Mastar Proect, National Institute of Information and Communications Technology, Kyoto 43

150 MCE SVM MCE SVM MCE SVM MCE SVM MCE MCE SVM MCE MCE x x C N( { x1,, xn}) 1 if i, ( i C ) 0 otherwise, i C i ( i C ) C Ci ( x) R ( ( x ) C) p( C, x ) d x 1 ( x) x decide C i if P( C i x) P( C x) for all i decide Ci if gi( x; ) g( x; ) for all i g ( x; ) x C 44

151 g ( x; ) 1,, d ( x; ) g ( x; ) max g ( x; ) i i x C robustness MCE d ( x; ) g ( x; ) 1 1 log exp( gi ( x; )) 1 i x i i d ( x; ) 0 d ( x; ) 0 1( d ( x; ) 0) R( ) 1( d ( x; ) 0) p( C, x) dx 1 1( d ( x; ) 0) MCE ( x; ) ( d ( x; )) 1 1 exp( ad ( x; ) b) a b a L p 45

152 MCE R( ) 1( d ( x; ) 0) p( C, x) dx 1 1 PC ( ) p( x C ) dx { x d ( x, ) 0} x g 1 ( x, ),, g ( x, ),, g ( x, ) m ( d ( x, )) C m 0 p( x C ) dx Pd [ ( x; ) 0 C ] pm ( C) dm R( ) PC ( ) pm ( C) dm 1 0 m pm ( C) C m N 1 1 m d ( x, k; ) pn ( m ) C N k 1 h h x k, C k m d ( x, k; ) h 1 d ( x, k; ) h N C R( ) R ( ) PC ( ) p ( m C ) dm N N 0 1 N 1 N PC ( ) N N N 1 1 m d ( x, k; ) RN( ) dm 0 N 1 k 1 h h RN ( ) 46

153 1 m d( x, k; ) ( d( x, k; )) dm 0 h h MCE h N h R( ) Fig. 1 Fig. 1. Schematic explanation of kernels that do not cross over to each other. MCE 47

154 MCE x C C g( x) w x b w b g( x) 0 C g( x) 0 C C C y yg( x) 0 yg( x) 0 yg( x) yg( x) yg( x) yg( x) yg( x) x r x wx b r 2 w w L 2 w L 2 48

155 L 2 L 2 wx b 1 1 r 2 w SVM SVM SVM SVM SVM S x1, y1,, xn, yn,, xn, yn x n n y n ( w, b) N minimize ww C 1,,,, n N w b n1 subect to yn( wxn b) n 0 ( n 1,, N) n n n max(0, yn( wxn b)) ( 0) C x n y ( wx b) n n n n yn( wxn b) SVM 49

156 C Fig. 2 Fig. 2. Schematic explanation of hinge, logistic, and smooth logistic losses. MCE SVM MCE SVM SVM MCE SVM MCE SVM MCE SVM MCE MCE SVM MCE SVM SVM MCE MCE SVM 50

157 L2 2 w MCE SVM w x SVM Fig. 3 p C p C xˆ p xˆ p ˆx x d 2 2 x p x p d 2 p p Fig. 3. Schematic explanation of geometric margin for distance classifier. x 51

158 MCE L 2 w 2 p p 2 MCE SVM MCE large margin HMM MCE MCE MCE B 1) R.O.Duda and P.E. Hart, Pattern Classification and Scene Analysis, (Wiley Interscience Publishers, 1973). 2) S. Katagiri, B. uang, and C. Lee, Pattern Recognition Using a Family of Design Algorithms Based Upon the Generalized Probabilistic Descent Method, Proc. IEEE., vol. 86, no. 11, pp. 2345-2373 (1998). 3) V. N. Vapnik, The Nature of Statistical Learning Theory, (Springer-Verlag, 1995). 4). Lafferty, A. McCallum, and F. Pereira, Conditional Random Fields: Probabilistic Models for Segmental and Labeling Sequence Data, Proc. ICML 2001, pp. 282-289 (2001). 5). Friedman, T. Hastie, and R. Tibshirani, Additive Logistic Regression: A Statistical View of Boosting The Annal of Statistics, vol. 28, no. 2, pp. 337-407 (2000). 6) S. Katagiri, A Unified Approach to Pattern Recognition, Proc. ISANN 94, pp. 561-570 (1994). 7) E. McDermott and S. Katagiri, A Derivation of Minimum Classification Error from the Theoretical Classification Risk Using Parzen Estimation, Computer Speech and Language, vol. 18, pp. 107-122 (2004). 8) E. McDermott and S. Katagiri, Discriminative Trainig via Minimization of Risk Estimates Based on Parzen Smoothing, Appl. Intell., vol. 25, pp 35-57 (2006). 9) N. Cristianini and. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, (Cambridge University Press, Cambridge, 2000). 10) H. iang, X. Li, and C. Liu, Large Margin Hidden Markov Models for Speech Recognition, IEEE Trans. Audio, Speech, and Language Processing, vol. 4, No. 5, pp.1584-1595 (2006). 11)T. Poggio and F. Girosi, Regularization Algorithms for Learning That Are Equivalent to Multi-Layer Networks, Science, vol. 247, pp. 978-982 (1990). 12) D. Yu, L. Deng, X. He, and A. Acero, Large-margin Minimum Classification Error Training: A Theoretical Risk Minimization Perspective, Computer Speech and Language, vol. 22, pp. 415-429 (2008). 52