A Study of Relative Efficiency and Robustness of Classification Methods

Save this PDF as:

Size: px
Start display at page:

Download "A Study of Relative Efficiency and Robustness of Classification Methods"


1 A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics Seoul National University

2 Iris Data (red=setosa,green=versicolor,blue=virginica) Sepal.Length Sepal.Width Petal.Length Petal.Width

3 Figure: courtesy of Hastie, Tibshirani, & Friedman (2001) Handwritten digit recognition Cancer diagnosis with gene expression profiles Text categorization

4 Classification x = (x 1,..., x d ) R d y Y = {1,..., k} Learn a rule φ : R d Y from the training data {(x i, y i ), i = 1,..., n}, where (x i, y i ) are i.i.d. with P(X, Y). The 0-1 loss function: ρ(y,φ(x)) = I(y φ(x)) The Bayes decision rule φ B minimizing the error rate R(φ) = P(Y φ(x)) is φ B (x) = arg max P(Y = k X = x). k

5 Statistical Modeling: The Two Cultures One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. Breiman (2001) Model-based methods in statistics: LDA, QDA, logistic regression, kernel density classification Algorithmic methods in machine learning: Support vector machine (SVM), boosting, decision trees, neural network Less is required in pattern recognition. Devroye, Györfi and Lugosi (1996) If you possess a restricted information for solving some problem, try to solve the problem directly and never solve a general problem as an intermediate step. Vapnik (1998)

6 Questions Is modeling necessary for classification? Does modeling lead to more accurate classification? How to quantify the relative efficiency? How do the two approaches compare?

7 Convex Risk Minimization In the binary case (k = 2), suppose that y = 1 or -1. Typically obtain a discriminant function f : R d R, which induces a classifier φ(x) = sign(f(x)), by minimizing the risk under a convex surrogate loss of the 0-1 loss ρ(y, f(x)) = I(yf(x) 0). Logistic regression: binomial deviance (- log likelihood) Support vector machine: hinge loss Boosting: exponential loss

8 Logistic Regression Agresti (2002), Categorical Data Analysis Model the conditional distribution p k (x) = P(Y = k X = x) directly. p 1 (x) log 1 p 1 (x) = f(x) Then Y X = x Bernoulli distribution with p 1 (x) = exp(f(x)) 1+exp(f(x)) and p 1(x) = 1 1+exp(f(x)). Maximizing the conditional likelihood of (y 1,...,y n ) given (x 1,..., x n ) (or minimizing the negative log likelihood) amounts to min f n log ( 1+exp( y i f(x i )) ). i=1

9 Support Vector Machine Vapnik (1996), The Nature of Statistical Learning Theory Find f with a large margin minimizing 1 n n (1 y i f(x i )) + +λ f 2. i=1 x margin=2/ β β t x + β 0 = 1 β t x + β 0 = 0 β t x + β 0 = x 1

10 Boosting Freund and Schapire (1997), A decision-theoretic generalization of on-line learning and an application to boosting A meta-algorithm that combines the outputs of many weak classifiers to form a powerful committee Sequentially apply a weak learner to produce a sequence of classifiers f m (x), m = 1, 2,..., M and take a weighted majority vote for the final prediction. AdaBoost minimizes the exponential risk function with a stagewise gradient descent algorithm: min f n exp( y i f(x i )). i=1 Friedman, Hastie, and Tibshirani (2000), Additive Logistic Regression: A Statistical View of Boosting

11 Loss Functions ÄÓ Misclassification Exponential Binomial Deviance Squared Error Support Vector Ý Figure: courtesy of HTF (2001)

12 Classification Consistency The population minimizer f of ρ is defined as f with the minimum risk R(f) = Eρ(Y, f(x)). Negative log-likelihood (deviance) Hinge loss (SVM) f (x) = log Exponential loss (boosting) p 1 (x) 1 p 1 (x) f (x) = sign(p 1 (x) 1/2) f (x) = 1 2 log p 1 (x) 1 p 1 (x) sign(f ) yields the Bayes rule. Both modeling and algorithmic approaches are consistent. Lin (2000), Zhang (AOS 2004), Bartlett, Jordan, and McAuliffe (JASA 2006)

13 Outline Efron s comparison of LDA with logistic regression Efficiency of algorithmic approach (support vector machine and boosting) Simulation studies for comparison of efficiency and robustness Discussion

14 Normal Distribution Setting Two multivariate normal distributions in R d with mean vectors µ 1 and µ 2 and a common covariance matrix Σ π + = P(Y = 1) and π = P(Y = 1). For example, when π + = π, Fisher s LDA boundary is { } { Σ 1 (µ 1 µ 2 ) x 1 } 2 (µ 1 +µ 2 ) = 0. µ 1 µ 2

15 Canonical LDA setting Efron (JASA 1975), The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis X N(( /2)e 1, I) for Y = 1 with probability π + X N( ( /2)e 1, I) for Y = 1 with probability π where = {(µ 1 µ 2 ) Σ 1 (µ 1 µ 2 )} 1/2 Fisher s linear discriminant function is l(x) = log(π + /π )+ x 1. Let β 0 = log(π +/π ), (β 1,...,β d ) = e 1, and β = (β 0,...,β d ).

16 Excess Error For a linear discriminant method ˆl with coefficient vector ˆβ n, if n(ˆβ n β ) N(0,Σ β ), the expected increased error rate of ˆl, E(R(ˆl) R(φ B )) = π [ +φ(d 1 ) σ 00 2β 0 2 n σ 01+ β 2 0 where D 1 = /2+(1/ ) log(π + /π ). 2σ 11+σ σ dd ] +o( 1 n ), In particular, when π + = π, E(R(ˆl) R(φ B )) = φ( /2) ] [σ 00 +σ σ dd + o( 1 4 n n ).

17 Relative Efficiency Efron (1975) studied the Asymptotic Relative Efficiency (ARE) of logistic regression (LR) to normal discrimination (LDA) defined as E(R(ˆl lim LDA ) R(φ B )) n E(R(ˆl LR ) R(φ B )). Logistic regression is shown to be between one half and two thirds as effective as normal discrimination typically.

18 General Framework for Comparison Identify the limiting distribution of ˆβ n for other classification procedures (SVM, boosting, etc.) under the canonical LDA setting: 1 n ˆβ n = arg min ρ(y i, x i ;β) β n i=1 Need large sample theory for M-estimators. Find the excess error of each method and compute the efficiency relative to LDA.

19 M-estimator Asymptotics Pollard (ET 1991), Hjort and Pollard (1993), Geyer (AOS 1994), Knight and Fu (AOS 2000), Rocha, Wang and Yu (2009) Convexity of the loss ρ is the key. Let L(β) = Eρ(Y, X;β), β = arg min L(β), H(β) = 2 L(β), and G(β) = E β β Under some regularity conditions, ( )( ρ(y, X;β) ρ(y, X;β) β n(ˆβn β ) N(0, H(β ) 1 G(β )H(β ) 1 ) β ) in distribution.

20 Linear SVM Koo, Lee, Kim, and Park (JMLR 2008), A Bahadur Representation of the Linear Support Vector Machine With β = (β 0, w ), l(x;β) = w x +β { 0 } 1 n βλ,n = arg min (1 y i l(x i ;β)) + +λ w 2 β n i=1 Under the canonical LDA setting with π + = π, for λ = o(n 1/2 ), n ( βλ,n β SVM ) N(0,Σ β SVM ), where β SVM = 2 (2a + ) β LDA and a is a constant such that φ(a )/Φ(a ) = /2. If π + π, ŵ n w LDA but ˆβ 0 is inconsistent.

21 Relative Efficiency of SVM to LDA Under the canonical LDA setting with π + = π = 0.5, the ARE of the linear SVM to LDA is Eff = 2 2 (1+ 4 )φ(a ). R(φ B ) a SVM LR

22 Boosting 1 βn = arg min β n n exp( y i l(x i ;β)) i=1 Under the canonical LDA setting with π + = π, where β boost = 1 2 β LDA. n ( βn β boost ) N(0,Σ β boost ), In general, β n is a consistent estimator of (1/2)β LDA.

23 Relative Efficiency of Boosting to LDA Under the canonical LDA setting with π + = π = 0.5, the ARE of Boosting to LDA is Eff = 1+ 2 /4 exp( 2 /4). R(φ B ) Boosting SVM LR

24 Smooth SVM Lee and Mangasarian (2001), SSVM: A Smooth Support Vector Machine { } 1 n βλ,n = arg min (1 y i l(x i ;β)) 2 + β n +λ w 2 i=1 Under the canonical LDA setting with π + = π, for λ = o(n 1/2 ), n ( βλ,n β SSVM ) N(0,Σ β SSVM ), where β SSVM = 2 (2a + ) β LDA and a is a constant such that {a Φ(a )+φ(a )} = 2Φ(a ).

25 Relative Efficiency of SSVM to LDA Under the canonical LDA setting with π + = π = 0.5, the ARE of the Smooth SVM to LDA is Eff = (4+ 2 )Φ(a ) (2a + ). R(φ B ) a SSVM SVM LR

26 Possible Explanation for Increased Efficiency Hastie, Tibshirani, and Friedman (2001), Elements of Statistical Learning There is a close connection between Fisher s LDA and regression approach to classification with class indicators: n min (y i β 0 w x i ) 2 = i=1 n (1 y i (β 0 + w x i )) 2 i=1 The least squares coefficient is identical up to a scalar multiple to the LDA coefficient: ŵ ˆΣ 1 (ˆµ 1 ˆµ 2 )

27 Finite-Sample Excess Error Excess Error LDA (1.2578) LR (1.5021) SVM (1.7028) SSVM (1.4674) Sample Size (inverse scale) Figure: = 2, d = 5, R(φ B ) = , and π + = π. The results are based on 1000 replicates.

28 What If Model is Mis-specified? All models are wrong, but some are useful. George Box Compare methods under A mixture of two Gaussian distributions Mislabeling in LDA setting Quadratic discriminant analysis setting

29 A Mixture of Two Gaussian Distributions µ f2 W µ f1 µ g2 B µ g1 Figure: W and B indicate the mean difference between two Gaussian components within each class and the mean difference between two classes.

30 As W Varies Error Rate Bayes Rule LDA LR SVM SSVM Figure: B = 2, d = 5, π + = π, π 1 = π 2, and n = 100 w

31 As B Varies Increased Error Rate LDA LR SVM SSVM Figure: W = 1, d = 5, π + = π, π 1 = π 2, and n = 100 B

32 As Dimension d Varies Error Rate Bayes Rule LDA LR SVM SSVM Dimension Figure: W = 1, B = 2, π + = π, π 1 = π 2, and n = 100

33 Mislabeling in LDA excess error SVM Huberized SVM Smooth SVM perturbation fraction Figure: Mean excess errors of SVM and its variants from 400 replicates as the mislabeling proportion varies. = 2.7, d = 5, R(φ B ) = , π + = π, and n = 100.

34 QDA Setting C= 1 C= µ µ µ µ C= 3 C= µ 2 µ µ 2 µ Figure: X Y = 1 N(µ 1,Σ) and X Y = 1 N(µ 2, CΣ)

35 Decomposition of Error For a rule φ F, R(φ) R(φ B ) = R(φ) R(φ F ) + R(φ }{{} F ) R(φ B ) }{{} estimation error approximation error where φ F = arg min φ F R(φ). When a method M is used to choose φ from F, R(φ) R(φ F ) = R(φ) R(φ M ) }{{} + R(φ M ) R(φ F ) }{{} M-specific est.error M-specific approx.error where φ M is the method-specific limiting rule within F.

36 Approximation Error Error Rate Bayes Error Approximation Error Figure: QDA setting with = 2, Σ = I, d = 10, and π + = π C

37 Method-Specific Approximation Error Approximation Error LDA SVM SSVM Boosting Figure: Method-specific approximation error of linear classifiers in the QDA setting C

38 Extensions For high dimensional data, study double asymptotics where d also grows with n. Compare methods in a regularization framework. Investigate consistency and relative efficiency under other models. Also consider potential model mis-specification and compare methods in terms of robustness.

39 Concluding Remarks Compared modeling approach with algorithmic approach in the efficiency of reducing error rates. Under the normal setting, modeling leads to more efficient use of data. Linear SVM is shown to be between 40% and 67% as effective as LDA when the Bayes error rate is between 4% and 10%. A loss function plays an important role in determining the efficiency of the corresponding procedure. Squared hinge loss could yield more effective procedure than logistic regression. There is a trade-off between efficiency and robustness. The theoretical comparisons can be extended in many directions.

40 References P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101: , B. Efron. The efficiency of logistic regression compared to normal discriminant analysis. Journal of the American Statistical Association, 70(352): , Dec N. L. Hjort and D. Pollard. Asymptotics for minimisers of convex processes. Statistical Research Report, May Ja-Yong Koo, Yoonkyung Lee, Yuwon Kim, and Changyi Park. A Bahadur representation of the linear Support Vector Machine. Journal of Machine Learning Research, 9: , 2008.

41 Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32(1):56 85, Y.-J. Lee and O.L. Mangasarian. SSVM: A smooth support vector machine. Computational Optimization and Applications, 20:5 22, Y. Lin. A note on margin-based loss functions in classification. Statististics and Probability Letters, 68:73 82, D. Pollard. Asymptotics for least absolute deviation regression estimators. Econometric Theory, 7: , G. Rocha, X. Wang, and B. Yu. Asymptotic distribution and sparsistency for l 1 penalized parametric M-estimators, with applications to linear SVM and logistic regression. arxiv, v1:1 55, Aug 2009.