Does Modeling Lead to More Accurate Classification?

Similar documents
A Study of Relative Efficiency and Robustness of Classification Methods

A Bahadur Representation of the Linear Support Vector Machine

Does Modeling Lead to More Accurate Classification?: A Study of Relative Efficiency

Support Vector Machines for Classification: A Statistical Portrait

Regularization Paths

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Margin Maximizing Loss Functions

Statistical Data Mining and Machine Learning Hilary Term 2016

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Regularization Paths. Theme

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

AdaBoost and other Large Margin Classifiers: Convexity in Classification

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)

Logistic Regression. Machine Learning Fall 2018

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Lecture 5: LDA and Logistic Regression

Statistical Machine Learning Hilary Term 2018

Introduction to Machine Learning

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Linear Regression and Discrimination

Chapter 6. Ensemble Methods

A Bahadur Representation of the Linear Support Vector Machine

10-701/ Machine Learning - Midterm Exam, Fall 2010

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

STK-IN4300 Statistical Learning Methods in Data Science

Random projection ensemble classification

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

Classification. Chapter Introduction. 6.2 The Bayes classifier

Lecture 5: Classification

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Lecture 18: Multiclass Support Vector Machines

Introduction to Machine Learning

Generalized Boosted Models: A guide to the gbm package

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Stochastic Gradient Descent

Why does boosting work from a statistical view

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

The Bayes classifier

Introduction to Machine Learning Spring 2018 Note 18

Lecture 3: Statistical Decision Theory (Part II)

Boosting. March 30, 2009

Statistical Properties of Large Margin Classifiers

CS 195-5: Machine Learning Problem Set 1

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Voting (Ensemble Methods)

A Simple Algorithm for Learning Stable Machines

Gaussian discriminant analysis Naive Bayes

LEAST ANGLE REGRESSION 469

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Pattern Recognition and Machine Learning

Introduction to Machine Learning

Support Vector Machines

Machine Learning for NLP

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

Linear Models in Machine Learning

Machine Learning for Signal Processing Bayes Classification and Regression

Multivariate statistical methods and data mining in particle physics

TUM 2016 Class 1 Statistical learning theory

On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Logistic Regression Trained with Different Loss Functions. Discussion

A ROBUST ENSEMBLE LEARNING USING ZERO-ONE LOSS FUNCTION

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Advanced Machine Learning

An Introduction to Statistical and Probabilistic Linear Models

Ordinal Classification with Decision Rules

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Learning with multiple models. Boosting.

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Online Learning and Sequential Decision Making

Kernelized Perceptron Support Vector Machines

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

A Tutorial on Support Vector Machine

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

CMU-Q Lecture 24:

Learning theory. Ensemble methods. Boosting. Boosting: history

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Adaptive Sampling Under Low Noise Conditions 1

Chapter 14 Combining Models

Transcription:

Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang August 26-27, 2010 Summer School at CIMAT, Guanajuato, Mexico

Iris Data (red=setosa,green=versicolor,blue=virginica) 2.0 3.0 4.0 0.5 1.5 2.5 Sepal.Length 4.5 5.5 6.5 7.5 2.0 3.0 4.0 Sepal.Width Petal.Length 1 2 3 4 5 6 7 0.5 1.5 2.5 Petal.Width 4.5 5.5 6.5 7.5 1 2 3 4 5 6 7

Figure: courtesy of Hastie, Tibshirani, & Friedman (2001) Handwritten digit recognition Cancer diagnosis with gene expression profiles Text categorization

Classification x = (x 1,...,x d ) R d y Y = {1,...,k} Learn a rule φ : R d Y from the training data {(x i, y i ), i = 1,...,n}, where (x i, y i ) are i.i.d. with P(X, Y). The 0-1 loss function: ρ(y, φ(x)) = I(y φ(x)) The Bayes decision rule φ B minimizing the error rate R(φ) = P(Y φ(x)) is φ B (x) = arg max k P(Y = k X = x).

Statistical Modeling: The Two Cultures One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. Breiman (2001) Model-based methods in statistics: LDA, QDA, logistic regression, kernel density classification Algorithmic methods in machine learning: Support vector machine (SVM), boosting, decision trees, neural network Less is required in pattern recognition. Devroye, Györfi and Lugosi (1996) If you possess a restricted information for solving some problem, try to solve the problem directly and never solve a general problem as an intermediate step. Vapnik (1998)

Questions Is modeling necessary for classification? Does modeling lead to more accurate classification? How to quantify the relative efficiency? How do the two approaches compare?

Convex Risk Minimization In the binary case (k = 2), suppose that y = 1 or -1. Typically obtain a discriminant function f : R d R, which induces a classifier φ(x) = sign(f(x)), by minimizing the risk under a convex surrogate loss of the 0-1 loss ρ(y, f(x)) = I(yf(x) 0). Logistic regression: binomial deviance (- log likelihood) Support vector machine: hinge loss Boosting: exponential loss

Logistic Regression Agresti (2002), Categorical Data Analysis Model the conditional distribution p k (x) = P(Y = k X = x) directly. p 1 (x) log 1 p 1 (x) = f(x) Then Y X = x Bernoulli distribution with p 1 (x) = exp(f(x)) 1 + exp(f(x)) and p 1(x) = 1 1 + exp(f(x)). Maximizing the conditional likelihood of (y 1,...,y n ) given (x 1,...,x n ) (or minimizing the negative log likelihood) amounts to min f n log ( 1 + exp( y i f(x i )) ). i=1

Support Vector Machine Vapnik (1996), The Nature of Statistical Learning Theory Find f with a large margin minimizing 1 n n (1 y i f(x i )) + +λ f 2. i=1 x 2 1.5 1.0 0.5 0.0 0.5 1.0 margin=2/ β β t x + β 0 = 1 β t x + β 0 = 0 β t x + β 0 = 1 1.0 0.5 0.0 0.5 1.0 1.5 x 1

Boosting Freund and Schapire (1997), A decision-theoretic generalization of on-line learning and an application to boosting A meta-algorithm that combines the outputs of many weak classifiers to form a powerful committee Sequentially apply a weak learner to produce a sequence of classifiers f m (x), m = 1, 2,...,M and take a weighted majority vote for the final prediction. AdaBoost minimizes the exponential risk function with a stagewise gradient descent algorithm: min f n exp( y i f(x i )). i=1 Friedman, Hastie, and Tibshirani (2000), Additive Logistic Regression: A Statistical View of Boosting

Loss Functions ÄÓ 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Misclassification Exponential Binomial Deviance Squared Error Support Vector Ö Ö ÔÐ Ñ ÒØ -2-1 0 1 2 Ý Figure: courtesy of HTF (2001)

Classification Consistency The population minimizer f of ρ is defined as f with the minimum risk R(f) = Eρ(Y, f(x)). Negative log-likelihood (deviance) f (x) = log p 1 (x) 1 p 1 (x) Hinge loss (SVM) Exponential loss (boosting) f (x) = sign(p 1 (x) 1/2) f (x) = 1 2 log p 1 (x) 1 p 1 (x) sign(f ) yields the Bayes rule. Both modeling and algorithmic approaches are consistent. Lin (2000), Zhang (AOS 2004), Bartlett, Jordan, and McAuliffe (JASA 2006)

Outline Efron s comparison of LDA with logistic regression Efficiency of algorithmic approach (support vector machine and boosting) Simulation study Discussion

Normal Distribution Setting Two multivariate normal distributions in R d with mean vectors µ 1 and µ 2 and a common covariance matrix Σ π + = P(Y = 1) and π = P(Y = 1). For example, when π + = π, Fisher s LDA boundary is { } { Σ 1 (µ 1 µ 2 ) x 1 } 2 (µ 1 + µ 2 ) = 0. µ 1 µ 2

Canonical LDA setting Efron (JASA 1975), The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis X N(( /2)e 1, I) for Y = 1 with probability π + X N( ( /2)e 1, I) for Y = 1 with probability π where = {(µ 1 µ 2 ) Σ 1 (µ 1 µ 2 )} 1/2 Fisher s linear discriminant function is l(x) = log(π + /π ) + x 1. Let β 0 = log(π +/π ), (β 1,...,β d ) = e 1, and β = (β 0,...,β d ).

Excess Error For a linear discriminant method ˆl with coefficient vector ˆβ n, if n(ˆβ n β ) N(0, Σ β ), the expected increased error rate of ˆl, E(R(ˆl) R(φ B )) = π +φ(d 1 ) 2 n [ σ 00 2β 0 σ 01+ β 2 0 2 σ 11+σ 22 + +σ dd where D 1 = /2 + (1/ ) log(π + /π ). ] +o( 1 n ), In particular, when π + = π, E(R(ˆl) R(φ B )) = φ( /2) ] [σ 00 + σ 22 + + σ dd + o( 1 4 n n ).

Relative Efficiency Efron (1975) studied the Asymptotic Relative Efficiency (ARE) of logistic regression (LR) to normal discrimination (LDA) defined as lim n E(R(ˆl LDA ) R(φ B )) E(R(ˆl LR ) R(φ B )). Logistic regression is shown to be between one half and two thirds as effective as normal discrimination typically.

General Framework for Comparison Identify the limiting distribution of ˆβ n for other classification procedures (SVM, boosting, etc.) under the canonical LDA setting: 1 n ˆβ n = arg min ρ(y i, x i ; β) β n i=1 Need large sample theory for M-estimators. Find the excess error of each method and compute the efficiency relative to LDA.

M-estimator Asymptotics Pollard (ET 1991), Hjort and Pollard (1993), Geyer (AOS 1994), Knight and Fu (AOS 2000), Rocha, Wang and Yu (2009) Convexity of the loss ρ is the key. Let L(β) = Eρ(Y, X; β), β = arg min L(β), H(β) = L(β) β β, and G(β) = E ( ) ( ρ(y, X; β) ρ(y, X; β) β β ) Under some regularity conditions, n(ˆβ n β ) N(0, H(β ) 1 G(β )H(β ) 1 ) in distribution.

Linear SVM Koo, Lee, Kim, and Park (JMLR 2008), A Bahadur Representation of the Linear Support Vector Machine With β = (β 0, w ), l(x; β) = w x + β { 0 } 1 n βλ,n = arg min (1 y i l(x i ; β)) + + λ w 2 β n i=1 Under the canonical LDA setting with π + = π, for λ = o(n 1/2 ), n ( βλ,n β SVM ) N(0, Σ β SVM ), where β SVM = 2 (2a + ) β LDA and a is a constant such that φ(a )/Φ(a ) = /2. If π + π, ŵ n w LDA but ˆβ 0 is inconsistent.

Relative Efficiency of SVM to LDA Under the canonical LDA setting with π + = π = 0.5, the ARE of the linear SVM to LDA is Eff = 2 (1 + 2 4 )φ(a ). R(φ B ) a SVM LR 2.0 0.1587-0.3026 0.7622 0.899 2.5 0.1056-0.6466 0.6636 0.786 3.0 0.0668-0.9685 0.5408 0.641 3.5 0.0401-1.2756 0.4105 0.486 4.0 0.0228-1.5718 0.2899 0.343

Boosting 1 βn = arg min β n n exp( y i l(x i ; β)) i=1 Under the canonical LDA setting with π + = π, where β boost = 1 2 β LDA. n ( βn β boost ) N(0, Σ β boost ), In general, β n is a consistent estimator of (1/2)β LDA.

Relative Efficiency of Boosting to LDA Under the canonical LDA setting with π + = π = 0.5, the ARE of Boosting to LDA is Eff = 1 + 2 /4 exp( 2 /4). R(φ B ) Boosting SVM LR 2.0 0.1587 0.7358 0.7622 0.899 2.5 0.1056 0.5371 0.6636 0.786 3.0 0.0668 0.3425 0.5408 0.641 3.5 0.0401 0.1900 0.4105 0.486 4.0 0.0228 0.0916 0.2899 0.343

Smooth SVM Lee and Mangasarian (2001), SSVM: A Smooth Support Vector Machine { } 1 n βλ,n = arg min (1 y i l(x i ; β)) 2 + + λ w 2 β n i=1 Under the canonical LDA setting with π + = π, for λ = o(n 1/2 ), n ( βλ,n β SSVM ) N(0, Σ β SSVM ), where β SSVM = 2 (2a + ) β LDA and a is a constant such that {a Φ(a ) + φ(a )} = 2Φ(a ).

Relative Efficiency of SSVM to LDA Under the canonical LDA setting with π + = π = 0.5, the ARE of the Smooth SVM to LDA is Eff = (4 + 2 )Φ(a ) (2a + ). R(φ B ) a SSVM SVM LR 2.0 0.1587 0.4811 0.9247 0.7622 0.899 2.5 0.1056 0.0058 0.8200 0.6636 0.786 3.0 0.0668-0.4073 0.6779 0.5408 0.641 3.5 0.0401-0.7821 0.5206 0.4105 0.486 4.0 0.0228-1.1312 0.3712 0.2899 0.343

Possible Explanation for Increased Efficiency Hastie, Tibshirani, and Friedman (2001), Elements of Statistical Learning There is a close connection between Fisher s LDA and regression approach to classification with class indicators: min n (y i β 0 w x i ) 2 i=1 The least squares coefficient is identical up to a scalar multiple to the LDA coefficient: ŵ ˆΣ 1 (ˆµ 1 ˆµ 2 )

Finite-Sample Excess Error Excess Error 0.000 0.010 0.020 0.030 LDA (1.2578) LR (1.5021) SVM (1.7028) SSVM (1.4674) 1000 250 166 125 99 83 71 62 55 50 Sample Size Figure: = 2, d = 5, R(φ B ) = 0.1587, and π + = π. The results are based on 1000 replicates.

A Mixture of Two Gaussian Distributions µ f2 W µ f1 µ g2 B µ g1 Figure: W and B indicate the mean difference between two Gaussian components within each class and the mean difference between two classes.

As W Varies Error Rate 0.160 0.165 0.170 0.175 0.180 0.185 Bayes Rule LDA LR SVM SSVM 1 1.3 1.6 1.9 2.2 2.5 2.8 3.1 3.4 3.7 4 w Figure: B = 2, d = 5, π + = π, π 1 = π 2, and n = 100

As B Varies Error Rate 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Bayes Rule LDA LR SVM SSVM 1 1.3 1.6 1.9 2.2 2.5 2.8 3.1 3.4 3.7 4 Figure: W = 1, d = 5, π + = π, π 1 = π 2, and n = 100 B

As B Varies Increased Error Rate 0.005 0.010 0.015 0.020 0.025 LDA LR SVM SSVM 1 1.3 1.6 1.9 2.2 2.5 2.8 3.1 3.4 3.7 4 Figure: W = 1, d = 5, π + = π, π 1 = π 2, and n = 100 B

As Dimension d Varies Error Rate 0.15 0.20 0.25 0.30 0.35 0.40 0.45 Bayes Rule LDA LR SVM SSVM 5 15 25 35 45 55 65 75 85 Dimension Figure: W = 1, B = 2, π + = π, π 1 = π 2, and n = 100

QDA Setting C= 1 C= 2 0.05 0.1 0.1 0.06 0.15 0.15 0.15 µ 2 µ 1 µ 2 µ 1 0.1 0.05 0.04 0.05 0.02 0 C= 3 C= 4 0.01 0 0.02 0.05 0.05 0.04 0.1 0.15 µ 2 µ 1 µ 2 µ 1 0.15 0.1 0.03 0 0.02 Figure: X Y = 1 N(µ1,Σ) and X Y = 1 N(µ2, CΣ)

Decomposition of Error For a rule φ F, R(φ) R(φ B ) = R(φ) R(φ F ) }{{} + R(φ F ) R(φ B ) }{{} estimation error approximation error where φ F = arg min φ F R(φ). When a method M is used to choose φ from F, R(φ) R(φ F ) = R(φ) R(φ M ) }{{} + R(φ M ) R(φ F ) }{{} M-specific est.error M-specific approx.error where φ M is the method-specific limiting rule within F.

Approximation Error Error Rate 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Bayes Error Approximation Error 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Figure: QDA setting with = 2, Σ = I, d = 10, and π + = π C

Method-Specific Approximation Error Approximation Error 0.000 0.010 0.020 LDA SVM SSVM Boosting 1.0 1.5 2.0 2.5 3.0 3.5 4.0 C Figure: Method-specific approximation error of linear classifiers in the QDA setting

Extensions For high dimensional data, study double asymptotics where d also grows with n. Compare methods in a regularization framework. Investigate consistency and relative efficiency under other models. Consider potential mis-specification of a model. All models are wrong, but some are useful George Box Compare methods in terms of robustness.

Concluding Remarks Compared modeling approach with algorithmic approach in the efficiency of reducing error rates. Under the normal setting, modeling leads to more efficient use of data. Linear SVM is shown to be between 40% and 67% as effective as LDA when the Bayes error rate is between 4% and 10%. A loss function plays an important role in determining the efficiency of the corresponding procedure. Squared hinge loss could yield more effective procedure than logistic regression. The theoretical comparisons can be extended in many directions.

References P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101:138 156, 2006. B. Efron. The efficiency of logistic regression compared to normal discriminant analysis. Journal of the American Statistical Association, 70(352):892 898, Dec. 1975. N. L. Hjort and D. Pollard. Asymptotics for minimisers of convex processes. Statistical Research Report, May 1993. Ja-Yong Koo, Yoonkyung Lee, Yuwon Kim, and Changyi Park. A Bahadur representation of the linear Support Vector Machine. Journal of Machine Learning Research, 9:1343 1368, 2008.

Y.-J. Lee and O.L. Mangasarian. SSVM: A smooth support vector machine. Computational Optimization and Applications, 20:5 22, 2001. Y. Lin. A note on margin-based loss functions in classification. Statististics and Probability Letters, 68:73 82, 2002. D. Pollard. Asymptotics for least absolute deviation regression estimators. Econometric Theory, 7:186 199, 1991. G. Rocha, X. Wang, and B. Yu. Asymptotic distribution and sparsistency for l 1 penalized parametric M-estimators, with applications to linear SVM and logistic regression. arxiv, 0908.1940v1:1 55, Aug 2009. Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32(1):56 85, 2004.