# A Study of Relative Efficiency and Robustness of Classification Methods

Size: px
Start display at page:

## Transcription

1 A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics Seoul National University

2 Iris Data (red=setosa,green=versicolor,blue=virginica) Sepal.Length Sepal.Width Petal.Length Petal.Width

3 Figure: courtesy of Hastie, Tibshirani, & Friedman (2001) Handwritten digit recognition Cancer diagnosis with gene expression profiles Text categorization

4 Classification x = (x 1,..., x d ) R d y Y = {1,..., k} Learn a rule φ : R d Y from the training data {(x i, y i ), i = 1,..., n}, where (x i, y i ) are i.i.d. with P(X, Y). The 0-1 loss function: ρ(y,φ(x)) = I(y φ(x)) The Bayes decision rule φ B minimizing the error rate R(φ) = P(Y φ(x)) is φ B (x) = arg max P(Y = k X = x). k

5 Statistical Modeling: The Two Cultures One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. Breiman (2001) Model-based methods in statistics: LDA, QDA, logistic regression, kernel density classification Algorithmic methods in machine learning: Support vector machine (SVM), boosting, decision trees, neural network Less is required in pattern recognition. Devroye, Györfi and Lugosi (1996) If you possess a restricted information for solving some problem, try to solve the problem directly and never solve a general problem as an intermediate step. Vapnik (1998)

6 Questions Is modeling necessary for classification? Does modeling lead to more accurate classification? How to quantify the relative efficiency? How do the two approaches compare?

7 Convex Risk Minimization In the binary case (k = 2), suppose that y = 1 or -1. Typically obtain a discriminant function f : R d R, which induces a classifier φ(x) = sign(f(x)), by minimizing the risk under a convex surrogate loss of the 0-1 loss ρ(y, f(x)) = I(yf(x) 0). Logistic regression: binomial deviance (- log likelihood) Support vector machine: hinge loss Boosting: exponential loss

8 Logistic Regression Agresti (2002), Categorical Data Analysis Model the conditional distribution p k (x) = P(Y = k X = x) directly. p 1 (x) log 1 p 1 (x) = f(x) Then Y X = x Bernoulli distribution with p 1 (x) = exp(f(x)) 1+exp(f(x)) and p 1(x) = 1 1+exp(f(x)). Maximizing the conditional likelihood of (y 1,...,y n ) given (x 1,..., x n ) (or minimizing the negative log likelihood) amounts to min f n log ( 1+exp( y i f(x i )) ). i=1

9 Support Vector Machine Vapnik (1996), The Nature of Statistical Learning Theory Find f with a large margin minimizing 1 n n (1 y i f(x i )) + +λ f 2. i=1 x margin=2/ β β t x + β 0 = 1 β t x + β 0 = 0 β t x + β 0 = x 1

10 Boosting Freund and Schapire (1997), A decision-theoretic generalization of on-line learning and an application to boosting A meta-algorithm that combines the outputs of many weak classifiers to form a powerful committee Sequentially apply a weak learner to produce a sequence of classifiers f m (x), m = 1, 2,..., M and take a weighted majority vote for the final prediction. AdaBoost minimizes the exponential risk function with a stagewise gradient descent algorithm: min f n exp( y i f(x i )). i=1 Friedman, Hastie, and Tibshirani (2000), Additive Logistic Regression: A Statistical View of Boosting

11 Loss Functions ÄÓ Misclassification Exponential Binomial Deviance Squared Error Support Vector Ý Figure: courtesy of HTF (2001)

12 Classification Consistency The population minimizer f of ρ is defined as f with the minimum risk R(f) = Eρ(Y, f(x)). Negative log-likelihood (deviance) Hinge loss (SVM) f (x) = log Exponential loss (boosting) p 1 (x) 1 p 1 (x) f (x) = sign(p 1 (x) 1/2) f (x) = 1 2 log p 1 (x) 1 p 1 (x) sign(f ) yields the Bayes rule. Both modeling and algorithmic approaches are consistent. Lin (2000), Zhang (AOS 2004), Bartlett, Jordan, and McAuliffe (JASA 2006)

13 Outline Efron s comparison of LDA with logistic regression Efficiency of algorithmic approach (support vector machine and boosting) Simulation studies for comparison of efficiency and robustness Discussion

14 Normal Distribution Setting Two multivariate normal distributions in R d with mean vectors µ 1 and µ 2 and a common covariance matrix Σ π + = P(Y = 1) and π = P(Y = 1). For example, when π + = π, Fisher s LDA boundary is { } { Σ 1 (µ 1 µ 2 ) x 1 } 2 (µ 1 +µ 2 ) = 0. µ 1 µ 2

15 Canonical LDA setting Efron (JASA 1975), The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis X N(( /2)e 1, I) for Y = 1 with probability π + X N( ( /2)e 1, I) for Y = 1 with probability π where = {(µ 1 µ 2 ) Σ 1 (µ 1 µ 2 )} 1/2 Fisher s linear discriminant function is l(x) = log(π + /π )+ x 1. Let β 0 = log(π +/π ), (β 1,...,β d ) = e 1, and β = (β 0,...,β d ).

16 Excess Error For a linear discriminant method ˆl with coefficient vector ˆβ n, if n(ˆβ n β ) N(0,Σ β ), the expected increased error rate of ˆl, E(R(ˆl) R(φ B )) = π [ +φ(d 1 ) σ 00 2β 0 2 n σ 01+ β 2 0 where D 1 = /2+(1/ ) log(π + /π ). 2σ 11+σ σ dd ] +o( 1 n ), In particular, when π + = π, E(R(ˆl) R(φ B )) = φ( /2) ] [σ 00 +σ σ dd + o( 1 4 n n ).

17 Relative Efficiency Efron (1975) studied the Asymptotic Relative Efficiency (ARE) of logistic regression (LR) to normal discrimination (LDA) defined as E(R(ˆl lim LDA ) R(φ B )) n E(R(ˆl LR ) R(φ B )). Logistic regression is shown to be between one half and two thirds as effective as normal discrimination typically.

18 General Framework for Comparison Identify the limiting distribution of ˆβ n for other classification procedures (SVM, boosting, etc.) under the canonical LDA setting: 1 n ˆβ n = arg min ρ(y i, x i ;β) β n i=1 Need large sample theory for M-estimators. Find the excess error of each method and compute the efficiency relative to LDA.

19 M-estimator Asymptotics Pollard (ET 1991), Hjort and Pollard (1993), Geyer (AOS 1994), Knight and Fu (AOS 2000), Rocha, Wang and Yu (2009) Convexity of the loss ρ is the key. Let L(β) = Eρ(Y, X;β), β = arg min L(β), H(β) = 2 L(β), and G(β) = E β β Under some regularity conditions, ( )( ρ(y, X;β) ρ(y, X;β) β n(ˆβn β ) N(0, H(β ) 1 G(β )H(β ) 1 ) β ) in distribution.

20 Linear SVM Koo, Lee, Kim, and Park (JMLR 2008), A Bahadur Representation of the Linear Support Vector Machine With β = (β 0, w ), l(x;β) = w x +β { 0 } 1 n βλ,n = arg min (1 y i l(x i ;β)) + +λ w 2 β n i=1 Under the canonical LDA setting with π + = π, for λ = o(n 1/2 ), n ( βλ,n β SVM ) N(0,Σ β SVM ), where β SVM = 2 (2a + ) β LDA and a is a constant such that φ(a )/Φ(a ) = /2. If π + π, ŵ n w LDA but ˆβ 0 is inconsistent.

21 Relative Efficiency of SVM to LDA Under the canonical LDA setting with π + = π = 0.5, the ARE of the linear SVM to LDA is Eff = 2 2 (1+ 4 )φ(a ). R(φ B ) a SVM LR

22 Boosting 1 βn = arg min β n n exp( y i l(x i ;β)) i=1 Under the canonical LDA setting with π + = π, where β boost = 1 2 β LDA. n ( βn β boost ) N(0,Σ β boost ), In general, β n is a consistent estimator of (1/2)β LDA.

23 Relative Efficiency of Boosting to LDA Under the canonical LDA setting with π + = π = 0.5, the ARE of Boosting to LDA is Eff = 1+ 2 /4 exp( 2 /4). R(φ B ) Boosting SVM LR

24 Smooth SVM Lee and Mangasarian (2001), SSVM: A Smooth Support Vector Machine { } 1 n βλ,n = arg min (1 y i l(x i ;β)) 2 + β n +λ w 2 i=1 Under the canonical LDA setting with π + = π, for λ = o(n 1/2 ), n ( βλ,n β SSVM ) N(0,Σ β SSVM ), where β SSVM = 2 (2a + ) β LDA and a is a constant such that {a Φ(a )+φ(a )} = 2Φ(a ).

25 Relative Efficiency of SSVM to LDA Under the canonical LDA setting with π + = π = 0.5, the ARE of the Smooth SVM to LDA is Eff = (4+ 2 )Φ(a ) (2a + ). R(φ B ) a SSVM SVM LR

26 Possible Explanation for Increased Efficiency Hastie, Tibshirani, and Friedman (2001), Elements of Statistical Learning There is a close connection between Fisher s LDA and regression approach to classification with class indicators: n min (y i β 0 w x i ) 2 = i=1 n (1 y i (β 0 + w x i )) 2 i=1 The least squares coefficient is identical up to a scalar multiple to the LDA coefficient: ŵ ˆΣ 1 (ˆµ 1 ˆµ 2 )

27 Finite-Sample Excess Error Excess Error LDA (1.2578) LR (1.5021) SVM (1.7028) SSVM (1.4674) Sample Size (inverse scale) Figure: = 2, d = 5, R(φ B ) = , and π + = π. The results are based on 1000 replicates.

28 What If Model is Mis-specified? All models are wrong, but some are useful. George Box Compare methods under A mixture of two Gaussian distributions Mislabeling in LDA setting Quadratic discriminant analysis setting

29 A Mixture of Two Gaussian Distributions µ f2 W µ f1 µ g2 B µ g1 Figure: W and B indicate the mean difference between two Gaussian components within each class and the mean difference between two classes.

30 As W Varies Error Rate Bayes Rule LDA LR SVM SSVM Figure: B = 2, d = 5, π + = π, π 1 = π 2, and n = 100 w

31 As B Varies Increased Error Rate LDA LR SVM SSVM Figure: W = 1, d = 5, π + = π, π 1 = π 2, and n = 100 B

32 As Dimension d Varies Error Rate Bayes Rule LDA LR SVM SSVM Dimension Figure: W = 1, B = 2, π + = π, π 1 = π 2, and n = 100

33 Mislabeling in LDA excess error SVM Huberized SVM Smooth SVM perturbation fraction Figure: Mean excess errors of SVM and its variants from 400 replicates as the mislabeling proportion varies. = 2.7, d = 5, R(φ B ) = , π + = π, and n = 100.

34 QDA Setting C= 1 C= µ µ µ µ C= 3 C= µ 2 µ µ 2 µ Figure: X Y = 1 N(µ 1,Σ) and X Y = 1 N(µ 2, CΣ)

35 Decomposition of Error For a rule φ F, R(φ) R(φ B ) = R(φ) R(φ F ) + R(φ }{{} F ) R(φ B ) }{{} estimation error approximation error where φ F = arg min φ F R(φ). When a method M is used to choose φ from F, R(φ) R(φ F ) = R(φ) R(φ M ) }{{} + R(φ M ) R(φ F ) }{{} M-specific est.error M-specific approx.error where φ M is the method-specific limiting rule within F.

36 Approximation Error Error Rate Bayes Error Approximation Error Figure: QDA setting with = 2, Σ = I, d = 10, and π + = π C

37 Method-Specific Approximation Error Approximation Error LDA SVM SSVM Boosting Figure: Method-specific approximation error of linear classifiers in the QDA setting C

38 Extensions For high dimensional data, study double asymptotics where d also grows with n. Compare methods in a regularization framework. Investigate consistency and relative efficiency under other models. Also consider potential model mis-specification and compare methods in terms of robustness.

39 Concluding Remarks Compared modeling approach with algorithmic approach in the efficiency of reducing error rates. Under the normal setting, modeling leads to more efficient use of data. Linear SVM is shown to be between 40% and 67% as effective as LDA when the Bayes error rate is between 4% and 10%. A loss function plays an important role in determining the efficiency of the corresponding procedure. Squared hinge loss could yield more effective procedure than logistic regression. There is a trade-off between efficiency and robustness. The theoretical comparisons can be extended in many directions.

40 References P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101: , B. Efron. The efficiency of logistic regression compared to normal discriminant analysis. Journal of the American Statistical Association, 70(352): , Dec N. L. Hjort and D. Pollard. Asymptotics for minimisers of convex processes. Statistical Research Report, May Ja-Yong Koo, Yoonkyung Lee, Yuwon Kim, and Changyi Park. A Bahadur representation of the linear Support Vector Machine. Journal of Machine Learning Research, 9: , 2008.

41 Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32(1):56 85, Y.-J. Lee and O.L. Mangasarian. SSVM: A smooth support vector machine. Computational Optimization and Applications, 20:5 22, Y. Lin. A note on margin-based loss functions in classification. Statististics and Probability Letters, 68:73 82, D. Pollard. Asymptotics for least absolute deviation regression estimators. Econometric Theory, 7: , G. Rocha, X. Wang, and B. Yu. Asymptotic distribution and sparsistency for l 1 penalized parametric M-estimators, with applications to linear SVM and logistic regression. arxiv, v1:1 55, Aug 2009.

### Does Modeling Lead to More Accurate Classification?

Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang

### A Bahadur Representation of the Linear Support Vector Machine

A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support

### Does Modeling Lead to More Accurate Classification?: A Study of Relative Efficiency

Does Modeling Lead to More Accurate Classification?: A Study of Relative Efficiency Yoonkyung Lee, The Ohio State University Rui Wang, The Ohio State University Technical Report No. 869 July, 01 Department

### Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

### Regularization Paths

December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and

### Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Chap 2. Linear Classifiers (FTH, 4.1-4.4) Yongdai Kim Seoul National University Linear methods for classification 1. Linear classifiers For simplicity, we only consider two-class classification problems

### Regularization Paths. Theme

June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Theme Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Mee-Young Park,

### Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

### Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning Regression-Based Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1 Logistic Regression Linear regression provides a nice representation and an efficient solution to

### Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

### Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

### AdaBoost and other Large Margin Classifiers: Convexity in Classification

AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at

### EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

### Margin Maximizing Loss Functions

Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing

### Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC) Eunsik Park 1 and Y-c Ivan Chang 2 1 Chonnam National University, Gwangju, Korea 2 Academia Sinica, Taipei,

### Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier

### EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

### Boosting Methods: Why They Can Be Useful for High-Dimensional Data

New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

### Lecture 5: LDA and Logistic Regression

Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39 Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant

### Introduction to Machine Learning

1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,

### Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

### Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

### 10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

### A Bahadur Representation of the Linear Support Vector Machine

Journal of Machine Learning Research 9 2008 1343-1368 Submitted 12/06; Revised 2/08; Published 7/08 A Bahadur Representation of the Linear Support Vector Machine Ja-Yong Koo Department of Statistics Korea

### Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

### AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

### Linear Regression and Discrimination

Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian

### Logistic Regression. Machine Learning Fall 2018

Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

### Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

### Chapter 6. Ensemble Methods

Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Introduction

### Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee Predictive

### LEAST ANGLE REGRESSION 469

LEAST ANGLE REGRESSION 469 Specifically for the Lasso, one alternative strategy for logistic regression is to use a quadratic approximation for the log-likelihood. Consider the Bayesian version of Lasso

### The Bayes classifier

The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

### Generalized Boosted Models: A guide to the gbm package

Generalized Boosted Models: A guide to the gbm package Greg Ridgeway April 15, 2006 Boosting takes on various forms with different programs using different loss functions, different base models, and different

### The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers Hui Zou University of Minnesota Ji Zhu University of Michigan Trevor Hastie Stanford University Abstract We propose a new framework

### Classification. Chapter Introduction. 6.2 The Bayes classifier

Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

### Statistical Properties of Large Margin Classifiers

Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides

### Introduction to Machine Learning

1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

### Lecture 5: Classification

Lecture 5: Classification Advanced Applied Multivariate Analysis STAT 2221, Spring 2015 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department of Mathematical Sciences Binghamton

### Random projection ensemble classification

Random projection ensemble classification Timothy I. Cannings Statistics for Big Data Workshop, Brunel Joint work with Richard Samworth Introduction to classification Observe data from two classes, pairs

### Machine Learning for NLP

Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

### Support Vector Machines

Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

### Introduction to Machine Learning Spring 2018 Note 18

CS 189 Introduction to Machine Learning Spring 2018 Note 18 1 Gaussian Discriminant Analysis Recall the idea of generative models: we classify an arbitrary datapoint x with the class label that maximizes

### A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

### Lecture 18: Multiclass Support Vector Machines

Fall, 2017 Outlines Overview of Multiclass Learning Traditional Methods for Multiclass Problems One-vs-rest approaches Pairwise approaches Recent development for Multiclass Problems Simultaneous Classification

### Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Supervised Learning Unsupervised learning: To extract structure and postulate hypotheses about data generating process from observations x 1,...,x n. Visualize, summarize and compress data. We have seen

Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

### Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

### Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

### Logistic Regression Trained with Different Loss Functions. Discussion

Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =

### Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

### STK-IN4300 Statistical Learning Methods in Data Science

Outline of the lecture STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no AdaBoost Introduction algorithm Statistical Boosting Boosting as a forward stagewise additive

### Why does boosting work from a statistical view

Why does boosting work from a statistical view Jialin Yi Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 939 jialinyi@sas.upenn.edu Abstract We review boosting

### A ROBUST ENSEMBLE LEARNING USING ZERO-ONE LOSS FUNCTION

Journal of the Operations Research Society of Japan 008, Vol. 51, No. 1, 95-110 A ROBUST ENSEMBLE LEARNING USING ZERO-ONE LOSS FUNCTION Natsuki Sano Hideo Suzuki Masato Koda Central Research Institute

### CS 195-5: Machine Learning Problem Set 1

CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

### Introduction to Machine Learning

Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the

### Linear Models in Machine Learning

CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

### Gaussian discriminant analysis Naive Bayes

DM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1. is 2. Multi-variate

### Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Ensemble Methods Charles Sutton Data Mining and Exploration Spring 2012 Bias and Variance Consider a regression problem Y = f(x)+ N(0, 2 ) With an estimate regression function ˆf, e.g., ˆf(x) =w > x Suppose

### The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is

### The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

The Bayes classifier Linear discriminant analysis (LDA) Theorem The classifier satisfies In linear discriminant analysis (LDA), we make the (strong) assumption that where the min is over all possible classifiers.

### Boosting. March 30, 2009

Boosting Peter Bühlmann buhlmann@stat.math.ethz.ch Seminar für Statistik ETH Zürich Zürich, CH-8092, Switzerland Bin Yu binyu@stat.berkeley.edu Department of Statistics University of California Berkeley,

### Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

### A Simple Algorithm for Learning Stable Machines

A Simple Algorithm for Learning Stable Machines Savina Andonova and Andre Elisseeff and Theodoros Evgeniou and Massimiliano ontil Abstract. We present an algorithm for learning stable machines which is

### Pattern Recognition and Machine Learning

Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

### Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

### Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

### ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

ISyE 6416: Computational Statistics Spring 2017 Lecture 5: Discriminant analysis and classification Prof. Yao Xie H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology

### Linear Methods for Prediction

Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

### Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

### σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

### The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

### Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

### Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

### Learning with multiple models. Boosting.

CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

### Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:

### > DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

### Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

### ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

### A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee May 13, 2005

### Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

### A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

A Gentle Introduction to Gradient Boosting Cheng Li chengli@ccs.neu.edu College of Computer and Information Science Northeastern University Gradient Boosting a powerful machine learning algorithm it can

### CMU-Q Lecture 24:

CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

### ECE-271B. Nuno Vasconcelos ECE Department, UCSD

ECE-271B Statistical ti ti Learning II Nuno Vasconcelos ECE Department, UCSD The course the course is a graduate level course in statistical learning in SLI we covered the foundations of Bayesian or generative

### Classification Methods II: Linear and Quadratic Discrimminant Analysis

Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear

### TUM 2016 Class 1 Statistical learning theory

TUM 2016 Class 1 Statistical learning theory Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Machine learning applications Texts Images Data: (x 1, y 1 ),..., (x n, y n ) Note: x i s huge dimensional! All

### A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

### Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

### On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost Hamed Masnadi-Shirazi Statistical Visual Computing Laboratory, University of California, San Diego La

### Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Machine Learning 1 Linear Classifiers Marius Kloft Humboldt University of Berlin Summer Term 2014 Machine Learning 1 Linear Classifiers 1 Recap Past lectures: Machine Learning 1 Linear Classifiers 2 Recap

### Chapter 14 Combining Models

Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

### Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary

### Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

### Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

### Voting (Ensemble Methods)

1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

### An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

### Discriminative Models

No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models