On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Similar documents
Variance Reduction and Ensemble Methods

PATTERN RECOGNITION AND MACHINE LEARNING

Machine Learning

Machine Learning

Day 5: Generative models, structured classification

Understanding Generalization Error: Bounds and Decompositions

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Opening Theme: Flexibility vs. Stability

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Cheng Soon Ong & Christian Walder. Canberra February June 2018

BAYESIAN DECISION THEORY

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

PAC-learning, VC Dimension and Margin-based Bounds

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

ECE521 week 3: 23/26 January 2017

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Machine Learning

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

PAC-learning, VC Dimension and Margin-based Bounds

Lecture : Probabilistic Machine Learning

Fast learning rates for plug-in classifiers under the margin condition

MODULE -4 BAYEIAN LEARNING

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning

Pattern Recognition and Machine Learning

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

L11: Pattern recognition principles

Hierarchical models for the rainfall forecast DATA MINING APPROACH

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MIRA, SVM, k-nn. Lirong Xia

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Lecture 3: Statistical Decision Theory (Part II)

Bayesian Machine Learning

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence

STA 4273H: Sta-s-cal Machine Learning

Learning Linear Detectors

Multivariate statistical methods and data mining in particle physics

Course in Data Science

Recap from previous lecture

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature

CSCI567 Machine Learning (Fall 2014)

Bayesian Machine Learning

Classification: The rest of the story

Holdout and Cross-Validation Methods Overfitting Avoidance

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

Notes on Discriminant Functions and Optimal Classification

Curve Fitting Re-visited, Bishop1.2.5

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Day 3: Classification, logistic regression

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

CS6220: DATA MINING TECHNIQUES

Mining Classification Knowledge

CS-E3210 Machine Learning: Basic Principles

Gaussian with mean ( µ ) and standard deviation ( σ)

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko

Introduction to Machine Learning and Cross-Validation

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Machine Learning (CSE 446): Neural Networks

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Consistency of Nearest Neighbor Methods

CS6220: DATA MINING TECHNIQUES

Chapter ML:II (continued)

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Lecture 7

Artificial Neural Networks

Support Vector Machine (continued)

VBM683 Machine Learning

Max Margin-Classifier

Support Vector Machine. Natural Language Processing Lab lizhonghua

+E x, y ρ. (fw (x) y) 2] (4)

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Deep Learning for Computer Vision

Look before you leap: Some insights into learner evaluation with cross-validation

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

INTRODUCTION TO PATTERN RECOGNITION

STA414/2104 Statistical Methods for Machine Learning II

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Discriminative Learning and Big Data

1/sqrt(B) convergence 1/B convergence B

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Machine Learning. Ensemble Methods. Manfred Huber

Bayesian Learning (II)

Overfitting, Bias / Variance Analysis

Large-Scale Nearest Neighbor Classification with Statistical Guarantees

Transcription:

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality Weiqiang Dong 1

The goal of the work presented here is to illustrate that classification error responds to error in the target probability estimates in a much different (and perhaps less intuitive) way than squared estimation error. 2

Overview Function Estimation and Estimation Error Classification and Classification Error Discussion 3

Function Estimate Input: x Output: y = f x + ε where f (x) ( target function ) is a single valued deterministic function of x and ε is a random variable, E(ε x) = 0. The goal is to obtain an estimate using a training data set T 4

Estimation Error The goal is to obtain an estimate using a training data set T Mean Square Error: E T [y f x T)] 2 = [f(x) E T f(x T)] 2 +E T [ f(x T) E T f(x T)] 2 + E ε [ε x] 2 1. Square of bias 2. Variance 3. Irreducible prediction error 5

E T [y f x T)]2 = [f(x) E T f(x T)]2 + E T [ f(x T) E T f(x T)]2 + E ε [ε x] 2 1. Square of bias 2. Variance 3. Irreducible prediction error 1. Square of bias: The extent to which the average prediction over all data sets differs from the desired regression function. 2. Variance: The extent to which the solutions for individual data sets vary around their average (sensitivity to the particular choice of data set). 6

Bias-Variance Trade-off y = f x + ε f x = sin(2πx) x is uniform distributed. The target data set T was obtained by first computing the corresponding values of the function sin(2πx) and then adding a small level of random noise. Christopher Bishop, Pattern Recognition and Machine Learning, 2006 7

Bias-Variance Trade-off y = f x + ε f x = sin(2πx) We generate 25 data sets from f(x), each contain 25 data points. For each data set we fit the data using a polynomial function. Christopher Bishop, Pattern Recognition and Machine Learning, 2006 8

The left column shows the result of fitting the model to the 25 data sets The right column shows the corresponding average of the 25 fits Christopher Bishop, Pattern Recognition and Machine Learning, 2006 9

Bias-Variance Trade-off E T [y f x T)]2 = [f(x) E T f(x T)]2 + E T [ f(x T) E T f(x T)]2 + E ε [ε x] 2 1. Square of bias 2. Variance 3. Irreducible prediction error It is desirable to have both low biassquared and low variance since both contribute to the squared estimation error in equal measure. However, there is a natural bias-variance trade-off associated with function approximation. 10

Classification Input: x = {x 1,, x n } Output: y {0, 1} Prediction y {0, 1} The goal is to choose y x to minimize inaccuracy as characterized by the misclassification risk 11

The goal is to choose y x to minimize inaccuracy as characterized by the misclassification risk Here l 0 and l 1 are the losses incurred for the respective misclassifications, 1 is an indicator function, f (x) is given by 12

The goal is to choose y x to minimize inaccuracy as characterized by the misclassification risk The misclassification risk (2.2) is minimized by the ( Bayes ) rule which achieves the lowest possible risk 13

The goal is to choose y x to minimize inaccuracy as characterized by the misclassification risk The training data set T is used to learn a classification rule y(x T ) for (future) prediction. The usual paradigm for accomplishing this is to use the training data T to form an approximation (estimate) f(x T ) to f (x) Regular function estimation technology can be applied to obtain the estimate f(x T ), which is plugged into (2.6) to form a classification rule, i.e., neural networks, decision tree induction methods, and nearest neighbor methods. 14

Classification Error Let l 0 = l 1 = 1, the misclassification risk is minimized by the ( Bayes ) rule which achieves the lowest possible risk y B x = 1 f(x) 1 2 Prediction: y x T = 1 f x T 1 2 If the prediction agrees with that of the Bayes rule: If not: Pr( y(x T ) y) = Pr(y B (x) y) = min[ f(x), 1 f(x)] Pr( y(x T) y) = max[ f(x), 1 f(x)] = 2f(x) 1 + Pr(y B (x) y) 15

Classification Error Therefore one has Averaging over all training samples T, under the assumption that they are drawn independently of future data to be predicted, one has 16

Classification Error y B x = 1 f(x) 1 2 y(x) = 1 f(x) 1 2 Pr( y y B ) is the only quantity that involves the probability estimate f 17

Classification Error P Y = 1 = 0.9, P Y = 0 = 0.1 if0 x 0.5 P Y = 1 = 0.1, P Y = 0 = 0.9 if 0.5 x 1 Sample 100 observations from each class, fit a linear reg model and evaluate the estimate at x = 0.48 y(x) = 1 f(x) 1 2 Pr y y B = 0.1, f = Pr y = 1 x = 0.48 = 0.9 18

19

Classification Error P Y = 1 = 0.9, P Y = 0 = 0.1 if0 x 0.5 P Y = 1 = 0.1, P Y = 0 = 0.9 if 0.5 x 1 Sample 100 observations from each class, fit a linear reg model and evaluate the estimate at x = 0.48 Pr y y B = 0.1, f = Pr y = 1 x = 0.48 = 0.9 Mean f = 0.5337 mean f < 0.5 = 0.0602 Pr y y = 2f 1 0.0602 + Pr y y B = 2(0.9) 1 0.0602 + 0.1 = 0.1482 20

Classification Error In order to gain some intuition we approximate p( f ) by a normal distribution 21

Classification Error y(x) = 1 f(x) 1 2 Boundary bias: No E f f expression Pr y y B = Φ b(f, E f) var f 22

Classification Error Pr y y B = Φ b(f, E f) var f No E f f expression For a given var f, so long as the boundary bias remains negative, the classification error decreases with increasing E f 1/2 irrespective of the estimation bias (E f f). For positive boundary bias, the classification error increases with the distance of E f from 1/2 For a given E f, so long as the boundary bias remains negative, the classification error decreases with decrease in variance. For a positive boundary bias, the error increases with decrease in variance. 1 negative boundary bias 2 small enough variance 23

Classification Error Pr y y B = Φ b(f, E f) var f No E f f expression The key thing to note is that our estimate E f may be off from f by a huge margin. It does not matter as long as we take care of the fact that we lie on the appropriate side of 1/2 and cut down our variance(negative b). 24

25

Estimation Error E T [y f x T)]2 = [f(x) E T f(x T)]2 + E T [ f(x T) E T f(x T)]2 + E ε [ε x] 2 Classification Error 1. Square of bias 2. Variance 3. Irreducible prediction error The bias variance trade off is clearly very different for classification error than estimation error on the probability function f itself. The dependency of squared estimation error on E f and var f is additive whereas for classification error, there is a strong multiplicative interaction effect. Certain methods that are inappropriate for function estimation because of their very high bias may perform well for classification when their estimates are used in the context of a classification rule. rule. All that is required is a negative boundary bias and small enough variance. The procedures where the bias is caused by over smoothing have negative boundary bias, i.e., Naïve Bayes method, KNN. 26

Table 1 shows the values of average squared estimation error (Column 2) and classification error (Column 4) as a function of training sample size N (first column) along with the corresponding optimal values (Ke and Kc, respectively) of the number of nearest neighbors (third and fifth columns) at n = 20 dimensions. One sees that classification error is decreasing at a much faster rate than squared estimation error as N increases. The optimal value of K for squared estimation error (third column) is seen to be very slowly increasing with N. 27

One sees that classification error is not completely immune to the tendency of K- nearest neighbor methods to degrade as irrelevant inputs are included. But whereas the squared estimation error degrades by over a factor of 35 as the number of irrelevant inputs is increased by a factor of 20, the corresponding increase in classification error is less than a factor of six. One sees that squared estimation error is increasing at a much faster rate than classification error as n increases. 28

Squared estimation error (upper) and classification error (lower) as a function of number of nearest neighbors K, for n = 20 dimensions and training sample size N = 3200. One sees that choice of number of nearest neighbors is less critical for classification error so long as K is neither too small nor too large (here 500 K 2000). Quite often when K-nearest neighbors are compared to other classification methods a small value is used. The simple example examined here suggests that, at least in some situations, this may underestimate the performance achievable with the K-nearest neighbor approach. 29

Much research in classification has been devoted to achieving higher accuracy probability estimates under the presumption that this will generally lead to more accurate predictions. This need not always be the case. 30