Lecture 3: Statistical Decision Theory (Part II)

Similar documents
Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Recap from previous lecture

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Lecture 3: Introduction to Complexity Regularization

Terminology for Statistical Data

Linear Models in Machine Learning

Understanding Generalization Error: Bounds and Decompositions

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Lecture 2 Machine Learning Review

Overfitting, Bias / Variance Analysis

PAC-learning, VC Dimension and Margin-based Bounds

CMU-Q Lecture 24:

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina

ECE521 week 3: 23/26 January 2017

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Reproducing Kernel Hilbert Spaces

The Learning Problem and Regularization

Midterm exam CS 189/289, Fall 2015

Lecture 5: LDA and Logistic Regression

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLĂ„RNING

A Bahadur Representation of the Linear Support Vector Machine

Fast learning rates for plug-in classifiers under the margin condition

Statistical Data Mining and Machine Learning Hilary Term 2016

Decision trees COMS 4771

Bias-Variance Tradeoff. David Dalpiaz STAT 430, Fall 2017

Classification. Chapter Introduction. 6.2 The Bayes classifier

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Machine Learning Practice Page 2 of 2 10/28/13

18.9 SUPPORT VECTOR MACHINES

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

Linear Models for Regression CS534

Variance Reduction and Ensemble Methods

Advanced Introduction to Machine Learning CMU-10715

Density Estimation. Seungjin Choi

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning And Applications: Supervised Learning-SVM

Lecture 2: Statistical Decision Theory (Part I)

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Holdout and Cross-Validation Methods Overfitting Avoidance

COMS 4771 Introduction to Machine Learning. Nakul Verma

Introduction to Machine Learning and Cross-Validation

TDT4173 Machine Learning

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Statistics: Learning models from data

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

PATTERN RECOGNITION AND MACHINE LEARNING

Cheng Soon Ong & Christian Walder. Canberra February June 2018

A Study of Relative Efficiency and Robustness of Classification Methods

Introduction to Machine Learning Midterm Exam

Logistic Regression. Machine Learning Fall 2018

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Linear and Logistic Regression. Dr. Xiaowei Huang

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Classification objectives COMS 4771

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

COMS 4771 Introduction to Machine Learning. Nakul Verma

Reproducing Kernel Hilbert Spaces

Machine Learning 4771

ISyE 691 Data mining and analytics

Lecture 1: Bayesian Framework Basics

10-701/ Machine Learning - Midterm Exam, Fall 2010

Data Mining Stat 588

Notes on Discriminant Functions and Optimal Classification

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Least Squares Regression

PAC-learning, VC Dimension and Margin-based Bounds

CSCI567 Machine Learning (Fall 2014)

Does Modeling Lead to More Accurate Classification?

Lecture 18: Multiclass Support Vector Machines

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Day 3: Classification, logistic regression

41903: Introduction to Nonparametrics

CPSC 340: Machine Learning and Data Mining

COMS 4771 Regression. Nakul Verma

Introduction to the regression problem. Luca Martino

STK Statistical Learning: Advanced Regression and Classification

6.036 midterm review. Wednesday, March 18, 15

Classification: The rest of the story

Statistical Learning

Bits of Machine Learning Part 1: Supervised Learning

Support Vector Machines for Classification: A Statistical Portrait

Lecture 7 Introduction to Statistical Decision Theory

A Modern Look at Classical Multivariate Techniques

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Support Vector Machine

Machine Learning

Stochastic optimization in Hilbert spaces

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Machine Learning for OR & FE

Lecture : Probabilistic Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Approximation Theoretical Questions for SVMs

Transcription:

Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27

Outline of This Note Part I: Statistics Decision Theory (Classical Statistical Perspective - Estimation ) loss and risk MSE and bias-variance tradeoff Bayes risk and minimax risk (Machine Learning Perspective - Prediction ) optimal learner empirical risk minimization restricted estimators Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 2 / 27

Supervised Learning Any supervised learning problem has three components: Input vector X R d. Output Y, either discrete or continuous valued Probability framework (X, Y ) P(X, Y ). The goal is to estimate the relationship between X and Y, described by f (X ), for future prediction and decision-making. For regression, f : R d R For K-class classification, f : R d {1,, K} Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 3 / 27

Learning Loss Function Similar to the learning theory, we use a learning loss function L to measure the discrepancy Y and f (X ) to penalize the errors for predicting Y. Examples: squared error loss (used in mean regression) L(Y, f (X )) = (Y f (X )) 2 absolute error loss (used in median regression) L(Y, f (X )) = Y f (X ) 0-1 loss function (used in binary classification) L(Y, f (X )) = I (Y f (X )) Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 4 / 27

Risk, Expected Prediction Error (EPE) The risk of f is R(f ) = E X,Y L(Y, f (X )) = L(y, f (x))dp(x, y) In machine learning, R(f ) is called Expected Prediction Error (EPE). For squared error loss, For 0-1 loss R(f ) = EPE(f ) = E X,Y [Y f (X )] 2 R(f ) = EPE(f ) = E X,Y I (Y f (X )) = Pr(Y f (X )), which is the probability of making a mistake. Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 5 / 27

Optimal Learner We seek f which minimizes the average (long-term) loss over the entire population. Optimal Learner: The minimizer of R(f ) is f = arg min f R(f ). The solution f depends on the loss L. f is not achievable without knowing P(x, y) or the entire population Bayes Risk: The optimal risk, or risk of the optimal learner R(f ) = E X,Y (Y f (X )) 2. Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 6 / 27

How to Find The Optimal Learner Note that R(f ) = E X,Y L(Y, f (X )) = L(y, f (x))dp(x, y) { = E X EY X L(Y, f (X )) } One can do the following, for each fixed x, find f (x) by solving f (x) = arg min c E Y X =x L(Y, c) Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 7 / 27

Example 1: Regression with Squared Error Loss For regression, consider L(Y, f (X )) = (Y f (X )) 2. The risk is R(f ) = E X,Y (Y f (X )) 2 = E X E Y X ((Y f (X )) 2 X ). For each fixed x, the best learner solves the problem min E Y X [(Y c) 2 X = x]. c The solution is E(Y X = x), so f (x) = E(Y X = x) = ydp(y x). So the optimal leaner is f (X ) = E(Y X ). Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 8 / 27

Decomposition of EPE: Under Squared Error Loss For any learner f, its EPE under the squared error loss is decomposed as EPE(f ) = E X,Y [Y f (X )] 2 = E X E Y X ([Y f (X )] 2 X ) = E X E Y X ([Y E(Y X ) + E(Y X ) f (X )] 2 X } = E X E Y X ([Y E(Y X )] 2 X ) + E X [f (X ) E(Y X )] 2 = E X,Y [Y f (X )] 2 + E X [f (X ) f (X )] 2 = EPE(f ) + MSE = Bayes Risk + MSE The Bayes risk (gold standard) gives a lower bound for any risk. Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 9 / 27

Example 2: Classification with 0-1 Loss In binary classification, assume Y {0, 1}. The learning rule f : R d {0, 1}. Using 0-1 loss L(Y, f (X )) = I (Y f (X )), the expected prediction error is EPE(f ) = EI (Y f (X )) = Pr(Y f (X )) = E X [P(Y = 1 X )I (f (X ) = 0) + P(Y = 0 X )I (f (X ) = 1)] Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 10 / 27

Bayes Classification Rule for Binary Problems For each fixed x, the minimizer of EPE(f ) is given by { f 1 if P(Y = 1 X = x) > P(Y = 0 X = x) (x) = 0 if P(Y = 1 X = x) < P(Y = 0 X = x) It is called the Bayes classifier, denoted by φ B. The Bayes rule can be written as φ B (x) = f (x) = I [P(Y = 1 X = x) > 0.5]. To obtain the Bayes rule, we need to know P(Y = 1 X = x), or whether P(Y = 1 X = x = x) is larger than 0.5 or not. Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 11 / 27

Probability Framework Assume the training set (X 1, Y 1 ),..., (X n, Y n ) i.i.d P(X, Y ). We aim to find the best function (optimal solution) from the model class F for decision making. However, the optimal learner f is not attainable without knowledge of P(x, y). The training set (x 1, y 1 ),..., (x n, y n ) generated from P(x, y) carry information about P(x, y) We approximate the integral over P(x, y) by the average over the training samples. Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 12 / 27

Empirical Risk Minimization Key Idea: We can not compute R(f ) = E X,Y L(Y, f (X )), but can compute its empirical version using the data! Empirical Risk: also known as the training error Remp(f ) = 1 n n L(y i, f (x i )). i=1 A reasonable estimator f should be close to f, or converge to f when more data are collected. Using law of large numbers, Remp(f ) EPE(f ) as n. Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 13 / 27

Learning Based on ERM We construct the estimator of f by ˆf = arg min f Remp(f ). This principle is called Empirical Risk Minimization. (ERM) For regression, the empirical risk is known as Residual Sum of Squares (RSS) Remp(f ) = RSS(f ) = ERM gives the least mean squares: n [y i f (x i )] 2. i=1 ˆf ols = arg min RSS(f ) = arg min f f n [y i f (x i )] 2. i=1 Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 14 / 27

Issues with ERM Minimizing Remp(f ) over all functions is not proper. Any function passing through all (x i, y i ) has RSS = 0. Example: { y i if x = x i for some i = 1,, n f 1 (x) = 1 otherwise. It is easy to check that RSS(f 1 ) = 0. However, it does not amount to any form of learning. For the future points not equal to x i, it will always predict y = 1. This phenomenon is called overfitting. Overfitting is undesired, since a small Remp(f ) does not imply a small R(f ) = L(y, f (x))dp(x, y). Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 15 / 27

Class of Restricted Estimators We need to restrict the estimation process within a set of functions F, to control complexity of functions and hence avoid over-fitting. Choose a simple model class F. ˆf ols = arg min RSS(f F) = arg min f f F n [y i f (x i )] 2. Choose a large model class F, but use a roughness penalty to control model complexity: Find f F to minimize i=1 ˆf = arg min Penalized RSS RSS(f ) + λj(f ). f F The solution is called regularized empirical risk minimizer. How should we choose the model F? Simple one or complex one? Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 16 / 27

Parametric Learners Some model classes can be parametrized as F = {f (X ; α), α Λ}, where α is the index parameter for the class. Parametric Model Class: The form of f is known except for finitely many unknown parameters Linear Models: linear regression, linear SVM, LDA f (X ; β) = β 0 + X β 1, β 1 R d, f (X ; β) = β 0 + sin(x 1 )β 1 + β 2 X 2 2, X R 2 Nonlinear models f (x; β) = β 1 x β 2 f (X ; β) = β 1 + β 2 exp β 3X β 4 Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 17 / 27

Advantages of Linear Models Why are linear models in wide use? If linear assumptions are correct, it is more efficient than nonparametric fits. convenient and easy to fit easy to interpret providing the first-order Taylor approximation to f (X) Example: In the linear regression, F = {all linear functions of x}. The ERM is the same as solving the ordinary least squares, ( ˆβ ols 0, ˆβ ols ) = arg min (β 0,β) n [y i β 0 x iβ] 2. i=1 Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 18 / 27

Drawback of Parametric Models We never know whether linear assumptions are proper or not. Not flexible in practice, having all orders of derivatives and the same function form everywhere Individual observations have large influences on remote parts of the curve The polynomial degree can not be controlled continuously Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 19 / 27

More Flexible Learners Nonparametric Model Class F: the function form f is unspecified, and the parameter set could be infinitely dimensional. F = {all continuous functions of x}. F = the second-order Sobolev space over [0, 1]. F is the space spanned by a set of Basis functions (dictionary functions) F = {f : f θ (x) = θ m h m (x)}. m=1 Kernel methods and local regression: F contains all piece-wise constant functions, or, F contains all piece-wise linear functions. Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 20 / 27

Example of Nonparametric Methods For regression problems, splines, wavelets, kernel estimators, locally weighted polynomial regression, GAM projection pursuit, regression trees, k-nearest neighbors For classification problems, kernel support vector machines, k-nearest neighbors classification trees... For density estimation, histogram, kernel density estimation Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 21 / 27

Semiparametric Models: a compromise between linear models and nonparametric models partially linear models Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 22 / 27

Nonparametric Models Motivation: The underlying regression function is so complicated that no reasonable parametric models would be adequate. Advantages: NOT assume any specific form. Assume less, thus discover more infinite parameters: not no parameters Local more relying on the neighbors Outliers and influential points do not affect the results Let the data speak for themselves and show us the appropriate functional form! Price we pay: more computational effort, lower efficiency and slower convergence rate (if linear models happen to be reasonable) Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 23 / 27

Nonparametric vs Parametric Models They are not mutually exclusive competitors A nonparametric regression estimate may suggest a simple parametric model In practice, parametric models are preferred for simplicity Linear regression line is infinitely smooth fit method # parameters estimate outliers parametric finite global sensitive nonparametric infinite local robust Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 24 / 27

linear fit 0.0 0.5 1.0 0.0 0.5 1.0 y y 2 order polynomial fit 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x 3 order polynomial fit 0.0 0.5 1.0 0.0 0.5 1.0 y y smoothing spline (lambda=0.85) 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 25 / 27

Interpretation of Risk Let F denote the restricted model space. Denote f = arg min EPE(f ), the optimal learner. The minimization is taken over all measurable functions. Denote f = arg min F EPE(f ), the best learner in the restricted space F. Denote ˆf = arg min F Remp(f ), the learner based on limited data in the restricted space F. EPE(ˆf ) EPE(f ) = [EPE(ˆf ) EPE( f )] + [EPE( f ) EPE(f )] = estimation error + approximation error. Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 26 / 27

Estimation Error and Approximation Error The gap EPE(ˆf ) EPE( f ) is called the estimation error, which is due to the limited training data. The gap EPE( f ) EPE(f ) is called the approximation error. It is incurred from approximating f with a restricted model space or via regularization, since it is possible that f / F. This decomposition is similar to the bias-variance trade-off, with estimation error playing the role of variance approximation error playing the role of bias. Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 27 / 27