Recap from previous lecture

Similar documents
Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Lecture 3: Statistical Decision Theory (Part II)

Lecture 2 Machine Learning Review

Machine Learning Practice Page 2 of 2 10/28/13

Lecture 3: Introduction to Complexity Regularization

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines for Classification: A Statistical Portrait

Machine Learning And Applications: Supervised Learning-SVM

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Least Squares Regression

Machine Learning Linear Models

Probability and Statistical Decision Theory

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Decision Tree Learning Lecture 2

Announcements. Proposals graded

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

MLCC 2017 Regularization Networks I: Linear Models

a b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2)

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Warm up: risk prediction with logistic regression

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Loss Functions, Decision Theory, and Linear Models

ECE521 week 3: 23/26 January 2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

MIRA, SVM, k-nn. Lirong Xia

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Overfitting, Bias / Variance Analysis

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Midterm exam CS 189/289, Fall 2015

STK Statistical Learning: Advanced Regression and Classification

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Least Squares Regression

Machine Learning: A Statistics and Optimization Perspective

ISyE 691 Data mining and analytics

CSC 411 Lecture 17: Support Vector Machine

Tufts COMP 135: Introduction to Machine Learning

Machine Learning Linear Classification. Prof. Matteo Matteucci

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

18.9 SUPPORT VECTOR MACHINES

Linear Models in Machine Learning

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Linear Regression (continued)

Tutorial on Machine Learning for Advanced Electronics

Support Vector Machines

Day 3: Classification, logistic regression

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Introduction to Machine Learning

ESS2222. Lecture 4 Linear model

Decision trees COMS 4771

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

COMS 4771 Introduction to Machine Learning. Nakul Verma

Logistic regression and linear classifiers COMS 4771

Is the test error unbiased for these programs?

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CMU-Q Lecture 24:

Machine Learning for Signal Processing Bayes Classification

Introduction. Chapter 1

Classification and Support Vector Machine

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

ECE521 Lecture7. Logistic Regression

Notes on Discriminant Functions and Optimal Classification

An Introduction to Statistical and Probabilistic Linear Models

Machine Learning for Signal Processing Bayes Classification and Regression

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear & nonlinear classifiers

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Machine Learning Lecture 7

Chapter 7: Model Assessment and Selection

Bits of Machine Learning Part 1: Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

CMSC858P Supervised Learning Methods

IFT Lecture 7 Elements of statistical learning theory

Consistency of Nearest Neighbor Methods

Introduction to Machine Learning

Introduction to Machine Learning

Linear Models for Regression CS534

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Data Mining Stat 588

GI01/M055: Supervised Learning

Linear regression COMS 4771

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

PDEEC Machine Learning 2016/17

Machine Learning for NLP

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Applied Machine Learning Annalisa Marsico

Introduction to Machine Learning Midterm, Tues April 8

Naïve Bayes classification

Machine Learning Lecture 5

Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008.

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Understanding Generalization Error: Bounds and Decompositions

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Transcription:

Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience is encoded in data

Supervised learning: example Goal: teach a machine to distinguish handwritten 4 s from handwritten 9 s

Supervised learning: example Goal: teach a machine to distinguish handwritten 4 s from handwritten 9 s Desired outcome: a rule (function) for deciding whether a given 10 10 pattern of black-and-white pixels is a 4 or a 9 x {0, 1} 100 by { 1, +1}

Supervised learning: example Goal: teach a machine to distinguish handwritten 4 s from handwritten 9 s Desired outcome: a rule (function) for deciding whether a given 10 10 pattern of black-and-white pixels is a 4 or a 9 x {0, 1} 100 by { 1, +1} How to get there? assuming that the training data are representative of a stable probabilistic relationship between the input (x) and the output (y), do what you would do if you actually knew this relationship and could find an optimal predictor. Figure taken from G. Shakhnarovich

Statistical decision theory: general framework Given: ( X Y ) : input (or feature) : output (or label) P XY : joint probability distribution Goal: find a function f that predicts Y given the value of X

Statistical decision theory: general framework Given: ( X Y ) : input (or feature) : output (or label) P XY : joint probability distribution Goal: find a function f that predicts Y given the value of X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x)

Statistical decision theory: general framework Given: ( X Y ) : input (or feature) : output (or label) P XY : joint probability distribution Goal: find a function f that predicts Y given the value of X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x) Expected prediction error of a candidate predictor f: E(f) = E[l(Y, f(x))] Z = l(y, f(x))p XY (dx, dy)

Statistical decision theory: general framework Given: ( X Y ) : input (or feature) : output (or label) P XY : joint probability distribution Goal: find a function f that predicts Y given the value of X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x) Expected prediction error of a candidate predictor f: E(f) = E[l(Y, f(x))] Z = l(y, f(x))p XY (dx, dy) Optimal predictor: f = arg min E(f), E E(f ) f

Statistical decision theory: classification Given: ( X R p Y {0, 1} ) : feature : label P XY : joint probability distribution Goal: find a function f : R p {0, 1} that accurately predicts the label Y given the feature X

Statistical decision theory: classification Given: ( X R p Y {0, 1} ) : feature : label P XY : joint probability distribution Goal: find a function f : R p {0, 1} that accurately predicts the label Y given the feature X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x) For example, use 0-1 loss: l(y, f(x)) = 1 {Y f(x)}

Statistical decision theory: classification Given: ( X R p Y {0, 1} ) : feature : label P XY : joint probability distribution Goal: find a function f : R p {0, 1} that accurately predicts the label Y given the feature X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x) For example, use 0-1 loss: l(y, f(x)) = 1 {Y f(x)} Expected prediction error of a candidate predictor f: E(f) = P[Y f(x)] Z = 1 {y f(x)} P XY (dx, dy)

Minimizing expected classification error How to pick f to minimize E(f)?

Minimizing expected classification error How to pick f to minimize E(f)? Conditioning: Z E(f) = 1 {y f(x)} P XY (dx, dy)

Minimizing expected classification error How to pick f to minimize E(f)? Conditioning: Z E(f) = 1 {y f(x)} P XY (dx, dy) Z Z = «1 {y f(x)} P Y X (dy x) P X (dx)

Minimizing expected classification error How to pick f to minimize E(f)? Conditioning: Z E(f) = 1 {y f(x)} P XY (dx, dy) Z Z = Z Z = 1 «1 {y f(x)} P Y X (dy x) P X (dx) «1 {y=f(x)} P Y X (dy x) P X (dx)

Minimizing expected classification error How to pick f to minimize E(f)? Conditioning: Z E(f) = 1 {y f(x)} P XY (dx, dy) Z Z = Z Z = 1 «1 {y f(x)} P Y X (dy x) P X (dx) «1 {y=f(x)} P Y X (dy x) P X (dx) = 1 Z `P[Y = 0 X = x] 1{f(x)=0} + P[Y = 1 X = x] 1 {f(x)=1} PX (dx)

Minimizing expected classification error How to pick f to minimize E(f)? Conditioning: Z E(f) = 1 {y f(x)} P XY (dx, dy) Z Z = Z Z = 1 «1 {y f(x)} P Y X (dy x) P X (dx) «1 {y=f(x)} P Y X (dy x) P X (dx) = 1 Z `P[Y = 0 X = x] 1{f(x)=0} + P[Y = 1 X = x] 1 {f(x)=1} PX (dx) Maximize the inner integral for each fixed x by setting f (x) = arg max P[Y = y X = x] y {0,1} This is known as the Bayes classifier: classify to the most probable label given the observation X = x. E(f ) is known as the Bayes rate.

The Bayes classifier Classify to the most probable label: Define the regression function: f (x) = arg max P[Y = y X = x] y {0,1} η(x) = P[Y = 1 X = x].

The Bayes classifier Classify to the most probable label: Define the regression function: f (x) = arg max P[Y = y X = x] y {0,1} This leads to the following: Bayes classifier η(x) = P[Y = 1 X = x]. η(x) ( f 1, if η(x) 1/2 (x) = 0, if η(x) < 1/2 E = 1 2 1 Z 1 2η(x) P X (dx) 2 1/2 0 1 0 1 x

Statistical decision theory: regression Given: ( X R p Y R ) : input variable : output variable P XY : joint probability distribution Goal: find a function f : R p R that predicts Y given the value of X

Statistical decision theory: regression Given: ( X R p Y R ) : input variable : output variable P XY : joint probability distribution Goal: find a function f : R p R that predicts Y given the value of X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x) For example, use squared error loss: l(y, f(x)) = (Y f(x)) 2

Statistical decision theory: regression Given: ( X R p Y R ) : input variable : output variable P XY : joint probability distribution Goal: find a function f : R p R that predicts Y given the value of X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x) For example, use squared error loss: l(y, f(x)) = (Y f(x)) 2 Expected prediction error of a candidate predictor f: E(f) = E ˆ(Y f(x)) 2 Z = (y f(x)) 2 P XY (dx, dy)

Minimizing mean squared error How to pick f to minimize E(f)?

Minimizing mean squared error How to pick f to minimize E(f)? Conditioning: Z E(f) = (y f(x)) 2 P XY (dx, dy) Z Z «= (y f(x)) 2 P Y X (dy x) P X (dx) {z } E[(Y f(x)) 2 X=x]

Minimizing mean squared error How to pick f to minimize E(f)? Conditioning: Z E(f) = (y f(x)) 2 P XY (dx, dy) Z Z «= (y f(x)) 2 P Y X (dy x) P X (dx) {z } E[(Y f(x)) 2 X=x] Minimize the inner integral for each fixed x by setting The solution is the regression function: f (x) = arg min E[(Y c) 2 X = x] c f (x) = E[Y X = x]

Proof that regression function is optimal For any predictor f, write E(f) = E[(Y f(x)) 2 ] = E[(Y f (X) + f (X) f(x)) 2 ] = E[(Y f (X)) 2 ] + 2 E[(Y f (X))(f (X) f(x))] + E[(f(X) f (X)) 2 ]

Proof that regression function is optimal For any predictor f, write E(f) = E[(Y f(x)) 2 ] = E[(Y f (X) + f (X) f(x)) 2 ] = E[(Y f (X)) 2 ] + 2 E[(Y f (X))(f (X) f(x))] + E[(f(X) f (X)) 2 ] We now show that the middle term is zero: E[(Y f (X))(f (X) f(x))] = E ˆE ˆ(Y f (X))(f (X) f(x)) X = E ˆE ˆY f (X) X (f (X) f(x)) " # = E E[Y X] f (X) (f (X) f(x)) {z } f (X) 0

Proof that regression function is optimal For any predictor f, write E(f) = E[(Y f(x)) 2 ] = E[(Y f (X) + f (X) f(x)) 2 ] = E[(Y f (X)) 2 ] + 2 E[(Y f (X))(f (X) f(x))] + E[(f(X) f (X)) 2 ] We now show that the middle term is zero: E[(Y f (X))(f (X) f(x))] = E ˆE ˆ(Y f (X))(f (X) f(x)) X Therefore, for any f we have = E ˆE ˆY f (X) X (f (X) f(x)) " # = E E[Y X] f (X) (f (X) f(x)) {z } f (X) 0 E(f) = E(f ) + E[(f(X) f (X))] 2 E(f ), and the term E[(f(X) f (X)) 2 ] = 0 if and only if f = f.

Restricting the predictor Sometimes, even if we know the distribution P XY, computing the optimal regression function f may not be worth it. In such cases, we may want to settle for the best predictor in some restricted class. This is one way to introduce inductive bias.

Restricting the predictor Sometimes, even if we know the distribution P XY, computing the optimal regression function f may not be worth it. In such cases, we may want to settle for the best predictor in some restricted class. This is one way to introduce inductive bias. Example: best linear predictor Suppose we are only interested in predictors of the form f(x) = β T x = where β R p is some fixed vector of weights px β(i)x(i) i=1

Restricting the predictor Sometimes, even if we know the distribution P XY, computing the optimal regression function f may not be worth it. In such cases, we may want to settle for the best predictor in some restricted class. This is one way to introduce inductive bias. Example: best linear predictor Suppose we are only interested in predictors of the form f(x) = β T x = where β R p is some fixed vector of weights Then px β(i)x(i) i=1 E(f) = E(β) = E[(Y β T X) 2 ] = E[Y 2 ] 2β T E[Y X] + E[(β T X) 2 ]. Optimize by differentiating w.r.t. β to get the optimal linear predictor: β = (E[XX T ]) 1 E[XY ]

Decomposition of the expected prediction error What happens when we use restricted predictors?

Decomposition of the expected prediction error What happens when we use restricted predictors? Let F be a restricted class of candidate predictors (e.g., linear or quadratic ones); define the best prediction error over F: E F = min f F E(f).

Decomposition of the expected prediction error What happens when we use restricted predictors? Let F be a restricted class of candidate predictors (e.g., linear or quadratic ones); define the best prediction error over F: Then for any f F we have where: E F = min f F E(f). E(f) = E(f) E + E = E(f) EF + EF {z } E +E {z } =T 1 =T 2 T 1 can be set to zero by letting f achieve the minimum min f F E(f) T 2 is large or small depending on how well the predictors in F can do

Statistical learning: unknown P XY To compute the best predictor f, you need to know the underlying distribution P XY! What should we do when we don t know P XY?

Statistical learning: unknown P XY To compute the best predictor f, you need to know the underlying distribution P XY! What should we do when we don t know P XY? Empirical risk minimization If we can draw a large number n of independent samples (X 1, Y 1 ),..., (X n, Y n) from P XY, then we can find f that minimizes the empirical prediction error be(f) = 1 nx l(y i, f(x i )) n i=1

Statistical learning: unknown P XY To compute the best predictor f, you need to know the underlying distribution P XY! What should we do when we don t know P XY? Empirical risk minimization If we can draw a large number n of independent samples (X 1, Y 1 ),..., (X n, Y n) from P XY, then we can find f that minimizes the empirical prediction error be(f) = 1 nx l(y i, f(x i )) n i=1 Why does it work? Without getting into details, the reason is the Law of Large Numbers: for any f, be(f) E(f) with high probability

Statistical learning: unknown P XY To compute the best predictor f, you need to know the underlying distribution P XY! What should we do when we don t know P XY? Empirical risk minimization If we can draw a large number n of independent samples (X 1, Y 1 ),..., (X n, Y n) from P XY, then we can find f that minimizes the empirical prediction error be(f) = 1 nx l(y i, f(x i )) n i=1 Why does it work? Without getting into details, the reason is the Law of Large Numbers: for any f, be(f) E(f) with high probability It can then be proved that if b f minimizes the empirical prediction error, then E( b f ) E(f ) with high probability where f is the true optimal predictor (for P XY )

Example: least-squares linear regression Given training data (X 1, Y 1 ),..., (X n, Y n), find β b R p to minimize be(β) = 1 nx (Y i β T X i ) 2 n i=1

Example: least-squares linear regression Given training data (X 1, Y 1 ),..., (X n, Y n), find β b R p to minimize be(β) = 1 nx (Y i β T X i ) 2 n i=1 Least-squares linear regression Define the n p design matrix X = [X ij ], X ij = jth component of X i, and the response vector Y = [Y 1 Y 2... Y n] T. Then be(β) = 1 n (Y Xβ)T (Y Xβ)

Example: least-squares linear regression Given training data (X 1, Y 1 ),..., (X n, Y n), find β b R p to minimize be(β) = 1 nx (Y i β T X i ) 2 n i=1 Least-squares linear regression Define the n p design matrix X = [X ij ], X ij = jth component of X i, and the response vector Y = [Y 1 Y 2... Y n] T. Then be(β) = 1 n (Y Xβ)T (Y Xβ) Differentiate to get the normal equations: X T (Y Xβ) = 0

Example: least-squares linear regression Given training data (X 1, Y 1 ),..., (X n, Y n), find β b R p to minimize be(β) = 1 nx (Y i β T X i ) 2 n i=1 Least-squares linear regression Define the n p design matrix X = [X ij ], X ij = jth component of X i, and the response vector Y = [Y 1 Y 2... Y n] T. Then be(β) = 1 n (Y Xβ)T (Y Xβ) Differentiate to get the normal equations: X T (Y Xβ) = 0 If X T X is invertible, solve for the optimum: bβ = (X T X) 1 X T Y

Example: least-squares linear regression Given training data (X 1, Y 1 ),..., (X n, Y n), find β b R p to minimize be(β) = 1 nx (Y i β T X i ) 2 n i=1 Least-squares linear regression Define the n p design matrix X = [X ij ], X ij = jth component of X i, and the response vector Y = [Y 1 Y 2... Y n] T. Then be(β) = 1 n (Y Xβ)T (Y Xβ) Differentiate to get the normal equations: X T (Y Xβ) = 0 If X T X is invertible, solve for the optimum: bβ = (X T X) 1 X T Y Figure taken from Hastie, Tibshirani and Friedman, 2nd ed.

Classification using linear regression If we have training data (X 1, Y 1 ),..., (X n, Y n), where X i R p are features and Y i {0, 1} are binary labels, we can attempt to find the best linear classifier.

Classification using linear regression If we have training data (X 1, Y 1 ),..., (X n, Y n), where X i R p are features and Y i {0, 1} are binary labels, we can attempt to find the best linear classifier. Classification via regression Find the least-squares linear regressor: bβ = (X T X) 1 X T Y

Classification using linear regression If we have training data (X 1, Y 1 ),..., (X n, Y n), where X i R p are features and Y i {0, 1} are binary labels, we can attempt to find the best linear classifier. Classification via regression Find the least-squares linear regressor: bβ = (X T X) 1 X T Y Use it to build the classifier by (x) = ( 1, if β b T x 1/2 0, if β b T x < 1/2

Classification using linear regression If we have training data (X 1, Y 1 ),..., (X n, Y n), where X i R p are features and Y i {0, 1} are binary labels, we can attempt to find the best linear classifier. Classification via regression Find the least-squares linear regressor: bβ = (X T X) 1 X T Y Use it to build the classifier by (x) = ( 1, if β b T x 1/2 0, if β b T x < 1/2 Figure taken from Hastie, Tibshirani and Friedman, 2nd ed.

Classification using linear regression If we have training data (X 1, Y 1 ),..., (X n, Y n), where X i R p are features and Y i {0, 1} are binary labels, we can attempt to find the best linear classifier. Classification via regression Find the least-squares linear regressor: bβ = (X T X) 1 X T Y Use it to build the classifier by (x) = ( 1, if β b T x 1/2 0, if β b T x < 1/2 Figure taken from Hastie, Tibshirani and Friedman, 2nd ed. This will work well if the two classes are linearly separable (or almost so).

Nearest-neighbor methods Nearest-neighbor methods use training samples closest to the test point x to form b Y. These methods work well if the regression function is continuous.

Nearest-neighbor methods Nearest-neighbor methods use training samples closest to the test point x to form b Y. These methods work well if the regression function is continuous. Nearest-neighbor methods k-nearest-neighbor regression: by (x) = 1 k k-nn classification: 8 1, if >< by (x) = >: 0, if X X i N k (x) X X i N k (x) X X i N k (x) Y i Y i k/2 Y i < k/2 N k (x): the set of k nearest neighbors of x in the training sample

Nearest-neighbor methods Nearest-neighbor methods use training samples closest to the test point x to form b Y. These methods work well if the regression function is continuous. Nearest-neighbor methods k-nearest-neighbor regression: by (x) = 1 k k-nn classification: 8 1, if >< by (x) = >: 0, if X X i N k (x) X X i N k (x) X X i N k (x) Y i Y i k/2 Y i < k/2 N k (x): the set of k nearest neighbors of x in the training sample Figure taken from Hastie, Tibshirani and Friedman, 2nd ed.

Issues and pitfalls Curse of dimensionality Local methods (such as k-nearest-neighbors) require fairly dense training samples, which is hard to obtain when the problem dimension is very high. Bias-variance trade-off Using more complex models may lead to overfitting; using models that are overly simple may lead to underfitting. In both cases, we will have poor generalization.

The Curse of Dimensionality For nearest-neighbor methods, we need to capture a fixed fraction of the data (say, 10%) to form local averages. In high dimension, the data will be spread very thinly, and we will need to use almost the entire sample, which would conflict with the local nature of the method. Figure taken from Hasty, Tibshirani and Friedman, 2nd ed.

Bias and variance Suppose the true model is Y = f (X) + ɛ, where ɛ is noise, which is zero mean (E ɛ = 0) and independent of X. Let s see how well a predictor b f learned from a training sample T = {(X 1, Y 1 ),..., (X n, Y n)} does on an independent test sample (X, Y ):

Bias and variance Suppose the true model is Y = f (X) + ɛ, where ɛ is noise, which is zero mean (E ɛ = 0) and independent of X. Let s see how well a predictor b f learned from a training sample T = {(X 1, Y 1 ),..., (X n, Y n)} does on an independent test sample (X, Y ): E T [(Y b f(x)) 2 ] = E T [(Y E T b f(x) + ET b f(x) b f(x)) 2 ]

Bias and variance Suppose the true model is Y = f (X) + ɛ, where ɛ is noise, which is zero mean (E ɛ = 0) and independent of X. Let s see how well a predictor b f learned from a training sample T = {(X 1, Y 1 ),..., (X n, Y n)} does on an independent test sample (X, Y ): E T [(Y b f(x)) 2 ] = E T [(Y E T b f(x) + ET b f(x) b f(x)) 2 ] = E T [(Y E T b f(x)) 2 ] + 2 E T [(Y E T b f(x))(et b f(x) b f(x))] {z } =0 + E T [( b f(x) E T b f(x)) 2 ]

Bias and variance Suppose the true model is Y = f (X) + ɛ, where ɛ is noise, which is zero mean (E ɛ = 0) and independent of X. Let s see how well a predictor b f learned from a training sample T = {(X 1, Y 1 ),..., (X n, Y n)} does on an independent test sample (X, Y ): E T [(Y b f(x)) 2 ] = E T [(Y E T b f(x) + ET b f(x) b f(x)) 2 ] = E T [(Y E T f(x)) b 2 ] + 2 E T [(Y E T f(x))(et b f(x) b f(x))] b {z } =0 + E T [( f(x) b E T f(x)) b 2 ] = [Y E T f(x)] b 2 + E T h( f(x) b i E T f(x)) b 2

Bias and variance Suppose the true model is Y = f (X) + ɛ, where ɛ is noise, which is zero mean (E ɛ = 0) and independent of X. Let s see how well a predictor b f learned from a training sample T = {(X 1, Y 1 ),..., (X n, Y n)} does on an independent test sample (X, Y ): E T [(Y b f(x)) 2 ] = E T [(Y E T b f(x) + ET b f(x) b f(x)) 2 ] = E T [(Y E T f(x)) b 2 ] + 2 E T [(Y E T f(x))(et b f(x) b f(x))] b {z } =0 + E T [( f(x) b E T f(x)) b 2 ] = [Y E T f(x)] b 2 + E T h( f(x) b i E T f(x)) b 2 h( f(x) b E T f(x)) b i 2 = O(ɛ) + [f (X) E T f(x)] b 2 + E T {z } =Bias 2 {z } =Variance

Bias and variance Suppose the true model is Y = f (X) + ɛ, where ɛ is noise, which is zero mean (E ɛ = 0) and independent of X. Let s see how well a predictor b f learned from a training sample T = {(X 1, Y 1 ),..., (X n, Y n)} does on an independent test sample (X, Y ): E T [(Y b f(x)) 2 ] = E T [(Y E T b f(x) + ET b f(x) b f(x)) 2 ] = E T [(Y E T f(x)) b 2 ] + 2 E T [(Y E T f(x))(et b f(x) b f(x))] b {z } =0 + E T [( f(x) b E T f(x)) b 2 ] = [Y E T f(x)] b 2 + E T h( f(x) b i E T f(x)) b 2 h( f(x) b E T f(x)) b i 2 = O(ɛ) + [f (X) E T f(x)] b 2 + E T {z } =Bias 2 {z } =Variance Bias: how well b f approximates the true regression function f Variance: how much b f(x) wiggles around its mean

Bias-variance trade-off There is a trade-off between the bias and the variance: Complex predictors will generally have low bias and high variance Simple predictors will generally have high bias and low variance Figure taken from Hastie, Tibshirani and Friedman, 2nd ed.

Regularization One way to mitigate the bias-variance trade-off is to explicitly penalize predictors that are too complex.

Regularization One way to mitigate the bias-variance trade-off is to explicitly penalize predictors that are too complex. Instead of finding the minimizer of the empirical prediction error be(f) = 1 nx l(y i, f(x i )), n i=1 minimize the complexity-regularized empirical prediction error where: be cr(f) = E(f) b + λj(f) = 1 nx l(y i, f(x i )) + λj(f), n i=1 The regularizer J(f) is high for complicated predictors f and low for simple predictors f The regularization constant λ > 0 controls the balance between the data fit term b E(f) and the complexity term J(f)

Examples of regularization Roughness penalty: F is the class of twice differentiable functions, and Z J(f) = f (x) 2 dx This penalizes functions that wiggle around too much. The complexity-regularized solution is the smoothing spline.

Examples of regularization Roughness penalty: F is the class of twice differentiable functions, and Z J(f) = f (x) 2 dx This penalizes functions that wiggle around too much. The complexity-regularized solution is the smoothing spline. Subset selection: F is the class of linear regression functions f(x) = β T x, and J(β) = β 0 = the number of nonzero components of β This is effective, but may be computationally expensive.

Examples of regularization Roughness penalty: F is the class of twice differentiable functions, and Z J(f) = f (x) 2 dx This penalizes functions that wiggle around too much. The complexity-regularized solution is the smoothing spline. Subset selection: F is the class of linear regression functions f(x) = β T x, and J(β) = β 0 = the number of nonzero components of β This is effective, but may be computationally expensive. l 1 -penalty: F is again the class of linear regression functions, and J(β) = β 1 = px β(i) This often serves as a reasonable tractable relaxation of the subset selection penalty. This approach is known as the LASSO, and is very popular right now. i=1

Next lecture: a sneak preview Next time, we will focus on linear methods for classification: Logistic regression Separating hyperplanes and the Perceptron