Linear regression COMS 4771

Similar documents
COMS 4771 Regression. Nakul Verma

Logistic regression and linear classifiers COMS 4771

Classification 2: Linear discriminant analysis (continued); logistic regression

Convex optimization COMS 4771

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Statistical Machine Learning Hilary Term 2018

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Overfitting, Bias / Variance Analysis

Linear Methods for Prediction

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Linear Regression and Its Applications

Lecture 2 Machine Learning Review

CMU-Q Lecture 24:

Today. Calculus. Linear Regression. Lagrange Multipliers

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Recap from previous lecture

Decision trees COMS 4771

Linear Regression (continued)

ECE521 week 3: 23/26 January 2017

Ordinary Least Squares Linear Regression

Generalization theory

Statistical Data Mining and Machine Learning Hilary Term 2016

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 7 Introduction to Statistical Decision Theory

Linear Classifiers and the Perceptron

Classification objectives COMS 4771

CMSC858P Supervised Learning Methods

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Linear Models for Regression. Sargur Srihari

Lecture 3: Introduction to Complexity Regularization

HT Introduction. P(X i = x i ) = e λ λ x i

Linear & nonlinear classifiers

Perceptron Revisited: Linear Separators. Support Vector Machines

Linear Regression (9/11/13)

High-dimensional regression

Lasso, Ridge, and Elastic Net

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Linear Models in Machine Learning

Lecture 6. Regression

Lecture 2: Linear Algebra Review

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Association studies and regression

STAT5044: Regression and Anova. Inyoung Kim

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

STAT 100C: Linear models

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Linear Methods for Prediction

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Discriminative Models

Linear & nonlinear classifiers

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Cheng Soon Ong & Christian Walder. Canberra February June 2018

MS&E 226: Small Data

Day 4: Classification, support vector machines

CPSC 340: Machine Learning and Data Mining

MLCC 2017 Regularization Networks I: Linear Models

Support Vector Machine

CS 195-5: Machine Learning Problem Set 1

Least Squares Regression

Support Vector Machines and Bayes Regression

Week 3: Linear Regression

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Multivariate Regression

Linear Regression Linear Regression with Shrinkage

ISyE 691 Data mining and analytics

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Introduction to Machine Learning

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Fitting Linear Statistical Models to Data by Least Squares: Introduction

CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1

Warm up: risk prediction with logistic regression

MS&E 226: Small Data

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices

Support Vector Machines

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Linear Models for Regression CS534

Bayesian Linear Regression [DRAFT - In Progress]

CS145: INTRODUCTION TO DATA MINING

STA414/2104 Statistical Methods for Machine Learning II

1 Machine Learning Concepts (16 points)

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Least Squares Regression

Linear Models for Regression CS534

Vector Calculus. Lecture Notes

Introduction to Machine Learning

y Xw 2 2 y Xw λ w 2 2

Lecture 24 May 30, 2018

Support Vector Machines

Lecture 2 Part 1 Optimization

Linear Models for Regression CS534

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Machine Learning Basics: Maximum Likelihood Estimation

Bias-Variance Tradeoff

Transcription:

Linear regression COMS 4771

1. Old Faithful and prediction functions

Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40

Statistical model for time between eruptions Historical records of eruptions: b 0 a 0 a 1 b 1 a 2 b 2 a 3 b 3... Y 1 Y 2 Y 3 2 / 40

Statistical model for time between eruptions Historical records of eruptions: b 0 a 0 a 1 b 1 a 2 b 2 a 3 b 3... Y 1 Y 2 Y 3 Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) 2 / 40

Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. Y 2 / 40

Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. On Old Faithful data: Y 2 / 40

Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. On Old Faithful data: Using 136 past observations, we form estimate ˆµ = 70.7941. Y 2 / 40

Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. On Old Faithful data: Using 136 past observations, we form estimate ˆµ = 70.7941. Mean squared loss of ˆµ on next 136 observations is 187.1894. Y 2 / 40

Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. On Old Faithful data: Using 136 past observations, we form estimate ˆµ = 70.7941. Mean squared loss of ˆµ on next 136 observations is 187.1894. (Easier to interpret the square root, 13.6817, which has same units as Y.) Y 2 / 40

Looking at the data Naturalist Harry Woodward observed that time until the next eruption seems to be related to duration of last eruption. 3 / 40

Looking at the data Naturalist Harry Woodward observed that time until the next eruption seems to be related to duration of last eruption. time until next eruption 90 80 70 60 50 1.5 2 2.5 3 3.5 4 4.5 5 5.5 duration of last eruption 3 / 40

Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t data X Y 4 / 40

Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. Y 4 / 40

Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (a.k.a. predictor) ˆf : X R, This is called learning or training. Y 4 / 40

Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (a.k.a. predictor) ˆf : X R, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). Y 4 / 40

Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (a.k.a. predictor) ˆf : X R, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2. Y 4 / 40

Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (a.k.a. predictor) ˆf : X R, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2. How should we choose ˆf based on data? Y 4 / 40

2. Optimal predictors and linear regression models

Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). 5 / 40

Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 5 / 40

Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 5 / 40

Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2. Conditional distribution P Y X=x of Y given X = x for each x X : P Y X=x is a probability distribution on Y. 5 / 40

Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2. Conditional distribution P Y X=x of Y given X = x for each x X : P Y X=x is a probability distribution on Y. This lecture: Y = R (regression problems). 5 / 40

Optimal predictor What function f : X R has smallest (squared loss) risk R(f) := E[(f(X) Y ) 2 ]? 6 / 40

Optimal predictor What function f : X R has smallest (squared loss) risk R(f) := E[(f(X) Y ) 2 ]? Conditional on X = x, the minimizer of conditional risk ŷ E[(ŷ Y ) 2 X = x] is the conditional mean (Recall Galton board!) E[Y X = x]. 6 / 40

Optimal predictor What function f : X R has smallest (squared loss) risk R(f) := E[(f(X) Y ) 2 ]? Conditional on X = x, the minimizer of conditional risk ŷ E[(ŷ Y ) 2 X = x] is the conditional mean E[Y X = x]. (Recall Galton board!) Therefore, the function f : R R where f (x) = E[Y X = x], x R has the smallest risk. 6 / 40

Optimal predictor What function f : X R has smallest (squared loss) risk R(f) := E[(f(X) Y ) 2 ]? Conditional on X = x, the minimizer of conditional risk ŷ E[(ŷ Y ) 2 X = x] is the conditional mean E[Y X = x]. (Recall Galton board!) Therefore, the function f : R R where f (x) = E[Y X = x], x R has the smallest risk. f is called the regression function or conditional mean function. 6 / 40

Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. 7 / 40

Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. 7 / 40

Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. X = (X 1,..., X d ), a random vector (i.e., a vector of random variables). 7 / 40

Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. X = (X 1,..., X d ), a random vector (i.e., a vector of random variables). Conditional distribution of Y given X is normal. 7 / 40

Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. X = (X 1,..., X d ), a random vector (i.e., a vector of random variables). Conditional distribution of Y given X is normal. Marginal distribution of X not specified. 7 / 40

Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. X = (X 1,..., X d ), a random vector (i.e., a vector of random variables). Conditional distribution of Y given X is normal. Marginal distribution of X not specified. In this model, the regression function f is a linear function f β : R d R, 5 d f β (x) = x T β = x iβ i, x R d. i=1 (We ll often refer to f β just by β.) y 0 f * -5-1 -0.5 0 0.5 1 x 7 / 40

Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. 8 / 40

Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. Examples: 1. Non-linear transformations of existing variables: for x R, φ(x) = ln(1 + x). 8 / 40

Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. Examples: 1. Non-linear transformations of existing variables: for x R, φ(x) = ln(1 + x). 2. Logical formula of binary variables: for x = (x 1,..., x d ) {0, 1} d, φ(x) = (x 1 x 5 x 10) ( x 2 x 7). 8 / 40

Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. Examples: 1. Non-linear transformations of existing variables: for x R, φ(x) = ln(1 + x). 2. Logical formula of binary variables: for x = (x 1,..., x d ) {0, 1} d, 3. Trigonometric expansion: for x R, φ(x) = (x 1 x 5 x 10) ( x 2 x 7). φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x),... ). 8 / 40

Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. Examples: 1. Non-linear transformations of existing variables: for x R, φ(x) = ln(1 + x). 2. Logical formula of binary variables: for x = (x 1,..., x d ) {0, 1} d, 3. Trigonometric expansion: for x R, φ(x) = (x 1 x 5 x 10) ( x 2 x 7). φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x),... ). 4. Polynomial expansion: for x = (x 1,..., x d ) R d, φ(x) = (1, x 1,..., x d, x 2 1,..., x 2 d, x 1x 2,..., x 1x d,..., x d 1 x d ). 8 / 40

Example: Taking advantage of linearity Suppose you are trying to predict some health outcome. 9 / 40

Example: Taking advantage of linearity Suppose you are trying to predict some health outcome. Physician suggests that body temperature is relevant, specifically the (square) deviation from normal body temperature: φ(x) = (x temp 98.6) 2. 9 / 40

Example: Taking advantage of linearity Suppose you are trying to predict some health outcome. Physician suggests that body temperature is relevant, specifically the (square) deviation from normal body temperature: φ(x) = (x temp 98.6) 2. What if you didn t know about this magic constant 98.6? 9 / 40

Example: Taking advantage of linearity Suppose you are trying to predict some health outcome. Physician suggests that body temperature is relevant, specifically the (square) deviation from normal body temperature: φ(x) = (x temp 98.6) 2. What if you didn t know about this magic constant 98.6? Instead, use Can learn coefficients β such that φ(x) = (1, x temp, x 2 temp). β T φ(x) = (x temp 98.6) 2, or any other quadratic polynomial in x temp (which may be better!). 9 / 40

Quadratic expansion Quadratic function f : R R for a, b, c R. f(x) = ax 2 + bx + c, x R, 10 / 40

Quadratic expansion Quadratic function f : R R f(x) = ax 2 + bx + c, x R, for a, b, c R. This can be written as a linear function of φ(x), where φ(x) := (1, x, x 2 ), since where β = (c, b, a). f(x) = β T φ(x) 10 / 40

Quadratic expansion Quadratic function f : R R for a, b, c R. f(x) = ax 2 + bx + c, x R, This can be written as a linear function of φ(x), where since where β = (c, b, a). φ(x) := (1, x, x 2 ), f(x) = β T φ(x) For multivariate quadratic function f : R d R, use φ(x) := (1, x 1,..., x d, x 2 }{{} 1,..., x 2 d, x 1x 2,..., x 1x d,..., x d 1 x d ). }{{}}{{} linear terms squared terms cross terms 10 / 40

Affine expansion and Old Faithful Woodward needed an affine expansion for Old Faithful data: φ(x) := (1, x). 11 / 40

Affine expansion and Old Faithful Woodward needed an affine expansion for Old Faithful data: φ(x) := (1, x). 100 time until next eruption 80 60 40 20 affine function 0 0 1 2 3 4 5 6 duration of last eruption Affine function f a,b : R R for a, b R, f a,b (x) = a + bx, is a linear function f β of φ(x) for β = (a, b). (This easily generalizes to multivariate affine functions.) 11 / 40

Why linear regression models? 12 / 40

Why linear regression models? 1. Linear regression models benefit from good choice of features. 12 / 40

Why linear regression models? 1. Linear regression models benefit from good choice of features. 2. Structure of linear functions is very well-understood. 12 / 40

Why linear regression models? 1. Linear regression models benefit from good choice of features. 2. Structure of linear functions is very well-understood. 3. Many well-understood and efficient algorithms for learning linear functions from data, even when n and d are large. 12 / 40

3. From data to prediction functions

Maximum likelihood estimation for linear regression Linear regression model with Gaussian noise: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, with Y X = x N(x T β, σ 2 ), x R d. (Traditional to study linear regression in context of this model.) 13 / 40

Maximum likelihood estimation for linear regression Linear regression model with Gaussian noise: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, with Y X = x N(x T β, σ 2 ), x R d. (Traditional to study linear regression in context of this model.) Log-likelihood of (β, σ 2 ), given data (X i, Y i) = (x i, y i) for i = 1,..., n: n i=1 { 1 2σ 2 (xt i β y i) 2 + 1 } 2 ln 1 { } + terms not involving (β, σ 2 ). 2πσ 2 13 / 40

Maximum likelihood estimation for linear regression Linear regression model with Gaussian noise: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, with Y X = x N(x T β, σ 2 ), x R d. (Traditional to study linear regression in context of this model.) Log-likelihood of (β, σ 2 ), given data (X i, Y i) = (x i, y i) for i = 1,..., n: n i=1 { 1 2σ 2 (xt i β y i) 2 + 1 } 2 ln 1 { } + terms not involving (β, σ 2 ). 2πσ 2 The β that maximizes log-likelihood is also β that minimizes 1 n n (x T i β y i) 2. i=1 13 / 40

Maximum likelihood estimation for linear regression Linear regression model with Gaussian noise: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, with Y X = x N(x T β, σ 2 ), x R d. (Traditional to study linear regression in context of this model.) Log-likelihood of (β, σ 2 ), given data (X i, Y i) = (x i, y i) for i = 1,..., n: n i=1 { 1 2σ 2 (xt i β y i) 2 + 1 } 2 ln 1 { } + terms not involving (β, σ 2 ). 2πσ 2 The β that maximizes log-likelihood is also β that minimizes 1 n n (x T i β y i) 2. i=1 This coincides with another approach, called empirical risk minimization, which is studied beyond the context of the linear regression model... 13 / 40

Empirical distribution and empirical risk Empirical distribution P n on (x 1, y 1),..., (x n, y n) has probability mass function p n given by p n((x, y)) := 1 n n 1{(x, y) = (x i, y i)}, (x, y) R d R. i=1 14 / 40

Empirical distribution and empirical risk Empirical distribution P n on (x 1, y 1),..., (x n, y n) has probability mass function p n given by p n((x, y)) := 1 n n 1{(x, y) = (x i, y i)}, (x, y) R d R. i=1 Plug-in principle: Goal is to find function f that minimizes (squared loss) risk R(f) = E[(f(X) Y ) 2 ]. But we don t know the distribution P of (X, Y ). 14 / 40

Empirical distribution and empirical risk Empirical distribution P n on (x 1, y 1),..., (x n, y n) has probability mass function p n given by p n((x, y)) := 1 n n 1{(x, y) = (x i, y i)}, (x, y) R d R. i=1 Plug-in principle: Goal is to find function f that minimizes (squared loss) risk R(f) = E[(f(X) Y ) 2 ]. But we don t know the distribution P of (X, Y ). Replace P with P n Empirical (squared loss) risk R(f): R(f) := 1 n n (f(x i) y i) 2. i=1 14 / 40

Empirical risk minimization Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. 15 / 40

Empirical risk minimization Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. For linear functions and squared loss: ERM returns ˆβ arg min β R d R(β), which coincides with MLE under the basic linear regression model. 15 / 40

Empirical risk minimization Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. For linear functions and squared loss: ERM returns ˆβ arg min β R d R(β), which coincides with MLE under the basic linear regression model. In general: MLE makes sense in context of statistical model for which it is derived. ERM makes sense in context of general iid model for supervised learning. 15 / 40

Empirical risk minimization in pictures Red dots: data points. Affine hyperplane: linear function β (via affine expansion (x 1, x 2 ) (1, x 1, x 2 )). ERM: minimize sum of squared vertical lengths from hyperplane to points. 16 / 40

Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n y n 17 / 40

Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n Can write empirical risk as R(β) = Aβ b 2 2. y n 17 / 40

Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n Can write empirical risk as R(β) = Aβ b 2 2. Necessary condition for β to be a minimizer of R: y n R(β) = 0, i.e., β is a critical point of R. 17 / 40

Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n Can write empirical risk as R(β) = Aβ b 2 2. Necessary condition for β to be a minimizer of R: y n R(β) = 0, i.e., β is a critical point of R. This translates to (A T A)β = A T b, a system of linear equations called the normal equations. 17 / 40

Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n Can write empirical risk as R(β) = Aβ b 2 2. Necessary condition for β to be a minimizer of R: y n R(β) = 0, i.e., β is a critical point of R. This translates to (A T A)β = A T b, a system of linear equations called the normal equations. It can be proved that every critical point of R is a minimizer of R: 17 / 40

Aside: Convexity Let f : R d R be a differentiable function. Suppose we find x R d such that f(x) = 0. Is x a minimizer of f? 18 / 40

Aside: Convexity Let f : R d R be a differentiable function. Suppose we find x R d such that f(x) = 0. Is x a minimizer of f? Yes, if f is a convex function: f((1 t)x + tx ) (1 t)f(x) + tf(x ), for any 0 t 1 and any x, x R d. 18 / 40

Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) 19 / 40

Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) = (1 t)(ax b) + t(ax b) 2 2 19 / 40

Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) = (1 t)(ax b) + t(ax b) 2 2 = (1 t) 2 Ax b 2 2 + t 2 Ax b 2 2 + 2(1 t)t(ax b) T (Ax b) 19 / 40

Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) = (1 t)(ax b) + t(ax b) 2 2 = (1 t) 2 Ax b 2 2 + t 2 Ax b 2 2 + 2(1 t)t(ax b) T (Ax b) = (1 t) Ax b 2 2 + t Ax b 2 2 (1 t)t[ Ax b 2 2 + Ax b 2 2] + 2(1 t)t(ax b) T (Ax b) 19 / 40

Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) = (1 t)(ax b) + t(ax b) 2 2 = (1 t) 2 Ax b 2 2 + t 2 Ax b 2 2 + 2(1 t)t(ax b) T (Ax b) = (1 t) Ax b 2 2 + t Ax b 2 2 (1 t)t[ Ax b 2 2 + Ax b 2 2] + 2(1 t)t(ax b) T (Ax b) (1 t) Ax b 2 2 + t Ax b 2 2 where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometric mean (AM/GM) inequality. 19 / 40

Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 20 / 40

Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. 20 / 40

Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. 20 / 40

Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. Therefore, function β 1 n (xt i β y i) 2 is convex. 20 / 40

Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. Therefore, function β 1 n (xt i β y i) 2 is convex. Sum of convex functions is convex. 20 / 40

Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. Therefore, function β 1 n (xt i β y i) 2 is convex. Sum of convex functions is convex. Therefore R is convex. 20 / 40

Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. Therefore, function β 1 n (xt i β y i) 2 is convex. Sum of convex functions is convex. Therefore R is convex. Convexity is a useful mathematical property to understand! (We ll study more convex analysis in a few weeks.) 20 / 40

Algorithm for ERM Algorithm for ERM with linear functions and squared loss input Data (x 1, y 1 ),..., (x n, y n ) from R d R. output Linear function ˆβ R d. 1: Find solution ˆβ to the normal equations defined by the data (using, e.g., Gaussian elimination). 2: return ˆβ. Also called ordinary least squares in this context. 21 / 40

Algorithm for ERM Algorithm for ERM with linear functions and squared loss input Data (x 1, y 1 ),..., (x n, y n ) from R d R. output Linear function ˆβ R d. 1: Find solution ˆβ to the normal equations defined by the data (using, e.g., Gaussian elimination). 2: return ˆβ. Also called ordinary least squares in this context. Running time (dominated by Gaussian elimination): O(nd 2 ). Note: there are many approximate solvers that run in nearly linear time! 21 / 40

Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. 22 / 40

Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. 22 / 40

a 2 22 / 40 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb a 1

Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb is uniquely determined. ˆb a 1 a 2 22 / 40

Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb is uniquely determined. If rank(a) < d, then >1 way to write ˆb as linear combination of a 1,..., a d. ˆb a 1 a 2 22 / 40

Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb is uniquely determined. If rank(a) < d, then >1 way to write ˆb as linear combination of a 1,..., a d. If rank(a) < d, then ERM solution is not unique. ˆb a 1 a 2 22 / 40

Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb is uniquely determined. If rank(a) < d, then >1 way to write ˆb as linear combination of a 1,..., a d. ˆb a 1 a 2 If rank(a) < d, then ERM solution is not unique. To get β from ˆb: solve system of linear equations Aβ = ˆb. 22 / 40

Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? 23 / 40

Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? Necessary condition for β to be a minimizer of R: R(β) = 0, i.e., β is a critical point of R. 23 / 40

Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? Necessary condition for β to be a minimizer of R: This translates to R(β) = 0, i.e., β is a critical point of R. E[XX T ]β = E[Y X], a system of linear equations called the population normal equations. It can be proved that every critical point of R is a minimizer of R. 23 / 40

Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? Necessary condition for β to be a minimizer of R: This translates to R(β) = 0, i.e., β is a critical point of R. E[XX T ]β = E[Y X], a system of linear equations called the population normal equations. It can be proved that every critical point of R is a minimizer of R. Looks familiar? 23 / 40

Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? Necessary condition for β to be a minimizer of R: This translates to R(β) = 0, i.e., β is a critical point of R. E[XX T ]β = E[Y X], a system of linear equations called the population normal equations. It can be proved that every critical point of R is a minimizer of R. Looks familiar? If (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, then E[A T A] = E[XX T ] and E[A T b] = E[Y X], so ERM can be regarded as a plug-in estimator for a minimizer of R. 23 / 40

4. Risk, empirical risk, and estimating risk

Risk of ERM IID model: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, taking values in R d R. Let β be a minimizer of R over all β R d, i.e., β satisfies population normal equations E[XX T ]β = E[Y X]. 24 / 40

Risk of ERM IID model: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, taking values in R d R. Let β be a minimizer of R over all β R d, i.e., β satisfies population normal equations E[XX T ]β = E[Y X]. If ERM solution ˆβ is not unique (e.g., if n < d), then R(ˆβ) can be arbitrarily worse than R(β ). 24 / 40

Risk of ERM IID model: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, taking values in R d R. Let β be a minimizer of R over all β R d, i.e., β satisfies population normal equations E[XX T ]β = E[Y X]. If ERM solution ˆβ is not unique (e.g., if n < d), then R(ˆβ) can be arbitrarily worse than R(β ). What about when ERM solution is unique? 24 / 40

Risk of ERM IID model: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, taking values in R d R. Let β be a minimizer of R over all β R d, i.e., β satisfies population normal equations E[XX T ]β = E[Y X]. If ERM solution ˆβ is not unique (e.g., if n < d), then R(ˆβ) can be arbitrarily worse than R(β ). What about when ERM solution is unique? Theorem. Under mild assumptions on distribution of X, ( ) tr(cov(εw )) R(ˆβ) R(β ) = O n asymptotically, where W := E[XX T ] 1 2 X and ε := Y X T β. 24 / 40

Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 and n(ˆβ β ) = ( 1 n n ) 1 1 X ix T i n i=1 n i=1 ε ix i. 25 / 40

Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 and 1. By LLN: n(ˆβ β ) = ( 1 n 1 n n X ix T i i=1 n ) 1 1 X ix T i n i=1 p E[XX T ] n i=1 ε ix i. 25 / 40

Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 and 1. By LLN: 2. By CLT: n(ˆβ β ) = ( 1 n 1 n 1 n n X ix T i i=1 n i=1 ε ix i n ) 1 1 X ix T i n i=1 p E[XX T ] n i=1 ε ix i. d cov(εx) 1 2 Z, where Z N(0, I). 25 / 40

Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so and 1. By LLN: 2. By CLT: E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 n(ˆβ β ) = ( 1 n 1 n 1 n n X ix T i i=1 n i=1 ε ix i n ) 1 1 X ix T i n i=1 p E[XX T ] n i=1 ε ix i. d cov(εx) 1 2 Z, where Z N(0, I). Therefore, asymptotic distribution of n(ˆβ β ) is n(ˆβ β ) d E[XX T ] 1 cov(εx) 1 2 Z. 25 / 40

Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so and 1. By LLN: 2. By CLT: E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 n(ˆβ β ) = ( 1 n 1 n 1 n n X ix T i i=1 n i=1 ε ix i n ) 1 1 X ix T i n i=1 p E[XX T ] n i=1 ε ix i. d cov(εx) 1 2 Z, where Z N(0, I). Therefore, asymptotic distribution of n(ˆβ β ) is n(ˆβ β ) d E[XX T ] 1 cov(εx) 1 2 Z. A few more steps gives ( ) n E[(X T ˆβ Y ) 2 ] E[(X T β Y ) 2 d ] E[XX T ] 1 1 2 cov(εx) 2 Z 2 2. Random variable on RHS is concentrated around its mean tr(cov(εw )). 25 / 40

Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. 26 / 40

Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. Only assumptions are those needed for LLN and CLT to hold. 26 / 40

Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. Only assumptions are those needed for LLN and CLT to hold. However, if normal linear regression model holds, i.e., Y X = x N(x T β, σ 2 ), then the bound from the theorem becomes ( ) R(ˆβ) R(β σ 2 d ) = O, n which is familiar to those who have taken introductory statistics. 26 / 40

Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. Only assumptions are those needed for LLN and CLT to hold. However, if normal linear regression model holds, i.e., Y X = x N(x T β, σ 2 ), then the bound from the theorem becomes ( ) R(ˆβ) R(β σ 2 d ) = O, n which is familiar to those who have taken introductory statistics. With more work, can also prove non-asymptotic risk bound of similar form. 26 / 40

Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. Only assumptions are those needed for LLN and CLT to hold. However, if normal linear regression model holds, i.e., Y X = x N(x T β, σ 2 ), then the bound from the theorem becomes ( ) R(ˆβ) R(β σ 2 d ) = O, n which is familiar to those who have taken introductory statistics. With more work, can also prove non-asymptotic risk bound of similar form. In homework/reading, we look at a simpler setting for studying ERM for linear regression, called fixed design. 26 / 40

Risk vs empirical risk Let ˆβ be ERM solution. 27 / 40

Risk vs empirical risk Let ˆβ be ERM solution. 1. Empirical risk of ERM: R(ˆβ) 27 / 40

Risk vs empirical risk Let ˆβ be ERM solution. 1. Empirical risk of ERM: R(ˆβ) 2. True risk of ERM: R(ˆβ) 27 / 40

Risk vs empirical risk Let ˆβ be ERM solution. 1. Empirical risk of ERM: R(ˆβ) 2. True risk of ERM: R(ˆβ) Theorem. E [ ] [ ] R(ˆβ) E R(ˆβ). (Empirical risk can sometimes be larger than true risk, but not on average.) 27 / 40

Risk vs empirical risk Let ˆβ be ERM solution. 1. Empirical risk of ERM: R(ˆβ) 2. True risk of ERM: R(ˆβ) Theorem. E [ ] [ ] R(ˆβ) E R(ˆβ). (Empirical risk can sometimes be larger than true risk, but not on average.) Overfitting: empirical risk is small, but true risk is much higher. 27 / 40

Overfitting example (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid; X is continuous random variable in R. Suppose we use degree-k polynomial expansion φ(x) = (1, x 1,..., x k ), x R, so dimension is d = k + 1. 28 / 40

Overfitting example (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid; X is continuous random variable in R. Suppose we use degree-k polynomial expansion so dimension is d = k + 1. φ(x) = (1, x 1,..., x k ), x R, Fact: Any function on k + 1 points can be interpolated by a polynomial of degree at most k. y 3 2 1 0-1 -2-3 0 0.2 0.4 0.6 0.8 1 x 28 / 40

Overfitting example (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid; X is continuous random variable in R. Suppose we use degree-k polynomial expansion so dimension is d = k + 1. φ(x) = (1, x 1,..., x k ), x R, Fact: Any function on k + 1 points can be interpolated by a polynomial of degree at most k. y 3 2 1 0-1 -2-3 0 0.2 0.4 0.6 0.8 1 x Conclusion: If n k + 1 = d, ERM solution ˆβ with this feature expansion has R(ˆβ) = 0 always, regardless of its true risk (which can be 0). 28 / 40

Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. 29 / 40

Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. 29 / 40

Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. i=1 29 / 40

Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. Training data is independent of test data, so ˆf is independent of test data. i=1 29 / 40

Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. Training data is independent of test data, so ˆf is independent of test data. i=1 Let L i := ( ˆf( X i) Ỹi)2 for each i = 1,..., m, so [ E Rtest( ˆf) ˆf ] = 1 m m i=1 [ E L i ˆf ] = R( ˆf). 29 / 40

Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. Training data is independent of test data, so ˆf is independent of test data. i=1 Let L i := ( ˆf( X i) Ỹi)2 for each i = 1,..., m, so [ E Rtest( ˆf) ˆf ] = 1 m m i=1 [ E L i ˆf ] = R( ˆf). Moreover, L 1,..., L m are conditionally iid given ˆf, and hence by Law of Large Numbers, R test( ˆf) p R( ˆf) as m. 29 / 40

Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. Training data is independent of test data, so ˆf is independent of test data. i=1 Let L i := ( ˆf( X i) Ỹi)2 for each i = 1,..., m, so [ E Rtest( ˆf) ˆf ] = 1 m m i=1 [ E L i ˆf ] = R( ˆf). Moreover, L 1,..., L m are conditionally iid given ˆf, and hence by Law of Large Numbers, R test( ˆf) p R( ˆf) as m. By CLT, the rate of convergence is m 1/2. 29 / 40

Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. 30 / 40

Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n 30 / 40

Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n However... 30 / 40

Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n However... By CLT, we know the following holds for a fixed β: R(β) p R(β) (Here, we ignore the dependence on d.) at n 1/2 rate. 30 / 40

Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n However... By CLT, we know the following holds for a fixed β: R(β) p R(β) (Here, we ignore the dependence on d.) Yet, for ERM ˆβ, R(ˆβ) p R(β ) (Also ignoring dependence on d.) at n 1/2 rate. at n 1 rate. 30 / 40

Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n However... By CLT, we know the following holds for a fixed β: R(β) p R(β) (Here, we ignore the dependence on d.) Yet, for ERM ˆβ, R(ˆβ) p R(β ) (Also ignoring dependence on d.) at n 1/2 rate. at n 1 rate. Implication: Selecting a good predictor can be easier than estimating how good predictors are! 30 / 40

Old Faithful example 31 / 40

Old Faithful example Linear regression model + affine expansion on duration of last eruption. 31 / 40

Old Faithful example Linear regression model + affine expansion on duration of last eruption. Learn ˆβ = (35.0929, 10.3258) from 136 past observations. 31 / 40

Old Faithful example Linear regression model + affine expansion on duration of last eruption. Learn ˆβ = (35.0929, 10.3258) from 136 past observations. Mean squared loss of ˆβ on next 136 observations is 35.9404. (Recall: mean squared loss of ˆµ = 70.7941 was 187.1894.) 100 time until next eruption 80 60 40 20 linear model constant prediction 0 0 1 2 3 4 5 6 duration of last eruption 31 / 40

Old Faithful example Linear regression model + affine expansion on duration of last eruption. Learn ˆβ = (35.0929, 10.3258) from 136 past observations. Mean squared loss of ˆβ on next 136 observations is 35.9404. (Recall: mean squared loss of ˆµ = 70.7941 was 187.1894.) 100 time until next eruption 80 60 40 20 linear model constant prediction 0 0 1 2 3 4 5 6 duration of last eruption (Unfortunately, 35.9 > mean duration 3.5.) 31 / 40

5. Regularization

Inductive bias Suppose ERM solution is not unique. What should we do? 32 / 40

Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. 32 / 40

Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. 32 / 40

Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. 32 / 40

Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? 32 / 40

Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? Data does not give reason to choose a shorter β over a longer β. 32 / 40

Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? Data does not give reason to choose a shorter β over a longer β. The preference for shorter β is an inductive bias: it will work well for some problems (e.g., when true β is short), not for others. 32 / 40

Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? Data does not give reason to choose a shorter β over a longer β. The preference for shorter β is an inductive bias: it will work well for some problems (e.g., when true β is short), not for others. All learning algorithms encode some kind of inductive bias. 32 / 40

Example ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1 sin(2x), 1 cos(2x), 1 sin(3x), 1 cos(3x),... ). 2 2 3 3 33 / 40

Example ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1 sin(2x), 1 cos(2x), 1 sin(3x), 1 cos(3x),... ). 2 2 3 3 f(x) 2.5 2 1.5 1 0.5 0-0.5-1 -1.5-2 Training data: -2.5 0 1 2 3 4 5 6 x 33 / 40

Example ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1 sin(2x), 1 cos(2x), 1 sin(3x), 1 cos(3x),... ). 2 2 3 3 f(x) 2.5 2 1.5 1 0.5 0-0.5-1 -1.5-2 Training data and some arbitrary ERM: -2.5 0 1 2 3 4 5 6 x 33 / 40

Example ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1 sin(2x), 1 cos(2x), 1 sin(3x), 1 cos(3x),... ). 2 2 3 3 f(x) 2.5 2 1.5 1 0.5 0-0.5-1 -1.5-2 Training data and least l 2 norm ERM: -2.5 0 1 2 3 4 5 6 x It is not a given that the least norm ERM is better than the other ERM! 33 / 40

Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 34 / 40

Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 Fact: If λ > 0, then the solution is always unique (even if n < d)! 34 / 40

Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 Fact: If λ > 0, then the solution is always unique (even if n < d)! This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) 34 / 40

Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 Fact: If λ > 0, then the solution is always unique (even if n < d)! This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Parameter λ controls how much attention is paid to the regularizer β 2 2 relative to the data fitting term R(β). 34 / 40

Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 Fact: If λ > 0, then the solution is always unique (even if n < d)! This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Parameter λ controls how much attention is paid to the regularizer β 2 2 relative to the data fitting term R(β). Choose λ using cross-validation. 34 / 40

Another interpretation of ridge regression Define (n + d) d matrix à and (n + d) 1 column vector b by à := 1 n x T 1... x T n nλ... nλ, b := 1 n y 1... y n 0... 0. 35 / 40

Another interpretation of ridge regression Define (n + d) d matrix à and (n + d) 1 column vector b by x T 1 y 1 à := 1.. x T n 1 y n nλ, b := n n. 0.... nλ 0 Then Ãβ b 2 2 = R(β) + λ β 2 2. 35 / 40

Another interpretation of ridge regression Define (n + d) d matrix à and (n + d) 1 column vector b by x T 1 y 1 à := 1.. x T n 1 y n nλ, b := n n. 0.... nλ 0 Then Interpretation: Ãβ b 2 2 = R(β) + λ β 2 2. d fake data points; ensure that augmented data matrix à has rank d. 35 / 40

Another interpretation of ridge regression Define (n + d) d matrix à and (n + d) 1 column vector b by x T 1 y 1 à := 1.. x T n 1 y n nλ, b := n n. 0.... nλ 0 Then Interpretation: Ãβ b 2 2 = R(β) + λ β 2 2. d fake data points; ensure that augmented data matrix à has rank d. Squared length of each fake feature vector is nλ. All corresponding labels are 0. 35 / 40

Another interpretation of ridge regression Define (n + d) d matrix à and (n + d) 1 column vector b by x T 1 y 1 à := 1.. x T n 1 y n nλ, b := n n. 0.... nλ 0 Then Interpretation: Ãβ b 2 2 = R(β) + λ β 2 2. d fake data points; ensure that augmented data matrix à has rank d. Squared length of each fake feature vector is nλ. All corresponding labels are 0. Prediction of β on i-th fake feature vector is nλβ i. 35 / 40

Regularization with a different norm Lasso: For a given λ 0, find minimizer of R(β) + λ β 1 over β R d. Here, v 1 = d i=1 vi is the l1-norm. 36 / 40

Regularization with a different norm Lasso: For a given λ 0, find minimizer of R(β) + λ β 1 over β R d. Here, v 1 = d i=1 vi is the l1-norm. Prefers shorter β, but using a different notion of length than ridge. 36 / 40

Regularization with a different norm Lasso: For a given λ 0, find minimizer of R(β) + λ β 1 over β R d. Here, v 1 = d i=1 vi is the l1-norm. Prefers shorter β, but using a different notion of length than ridge. Tends to produce β that are sparse i.e., have few non-zero coordinates or at least well-approximated by sparse vectors. 36 / 40

Regularization with a different norm Lasso: For a given λ 0, find minimizer of R(β) + λ β 1 over β R d. Here, v 1 = d i=1 vi is the l1-norm. Prefers shorter β, but using a different notion of length than ridge. Tends to produce β that are sparse i.e., have few non-zero coordinates or at least well-approximated by sparse vectors. Fact: Vectors with small l 1-norm are well-approximated by sparse vectors. If β contains just the 1/ε 2 -largest coefficients (by magnitude) of β, then β β 2 ε β 1. 36 / 40