Regression I: Mean Squared Error and Measuring Quality of Fit

Similar documents
Introduction to Statistical modeling: handout for Math 489/583

Model comparison and selection

MASM22/FMSN30: Linear and Logistic Regression, 7.5 hp FMSN40:... with Data Gathering, 9 hp

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Data Mining Stat 588

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Regression, Ridge Regression, Lasso

CSC321 Lecture 18: Learning Probabilistic Models

Sparse Linear Models (10/7/13)

Linear Models in Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

High-dimensional regression

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models

Machine Learning for OR & FE

Lecture 3: Introduction to Complexity Regularization

How the mean changes depends on the other variable. Plots can show what s happening...

Notes on Discriminant Functions and Optimal Classification

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Classification and Regression Trees

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Day 4: Shrinkage Estimators

MS&E 226: Small Data

Dimension Reduction Methods

Statistical Distribution Assumptions of General Linear Models

6.867 Machine Learning

Linear Model Selection and Regularization

Generative Clustering, Topic Modeling, & Bayesian Inference

Chapter 7: Model Assessment and Selection

Lecture 2 Machine Learning Review

Introduction to Machine Learning and Cross-Validation

Minimum Description Length (MDL)

Association studies and regression

ISyE 691 Data mining and analytics

Likelihood-Based Methods

PRINCIPAL COMPONENTS ANALYSIS

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Introduction to Within-Person Analysis and RM ANOVA

6.867 Machine Learning

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CMSC858P Supervised Learning Methods

Linear Methods for Prediction

Machine Learning for OR & FE

Selection Criteria Based on Monte Carlo Simulation and Cross Validation in Mixed Models

ECE521 week 3: 23/26 January 2017

Lecture 4 Multiple linear regression

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Lecture 4: Training a Classifier

CPSC 340: Machine Learning and Data Mining

Linear Regression In God we trust, all others bring data. William Edwards Deming

Linear Regression (9/11/13)

Sampling distribution of GLM regression coefficients

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Advanced Introduction to Machine Learning CMU-10715

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Lecture 14: Shrinkage

Terminology for Statistical Data

Machine Learning Basics: Maximum Likelihood Estimation

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

MS&E 226: Small Data

FEEG6017 lecture: Akaike's information criterion; model reduction. Brendan Neville

Regression #3: Properties of OLS Estimator

Machine Learning Basics: Estimators, Bias and Variance

For more information about how to cite these materials visit

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

CS281 Section 4: Factor Analysis and PCA

An Introduction to Path Analysis

Multi-View Regression via Canonincal Correlation Analysis

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Lecture 35: December The fundamental statistical distances

Statistics 262: Intermediate Biostatistics Model selection

PATTERN RECOGNITION AND MACHINE LEARNING

Machine Learning (Spring 2012) Principal Component Analysis

Machine learning - HT Maximum Likelihood

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Linear Regression Models P8111

Generalized Linear Models

12 Statistical Justifications; the Bias-Variance Decomposition

Linear Models (continued)

Recap from previous lecture

Regression Estimation Least Squares and Maximum Likelihood

GAUSSIAN PROCESS REGRESSION

Chapter 2: simple regression model

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Lecture 2: Linear regression

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection

High-dimensional regression modeling

Ordinary Least Squares Linear Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Covariance function estimation in Gaussian process regression

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Measures of Fit from AR(p)

Introduction to Simple Linear Regression

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment.

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Transcription:

Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1

The Setup Suppose there is a scientific problem we are interested in solving (This could be estimating a relationship between height and weight in humans) The perspective of this class is to define these quantities as random, say X = height Y = weight We want to know about the joint distribution of X and Y This joint distribution is unknown, and hence we 1. gather data 2. estimate it 2

The Setup Now we have data D n = {(X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n )}, where X i R p are the covariates, explanatory variables, or predictors (NOT INDEPENDENT VARIABLES!) Y i R are the response or supervisor variables. (NOT DEPENDENT VARIABLE!) Finding the joint distribution of X and Y is usually too ambitious A good first start is to try and get at the mean, say Y = µ(x ) + ɛ where ɛ describes the random fluctuation of Y around its mean 3

Parameterizing this relationship Even estimating µ(x ) is often too much A good simplification is to assume that it is linear. This means we suppose that there is a β R p such that: Y = µ(x ) + ɛ = X β + ɛ R }{{} simplification (The notation indicates in and if I say x R q, that means that x is a vector with q entries) 4

Parameterizing this relationship Translated into using our data, we get Y = Xβ + ɛ R n where X = X 1 X 2. Commonly, X i1 = 1, which encodes an intercept term in the model. X n X is known as the design or feature matrix 5

Parameterizing this relationship: In detail Back to the height/weight example: Estimating X 1 = [1, 62], X 2 = [1, 68],..., X 10 = [1, 65] ˆβ = argmin Xβ Y 2 2 = argmin β β n ( i=1 Y i X i ) 2 β finds the usual simple linear regression estimators: ˆβ = [ ˆβ 0, ˆβ] 6

Some terminology and overarching concepts 7

Core terminology Statistics is fundamentally about two goals Formulating an estimator of unknown, but desired, quantities (This is known as (statistical) learning) Answering the question: How good is that estimator? (For this class, we will focus on if the estimator is sensible and if it can make good predictions) Let s address some relevant terminology for each of these goals 8

Types of learning Supervised Semi-supervised Unsupervised Regression Classification PCA regression Clustering PCA Some comments: Comparing to the response (aka supervisor) Y gives a natural notion of prediction accuracy Much more heuristic, unclear what a good solution would be. We ll return to this later in the semester. 9

How good is that estimator? Suppose we are attempting to estimate a quantity β with an estimator ˆβ. (We don t need to be too careful as to what this means in this class. Think of it as a procedure that takes data and produces an answer) How good is this estimator? We can look at how far ˆβ is from β through some function l Any distance function 1 will do... 1 Or even topology... 10

Risky (and lossy) business We refer to the quantity l( ˆβ, β) as the loss function. As l is random (it is a function of the data, after all), we usually want to average it over the probability distribution of the data: This produces which is called the risk function. R( ˆβ, β) = El( ˆβ, β) 11

Risky business To be concrete, however, let s go through an important example: l( ˆβ, β) = ˆβ β 2 2 (Note that we tend to square the 2-norm to get rid of the pesky square root.) Also, R( ˆβ, β) = E ˆβ β 2 2 12

Risky business Back to the original question: What Makes a Good Estimator of β? Answer: One that has small risk! It turns out that we can form the following R( ˆβ, β) = E ˆβ β 2 2 = E ˆβ E ˆβ + E ˆβ β 2 2 = E ˆβ E ˆβ 2 E + E ˆβ β 2 + 2E( ˆβ E ˆβ) (E ˆβ β) 2 2 = E ˆβ E ˆβ 2 E + ˆβ β 2 + 0 2 2 = tracev ˆβ + E ˆβ β = Variance + Bias 2 2 13

Bias and variance Two important concepts in statistics: Bias and Variance R( ˆβ, β) = Variance + Bias 14

Bias and variance So, what makes a good estimator? If... 1. R( ˆβ, β) = Variance + Bias 2. we want R( ˆβ, β) to be small We want a ˆβ that optimally trades off bias and variance (Note, crucially, that this implies that biased estimators are generally better) This is great but... 15

Bias and variance So, what makes a good estimator? If... 1. R( ˆβ, β) = Variance + Bias 2. we want R( ˆβ, β) to be small We want a ˆβ that optimally trades off bias and variance (Note, crucially, that this implies that biased estimators are generally better) This is great but... We don t know E and hence we don t know R( ˆβ, β)! 15

We don t know the risk Since the risk is unknown, we need to estimate it The risk is by definition an average, so perhaps we should use the data... This means translating risk in terms of just β into risk in terms of Xβ (This might seem strange. I m omitting some details to limit complexity) E into E ˆβ β 2 X E 2 ˆβ Xβ 2 (i.e.: using the data to compute an expectation. You ve done this before!) 2 16

What Makes a Good Estimator of β? An intuitive and well-explored criterion is known variously as Mean Squared Error (MSE) Residual Sums of Squares (RSS) Training error (We ll get back to this last one) which for an arbitrary estimator ˆβ has the form: MSE( ˆβ) = n ( i=1 Y i X i ) 2?? ˆβ E X ˆβ Xβ Here, we see that if ˆβ is such that Xi ˆβ Y i for all i, then MSE( ˆβ) 0. But, there s a problem... we can make MSE arbitrarily small... 2 2 17

What Makes a Good Estimator of β? Here s an example: Let s suppose we have 20 observations with one explanatory variable and one response. Now, let s fit some polynomials to this data. Let µ i be the conditional mean of the response Y i. (That is µ i = µ(x i ) ) We consider the following models: - Model 1: µ i = β 0 + β 1 X i - Model 2: µ i = β 0 + β 1 X i + β 2 Xi 2 + β 3 Xi 3 - Model 3: µ i = β 0 + 10 k=1 β kx k i - Model 4: µ i = β 0 + 100 k=1 β kx k i Let s look at what happens... 18

What Makes a Good Estimator of β? Fitting Polynomials Example The MSE s are: y 0.5 0.0 0.5 1.0 1.5 2.0 2.5 Model 1 Model 2 Model 3 Model 4 MSE(Model 1) = 0.98 MSE(Model 2) = 0.86 MSE(Model 3) = 0.68 MSE(Model 4) = 0 What about predicting new observations (red crosses)? 0.0 0.5 1.0 x 19

Example of this Inverse Relationship y 0.5 0.0 0.5 1.0 1.5 2.0 2.5 Fitting Polynomials Example Model 1 Model 2 Model 3 Model 4 Black model has low variance, high bias Green model has low bias, but high variance Red model and Blue model have intermediate bias and variance. We want to balance these two quantities. 0.0 0.5 1.0 x 20

Bias vs. Variance 0.5 0.0 0.5 1.0 1.5 2.0 2.5 y Fitting Polynomials Example Model 1 Model 2 Model 3 Model 4 Squared Bias Variance Risk 0.0 0.5 1.0 x Model Complexity The best estimator is at the vertical black line (minimum of Risk) 21

Training and test error 22

A better notion of risk What we are missing is that the same data that goes into training ˆβ goes into testing ˆβ via MSE( ˆβ) What we really want is to be able to predict a new observation well Let (X 0, Y 0 ) be a new observation that has the same properties as our original sample D n, but is independent of it. 23

A better notion of risk It turns out that E X ˆβ Xβ 2 is the same as E(Y 0 X ˆβ) 0 2 2 ( same means that it isn t equal, but behaves the same) Of course, pred( ˆβ) = E(Y 0 X0 ˆβ) 2 still depends on information that is unavailable to the data analyst (in particular, the joint distribution of X 0 and Y 0 ) 24

Training and test split We can mimic pred( ˆβ) = E(Y 0 X 0 ˆβ) 2 by splitting the sample D n into two parts The total sample (D n ) Training Test D train D test Where Every observation from D n is in D train or D test No observation is in both 25

Training and test split Now, instead of trying to compute we can instead pred( ˆβ) = E(Y 0 X 0 ˆβ) 2, train ˆβ on the observations in D train compute the MSE using observations in D test to test Example: Commonly, this might be 90% of the data in D train and 10% of the data in D test Question: Where does the terminology training error come from? 26

Training and test split This approach has major pros and cons Pro: This is a fair estimator of risk as the training and test observations are independent Con: We are sacrificing power by using only a subset of the data for training 27

Other estimates of risk There are many candidates for estimating pred AIC (Akaike information criterion) AICc (AIC with a correction) BIC (Bayesian information criterion) Mallows Cp Don t worry overly much about the differences between each of these criteria. 28

What Makes a Good Estimator of β? For example: pred( ˆβ) AIC( ˆβ) = 2l( ˆβ) + 2 ˆβ. Here, l is the log likelihood under Gaussian errors (This term is effectively MSE) ˆβ is the length of the estimator (the number of parameters) For the polynomial example: 2 for Model 1 4 for Model 2 ˆβ = 11 for Model 3 101 for Model 4 29

y What Makes a Good Estimator of β? MSE(Model 1) = 0.98 MSE(Model 2) = 0.86 MSE(Model 3) = 0.68 MSE(Model 4) = 0 AIC(Model 1) = 0.98 + 2 2 = 4.98 AIC(Model 2) = 0.86 + 2 4 = 8.86 AIC(Model 3) = 0.68 + 2 11 = 22.68 AIC(Model 4) = 0 + 2 101 = 202.000 2 for Model 1 and ˆβ 4 for Model 2 = 11 for Model 3 101 for Model 4 Fitting Polynomials Example 0.5 0.0 0.5 1.0 1.5 2.0 2.5 Model 1 Model 2 Model 3 Model 4 30

What Makes a Good Estimator of β? I ll include the form of each of the criteria for future reference, formulated for regression: AIC( ˆβ) = MSE( ˆβ) + 2 ˆβ AICc( ˆβ) = AIC + 2 ˆβ ( ˆβ + 1) n ˆβ 1 BIC( ˆβ) = MSE( ˆβ) + ˆβ log n Cp( ˆβ) = MSE ˆσ 2 n + 2 ˆβ. Note: As long as log n 2 (which is effectively always), BIC picks a smaller model than AIC. As n ˆβ + 1, AICc always picks a smaller model than AIC. AICc is a correction for AIC. You should use AICc in practice/research if available. In this class we ll just use AIC so as not to get bogged down in details. 31

Finally, we can answer: What Makes a Good Estimator of β? A good estimator is one that minimizes one of those criteria. For example: ˆβ good = argmin AIC( ˆβ). ˆβ Algorithmically, how can we compute ˆβ good? One way is to compute all possible models and their associated AIC scores. However, this is a lot of models... 32

Caveat In order to avoid some complications about AIC and related measures, I ve over-simplified a bit Sometimes you ll see AIC as written in three ways: AIC = log(mse/n) + 2 ˆβ AIC = MSE + 2 ˆβ AIC = MSE + 2 ˆβ MSE/n The reasons to prefer one over the other is too much of a distraction for this class If you re curious, I can set up a special discussion outside of class to discuss this. (The following few slides are optional and get at the motivation behind information criteria based methods) 33

Comparing probability measures Information criteria come from many different places Suppose we have data Y that comes from the probability density function f. What happens if we use the probability density function g instead? One central idea is Kullback-Leibler discrepancy 2 ( ) f (y) KL(f, g) = log f (y)dy g(y) log(g(y))f (y)dy (ignore term without g) = E[log(g(Y ))] This gives us a sense of the loss incurred by using g instead of f. 2 This has many features of a distance, but is not a true distance as KL(f, g) KL(g, f ). 34

Kullback-Leibler discrepancy Usually, g will depend on some parameters, call them θ, and write g(y; θ). Example: In regression, we can specify f N(X β, σ 2 ) for a fixed (true) 3 β, and let g N(X θ, σ 2 ) over all θ R p As KL(f, g) = E[log(g(Y ; θ))], we want to minimize this over θ. Again, f is unknown, so we minimize log(g(y; θ)) instead. This is the maximum likelihood value ˆθ ML = arg max g(y; θ) θ 3 We actually don t need to assume things about a true model nor have it be nested in the alternative models. 35

Kullback-Leibler discrepancy Now, to get an operational characterization of the KL divergence at the ML solution E[log(g(Y ; ˆθ ML ))] we need an approximation (don t know f, still) This approximation is exactly AIC: AIC = 2 log(g(y ; ˆθ ML )) + 2 ˆθ ML Example: If log(g(y; θ)) = n 2 log(2πσ2 ) + 1 2σ 2 y Xθ 2 2, as in regression, and σ 2 is known, Then using ˆθ = (X X) 1 X y, AIC MSE/σ 2 + 2p 36