Chapter 3: Other Issues in Multiple regression (Part 1)

Similar documents
Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Properties and Hypothesis Testing

Lecture 24: Variable selection in linear models

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Circle the single best answer for each multiple choice question. Your choice should be made clearly.

(all terms are scalars).the minimization is clearer in sum notation:

1 Inferential Methods for Correlation and Regression Analysis

1 Review of Probability & Statistics

ST 305: Exam 3 ( ) = P(A)P(B A) ( ) = P(A) + P(B) ( ) = 1 P( A) ( ) = P(A) P(B) σ X 2 = σ a+bx. σ ˆp. σ X +Y. σ X Y. σ X. σ Y. σ n.

11 Correlation and Regression

Statistics 511 Additional Materials

Linear Regression Models

A Question. Output Analysis. Example. What Are We Doing Wrong? Result from throwing a die. Let X be the random variable

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

There is no straightforward approach for choosing the warmup period l.

6.003 Homework #3 Solutions

n but for a small sample of the population, the mean is defined as: n 2. For a lognormal distribution, the median equals the mean.

ECON 3150/4150, Spring term Lecture 3

STP 226 EXAMPLE EXAM #1

Machine Learning Brett Bernstein

Chapter 13, Part A Analysis of Variance and Experimental Design

Lecture 11 Simple Linear Regression

Read through these prior to coming to the test and follow them when you take your test.

Support vector machine revisited

Lecture 2: Monte Carlo Simulation

10-701/ Machine Learning Mid-term Exam Solution

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Polynomial Functions and Their Graphs

A statistical method to determine sample size to estimate characteristic value of soil parameters

Study the bias (due to the nite dimensional approximation) and variance of the estimators

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 9

Introductory statistics

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

P1 Chapter 8 :: Binomial Expansion

UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL/MAY 2009 EXAMINATIONS ECO220Y1Y PART 1 OF 2 SOLUTIONS

Chapter 11 Output Analysis for a Single Model. Banks, Carson, Nelson & Nicol Discrete-Event System Simulation

Assessment and Modeling of Forests. FR 4218 Spring Assignment 1 Solutions

Statistical Properties of OLS estimators

Information-based Feature Selection

Revision Topic 1: Number and algebra

6.867 Machine learning

Mathematical Statistics - MS

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

Multiple regression is arguably the single most important method in all of statistics.

CHAPTER 10 INFINITE SEQUENCES AND SERIES

Estimation for Complete Data

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

If, for instance, we were required to test whether the population mean μ could be equal to a certain value μ

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Final Examination Solutions 17/6/2010

EE260: Digital Design, Spring n Binary Addition. n Complement forms. n Subtraction. n Multiplication. n Inputs: A 0, B 0. n Boolean equations:

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Regression, Part I. A) Correlation describes the relationship between two variables, where neither is independent or a predictor.

Chapter 7. Support Vector Machine

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

It should be unbiased, or approximately unbiased. Variance of the variance estimator should be small. That is, the variance estimator is stable.

REGRESSION (Physics 1210 Notes, Partial Modified Appendix A)

Regression and generalization

ARIMA Models. Dan Saunders. y t = φy t 1 + ɛ t

Algebra of Least Squares

Stat 139 Homework 7 Solutions, Fall 2015

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Linear Regression Demystified

Math 116 Practice for Exam 3

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Paired Data and Linear Correlation

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

Chapters 5 and 13: REGRESSION AND CORRELATION. Univariate data: x, Bivariate data (x,y).

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Data Description. Measure of Central Tendency. Data Description. Chapter x i

Lesson 11: Simple Linear Regression

Simple Linear Regression

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

UNIT 11 MULTIPLE LINEAR REGRESSION

Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

(ii) Two-permutations of {a, b, c}. Answer. (B) P (3, 3) = 3! (C) 3! = 6, and there are 6 items in (A). ... Answer.

TAMS24: Notations and Formulas

Stat 342 Homework Fall 2014

6.867 Machine learning, lecture 7 (Jaakkola) 1

EE / EEE SAMPLE STUDY MATERIAL. GATE, IES & PSUs Signal System. Electrical Engineering. Postal Correspondence Course

Lecture 10: Performance Evaluation of ML Methods

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

Random Variables, Sampling and Estimation

, then cv V. Differential Equations Elements of Lineaer Algebra Name: Consider the differential equation. and y2 cos( kx)

Curve Sketching Handout #5 Topic Interpretation Rational Functions

Correlation Regression

Question 1: Exercise 8.2

Least-Squares Regression

MA238 Assignment 4 Solutions (part a)

Infinite Sequences and Series

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Statistical inference: example 1. Inferential Statistics

HYPOTHESIS TESTS FOR ONE POPULATION MEAN WORKSHEET MTH 1210, FALL 2018

Open book and notes. 120 minutes. Cover page and six pages of exam. No calculators.

Transcription:

Chapter 3: Other Issues i Multiple regressio (Part 1) 1 Model (variable) selectio The difficulty with model selectio: for p predictors, there are 2 p differet cadidate models. Whe we have may predictors (with may possible iteractios), it ca be difficult to fid a good model. Model selectio tries to simplify this task. Suppose we have P predictors X 1,..., X P, but the true models oly depeds o a subset of X 1,..., X P. I other words i model Y = β 0 + β 1 X 1 +... + β P X P + ε some of the coefficiets are zeros. We eed to fid those predictors with ozero coefficiets. we call the set of predictors with ozero coefficiets best subset, all the predictors i the best subset importat variables Criteria: Statistical test; some idices of the model; predictability (Distictio betwee predictive ad explaatory research.) Example 1.1 (Surgical Uit example) X 1 : blood clottig score; X 2 : Progostic idex; X 3 : ezyme fuctio test score X 4 : liver fuctio test score; X 5 : age i year; X 6 : idicator of geder (0=mail, 1=femail); X 7,X 8 idicator for alcohol use; Y :survivaltime. If we oly cosider the first 4 predictors, we have the followig calculatio for the 1

possible models variables selected p SSE R 2 Ra 2 C p AIC SBC PRESS (BIC) (CV) Noe 1 12.808 0 0 151.4-75.7-73.7 13.3 X1 2 12.0 0.06 0.043 141-77 -73 13.5 X2 2 9.98 0.21 0.21 108.5-87.17-83.2 10.74 X3 2 7.3 0.428 0.417 66.49-103.8-99.84 8.32 X4 2 7.4 0.422 0.410 67.715-103.26-99.28 8.025 X1, X2 3 9.44 0.26 0.23 7102.037-88.16-82.19 11.06 X1, X3 3 5.71 0.549 0.531 43.85-114.65-108.69 6.98 X1, X4 3 7.29 0.43 0.408 67.97-102.067-96.1 8.472 X2, X3 3 4.312 0.663 0.65 20.52-130.48-124.5 5.065 X2, X4 3 6.62 0.483 0.463 57.21-107.32-101.357 7.476 X3, X4 3 5.13 0.6 0.58 33.5-121.1-115.146 6.12 X1, X2, X3 4 3.109 0.757 0.743 3.391-146.161-138.2 3.91 X1, X2, X4 4 6.57 0.487 0.456 58.39-105.74-97.79 7.9 X1, X3, X4 4 4.9 0.61 0.589 32.93-120.8-112.88 6.2 X2, X3, X4 4 3.6 0.718 0.7 11.42-138.023-130.067 4.597 X1, X2, X3, X4 5 3.08 0.759 0.74 5.00-144.59-134.65 4.07 where p is the umber of coefficiets icluded i the model. 2 R 2 ad R 2 a Criterio 1. R 2 : ca be used for models with the same umber of parameters/coefficiets. 2. R 2 a : ca be used for models with Differet umber of parameters/coefficiets. We eed to choose a model with the biggest R 2 a. 3 Mallows C p Criterio Suppose we select p predictors, p P ad try a model with the selected predictors. deote its SSE by SSE p. The criterio is C p = SSE p MSE(X 1,..., X P ) ( 2p ) where p is the umber of coefficiets icludig itercept (if there is). Criterio: We seek to idetify subsets of X for which (1) the C p values is small ad (2) the C p vale is ear p. 2

If a selected model icludes all the importat variables (But with some other uimportat variables), the model is still correct. The we have E{SSE p } =( p )σ 2 O the other had Roughly speakig, we have E{MSE(X 1,..., X P )} = σ 2 C p p ( 2p )=p Questio: are the estimators still ubiased? If a selected model does ot iclude all the importat variables, the model is wrog. The SSE p >> SSE P C p >> p ( 2p )=p Questio: are the estimators still ubiased? 4 Akaike s iformatio criterio (AIC) We caot use SSE aloe for the selectio. As p icreases, SSE p decreases. AIC try to balace the umber of parameters ad SSE p. AIC p =log( SSE p )+2p or AIC p = log( SSE p )+2p 3

5 Schwarz Bayesia criterio (BIC or SBC) Theoretically, people fid that AIC does ot give a right umber of variables. Schwarz proposed the BIC or BIC p =log( SSE p )+log()p BIC p = log( SSE p )+log()p BIC gives bigger pealty to the umber of parameters 6 Predictio sum of squares (PRESS) or Cross-validatio criterio (CV) A better model should have better predictio. Most of the time, we dot have a data for us to predict. A simple way is to partitio the data to two parts: traiig samples (set) ad predictio set (or validatio set). Use traiig set to estimate the model ad predictio set to check the predictability. A simple case that each time, the predictio set has oe sample i tur. There are may partitios. Usig all the partitios is the idea of cross-validatio (CV). The idea was proposed by M. Stoe (1974). If we use 1 observatio for validatio ad the other -1 for model estimatio, it is the leave-oe-observatio-out cross-validatio If we use m observatios for validatio ad the other -m for model estimatio, it is the leave-m-observatio-out cross-validatio. We eed to select variables from X 1,..., X p to be icluded i the model. There are may cadidate variables. For example, model 1: model 2: model 3: Y = a 0 + a 1 X 1 + ε Y = b 0 + b 1 X 1 + b 2 X 4 + ε Y = c 0 + c 1 X 2 + ε 4

Suppose we have samples. For each i = 1,...,, we use data (Y 1,X 1 ),..., (Y i 1,X i 1 ), (Y i+1,x i+1 ),...(Y,X ), where X i =(X i1,..., X ip ), to estimate the models. the estimated models are, say, model 1: model 2: model 3: Y =â i 0 +âi 1 X i1 Y = ˆb i 0 + ˆb i 1X i1 + ˆb i 2X i4 Y =ĉ i 0 +ĉ i 1X i2 The predictio errors for (Y i,x i ) are respectively model 1: err 1 (i) ={Y i â i 0 â i 1X i,1 } 2 model 2: err 2 (i) ={Y i ˆb i 0 ˆb i 1 X i,1 ˆb i 2 X i,4} 2 model 3: err 3 (i) ={Y i ĉ i 0 ĉi 1 X i,2} 2 The overall predictio errors (also called Cross-validatio value) are respectively the model 1: CV 1 = 1 err 1 (i) i=1 model 2: CV 2 = 1 err 2 (i) i=1 model 3: CV 3 = 1 err 3 (i) i=1 The model with the smallest CV value is the model we prefer. 5

Example 6.1 For the same data above (data) Our cadidate models are model 0 model 1 model 2 model 3 model 4 model 5 Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + β 5 X 5 + ε Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + ε Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 5 X 5 + ε Y = β 0 + β 1 X 1 + β 2 X 2 + β 4 X 4 + β 5 X 5 + ε Y = β 0 + β 1 X 1 + β 3 X 3 + β 4 X 4 + β 5 X 5 + ε Y = β 0 + β 2 X 2 + β 3 X 3 + β 4 X 4 + β 5 X 5 + ε The CV values for the above model are respectivly CV (model 0) = 0.3633548,CV(model 1) = 0.333161,CV(model 2) = 1.216745, CV (model 3) = 0.3922781,CV(model 4) = 1.400237,CV(model 5) = 0.4589498 Thus model 1 is selected (ad variable X 5 is deleted) R-code for the calculatio K-fold cross-validatio I K-fold cross-validatio, the origial sample is partitioed ito K subsamples. Of the K subsamples, a sigle subsample is retaied as the validatio data for testig the model, ad the remaiig K 1 subsamples are used as traiig data. The cross-validatio process is the repeated K times (the folds), with each of the K subsamples used exactly oce as the validatio data. The K results from the folds the ca be averaged (or otherwise combied) to produce a sigle estimatio. The advatage of this method over repeated radom sub-samplig is that all observatios are used for both traiig ad validatio, ad each observatio is used for validatio exactly oce. 10-fold cross-validatio is commoly used. 7 Searchig for the best subset Forward selectio: startig with o variables i the model, tryig out the variables oe by oe ad icludig them if they are statistically sigificat or ca icrease the predictability. 6

Backward elimiatio: startig with all cadidate variables ad testig them oe by oe for statistical sigificace, deletig ay that are ot sigificat or ca icrease the predictability. Stepwise: a combiatio of the above, testig at each stage for variables to be icluded or excluded. 8 R code step(object, directio = c("both", "backward", "forward"), steps = 1000, k =??) where k ca be ay positive values, but k =2forAIC,adk =log() forbic(sbc) Example 8.1 For the first example above with data, the selected model variables are Based o BIC: X1 + X2 + X3 + X5 + X6 + X8 or Based o BIC: X1 + X2 + X3 + X8 (code) 7