Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Similar documents
1 Inferential Methods for Correlation and Regression Analysis

(all terms are scalars).the minimization is clearer in sum notation:

Properties and Hypothesis Testing

Efficient GMM LECTURE 12 GMM II

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

Linear Regression Models

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

Stat 139 Homework 7 Solutions, Fall 2015

Matrix Representation of Data in Experiment

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Algebra of Least Squares

ECON 3150/4150, Spring term Lecture 3

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

Section 14. Simple linear regression.

Chapters 5 and 13: REGRESSION AND CORRELATION. Univariate data: x, Bivariate data (x,y).

Dr. Maddah ENMG 617 EM Statistics 11/26/12. Multiple Regression (2) (Chapter 15, Hines)

Stat 200 -Testing Summary Page 1

Open book and notes. 120 minutes. Cover page and six pages of exam. No calculators.

Simple Linear Regression

STA6938-Logistic Regression Model

3/3/2014. CDS M Phil Econometrics. Types of Relationships. Types of Relationships. Types of Relationships. Vijayamohanan Pillai N.

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

TAMS24: Notations and Formulas

10-701/ Machine Learning Mid-term Exam Solution

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

Computing Confidence Intervals for Sample Data

Simple Random Sampling!

University of California, Los Angeles Department of Statistics. Practice problems - simple regression 2 - solutions

Regression, Inference, and Model Building

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Topic 9: Sampling Distributions of Estimators

CLRM estimation Pietro Coretto Econometrics

ST 305: Exam 3 ( ) = P(A)P(B A) ( ) = P(A) + P(B) ( ) = 1 P( A) ( ) = P(A) P(B) σ X 2 = σ a+bx. σ ˆp. σ X +Y. σ X Y. σ X. σ Y. σ n.

Stat 421-SP2012 Interval Estimation Section

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Statistics 20: Final Exam Solutions Summer Session 2007

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Chapter 3: Other Issues in Multiple regression (Part 1)

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Data Analysis and Statistical Methods Statistics 651

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Circle the single best answer for each multiple choice question. Your choice should be made clearly.

Output Analysis (2, Chapters 10 &11 Law)

Statistics 203 Introduction to Regression and Analysis of Variance Assignment #1 Solutions January 20, 2005

[ ] ( ) ( ) [ ] ( ) 1 [ ] [ ] Sums of Random Variables Y = a 1 X 1 + a 2 X 2 + +a n X n The expected value of Y is:

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

1 General linear Model Continued..

Linear Regression Models, OLS, Assumptions and Properties

Binomial Distribution

1 Models for Matched Pairs

Mathematical Notation Math Introduction to Applied Statistics

Statistical Properties of OLS estimators

UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL/MAY 2009 EXAMINATIONS ECO220Y1Y PART 1 OF 2 SOLUTIONS

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

¹Y 1 ¹ Y 2 p s. 2 1 =n 1 + s 2 2=n 2. ¹X X n i. X i u i. i=1 ( ^Y i ¹ Y i ) 2 + P n

Lecture 4: Simple Linear Regression Models, with Hints at Their Estimation

Correlation Regression

Problem Set 4 Due Oct, 12

Linear Regression Demystified

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Last time: Moments of the Poisson distribution from its generating function. Example: Using telescope to measure intensity of an object

Lecture 2: Monte Carlo Simulation

Random Variables, Sampling and Estimation

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Chapter 1 Simple Linear Regression (part 6: matrix version)

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

It should be unbiased, or approximately unbiased. Variance of the variance estimator should be small. That is, the variance estimator is stable.

Lecture 11 Simple Linear Regression

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Lesson 11: Simple Linear Regression

Read through these prior to coming to the test and follow them when you take your test.

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

Regression. Correlation vs. regression. The parameters of linear regression. Regression assumes... Random sample. Y = α + β X.

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

4. Hypothesis testing (Hotelling s T 2 -statistic)

Common Large/Small Sample Tests 1/55

Statistical inference: example 1. Inferential Statistics

Chapter Vectors

Final Examination Solutions 17/6/2010

Statistics 511 Additional Materials

Introductory statistics

Lecture 8: Non-parametric Comparison of Location. GENOME 560, Spring 2016 Doug Fowler, GS

6.867 Machine learning

MA Advanced Econometrics: Properties of Least Squares Estimators

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

3 Why is the Training Error Smaller than the Generalization Error?

Topic 9: Sampling Distributions of Estimators

Ismor Fischer, 1/11/

6 Sample Size Calculations

Transcription:

Lecture 22: Review for Exam 2 Basic Model Assumptios (without Gaussia Noise) We model oe cotiuous respose variable Y, as a liear fuctio of p umerical predictors, plus oise: Y = β 0 + β X +... β p X p + ɛ. () Liearity is a assumptio, which ca be wrog. Further assumptios take the form of restrictios o the oise: E [ɛ X = 0, [ɛ X = σ 2. Moreover, we assume ɛ is ucorrelated across observatios. We covert this to matrix form: Y = Xβ + ɛ (2) Y is a matrix of radom variables; X is a (p + ) matrix, with a extra colum of all s; ɛ is a matrix. Beyod liearity, the assumptios traslate to E [ɛ X = 0, [ɛ X = σ 2 I. (3) We do t kow β. If we guess it is b, we will make a vector of predictios Xb ad have a vector of errors Y Xb. The mea squared error, as a fuctio of b, is the MSE(b) = (Y Xb)T (Y Xb). (4) 2 Least Squares Estimatio ad Its Properties The least squares estimate of the coefficiets is the oe which miimizes the MSE: To fid this, we eed the derivatives: β argmi MSE(b). (5) b We set the derivative to zero at the optimum: b MSE = 2 (XT Y X T Xb). (6) ( ) XT Y X β = 0. (7) The term i paretheses is the vector of errors whe we use the least-squares estimate. This is the vector of residuals, e Y X β (8) so the have the ormal, estimatig or score equatios, XT e = 0. (9)

We say equatios, plural, because this is equivalet to the set of p + equatios e i i= = 0 (0) e i X ij = 0 () i= (May people omit the factor of /.) This tells us that while e is a -dimesioal vector, it is subject to p + liear costraits, so it is cofied to a liear subspace of dimesio p. Thus p is the umber of residual degrees of freedom. The solutio to the estimatig equatios is β = (X T X) X T Y. (2) This is oe of the two most importat equatios i the whole subject. It says that the coefficiets are a liear fuctio of the respose vector Y. The least squares estimate is a costat plus oise: β = (X T X) X T Y (3) = (X T X) X T (Xβ + ɛ) (4) = (X T X) X T Xβ + (X T X) X T ɛ (5) = β + (X T X) X T ɛ. (6) The least squares estimate is ubiased: E [ β = β + (X T X) X T E [ɛ = β. (7) Its variace is [ β = σ 2 (X T X). (8) Sice the etries i X T X are usual proportioal to, it ca be helpful to write this as The variace of ay oe coefficiet estimator is The vector of fitted meas or coditioal values is ( ) [ β = σ2 XT X. (9) ( ) [ βi = σ2 XT X. (20) i+,i+ This is more coveietly expressed i terms of the origial matrices: Ŷ X β. (2) Ŷ = X(X T X) X T Y = HY. (22) 2

The fitted values are thus liear i Y: set the resposes all to zero ad all the fitted values will be zero; double all the resposes ad all the fitted values will double. The hat matrix H X(X T X) X T, also called the ifluece, projectio or predictio matrix, cotrols the fitted values. It is a fuctio of X aloe, igorig the respose variable totally. It is a matrix with several importat properties: It is symmetric, H T = H. It is idempotet, H 2 = H. Its trace tr H = i H ii = p +, the umber of degrees of freedom for the fitted values. The variace-covariace matrix of the fitted values is [Ŷ = Hσ 2 IH T = σ 2 H. (23) To make a predictio at a ew poit, ot i the data used for estimatio, we take its predictor coordiates ad group them ito a (p + ) matrix X ew (icludig the for the itercept). The[ poit predictio for Y is the X ew β. The expected value is Xew β, ad the variace is X ew β = X ew [ β X T ew = σ 2 X ew (X T X) X T ew. The residuals are also liear i the respose: e Y m = (I H)Y. (24) The trace of I H is p. The variace-covariace matrix of the residuals is The mea squared error (traiig error) is [e = σ 2 (I H). (25) MSE = e 2 i = et e. (26) i= Its expectatio value is slightly below σ 2 : E [MSE = σ 2 p. (27) (This may be proved usig the trace of I H.) A ubiased estimate of σ 2, which I will call σ 2 throughout the rest of this, is σ 2 MSE p. (28) The leverage of data poit i is H ii. This has several iterpretatios:. [Ŷi = σ 2 H ii ; the leverage cotrols how much variace there is i the fitted value. 2. Ŷi/ Y i = H ii ; the leverage says how much chagig the respose value for poit i chages the fitted value there. 3

3. Cov [Ŷi, Y i = σ 2 H ii ; the leverage says how much covariace there is betwee the i th respose ad the i th fitted value. 4. [e i = σ 2 ( H ii ); the leverage cotrols how big the i th residual is. The stadardized residual is r i = e i σ H ii. (29) The oly restrictio we have to impose o the predictor variables X i is that (X T X) eeds to exist. This is equivalet to X is ot colliear: oe of its colums is a liear combiatio of other colums; which is also equivalet to The eigevalues of X T X are all > 0. (If there are zero eigevalues, the correspodig eigevectors idicate liearly-depedet combiatios of predictor variables.) Nearly-colliear predictor variables ted to lead to large variaces for coefficiet estimates, with high levels of correlatio amog the estimates. It is perfectly OK for oe colum of X to be a fuctio of aother, provided it is a oliear fuctio. Thus i polyomial regressio we add extra colums for powers of oe or more of the predictor variables. (Ay other oliear fuctio is however also legitimate.) This complicates the iterpretatio of coefficiets as slopes, just as though we had doe a trasformatio of a colum. Estimatio ad iferece for the coefficiets o these predictor variables goes exactly like estimatio ad iferece for ay other coefficiet. Oe colum of X could be a (oliear) fuctio of two or more of the other colums; this is how we represet iteractios. Usually the iteractio colum is just a product of two other colums, for a product or multiplicative iteractio; this also complicates the iterpretatio of coefficiets as slopes. (See the otes o iteractios.) Estimatio ad iferece for the coefficiets o these predictor variables goes exactly like estimatio ad iferece for ay other coefficiet. We ca iclude qualitative predictor variables with k discrete categories or levels by itroducig biary idicator variables for k of the levels, ad addig them to X. The coefficiets o these idicators tell us about amouts that are added (or subtracted) to the respose for every idividual who is a member of that category or level, compared to what would be predicted for a otherwiseidetical idividual i the baselie category. Equivaletly, every category gets its ow itercept. Estimatio ad iferece for the coefficiets o these predictor variables goes exactly like estimatio ad iferece for ay other coefficiet. Iteractig the idicator variables for categories with other variables gives coefficiets which say what amout is added to the slope used for each member of that category (compared to the slope for members of the baselie level). Equivaletly, each category gets its ow slope. Estimatio ad iferece for the coefficiets o these predictor variables goes exactly like estimatio ad iferece for ay other coefficiet. Model selectio for predictio aims at pickig a model which will predict well o ew data draw from the same distributio as the data we ve see. Oe way to estimate this out-of-sample performace is to look at what the expected squared error would be o ew data with the same X 4

matrix, but a ew, idepedet realizatio of Y. I the otes o model selectio, we showed that [ E (Y m) T (Y m) [ = E [ = E = E (Y m)t (Y m) + 2 Cov [Y i, m i (30) i= (Y m)t (Y m) + 2 σ2 tr H (3) [ (Y m)t (Y m) + 2 σ2 (p + ). (32) Mallow s C p estimates this by MSE + 2 σ2 (p + ) (33) usig the σ 2 from the largest, model beig selected amog (which icludes all the other models as special cases). A alterative is leave-oe-out cross-validatio, which amouts to i= ( ei H ii We also cosidered K-fold cross-validatio, AIC ad BIC. 3 Gaussia Noise ) 2. (34) The Gaussia oise assumptio is added o to the other assumptios already made. It is that ɛ i N(0, σ 2 ), idepedet of the predictor variables ad all other ɛ j. I other words, ɛ has a multivariate Gaussia distributio, ɛ MV N(0, σ 2 I). (35) Uder this assumptio, it follows that, sice β is a liear fuctio of ɛ, it also has a multivariate Gaussia distributio: β MV N(β, σ 2 (X T X) ) (36) ad It follows from this that ad Ŷ MV N(Xβ, σ 2 H). (37) β i N(β i, σ 2 (X T X) i+,i+ (38) Ŷ i N(X i β, σ 2 H ii ). (39) The samplig distributio of the estimated coditioal mea at a ew poit X ew is N(X ew β, σ 2 X ew (X T X) X T ew). The mea squared error follows a χ 2 distributio: MSE σ 2 χ 2 p. (40) 5

Moreover, the MSE is statistically idepedet of β. We may therefore defie [ βi = σ (X T X) i+,i+ (4) ad ad get t distributios: [Ŷi = σ H ii (42) β i β i t p N(0, ) (43) [ βi ad Ŷ i m(x i ) [ m i The Wald test for the hypothesis that β i = β i t p N(0, ). (44) therefore forms the test statistic β i β i (45) [ βi ad rejects the hypothesis if it is too large (above or below zero) compared to the quatiles of a t p distributio. The summary fuctio of R rus such a test of the hypothesis that β i = 0. There is othig magic or eve especially importat about testig for a 0 coefficiet, ad the same test works for testig whether a slope = 42 (for example). Importat! The ull hypothesis beig test is Y is a liear fuctio of X,... X p, ad of o other predictor variables, with idepedet, costat-variace Gaussia oise, ad the coefficiet β i = 0 exactly. ad the alterative hypothesis is Y is a liear fuctio of X,... X p, ad of o other predictor variables, with idepedet, costat-variace Gaussia oise, ad the coefficiet β i 0. The Wald test does ot test ay of the model assumptios (it presumes them all), ad it caot say whether i a absolutely sese X i matters for Y ; addig or removig other predictors ca chage whether the true β i = 0. Warig! Retaiig the ull hypothesis β i = 0 ca happe if either the parameter is precisely estimated, ad cofidetly kow to be close to zero, or if it is im-precisely estimated, ad might as well be zero or somethig huge o either side. Sayig We ca igore this because we ca be quite sure it s small ca make sese; sayig We ca igore this because we have o idea what it is is preposterous. To test whether several coefficiets (β j : j S) are all simultaeously zero, use a F test. The ull hypothesis is H 0 : β j = 0 for all j S ad the altermative is H : β j 0 for at least oe j S. 6

The F statistic is F stat = ( σ2 ull σ2 full )/s σ 2 full /( p ) (46) where s is the umber of elemets i S. Uder that ull hypothesis, F stat F s, p (47) If we are testig a subset of coefficiets, we have a partial F test. A full F test sets s = p, i.e., it tests the ull hypothesis of a itercept-oly model (with idepedet, costat-variace Gaussia oise) agaist the alterative of the liear model o X,... X p (ad oly those variables, with idepedet, costat-variace Gaussia oise). This is oly of iterest uder very uusual circumstaces. Oce agai, o F test is capable of checkig ay modelig assumptios. This is because both the ull hypothesis ad the alterative hypothesis presume that the all of the modelig assumptios are exactly correct. A α cofidece iterval for β i is β i ± [β i t p (α/2) β i ± [β i z α/2. (48) We saw how to create a cofidece ellipsoid for several coefficiets. These make a simultaeous guaratee: all the parameters are trapped iside the cofidece regio with probabiluty α. A simpler way to get a simultaeous cofidece regio for all p parameters is to use α/p cofidece itervals for each oe ( Boferroi correctio ). This gives a cofidece hyper-rectagle. A α cofidece iterval for the regressio fuctio at a poit is m(x i ) ± [ m(x i ) t p (α/2). (49) Residuals. The cross-validated or studetized residuals are:. Temporarily hold out data poit i 2. Re-estimate the coefficiets to get β ( i) ad σ ( i). 3. Make a predictio for Y i, amely, Ŷi(i) = m ( i) (X i ). 4. Calculate t i = Y i Ŷi(i) [ σ ( i) + m ( i) i This ca be doe without recourse to actually re-fittig the model: p t i = r i p ri 2 (Note that for large, this is typically extremely close to r i.) Also,. (50) (5) t i t p 2 (52) (The 2 is because we re usig data poits to estimate p + coefficiets.) Cook s distace for poit i is the sum of the (squared) chages to all the fitted values if i was omitted; it is D i = p + e2 i H ii ( H ii ) 2. (53) 7