Introduction to Statistical Inference Lecture 8: Linear regression, Tests and confidence intervals

Similar documents
Introduction to Statistical Inference Lecture 10: ANOVA, Kruskal-Wallis Test

Introduction to Statistical Inference Self-study

Ch 2: Simple Linear Regression

Simple Linear Regression

Inference for Regression

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Correlation Analysis

Chapter 12 - Lecture 2 Inferences about regression coefficient

Lecture 14 Simple Linear Regression

Chapter 14 Simple Linear Regression (A)

Lecture 30. DATA 8 Summer Regression Inference

Lecture 15. Hypothesis testing in the linear model

Chapter 4. Regression Models. Learning Objectives

Inferences for Regression

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Business Statistics. Lecture 10: Course Review

Stats Review Chapter 14. Mary Stangler Center for Academic Success Revised 8/16

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Multiple Linear Regression

Simple Linear Regression

Inference in Regression Analysis

Chapter 16. Simple Linear Regression and dcorrelation

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Psychology 282 Lecture #4 Outline Inferences in SLR

Lecture 18: Simple Linear Regression

Harvard University. Rigorous Research in Engineering Education

Inference in Normal Regression Model. Dr. Frank Wood

Chapter 16. Simple Linear Regression and Correlation

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

Evaluation requires to define performance measures to be optimized

appstats27.notebook April 06, 2017

Inference for Regression Simple Linear Regression

MATH 240. Chapter 8 Outlines of Hypothesis Tests

Statistics II Exercises Chapter 5

Chapter 27 Summary Inferences for Regression

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Evaluation. Andrea Passerini Machine Learning. Evaluation

ECO220Y Simple Regression: Testing the Slope

Chapter 3 Multiple Regression Complete Example

Lectures 5 & 6: Hypothesis Testing

Can you tell the relationship between students SAT scores and their college grades?

Probability and Statistics Notes

Statistics for Engineers Lecture 9 Linear Regression

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

BNAD 276 Lecture 10 Simple Linear Regression Model

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Chapter 13. Multiple Regression and Model Building

Ch 3: Multiple Linear Regression

Sample Problems. Note: If you find the following statements true, you should briefly prove them. If you find them false, you should correct them.

Bias Variance Trade-off

Chapter 24. Comparing Means. Copyright 2010 Pearson Education, Inc.

Regression Models - Introduction

Simple Linear Regression

Measuring the fit of the model - SSR

Finding Relationships Among Variables

Econometrics. 4) Statistical inference

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Business Statistics. Lecture 10: Correlation and Linear Regression

AMS 7 Correlation and Regression Lecture 8

Final Exam - Solutions

Correlation and regression

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

Answer Key: Problem Set 6

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

Statistical Inference for Means

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

What is a Hypothesis?

Testing Homogeneity Of A Large Data Set By Bootstrapping

Warm-up Using the given data Create a scatterplot Find the regression line

Scatter plot of data from the study. Linear Regression

13 Simple Linear Regression

Test 3 Practice Test A. NOTE: Ignore Q10 (not covered)

Applied Econometrics (QEM)

STAT5044: Regression and Anova. Inyoung Kim

Simple Linear Regression

TESTING FOR NORMALITY IN THE LINEAR REGRESSION MODEL: AN EMPIRICAL LIKELIHOOD RATIO TEST

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

Chapter 7 Student Lecture Notes 7-1

Linear Regression and Correlation

ECO375 Tutorial 4 Introduction to Statistical Inference

Recent Advances in the Field of Trade Theory and Policy Analysis Using Micro-Level Data

Hypothesis Testing hypothesis testing approach

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Business Statistics. Lecture 9: Simple Regression

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

Finite Population Correction Methods

Lecture 5: Clustering, Linear Regression

Bivariate Relationships Between Variables

Big Data Analysis with Apache Spark UC#BERKELEY

Basic Business Statistics, 10/e

Midterm 2 - Solutions

Lecture 5: Clustering, Linear Regression

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Transcription:

Introduction to Statistical Inference Lecture 8: Linear regression, Tests and confidence la Non-ar

Contents Non-ar Non-ar

Non-ar

Consider n observations (pairs) (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ) of (x, y). Assume, that the values y i are observed values of a random variable y and assume, the that values x i are observed non-random values of x. Assume that the values y i depend arly on the value x i. Simple (one explanatory variable) ar model can be presented in the following way: y i = b 0 + b 1 x i + ε i, i 1,..., n, Non-ar where the regression coefficients b 0 and b 1 are unknown constants and the expected value of the residuals ε i is E[ε i ] = 0.

, assumptions for parametric tests and confidence We now consider testing the parameters of a ar regression model and calculating confidence for the estimated parameters under classical assumptions. Measurement of the values x i are error-free. The residuals are independent of the values x i. The residuals are independently and identically distributed (iid). The expected value of the residuals is E[ε i ] = 0, i {1,..., n}. The residuals have the same variance E[ε 2 i ] = σ2, i {1,..., n}. The residuals are uncorrelated i.e. ρ(ε i, ε j ) = 0, i j. The residuals are normally distributed. Non-ar

Non-ar

Testing the slope of the regression The null hypothesis H 0 : b 1 = b1 0. (Typically null hypothesis b 1 = 0 is tested) Possible alternative hypotheses: H 1 : b 1 > b 0 1 (one-tailed), H 1 : b 1 < b 0 1 (one-tailed) or H 1 : b 1 b 0 1 (two-tailed). Non-ar

Testing the slope of the regression t test statistic t = ˆb 1 b 0 1 s/( n 1s x ), where s 2 = var(ˆε) (see lecture 7) and s 2 x is the sample variance of the variable x. Under the null hypothesis H 0, the test statistic follows Student s t-distribution with n 2 degrees of freedom. Under the null hypothesis H 0, expected value of the test statistic is E[t] = 0. Large absolute values of the test statistic suggest, that the null hypothesis H 0 does not hold. The null hypothesis H 0 is rejected, if the p-value is small enough. Non-ar

Testing the slope of the regression Note that, if we have a simple ar model (one response variable and one explaining variable), and we wish to test if the coefficient of determination is 0, we can do that by testing the null hypothesis H 0 : b 1 = 0. Non-ar

, confidence interval Under normality assumption, (1 α) 100% confidence interval for the slope of the regression can be given as s (ˆb1 t n 2,α/2, ˆb s 1 + t n 2,α/2 ), n 1sx n 1sx Non-ar where s 2 = var(ˆε), s 2 x is the sample variance of the variable x and where t n 2 is the Student s t-distribution with n 2 degrees of freedom, and t n 2,α/2 is the (1 α/2) 100 percentile of the t(n 2)-distribution.

/constant term Non-ar

Testing the constant term of the regression The null hypothesis H 0 : b 0 = b 0 0. Possible alternative hypotheses: H 1 : b 0 > b 0 0 (one-tailed), H 1 : b 0 < b 0 0 (one-tailed) or H 1 : b 0 b 0 0 (two-tailed). Non-ar

Testing the constant term of the regression t test statistic ˆb 0 b0 0 t = n s i=1 x i 2 /(, n(n 1)s x ) s 2 = var(ˆε) and s 2 x is the sample variance of the variable x. Under the null hypothesis H 0, the test statistic follows Student s t-distribution with n 2 degrees of freedom. Under the null hypothesis H 0, expected value of the test statistic is E[t] = 0. Large absolute values of the test statistic suggest, that the null hypothesis H 0 does not hold. The null hypothesis H 0 is rejected, if the p-value is small enough. Non-ar

, confidence interval Under normality assumption, (1 α) 100% confidence interval for the constant term of the regression can be given as (ˆb0 t n 2,α/2 s n i=1 x n i 2, ˆb s i=1 0 + t x i 2 n 2,α/2 n(n 1)sx n(n 1)sx ), Non-ar where s 2 = var(ˆε), s 2 x is the sample variance of the variable x and where t n 2 is the Student s t-distribution with n 2 degrees of freedom, and t n 2,α/2 is the (1 α/2) 100 percentile of the t(n 2)-distribution.

Non-ar

the values of variable y A prediction ỹ for the value of the variable y, when x has value x, can be given as ỹ x = ˆb 0 + ˆb 1 x. The more there are observations, smaller the variance σ 2 is, and the closer x is to the the sample mean of x, the better (more accurate) the prediction is. Note that x should be on the range of the observed values of the variable x. Non-ar

the values of variable y Under normality assumption, (1 α) 100% confidence interval for the value of y, when x has value x, can be given as ˆb 0 + ˆb 1 x ± t n 2,α/2 s 1 + 1 ( x x)2 + n (n 1)sx 2, Non-ar where s 2 = var(ˆε), s 2 x is the sample variance of the variable x and where t n 2 is the Student s t-distribution with n 2 degrees of freedom, and t n 2,α/2 is the (1 α/2) 100 percentile of the t(n 2)-distribution.

the expected value of variable y A prediction ˆµ y for the expected value E[y], when x has value x, can be given as ˆµ y x = ˆb 0 + ˆb 1 x. Non-ar

Note that ỹ x estimates the value of a random variable and ˆµ y x estimates the expected value (constant). Estimate ỹ x estimates the values of the variable y on individual level, when x has value x. Estimate ˆµ y x estimates the mean value of the variable y, when x has value x. Even though the estimates are the same, the corresponding confidence are not! Confidence interval for the value of y is wider. It is easier to predict average behaviour than to predict individual values. Non-ar

the expected value of variable y Under normality assumption, (1 α) 100% confidence interval for E[y], when x has value x, can be given as ˆb 0 + ˆb 1 ( x x)2 1 x ± t n 2,α/2 s + n (n 1)sx 2, where s 2 = var(ˆε), s 2 x is the sample variance of the variable x and where t n 2 is the Student s t-distribution with n 2 degrees of freedom, and t n 2,α/2 is the (1 α/2) 100 percentile of the t(n 2)-distribution. Non-ar

Numerical example Numerical example from Lecture slides 7 continues... Summer trainee is asked to predict the sales of Jack s cookies, when 5500 units of Charles cookies are sold. She should also calculate a 95% confidence interval for the sales. Non-ar

Using the regression coefficients ˆb 0 = 10723.87 and ˆb 1 = 0.9386, the prediction of the sales of Jack s cookies, on condition that c = 5500 units of Charles cookies are sold, can be given as j c = ˆb0 + ˆb 1 c = 10723.87 0.9386 5500 = 5561.57. The corresponding confidence interval can be given as ˆb 0 + ˆb 1 c ± t n 2,α/2 s 1 + 1 ( c c)2 + n (n 1)sc 2, Non-ar where t n 2,α/2 = t 10,0.025 = 2.228, c = 5567.833, s c = 302.95 and s 2 = 11948.42.

The 95% confidence interval is ˆb 0 + ˆb 1 c ± t n 2,α/2 s = 5561.57±2.228 11948.42 1 + 1 n = (5308.093, 5815.047). + ( c c)2 (n 1)s 2 c 1 + 1 (5500 5567.833)2 + 12 11 302.95 2 If 5500 units of Charles cookies are sold, then the prediction for the sales of Jack s cookies is 5562 units. A 95 % confidence interval for the prediction is (5308, 5816). The prediction seems reasonable. What about the confidence interval? Is there anything suspicious in the calculated confidence interval? Non-ar

Non-ar

for the regression coefficients Consider the estimated residuals ˆε 1, ˆε 2,..., ˆε n and the fitted values ŷ 1, ŷ 2,..., ŷ n of the regression model. Collect a new sample ˇε 1, ˇε 2,..., ˇε n by picking n data points randomly with replacement from ˆε 1, ˆε 2,..., ˆε n. Form a bootstrap sample where (x 1, ˇy 1 ), (x 2, ˇy 2 ),..., (x n, ˇy n ), ˇy i = ŷ i + ˇε i. Non-ar Calculate estimates for the regression coefficients b 0 and b 1 from the bootstrap sample. Repeat this several times, for example 999 times. Order now all the estimates (the original ones and the 999 bootstrap estimates) from the smallest to the largest. Now an estimate for the 90% confidence interval (l, u) is obtained by choosing the 50th ordered estimate as l and the 951st estimate as u. An estimate for the 95% confidence interval (l, u) is obtained by choosing the 25th estimate as l and the 976th estimate as u.

Prediction, bootstrap confidence A prediction ˆµ y for the expected value E[y], when x has value x, was given as ˆµ y x = ˆb 0 + ˆb 1 x. Consider bootstrap estimates for the regression coefficients b 0 and b 1. One can calculate bootstrap confidence for ˆµ y x by replacing ˆb 0 and ˆb 1 by bootstrap estimates in the formula above. That is then repeated, for example, 999 times. After that, all the 1000 predictions are ordered and bootstrap confidence are obtained. Non-ar

Coefficient of determination, bootstrap confidence Bootstrap samples (x 1, ˇy 1 ), (x 2, ˇy 2 ),..., (x n, ˇy n ) can be used also for calculating bootstrap confidence for the coefficient of determination of the model. Coefficient of determination is estimated (separately) from every bootstrap sample. One can use, for example, 999 or 9999 bootstrap samples. After that, all the 1000 or 10000 estimates are ordered and bootstrap confidence are obtained. Non-ar

, alternative approach Instead of bootstrapping from the estimated residuals, one may take bootstrap samples directly from the original observations (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ). Parameter estimates are then calculated from the bootstrap samples, the estimates are ordered and bootstrap confidence are obtained. Non-ar

Non-ar Non-ar

Non-ar What should be done, if the between two variables is non-ar? Try to arise the variables. Try piecewise analysis. Try to fit some other shape, for example parabola. If the dependent variable is binary, use logistic regression. Non-ar

Non-ar

What should one do, if... the residuals are not normally distributed? the residuals are not independent from the values of the variable x? the residuals are heteroscedastic? the residuals are not uncorrelated? Non-ar

J. S. Milton, J. C. Arnold: Introduction to Probability and Statistics, McGraw-Hill Inc 1995. R. V. Hogg, J. W. McKean, A. T. Craig: Introduction to Mathematical Statistics, Pearson Education 2005. A. C. Davison, D. V. Hinkley: Bootstrap Methods and their Applications, Cambridge University Press 2009. Pertti Laininen: Todennäköisyys ja sen tilastoln soveltaminen, Otatieto 1998, numero 586. Ilkka Mellin: Tilastolliset menetelmät, http://math.aalto.fi/opetus/sovtoda/materiaali.html. Non-ar