ECO375 Tutorial 4 Introduction to Statistical Inference

Size: px

Start display at page:

Download "ECO375 Tutorial 4 Introduction to Statistical Inference"

Sherman Johnson
5 years ago
Views:

1 ECO375 Tutorial 4 Introduction to Statistical Inference Matt Tudball University of Toronto Mississauga October 19, 2017 Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

2 Statistical Inference First of all, what is statistical inference? Recall that our objective in doing statistics is to use data to estimate the parameters of the underlying distribution which is producing that data. Since we may only work with a random sample of a fixed sample size n there is always going to be some variation in our estimators. As we showed when we did Monte Carlo simulations, estimates of the mean can differ greatly across samples. The fundamental question of statistical inference, given this variation, how can we make statements about the relationship between our estimators and the true population parameters we are trying to estimate? Reference: Chapter 4 of Wooldridge Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

3 Distribution of the Mean As a motivating example, let s consider a set of random variables {X 1, X 2,..., X n } which are independent and identically distributed (IID) with X i N ( µ, σ 2), denoting the normal distribution with mean µ and variance σ 2. Given the n data points drawn from this distribution, suppose we want to estimate the mean of the distribution. Naturally, the estimator for the mean is simply ˆµ = 1 n n i=1 X i. Let s now derive the variance of the estimator. Var(ˆµ) = Var( 1 n n i=1 X i) = 1 n 2 Var( n i=1 X i) = 1 n 2 n i=1 Var(X i) by independence of X i = 1 n 2 n i=1 σ2 by identical distribution of X i = 1 n 2 nσ 2 = 1 n σ2 Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

Monte Carlo Simulation of the Mean It is possible to show by the property of the normal distribution that, since ˆµ is distributed with mean µ and variance 1 n σ2 and since all of the X i are

4 Monte Carlo Simulation of the Mean It is possible to show by the property of the normal distribution that, since ˆµ is distributed with mean µ and variance 1 n σ2 and since all of the X i are normally distributed, that ˆµ N ( µ, 1 n σ2). For example, let s suppose that X i N (5, 2). I went ahead and ran a Monte Carlo simulation in which I produced 1000 samples of sample size 100 from this distribution and calculate their means. As expected, the average ˆµ is , which is very close to our µ of 5 and the variance of ˆµ was , which is roughly equal to 1 n σ2 = = Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

5 Confidence Intervals Most of the time, however, we will only have a single ˆµ coming from a single sample of size n. In this more common case, how can we use the known distribution of ˆµ to help us make claims about the relationship between ˆµ and µ? A very useful thing to note is that, since ˆµ N (µ, 1 n σ2 ), by the properties of the normal distribution ˆµ µ σ/ N (0, 1). This is called n the z-statistic. What we have done here is taken away the mean and divided by the square root of the variance. This is known as normalising the distribution. This is useful because this expression has a known distribution that no longer depends on unknown parameters. Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

6 Confidence Interval with Known σ 2 Using this fact, we can now begin to answer questions about the probability that the true mean falls within some interval. Since we know how ˆµ µ σ/ is distributed, we can very easily find an interval [ ] n zα/2, z α/2 such that it will fall inside that interval (1 α)% of the time, where we call α the significance level. z α/2 is called the critical value and it is subscripted by α/2 to emphasise that its value depends on the significance level we specify. By our definitions so far, (1 α)% = P( z α/2 ˆµ µ σ/ n z α/2) = P( z α/2 σ n ˆµ µ z α/2 σ n ) = P( z α/2 σ n µ ˆµ z α/2 σ n ) = P(ˆµ z α/2 σ n µ ˆµ + z α/2 σ n ) Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

7 Confidence Interval with Known σ 2 ] σ Assuming that we know σ, this interval [ˆµ z α/2 n σ, ˆµ + z α/2 n is calculable from the data. This is what s called a confidence interval. It s very important to keep in mind that, for a single estimate ˆµ, the true value µ is either inside or outside of the confidence interval associated with ˆµ. It is not correct to interpret the confidence interval as an interval such that µ is inside it with probability (1 α)%. The correct way to interpret it is that, if we took a hypothetical infinite number of samples of sample size n, then µ would fall inside of their confidence intervals (1 α)% of the time. Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

8 Finding Critical Values You might be wondering how we find the critical value z α/2. Reminder that z α/2 is a point on the standard normal distribution N (0, 1) such that for Z drawn from that distribution, P( z α/2 Z z α/2 ) = P(Z z α/2 ) P(Z z α/2 ) = (1 α)%. We can find these values using the standard normal table, which shows values along the cumulative distribution function of the standard normal. On the slide below, you can calculate for (1 α)% = 95%, z α/2 = z = So our 95% confidence interval for the mean ] of the normal distribution is [ˆµ 1.96 n σ, ˆµ n σ Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

9 Finding Critical Values Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

10 In-Class Exercise 1 Consider the following code for running a Monte Carlo simulation to calculate the proportion of the time that the 95% confidence interval for a sample mean contains the true mean. postfile buffer inconf using "$path/conf.dta", replace forvalues i = 1/1000 { qui set seed i qui drop _all qui set obs 20 qui generate x = rnormal(5,sqrt(2)) qui mean x qui local lower = _b[x]-1.96*sqrt(2)/sqrt(20) qui local upper = _b[x]+1.96*sqrt(2)/sqrt(20) qui local inconf = ( lower <= 5) & (5 <= upper ) post buffer ( inconf ) } postclose buffer use "$path/conf.dta", clear Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

11 In-Class Exercise 1 Suppose you have data coming from a normal distribution with known variance of σ 2 = 2. You have 20 observations in your sample and you calculate a sample mean of 5.1. What is the 95% confidence interval for this dataset? Run the code in the previous slide in Stata. You will end up with a dataset containing a variable which takes values of 0 or 1. It will be 1 if the true mean fell inside of that sample mean s confidence interval and 0 otherwise. Calculate the proportion of the time that the confidence interval contained the true mean. Suppose we now wanted to calculate a 90% confidence interval for the sample mean. Re-run this code using critical values for the 90% confidence interval. What proportion of the time does the 90% confidence interval contain the true mean? Is this to be expected? Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

12 Confidence Interval with Unknown σ 2 Up until now we have assumed that σ 2 is known. However, this will rarely be the case in practice. A good estimator for σ 2 is just the Bessel-corrected sample variance, ˆσ 2 = 1 n 1 n (X i ˆµ) 2 We can plug this back into our original statistic and obtain the t-statistic t = ˆµ µ ˆσ/ n. Since ˆσ2 is now an estimator for σ 2 and therefore has sampling variation associated with it, this statistic will no longer be distributed on N (0, 1). It will instead follow a t-distribution on n 1 degrees of freedom. i=1 t = ˆµ µ ˆσ/ n t n 1 Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

13 Confidence Interval with Unknown σ 2 The consequence of this is that, when we pick our critical values for our confidence intervals, we must now look at the table for the t-distribution. Therefore the form of our (1 α)% confidence interval with unknown σ 2 is, [ˆµ t n 1,α/2 ˆσ n, ˆµ + t n 1,α/2 ˆσ n ] where t n 1,α/2 is a critical value from the t-distribution on n 1 degrees of freedom. You can calculate these using the table in the slide below. Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

14 Confidence Interval with Unknown σ 2 Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

15 In-Class Exercise 2 Consider the following code for running a Monte Carlo simulation to calculate the proportion of the time that the 95% confidence interval for a sample mean with unknown variance contains the true mean. postfile buffer inconf using "$path/conf.dta", replace forvalues i = 1/1000 { qui set seed i qui drop _all qui set obs 20 qui generate x = rnormal(5,sqrt(2)) qui mean x qui local lower = _b[x]-1.96*_se[x] qui local upper = _b[x]+1.96*_se[x] qui local inconf = ( lower <= 5) & (5 <= upper ) post buffer ( inconf ) } postclose buffer use "$path/conf.dta", clear Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

16 In-Class Exercise 2 Run this code in Stata and calculate the proportion as before. What do you notice about the proportion in this case? Notice that I use critical values from the z-table in the above code. I should be using critical values from the t-table. Input the correct critical values and re-run the code. What do you notice about the proportion now? Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

17 Hypothesis Testing Another common application of statistical inference is hypothesis testing. Hypothesis testing allows us to ask questions of the form: Assume that the true value of µ = µ 0. Under this assumption, what is the probability that we would have observed the t-statistic that we calculated? If the probability is low, then it seems unlikely that µ 0 is a good candidate for the true value µ. In the language of hypothesis testing, the hypothesis H 0 : µ = µ 0 is known as the null hypothesis and the hypothesis we are testing against, H 1 : µ µ 0, is known as the alternate hypothesis. The form of this hypothesis test is called a two-sided hypothesis test. This is in contrast to the tests H 0 : µ µ 0, H 1 : µ < µ 0 and H 0 : µ µ 0, H 1 : µ > µ 0 which are known as one-sided hypothesis tests. Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

18 Hypothesis Testing: Two-Sided Test Let s return again to our t-statistic t = ˆµ µ ˆσ/ n. Under the null hypothesis of the two-sided test H 0 : µ = µ 0, we calculate t = ˆµ µ 0 ˆσ/. This is the t-statistic under the assumption that n µ = µ 0. If ˆµ is very different from µ 0 (after normalising with respect to the standard deviation) then our t-statistic is going to be fairly large in absolute value. Picking a significance level α as before and looking at the t-table, we can figure out a critical value such that the probability under the null hypothesis of observing a t-statistic larger (in absolute value) than that critical value is less than α%. This critical value is going to be t α/2 for the two-sided test. We are going to reject the null hypothesis if the absolute value of our t-statistic is greater than t α/2 : t > t α/2. Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

19 Hypothesis Testing: One-Sided Test The approach is going to be similar for the one-sided test. For H 0 : µ µ 0, H 1 : µ > µ 0, we are going to calculate a critical value t α and we will reject the null hypothesis if t > t α. For H 0 : µ µ 0, H 1 : µ < µ 0, we are going to calculate a critical value t α and we will reject the null hypothesis if t < t α. It is important to keep in mind that we either reject the null hypothesis or fail to reject the null hypothesis. With hypothesis testing we are only able to find evidence against the null hypothesis. A t-statistic which induces us to reject the null hypothesis does not necessarily provide evidence in favour of accepting the alternate hypothesis. Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

20 Hypothesis Testing: t-test Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

21 Inference on Regression Estimates Up until now we have been performing inference over a simple sample mean. We can also, however, perform inference over more complicated estimators such the OLS estimators. Recall the MLR.6 assumption: The error u i is independent of the explanatory variables x i1, x i2,..., x ik and is normally distributed with mean 0 and variance σ 2 u i N (0, σ 2 ) This is the key assumption for allowing us to construct confidence intervals and perform hypothesis tests with the OLS estimators. Consider the simple regression model y i = β 0 + β 1 x i + u i. We know the estimator ˆβ 1 can be written as, ˆβ 1 = n i=1 (x 1i x 1 )y i n i=1 (x 1i x 1 ) 2 = n i=1 (x 1i x 1 )(β 0 +β 1 x i +u i ) n i=1 (x 1i x 1 ) 2 = β 1 + n i=1 (x 1i x 1 )u i n i=1 (x 1i x 1 ) 2 Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

22 Inference on Regression Estimates Since we know that u i N (0, σ 2 ), it follows from the property of the normal distribution that, ( ( n )) i=1 ˆβ 1 N β 1, Var (x 1i x 1 )u n i i=1 (x 1i x 1 ) 2 n i=1 where Var( (x 1i x 1 )u n i σ i=1 (x 1i x 1 ) = 2 ) 2 n i=1 (x 1i x 1 ) 2 from MLR.5. Unfortunately the variance of ˆβ 1 is a function of the unknown σ 2. Similar to the case with the mean, we will simply replace σ 2 with an estimator. A natural approach is to just take a (Bessel-corrected) average of the sample residuals û i, σ 2 = 1 n k n i=1 û2 i. Our t-statistic is going to be, t = ˆβ 1 β 1 ŝe( ˆβ 1) t n 2 which is the t-distribution on n 2 degrees of freedom. Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

23 Inference on Regression Estimates: Confidence Interval This generalises very easily to multiple regression with k > 1. The t-statistic on some ˆβ j in multiple regression will be, t = ˆβ j β j ŝe( ˆβ j) t n k 1 which is the t-distribution on n k 1 degrees of freedom. We can therefore very easily apply the formulas we derived previously for confidence intervals and hypothesis tests to inference over regression estimates. The (1 α)% confidence interval for β j is going to be, [ ˆβ j t n k 1,α/2 ˆσ n, ˆβ j + t n k 1,α/2 ˆσ n ] where t n k 1,α/2 is the critical value from the t-distribution on n k 1 degrees of freedom. Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

24 Inference on Regression Estimates: Hypothesis Testing Hypothesis testing also generalises very easily to the multiple regression model. In doing t-tests, we will be stating a null hypothesis for a single coefficient β j The two-sided hypothesis test is of the form H 0 : β j = γ against the alternate hypothesis H 1 : β j γ. As before we would simply calculate our t-statistic t = ( ˆβ j γ)/ŝe( ˆβ j ) and find critical values t n k 1,α/2 for a given significance level α and degrees of freedom n k 1. If t > t n k 1,α/2 we reject the null hypothesis. For the one-sided hypothesis test of the form H 0 : β j γ, H 1 : β j > γ we are going to calculate t n k 1,α and we will reject the null if t > t n k 1,α. For the one-sided hypothesis test of the form H 0 : β j γ, H 1 : β j < γ we are going to calculate t n k 1,α and we will reject the null if t < t n k 1,α. Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

25 Hypothesis Testing: F-test What if we want to test hypotheses over multiple estimates? For example, H 0 : β 1 = β 2 = β 3 = 0. For hypotheses of this form we will use an F-test. The F-statistic takes the form, F = (SSRr SSRur )/q SSR ur /(n k 1) = (R2 ur R2 r )/q (1 R 2 ur )/(n k 1) F q,n k 1 SSR r denotes the sum of squared residuals of the restricted model in which we assume the null hypothesis is true. Rr 2 is the R 2 from this restricted model. SSR ur denotes the sum of squared residuals of the unrestricted model in which we make no assumptions over the coefficients in the null hypothesis. Rur 2 is the R 2 from this unrestricted model. q is the number of restrictions being tested. In the example above, it is 3. n k 1 is the degrees of freedom of the unrestricted model. Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

26 Hypothesis Testing: F-test Matt Tudball (University of Toronto) ECO375H5 October 19, / 26

ECO375 Tutorial 8 Instrumental Variables

ECO375 Tutorial 8 Instrumental Variables Matt Tudball University of Toronto Mississauga November 16, 2017 Matt Tudball (University of Toronto) ECO375H5 November 16, 2017 1 / 22 Review: Endogeneity Instrumental