Lecture 4: Testing Stuff - PDF Free Download

Lecture 4: esting Stuff. esting Hypotheses usually has three steps a. First specify a Null Hypothesis, usually denoted, which describes a model of H 0 interest. Usually, we express H 0 as a restricted version of a more general model. i. For example, given X characteristics, Z ethnic origin and Y earnings, and a model Y = Xβ + ZΓ+ ε, we might wish to test the null hypothesis that ethnic origin is irrelevant to earnings as predicted by competitive models of discriminatory preferences. ii. Here, H : 0 Γ= 0, which implies a model Y Xβ ε. iii. he Null hypothesis always has an alternative which may or may not be specified. Here, we could imagine at least 3 kinds of alternative hypotheses: H : Γ< 0; H : Γ> 0; orh : Γ 0. Note that the last iv. alternative is equivalent to H : 0 0 Γ< or Γ>. Competitive models of discrimination also predict that ethnic origin, if correlated with preferences, should affect occupational choice, and so we model occupation W as W = Xδ + Zα + η. In this case, we might wish to test the hypothesis H : α = 0 0. b. hen, construct a test statistic which (typically) is a random variable (because it is a function of other random variables) with two features: i. it has a known distribution under the Null Hypothesis (usually, normal, or chi-square or t). ii. this known distribution may depend on data, but not on parameters (this is called pivotality a test statistic is pivotal if it satisfies this condition). iii. It is typical to express the test statistic as a standardised variate. () that is, express the asymptotic distribution above as the standardized coefficient estimate going to a standard normal, Γ = ~ N(0,) V ( Γ). the subscript on is just to keep track () With normally distributed disturbances, we have small-sample results. We know that for a sample of size N in a model with k parameters, Γ = ~ t N k V ( Γ). As N-k gets large, this t distribution goes to a standard normal.. he above test statistic is sometimes distributed as a t distribution. a. hink for a moment about the asymptotic case. We know that, given everything in the classical linear model except normality of the error terms, asymptotically

OLS coefficients are distributed normally: σ I N = E[ εε ']. ( ) β ~ N βσ, ( X ' X) where i. Given normality of the disturbances, this holds in the small sample as well. V ( β) = σ ( X ' X) ii. Here, is based on data, X, and a parameter, σ. We have to estimate the parameter, so we use the estimated variance: ( ) ( ) ( ) ', N V β = s X X s = ei N k i=, which is comprised of an observable nonstochastic part (X X) and a stochastic part, which is the sum of squared normals. b. What is in the test statistic? It has a numerator and denominator, two elements: β,v ( β ). i. Asymptotic. We consider the asymptotic case when we do not know the exact distribution of the disturbance term, but only know that it is wellbehaved enough that it behaves asymptotically normal. () he numerator goes to a normal. () he denominator goes to the square-root of a sum of squared normals. (3) We do not know how such a ratio behaves exactly. However, we can approximate it. he nd order aylor approximation of the ratio of the numerator divided by the square root of the estimated variance (which is random) behaves just like the nd order aylor approximation of the ratio of the numerator divided by the square root of the true variance (which is not random). (4) So, asymptotically, the ratio behaves approximately like the numerator alone, which is as a normal. ii. Small Sample. We consider the finite sample case when we do know the exact distribution of the disturbance term. We assume that it is normal. () In this case, the numerator is distributed normally. () he denominator is distributed as the square-root of the sum of squared normals. (3) So, the ratio is distributed as the ratio of a normal to the squareroot of a sum of squared normals. (4) We call this distribution a t distribution, and it differs given how many normals are summed in the denominator. here are always N-k normals there, so we use a t with N-k df. iii. Why only N-k squared normals rather than N? () OLS can fit k data points exactly, so their sample errors can be zero. here are only N-k sample errors that need to be nonzero.

c. hen, compare the value of the test statistic to its known distribution. i. For example, if ethnicity is univariate (eg, an aboriginal dummy) and Γ= 0.30, and its estimated standard error ( V ( Γ) ) is 0.7, = 0.30/0.7 =.76. ii. If the sample size is 00 and 0 parameters are estimated, there are 90 degrees of freedom, so the appropriate distribution to compare this test statistic to is a t with 90 degrees of freedom. () 8.% of the distribution of this t is larger than.76 in absolute value. his is the p-value for the -sided test. () he -sided test is appropriate if we are asking what is the probability under the Null that I would see a deviation-from-zero for my test statistic that is as large as what I saw? (3) 4.% of the distribution of this t is larger than -.76 in the negative direction. his is the p-value for the -sided test. (4) he -sided test is appropriate if we are asking What is the probability under the Null that I would see a test statistic as negative as the one I saw?. (5) Whether you use a -sided or -sided test depends upon your priors. If your prior was that the deviation, if any, has to be negative, then it is a -sided test. his is because in this case, you are really only thinking about the negativeness of the parameter. (6) If your prior was diffuse in the sense that you didn t know which way the violation of the Null might go, you d use a -sided test. (7) -sided tests are for testing inequality restrictions (here, the -sided test is an alternative of Γ<0), and -sided tests are for equality restrictions (here, the -sided test is an alternative of Γ 0). (8) Is 8.% a big number? Is 4.% a big number? Usually, we try to have the significance level in our head a priori. Common significance levels are 0%, 5% and %. If you are using 5%, then the -sided test for equality against a nonzero alternative does not reject, but the -sided equality against a less than zero alternative does reject. d. he significance level chosen (eg, 5%) determines the probability of a ype I error. A type I error is when we reject the Null even though it is true. he probability of a type I error is equal to the significance level. (aka size). e. A ype II error is when we fail to reject the Null even though it is false. i. For example, if in the example above, the true parameter was Γ= 0.5, then the sampling distribution of the test statistic would be centered around -3 (=-.5/.7) and not around 0. he 5% critical value for the - sided test given the Null is.96. he probability of failing to reject is the probability that the test statistic would lie in [-.96,.96] when its sampling distribution is centered on -3. his probability is 5%. f. he power of a test is the probability of making a type II error. he power of a

test varies with the true value of the parameter(s). 3. Confidence Regions are statements about the distribution of a random variable. a. An α % confidence region for a single random variable r with point estimate r is the set of values centered on r such that there is an α % chance in repeated samples that the point estimate would lie in the set. i. For example, we can construct a 90% confidence region for the coefficient Γ above, whose point estimate is Γ= 0.30 and whose standard error is 0.7. Since the standardised demeaned coefficient is distributed as a t with 90 degrees of freedom (that is, Γ Γ ~ t se( Γ) 90 ), and we know that the cdf of that t is 5% at a value of -.99 and is 95% at.99, we can compute endpoints of the confidence band. Γ Γ =.99 se( Γ) Γ 0.30 =.99 Γ= 0.04 0.7 Γ Γ =.99 Γ= 0.64 se( Γ) () the 5% cutoff is given by solving for Γ, which yields. () the 95% cutoff is given by solving (3) here is a 90% probability that the coefficient lies in [-0.04, 0.64]. b. One can construct confidence regions for several random variables jointly as well. i. Kennedy pg 64-66 has 3 good pictures for this problem. ii. he joint confidence region for two random variables r, r is the set of iii. iv. r, r values centered on such that there is an α % chance in repeated samples that the point estimate would lie in the set. If the two random variables are independent, then the joint confidence region is an untilted oval whose two lengths are in ratio to the standard errors of the two random variables. If the two random variables covary, then the joint confidence region is a tilted oval. () Imagine that the two random variables covary positively. hen, if one is a high value, we expect the other to be a high value. hus, the confidence region must be tilted, so that the region where both random variables take on high values is shown as probable. 4. Computer programs often spit out univariate confidence regions and t-statistics for each variable. From the preceding discussion, you should be able to tell that if you know the confidence region, then you know the t-statistic and vice versa for any coefficient. hus, the computer is just giving you two descriptions of the same features of the distributions of each coefficient.

5. ests have some common themes. For hypotheses on single variables, we often use test statistics that are distributed as t or normally. a. Most often these tests are just the standardized value of the coefficient (standardized meaning divided by standard error ). hese standardized values are called t-tests if they are distributed t (as in OLS regression coefficients), or called z-tests if they are distributed normally (as in cases where we don t have small sample results, eg, stage least squares and FGLS). b. A coefficient is called significant if its t- or z-test exceeds a critical value (often about the 5% critical value for a sided test of a standard normal variable is.96). 6. Often, we want to test joint hypotheses, where we want to know the probability that several hypotheses are true at once. Chapter 6 of Green develops these ideas. a. Eg: develop test of overidentifying restrictions Z e=0. b. Consider the model Y = Xβ + ZΓ+ ε where Z is a matrix with columns, one aboriginal dummy and one visible minority dummy (with white being the left out category). We might be interested in the joint hypothesis that both the coefficients on these variables are zero. c. H : Γ = 0& Γ = 0 H : Γ 0 Γ 0 represent the null and alternative 0 hypotheses. d. Consider first the asymptotic case, where we let N get really big which implies that the estimated coefficients go to a normal distribution: e. f. ===> g. ===> Γ Γ V ( Γ) cov ( Γ, Γ) ~ N, Γ Γ cov ( Γ, Γ) V ( Γ) V ( Γ ) ( ) cov Γ, Γ Γ Γ 0 0 ~ N, cov ( ) ( ), V 0 0 Γ Γ Γ Γ Γ Γ ( ) ( ) Γ V Γ cov Γ, Γ Γ Γ ~ χ cov ( Γ ) ( ), Γ V Γ Γ Γ Γ Γ 7. his is called a Wald est. he Wald est asks whether the discrepancy vector is big. a. It measures the squared distance of unrestricted estimates from the Null Hypothesis (aka discrepancy vector ) in the metric of the covariance of the estimates. For random variables that are normal, this distance is a chi-square. (Chi-squares are sums of squared standard normals.) i. Don t overworry about this issue of chi-square is a squared standard normal. hink about it this way. () he Wald formulation of a joint hypothesis asks: how far away are our parameters from the Null Hypothesis. If you just added the distances up (without squaring them), they could cancel each other

out even if both were large. hus we square them. () Since we are squaring standard normals, it would be helpful to write a table of values for, and give a name to, the distribution of sums of squared normals. We name it χ. b. he general form of a Wald est for a linear hypothesis H : Rβ r = 0 0, where β is a vector of normal random variables (eg, coefficient estimates) is Wald = ( Rβ r) '( RV ' ( β) R) ( Rβ r) ~ χ i. where J is the number of restrictions in (rank of) R. c. Wald ests for nonlinear hypotheses are similar. For a nonlinear hypothesis H : c ( β ) = 0, we need the value of the hypothesis given estimated coefficients 0 (the analogue to the discrepancy vector Rβ r ) and the slope of this with respect to β (the analogue to R). If β is a vector of normal random variables (eg, coefficient estimates), the Wald est is i. c( β ) ( ( )) c( β) ' ' ( ) Wald = c β V β ( c( β) ) ~ χ β β ii. clearly, if c is linear, this results in the linear hypothesis Wald est above. d. In small samples, these wald test statistics do not go to the chi-square distribution. he reason is that V is an estimate, the variance of which can only be ignored asymptotically. o take care of the fact that V is an estimated covariance matrix which itself is a chi-square we need access to a distribution that is a ratio of chi-squares. i. he F distribution with degrees of freedom J and N-k (two numbers which will be clarified in a moment) is the distribution of the ratio of two chisquares, one with J degrees of freedom and the other with N-k degrees of freedom, each divided by their degrees of freedom. ii. hus, the linear Wald test above is a chi-square, which is connected to the F distribution by () s / J / σ Wald ~ F J, N k () he numerator is a chi-square divided by its degrees of freedom (3) he demoninator is the scale factor that adjusts for the fact that V is estimated. It goes to as N goes to infinity, but in finite samples, it is not. (4) he denominator is itself a random variable, and since s is a sum of squared regression errors divided by N-k, it too is a chi-square (under normality of the error terms) divided by its degrees of freedom. J J

8. he discrepancy vector a. for a hypothesis H : c( β ) = 0 0, the discrepancy vector is the value of this function at the estimates. b. d = c( β ) c. If we are thinking of a linear hypothesis where we can write c as c( β) = Rβ r = 0, the discrepancy vector is d = Rβ r. d. We may think of the Wald test as asking whether the discrepancy vector is far from zero. e. o do this, we need to know the sampling distribution of d. If β is distributed c( β) c( β) d = c( β)~ N 0, V( β) β β asymptotically normally, then. i. Its expectation under the Null is zero, and its variance is given by the quadratic form of the Jacobian of c and the variance of β. f. he discrepancy vector may have just one element, and in this case, we could use a univariate test and ask how far out are we in its sampling distribution. If β is asymptotically normally distributed, then the scalar d is asymptotically normal c( β) c( β) V( d) = V( β ) β β d z = ~ N (0,) V( d) with variance given by the scalar. i. So, we could construct the z test. If z is bigger in absolute value than about, we reject the hypothesis. g. If the discrepancy vector has many elements, then we need to find a way to aggregate the distance of each element from zero without allowing positives and negatives to cancel each other out. We square them. i. he idea of the z test is to convert the normally distributed discrepancy to a standard normally distributed test statistic by standardising by the standard error (square root of variance). ii. he idea with a many-element discrepancy vector is to convert the joint normally distributed discrepancy vector to a vector of independent standard normals by dividing by its root-variance matrix. iii. / t = V( d) d ~ N(0, I ) where J is the length of d. iv. D J J o stop positives from cancelling negatives, we square and add up: ( t = tt = d V d) d ~ χ W D D J v. his is the Wald test right back at ya. (Plug in the formulae you ll see.)

9. Estimating subject to Restrictions a. he above t, Wald, and F tests have you estimate an unrestricted model and test restrictions, which is known as testing down, because you are testing down from an unrestricted model to see if restrictions hold. b. One might wish to know what the model looked like under the restrictions. Restricted estimation is often pretty easy. Consider a linear model where the restriction you wish to impose is that some element of the parameter vector is zero. In this case, we may frame the restriction as an exclusion restriction, because it implies that some variable may be excluded from the model. i. Exclusion Restrictions () he discussion above shows how you might test the exclusion restriction. Eg, look at the value of the t-stat for that coefficient. () o estimate subject to the restriction, just exclude the variable from the regression. c. Single equality restrictions. Suppose Y = Xβ + ZΓ+ ε where Z is univariate. i. Consider the restriction Γ=. () We could rewrite the model as Y Z = Xβ, so regressing Y-Z on X yields estimates satisfying the restriction. ii. Assume X and Z are univariate, eg capital and labour ratios to production output Y. In Cobb-Douglas production environments, the restriction β +Γ= would be satisfied. In this case, ( ) Γ= β Y = Xβ + Z Zβ = X Z β + Z () So, one could regress Y+Z on X-Z yielding estimates satisfying the restrictions. iii. Consider the restriction β =Γ. If X and Z are two types of human capital, we might assume they have the same effect on earnings. () Γ= Y = X + Z = ( X + Z) β β β β d. When there are multiple restrictions, sometimes they interact in funny ways, and it can be difficult or impossible to write out a regression formulation on transformed variables that does the trick. However, one can always write out a lagrangean for the restricted regression problem as min i. Where R is a set of restrictions. ii. β N i= ( Y ) i Xiβ λr β ( ) If R is a set of linear restrictions, then the solution for restricted coefficients is also linear. () If either R or the regression function is nonlinear, then the solution is typically a nonlinear function of the data.,

0. Goodness of Fit a. Since errors are random variables, sums of squared errors (SSR) are random variables. So, can t we use the fit of a regression (SSR) as a test statistic? i. he model is Y = Xβ + ε, and now, instead of worrying about the sampling distribution of β, we try to figure out the sampling distribution N ( ) e SSR = e = Y X β of where. i= i i i i b. Goodness of fit could be compared by comparing the SSR when we impose the Null compared to SSR when we don t impose the Null. c. his is different from the spirit of a Wald est, because to do a Wald est, you don t have to estimate under the restriction that the Null is true. Rather, you estimate a general model and ask how large is the discrepancy from the Null. d. So, you estimate under the Null, and call the sum of squared errors from this as SSR(restricted). hen, you estimate under the alternative, and call the sum of squared errors from this as SSR(unrestricted). e. First, notice that under the Null, SSR(unrestricted) and SSR(restricted) should have the same distribution because the restrictions are not binding under the Null. f. his means that we might consider using SSR SSR as part of a test statistic because its expectation under the Null is asymptotically zero. We also know that it must be weakly positive because the unrestricted model contains the restricted model as a possibility. How is this thing distributed? g. SSR(unrestricted) is related to a chi-square with N-k degrees of freedom (because k perfect fits can be had from the k parameters). However, chi-squares are sums of squared standard normals, and e is not a standard normal, because its variance goes to σ, which we can estimate by s. i. So, SSRU / σ ~ χn k ii. And, asymptotically, SSR / s ~ χ. Even though s is a random R U N k variable, we can ignore its variation asymptotically. h. SSR(restricted) is a chi-square with N-k+J degrees of freedom (because k perfect fits can be had from the k parameters, but J of these parameters are determined by restrictions). i. So, SSRR / σ ~ χn k + J. ii. And, asymptotically, N i. Recall also that ( ) SSR / s ~ χ + R N k J s = ei = SSRU / N k N k i = U

SSR SSR SSR SSR s SSR / N k R U R U = ~ j. So, it must be that, because we are subtracting the sum of N-k+J squared standard normals from the sum of N-k squared standard normals. i. his ignores the variation in the denominator, treating it like a constant asymptotically. We have sums of squared normals on top, and we ignore the sampling variation of the bottom. k. If we wanted to turn this into a small-sample statistic, we would have to add the assumption Y = Xβ + ε, ε ~ N(0, σ ). Given this, we can model the sampling distribution of the denominator: l. SSR / N k is a chi-square divided by its degrees of freedom. U m. he numerator is a chi-square not divided by its degrees of freedom, so if we divide it by its degrees of freedom, we get a ratio of chi-squares divided by their degrees of freedom, also known as an F variate. i. ( ) ( ) SSR SSR / J SSR SSR / J = s SSR N k R U R U U / U χ J ~ F J, N k