Introduction. The Linear Regression Model One popular model is the linear regression model. It writes as :

Size: px

Start display at page:

Download "Introduction. The Linear Regression Model One popular model is the linear regression model. It writes as :"

Brice Bryce Miller
5 years ago
Views:

1 Introduction Definition From Wikipedia : The two main purposes of econometrics are to give empirical content to economic theory and to subject economic theory to potentially falsifying tests. Another popular definition : The application of statistical and mathematical methods to the analysis of economic data, with the purpose of giving empirical content to economic theories and verifying them or refuting them. Example : Labor demand theory tells that increasing minimum wage could reduce employment demand. Econometrics may help to assess wether this is true or not and what is the order of magnitude. The definition implies that there are three key aspects in econometrics 1. What is the question we want to answer? What is the underlying economic theory? 2. What are the data available 3. Which method to use The Linear Regression Model One popular model is the linear regression model. It writes as : y = α + β 1 x β K x K + u = xb + u y is the dependant variable x 1,..., x K, K explanatory variables u is a random term b = (α, β 1,, β K ) is the parameter to be estimated y and x are observed u is unobserved b is unknown. Example : 1

2 Return to schooling (Mincer equation) log (w) = a 0 + a s s + a e e + u w is the wage, s is the number of years of schooling, e is experience. a s is the return to schooling it measures the wage gain associated with an increase of one year of schooling. Data for this model could be at the individual level. Keynesian Consumption Function C = C (I) where I is income and C consumption. The econometric model turns this relation to C = α + βi + U where U is an error term either because the consumption planned an the consumption realized are not the same and also because the model is an approximation of the true relationship. Data for this model could be aggregated information at different dates Data and Sample Three types of data 1. Cross sectional 2. Time series 3. Panel data Cross sectional data Formed of observations of economic variables for N different units at a same point in time. Units can be individuals, households, cities, countries... Individuals are identified with a subscript i : y i, 1 1, x i = (1, x 1i,..., x Ki ), 1 (K + 1) Data are organized as vectors and matrices y 1, N 1, x= 1 x 11 x K1, N (K + 1) y=... y N 1 x 1N x KN The way data are ordered is of no matter. Pooled Cross-Sectional data are formed of observations of different individuals for different cross section. 2

3 Obs Ident lw sco exp Tab. 1 Cross Sectional Data Set on Wages Schooling Experience 3

4 Time series Formed of observations of one same individual (country, currency,asset,...) followed through T periods. The frequency of observations is one important feature of time series : Annually, Quarterly, Weekly, Daily,... One observation is indexed by a subscript t : y t (1 1), x t = (1, x 1t,..., x Kt ) Data are also organized as matrices and vectors y 1, T 1, x= 1 x 11 x K1, T (K + 1) y=... y T 1 x 1T x KT The way the data are ordered is of great importance. There is a natural order. One important difference is that between cross section and time series is that observations for two individuals can be seen as independent one from the other. This is not the case for time series. Consumption on a given year is not independent from consumption the past year. Panel data Panel data mixes the two types of data. Composed of data for N individuals followed during T identical periods. Data are indexed by a double index i, t : y it (1 1), x it = (1, x 1it,..., x Kit ) Causality and Econometrics We are interested in measuring the strength and existence of causal relationships. One observed outcome results from many causes coming simultaneously and not independently. One purpose of econometrics is to measure the effect of one of these causes at least ceteris paribus, i.e. maintaining all other possible causes fixed. Example : Return to schooling. When assessing the effect of school attendance on wages, one want to measure the causal effect of school attendance on wages. The problem is that general ability is also a cause of wages but also affect the decision to attend school. Econometrics tries to measure the effect of school attendance given ability. Counseling program for unemployed. Counseling unemployed is a frequent policy. One may be interested in measuring its effect on employment. However program participation is a decision of individuals that may be related to other abilities, unobserved that also affect the ability to find a job. Econometrics would like to measure the separate effect of program 4

5 Obs Date Income Consumption Tab. 2 Time Serie 5

6 Obs Ident Date VA L Capital 1 A A A A B B B B C C C C D D D D E E E E F F F F G G G G H H H H Tab. 3 Panel data 6

7 participation. One way to solve this problem is to organize experiments : randomly affect individuals to program. Road map of the course We will focus on the linear regression model y = b 0 + x 1 b x K b K + u We will examine : Ordinary Least Square Estimation :OLS Variance of the estimator and goodness of fit Confidence Intervals Hypothesis Testing Distinction will be made between finite distance properties (Student test, Fisher test) when u is normal and asymptotic properties, when the number of independent observations is large Specification issues We will then examine some extensions of basic OLS Heteroskedasticity Panel data Endogeneity of explanatory variables : Instrumental variable estimation Explaining dummy variables 7

8 Random Variables - Their Distribution and the Feature of their Distribution Distribution Def. : Random variable A random variable is a variable which outcome is the result of some random event. Examples flip coin : y = 1 if head else y = 0. A discrete variable with 2 possible values sum of the result of 10 flips : A discrete variable with11 values unemployment duration : A continuous variable on R + Def. : Distribution (discrete variable) The distribution of a discrete variable is the probability of each outcome Examples flip coin typically P (y = 1) = 1. More generally can be P (y = 1) = θ 2 with 0 < θ < 1. Such a variable is a Bernoulli variable of parameters θ. Sum of n flips, each flip of probability θ : P (y = k) = Cnθ k k (1 θ) n k Def. Cumulative distribution The cumulative distribution is defined by F (x) = P (X < x). It is an increasing positive function. It allows to quantify the probability of various sets : P (a < X < b) = F (b) F (a) and to define the distribution for continuous variables. When F is smooth one define the previous expression for when b = a + da a P (a < X < a + da) = F (a + da) F (a) = f(a)da f is the distribution (or density of X). F and f are also related through : Examples F (t) = t f(u)du with f(x) 0 and + f = 1 8

9 Unemployment duration F (t) = 1 e ηε f(t) = ηe ηε f is an exponential distribution of parameter η. Normal distribution ϕ(t) = 1 2π exp 1 2 t2. Cumulative distribution : Φ(t) = t ϕ(x)dx. Joint and conditional distribution Distribution of two random variables X and Y two real random variables P (X [x, x + dx], Y [y, y + dy]) = f X,Y (x, y)dxdy f(x, y) is the joint distribution of X and Y. Marginal distribution f X (x) the distribution of X is related to the joint distribution f X,Y. f X (x) = f X,Y (x, y)dy Conditional distribution y Holding Y = y, X is still a random variable and has a distribution. The distribution of X conditional on Y = y is given by : Comes from the Bayes rule f X (x Y = y) = f X,Y (x, y) f Y (y) P (A B) = P (A B) P (B) Independence X and Y are two independent variables if : f X,Y (x, y) = f X (x)f Y (y) 9

10 Remark : Independence implies f(x Y ) = f(x) knowledge of Y doesn t matter for X. Random sample A sample of N observations is a random sample if (X 1,..., X N ) are independent and identically distributed Independence Identical distribution f(x 1,... x N ) = f 1 (x 1 )f 2 (x 1 )..., f N (x N ) f(x 1,... x N ) = f(x 1 )... f(x N ) Example Sample of N flips coin of parameter θ P (X i = 1) = θ P (X) = θ X (1 θ) 1 X P (x 1,..., x N ) = Π N i=1θ x i (1 θ) 1 x i if IID P (x 1,..., x N ) = Π N i=1θ x i i (1 θ i ) 1 x i if just independence Sample of N unemployment duration f(t) = ηe ηt f(t i,, t N ) = Πη e ηti if IID f(t 1,, t N ) = Π N i=1η i e η it i if just independence Moments Expected value : Moment of order 1 Definition X a random variable of distribution, f has a moment of order 1, E (X) = xf X (x)dx when it exists. 10

11 case of a discrete distribution X {a i,, a J } with P (X = a j ) = P j E (X) = P j a j Example flip coin P (X = 1) = θ sum of η flip coins E(X) = θ E(X X n ) = n jcnθ j j (1 θ) n j j=0 E(X X n ) = nθ n j=1 expected duration of unemployment tηe ηt dt = 1 η f(x) = 1 2Π exp ( 1 2 x2) C j 1 n 1θ j 1 (1 θ) n j = nθ x f(x) = 0 because f (x) = x f(x) f(t) = f( t) symmetric distribution then if E(X) exists E(X) = 0 +A A tf(t) = 0 A tf(t)dt + A 0 tf(t)dt Expectation may not exists f(x) = 1 Π can see easily that = A 0 tf(t)dt + A 0 tf(t)dt = 0 xf(x)dx = 1 Π 1 Cauchy distribution. One 1+x 2 x 1 + x 2 = + 11

12 Property of expectation Linearity E (ax 1 + bx 2 ) = ae(x 1 ) + be(x 2 ) Example Sum of η flip coins E(X X n ) = n E (X j ) = nθ j=1 Conditional expectation Holding X = x constant Y is still a random variable with distribution f(y X = x) = f(y, x)/f(x). It may have an expectation : E(Y X = x) = y f(y X = x)dy it is noted E(Y X ) as a random variable or E(Y X = x) for the value it takes when X = x. Property of the conditional expectation EE(Y X ) = E(Y ) E(a (X) Y + b (X) X ) = a(x) E(Y X ) + b(x) if Y independent of X E(Y X ) = E(Y ) E(Y X ) = E(E(Y X, Z )/X) Example Duration of unemployment f Y (y X ) = exp(xb) exp( exp(xb)y) E(Y X ) = 1 exp(xb) The basis of model specification in linear regression E(Y X ) = Xb 12

13 Variance : Moment of order 2 Moments of order 2 are defined by N 2 = x 2 f(x)dx The variance is defined as the centered moment of order 2. V ar(x) = (x E(X)) 2 f X (x)dx = x 2 f X (x)dx E(X) 2 V ar(x) is a measure of the dispersion of the variable One consider also the standard deviation Property of the variance Sdt(X) = V ar(x) V ar(ax + b) = a 2 V ar(x) Covariance E (XY ) moment of order 2 of X and Y The covariance of X and Y, Cov(X, Y ) is the centered second order moment of X and Y Cov (X, Y ) = E(X E(X))(Y E(Y )) = E(XY ) E(X)E(Y ) if X and Y are independent then Cov(X, Y ) = 0 Zero correlation independence But the revers is not true. Covariance matrix of X and Y V ar (X) Cov(X, Y ) Σ(X, Y ) = Cov(X, Y ) V ar(y ) Variance of a linear combination V ar(a 1 X 1 + a 2 X 2 ) = a 2 1V ar (X 1 ) + a 2 2V ar (X 2 ) + 2a 1 a 2 Cov(X 1, X 2 ) ( ) a1 = (a1, a2) Σ(X, Y ) 13 a 2

14 for 2 independent variables Var (X 1 + X 2 ) = Var (X 1 ) + Var (X 2 ) Var (X 1 X 2 ) = Var (X 1 ) + Var (X 2 ) N uncorrelated variables V ar(σa i X i ) = Σa 2 i V ar (X i ) Example if V ar (X i ) = V ( ) ΣXi V ar N = Σ 1 N 2 V arx i ( ) ΣXi V ar = V N N Correlation ρ(x, Y ) = Cov(X, Y ) / V ar (X) V ar (Y ) Property ρ(x, Y ) [ 1, 1] Cauchy inequality Conditional variance Cov(X, Y ) 2 V ar (X) V ar (Y ) The variable Y X = x has also a variance V (Y X ) = (y E(y X ) 2 f(y X )dy One can show that E(V (Y X )) + V (E(Y X )) = V (Y ) Other features of the distribution of a variable Definition of quantiles : quantile of order α of a distribution F (x), with 0 < α < 1 is the value of q α of x such that F (q α ) = α. 14

15 The median is one popular quantile associated to α = 0.5. It is such that the probability to observe a realization of the variable below the median is 0.5. Quantiles are important to tests hypothesis or to obtain confidence intervals for parameters. Quantiles of the normal distribution are especially important. One important quantile is the quantile of order 97.5% : q (normal) = 1.96 it is defined by Φ (q (normal)) = It is possible to show that when f is symmetric and the mean exists, then the median and the mean are equal. For symmetric variables of cumulative distribution F, we also have the property that F ( t) = 1 F (t) Thus q α (F ) = q 1 α (F ) for example for a normal variable we have q (normal) = q (normal) =

16 Estimation and Inference Estimation We consider a random variable Y. We are interested in a parameters θ 0 of its distribution f(y). We have a sample of N iid draws of Y (Y 1,..., Y N ). An estimation of θ 0, noted ˆθ is a function of observation ˆθ = ˆθ(Y 1,..., Y N ) that brings knowledge about θ 0. Example θ 0 = expectation of the distribution θ 0 = E(Y ) ˆθ = N i=1 α iy i Example Assume you are looking for a job and send CV s in response to job offers found in news papers. You are interested in obtaining interviews. The distribution of I : obtaining an interview is a Bernoulli variable of parameter p : B (p). Assume you have sent N CV s. The output of each of these answers are independent variables Y i and they are all distributed as B (p). You are interested in estimating p from the sample of observations (Y i ). Here p is a parameter of the distribution. It is also the mean of the distribution. Remark : In most cases θ 0 is a parameter of the distribution. It is also the case of a normal variable : the mean is a key parameter. f (y; µ, σ 2 ) = exp ( (y µ) 2/ 2σ 2)/ 2π. We may write frequently the distribution f as a function of the parameter of interest : f (y; θ) Properties of an estimator Bias : ˆθ is unbiased if E(ˆθ) = θ 0 (i.e. ˆθ(y1,, y N ) f(y 1,, y N )dy = θ 0 Example Estimation of the mean E(Σ α i Y i ) = Σ α i E(Y i ) = E(Y )Σα i 16

17 ˆθ is unbiased if Σ α i = 1 A weighted mean of observations is an unbiased estimator of the expectation. is an unbiased estimator of E (Y ) Estimation of the variance Y = 1 N Σ Y i ˆV = 1 (Y N 1 i Y ) 2 ( ) is an unbiased estimator of the variance : E ˆV = V In the case of job seeking, the expectation of obtaining an interview is E (I) = p and it can be estimated without bias by Ê (I) = I = 1 N Σ I i The variance of the output variable is V (Y ) = E (Y 2 ) E (Y ) 2 = E (Y ) E (Y ) 2 = p p 2 = p (1 p). It can be estimated without bias by : ˆV (Y ) = 1 (Yi Y ) 2 = = N 1 Variance of an estimator V (ˆθ) = V ˆθ(Y 1,..., Y N ) N N 1 Y ( 1 Y ) The variance is important as it gives information on the accuracy of the estimation. A desirable property is that V (ˆθ) is minimal. This is related to the efficiency issue. Efficiency : Mean Squared Errors E(ˆθ θ 0 ) 2 = MSE(ˆθ) E(ˆθ θ 0 ) 2 = E(ˆθ E(ˆθ)) 2 + (E(ˆθ) θ 0 ) 2 MSE(ˆθ) = V (ˆθ) + Bias 2 17

18 An efficient estimator is an estimator that has the minimal mean squared errors in the class of considered estimators Their is a possibility to balance the variance and the bias. In most cases however the bias term can be neglected or is zero. In the class of unbiased estimator, efficiency means minimal variance. Example Estimation of the mean V (Σ α i Y i ) = Σ α 2 i V (Y i ) = V Σα 2 i Min Sc Σ α i = 1 α i = 1 N Y = 1 N ΣY i is an efficient estimator in the class of linear unbiased estimators. Its variance is : V (Y ) = 1 N V (Y ) In the case of the probability to obtain an interview after sending a CV, the best estimator is the simple mean of observed results. p = 1 N ΣI i the variance of this estimator is related to the variance V = p (1 p) V ( p) = 1 p (1 p) V (I) = N N It can be estimated without bias by replacing V by an unbiased estimator of V (I) : p (1 p) N N 1 V ( p) = p (1 p) N 1 Assume your name is Julie, that you have sent 400 CV s and you have obtained 112 interviews. The probability to obtain an interview can be estimated as p F = 112/400 = 28%. The variance of this estimation is p F (1 p F )/ (N 1) and it can be estimated without bias by p F (1 p F )/ (N 1) = 0.28 (1 0.28)/399 = The corresponding standard error is

19 Julie Jules D Interviews p V Std Tab. 4 Interview s for Candy and Ken Assume you have also sent 400 CV s and just replaced your name by Jules and you have obtained 138 interviews. The probability to obtain an interview can be estimated as p H = 138/400 = 32.5%. The variance of this estimation is p H (1 p H )/ (N 1) and it can be estimated without bias by p H (1 p H )/ (N 1) = ( )/399 = The corresponding standard error is Assume you have in fact selected 800 job offers and made 400 pairs of job offers. You randomly choose in each pair the offer you will respond under the name Jules and the offer you will answer under the name Julie. You consider for each pair the results yi H, yi F and the difference d i = yi H yi F.This variable take three possible values -1, 0 and1. Because the job offers have been randomly selected into pairs yi H,yi F are independent draws in the distribution Y H and Y F. The expectation d of D = Y H Y F is the difference of the expectation of Y H and Y F : E (D) = d = p H p F Because Jules and Julie never answer the same job offer, the two variables are independent, therefore the variance of its difference is the sum of the variance. V (D) = V (Y H ) + V (Y F ) We can deduce from this an unbiased estimation( of d : d = p H p F = = 4.5% the expression of its variance V d) = (p H (1 p H ) + p F (1 p F )) /(N 1) and an estimation of the variance of this estimation V ( d) = ( p H (1 p H ) + p F (1 p F )) /(N 1) = to which corresponds a standard error Distribution of an estimator 19

20 Mean and variance are important but if we want to have more precise statements about the parameters we need to know the distribution of the parameter. If the distribution of Y, f, is know then the distribution of the estimator is known. P ( θ < θ) = θ(y 1,, y N )Πf(y)dy θ(y 1,,y N )<θ Example : Normal distribution N(µ, σ 2 ) : f (y; µ, σ 2 ) = exp ( (y µ) 2/ 2σ 2)/ 2π. We consider an iid sample (Yi ). 1) Estimation of the mean ˆµ =Y = 1 N ΣY i ˆµ is also a normal variable, with first and second moments : E (ˆµ) = µ, V (ˆµ) = 1 N σ2 ) ˆµ N (µ, σ2 N 2) Estimation of the variance σ 2 = 1 N 1 Σ(Y i Y ) 2 We already know that σ 2 is an unbiased estimator. In the normal case the distribution of σ 2 is also known. One can show that Definition χ 2 (K) distribution (N 1) σ2 σ 2 χ2 (N 1) Z 1,..., Z K are N(0, 1) independent variables then K Zk 2 χ 2 (K) i E(χ 2 (K)) = K V (χ 2 (K)) = 2K 20

21 3) Moreover ˆµ and σ 2 are two independent variables. Usually however the distribution of the estimator is not easy to know at finite distance. One has to look at asymptotic distribution and properties of estimator, i.e. approximation of the distribution when the number of observations tend to infinity. Two central asymptotic results Consistency convergence in probability ˆθ P θ if P ( ˆθ N θ > δ) 0 δ. Noted p lim ˆθ = θ MS convergence in mean square ˆθ θ if E(ˆθ θ) 2 = ) (ˆθ θ) 2 f (ˆθ 0 ˆθ P θ means that the probability that a draw is outside a given interval tend to zero when the number of observations increases. The estimator will become closer and closer from the value θ. (different from unbiasedness). MS ˆθ θ ˆθ P θ (ˆθ) (ˆθ θ) 2 f = ) (ˆθ θ) 2 f (ˆθ + ) (ˆθ θ) 2 f (ˆθ ˆθ θ >δ ˆθ θ >δ > δ 2 P ( ) ˆθ θ > δ Unbiased estimator with variance that shrinks to zero are consistent estimator Application law of large numbers 1 N Sample of iid of observations E(Y ) = µ and V (Y ) = V < + then Y P µ as N + This is just because E(Y ) = µ (Y is unbiased) and V ( Y ) = E(Y µ) 2 = MS V 0 thus Y µ Example Estimation of the probability of answer from CV s. As V (Y ) = p (1 p) < + we know that Y is a consistent estimator of p, as well as the estimation of the difference of the probability of answer between Jules and Julie. Property of consistency Main property p lim ˆθ ) = θ plim g (ˆθ = g(θ) if g is continuous in θ. 21

22 True for a parameters on a vector of parameters. Example Y distributed as a Bernoulli variable of parameter q. Y also have Y 1 Y P q 1 q odd ratio. P q we Asymptotic normality Consistency in distribution : ˆθ ) D Z which has a distribution F Z if P (ˆθ < z F (z). This is the key notion that will help us to know the distribution of the parameter at infinity. Central result : Asymptotic distribution of a mean estimation This means here that N Y µ σ P ( N Y N σ D N (0, 1). < t) Φ(t) This result is also true if the unknown parameter σ is replaced by a consistent estimator σ P σ probability ( ) N Y µ P < t Φ(t) ˆσ Application (Y i ) sample of iid variable of unknown distribution but with first and second moments we already know E ( Y ) = µ and V ( Y ) = σ 2 / N. we now now the distribution of the variable ) Y N (µ, σ2 = N ( µ, σ ( Y )) N as if Y had been drawn from a normal distribution Example Probability to obtain an interview from CV s. We can apply the previous result to the estimation of the probability of answer of Julie and Jules and its difference D. We now know that when the number of experimental CV s sent to job offers tend to infinity, the distribution of the proportion interviews is normal : 22

23 N ( p H p H) ph (1 p H ) N ( p F p F ) pf (1 p F ) = = N ( d d ) V (D) = ( p H p H) σ ( p H ) ( p F p ) F σ ( p F ) ( ) d d ( ) σ d ( p H p H) σ ( p H ) ( p F p ) F σ ( p F ) ( ) d d D N (0, 1) D N (0, 1) ( ) D N (0, 1) σ d Confidence interval for a parameters Definition A confidence interval for a parameters is an interval that has a lower bound and a upper bound computed from the data and such that the true value of the parameters lies in the interval with a given probability. Example Estimation of the mean (Y i ) iid sample of N(µ, 1) then N ( Y µ ) N(0, 1) So for a given probability α and B and b such that Φ(B) Φ(b) = α : P (b < N ( Y µ ) < B) = Φ(B) Φ(b) = α We can derive from this : ( P Y B N < µ < Y b N ) = α [ ] Y B N ; Y b N is a confidence interval for µ of probability α. The width of the interval is (B b)/ N.. It is minimal for B and b such that b = B.This implies that Φ(B) = 1+α : B is the quantile of order 2 1+α of a standardized normal variable. 2 Typically a confidence interval is obtained for α = We want an interval that has the probability 0.95 to contain the true value. The value of B corresponding is such that Φ(B) = Such a value is B =

24 The 95 % confidence interval is therefore [ Y 1.96 ; Y ] N N What does it mean? Assume the true value is zero and N = 100. The confidence interval is therefore [ Y ; Y ] If we consider a first sample S 1 we can compute ȳ S1 and the interval [ȳ S ; ȳ S ]. There are two possibilities 0 [ȳ S ; ȳ ] then R S1 = 1 0 / [ȳ S ; ȳ ] then R S1 = 0 If you draw a large number of similar samples you can look at the proportion of success 1 H H h=1 R S h. This proportion should he close to This mean that in 5 % of the cases the true value didn t lies in the computed confidence interval There is an arbitrage between failure rate and width of the interval. A small failure rate is desirable as well as a small width. In the previous case the width is Confidence intervals can be computed for other α. For example α = 0.90 or α = 0.99 in such cases the confidence intervals for 100 observations are : [ȳ ; ȳ ] [ȳ ; ȳ ] the failure rate in the first case is (α) 10 % and the width of the interval is 0.34 while the failure rate is 1 % in the second case and the width is Confidence intervals for the mean of normal variables In the general case normal variables have two parameters Y i N(µ, σ). We just known Y µ N N(0, 1) σ This is not sufficient to compute a confidence interval for N because σ is unknown. 24

25 However we also know and ˆσ 2 and Y independent Definition with Y 1 and Y 2 independent then (N 1) ˆσ2 σ 2 χ2 (N 1) Y 1 Y2 / K Y 1 N (0, 1) Y 2 χ 2 (K) Student (K) Using this we get a result that will be of large use N Y µ ˆσ Student(N 1) Remark V (Y ) = σ 2 / N thus the results states that Y µ σ(y ) Student(N 1) Confidence interval of probability α t β,k = quantile of order β of Student (k) : P (S < t β,k ) = β with S Student (k) [ ] Y σ(y ).t 1+α,(N 1) ; Y + σ(y ).t 1+α 2 2,(N 1) is a confidence interval for µ with probability α ( ) Y µ P t 1+α,(N 1) < < t 2 bσ(y ) 1+α 2,(N 1) = α ( P Y σ(y )t 1+α,(N 1) < µ < Y + σ(y )t 1+α 2 2,(N 1) ) = α The quantile t 1+α,(N 1) is a decreasing function of N. t 1+α,(N 1) limit When N increases the width narrows for two reasons 1. σ(y ) decreases σ(y ) = σ/ N 25

26 Point estimate Confidence Interval p Std Lower bound Upper bound Candy Ken D Tab. 5 Point estimate and confidence intervals for answering rates and their difference 2. t 1+α,(N 1) decreases 2 Asymptotic confidence intervals for the mean of non normal variables We have seen that ( ) Y µ N σ D N (0, 1) also true when σ is replaced by a consistent estimator : N Y µ σ N(0, 1) σ(y ) = σ/ N is an estimator of the standard error of the estimation of the mean. This can be used to construct asymptotic confidence intervals : [ ] Y σ(y ).t 1+α ; Y + σ(y ).t 1+α 2 2 where t 1+α 2 is the quantile of order (1 + α) /2 of the normal distribution. for α = 95%, we have t 97.5 = 1.96 This can be used to compute confidence intervals for the probability of interview from CV s for Julie and Jules as well as a confidence interval for the difference of their answering rate at 95%, we have the confidence intervals in Table 2. Hypothesis testing We frequently want to know if a parameter take a given value. For example we may want to know whether the difference between the answering rate of Jules and Julie is zero or not : d = p C p K = 0? To deal with this issue, it is usual to introduce two hypothesis called the null hypothesis H 0 and the alternative hypothesish 1. Here : 26

27 Truth Decide H 0 true Decide H 1 true H 0 No error Error of type I H 1 Error of type II No error Tab. 6 Type of Error H 0 = {d = 0} and H 1 = {d 0} a test of H 0 against H 1 is a decision rule function of the observation δ (y 1,..., y N ) which will associate to observation a decision :H 0 is true, or H 1 is true. Usually this decision rule is constructed from a statistic S (y 1,..., y N ) and there will be a critical region W such that the decision rule is : if S W then reject H 0 and decide H 1 is true, otherwise reject H 1 and accept H 0. Other alternative hypothesis could be : H 1 = {d > 0} or H 1 = {d < 0}. Tests of H 0 = {d = 0} against H 1 = {d 0} are said bilateral tests. Tests of H 0 = {d = 0} against H 1 = {d > 0} or H 0 = {d = 0} against H 1 = {d < 0} are said unilateral tests. There are different possible errors that can be made. The first type of error is that we may decide that H 0 is wrong while it it is true. This is call Errors of type I. The second type of error is that we may decide that H 1 is true while H 0 is true. This is called Errors of type II. Any decision rule will induce a probability of each of these errors There are two risks associated α = P (Reject H 0 H 0 true) the probability of Error of type I and β = P (Reject H 1 H 1 true) the probability of Error of type II. The way tests are designed is such that for a given value of α, which is a parameter of the test called the level of the test, the decision rule minimize the probability of error of type II. Test of equality of the mean to a given value We consider a population of iid observations (y i ) drawn from Y. The expectation of Y is µ and we want to test the hypothesis that H 0 = {µ = µ 0 } we consider unilateral test for which the alternative hypothesis is one-sided H 1 = {µ > µ 0 } 27

28 or a bilateral test for which the alternative is two-sided H 1 = {µ µ 0 } In both cases, if the estimated mean µ is far from the value considered µ 0, we will reject H 0. Far will be determined on the basis of the estimation of the variance of the estimator. If the estimator is very imprecise, then far will be very far, but when the estimation is accurate far may become not so far. The precise meaning of far is a function of the risk of first type error α and the distribution of the estimator of the mean. Case of normal variables : Y N (µ, σ 2 ) We know that Y µ σ ( ) Student (N 1) Y 1 with σ (y) = N σ2 1 = (yi y) 2 N(N 1) Under H 0 µ = µ 0, we therefore have T = ( Y µ 0 )/ σ ( Y ) Student (N 1). It can therefore take extreme values. The principle of the test is to define the critical region as formed of the least possible values of T : { t > c}. c is determined as a function of α and the distribution of T Student (N 1). Case of a two sided test : Under the null hypothesis, T Student (N 1), and the probability that T > c is P ( T > c) = P (T > c) + P (T < c) = 1 F Student (c) + F Student ( c) = 2 (1 F Student (c)) = α Thus for c = t 1 α/2,n 1, the quantile of order 1 α/2 of the Student distribution with N 1 degrees of freedom, we have (P ( T > c)) = α. The critical region is defined by { t > t 1 α/2,n 1 } Case of a one sided test : The alternative hypothesis is H 1 : µ > µ 0. Under the null hypothesis, we still have T Student (N 1), but the rejection region will only involve positive values of T : t > c. The corresponding probability under the null hypothesis is P (T > c) = 1 F Student (c) = α 28

29 Critical Region for a α = 0.95 H 1 test at level α N = 20 N = 50 N = p-value µ µ 0 t > t 1 α/2,n 1 t > 2.09 t > 2.01 t > (1 F ( t )) µ > µ 0 t > t 1 α,n 1 t > 1.72 t > 1.68 t > F (t) µ < µ 0 t < t 1 α,n 1 t < 1.72 t < 1.68 t < 1.65 F (t) Tab. 7 Critical region and p-values for one and two-sided test in the normal case - t = (y µ)/ σ (y) Thus for c = t 1 α,n 1, the quantile of order 1 α of the Student distribution with N 1 degrees of freedom, we have (P (T > c)) = α. The critical region is defined by {t > t 1 α,n 1 } p-values : once the statistical t has been computed, it is also possible to work with p-values. The p-value is a transformation of the t statistic that gives you the level of the test for which you would just reject the null hypothesis. Consider a two sided test. Let F be the distribution of the test statistic under the null : F is the cumulative distribution of a Student (N 1). Rejection regions are of the form { T > critical value(α)} for a test at level α. It is constructed such that P ({ T > critical value(α)}) = α under the null. One can compute similarly the probability that the absolute value of a Student distribution is beyond the t computed from the observations : p = P ( T > t ) = 2(1 F ( t )). p is the p-value. It provides the same information than the t but in a more directly interpretable way. For example if you want to perform a test at the α percent level, then you reject H 0 when p < α. Asymptotic test of equality of the mean The previous method can be extended directly to the case of a large sample even when the distribution of the underlying variable is unknown. The key result is that we know that Y µ σ ( ) N (0, 1) Y 1 with σ (y) = N σ2 1 = (yi y) 2 (N or N 1 doesn t change N(N 1) anything here because the sample size is assumed to be large) We can directly extend the previous results : 29

30 Critical Region for a α = 0.95 H 1 test at level α N large p-value µ µ 0 t > t 1 α/2 t > (1 Φ ( t )) µ > µ 0 t > t 1 α t > Φ (t) µ < µ 0 t < t 1 α t < 1.65 Φ (t) Tab. 8 Critical region and p-values for one and two-sided asymptotic tests - t = (y µ)/ σ (y) under H 0 : µ = µ 0 the t statistic T = Y µ 0 σ ( ) N (0, 1) Y this can be used to obtain rejection regions and critical values for tests at the α level. Case of a two sided test : Under the null hypothesis, T N (0, 1), and the probability that T > c is P ( T > c) = P (T > c) + P (T < c) = 1 Φ (c) + Φ( c) = 2 (1 Φ (c)) = α Thus for c = t 1 α/2, the quantile of order 1 α/2 of the Normal distribution, we have (P ( T > c)) = α. The critical region is defined by { t > t 1 α/2 } Case of a one sided test : The alternative hypothesis is H 1 : µ > µ 0. Under the null hypothesis, we still have T N (0, 1), but the rejection region will only involve positive values of T : t > c. The corresponding probability under the null hypothesis is P (T > c) = 1 Φ (c) = α Thus for c = t 1 α, the quantile of order 1 α of the Normal distribution, we have (P (T > c)) = α. The critical region is defined by {t > t 1 α,n 1 } Example : CV s interviews and gender. In the case of the experiment on CV s. We may want to test that there are no difference between the interview rate for Julie and for Jules. Formally we want to test H 0 : d = 0 against H 1 : d 0.We could also consider a test of no women discrimination, i.e. H 0 against H 1 : d > 0. The asymptotic t is t = 0.045/ =

31 Compared with the critical value for a two sided test at 5% 1.96 or with the critical value for a one sided test at 5% 1.65, we see that we cannot reject the null hypothesis in any case. We can compute the p-value for such tests. For the one sided test we have p = 2 (1 Φ ( t )) = 2 (1 Φ (1.39)) = 16.4% for a two sided test and p = (1 Φ ( t )) = 1 Φ (1.39) = 8.2%. For the test of no discrimination (one-sided), this means that for tests at level below 8.2% we will not reject the assumption that there is no discrimination. Power of a test The power of a test is the probability of rejection when H 1 is the truth. The power is π = 1 β where β is the risk of error of type II. This power can be computed for different value of µ in H 1. consider testing H 0 : µ = µ 0 and assume the truth is µ = µ 1 = µ 0 + δ H 1. Then the asymptotic behavior of the mean estimation is such that Y µ 1 σ ( ) N (0, 1) Y but we consider T = ( Y µ 0 )/ σ ( Y ) instead, and thus T δ σ ( ) N (0, 1) Y from which we compute that ( ( ) ( )) π (δ) = 1 Φ t 1 α/2 δ σ ( Y ) Φ t 1 α/2 δ σ ( Y ) ( = Φ t 1 α/2 + N δ ) ( + Φ t 1 α/2 N δ ) σ σ because Φ ( t) = 1 Φ (t) and σ ( Y ) = σ/ N. It is clear that the power of the test tend to 1 when the number of observation tend to infinity. It is also clear that for δ close to zero, although the power will tend to 1 in the end, the power may be small even for large N depending on σ. 31

32 32 Fig. 1 Power of test

33 The Simple Linear Regression Model We consider the simple linear model, y = a + xb + u where x is of dimension 1. We have a random sample of N IID draws. It is a special case of a more general model in which there are more than one x variable. However the model with one variable is of interest because it allows to understand easily the basics of the more general models. Frequently also one is interested in the effect of one single variable x on an output variable y. The simple linear regression model is a good starting point. Examples effect of job training x {0, 1} on wages. number of years of schooling and wage. capital and labor productivity. duration of unemployment and unemployment benefit. corn yield and fertilizer use. employment status and jobs subsidies. success of student and size of class. Meaning of the parameter b In writing the model under the linear form. y = a + xb + u we see that the specification implies that b measure the variation of y associated with a variation of x. b = y/ x In this expression there is a causal interpretation. b is the change that we would observe in y if we decided a mandatory change in x b is related to economic parameters :preferences, technology, or a mix of them. Example : 33

34 production function : y = K α L 1 α A log Y = α log K + a. α is the L L elasticity of productivity to capital intensity corn yield : effect of an increase in fertilizer use on corn yield this is at technical relationship. years of schooling and wages. this is a parameters measuring the extend to which schooling increases human capital and how labor market prices it. duration of unemployment and unemployment benefit. Economic theory tells that increasing unemployment benefit increases the reservation wage of unemployed, makes unemployed more selective, and thus increases the duration of unemployment. The meaning of b depends on the way the variables enter the equation. Consider Y and X two variables in level y = Y and x = X. In this case b is the effect of a marginal increase in X on Y y = log (Y ) and x = log (X). In this case b is the log derivative of Y with respect to X : an increase of 1% of X increases Y by b%. b is called the elasticity of Y with respect to X y = log (Y ) and x = X. a change of 1 in X lead to an increase of Y of 100.b%. b is called semi elasticity of Y with respect to X. y = Y and x = log (X). a change of 1% in X lead to an increase of Y of b/100. b is also called semi elasticity of Y with respect to X. Under specific assumptions we will obtain an estimation of the parameter that has good statistical properties, i.e. that allows to compute confidence intervals for the parameters and to perform tests. Mainly these conditions are E(u x) = 0 E(u) = 0 E(u) = 0 is not important. It is just a matter of interpretation of the parameter a. We could change the model introducing v = u E (u) and ã = a+e (u). The key assumption is however that E(u x) = 0. This means that for different values of x the average value of unobserved terms is unchanged. It is usually difficult to prove such a strong property and there is usually loads of reasons for it to fail. 34

35 Example : Ability bias and return to schooling. There is an unobserved factor : ability that condition schooling decision and also labor market ability. Corn yield and fertilizer use : in the corn yield equation the unobserved term involves the quality of land and the quality of land is also a determinant of fertilizer use. However we may perform field experiments. We way randomly choose in which field fertilizers will be used and randomly choose the amount of fertilizer used. In such a case fertilizer use in uncorrelated by construction with the unobserved term. Matrix form : The model can be expressed under a matrix form, from the vector and matrices of observations and residuals : y 1 1 x 1 u 1 Y =., X =.., U =. y N 1 x N u N where Y is the vector of size N 1 of observation of the dependent variable, and X is the matrix 2 N of explanatory variables, including the intercept. Similarly, U is the vector of residual for the N observations. The model writes as Y = a + x 1 b + u 1. a + x N b + u N = 1 x x N We introduce β : the vector of parameters. The ordinary least square estimator ( a b ) + Definition : It is the set of parameters â, b that solves : 2 minσ(y i ã x i b) ea, e b u 1. u N = Xβ + U The idea of the estimator is to choose the parameter values so that the sum of square of the residual values y i ã x i b is as small as possible. We are 35

36 looking at the linear combination of observations (1, x) that has the best fit of the data (y). It is straight forward to solve this problem : This gives the following expressions : Σ(y i â x i b) = 0 Σ(y i â x i b)xi = 0 b = Σ(yi y)(x i x) /Σ(x i x) 2 = cov(x, ˆ y) / V (x) â = y x b The slope parameter b is simply the ratio of the covariance of the two variables and the variance of the explanatory variable. Notice that the estimator of b is a linear combination of observations y i : b = Σ(yi y)(x i x) /Σ(x i x) 2 = Σy i (x i x) /Σ(x i x) 2 Example : 1 Consider the case of an explanatory variables X which is a Bernoulli variable of probability p : X {0, 1} we can give explicit expression for the parameters of the model. b = y 1 y 0 â = y 0 where y k is the empirical mean of y for the subsample X = k 2 Consider the model of return to schooling Log(wage) = α + βeduc + u Using data from Wooldridge WAGE1 that can be found at http ://fmwww.bc.edu/ecp/data/wooldridge/datasets.list.html we obtain the following characteristics of the sample reported in Table 1. The empirical covariance between log(wage) and Education is 0.63 while the empirical variance of education is We deduce from these moments the estimated values of the coefficients (see Table 2). The semi-elasticity of wages with respect to education is 0.63/ 7.65 =

37 Mean Var Wage Educ Wage Educ Tab. 9 Mean and Variance of Log(Wage) and Education level Expression Estimate â ȳ xˆb 0.58 b cov ˆ (x, y) / V (x) 0.08 Tab. 10 Estimation of a Mincer Equation Predicted values and estimated residuals. Each observation can be decomposed into a component explained by the model and a residual component : Example : ŷ = â + x b û = y ŷ = y â x b Predicted residuals are orthogonal to the space generated by x and 1 : Σûx = 0 and Σû = 0 Predicted residuals and predicted values are orthogonal Σŷû = 0 The sample average of observations and the sample average of fitted values are identical : y = ŷ + û and û = 0 The total empirical variance of y in the variance explained by the model : V (ŷ) and the residual variance V (û). y y = (ŷ ŷ) + û Σ(y i y) 2 = Σ(ŷ i ŷ) 2 + 2Σ(ŷ i ŷ)û i + Σû 2 i 1 Σ(y N 1 i y) 2 = 1 Σ(ŷ N 1 i ŷ) 2 + 2Σ(ŷ i ŷ)û i + 1 N 1 Σû2 i 37

38 38 Fig. 2 Mincer equation : Observed sample and regression line

39 Log (wage) Log (wage) û Log (wage) Log (wage) û Tab. 11 Total variance Explained variance and Residual variance Goodness of fit. The R 2 is defined by R 2 = Σ(ŷ i ŷ) 2 /Σ(y i y) 2. It is the share of the variance explained by the model. R 2 [0, 1]. The closer the R 2 is to one the better is the fit of the model. Usually however the R 2 is low. Example : Mincer equation. Table 3 presents the Mean Squared Error for Log (wage) ( 1 Σ(y N 1 i y) 2 ), here , and its decomposition into Mean 1 Squared Error of predicted values Σ(ŷ N 1 i ŷ) 2, here , and the Mean Squared Error for the residuals, here We can deduce from this the following value for the R 2 : R 2 = / = Matrix expression : We can define the vector of predicted values â + x 1 b 1 x 1 ( ) â Ŷ =. =.. = X b β â + x N b 1 x N Similarly one can define the vector of estimated residuals ( ) y 1 â + x 1 b Û = Y Ŷ =. ( ) = Y X β y â + x N b The equations defining the estimators can be written as X (Y X β ) = 0 Thus the estimator is also defined under a matrix form as β = (X X) 1 X Y 39

40 We have the orthogonality relations X Û = 0 and Ŷ Û = 0 Bias and variance of OLS estimator Consider the model y = a + xb + u in which b has a causal economic meaning. The parameters â OLS and b OLS can always be defined. The question is what are the relationships between the parameter b we are interested in and the estimated b parameter. First the value of the parameter estimated is not the true value b. It is a function of the observation in the sample. If we use another sample drawn in the same distribution of observations, the estimated value will change. The parameter estimated must be seen as a random variable. We are interested the statistical properties of this estimator. Unbiasedness of OLS estimator Under the two conditions : Cond 1 : The sample is random (observation are IID) Cond 2 : E (u x) = 0 Remark : under matrix form the two conditions implies that E (U X ) = 0. It implies that E (Y X ) = Xβ. The parameter b OLS is an unbiased estimator of the parameter b. Which means ) E ( bols = b This means that if we had a large number of independent samples, and had computed for each of these samples an estimate of the parameter b, the empirical distribution of the obtained parameters is centered around the true value b of the parameter. 40

41 Proof : The OLS estimator is a linear combination of observations y i the weights are functions of the (x). We therefore have ) ( ) Σ(yi y)(x i x) E ( bols x 1,..., x N = E x Σ(x i x) 2 1,..., x N = Σ(E (y i x 1,..., x N ) E (y x 1,..., x N ))(x i x) Σ(x i x) 2 E (y i x 1,..., x N ) = x i b because E (u n x 1,..., x N ) = E (u n x n ) = 0 because of independence and Cond. 2. Similarly E (y x 1,..., x N ) = xb. Last E ) E ( bols x 1,..., x N ( bols ) = E = Σ(x ib xb)(x i x) Σ(x i x) 2 = b ( )) E ( bols x 1,..., x N = E (b) = b Remark : under matrix form ) ( ) E ( βols X = E (X X) 1 X Y X = (X X) 1 X E (Y X ) = (X X) 1 X Xβ = β Variance of the OLS estimator The variance of the OLS estimator conditional on the explanatory variables is a function of the conditional variance of the residuals : V ( bols x 1,..., x N ) Cond 3 : V (u i x i ) = σ 2 Remark = ΣV (y i x 1,..., x N ) (x i x) 2 (Σ(x i x) 2 ) 2 1 = Σ(x i x) Σ (x i x) 2 2 Σ(x i x) V (u 2 i x i ) 1 Cond. 1 implies that the covariance of two residuals conditional on explanatory variables is zero : Cov (u i, u j x 1,..., x N ) = 0. Thus Cond. 3 and Cond. 1 implies that the variance matrix of the residuals is diagonal : V (U X ) = σ 2 I N where I N is the identity matrix of size N N. We also directly have V (Y X ) = σ 2 I N 41

42 This conditions means that the variance of the residual does not depend on the explanatory variables. 2 There are models of interest that do not satify this condition. For example if the dependent variable D i is a dummy variables that takes its value in {0, 1} We may write a regression model as D i = x i b + u i we have E (D i X ) = x i b and also P (D i = 1 X ) = x i b. we can compute easily the conditional variance V (D i X ) = x i b (1 x i b). Under Cond. 3, the variance of the estimator conditional on explanatory variables is V ) ( bols x 1,..., x N = σ 2 Σ(x i x) 2 = There are three key aspects in this expression : σ2 N V (x) 1. When the number of observation increases, the variance of the estimator decreases rapidly at order 1/N. 2. When the variance of x increases, the variance of the estimator decreases. If you think at the data as the result of experiments, it is because there are changes in the explanatory variables that we can detect from repeated situation the effect of x. The more the experimental setups vary, the more there are to learn from your experiments (given the linear model). 3. When the residual variance increases, the variance of the estimator increases. The relationship between the explanatory variable and the output variable is noisy. The more this noise is important the less it is easy to detect specific changes in the output variable related to changes in the values of the explanatory variables. The previous expression is conditional on the values of the explanatory variables. For a large sample, the empirical variance of the sampled x will tend to the variance of the distribution of x. The dependence with respect to x will no longer matter. However in finite samples the variance of the estimator depend on the values taken by x. 42

43 Remark : you can already notice that as the estimation is unbiased and as its variance shrinks to zero when the number of observation is large, the estimator converges to b when the number of observation is large Matrix expression : ) ( ) V ar ( β X = V ar (X X) 1 X Y X = (X X) 1 X V ar (Y X ) X (X X) 1 and because V ar (Y X ) = σ 2 I N, ) V ar ( β X = (X X) 1 X σ 2 I N X (X X) 1 = σ 2 (X X) 1 We directly obtain from this expression the variance and covariance of the parameters V (â X ) = 1 N σ2 x 2 /( x 2 x 2 ) Cov ( ) â, b X = 1 ) N σ2 x /(x 2 x 2 V (â X ) = 1 /( ) N σ2 x 2 x 2 Estimation of the variance To estimate the variance of the OLS estimator, we need an estimator σ 2 of the variance of the residuals σ 2. A natural estimator of the variance of the residuals could be σ 2 = (u i u) 2/ (N 1). We know that this would an unbiased estimation of the variance of u i. However we do not observe the residual, we only have estimated residuals. What we can show is that it is however possible to obtain an unbiased estimation of the variance using these estimated residuals : σ 2 = / û 2 i (N 2) One difference is that the sum of square is divided by N 2 rather than by N 1. This is because we have lost degrees of freedom we impose two constraints on the estimated residuals : ûi = 0 ûi x i = 0 43

44 while for the estimation of the variance of a variable z i we consider z i = E (z i ) + ε i and the estimated variance is σ z 2 = ε 2 i / (N 1) with ε i = z i z satisfies only one constraint ε i = 0 ) Formally u i = û i + â a + x i ( b b. Taking the square and summing yield u 2 i = ( ) ) ) 2 û 2 i + N (â a) 2 + 2x (â a) ( b b + x ( b 2 b where we account for the fact that û i = 0 and û i x i = 0. Taking expectation and using the expressions for the variance of estimators ( ) ) / ) Nσ 2 = E û2i X + N (x 2 2x 2 + x 2 σ 2 N (x 2 x 2 ( ) = E û2i X + 2σ 2 Thus E ( ) û2 i N 2 X = σ 2 Estimation of the variance of the estimated parameters We can estimate the variance of the parameters without bias, replacing σ 2 by its unbiased estimator σ 2 Cov V (â X ) = 1 /( ) N σ2 x 2 x 2 x 2 ( ) â, b X = 1 ) N σ2 x /(x 2 x 2 V (â X ) = 1 /( ) N σ2 x 2 x 2 Example : Mincer Equation We can compute the sum of squared residuals as well as the empirical mean and variance of schooling. We find (see Table 4) an estimated value for the variance of residual of 0.23 and a much larger variance of the explanatory variable : In the end the variance of the slope parameter b is very small to which correspond a standard error of estimation is thus very accurate. We will see soon how to use standard errors with a more precise probabilistic statement to build confidence intervals. 44

Probability Theory and Statistics. Peter Jochumzen

Probability Theory and Statistics Peter Jochumzen April 18, 2016 Contents 1 Probability Theory And Statistics 3 1.1 Experiment, Outcome and Event................................ 3 1.2 Probability............................................