EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

Similar documents
ECON Fundamentals of Probability

Week 2: Review of probability and statistics

1 Probability theory. 2 Random variables and probability theory.

Chapter 2. Some Basic Probability Concepts. 2.1 Experiments, Outcomes and Random Variables

Random Variables and Expectations

Review of probability and statistics 1 / 31

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

ECON The Simple Regression Model

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Probability and Distributions

Lecture 2: Review of Probability

Course information: Instructor: Tim Hanson, Leconte 219C, phone Office hours: Tuesday/Thursday 11-12, Wednesday 10-12, and by appointment.

Probability Theory and Statistics. Peter Jochumzen

2 (Statistics) Random variables

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Preliminary Statistics Lecture 3: Probability Models and Distributions (Outline) prelimsoas.webs.com

Basic Probability Reference Sheet

Problem Set #6: OLS. Economics 835: Econometrics. Fall 2012

Random Variables. Saravanan Vijayakumaran Department of Electrical Engineering Indian Institute of Technology Bombay

LECTURE 1. Introduction to Econometrics

Applied Econometrics (QEM)

Lecture 1: August 28

4. Distributions of Functions of Random Variables

Econometrics Lecture 1 Introduction and Review on Statistics

SDS 321: Introduction to Probability and Statistics

Introduction to Simple Linear Regression

Preliminary Statistics. Lecture 3: Probability Models and Distributions

The Simple Regression Model. Part II. The Simple Regression Model

Applied Econometrics - QEM Theme 1: Introduction to Econometrics Chapter 1 + Probability Primer + Appendix B in PoE

Applied Quantitative Methods II

Lecture 8 Sampling Theory

Simple Linear Regression

MFin Econometrics I Session 4: t-distribution, Simple Linear Regression, OLS assumptions and properties of OLS estimators

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

Jointly Distributed Random Variables

IAM 530 ELEMENTS OF PROBABILITY AND STATISTICS LECTURE 3-RANDOM VARIABLES

MULTIVARIATE PROBABILITY DISTRIBUTIONS

Lecture 3: Multiple Regression

Linear Models and Estimation by Least Squares

Random Variables and Their Distributions

BASICS OF PROBABILITY

2008 Winton. Statistical Testing of RNGs

Probability. Paul Schrimpf. January 23, Definitions 2. 2 Properties 3

WISE International Masters

Expectation and Variance

Random Variables. Cumulative Distribution Function (CDF) Amappingthattransformstheeventstotherealline.

Review. December 4 th, Review

Business Economics BUSINESS ECONOMICS. PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS MODULE No. : 3, GAUSS MARKOV THEOREM

Correlation and Regression

Chapter 2. Continuous random variables

Graduate Econometrics I: Asymptotic Theory

Simple Linear Regression

Discrete Probability Refresher

Appendix A : Introduction to Probability and stochastic processes

Ch 2: Simple Linear Regression

The Simple Linear Regression Model

Probability. Paul Schrimpf. January 23, UBC Economics 326. Probability. Paul Schrimpf. Definitions. Properties. Random variables.

This does not cover everything on the final. Look at the posted practice problems for other topics.

An Introduction to Parameter Estimation

EXPECTED VALUE of a RV. corresponds to the average value one would get for the RV when repeating the experiment, =0.

Stat 704 Data Analysis I Probability Review

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Probability. Table of contents

Intermediate Econometrics

EXAMINERS REPORT & SOLUTIONS STATISTICS 1 (MATH 11400) May-June 2009

Regression #3: Properties of OLS Estimator

Lecture 14 Simple Linear Regression

Introduction to Statistical Inference

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Introductory Econometrics

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

ACE 562 Fall Lecture 2: Probability, Random Variables and Distributions. by Professor Scott H. Irwin

01 Probability Theory and Statistics Review

ECON3150/4150 Spring 2015

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

STAT2201. Analysis of Engineering & Scientific Data. Unit 3

Paul Schrimpf. January 23, UBC Economics 326. Statistics and Inference. Paul Schrimpf. Properties of estimators. Finite sample inference

Central Limit Theorem ( 5.3)

Continuous Random Variables

6 The normal distribution, the central limit theorem and random samples

Topic 3: The Expectation of a Random Variable

ECON 4160, Autumn term Lecture 1

Introduction to Machine Learning

Econometrics Summary Algebraic and Statistical Preliminaries

Chapter 11 Sampling Distribution. Stat 115

1 Presessional Probability

HT Introduction. P(X i = x i ) = e λ λ x i

18.440: Lecture 28 Lectures Review

STATISTICS 1 REVISION NOTES

Joint Distribution of Two or More Random Variables

Introduction to Econometrics

Properties of Summation Operator

MAS223 Statistical Inference and Modelling Exercises

Bias Variance Trade-off

STA 2201/442 Assignment 2

Review of Statistics

Final Exam. Economics 835: Econometrics. Fall 2010

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Transcription:

1 EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) Taisuke Otsu London School of Economics Summer 2018

A.1. Summation operator (Wooldridge, App. A.1) 2

3 Summation operator For sequence {x 1, x 2,..., x n }, denote summation as n x i = x 1 + x 2 + + x n i=1 Since data are collection of numbers, n econometrics and statistics i=1 plays key role in

4 Properties 1. For any constant c 2. For any constant c n c = nc i=1 n cx i = c n i=1 i=1 3. For sequence {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )} and constants a and b n n n (ax i + by i ) = a x i + b i=1 i=1 i=1 If you get confused, try for case of n = 2 or 3 x i y i

5 Average For {x 1, x 2,..., x n }, average or mean is defined as x = 1 n n i=1 x i (x i x) is called deviation from average

Properties of x i x 1. Sum of deviations is always zero n (x i x) = i=1 n x i n x = i=1 n n x i x i = 0 i=1 i=1 2. Sum of squared deviations n (x i x) 2 = i=1 n xi 2 n( x) 2 i=1 3. Cross-product version n (x i x)(y i ȳ) = i=1 n x i y i n( xȳ) i=1 These are shown by properties of summation operator 6

7 Derive Property 2 Expand square and apply properties of summation operator n (x i x) 2 = i=1 = = = n (xi 2 2x i x + ( x) 2 ) i=1 n i=1 x 2 i n 2 x x i + n( x) 2 i=1 n xi 2 2 x(n x) + n( x) 2 i=1 n xi 2 n( x) 2 i=1 Property 3 is similarly shown

A.2. Linear function (Wooldridge, App. A.2) 8

9 Linear function Linear function plays important role to specify econometric models If x and y are related by y = β 0 + β 1 x then we say that y is a linear function of x This relation is described by two parameters: intercept β 0 and slope β 1

10 Property of linear function Let denote change Key feature of linear function y = β 0 + β 1 x is: change in y is given by slope β 1 times change in x, i.e. y = β 1 x In other words, marginal effect of x on y is constant and equal to β 1

Two variable case If we have x 1 and x 2, linear function is y = β 0 + β 1 x 1 + β 2 x 2 Change in y given changes in x 1 and x 2 is If x 2 does not change, then y = β 1 x 1 + β 2 x 2 y = β 1 x 1 if x 2 = 0 or β 1 = y x 1 if x 2 = 0 So β 1 measures how y changes with x 1 holding x 2 fixed (called partial effect). This is closely related to ceteris paribus 11

A.4. Some special functions (Wooldridge, App. A.4) 12

13 Quadratic function One way to capture diminishing return is to add quadratic term y = β 0 + β 1 x + β 2 x 2 When β 1 > 0 and β 2 < 0, it will be parabolic mountain shape By applying calculus, slope of quadratic function is approximated by slope = y x β 1 + 2β 2 x Caution: quadratic function is not monotone

Natural logarithm Perhaps most important nonlinear function in econometrics. Denote by log(x) (but ln(x) is also common) log(x) is defined only for x > 0 and looks like log -1 0 1 2 3 0 5 10 15 20 14

15 It is not very important how values of log(x) are obtained log(x) is monotone increasing and displays diminishing marginal returns (slope gets closer to 0 as x increases) Also we can see Some properties log(x) < 0 for 0 < x < 1 log(1) = 0 log(x) > 0 for x > 1 log(x 1 x 2 = log(x 1 ) + log(x 2 ) ( ) x1 log = log(x 1 ) log(x 2 ) x 2 log(x c ) = c log(x) for any c

16 Key property: Relationship with percent change (By using calculus) we can see that if x 1 x 0 is small log(x 1 ) log(x 0 ) x 1 x 0 x 0 Right hand side multiplied by 100 gives us percent change in x. So this can be written as 100 log(x) % x i.e. log change times 100 approximates percent change

17 Elasticity Thus log is useful to approximate elasticity. Elasticity of y with respect to x is defined as % y % x = ( y/y) ( x/x) = y x x y i.e. percentage change in y when x increases by 1% (familiar concept in economics) By log, elasticity is approximated as % y % x log(y) log(x)

B.1. Random variables and their probability distributions (Wooldridge, App. B.1) 18

19 Definition Experiment is any procedure that can yield outcomes with uncertainty E.g. Tossing a coin (head or tail) Random variable is one that takes numerical values and has outcome determined by an experiment E.g. Number of heads by tossing 10 coins

Notation for appendix In Appendix, denote random variables by uppercase letters, like X, Y, Z On the other hand, denote particular outcomes by corresponding lowercase letters, like x, y, z In main body of textbook, both are denoted by lowercase x, y, z (should be clear for each context) X is not associated with any particular value but x is, say x = 3 Typical example in mind: X is exam score at this point (which is random and not realized yet). Once you take the exam, it realizes and you get a particular value x, say x = 80 So expression P(X = x) = 0.2 means probability that random variable X takes a particular number x is 0.2 20

21 Discrete random variables If X takes on only a finite (like {1, 2,..., 10}) or countably infinite (like {1, 2, 3,...}) number of values, then X is called discrete random variable Suppose X can take on k possible values {x 1,..., x k }. Since X is random, we never know which number X takes for sure. So we need to talk about probability for X to take each value p j = P(X = x j ) for j = 1, 2,..., k Note: p j is between 0 and 1 and satisfies p 1 + p 2 + + p k = 1

22 Probability density function (pdf) Distribution of X is summarized by probability density function (pdf) f (x j ) = p j for j = 1, 2,..., k with f (x) = 0 for any x not equal to x j s Probability for any event involving X can be computed by p j s

23 Continuous random variable If X takes on some interval or real line, then X is called continuous random variable Continuous random variable takes on any real value with zero probability, i.e. if X is continuous, then P(X = x) = 0 for any value of x Since X can take on too many possible values, we cannot allocate probability each value of x For continuous X, it only makes sense to talk about probability for interval, such as P(a X b) and P(X c)

24 Cumulative distribution function (cdf) To compute probabilities for continuous random variable, it is useful to work with cumulative distribution function (cdf) F (x) = P(X x) for any x F (x) is an increasing (or non-decreasing) function (starts from 0 and increases to 1) By F (x), we can compute P(X c) = 1 F (c) P(a X b) = F (b) F (a) For continuous case, pdf f (x) is also available which provides probability for any interval by integral over the interval

B.2. Joint distributions, conditional distributions and independence (Wooldridge, App. B.2) 25

26 Joint distribution Let X and Y be discrete random variables. Then (X, Y ) have joint distribution, which is fully described by joint pdf f X,Y (x, y) = P(X = x, Y = y) where right hand side is probability that X takes x and Y takes y pdf of single variable such as pdf f X (x) of X is called marginal pdf E.g. Y =wage, X =years of education

27 Independence We say X and Y are independent if f X,Y (x, y) = f X (x)f Y (y) for all x and y, where f X (x) is marginal pdf of X and f Y (y) is marginal pdf of Y Otherwise, we say X and Y are dependent As we will see soon, if X and Y are independent, knowing the outcome of X does not change the probabilities of outcomes of Y, and vice versa

Conditional distribution To talk about how X affects Y, we look at conditional distribution of Y given X, which is summarized by conditional pdf f Y X (y x) = f X,Y (x, y) f X (x) for all values of x such that f X (x) > 0 Note that by definition f Y X (y x) = P(X = x, Y = y) P(X = x) = P(Y = y X = x) so conditional pdf f Y X (y x) gives us (conditional) probability for Y = y given that X = x E.g. Y =wage and X =years of education. f Y X (y 12) means pdf of wage for all people in the population with 12 years of education 28

29 Relationship with independence If X and Y are independent (i.e. f X,Y (x, y) = f X (x)f Y (y)), then conditional pdf of Y given X is written as f Y X (y x) = f X,Y (x, y) f X (x) = f Y (y) = f X (x)f Y (y) f X (x) i.e. knowledge of the value taken by X tells nothing about distribution of Y

B.3. Features of probability distributions (Wooldridge, App. B.3) 30

31 Features of distribution Knowing pdf is great but for many purposes we will be interested in only a few aspects of distribution of random variable, such as Measure of central tendency Measure of variability or spread Measure of association between two random variables

32 Measure of central tendency: Expected value One of the most important concepts in this course Expected value (or expectation) of random variable X (denoted by E(X ) or sometimes µ) is weighted average of all possible values of X with weights determined by pdf If X takes values on {x 1,..., x k } with pdf f (x), then expected value is written as E(X ) = x 1 f (x 1 ) + + x k f (x k ) If X is continuous, expected value is given by integral E(X ) = xf (x)dx

33 Expected value of function of X Consider g(x ), function of X. Its expected value is E[g(X )] = g(x 1 )f (x 1 ) + + g(x k )f (x k ) i.e. weighted average of all possible values of g(x ) For example, if g(x ) = X 2, then E[X 2 ] = x1 2 f (x 1 ) + + xk 2 f (x k )

34 Properties of E( ) Used very frequently in this course Property E.1: For any (nonrandom) constant c, E(c) = c E.g. E(3) = 3. Since c (or 3 in this case) never takes other number, it makes sense

35 Property E.2: For any constants a and b, E(aX + b) = ae(x ) + b Intuitively constants can go outside of E( ) This can be seen from expressing E( ) by weighted averages

36 Property E.3: If {a 1,..., a n } are constants and {X 1,..., X n } are random variables, then E(a 1 X 1 + + a n X n ) = a 1 E(X 1 ) + + a n E(X n ) This is generalization of Property E.2 Expectation of summation can be split into sum of expectations. Constant coefficients a i s can go outside of E( )

37 Measure of variability: Variance and standard deviation Once we figure out central tendency of distribution of X by expected value µ = E(X ), next step is to characterize variability or spread of distribution around µ Common measure of variability is variance Var(X ) = E[(X µ) 2 ] i.e. measure variability by squared difference (X µ) 2 and summarize by its expected value Also standard deviation is defined as sd(x ) = Var(X )

38 Properties of variance Property VAR.1: For any (nonrandom) constant c, Constant has no variability Var(c) = 0 Property VAR.2: For any constants a and b, Var(aX + b) = a 2 Var(X ) b does not change variance. When a goes outside of Var( ), it becomes a 2 (because variance is defined by expected squared difference E[(X µ) 2 ])

B.4. Features of joint and conditional distributions (Wooldridge, App. B.4) 39

Covariance Consider two random variables X and Y. Let µ X = E(X ) and µ Y = E(Y ). To measure association of X and Y, we look at product of deviations from the means (X µ X )(Y µ Y ) If X > µ X, Y > µ Y or X < µ X, Y < µ Y (i.e. same signs), then this product is positive. If X > µ X, Y < µ Y or X > µ X, Y < µ Y (i.e. different signs), then this product is negative Covariance is expected value of this product Cov(X, Y ) = E[(X µ X )(Y µ Y )] Property COV.1: If X and Y are independent, then Cov(X, Y ) = 0 (but converse is not true in general) 40

41 Correlation coefficient Drawback of covariance is that it depends on unit of measurements. This can be overcome by correlation coefficient Corr(X, Y ) = Cov(X, Y ) sd(x )sd(y ) Property CORR.1: 1 Corr(X, Y ) 1 If Cov(X, Y ) > 0 (or Corr(X, Y ) > 0), we say X and Y are positively correlated If Cov(X, Y ) < 0 (or Corr(X, Y ) < 0), we say X and Y are negatively correlated

42 Variance of sum of random variables Property VAR.3: For constants a and b Var(aX + by ) = a 2 Var(X ) + b 2 Var(Y ) + 2abCov(X, Y ) If X and Y are uncorrelated (i.e. Cov(X, Y ) = 0), then Var(aX + by ) = a 2 Var(X ) + b 2 Var(Y ) Property VAR.4: Suppose {X 1,..., X n } are uncorrelated each other (i.e. Cov(X i, X j ) = 0 for any i j). Then for constants {a 1,..., a n }, Var(a 1 X 1 + + a n X n ) = a 2 1Var(X 1 ) + + a 2 nvar(x n )

43 Conditional expectation Let X and Y be discrete random variables. Recall conditional pdf is f Y X (y x) = f X,Y (x, y) f X (x) = P(Y = y X = x) i.e. probability of Y = y given that X = x E.g. Y =wage and X =years of education. f Y X (y 12) means pdf of wage for all people in the population with 12 years of education. Similarly, we can define f Y X (y 13), f Y X (y 14), f Y X (y 16), so on. In general, these distributions are all different Conditional expectation (or conditional mean) is looking at expected values of these conditional pdfs

44 Suppose Y takes on values {y 1,..., y m }. Conditional expectation of Y given X = x is E(Y X = x) = y 1 f Y X (y 1 x) + + y m f Y X (y m x) If Y is continuos, E(Y X = x) is defined by integral over y E.g. Y =wage and X =years of education. E(Y X = 12) is average wage for all people in the population with 12 years of education. E(Y X = x) means that for x years of education Note: E(Y X = x) typically varies with x. In other words, E(Y X = x) is a function of x (say, m(x) = E(Y X = x)) Very useful summary on how Y and X are related

45 Properties of conditional expectation Used frequently in this course Property CE.1: For any function c(x ), E[c(X ) X ] = c(x ) Intuitively, if we know X, then we also know c(x ) To compute expectation conditional on X, the function c(x ) of X is treated like constant E.g. For c(x ) = X 2, E[X 2 X ] = X 2

46 Property CE.2: For any functions a(x ) and b(x ), E[a(X )Y + b(x ) X ] = a(x )E(Y X ) + b(x ) Intuitively, functions of X can go outside of conditional expectation E( X ) To compute expectation conditional on X, the functions a(x ) and b(x ) of X are treated like constants

47 Property CE.5: If E(Y X ) = E(Y ), then (and also Corr(X, Y ) = 0) Cov(X, Y ) = 0 If knowledge of X does not change the expected value of Y, then X and Y must be uncorrelated Converse is not true in general: Even if X and Y are uncorrelated, E(Y X ) could still depend on X

48 Conditional variance Conditional variance of conditional distribution of Y given X = x is Formula often used: Var(Y X = x) = E[(Y E(Y x)) 2 x] Var(Y X ) = E(Y 2 X ) [E(Y X )] 2 Property CV.1: If X and Y are independent, then Var(Y X ) = Var(Y )

B.5. Normal and related distributions (Wooldridge, App. B.5) 49

50 Normal distribution Most widely used distribution in econometrics and statistics Other distributions such as t- and F -distributions (explain later) are obtained by functions of normally distributed random variables Normal random variable is continuous and can take any value on real line. Although mathematical expression of its pdf is bit complicated, pdf is bell-shape and symmetric around its expected value We say X has normal distribution with expected value µ = E(X ) and variance σ 2 = Var(X ), written as X Normal(µ, σ 2 ) If Z N(0, 1), we say Z has standard normal distribution

Graph of N(0, 1) and t 6 51

52 Property of normal random variable Property of Normal.1: If X Normal(µ, σ 2 ), then X µ σ N(0, 1) This transformation (i.e. subtract expected value µ then divide by standard deviation σ) is called standardization Property of Normal.4: Linear combination of normal random variables (e.g. a 1 X 1 + a 2 X 2 + + a n X n ) is also normally distributed

53 Chi-square distribution Consider n independent standard normal random variables Z 1,..., Z n (i.e. Z i Normal(0, 1)) Based on them, consider sum of squares X = n i=1 Z 2 i Since this object appears very often (closely related to sample variance), people put a name on it Distribution of X is called the chi-square distribution with n degree of freedom, written as pdf is complicated X χ 2 n

54 t distribution Let Then consider the ratio Z N(0, 1) X χ 2 n Z and X are independent T = Z X /n Since this object appears very often people put a name on it Distribution of T is called t n distribution with n degree of freedom, written as T t n

55 t n distribution depends on n (called degree of freedom) pdf of t distribution is similar bell-shape as standard normal Normal(0, 1) but is more spread (Intuitively Z is normal but T has extra variation due to random denominator X /n) Indeed, t n distribution converges to Normal(0, 1) as n Mathematical expression of t distribution is complicated. Use Table G in Appendix or computer

56 Let F distribution X 1 χ 2 k 1 X 2 χ 2 k 2 Based on them, consider X 1 and X 2 are independent F = (X 1/k 1 ) (X 2 /k 2 ) Again, since this object appears very often people put a name on it Distribution of F is called F k1,k 2 distribution with (k 1, k 2 ) degrees of freedom, written as F F k1,k 2

C.1. & C.2. Concepts for point estimation (Wooldridge, App. C.1 & C.2) 57

58 Random sampling Consider n independent random variables Y 1,..., Y n with common pdf f (y; θ). Then {Y 1,..., Y n } is called random sample from the population f (y; θ) with parameter θ Example: Y i = 0 or 1 (say, tail or head) with pdf P(Y i = 1) = θ P(Y i = 0) = 1 θ We want to estimate θ by random sample {Y 1,..., Y n }

Estimator & Estimate In principle, any method to θ should be some function of sample {Y 1,..., Y n }, say ˆθ = g(y 1,..., Y n ) such object is called estimator of θ Note that estimator is function of random variable, so ˆθ is random, too What we report is its outcome based on the outcomes {y 1,..., y n } of {Y 1,..., Y n } which is called estimate of θ ˆθ estimate = g(y 1,..., y n ) Estimator is random. Estimate is non-random (just some number) 59

60 For example, to estimate population mean µ = E(Y i ), sample mean Ȳ = 1 n Y i n is an estimator of θ. By the data {y 1,..., y n } (i.e. particular outcomes of sample), we report i=1 ȳ = 1 n n y i (say, ȳ = 75) i=1 Property of estimator is described by sampling distribution of estimator Ȳ (ȳ is constant, so it does not have distribution)

61 Unbiasedness First property we focus on is the expected value E(ˆθ) of estimator ˆθ is an unbiased estimator for θ if E(ˆθ) = θ If it is not equal, estimator is biased and Bias(ˆθ) = E(ˆθ) θ For example, Ȳ is unbiased for µ = E(Y i ) because ( ) 1 n E(Ȳ ) = E Y i = 1 n E(Y i ) = 1 n µ = µ n n n i=1 i=1 i=1

62 Sampling variance Second property is sampling variance Var(ˆθ) of estimator If we have two unbiased estimators (say ˆθ and θ), we often compare by their variances Var(ˆθ) and Var( θ) (prefer smaller variance estimator). Smaller variance is called more efficient For example, sampling variance of Ȳ is ( ) ( 1 n n ) Var(Ȳ ) = Var Y i = 1 n n 2 Var Y i i=1 i=1 = 1 n n 2 Var(Y i ) (because Y i s are independent) i=1 = 1 n 2 n i=1 σ 2 = σ2 n

C.3. Asymptotic properties of estimators (Wooldridge, Appendix. C.3) 63

64 Consistency First asymptotic property of estimator concerns how far the estimator is likely to be from the parameter supposed to be estimating as sample size increases to infinity Intuitively we want convergence of estimator, say ˆθ n, to the unknown parameter, say θ, as n Recall: convergence of non-random sequence c n c. For example, c n = 2 + 3 n 2 or write lim n c n = 2 as n

65 Convergence in probability Want analog of convergence for ˆθ n, which is random We say: Sequence of random variables W n converges in probability to c if for any ɛ > 0, P( W n c > ɛ) 0 as n This is denoted by called probability limit plim(w n ) = c

66 Consistency of estimator Estimator ˆθ n is consistent for parameter θ if plim(ˆθ n ) = θ It means distribution of ˆθ n becomes more and more concentrated around θ and collapses to constant θ in the limit In particular, we want consistency of OLS estimator plim( ˆβ j ) = β j (note that ˆβ j depends on the sample size n)

67 Law of large numbers (LLN) Basic tool for establishing consistency is law of large numbers (LLN) LLN: Let Y 1,..., Y n be independent and identically distributed random variables with mean µ = E(Y i ). Then plim(ȳ n ) = µ i.e. sample average converges in probability to population mean In other words, Ȳ n is a consistent estimator for µ

68 Simulation Let Y 1,..., Y n be independent and for i = 1,..., n Y i Uniform(0, 100) Population mean is E(Y i ) = 50 Fix n. Then simulate Ȳ n 10,000 times by computer and draw the histogram

Histogram for Ȳ n with n = 1 Histogram of z1 Frequency 0 100 200 300 400 500 0 20 40 60 80 100 69

Histogram for Ȳ n with n = 2 Histogram of z2 Frequency 0 200 400 600 800 0 20 40 60 80 100 70

Histogram for Ȳ n with n = 5 Histogram of z5 Frequency 0 500 1000 1500 20 40 60 80 71

Histogram for Ȳ n with n = 10 Histogram of z10 Frequency 0 500 1000 1500 2000 20 30 40 50 60 70 80 72

Histogram for Ȳ n with n = 100 Histogram of z100 Frequency 0 500 1000 1500 2000 2500 40 45 50 55 60 73

74 Intuition for LLN Key: Look at variance of Ȳ n Let Var(Y i ) = σ 2. Recall that Var(Ȳ n ) = σ2 n 0 i.e. variance of Ȳ n shrinks at n rate to zero, so distribution of Ȳ n collapses

75 Consistency of sample moments We saw sample mean Ȳ n is consistent for population mean E(Y i ) LLN also gives us consistency of other sample moment estimators, e.g. sample variance ( ) 1 n plim (Y i Ȳ n ) 2 = Var(Y i ) n 1 i=1 and sample covariance ( ) 1 n plim (Y i Ȳ n )(Z i Z n ) = Cov(Y i, Z i ) n i=1

76 Property of plim Property PLIM.2: If plim(z n ) = a and plim(w n ) = b, then plim(z n + W n ) = a + b plim(z n W n ) = ab plim(z n /W n ) = a/b provided b 0

77 Asymptotic distribution Consistency is desirable property of estimator. If estimator ˆθ n is consistent, it eventually converges to unknown parameter θ of interest However, if we wish to conduct statistical inference (hypothesis testing or confidence interval), we need more information about ˆθ n, i.e. its distribution Unless we impose restrictive assumption (e.g. MLR.6), it is not easy to get finite sample distribution of ˆθ n for given n However, it is easy to get approximate distribution for ˆθ n when n increases to infinity under mild condition Indeed most estimators in econometrics are well approximated by normal distribution

78 Asymptotic normal distribution We say: Sequence of random variables {Z n } have asymptotic standard normal distribution if for each a P(Z n a) Φ(a) as n where Φ(a) is cumulative distribution function (cdf) of standard normal Normal(0, 1) In words, for each a, cdf of Z n evaluated at a converges to cdf of Normal(0, 1) evaluated at a We often write Z n a Normal(0, 1)

79 Central limit theorem (CLT) Basic tool for establishing asymptotic normality is central limit theorem (CLT) Let Y 1,..., Y n be independent and identically distributed random variables with mean µ = E(Y i ) and variance σ 2 = Var(Y i ) Consider sample average Ȳ n = 1 n n i=1 Y i again Note: Ȳ n itself does not have asymptotic distribution (it collapses to µ by LLN)

80 Key: Look at standardized version of Ȳ n Note which implies E(Ȳ n ) = µ Var(Ȳ n ) = σ2 n Z n = Ȳn µ σ/ n satisfies E(Z n ) = 0 and Var(Z n ) = 1 Therefore, distribution of Z n will not collapse even if n

81 CLT: Let Y 1,..., Y n be independent and identically distributed random variables with mean µ = E(Y i ) and variance σ 2 = Var(Y i ). Then Z n = Ȳn µ σ/ n a Normal(0, 1) Remarkably, regardless of distribution of Y i, distribution of Z n gets arbitrarily close to standard normal

82 Simulation Again, let Y 1,..., Y n be independent and for i = 1,..., n Y i Uniform(0, 100) Population mean is µ = 50 and variance is σ 2 = 10000/12 Fix n. Then simulate Z n = Ȳn µ σ/ n 10,000 times by computer and draw the histogram

Histogram for Z n with n = 1 Histogram of w1 Frequency 0 100 200 300 400 500 600-1.5-1.0-0.5 0.0 0.5 1.0 1.5 83

Histogram for Z n with n = 2 Histogram of w2 Frequency 0 500 1000 1500-2 -1 0 1 2 84

Histogram for Z n with n = 5 Histogram of w5 Frequency 0 500 1000 1500-4 -2 0 2 85

Histogram for Z n with n = 10 Histogram of w10 Frequency 0 500 1000 1500 2000-2 0 2 4 86

Histogram for Z n with n = 100 Histogram of w100 Frequency 0 500 1000 1500 2000-2 0 2 4 87

C.6. Hypothesis testing (Wooldridge, App. C.6) 88

89 Hypothesis testing Let θ be parameter of interest. Estimator ˆθ gives us an estimate for θ, i.e. we report some number E.g. θ = E(X ) (population mean) ˆθ = X (sample mean) Hypothesis testing is interested in answering yes/no question about θ, i.e. we report yes or no Typical question: some regression coefficient is zero or not

90 Example: Testing hypotheses about mean in normal population To illustrate basic idea, consider N(µ, σ 2 ) population and hypothesis testing about mean µ based on random sample {Y 1,..., Y n } (so Y i N(µ, σ 2 ) for all i = 1,..., n) Consider the null hypothesis H 0 : µ = µ 0 where µ 0 is a value we specify (e.g. µ 0 = 0, so H 0 : µ = 0)

To setup yes/no question, we need to specify the alternative hypothesis. Popular examples are H 1 : µ > µ 0 H 1 : µ < µ 0 H 1 : µ µ 0 The first and second ones are called one-sided alternative hypothesis. The third one is called two-sided alternative hypothesis Here let us consider H 0 : µ = µ 0, vs. H 1 : µ > µ 0 We report: Reject H 0 or Do not reject H 0 (in favor of H 1 ) 91

Idea for testing Intuitively we should reject H 0 if ȳ is sufficiently greater than µ 0 but how large? ȳ µ 0 > 10, 500, say? Meaning of ȳ µ 0 = 10 (say) is case-by-case. So consider standardized version by dividing the standard error t = ȳ µ 0 se(ȳ) = ȳ µ 0 s/ n where se(ȳ) = s/ n and s = 1 n 1 n (y i ȳ) 2 i=1 Now meaning of t = 2 (say) is universal for any data 92

93 Find critical value Based on standardized object t, reasonable test would be Reject H 0 : µ = µ 0 if t > c and do not reject H 0 (in favor of H 1 : µ > µ 0 ) if t c So what we have to do is to find the critical value c To pin down c, we need some rule

94 Rule for critical value In testing, we have two kinds of mistakes Reject Not reject H 0 true Type I correct H 1 true correct Type II Type I error probability: P(Reject ; H 0 true) Type II error probability: P(Accept ; H 1 true) Rule: Find c to control Type I error probability

95 Let us find c in current example. To compute probability, consider random variable counterpart of t = ȳ µ 0 s/, that is n We want to find c such that T = Ȳ µ 0 S/ n P(Reject ; H 0 true) = P(T > c ; H 0 true) = α where α (called significance level) should be specified by us. Typically α =.01,.05,.10 To find c, we need to know the distribution of T under H 0 : µ = µ 0. Indeed T follows t n 1 distribution under H 0

Then look up t distribution table (Table G.2). For example, if n = 29 and α =.05, critical value is c = 1.701 96

97 Test for mean in normal population Hypotheses Significance level α =.05 H 0 : µ = µ 0, vs. H 1 : µ > µ 0 Test statistic & distribution under H 0 T = Ȳ µ 0 S/ n t n 1 under H 0 Find critical value c = 1.701 from t 29 1 distribution table Test: Reject H 0 if t > 1.701. Do not reject if t 1.701

98 Test for another one-sided alternative If alternative hypothesis is H 1 : µ < µ 0 we reject H 0 if t < c c can be found in the same way by looking at left tail of t n 1 distribution. For example, if n = 29, t < 1.701

99 Test for two-sided alternative If alternative hypothesis is two-sided we reject H 0 if H 1 : µ µ 0 t > c We should reject for both positive and negative large values of t Distribution of T under H 0 remains same, i.e. T t n 1 but we have to allocate significance level α to left and right tails So if we look right tail, area should be α/2

Look up t distribution table (Table G.2). For example, if n = 26 and α =.05, critical value is 2.06. t distribution is symmetric 100

101 Summary: Basic steps for testing State null and alternative hypotheses, H 0 and H 1 Declare significance level α Find test statistic & distribution under H 0 (e.g. T t n 1 under H 0 ) Find critical value c from distribution table (or by software) State testing procedure: Reject H 0 if... and do not reject H 0 if... Implement the test by data and report the result: Reject (or do not reject) H 0 at 100(1 α)% significance level