covariance between any two observations

1 Ordinary Least Squares (OLS) 1.1 Single Linear Regression Model assumptions of Classical Linear Regression Model (CLRM) (i) true relationship y i = α + βx i + ε i, i = 1,..., N where α, β = population parameters α, β = parameter estimates ε i = idiosyncratic error term (reflects randomness, unobserved factors) ε i = estimated residual (ii) E[ε i ] = 0 error is mean zero (iii) E[ε 2 i ] = σ2 equal to variance of ε given (ii) variance identical i (iv) E[ε i ε j ] = 0 i j covariance between any two observations 1

(v) E[x i ε i ] = 0 x is independent of the error (vi) ε i N(0, σ 2 ) estimation errors distributed normally not needed for unbiasedness, consistency, only std errors given a random sample {y i, x i } N i=1, OLS minimizes the sum of the squared residuals α, β = arg min α,β N i=1 (y i α βx i ) 2 solution implies β OLS = Cov(y i, x i ) Var(x i ) N i=1 = (y i y)(x i x) N i=1 (x i x) 2 = N i=1 y i(x i x) N i=1 x i(x i x) α OLS = y β OLS x 2

properties α, β are unbiased (finite sample property), consistent (asymptotic property) α, β are efficient Var( β) = σ 2 / N i=1 (x i x) 2 = σ 2 / [N Var(x)] smallest variance of any linear, unbiased estimate (Gauss-Markov Theorem), where linear estimators are those in the class β = i ω iy i where ω i is some weighting function 3

1.2 Multiple Regression Model assumptions (i) true relationship y i = α + K = α + x i }{{} k=1 β kx ki + ε i, i = 1,..., N 1xK β }{{} Kx1 + ε i vector vector where K = # of independent variables (also known as regressors or covariates) (ii) E[x ki ε i ] = 0 k (iii) x s are linearly independent (no perfect multicolinearity) other assumptions follow from CLRM 4

estimation given a random sample {y i, x i } N i=1, OLS minimizes the sum of the squared residuals α, β = arg min α,β N i=1 (y i α x i β) 2 solution implies β OLS = (x x) 1 (x y) α OLS = y β OLS x where x is an NxK matrix and y is a Nx1 vector estimators retain same properties as in CLRM 5

2 Maximum Likelihood Estimation (MLE) alternative estimation technique to OLS useful in nonlinear models equivalent to OLS in classical linear regression model intuition: y depends on x, θ (e.g., θ = {β, σ})... choose θ ML to maximize probability of the realized data {y i, x i } N i=1 the likelihood function gives the total probability of observing the realized data as a function of θ 6

general model y i = f(x i, β) + ε i, ε i iid N(0, σ 2 ) θ = {β, σ} data = {y i, x i } N i=1 implies L(θ) = Pr(y 1,..., y N x 1,..., x N, θ) = Pr(y 1 x 1, θ) Pr(y N x N, θ) = i Pr(y i x i, θ) taking logs implies ln[l(θ)] = i ln[pr(y i x i, θ)] and θml = arg max θ ln[l(θ)] 7

example: Classical Linear Regression Model y i = x i β + ε i ε i iid N(0, σ 2 ) what is Pr(y i x i, θ)? Pr(y i x i, θ) = Pr(ε i x i, θ) = Pr(y i x i β x i, θ) { 1 = exp 1 ( ) } 2 yi x i β 2πσ 2 2 σ }{{} NORMAL PDF this implies θml = arg max θ = arg max θ [ { N ln 1 exp 1 ( ) }] 2 yi x i β i=1 2πσ 2 2 σ N 2 ln(2π) N ln σ 1 ( N yi x i β 2 i=1 σ ) 2 which is maximized by minimizing the sum of squared residuals (thus, identical to OLS) 8

alternative notation Pr(y i x i, θ) = 1 ( σ φ εi ) σ where φ( ) is the std normal pdf, 1/σ is the Jacobian this implies θml = arg max θ = arg max θ N [ 1 ( ln i=1 σ φ εi σ N ln σ N ) ] ( φ εi ) i=1 σ which is algebraically equivalent to the above representation 9

properties consistent plim θ ML = θ asymptotically normal θml N(θ, Σ θ ) asymptotically efficient 10

hypothesis testing Wald test equivalent to F-test in OLS requires estimation of only the unrestricted model Likelihood ratio test intuition: imposing restrictions can never improve the value of the likelihood fn; the amount by which imposing the restriction(s) lowers the likelihood fn provides some indication of the validity of the restriction (the bigger the decrease, the more likely the restriction(s) do not hold) requires estimation of the unrestricted and restricted models test statistic LR = 2 { } ln[l( θ UR )] ln[l( θ R )] 0 where LR χ 2 q and q = # of restrictions being tested maximization typically by numerical methods as analytical derivatives are messy 11

3 Limited Dependent Variable (LDV) Models class of models where y depends on x, but y is not continuous common cases y {0, 1} binary model estimation: probit, logit, LPM y [a, b] censored model a, b are known a, b may vary by i if a =, b =, then continuous estimation: censored regression, tobit y {0, 1, 2,...} count model estimation: poisson, negative binomial 12

y {0, 1, 2,...} qualitative response (QR) model values correspond to choices with no natural ordering examples brand choice mode of transportation estimation: multinomial logit, multinomial probit, conditional logit, nested logit y {0, 1, 2,...} ordered QR model values correspond to choices with natural ordering distance between values may vary between choices examples OLF, PT, FT <HS, HS, College, >College estimation: ordered logit, ordered probit 13

3.1 Binary Models applicable to problems where the dependent variable is binary examples: LFP, default on a loan, belong to a FTA, etc. Linear Probability Model (LPM) y i = x i β + ε i ε i N(0, σ 2 i ) estimated by OLS problems heteroskedasticity since implies ε i = x i β if y i = 0 1 x i β if y i = 1 Var(ε i x i ) = x i β(1 x i β) predictions not bounded by 0,1 x i β / [0, 1] and therefore do not correspond to probabilities 14

solution: model Pr(y i = 1 x i ) using proper functional form Pr(y i = 1 x i ) = F (x i β) where F ( ) satisfies lim F (x iβ) 1 x i β lim F (x iβ) 0 x i β obvious candidates are CDFs, since these map numbers from the entire real number line to the unit interval two parametric solutions F ( ) = Φ( ) }{{} = xi β φ(u)du (probit) std normal CDF F ( ) = Λ( ) }{{} logistic CDF = exp(x iβ) 1 + exp(x i β) (logit) 15

interpretation of β Pr(y i = 1) x j = F (x iβ) β x j }{{ j } Marginal Effect where F (x i β) x j = φ(x i β) (probit) Λ(x i β)[1 Λ(x i β)] (logit) notes since β j is more difficult to interpret (it s not the change in the probability given a unit change in x), typically report marginal effects marginal effects are observation-specific 16

common reporting options (i) marginal effects evaluated at the sample mean φ(x β) β j OR Λ(x β)[1 Λ(x β)] β j (ii) the sample mean of the marginal effects 1 N 1 N N N i=1 φ(x i β) β j, OR i=1 Λ(x i β)[1 Λ(x i β)] βj (iii) marginal effects evaluated at some combination of values of interest: φ(x o β) βj OR Λ(x o β)[1 Λ(xo β)] βj where x o is a vector of values of interest (e.g., white, male, 40 years old, with a college diploma) caution needed for marginal effects of dummy vars, interaction effects 17

estimation by MLE ln[l(β)] = i ln[pr(y i x i, β)] where Pr(y i x i, β) = F (x i β) if y i = 1 1 F (x i β) if y i = 0 which implies ln[l(β)] = i:y i =1 ln[f (x iβ)] + i:y i =0 ln[1 F (x iβ)] = i ln[f (x iβ)] y i ln[1 F (x i β)] 1 y i 18

latent variable framework probit/logit model can be recast in a latent (unobserved) variable framework model yi iid = x i β + ε i, ε i N/L 1 if yi y i = > 0 0 if yi 0 y i unobserved y i observed given data {y i, x i } N i=1, estimate β via MLE again, what is Pr(y i x i, θ)? Pr(yi > 0 x i, θ) = Pr(ε i > x i β) ( ) = Pr(ε i < x i β) = F xi β σ if y i = 1 Pr(y i x i, θ) = Pr(yi 0 x i, θ) = Pr(ε i x i β) ( ) [ )] = F = 1 F if y i = 0 x iβ σ which implies identical form of L(θ) ( xi β σ note, σ is not identified since it would appear everywhere as β/σ, so it is normalized to one STATA: -probit-, -logit-, -dprobit-, -margin-, -mfx- 19

3.2 Censored Regression Models applicable to problems where the dependent variable is censored (potentially from above and below) at certain threshholds note, while the dependent variable is censored, the x s are always observed examples: income/wealth may be top-coded, age at first birth for nonmothers, duration of unemployment spells for currently unemployed latent variable setup yi iid = x i β + ε i, ε i N(0, σ 2 ) y i = b i if yi b i yi if yi [a i, b i ] if yi a i a i y i unobserved y i observed terminology right-censoring: if b i i left-censoring: if a i i given data {y i, x i } N i=1, estimate β via MLE 20

again,what is Pr(y i x i, θ)? Pr(y i b i x i, θ) = Pr(ε i b i x i β) Pr(y i x i, θ) = = 1 Pr(ε i < b i x i β) [ ( )] = 1 Φ bi x i β σ if y i = b i ( ) Pr(yi x i, θ) = 1 σ φ y i x i β σ if y i [a i, b i ] Pr(yi a i x i, θ) = Pr(ε i a i x i β) ) = Φ if y i = a i ( ai x i β σ log likelihood fn given by ln[l(θ)] = [ ( )] ai x i β ln Φ i:y i =a i σ + [ ( )] bi x i β ln 1 Φ i:y i =b i σ + [ ( )] 1 y ln i:y i [a i,b i ] σ φ i x i β σ notes interpretation of β is the impact of a x on y (the latent variable) σ is identified as long as y i [a i, b i ] for some i 21

special case: tobit model a i = 0, b i = i examples: labor supply, R&D expenditures by a firm implies the following setup yi iid = x i β + ε i, ε i N(0, σ 2 ) yi if yi y i = > 0 0 if yi 0 y i unobserved y i observed estimate via MLE, where log likelihood fn given by ln[l(θ)] = ( [Φ ln x )] iβ i:y i =0 σ }{{} as in probit + [ ( )] 1 y ln i:y i >0 σ φ i x i β σ }{{} as in CLRM 22

interpretation of β is the impact of a x on y (the latent variable) not directly comparable to OLS, which estimates the change in E[y i x i ]/ x j marginal effects (for comparison to OLS) E[y i x i ] x j ( ) xi β = β j Φ σ these are observation-specific example: Tobit Illustration with One Regressor 5 0 5 10 15 0 2 4 6 8 10 x True values: alpha= 2, beta=2. OLS Data Points Tobit STATA: -tobit-, -cnreg- 23

3.3 Count Models applicable to situations where the dependent variable is a non-negative integer count of events dependent value typically takes on only a few values examples: # of children, # of patents held by a firm, # of doctor visits per year, # of cigarettes smoked per day 3.3.1 Poisson Model setup model expected number of events conditional on x E[y i x i ] = F (x i β) where F ( ) 0 functional form of F ( ) = exp{x i β} estimation via MLE need a distribution for y i, which cannot be normal assume Poisson distribution, which depends only on the mean given by exp{x i β} 24

again, what is Pr(y i x i, θ)? Pr(y i x i, θ) = exp{ λ i}λ y i i y i! (Poisson pdf) where λ i =exp{x i β} can show that E[y i x i ] = Var(y i x i ) = λ i log likelihood fn given by ln[l(θ)] = i [ λ i + y i ln λ i ln(y i!)] interpretation of β as if ln(y) is dependent variable; implies %, or elasticity if ln(x) marginal effects E[y i x i ] x i = λ i β 25

shortcoming Poisson assumes the mean and variance are identical regression-based test (Cameron & Trivedi 1990) H o : Var(y i ) = E[y i ] H 1 : Var(y i ) = E[y i ] + αg(e[y i ]) regress z i = ( y i λ i ) 2 yi λ i 2 on either (i) a constant or (ii) λ i and no constant; t-statistic is a test of the null hypothesis 26

Wooldridge proposes a correction to the standard errors when there is over- or under-dispersion assumes proportionality Var(y x) = σ 2 E[y x] where σ 2 is unknown σ 2 > 1 = overdispersion; σ 2 < 1 = under-dispersion define û i = y i λ i obtain σ 2 = 1 n k 1 i û2 i / λ i multiply usual standard errors by σ = σ 2 > 0 alternative estimation: negative binomial model STATA: -poisson-, -nbreg- 27

3.3.2 Zero-Inflated Poisson Model (skip) applicable to count models with a mass at zero, or where the decision between zero and some positive count is different than the decision among positive counts notation observations assumed to be drawn from either of two regimes observations in regime 1, given by z i = 1, always have a count of zero outcomes for observations in regime 2, given by z i = 2, follow a Poisson process, with possible outcomes given by y i 0 28

MLE set-up entails Pr(y = 0 x) = Pr(z = 1 x) + Pr(y = 0 x, z = 2) Pr(z = 2) Pr(y = j x) = Pr(y = j x, z = 2) Pr(z = 2), j = 1, 2,... log likelihood fn given by ln[l(θ)] = i:y i =0 ln{f (x iγ) + [1 F (x i γ)] exp( λ i )} + i:y i >0 {ln[1 F (x iγ)] λ i + y i ln λ i ln(y i!)} where F ( ) is either the logistic of normal CDF notes as in poisson model, λ i = x i β β is not constrained to be equal to γ some elements of β or γ may be constrained to be zero (i.e., the regressors in the two parts of the model need not be the same) STATA: -zip-, -zinb- 29

3.3.3 Application: Talley et al. (2005) question: determinants of crew injuries in ship accidents data structure sample includes accidents investigated by US Coast Guard includes US ships anywhere, and foreign ships in US waters outcomes # deaths, # injuries, # missing small, non-negative integer count data lots of covariates; mix of discrete and continuous data over 11 years, 1991-2001 model a bit structural, then arrives at reduced form eqn (5) count data = Poisson, negative binomial seperate model by dependent variable and type of ship (freight, tanker, tugboat) results: interpreted as %, also report marginal effects 30

shortcoming non-random sample selection; only accidents included not account for impact of x s on probability of an accident, only severity conditional on accident occuring 31

3.4 Qualitative Response Models applicable to analyses of choice by agents among many (unordered) alternatives e.g., brand choice, mode of transportation, type of mortgage (fixed-30 yr, fixed-15 yr, adjustable rate,...), type of school (public, charter, privatenonreligious, private-religious,...), occupation dependent variable typically coded as (positive) integers corresponding to specific choice; value/order of actual numbers is irrelevant 32

3.4.1 Multinomial Logit setup choose among J + 1 alternatives y i {0, 1,..., J} x s are observation-specific attributes, not choice-specific attributes again, what is Pr(y i = j x i, θ), j = 0, 1,..., J? Pr(y i = j x i, θ) = F (x i, θ) F ( ) should satisfy two criteria F ( ) (0, 1) j Pr(y i = j x i, θ) = 1 i functional form of F ( ) Pr(y i = j x i, θ) = known as multinomial logit exp(x i β j ) J k=0 exp(x iβ k ) 33

notes β s are choice-specific (subscripted by j) MLE based on above model is not identified, since β j = β j + c leaves L(θ) unchanged; implies an infinite # of solutions to identify the model, normalize β 0 = 0 identification also requires that the x s be observation-specific, not choice-specific; cannot identify choice-specific effects of choice-specific variables likelihood function ln[l(θ)] = i:y i =0 ln ( + i:y i 0 1 ) 1 + J k=1 exp(x iβ k ) ( ) d exp(x i β j ) ij ln j=1 1 + J k=1 exp(x iβ k ) J where d ij = 1 if y i = j, 0 otherwise 34

interpretation log odds ratio (relative to base choice) ln where P ij = Pr(y i = j) ( Pij P i0 ) = x i β j implies β j is the % change in odds ratio from unit change in x log odds ratio (relative to any other choice) ln ( Pij P ij ) = x i (β j β j ) 35

shortcoming: Independence of Irrelevant Alternatives (IIA) due to assumption of independent, homoskedastic errors expressed as ln ( Pij P ij ) f(p ik ), k j, j Hausman test estimate full model = β j, j = 1,..., J estimate partial model on only a subset of choices = βj, j = 1,..., J P, where J P < J if IIA holds: βj β j, j = 1,..., J P thus, if partial model and full model yield estimates that are too different then we reject IIA multinomial probit does not impose IIA, but estimation is more complex since it involves multiple integration nested logit is also a possible alternative 36

random utility model (RUM) model can be derived from a RUM utility of observation i from choice j given by U ij = V ij + ε ij = x i β j + ε ij where ε is an iid from a Type I extreme value dbn, where G(ε) = exp( e ε ) observations choose to maximize utility implies Pr(y i = j) = Pr(U ij U ij ) j j =. exp(v ij ) J k=0 exp(v ij) which is identical to above STATA: -mlogit-, -mprobit- 37

3.4.2 Conditional Logit multinomial logit does not identify effects of choice-specific x s usual model with such data is a conditional logit results analagous to many pairwise binary logit models MLE again, what is Pr(y i = j x j, θ), j = 0, 1,..., J? Pr(y i = j x j, θ) = F (x j, θ) F ( ) should satisfy two criteria F ( ) (0, 1) j Pr(y i = j x j, θ) = 1 i functional form of F ( ) Pr(y i = j x j, θ) = known as conditional logit exp(x j β) J k=0 exp(x kβ) likelihood function ln[l(θ)] = ( ) d exp(x j β) ij ln i j J k=0 exp(x kβ) 38

log odds ratio ln ( Pij P ij ) = (x j x j )β which still implies IIA interpretation marginal effects: own continuous attributes Pr(y i = j) x j = {Pr(y i = j) [1 Pr(y i = j)]} β marginal effects: other continuous attributes Pr(y i = j) x j = {Pr(y i = j) Pr(y i = j )} β, j j notes model allows attributes of all other choices to affect probability of each option being chosen however, two derivatives must necessarily be of the opposite sign may not be a good property (e.g., FDI is increasing in own market size, and market size of neighbors) can similarly be derived from a random utility model STATA: -clogit- 39

3.4.3 Nested Logit (skip) reframes the problem as a sequential, multi-stage decision e.g., a business may first decide on a state in which to build a new plant, and then choose a county within the state notation J = total number of potential choices (e.g., the number of counties in the US) L = number of subgroups (e.g., the number of states in the US) J l = number of choices within subgroup l, l = 1,..., L (e.g., the number of counties in state l) implies the following identity J = L l=1 J l x j l = attributes of choice j in subgroup l z l = attributes of subgroup l 40

MLE probability of a given choice given by Pr(y i = j, l) = L h=1 which is the unconditional probability exp(x j l β + z l γ) Jl k=1 exp(x k hβ + z h γ) the unconditional probability is equal to the conditional times the marginal probability Pr(y i = j, l) = Pr(y i = j l) Pr(y i = l) = exp(x j l β) Jl k=1 exp(x L k lβ) h=1 exp(z l γ) J l k=1 exp(x j lβ) Jl k=1 exp(x k hβ + z h γ) define the inclusive value as I l = ln [ Jl k=1 exp(x j lβ) ] re-arranging, we get Pr(y i = j, l) = Pr(y i = j l) Pr(y i = l) exp(x j l β) exp(z l γ + τ l I l ) = Jl L h=1 exp(z hγ + τ h I h ) k=1 exp(x k lβ) restricting τ l = 1, l = 1,..., L, yields the conditional logit model 41

error term is homoskedastic within each subgroup, heteroskedastic across subgroups log odds ratio any two choices ln ( ) Pr(yi = j, l) Pr(y i = j, l ) = exp(x j lβ) exp(z l γ + τ l I l ) J l k=1 exp(x k l β) exp(x j l β) exp(z l γ + τ l I l ) J l = (x j l x j l )β + (z l z l )γ k=1 exp(x k lβ) + (τ l 1)I l (τ l 1)I l which incorporates all attributes of each subgroup l, l any two choices within a subgroup ln which implies IIA any two subgroups ln ( ) Pr(yi = j, l) Pr(y i = j, l) ( ) Pr(yi = l) Pr(y i = l ) = (x j l x j l)β = (z l z l )γ + (τ l I l τ l I l ) which incorporates all attributes of each subgroup l, l 42

interpretation marginal effects: own continuous attributes Pr(y i = j, l) [1 Pr(y i = j l)] = x j l + τ l Pr(y i = j l) [1 Pr(y i = l)] β marginal effects: continuous attributes from j, l Pr(y i = j, l) x j l = { Pr(y i = j l) + τ l Pr(y i = j l) [1 Pr(y i = l)]} β which may or may not be of the same sign as β marginal effects: continuous attributes from j, l Pr(y i = j, l) x j l = { τ l Pr(y i = j l) Pr(y i = l)} β which is of the opposite sign of β STATA: -nlogit- 43

3.4.4 Application: Co & List (2004) question: analyze determinants of location choice of FDI in the US; in particular, are knowledge spillovers important data structure outcome is the state choosen by new foreign-owned plants plants entered the US between 1986 and 1993 x s are state-specific attributes in 1986 of particular interest is the x measuring knowledge spillovers, measured by patent counts model: discrete choice data, with choice-specific x s = conditional logit model results min, median, and max elasticity estimates reported elasticity given by ln(p j ) ln(x jm ) = x jmβ m (1 P j ) knowledge spillovers, unionization rates positively associated with probability of receiving a new plant 44

energy expenditures, pollution costs negatively associated with probability of receiving a new plant 45

3.5 Ordered Response Models applicable to analyses of choice by agents among many ordered alternatives e.g., labor force status (OLF, PT, FT), schooling (<HS, HS, some coll, BA/BS, MA, PhD), bond ratings (AAA,...,D) outcomes coded as discrete integers: 0, 1, 2,... relative to other models MNL, MNP fails to account for ordinal nature of data OLS treats distance between choices as identical (e.g., moving from OLF to PT is equivalent to moving from PT to FT) 46

latent variable framework similar to probit/logit framework model yi iid = x i β + ε i, ε i N/L 0 if yi 0 y i = 1 if y i (0, µ 1] 2 if y i (µ 1, µ 2 ]. J if yi > µ J 1 yi unobserved y i observed where µ s are cutoff parameters that are unknown and to be estimated given data {y i, x i } N i=1, estimate β via MLE 47

again, what is Pr(y i x i, θ)? Pr(yi 0 x i, θ) = Pr(ε i x i β) = F ( x i β) = [1 F (x i β)] if y i = 0 Pr(µ j 1 < yi µ j x i, θ) Pr(y i x i, θ) = = Pr(ε i < µ j x i β) Pr(ε i < µ j 1 x i β) = F ( µ j x i β ) F ( µ j 1 x i β ) if y i = j, j = 1,..., J 1 Pr(y i > µ J 1 x i, θ) = Pr(ε i > µ J 1 x i β) = 1 Pr(ε i < µ J 1 x i β) = 1 F (µ J 1 x i β) if y i = J where µ 0 = 0 and σ is normalized to one, and 0 < µ 1 < < µ J 1 form of F ( ) defines the ordered probit/logit models interpretation marginal effects are outcome-specific in general Pr(y = j x) x = f(xβ)β if j = 0 [ f(µj 1 xβ) f(µ j xβ) ] β if j = 1,..., J 1 f(µ J 1 xβ)β if j = J where f( ) is the corresponding pdf and µ 0 = 0 STATA: -oprobit-, -ologit- 48

3.5.1 Application: Kalb & Williams (2003) question: analyze the differential effects of attributes on juvenile criminal behavior by gender data structure sample of individuals born in 1958 and residing in Philadelphia at some point between ages of 10 and 18 complete criminal history retrospective data on other variables collected in 1988 dependent variable = # of arrests (0, 1, or 2+) other variables represent family background variables issue with respect to sample weighting and missing data can be ignored for our purposes model: ordered, integer data = ordered probit model 49

results coefficient estimates and marginal effects are reported results fairly similar across gender non-whites, those leaving school early are more likely to commit criminal acts no impact of past physical/sexual abuse some differences by gender maternal labor supply has no effect on boys, decreases probability of criminal behavior for girls father with HS diploma and fewer siblings have no effect on girls, decrease probability of criminal behavior for boys 50