AN OVERVIEW OF INSTRUMENTAL VARIABLES* KENNETH A BOLLEN CAROLINA POPULATION CENTER UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL *Based On Bollen, K.A. (2012). Instrumental Variables In Sociology and the Social Sciences. Annual Review Of Sociology 38:37-72.
OUTLINE I. INTRODUCTION II. WHAT ARE INSTRUMENTAL VARIABLES (IVs)? III. ORIGINS OF INSTRUMENTAL VARIABLE METHODS IV. APPLICATION AREAS V. FINDING INSTRUMENTAL VARIABLES VI. EVALUATING INSTRUMENTAL VARIABLES VII. HETEROGENOUS CAUSAL EFFECTS VIII. CONCLUSIONS
INTRODUCTION Many reasons for equation error to correlate with covariate Can create bias/inconsistent estimator Instrumental Variables (IVs) can help IVs methods appear in many social sciences Spreading through even more disciplines Purpose of presentation Broad overview of IVs Give advantages & disadvantages Present methods to assess quality of IVs
WHAT ARE INSTRUMENTAL VARIABLES (IVS)? The Problem: COV(X, ε) 0 Y i = α + β X i + ε Yi
WHAT ARE INSTRUMENTAL VARIABLES (IVS)?
WHAT ARE INSTRUMENTAL VARIABLES (IVS)? Ordinary Least Squares applied (simple regression) β OLS = COV(X,Y ) VAR(X) β Biased estimator if ignore correlation of error with X
WHAT ARE INSTRUMENTAL VARIABLES (IVS)? The Instrumental Variable Solution (simple regression) For variable Z to be IV: COV(Z, ε Y ) = 0 COV(Z, X) 0 COV(Y,Z) = COV(α + β X + ε Y,Z) = βcov(x,z) β IV = COV(Y,Z) COV(X,Z) = β
WHAT ARE INSTRUMENTAL VARIABLES (IVS)? Ordinary Least Squares applied (multiple regression) Y = Xβ + ε Y = X β + X β + ε 1 1 2 2 Y Separates X in 2 parts: X and X,where 1 2 X correlates with ε problem! 1 Y X does not correlate with ε 2 Y ˆβ = (X ' X) 1 X 'Y OLS ˆβ biased (& inconsistent) estimator of β OLS
WHAT ARE INSTRUMENTAL VARIABLES (IVS)? Ordinary Least Squares applied (multiple regression) Y = X β + X β + ε 1 1 2 2 Y Y = birth weight X = smoking, drinking 1 X = age, race, first child 2 ˆβ biased (& inconsistent) estimator of β OLS (estimates of effects of all variables biased to unknown degree)
WHAT ARE INSTRUMENTAL VARIABLES (IVS)? The Instrumental Variable Solution (multiple regression) Y = X β + X β + ε 1 1 2 2 Y [recall C(X, ε ) 0; C(X, ε ) = 0] 1 Y 2 Y Z = [X X ] where C(X, ε ) = 0 2 3 3 Y ˆβ = (X 'P X) 1 X 'P Y where P = Z(Z 'Z) 1 Z ' IV Z Z Z ˆβ consistent (asymp unbiased) estimator of β IV
WHAT ARE INSTRUMENTAL VARIABLES (IVS)? Instrumental Variable Solution (multiple regression) = X 1 β 1 + X 2 β 2 + ε Y = birth weight X 1 = smoking, drinking X 2 = age, race, 1st child IVs are in Z = [X 2 X 3 ] Need X 3 X 3 = tobacco & alcohol receipts, smoke & drink history, spouse reports, DUI tickets ˆβ IV = (X 'P Z X) 1 X 'P Z Y where P Z = Z(Z 'Z) 1 Z '
WHAT ARE INSTRUMENTAL VARIABLES (IVS)? IV procedures in Stata, SAS, etc. Stata: ivregress 2sls brthwght age race frstchld (smoke, drink=hissmok hisdrnk spsmoke spdrink DUI) Y = X 1 β 1 + X 2 β 2 + ε Y Y = birth weight X 1 = smoking, drinking X 2 = age, race, 1st child X 3 = smoke & drink history, spouse reports, DUI tickets
WHAT ARE INSTRUMENTAL VARIABLES (IVS)? IV procedures in Stata, SAS, etc. SAS: Proc syslin 2sls; endogenous brthwght smoke drink; instruments age race frstchld hissmok hisdrnk spsmoke spdrink DUI; Model brthwght=age race frstchld smoke drink; Y = X 1 β 1 + X 2 β 2 + ε Y Y = birth weight X 1 = smoking, drinking X 2 = age, race, 1st child X 3 = smoke & drink history, spouse reports, DUI tickets
ORIGINS OF INSTRUMENTAL VARIABLE METHODS Sewall Wright (1925, Corn and Hog Correlations, US Dept Agric Bull.) Goldberger (1972) credits Sewall Philip Wright (1928, The Tariff on Animal and Vegetable Oils) Appendix B has IVs in supply & demand problem Which Wright is right? Controversy Prior to Goldberger (1972) many gave credit to Reiersøl (1941, 1945)
APPLICATION AREAS Simultaneous Equation Models Early ones in path analysis Highly developed in econometrics Two or more dependent variables Assume no measurement error Y = α + BY + ΓX + ε where Y = endogenous vars, X=exogenous vars., α = intercepts, B = coefficients, Γ = coefficients, ε = errors
APPLICATION AREAS Simultaneous Equation Models Felson & Bohrnstedt (1979) GPA height academic ζ 1 weight rating attract ζ 2
APPLICATION AREAS Simultaneous Equation Models Consider academic equation: academic = α + β attract + β GPA +ε 1 1 2 1 X = [X X ] X = attract, X = GPA 1 2 1 2 Z = [X X ] X = height, weight, rating 2 3 3 ˆβ = (X 'P X) 1 X 'P Y where P = Z(Z 'Z) 1 Z ' IV Z Z Z
APPLICATION AREAS Simultaneous Equation Models Consider academic equation: academic = α + β attract + β GPA +ε 1 1 2 1 Feedback COV(attract, ε ) 0 1 Need at least 1 IV in X. 3 2 or more IVs overidentified X = height, weight, rating overidentified 3 ˆβ = (X 'P X) 1 X 'P Y IV Z Z
APPLICATION AREAS Factor Analysis (less common application) Madansky (1964, Psychometrika) 1 st to suggest Exploratory factor analysis No correlated errors Approach here based on Bollen (1996, Psychometrika) Confirmatory (or exploratory) factor analysis Allows correlated errors Z = α + ΛL + ε Z = indicators, L= latent vars (factors), α = intercepts, Λ = coefficients (loadings), ε = errors
APPLICATION AREAS Factor Analysis (less common application) 1 factor, 4 indicators Subjective Air Quality L 1 Overall Z 1 Clarity Z 2 Color Z 3 Odor Z 4 ε 1 ε 2 ε 3 ε 4
APPLICATION AREAS Factor Analysis (less common application) 1 factor, 4 indicators Z = L + ε (set scale of L ) 1 1 1 1 Z = α + Λ L + ε 2 2 21 1 2 Z = α + Λ L + ε 3 3 31 1 3 Z = α + Λ L + ε 4 4 41 1 4
APPLICATION AREAS Factor Analysis (less common application) 1 factor, 4 indicators Z 1 = L 1 + ε 1 L 1 = Z 1 ε 1 Consider 2nd indicator equation: Z = α + Λ L + ε 2 2 21 1 2 = α + Λ (Z ε ) + ε 2 21 1 1 2 = α + Λ Z Λ ε + ε 2 21 1 21 1 2 COV(Z,ε ) 0 need IVs 1 1
APPLICATION AREAS Factor Analysis (less common application) 1 factor, 4 indicators Z = α + Λ Z Λ ε + ε 2 2 21 1 21 1 2 COV(Z,ε ) 0 need IVs 1 1 IVs must: (1) correlate with Z 1 (2) uncorrelated with ε,ε 1 2 Z & Z meet these conditions 3 4
APPLICATION AREAS Factor Analysis (less common application) 1 factor, 4 indicators Z 2 = α 2 + Λ 21 Z 1 Λ 21 ε 1 + ε 2 IV formula: ˆβ IV = (X 'P Z X) 1 X 'P Z Y For Z 2 eq.: ˆΛ 21 is ˆβ IV, Z 1 is X, Z 2 is Y Z 3, Z 4 form Z and P Z = Z(Z 'Z) 1 Z '
APPLICATION AREAS Factor Analysis (less common application) 1 factor, 4 indicators Subjective Air Quality L 1 Overall Z 1 Clarity Z 2 Color Z 3 Odor Z 4 ε 1 ε 2 ε 3 ε 4
APPLICATION AREAS Factor Analysis (less common application) General Procedure for IV estimation: Replace each latent variable with its scaling indicator minus its error Transforms latent variable model into observed variable model For each equation find those indicators from other equations that are uncorrelated with error Apply usual IV formula Because suitable IVs are dictated by model, I refer to these as Model Implied Instrumental Variables (MIIVs) Tests for overidentified equations are tests of model
APPLICATION AREAS Latent Variable SEM (less common application) Bollen (1996, Psychometrika) L = α L + BL + ε L Y = α Z + ΛL + ε Z Y = indicators, L= latent vars (factors), ε L, ε Y = errors for L &Y eqs., respectively α L, α Y = intercepts for L &Y eqs., respectively B, Λ= coefficients for L &Y eqs., respectively
APPLICATION AREAS Latent Variable SEM (less common application) Robins & West (1977, JASA) Y 1 ε L Y NY -2 ε YNY -2 Y 2 L Y NY -1 ε YNY -1 Y 3 Y NY ε YNY Y NY -3
APPLICATION AREAS Latent Variable SEM (less common application) Robins & West (1977, JASA) L 1 = value of home Y 1 = lot size Y 2 = square footage Y 3 = number of rooms Y 3 to Y NY -3 = other causal indicators Y NY -2 = appraised value Y NY -1 = owner estimate Y NY = assessed value ε L, ε YNY -1, ε YNY -2, ε YNY -3 = disturbances (errors) N Y = # of observed variables
APPLICATION AREAS Latent Variable SEM (less common application) General Procedure for IV estimation: Replace each latent variable with its scaling indicator minus its error Transforms latent variable model into observed variable model For each equation find those indicators from other equations that are uncorrelated with error Apply usual IV formula Model Implied Instrumental Variables (MIIVs) Tests for overidentified equations are tests of model
APPLICATION AREAS Dichotomous/ordinal dependent variable Y * = Xβ + ε 1 if Y Dichotomous outcome: Y= * > 0 0 if Y * 0 e.g., Y= 1 HIV positive, 0 not Ordinal outcome: Y=c, if τ c Y * < τ c+1 τs are thresholds crossed by Y * e.g., abortion attitude, Y= 0 to 5
APPLICATION AREAS Dichotomous/ordinal dependent variable (probit/logistic) Y * = Xβ + ε = X 1 β 1 + X 2 β 2 + ε Separates X in 2 parts: X 1 and X 2,where X 1 correlates with ε problem! X 2 does not correlate with ε Find Z = [X 2 X 3 ] where C(X 3, ε) = 0 Z are IVs
APPLICATION AREAS Dichotomous/ordinal dependent variable (probit/logistic) Approaches Treat Y as if continuous Y * = X 1 β 1 + X 2 β 2 + ε Same procedure as illustrated for usual regression Need heteroscedastic consistent standard errors 7 or more ordinal categories or for exploratory research Some highly critical of this approach
APPLICATION AREAS Dichotomous/ordinal dependent variable (probit/logistic) Approaches Y * = X 1 β 1 + X 2 β 2 + ε Instrumental variable probit/logit method (Lee, 1981; Rivers & Vuong, 1988) Use ˆX 1 in place of X 1 in above Do probit/logistic Problems: 1. Standard errors might not be good 2. Scaling differs from original equation
APPLICATION AREAS Dichotomous/ordinal dependent variable (probit/logistic) Other Approaches Y * = X 1 β 1 + X 2 β 2 + ε Two-stage conditional probit (Vuong, 1984; Rivers & Vuong, 1988; Smith & Blundell, 1986) Polychoric instrumental variables (Bollen & Maydeu- Olivares, 2007) Limited evidence on which approach is best
FINDING INSTRUMENTAL VARIABLES Three main strategies: (1) Auxiliary Instrumental Variables (AIVs) (2) Model Implied Instrumental Variables (MIIVs) (3) Randomization Instrumental Variables (RIVs) (My classification. Usually distinctions not made.)
FINDING INSTRUMENTAL VARIABLES AUXILIARY INSTRUMENTAL VARIABLES (AIVs) Y * = X 1 β 1 + X 2 β 2 + ε X 1 correlates with ε problem! X 2 does not correlate with ε Find X 3 Get into trouble and look for a way out
FINDING INSTRUMENTAL VARIABLES AUXILIARY INSTRUMENTAL VARIABLES (AIVs) Y * = X 1 β 1 + X 2 β 2 + ε Find X 3 as IVs Get into trouble and look for a way out You have an endogeneity problem. Look for IVs not part of original model Earlier example on birth weight and need IVs for smoking, drinking during pregnancy Suggested pre-pregnancy smoking, drinking, spousal reports on mother s drinking, smoking, cigarette & alcohol receipts as possible IVs
FINDING INSTRUMENTAL VARIABLES AUXILIARY INSTRUMENTAL VARIABLES (AIVs) Advantages Helps permit asymp. unbiased estimation of effects Exact relation of AIV to endogenous variable not specified If more than minimum IVs then overidentification tests possible Disadvantages Ad hoc selection raises doubts about whether IV conditions met Less systematic thought of role of IV in model Tendency to seek just enough IVs to permit estimation Overidentification test of IV not possible
ε ε ε ε FINDING INSTRUMENTAL VARIABLES MODEL IMPLIED INSTRUMENTAL VARIABLES (MIIVs) Approach in Bollen (1996) Build Identified model, implies sufficient instruments Subjective Air Quality L 1 Overall Z 1 Clarity Z 2 Color Z 3 Odor Z 4
FINDING INSTRUMENTAL VARIABLES Z = L + ε L = Z ε 1 1 1 1 1 1 Consider 2nd indicator equation: Z = α + Λ L + ε 2 2 21 1 2 = α + Λ (Z ε ) + ε 2 21 1 1 2 = α + Λ Z Λ ε + ε 2 21 1 21 1 2 COV(Z,ε ) 0 need IVs 1 1
FINDING INSTRUMENTAL VARIABLES Model Implied Instrumental Variables: Z equation 2 Z = α + Λ Z Λ ε + ε 2 2 21 1 21 1 2 COV(Z,ε ) 0 need IVs 1 1 IVs must: (1) correlate with Z 1 (2) uncorrelated with ε,ε 1 2 Z & Z meet these conditions 3 4
FINDING INSTRUMENTAL VARIABLES MODEL IMPLIED INSTRUMENTAL VARIABLES (MIIVs) MIIVs found for each equation SAS macro Bollen & Bauer (2004) Stata macro Bauldry (2013) R package Fisher (in progress) Sources of MIIVs Exogenous observed variables Multiple indicators Sometimes endogenous observed variables
FINDING INSTRUMENTAL VARIABLES MODEL IMPLIED INSTRUMENTAL VARIABLES (MIIVs) Advantages More sustained effort & thought in building model rather than post hoc search for IVs Assumptions about variables explicit in model Overidentification tests to test assumption that all MIIVs uncorrelated with equation error Disadvantages Approximate nature of models implies that MIIVs are not exactly uncorrelated with error Excess power could reject reasonable approx. IVs Exactly identified equations have no test for MIIVs Problem shared with AIVs or any method that creates exactly identified equation
FINDING INSTRUMENTAL VARIABLES (Quasi) Randomization Instrumental Variables (RIVs) Intervention or treatment randomized One group randomly assigned to job training program, others form control group Natural experiments (quasi-experiments) Twin births, weather events, random assignments of roommates at college intention to treat variable is IV for treatment variable Acknowledges difference between assignment and actual treatment
FINDING INSTRUMENTAL VARIABLES (Quasi) Randomization Instrumental Variables (RIVs) Advantages Randomization or natural experiment nature makes correlation with omitted variables less likely Intention-to-treat variable highly correlated with those taking treatment Models can be simpler Disadvantages Assumes that all effects of the intention-to-treat variable go through treatment variable Job training selection gives hope & confidence to those selected, opposite for controls Hope & confidence might affect job search outcome rather than job training per se False confidence & decrease motivation to search as confounders Experimental context might not generalize to real world conditions Exact identification, no overidentification tests
EVALUATING INSTRUMENTAL VARIABLES Three main criteria for IVs: (1) IVs are uncorrelated with equation error (2) IVs associated with X 1 (vars that correlate with error) (3) No perfect collinearity among Zs
EVALUATING INSTRUMENTAL VARIABLES Y = X 1 β 1 + X 2 β 2 + ε [recall C(X 1, ε) 0; C(X 2, ε) = 0] Z = [X 2 X 3 ] where C(X 3, ε) = 0 Z contains IVs ˆβ IV = (X 'P Z X) 1 X 'P Z Y where P Z = Z(Z 'Z) 1 Z '
EVALUATING INSTRUMENTAL VARIABLES (1) IVs are uncorrelated with equation error Are the IVs uncorrelated with the error [C(Z, ε) = 0]? Sargan (1958) test: T S = ˆε 'Z(Z 'Z) 1 Z ' ˆε χ 2 ˆε ' ˆε / N Simple way to calculate: 1) regress ˆε on Z, 2) Get R 2 3) Form T S = NR 2 degrees of freedom = # of IVs above minimum e.g., X 1 has 3 vars., X 3 has 5, df=2.
EVALUATING INSTRUMENTAL VARIABLES Are the IVs uncorrelated with the error [C(Z, ε) = 0]? Sargan (1958) test: H 0 : All IVs uncorrelated with error [C(Z, ε) = 0] H a : 1 or more IVs correlate with error [C(Z, ε) 0] Rejection means problem with IVs Does not say which IV is problem Substantive vs. statistical significance - this is statistical significance test Test not applicable if exactly identified equation
EVALUATING INSTRUMENTAL VARIABLES Are the IVs uncorrelated with the error [C(Z, ε) = 0]? Sargan (1958) test Other IV tests available Kirby & Bollen (2009, SM) show that Sargan has best performance Sargan or Basmann tests widely available Stata, SAS, etc.
EVALUATING INSTRUMENTAL VARIABLES Three main criteria for IVs: (1) IVs are uncorrelated with equation error (2) IVs associated with X 1 (vars that correlate with error) (3) No perfect collinearity among Zs Check for nonsingular covariance (or correlation) matrix
EVALUATING INSTRUMENTAL VARIABLES IVs associated with X 1 (vars that correlate with error) Check for WEAK IVs Insufficient association increase standard errors Problem made worse if small association of error with IV Simple regression example (Bound et al (1995)) Show that IV estimator can be worse than OLS if IV weakly correlated with X 1 and small correlation of IV and error
EVALUATING INSTRUMENTAL VARIABLES IVs associated with X 1 (vars that correlate with error) Check for WEAK IVs Simple regression diagnostic Check correlation of Z and X 1 Multiple regression diagnostics more complicated Shea (1997) proposes partial R 2 measure See Bollen (2012) for review and references Growing interest in weak IVs diagnostics over last 20 years Tests available, though consensus on best method not there yet
EVALUATING INSTRUMENTAL VARIABLES How many IVs should we use? Sometimes we have many more IVs than minimum needed Should we use all available IVs? Based on analytic results for special cases & simulation results (e.g., Bollen et al., 2007), my recommendations: Small N : use 1 or 2 more than required minimum # of IVs E.g., N=50, X 1 has 2 vars, use 3 or 4 IVs from X 3 Big N: matters less
HETEROGENEOUS CAUSAL EFFECTS So far, assumed same causal effect for each case Y i = α + β X i + ε Yi β same for all i Suppose effect of X i on Y i differs by i Y i = α + β i X i + ε Yi β i allows effects to differ
HETEROGENEOUS CAUSAL EFFECTS Y i = α + β i X i + ε Yi IVs for heterogeneous causal effects Merging Neyman (1923)-Rubin (1974) potential outcome with IV literature Much of literature assumes dichotomous X i Catholic school or not on academic achievement Job training attendance on wages IV (Z i ) often dichotomous E.g., Angrist (1990) military service (X i ) impact on wages, IV is draft eligible lottery number (Z i )
HETEROGENEOUS CAUSAL EFFECTS Y i = α + β i X i + ε Yi IVs for heterogeneous causal effects Intention to treat mean effects of Z i on Y i E(Y i Z i = 1) E(Y i Z i = 0) IV causal effect of X i on Y i E[Y i Z i = 1] E[Y i Z i = 0] E[X i Z i = 1] E[X i Z i = 0] Local Average Treatment Effect (LATE) Treatment effect of X for those whose treatment can be changed by Z.
HETEROGENEOUS CAUSAL EFFECTS Y i = α + β i X i + ε Yi IVs for heterogeneous causal effects More assumptions than I have time to go over More complicated than models where homogenous effects assumed Vast developing literature on this approach
INSTRUMENTAL VARIABLES IN PRACTICE Varies by discipline and field Correlation of error and Xs typically ignored Many sources of correlation present but not treated Say nothing about it and hope others do the same When IVs are used common not to apply diagnostics for correlation of IVs with error or for weak IVs
CONCLUSIONS Measurement error, omitted variables, feedback loops, spatial correlation, etc. common in social and health sciences Creates correlation of error and covariates Biases usual estimates Problems largely ignored Instrumental variables help provide corrected estimates Diagnostic checks available on IVs Widely available in statistical software Right to be concerned with current use of IVs, but bigger problem is that IVs not used when they could help