Specification Errors, Measurement Errors, Confounding

Similar documents
For more information about how to cite these materials visit

Regression diagnostics

Scatter plot of data from the study. Linear Regression

Scatter plot of data from the study. Linear Regression

Linear models. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. October 5, 2016

Double Robustness. Bang and Robins (2005) Kang and Schafer (2007)

Lecture 14 Simple Linear Regression

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Simple Linear Regression

Final Exam. Economics 835: Econometrics. Fall 2010

Topic 12 Overview of Estimation

Ch 2: Simple Linear Regression

Measurement error effects on bias and variance in two-stage regression, with application to air pollution epidemiology

Applied Econometrics (QEM)

ECON 3150/4150, Spring term Lecture 7

Matrices and Multivariate Statistics - II

Regression #3: Properties of OLS Estimator

where x and ȳ are the sample means of x 1,, x n

3 Multiple Linear Regression

Lecture 8: Instrumental Variables Estimation

The Simple Linear Regression Model

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

1 Motivation for Instrumental Variable (IV) Regression

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 26. Estimation: Regression and Least Squares

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall

ECON The Simple Regression Model

Regression, Ridge Regression, Lasso

Data Uncertainty, MCML and Sampling Density

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

Lecture 4 Multiple linear regression

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

ECON Program Evaluation, Binary Dependent Variable, Misc.

Linear Regression. Junhui Qian. October 27, 2014

Linear models. Linear models are computationally convenient and remain widely used in. applied econometric research

STAT 540: Data Analysis and Regression

Multivariate Regression

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Maximum Likelihood Estimation

Simple Linear Regression

Propensity Score Weighting with Multilevel Data

EXAMINERS REPORT & SOLUTIONS STATISTICS 1 (MATH 11400) May-June 2009

Models for Clustered Data

[y i α βx i ] 2 (2) Q = i=1

Confounding, mediation and colliding

For more information about how to cite these materials visit

Models for Clustered Data

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Part 6: Multivariate Normal and Linear Models

De-mystifying random effects models

MLES & Multivariate Normal Theory

Write your Registration Number, Test Centre, Test Code and the Number of this booklet in the appropriate places on the answersheet.

Gov 2000: 9. Regression with Two Independent Variables

multilevel modeling: concepts, applications and interpretations

Panel Data Models. James L. Powell Department of Economics University of California, Berkeley

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall

1.4 Properties of the autocovariance for stationary time-series

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Econ 510 B. Brown Spring 2014 Final Exam Answers

5. Linear Regression

Statistics 910, #15 1. Kalman Filter

Statistics 910, #5 1. Regression Methods

Applied Statistics and Econometrics

Estimating direct effects in cohort and case-control studies

Lecture 6 Multiple Linear Regression, cont.

Regression. Econometrics II. Douglas G. Steigerwald. UC Santa Barbara. D. Steigerwald (UCSB) Regression 1 / 17

Simple Linear Regression

Lecture: Simultaneous Equation Model (Wooldridge s Book Chapter 16)

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

STA 2201/442 Assignment 2

Linear models and their mathematical foundations: Simple linear regression

ECON3150/4150 Spring 2016

VAR Model. (k-variate) VAR(p) model (in the Reduced Form): Y t-2. Y t-1 = A + B 1. Y t + B 2. Y t-p. + ε t. + + B p. where:

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

Chapter 1. Linear Regression with One Predictor Variable

CS229 Lecture notes. Andrew Ng

VIII. ANCOVA. A. Introduction

Course information: Instructor: Tim Hanson, Leconte 219C, phone Office hours: Tuesday/Thursday 11-12, Wednesday 10-12, and by appointment.

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Association studies and regression

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Introduction to Simple Linear Regression

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

An overview of applied econometrics

PSC 504: Instrumental Variables

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Spatial Process Estimates as Smoothers: A Review

Advanced Econometrics

5. Erroneous Selection of Exogenous Variables (Violation of Assumption #A1)

Multiple Linear Regression

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

ECON Introductory Econometrics. Lecture 16: Instrumental variables

ECON 4160, Autumn term Lecture 1

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

x 21 x 22 x 23 f X 1 X 2 X 3 ε

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Transcription:

Specification Errors, Measurement Errors, Confounding Kerby Shedden Department of Statistics, University of Michigan October 10, 2018 1 / 32

An unobserved covariate Suppose we have a data generating model of the form y = α + βx + γz + ɛ. The usual conditions E[ɛ X = x, Z = z] = 0 and var [ɛ X = x, Z = z] = σ 2 hold. The covariate X is observed, but Z is not observable. If we regress y on x, the model we are fitting differs from the data generating model. What are the implications of this? Does the fitted regression model ŷ = ˆα + ˆβx estimate E[Y X = x], and does the MSE ˆσ 2 estimate var[y X = x]? 2 / 32

An unobserved independent covariate The simplest case is where X and Z are independent (and for simplicity E[Z] = 0). The slope estimate ˆβ has the form ˆβ = i y i (x i x)/ i (x i x) 2 = i (α + βx i + γz i + ɛ i )(x i x)/ i (x i x) 2 = β + γ i z i (x i x)/ i (x i x) 2 + i ɛ i (x i x)/ i (x i x) 2 By the double expectation theorem, E[ɛ X = x] = E Z X E[ɛ X = x, Z = z] = 0 and since Z and X are independent E[ i z i (x i x) x] = i (x i x)e[z i x] = E[Z] i (x i x) = 0. 3 / 32

An unobserved independent covariate Therefore ˆβ remains unbiased if there is an unmeasured covariate Z that is independent of X. What about ˆσ 2? What does it estimate in this case? 4 / 32

An unobserved independent covariate The residuals are So the residual sum of squares is (I P)y = (I P)(γz + ɛ) y (I P)y = γ 2 z (I P)z + ɛ (I P)ɛ + 2γz (I P)ɛ. The expected value is therefore E[y (I P)y x] = γ 2 var[z]rank(i P) + σ 2 rank(i P) = (γ 2 var[z] + σ 2 )(n 2). Hence the MSE of ˆσ 2 has expected value γ 2 var(z) + σ 2. 5 / 32

An unobserved independent covariate Are our inferences correct? We can set ɛ = γz + ɛ as being the error term of the model. Since E[ ɛ X = x] = 0 cov[ ɛ X = x] = (γ 2 var[z] + σ 2 )I I, all the results about estimation of β in a correctly-specified model hold in this setting. In general, we may wish to view any unobserved covariate as simply being another source of error, like ɛ. But we will see next that this cannot be done if Z is dependent with X. 6 / 32

Confounding As above, continue to take the data generating model to be y = α + βx + γz + ɛ, but now suppose that X and Z are correlated. As before, Z is not observed so our analysis will be based on Y and X. A variable such as Z that is associated with both the dependent and independent variables in a regression model is called a confounder. 7 / 32

Confounding Suppose X and Z are standardized, and cor[x, Z] = r. Further suppose that E[Z X ] = rx. Due to the linearity of E[Y X, Z]: If X increases by one unit and Z remains fixed, the expected response increases by β units. If Z increases by one unit and X remains fixed, the expected response increases by γ units. However, if we select a pair of cases with X values differing by one unit at random (without controlling Z), their Z values will differ on average by r units. Therefore these expected responses for these two cases differ by β + rγ units. 8 / 32

Known confounders Suppose we are mainly interested in the relationship between a particular variable X and an outcome Y. A measured confounder is a variable Z that can be measured and included in a regression model along with X. A measured confounder generally does not pose a problem for estimating the effect of X, unless it is highly collinear with X. For example, supose we are studying the health effects of second-hand smoke exposure. We measure the health outcome (Y ) directly. Subjects who smoke themselves are of course at risk for many of the same bad outcomes that may be associated with second-hand smoke exposure. Thus, it would be very important to determine which subjects smoke, and include that information as a covariate (a measured confounder) in a regression model used to assess the effects of second-hand smoke exposure. 9 / 32

Unmeasured confounders An unmeasured confounder is a variable that we know about, and for which we may have some knowledge of its statistical properties, but is not measured in a particular data set. For example, we may know that certain occupations (like working in certain types of factories) may produce risks similar to the risks of exposure to second-hand smoke. If occupation data is not collected in a particular study, this is an unmeasured confounder. Since we do not have data for unmeasured confounders, their omission may produce bias in the estimated effects for variables of interest. If we have some understanding of how a certain unmeasured confounder operates, we may be able to use a sensitivity analysis to get a rough idea of how much bias is present. 10 / 32

Unknown confounders An unknown confounder is a variable that affects the outcome of interest, but is unknown to us. For example, there may be genetic factors that produce similar health effects as second-hand smoke, but we have no knowlege of which specific genes are involved. 11 / 32

Randomizaton and confounding Unknown confounders and unmeasured confounders place major limits on our ability to interpret regression models causally or mechanistically. Randomization: One way to substantially reduce the risk of confounding is to randomly assign the values of X. In this case, there can be no systematic association between X and Z, and in large enough samples the actual (sample-level) association between X and Z will be very low, so very little confounding is possible. In small samples, randomization cannot gaurantee approximate orthogonality against unmeasured confounders. 12 / 32

Confounding For simplicity, suppose that Z has mean 0 and variance 1, and we use least squares to fit a working model ŷ = ˆα + ˆβx We can work out the limiting value of the slope estimate as follows. ˆβ = = i y i(x i x) i (x i x) 2 i (α + βx i + γz i + ɛ i )(x i x)/n i (x i x) 2 /n β + γr. Note that if either γ = 0 (Z is independent of Y given X ) or if r = 0 (Z is uncorrelated with X ), then β is estimated correctly. 13 / 32

Confounding Since ˆβ β + γr, and it is easy to show that ˆα α, the fitted model is approximately ŷ α + βx + γrx = α + (β + γr)x. How does the fitted model relate to the marginal model E[Y X ]? It is easy to get E[Y X = x] = α + βx + γe[z X = x], so the fitted regression model agrees with E[Y X = x] as long as E[Z X = x] = rx. 14 / 32

Confounding Turning now to the variance structure of the fitted model, the limiting value of ˆσ 2 is ˆσ 2 = i (y i ˆα ˆβx i ) 2 /(n 2) i (γz i + ɛ i γrx i ) 2 /n σ 2 + γ 2 (1 r 2 ). Ideally this should estimate the variance function of the marginal model var[y X ]. 15 / 32

Confounding By the law of total variation, var[y X = x] = E Z X var[y X = x, Z] + var Z X =x E[Y X = x, Z] = σ 2 + var Z X =x (α + βx + γz) = σ 2 + γ 2 var[z X = x]. So for ˆσ 2 to estimate var[y X = x] we need var[z X = x] = 1 r 2. 16 / 32

The Gaussian case Suppose ( A Y = B ) is a Gaussian random vector, where Y R n, A R q, and B R n q. Let µ = EY and Σ = cov[y ]. We can partition µ and Σ as µ = ( µ1 µ 2 ) ( Σ11 Σ Σ = 12 Σ 21 Σ 22 ), where µ 1 R q, µ 2 R n q, Σ 11 R q q, Σ 12 R q n q, Σ 22 R n q n q, and Σ 21 = Σ 12. 17 / 32

The Gaussian case It is a fact that A B is Gaussian with mean E[A B = b] = µ 1 + Σ 12 Σ 1 22 (b µ 2) and covariance matrix cov[a B = b] = Σ 11 Σ 12 Σ 1 22 Σ 21. 18 / 32

The Gaussian Case Now we apply these results to our model, taking X and Z to be jointly Gaussian. The mean vector and covariance matrix are so we get [ Z E X ] [ Z = 0 cov X ] = ( 1 r r 1 ). E[Z X = x] = rx cov[z X = x] = 1 r 2. These are exactly the conditions stated earlier that guarantee the fitted mean model converges to the marginal regression function E[Y X ], and the fitted variance model converges to to the marginal regression variance var[y X ]. 19 / 32

Consequences of confounding How does the presence of unmeasured confounders affect our ability to interpret regression models? 20 / 32

Population average covariate effect Suppose we specify a value X in the covariate space and randomly select two subjects i and j having X values X i = X + 1 and X j = X. The inter-individual difference is y i y j = β + γ(z i z j ) + ɛ i ɛ j, which has a mean value (marginal effect) of E[Y i Y j X i = x +1, X j = x ] = β +γ(e[z X = x +1] E[Z X = x ]), which agrees with what would be obtained by least squares analysis as long as E[Z X = x] = rx. 21 / 32

Population average covariate effect The variance of y i y j is 2σ 2 + 2γ 2 var[z X ], which also agrees with the results of least squares analysis as long as var[z X = x] = 1 r 2. 22 / 32

Individual treatment effect Now suppose we match two subjects i and j having X values differing by one unit, and who also having the same values of Z. This is what one expect to see as the pre-treatment and post-treatment measurements following a treatment that changes an individual s X value by one unit, if the treatment does not affect Z (the within-subject treatment effect). 23 / 32

Individual treatment effect The mean difference (individual treatment effect) is and the variance is E[Y i Y j X i = x + 1, X j = x, Z i = z j ] = β var[y i Y j X i = x + 1, X j = x, Z i = z j ] = 2σ 2. These do not in general agree with the estimates obtained by using least squares to analyze the observable data for X and Y. Depending on the sign of γθ, we may either overstate or understate the individual treatment effect β, and the population variance of the treatment effect will always be overstated. 24 / 32

Measurement error for linear models Suppose the data generating model is Y = Zβ + ɛ, with the usual linear model assumptions, but we do not observe Z. Rather, we observe X = Z + τ, where τ is a random vector of covariate measurement errors with E[τ] = 0. Assuming X 1 = 1 is the intercept, it is natural to set the first column of τ equal to zero. This is called an errors in variables model, or a measurement error model. 25 / 32

Measurement error for linear models When covariates are measured with error, least squares point estimates may be biased and inferences may be incorrect. Intuitively it seems that slope estimates should be attenuated (biased toward zero). The reasoning is that as the measurement error grows very large, the observed covariate X becomes equivalent to noise, so the slope estimate should go to zero. 26 / 32

Measurement error for linear models Let X and Z now represent the n p + 1 observed and ideal design matrices. The least squares estimate of the model coefficients is ˆβ = (X X ) 1 X Y = (Z Z + Z τ + τ Z + τ τ) 1 (Z Y + τ Y ) = (Z Z/n + Z τ/n + τ Z/n + τ τ/n) 1 (Z Y /n + τ Zβ/n + τ ɛ/n). We will make the simplifying assumption that the covariate measurement error is uncorrelated with the covariate levels, so Z τ/n 0, and that the covariate measurement error τ and observation error ɛ are uncorrelated, so τ ɛ/n 0. 27 / 32

Measurement error for linear models Under these circumstances, ˆβ (Z Z/n + τ τ/n) 1 Z Y /n. Let M z be the limiting value of Z Z/n, and let M τ be the limiting value of τ τ/n. Thus the limit of ˆβ is (M z + M τ ) 1 Z Y /n = (I + Mz 1 M τ ) 1 Mz 1 Z Y /n (I + Mz 1 M τ ) 1 β β 0. and hence the limiting bias is β 0 β = ((I + Mz 1 M τ ) 1 I )β. 28 / 32

Measurement error for linear models What can we say about the bias? Note that the matrix Mz 1 M τ has non-negative eigenvalues, since it shares its eigenvalues with the positive semi-definite matrix M 1/2 z M τ M T /2 z. It follows that all eigenvalues of I + Mz 1 M τ are greater than or equal to 1, so all eigenvalues of (I + Mz 1 M τ ) 1 are less than or equal to 1. This means that (I + Mz 1 M τ ) 1 is a contraction, so β 0 β. Therefore the sum of squares of fitted slopes is smaller on average than the sum of squares of actual slopes, due to measurement error. 29 / 32

Types of measurement error The classical measurement error model X = Z + τ, where Z is the true value and X is the observed value, is the one most commonly considered. Alternatively, in the case of an experiment it may make more sense to use the Berkson error model: Z = X + τ. For example, suppose we aim to study a chemical reaction when a given concentration X of substrate is present. However, due to our inability to completely control the process, the actual concentration of substrate Z differs randomly from X, by an unknown amount τ. 30 / 32

Types of measurement error You cannot simply rearrange Z = X + τ to X = Z τ and claim that the two situations are equivalent. In the first case, τ is independent of Z but dependent with X. In the second case, τ is independent of X but dependent with Z. 31 / 32

SIMEX SIMEX (simulation-extrapolation) is a relatively straightforward way to adjust for the effects of measurement error if the variances and covariances among the measurement errors can be considered known. Regress Y on X + λe, where E is simulated noise having the same variance as the assumed measurement error. Denote the coefficient vector of this fit as ˆβ λ. Repeat this for several values of λ 0, leading to a set of ˆβ λ vectors. Ideally, ˆβ 1 would approximate the coefficient estimates under no measurement error. By fitting a line or smooth curve to the ˆβ λ values (separately for each component of β), it becomes possible to extrapolate back to ˆβ 1. 32 / 32