Measurement Error in Covariates

Size: px

Start display at page:

Download "Measurement Error in Covariates"

Marilynn Bell
5 years ago
Views:

1 Measurement Error in Covariates Raymond J. Carroll Department of Statistics Faculty of Nutrition Institute for Applied Mathematics and Computational Science Texas A&M University

2 My Goal Today Introduce the basic ideas Effects of measurement error Data needed Simplest models

3 Later Lectures Measurement error analysis Regression calibration SIMEX Corrected Scores Maximum likelihood Bayes

4 Today Start thinking about when a measurement error analysis is a good thing

5 Big Theme We always have a risk model relating disease to true exposure We always have a measurement error model relating observed exposure and true exposure The literature is split into methods that have an exposure model (structural) versus those that have no exposure model (functional)

6 Part 1: Overview

7 The Triple Whammy of Measurement Error Bias in parameter estimation, which with multiple covariates can lead to incorrect null hypothesis testing Loss of Power for detecting signals Masking the features of the data: measurement error causes the model of the observed data to change

8 Notation Response: Y True Covariates Subject to Measurement Error: X Covariates Measured Exactly: Z Observed Proxies for X: W

9 Emphasis I will focus on continuous exposures Measurement error with a categorical exposure X subject to misclassification is an even harder topic In those problems, the real issue is to get data to estimate the sensitivity and specificity of the surrogate exposure W, e.g.. Pr(W = 1 X = 1), Pr(W = 0 X = 0) The impact of misclassification is often profound

10 Nondifferential Measurement Error Definition: If you can observe X, you would not bother collecting W Technically, the proxies W are statistically independent of Y given (X, Z) All my lectures assume nondifferential measurement error The analysis of data subject to differential measurement error is a very different subject

11 Simple Linear Regression Linear regression model is Y = β 0 + β 1 X + ɛ; ɛ = Normal(0, σɛ 2 ). The interest here is in estimating β 1 We also want to form confidence intervals for β 1 We also want to test the null hypothesis that β 1 = 0.

12 Classical Measurement Error The classical measurement error model is that instead of observing X, we observe W, where W = X + U ; U = Normal(0, σ 2 u). We will shortly go into more detail, but the first two parts of the Triply Whammy can best be seen graphically.

13 Simulation Study Generate X 1,..., X 50 which are Normal(0, 1) Generate Y i = β 0 + β x X i + ɛ i ɛ i = Normal(0, 1/9) β 0 = 0 β x = 1 Generate U 1,..., U 50 = Normal(0, 1) Set W i = X i + U i Regress Y on X and Y on W and compare

14 Effects of Measurement Error 10 5 Reliable Data True Data Without Measurement Error

15 Effects of Measurement Error 10 5 Error--prone Data Reliable Data Observed Data With Measurement Error

16 Linear Regression We saw bias: the slope is too small in absolute value We saw excess variance: the variability about the line is much larger Of course, excess variance = loss of power

17 30 25 Sample Size Measurement Error Variance Sample Size for 80% Power. True slope β x = Variances σ 2 x = σ 2 ɛ = 1.

18 Loss of Features In many problems, we might have a more complex regression model This could include a change point of a threshold Or it could include a nonlinear regression: Y = sin(x) + ɛ; X = Uniform(0, 4π); U = Normal{0, var(x)/2}; σɛ 2 = 0.09.

19 2 Y=Sine(X)+Noise, with X observed

20 2 Y=Sine(X)+Noise, with X not observed

21 2 Y=Sine(X)+Noise, with X observed (Blue) or not (Red)

22 Examples of Measurement Error Problems Measuring usual nutrient intake Measuring Systolic Blood Pressure Measuring Radiation Dose from the Nevada Test Site or in Chernobyl Measuring Exposure to arsenic in drinking water, dust in the workplace, radon gas in the home, and other environmental hazards

23 Measures Of Nutrient Intake Response: Y = average daily % calories from fat by a Food Frequency Questionnaire (FFQ). True Unobserved Predictor: X = true long-term average daily percentage of calories from fat Regression Model: Assume Y = β 0 + β x X + ɛ Measurement Error: X is never observable

24 Measures Of Nutrient Intake, cont. Surrogate for X: On 6 days over the course of a year women are interviewed by phone and asked to recall their food intake over the past year (24 hour recalls). Their average is recorded and denoted by W. The analysis of 24 hour recall introduces some error = analysis error Measurement error = sampling error + analysis error Classical measurement error model: W i = X i + U i, U i are measurement errors

25 Discussion I will do "theory" for linear regression, because the calculations are exact Everything though applies to logistic regression survival analysis, Poisson regression, etc.

26 General Structure Of A Measurement Error Problem Y = response, Z = error-free predictor, X = error-prone predictor Outcome Model: E(Y Z, X) = f (Z, X, β) Observed data: (Y, Z, W ) Model Misspecification: E(Y Z, W ) f (Z, W, β) Measurement Error model: relating W and X Classical Model: W = X + U

27 Theory Behind The Pictures: The Naive Analysis Attenuation Factor a.k.a. Reliability Ratio λ = var(x) var(w ) = σ 2 x σ 2 x + σ 2 u = σ2 x σ 2 w Note how λ 1

28 Theory Behind The Pictures: The Naive Analysis, cont. Observed Data Model is a linear regression Intercept: β 0 + (1 λ)β x µ x Slope: λβ x Slope is biased, too small Residual Variance: σ 2 ɛ + (1 λ)β 2 xσ 2 x Variance is increased

29 Implications For Testing Hypotheses The observed data slope is λβ x Because β x = 0 iff λβ x = 0 it follows that [ ] H 0 : β x = 0 [ ] H 0 : λβ x = 0 so the naive test of β x = 0 is valid (correct Type I error rate). Because the residual variance is increased, the power is decreased

30 Multiple Linear Regression Triple Whammy Bias Increased variance and loss of power Loss of features In multiple linear regression, the bias can take unusual forms This can lead to invalidity of hypothesis testing

31 Multiple Linear Regression With Error Model Y = β 0 + β T z Z + β T x X + ɛ W = X + U Bias Regressing Y on Z and W estimates ) ) [ ( βz β x = Λ ( βz β x ( βz β x )]

32 Multiple Linear Regression With Error, cont. Λ is the attenuation matrix or reliability matrix Λ = ( σzz σ zx σ xz σ xx + σ uu ) 1 ( σzz σ zx σ xz σ xx ) The terms σ zz, etc. are now covariance matrices Biases can take many forms, including reversal of signs! Global null test is OK: Naive test of H 0 : β x = 0, andβ z = 0 is valid No effect for any component of X is OK: Naive test of H 0 : β x = 0 is valid

33 What Naive Tests are Invalid? Tests for Z: surprisingly, tests about the variables Z measured without error are not valid Exception is when X and Z are independent When X is multivariate: Naive tests here for components of X are not valid

34 Bivariate X and Bias Example: One component of X has no effect Y = β 0 + β 1 X 1 + β 2 X 2 + ɛ β 2 = 0 Correlation: The X s are negatively correlated: [ ] cov(x 1, X 2 ) =

35 Bivariate X and Bias cov(x 1, X 2 ) = [ ] Measurement errors are positively correlated W 1 = X 1 + U 1 W 2 = X 2 + U [ 2 ] cov(u 1, U 2 ) =

36 Bivariate X and Bias cov(x 1, X 2 ) = cov(u 1, U 2 ) = [ [ ] ] Reliability Matrix Λ = [ ]

37 Bivariate X and Bias cov(x 1, X 2 ) = cov(u 1, U 2 ) = Λ = [ ] [ ] [ ] True β = (1, 0) T Observed β = (0.5, 0.25) T : Naive Test Invalid!

38 Multiple Linear Regression With Error For X scalar, attenuation factor in β x is λ 1 = σ2 x z σ 2 x z + σ2 u σx z 2 = residual variance in regression of X on Z σx z 2 σ2 x = σ 2 x z λ 1 = σx z 2 + σ2 u σ 2 x σx 2 + σu 2 = λ = Collinearity accentuates attenuation

39 Bias for Inferences About Error-Free Covariates When X and Z are related Regression of X on Z: E(X Z) = Γ 1 + Γ T z Z Effects of Error: You do not estimate β z, but instead you estimate β z = β z + (1 λ 1 )β x Γ z,

40 Analysis Of Covariance These results have implications for the two group ANCOVA. True Predictor X Group Assignment Z = dummy indicator of group 1, say Imbalance: If X has a different mean for the two groups, then the estimated effect of Z is biased Illustration: The next slides illustrate that even when there is no Z effect in truth, the observed data may indicate, falsely, that there is an apparent effect

41 2-Group ANCOVA, True X Data. Note no effect. 4 ANCOVA, True X data

42 2-Group ANCOVA, Observed W Data. Note apparent effect. 4 ANCOVA, Observed W data

43 Part 2: Effects of Corrections for Measurement Error

44 What can a Measurement Error Analysis Do? Response: Y True Covariate: X Surrogate: W Other Exactly Measured Covariates: Z

45 What can a Measurement Error Analysis Do? With a single exposure, and classical measurement error, usually the effect is that the observed data underestimate the relative risks, sometimes profoundly. Measurement error analysis can correct this underestimation

46 What else can a Measurement Error Analysis Do? We have seen that there are cases that hypothesis tests that ignore measurement error are invalid. A measurement error analysis can result in valid tests with real Type I errors of near 5%

47 What else can a Measurement Error Analysis Not Do? A measurement error analysis cannot ever be as statistically efficient and powerful than an analysis in which true exposure X is observed. Measurement error lowers power, and no fancy analysis can alleviate this fact.

48 Prices of a Measurement Error Analysis Somewhere, somehow, you have to provide a model for the relationship of the true exposure X and the surrogate exposure W, even though you do not observe X Usually, this means that you need additional data to get at the measurement error model Requires planning, and costs more

49 Prices of a Measurement Error Analysis Almost without exception, for a variable measured with error, a measurement error analysis leads to increased variability in the estimate of risk.

50 The NIH-AARP Diet and Health Study Survival analysis of colorectal cancer (Y ) True exposures are X consisting of usual intake of energy and the usual Healthy Eating Index 2005 (HEI-2005) total score. Other variables Z measured "exactly were a long list (age group, etc.)

51 The NIH-AARP Diet and Health Study n 220, 000 Instead of usual energy and HEI-2005, we have them (mis)measured by a food frequency questionnaire Also, in a sub-study, we have 1,000 people who contributed 2 24hr recalls. I will not go into the entire complex modeling process

52 The NIH-AARP Diet and Health Study: Men Using the FFQ log relative risk estimate = 0.33 Standard error = 0.07 Measurement error analysis log relative risk estimate = 0.45 (greater in absolute value) Standard error = 0.09 (larger standard error)

53 The NIH-AARP Diet and Health Study: Women Using the FFQ log relative risk estimate = 0.22 Standard error = 0.09 p = 0.02 Measurement error analysis log relative risk estimate = 0.49 (greater in absolute value) Standard error = 0.16 (larger standard error) p = 0.00

54 The NIH-AARP Diet and Health Study: Women The lower p-values for the MEM analysis for women can happen True exposure X is bivariate, and while the components (energy, HEI-2005) are not highly correlated, they are correlated nonetheless.

55 Part 3: Needed Data for a Measurement Error Analysis

56 Overview of Classical Models Response: Y True Covariate: X Surrogate: W Other Exactly Measured Covariates: Z

57 Conundrum The Classical Model says that W = X + U, U = Normal(0, σ 2 u). In general, The measurement error variance σ 2 u cannot be estimated from just (Y, W, Z) data Question: what data are needed to estimate σ 2 u?

58 Solution #1: Validation Data At least in principle, in some cases, one can effectively observe X in a sub-study This is called a validation study Validation studies are beautiful things They are rare, especially if X is a long-term exposure Of course, if such data exist, σ 2 u = var(w X).

59 Solution #1: Validation Data Validation data, which include X, also allow us to estimate the distribution of true exposure They also allow us to understand whether the classical error model actually holds! Validation study data are really data with missing data, in X, although they are not typical missing data problems because most of the X s are missing.

60 Solution #2: Replication Data In many cases, it is possible to observed replicated W data Thus, for the i th person, we observed (W i1,..., W im with W ij = X i + U ij. Replication data allow easy estimation of σ 2 u through ANOVA calculations They also allow data checking to see if the additive model with homoscedastic error holds (details not given, this is an overview).

61 Solution #2: Replication Data Replicated biomarkers or 24hr recalls Replicated blood pressure measurements Replicated monitoring equipment

62 Solution #2: Replication Data There are subtleties with replication data There is debate as to whether they measure long-term exposure unbiasedly, or short-term exposure only. Everyone agrees that replication data are a good thing

63 Solution #3: Instrumental Variables Often forgotten, but widely used in econometrics These are variables T which have the following properties (hopefully) There are correlated with true exposure X They are nondifferential They are independent of the measurement error U Convincing oneself (or referees) that T is a proper instrument is hard, because it cannot be verified numerically.

64 Part 4: Pure Berkson Error

65 Overview In radiation epidemiology and presumably also occupational epidemiology, the calculated exposure comes from two sources: Error-Prone estimates from an individual Error-prone estimates assigned to groups with similar characteristics The second error-prone estimates are generally designated as Berkson measurement error I want to introduce the Berkson error model

66 The Berkson Model and the Nevada Test Site Genesis: In the 1950 s, the U.S. did above-ground nuclear testing At least twice, they set off atomic bombs when the winds were high and in the direction of Utah. The radiation fell on the ground Cows ate the grass from the ground Kids drank the milk

67 The Berkson Model and the Nevada Test Site Concerns about radiation-caused thyroid disease for these "down-winders" led to a major epidemiologic study at the University of Utah Radiation and biological experts, including NCI s Andre Bouville, built a dosimetry system based on physical transport models Every person with certain shared characteristics were assigned the same dose, along with a value for the uncertainty.

68 The Berkson Model and the Nevada Test Site Example: all girls aged 6 living in Washington Country who got their milk from their own cows and drank 3 glasses of milk per day were assigned the same dose and an uncertainty Example: all boys aged 3 living in Lincoln Country who got their milk from stores and drank 2 glasses of milk per day were assigned the same dose and an uncertainty In real life, classical errors come from estimates of the amount of milk drunk

69 The Berkson Model and the Nevada Test Site Thought Experiment: Ignore the uncertainty (measurement error) in milk consumption From the dosimetry system, each individual gets an assigned/calculated dose W Crucial Point: Children with the same characteristics are given the same assigned dose W Direct Measurements of thyroid exposure for the individuals were not done

70 The Berkson Model and the Nevada Test Site Fact: Two people with similar characteristics might get the same assigned dose W However, their true radiation exposures X would be different Model (in log scale): X = W + U, where U is the assigned uncertainty This is the Berkson measurement error model

71 The Berkson Model The classical Berkson model says that True Exposure = Assigned Exposure + Mean Zero Error In symbols: X = W + U berk (or X = W U berk ), Assumption: W and U b are independent and E(U ) = 0 (additive error) or E(U ) = 1 (multiplicative error) so that E(X W ) = W Compare with classical measurement error model where W = X + U and E(X W ) = λw + (1 λ)µ x.

72 The Berkson Model From previous page X = W + U berk In the linear regression model, Y = β 0 + β x X + ɛ Substituting, Y = β 0 + β x (W + U berk ) + ɛ = β 0 + β x W + (β x U berk + ɛ) No Bias: The slope of the regression of Y on W is β x! Increased Variance/Loss of Power: However, the variance of the regression in W is increased: it is var(ɛ) + β 2 xvar(u berk )

MEASUREMENT ERROR IN HEALTH STUDIES

MEASUREMENT ERROR IN HEALTH STUDIES Lecture 1 Introduction, Examples, Effects of Measurement Error in Linear Models Lecture 2 Data Types, Nondifferential Error, Estimating Attentuation, Exact Predictors,