Working with Stata Inference on the mean

Similar documents
A procedure to tabulate and plot results after flexible modeling of a quantitative covariate

SOCY5601 Handout 8, Fall DETECTING CURVILINEARITY (continued) CONDITIONAL EFFECTS PLOTS

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

Meta-analysis of epidemiological dose-response studies

General Linear Model (Chapter 4)

Essential of Simple regression

One-stage dose-response meta-analysis

ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Statistical Modelling in Stata 5: Linear Models

Homework Solutions Applied Logistic Regression

Correlation and Simple Linear Regression

Binary Dependent Variables

Lab 07 Introduction to Econometrics

Lecture 3: Multiple Regression. Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II

Section Least Squares Regression

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression:

BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) STATA Users

options description set confidence level; default is level(95) maximum number of iterations post estimation results

Lab 10 - Binary Variables

especially with continuous

Week 3: Simple Linear Regression

Lecture 7: OLS with qualitative information

ECON3150/4150 Spring 2016

ECON Introductory Econometrics. Lecture 5: OLS with One Regressor: Hypothesis Tests

4.1 Example: Exercise and Glucose

sociology 362 regression

Group Comparisons: Differences in Composition Versus Differences in Models and Effects

2. We care about proportion for categorical variable, but average for numerical one.

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

Introductory Econometrics. Lecture 13: Hypothesis testing in the multiple regression model, Part 1

sociology 362 regression

Lab 6 - Simple Regression

Lecture 4: Multivariate Regression, Part 2

Modelling Rates. Mark Lunt. Arthritis Research UK Epidemiology Unit University of Manchester

Name: Biostatistics 1 st year Comprehensive Examination: Applied in-class exam. June 8 th, 2016: 9am to 1pm

Self-Assessment Weeks 8: Multiple Regression with Qualitative Predictors; Multiple Comparisons

sociology sociology Scatterplots Quantitative Research Methods: Introduction to correlation and regression Age vs Income

Specification Error: Omitted and Extraneous Variables

Analysis of repeated measurements (KLMED8008)

Title. Description. stata.com. Special-interest postestimation commands. asmprobit postestimation Postestimation tools for asmprobit

Lecture 12: Effect modification, and confounding in logistic regression

Thursday Morning. Growth Modelling in Mplus. Using a set of repeated continuous measures of bodyweight

Unit 2 Regression and Correlation Practice Problems. SOLUTIONS Version STATA

****Lab 4, Feb 4: EDA and OLS and WLS

Practice exam questions

Section I. Define or explain the following terms (3 points each) 1. centered vs. uncentered 2 R - 2. Frisch theorem -

Mixed Models for Longitudinal Binary Outcomes. Don Hedeker Department of Public Health Sciences University of Chicago.

raise Coef. Std. Err. z P> z [95% Conf. Interval]

STATISTICS 110/201 PRACTICE FINAL EXAM

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Chapter 11. Regression with a Binary Dependent Variable

Monday 7 th Febraury 2005

2: Multiple Linear Regression 2.1

THE MULTIVARIATE LINEAR REGRESSION MODEL

Applied Statistics and Econometrics

Appendix A. Numeric example of Dimick Staiger Estimator and comparison between Dimick-Staiger Estimator and Hierarchical Poisson Estimator

multilevel modeling: concepts, applications and interpretations

Applied Statistics and Econometrics

Course Econometrics I

How To Do Piecewise Exponential Survival Analysis in Stata 7 (Allison 1995:Output 4.20) revised

Lecture 12: Interactions and Splines

Empirical Application of Simple Regression (Chapter 2)

ECO220Y Simple Regression: Testing the Slope

Ex: Cubic Relationship. Transformations of Predictors. Ex: Threshold Effect of Dose? Ex: U-shaped Trend?

S o c i o l o g y E x a m 2 A n s w e r K e y - D R A F T M a r c h 2 7,

Practice: Basic Linear-Interactive Model

Problem set - Selection and Diff-in-Diff

Description Remarks and examples Reference Also see

SplineLinear.doc 1 # 9 Last save: Saturday, 9. December 2006

Empirical Asset Pricing

Warwick Economics Summer School Topics in Microeconometrics Instrumental Variables Estimation

Multiple Regression: Inference

Interpreting coefficients for transformed variables

ECON Introductory Econometrics. Lecture 17: Experiments

Assessing the Calibration of Dichotomous Outcome Models with the Calibration Belt

Longitudinal Data Analysis Using Stata Paul D. Allison, Ph.D. Upcoming Seminar: May 18-19, 2017, Chicago, Illinois

Handout 12. Endogeneity & Simultaneous Equation Models

LONGITUDINAL DATA ANALYSIS Homework I, 2005 SOLUTION. A = ( 2) = 36; B = ( 4) = 94. Therefore A B = 36 ( 94) = 3384.

At this point, if you ve done everything correctly, you should have data that looks something like:

Soc 63993, Homework #7 Answer Key: Nonlinear effects/ Intro to path analysis

1 A Review of Correlation and Regression

Assessing Calibration of Logistic Regression Models: Beyond the Hosmer-Lemeshow Goodness-of-Fit Test

Sociology 63993, Exam 2 Answer Key [DRAFT] March 27, 2015 Richard Williams, University of Notre Dame,

Confidence intervals for the variance component of random-effects linear models

Multivariate Regression: Part I

Practice 2SLS with Artificial Data Part 1

Case of single exogenous (iv) variable (with single or multiple mediators) iv à med à dv. = β 0. iv i. med i + α 1

Problem Set 10: Panel Data

Paper: ST-161. Techniques for Evidence-Based Decision Making Using SAS Ian Stockwell, The Hilltop UMBC, Baltimore, MD

R 2 and F -Tests and ANOVA

Statistical Inference with Regression Analysis

Sociology Exam 2 Answer Key March 30, 2012

The Simulation Extrapolation Method for Fitting Generalized Linear Models with Additive Measurement Error

University of California at Berkeley Fall Introductory Applied Econometrics Final examination. Scores add up to 125 points

Outline. Linear OLS Models vs: Linear Marginal Models Linear Conditional Models. Random Intercepts Random Intercepts & Slopes

Understanding the multinomial-poisson transformation

Lecture#12. Instrumental variables regression Causal parameters III

Chapter 1 Linear Regression with One Predictor

ECON3150/4150 Spring 2015

Transcription:

Working with Stata Inference on the mean Nicola Orsini Biostatistics Team Department of Public Health Sciences Karolinska Institutet

Dataset: hyponatremia.dta Motivating example Outcome: Serum sodium concentration, mmol/liter Descriptive abstract Hyponatremia has emerged as an important cause of race-related death and life-threatening illness among marathon runners. We studied a cohort of marathon runners to estimate the incidence of hyponatremia and to identify the principal risk factors. Hyponatremia among Runners in the Boston Marathon, New England Journal of Medicine, 2005, Volume 352:1550-1556. 2

Arithmetic mean Suppose we pick a sample of 5 observations of serum sodium concentration (mmol/liter) nar mean error res 1. 137.5658 138.52805 -.96230469.9260303 2. 132.62 138.52805-5.908075 34.90535 3. 139.9454 138.52805 1.417395 2.009009 4. 139.9312 138.52805 1.4031433 1.968811 5. 142.5779 138.52805 4.0498413 16.40121 Sum 692.6403 2.842e-14 56.21041 Arithmetic mean = sum of the values divided by the number of observations = 692.6403/5 = 138.5 mmol/liter 3

Central, Gaussian, or Normal distribution f(y μ, σ ( ) = 1 2πσ exp 2 1 2 4y μ σ 5 ( 6 μ = mean σ = standard deviation The distribution of the continuous random variable y is characterized by the parameters μ and σ. 4

-4σ -3σ -2σ -1σ µ 1σ 2σ 3σ 4σ 5

Historical importance of Gauss s result Gauss (1809) proved that the condition that (maximum likelihood estimate of location) = (arithmetic mean of observations) uniquely determines the normal distribution for the observations (independent, identically distributed). 6

We are trying to estimate a location parameter μ and our data consists of n observations D = {y D, y (,, y F } Our model is y H = μ + e H 1 i n where e H is actual error in the i-th measurement. 7

If we assigned an independent Gaussian distribution for the errors e H = y H μ p(d μ, σ ( ) = O 1 2πσ (P F/( exp R (y H μ) ( 2σ ( T Only the first two moments (e and e ( VVV ) of the data are going to be used for inferences about the location parameter μ. 8

When we assign an independent Gaussian distribution to the errors, what we achieve is not that the error frequencies are correctly represented, but those frequencies are made irrelevant to the inference, in two respects: 1) All other aspect of the noise beyond e and e VVV ( contribute nothing to the numerical value or the accuracy of our estimate 2) Our estimate is more accurate than that from any other distribution that estimates a location parameter by a linear combination of the observations, because it has the maximum possible error cancellation. Jaynes ET. Probability theory. The logic of science. Cambridge University Press. 2003. Chapter 7. Page 214. 9

Simulations 1. Fix a sample size n 2. Draw i.i.d. observations y H from a non-normal χ ( (3) 3. Estimate the mean of y H in the sample Repeat Steps 1 to 3 a large number of times, for example s = 1000. 10

Inference on one population 0.25 0.20 Distribution 0.15 0.10 0.05 0.00 0 2 4 6 8 10 12 14 16 18 20 22 Health outcome variable N mean sd p25 p50 p75 -------------+------------------------------------------------------------ y 10,000 3.0 2.5 1.2 2.4 4.2 11

The estimated population mean outcome is 3.03 units. We are 95% confident that the population mean is between 2.98 and 3.08. Question: For the above inference to be valid, do we need to assume normality of the outcome? 12

2.98 3.03 3.08 13

µ=3 n=1,000 s=1,000 Distribution 0 1 2 3 4 5 2.8 2.9 3 3.1 3.2 3.3 Sample mean variable N mean sd p2.5 p97.5 -------------+------------------------------------------------- m1000 1000 3.002205.0766879 2.85 3.15 --------------------------------------------------------------- 14

µ=3 n=10,000 s=1,000 Distribution 0 5 10 15 20 2.9 2.95 3 3.05 3.1 Sample mean variable N mean sd p2.5 p97.5 -------------+------------------------------------------------- m10000 10000 3.001409.0240927 2.95 3.05 --------------------------------------------------------------- 15

µ=3 n=100,000 s=1,000 Distribution 0 20 40 60 2.98 2.99 3 3.01 3.02 3.03 Sample mean variable N mean sd p2.5 p97.5 -------------+------------------------------------------------- m100000 1000 3.000018.0078112 2.98 3.02 --------------------------------------------------------------- 16

µ=3 0 20 40 60 n=1,000 n=10,000 n=100,000 2.8 2.9 3 3.1 3.2 3.3 Sample mean 17

Consistency A consistent estimator gets arbitrarily close in probability to the true value μ as you increase the sample size n. The probability that a consistent estimator is outside a neighborhood of the true value goes to zero as the sample size increases. 18

Asymptotically normal Estimators for which a recentered and rescaled version converges to a normal distribution are said to be asymptotically normal. n(yv μ) gets arbitrarily close to a N(0, σ ( ) distribution. In cases of i.i.d draws from a χ ( (3) μ = 3 and σ ( = 2 3 19

Distribution 0.05.1.15.2 n=1,000 n=10,000 n=100,000 N(0, 6) -10-5 0 5 10 Centered and rescaled sample mean 20

The densities of the recentered and rescaled sample means are very similar and look close to a normal density. n(yv μ) N(0, σ ( ) This convergence in distribution justifies our use of the distribution yv N(μ, σ ( /n) 21

Suppose I got my sample of n=10,000 with sample mean of yv =3.03 and sample standard deviation of σ=2.5. The population mean μ is estimated to be 3.03 A 95% confidence interval for population mean μ is obtained as yv ± 1.96 σ/ n 3.03 ± 1.96 2.5/ 10,000 = 2.98, 3.08 22

Central Limit Theorem When a sample of size n is selected from a population with mean μ and standard deviation σ, the sampling distribution of mean has the following properties: The mean is equal to the population mean μ The standard deviation, also called standard error, is σ/ n The above properties always hold, regardless of the population distribution 23

Back to hyponatremia We want to investigate the relation between wtdiff = quantitative predictor (weight change, kg) and na = quantitative outcome (serum sodium concentration, mmol/liter) 24

Univariable analysis We want to investigate the relation between wtdiff = quantitative predictor (weight change, kg) and na = quantitative outcome (serum sodium concentration, mmol/liter) 25

250 232 200 Frequency 150 100 141 50 38 58 0 1 2 4 110 120 130 140 150 160 Serum sodium concentration, mmol/l 9 3 26

120 111 110 100 95 Frequency 80 60 40 48 41 20 17 15 0 3 2 5-7 -6-5 -4-3 -2-1 0 1 2 3 4 5 Weight change (kg) pre/post race 7 1 27

160 Serum sodium concentration, mmol/liter 155 150 145 140 135 130 125 120 115 110-7 -6-5 -4-3 -2-1 0 1 2 3 4 Weight Change, kg 28

Mean serum sodium concentration, mmol/liter 150 145 140 135 130-7 -6-5 -4-3 -2-1 0 1 2 3 4 Weight Change, kg 29

Regression model for the mean We assume a statistical model to make inference about the population mean outcome as linear function of (conditioning on) a quantitative covariate. Mean(y x) = β f + β D x y represents individual values of independent outcomes x represents individual values of a quantitative covariate Basic assumptions of the model 30

A sample of n of independent observations The response is equal to a fixed part that depends on the value of the predictor plus a random error y = β f + β D x + ε The response, conditionally on the value of the predictor, is assumed to have a constant variance Var(y x) = σ ( The population mean outcome among individuals with a covariate x equal to 0 is given by 31

Mean(y x = 0) = β f The difference in population mean outcome comparing individuals with a covariate value x D with individuals with a covariate value x ( is given by Mean(y x = x D ) = β f + β D x D Mean(y x = x ( ) = β f + β D x ( Given the specified model, one could explore variation in the population mean outcome. 32

The difference or contrast in population mean outcomes comparing individuals with a value of the covariate x D with individuals with a value of the covariate x ( is given by Mean(y x = x D ) Mean(y x = x ( ) = β D (x D x ( ) Every (x D x ( ) unit increase in the predictor, is associated with a β D unit change in the mean response, regardless of where one begin the increase (x ( ). This is the linear-response assumption. We specify a simple linear regression model for the mean sodium concentration with weight change as the only predictor. 33

Mean(na wtdiff) = β f + β D wtdiff Estimation procedures such as ordinary least-square or maximum likelihood provide estimates of unknown population parameters β f and β D. 34

. regress na wtdiff Source SS df MS Number of obs = 455 -------------+------------------------------ F( 1, 453) = 91.82 Model 1765.61492 1 1765.61492 Prob > F = 0.0000 Residual 8710.96529 453 19.229504 R-squared = 0.1685 -------------+------------------------------ Adj R-squared = 0.1667 Total 10476.5802 454 23.0761679 Root MSE = 4.3851 ------------------------------------------------------------------------------ na Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- wtdiff -1.2181.1271215-9.58 0.000-1.467921 -.9682791 _cons 139.5535.2234862 624.44 0.000 139.1143 139.9927 ------------------------------------------------------------------------------ The first variable name is the response followed by a list of covariates or predictors. 35

Mean(na wtdiff) = 140 1.2 wtdiff The mean serum sodium concentration significantly decreases by 1.2 mmol per liter (95% CI = -1.5 to -1) for every 1 kg increase of weight change during race. The intercept, _cons, is the estimated mean response when the predictor is set to zero. The population mean sodium concentration is 140 mmol/liter for those runners who did not change weight during the race. 1. What is the population mean serum sodium concentration among those runners who increased 3 kg during the marathon? 36

Mean(na wtdiff = 3) = 140 1.2 3. lincom _b[cons] + _b[wtdiff]*3 ------------------------------------------------------------------------------ na Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) 135.8992.5120958 265.38 0.000 134.8928 136.9056 ------------------------------------------------------------------------------ 37

Mean serum sodium concentration, mmol/liter 150 145 140 135 130 Mean(na)=140-1.2*wtdiff -7-6 -5-4 -3-2 -1 0 1 2 3 4 Weight Change, kg 38

2. What is the change in the population mean serum sodium concentration associated with 2 kg increment? Mean(na wtdiff = x + 2) Mean(na wtdiff = x) = 1.2 (x + 2 x) = 1.2 2. lincom _b[wtdiff]*2 ------------------------------------------------------------------------------ na Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) -2.4362.254243-9.58 0.000-2.935842-1.936558 ------------------------------------------------------------------------------ The mean serum sodium concentration decreases by 2.4 mmol per liter for every 2 kg increment in weight change. 39

2. What are the differences in the population mean serum sodium concentration comparing runners with any value of weight change (x D = x) relative to runners who did not change weight (x ( = 0)? x D represents any sub-population defined by x x ( represents the reference (or baseline) sub-population Mean(na wtdiff = x) Mean(na wtdiff = 0) = = 1.2 (x 0) 40

Tabulate mean differences Weight change, kg x D = -3 x D = -1 x ( =0 x D = 1 x D = 2 β D ( 3 0) β D ( 1 0) Ref β D (1 0) β D (2 0) 3.7 (2.9 to 4.4) 1.2 (1.0 to 1.5) 0-1.2 (-1.5 to -1.0) -2.4 (-2.9 to - 1.9) In our example of weight change in predicting mean sodium concentration, we can estimate differences for any value x 1 relative to x 2 using the lincom postestimation command. 41

β D ( 3 0). lincom _b[wtdiff]*(-3-0), cformat(%2.1fc) ------------------------------------------------------------------------------ na Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) 3.7 0.4 9.58 0.000 2.9 4.4 ------------------------------------------------------------------------------ β D (2 0). lincom _b[wtdiff]*(2-0), cformat(%2.1fc) ------------------------------------------------------------------------------ na Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) -2.4 0.3-9.58 0.000-2.9-1.9 ------------------------------------------------------------------------------ 42

Plot mean differences To present graphically the quantity β D (x D x ( ) β D (x x lmn ) The post-estimation command predictnl is very useful to obtain the above quantity for any value of x with 95% confidence interval. Any covariate value x 2 can be used as referent. 43

MD = β D (x 0) Mean Difference Serum sodium concentration, mmol/liter 10 8 6 4 2 0-2 -4-6 -8 MD = β D (x 7) P-value < 0.001-7 -6-5 -4-3 -2-1 0 1 2 3 4 Weight Change, kg 44

Mean Difference Serum sodium concentration, mmol/liter 0-2 -4-6 -8-10 -12-14 P-value < 0.001-7 -6-5 -4-3 -2-1 0 1 2 3 4 Weight Change, kg MD = β D (x 4) 45

Mean Difference Serum sodium concentration, mmol/liter 14 12 10 8 6 4 2 0 P-value < 0.001-7 -6-5 -4-3 -2-1 0 1 2 3 4 Weight Change, kg 46

Confidence intervals for the mean outcome Var(Mean(y x)) = Var(η) = Var(β f + β D x) Var(η) = Var(β f ) + Var(β D )x ( + 2Cov(β f, β D )x SE(η) = tvar(η) By the central limit theorem, we know that Pr 2 1.96 < η SE(η) < 1.966 95% 47

Rearranging the terms, Pr[η 1.96 SE(η) < η < η + 1.96SE(η)] 95% Note: Before the sample is selected we can say there is 95% probability that η is included; after the sample is selected we can only say that there is 95% confidence that η is included. A 95% confidence interval using the Standard normal distribution is computed using the constant of 1.96. 48

Using probability functions display invnormal(.025) -1.959964 display invnormal(.975) 1.959964 display normal(1.959964)-normal(-1.959964).95. mat list e(v) symmetric e(v)[2,2] wtdiff _cons wtdiff.01615988 _cons.01114286.04994607 49

Var(β f ) =.04994607 Var(β D ) =.01615988 Cov(β f, β D ) =.0111428 Var(η) =.04994607 +.01615988 x ( + 2.0111428 x Var(Mean(na wtdiff = 0)) =.04994607 SE(Mean(na wtdiff = 0)) =. 04994607 =.22348618 95% CI for the mean serum sodium concentration among those who did not change weight is given by Mean(na wtdiff = 0) = 140 50

Lower Limit = 140-1.96*.22348618 = 139 mmol/liter Upper Limit = 140 + 1.96*.22348618 = 140 mmol/liter. lincom _b[_cons] ------------------------------------------------------------------------------ na Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) 139.5535.2234862 624.44 0.000 139.1143 139.9927 ------------------------------------------------------------------------------ 95% CI for the mean serum sodium concentration among those who increased 4 kg is given by Mean(na wtdiff = 4) = 140 1.2 4 = 135 51

SE(η) = t. 04994607 +.01615988 4 ( + 2.0111428 4 =.6305925 Lower Limit = 135-1.96*.6305925 = 133 mmol/liter Upper Limit = 135 + 1.96*.6305925 = 136 mmol/liter. lincom _b[_cons] + _b[wtdiff]*4 ------------------------------------------------------------------------------ na Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) 134.6811.6305925 213.58 0.000 133.4418 135.9203 ------------------------------------------------------------------------------ 52

Confidence intervals for the difference in mean outcomes MD = Mean(y x = x D ) Mean(y x = x ( ) = β D (x D x ( ) Var(β D (x D x ( )) = Var(β D )(x D x ( ) ( SE(MD) = tvar(β D )(x D x ( ) ( 95% CI = β D (x D x ( ) ± 1.96 tvar(β D )(x D x ( ) ( 95% CI = MD ± 1.96 SE(MD) 53

What is the 95% CI for the mean difference in sodium concentration comparing those who lost 3 kg (x D = 3) compared to those runners who did not change weight (x ( = 0)? MD = 1.2181 ( 3 0) = 3.65 Var(β D ) =.01615988 SE(MD) = t. 01615988(3 0) ( =.38136451 95% CI = 3.65 (3 0) ± 1.96. 38136451 95% CI = 2.9 to 4.4 54

Notes on Confidence Intervals The width of the 95% confidence interval for the mean outcome is smaller at the mean value of the quantitative predictor. The width of the 95% CI for the mean outcome is increasing moving away from the mean value of the predictor. 55

The width of the 95% CI for the difference in mean outcome is zero when the two values of the quantitative predictor being compared are the same (x D = x ( ). MD = 0 and SE(MD)=0. The width of the 95% CI for the difference in mean outcome is zero is increasing with the distance between the two values of the predictor being compared (x D x ( ). 56

Dichotomous predictor Consider now a binary or dichotomous predictor. For example, an indicator variable of whether a runner increased or lost weight during the marathon.. codebook gainweight type: numeric (float) label: gw range: [0,1] units: 1 unique values: 2 missing.: 33/488 tabulation: Freq. Numeric Label 320 0 Post<=Pre 135 1 Post>Pre 33 57

A linear regression with a single binary (0/1) predictor provides a comparison of the mean response across the two subpopulations defined by the predictor. This is equivalent to a comparison of two means for independent populations (help ttest). Let s assume a piecewise constant association between weight change and mean serum sodium concentration with a knot at zero. 58

. regress na gainweight Source SS df MS Number of obs = 455 -------------+------------------------------ F( 1, 453) = 70.55 Model 1411.72652 1 1411.72652 Prob > F = 0.0000 Residual 9064.8537 453 20.0107146 R-squared = 0.1348 -------------+------------------------------ Adj R-squared = 0.1328 Total 10476.5802 454 23.0761679 Root MSE = 4.4733 ------------------------------------------------------------------------------ na Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- gainweight -3.856019.4590871-8.40 0.000-4.758223-2.953814 _cons 141.5375.250067 566.00 0.000 141.0461 142.0289 ------------------------------------------------------------------------------ Mean(na gainweight) = 142 4 gainweight The intercept (_cons), 142 mmol/liter is the mean sodium concentration at the referent value of gainweight, that is, individuals who lost or did not change weight (Post<=Pre). 59

The mean sodium concentration among those who increased weight was 4 mmol/liter significantly lower (95% CI = -5, -3) compared to those who lost or did not change weight. Both the linearity and the dichotomization of a continuous covariate make strong assumptions about the dose-response relationship. Let s compare the two approaches. // Linear trend reg na wtdiff predict fit1 // Dichotomization reg na gainweight predict fit2 tw (line fit1 fit2 wtdiff, sort c(l J) lp(- l) ), /// scheme(s1mono) legend(off) 60

Fitted values 135 140 145 150-10 -5 0 5 Weight change (kg) pre/post race 61

More than 2 categories A popular strategy among epidemiologists is to categorize the continuous covariate in 3 to 5 categories. It is commonly used to present the data in a tabular form and to avoid the assumption of linearity. Let s consider a categorized version of weight change as predictor of serum concentration. 62

. table wtdiffc, c(freq mean na sd na) f(%3.0f) --------------------------------------------------- Categorization of weight change Freq. mean(na) sd(na) ---------------+----------------------------------- 3.0 to 4.9 7 133 4 2.0 to 2.9 16 135 6 1.0 to 1.9 39 138 5 0.0 to 0.9 96 139 5-1.0 to -0.1 109 141 5-2.0 to -1.1 99 142 4-5.0 to -2.1 84 143 4 --------------------------------------------------- 63

To correctly interpret the regression coefficients of indicator variables we need to know how the variable is coded (meaning of the numbers).. codebook wtdiffc range: [1,7] units: 1 unique values: 7 missing.: 38/488 tabulation: Freq. Numeric Label 7 1 3.0 to 4.9 16 2 2.0 to 2.9 39 3 1.0 to 1.9 96 4 0.0 to 0.9 109 5-1.0 to -0.1 99 6-2.0 to -1.1 84 7-5.0 to -2.1 38. 64

Categorical variables prefix xi Categorical variables with more than two levels are usually included in the regression model using indicator/dummy variables. The indicator variable omitted from the model identifies the referent group. The prefix command, however, xi makes it easy to generate indicator variables as well as all interactions terms. By default, Stata uses the lowest value of the categorical variable as reference. 65

Mean(na) = β 0 + β 1 _Iwtdiffc_2 +... + β 7 _Iwtdiffc_7. xi: regress na i.wtdiffc i.wtdiffc _Iwtdiffc_1-7 (naturally coded; _Iwtdiffc_1 omitted) Source SS df MS Number of obs = 450 -------------+------------------------------ F( 6, 443) = 16.48 Model 1888.39075 6 314.731791 Prob > F = 0.0000 Residual 8462.1337 443 19.1018819 R-squared = 0.1824 -------------+------------------------------ Adj R-squared = 0.1714 Total 10350.5244 449 23.052393 Root MSE = 4.3706 ------------------------------------------------------------------------------ na Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- _Iwtdiffc_2 1.196429 1.980583 0.60 0.546-2.696077 5.088934 _Iwtdiffc_3 4.238095 1.794055 2.36 0.019.7121797 7.764011 _Iwtdiffc_4 5.644345 1.711087 3.30 0.001 2.281489 9.007201 _Iwtdiffc_5 7.442988 1.704138 4.37 0.000 4.093789 10.79219 _Iwtdiffc_6 8.227994 1.709324 4.81 0.000 4.868603 11.58739 _Iwtdiffc_7 9.083333 1.719373 5.28 0.000 5.704192 12.46247 _cons 133.4286 1.65192 80.77 0.000 130.182 136.6751 66

The intercept (_cons) is the mean sodium concentration at the referent value of all predictors, that is, individuals who gained 3 to 4.9 kg during race. The coefficient of _Iwtdiffc_2 is the difference in the mean sodium concentration comparing runners who gained 2 to 2.9 kg vs the referent. The coefficient of _Iwtdiffc_7 is the difference in the mean sodium concentration comparing runners who lost 2.1 to 5 kg vs the referent. Suppose you want to define weight change between 0 to 0.9 kg as your referent group rather than the default lowest value. 67

. char wtdiffc[omit] 4. xi: regress na i.wtdiffc i.wtdiffc _Iwtdiffc_1-7 (naturally coded; _Iwtdiffc_4 omitted) Source SS df MS Number of obs = 450 -------------+------------------------------ F( 6, 443) = 16.48 Model 1888.39075 6 314.731791 Prob > F = 0.0000 Residual 8462.1337 443 19.1018819 R-squared = 0.1824 -------------+------------------------------ Adj R-squared = 0.1714 Total 10350.5244 449 23.052393 Root MSE = 4.3706 ------------------------------------------------------------------------------ na Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- _Iwtdiffc_1-5.644345 1.711087-3.30 0.001-9.007201-2.281489 _Iwtdiffc_2-4.447917 1.180189-3.77 0.000-6.767381-2.128452 _Iwtdiffc_3-1.40625.8299216-1.69 0.091-3.037323.2248226 _Iwtdiffc_5 1.798643.611739 2.94 0.003.5963719 3.000914 _Iwtdiffc_6 2.583649.6260401 4.13 0.000 1.353271 3.814027 _Iwtdiffc_7 3.438988.6529788 5.27 0.000 2.155667 4.722309 _cons 139.0729.4460694 311.77 0.000 138.1962 139.9496 ------------------------------------------------------------------------------ 68

The intercept (_cons) is the mean sodium concentration at the referent value of all predictors, that is, individuals who gained 0 to 0.9 kg during race. The coefficient of _Iwtdiffc_2 is the difference in the mean sodium concentration comparing runners who gained 3 to 4.9 kg vs the referent. And so on so forth. The coefficient of _Iwtdiffc_7 is the difference in the mean sodium concentration comparing runners who lost 2.1 to 5 kg vs the referent. 69

Comparing different approaches tw (lfit na wtdiff) /// (lowess na wtdiff, lc(red)) /// (line nahat2 wtdiff, c(j) lp(-) sort ) ///, legend(ring(0) pos(1) col(1) /// label(1 "Linear trend") /// label(2 "Smoothed trend") /// label(3 "Step-function") ) /// ytitle("mean sodium concentration, mmol/liter") /// xlabel(-7(1)4) ylabel(130(5)150, angle(horiz)) 70

Mean*sodium*concentra4on,*mmol/liter 150 145 140 135 Linear*trend Smoothed*trend Step8func4on 130 87 86 85 84 83 82 81 0 1 2 3 4 Weight*change*(kg)*pre/post*race 71

Lowess Regression lowess regression (Locally Weighted Scatter plot Smoothing): Fit a line through a scatter plot without any model assumption Each observation (x i, y i ) is fitted to a separate linear regression line based on adjacent observations Each point in this range is weighted as a function of the distance from x i It provides a graph to easily detect strong departure from linearity. 72

Non-linear associations A linear model can be used to model exposure-response relations that are not linear. In our example, the flexible smoothed line for weight change suggests a possible non-linear relationship. The rate of change of sodium concentration among those who lost weight is not as steep as for those who increased weight during the race. A way to detect strong departure from linearity is to fit a model that allows for non-linearity that includes the linear model as a special case. A simple example is to fit a regression model in which is entered the exposure variable as it is and the exposure squared (to the power of 2), known as quadratic model. 73

Adding a quadratic transformation The quadratic model for a quantitative exposure x is Mean(y x) = β 0 + β 1 x + β 2 x 2 The linear response model is nested in (special case of) the quadratic model. A p-value for linearity is obtained by testing the coefficient zero. b 2 equal to 74

If the p-value is small (saying < 0.05), there is a departure from linearity that needs care and attention. Otherwise, the simpler linear model fits adequately the data. We first generate a new variable containing weight change to the power of 2 (wtdiff squared).. gen wtdiffsq = wtdiff^2 Then we fit the quadratic regression model 75

. regress na wtdiff wtdiffsq Source SS df MS Number of obs = 455 -------------+------------------------------ F( 2, 452) = 51.04 Model 1930.17099 2 965.085495 Prob > F = 0.0000 Residual 8546.40923 452 18.907985 R-squared = 0.1842 -------------+------------------------------ Adj R-squared = 0.1806 Total 10476.5802 454 23.0761679 Root MSE = 4.3483 ------------------------------------------------------------------------------ na Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- wtdiff -1.445263.1477126-9.78 0.000-1.735552-1.154974 wtdiffsq -.1355339.0459424-2.95 0.003 -.225821 -.0452467 _cons 139.8157.2387765 585.55 0.000 139.3465 140.285 ------------------------------------------------------------------------------ 76

Question 1. Is weight change overall predicting the mean sodium concentration? We test simultaneously the two coefficients equal to zero. testparm wtdiff wtdiffsq ( 1) wtdiff = 0 ( 2) wtdiffsq = 0 F( 2, 452) = 51.04 Prob > F = 0.0000 The p-value is small, so the answer is yes. 77

Question 2. Is a quadratic model for weight change predicting the mean sodium concentration better compared to a simpler linear model? We test the coefficient of the squared exposure equal to zero. The test and its p-value is already in the output of regress command (p=0.003). The p-value is small, so the answer is yes. 78

Question 3. What is the difference in the mean sodium concentration comparing those who increased 2 kg as compared to those who did not change weight? To put it more generally, the predicted mean responses for any two values of x of a quadratic model are Mean(y x = x 1 ) = β 0 + β 1 x 1 + β 2 x 1 2 Mean(y x = x 2 ) = β 0 + β 1 x 2 + β 2 x 2 2 79

The quantity Mean(y x = x 1 ) Mean(y x = x 2 ) = β 1 (x 1 x 2 )+ β 2 (x 1 2 x 2 2 ) is the contrast between two predicted responses associated with a x 1 x 2 unit change of the exposure x. Compare to the linear response model, to quantify the change in the mean response is now more complicated because we need to involve two regression coefficients and two variables. 80

In health-related fields, the value of the covariate x=x 2 is called a reference value, and it is used to compute and interpret a set of comparisons of subpopulations defined by different covariate values. You can easily estimate the above quantity with the postestimation commands lincom or predictnl. The postestimation command xblc carries out these computations. Orsini N., Greenland S. A procedure to tabulate and plot results after flexible modeling of a quantitative covariate. Stata Journal. 2011. 11, Number 1, pp. 1 29. Example, using the post-estimation lincom command.. lincom _b[wtdiff]*(2-0) + _b[wtdiffsq]*(4-0) 81

( 1) 2*wtdiff + 4*wtdiffsq = 0 ------------------------------------------------------------------------------ na Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) -3.432662.4214855-8.14 0.000-4.260976-2.604347 ------------------------------------------------------------------------------ Compare to those runners who did no change weight, those runners who increased 2 kg had a 3.4 mm/liter significantly lower mm/liter mean sodium concentration. One can tabulate differences in mean responses for a list of specific values of the exposure. Question 4. How to plot the change in the mean response with 95% confidence intervals as function of the exposure using a specific exposure value as reference? 82

To create a plot we need to store the numbers we are interested in as variables. Once again, we can use the post-estimation command predictnl predictnl diff = _b[wtdiff]*(wtdiff-0) + /// _b[wtdiffsq]*(wtdiffsq-0), ci(lb ub) This gives us 3 new variables (diff, lb, and ub) in one line ready to be plotted with a standard twoway plot. 83

twoway (line diff lb ub wtdiff, sort lp(l - -)), /// legend(off) scheme(s1mono) ytitle("mean Difference") 10 Mean)Difference,)mmol/liter 5 0!5!10!7!6!5!4!3!2!1 0 1 2 3 4 Weight)change)(kg))pre/post)race 84

Summary Linear regression is used to make inference on the population mean conditionally on predictors. Independent observations. The normal distribution is important to make inference about the population parameters. We have seen how to interpret the regression coefficients and how to graphically present the model. 85