BIOS 312: MODERN REGRESSION ANALYSIS

Size: px
Start display at page:

Download "BIOS 312: MODERN REGRESSION ANALYSIS"

Transcription

1 BIOS 312: MODERN REGRESSION ANALYSIS James C (Chris) Slaughter Department of Biostatistics Vanderbilt University School of Medicine james.c.slaughter@vanderbilt.edu biostat.mc.vanderbilt.edu/coursebios312 Copyright JC Slaughter All Rights Reserved Updated March 26, 2012

2 Contents 14 Regression Based Prediction Overview Uses of Regression Examples of Prediction Assumptions needed for inference Estimation (prediction) of summary measures Point estimates Interval estimates Prediction (forecasting) of individual observations Assumptions Point and interval estimates for Normal data Example: FEV by age and height Mathematics of Prediction Intervals Prediction of Binary Measurements Classification (discrimination)

3 CONTENTS Stepwise model building ROC Curves Example 1: Age or height to predict smoking Example 2: Multiple predictors of smoking Analysis of ROC curves in R Multivariable prognostic models

4 Chapter 14 Regression Based Prediction 14.1 Overview Uses of Regression Cluster Analysis Focuses on identifying observations (covariates) with similar characteristics Interest lies in finding clusters that are more (or less) likely to have some outcome of interest Divide a population into subgroups based on patterns of similar measurement Clustering can be done by considering one variable (univariable) or multiple variables (multivariable) Number of clusters can be known or unknown Examples Microarrays 5

5 CHAPTER 14. REGRESSION BASED PREDICTION 6 fmri Small n, large p (few subjects, many predictors) Statistical techniques Stepwise regression Regression trees Other data-driven model building approaches Getting valid inference (i.e. reliable p-values, unbiased parameter estimates) particularly challenging Area of active research in Biostatistics Quantifying and comparing distributions Compare distribution of the outcome variable across levels of the grouping variable or variables Compare salary in males and females, across professors of the same rank, degree, experience, and field Compare FEV in smokers and non-smokers, across teenagers of the same age and height Need to decide which summary measure is appropriate for describing the distribution Common summary measures: Means, median, geometric mean, odds, rates, hazards Other measures: Variance, skewness, kurtosis, likelihood of extreme values, quantiles, etc.

6 CHAPTER 14. REGRESSION BASED PREDICTION 7 May desire estimates within specific subgroups Estimates of the association within gender, race, or age groups (This is effect modification) Regression based prediction Prediction of summary measures Point prediction Best single estimate for the measurement that would be obtained, on average, for many individuals with given covariates Interval prediction Quantify the uncertainty of the average Range of summary measure that might be reasonable to observe Prediction of individual measurements Point prediction Best single estimate for the measurement that would be obtained on a single individual with given covariates Interval prediction Quantify the uncertainty of the individual prediction Range of measurements that might reasonably be observed in a future individual with given covariates Examples of Prediction Continuous prediction: Creatinine clearance Creatinine is a continuously produced breakdown product in muscles

7 CHAPTER 14. REGRESSION BASED PREDICTION 8 Removed by the kidney through filtration Amount of creatinine cleared by the kidneys in 24 hours is a measure of renal function Gold standard is to collect urine output for 24 hours, measure creatinine Would prefer to find a combination of blood and urine measures that can be obtained at one time point, yet still provides accurate prediction of a patient s creatinine clearance Statistical approach Collect a training dataset to build a regression model for prediction Measure true creatinine clearance Measure age, gender, weight, height, blood makers, urinary markers, etc. Fit a (data driven?) regression model to predict true creatinine clearance Collect a validation dataset Use the regression estimates (the βs) from the training dataset to see how well your model predict creatinine clearance Quantify the accuracy of the predictive model (e.g. error) mean squared Cross validation

8 CHAPTER 14. REGRESSION BASED PREDICTION 9 Discrimination (binary prediction): Low birth weight Want to predict which infants are more likely to be less than 2500 grams at birth Possible predictors: Age, race, blood biomarkers collected during pregnancy Statistical Approach Collect a training dataset to build a regression model for prediction Measure birth weight Measure age, race, blood biomarkers Fit a (data driven?) regression model to predict low birth weight Collect a validation dataset Use the regression estimates (the βs) from the training dataset to see how well your model predict low birth weight Quantify the accuracy of the predictive model Sensitivity, specificity (ROC curves) Predictive value positive, predictive value negative Interval prediction: Normal ranges of PSA Identify the range of PSA values that would be expected in 95% of most healthy adult males Possibly stratify by age, race Statistical Approach

9 CHAPTER 14. REGRESSION BASED PREDICTION 10 Collect a training dataset to build a regression model for prediction Measure PSA and variable to predict PSA Need to estimate the quantiles Mean plus/minus 2 standard deviations (makes strong Normality assumption) Estimate the quantiles, provide CIs around the quantiles (likely low precision) Collect a validation dataset Quantify the accuracy of the predictive model (e.g. coverage probabilities) Assumptions needed for inference Inference for associations Necessary assumptions for classical regression (no robust standard errors) Independence of response measurements within identified clusters Have appropriately modeled the within group variance Linear regression: Equal variance across groups Other regressions: Appropriate mean-variance relationship Lack of model fit may lead to poor estimate of the variance Sample size is large enough so that parameter estimates approximately follow a Normal distribution Necessary assumptions for first order trends using robust standard errors Independence of response measurements within identified clusters

10 CHAPTER 14. REGRESSION BASED PREDICTION 11 (Robust standard errors accounts for heteroscedasticity in large samples) Lack of model fit leads to conservative inference due to mixing systematic and random error Sample size is large enough so that parameter estimates approximately follow a Normal distribution Inference for predictions Additional assumptions for predictions of means Our regression model has accurately described the relationship between summary measures across groups For continuous covariates, often involves flexible models that allow for departures from linearity Additional assumptions for predictions of individual observations Also need to know the shape of the distribution within each group Methods implemented in software often rely on strong assumptions like Normality 14.2 Estimation (prediction) of summary measures Given age, height, and sex, estimate the mean (or geometric mean) FEV Use linear regression to obtain estimates and CI Given age and PSA, estimate the probability (or odds) of remaining in remission for 24 months

11 CHAPTER 14. REGRESSION BASED PREDICTION 12 Use logistic regression to obtain estimates and CI Assumptions Independence (between clusters for robust SE) Variance approximated by the model (relaxed for robust SE) Regression model accurately describes the relationship of summary measures across groups Sufficient sample sizes for asymptotic distributions of parameters to be a good approximation Point estimates Point estimates obtained by substitution of predictor values into the estimated regression equation E[Chol Age] = Age Expected cholesterol for 50 year old is = logodds(survival Age) = Age Expected log-odds of Survival for a 50 year old is = 0.53 Expected odds = e.53 = 0.59 Expected probability = Interval estimates If assumptions hold, interval estimates can be obtained

12 CHAPTER 14. REGRESSION BASED PREDICTION 13 We generally find a confidence interval for the transformed quantity, and then back transform to the desired quantity For logistic regression, calculate the CI for the log odds, the transform to odds or probability Statistical criteria for determining the best estimate Consistent: Correct estimate (with infinite sample size) Precise: Minimum variance among (unbiased) estimators Common regression methods provide the best estimate in a wide variety of settings Necessary assumptions Independence Variance approximated by the model (relaxed for robust SE) Regression model accurately describes the relationship among summary measures across groups Sufficient sample size so that asymptotic Normality of the parameter estimates holds Interval estimates When we substitute in the predictor values, it provides an estimate of the model transformation of the summary measure Model transformation of the summary measure varies by regression setting Linear regression: Mean

13 CHAPTER 14. REGRESSION BASED PREDICTION 14 Linear regression on log transformed outcome: Log geometric mean Logistic regression: Log odds Poisson regression: Log rate Formulas for the confidence interval In general: (estimate) ± (crit value) (std error) In linear regression, the t distribution is usually used to obtain the confidence interval Stata: (crit value) = invttail(df, α/2) R: (crit value) = qt(1 α/2, df) Degrees of freedom df = n number of predictors in model In other regressions, we use the standard Normal distribution Stata: (crit value) = invnorm(1 α/2) = 1.96 R: (crit value) = qnorm(1 α/2) = 1.96 Interval estimates in Stata After any regression command, the Stata command predict will give compute estimates and standard errors predict varname, [what] varname is that name of the new variable to be created what is one of xb: The linear predictor (works for all regression)

14 CHAPTER 14. REGRESSION BASED PREDICTION 15 stdp: Standard error of the linear prediction p: For logistic regression, to predicted probability Interval estimates in R After storing a model, the R function predict will give compute estimates and standard errors predict(object, se.fit=false,...) object is the stored fitted model se.fit will provide the standard error of the fit (defaults to FALSE) See help(predict) for more details and options

15 CHAPTER 14. REGRESSION BASED PREDICTION 16. gen logfev = log(fev). regress logfev height age Source SS df MS Number of obs = F( 2, 651) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = logfev Coef. Std. Err. t P> t [95% Conf. Interval] height age _cons predict fitlogfev (option xb assumed; fitted values). predict sefit, stdp.. gen gmfev= exp(fitlogfev). gen gmlofev = exp(fitlogfev - invttail(651,.025) * sefit). gen gmhifev = exp(fitlogfev + invttail(651,.025) * sefit).. list gmfev gmlofev gmhifev if age==10 & height== gmfev gmlofev gmhifev

16 CHAPTER 14. REGRESSION BASED PREDICTION 17. logit smoker height age Logistic regression Number of obs = 654 LR chi2(2) = Prob > chi2 = Log likelihood = Pseudo R2 = smoker Coef. Std. Err. z P> z [95% Conf. Interval] height age _cons predict logodds, xb. predict selogodds, stdp.. gen odds= exp(logodds). gen oddslo= exp(logodds * selogodds). gen oddshi= exp(logodds * selogodds). list odds oddslo oddshi if age==10 & height== odds oddslo oddshi Prediction (forecasting) of individual observations Methods are for continuous outcomes Will deal with discrimination of binary outcomes in next section Given age, height, and sex, predict a new subject s FEV Use linear regression to obtain estimates and CI Assumptions Necessary assumptions to predict individual observations

17 CHAPTER 14. REGRESSION BASED PREDICTION 18 Independence (between clusters for robust SE) Variance approximated by the model (NOT relaxed for robust SE) Regression model accurately describes the relationship of summary measures across groups Shape of the distribution the same in each group Transformation of the outcome may be very useful Sufficient sample sizes for asymptotic distributions of parameters to be a good approximation Assumptions are very strong Consequently, we do not have many methods that provide robust inference In general, I prefer methods that make as few assumptions as possible Robust standard errors will not help in this situation Proper transformation of outcomes and predictors may be necessary so that underlying assumptions of classical linear regression model hold Models, estimates, will need to be appropriately penalized for valid inference Topic of more advanced regression courses (Regression Modeling Strategies) Precise methods have been developed for Binary variables (the mean specifies the variances) Continuous data that follow a Normal distribution

18 CHAPTER 14. REGRESSION BASED PREDICTION Point and interval estimates for Normal data Point estimates obtained by substitution into the estimated regression equation E[Chol Age] = Age Expected cholesterol for 50 year old is = When we substitute in age values, it provides an estimate of the forecast cholesterol Interval estimates Under appropriate assumptions, we can obtain standard errors for such predictions Standard errors must account for two sources of variability Variability in estimating the regression parameters (same as in predictions of summary measures) Variability due to subject Additional variability about the sample mean; within group standard deviation Estimating this sources of variability is where the additional Normality assumption is key Formulas for the prediction interval In general: (prediction) ± (crit value) (std error) In linear regression, the t distribution is usually used to obtain the prediction interval Stata: (crit value) = invttail(df, α/2) R: (crit value) = qt(1 α/2, df)

19 CHAPTER 14. REGRESSION BASED PREDICTION 20 Degrees of freedom df = n number of predictors in model Interval estimates in Stata After any regression command, the Stata command predict will give compute estimates and standard errors predict varname, [what] varname is that name of the new variable to be created what is one of xb: The linear predictor (works for all regression) stdf: Standard error of the forecast prediction Interval estimates in R After storing a model, the R function predict will give compute estimates and standard errors predict(object, se.fit=false,...) object is the stored fitted model se.fit will provide the standard error of the fit (defaults to FALSE) For the standard error of the forecast, need to include the root mean squared error se 2 pred = se2 fit + RMSE2 General comment about software Commercial software only implements prediction intervals by assuming Normal data

20 CHAPTER 14. REGRESSION BASED PREDICTION 21 If using R libraries (or Stata ado files) for prediction, carefully investigate and understand how they are making predictions Don t trust the black box to be giving you results that are widely applicable to all situations Be careful that you understand exactly what is going on if you are interested in making predictions Example: FEV by age and height. regress logfev height age Source SS df MS Number of obs = F( 2, 651) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = logfev Coef. Std. Err. t P> t [95% Conf. Interval] height age _cons predict fitlogfev (option xb assumed; fitted values). predict sepredict, stdf. gen predfev= exp(fitlogfev). gen predlofev = exp(fitlogfev - invttail(651,.025) * sepredict). gen predhifev = exp(fitlogfev + invttail(651,.025) * sepredict). list predfev predlofev predhifev if age==10 & height== predfev predlo~v predhi~v The preceding output gave prediction for individual observations

21 CHAPTER 14. REGRESSION BASED PREDICTION 22 We can compare these results to the prediction for summary measures done previously (see below) gmfev gmlofev gmhifev predfev pre~ofev predhi~v Point estimates identical Confidence interval for individual predictions is wider As n increases, the width of the prediction interval (on the log scale) approaches ±1.96 RMSE As n increases, the width of the interval around the summary measure approaches 0 We can also calculate the standard error of the prediction Example: Age is 10 and Height is 66. gen sepredict2 = sepredict^2. gen sefit2 = sefit^2. list sepredict sefit sepredict2 sefit2 if age==10 & heigh== sepred~t sefit sepred~2 sefit di.14659^ se 2 pred = se2 fit + RMSE2 se pred = se fit = RMSE =

22 CHAPTER 14. REGRESSION BASED PREDICTION 23 Note that discrepancy in hand calculation versus Stata output ( vs ) is due to rounding error This is an academic exercise in Stata, but necessary in R (unless you can find a function to do it for you) Mathematics of Prediction Intervals Basic ideas behind prediction intervals Model: Y i X i N(β 0 + β 1 X i, σ 2 ) Alternative specification: Y i X i = β 0 + β 1 X i + ɛ i ɛ i N(0, σ 2 ) Estimated mean: ˆβ 0 + ˆβ 1 X i N ( β 0 + β 1 X i, σ 2 V ) Predicted mean: ˆβ 0 + ˆβ 1 X i + ɛ i N ( β 0 + β 1 X i, σ 2 (1 + V ) ) V = 1 n + (X h X) 2 n i=1 (X i X) 2 X h is the chosen value of the covariate (e.g. age==10) Note: As n, V Prediction of Binary Measurements Classification (discrimination) Sometimes the scientific question is one of deriving a rule to classify subjects

23 CHAPTER 14. REGRESSION BASED PREDICTION 24 Diagnosis of prostate cancer: Based on age, race, and PSA, should we make a diagnosis of prostate cancer? Prognosis of patients with primary biliary cirrhosis: Based on age, bilirubin, albumin, edema, protime, is the patient likely to die within the next year? Classification can be regarded as trying to predict the value of a binary variable Earlier, we were estimating the probability and odds of relapse within a particular group (a summary measure) Now we want to decide whether a particular individual will relapse or not (an individual measure) There is an obvious connection between the above two ideas The probability or odds tells us everything about the distribution of values The only possible values are 0 or 1 Typical approach First, use regression model to estimate the probability of event in each group Second, form a decision rule based on estimated probability of event If estimate c (or c), predict outcome is 1 If estimate < c (or > c), predict outcome is 0 Quantify the accuracy of the decision rule

24 CHAPTER 14. REGRESSION BASED PREDICTION 25 For disease D and test T Sensitivity: Pr(T + D + ) Specificity: Pr(T D ) Predicted Value Positive: Pr(D + T + ) Predicted Value Negative: Pr(D T ) Stepwise model building Old method for considering a large number of covariates that might possibly be predictive of an outcome If stepwise model building had been proposed today, it would not have passed peer review Available in all software as an automated tool, but should not be used Avoid recreating the stepwise approach when building your own models manually Major caveats Overfits your dataset P-values are not true p-values; they are very anti-conservative You will often obtain different models if you use forward or backward stepwise regression Ignores confounding effects (a variable could be an important confounder, but itself not be statistically significant in the model)

25 CHAPTER 14. REGRESSION BASED PREDICTION 26 Without clustering certain predictors, may throw out covariates to make non-sensical model Suppose race/ethnicity is coded as White, Black, Hispanic, Other Want to model as 3 dummy variables Stepwise may throw out one of your dummy variables, keep the rest Pairwise significance would also be highly dependent on reference group chosen by analyst Two flavors of stepwide model building Start with no covariates: Forward stepwise regression Start with all covariates: Backward stepwise regression Stepwise procedure proceeds by adding or removing covariates from the model base on the corresponding partial t or Z test Must pre-specify a P to enter and P to remove To avoid infinite loops, P to enter must be less than P to remove E.g. Add a variable to the model if p < 0.05, but only remove a variable from the model if p > 0.10 Repeat the process until you arrive at a final model ROC Curves Receiver operating characteristic (ROC) curves plot the sensitivity against the false positive rate (e.g. 1 - specificity)

26 CHAPTER 14. REGRESSION BASED PREDICTION 27 Ideal tests will have both high sensitivity and high specificity (low false positives) Graphically depicted as points in the upper left portion of the plot 1:1 line represents a test that is no better than the flip of a coin Test usually involves the classification of some binary outcome by a continuous predictor (or set of continuous predictors) When the continuous predictor is above some point c, the test is said to be positive ROC curves plots the sensitivity and false positive rate for all possible values of c Major drawbacks to ROC analysis Tempts analyst to find a cutoff point (c) when none really exists Better to treat your covariate or linear prediction as a continuous variable Example 1: Age or height to predict smoking Question: Is age or height a better predictor of smoking status? Will compare the ROC curves Often done use the area under the curve (AUC) Model with higher AUC is better A null model (not predictive of outcome) will have area of 0.50 Maximum AUC is 1.0

27 CHAPTER 14. REGRESSION BASED PREDICTION 28 Stata Fit the logistic regression model lroc creates the ROC curve predict to save the fitted curves roccomp to compare the curves

28 CHAPTER 14. REGRESSION BASED PREDICTION 29. logit smoker height Logistic regression Number of obs = 654 LR chi2(1) = Prob > chi2 = Log likelihood = Pseudo R2 = smoker Coef. Std. Err. z P> z [95% Conf. Interval] height _cons lroc, nograph Logistic model for smoker number of observations = 654 area under ROC curve = predict xb1, xb.. logit smoker age Logistic regression Number of obs = 654 LR chi2(1) = Prob > chi2 = Log likelihood = Pseudo R2 = smoker Coef. Std. Err. z P> z [95% Conf. Interval] age _cons lroc, nograph Logistic model for smoker number of observations = 654 area under ROC curve = predict xb2, xb. roccomp smoker xb1 xb2, graph summary ROC -Asymptotic Normal-- Obs Area Std. Err. [95% Conf. Interval] xb xb Ho: area(xb1) = area(xb2) chi2(1) = Prob>chi2 =

29 CHAPTER 14. REGRESSION BASED PREDICTION 30 Age has a significantly larger AUC than Height (p < 0.001) For any given false positive rate, the sensitivity is higher for age Predicted value positive would also always be higher Example 2: Multiple predictors of smoking Will consider age, height, sex, and fev as predictors of smoking Fit a multivariable logistic regression model using these covariates in a training sample Training sample will contain 60% of the observation in the current study Develop a model on the training sample, see how well it fits in the validation sample (the remaining 40% of the data)

30 CHAPTER 14. REGRESSION BASED PREDICTION 31 Consider a rule that predicts a subject will smoke if their predicted probability is greater than 0.5 Could choose other cutoffs than 0.5 Other cutoffs will be represented on the ROC curves Sensitivity, specificity will vary by the cutoff chosen

31 CHAPTER 14. REGRESSION BASED PREDICTION 32.. xi: logit smoker age height i.sex fev if training <.6 i.sex _Isex_1-2 (_Isex_1 for sex==female omitted) Logistic regression Number of obs = 385 LR chi2(4) = Prob > chi2 = Log likelihood = Pseudo R2 = smoker Coef. Std. Err. z P> z [95% Conf. Interval] age height _Isex_ fev _cons predict pfit (option pr assumed; Pr(smoker)).. gen pfit50=pfit. recode pfit50 0/0.5=0 0.5/1=1 (pfit50: 654 changes made).. tabulate smoker pfit50 if training >.6, row col pfit50 smoker 0 1 Total Total Using a cutoff of 0.50 Sensitivity: 8/33 = Specificity: 226/236 = 95.76

32 CHAPTER 14. REGRESSION BASED PREDICTION 33 PVP: 8/18 = PVN: 226/251 = Thresholds other than 0.50 give different values for sensitivity, specificity, PVP, PVN ROC curve displayed gives all possible thresholds

33 CHAPTER 14. REGRESSION BASED PREDICTION Analysis of ROC curves in R There are a number of add on packages that will conduct ROC analysis in R ROCR is one popular package Multivariable prognostic models Reference Harrell FE Jr, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine. 15(4): , Feb 1996 A Tutorial is Biostatistics (frequently appears in Statistics in Medicine)

34 CHAPTER 14. REGRESSION BASED PREDICTION 35 From abstract of paper Multivariable regression models are powerful tools that are frequently used in clinical outcomes... however uncritical application of modeling techniques can result in models that poorly fit the dataset at hand, or, even more likely, inaccurately predict outcomes on new subjects.... predictive accuracy should be unbiasedly validated using bootstrapping or cross-validation before using predictions in new data series. Common problem: Number of predictors being considered (p) is larger than the number of events (m) Recommendation: Consider no more than m/10 potential predictors in a model e.g. if event rate is 10% and n = 200 then m = 20; only 2 potential predictors Step 1: Consider your dataset Missing data (imputation?) Interactions (scientific relevance?) Transformation of predictors (scientific relevance?) Step 2: Data Reduction Reduce the number of predictors to a manageable amount Utilize methods that do not consider the outcome ( unsupervised learning) If you do not use the outcome at this stage, your statistical inference is preserved

35 CHAPTER 14. REGRESSION BASED PREDICTION 36 Correlations among predictors: Variable clustering, principle components analysis Scientifically meaningful summary measures Step 3: Fit the model and... Evaluate modeling assumptions: Linearity, additivity, distributional assumptions Use backwards stepwise selection to find a simpler model Harrell: Remove predictor if χ 2 > 2 (p > 0.16) Will use all data to fit model Step 4: Evaluate model Bootstrap cross-validation Sample data with replacement For each bootstrap sample, the evaluation of modeling assumptions and backwards selection will be performed (these steps use the outcome) Summarize predictive accuracy using a statistic (C-index, Brier score) Bootstrap cross-validation provides a nearly unbiased estimate of predictive accuracy (e.g. C-index, Brier score) while allowing the entire model to be used for model development C-index: Area under the ROC curve. Index near 1 indicates higher accuracy. Brier score: Average squared deviation between predicted probability of event and observed outcome. A lower score represents higher

36 CHAPTER 14. REGRESSION BASED PREDICTION 37 accuracy. Case Study using lrm() in R Logistic regression model Main effects: sex, cholesterol (splines), age (polynomial), and blood pressure (linear) Interactions: sex with cholesterol (5 predictors maximum) >f <- lrm(y ~ sex*rcs(cholesterol)+pol(age,2)+blood.pressure, x=true, y=true) > > f Logistic Regression Model lrm(formula = y ~ sex * rcs(cholesterol) + pol(age, 2) + blood.pressure, x = TRUE, y = TRUE) Model Likelihood Discrimination Rank Discrim. Ratio Test Indexes Indexes Obs 1000 LR chi R C d.f. 12 g Dxy Pr(> chi2) < gr gamma max deriv 2e-04 gp tau-a Brier Coef S.E. Wald Z Pr(> Z ) Intercept sex=male cholesterol cholesterol cholesterol cholesterol age age^ blood.pressure sex=male * cholesterol sex=male * cholesterol sex=male * cholesterol sex=male * cholesterol

37 CHAPTER 14. REGRESSION BASED PREDICTION 38 Validation... Backwards stepwise regression Different bootstrap sample will remove different covariates Indicates which covariates were included for each sample Estimates of discrimination indexes > validate(f, B=150, bw=true, rule="p", sls=.1, type="individual") Backwards Step-down - Original Model Deleted Chi-Sq d.f. P Residual d.f. P AIC blood.pressure cholesterol Approximate Estimates after Deleting Factors Coef S.E. Wald Z P Intercept sex=male age age^ sex=male * cholesterol sex=male * cholesterol sex=male * cholesterol sex=male * cholesterol Factors in Final Model [1] sex age sex * cholesterol index.orig training test optimism index.corrected n Dxy R Intercept Slope Emax D U Q B g gp Factors Retained in Backwards Elimination sex cholesterol age blood.pressure sex * cholesterol

38 CHAPTER 14. REGRESSION BASED PREDICTION 39 * * * * * * * * * * * * * * * * * * * * * * * * * * * (output omitted; continues for each bootstrap sample) Frequencies of Numbers of Factors Retained

Homework Solutions Applied Logistic Regression

Homework Solutions Applied Logistic Regression Homework Solutions Applied Logistic Regression WEEK 6 Exercise 1 From the ICU data, use as the outcome variable vital status (STA) and CPR prior to ICU admission (CPR) as a covariate. (a) Demonstrate that

More information

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation Biost 58 Applied Biostatistics II Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture 5: Review Purpose of Statistics Statistics is about science (Science in the broadest

More information

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling Linear Modelling in Stata Session 6: Further Topics in Linear Modelling Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 14/11/2017 This Week Categorical Variables Categorical

More information

Binary Dependent Variables

Binary Dependent Variables Binary Dependent Variables In some cases the outcome of interest rather than one of the right hand side variables - is discrete rather than continuous Binary Dependent Variables In some cases the outcome

More information

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis Lecture 6: Logistic Regression Analysis Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox Review Homework 2 Overview Logistic regression model conceptually Logistic regression

More information

From the help desk: Comparing areas under receiver operating characteristic curves from two or more probit or logit models

From the help desk: Comparing areas under receiver operating characteristic curves from two or more probit or logit models The Stata Journal (2002) 2, Number 3, pp. 301 313 From the help desk: Comparing areas under receiver operating characteristic curves from two or more probit or logit models Mario A. Cleves, Ph.D. Department

More information

Statistical Modelling with Stata: Binary Outcomes

Statistical Modelling with Stata: Binary Outcomes Statistical Modelling with Stata: Binary Outcomes Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 21/11/2017 Cross-tabulation Exposed Unexposed Total Cases a b a + b Controls

More information

General Linear Model (Chapter 4)

General Linear Model (Chapter 4) General Linear Model (Chapter 4) Outcome variable is considered continuous Simple linear regression Scatterplots OLS is BLUE under basic assumptions MSE estimates residual variance testing regression coefficients

More information

Lecture 12: Effect modification, and confounding in logistic regression

Lecture 12: Effect modification, and confounding in logistic regression Lecture 12: Effect modification, and confounding in logistic regression Ani Manichaikul amanicha@jhsph.edu 4 May 2007 Today Categorical predictor create dummy variables just like for linear regression

More information

BIOS 312: MODERN REGRESSION ANALYSIS

BIOS 312: MODERN REGRESSION ANALYSIS BIOS 312: MODERN REGRESSION ANALYSIS James C (Chris) Slaughter Department of Biostatistics Vanderbilt University School of Medicine james.c.slaughter@vanderbilt.edu biostat.mc.vanderbilt.edu/coursebios312

More information

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression:

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression: Biost 518 Applied Biostatistics II Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture utline Choice of Model Alternative Models Effect of data driven selection of

More information

Assessing the Calibration of Dichotomous Outcome Models with the Calibration Belt

Assessing the Calibration of Dichotomous Outcome Models with the Calibration Belt Assessing the Calibration of Dichotomous Outcome Models with the Calibration Belt Giovanni Nattino The Ohio Colleges of Medicine Government Resource Center The Ohio State University Stata Conference -

More information

Sociology 362 Data Exercise 6 Logistic Regression 2

Sociology 362 Data Exercise 6 Logistic Regression 2 Sociology 362 Data Exercise 6 Logistic Regression 2 The questions below refer to the data and output beginning on the next page. Although the raw data are given there, you do not have to do any Stata runs

More information

Name: Biostatistics 1 st year Comprehensive Examination: Applied in-class exam. June 8 th, 2016: 9am to 1pm

Name: Biostatistics 1 st year Comprehensive Examination: Applied in-class exam. June 8 th, 2016: 9am to 1pm Name: Biostatistics 1 st year Comprehensive Examination: Applied in-class exam June 8 th, 2016: 9am to 1pm Instructions: 1. This is exam is to be completed independently. Do not discuss your work with

More information

Introduction to logistic regression

Introduction to logistic regression Introduction to logistic regression Tuan V. Nguyen Professor and NHMRC Senior Research Fellow Garvan Institute of Medical Research University of New South Wales Sydney, Australia What we are going to learn

More information

How to Present Results of Regression Models to Clinicians

How to Present Results of Regression Models to Clinicians How to Present Results of Regression Models to Clinicians Frank E Harrell Jr Department of Biostatistics Vanderbilt University School of Medicine f.harrell@vanderbilt.edu biostat.mc.vanderbilt.edu/fhhandouts

More information

Correlation and Simple Linear Regression

Correlation and Simple Linear Regression Correlation and Simple Linear Regression Sasivimol Rattanasiri, Ph.D Section for Clinical Epidemiology and Biostatistics Ramathibodi Hospital, Mahidol University E-mail: sasivimol.rat@mahidol.ac.th 1 Outline

More information

Linear Regression Models P8111

Linear Regression Models P8111 Linear Regression Models P8111 Lecture 25 Jeff Goldsmith April 26, 2016 1 of 37 Today s Lecture Logistic regression / GLMs Model framework Interpretation Estimation 2 of 37 Linear regression Course started

More information

Correlation and regression

Correlation and regression 1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,

More information

Logistic Regression. Building, Interpreting and Assessing the Goodness-of-fit for a logistic regression model

Logistic Regression. Building, Interpreting and Assessing the Goodness-of-fit for a logistic regression model Logistic Regression In previous lectures, we have seen how to use linear regression analysis when the outcome/response/dependent variable is measured on a continuous scale. In this lecture, we will assume

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

especially with continuous

especially with continuous Handling interactions in Stata, especially with continuous predictors Patrick Royston & Willi Sauerbrei UK Stata Users meeting, London, 13-14 September 2012 Interactions general concepts General idea of

More information

Lecture 7 Time-dependent Covariates in Cox Regression

Lecture 7 Time-dependent Covariates in Cox Regression Lecture 7 Time-dependent Covariates in Cox Regression So far, we ve been considering the following Cox PH model: λ(t Z) = λ 0 (t) exp(β Z) = λ 0 (t) exp( β j Z j ) where β j is the parameter for the the

More information

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58 Inference ME104: Linear Regression Analysis Kenneth Benoit August 15, 2012 August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58 Stata output resvisited. reg votes1st spend_total incumb minister

More information

8 Analysis of Covariance

8 Analysis of Covariance 8 Analysis of Covariance Let us recall our previous one-way ANOVA problem, where we compared the mean birth weight (weight) for children in three groups defined by the mother s smoking habits. The three

More information

An Analysis. Jane Doe Department of Biostatistics Vanderbilt University School of Medicine. March 19, Descriptive Statistics 1

An Analysis. Jane Doe Department of Biostatistics Vanderbilt University School of Medicine. March 19, Descriptive Statistics 1 An Analysis Jane Doe Department of Biostatistics Vanderbilt University School of Medicine March 19, 211 Contents 1 Descriptive Statistics 1 2 Redundancy Analysis and Variable Interrelationships 2 3 Logistic

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

Lecture 4 Multiple linear regression

Lecture 4 Multiple linear regression Lecture 4 Multiple linear regression BIOST 515 January 15, 2004 Outline 1 Motivation for the multiple regression model Multiple regression in matrix notation Least squares estimation of model parameters

More information

Statistical Distribution Assumptions of General Linear Models

Statistical Distribution Assumptions of General Linear Models Statistical Distribution Assumptions of General Linear Models Applied Multilevel Models for Cross Sectional Data Lecture 4 ICPSR Summer Workshop University of Colorado Boulder Lecture 4: Statistical Distributions

More information

Lecture 7: OLS with qualitative information

Lecture 7: OLS with qualitative information Lecture 7: OLS with qualitative information Dummy variables Dummy variable: an indicator that says whether a particular observation is in a category or not Like a light switch: on or off Most useful values:

More information

BIOS 312: Precision of Statistical Inference

BIOS 312: Precision of Statistical Inference and Power/Sample Size and Standard Errors BIOS 312: of Statistical Inference Chris Slaughter Department of Biostatistics, Vanderbilt University School of Medicine January 3, 2013 Outline Overview and Power/Sample

More information

Statistical Modelling in Stata 5: Linear Models

Statistical Modelling in Stata 5: Linear Models Statistical Modelling in Stata 5: Linear Models Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 07/11/2017 Structure This Week What is a linear model? How good is my model? Does

More information

Description Syntax for predict Menu for predict Options for predict Remarks and examples Methods and formulas References Also see

Description Syntax for predict Menu for predict Options for predict Remarks and examples Methods and formulas References Also see Title stata.com logistic postestimation Postestimation tools for logistic Description Syntax for predict Menu for predict Options for predict Remarks and examples Methods and formulas References Also see

More information

Problem Set 10: Panel Data

Problem Set 10: Panel Data Problem Set 10: Panel Data 1. Read in the data set, e11panel1.dta from the course website. This contains data on a sample or 1252 men and women who were asked about their hourly wage in two years, 2005

More information

Question 1a 1b 1c 1d 1e 2a 2b 2c 2d 2e 2f 3a 3b 3c 3d 3e 3f M ult: choice Points

Question 1a 1b 1c 1d 1e 2a 2b 2c 2d 2e 2f 3a 3b 3c 3d 3e 3f M ult: choice Points Economics 102: Analysis of Economic Data Cameron Spring 2016 May 12 Department of Economics, U.C.-Davis Second Midterm Exam (Version A) Compulsory. Closed book. Total of 30 points and worth 22.5% of course

More information

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010 1 Linear models Y = Xβ + ɛ with ɛ N (0, σ 2 e) or Y N (Xβ, σ 2 e) where the model matrix X contains the information on predictors and β includes all coefficients (intercept, slope(s) etc.). 1. Number of

More information

Section 9c. Propensity scores. Controlling for bias & confounding in observational studies

Section 9c. Propensity scores. Controlling for bias & confounding in observational studies Section 9c Propensity scores Controlling for bias & confounding in observational studies 1 Logistic regression and propensity scores Consider comparing an outcome in two treatment groups: A vs B. In a

More information

Longitudinal Data Analysis Using Stata Paul D. Allison, Ph.D. Upcoming Seminar: May 18-19, 2017, Chicago, Illinois

Longitudinal Data Analysis Using Stata Paul D. Allison, Ph.D. Upcoming Seminar: May 18-19, 2017, Chicago, Illinois Longitudinal Data Analysis Using Stata Paul D. Allison, Ph.D. Upcoming Seminar: May 18-19, 217, Chicago, Illinois Outline 1. Opportunities and challenges of panel data. a. Data requirements b. Control

More information

STA6938-Logistic Regression Model

STA6938-Logistic Regression Model Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of

More information

8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

More information

Exercise 7.4 [16 points]

Exercise 7.4 [16 points] STATISTICS 226, Winter 1997, Homework 5 1 Exercise 7.4 [16 points] a. [3 points] (A: Age, G: Gestation, I: Infant Survival, S: Smoking.) Model G 2 d.f. (AGIS).008 0 0 (AGI, AIS, AGS, GIS).367 1 (AG, AI,

More information

Modelling Binary Outcomes 21/11/2017

Modelling Binary Outcomes 21/11/2017 Modelling Binary Outcomes 21/11/2017 Contents 1 Modelling Binary Outcomes 5 1.1 Cross-tabulation.................................... 5 1.1.1 Measures of Effect............................... 6 1.1.2 Limitations

More information

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical

More information

A new strategy for meta-analysis of continuous covariates in observational studies with IPD. Willi Sauerbrei & Patrick Royston

A new strategy for meta-analysis of continuous covariates in observational studies with IPD. Willi Sauerbrei & Patrick Royston A new strategy for meta-analysis of continuous covariates in observational studies with IPD Willi Sauerbrei & Patrick Royston Overview Motivation Continuous variables functional form Fractional polynomials

More information

A Re-Introduction to General Linear Models (GLM)

A Re-Introduction to General Linear Models (GLM) A Re-Introduction to General Linear Models (GLM) Today s Class: You do know the GLM Estimation (where the numbers in the output come from): From least squares to restricted maximum likelihood (REML) Reviewing

More information

Lecture Outline Biost 518 / Biost 515 Applied Biostatistics II / Biostatistics II. Linear Predictors Modeling Complex Dose-Response

Lecture Outline Biost 518 / Biost 515 Applied Biostatistics II / Biostatistics II. Linear Predictors Modeling Complex Dose-Response Lecture Outline Biost 518 / Biost 515 Applied Biostatistics II / Biostatistics II Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Modeling complex dose response Multiple

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

University of California at Berkeley Fall Introductory Applied Econometrics Final examination. Scores add up to 125 points

University of California at Berkeley Fall Introductory Applied Econometrics Final examination. Scores add up to 125 points EEP 118 / IAS 118 Elisabeth Sadoulet and Kelly Jones University of California at Berkeley Fall 2008 Introductory Applied Econometrics Final examination Scores add up to 125 points Your name: SID: 1 1.

More information

Logistic Regression Models for Multinomial and Ordinal Outcomes

Logistic Regression Models for Multinomial and Ordinal Outcomes CHAPTER 8 Logistic Regression Models for Multinomial and Ordinal Outcomes 8.1 THE MULTINOMIAL LOGISTIC REGRESSION MODEL 8.1.1 Introduction to the Model and Estimation of Model Parameters In the previous

More information

Investigating Models with Two or Three Categories

Investigating Models with Two or Three Categories Ronald H. Heck and Lynn N. Tabata 1 Investigating Models with Two or Three Categories For the past few weeks we have been working with discriminant analysis. Let s now see what the same sort of model might

More information

Varieties of Count Data

Varieties of Count Data CHAPTER 1 Varieties of Count Data SOME POINTS OF DISCUSSION What are counts? What are count data? What is a linear statistical model? What is the relationship between a probability distribution function

More information

STATISTICS 110/201 PRACTICE FINAL EXAM

STATISTICS 110/201 PRACTICE FINAL EXAM STATISTICS 110/201 PRACTICE FINAL EXAM Questions 1 to 5: There is a downloadable Stata package that produces sequential sums of squares for regression. In other words, the SS is built up as each variable

More information

Applied Statistics and Econometrics

Applied Statistics and Econometrics Applied Statistics and Econometrics Lecture 5 Saul Lach September 2017 Saul Lach () Applied Statistics and Econometrics September 2017 1 / 44 Outline of Lecture 5 Now that we know the sampling distribution

More information

Modelling Rates. Mark Lunt. Arthritis Research UK Epidemiology Unit University of Manchester

Modelling Rates. Mark Lunt. Arthritis Research UK Epidemiology Unit University of Manchester Modelling Rates Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 05/12/2017 Modelling Rates Can model prevalence (proportion) with logistic regression Cannot model incidence in

More information

ECO220Y Simple Regression: Testing the Slope

ECO220Y Simple Regression: Testing the Slope ECO220Y Simple Regression: Testing the Slope Readings: Chapter 18 (Sections 18.3-18.5) Winter 2012 Lecture 19 (Winter 2012) Simple Regression Lecture 19 1 / 32 Simple Regression Model y i = β 0 + β 1 x

More information

multilevel modeling: concepts, applications and interpretations

multilevel modeling: concepts, applications and interpretations multilevel modeling: concepts, applications and interpretations lynne c. messer 27 october 2010 warning social and reproductive / perinatal epidemiologist concepts why context matters multilevel models

More information

Sociology Exam 2 Answer Key March 30, 2012

Sociology Exam 2 Answer Key March 30, 2012 Sociology 63993 Exam 2 Answer Key March 30, 2012 I. True-False. (20 points) Indicate whether the following statements are true or false. If false, briefly explain why. 1. A researcher has constructed scales

More information

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ Logistic Regression The goal of a logistic regression analysis is to find the best fitting and most parsimonious, yet biologically reasonable, model to describe the relationship between an outcome (dependent

More information

1 A Review of Correlation and Regression

1 A Review of Correlation and Regression 1 A Review of Correlation and Regression SW, Chapter 12 Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then

More information

Ex: Cubic Relationship. Transformations of Predictors. Ex: Threshold Effect of Dose? Ex: U-shaped Trend?

Ex: Cubic Relationship. Transformations of Predictors. Ex: Threshold Effect of Dose? Ex: U-shaped Trend? Biost 518 Applied Biostatistics II Scott S. Emerson, M.., Ph.. Professor of Biostatistics University of Washington Lecture Outline Modeling complex dose response Flexible methods Lecture 9: Multiple Regression:

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Working with Stata Inference on the mean

Working with Stata Inference on the mean Working with Stata Inference on the mean Nicola Orsini Biostatistics Team Department of Public Health Sciences Karolinska Institutet Dataset: hyponatremia.dta Motivating example Outcome: Serum sodium concentration,

More information

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression Section IX Introduction to Logistic Regression for binary outcomes Poisson regression 0 Sec 9 - Logistic regression In linear regression, we studied models where Y is a continuous variable. What about

More information

Understanding the multinomial-poisson transformation

Understanding the multinomial-poisson transformation The Stata Journal (2004) 4, Number 3, pp. 265 273 Understanding the multinomial-poisson transformation Paulo Guimarães Medical University of South Carolina Abstract. There is a known connection between

More information

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

Lecture 2: Poisson and logistic regression

Lecture 2: Poisson and logistic regression Dankmar Böhning Southampton Statistical Sciences Research Institute University of Southampton, UK S 3 RI, 11-12 December 2014 introduction to Poisson regression application to the BELCAP study introduction

More information

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model 1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor

More information

Meta-Analysis in Stata, 2nd edition p.158 Exercise Silgay et al. (2004)

Meta-Analysis in Stata, 2nd edition p.158 Exercise Silgay et al. (2004) Stata LightStone Stata 14 Funnel StataPress Meta-Analysis in Stata, 2nd edition p.153 Harbord et al. metabias metabias Steichen (1998) Begg Egger Stata Stata metabias Begg Egger Harbord Peters ado metafunnel

More information

Statistics in medicine

Statistics in medicine Statistics in medicine Lecture 4: and multivariable regression Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu

More information

Incorporating published univariable associations in diagnostic and prognostic modeling

Incorporating published univariable associations in diagnostic and prognostic modeling Incorporating published univariable associations in diagnostic and prognostic modeling Thomas Debray Julius Center for Health Sciences and Primary Care University Medical Center Utrecht The Netherlands

More information

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

One-Way ANOVA. Some examples of when ANOVA would be appropriate include: One-Way ANOVA 1. Purpose Analysis of variance (ANOVA) is used when one wishes to determine whether two or more groups (e.g., classes A, B, and C) differ on some outcome of interest (e.g., an achievement

More information

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is Practice Final Exam Last Name:, First Name:. Please write LEGIBLY. Answer all questions on this exam in the space provided (you may use the back of any page if you need more space). Show all work but do

More information

Lecture 5: Hypothesis testing with the classical linear model

Lecture 5: Hypothesis testing with the classical linear model Lecture 5: Hypothesis testing with the classical linear model Assumption MLR6: Normality MLR6 is not one of the Gauss-Markov assumptions. It s not necessary to assume the error is normally distributed

More information

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison.

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison. Regression Bret Hanlon and Bret Larget Department of Statistics University of Wisconsin Madison December 8 15, 2011 Regression 1 / 55 Example Case Study The proportion of blackness in a male lion s nose

More information

Unit 2 Regression and Correlation Practice Problems. SOLUTIONS Version STATA

Unit 2 Regression and Correlation Practice Problems. SOLUTIONS Version STATA PubHlth 640. Regression and Correlation Page 1 of 19 Unit Regression and Correlation Practice Problems SOLUTIONS Version STATA 1. A regression analysis of measurements of a dependent variable Y on an independent

More information

ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 1. Introduction 2. The Sharp RD Design 3.

More information

2: Multiple Linear Regression 2.1

2: Multiple Linear Regression 2.1 1. The Model y i = + 1 x i1 + 2 x i2 + + k x ik + i where, 1, 2,, k are unknown parameters, x i1, x i2,, x ik are known variables, i are independently distributed and has a normal distribution with mean

More information

ECON 594: Lecture #6

ECON 594: Lecture #6 ECON 594: Lecture #6 Thomas Lemieux Vancouver School of Economics, UBC May 2018 1 Limited dependent variables: introduction Up to now, we have been implicitly assuming that the dependent variable, y, was

More information

Lecture 5: Poisson and logistic regression

Lecture 5: Poisson and logistic regression Dankmar Böhning Southampton Statistical Sciences Research Institute University of Southampton, UK S 3 RI, 3-5 March 2014 introduction to Poisson regression application to the BELCAP study introduction

More information

Assessing Calibration of Logistic Regression Models: Beyond the Hosmer-Lemeshow Goodness-of-Fit Test

Assessing Calibration of Logistic Regression Models: Beyond the Hosmer-Lemeshow Goodness-of-Fit Test Global significance. Local impact. Assessing Calibration of Logistic Regression Models: Beyond the Hosmer-Lemeshow Goodness-of-Fit Test Conservatoire National des Arts et Métiers February 16, 2018 Stan

More information

One-stage dose-response meta-analysis

One-stage dose-response meta-analysis One-stage dose-response meta-analysis Nicola Orsini, Alessio Crippa Biostatistics Team Department of Public Health Sciences Karolinska Institutet http://ki.se/en/phs/biostatistics-team 2017 Nordic and

More information

Model Selection Procedures

Model Selection Procedures Model Selection Procedures Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Model Selection Procedures Consider a regression setting with K potential predictor variables and you wish to explore

More information

Unit 11: Multiple Linear Regression

Unit 11: Multiple Linear Regression Unit 11: Multiple Linear Regression Statistics 571: Statistical Methods Ramón V. León 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 1 Main Application of Multiple Regression Isolating the effect of a variable

More information

Econometrics. 8) Instrumental variables

Econometrics. 8) Instrumental variables 30C00200 Econometrics 8) Instrumental variables Timo Kuosmanen Professor, Ph.D. http://nomepre.net/index.php/timokuosmanen Today s topics Thery of IV regression Overidentification Two-stage least squates

More information

Building a Prognostic Biomarker

Building a Prognostic Biomarker Building a Prognostic Biomarker Noah Simon and Richard Simon July 2016 1 / 44 Prognostic Biomarker for a Continuous Measure On each of n patients measure y i - single continuous outcome (eg. blood pressure,

More information

Lecture 3: Multiple Regression. Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II

Lecture 3: Multiple Regression. Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II Lecture 3: Multiple Regression Prof. Sharyn O Halloran Sustainable Development Econometrics II Outline Basics of Multiple Regression Dummy Variables Interactive terms Curvilinear models Review Strategies

More information

Correlation & Simple Regression

Correlation & Simple Regression Chapter 11 Correlation & Simple Regression The previous chapter dealt with inference for two categorical variables. In this chapter, we would like to examine the relationship between two quantitative variables.

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

Model Building Chap 5 p251

Model Building Chap 5 p251 Model Building Chap 5 p251 Models with one qualitative variable, 5.7 p277 Example 4 Colours : Blue, Green, Lemon Yellow and white Row Blue Green Lemon Insects trapped 1 0 0 1 45 2 0 0 1 59 3 0 0 1 48 4

More information

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56 STAT 391 - Spring Quarter 2017 - Midterm 1 - April 27, 2017 Name: Student ID Number: Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56 Directions. Read directions carefully and show all your

More information

Problem Set 1 ANSWERS

Problem Set 1 ANSWERS Economics 20 Prof. Patricia M. Anderson Problem Set 1 ANSWERS Part I. Multiple Choice Problems 1. If X and Z are two random variables, then E[X-Z] is d. E[X] E[Z] This is just a simple application of one

More information

Self-Assessment Weeks 8: Multiple Regression with Qualitative Predictors; Multiple Comparisons

Self-Assessment Weeks 8: Multiple Regression with Qualitative Predictors; Multiple Comparisons Self-Assessment Weeks 8: Multiple Regression with Qualitative Predictors; Multiple Comparisons 1. Suppose we wish to assess the impact of five treatments while blocking for study participant race (Black,

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Interpretation of the Fitted Logistic Regression Model

Interpretation of the Fitted Logistic Regression Model CHAPTER 3 Interpretation of the Fitted Logistic Regression Model 3.1 INTRODUCTION In Chapters 1 and 2 we discussed the methods for fitting and testing for the significance of the logistic regression model.

More information

Measurement Error. Often a data set will contain imperfect measures of the data we would ideally like.

Measurement Error. Often a data set will contain imperfect measures of the data we would ideally like. Measurement Error Often a data set will contain imperfect measures of the data we would ideally like. Aggregate Data: (GDP, Consumption, Investment are only best guesses of theoretical counterparts and

More information

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key Statistical Methods III Statistics 212 Problem Set 2 - Answer Key 1. (Analysis to be turned in and discussed on Tuesday, April 24th) The data for this problem are taken from long-term followup of 1423

More information

Multinomial Logistic Regression Models

Multinomial Logistic Regression Models Stat 544, Lecture 19 1 Multinomial Logistic Regression Models Polytomous responses. Logistic regression can be extended to handle responses that are polytomous, i.e. taking r>2 categories. (Note: The word

More information

The Simulation Extrapolation Method for Fitting Generalized Linear Models with Additive Measurement Error

The Simulation Extrapolation Method for Fitting Generalized Linear Models with Additive Measurement Error The Stata Journal (), Number, pp. 1 12 The Simulation Extrapolation Method for Fitting Generalized Linear Models with Additive Measurement Error James W. Hardin Norman J. Arnold School of Public Health

More information