BIOS 312: MODERN REGRESSION ANALYSIS

Size: px

Start display at page:

Download "BIOS 312: MODERN REGRESSION ANALYSIS"

Britney Hodge
6 years ago
Views:

1 BIOS 312: MODERN REGRESSION ANALYSIS James C (Chris) Slaughter Department of Biostatistics Vanderbilt University School of Medicine james.c.slaughter@vanderbilt.edu biostat.mc.vanderbilt.edu/coursebios312 Copyright JC Slaughter All Rights Reserved Updated March 26, 2012

2 Contents 14 Regression Based Prediction Overview Uses of Regression Examples of Prediction Assumptions needed for inference Estimation (prediction) of summary measures Point estimates Interval estimates Prediction (forecasting) of individual observations Assumptions Point and interval estimates for Normal data Example: FEV by age and height Mathematics of Prediction Intervals Prediction of Binary Measurements Classification (discrimination)

3 CONTENTS Stepwise model building ROC Curves Example 1: Age or height to predict smoking Example 2: Multiple predictors of smoking Analysis of ROC curves in R Multivariable prognostic models

4 Chapter 14 Regression Based Prediction 14.1 Overview Uses of Regression Cluster Analysis Focuses on identifying observations (covariates) with similar characteristics Interest lies in finding clusters that are more (or less) likely to have some outcome of interest Divide a population into subgroups based on patterns of similar measurement Clustering can be done by considering one variable (univariable) or multiple variables (multivariable) Number of clusters can be known or unknown Examples Microarrays 5

5 CHAPTER 14. REGRESSION BASED PREDICTION 6 fmri Small n, large p (few subjects, many predictors) Statistical techniques Stepwise regression Regression trees Other data-driven model building approaches Getting valid inference (i.e. reliable p-values, unbiased parameter estimates) particularly challenging Area of active research in Biostatistics Quantifying and comparing distributions Compare distribution of the outcome variable across levels of the grouping variable or variables Compare salary in males and females, across professors of the same rank, degree, experience, and field Compare FEV in smokers and non-smokers, across teenagers of the same age and height Need to decide which summary measure is appropriate for describing the distribution Common summary measures: Means, median, geometric mean, odds, rates, hazards Other measures: Variance, skewness, kurtosis, likelihood of extreme values, quantiles, etc.

6 CHAPTER 14. REGRESSION BASED PREDICTION 7 May desire estimates within specific subgroups Estimates of the association within gender, race, or age groups (This is effect modification) Regression based prediction Prediction of summary measures Point prediction Best single estimate for the measurement that would be obtained, on average, for many individuals with given covariates Interval prediction Quantify the uncertainty of the average Range of summary measure that might be reasonable to observe Prediction of individual measurements Point prediction Best single estimate for the measurement that would be obtained on a single individual with given covariates Interval prediction Quantify the uncertainty of the individual prediction Range of measurements that might reasonably be observed in a future individual with given covariates Examples of Prediction Continuous prediction: Creatinine clearance Creatinine is a continuously produced breakdown product in muscles

7 CHAPTER 14. REGRESSION BASED PREDICTION 8 Removed by the kidney through filtration Amount of creatinine cleared by the kidneys in 24 hours is a measure of renal function Gold standard is to collect urine output for 24 hours, measure creatinine Would prefer to find a combination of blood and urine measures that can be obtained at one time point, yet still provides accurate prediction of a patient s creatinine clearance Statistical approach Collect a training dataset to build a regression model for prediction Measure true creatinine clearance Measure age, gender, weight, height, blood makers, urinary markers, etc. Fit a (data driven?) regression model to predict true creatinine clearance Collect a validation dataset Use the regression estimates (the βs) from the training dataset to see how well your model predict creatinine clearance Quantify the accuracy of the predictive model (e.g. error) mean squared Cross validation

8 CHAPTER 14. REGRESSION BASED PREDICTION 9 Discrimination (binary prediction): Low birth weight Want to predict which infants are more likely to be less than 2500 grams at birth Possible predictors: Age, race, blood biomarkers collected during pregnancy Statistical Approach Collect a training dataset to build a regression model for prediction Measure birth weight Measure age, race, blood biomarkers Fit a (data driven?) regression model to predict low birth weight Collect a validation dataset Use the regression estimates (the βs) from the training dataset to see how well your model predict low birth weight Quantify the accuracy of the predictive model Sensitivity, specificity (ROC curves) Predictive value positive, predictive value negative Interval prediction: Normal ranges of PSA Identify the range of PSA values that would be expected in 95% of most healthy adult males Possibly stratify by age, race Statistical Approach

9 CHAPTER 14. REGRESSION BASED PREDICTION 10 Collect a training dataset to build a regression model for prediction Measure PSA and variable to predict PSA Need to estimate the quantiles Mean plus/minus 2 standard deviations (makes strong Normality assumption) Estimate the quantiles, provide CIs around the quantiles (likely low precision) Collect a validation dataset Quantify the accuracy of the predictive model (e.g. coverage probabilities) Assumptions needed for inference Inference for associations Necessary assumptions for classical regression (no robust standard errors) Independence of response measurements within identified clusters Have appropriately modeled the within group variance Linear regression: Equal variance across groups Other regressions: Appropriate mean-variance relationship Lack of model fit may lead to poor estimate of the variance Sample size is large enough so that parameter estimates approximately follow a Normal distribution Necessary assumptions for first order trends using robust standard errors Independence of response measurements within identified clusters

10 CHAPTER 14. REGRESSION BASED PREDICTION 11 (Robust standard errors accounts for heteroscedasticity in large samples) Lack of model fit leads to conservative inference due to mixing systematic and random error Sample size is large enough so that parameter estimates approximately follow a Normal distribution Inference for predictions Additional assumptions for predictions of means Our regression model has accurately described the relationship between summary measures across groups For continuous covariates, often involves flexible models that allow for departures from linearity Additional assumptions for predictions of individual observations Also need to know the shape of the distribution within each group Methods implemented in software often rely on strong assumptions like Normality 14.2 Estimation (prediction) of summary measures Given age, height, and sex, estimate the mean (or geometric mean) FEV Use linear regression to obtain estimates and CI Given age and PSA, estimate the probability (or odds) of remaining in remission for 24 months

11 CHAPTER 14. REGRESSION BASED PREDICTION 12 Use logistic regression to obtain estimates and CI Assumptions Independence (between clusters for robust SE) Variance approximated by the model (relaxed for robust SE) Regression model accurately describes the relationship of summary measures across groups Sufficient sample sizes for asymptotic distributions of parameters to be a good approximation Point estimates Point estimates obtained by substitution of predictor values into the estimated regression equation E[Chol Age] = Age Expected cholesterol for 50 year old is = logodds(survival Age) = Age Expected log-odds of Survival for a 50 year old is = 0.53 Expected odds = e.53 = 0.59 Expected probability = Interval estimates If assumptions hold, interval estimates can be obtained

12 CHAPTER 14. REGRESSION BASED PREDICTION 13 We generally find a confidence interval for the transformed quantity, and then back transform to the desired quantity For logistic regression, calculate the CI for the log odds, the transform to odds or probability Statistical criteria for determining the best estimate Consistent: Correct estimate (with infinite sample size) Precise: Minimum variance among (unbiased) estimators Common regression methods provide the best estimate in a wide variety of settings Necessary assumptions Independence Variance approximated by the model (relaxed for robust SE) Regression model accurately describes the relationship among summary measures across groups Sufficient sample size so that asymptotic Normality of the parameter estimates holds Interval estimates When we substitute in the predictor values, it provides an estimate of the model transformation of the summary measure Model transformation of the summary measure varies by regression setting Linear regression: Mean

13 CHAPTER 14. REGRESSION BASED PREDICTION 14 Linear regression on log transformed outcome: Log geometric mean Logistic regression: Log odds Poisson regression: Log rate Formulas for the confidence interval In general: (estimate) ± (crit value) (std error) In linear regression, the t distribution is usually used to obtain the confidence interval Stata: (crit value) = invttail(df, α/2) R: (crit value) = qt(1 α/2, df) Degrees of freedom df = n number of predictors in model In other regressions, we use the standard Normal distribution Stata: (crit value) = invnorm(1 α/2) = 1.96 R: (crit value) = qnorm(1 α/2) = 1.96 Interval estimates in Stata After any regression command, the Stata command predict will give compute estimates and standard errors predict varname, [what] varname is that name of the new variable to be created what is one of xb: The linear predictor (works for all regression)

14 CHAPTER 14. REGRESSION BASED PREDICTION 15 stdp: Standard error of the linear prediction p: For logistic regression, to predicted probability Interval estimates in R After storing a model, the R function predict will give compute estimates and standard errors predict(object, se.fit=false,...) object is the stored fitted model se.fit will provide the standard error of the fit (defaults to FALSE) See help(predict) for more details and options

15 CHAPTER 14. REGRESSION BASED PREDICTION 16. gen logfev = log(fev). regress logfev height age Source SS df MS Number of obs = F( 2, 651) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = logfev Coef. Std. Err. t P> t [95% Conf. Interval] height age _cons predict fitlogfev (option xb assumed; fitted values). predict sefit, stdp.. gen gmfev= exp(fitlogfev). gen gmlofev = exp(fitlogfev - invttail(651,.025) * sefit). gen gmhifev = exp(fitlogfev + invttail(651,.025) * sefit).. list gmfev gmlofev gmhifev if age==10 & height== gmfev gmlofev gmhifev

16 CHAPTER 14. REGRESSION BASED PREDICTION 17. logit smoker height age Logistic regression Number of obs = 654 LR chi2(2) = Prob > chi2 = Log likelihood = Pseudo R2 = smoker Coef. Std. Err. z P> z [95% Conf. Interval] height age _cons predict logodds, xb. predict selogodds, stdp.. gen odds= exp(logodds). gen oddslo= exp(logodds * selogodds). gen oddshi= exp(logodds * selogodds). list odds oddslo oddshi if age==10 & height== odds oddslo oddshi Prediction (forecasting) of individual observations Methods are for continuous outcomes Will deal with discrimination of binary outcomes in next section Given age, height, and sex, predict a new subject s FEV Use linear regression to obtain estimates and CI Assumptions Necessary assumptions to predict individual observations

17 CHAPTER 14. REGRESSION BASED PREDICTION 18 Independence (between clusters for robust SE) Variance approximated by the model (NOT relaxed for robust SE) Regression model accurately describes the relationship of summary measures across groups Shape of the distribution the same in each group Transformation of the outcome may be very useful Sufficient sample sizes for asymptotic distributions of parameters to be a good approximation Assumptions are very strong Consequently, we do not have many methods that provide robust inference In general, I prefer methods that make as few assumptions as possible Robust standard errors will not help in this situation Proper transformation of outcomes and predictors may be necessary so that underlying assumptions of classical linear regression model hold Models, estimates, will need to be appropriately penalized for valid inference Topic of more advanced regression courses (Regression Modeling Strategies) Precise methods have been developed for Binary variables (the mean specifies the variances) Continuous data that follow a Normal distribution

18 CHAPTER 14. REGRESSION BASED PREDICTION Point and interval estimates for Normal data Point estimates obtained by substitution into the estimated regression equation E[Chol Age] = Age Expected cholesterol for 50 year old is = When we substitute in age values, it provides an estimate of the forecast cholesterol Interval estimates Under appropriate assumptions, we can obtain standard errors for such predictions Standard errors must account for two sources of variability Variability in estimating the regression parameters (same as in predictions of summary measures) Variability due to subject Additional variability about the sample mean; within group standard deviation Estimating this sources of variability is where the additional Normality assumption is key Formulas for the prediction interval In general: (prediction) ± (crit value) (std error) In linear regression, the t distribution is usually used to obtain the prediction interval Stata: (crit value) = invttail(df, α/2) R: (crit value) = qt(1 α/2, df)

19 CHAPTER 14. REGRESSION BASED PREDICTION 20 Degrees of freedom df = n number of predictors in model Interval estimates in Stata After any regression command, the Stata command predict will give compute estimates and standard errors predict varname, [what] varname is that name of the new variable to be created what is one of xb: The linear predictor (works for all regression) stdf: Standard error of the forecast prediction Interval estimates in R After storing a model, the R function predict will give compute estimates and standard errors predict(object, se.fit=false,...) object is the stored fitted model se.fit will provide the standard error of the fit (defaults to FALSE) For the standard error of the forecast, need to include the root mean squared error se 2 pred = se2 fit + RMSE2 General comment about software Commercial software only implements prediction intervals by assuming Normal data

20 CHAPTER 14. REGRESSION BASED PREDICTION 21 If using R libraries (or Stata ado files) for prediction, carefully investigate and understand how they are making predictions Don t trust the black box to be giving you results that are widely applicable to all situations Be careful that you understand exactly what is going on if you are interested in making predictions Example: FEV by age and height. regress logfev height age Source SS df MS Number of obs = F( 2, 651) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = logfev Coef. Std. Err. t P> t [95% Conf. Interval] height age _cons predict fitlogfev (option xb assumed; fitted values). predict sepredict, stdf. gen predfev= exp(fitlogfev). gen predlofev = exp(fitlogfev - invttail(651,.025) * sepredict). gen predhifev = exp(fitlogfev + invttail(651,.025) * sepredict). list predfev predlofev predhifev if age==10 & height== predfev predlo~v predhi~v The preceding output gave prediction for individual observations

21 CHAPTER 14. REGRESSION BASED PREDICTION 22 We can compare these results to the prediction for summary measures done previously (see below) gmfev gmlofev gmhifev predfev pre~ofev predhi~v Point estimates identical Confidence interval for individual predictions is wider As n increases, the width of the prediction interval (on the log scale) approaches ±1.96 RMSE As n increases, the width of the interval around the summary measure approaches 0 We can also calculate the standard error of the prediction Example: Age is 10 and Height is 66. gen sepredict2 = sepredict^2. gen sefit2 = sefit^2. list sepredict sefit sepredict2 sefit2 if age==10 & heigh== sepred~t sefit sepred~2 sefit di.14659^ se 2 pred = se2 fit + RMSE2 se pred = se fit = RMSE =

22 CHAPTER 14. REGRESSION BASED PREDICTION 23 Note that discrepancy in hand calculation versus Stata output ( vs ) is due to rounding error This is an academic exercise in Stata, but necessary in R (unless you can find a function to do it for you) Mathematics of Prediction Intervals Basic ideas behind prediction intervals Model: Y i X i N(β 0 + β 1 X i, σ 2 ) Alternative specification: Y i X i = β 0 + β 1 X i + ɛ i ɛ i N(0, σ 2 ) Estimated mean: ˆβ 0 + ˆβ 1 X i N ( β 0 + β 1 X i, σ 2 V ) Predicted mean: ˆβ 0 + ˆβ 1 X i + ɛ i N ( β 0 + β 1 X i, σ 2 (1 + V ) ) V = 1 n + (X h X) 2 n i=1 (X i X) 2 X h is the chosen value of the covariate (e.g. age==10) Note: As n, V Prediction of Binary Measurements Classification (discrimination) Sometimes the scientific question is one of deriving a rule to classify subjects

23 CHAPTER 14. REGRESSION BASED PREDICTION 24 Diagnosis of prostate cancer: Based on age, race, and PSA, should we make a diagnosis of prostate cancer? Prognosis of patients with primary biliary cirrhosis: Based on age, bilirubin, albumin, edema, protime, is the patient likely to die within the next year? Classification can be regarded as trying to predict the value of a binary variable Earlier, we were estimating the probability and odds of relapse within a particular group (a summary measure) Now we want to decide whether a particular individual will relapse or not (an individual measure) There is an obvious connection between the above two ideas The probability or odds tells us everything about the distribution of values The only possible values are 0 or 1 Typical approach First, use regression model to estimate the probability of event in each group Second, form a decision rule based on estimated probability of event If estimate c (or c), predict outcome is 1 If estimate < c (or > c), predict outcome is 0 Quantify the accuracy of the decision rule

24 CHAPTER 14. REGRESSION BASED PREDICTION 25 For disease D and test T Sensitivity: Pr(T + D + ) Specificity: Pr(T D ) Predicted Value Positive: Pr(D + T + ) Predicted Value Negative: Pr(D T ) Stepwise model building Old method for considering a large number of covariates that might possibly be predictive of an outcome If stepwise model building had been proposed today, it would not have passed peer review Available in all software as an automated tool, but should not be used Avoid recreating the stepwise approach when building your own models manually Major caveats Overfits your dataset P-values are not true p-values; they are very anti-conservative You will often obtain different models if you use forward or backward stepwise regression Ignores confounding effects (a variable could be an important confounder, but itself not be statistically significant in the model)

25 CHAPTER 14. REGRESSION BASED PREDICTION 26 Without clustering certain predictors, may throw out covariates to make non-sensical model Suppose race/ethnicity is coded as White, Black, Hispanic, Other Want to model as 3 dummy variables Stepwise may throw out one of your dummy variables, keep the rest Pairwise significance would also be highly dependent on reference group chosen by analyst Two flavors of stepwide model building Start with no covariates: Forward stepwise regression Start with all covariates: Backward stepwise regression Stepwise procedure proceeds by adding or removing covariates from the model base on the corresponding partial t or Z test Must pre-specify a P to enter and P to remove To avoid infinite loops, P to enter must be less than P to remove E.g. Add a variable to the model if p < 0.05, but only remove a variable from the model if p > 0.10 Repeat the process until you arrive at a final model ROC Curves Receiver operating characteristic (ROC) curves plot the sensitivity against the false positive rate (e.g. 1 - specificity)

26 CHAPTER 14. REGRESSION BASED PREDICTION 27 Ideal tests will have both high sensitivity and high specificity (low false positives) Graphically depicted as points in the upper left portion of the plot 1:1 line represents a test that is no better than the flip of a coin Test usually involves the classification of some binary outcome by a continuous predictor (or set of continuous predictors) When the continuous predictor is above some point c, the test is said to be positive ROC curves plots the sensitivity and false positive rate for all possible values of c Major drawbacks to ROC analysis Tempts analyst to find a cutoff point (c) when none really exists Better to treat your covariate or linear prediction as a continuous variable Example 1: Age or height to predict smoking Question: Is age or height a better predictor of smoking status? Will compare the ROC curves Often done use the area under the curve (AUC) Model with higher AUC is better A null model (not predictive of outcome) will have area of 0.50 Maximum AUC is 1.0

27 CHAPTER 14. REGRESSION BASED PREDICTION 28 Stata Fit the logistic regression model lroc creates the ROC curve predict to save the fitted curves roccomp to compare the curves

28 CHAPTER 14. REGRESSION BASED PREDICTION 29. logit smoker height Logistic regression Number of obs = 654 LR chi2(1) = Prob > chi2 = Log likelihood = Pseudo R2 = smoker Coef. Std. Err. z P> z [95% Conf. Interval] height _cons lroc, nograph Logistic model for smoker number of observations = 654 area under ROC curve = predict xb1, xb.. logit smoker age Logistic regression Number of obs = 654 LR chi2(1) = Prob > chi2 = Log likelihood = Pseudo R2 = smoker Coef. Std. Err. z P> z [95% Conf. Interval] age _cons lroc, nograph Logistic model for smoker number of observations = 654 area under ROC curve = predict xb2, xb. roccomp smoker xb1 xb2, graph summary ROC -Asymptotic Normal-- Obs Area Std. Err. [95% Conf. Interval] xb xb Ho: area(xb1) = area(xb2) chi2(1) = Prob>chi2 =

29 CHAPTER 14. REGRESSION BASED PREDICTION 30 Age has a significantly larger AUC than Height (p < 0.001) For any given false positive rate, the sensitivity is higher for age Predicted value positive would also always be higher Example 2: Multiple predictors of smoking Will consider age, height, sex, and fev as predictors of smoking Fit a multivariable logistic regression model using these covariates in a training sample Training sample will contain 60% of the observation in the current study Develop a model on the training sample, see how well it fits in the validation sample (the remaining 40% of the data)

30 CHAPTER 14. REGRESSION BASED PREDICTION 31 Consider a rule that predicts a subject will smoke if their predicted probability is greater than 0.5 Could choose other cutoffs than 0.5 Other cutoffs will be represented on the ROC curves Sensitivity, specificity will vary by the cutoff chosen

31 CHAPTER 14. REGRESSION BASED PREDICTION 32.. xi: logit smoker age height i.sex fev if training <.6 i.sex _Isex_1-2 (_Isex_1 for sex==female omitted) Logistic regression Number of obs = 385 LR chi2(4) = Prob > chi2 = Log likelihood = Pseudo R2 = smoker Coef. Std. Err. z P> z [95% Conf. Interval] age height _Isex_ fev _cons predict pfit (option pr assumed; Pr(smoker)).. gen pfit50=pfit. recode pfit50 0/0.5=0 0.5/1=1 (pfit50: 654 changes made).. tabulate smoker pfit50 if training >.6, row col pfit50 smoker 0 1 Total Total Using a cutoff of 0.50 Sensitivity: 8/33 = Specificity: 226/236 = 95.76

32 CHAPTER 14. REGRESSION BASED PREDICTION 33 PVP: 8/18 = PVN: 226/251 = Thresholds other than 0.50 give different values for sensitivity, specificity, PVP, PVN ROC curve displayed gives all possible thresholds

CHAPTER 14. REGRESSION BASED PREDICTION 34 14.4.6 Analysis of ROC curves in R There are a number of add on packages that will conduct ROC analysis in R ROCR is one popular package 14.4.7 Multivariable prognostic models Reference Harrell FE Jr, Lee KL, Mark DB.

33 CHAPTER 14. REGRESSION BASED PREDICTION Analysis of ROC curves in R There are a number of add on packages that will conduct ROC analysis in R ROCR is one popular package Multivariable prognostic models Reference Harrell FE Jr, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine. 15(4): , Feb 1996 A Tutorial is Biostatistics (frequently appears in Statistics in Medicine)

34 CHAPTER 14. REGRESSION BASED PREDICTION 35 From abstract of paper Multivariable regression models are powerful tools that are frequently used in clinical outcomes... however uncritical application of modeling techniques can result in models that poorly fit the dataset at hand, or, even more likely, inaccurately predict outcomes on new subjects.... predictive accuracy should be unbiasedly validated using bootstrapping or cross-validation before using predictions in new data series. Common problem: Number of predictors being considered (p) is larger than the number of events (m) Recommendation: Consider no more than m/10 potential predictors in a model e.g. if event rate is 10% and n = 200 then m = 20; only 2 potential predictors Step 1: Consider your dataset Missing data (imputation?) Interactions (scientific relevance?) Transformation of predictors (scientific relevance?) Step 2: Data Reduction Reduce the number of predictors to a manageable amount Utilize methods that do not consider the outcome ( unsupervised learning) If you do not use the outcome at this stage, your statistical inference is preserved

35 CHAPTER 14. REGRESSION BASED PREDICTION 36 Correlations among predictors: Variable clustering, principle components analysis Scientifically meaningful summary measures Step 3: Fit the model and... Evaluate modeling assumptions: Linearity, additivity, distributional assumptions Use backwards stepwise selection to find a simpler model Harrell: Remove predictor if χ 2 > 2 (p > 0.16) Will use all data to fit model Step 4: Evaluate model Bootstrap cross-validation Sample data with replacement For each bootstrap sample, the evaluation of modeling assumptions and backwards selection will be performed (these steps use the outcome) Summarize predictive accuracy using a statistic (C-index, Brier score) Bootstrap cross-validation provides a nearly unbiased estimate of predictive accuracy (e.g. C-index, Brier score) while allowing the entire model to be used for model development C-index: Area under the ROC curve. Index near 1 indicates higher accuracy. Brier score: Average squared deviation between predicted probability of event and observed outcome. A lower score represents higher

36 CHAPTER 14. REGRESSION BASED PREDICTION 37 accuracy. Case Study using lrm() in R Logistic regression model Main effects: sex, cholesterol (splines), age (polynomial), and blood pressure (linear) Interactions: sex with cholesterol (5 predictors maximum) >f <- lrm(y ~ sex*rcs(cholesterol)+pol(age,2)+blood.pressure, x=true, y=true) > > f Logistic Regression Model lrm(formula = y ~ sex * rcs(cholesterol) + pol(age, 2) + blood.pressure, x = TRUE, y = TRUE) Model Likelihood Discrimination Rank Discrim. Ratio Test Indexes Indexes Obs 1000 LR chi R C d.f. 12 g Dxy Pr(> chi2) < gr gamma max deriv 2e-04 gp tau-a Brier Coef S.E. Wald Z Pr(> Z ) Intercept sex=male cholesterol cholesterol cholesterol cholesterol age age^ blood.pressure sex=male * cholesterol sex=male * cholesterol sex=male * cholesterol sex=male * cholesterol

37 CHAPTER 14. REGRESSION BASED PREDICTION 38 Validation... Backwards stepwise regression Different bootstrap sample will remove different covariates Indicates which covariates were included for each sample Estimates of discrimination indexes > validate(f, B=150, bw=true, rule="p", sls=.1, type="individual") Backwards Step-down - Original Model Deleted Chi-Sq d.f. P Residual d.f. P AIC blood.pressure cholesterol Approximate Estimates after Deleting Factors Coef S.E. Wald Z P Intercept sex=male age age^ sex=male * cholesterol sex=male * cholesterol sex=male * cholesterol sex=male * cholesterol Factors in Final Model [1] sex age sex * cholesterol index.orig training test optimism index.corrected n Dxy R Intercept Slope Emax D U Q B g gp Factors Retained in Backwards Elimination sex cholesterol age blood.pressure sex * cholesterol

38 CHAPTER 14. REGRESSION BASED PREDICTION 39 * * * * * * * * * * * * * * * * * * * * * * * * * * * (output omitted; continues for each bootstrap sample) Frequencies of Numbers of Factors Retained

Homework Solutions Applied Logistic Regression

Homework Solutions Applied Logistic Regression WEEK 6 Exercise 1 From the ICU data, use as the outcome variable vital status (STA) and CPR prior to ICU admission (CPR) as a covariate. (a) Demonstrate that