Outline. Topic 13 - Model Selection. Predicting Survival - Page 350. Survival Time as a Response. Variable Selection R 2 C p Adjusted R 2 PRESS

Similar documents
Topic 18: Model Selection and Diagnostics

Model Selection Procedures

How the mean changes depends on the other variable. Plots can show what s happening...

Model Building Chap 5 p251

Statistics 512: Applied Linear Models. Topic 4

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Topic 17 - Single Factor Analysis of Variance. Outline. One-way ANOVA. The Data / Notation. One way ANOVA Cell means model Factor effects model

Statistics 262: Intermediate Biostatistics Model selection

STAT 100C: Linear models

Chapter 11 : State SAT scores for 1982 Data Listing

SAS Commands. General Plan. Output. Construct scatterplot / interaction plot. Run full model

Outline. Topic 22 - Interaction in Two Factor ANOVA. Interaction Not Significant. General Plan

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

The program for the following sections follows.

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Outline Topic 21 - Two Factor ANOVA

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013

Topic 20: Single Factor Analysis of Variance

Introduction to Statistical modeling: handout for Math 489/583

General Linear Model (Chapter 4)

Regression, Ridge Regression, Lasso

Lecture 1 Linear Regression with One Predictor Variable.p2

Machine Learning for OR & FE

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Variable Selection and Model Building

Lecture 11 Multiple Linear Regression

Outline. Topic 19 - Inference. The Cell Means Model. Estimates. Inference for Means Differences in cell means Contrasts. STAT Fall 2013

10. Alternative case influence statistics

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Topic 25 - One-Way Random Effects Models. Outline. Random Effects vs Fixed Effects. Data for One-way Random Effects Model. One-way Random effects

Topic 28: Unequal Replication in Two-Way ANOVA

Sections 7.1, 7.2, 7.4, & 7.6

MASM22/FMSN30: Linear and Logistic Regression, 7.5 hp FMSN40:... with Data Gathering, 9 hp

Lecture 10 Multiple Linear Regression

Chapter 8 Quantitative and Qualitative Predictors

Chapter 6 Multiple Regression

2. Regression Review

Chapter 3: Other Issues in Multiple regression (Part 1)

Data Mining Stat 588

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Day 4: Shrinkage Estimators

1 Introduction 1. 2 The Multiple Regression Model 1

Topic 14: Inference in Multiple Regression

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Lecture 3: Inference in SLR

Multicollinearity Exercise

Model Selection. Frank Wood. December 10, 2009

2.2 Classical Regression in the Time Series Context

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Lecture 11: Simple Linear Regression

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

LINEAR REGRESSION. Copyright 2013, SAS Institute Inc. All rights reserved.

Sparse Linear Models (10/7/13)

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Minimum Description Length (MDL)

Multiple linear regression S6

Topic 16: Multicollinearity and Polynomial Regression

Chapter 2 Inferences in Simple Linear Regression

STAT 705 Chapter 16: One-way ANOVA

Failure Time of System due to the Hot Electron Effect

ssh tap sas913, sas

Linear regression methods

Linear models Analysis of Covariance

Model Selection in Cox regression

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Linear models Analysis of Covariance

1 A Review of Correlation and Regression

ISyE 691 Data mining and analytics

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

holding all other predictors constant

Topic 23: Diagnostics and Remedies

1 The classical linear regression model

Linear Models (continued)

BAYESIAN ESTIMATION OF LINEAR STATISTICAL MODEL BIAS

SPECIAL TOPICS IN REGRESSION ANALYSIS

Formal Statement of Simple Linear Regression Model

Lab # 11: Correlation and Model Fitting

Handout 1: Predicting GPA from SAT

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Bootstrap Simulation Procedure Applied to the Selection of the Multiple Linear Regressions

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Math 423/533: The Main Theoretical Topics

High-dimensional regression modeling

Chapter 1 Linear Regression with One Predictor

Overview Scatter Plot Example

Python 데이터분석 보충자료. 윤형기

STAT 705 Chapter 19: Two-way ANOVA

171:162 Design and Analysis of Biomedical Studies, Summer 2011 Exam #3, July 16th

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Linear Methods for Regression. Lijun Zhang

Matematické Metody v Ekonometrii 7.

Effect of Centering and Standardization in Moderation Analysis

Two-factor studies. STAT 525 Chapter 19 and 20. Professor Olga Vitek

Linear Regression In God we trust, all others bring data. William Edwards Deming

Machine Learning Linear Regression. Prof. Matteo Matteucci

Linear model selection and regularization

Transcription:

Topic 13 - Model Selection - Fall 2013 Variable Selection R 2 C p Adjusted R 2 PRESS Outline Automatic Search Procedures Topic 13 2 Predicting Survival - Page 350 Surgical unit wants to predict survival in patients undergoing a specific liver operation Has random sample of 108 patients - use only 54 patients Response Y is survival time (days) Eight predictor variables X 1 blood clotting score X 2 prognostic index X 3 enzyme function score X 4 liver function score X 5 age X 6 gender X 7 and X 8 history of alcohol use Survival Time as a Response Conditional distribution often highly skewed to the right Times can be censored if study stopped prior to all deaths Survival analysis techniques should be used when censoring is present In this case, we observe all survival times so we will investigate transformation using Box-Cox transformation Topic 13 3 Topic 13 4

SAS Commands Data a1; infile U:\.www\datasets525\Ch09ta01.txt dlm= 09 x; input blood prog enz liver age female mod_alc sev_alc surv; proc transreg; model boxcox(surv) = identity(blood) identity(prog) identity(enz) identity(liver) identity(age) identity(female) identity(mod_alc) identity(sev_alc); Topic 13 5 Topic 13 6 Continuing the Analysis data a1; set a1; lsurv=log(surv); proc sgscatter; matrix lsurv blood prog enz liver age; proc corr; var lsurv blood prog enz liver age; proc reg data=a1; model lsurv=blood prog enz liver age female mod_alc sev_alc / selection= rsquare adjrsq cp aic sbc best=2 b; Topic 13 7 Topic 13 8

Output Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum lsurv 54 6.43054 0.49152 347.24929 5.19850 7.75919 blood 54 5.78333 1.60303 312.30000 2.60000 11.20000 prog 54 63.24074 16.90253 3415 8.00000 96.00000 enz 54 77.11111 21.25378 4164 23.00000 119.00000 liver 54 2.74426 1.07036 148.19000 0.74000 6.40000 age 54 51.61111 11.12267 2787 30.00000 70.00000 Pearson Correlation Coefficients, N = 54 lsurv blood prog enz liver age lsurv 1.00000 0.24633 0.47015 0.65365 0.64920-0.14505 0.0726 0.0003 <.0001 <.0001 0.2953 blood 1.00000 0.09012-0.14963 0.50242-0.02069 0.5169 0.2802 0.0001 0.8820 prog 1.00000-0.02361 0.36903-0.04767 0.8655 0.0060 0.7321 enz 1.00000 0.41642-0.01290 0.0017 0.9262 liver 1.00000-0.20738 0.1324 Variable Selection Two distinct questions 1 What is the appropriate subset size? adjusted R 2, C p, MSE, PRESS, AIC, SBC 2 What is the best model for a fixed size? R 2 and any of the above measures Topic 13 9 Topic 13 10 Compares total mean squared error with σ 2 Squared error (Ŷi µ i ) 2 = (Ŷi E(Ŷi) + E(Ŷi) µ i ) 2 = (E(Ŷi) µ i ) 2 + (Ŷi E(Ŷi)) 2 = Bias 2 + (Ŷi E(Ŷi)) 2 Mean value is (E(Ŷi) µ i ) 2 + σ 2 (Ŷi) Total mean value is (E(Ŷi) µ i ) 2 + σ 2 (Ŷi) Criterion measure Γ p = (E( Ŷ i ) µ i ) 2 + σ 2 (Ŷi) σ 2 Do not know σ 2 nor numerator For σ 2, use MSE(X 1, X 2,..., X p 1 )=MSE(F) as estimate For numerator: Can show σ 2 (Ŷ) = σ2 H This means σ 2 (Ŷi) = σ 2 Trace(H)=σ 2 p (Trace(idempotent matrix)= rank) Can show E(SSE)= (E(Ŷi) µ i ) 2 + (n p)σ 2 Note: when model correct, (E(Ŷi) µ i ) 2 = 0 C p = (SSE p (n p)mse(f)) + pmse(f) MSE(F) = SSE p (n 2p) MSE(X 1, X 2,..., X p 1 ) Topic 13 11 Topic 13 12

p is number of predictors + intercept When model correct, there is no bias E(C p ) p When plotting models against p Biased models will fall above C p = p Unbiased models will fall around line C p = p By definition: C p for full model equals p Adjusted R 2 Criterion Takes into account the number of parameters in model Switches from SS s to MS s R 2 a = 1 ( ) n 1 SSE n p SSTO = 1 MSE MSTO Choose model which maximizes R 2 a Same approach as choosing model with smallest MSE Topic 13 13 Topic 13 14 PRESS p Criterion Looks at the prediction sum of squares which quantifies how well the fitted values can predict the observed responses For each case i, predict Y i using model generated from other n 1 cases PRESS = (Y i Ŷi(i)) 2 Want to select model with small PRESS Can calculate this in one fit (Chpt 10) Other Approaches Criterion based on minimizing -2log(likelihood) plus a penalty for more complex model AIC - Akaike s information criterion ( SSEp n log n ) + 2p SBC - Schwarz Bayesian Criterion ( SSEp n log n ) + p log(n) Can use to compare non-nested models Topic 13 15 Topic 13 16

Selection in SAS Helpful options in model statement selection= to choose criterion and method - forward (step up) - backward (step down) - stepwise (forward with backward glance) include=n forces first n variables into all models best=n limits output to the best n models start=n limits output to models with n X s b will include parameter estimates Models of same subset size Can also use R 2 or SSE May result in several worthy models Use knowledge on subject matter to make final decision Decision not that important if goal is prediction Topic 13 17 Topic 13 18 Output R-Square Selection Method Number in Adjusted Model R-square R-square C(p) AIC SBC 1 0.4273 0.4162 117.4783-103.8110-99.83305 1 0.4215 0.4103 119.1712-103.2679-99.28994 2 0.6632 0.6500 50.4918-130.4785-124.51159 2 0.5992 0.5835 69.1967-121.0890-115.12206 3 0.7780 0.7647 18.9015-151.0021-143.04620 3 0.7572 0.7427 24.9882-146.1614-138.20545 4 0.8299 0.8160 5.7340-163.3759-153.43101 4 0.8144 0.7993 10.2633-158.6694-148.72443 5 0.8375 0.8205 5.5282-163.8257-151.89179 5 0.8359 0.8188 5.9990-163.2934-151.35953 6 0.8435 0.8235 5.7725-163.8583-149.93537 6 0.8392 0.8186 7.0288-162.3961-148.47317 7 0.8460 0.8226 7.0288-162.7428-146.83088 7 0.8436 0.8198 7.7214-161.9186-146.00670 8 0.8461 0.8187 9.0000-160.7773-142.87649 Background Reading KNNL Sections 9.1-9.5 knnl350.sas KNNL Chapter 10 Topic 13 19 Topic 13 20