Resampling-Based Information Criteria for Adaptive Linear Model Selection

Size: px

Start display at page:

Download "Resampling-Based Information Criteria for Adaptive Linear Model Selection"

Randell Melton
6 years ago
Views:

1 Resampling-Based Information Criteria for Adaptive Linear Model Selection Phil Reiss January 5, 2010 Joint work with Joe Cavanaugh, Lei Huang, and Amy Krain Roy

2 Outline Motivating application: amygdala connectivity Linear models and linear model selection Akaike s information criterion and its limitations Resampling and its application to model selection Simulation results Back to the connectivity application 2

3 Motivation Stein et al. (2007) proposed a network of effective amygdala connectivity comprising 10 connections among the following 8 regions of interest: AMY (amygdala); INS (insula); OFC (orbitofrontal cortex); PCC (posterior cingulate cortex); PFC (prefrontal cortex); PHG (parahippocampal gyrus); SUB (subgenual cingulate); SUP (supragenual cingulate). We have measured resting-state functional connectivity (FC) for these 10 connections in 42 individuals who also responded to a battery of psychological tests. Goal: Use multiple linear regression to investigate whether these 10 FC scores (x) predict psychological variables (y). More specifically, regress y on an appropriate subset of the x s ( model selection problem). 3

4 Figure 1: The eight brain ROIs studied by Stein et al. (2007), and the 10 connections that they identified. Black dots indicate ROIs that lie on the mid-sagittal plane shown, while grey dots indicate left-hemisphere ROIs that have been projected onto this plane for illustration. 1

5 Simple linear regression: fit least-squares line x y x y 4

6 y x2 y x2 Multiple linear regression: fit least-squares (hyper)plane x1 x1 5

7 Which has lower sum of squared residuals: exact model y = 3 + 2x + ε, or fitted model y = x + ε? y x 6

8 Basic model equation for multiple linear regression where y i = β 0 + β 1 x i β p x ip + ε i, y i is the ith subject s outcome, β 0 is the intercept of the best-fit line (or plane...), x i1,..., x ip are subject i s values for predictor 1,..., p, β 1,..., β p are coefficients or effects of predictor 1,..., p, ε i is an error term which in the nicest case is 1. independent across subjects, 2. identically distributed for each subject, 3. normally distributed. Each of these 3 conditions can be relaxed. 7

9 Model selection In many situations we are not given the p predictors in the model. Rather, we have a number perhaps a large number of candidate predictors, and we seek a subset of the predictors that will constitute the best model. More ambitiously, we may say: we seek the true model. But... 8

10 The true model has two meanings! 1. Right set of predictors. E.g., suppose a person s HAM-D score is a linear function of his/her age and a FC measure, plus random noise. Then the true model is y i = β 0 + β 1 x i1 + β 2 x i2 + ε i, where y i =ith individual s HAM-D, x i1 =age, x i2 =FC. All other linear models are underfitted, e.g. y i = β 0 + β 1 x i1 + ε i ; overfitted, e.g. y i = β 0 + β 1 x i1 + β 2 x i2 + β 3 x i3 + ε i for some additional predictor x 3 ; or under/over -fitted, e.g. y i = β 0 + β 1 x i1 + β 3 x i3 + ε i. 2. Right coefficient values in above example, the actual numeric values of β 0, β 1, β 2. To reduce confusion, we ll refer to #1 as the true model, and #2 as the exact model. The exact model is inherently unknowable, assuming a finite sample. Model selection aims for the true model. 9

11 General strategy for model selection: minimize prediction error Model fit gives rise to fitted values ŷ i = ˆβ 0 + ˆβ 1 x i ˆβ p x ip minimizing the sum of squared residuals SS resid = nx (y i ŷ i ) 2 i=1 If we collected a new data set with same x s but new outcomes y +, the predicted outcomes would be ŷ 1,..., ŷ n as defined above. Generally speaking, on average, the sum of squared prediction errors SS pred = nx (y + i ŷ i) 2 i=1 is smaller for the true model than for other models. Strategy: 1. for each candidate model, estimate mean SS pred ; 2. select the model for which estimate is lowest. 10

12 How to estimate mean SS pred for each model? Naïve idea: estimate by mean SS pred = SS resid = n (y + i ŷ i ) 2 i=1 n (y i ŷ i ) 2 i=1 select model with lowest SS resid. That s a bad idea, for at least two reasons. First reason: As predictors are added to the model, SS resid always decreases. So if you choose the minimum-ss resid model, you ll always choose the largest candidate model! 11

13 Another reason: Even with the true model, SS resid is a poor estimate of mean SS pred y ynew x x SS fitted resid < SSexact resid E(SSexact pred ) < E(SSfitted pred ) SS resid is an overoptimistic (negatively biased) estimate of E(SS pred ). 12

14 Akaike information criterion (AIC) In the early 1970s, Akaike proposed to do model selection by minimizing the Kullback-Leibler information in linear model context, essentially the mean of n log(ss pred ). n log(ss resid ) is a negatively biased estimate. It turns out that, for a model with p predictors, the bias is about 2p. Thus AIC = n log(ss resid ) + 2p is an unbiased estimate of K-L info. As number of predictors grows, n log(ss resid ), 2p. AIC-minimizing model strikes an ideal balance between goodness of fit and model parsimony. 13

15 AIC with many candidate models In a standard application, we compute AIC = n log(ss resid ) + 2p for each of several candidate models, and select the model with lowest AIC. But what if we have many candidate models? E.g., with 10 possible predictors, there are 2 10 = 1024 candidate models. Even if none of these are true predictors (true model is the null model with no predictors), it is likely that at least one of these 1024 models will have lower AIC than the null model just by chance. This is closely related to the problem of multiple hypothesis testing. To account for multiplicity, need a modified AIC n log(ss resid ) + penalty for some penalty > 2p. But how to find this penalty? 14

16 Resampling We want to find a penalty such that n log(ss pred ) n log(ss resid ) + penalty, i.e., penalty n log(ss pred ) n log(ss resid ) = population quantity Q pop sample quantity Q sample. Resampling is a general way to estimate the difference between an (observed) sample quantity Q sample and a corresponding (unobservable) population quantity Q pop. Idea: Repeatedly draw resampled data sets (samples from the sample). Resample : Sample :: Sample : Population (approximately). Average Q sample Q resample is an estimate of Q pop Q sample. Two ways to resample: 1. Bootstrapping: draw a sample of n with replacement. E.g., if n = 10, might take observations 1, 3, 3, 4, 5, 7, 7, 9, 10, Subsampling: draw a sample of m < n without replacement. E.g., if n = 10, m = 8, might take observations 1, 2, 3, 4, 7, 8, 9,

17 Simulation study We simulated four collections of data sets with 20 candidate predictors, of which 0, 2, 6, or 10 were true predictors. We then compared how well the the true predictors were recovered by the following model selection methods: 1. AIC; 2. AIC c (corrected AIC); 3. CIC (Tibshirani and Knight, 1999); 4. two bootstrap methods; 5. two subsampling methods. 16

18 True coefficients AIC AIC c CIC Bootstrap, method 1 Bootstrap, method 2 Subsampling, method 1 Subsampling, method 2 17

19 True coefficients AIC AIC c CIC Bootstrap, method 1 Bootstrap, method 2 Subsampling, method 1 Subsampling, method 2 18

20 True coefficients AIC AIC c CIC Bootstrap, method 1 Bootstrap, method 2 Subsampling, method 1 Subsampling, method 2 19

21 True coefficients AIC AIC c CIC Bootstrap, method 1 Bootstrap, method 2 Subsampling, method 1 Subsampling, method 2 20

22 Back to Stein seed data Regressed the following outcomes on the ten connections: 1. Rosenberg self-esteem score 2. Mood and Anxiety Symptom Questionnaire (MASQ) General Distress: Depressive Symptoms (GDD) subscale Selected best models using 1. AIC 2. AIC c 3. our subsampling criterion 21

23 Self-esteem results 13 models, with 1 5 predictors, had somewhat lower AIC than the null model. However, both AIC c and subsampling method choose the null model as the best model, suggesting that chance variation accounts for the AIC-based findings. 22

24 MASQ-GDD results Number of models outscoring the null model is 78 for AIC and 56 for AIC c, but only 6 for subsampling method. The PCC-AMY and SUB-INS connections tended to appear in most of the best models by all three methods. Subsampling method favored more parsimonious models; AIC favored less parsimonious; AIC c in between. PCC-AMY and SUB-INS appear positively related with depression: one-sd increase in either FC score is associated with 10% increase in the GDD subscale. The coefficients of determination are roughly 11% for PCC-AMY alone, 13% for SUB-INS alone, and 25% for the model containing both. 23

25 AIC c ad IC sub AMY PHG AMY SUB INS AMY OFC AMY OFC PFC PCC AMY SUB INS SUB SUP SUP AMY SUP PCC AMY PHG AMY SUB INS AMY OFC AMY OFC PFC PCC AMY SUB INS SUB SUP SUP AMY SUP PCC Criterion value Figure 2: Information criterion values for the best models for MASQ- GDD score, based on AIC c and on ICsub ad. The dotted line in the right plot indicates ICsub ad for the null model. 2

26 Summary A popular way to select a model from a set of candidates is to take the model with lowest AIC. However, when selecting among a large set of models, the minimum-aic model is likely to overfit. Resampling-based information criteria can correct this tendency. For our data set: 1. Minimum-AIC model suggests self-esteem predicted by some of the connections; our methods suggest these effects are spurious. 2. PCC-AMY and SUB-INS seem to be positively associated with MASQ GDD subscale; our method again gives more parsimonious results than AIC. 24

Resampling-Based Information Criteria for Best- Subset Regression

University of Haifa From the SelectedWorks of Philip T. Reiss 202 Resampling-Based Information Criteria for Best- Subset Regression Philip T. Reiss, New York University Lei Huang, Columbia University Joseph