Extended Bayesian Information Criteria for Model Selection with Large Model Spaces

Extended Bayesian Information Criteria for Model Selection with Large Model Spaces Jiahua Chen, University of British Columbia Zehua Chen, National University of Singapore (Biometrika, 2008) 1 / 18

Variable Selection Classical Criteria Akaike Information Criterion (Akaike, 1973) Bayesian Information Criterion (Schwarz, 1978) Cross-Validation (Stone, 1974) Large P, small n ex. genome-wide association studies The above criteria: select many spurious covariates A great challenge 2 / 18

Bayesian Information Criterion (BIC) {(y i, x i ) : i = 1,..., n}: independent observations f (y i x i, θ), θ R P L n (θ) = n i=1 f (y i x i, θ) s {1,..., P} θ(s): components outside s being set to 0 BIC(s) = 2 log L n {ˆθ(s)} + ν(s) log n, where ˆθ(s): MLE of θ(s), ν(s): # of components in s. 3 / 18

BIC: approximate Bayes approach S: model space under consideration. p(s): prior probability of model s. Posterior probability of model s is Pr(s Z) Pr(Z s)p(s) where Pr(Z s) = Pr(Z θ(s), s)pr(θ(s) s)dθ(s). 4 / 18

BIC: approximate Bayes approach s that maximizes log Pr(Z s)p(s) Approximation to the integral followed by some simplifications So, log Pr(Z s) = log L n {ˆθ(s)} ν(s) 2 log n + O(1) log Pr(Z s)p(s) = log Pr(Z s) + log p(s) = log L n {ˆθ(s)} ν(s) 2 = 1 BIC + O(1) + log p(s) 2 log n + O(1) + log p(s) 5 / 18

Constant prior assumption behind BIC An implicit assumption underlying the use of BIC p(s) is constant for s (uniform prior) This may not be reasonable with large P S j : class of models containing j covariates ex. S 1 : the collection of models with a single covariate. Under constant prior, Pr(S j ) τ(s j ), where τ(s j ): size of S j Ex. P = 1000, τ(s 1 ) = 1000 vs τ(s 2 ) = 1000 999/2 The constant prior prefers larger models. 6 / 18

Extended BIC S = P j=1 S j Pr(S j ) τ ξ (S j ), where 0 ξ 1 Pr(s S j ) = 1/τ(S j ) for any s S j (all models in S j are equally plausible) p(s) τ γ (S j ) for s S j, where γ = 1 ξ Extended BIC BIC γ (s) = 2 log L n { ˆ θ(s)} + ν(s) log n + 2γ log τ(s j ), where 0 γ 1 7 / 18

Consistency of the Extended BIC P = p n = O(n κ ) as n for κ > 0 Model where ɛ n N(0, σ 2 I n ) y n = X n β + ɛ n, (1) s 0 : the smallest subset of {1,..., p n } such that µ n = E(y n ) = X n (s 0 )β(s 0 ), where X n (s 0 ),β(s 0 ): design matrix and coefficients corresponding to s 0 Call s 0 the true submodel 8 / 18

Asymptotic identifiability K 0 = ν(s 0 ) H n (s): projection matrix of X n (s) n (s) = µ n H n (s)µ n 2 Condition 1: Asymptotic identifiability. Model 1 with true submodel s 0 is asymptotically identifiable if lim min{(log n n) 1 n (s) : s s 0, ν(s) K 0 } =. Any other model of comparable size cannot predict the response well. 9 / 18

Consistency of the Extended BIC Theorem Assume that p n = O(n κ ) for some constant κ. If γ > 1 1/(2κ), then, under the asymptotic identifiability condition, Pr [min{bic γ (s) : ν(s) = j, s s 0 } > BIC γ (s 0 )] 1, for j = 1,..., K, as n (K is an upper bound for K 0 ) If γ = 1, consistent for κ > 0 If γ = 0, consistent for κ < 0.5 (original BIC) 10 / 18

Simulation Studies γ = 0, 0.5 and 1 cannot afford to compute BIC γ (s) for all possible s LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001) Increase λ gradually BIC γ computed for sequence of nested models If P n, a tournament (Chen and Chen, 2008) 1. randomly partition covariates (each: n/2 covariates) 2. apply LASSO or SCAD 3. pool nonzero components 4. repeat if necessary 11 / 18

Simulation Studies Linear model in all cases, y i = x T i (s)β(s) + ɛ i, for some s, where ɛ i N(0, 1) Case 1: P = 50, n = 200 (i) cor(x j, x k ) = ρ for all (j, k) (ii) cor(x j, x k ) = ρ for k = j ± 1 (iii) cor(x j, x k ) = ρ k j β(s) = (7, 9, 4, 3, 10, 2, 2, 1) T /10 12 / 18

Results of Case 1 s : selected by the extended BIC Positive Selection Rate (PSR): ν(s s )/ν(s) False Discovery Rate (FDR): ν(s s)/ν(s ) 13 / 18

Simulation Studies: Case 2 P = 1000, n = 200 20 groups of size 50 First 10 groups: generated as Case 1 The other 10 groups: generated from a discrete distribution β(s) = (10, 7, 5, 3, 2) T /10 The tournament approach 1. 1000 covriates: partitioned into subsets of size 100 2. From each subset, 12 covariates selected 3. 120 20 (final candidate covariates) 14 / 18

Results of Case 2 15 / 18

Simulation Studies: Case 3 A dataset from a genome-wide association study y: mrna level of a particular gene X : 1414 single-nucleotide polymorphisms (SNPs) n = 233 Setting 1. y: randomly permuted Setting 2. y: randomly generated under assumption of no association Setting 3. y: generated from linear model β(s) = ( 1.56, 1.09, 1.22, 0.06, 0.08, 0.012, 0.067, 0.047, 0.07, 0.05) T Setting 4. β(s) = ( 0.31, 0.23, 0.42, 0.32, 0.33, 0.26, 0.41, 0.29, 0.35, 0.69) T 16 / 18

Results of Case 3 17 / 18

Summary: BIC γ γ = 1 effectively controls FDR consistent for κ > 0 (pn = O(n κ )) γ = 0 (original BIC) a slightly better PSR, much worse FDR consistent for κ < 0.5 (pn = O(n κ )) BIC γ incur a small loss in PSR, but tightly control FDR Another way of choosing γ 1. P = n κ 2. γ = 1 1/(2κ) 18 / 18