Extended Bayesian Information Criteria for Model Selection with Large Model Spaces Jiahua Chen, University of British Columbia Zehua Chen, National University of Singapore (Biometrika, 2008) 1 / 18
Variable Selection Classical Criteria Akaike Information Criterion (Akaike, 1973) Bayesian Information Criterion (Schwarz, 1978) Cross-Validation (Stone, 1974) Large P, small n ex. genome-wide association studies The above criteria: select many spurious covariates A great challenge 2 / 18
Bayesian Information Criterion (BIC) {(y i, x i ) : i = 1,..., n}: independent observations f (y i x i, θ), θ R P L n (θ) = n i=1 f (y i x i, θ) s {1,..., P} θ(s): components outside s being set to 0 BIC(s) = 2 log L n {ˆθ(s)} + ν(s) log n, where ˆθ(s): MLE of θ(s), ν(s): # of components in s. 3 / 18
BIC: approximate Bayes approach S: model space under consideration. p(s): prior probability of model s. Posterior probability of model s is Pr(s Z) Pr(Z s)p(s) where Pr(Z s) = Pr(Z θ(s), s)pr(θ(s) s)dθ(s). 4 / 18
BIC: approximate Bayes approach s that maximizes log Pr(Z s)p(s) Approximation to the integral followed by some simplifications So, log Pr(Z s) = log L n {ˆθ(s)} ν(s) 2 log n + O(1) log Pr(Z s)p(s) = log Pr(Z s) + log p(s) = log L n {ˆθ(s)} ν(s) 2 = 1 BIC + O(1) + log p(s) 2 log n + O(1) + log p(s) 5 / 18
Constant prior assumption behind BIC An implicit assumption underlying the use of BIC p(s) is constant for s (uniform prior) This may not be reasonable with large P S j : class of models containing j covariates ex. S 1 : the collection of models with a single covariate. Under constant prior, Pr(S j ) τ(s j ), where τ(s j ): size of S j Ex. P = 1000, τ(s 1 ) = 1000 vs τ(s 2 ) = 1000 999/2 The constant prior prefers larger models. 6 / 18
Extended BIC S = P j=1 S j Pr(S j ) τ ξ (S j ), where 0 ξ 1 Pr(s S j ) = 1/τ(S j ) for any s S j (all models in S j are equally plausible) p(s) τ γ (S j ) for s S j, where γ = 1 ξ Extended BIC BIC γ (s) = 2 log L n { ˆ θ(s)} + ν(s) log n + 2γ log τ(s j ), where 0 γ 1 7 / 18
Consistency of the Extended BIC P = p n = O(n κ ) as n for κ > 0 Model where ɛ n N(0, σ 2 I n ) y n = X n β + ɛ n, (1) s 0 : the smallest subset of {1,..., p n } such that µ n = E(y n ) = X n (s 0 )β(s 0 ), where X n (s 0 ),β(s 0 ): design matrix and coefficients corresponding to s 0 Call s 0 the true submodel 8 / 18
Asymptotic identifiability K 0 = ν(s 0 ) H n (s): projection matrix of X n (s) n (s) = µ n H n (s)µ n 2 Condition 1: Asymptotic identifiability. Model 1 with true submodel s 0 is asymptotically identifiable if lim min{(log n n) 1 n (s) : s s 0, ν(s) K 0 } =. Any other model of comparable size cannot predict the response well. 9 / 18
Consistency of the Extended BIC Theorem Assume that p n = O(n κ ) for some constant κ. If γ > 1 1/(2κ), then, under the asymptotic identifiability condition, Pr [min{bic γ (s) : ν(s) = j, s s 0 } > BIC γ (s 0 )] 1, for j = 1,..., K, as n (K is an upper bound for K 0 ) If γ = 1, consistent for κ > 0 If γ = 0, consistent for κ < 0.5 (original BIC) 10 / 18
Simulation Studies γ = 0, 0.5 and 1 cannot afford to compute BIC γ (s) for all possible s LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001) Increase λ gradually BIC γ computed for sequence of nested models If P n, a tournament (Chen and Chen, 2008) 1. randomly partition covariates (each: n/2 covariates) 2. apply LASSO or SCAD 3. pool nonzero components 4. repeat if necessary 11 / 18
Simulation Studies Linear model in all cases, y i = x T i (s)β(s) + ɛ i, for some s, where ɛ i N(0, 1) Case 1: P = 50, n = 200 (i) cor(x j, x k ) = ρ for all (j, k) (ii) cor(x j, x k ) = ρ for k = j ± 1 (iii) cor(x j, x k ) = ρ k j β(s) = (7, 9, 4, 3, 10, 2, 2, 1) T /10 12 / 18
Results of Case 1 s : selected by the extended BIC Positive Selection Rate (PSR): ν(s s )/ν(s) False Discovery Rate (FDR): ν(s s)/ν(s ) 13 / 18
Simulation Studies: Case 2 P = 1000, n = 200 20 groups of size 50 First 10 groups: generated as Case 1 The other 10 groups: generated from a discrete distribution β(s) = (10, 7, 5, 3, 2) T /10 The tournament approach 1. 1000 covriates: partitioned into subsets of size 100 2. From each subset, 12 covariates selected 3. 120 20 (final candidate covariates) 14 / 18
Results of Case 2 15 / 18
Simulation Studies: Case 3 A dataset from a genome-wide association study y: mrna level of a particular gene X : 1414 single-nucleotide polymorphisms (SNPs) n = 233 Setting 1. y: randomly permuted Setting 2. y: randomly generated under assumption of no association Setting 3. y: generated from linear model β(s) = ( 1.56, 1.09, 1.22, 0.06, 0.08, 0.012, 0.067, 0.047, 0.07, 0.05) T Setting 4. β(s) = ( 0.31, 0.23, 0.42, 0.32, 0.33, 0.26, 0.41, 0.29, 0.35, 0.69) T 16 / 18
Results of Case 3 17 / 18
Summary: BIC γ γ = 1 effectively controls FDR consistent for κ > 0 (pn = O(n κ )) γ = 0 (original BIC) a slightly better PSR, much worse FDR consistent for κ < 0.5 (pn = O(n κ )) BIC γ incur a small loss in PSR, but tightly control FDR Another way of choosing γ 1. P = n κ 2. γ = 1 1/(2κ) 18 / 18