Extended Bayesian Information Criteria for Model Selection with Large Model Spaces

Similar documents
Feature selection with high-dimensional data: criteria and Proc. Procedures

Extended Bayesian Information Criteria for Gaussian Graphical Models

Some Curiosities Arising in Objective Bayesian Analysis

STAT 461/561- Assignments, Year 2015

Generalized Linear Models and Its Asymptotic Properties

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

STAT 100C: Linear models

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Extended BIC for small-n-large-p sparse GLM

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

3 Comparison with Other Dummy Variable Methods

Model comparison and selection

Analysis Methods for Supersaturated Design: Some Comparisons

On High-Dimensional Cross-Validation

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Overview. and data transformations of gene expression data. Toy 2-d Clustering Example. K-Means. Motivation. Model-based clustering

An Adaptive LASSO-Penalized BIC

Semi-Penalized Inference with Direct FDR Control

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Structure estimation for Gaussian graphical models

Variable Selection for Highly Correlated Predictors

Econ 5150: Applied Econometrics Dynamic Demand Model Model Selection. Sung Y. Park CUHK

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Guarding against Spurious Discoveries in High Dimension. Jianqing Fan

Selecting explanatory variables with the modified version of Bayesian Information Criterion

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Tuning Parameter Selection in L1 Regularized Logistic Regression

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Consistent Model Selection Criteria on High Dimensions

The annals of Statistics (2006)

False Discovery Control in Spatial Multiple Testing

Machine Learning for OR & FE

Variable Selection in Finite Mixture of Regression Models

Model Choice Lecture 2

arxiv: v2 [stat.me] 29 May 2017

Statistical Inference of Covariate-Adjusted Randomized Experiments

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

CS Homework 3. October 15, 2009

High-throughput Testing

On Mixture Regression Shrinkage and Selection via the MR-LASSO

High-dimensional regression modeling

A Consistent Model Selection Criterion for L 2 -Boosting in High-Dimensional Sparse Linear Models

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

A Confidence Region Approach to Tuning for Variable Selection

Sparse Linear Models (10/7/13)

A Kullback-Leibler Divergence for Bayesian Model Comparison with Applications to Diabetes Studies. Chen-Pin Wang, UTHSCSA Malay Ghosh, U.

Regression I: Mean Squared Error and Measuring Quality of Fit

arxiv: v1 [stat.me] 27 Nov 2012

Group exponential penalties for bi-level variable selection

Bayesian model selection: methodology, computation and applications

Bayesian shrinkage approach in variable selection for mixed

Frequentist Accuracy of Bayesian Estimates

A Robust Approach to Regularized Discriminant Analysis

Machine Learning 2017

ESL Chap3. Some extensions of lasso

STAT 740: Testing & Model Selection

Exam: high-dimensional data analysis January 20, 2014

Consistent change point estimation. Chi Tim Ng, Chonnam National University Woojoo Lee, Inha University Youngjo Lee, Seoul National University

1 Hypothesis Testing and Model Selection

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Generalized Linear Models

MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY

QTL Mapping I: Overview and using Inbred Lines

How the mean changes depends on the other variable. Plots can show what s happening...

Inference After Variable Selection

To Hold Out or Not. Frank Schorfheide and Ken Wolpin. April 4, University of Pennsylvania

A Consistent Information Criterion for Support Vector Machines in Diverging Model Spaces

Consistency of test based method for selection of variables in high dimensional two group discriminant analysis

Choosing among models

Prior Distributions for the Variable Selection Problem

Outlier detection in ARIMA and seasonal ARIMA models by. Bayesian Information Type Criteria

EM Algorithm II. September 11, 2018

STK-IN4300 Statistical Learning Methods in Data Science

Shrinkage Tuning Parameter Selection in Precision Matrices Estimation

Lecture 8 Genomic Selection

Bayesian Model Comparison

Advanced Statistical Methods: Beyond Linear Regression

Research Article Sample Size Calculation for Controlling False Discovery Proportion

Robust Bayesian Variable Selection for Modeling Mean Medical Costs

F & B Approaches to a simple model

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Seminar über Statistik FS2008: Model Selection

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Horseshoe, Lasso and Related Shrinkage Methods

Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space

Large-Scale Multiple Testing of Correlations

LTI Systems, Additive Noise, and Order Estimation

Variable Selection in Multivariate Linear Regression Models with Fewer Observations than the Dimension

Lattice Data. Tonglin Zhang. Spatial Statistics for Point and Lattice Data (Part III)

Marginal Screening and Post-Selection Inference

arxiv: v1 [stat.me] 30 Dec 2017

MASM22/FMSN30: Linear and Logistic Regression, 7.5 hp FMSN40:... with Data Gathering, 9 hp

Bayesian Asymptotics

High-dimensional variable selection via tilting

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Transcription:

Extended Bayesian Information Criteria for Model Selection with Large Model Spaces Jiahua Chen, University of British Columbia Zehua Chen, National University of Singapore (Biometrika, 2008) 1 / 18

Variable Selection Classical Criteria Akaike Information Criterion (Akaike, 1973) Bayesian Information Criterion (Schwarz, 1978) Cross-Validation (Stone, 1974) Large P, small n ex. genome-wide association studies The above criteria: select many spurious covariates A great challenge 2 / 18

Bayesian Information Criterion (BIC) {(y i, x i ) : i = 1,..., n}: independent observations f (y i x i, θ), θ R P L n (θ) = n i=1 f (y i x i, θ) s {1,..., P} θ(s): components outside s being set to 0 BIC(s) = 2 log L n {ˆθ(s)} + ν(s) log n, where ˆθ(s): MLE of θ(s), ν(s): # of components in s. 3 / 18

BIC: approximate Bayes approach S: model space under consideration. p(s): prior probability of model s. Posterior probability of model s is Pr(s Z) Pr(Z s)p(s) where Pr(Z s) = Pr(Z θ(s), s)pr(θ(s) s)dθ(s). 4 / 18

BIC: approximate Bayes approach s that maximizes log Pr(Z s)p(s) Approximation to the integral followed by some simplifications So, log Pr(Z s) = log L n {ˆθ(s)} ν(s) 2 log n + O(1) log Pr(Z s)p(s) = log Pr(Z s) + log p(s) = log L n {ˆθ(s)} ν(s) 2 = 1 BIC + O(1) + log p(s) 2 log n + O(1) + log p(s) 5 / 18

Constant prior assumption behind BIC An implicit assumption underlying the use of BIC p(s) is constant for s (uniform prior) This may not be reasonable with large P S j : class of models containing j covariates ex. S 1 : the collection of models with a single covariate. Under constant prior, Pr(S j ) τ(s j ), where τ(s j ): size of S j Ex. P = 1000, τ(s 1 ) = 1000 vs τ(s 2 ) = 1000 999/2 The constant prior prefers larger models. 6 / 18

Extended BIC S = P j=1 S j Pr(S j ) τ ξ (S j ), where 0 ξ 1 Pr(s S j ) = 1/τ(S j ) for any s S j (all models in S j are equally plausible) p(s) τ γ (S j ) for s S j, where γ = 1 ξ Extended BIC BIC γ (s) = 2 log L n { ˆ θ(s)} + ν(s) log n + 2γ log τ(s j ), where 0 γ 1 7 / 18

Consistency of the Extended BIC P = p n = O(n κ ) as n for κ > 0 Model where ɛ n N(0, σ 2 I n ) y n = X n β + ɛ n, (1) s 0 : the smallest subset of {1,..., p n } such that µ n = E(y n ) = X n (s 0 )β(s 0 ), where X n (s 0 ),β(s 0 ): design matrix and coefficients corresponding to s 0 Call s 0 the true submodel 8 / 18

Asymptotic identifiability K 0 = ν(s 0 ) H n (s): projection matrix of X n (s) n (s) = µ n H n (s)µ n 2 Condition 1: Asymptotic identifiability. Model 1 with true submodel s 0 is asymptotically identifiable if lim min{(log n n) 1 n (s) : s s 0, ν(s) K 0 } =. Any other model of comparable size cannot predict the response well. 9 / 18

Consistency of the Extended BIC Theorem Assume that p n = O(n κ ) for some constant κ. If γ > 1 1/(2κ), then, under the asymptotic identifiability condition, Pr [min{bic γ (s) : ν(s) = j, s s 0 } > BIC γ (s 0 )] 1, for j = 1,..., K, as n (K is an upper bound for K 0 ) If γ = 1, consistent for κ > 0 If γ = 0, consistent for κ < 0.5 (original BIC) 10 / 18

Simulation Studies γ = 0, 0.5 and 1 cannot afford to compute BIC γ (s) for all possible s LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001) Increase λ gradually BIC γ computed for sequence of nested models If P n, a tournament (Chen and Chen, 2008) 1. randomly partition covariates (each: n/2 covariates) 2. apply LASSO or SCAD 3. pool nonzero components 4. repeat if necessary 11 / 18

Simulation Studies Linear model in all cases, y i = x T i (s)β(s) + ɛ i, for some s, where ɛ i N(0, 1) Case 1: P = 50, n = 200 (i) cor(x j, x k ) = ρ for all (j, k) (ii) cor(x j, x k ) = ρ for k = j ± 1 (iii) cor(x j, x k ) = ρ k j β(s) = (7, 9, 4, 3, 10, 2, 2, 1) T /10 12 / 18

Results of Case 1 s : selected by the extended BIC Positive Selection Rate (PSR): ν(s s )/ν(s) False Discovery Rate (FDR): ν(s s)/ν(s ) 13 / 18

Simulation Studies: Case 2 P = 1000, n = 200 20 groups of size 50 First 10 groups: generated as Case 1 The other 10 groups: generated from a discrete distribution β(s) = (10, 7, 5, 3, 2) T /10 The tournament approach 1. 1000 covriates: partitioned into subsets of size 100 2. From each subset, 12 covariates selected 3. 120 20 (final candidate covariates) 14 / 18

Results of Case 2 15 / 18

Simulation Studies: Case 3 A dataset from a genome-wide association study y: mrna level of a particular gene X : 1414 single-nucleotide polymorphisms (SNPs) n = 233 Setting 1. y: randomly permuted Setting 2. y: randomly generated under assumption of no association Setting 3. y: generated from linear model β(s) = ( 1.56, 1.09, 1.22, 0.06, 0.08, 0.012, 0.067, 0.047, 0.07, 0.05) T Setting 4. β(s) = ( 0.31, 0.23, 0.42, 0.32, 0.33, 0.26, 0.41, 0.29, 0.35, 0.69) T 16 / 18

Results of Case 3 17 / 18

Summary: BIC γ γ = 1 effectively controls FDR consistent for κ > 0 (pn = O(n κ )) γ = 0 (original BIC) a slightly better PSR, much worse FDR consistent for κ < 0.5 (pn = O(n κ )) BIC γ incur a small loss in PSR, but tightly control FDR Another way of choosing γ 1. P = n κ 2. γ = 1 1/(2κ) 18 / 18