Model selection in penalized Gaussian graphical models

Similar documents
Lecture 6: Model Checking and Selection

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Model comparison and selection

Statistics 202: Data Mining. c Jonathan Taylor. Model-based clustering Based in part on slides from textbook, slides of Susan Holmes.

STAT 100C: Linear models

Inference of Gaussian graphical models and ordinary differential equations Vujacic, Ivan

Bias Correction of Cross-Validation Criterion Based on Kullback-Leibler Information under a General Condition

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Inference in non-linear time series

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Foundations of Statistical Inference

Feature selection with high-dimensional data: criteria and Proc. Procedures

Bayesian Asymptotics

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Akaike Information Criterion

Estimation and Model Selection in Mixed Effects Models Part I. Adeline Samson 1

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

All models are wrong but some are useful. George Box (1979)

Chapter 7: Model Assessment and Selection

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Minimum Description Length (MDL)

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Model selection using penalty function criteria

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Extended Bayesian Information Criteria for Gaussian Graphical Models

Shrinkage Tuning Parameter Selection in Precision Matrices Estimation

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Memoirs of the Faculty of Engineering, Okayama University, Vol. 42, pp , January Geometric BIC. Kenichi KANATANI

Bayesian methods in economics and finance

Variable selection for model-based clustering

An Extended BIC for Model Selection

Covariance function estimation in Gaussian process regression

Bayesian Model Comparison

Graduate Econometrics I: Maximum Likelihood I

MASM22/FMSN30: Linear and Logistic Regression, 7.5 hp FMSN40:... with Data Gathering, 9 hp

Model Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC.

Structure estimation for Gaussian graphical models

ICES REPORT Model Misspecification and Plausibility

Model Fitting and Model Selection. Model Fitting in Astronomy. Model Selection in Astronomy.

Least Squares Model Averaging. Bruce E. Hansen University of Wisconsin. January 2006 Revised: August 2006

STAT 740: Testing & Model Selection

Regression, Ridge Regression, Lasso

STAT Financial Time Series

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Theory of Maximum Likelihood Estimation. Konstantin Kashin

A Very Brief Summary of Statistical Inference, and Examples

Testing Restrictions and Comparing Models

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Estimating prediction error in mixed models

arxiv: v6 [math.st] 3 Feb 2018

Lecture 13 : Variational Inference: Mean Field Approximation

Introduction to graphical models: Lecture III

y Xw 2 2 y Xw λ w 2 2

I. Bayesian econometrics

Approximate Likelihoods

Minimum Message Length Analysis of the Behrens Fisher Problem

Central Limit Theorem ( 5.3)

Learning Binary Classifiers for Multi-Class Problem

High-dimensional covariance estimation based on Gaussian graphical models

CPSC 540: Machine Learning

Regression I: Mean Squared Error and Measuring Quality of Fit

Posterior Regularization

Math 533 Extra Hour Material

A Confidence Region Approach to Tuning for Variable Selection

An Information Criteria for Order-restricted Inference

MS&E 226: Small Data

14 : Mean Field Assumption

PATTERN RECOGNITION AND MACHINE LEARNING

Model selection criteria in Classification contexts. Gilles Celeux INRIA Futurs (orsay)

Composite Likelihood Estimation

Label Switching and Its Simple Solutions for Frequentist Mixture Models

Model comparison: Deviance-based approaches

New Bayesian methods for model comparison

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Gaussian Graphical Models and Graphical Lasso

Variable Selection in Multivariate Linear Regression Models with Fewer Observations than the Dimension

Applications of Information Geometry to Hypothesis Testing and Signal Detection

Label switching and its solutions for frequentist mixture models

A Bayesian view of model complexity

DIC, AIC, BIC, PPL, MSPE Residuals Predictive residuals

CS 540: Machine Learning Lecture 2: Review of Probability & Statistics

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Minimax Redundancy for Large Alphabets by Analytic Methods

TP, TN, FP and FN Tables for dierent methods in dierent parameters:

Mixtures and Hidden Markov Models for analyzing genomic data

Likelihood-Based Methods

Information Theory and Model Selection

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

High dimensional ising model selection using l 1 -regularized logistic regression

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Practical Econometrics. for. Finance and Economics. (Econometrics 2)

COS513 LECTURE 8 STATISTICAL CONCEPTS

Lecture : Probabilistic Machine Learning

Computer Intensive Methods in Mathematical Statistics

Parametric Techniques Lecture 3

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm

Statistics 262: Intermediate Biostatistics Model selection

Transcription:

University of Groningen e.c.wit@rug.nl http://www.math.rug.nl/ ernst January 2014

Penalized likelihood generates a PATH of solutions Consider an experiment: Γ genes measured across T time points. Assume n iid samples y (1),..., y (n), where y (i) = (y (i) 1,..., y (i) ΓT ). Assume Y (i) N(0, K 1 ), then Likelihood: l(k y) n {log( K ) tr(sk)}. 2 AIM: Optimization of penalized likelihood: subject to K 0; ˆK λ := argmax K {l(k y)} K 1 1/λ... for λ in some range!!; some factorial colouring F.

Between proximity and truth True process for data Y : Y Q. A statistical model is a collection of measures: M i = {P θ θ Θ i }... and typically we consider several: M = {M 1,..., M k }. What is the best model? Proximity : the model that is closest to the truth: min i k KL(Pˆθi ; Q). Truth : the model that is most likely to be the truth: max i k P(M i Y ).

Truth With flat prior on Θ M and M, model probability for M M is: p(m y) p(y M)/p(y) e n l(θ) θ Θ M e l(ˆθ) Θ M e 1 2 (θ ˆθ) t n 2 θ 2 l(ˆθ)(θ ˆθ) θ e l(ˆθ) (2π) p/2 n p/2 2 θ l(ˆθ) 2 where p = dim(θ M ) and l(ˆθ) = l(ˆθ)/n. Schwarz (1978) ignored terms not depending on n: p(m y) e l(ˆθ) n p/2, for large n. 1/2,

BIC: Bayesian information criterion By applying the log and ignoring constant terms c: BIC(M) = 2l(ˆθ) + p log(n). Minimizing BIC corresponds to maximizing posterior model probability. Define model weights W (M), W (M) = e BIC(M)/2, which rescaled correspond to posterior model probabilities, p(m i y) = W (M i ) M M W (M).

BIC for Gaussian graphical models The BIC for an estimated Gaussian graphical model K: BIC( K) = n( log K + Tr(S K)) + log(n), p K where p K = {unique non-zeroes in K}, which is less than p(p 1)/2.

BIC for Gaussian graphical models The BIC for an estimated Gaussian graphical model K: BIC( K) = n( log K + Tr(S K)) + p K log(n), where p K = {unique non-zeroes in K}, which is less than p(p 1)/2. But... BIC is an asymptotic approximation. What is n?? For penalized estimate K are number of parameters not smaller??

ebic: extended Bayesian information criterion The BIC for an estimated Gaussian graphical model K: ebic γ ( K) = n( log K + Tr(S K)) + log(n) + log(p), p K 4γp K where p K = {unique non-zeroes in K} p = number of nodes γ = tuning paramter(0 γ 1) If γ = 0 ordinary BIC γ = 1 additional sparsity γ = 0.5 good trade-off (Foygel & Drton, 2010)

Proximity Let the true density of the data Y be: Y q = dq. The Kullback-Leibler (KL) divergence between fitted and true model KL(ˆθ i ) = q(y) log q(y)dy q(y) log p(y; ˆθ i )dy

Proximity Let the true density of the data Y be: Y q = dq. The Kullback-Leibler (KL) divergence between fitted and true model KL(ˆθ i ) = q(y) log q(y)dy q(y) log p(y; ˆθ i )dy = C E q (l(ˆθ i )

Proximity Let the true density of the data Y be: Y q = dq. The Kullback-Leibler (KL) divergence between fitted and true model KL(ˆθ i ) = q(y) log q(y)dy q(y) log p(y; ˆθ i )dy = C E q (l(ˆθ i ) C l(ˆθ i ) + p i where p(.; θ) is density ( associated ) with P θ. pi = Trace Ji 1 K i dim(θ i ), where J i and K i are Fisher informations using model M i : ( 2 log f (Y, J i = E ˆθ ) ( ) i ) log(f (Y, ˆθi ) g θ θ t, K i = V g. θ

AIC: Akaike s information criterion By applying the log and ignoring constant terms c: AIC(M) = 2l(ˆθ) + 2p. Minimizing AIC corresponds to maximizing posterior model probability. Define Akaike weights W (M), W (M) = e AIC(M)/2, which rescaled correspond to probability weights that add up to one, p(m i ) = W (M i ) M M W (M).

AIC for Gaussian graphical models The AIC for an estimated Gaussian graphical model K: AIC( K) = n( log K + Tr(S K)) +, 2p K where p K = {unique non-zeroes in K}, which is less than p(p 1)/2.

AIC for Gaussian graphical models The AIC for an estimated Gaussian graphical model K: AIC( K) = n( log K + Tr(S K)) + 2p K, where p K = {unique non-zeroes in K}, which is less than p(p 1)/2. But... AIC is an asymptotic approximation. What is n?? For penalized estimate K are number of parameters not smaller?? In the next slides, we propose three alternatives.

1. Exact AIC KL(K 0 K) = 1 2 {Tr( KΣ 0 ) log KΣ 0 p} Scaling this by 2 and ignoring a constant 2KL(K 0 K) = {log K Tr( KS)} + 2 1 2 {Tr( K(Σ 0 S))}. We can write this as 2KL(K 0 K) = 2l( K) + 2 1 2 {Tr( K(Σ 0 S))}. Definition (Degrees of freedom in Gaussian graphical model) Let Y N(0, K 1 0 ) and K an estimate of K 0 : df K = 1 2 {Tr( K(Σ 0 S))}.

Approximate Exact AIC We obviously don t know Σ 0, but we can estimate it: for some tuning paramter π. Σ 0 = πs (1 π)diag{σ 2 11,..., σ 2 pp}, Definition (Approximate Exact AIC) Let Y N(0, K 1 0 ) and K an estimate of K 0 : AIC( K) = 2l( K) + 2 df, where df = 1 2 {Tr( K( Σ 0 S))}.

2. Exact GIC Problem of AIC: ˆK λ is not a MLE. GIC for M-estimator ˆK derived by Konishi & Kitagawa (1996): n GIC = 2 l k ( ˆK; x k ) + 2tr{R 1 Q}, (1) k=1 where R and Q are square matrices of order p 2 given by R = 1 n {Dψ(x k, K)} n K= ˆK, k=1 Q = 1 n ψ(x k, K)Dl k (K) n K= ˆK. k=1 Problem: Bias term in (1) requires inversion of d 2 d 2 matrix R.

Approximate GIC (Abbruzzo, Vujacic, Wit) We derived an explicit estimator of KL that avoids matrix inversion: where ĜIC(λ) = 2l( ˆK λ ) + 2 df GIC, df GIC = nk=1 c(s k I λ ) c( ˆK λ (S k I λ ) ˆK λ ) 2n where c() is the vectorize operator, S k = x k x T k, I λ = 1 * ( ˆK λ!= 0), an indicator matrix. c(s I λ) c( ˆK λ (S I λ ) ˆK λ ), 2

Approximate GIC: Simulations 0 100 200 300 400 500 600 df_kl df_gic df_aic 2400 2500 2600 2700 KL GIC AIC 0.1 0.2 0.3 0.4 0.5 0.6 λ 0.1 0.2 0.3 0.4 0.5 0.6 λ Bias term (left) and KL divergence (right) estimates for n = 100, d = 50.

3. Exact Cross-validation (CV) Ignoring an additive constant, the KL divergence is: KL(K 0 K) = 1 2 log K + E Σ0 1 2 {Tr( KXX t )} The idea of cross-validation is to replace the final expectation by KL(K 0 K) 1 2n n {log K (k) + Tr( K (k) x k xk)} t k=1 Clearly, this would require recalculating the estimate K (k) many times.

Approximate Cross-validation (Vujacic, Abbruzzo, Wit) We write the KL divergence as follows, KL(K 0 ˆK λ ) = 1 n l( ˆK λ ) + bias, where l(k) = n{log K tr(ks)}/2 and bias = tr( ˆK λ (K 1 0 S))/2. Definition LOOCV-inspired estimate of Kullback-Leibler divergence KLCV (λ) = 1 ni=1 n l( ˆK 1 c[( ˆK λ S i ) I λ ] ( λ )+ ˆK λ ˆK λ )c[(s S i ) I λ ], n(n 1) where c() is the vectorize operator, S k = x k x T k, I λ = 1 * ( ˆK λ!= 0), an indicator matrix.

Proof LOOCV = 1 n f (S i, ˆK ( i) ) 2n i=1 = 1 2 f (S, ˆK) 1 n [f (S i, ˆK ( i) ) f (S i, ˆK)] 2n i=1 1 n l( ˆK) 1 n [ df (Si, ˆK) ] c( ˆK ( i) ˆK). 2n dω i=1 Matrix differential calculus: df (S i, ˆK)/dΩ = c( ˆK 1 S i ). The term c( ˆK ( i) ˆK) is obtained via the Taylor expansion 0 df (S, ˆK) dω + d 2 f (S, ˆK) dω 2 c( ˆK ( i) ˆK) + d 2 f (S, ˆK) dωds c(s( i) S). Inserting: df (S, ˆK)/dΩ = c( ˆK 1 S), d 2 f (S, ˆK)/dΩdS = I p 2, d 2 f (S, ˆK)/dΩ 2 = ˆK 1 ˆK 1 and consequently c( ˆK ( i) ˆK) = ( ˆK ˆK)c(S ( i) S).

KLCV simulation results d=100 KL ORACLE KLCV AIC GACV n=20 8.06 8.60 12.24 28.59 (0.37) (0.45) (0.28) (19.94) n=30 6.87 7.29 10.59 32.07 (0.34) (0.39) (0.41) (2.77) n=50 5.24 5.63 7.33 16.93 (0.27) (0.33) (0.81) (1.40) n=100 3.34 3.57 3.63 6.81 (0.19) (0.23) (0.48) (0.52) n=400 1.13 1.20 1.17 1.24 (0.07) (0.08) (0.08) (0.07 )

Conclusions BIC aims to find true model AIC aims to come closest to the truth BIC gives sparser model than AIC (typically) AIC/BIC have asymptotic issues. What are number of parameters for penalized inference? Improved versions are available!