Empirical Likelihood Based Deviance Information Criterion

Size: px

Start display at page:

Download "Empirical Likelihood Based Deviance Information Criterion"

Aldous Jeffrey York
5 years ago
Views:

1 Empirical Likelihood Based Deviance Information Criterion Yin Teng Smart and Safe City Center of Excellence NCS Pte Ltd June 22, 2016

2 Outline Bayesian empirical likelihood Definition Problems Empirical likelihood based deviance information criterion Deviance information criterion Empirical likelihood based DIC Simulation studies and real data analysis Conclusion and discussion

3 Outline Bayesian empirical likelihood Definition Problems Empirical likelihood based deviance information criterion Deviance information criterion Empirical likelihood based DIC Simulation studies and real data analysis Conclusion and discussion

4 Empirical Likelihood (EL)-Definition X R is a random variable following F 0 θ F 0 θ F θ, θ = (θ 1,..., θ d ) Θ R d h(x, θ) = (h 1 (X, θ),..., h q (X, θ)) T are known to satisfy x = (x 1,..., x n ) are n observations of X F (x i ) = P (X x) and F (x i ) = P (X < x) E F 0[h(X, θ)] = 0. (1.1) θ A non-parametric likelihood of F can be defined as L(F ) = n {F (x i ) F (x i )} (1.2)

5 Empirical Likelihood-Definition (Cont d) EL estimates F 0 θ by maximising L(F ) over F θ under constraints depending on h(x, θ) Define ω i = F (x i ) F (x i ), for a given θ Θ. EL computes where W θ = ˆω(θ) = argmax ω W θ { ω : n n log ω i (θ) (1.3) } ω i (θ)h(x i, θ) = 0 n 1. Here n 1 is the n 1 dimensional simplex

6 Empirical Likelihood - Definition (Cont d) F 0 θ can be estimated by n ˆF θ 0 (x) = ˆω i (θ)1 {xi x}. The empirical likelihood corresponding to ˆF 0 θ is then given by n L(θ) = ˆω i (θ). (1.4)

7 Bayesian Empirical Likelihood We have prior π(θ) We can define a posterior as Π(θ x) = L(θ)π(θ) L(θ)π(θ)dθ. (1.5) 1. If L(θ) = n n, then Π(θ x) = π(θ) 2. Absence of analytical form

8 Bayesian Empirical Likelihood - Problems Computational issues No analytical form of posterior - Gibbs sampling fails Non-convex problem- random walk MH not easy Mixed effect models-parallel tempering mixed slowly Bayesian model selection Bayes factor -Not easy to calculate Posterior predictive distribution -The integral is not easily obtained

9 Outline Bayesian empirical likelihood Definition Problems Empirical likelihood based deviance information criterion Deviance information criterion Empirical likelihood based DIC Simulation studies and real data analysis Conclusion and discussion

10 Deviance information criterion(dic) f(y θ) depends on a parameter vector θ Θ R p y 1,..., y n are n i.i.d random variables A posterior of θ is then given by Π(θ y) = p(y θ)π(θ) θ Θ p(y θ)π(θ)dθ. a Bayesian deviance is defined as D(θ) = 2 log p(y θ) 2 log p(y). (2.1) where p(y) is some fully specified standardizing term which is a function of the data alone (Spiegelhalter u. a., 2002)

11 Deviance information criterion (DIC) Bayesian model fit is measured by posterior expectation of the deviance D(θ) = E Π [D(θ)] (2.2) Bayesian model complexity is measured by the effective number of parameters in the model p D = D(θ) D(ˆθ Π ). (2.3) where ˆθ Π is some posterior estimate of parameter DIC is defined as DIC = D(θ) + p D = 2D(θ) D(ˆθ Π ) = D(ˆθ Π ) + 2p D. (2.4)

12 Empirical likelihood based DIC - Definition The empirical likelihood deviance is given by D EL (θ) = 2 n log ˆω i (θ) 2n log n. (2.5) BayesEL model fit can be defined as BayesEL posterior mean of (2.5) D EL (θ) = E ΠEL [D EL (θ)]. (2.6) BayesEL model complexity is defined as the effective number of parameters p EL D = D EL (θ) D EL ( θ EL ). (2.7) θel is the posterior mean The empirical likelihood based DIC (ELDIC) is defined as ELDIC = D EL (θ) + p EL D. (2.8)

13 Empirical likelihood based DIC - Definition (Cont d) DIC EL based DIC D(θ) = 2 log p(θ y) 2 log p(y) DIC = D(θ) + p D D(θ) = E Π (D(θ)) p D = D(θ) D( θ Π ) D EL (θ) = 2 log L(θ) 2n log n ELDIC = D EL (θ) + p EL D D EL (θ) = E ΠEL [D EL (θ)] p EL D = D EL (θ) D EL ( θ EL )

14 Empirical likelihood based DIC - Definition (Cont d) Properties of DIC Approximate normality of Π(θ y) is important! positivity of pd decision theoretical justification Bayesian version of AIC (Akaike, 1974) p D is not invariant to reparameterization! Properties of ELDIC Approximate normality of Π EL (θ y) is important! positivity of p EL D p EL D consistency of θel decision theoretical justification Bayesian version of ELAIC (Variyath u. a., 2010) is not invariant to reparameterization!

15 Approximate normality Theorem 2.1 Let 2 J(ˆθ n ) = 1 n θ θ log L(θ). (2.9) T θ=ˆθ n Assume that J(ˆθ n) and J 0 are invertible. Under some regularity condition, for {θ : θ θ 0 = O(n 1/2 )}, the posterior distribution of θ has density where Π EL (θ y) exp { 1 2 (θ θ EL ) T J n (θ θ EL ) + o p (1) } (2.10) J n = J 0 + nj(ˆθ n ), (2.11) θ EL = J 1 n {J 0 m 0 + nj(ˆθ n )ˆθ n }. (2.12) Furthermore, if J(ˆθ n ) and J 0 are positive definite, J 1/2 n (θ θ EL ) D N(0, I).

16 Consistency of the posterior mean Theorem 2.2 Assume m 0 = O p (1) and J 0 = O p (1), under some regularity assumption, θ p EL θ 0.

17 Decision theoretic justification y rep = (y rep,1,..., y rep,n ) is a replicate of y = (y 1,..., y n ) from the same data generating process. y rep follows F 0 rep,θ, which is actually same as F 0 θ. Let W yrep,θ = {ω : n ω i h(y rep,i, θ) = 0} n 1. Given θ and y rep, an empirical likelihood of y rep is L rep (θ) = n where ˆω rep = (ˆω rep,1,..., ˆω rep,n ) and ˆω rep = argmax ω W yrep,θ ˆω rep,i n log ω i.

18 Decision theoretic justification (Cont d) The loss function of using the observed data y to predict y rep n L(y rep, y) = 2 log L rep ( θ(y)) 2n log n = 2 log nˆω rep,i (2.13) θ(y) is a summary of θ based on the observed data y. The BayesEL posterior predictive distribution function F (y rep y) = Π EL (θ y)frep,θdθ. 0 (2.14) The risk of predicting y rep by using y R(y) = E yrep y(l(y rep, y)) = L(y rep, y)df (y rep y). (2.15)

19 Decision theoretic justification (Cont d) Theorem 2.3 If θ(y) = θ EL, assuming that y rep and y are generated from the same mechanism, we get E y (R(y)) = E y E yrep y (L(y rep, y)) = E y (ELDIC) + o(1). (2.16) Under a diffused prior (i.e. J 0 = o p (1) ). By Theorem 2.1, we have θ EL ˆθ n and p EL D p. Thus ELDIC = D EL ( θ EL ) + 2p EL D 2 log L(ˆθ n ) + 2p. ELDIC reduces to empirical likelihood based AIC(Variyath u. a., 2010).

20 Prior varying with the size of samples Consider the following three classes of priors. 1 For π 1 (θ), m 0,n = O p (1), J 0,n = o p (n) 2 For π 2 (θ), m 0,n = θ 0 + o p (1), J 0,n = O p (n) 3 For π 3 (θ), m 0,n = θ 0 + o p (1), J 0,n = O p (n 1+α ), where α > 0 Note that π 1 (θ), J 0,n increases in a slower rate than n. π 2 (θ) is shrinking to θ 0 at the same rate of n π 3 (θ) is shrinking to θ 0 at a faster rate than n.

21 Prior varying with the size of samples Theorem 2.4 Assume that λ (1) J(ˆθ n )) and λ (1) (J(ˆθ n ) 1 ) are bounded. Under the same assumptions in Theorem 2.1, under the priors π 1 (θ), π 2 (θ) and π 3 (θ), θ p EL θ 0. Theorem 2.5 Under the same assumption of Theorem 2.4, let p be the dimension of the parameter, the following statements hold. 1 If the prior is π 1 (θ), p EL D 2 If the prior is π 2 (θ), then p EL D 3 If the prior is π 3 (θ), then p EL D p p, as n < p, as n. p 0, as n

22 An alternative definition of p EL D Recall that p D is not invariant to reparameterization! Gelman u. a. (2003) defined an alternative measure p EL D p V = V Π[D(θ)]. 2 is also not invariant to reparameterization! We define p EL V = V Π EL (D EL (θ)) 2

23 An alternative definition of p EL D (Cont d) Theorem 2.6 Assume J 0 = o p (1), under the same assumptions in Theorem 2.4, we have p EL V = p EL D + o p (1) (2.17) Corollary 2.7 Define ELDIC (2) = D EL ( θ EL ) + 2p EL, then we have V E y (R(y)) = E y (ELDIC (2) ) + o(1) (2.18)

24 Priors and P EL D The predictor vector x i = (x i1, x i2, x i3, x i4, x i5 ) T was generated independently from N (0, I), where I is the identity matrix. The coefficient vector β = (β 1, β 2, β 3, β 4, β 5 ) was generated independently from N (0, 0.25I). 100 independent observations were generated from a linear regression model given by where ϵ i follows N(0, ). y i = x T i β + ϵ i, i = 1, 2,..., 100, Three priors of β are considered. They are N(0, 100), N(β 0, 0.01) and N(β 0, 0.001) respectively. In π 2 (β) and π 3 (β), β 0 is the true value of β.

25 Priors and P EL D Suppose ω = (ω 1,..., ω n ) is the vector of jumps of the estimated joint distribution of y and x at the i th observation. We define the set of constraints as { } n n W β = ω : ω i (y i x T i β) = 0, ω i x ij (y i x T i β) = 0, j = 1,..., Given β, the empirical likelihood is given by L(β) = n ˆω i (2.19) where ˆω = arg max ω W β n log ω i.

26 Priors and P EL D Table: The means and standard deviations (sd) of D EL (θ), D EL ( θ EL ), p EL D and p EL V for 1000 repetitions under priors N(0, 100), N(β 0, 0.01) and N(β 0, 0.001) respectively. Priors D EL (θ) D( θ EL ) p EL D p EL V Mean sd Mean sd Mean sd Mean sd N (0, 100) N (β 0, 0.1) N (β 0, 0.001)

27 Variables selection for over-dispersed Poisson regression We have marginal distribution of y such that E(y) = µ V ar(y) = µ(1 + µλ) We consider a four covariates generalized linear model such that log(µ) = β 0 + x 1 β 1 + x 2 β 2 + x 3 β 3 + x 4 β 4 with β = c(0.5, 0.5, 0.6, 0, 0). The covariance structure is cov(x i, x j ) = (0.5) i j and λ has four levels, (0, 1/8, 1/6, 1/4)

28 Variables selection for over-dispersed Poisson regression Similar to linear model example, we use γ = (γ 1,..., γ 4 ) to denote model and x iγ to denote the i th observation for covariate vector of model γ. Let β γ be the corresponding coefficient vector. Then the set of constraints is defined as { W βγ = ω : n n ω i (y i exp{β 0 + x iγ β γ })=0, ω i x ij (y i exp{β 0 + x iγ β γ })=0, } j = 1, 2, 3, 4} 99 The empirical likelihood under model γ is given by L(β γ) = n ˆω i (2.20) where ˆω = arg max ω W βγ n log ω i.

29 Variables selection for over-dispersed Poisson regression Table: Comparison of ELDIC (1), ELDIC (2), DIC (1) and DIC (2) based on % time the model selected (1) TM; (2) TM+1; (3) TM+2 with different over-dispersed parameter. λ Model ELDIC (1) ELDIC (2) DIC (1) DIC (2) 0 TM TM TM /8 TM TM TM /6 TM TM TM /4 TM TM TM

30 Analysis of gene expression data Data is from n = 118 microarray experiments collected and analyzed by (Wille u. a., 2004) To reveal the correlation structure of these genes, Wille u. a. (2004) proposed a modified graphical gaussian modeling approach where the dependence between two genes was investigated only conditioning on a third one rather than all the other genes at a time. Drton u. Perlman (2007) employed a multiple testing based graphical model selection approach to analyze 13 genes from the MEP pathway in Wille u. a. (2004). We apply ELDIC on the same genes as Drton u. Perlman (2007). Prior settings: β γ double exponential(0, 100).

31 Analysis of gene expression data - Cont d These 13 genes are well ordered and observations for each one are standardized. For each gene, or say node, a linear model is applied. All ancestors of the given node are considered as potential covariates in the linear model. Our goal is to select a directed acyclic graphic (DAG) model, which can best illustrate the correlation structure among these genes.

32 Analysis of gene expression data - Cont d Suppose k {4, 5,..., 13} indicate the number of the gene and g k indicate the k th gene. Given the g k, the model to fit this gene is denoted by γ k = (γ 1,..., γ k 1 ), where γ i, is 1 when g i is in the model and 0 otherwise. Let g be the covariates, β γ k γk be the corresponding coefficient vector, andϵ be the error which has mean 0 and variance γ k σ2 γ. k The model γ k is then given by g k = β γ kg γ k + ϵ γ k (2.21) Prior settings: β γ double exponential(0, 100)

33 Analysis of gene expression data - Cont d Suppose that (g (i) k, g(i) ) is the i th observations of (g γ k k, g k). Let ω γ i is the weight in the empirical likelihood for the data (g k, g γ k). The set of constraints for empirical likelihood given node k is defined as W βγ k = { ω : n j = 4,..., k ( ) ω i g (i) k [g(i) ] T β γ k γ k = 0, } n ( ) ω i g (i) j g (i) k [g(i) ] T β γ k γ k = (2.22) Given β γ k, the empirical likelihood is given by where n L(β k) = γ ˆω i ˆω = argmax ω W βγ k n log ω i

34 Analysis of gene expression data Figure: Graphic Model for gene data

35 Application to gene expression data (RJMCMC) Figure: Graphic Model for gene data

36 Conclusion and discussion The proposed ELDIC has similar form to the classical DIC The model with minimum value of ELDIC is selected Heuristically, the model with the minimum ELDIC has the smallest posterior predictive risk The proposed BayEL based estimate for the effective number of parameters in the model is valid Application of ELDIC on the other models remains an open problem The magnitude of significant difference between values of ELDICs is still not clear

37 Thank you!

38 Bibliography I [Akaike 1974] AKAIKE, Hirotugu: A new look at the statistical model identification. In: Automatic Control, IEEE Transactions on 19 (1974), Nr. 6, S [Drton u. Perlman 2007] DRTON, Mathias ; PERLMAN, Michael D.: Multiple testing and error control in Gaussian graphical model selection. In: Statistical Science (2007), S [Gelman u. a. 2003] GELMAN, Andrew ; CARLIN, John B. ; STERN, Hal S. ; RUBIN, Donald B.: Bayesian data analysis. Chapman & Hall/CRC, 2003 [Spiegelhalter u. a. 2002] SPIEGELHALTER, David J. ; BEST, Nicola G. ; CARLIN, Bradley P. ; VAN DER LINDE, Angelika: Bayesian measures of model complexity and fit. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 (2002), Nr. 4, S [Variyath u. a. 2010] VARIYATH, Asokan M. ; CHEN, Jiahua ; ABRAHAM, Bovas: Empirical likelihood based variable selection. In: Journal of Statistical Planning and Inference 140 (2010), Nr. 4, S

39 Bibliography II [Wille u. a. 2004] WILLE, Anja ; ZIMMERMANN, Philip ; VRANOVÁ, Eva ; FÜRHOLZ, Andreas ; LAULE, Oliver ; BLEULER, Stefan ; HENNIG, Lars ; PRELIC, Amela ; ROHR, Peter von ; THIELE, Lothar u. a.: Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. In: Genome Biol 5 (2004), Nr. 11, S. R92

Effects of dependence in high-dimensional multiple testing problems. Kyung In Kim and Mark van de Wiel

Effects of dependence in high-dimensional multiple testing problems Kyung In Kim and Mark van de Wiel Department of Mathematics, Vrije Universiteit Amsterdam. Contents 1. High-dimensional multiple testing