Selecting explanatory variables with the modified version of Bayesian Information Criterion Institute of Mathematics and Computer Science, Wrocław University of Technology, Poland in cooperation with J.K.Ghosh, R.W.Doerge, R. Cheng Purdue University A. Baierl, F. Frommlet, A. Futschik Vienna University A. Chakrabarti - Indian Statistical Institute P. Biecek, A. Ochman, M. Żak Wrocław University of Technology Vienna, 24/07/2008
Searching large data bases Y - the quantitative variable of interest (fruit size, survival time, process yield)
Searching large data bases Y - the quantitative variable of interest (fruit size, survival time, process yield) Aim identify factors influencing Y
Searching large data bases Y - the quantitative variable of interest (fruit size, survival time, process yield) Aim identify factors influencing Y Properties of the data base number of potential factors, m, may be much larger than the number of cases, n
Searching large data bases Y - the quantitative variable of interest (fruit size, survival time, process yield) Aim identify factors influencing Y Properties of the data base number of potential factors, m, may be much larger than the number of cases, n Assumption of Sparsity - only a small proportion of potential explanatory variables influences Y
Specific application - Locating Quantitative Trait Loci
Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus
Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus X ij - dummy variable encoding the genotype of i-th individual at locus j
Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus X ij - dummy variable encoding the genotype of i-th individual at locus j X ij { 1/2, 1/2}
Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus X ij - dummy variable encoding the genotype of i-th individual at locus j X ij { 1/2, 1/2} Multiple regression model: Y i = β 0 + m β j X ij + ɛ i, (0.1) j=1 where i {1,..., n} and ɛ i N(0, σ 2 )
Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus X ij - dummy variable encoding the genotype of i-th individual at locus j X ij { 1/2, 1/2} Multiple regression model: Y i = β 0 + m β j X ij + ɛ i, (0.1) j=1 where i {1,..., n} and ɛ i N(0, σ 2 ) Problem : estimation of the number of influential genes
Bayesian Information Criterion (1) M i - i-th linear model with k i < n regressors
Bayesian Information Criterion (1) M i - i-th linear model with k i < n regressors θ i = (β 0, β 1,..., β ki, σ) - vector of model parameters
Bayesian Information Criterion (1) M i - i-th linear model with k i < n regressors θ i = (β 0, β 1,..., β ki, σ) - vector of model parameters Bayesian Information Criterion (Schwarz, 1978) maximize BIC = log L(Y M i, ˆθ i ) 1 2 k i log n
Bayesian Information Criterion (1) M i - i-th linear model with k i < n regressors θ i = (β 0, β 1,..., β ki, σ) - vector of model parameters Bayesian Information Criterion (Schwarz, 1978) maximize BIC = log L(Y M i, ˆθ i ) 1 2 k i log n If m is fixed, n and X X /n Q, where Q is a positive definite matrix, then BIC is consistent - the probability of choosing the proper model converges to 1.
Bayesian Information Criterion (1) M i - i-th linear model with k i < n regressors θ i = (β 0, β 1,..., β ki, σ) - vector of model parameters Bayesian Information Criterion (Schwarz, 1978) maximize BIC = log L(Y M i, ˆθ i ) 1 2 k i log n If m is fixed, n and X X /n Q, where Q is a positive definite matrix, then BIC is consistent - the probability of choosing the proper model converges to 1. When n 8 BIC never chooses more regressors than AIC and is usually considered as one of the most restrictive model selection criteria.
Bayesian Information Criterion (1) M i - i-th linear model with k i < n regressors θ i = (β 0, β 1,..., β ki, σ) - vector of model parameters Bayesian Information Criterion (Schwarz, 1978) maximize BIC = log L(Y M i, ˆθ i ) 1 2 k i log n If m is fixed, n and X X /n Q, where Q is a positive definite matrix, then BIC is consistent - the probability of choosing the proper model converges to 1. When n 8 BIC never chooses more regressors than AIC and is usually considered as one of the most restrictive model selection criteria. Surprise? : - Broman and Speed (JRSS, 2002) report that BIC overestimates the number of regressors when applied to QTL mapping.
Explanation - Bayesian roots of BIC (1) f (θ i ) prior density of θ i, π(m i ) prior probability of M i
Explanation - Bayesian roots of BIC (1) f (θ i ) prior density of θ i, π(m i ) prior probability of M i m i (Y ) = L(Y M i, θ i )f (θ i )dθ i integrated likelihood of the data given the model M i
Explanation - Bayesian roots of BIC (1) f (θ i ) prior density of θ i, π(m i ) prior probability of M i m i (Y ) = L(Y M i, θ i )f (θ i )dθ i integrated likelihood of the data given the model M i posterior probability of M i : P(M i Y ) m i (Y )π(m i )
Explanation - Bayesian roots of BIC (1) f (θ i ) prior density of θ i, π(m i ) prior probability of M i m i (Y ) = L(Y M i, θ i )f (θ i )dθ i integrated likelihood of the data given the model M i posterior probability of M i : P(M i Y ) m i (Y )π(m i ) BIC neglects π(m i ) and uses approximation log m i (Y ) log L(Y M i, ˆθ i ) 1/2(k i + 2) log n + R i, R i is bounded in n.
Explanation - Bayesian roots of BIC (2) neglecting π(m i ) assuming all the models have the same prior probability
Explanation - Bayesian roots of BIC (2) neglecting π(m i ) assuming all the models have the same prior probability assigning a large prior probability to the event that the true model contains approximately m 2 regressors
Explanation - Bayesian roots of BIC (2) neglecting π(m i ) assuming all the models have the same prior probability assigning a large prior probability to the event that the true model contains approximately m 2 regressors ( ) 200 m=200, 200 models with one regressor, 2 = 19900 models ( ) 200 with two regressors, 100 = 9 10 58 models with 100 regressors
Modified version of BIC, mbic (1) M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004) Proposed solution - supplementing BIC with an informative prior distribution on the set of possible models, proposed in George and McCulloch (1993)
Modified version of BIC, mbic (1) M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004) Proposed solution - supplementing BIC with an informative prior distribution on the set of possible models, proposed in George and McCulloch (1993) p - prior probability that a randomly chosen regressor influences Y π(m i ) = p k i (1 p) m k i
Modified version of BIC, mbic (1) M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004) Proposed solution - supplementing BIC with an informative prior distribution on the set of possible models, proposed in George and McCulloch (1993) p - prior probability that a randomly chosen regressor influences Y π(m i ) = p k i (1 p) m k i ( ) 1 p log π(m i ) = m log(1 p) k i log p
Modified version of BIC, mbic (1) M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004) Proposed solution - supplementing BIC with an informative prior distribution on the set of possible models, proposed in George and McCulloch (1993) p - prior probability that a randomly chosen regressor influences Y π(m i ) = p k i (1 p) m k i ( ) 1 p log π(m i ) = m log(1 p) k i log p Modified version of BIC recommends choosing the model maximizing log L(Y M i, ˆθ i ) 1 ( ) 1 p 2 k i log n k i log p
mbic (2) c = mp - expected number of true regressors
mbic (2) c = mp - expected number of true regressors mbic = log L(Y M i, ˆθ i ) 1 ( m ) 2 k i log n k i log c 1
mbic (2) c = mp - expected number of true regressors mbic = log L(Y M i, ˆθ i ) 1 ( m ) 2 k i log n k i log c 1 Standard version of mbic uses c = 4 to control the overall type I error at the level below 10%
mbic (2) c = mp - expected number of true regressors mbic = log L(Y M i, ˆθ i ) 1 ( m ) 2 k i log n k i log c 1 Standard version of mbic uses c = 4 to control the overall type I error at the level below 10% A similar log m penalty appears also in RIC of Foster and George (1994)
Relationship to multiple testing (1) Orthogonal design: X T X = ni (m+1) (m+1), (1)
Relationship to multiple testing (1) Orthogonal design: X T X = ni (m+1) (m+1), (1) BIC chooses those X j s for which n ˆβ 2 j σ 2 > log n
Relationship to multiple testing (1) Orthogonal design: X T X = ni (m+1) (m+1), (1) BIC chooses those X j s for which n ˆβ 2 j σ 2 > log n Under H 0j : β j = 0, Z j = n ˆβ j σ N(0, 1)
Relationship to multiple testing (1) Orthogonal design: X T X = ni (m+1) (m+1), (1) BIC chooses those X j s for which n ˆβ 2 j σ 2 > log n Under H 0j : β j = 0, Z j = Since for c > 0, n ˆβ j σ N(0, 1) 1 Φ(c) = φ(c) c (1 + o c )
Relationship to multiple testing (1) Orthogonal design: X T X = ni (m+1) (m+1), (1) BIC chooses those X j s for which n ˆβ 2 j σ 2 > log n Under H 0j : β j = 0, Z j = n ˆβ j σ N(0, 1) Since for c > 0, 1 Φ(c) = φ(c) c (1 + o c ) It holds that for large values of n α n = 2P(Z j > 2 log n) πn log n.
Relationship to multiple testing (2) When n and m go to infinity and the number of true signals remains fixed, the expected number of false discoveries is of the m rate. n log n
Relationship to multiple testing (2) When n and m go to infinity and the number of true signals remains fixed, the expected number of false discoveries is of the m rate. n log n Corollary: BIC is not consistent when m n log n
. Bonferroni correction for multiple testing : α n,m = αn m
Bonferroni correction for multiple testing : α n,m = αn m probability of detecting at least one false positive : FWER α n
Bonferroni correction for multiple testing : α n,m = αn m probability of detecting at least one false positive : FWER α n 2(1 Φ( c Bon )) = αn m
Bonferroni correction for multiple testing : α n,m = αn m probability of detecting at least one false positive : FWER α n 2(1 Φ( c Bon )) = αn m ( ) m c Bon = 2 log (1 + o n,m ) = (log n + 2 log m)(1 + o n,m ) α n where o n,m converges to zero when n or m tends to infinity.
Bonferroni correction for multiple testing : α n,m = αn m probability of detecting at least one false positive : FWER α n 2(1 Φ( c Bon )) = αn m ( ) m c Bon = 2 log (1 + o n,m ) = (log n + 2 log m)(1 + o n,m ) α n where o n,m converges to zero when n or m tends to infinity. c mbic = log n + 2 log ( m c 1) log n + 2 log m 2 log c
Properties of mbic 1. FWER 2 c π n(log n+2 log m 2 log c)
Properties of mbic 1. FWER 2 c π n(log n+2 log m 2 log c) 2. The power of detecting the explanatory variable with β j 0 is given by ( 1 P nβj n( ˆβj β j ) c mbic < < ) nβj c mbic σ σ σ ( ) cmbic > 1 Φ nβj σ 1,
Properties of mbic 1. FWER 2 c π n(log n+2 log m 2 log c) 2. The power of detecting the explanatory variable with β j 0 is given by ( 1 P nβj n( ˆβj β j ) c mbic < < ) nβj c mbic σ σ σ ( ) cmbic > 1 Φ nβj σ 1, Corollary: Independently on the choice of c mbic is consistent
Properties of mbic 1. FWER 2 c π n(log n+2 log m 2 log c) 2. The power of detecting the explanatory variable with β j 0 is given by ( 1 P nβj n( ˆβj β j ) c mbic < < ) nβj c mbic σ σ σ ( ) cmbic > 1 Φ nβj σ 1, Corollary: Independently on the choice of c mbic is consistent The standard version of mbic uses c = 4 to control FWER at the level below 10%, when n 200.
Asymptotic optimality of mbic (1) γ 0 - cost of the false discovery, γ A - cost of missing the true signals
Asymptotic optimality of mbic (1) γ 0 - cost of the false discovery, γ A - cost of missing the true signals β j (1 p)δ 0 + pn(0, τ 2 )
Asymptotic optimality of mbic (1) γ 0 - cost of the false discovery, γ A - cost of missing the true signals β j (1 p)δ 0 + pn(0, τ 2 ) Expected value of the experiment cost: R = m(γ 0 t 1 (1 p) + γ A t 2 p), where t 1 and t 2 are type I and type II errors
Asymptotic optimality of mbic (1) γ 0 - cost of the false discovery, γ A - cost of missing the true signals β j (1 p)δ 0 + pn(0, τ 2 ) Expected value of the experiment cost: R = m(γ 0 t 1 (1 p) + γ A t 2 p), where t 1 and t 2 are type I and type II errors Optimal rule: Bayes oracle f A ( ˆβ j ) f 0 ( ˆβ j ) > (1 p)γ 0 pγ A, where f A ( ˆβ j ) N(0, τ 2 + σ2 n ) and f 0( ˆβ j ) N(0, σ2 n )
Asymptotic optimality of mbic (2) Bayes oracle n ˆβ j 2 σ 2 > σ2 + nτ 2 nτ 2 [ log ( nτ 2 + σ 2 σ 2 ) + 2 log ( 1 p p ) + 2 log ( γ0 γ A )].
Asymptotic optimality of mbic (2) Bayes oracle n ˆβ j 2 σ 2 > σ2 + nτ 2 nτ 2 [ log ( nτ 2 + σ 2 σ 2 ) + 2 log ( 1 p p ) + 2 log Asymptotic Optimality: the model selection rule V is asymptotically optimal if lim n,m R V R BO = 1. ( γ0 γ A )].
Asymptotic optimality of mbic (2) Bayes oracle n ˆβ j 2 σ 2 > σ2 + nτ 2 nτ 2 [ log ( nτ 2 + σ 2 σ 2 ) + 2 log ( 1 p p ) + 2 log Asymptotic Optimality: the model selection rule V is asymptotically optimal if lim n,m R V R BO = 1. Theorem 1 (Bogdan, Chakrabarti, Ghosh, 2008). Under orthogonal design (1) mbic is asymptotically optimal when lim m mp = s, where s R. ( γ0 γ A )]
Asymptotic optimality of mbic (2) Bayes oracle n ˆβ j 2 σ 2 > σ2 + nτ 2 nτ 2 [ log ( nτ 2 + σ 2 σ 2 ) + 2 log ( 1 p p ) + 2 log Asymptotic Optimality: the model selection rule V is asymptotically optimal if lim n,m R V R BO = 1. Theorem 1 (Bogdan, Chakrabarti, Ghosh, 2008). Under orthogonal design (1) mbic is asymptotically optimal when lim m mp = s, where s R. ( γ0 Conjecture (Frommlet, Bogdan, 2008). Theorem 1 holds also when β j (1 p)δ 0 + pf A, where F A has a positive density at 0. γ A )]
Computer simulations(1) Setting : n = 200, m = 300, entries of X N(0, σ = 0.5), k Binomial(m, p), with p = 1 30 (mp = 10), β i N(0, σ = 1.5), ε N(0, 1) and Tukey s gross error model: ε Tukey(0.95, 100, 1) = 0.95 N(0, 1) + 0.05 N(0, 10).
Computer simulations(1) Setting : n = 200, m = 300, entries of X N(0, σ = 0.5), k Binomial(m, p), with p = 1 30 (mp = 10), β i N(0, σ = 1.5), ε N(0, 1) and Tukey s gross error model: ε Tukey(0.95, 100, 1) = 0.95 N(0, 1) + 0.05 N(0, 10). Characteristics : Power, FDR = FP AP, MR = FP + FN, l 2 = m j=1 (β j ˆβ j ) 2 mean value of the absolute prediction error based on 50 additional observations, d
Computer simulations Table: Results for 1000 replications. noise N(0,1) Tukey(0.95, 100, 1) citerion BIC mbic rbic BIC mbic rbic FP 13.3 0.073 0.08 12.5 0.08 0.1 FN 1.84 2.97 3.45 3.95 6.11 4.29 Power 0.8155 0.7030 0.6586 0.6087 0.3923 0.5806 FDR 0.5889 0.0107 0.0116 0.6487 0.0210 0.0162 MR 15.1480 3.0410 3.5310 16.4440 6.1910 4.3910 l 2 2.3610 0.6025 0.8500 13.51 4.732 1.597 d 0.9460 0.8505 0.8687 1.714 1.503 1.298 E ε 1 0.8, E ε 2 1.16
Applications for QTL mapping Y i = µ + j I β j X ij + γ uv X iu X iv + ε i, (u,v) U I - a certain subset of the set N = {1,..., m}, U - a certain subset of N N
Applications for QTL mapping Y i = µ + j I β j X ij + (u,v) U I - a certain subset of the set N = {1,..., m}, U - a certain subset of N N Standard version of mbic - minimize γ uv X iu X iv + ε i, n log(rss)+(p+r) log(n)+2p log(m/2.2 1)+2r log(n e /2.2 1) p - number of main effects, r - number of interactions, N e = m(m 1)/2
Further applications for QTL mapping 1. Extending to more complicated genetic scenarios + iterative version of mbic : Baierl, Bogdan, Frommlet, Futschik Genetics, 2006 2. Robust versions based on M-estimates: Baierl, Futschik, Bogdan, Biecek CSDA, 2007 3. Rank version: Żak, Baierl, Bogdan, Futschik Genetics, 2007 4. Taking into account the correlations between neighboring markers: Bogdan, Frommlet, Biecek, Cheng, Ghosh, Doerge, Biometrics, 2008
Real Data Analysis (1) Huttunen et al (2004) - data on the variation in male courtship song characters in Drosophila virilis.
Real Data Analysis (2) Drosophila "sing" by vibrating their wings. The most common song type is pulse song and consists of rapid transients (short-lived oscillations) of low frequency.
Real Data Analysis (2) Drosophila "sing" by vibrating their wings. The most common song type is pulse song and consists of rapid transients (short-lived oscillations) of low frequency. Quantitative trait PN - number of pulses in a pulse train.
Real Data Analysis (2) Drosophila "sing" by vibrating their wings. The most common song type is pulse song and consists of rapid transients (short-lived oscillations) of low frequency. Quantitative trait PN - number of pulses in a pulse train. Data - 24 markers on three chromosomes, n=520 males Huttunen et al (2004) used single marker analysis and composite interval mapping. They found one QTL on chromosome 2, five QTL on chromosome 3 (not sure if there are only 2) and another QTL on chromosome 4.
Real Data Analysis (2) Drosophila "sing" by vibrating their wings. The most common song type is pulse song and consists of rapid transients (short-lived oscillations) of low frequency. Quantitative trait PN - number of pulses in a pulse train. Data - 24 markers on three chromosomes, n=520 males Huttunen et al (2004) used single marker analysis and composite interval mapping. They found one QTL on chromosome 2, five QTL on chromosome 3 (not sure if there are only 2) and another QTL on chromosome 4. We use mbic supplied with Haley and Knott regression. We impute the genotypes inside intermarker intervals so the distance between tested positions does not exceed 10 cm. We penalize these imputed locations as real markers. In the results m = 59 and N e = 1711.
Real Data Analysis (3)
Real Data Analysis (4) Zeng et al. (2000) data on the morphological differences between two species of Drosophila, Drosophila simulans and Drosophila mauritana
Real Data Analysis (4) Zeng et al. (2000) data on the morphological differences between two species of Drosophila, Drosophila simulans and Drosophila mauritana Trait - the size and the shape of the posterior lobe of the male genital arch, quantified by a morphometric descriptor.
Real Data Analysis (4) Zeng et al. (2000) data on the morphological differences between two species of Drosophila, Drosophila simulans and Drosophila mauritana Trait - the size and the shape of the posterior lobe of the male genital arch, quantified by a morphometric descriptor. n 1 = 471, n 2 = 491, m = 193, genotypes at neighboring positions are closely correlated, N e = 18, 528
Real Data Analysis, BM Forward mbic Zeng 0 20 40 60 Forward mbic Zeng 0 20 40 60 80 100 120 140 Forward mbic Zeng
Real Data Analysis, BS Forward mbic Zeng 0 20 40 60 Forward mbic Zeng 0 20 40 60 80 100 120 140 Forward mbic Zeng
Further work 1. Relaxing the penalty so as to control FDR instead of FWER, expected optimality for a wider range of values of p - with F. Frommlet, J. K. Ghosh, A. Chakrabarti and M. Murawska. 2. Application for association mapping - with F. Frommlet and M. Murawska. 3. Application for GLM and Zero Inflated Generalized Poisson Regression, with M.Zak, C. Czado, V. Earhardt. 4. Application for model selection in logic regression and comparison with Bayesian Regression Trees - with M. Malina, K. Ickstadt, H. Schwender.
References 1. Baierl, A., Bogdan, M., Frommlet, F., Futschik, A., 2006. On Locating multiple interacting quantitative trait loci in intercross designs. Genetics 173, 1693-1703. 2. Baierl, A., Futschik, A.,Bogdan, M.,Biecek, P., 2007. Locating multiple interacting quantitative trait loci using robust model selection, Computational Statistics and Data Analysis 51, 6423-6434. 3. Bogdan, M., Ghosh, J.K., Doerge, R.W., 2004. Modifying the Schwarz Bayesian Information Criterion to locate multiple interacting quantitative trait loci. Genetics 167, 989 999. 4. Bogdan, M., Frommlet, F., Biecek, P., Cheng, R., Ghosh, J. K., Doerge R. W. 2008 Extending the Modified Bayesian Information Criterion (mbic) to dense markers and multiple interval mapping. Biometrics, doi: 10.1111/j.1541-0420.2008.00989.x. 5. Broman, K.W., Speed, T.P., 2002. A model selection approach for the identification of quantitative trait loci in experimental crosses. J. Roy. Stat. Soc. B 64, 641 656. 6. George, E.I., McCulloch, R.E., 1993. Variable Selection Via Gibbs Sampling. J. Amer. Statist. Assoc. 88 : 881-889. 7. Żak, M., Baierl, A., Bogdan, M., Futschik, A., 2007. Locating multiple interacting quantitative trait loci using rank-based model selection. Genetics 176, 1845-1854.