Selecting explanatory variables with the modified version of Bayesian Information Criterion

Similar documents
Overview. Background

Locating multiple interacting quantitative trait. loci using rank-based model selection

Multiple QTL mapping

Mapping multiple QTL in experimental crosses

R/qtl workshop. (part 2) Karl Broman. Biostatistics and Medical Informatics University of Wisconsin Madison. kbroman.org

Mapping multiple QTL in experimental crosses

Gene mapping in model organisms

Extended Bayesian Information Criteria for Gaussian Graphical Models

A novel algorithmic approach to Bayesian Logic Regression

QTL Mapping I: Overview and using Inbred Lines

Model comparison and selection

Extended Bayesian Information Criteria for Model Selection with Large Model Spaces

QTL Model Search. Brian S. Yandell, UW-Madison January 2017

Introduction to QTL mapping in model organisms

Use of hidden Markov models for QTL mapping

Introduction to QTL mapping in model organisms

QTL model selection: key players

QTL model selection: key players

Causal Model Selection Hypothesis Tests in Systems Genetics

Lecture 8. QTL Mapping 1: Overview and Using Inbred Lines

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Generalized Linear Models

Hierarchical Generalized Linear Models for Multiple QTL Mapping

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Introduction to QTL mapping in model organisms

A Frequentist Assessment of Bayesian Inclusion Probabilities

Statistical issues in QTL mapping in mice

Statistical Applications in Genetics and Molecular Biology

STATISTICAL LEARNING SYSTEMS

Feature selection with high-dimensional data: criteria and Proc. Procedures

Introduction to QTL mapping in model organisms

Variable Selection and Computation of the Prior Probability of a Model via ROC Curves Methodology

Modeling different dependence structures involving count data with applications to insurance, economics and genetics

Sparse Linear Models (10/7/13)

Selecting an Orthogonal or Nonorthogonal Two-Level Design for Screening

Mapping QTL to a phylogenetic tree

Some Curiosities Arising in Objective Bayesian Analysis

Introduction to QTL mapping in model organisms

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Regression, Ridge Regression, Lasso

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Exam: high-dimensional data analysis January 20, 2014

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

BAYESIAN MAPPING OF MULTIPLE QUANTITATIVE TRAIT LOCI

Analysis Methods for Supersaturated Design: Some Comparisons

A Statistical Framework for Expression Trait Loci (ETL) Mapping. Meng Chen

1 Hypothesis Testing and Model Selection

Anumber of statistical methods are available for map- 1995) in some standard designs. For backcross populations,

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

MOST complex traits are influenced by interacting

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

Holzmann, Min, Czado: Validating linear restrictions in linear regression models with general error structure

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Penalized Loss functions for Bayesian Model Choice

Linear Model Selection and Regularization

what is Bayes theorem? posterior = likelihood * prior / C

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Model Selection. Frank Wood. December 10, 2009

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Model Selection for Multiple QTL

GWAS IV: Bayesian linear (variance component) models

Consistent high-dimensional Bayesian variable selection via penalized credible regions

7. Estimation and hypothesis testing. Objective. Recommended reading

Minimum Description Length (MDL)

STAT 100C: Linear models

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Seminar über Statistik FS2008: Model Selection

Marginal Screening and Post-Selection Inference

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Bayesian methods in economics and finance

BTRY 4830/6830: Quantitative Genomics and Genetics

A Robust Approach to Regularized Discriminant Analysis

Computational statistics

On High-Dimensional Cross-Validation

Chapter 2 The Naïve Bayes Model in the Context of Word Sense Disambiguation

Association studies and regression

g-priors for Linear Regression

MS-C1620 Statistical inference

SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection

Bi-level feature selection with applications to genetic association

Day 4: Shrinkage Estimators

Lecture 9. QTL Mapping 2: Outbred Populations

Generalized Linear Models and Its Asymptotic Properties

Empirical Bayesian LASSO-logistic regression for multiple binary trait locus mapping

Module 4: Bayesian Methods Lecture 9 A: Default prior selection. Outline

How the mean changes depends on the other variable. Plots can show what s happening...

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Generalized Linear Models

Estimation in Generalized Linear Models with Heterogeneous Random Effects. Woncheol Jang Johan Lim. May 19, 2004

Inferring Genetic Architecture of Complex Biological Processes

Variable Selection for Highly Correlated Predictors

A prior distribution over model space p(m) (or hypothesis space ) can be updated to a posterior distribution after observing data y.

Model selection in logistic regression using p-values and greedy search

Model Choice Lecture 2

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Bayesian Assessment of Hypotheses and Models

An Extended BIC for Model Selection

Transcription:

Selecting explanatory variables with the modified version of Bayesian Information Criterion Institute of Mathematics and Computer Science, Wrocław University of Technology, Poland in cooperation with J.K.Ghosh, R.W.Doerge, R. Cheng Purdue University A. Baierl, F. Frommlet, A. Futschik Vienna University A. Chakrabarti - Indian Statistical Institute P. Biecek, A. Ochman, M. Żak Wrocław University of Technology Vienna, 24/07/2008

Searching large data bases Y - the quantitative variable of interest (fruit size, survival time, process yield)

Searching large data bases Y - the quantitative variable of interest (fruit size, survival time, process yield) Aim identify factors influencing Y

Searching large data bases Y - the quantitative variable of interest (fruit size, survival time, process yield) Aim identify factors influencing Y Properties of the data base number of potential factors, m, may be much larger than the number of cases, n

Searching large data bases Y - the quantitative variable of interest (fruit size, survival time, process yield) Aim identify factors influencing Y Properties of the data base number of potential factors, m, may be much larger than the number of cases, n Assumption of Sparsity - only a small proportion of potential explanatory variables influences Y

Specific application - Locating Quantitative Trait Loci

Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus

Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus X ij - dummy variable encoding the genotype of i-th individual at locus j

Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus X ij - dummy variable encoding the genotype of i-th individual at locus j X ij { 1/2, 1/2}

Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus X ij - dummy variable encoding the genotype of i-th individual at locus j X ij { 1/2, 1/2} Multiple regression model: Y i = β 0 + m β j X ij + ɛ i, (0.1) j=1 where i {1,..., n} and ɛ i N(0, σ 2 )

Data for QTL mapping in backcross population and recombinant inbred lines Only two genotypes possible at a given locus X ij - dummy variable encoding the genotype of i-th individual at locus j X ij { 1/2, 1/2} Multiple regression model: Y i = β 0 + m β j X ij + ɛ i, (0.1) j=1 where i {1,..., n} and ɛ i N(0, σ 2 ) Problem : estimation of the number of influential genes

Bayesian Information Criterion (1) M i - i-th linear model with k i < n regressors

Bayesian Information Criterion (1) M i - i-th linear model with k i < n regressors θ i = (β 0, β 1,..., β ki, σ) - vector of model parameters

Bayesian Information Criterion (1) M i - i-th linear model with k i < n regressors θ i = (β 0, β 1,..., β ki, σ) - vector of model parameters Bayesian Information Criterion (Schwarz, 1978) maximize BIC = log L(Y M i, ˆθ i ) 1 2 k i log n

Bayesian Information Criterion (1) M i - i-th linear model with k i < n regressors θ i = (β 0, β 1,..., β ki, σ) - vector of model parameters Bayesian Information Criterion (Schwarz, 1978) maximize BIC = log L(Y M i, ˆθ i ) 1 2 k i log n If m is fixed, n and X X /n Q, where Q is a positive definite matrix, then BIC is consistent - the probability of choosing the proper model converges to 1.

Bayesian Information Criterion (1) M i - i-th linear model with k i < n regressors θ i = (β 0, β 1,..., β ki, σ) - vector of model parameters Bayesian Information Criterion (Schwarz, 1978) maximize BIC = log L(Y M i, ˆθ i ) 1 2 k i log n If m is fixed, n and X X /n Q, where Q is a positive definite matrix, then BIC is consistent - the probability of choosing the proper model converges to 1. When n 8 BIC never chooses more regressors than AIC and is usually considered as one of the most restrictive model selection criteria.

Bayesian Information Criterion (1) M i - i-th linear model with k i < n regressors θ i = (β 0, β 1,..., β ki, σ) - vector of model parameters Bayesian Information Criterion (Schwarz, 1978) maximize BIC = log L(Y M i, ˆθ i ) 1 2 k i log n If m is fixed, n and X X /n Q, where Q is a positive definite matrix, then BIC is consistent - the probability of choosing the proper model converges to 1. When n 8 BIC never chooses more regressors than AIC and is usually considered as one of the most restrictive model selection criteria. Surprise? : - Broman and Speed (JRSS, 2002) report that BIC overestimates the number of regressors when applied to QTL mapping.

Explanation - Bayesian roots of BIC (1) f (θ i ) prior density of θ i, π(m i ) prior probability of M i

Explanation - Bayesian roots of BIC (1) f (θ i ) prior density of θ i, π(m i ) prior probability of M i m i (Y ) = L(Y M i, θ i )f (θ i )dθ i integrated likelihood of the data given the model M i

Explanation - Bayesian roots of BIC (1) f (θ i ) prior density of θ i, π(m i ) prior probability of M i m i (Y ) = L(Y M i, θ i )f (θ i )dθ i integrated likelihood of the data given the model M i posterior probability of M i : P(M i Y ) m i (Y )π(m i )

Explanation - Bayesian roots of BIC (1) f (θ i ) prior density of θ i, π(m i ) prior probability of M i m i (Y ) = L(Y M i, θ i )f (θ i )dθ i integrated likelihood of the data given the model M i posterior probability of M i : P(M i Y ) m i (Y )π(m i ) BIC neglects π(m i ) and uses approximation log m i (Y ) log L(Y M i, ˆθ i ) 1/2(k i + 2) log n + R i, R i is bounded in n.

Explanation - Bayesian roots of BIC (2) neglecting π(m i ) assuming all the models have the same prior probability

Explanation - Bayesian roots of BIC (2) neglecting π(m i ) assuming all the models have the same prior probability assigning a large prior probability to the event that the true model contains approximately m 2 regressors

Explanation - Bayesian roots of BIC (2) neglecting π(m i ) assuming all the models have the same prior probability assigning a large prior probability to the event that the true model contains approximately m 2 regressors ( ) 200 m=200, 200 models with one regressor, 2 = 19900 models ( ) 200 with two regressors, 100 = 9 10 58 models with 100 regressors

Modified version of BIC, mbic (1) M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004) Proposed solution - supplementing BIC with an informative prior distribution on the set of possible models, proposed in George and McCulloch (1993)

Modified version of BIC, mbic (1) M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004) Proposed solution - supplementing BIC with an informative prior distribution on the set of possible models, proposed in George and McCulloch (1993) p - prior probability that a randomly chosen regressor influences Y π(m i ) = p k i (1 p) m k i

Modified version of BIC, mbic (1) M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004) Proposed solution - supplementing BIC with an informative prior distribution on the set of possible models, proposed in George and McCulloch (1993) p - prior probability that a randomly chosen regressor influences Y π(m i ) = p k i (1 p) m k i ( ) 1 p log π(m i ) = m log(1 p) k i log p

Modified version of BIC, mbic (1) M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004) Proposed solution - supplementing BIC with an informative prior distribution on the set of possible models, proposed in George and McCulloch (1993) p - prior probability that a randomly chosen regressor influences Y π(m i ) = p k i (1 p) m k i ( ) 1 p log π(m i ) = m log(1 p) k i log p Modified version of BIC recommends choosing the model maximizing log L(Y M i, ˆθ i ) 1 ( ) 1 p 2 k i log n k i log p

mbic (2) c = mp - expected number of true regressors

mbic (2) c = mp - expected number of true regressors mbic = log L(Y M i, ˆθ i ) 1 ( m ) 2 k i log n k i log c 1

mbic (2) c = mp - expected number of true regressors mbic = log L(Y M i, ˆθ i ) 1 ( m ) 2 k i log n k i log c 1 Standard version of mbic uses c = 4 to control the overall type I error at the level below 10%

mbic (2) c = mp - expected number of true regressors mbic = log L(Y M i, ˆθ i ) 1 ( m ) 2 k i log n k i log c 1 Standard version of mbic uses c = 4 to control the overall type I error at the level below 10% A similar log m penalty appears also in RIC of Foster and George (1994)

Relationship to multiple testing (1) Orthogonal design: X T X = ni (m+1) (m+1), (1)

Relationship to multiple testing (1) Orthogonal design: X T X = ni (m+1) (m+1), (1) BIC chooses those X j s for which n ˆβ 2 j σ 2 > log n

Relationship to multiple testing (1) Orthogonal design: X T X = ni (m+1) (m+1), (1) BIC chooses those X j s for which n ˆβ 2 j σ 2 > log n Under H 0j : β j = 0, Z j = n ˆβ j σ N(0, 1)

Relationship to multiple testing (1) Orthogonal design: X T X = ni (m+1) (m+1), (1) BIC chooses those X j s for which n ˆβ 2 j σ 2 > log n Under H 0j : β j = 0, Z j = Since for c > 0, n ˆβ j σ N(0, 1) 1 Φ(c) = φ(c) c (1 + o c )

Relationship to multiple testing (1) Orthogonal design: X T X = ni (m+1) (m+1), (1) BIC chooses those X j s for which n ˆβ 2 j σ 2 > log n Under H 0j : β j = 0, Z j = n ˆβ j σ N(0, 1) Since for c > 0, 1 Φ(c) = φ(c) c (1 + o c ) It holds that for large values of n α n = 2P(Z j > 2 log n) πn log n.

Relationship to multiple testing (2) When n and m go to infinity and the number of true signals remains fixed, the expected number of false discoveries is of the m rate. n log n

Relationship to multiple testing (2) When n and m go to infinity and the number of true signals remains fixed, the expected number of false discoveries is of the m rate. n log n Corollary: BIC is not consistent when m n log n

. Bonferroni correction for multiple testing : α n,m = αn m

Bonferroni correction for multiple testing : α n,m = αn m probability of detecting at least one false positive : FWER α n

Bonferroni correction for multiple testing : α n,m = αn m probability of detecting at least one false positive : FWER α n 2(1 Φ( c Bon )) = αn m

Bonferroni correction for multiple testing : α n,m = αn m probability of detecting at least one false positive : FWER α n 2(1 Φ( c Bon )) = αn m ( ) m c Bon = 2 log (1 + o n,m ) = (log n + 2 log m)(1 + o n,m ) α n where o n,m converges to zero when n or m tends to infinity.

Bonferroni correction for multiple testing : α n,m = αn m probability of detecting at least one false positive : FWER α n 2(1 Φ( c Bon )) = αn m ( ) m c Bon = 2 log (1 + o n,m ) = (log n + 2 log m)(1 + o n,m ) α n where o n,m converges to zero when n or m tends to infinity. c mbic = log n + 2 log ( m c 1) log n + 2 log m 2 log c

Properties of mbic 1. FWER 2 c π n(log n+2 log m 2 log c)

Properties of mbic 1. FWER 2 c π n(log n+2 log m 2 log c) 2. The power of detecting the explanatory variable with β j 0 is given by ( 1 P nβj n( ˆβj β j ) c mbic < < ) nβj c mbic σ σ σ ( ) cmbic > 1 Φ nβj σ 1,

Properties of mbic 1. FWER 2 c π n(log n+2 log m 2 log c) 2. The power of detecting the explanatory variable with β j 0 is given by ( 1 P nβj n( ˆβj β j ) c mbic < < ) nβj c mbic σ σ σ ( ) cmbic > 1 Φ nβj σ 1, Corollary: Independently on the choice of c mbic is consistent

Properties of mbic 1. FWER 2 c π n(log n+2 log m 2 log c) 2. The power of detecting the explanatory variable with β j 0 is given by ( 1 P nβj n( ˆβj β j ) c mbic < < ) nβj c mbic σ σ σ ( ) cmbic > 1 Φ nβj σ 1, Corollary: Independently on the choice of c mbic is consistent The standard version of mbic uses c = 4 to control FWER at the level below 10%, when n 200.

Asymptotic optimality of mbic (1) γ 0 - cost of the false discovery, γ A - cost of missing the true signals

Asymptotic optimality of mbic (1) γ 0 - cost of the false discovery, γ A - cost of missing the true signals β j (1 p)δ 0 + pn(0, τ 2 )

Asymptotic optimality of mbic (1) γ 0 - cost of the false discovery, γ A - cost of missing the true signals β j (1 p)δ 0 + pn(0, τ 2 ) Expected value of the experiment cost: R = m(γ 0 t 1 (1 p) + γ A t 2 p), where t 1 and t 2 are type I and type II errors

Asymptotic optimality of mbic (1) γ 0 - cost of the false discovery, γ A - cost of missing the true signals β j (1 p)δ 0 + pn(0, τ 2 ) Expected value of the experiment cost: R = m(γ 0 t 1 (1 p) + γ A t 2 p), where t 1 and t 2 are type I and type II errors Optimal rule: Bayes oracle f A ( ˆβ j ) f 0 ( ˆβ j ) > (1 p)γ 0 pγ A, where f A ( ˆβ j ) N(0, τ 2 + σ2 n ) and f 0( ˆβ j ) N(0, σ2 n )

Asymptotic optimality of mbic (2) Bayes oracle n ˆβ j 2 σ 2 > σ2 + nτ 2 nτ 2 [ log ( nτ 2 + σ 2 σ 2 ) + 2 log ( 1 p p ) + 2 log ( γ0 γ A )].

Asymptotic optimality of mbic (2) Bayes oracle n ˆβ j 2 σ 2 > σ2 + nτ 2 nτ 2 [ log ( nτ 2 + σ 2 σ 2 ) + 2 log ( 1 p p ) + 2 log Asymptotic Optimality: the model selection rule V is asymptotically optimal if lim n,m R V R BO = 1. ( γ0 γ A )].

Asymptotic optimality of mbic (2) Bayes oracle n ˆβ j 2 σ 2 > σ2 + nτ 2 nτ 2 [ log ( nτ 2 + σ 2 σ 2 ) + 2 log ( 1 p p ) + 2 log Asymptotic Optimality: the model selection rule V is asymptotically optimal if lim n,m R V R BO = 1. Theorem 1 (Bogdan, Chakrabarti, Ghosh, 2008). Under orthogonal design (1) mbic is asymptotically optimal when lim m mp = s, where s R. ( γ0 γ A )]

Asymptotic optimality of mbic (2) Bayes oracle n ˆβ j 2 σ 2 > σ2 + nτ 2 nτ 2 [ log ( nτ 2 + σ 2 σ 2 ) + 2 log ( 1 p p ) + 2 log Asymptotic Optimality: the model selection rule V is asymptotically optimal if lim n,m R V R BO = 1. Theorem 1 (Bogdan, Chakrabarti, Ghosh, 2008). Under orthogonal design (1) mbic is asymptotically optimal when lim m mp = s, where s R. ( γ0 Conjecture (Frommlet, Bogdan, 2008). Theorem 1 holds also when β j (1 p)δ 0 + pf A, where F A has a positive density at 0. γ A )]

Computer simulations(1) Setting : n = 200, m = 300, entries of X N(0, σ = 0.5), k Binomial(m, p), with p = 1 30 (mp = 10), β i N(0, σ = 1.5), ε N(0, 1) and Tukey s gross error model: ε Tukey(0.95, 100, 1) = 0.95 N(0, 1) + 0.05 N(0, 10).

Computer simulations(1) Setting : n = 200, m = 300, entries of X N(0, σ = 0.5), k Binomial(m, p), with p = 1 30 (mp = 10), β i N(0, σ = 1.5), ε N(0, 1) and Tukey s gross error model: ε Tukey(0.95, 100, 1) = 0.95 N(0, 1) + 0.05 N(0, 10). Characteristics : Power, FDR = FP AP, MR = FP + FN, l 2 = m j=1 (β j ˆβ j ) 2 mean value of the absolute prediction error based on 50 additional observations, d

Computer simulations Table: Results for 1000 replications. noise N(0,1) Tukey(0.95, 100, 1) citerion BIC mbic rbic BIC mbic rbic FP 13.3 0.073 0.08 12.5 0.08 0.1 FN 1.84 2.97 3.45 3.95 6.11 4.29 Power 0.8155 0.7030 0.6586 0.6087 0.3923 0.5806 FDR 0.5889 0.0107 0.0116 0.6487 0.0210 0.0162 MR 15.1480 3.0410 3.5310 16.4440 6.1910 4.3910 l 2 2.3610 0.6025 0.8500 13.51 4.732 1.597 d 0.9460 0.8505 0.8687 1.714 1.503 1.298 E ε 1 0.8, E ε 2 1.16

Applications for QTL mapping Y i = µ + j I β j X ij + γ uv X iu X iv + ε i, (u,v) U I - a certain subset of the set N = {1,..., m}, U - a certain subset of N N

Applications for QTL mapping Y i = µ + j I β j X ij + (u,v) U I - a certain subset of the set N = {1,..., m}, U - a certain subset of N N Standard version of mbic - minimize γ uv X iu X iv + ε i, n log(rss)+(p+r) log(n)+2p log(m/2.2 1)+2r log(n e /2.2 1) p - number of main effects, r - number of interactions, N e = m(m 1)/2

Further applications for QTL mapping 1. Extending to more complicated genetic scenarios + iterative version of mbic : Baierl, Bogdan, Frommlet, Futschik Genetics, 2006 2. Robust versions based on M-estimates: Baierl, Futschik, Bogdan, Biecek CSDA, 2007 3. Rank version: Żak, Baierl, Bogdan, Futschik Genetics, 2007 4. Taking into account the correlations between neighboring markers: Bogdan, Frommlet, Biecek, Cheng, Ghosh, Doerge, Biometrics, 2008

Real Data Analysis (1) Huttunen et al (2004) - data on the variation in male courtship song characters in Drosophila virilis.

Real Data Analysis (2) Drosophila "sing" by vibrating their wings. The most common song type is pulse song and consists of rapid transients (short-lived oscillations) of low frequency.

Real Data Analysis (2) Drosophila "sing" by vibrating their wings. The most common song type is pulse song and consists of rapid transients (short-lived oscillations) of low frequency. Quantitative trait PN - number of pulses in a pulse train.

Real Data Analysis (2) Drosophila "sing" by vibrating their wings. The most common song type is pulse song and consists of rapid transients (short-lived oscillations) of low frequency. Quantitative trait PN - number of pulses in a pulse train. Data - 24 markers on three chromosomes, n=520 males Huttunen et al (2004) used single marker analysis and composite interval mapping. They found one QTL on chromosome 2, five QTL on chromosome 3 (not sure if there are only 2) and another QTL on chromosome 4.

Real Data Analysis (2) Drosophila "sing" by vibrating their wings. The most common song type is pulse song and consists of rapid transients (short-lived oscillations) of low frequency. Quantitative trait PN - number of pulses in a pulse train. Data - 24 markers on three chromosomes, n=520 males Huttunen et al (2004) used single marker analysis and composite interval mapping. They found one QTL on chromosome 2, five QTL on chromosome 3 (not sure if there are only 2) and another QTL on chromosome 4. We use mbic supplied with Haley and Knott regression. We impute the genotypes inside intermarker intervals so the distance between tested positions does not exceed 10 cm. We penalize these imputed locations as real markers. In the results m = 59 and N e = 1711.

Real Data Analysis (3)

Real Data Analysis (4) Zeng et al. (2000) data on the morphological differences between two species of Drosophila, Drosophila simulans and Drosophila mauritana

Real Data Analysis (4) Zeng et al. (2000) data on the morphological differences between two species of Drosophila, Drosophila simulans and Drosophila mauritana Trait - the size and the shape of the posterior lobe of the male genital arch, quantified by a morphometric descriptor.

Real Data Analysis (4) Zeng et al. (2000) data on the morphological differences between two species of Drosophila, Drosophila simulans and Drosophila mauritana Trait - the size and the shape of the posterior lobe of the male genital arch, quantified by a morphometric descriptor. n 1 = 471, n 2 = 491, m = 193, genotypes at neighboring positions are closely correlated, N e = 18, 528

Real Data Analysis, BM Forward mbic Zeng 0 20 40 60 Forward mbic Zeng 0 20 40 60 80 100 120 140 Forward mbic Zeng

Real Data Analysis, BS Forward mbic Zeng 0 20 40 60 Forward mbic Zeng 0 20 40 60 80 100 120 140 Forward mbic Zeng

Further work 1. Relaxing the penalty so as to control FDR instead of FWER, expected optimality for a wider range of values of p - with F. Frommlet, J. K. Ghosh, A. Chakrabarti and M. Murawska. 2. Application for association mapping - with F. Frommlet and M. Murawska. 3. Application for GLM and Zero Inflated Generalized Poisson Regression, with M.Zak, C. Czado, V. Earhardt. 4. Application for model selection in logic regression and comparison with Bayesian Regression Trees - with M. Malina, K. Ickstadt, H. Schwender.

References 1. Baierl, A., Bogdan, M., Frommlet, F., Futschik, A., 2006. On Locating multiple interacting quantitative trait loci in intercross designs. Genetics 173, 1693-1703. 2. Baierl, A., Futschik, A.,Bogdan, M.,Biecek, P., 2007. Locating multiple interacting quantitative trait loci using robust model selection, Computational Statistics and Data Analysis 51, 6423-6434. 3. Bogdan, M., Ghosh, J.K., Doerge, R.W., 2004. Modifying the Schwarz Bayesian Information Criterion to locate multiple interacting quantitative trait loci. Genetics 167, 989 999. 4. Bogdan, M., Frommlet, F., Biecek, P., Cheng, R., Ghosh, J. K., Doerge R. W. 2008 Extending the Modified Bayesian Information Criterion (mbic) to dense markers and multiple interval mapping. Biometrics, doi: 10.1111/j.1541-0420.2008.00989.x. 5. Broman, K.W., Speed, T.P., 2002. A model selection approach for the identification of quantitative trait loci in experimental crosses. J. Roy. Stat. Soc. B 64, 641 656. 6. George, E.I., McCulloch, R.E., 1993. Variable Selection Via Gibbs Sampling. J. Amer. Statist. Assoc. 88 : 881-889. 7. Żak, M., Baierl, A., Bogdan, M., Futschik, A., 2007. Locating multiple interacting quantitative trait loci using rank-based model selection. Genetics 176, 1845-1854.