Variable Selection in Structured High-dimensional Covariate Spaces

Variable Selection in Structured High-dimensional Covariate Spaces Fan Li 1 Nancy Zhang 2 1 Department of Health Care Policy Harvard University 2 Department of Statistics Stanford University May 14 2007

Problem Formulation Consider the standard linear regression problem: Y = Xβ + ɛ, (1) where Y is n 1 response, X is n p covariates, ɛ is n 1 independent errors,

Problem Formulation Consider the standard linear regression problem: Y = Xβ + ɛ, (1) where Y is n 1 response, X is n p covariates, ɛ is n 1 independent errors, Many problems in genomics fall into the scenario: 1. Known structure exists in the covariate space. 2. p large (possibly > n), want sparse estimation of β.

Example 1: Chromosome Copy Number Data X i,j : noisy measure of copy number at location j in person i. Y i : quantitative trait for person i.

Example 1: Chromosome Copy Number Data X i,j : noisy measure of copy number at location j in person i. Y i : quantitative trait for person i. Scale: i in the tens or hundreds, j in the thousands.

Example 1: Chromosome Copy Number Data X i,j : noisy measure of copy number at location j in person i. Y i : quantitative trait for person i. Scale: i in the tens or hundreds, j in the thousands. Example of X i,: :

Example 1: Chromosome Copy Number Pollack et al. (2002) breast cancer data set:

Expression of key oncogenes: ERBB2 has elevated expression in 30% of breast cancers. It is correlated with more aggressive cancer. What controls the expression of ERBB2?

Example 2: Biological Motif Analysis Data X w,j : count of occurrence of word w upstream of gene i. Y i : expression of gene i.

Example 2: Biological Motif Analysis Data X w,j : count of occurrence of word w upstream of gene i. Y i : expression of gene i. Scale: i in the thousands, w in the thousands.

Example 2: Biological Motif Analysis Data X w,j : count of occurrence of word w upstream of gene i. Y i : expression of gene i. Scale: i in the thousands, w in the thousands. Examples of w: ACGCGTT ACGCGTG TCGCGTA TCGCGGA The motifs for a single transcription factor tend to be clustered. More on this later...

Review in Structured Variable Selection Lasso (L 1 penalty) type 1. Fused-Lasso (Tibshirani et al., 2005): 1-d smoothing. 2. Grouped Lasso (Yuan and Lin, 2006): variables added and dropped in groups.

Review in Structured Variable Selection Lasso (L 1 penalty) type 1. Fused-Lasso (Tibshirani et al., 2005): 1-d smoothing. 2. Grouped Lasso (Yuan and Lin, 2006): variables added and dropped in groups. Bayesian variable selection framework (next slide): 1. Gibbs sampler, George and McCulloch, (1993) 2. Many improvements since then... We apply BVS in structured settings to genomic analysis.

Review: Latent Variable Model Define latent variables: γ i {0, 1}, i = 1,..., p.

Review: Latent Variable Model Define latent variables: Conditioned on γ i : γ i {0, 1}, i = 1,..., p. β i γ i (1 γ i )N(0, τ 2 i ) + γ i N(0, σ 2 c 2 i τ 2 i ). (2)

Review: Latent Variable Model Define latent variables: Conditioned on γ i : Special case: γ i {0, 1}, i = 1,..., p. β i γ i (1 γ i )N(0, τ 2 i ) + γ i N(0, σ 2 c 2 i τ 2 i ). (2) β i γ i (1 γ i )I 0 + γ i N(0, ν 2 i ). (3)

Review: Latent Variable Model Define latent variables: Conditioned on γ i : Special case: γ i {0, 1}, i = 1,..., p. β i γ i (1 γ i )N(0, τ 2 i ) + γ i N(0, σ 2 c 2 i τ 2 i ). (2) Conjugate prior for variance: β i γ i (1 γ i )I 0 + γ i N(0, ν 2 i ). (3) σ 2 γ IG(ν γ /2, ν γ λ γ /2)

Review: Latent Variable Model Define latent variables: Conditioned on γ i : Special case: γ i {0, 1}, i = 1,..., p. β i γ i (1 γ i )N(0, τ 2 i ) + γ i N(0, σ 2 c 2 i τ 2 i ). (2) Conjugate prior for variance: Likelihood for observed data: β i γ i (1 γ i )I 0 + γ i N(0, ν 2 i ). (3) σ 2 γ IG(ν γ /2, ν γ λ γ /2) Y β, X N(Xβ, σ 2 I).

First, a simple Markov prior Transition matrix P = ( p 1 p 1 q q Assume that γ 1 π, where π = ( 1 q ). 2 p q, stationary distribution with regards to P. ) 1 p 2 p q is the

First, a simple Markov prior Transition matrix P = ( p 1 p 1 q q Assume that γ 1 π, where π = ( 1 q ). 2 p q, stationary distribution with regards to P. ) 1 p 2 p q is the More interpretable to re-parameterize the prior as: r = π 1 π 0 = (1 p) q (1 q) 1 p : prior prob ratio of inclusion of a variable, w = : Fold change in probability of inclusion of next variable.

Two strategies for Gibbs sampler Auxiliary chain: f (γ Y) is sampled from the auxiliary Markov chain β 0, σ 0, γ 0, β 1, σ 1, γ 1,... Direct chain: f (γ Y) is directly sampled from γ 0, γ 1, γ 2,...

Auxiliary Chain Posterior distribution of β: β j N p (A γ j 1X Y, A γ j 1), (4) where A γ j 1 = [X X + D 1 γ j 1 R 1 D 1 γ j 1 ] 1. Hierarchical structure of the model γ i β i Y i implies f (γ j Y j 1, β j 1, σ j 1 ) = f (γ j β j 1 ). Posterior f (γ j β j 1 ) is an inhomogeneous Markov chain. Posterior distribution of σ j σ j IG ( n + νγ j 1 2, Y Xβj 2 + ν γ j 1λ γ j 1 2 ).

Auxiliary Chain: Computational Notes Two computational intensive tasks: 1. inverting a p p matrix to obtain A γ. 2. computing square root of A γ. Traditionally done by Cholesky-type decomposition, O(p 3 ) computation per sweep. Key Observation: very few γ changes in two consecutive sweeps. Key idea: 1. low-rank update of matrix inversion (e.g., SMW formula); 2. low-rank update of Cholesky decomposition (e.g., algorithm C1-C4 in Gill et. al (1974)). Combining the two, get a fast update algorithm of O(lp 2 ) computation, l is average no. of changed γ per sweep.

Direct Chain Build on the special Guassian mixture prior (3). γ i γ ( i) is sampled based on: P(γ i = 1 γ ( i), Y ) = P(γ i = 1 γ ( i) ) P(γ i = 1 γ ( i) ) + BF P(γ i = 0 γ ( i) ), BF is Bayes factor: BF = P(Y γ i =0,γ ( i) ) P(Y γ i =1,γ ( i) ). Collapsing over σ and β, BF = v Γ( n n i +1+ν ) 2 Γ( n n i +ν A 1 ( i) 2 ) 2 A i 1 2 0 @ Y Y Y X Ii A 1 X i I 1 n n i +ν Y +νλ 2 i A 2 0 Y Y Y X B I( i) A 1 1 ( i) X Y +νλ I ( i) C @ 2 A n n i +1+ν 2. A i = X I i X Ii + D 2 I i for conjugate setup.

Direct Chain: Computational Notes Two main computation tasks: 1. inverting n i n i (n i n i ) matrix A i (A ( i) ). 2. computing determinant of A i and A i (equivalent to decomposition). Both required for each γ in every sweep, O( n 3 p) computation per sweep by standard Cholesky, n average model size. Further speed-up: 1. A 1 i is updated from A 1 ( i) using block matrix inversion, computation O(n 2 i ). 2. A 1 2 i is updated from A 1 2 ( i) via low-rank update of Cholesky factor, computation O(ni 2). 3. One of A i and A ( i) is always same as A i 1 or A (i 1). Overall, our fast update algorithm is of O( n 2 p). For sparse models, this becomes doable for large p.

Simulation Study: design p = 200 predictors, 2 blocks of γ = 1. X i.i.d. N(0, 1), Y = Xβ γ + N(0, σ 2 ɛ ). 25000 iterations, at various levels of the smoothing parameter w. Interested in low signal/noise situations: σ 2 ɛ = 1, β = 0.8.

Simulation Results: Structure in γ is used to obtain better estimates

Finding regulators of ERBB2 gene p = 6000, n = 41, adjusted sparsity r so that average model size 20-30. 10 random restarts, 100,000 sweeps per round of Monte Carlo. For this data set, not much difference between w = 1 and w = 10. However, a few low signals did jump out consistently for w = 10.

Finding regulators of ERBB2 gene Gene PTPRN2 NR1H3 STAT1 MINK MLN64 ERBB2 Notes Tumor suppressor gene, target of methylation in human cancers. Transcription factors that regulate cell growth, sizeable body of data implicate these factors in oncogenesis of breast cancer. Co-regulated with ERBB2, Alpy et al, 2003, Oncogene Locus for ERBB2

Biological Motif Detection

Transcription Regulation Transcription factors... 1. regulate gene expression by helping (or inhibiting) transcription initiation.

Transcription Regulation Transcription factors... 1. regulate gene expression by helping (or inhibiting) transcription initiation. 2. bind to DNA in a sequence specific manner.

Transcription Regulation Transcription factors... 1. regulate gene expression by helping (or inhibiting) transcription initiation. 2. bind to DNA in a sequence specific manner. 3. play an important part in the much larger picture of expression regulation.

Data description For each gene g: Promoter sequence S g Expression Y g

Regression Model Bussemaker, Li, and Siggia (2001): Y g = β 0 + M β m X g (m) + error, m=1 X g (m) is the count of word w in S g. Y g is log expression of gene g.

Regression Model Bussemaker, Li, and Siggia (2001): Y g = β 0 + M β m X g (m) + error, m=1 X g (m) is the count of word w in S g. Y g is log expression of gene g. MCB (ACGCGT), 21 minutes

Regression Model Bussemaker, Li, and Siggia (2001): Y g = β 0 + M β m X g (m) + error, m=1 X g (m) is the count of word w in S g. Y g is log expression of gene g. SCB (TTTCGCG), 21 minutes

Regression Model Bussemaker, Li, and Siggia (2001): Y g = β 0 + M β m X g (m) + error, m=1 X g (m) is the count of word w in S g. Y g is log expression of gene g. Arbitrary (TGATATC), 21 minutes

Modeling motif degeneracy 1. Inexact matches are allowed.

Modeling motif degeneracy 1. Inexact matches are allowed. 2. Not all positions created equal.

Information Content of Positions Pattern observed for most transcription factors: Dimeric binding: two such peaks separated by a short distance.

Information Content of Positions Pattern observed for most transcription factors: Dimeric binding: two such peaks separated by a short distance. This information has been noted by several studies: 1. Eisen, 2005, Genome Biology. 2. Kechris et al., 2004. 3. Keles et al., 2002.

Hypercube model We consider all words of length L = 6, 7 to lie in a graph.

Hypercube model We consider all words of length L = 6, 7 to lie in a graph. There is an edge between words w 1, w 2 if d Hamming (w 1, w 2 ) = 1.

Hypercube model We consider all words of length L = 6, 7 to lie in a graph. There is an edge between words w 1, w 2 if d Hamming (w 1, w 2 ) = 1. The weight on the edge depends on the position of the differing letter.

A more general model: Ising prior for γ i P(γ) = e α γ+γ Bγ ψ(α,b), where α = (α 1,..., α p ), B = (b i,j ) p p are hyperparameters, and ψ(α, B) is the normalizing constant: ψ(α, B) = γ {0,1} p e α γ+γ Bγ. For each i, the conditional distribution P P(γ i γ ( i) ) = eγ i (α i + j I β ij γ j ) ( i) 1 + e γ i (α i + P j I ( i) β ij γ j ). can be efficiently computed for sparse B.

Ising Prior: Posterior Computation For each i, the conditional distribution P P(γ i γ ( i) ) = eγ i (α i + j I β ij γ j ) ( i) 1 + e γ i (α i + P j I ( i) β ij γ j ). can be efficiently computed for sparse B. Apply to structured model selection: P(γ i = 1 γ ( i), Y ) = P(γ i = 1 γ ( i) ) P(γ i = 1 γ ( i) ) + BF P(γ i = 0 γ ( i) ),

Example: Regulatory Motifs for Yeast Cell Cycle

Periodic time series

PCA of cell cycle experiment

Motif Analysis Results I Motif length 6. B i,j = { 1, positions 1,6; 2, positions 2,3,4,5. Distance between top 100 motifs found by our model:

Motif Analysis Results II Signals that were found (posterior probability > 0.05) with smoothing that were lost without: Words ACGCGT TCGCGT TCGCGA GCGCGT CCGCGT Description TGCTGG GGCTGG ACGGGT TCGCGG TCGGGT 16 new motifs total.

Motif Analysis Results II Signals that were found (posterior probability > 0.05) with smoothing that were lost without: Words ACGCGT TCGCGT TCGCGA GCGCGT CCGCGT Description MCB binding site in CLN1 TGCTGG GGCTGG ACGGGT TCGCGG TCGGGT 16 new motifs total.

Motif Analysis Results II Signals that were found (posterior probability > 0.05) with smoothing that were lost without: Words ACGCGT TCGCGT TCGCGA GCGCGT CCGCGT TGCTGG GGCTGG Description SWI5 binding site ACGGGT TCGCGG TCGGGT 16 new motifs total.

Notes on Hyperparameter Selection For the motif model, we so far arbitrarily selected hyperparameters for good computational properties. For 1-d lattice: (r, w) code for sparsity and smoothness. Given w, can choose r analytically to set desired model size. Since algorithm is O(p n 2 ), this is practically very important. For general graphs: model size is no longer as easy to control. Phase transition behavior. Asymmetry in graph: α should not be constant. Interpretation of hyperparameters not as straightforward.

Conclusions and Extensions General Ising model for variable selection in structured covariate spaces.

Conclusions and Extensions General Ising model for variable selection in structured covariate spaces. 1-d lattice: reduces to simple Markov model which is applicable to problems in genomic profiling studies. L-d hypercube: a natural model for motif detection. Computationally feasible for p > 1000. Extensions: Hyperparameter selection for biological motif discovery.