Variable Selection in Structured High-dimensional Covariate Spaces

Size: px

Start display at page:

Download "Variable Selection in Structured High-dimensional Covariate Spaces"

Lorin Chase
5 years ago
Views:

1 Variable Selection in Structured High-dimensional Covariate Spaces Fan Li 1 Nancy Zhang 2 1 Department of Health Care Policy Harvard University 2 Department of Statistics Stanford University May

2 Problem Formulation Consider the standard linear regression problem: Y = Xβ + ɛ, (1) where Y is n 1 response, X is n p covariates, ɛ is n 1 independent errors,

3 Problem Formulation Consider the standard linear regression problem: Y = Xβ + ɛ, (1) where Y is n 1 response, X is n p covariates, ɛ is n 1 independent errors, Many problems in genomics fall into the scenario: 1. Known structure exists in the covariate space. 2. p large (possibly > n), want sparse estimation of β.

4 Example 1: Chromosome Copy Number Data X i,j : noisy measure of copy number at location j in person i. Y i : quantitative trait for person i.

5 Example 1: Chromosome Copy Number Data X i,j : noisy measure of copy number at location j in person i. Y i : quantitative trait for person i. Scale: i in the tens or hundreds, j in the thousands.

6 Example 1: Chromosome Copy Number Data X i,j : noisy measure of copy number at location j in person i. Y i : quantitative trait for person i. Scale: i in the tens or hundreds, j in the thousands. Example of X i,: :

7 Example 1: Chromosome Copy Number Data X i,j : noisy measure of copy number at location j in person i. Y i : quantitative trait for person i. Scale: i in the tens or hundreds, j in the thousands. Example of X i,: : At each location, we observe a noisy measurement of the underlying copy number.

8 Example 1: Chromosome Copy Number Pollack et al. (2002) breast cancer data set:

9 Expression of key oncogenes: ERBB2 has elevated expression in 30% of breast cancers. It is correlated with more aggressive cancer. What controls the expression of ERBB2?

10 Example 2: Biological Motif Analysis Data X w,j : count of occurrence of word w upstream of gene i. Y i : expression of gene i.

11 Example 2: Biological Motif Analysis Data X w,j : count of occurrence of word w upstream of gene i. Y i : expression of gene i. Scale: i in the thousands, w in the thousands.

12 Example 2: Biological Motif Analysis Data X w,j : count of occurrence of word w upstream of gene i. Y i : expression of gene i. Scale: i in the thousands, w in the thousands. Examples of w: ACGCGTT ACGCGTG TCGCGTA TCGCGGA The motifs for a single transcription factor tend to be clustered. More on this later...

13 Review in Structured Variable Selection Lasso (L 1 penalty) type 1. Fused-Lasso (Tibshirani et al., 2005): 1-d smoothing. 2. Grouped Lasso (Yuan and Lin, 2006): variables added and dropped in groups.

14 Review in Structured Variable Selection Lasso (L 1 penalty) type 1. Fused-Lasso (Tibshirani et al., 2005): 1-d smoothing. 2. Grouped Lasso (Yuan and Lin, 2006): variables added and dropped in groups. Bayesian variable selection framework (next slide): 1. Gibbs sampler, George and McCulloch, (1993) 2. Many improvements since then... We apply BVS in structured settings to genomic analysis.

15 Review: Latent Variable Model Define latent variables: γ i {0, 1}, i = 1,..., p.

16 Review: Latent Variable Model Define latent variables: Conditioned on γ i : γ i {0, 1}, i = 1,..., p. β i γ i (1 γ i )N(0, τ 2 i ) + γ i N(0, σ 2 c 2 i τ 2 i ). (2)

17 Review: Latent Variable Model Define latent variables: Conditioned on γ i : Special case: γ i {0, 1}, i = 1,..., p. β i γ i (1 γ i )N(0, τ 2 i ) + γ i N(0, σ 2 c 2 i τ 2 i ). (2) β i γ i (1 γ i )I 0 + γ i N(0, ν 2 i ). (3)

18 Review: Latent Variable Model Define latent variables: Conditioned on γ i : Special case: γ i {0, 1}, i = 1,..., p. β i γ i (1 γ i )N(0, τ 2 i ) + γ i N(0, σ 2 c 2 i τ 2 i ). (2) Conjugate prior for variance: β i γ i (1 γ i )I 0 + γ i N(0, ν 2 i ). (3) σ 2 γ IG(ν γ /2, ν γ λ γ /2)

19 Review: Latent Variable Model Define latent variables: Conditioned on γ i : Special case: γ i {0, 1}, i = 1,..., p. β i γ i (1 γ i )N(0, τ 2 i ) + γ i N(0, σ 2 c 2 i τ 2 i ). (2) Conjugate prior for variance: Likelihood for observed data: β i γ i (1 γ i )I 0 + γ i N(0, ν 2 i ). (3) σ 2 γ IG(ν γ /2, ν γ λ γ /2) Y β, X N(Xβ, σ 2 I).

20 First, a simple Markov prior Transition matrix P = ( p 1 p 1 q q Assume that γ 1 π, where π = ( 1 q ). 2 p q, stationary distribution with regards to P. ) 1 p 2 p q is the

21 First, a simple Markov prior Transition matrix P = ( p 1 p 1 q q Assume that γ 1 π, where π = ( 1 q ). 2 p q, stationary distribution with regards to P. ) 1 p 2 p q is the More interpretable to re-parameterize the prior as: r = π 1 π 0 = (1 p) q (1 q) 1 p : prior prob ratio of inclusion of a variable, w = : Fold change in probability of inclusion of next variable.

22 Two strategies for Gibbs sampler Auxiliary chain: f (γ Y) is sampled from the auxiliary Markov chain β 0, σ 0, γ 0, β 1, σ 1, γ 1,... Direct chain: f (γ Y) is directly sampled from γ 0, γ 1, γ 2,...

23 Auxiliary Chain Posterior distribution of β: β j N p (A γ j 1X Y, A γ j 1), (4) where A γ j 1 = [X X + D 1 γ j 1 R 1 D 1 γ j 1 ] 1. Hierarchical structure of the model γ i β i Y i implies f (γ j Y j 1, β j 1, σ j 1 ) = f (γ j β j 1 ). Posterior f (γ j β j 1 ) is an inhomogeneous Markov chain. Posterior distribution of σ j σ j IG ( n + νγ j 1 2, Y Xβj 2 + ν γ j 1λ γ j 1 2 ).

24 Auxiliary Chain: Computational Notes Two computational intensive tasks: 1. inverting a p p matrix to obtain A γ. 2. computing square root of A γ. Traditionally done by Cholesky-type decomposition, O(p 3 ) computation per sweep.

25 Auxiliary Chain: Computational Notes Two computational intensive tasks: 1. inverting a p p matrix to obtain A γ. 2. computing square root of A γ. Traditionally done by Cholesky-type decomposition, O(p 3 ) computation per sweep. Key Observation: very few γ changes in two consecutive sweeps. Key idea: 1. low-rank update of matrix inversion (e.g., SMW formula); 2. low-rank update of Cholesky decomposition (e.g., algorithm C1-C4 in Gill et. al (1974)). Combining the two, get a fast update algorithm of O(lp 2 ) computation, l is average no. of changed γ per sweep.

26 Direct Chain Build on the special Guassian mixture prior (3). γ i γ ( i) is sampled based on: P(γ i = 1 γ ( i), Y ) = P(γ i = 1 γ ( i) ) P(γ i = 1 γ ( i) ) + BF P(γ i = 0 γ ( i) ), BF is Bayes factor: BF = P(Y γ i =0,γ ( i) ) P(Y γ i =1,γ ( i) ). Collapsing over σ and β, BF = v Γ( n n i +1+ν ) 2 Γ( n n i +ν A 1 ( i) 2 ) 2 A i 1 2 Y Y Y X Ii A 1 X i I 1 n n i +ν Y +νλ 2 i A 2 0 Y Y Y X B I( i) A 1 1 ( i) X Y +νλ I ( i) 2 A n n i +1+ν 2. A i = X I i X Ii + D 2 I i for conjugate setup.

27 Direct Chain: Computational Notes Two main computation tasks: 1. inverting n i n i (n i n i ) matrix A i (A ( i) ). 2. computing determinant of A i and A i (equivalent to decomposition). Both required for each γ in every sweep, O( n 3 p) computation per sweep by standard Cholesky, n average model size.

28 Direct Chain: Computational Notes Two main computation tasks: 1. inverting n i n i (n i n i ) matrix A i (A ( i) ). 2. computing determinant of A i and A i (equivalent to decomposition). Both required for each γ in every sweep, O( n 3 p) computation per sweep by standard Cholesky, n average model size. Further speed-up: 1. A 1 i is updated from A 1 ( i) using block matrix inversion, computation O(n 2 i ). 2. A 1 2 i is updated from A 1 2 ( i) via low-rank update of Cholesky factor, computation O(ni 2). 3. One of A i and A ( i) is always same as A i 1 or A (i 1). Overall, our fast update algorithm is of O( n 2 p). For sparse models, this becomes doable for large p.

29 Simulation Study: design p = 200 predictors, 2 blocks of γ = 1. X i.i.d. N(0, 1), Y = Xβ γ + N(0, σ 2 ɛ ) iterations, at various levels of the smoothing parameter w. Interested in low signal/noise situations: σ 2 ɛ = 1, β = 0.8.

30 Simulation Results: Structure in γ is used to obtain better estimates

31 Finding regulators of ERBB2 gene p = 6000, n = 41, adjusted sparsity r so that average model size random restarts, 100,000 sweeps per round of Monte Carlo. For this data set, not much difference between w = 1 and w = 10. However, a few low signals did jump out consistently for w = 10.

32 Finding regulators of ERBB2 gene Gene PTPRN2 NR1H3 STAT1 MINK MLN64 ERBB2 Notes Tumor suppressor gene, target of methylation in human cancers. Transcription factors that regulate cell growth, sizeable body of data implicate these factors in oncogenesis of breast cancer. Co-regulated with ERBB2, Alpy et al, 2003, Oncogene Locus for ERBB2

33 Finding regulators of ERBB2 gene Gene PTPRN2 NR1H3 STAT1 MINK MLN64 ERBB2 Notes Tumor suppressor gene, target of methylation in human cancers. Transcription factors that regulate cell growth, sizeable body of data implicate these factors in oncogenesis of breast cancer. Co-regulated with ERBB2, Alpy et al, 2003, Oncogene Locus for ERBB2

34 Biological Motif Detection

35 Biological Motif Detection

36 Transcription Regulation Transcription factors regulate gene expression by helping (or inhibiting) transcription initiation.

37 Transcription Regulation Transcription factors regulate gene expression by helping (or inhibiting) transcription initiation. 2. bind to DNA in a sequence specific manner.

38 Transcription Regulation Transcription factors regulate gene expression by helping (or inhibiting) transcription initiation. 2. bind to DNA in a sequence specific manner. 3. play an important part in the much larger picture of expression regulation.

Transcription Regulation Transcription factors... 1. regulate gene expression by helping (or inhibiting) transcription initiation. 2.

39 Transcription Regulation Transcription factors regulate gene expression by helping (or inhibiting) transcription initiation. 2. bind to DNA in a sequence specific manner. 3. play an important part in the much larger picture of expression regulation. Ultimate goal: Learn the grammar of transcription regulation.

40 Data description For each gene g: Promoter sequence S g Expression Y g

41 Data description For each gene g: Promoter sequence S g Expression Y g

42 Regression Model Bussemaker, Li, and Siggia (2001): Y g = β 0 + M β m X g (m) + error, m=1 X g (m) is the count of word w in S g. Y g is log expression of gene g.

43 Regression Model Bussemaker, Li, and Siggia (2001): Y g = β 0 + M β m X g (m) + error, m=1 X g (m) is the count of word w in S g. Y g is log expression of gene g. MCB (ACGCGT), 21 minutes

44 Regression Model Bussemaker, Li, and Siggia (2001): Y g = β 0 + M β m X g (m) + error, m=1 X g (m) is the count of word w in S g. Y g is log expression of gene g. SCB (TTTCGCG), 21 minutes

45 Regression Model Bussemaker, Li, and Siggia (2001): Y g = β 0 + M β m X g (m) + error, m=1 X g (m) is the count of word w in S g. Y g is log expression of gene g. Arbitrary (TGATATC), 21 minutes

46 Modeling motif degeneracy 1. Inexact matches are allowed.

47 Modeling motif degeneracy 1. Inexact matches are allowed. 2. Not all positions created equal.

48 Information Content of Positions Pattern observed for most transcription factors: Dimeric binding: two such peaks separated by a short distance.

49 Information Content of Positions Pattern observed for most transcription factors: Dimeric binding: two such peaks separated by a short distance. This information has been noted by several studies: 1. Eisen, 2005, Genome Biology. 2. Kechris et al., Keles et al., 2002.

50 Hypercube model We consider all words of length L = 6, 7 to lie in a graph.

51 Hypercube model We consider all words of length L = 6, 7 to lie in a graph. There is an edge between words w 1, w 2 if d Hamming (w 1, w 2 ) = 1.

52 Hypercube model We consider all words of length L = 6, 7 to lie in a graph. There is an edge between words w 1, w 2 if d Hamming (w 1, w 2 ) = 1. The weight on the edge depends on the position of the differing letter.

53 Hypercube model We consider all words of length L = 6, 7 to lie in a graph. There is an edge between words w 1, w 2 if d Hamming (w 1, w 2 ) = 1. The weight on the edge depends on the position of the differing letter. Hard to draw, here s a 2-D simplification:

54 A more general model: Ising prior for γ i P(γ) = e α γ+γ Bγ ψ(α,b), where α = (α 1,..., α p ), B = (b i,j ) p p are hyperparameters, and ψ(α, B) is the normalizing constant: ψ(α, B) = γ {0,1} p e α γ+γ Bγ.

55 A more general model: Ising prior for γ i P(γ) = e α γ+γ Bγ ψ(α,b), where α = (α 1,..., α p ), B = (b i,j ) p p are hyperparameters, and ψ(α, B) is the normalizing constant: ψ(α, B) = γ {0,1} p e α γ+γ Bγ. For each i, the conditional distribution P P(γ i γ ( i) ) = eγ i (α i + j I β ij γ j ) ( i) 1 + e γ i (α i + P j I ( i) β ij γ j ). can be efficiently computed for sparse B.

56 Ising Prior: Posterior Computation For each i, the conditional distribution P P(γ i γ ( i) ) = eγ i (α i + j I β ij γ j ) ( i) 1 + e γ i (α i + P j I ( i) β ij γ j ). can be efficiently computed for sparse B.

57 Ising Prior: Posterior Computation For each i, the conditional distribution P P(γ i γ ( i) ) = eγ i (α i + j I β ij γ j ) ( i) 1 + e γ i (α i + P j I ( i) β ij γ j ). can be efficiently computed for sparse B. Apply to structured model selection: P(γ i = 1 γ ( i), Y ) = P(γ i = 1 γ ( i) ) P(γ i = 1 γ ( i) ) + BF P(γ i = 0 γ ( i) ),

58 Example: Regulatory Motifs for Yeast Cell Cycle

59 Periodic time series

60 PCA of cell cycle experiment

61 Motif Analysis Results I Motif length 6. B i,j = { 1, positions 1,6; 2, positions 2,3,4,5. Distance between top 100 motifs found by our model:

62 Motif Analysis Results II Signals that were found (posterior probability > 0.05) with smoothing that were lost without: Words ACGCGT TCGCGT TCGCGA GCGCGT CCGCGT Description TGCTGG GGCTGG ACGGGT TCGCGG TCGGGT 16 new motifs total.

63 Motif Analysis Results II Signals that were found (posterior probability > 0.05) with smoothing that were lost without: Words ACGCGT TCGCGT TCGCGA GCGCGT CCGCGT Description MCB binding site in CLN1 TGCTGG GGCTGG ACGGGT TCGCGG TCGGGT 16 new motifs total.

64 Motif Analysis Results II Signals that were found (posterior probability > 0.05) with smoothing that were lost without: Words ACGCGT TCGCGT TCGCGA GCGCGT CCGCGT TGCTGG GGCTGG Description SWI5 binding site ACGGGT TCGCGG TCGGGT 16 new motifs total.

65 Motif Analysis Results II Signals that were found (posterior probability > 0.05) with smoothing that were lost without: Words ACGCGT TCGCGT TCGCGA GCGCGT CCGCGT Description TGCTGG GGCTGG ACGGGT MCM1 binding site in CLN3 TCGCGG TCGGGT 16 new motifs total.

66 Motif Analysis Results II Signals that were found (posterior probability > 0.05) with smoothing that were lost without: Words ACGCGT TCGCGT TCGCGA GCGCGT CCGCGT Description TGCTGG GGCTGG ACGGGT TCGCGG TCGGGT 16 new motifs total. REB1 binding sites

67 Notes on Hyperparameter Selection For the motif model, we so far arbitrarily selected hyperparameters for good computational properties. For 1-d lattice: (r, w) code for sparsity and smoothness. Given w, can choose r analytically to set desired model size. Since algorithm is O(p n 2 ), this is practically very important.

68 Notes on Hyperparameter Selection For the motif model, we so far arbitrarily selected hyperparameters for good computational properties. For 1-d lattice: (r, w) code for sparsity and smoothness. Given w, can choose r analytically to set desired model size. Since algorithm is O(p n 2 ), this is practically very important. For general graphs: model size is no longer as easy to control. Phase transition behavior. Asymmetry in graph: α should not be constant. Interpretation of hyperparameters not as straightforward.

69 Notes on Hyperparameter Selection For the motif model, we so far arbitrarily selected hyperparameters for good computational properties. For 1-d lattice: (r, w) code for sparsity and smoothness. Given w, can choose r analytically to set desired model size. Since algorithm is O(p n 2 ), this is practically very important. For general graphs: model size is no longer as easy to control. Phase transition behavior. Asymmetry in graph: α should not be constant. Interpretation of hyperparameters not as straightforward. Favor a priori certain structures, not certain variables.

70 Conclusions and Extensions General Ising model for variable selection in structured covariate spaces.

71 Conclusions and Extensions General Ising model for variable selection in structured covariate spaces. 1-d lattice: reduces to simple Markov model which is applicable to problems in genomic profiling studies.

72 Conclusions and Extensions General Ising model for variable selection in structured covariate spaces. 1-d lattice: reduces to simple Markov model which is applicable to problems in genomic profiling studies. L-d hypercube: a natural model for motif detection.

73 Conclusions and Extensions General Ising model for variable selection in structured covariate spaces. 1-d lattice: reduces to simple Markov model which is applicable to problems in genomic profiling studies. L-d hypercube: a natural model for motif detection. Computationally feasible for p > 1000.

74 Conclusions and Extensions General Ising model for variable selection in structured covariate spaces. 1-d lattice: reduces to simple Markov model which is applicable to problems in genomic profiling studies. L-d hypercube: a natural model for motif detection. Computationally feasible for p > Extensions: Hyperparameter selection for biological motif discovery.

75 Conclusions and Extensions General Ising model for variable selection in structured covariate spaces. 1-d lattice: reduces to simple Markov model which is applicable to problems in genomic profiling studies. L-d hypercube: a natural model for motif detection. Computationally feasible for p > Extensions: Hyperparameter selection for biological motif discovery. Convergence speed-up.

76 Conclusions and Extensions General Ising model for variable selection in structured covariate spaces. 1-d lattice: reduces to simple Markov model which is applicable to problems in genomic profiling studies. L-d hypercube: a natural model for motif detection. Computationally feasible for p > Extensions: Hyperparameter selection for biological motif discovery. Convergence speed-up. Nonlinear regression models. Thank you!

Bayesian Variable Selection in Structured High-Dimensional Covariate Spaces with Applications in Genomics

Bayesian Variable Selection in Structured High-Dimensional Covariate Spaces with Applications in Genomics Fan Li Department of Statistical Science, Duke University Durham, NC 27708-0251, USA fli@stat.duke.edu