Gibbs Sampling Methods for Multiple Sequence Alignment

Size: px

Start display at page:

Download "Gibbs Sampling Methods for Multiple Sequence Alignment"

Peregrine Gallagher
6 years ago
Views:

1 Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1

2 Outline Statistical models for multiple alignment Bayesian methodology Missing data problems The Gibbs sampling algorithm Extensions to multiple motifs and repeats Motif Sampler, PROBE, and beyond 11/17/99 2

3 Sequence alignment problem Find commonalties among multiple protein sequences in order to: Understand function Predict structure Reveal evolutionary relationships Approaches: Pairwise vs. multiple Global vs. local Score-based vs. model-based (EM, Gibbs, HMM) Our basic view: block-motif model 11/17/99 3

4 Multiple (local) sequence alignment a Motif 1 a 2 width = w a k length n k Alignment variable: A={a 1, a 2,, a k } Objective: find the best (most probable) common patterns. 11/17/99 4

5 Statistical models How do we describe patterns? Frequencies of amino acid types Multinomial distribution More generally, a probabilistic model A typical aligned motif 11/17/99 5

6 Multinomial distribution Observed columns: Motif Positions Seq 1 I G K P I E Seq 2 V G D P G E Seq 3 V G D D A D Seq 4 I G Q H P E Seq 5 L S G P E E A total of k sequences Model for i th column: (k i,1, k i,2,, k i,20 ) ~ Multinom (k, p i ) where p i =(p i,1,, p i,20 ) 11/17/99 6

7 Estimating model probabilities Maximum likelihood: Bayesian estimate: Prior: Posterior: Posterior Mean: $p ij $p ij = p i ~ Dirichlet (a i,1,, a i,20 ), pseudo-counts Posterior Distribution: = k [p i obs ]~ Dirichlet (a i,1 +k i,1,, a i,20 + k i,20 ) ij k k k + ij ij + i 11/17/99 7

8 Prior and posterior distributions Prob of p Prior = 10, = 10 The prior belief changes after observing the data k=90, N-k=0 Prob of p Prob of p k + = 100, N k + = 10 Posterior Prob of p 11/17/99 8

9 Motif alignment model a 1 a 2 Motif width = w a k length n k Alignment variable: A={a 1, a 2,, a k } Non-site positions follow a common multinomial distribution with p 0 =(p 0,1,, p 0,20 ) Each position i in the motif follows probability distribution p i =(p i,1,, p i,20 ) 11/17/99 9

10 The tricky part We don t know the alignment variables A={a 1, a 2,, a k } (they are missing ) General missing data problem: Statistical problem of unobserved data May be potentially observable, but unrecorded Arises in e.g. non-response in surveys Biological examples Sequence alignment Protein and RNA structure 11/17/99 10

11 Dealing with the missing data Let Θ=(p 0, p 1,..., p w ), A={a 1, a 2,, a K } Iterative sampling: Draw from, then from P(Θ A, Data) P(A Θ, Data) Predictive Updating: Pretend K-1 sequences are aligned Predict the K th sequence (stochastically) a 1 a 2 a 3 11/17/99 a k? 11

12 The Predictive Updating step a 1 a 2 a 3 a k? 1. Compute predictive frequencies of each position i in motif c ij = count of amino acid type j at position i. c 0j = count of amino acid type j in all non-site positions. q ij = (c ij +b j )/(K-1+B), B=b b K pseudo-counts 2. Sample from the predictive distribution of a k. P( a l 1) k 11/17/99 12 i w 1 q q i, R ( l i) k 0, R ( l i) k

13 The algorithm Initialize by choosing random starting positions 0 a, a,..., a K ( ) ( 0) ( 0) 1 2 Iterate the following steps many times: Randomly or systematically choose a sequence, say sequence k, to exclude. Carry out Predictive Update step to update a k Stop when changes are small, or other convergence criterion met. 11/17/99 13

14 Phase-shift Sometimes stuck in a shift local optima p : True motif locations min{ 1, ( A R) } ( A R) a k? How to escape from this local optima? Simultaneous move: A => A+δ={a 1 +δ,,a K +δ} Use Metropolis step: accept move with prob: Compare entropy between added and removed columns 11/17/99 14

15 ( ) X X X X W J Fragmentation Imagine we have W positions for the motif, only J of them are important : Choose best J columns among W possible /17/99 15 W via a Metropolis algorithm. Prior distribution for fragmentation: W is regarded as the length of span No need to specify W with this model

16 Example: H-T-H proteins HTH: sequence-specific DNA binding, gene regulation. Motifs occur as local isolated structures. The whole 3-D structures are known and very different. 30 sequences with known HTH positions chosen. The set represents a typically diverse cross section of HTH sequences. Width of the motif pattern is assumed to be in the range from 17 to 22. The criterion information per parameter is used to determine the optimal width, 21. Heuristic convergence criteria developed. (multiple restarts with IPP monitored) 11/17/99 16

17 Alignment * = structural alignment 11/17/99 17

18 Model probability ratios 100 * q i,j /p j 11/17/99 18

19 Determining width of motif : All sequences correctly aligned + : One or more incorrectly aligned ο: Sequences with HTH region deleted x : Randomized sequences 11/17/99 19

20 Monitoring convergence 11/17/99 20

21 Motif Sampler Often one has multiple repeats of several motifs in the set of sequences. One long sequence with 3 motif types: Idea: partition the input sequence into segments that correspond to different (unknown) motif models. Use mixture models (unsupervised learning) 11/17/99 21

22 Special case: Bernoulli sampler Sequence data: R = r 1 r 2 r 3 r N Indicator variable: = δ 1 δ 2 δ 3 δ N i 1, 0, if it is the start of an element if not. parameters for the motif model Likelihood: p(r, Θ, ε), ε is the prior probability for δ i =1 Predictive Update: ( 1, R) k [ k ] ( 0, R) 1 k [ k ] $ $ w i 1 p$ p$ i, r 0, r k i k i /17/99 22

23 Example: Porin-like proteins Membrane protein, channels for nutrients, waste products, and antibiotics. 25 proteins suspected of porin 70 segments found, with width=13 4 repeats detected for each of the 3 porins with known 3-D structures, correspond to alternating membrane spanning β-strands. All on outside of the structure. 11/17/99 23

24 Conserved repeats in bacterial porins 11/17/99 24

25 Multiple motif regions detected in bacterial porins 11/17/99 25

26 Example: HTH proteins (cont d) 30 hth sequences (J=12, unknown # repeats) Fragmentation model test: 100 runs of Bernoulli sampler, with/out fragmentation. With Without Average: correct incorrect Most often contained 18 out of 30 known sites. Most frequently: 28 sites, MAP=139; 25 sites, MAP=147. Multiple runs needed to reveal subtle qualitative differences. Different nearby modes are worth 11/17/99 26

27 Hidden Markov model Architecture D 1 D 2 D 3 D 4 I 0 I 1 I 2 I 3 I 4 m 0 m 1 m 2 m 3 m 4 m E 11/17/99 27

28 i = # of deletions G i 0 1 The Path if generated by insertion if generated by a match r 1 r 2 r 3 m r 4 r 5 r 6 (δ 1,G 1 ) (δ 2,G 2 ) m 0 m 1 m 2 3 m r 0 r 1 r 2 r 3 r 4 r 5 r 6 4 (δ 3,G 3 ) (δ 4,G 4 ) (δ 5,G 5 ) (δ 6,G 6 ) Solid path: Dashed path: (,1) (,0) (,0) (,1) (,0) (,0) (,1) (,0) (,0) (,1) (,0) (,0) 11/17/99 28

29 More restricted models Block Motif Propagation Model: limit the total number of gaps but no deletions allowed. Motif 1 Motif 2 Motif 3 Motif 4 With Deletion Indicators: Model: D 1 D 2 D 3 D 4 11/17/99 29

30 Likelihood: Bayesian analysis of the propagation model Random variables in the model A: alignment variable indicating where motif starts Θ: parameters for insertion and model frequencies : fragmentation indicator (for which columns are included) L: total number of motifs for the alignment W: total number of alignment columns R = (R 1,R 2,, R k ), set of protein sequences to be aligned P( R1, R2, K, RK A, Θ,, L, W ) Posterior: c P(R A,Θ,, L,W) P(A, Θ, L,W) normalizing constant prior 11/17/99 30

31 The PROBE algorithm Iterative (Gibbs) sampling from conditionals Given A [-k], (alignment of the rest of sequences) find the distribution of sequence k s alignment Multiple motifs - use recursive summations Estimated from other sequences Seq. k Motif 1 Motif 2 Motif 3 Motif 4 a 1 a 2 a 3 a 4 Genetic algorithm for variable L and W 11/17/99 31

32 An Illustration Alignment of 16 putative N- acetyltransferases. The best alignment found has 4 blocks and 47 columns. The value plotted is the logarithm of the ratio between the likelihoods of posited model and null model. Also applied to the shuffled sequences. 2 blocks 3 blocks 1 block 5 blocks 4 blocks Results for shuffled seq 11/17/99 32

An example: Swinging Arm domain in bacterial Membrane Fusion Proteins Lipoyl binding domain from Bacillus stearothermophilus dihydrolipamide acetyltransferase PROBE detects

33 An example: Swinging Arm domain in bacterial Membrane Fusion Proteins Lipoyl binding domain from Bacillus stearothermophilus dihydrolipamide acetyltransferase PROBE detects conserved regions in MFPs shared with lipoyl and biotin binding domains Large insertions between two conserved regions in MFPs. Implying a swinging arm mechanism 11/17/99 33

34 Representative alignment 11/17/99 34

35 References Lawrence, C.E. et al. (1993). Gibbs sampling for detecting subtle motifs in multiple sequences. Science 262, Liu, J.S. (1994). The collapsed Gibbs sampler with application to a gene regulation problem. J. Amer. Statist. Assoc. 89, 1156 Liu, J.S., Neuwald, A.F., and Lawrence, C.E. (1999). Markovian structures in biological sequence alignments. J. Amer. Statist. Assoc., Forthcoming. --- (1995). Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Amer. Statist. Assoc. 90, Neuwald, A.F., Liu, J.S., and Lawrence, C.E. (1997). Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Protein Science 4, Neuwald, A.F., Liu, J.S., Lipman, D.J., and Lawrence, C.E. (1997). Extracting protein alignment models from the sequence database. Nucleic Acids. Res. 25, /17/99 35

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence