Profile HMM for multiple sequences

Size: px

Start display at page:

Download "Profile HMM for multiple sequences"

Phillip Terry
6 years ago
Views:

1 Profle HMM for multple sequences

2 Par HMM HMM for parwse sequence algnment, whch ncorporates affne gap scores. Match (M) nserton n x (X) nserton n y (Y) Hdden States Observaton Symbols Match (M): {(a,b) a,b n }. nserton n x (X): {(a,-) a n }. nserton n y (Y): {(-,a) a n }.

3 Algnment: a path a hdden state sequence A T - G T T A T A T C G T - A C M M Y M M X M M

4 Par HMMs Begn δ 1-2δ-τ 1-2δ-τ δ M 1-ε-τ δ δ 1-ε-τ X Y τ τ ε ε τ τ End

5 Multple sequence algnment (Globn famly)

6 Profle model (PSSM) A natural probablstc model for a conserved regon would be to specfy ndependent probabltes e (a) of observng nucleotde (amno acd) a n poston The probablty of a new sequence x accordng to ths model s L =1 P(x M) = e (x )

Profle / PSSM DNA / protens Segments of the same length L; Often represented as Postonal frequency matrx; LTMTRGDGNYLGLTVETSRLLGRFQKSGML LTMTRGDGNYLGLTETSRLLGRFQKSGM LTMTRGDGNYLGLTVETSRLLGRFQKSEL

7 Profle / PSSM DNA / protens Segments of the same length L; Often represented as Postonal frequency matrx; LTMTRGDGNYLGLTVETSRLLGRFQKSGML LTMTRGDGNYLGLTETSRLLGRFQKSGM LTMTRGDGNYLGLTVETSRLLGRFQKSEL LTMTRGDGNYLGLTVETSRLLGRLQKMGL LAMSRNEGNYLGLAVETVSRVFSRFQQNEL LAMSRNEGNYLGLAVETVSRVFTRFQQNGL LPMSRNEGNYLGLAVETVSRVFTRFQQNGLL VRMSREEGNYLGLTLETVSRLFSRFGREGL LRMSREEGSYLGLKLETVSRTLSKFHQEGL LPMCRRDGDYLGLTLETVSRALSQLHTQGL LPMSRRDADYLGLTVETVSRAVSQLHTDGVL LPMSRQDADYLGLTETVSRTFTKLERHGA

8 Searchng profles: nference Gve a sequence S of length L, compute the lkelhood rato of beng generated from ths profle vs. from background model: L e ( x ) R(S P)= = 1 b s Searchng motfs n a sequence: sldng wndow approach

9 Match states for profle HMMs Match states Emsson probabltes em (a) Begn... M... End

10 Components of profle HMMs nsert states e Emsson prob. Usually back ground dstrbuton q a. Transton prob. ( a) M to, to tself, to M +1 Log-odds score for a gap of length k (no log-odds from emsson) log a a + ( k 1) log a 1 M + log M + Begn M End

11 Components of profle HMMs Delete states No emsson prob. Cost of a deleton M D, D D, D M Each D D mght be dfferent D Begn M End

12 Full structure of profle HMMs D Begn M End

13 Dervng HMMs from multple algnments Key dea behnd profle HMMs Model representng the consensus for the algnment of sequence from the same famly Not the sequence of any partcular member HBA_HUMAN...VGA--HAGEY... HBB_HUMAN...V----NVDEV... MYG_PHYCA...VEA--DVAGH... GLB3_CHTP...VKG------D... GLB5_PETMA...VYS--TYETS... LGB2_LUPLU...FNA--NPKH... GLB1_GLYD...AGADNGAGV... *** *****

14 Dervng HMMs from multple algnments Basc profle HMM parameterzaton Am: makng the hgher probablty for sequences from the famly Parameters the probabltes values : trval f many of ndependent algnment sequences are gven. a kl = A e ( a) kl k k A ' ' ( ') l kl E a' k a length of the model: heurstcs or systematc way = E ( a)

15 Sequence conservaton: entropy profle of the emsson probablty dstrbutons

16 Searchng wth profle HMMs Man usage of profle HMMs Detectng potental sequences n a famly Matchng a sequence to the profle HMMs Vterb algorthm or forward algorthm Comparng the resultng probablty wth random model P ( x R) = q x

17 Searchng wth profle HMMs Vterb algorthm (optmal log-odd algnment) = = = ; log ) (, log ) (, log ) ( max ) ( ; log 1) (, log 1) (, log 1) ( max ) ( log ) ( ; log 1) (, log 1) (, log 1) ( max ) ( log ) ( D D D 1 D 1 D M M 1 D D D M M M D D 1 M 1 M M M 1 M M a V a V a V V a V a V a V q x e V a V a V a V q x e V x x

18 Searchng wth profle HMMs Forward algorthm: summng over all potent algnments ))]; ( exp( )) ( exp( )) ( exp( log[ ) ( 1))]; ( exp( 1)) ( exp( 1)) ( exp( log[ ) ( log ) ( 1))]; ( exp( 1)) ( exp( 1)) ( exp( log[ ) ( log ) ( D 1 D D 1 D M 1 D M D D D M M D 1 M D 1 M M 1 M M M M F a F a F a F F a F a F a q x e F F a F a F a q x e F x x + + = = =

19 Varants for non-global algnments Local algnments (flankng model) Emsson prob. n flankng states use background values q a. Loopng prob. close to 1, e.g. (1- η) for some small η. D M Begn End Q Q

20 Varants for non-global algnments Overlap algnments Only transtons to the frst model state are allowed. When expectng to fnd ether present as a whole or absent Transton to frst delete state allows mssng frst resdue D Q Q Begn M End

21 Varants for non-global algnments Repeat algnments Transton from rght flankng state back to random model Can fnd multple matchng segments n query strng D M Begn Q End

22 Estmaton of prob. Maxmum lkelhood (ML) estmaton gven observed freq. c a of resdue a n poston. c a em ( a) = c Smple pseudocounts q a : background dstrbuton A: weght factor a' a' M ( a) e = c A a + + Aq a' c a a'

23 Optmal model constructon: mark columns (a) Multple algnment: x x... x bat A G C rat A - A G - C cat A G - A A - gnat - - A A A C goat A G C (b) Profle-HMM archtecture: D D beg M M M end D (c) Observed emsson/transton counts match emssons nsert emssons state transtons A C G T A C G T M-M M-D M M D D-M D-D D

24 Optmal model constructon MAP (match-nsert assgnment) Recursve calculaton of a number S S : log prob. of the optmal model for algnment up to and ncludng column, assumng s marked. S s calculated from S and summed log prob. between and. T : summed log prob. of all the state transtons between marked and. T = c xy x, y M,D, log a c xy are obtaned from partal state paths mpled by markng and. xy

25 Optmal model constructon Algorthm: MAP model constructon ntalzaton: S 0 = 0, M L+1 = 0. Recurrence: for = 1,..., L+1: S = max S + T + M + σ 0 < = arg max S 0 < + T + M Traceback: from = σ L+1, whle σ > 0: Mark column as a match column = σ. + 1, λ; + 1, 1 + λ;

26 Weghtng tranng sequences nput sequences are random? Assumpton: all examples are ndependent samples mght be ncorrect Solutons Weght sequences based on smlarty

27 Weghtng tranng sequences Smple weghtng schemes derved from a tree Phylogenetc tree s gven. [Thompson, Hggns & Gbson 1994b] [Gersten, Sonnhammer & Chotha 1994] w = t n w leaves k below n w k

28 Weghtng tranng sequences 7 t 6 = V 7 6 V 6 t 5 = 3 5 t 3 = 5 t 4 = 8 V t 1 = 2 t 2 = w 1 :w 2 :w 3 :w 4 = 35:35:50: : 2 : 3 : 4 = 20:20:32:47

29 Multple algnment by tranng profle HMM Sequence profles could be represented as probablstc models lke profle HMMs. Profle HMMs could smply be used n place of standard profles n progressve or teratve algnment methods. ML methods for buldng (tranng) profle HMM (descrbed prevously) are based on multple sequence algnment. Profle HMMs can also be traned from ntally unalgned sequences usng the Baum-Welch (EM) algorthm

30 Multple algnment by profle HMM tranng- Multple algnment wth a known profle HMM Before we estmate a model and a multple algnment smultaneously, we consder as smpler problem: derve a multple algnment from a known profle HMM model. Ths can be appled to algn a large member of sequences from the same famly based on the HMM model bult from the (seed) multple algnment of a small representatve set of sequences n the famly.

31 Multple algnment wth a known profle HMM Algn a sequence to a profle HMM Vterb algorthm Constructon a multple algnment ust requres calculatng a Vterb algnment for each ndvdual sequence. Resdues algned to the same match state n the profle HMM should be algned n the same columns.

32 Multple algnment wth a known profle HMM Gven a prelmnary algnment, HMM can algn addtonal sequences.

33 Multple algnment wth a known profle HMM

34 Multple algnment wth a known profle HMM mportant dfference wth other MSA programs Vterb path through HMM dentfes nserts Profle HMM does not algn nserts Other multple algnment algorthms algn the whole sequences.

35 Profle HMM tranng from unalgned Harder problem sequences estmatng both a model and a multple algnment from ntally unalgned sequences. ntalzaton: Choose the length of the profle HMM and ntalze parameters. Tranng: estmate the model usng the Baum- Welch algorthm (teratvely). Multple Algnment: Algn all sequences to the fnal model usng the Vterb algorthm and buld a multple algnment as descrbed n the prevous secton.

36 Profle HMM tranng from unalgned ntal Model sequences The only decson that must be made n choosng an ntal structure for Baum-Welch estmaton s the length of the model M. A commonly used rule s to set M be the average length of the tranng sequence. We need some randomness n ntal parameters to avod local maxma.

37 Multple algnment by profle HMM tranng Avodng Local maxma Baum-Welch algorthm s guaranteed to fnd a LOCAL maxma. Models are usually qute long and there are many opportuntes to get stuck n a wrong soluton. Soluton Start many tmes from dfferent ntal models. Use some form of stochastc search algorthm, e.g. smulated annealng.

38 Multple algnment by profle HMM - smlar to Gbbs samplng The Gbbs sampler algorthm descrbed by Lawrence et al.[1993] has substantal smlartes. The problem was to smultaneously fnd the motf postons and to estmate the parameters for a consensus statstcal model of them. The statstcal model used s essentally a profle HMM wth no nsert or delete states.

39 Multple algnment by profle HMM tranng-model surgery We can modfy the model after (or durng) tranng a model by manually checkng the algnment produced from the model. Some of the match states are redundant Some nsert states absorb too many sequences Model surgery f a match state s used by less than ½ of tranng sequences, delete ts module (match-nsert-delete states) f more than ½ of tranng sequences use a certan nsert state, expand t nto n new modules, where n s the average length of nsertons ad hoc, but works well

40 Phylo-HMMs: model multple algnments of syntenc sequences A phylo-hmm s a probablstc machne that generates a multple algnment, column by column, such that each column s defned by a phylogenetc model Unlke sngle-sequence HMMs, the emsson probabltes of phylo-hmms are complex dstrbutons defned by phylogenetc models

41 Applcatons of Phylo-HMMs mprovng phylogenetc modelng that allow for varaton among stes n the rate of substtuton (Felsensten & Churchll, 1996; Yang, 1995) Proten secondary structure predcton (Goldman et al., 1996; Thorne et al., 1996) Detecton of recombnaton from DNA multple algnments (Husmeer & Wrght, 2001) Recently, comparatve genomcs (Sepel, et. al. Haussler, 2005)

42 Phylo-HMMs: combnng phylogeny and HMMs Molecular evoluton can be vewed as a combnaton of two Markov processes One that operates n the dmenson of space (along a genome) One that operates n the dmenson of tme (along the branches of a phylogenetc tree) Phylo-HMMs model ths combnaton

43 Sngle-sequence HMM Phylo-HMM

44 Phylogenetc models Stochastc process of substtuton that operates ndependently at each ste n a genome A character s frst drawn at random from the background dstrbuton and assgned to the root of the tree; character substtutons then occur randomly along the tree branches, from root to leaves The characters at the leaves defne an algnment column

45 Phylogenetc Models The dfferent phylogenetc models assocated wth the states of a phylo-hmm may reflect dfferent overall rates of substtuton (e.g. n conserved and non-conserved regons), dfferent patterns of substtuton or background dstrbutons, or even dfferent tree topologes (as wth recombnaton)

46 Phylo-HMMs: Formal Defnton A phylo-hmm s a 4-tuple θ = (S,ψ, A,b) : : set of hdden states S = {s 1,,s M } ψ = {ψ 1, ψ M } : set of assocated phylogenetc models A = {a,k } (1,k M) b = (b 1,,b M ) : transton probabltes : ntal probabltes

47 The Phylogenetc Model ψ = (Q,π,τ,β ) : Q π : substtuton rate matrx : background frequences τ β : bnary tree : branch lengths

48 The Phylogenetc Model The model s defned wth respect to an alphabet Σ whose sze s denoted d The substtuton rate matrx has dmenson dxd The background frequences vector has dmenson d The tree has n leaves, correspondng to n extant taxa The branch lengths are assocated wth the tree

49 Probablty of the Data Let X be an algnment consstng of L columns and n rows, wth the th column denoted X The probablty that column X s emtted by state s s smply the probablty of X under the correspondng phylogenetc model, P(X ψ ) Ths s the lkelhood of the column gven the tree, whch can be computed effcently usng Felsensten s prunng algorthm (whch we wll descrbe n later lectures)

50 Substtuton Probabltes Felsensten s algorthm requres the condtonal probabltes of substtuton for all bases a,b Σ and branch lengths t β The probablty of substtuton of a base b for a base a along a branch of length t, denoted P(b a,t,ψ ) s based on a contnuous-tme Markov model of substtuton, defned by the rate matrx Q

51 Substtuton Probabltes n partcular, for any gven non-negatve value t, the condtonal probabltes P(b a,t,ψ ) for all a,b Σ are gven the dxd matrx P (t) = exp(q t), where exp(q t) = k= 0 (Q t) k k!

52 Example: HKY model π = (π A,,π C,,π G,,π T, ) κ represents the transton/transverson rate rato for ψ - s ndcate quanttes requred to normalze each row.

53 State sequences n Phylo-HMMs A state sequence through the phylo-hmm s a sequence φ = (φ 1,,φ L ) such that The ont probablty of a path and and algnment s φ S 1 L L 1 ψ ) a P( X ψ φ ) 1 φ φ = 2 P( φ, X θ ) = βφ P( X 1 1 φ

54 Phylo-HMMs The lkelhood s gven by the sum over all paths (forward algorthm) P(X θ) = φ P(φ,X θ) The maxmum-lkelhood path s (Verteb s) φ = argmax P(φ, X θ) φ

55 Computng the Probabltes The lkelhood can be computed effcently usng the forward algorthm The maxmum-lkelhood path can be computed effcently usng the Vterb algorthm The forward and backward algorthms can be combned to compute the posteror probablty P(φ = X,θ)

56 Hgher-order Markov Models for Emssons t s common wth gene-fndng HMMs to condton the emsson probablty of each observaton on the observatons that mmedately precede t n the sequence For example, n a 3-rd-codon-poston state, the emsson of a base x = A mght have a farly hgh probablty f the prevous two bases are x -2 = G and x -1 = A (GAA=Glu), but should have zero probablty f the prevous two bases are x -2 = T and x -1 = A (TAA=stop)

57 Hgher-order Markov Models for Emsson Consderng the N observatons precedng each x corresponds to usng an N th order Markov model for emssons An N th order model for emssons s typcally parameterzed n terms of (N+1)-tuples of observatons, and condtonal probabltes are computed as

58 N th Order Phylo-HMMs Probablty of the N-tuple Sum over all possble algnment columns Y (can be calculated effcently by a slght modfcaton of Felsensten s prunng algorthm)

Split alignment. Martin C. Frith April 13, 2012

Split alignment. Martin C. Frith April 13, 2012 Splt algnment Martn C. Frth Aprl 13, 2012 1 Introducton Ths document s about algnng a query sequence to a genome, allowng dfferent parts of the query to match dfferent parts of the genome. Here are some