Probabilistic Arithmetic Automata

Size: px

Start display at page:

Download "Probabilistic Arithmetic Automata"

Brice Barnett
5 years ago
Views:

1 Probabilistic Arithmetic Automata Applications of a Stochastic Computational Framework in Biological Sequence Analysis Inke Herms PhD thesis defense

2 Overview 1 Probabilistic Arithmetic Automata 2 Application I: Protein Identification by PMF 3 Modeling Protein Sequences 4 Application II: Seed Sensitivity 5 Application III: 454 Sequencing Read Statistics Probabilistic Arithmetic Automata 2 Inke Herms

3 Overview Probabilistic Arithmetic Automata 1 Probabilistic Arithmetic Automata 2 Application I: Protein Identification by PMF 3 Modeling Protein Sequences 4 Application II: Seed Sensitivity 5 Application III: 454 Sequencing Read Statistics Probabilistic Arithmetic Automata 3 Inke Herms

4 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands Probabilistic Arithmetic Automata 4 Inke Herms

5 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted Probabilistic Arithmetic Automata 4 Inke Herms

6 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions Probabilistic Arithmetic Automata 4 Inke Herms

7 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms

8 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms

9 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms

10 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms

11 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms

12 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms

13 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms

14 Computational Framework Probabilistic Arithmetic Automata probability distribution of resulting value after t steps P(V t = v) = P(step t, state = q, value = v) states q runtime: O(t (# of states) 2 # of emissions # of values) Probabilistic Arithmetic Automata 5 Inke Herms

15 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification Probabilistic Arithmetic Automata 6 Inke Herms

16 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences Probabilistic Arithmetic Automata 6 Inke Herms

17 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences significance of transcription factor binding sites Probabilistic Arithmetic Automata 6 Inke Herms

18 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences significance of transcription factor binding sites quality of database filtering criteria Probabilistic Arithmetic Automata 6 Inke Herms

19 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences significance of transcription factor binding sites quality of database filtering criteria length distribution of DNA reads produced by high throughput pyrosequencing Probabilistic Arithmetic Automata 6 Inke Herms

20 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences significance of transcription factor binding sites quality of database filtering criteria length distribution of DNA reads produced by high throughput pyrosequencing expected population size under specified evolutionary processes Probabilistic Arithmetic Automata 6 Inke Herms

21 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences significance of transcription factor binding sites quality of database filtering criteria length distribution of DNA reads produced by high throughput pyrosequencing expected population size under specified evolutionary processes etc. Probabilistic Arithmetic Automata 6 Inke Herms

22 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences significance of transcription factor binding sites quality of database filtering criteria length distribution of DNA reads produced by high throughput pyrosequencing expected population size under specified evolutionary processes etc. Probabilistic Arithmetic Automata 7 Inke Herms

23 Overview Application I: Peptide Statistics 1 Probabilistic Arithmetic Automata 2 Application I: Protein Identification by PMF 3 Modeling Protein Sequences 4 Application II: Seed Sensitivity 5 Application III: 454 Sequencing Read Statistics Probabilistic Arithmetic Automata 8 Inke Herms

24 Application I: Peptide Statistics Peptide Mass Fingerprinting (PMF) Probabilistic Arithmetic Automata 9 Inke Herms

25 Motivation Application I: Peptide Statistics score peak matchings with respect to a null model higher scores for significant matches Probabilistic Arithmetic Automata 10 Inke Herms

26 Motivation Application I: Peptide Statistics score peak matchings with respect to a null model higher scores for significant matches Goal build PAA to measure peptide fragments extend i.i.d. (H.-M. Kaltenbach, 2007) to Markov model for peptides incorporate incomplete cleavage and post-translational modifications Probabilistic Arithmetic Automata 10 Inke Herms

27 In silico Digestion Application I: Peptide Statistics proteolytic cleavage by site-specific protease cleavage characters Γ, prohibition characters Π widely used: Trypsin cleaves after Γ = {K, R}, unless followed by Π = {P} cleavage patterns Γ Π distinguish first and following fragments (do not start with P) Probabilistic Arithmetic Automata 11 Inke Herms

28 PAA Design Application I: Peptide Statistics states: amino acids transitions: conditional amino acid frequencies outgoing cleavage characters: prohibition character or end fragment weights: amino acid molecular masses operations: cumulate character weights Probabilistic Arithmetic Automata 12 Inke Herms

29 PAA Recurrence Application I: Peptide Statistics joint state-value distribution: f t (q, m) = P(step t, state = q, mass = m) Recurrence f t (q, m) = f t 1 (q, m m )T q qe q (m ) q Q m MS(q) Probabilistic Arithmetic Automata 13 Inke Herms

30 Fragment Statistics Application I: Peptide Statistics joint length-mass distribution P(length = k, mass = m) = P(step k + 1, state = ζ, mass = m) marginalization 1 length distribution P(length = k) 2 mass distribution P(mass = m) Probabilistic Arithmetic Automata 14 Inke Herms

31 Fragment Statistics Application I: Peptide Statistics joint length-mass distribution P(length = k, mass = m) = P(step k + 1, state = ζ, mass = m) marginalization 1 length distribution P(length = k) 2 mass distribution P(mass = m) mass occurrence probability P(fragmentation yields at least one fragment of mass m) significance of mass spectra alignment scores (H.-M. Kaltenbach, 2007) Probabilistic Arithmetic Automata 14 Inke Herms

32 Modified PAA Application I: Peptide Statistics Peptide statistics are influenced by... 1 Incomplete cleavage due to: inadequate conditions for protease self-digestion of protease Probabilistic Arithmetic Automata 15 Inke Herms

33 Modified PAA Application I: Peptide Statistics Peptide statistics are influenced by... 1 Incomplete cleavage inclusion into PAA change transitions reduce probability to transit from cleavage character to end redistribute probability to non-prohibition characters Probabilistic Arithmetic Automata 15 Inke Herms

34 Modified PAA Application I: Peptide Statistics Peptide statistics are influenced by... 1 Incomplete cleavage inclusion into PAA change transitions reduce probability to transit from cleavage character to end redistribute probability to non-prohibition characters future extension: take amino acid propensities into account Probabilistic Arithmetic Automata 15 Inke Herms

35 Modified PAA Application I: Peptide Statistics Peptide statistics are influenced by... 1 Incomplete cleavage inclusion into PAA change transitions 2 Post-translational modifications (PTMs) variants: addition of functional groups structural changes Probabilistic Arithmetic Automata 15 Inke Herms

36 Modified PAA Application I: Peptide Statistics Peptide statistics are influenced by... 1 Incomplete cleavage inclusion into PAA change transitions 2 Post-translational modifications (PTMs) variants: addition of functional groups structural changes impact: modify the function of a protein transform precursor molecule into active protein Probabilistic Arithmetic Automata 15 Inke Herms

37 Modified PAA Application I: Peptide Statistics Peptide statistics are influenced by... 1 Incomplete cleavage inclusion into PAA change transitions 2 Post-translational modifications (PTMs) inclusion into PAA change weights phosphorylation: +80 Da methylation: +14 Da or +28 Da augment weight distributions Probabilistic Arithmetic Automata 15 Inke Herms

38 Application I: Peptide Statistics Results: Comparison of peptide model first fragment probability i.i.d. Markov empirical probability i.i.d. Markov fragment length mass m in Dalton following fragments probability i.i.d. Markov empirical fragment length length distribution probability i.i.d Markov mass m in Dalton length-mass distribution Probabilistic Arithmetic Automata 16 Inke Herms

39 Application I: Peptide Statistics Results: Incomplete Cleavage and PTMs Missed cleavages probability 0.10 complete cleavage 0.08 missed cleavages fragment length probability mass m in Dalton complete cleavage missed cleavages Posttranslational modifications probability 0.8 Markov 0.6 add. PTMs mass m in Dalton mass occurrence probability Probabilistic Arithmetic Automata 17 Inke Herms

40 Summary Application I: Peptide Statistics Contributions PAA to measure proteolytic fragments different models for peptides single molecular mass or isotopic distribution distribution of fragment length, mass, and mass occurrence probabilities for i.i.d. and Markov peptides Markov model generates only slightly different results than i.i.d. model, more apparent for first fragment inclusion of incomplete cleavage and PTMs PTMs introduce additional fragment masses missed cleavages induce the strongest effect on peptide statistics use for mass spectra alignment Probabilistic Arithmetic Automata 18 Inke Herms

41 Summary Application I: Peptide Statistics Directions for future research extend model of incomplete cleavage generalization to tandem MS context Probabilistic Arithmetic Automata 19 Inke Herms

42 Overview SSE Protein Model 1 Probabilistic Arithmetic Automata 2 Application I: Protein Identification by PMF 3 Modeling Protein Sequences 4 Application II: Seed Sensitivity 5 Application III: 454 Sequencing Read Statistics Probabilistic Arithmetic Automata 20 Inke Herms

43 Motivation SSE Protein Model Why a new null model for protein sequences? to improve... the identification of unknown proteins the classification of protein domains or protein families the prediction of secondary structure for a sequence of amino acids which rely on a reasonable null model Probabilistic Arithmetic Automata 21 Inke Herms

44 Motivation SSE Protein Model Why a new null model for protein sequences? to improve... the identification of unknown proteins the classification of protein domains or protein families the prediction of secondary structure for a sequence of amino acids which rely on a reasonable null model Goal string model for protein sequences include information from secondary structure annotation Probabilistic Arithmetic Automata 21 Inke Herms

45 Secondary Structure Information SSE Protein Model 8 secondary structure elements B, E: β-bridge and β-strand S, T: region with high curvature G, H, I: different kinds of helices C: others (random coil) B E S T G H I C Probabilistic Arithmetic Automata 22 Inke Herms

46 Secondary Structure Information SSE Protein Model 8 secondary structure elements B, E: β-bridge and β-strand S, T: region with high curvature G, H, I: different kinds of helices C: others (random coil) PDB sequence with annotation sequence MQHVSAPVFVFECTRLAYVQHK annotation CCCCCHHHHHHCCCCHHHEEEE Probabilistic Arithmetic Automata 22 Inke Herms

47 Secondary Structure Information SSE Protein Model 8 secondary structure elements B, E: β-bridge and β-strand S, T: region with high curvature G, H, I: different kinds of helices C: others (random coil) PDB sequence with annotation sequence MQHVS APVFVF ECTR LAY VQHK annotation CCCCC HHHHHH CCCC HHH EEEE Probabilistic Arithmetic Automata 22 Inke Herms

48 Secondary Structure Information SSE Protein Model 4 secondary structure classes B, E: sheet S, T: turn G, H, I: helix C: coil Probabilistic Arithmetic Automata 22 Inke Herms

Secondary Structure Information SSE Protein Model 4 secondary structure classes B, E: sheet S, T: turn G, H, I: helix C: coil www.cathdb.

49 Secondary Structure Information SSE Protein Model 4 secondary structure classes B, E: sheet S, T: turn G, H, I: helix C: coil investigate: conditional amino acid frequencies within and across elements length of amino acid chains Probabilistic Arithmetic Automata 22 Inke Herms

50 The SSE Model SSE Protein Model Secondary Structure Element Based Protein String Model Probabilistic Arithmetic Automata 23 Inke Herms

51 Evaluation: Log likelihood SSE Protein Model compare SSE model with i.i.d. and Markov models M1, M2 Free model parameters i.i.d. M1 M2 SSE ln (maximum likelihood) = ln P(model data) more involved computation for SSE O(n 2 ) for sequence of length n avg. log likelihood i.i.d. M1 M2 SSE model Probabilistic Arithmetic Automata 24 Inke Herms

52 Evaluation: AIC SSE Protein Model penalized model selection: balance goodness of fit and model complexity AIC(M) := 2 ln (maximum likelihood) + 2#free model parameters good model minimizes AIC compare models by AIC difference: AIC(M i ) min j AIC(M j ) best model has AIC difference 0 account for sample size second-order AIC Probabilistic Arithmetic Automata 25 Inke Herms

53 Results SSE Protein Model Ranking according to second-order AIC rank model AIC difference 1 SSE 0 2 M i.i.d M SSE performs best! Probabilistic Arithmetic Automata 26 Inke Herms

54 Summary SSE Protein Model Contributions first generative random string model for proteins including SSEs O(n 2 ) method to compute log likelihood of given sequence of length n SSE model outperforms i.i.d., M1, and M2 according to AIC article submitted to RECOMB 2010 Probabilistic Arithmetic Automata 27 Inke Herms

55 Summary SSE Protein Model Possible extensions more elaborate models within individual elements hybrid model to account for bad annotation of initial amino acids variants of the SSE model that focus on families of proteins classification of new proteins Probabilistic Arithmetic Automata 28 Inke Herms

56 Overview Application II: Seed Sensitivity 1 Probabilistic Arithmetic Automata 2 Application I: Protein Identification by PMF 3 Modeling Protein Sequences 4 Application II: Seed Sensitivity 5 Application III: 454 Sequencing Read Statistics Probabilistic Arithmetic Automata 29 Inke Herms

57 Seeded Alignment Application II: Seed Sensitivity database search for putatively homologous sequences heuristic tools use filtration 1 select candidate sequences that match a short seed 2 investigate candidates by exact local alignment methods quality crucially depends on the seed: longer seed lower sensitivity shorter seed lower specificity Probabilistic Arithmetic Automata 30 Inke Herms

58 Seeded Alignment Application II: Seed Sensitivity database search for putatively homologous sequences heuristic tools use filtration 1 select candidate sequences that match a short seed 2 investigate candidates by exact local alignment methods quality crucially depends on the seed: longer seed lower sensitivity shorter seed lower specificity Goal PAA to compute seed sensitivity = fraction of sought sequences selected by filtration unified approach for gapped and ungapped alignments Probabilistic Arithmetic Automata 30 Inke Herms

59 Prerequisites model alignments with certain degree of similarity homology model Ungapped alignments Application II: Seed Sensitivity i.i.d. string over Σ = {0, 1} match probability p Gapped alignments Markovian string over Σ = {0, 1, 2, 3} p 0, P = p 0 p 1 p g p g 1 p 0 p 1 p g p g 2 p0 p1 p g 0 3 p0 p1 0 p g Probabilistic Arithmetic Automata 31 Inke Herms

60 Prerequisites Application II: Seed Sensitivity model alignments with certain degree of similarity homology model filtration criterion seed model Consecutive seed require contiguous perfect matches, e.g. π = Spaced seed specify discontiguous matching positions, e.g. π =1 11 1, [0, 1] Indel seed allow for indels of variable size, e.g. π =1 1??1, [0, 1],? [ɛ, 0, 1, 2, 3] Probabilistic Arithmetic Automata 31 Inke Herms

61 Prerequisites Application II: Seed Sensitivity model alignments with certain degree of similarity homology model filtration criterion seed model seed sensitivity P(seed hits random alignment) Hit count count occurrences of seed patterns in random alignment Define V t (π) : # of matches of seed π Wanted 1 Sensitivity: P(V t (π) 1) 2 Hit distribution: P(V t (π) = k), k = 0,..., K Probabilistic Arithmetic Automata 31 Inke Herms

62 PAA Design Application II: Seed Sensitivity π = 1 1 states: prefixes of seed patterns transitions: alignment column dependencies given by homology model weights: number of seed patterns ending in state operations: cumulate seed hits Probabilistic Arithmetic Automata 32 Inke Herms

63 Application II: Seed Sensitivity Results: Sensitivity and Hit Distribution distribution of seed hits: P(V t = k) = q Q P(step t, state = q, hits = k) Recurrence f t (q, k) = q Q f t 1 (q, k C(q))T q q C(q) : # patterns ending in q Probabilistic Arithmetic Automata 33 Inke Herms

64 Results: Alternative Criteria Application II: Seed Sensitivity Idea: seed should 1 maximize the sensitivity to alignments referring to homologous sequences (M hom ) 2 maximize the specificity to alignments referring to random, unrelated sequences (M 0 ) max P(V t (M hom ) k) P(V t (M 0 ) < k) for k = 1,..., K no further information, sensitivity supported Probabilistic Arithmetic Automata 34 Inke Herms

65 Summary Application II: Seed Sensitivity Contributions PAA to compute seed sensitivity and entire hit distribution different seed and homology models unifying definitions for gapped and ungapped alignments different occurrence counts (overlapping or non-overlapping) definition and evaluation of alternative criteria: sensitivity supported work presented at WABI 2008 Probabilistic Arithmetic Automata 35 Inke Herms

66 Overview Application III: 454 Read Statistics 1 Probabilistic Arithmetic Automata 2 Application I: Protein Identification by PMF 3 Modeling Protein Sequences 4 Application II: Seed Sensitivity 5 Application III: 454 Sequencing Read Statistics Probabilistic Arithmetic Automata 36 Inke Herms

67 454 Sequencing Technology Application III: 454 Read Statistics Probabilistic Arithmetic Automata 37 Inke Herms

68 Motivation Application III: 454 Read Statistics massive parallel DNA sequencing current GS-FLX: 400, 000 reads with 250 nt on average current GS-FLX Titanium: 1, 000, 000 reads with 400 nt on average still short reads (Sanger: 800 nt) the longer the reads, the better (for genome assembly) Probabilistic Arithmetic Automata 38 Inke Herms

69 Motivation Application III: 454 Read Statistics massive parallel DNA sequencing current GS-FLX: 400, 000 reads with 250 nt on average current GS-FLX Titanium: 1, 000, 000 reads with 400 nt on average still short reads (Sanger: 800 nt) the longer the reads, the better (for genome assembly) Goal PAA to compute length distribution of 454 reads investigate potential improvement of average read length Probabilistic Arithmetic Automata 38 Inke Herms

70 454 Sequencing Application III: 454 Read Statistics sequencing by synthesis nucleotides are flown periodically according to a specified order incorporation is accompanied by a flash of light signal is proportional to number of synthesized nucleotides (up to 8 nt) Probabilistic Arithmetic Automata 39 Inke Herms

71 454 Sequencing Application III: 454 Read Statistics sequencing by synthesis nucleotides are flown periodically according to a specified order incorporation is accompanied by a flash of light signal is proportional to number of synthesized nucleotides (up to 8 nt) Dispensation order d = d[1]... d[l] with d[i] Σ = {A, C, G, T }, where 1 every character from Σ has to occur in d 2 no character is flown in row: d[i] = σ d[i + 1] σ 3 d[1] d[l] Probabilistic Arithmetic Automata 39 Inke Herms

72 454 Sequencing Application III: 454 Read Statistics sequencing by synthesis nucleotides are flown periodically according to a specified order incorporation is accompanied by a flash of light signal is proportional to number of synthesized nucleotides (up to 8 nt) GS-FLX standard dispensation order: TACG 100 cycles = 400 nucleotide flows Probabilistic Arithmetic Automata 39 Inke Herms

73 cycle nt flow TACG GTCA (3 ) read length: 6 10 Application III: 454 Read Statistics Influence of d on Read Length Probabilistic Arithmetic Automata 40 Inke Herms

74 PAA Design Application III: 454 Read Statistics states: dispensation order d transitions: conditional nucleotide frequencies weights: dirac measures (emission 1 in all nucleotide states) operations: cumulate number of inserted nucleotides Probabilistic Arithmetic Automata 41 Inke Herms

75 Proof of Concept Application III: 454 Read Statistics theoretical empirical simulated empirical frequency frequency length length comparison of computed read lengths to empirical data simulated reads, processing according to TrimBack filter PAA yields reasonable results Probabilistic Arithmetic Automata 42 Inke Herms

76 Observation for sample reads Application III: 454 Read Statistics The Optimal Dispensation Order dispensation orders of length 4 perform best compared expected read length under all orders of length 4 CGAT provides on average 10% longer reads than TACG (282 vs. 257) optimal dispensation order maximizes the sum of dinucleotide frequencies Probabilistic Arithmetic Automata 43 Inke Herms

77 Observation for sample reads Application III: 454 Read Statistics The Optimal Dispensation Order dispensation orders of length 4 perform best compared expected read length under all orders of length 4 CGAT provides on average 10% longer reads than TACG (282 vs. 257) optimal dispensation order maximizes the sum of dinucleotide frequencies For new genome: 1 estimate conditional nucleotide frequencies (from preceding sequencing runs) 2 design PAA and compute expected read length for orders of length 4 3 return best-performing order Probabilistic Arithmetic Automata 43 Inke Herms

78 Summary Application III: 454 Read Statistics Contributions PAA to compute length distribution of 454 sequence reads for dispensation orders of different lengths different models for nucleotide sequences choice of dispensation order can have a noticeable effect on the average read length propose optimal dispensation order for a new genome given preliminary reads Probabilistic Arithmetic Automata 44 Inke Herms

79 Summary Application III: 454 Read Statistics Future prospect theoretical analysis of dispensation orders experimental validation Probabilistic Arithmetic Automata 45 Inke Herms

80 Publications Application III: 454 Read Statistics Modeling Protein Sequences Including Information from Secondary Structure I. Herms and S. Rahmann, submitted to RECOMB 2010 Accurate Statistics for Local Sequence Alignment with Position-Dependent Scoring by Rare-Event Sampling S. Wolfsheimer, I. Herms, S. Rahmann and A.K. Hartmann, submitted to BMC Bioinformatics, 2009 Computing Alignment Seed Sensitivity with Probabilistic Arithmetic Automata I. Herms and S. Rahmann, WABI 2008, LNBI 5251, pp E. Baake and I. Herms Single-Crossover Dynamics: Finite versus Infinite Populations Bulletin of Mathematical Biology 70 (2008), pp Probabilistic Arithmetic Automata 46 Inke Herms

81 Conclusion Conclusion PAA framework three applications of PAA framework to sequence statistics general, unifying framework elegant formulation and simplification of solutions to different tasks straightforward computations (re-use of algorithms from automata and Markov chain theory) extensions to formerly investigated questions SSE Model biologically motivated null model for protein sequences outperforms customary string models Probabilistic Arithmetic Automata 47 Inke Herms

82 Analysis of Secondary Structure Segments Appendix Relative entropy between amino acid compositions within individually annotated segments Probabilistic Arithmetic Automata 48 Inke Herms

83 Secondary Structure Classification Appendix 0.15 coil C 0.15 helix H A C D E F G H I K L M N P Q R S T V WY 0.00 A C D E F G H I K L M N P Q R S T V WY 0.15 sheet S 0.15 turnt A C D E F G H I K L M N P Q R S T V WY 0.00 A C D E F G H I K L M N P Q R S T V WY Amino acid compositions after collapsing structure annotation to 4 classes Probabilistic Arithmetic Automata 49 Inke Herms

84 Model Log Likelihoods Appendix... for sequence s of length s = n ln P(s i.i.d.) = s ln P(S i = s[i]) = ln p s[i] i=1 s 1 ln P(s M1) = ln ps[1] 0 + i=1 ln P(s M2) = ln p 0 s[1] + ln P s[1]s[2] s i=1 ln P s[i]s[i+1] + s 2 i=1 ln P(S i+2 = s[i + 2] S i = s[i], S i+1 = s[i + 1]) Probabilistic Arithmetic Automata 50 Inke Herms

85 Log Likelihood of SSE Model Appendix ln P(S = s SSE) with s = n P(S = s) = v Q P(S 1,..., S n = s[1, n], Y n = v) variant of forward algorithm f k (v) := P(S 1,..., S k = s[1, k], Y k = v, Y k+1 v) f n (v) := P(S 1,..., S n = s[1, n], Y n = v) for 1 k < n P(S = s) = v Q f n (v) Probabilistic Arithmetic Automata 51 Inke Herms

86 Log Likelihood of SSE Model Appendix Recurrence f k (v) = u Q k 1 f l (u) Ts[l]s[l+1] uv pv (s[l + 1, k]) P(R(v) = k l) l=1 + P start,v π v s[1] pv (s[1, k]) P(R(v) = k) for 1 k < n f n (v) = u Q n 1 f l (u) Ts[l]s[l+1] uv pv (s[l + 1, n]) P(R(v) n l) l=1 + P start,v π v s[1] pv (s[1, n]) P(R(v) n) Output: f (s) = v Q f n(v) Probabilistic Arithmetic Automata 52 Inke Herms

87 Log Likelihood of SSE Model Appendix for log likelihood: implementation of logarithms of sums of small values use ln(x + y) = ln(x(1 + y /x)) = ln x + ln(1 + exp(ln y ln x)) ( k 1 ln f k (v) = ln Fk 0 (v) + log1p exp ( ln Fk l (v) ln F k 0 (v))) where Fk l (v): probability of path that generates length k prefix, with the last k l residues produced in element v l=1 Probabilistic Arithmetic Automata 53 Inke Herms

88 Second-order AIC Appendix 2k(k + 1) AIC c := AIC + n k 1 = 2 l( θ, n x) + 2k n k 1 with n: size of sample x k: # of free parameters in most complex model Probabilistic Arithmetic Automata 54 Inke Herms

89 454 Read Length Distribution Appendix Recurrence f s Sq f s (q, l) = q q (q, l 1)T q q Q f h(q,s) (q, l) if q = g(s) otherwise g(s) maps number of nt flow to corresponding nucleotide within d h(q, s) gives last preceding flow corresponding to nucleotide q Probabilistic Arithmetic Automata 55 Inke Herms

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)