Probabilistic Arithmetic Automata

Size: px
Start display at page:

Download "Probabilistic Arithmetic Automata"

Transcription

1 Probabilistic Arithmetic Automata Applications of a Stochastic Computational Framework in Biological Sequence Analysis Inke Herms PhD thesis defense

2 Overview 1 Probabilistic Arithmetic Automata 2 Application I: Protein Identification by PMF 3 Modeling Protein Sequences 4 Application II: Seed Sensitivity 5 Application III: 454 Sequencing Read Statistics Probabilistic Arithmetic Automata 2 Inke Herms

3 Overview Probabilistic Arithmetic Automata 1 Probabilistic Arithmetic Automata 2 Application I: Protein Identification by PMF 3 Modeling Protein Sequences 4 Application II: Seed Sensitivity 5 Application III: 454 Sequencing Read Statistics Probabilistic Arithmetic Automata 3 Inke Herms

4 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands Probabilistic Arithmetic Automata 4 Inke Herms

5 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted Probabilistic Arithmetic Automata 4 Inke Herms

6 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions Probabilistic Arithmetic Automata 4 Inke Herms

7 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms

8 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms

9 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms

10 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms

11 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms

12 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms

13 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms

14 Computational Framework Probabilistic Arithmetic Automata probability distribution of resulting value after t steps P(V t = v) = P(step t, state = q, value = v) states q runtime: O(t (# of states) 2 # of emissions # of values) Probabilistic Arithmetic Automata 5 Inke Herms

15 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification Probabilistic Arithmetic Automata 6 Inke Herms

16 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences Probabilistic Arithmetic Automata 6 Inke Herms

17 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences significance of transcription factor binding sites Probabilistic Arithmetic Automata 6 Inke Herms

18 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences significance of transcription factor binding sites quality of database filtering criteria Probabilistic Arithmetic Automata 6 Inke Herms

19 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences significance of transcription factor binding sites quality of database filtering criteria length distribution of DNA reads produced by high throughput pyrosequencing Probabilistic Arithmetic Automata 6 Inke Herms

20 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences significance of transcription factor binding sites quality of database filtering criteria length distribution of DNA reads produced by high throughput pyrosequencing expected population size under specified evolutionary processes Probabilistic Arithmetic Automata 6 Inke Herms

21 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences significance of transcription factor binding sites quality of database filtering criteria length distribution of DNA reads produced by high throughput pyrosequencing expected population size under specified evolutionary processes etc. Probabilistic Arithmetic Automata 6 Inke Herms

22 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences significance of transcription factor binding sites quality of database filtering criteria length distribution of DNA reads produced by high throughput pyrosequencing expected population size under specified evolutionary processes etc. Probabilistic Arithmetic Automata 7 Inke Herms

23 Overview Application I: Peptide Statistics 1 Probabilistic Arithmetic Automata 2 Application I: Protein Identification by PMF 3 Modeling Protein Sequences 4 Application II: Seed Sensitivity 5 Application III: 454 Sequencing Read Statistics Probabilistic Arithmetic Automata 8 Inke Herms

24 Application I: Peptide Statistics Peptide Mass Fingerprinting (PMF) Probabilistic Arithmetic Automata 9 Inke Herms

25 Motivation Application I: Peptide Statistics score peak matchings with respect to a null model higher scores for significant matches Probabilistic Arithmetic Automata 10 Inke Herms

26 Motivation Application I: Peptide Statistics score peak matchings with respect to a null model higher scores for significant matches Goal build PAA to measure peptide fragments extend i.i.d. (H.-M. Kaltenbach, 2007) to Markov model for peptides incorporate incomplete cleavage and post-translational modifications Probabilistic Arithmetic Automata 10 Inke Herms

27 In silico Digestion Application I: Peptide Statistics proteolytic cleavage by site-specific protease cleavage characters Γ, prohibition characters Π widely used: Trypsin cleaves after Γ = {K, R}, unless followed by Π = {P} cleavage patterns Γ Π distinguish first and following fragments (do not start with P) Probabilistic Arithmetic Automata 11 Inke Herms

28 PAA Design Application I: Peptide Statistics states: amino acids transitions: conditional amino acid frequencies outgoing cleavage characters: prohibition character or end fragment weights: amino acid molecular masses operations: cumulate character weights Probabilistic Arithmetic Automata 12 Inke Herms

29 PAA Recurrence Application I: Peptide Statistics joint state-value distribution: f t (q, m) = P(step t, state = q, mass = m) Recurrence f t (q, m) = f t 1 (q, m m )T q qe q (m ) q Q m MS(q) Probabilistic Arithmetic Automata 13 Inke Herms

30 Fragment Statistics Application I: Peptide Statistics joint length-mass distribution P(length = k, mass = m) = P(step k + 1, state = ζ, mass = m) marginalization 1 length distribution P(length = k) 2 mass distribution P(mass = m) Probabilistic Arithmetic Automata 14 Inke Herms

31 Fragment Statistics Application I: Peptide Statistics joint length-mass distribution P(length = k, mass = m) = P(step k + 1, state = ζ, mass = m) marginalization 1 length distribution P(length = k) 2 mass distribution P(mass = m) mass occurrence probability P(fragmentation yields at least one fragment of mass m) significance of mass spectra alignment scores (H.-M. Kaltenbach, 2007) Probabilistic Arithmetic Automata 14 Inke Herms

32 Modified PAA Application I: Peptide Statistics Peptide statistics are influenced by... 1 Incomplete cleavage due to: inadequate conditions for protease self-digestion of protease Probabilistic Arithmetic Automata 15 Inke Herms

33 Modified PAA Application I: Peptide Statistics Peptide statistics are influenced by... 1 Incomplete cleavage inclusion into PAA change transitions reduce probability to transit from cleavage character to end redistribute probability to non-prohibition characters Probabilistic Arithmetic Automata 15 Inke Herms

34 Modified PAA Application I: Peptide Statistics Peptide statistics are influenced by... 1 Incomplete cleavage inclusion into PAA change transitions reduce probability to transit from cleavage character to end redistribute probability to non-prohibition characters future extension: take amino acid propensities into account Probabilistic Arithmetic Automata 15 Inke Herms

35 Modified PAA Application I: Peptide Statistics Peptide statistics are influenced by... 1 Incomplete cleavage inclusion into PAA change transitions 2 Post-translational modifications (PTMs) variants: addition of functional groups structural changes Probabilistic Arithmetic Automata 15 Inke Herms

36 Modified PAA Application I: Peptide Statistics Peptide statistics are influenced by... 1 Incomplete cleavage inclusion into PAA change transitions 2 Post-translational modifications (PTMs) variants: addition of functional groups structural changes impact: modify the function of a protein transform precursor molecule into active protein Probabilistic Arithmetic Automata 15 Inke Herms

37 Modified PAA Application I: Peptide Statistics Peptide statistics are influenced by... 1 Incomplete cleavage inclusion into PAA change transitions 2 Post-translational modifications (PTMs) inclusion into PAA change weights phosphorylation: +80 Da methylation: +14 Da or +28 Da augment weight distributions Probabilistic Arithmetic Automata 15 Inke Herms

38 Application I: Peptide Statistics Results: Comparison of peptide model first fragment probability i.i.d. Markov empirical probability i.i.d. Markov fragment length mass m in Dalton following fragments probability i.i.d. Markov empirical fragment length length distribution probability i.i.d Markov mass m in Dalton length-mass distribution Probabilistic Arithmetic Automata 16 Inke Herms

39 Application I: Peptide Statistics Results: Incomplete Cleavage and PTMs Missed cleavages probability 0.10 complete cleavage 0.08 missed cleavages fragment length probability mass m in Dalton complete cleavage missed cleavages Posttranslational modifications probability 0.8 Markov 0.6 add. PTMs mass m in Dalton mass occurrence probability Probabilistic Arithmetic Automata 17 Inke Herms

40 Summary Application I: Peptide Statistics Contributions PAA to measure proteolytic fragments different models for peptides single molecular mass or isotopic distribution distribution of fragment length, mass, and mass occurrence probabilities for i.i.d. and Markov peptides Markov model generates only slightly different results than i.i.d. model, more apparent for first fragment inclusion of incomplete cleavage and PTMs PTMs introduce additional fragment masses missed cleavages induce the strongest effect on peptide statistics use for mass spectra alignment Probabilistic Arithmetic Automata 18 Inke Herms

41 Summary Application I: Peptide Statistics Directions for future research extend model of incomplete cleavage generalization to tandem MS context Probabilistic Arithmetic Automata 19 Inke Herms

42 Overview SSE Protein Model 1 Probabilistic Arithmetic Automata 2 Application I: Protein Identification by PMF 3 Modeling Protein Sequences 4 Application II: Seed Sensitivity 5 Application III: 454 Sequencing Read Statistics Probabilistic Arithmetic Automata 20 Inke Herms

43 Motivation SSE Protein Model Why a new null model for protein sequences? to improve... the identification of unknown proteins the classification of protein domains or protein families the prediction of secondary structure for a sequence of amino acids which rely on a reasonable null model Probabilistic Arithmetic Automata 21 Inke Herms

44 Motivation SSE Protein Model Why a new null model for protein sequences? to improve... the identification of unknown proteins the classification of protein domains or protein families the prediction of secondary structure for a sequence of amino acids which rely on a reasonable null model Goal string model for protein sequences include information from secondary structure annotation Probabilistic Arithmetic Automata 21 Inke Herms

45 Secondary Structure Information SSE Protein Model 8 secondary structure elements B, E: β-bridge and β-strand S, T: region with high curvature G, H, I: different kinds of helices C: others (random coil) B E S T G H I C Probabilistic Arithmetic Automata 22 Inke Herms

46 Secondary Structure Information SSE Protein Model 8 secondary structure elements B, E: β-bridge and β-strand S, T: region with high curvature G, H, I: different kinds of helices C: others (random coil) PDB sequence with annotation sequence MQHVSAPVFVFECTRLAYVQHK annotation CCCCCHHHHHHCCCCHHHEEEE Probabilistic Arithmetic Automata 22 Inke Herms

47 Secondary Structure Information SSE Protein Model 8 secondary structure elements B, E: β-bridge and β-strand S, T: region with high curvature G, H, I: different kinds of helices C: others (random coil) PDB sequence with annotation sequence MQHVS APVFVF ECTR LAY VQHK annotation CCCCC HHHHHH CCCC HHH EEEE Probabilistic Arithmetic Automata 22 Inke Herms

48 Secondary Structure Information SSE Protein Model 4 secondary structure classes B, E: sheet S, T: turn G, H, I: helix C: coil Probabilistic Arithmetic Automata 22 Inke Herms

49 Secondary Structure Information SSE Protein Model 4 secondary structure classes B, E: sheet S, T: turn G, H, I: helix C: coil investigate: conditional amino acid frequencies within and across elements length of amino acid chains Probabilistic Arithmetic Automata 22 Inke Herms

50 The SSE Model SSE Protein Model Secondary Structure Element Based Protein String Model Probabilistic Arithmetic Automata 23 Inke Herms

51 Evaluation: Log likelihood SSE Protein Model compare SSE model with i.i.d. and Markov models M1, M2 Free model parameters i.i.d. M1 M2 SSE ln (maximum likelihood) = ln P(model data) more involved computation for SSE O(n 2 ) for sequence of length n avg. log likelihood i.i.d. M1 M2 SSE model Probabilistic Arithmetic Automata 24 Inke Herms

52 Evaluation: AIC SSE Protein Model penalized model selection: balance goodness of fit and model complexity AIC(M) := 2 ln (maximum likelihood) + 2#free model parameters good model minimizes AIC compare models by AIC difference: AIC(M i ) min j AIC(M j ) best model has AIC difference 0 account for sample size second-order AIC Probabilistic Arithmetic Automata 25 Inke Herms

53 Results SSE Protein Model Ranking according to second-order AIC rank model AIC difference 1 SSE 0 2 M i.i.d M SSE performs best! Probabilistic Arithmetic Automata 26 Inke Herms

54 Summary SSE Protein Model Contributions first generative random string model for proteins including SSEs O(n 2 ) method to compute log likelihood of given sequence of length n SSE model outperforms i.i.d., M1, and M2 according to AIC article submitted to RECOMB 2010 Probabilistic Arithmetic Automata 27 Inke Herms

55 Summary SSE Protein Model Possible extensions more elaborate models within individual elements hybrid model to account for bad annotation of initial amino acids variants of the SSE model that focus on families of proteins classification of new proteins Probabilistic Arithmetic Automata 28 Inke Herms

56 Overview Application II: Seed Sensitivity 1 Probabilistic Arithmetic Automata 2 Application I: Protein Identification by PMF 3 Modeling Protein Sequences 4 Application II: Seed Sensitivity 5 Application III: 454 Sequencing Read Statistics Probabilistic Arithmetic Automata 29 Inke Herms

57 Seeded Alignment Application II: Seed Sensitivity database search for putatively homologous sequences heuristic tools use filtration 1 select candidate sequences that match a short seed 2 investigate candidates by exact local alignment methods quality crucially depends on the seed: longer seed lower sensitivity shorter seed lower specificity Probabilistic Arithmetic Automata 30 Inke Herms

58 Seeded Alignment Application II: Seed Sensitivity database search for putatively homologous sequences heuristic tools use filtration 1 select candidate sequences that match a short seed 2 investigate candidates by exact local alignment methods quality crucially depends on the seed: longer seed lower sensitivity shorter seed lower specificity Goal PAA to compute seed sensitivity = fraction of sought sequences selected by filtration unified approach for gapped and ungapped alignments Probabilistic Arithmetic Automata 30 Inke Herms

59 Prerequisites model alignments with certain degree of similarity homology model Ungapped alignments Application II: Seed Sensitivity i.i.d. string over Σ = {0, 1} match probability p Gapped alignments Markovian string over Σ = {0, 1, 2, 3} p 0, P = p 0 p 1 p g p g 1 p 0 p 1 p g p g 2 p0 p1 p g 0 3 p0 p1 0 p g Probabilistic Arithmetic Automata 31 Inke Herms

60 Prerequisites Application II: Seed Sensitivity model alignments with certain degree of similarity homology model filtration criterion seed model Consecutive seed require contiguous perfect matches, e.g. π = Spaced seed specify discontiguous matching positions, e.g. π =1 11 1, [0, 1] Indel seed allow for indels of variable size, e.g. π =1 1??1, [0, 1],? [ɛ, 0, 1, 2, 3] Probabilistic Arithmetic Automata 31 Inke Herms

61 Prerequisites Application II: Seed Sensitivity model alignments with certain degree of similarity homology model filtration criterion seed model seed sensitivity P(seed hits random alignment) Hit count count occurrences of seed patterns in random alignment Define V t (π) : # of matches of seed π Wanted 1 Sensitivity: P(V t (π) 1) 2 Hit distribution: P(V t (π) = k), k = 0,..., K Probabilistic Arithmetic Automata 31 Inke Herms

62 PAA Design Application II: Seed Sensitivity π = 1 1 states: prefixes of seed patterns transitions: alignment column dependencies given by homology model weights: number of seed patterns ending in state operations: cumulate seed hits Probabilistic Arithmetic Automata 32 Inke Herms

63 Application II: Seed Sensitivity Results: Sensitivity and Hit Distribution distribution of seed hits: P(V t = k) = q Q P(step t, state = q, hits = k) Recurrence f t (q, k) = q Q f t 1 (q, k C(q))T q q C(q) : # patterns ending in q Probabilistic Arithmetic Automata 33 Inke Herms

64 Results: Alternative Criteria Application II: Seed Sensitivity Idea: seed should 1 maximize the sensitivity to alignments referring to homologous sequences (M hom ) 2 maximize the specificity to alignments referring to random, unrelated sequences (M 0 ) max P(V t (M hom ) k) P(V t (M 0 ) < k) for k = 1,..., K no further information, sensitivity supported Probabilistic Arithmetic Automata 34 Inke Herms

65 Summary Application II: Seed Sensitivity Contributions PAA to compute seed sensitivity and entire hit distribution different seed and homology models unifying definitions for gapped and ungapped alignments different occurrence counts (overlapping or non-overlapping) definition and evaluation of alternative criteria: sensitivity supported work presented at WABI 2008 Probabilistic Arithmetic Automata 35 Inke Herms

66 Overview Application III: 454 Read Statistics 1 Probabilistic Arithmetic Automata 2 Application I: Protein Identification by PMF 3 Modeling Protein Sequences 4 Application II: Seed Sensitivity 5 Application III: 454 Sequencing Read Statistics Probabilistic Arithmetic Automata 36 Inke Herms

67 454 Sequencing Technology Application III: 454 Read Statistics Probabilistic Arithmetic Automata 37 Inke Herms

68 Motivation Application III: 454 Read Statistics massive parallel DNA sequencing current GS-FLX: 400, 000 reads with 250 nt on average current GS-FLX Titanium: 1, 000, 000 reads with 400 nt on average still short reads (Sanger: 800 nt) the longer the reads, the better (for genome assembly) Probabilistic Arithmetic Automata 38 Inke Herms

69 Motivation Application III: 454 Read Statistics massive parallel DNA sequencing current GS-FLX: 400, 000 reads with 250 nt on average current GS-FLX Titanium: 1, 000, 000 reads with 400 nt on average still short reads (Sanger: 800 nt) the longer the reads, the better (for genome assembly) Goal PAA to compute length distribution of 454 reads investigate potential improvement of average read length Probabilistic Arithmetic Automata 38 Inke Herms

70 454 Sequencing Application III: 454 Read Statistics sequencing by synthesis nucleotides are flown periodically according to a specified order incorporation is accompanied by a flash of light signal is proportional to number of synthesized nucleotides (up to 8 nt) Probabilistic Arithmetic Automata 39 Inke Herms

71 454 Sequencing Application III: 454 Read Statistics sequencing by synthesis nucleotides are flown periodically according to a specified order incorporation is accompanied by a flash of light signal is proportional to number of synthesized nucleotides (up to 8 nt) Dispensation order d = d[1]... d[l] with d[i] Σ = {A, C, G, T }, where 1 every character from Σ has to occur in d 2 no character is flown in row: d[i] = σ d[i + 1] σ 3 d[1] d[l] Probabilistic Arithmetic Automata 39 Inke Herms

72 454 Sequencing Application III: 454 Read Statistics sequencing by synthesis nucleotides are flown periodically according to a specified order incorporation is accompanied by a flash of light signal is proportional to number of synthesized nucleotides (up to 8 nt) GS-FLX standard dispensation order: TACG 100 cycles = 400 nucleotide flows Probabilistic Arithmetic Automata 39 Inke Herms

73 cycle nt flow TACG GTCA (3 ) read length: 6 10 Application III: 454 Read Statistics Influence of d on Read Length Probabilistic Arithmetic Automata 40 Inke Herms

74 PAA Design Application III: 454 Read Statistics states: dispensation order d transitions: conditional nucleotide frequencies weights: dirac measures (emission 1 in all nucleotide states) operations: cumulate number of inserted nucleotides Probabilistic Arithmetic Automata 41 Inke Herms

75 Proof of Concept Application III: 454 Read Statistics theoretical empirical simulated empirical frequency frequency length length comparison of computed read lengths to empirical data simulated reads, processing according to TrimBack filter PAA yields reasonable results Probabilistic Arithmetic Automata 42 Inke Herms

76 Observation for sample reads Application III: 454 Read Statistics The Optimal Dispensation Order dispensation orders of length 4 perform best compared expected read length under all orders of length 4 CGAT provides on average 10% longer reads than TACG (282 vs. 257) optimal dispensation order maximizes the sum of dinucleotide frequencies Probabilistic Arithmetic Automata 43 Inke Herms

77 Observation for sample reads Application III: 454 Read Statistics The Optimal Dispensation Order dispensation orders of length 4 perform best compared expected read length under all orders of length 4 CGAT provides on average 10% longer reads than TACG (282 vs. 257) optimal dispensation order maximizes the sum of dinucleotide frequencies For new genome: 1 estimate conditional nucleotide frequencies (from preceding sequencing runs) 2 design PAA and compute expected read length for orders of length 4 3 return best-performing order Probabilistic Arithmetic Automata 43 Inke Herms

78 Summary Application III: 454 Read Statistics Contributions PAA to compute length distribution of 454 sequence reads for dispensation orders of different lengths different models for nucleotide sequences choice of dispensation order can have a noticeable effect on the average read length propose optimal dispensation order for a new genome given preliminary reads Probabilistic Arithmetic Automata 44 Inke Herms

79 Summary Application III: 454 Read Statistics Future prospect theoretical analysis of dispensation orders experimental validation Probabilistic Arithmetic Automata 45 Inke Herms

80 Publications Application III: 454 Read Statistics Modeling Protein Sequences Including Information from Secondary Structure I. Herms and S. Rahmann, submitted to RECOMB 2010 Accurate Statistics for Local Sequence Alignment with Position-Dependent Scoring by Rare-Event Sampling S. Wolfsheimer, I. Herms, S. Rahmann and A.K. Hartmann, submitted to BMC Bioinformatics, 2009 Computing Alignment Seed Sensitivity with Probabilistic Arithmetic Automata I. Herms and S. Rahmann, WABI 2008, LNBI 5251, pp E. Baake and I. Herms Single-Crossover Dynamics: Finite versus Infinite Populations Bulletin of Mathematical Biology 70 (2008), pp Probabilistic Arithmetic Automata 46 Inke Herms

81 Conclusion Conclusion PAA framework three applications of PAA framework to sequence statistics general, unifying framework elegant formulation and simplification of solutions to different tasks straightforward computations (re-use of algorithms from automata and Markov chain theory) extensions to formerly investigated questions SSE Model biologically motivated null model for protein sequences outperforms customary string models Probabilistic Arithmetic Automata 47 Inke Herms

82 Analysis of Secondary Structure Segments Appendix Relative entropy between amino acid compositions within individually annotated segments Probabilistic Arithmetic Automata 48 Inke Herms

83 Secondary Structure Classification Appendix 0.15 coil C 0.15 helix H A C D E F G H I K L M N P Q R S T V WY 0.00 A C D E F G H I K L M N P Q R S T V WY 0.15 sheet S 0.15 turnt A C D E F G H I K L M N P Q R S T V WY 0.00 A C D E F G H I K L M N P Q R S T V WY Amino acid compositions after collapsing structure annotation to 4 classes Probabilistic Arithmetic Automata 49 Inke Herms

84 Model Log Likelihoods Appendix... for sequence s of length s = n ln P(s i.i.d.) = s ln P(S i = s[i]) = ln p s[i] i=1 s 1 ln P(s M1) = ln ps[1] 0 + i=1 ln P(s M2) = ln p 0 s[1] + ln P s[1]s[2] s i=1 ln P s[i]s[i+1] + s 2 i=1 ln P(S i+2 = s[i + 2] S i = s[i], S i+1 = s[i + 1]) Probabilistic Arithmetic Automata 50 Inke Herms

85 Log Likelihood of SSE Model Appendix ln P(S = s SSE) with s = n P(S = s) = v Q P(S 1,..., S n = s[1, n], Y n = v) variant of forward algorithm f k (v) := P(S 1,..., S k = s[1, k], Y k = v, Y k+1 v) f n (v) := P(S 1,..., S n = s[1, n], Y n = v) for 1 k < n P(S = s) = v Q f n (v) Probabilistic Arithmetic Automata 51 Inke Herms

86 Log Likelihood of SSE Model Appendix Recurrence f k (v) = u Q k 1 f l (u) Ts[l]s[l+1] uv pv (s[l + 1, k]) P(R(v) = k l) l=1 + P start,v π v s[1] pv (s[1, k]) P(R(v) = k) for 1 k < n f n (v) = u Q n 1 f l (u) Ts[l]s[l+1] uv pv (s[l + 1, n]) P(R(v) n l) l=1 + P start,v π v s[1] pv (s[1, n]) P(R(v) n) Output: f (s) = v Q f n(v) Probabilistic Arithmetic Automata 52 Inke Herms

87 Log Likelihood of SSE Model Appendix for log likelihood: implementation of logarithms of sums of small values use ln(x + y) = ln(x(1 + y /x)) = ln x + ln(1 + exp(ln y ln x)) ( k 1 ln f k (v) = ln Fk 0 (v) + log1p exp ( ln Fk l (v) ln F k 0 (v))) where Fk l (v): probability of path that generates length k prefix, with the last k l residues produced in element v l=1 Probabilistic Arithmetic Automata 53 Inke Herms

88 Second-order AIC Appendix 2k(k + 1) AIC c := AIC + n k 1 = 2 l( θ, n x) + 2k n k 1 with n: size of sample x k: # of free parameters in most complex model Probabilistic Arithmetic Automata 54 Inke Herms

89 454 Read Length Distribution Appendix Recurrence f s Sq f s (q, l) = q q (q, l 1)T q q Q f h(q,s) (q, l) if q = g(s) otherwise g(s) maps number of nt flow to corresponding nucleotide within d h(q, s) gives last preceding flow corresponding to nucleotide q Probabilistic Arithmetic Automata 55 Inke Herms

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University COMP 598 Advanced Computational Biology Methods & Research Introduction Jérôme Waldispühl School of Computer Science McGill University General informations (1) Office hours: by appointment Office: TR3018

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

HMMs and biological sequence analysis

HMMs and biological sequence analysis HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the

More information

De Novo Peptide Sequencing

De Novo Peptide Sequencing De Novo Peptide Sequencing Outline A simple de novo sequencing algorithm PTM Other ion types Mass segment error De Novo Peptide Sequencing b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 A NELLLNVK AN ELLLNVK ANE LLLNVK

More information

Computational Methods for Mass Spectrometry Proteomics

Computational Methods for Mass Spectrometry Proteomics Computational Methods for Mass Spectrometry Proteomics Eidhammer, Ingvar ISBN-13: 9780470512975 Table of Contents Preface. Acknowledgements. 1 Protein, Proteome, and Proteomics. 1.1 Primary goals for studying

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were

Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were developed to allow the analysis of large intact (bigger than

More information

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)

More information

Local Alignment: Smith-Waterman algorithm

Local Alignment: Smith-Waterman algorithm Local Alignment: Smith-Waterman algorithm Example: a shared common domain of two protein sequences; extended sections of genomic DNA sequence. Sensitive to detect similarity in highly diverged sequences.

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

On the Monotonicity of the String Correction Factor for Words with Mismatches

On the Monotonicity of the String Correction Factor for Words with Mismatches On the Monotonicity of the String Correction Factor for Words with Mismatches (extended abstract) Alberto Apostolico Georgia Tech & Univ. of Padova Cinzia Pizzi Univ. of Padova & Univ. of Helsinki Abstract.

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

Stephen Scott.

Stephen Scott. 1 / 21 sscott@cse.unl.edu 2 / 21 Introduction Designed to model (profile) a multiple alignment of a protein family (e.g., Fig. 5.1) Gives a probabilistic model of the proteins in the family Useful for

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas Forward Algorithm For Markov chains we calculate the probability of a sequence, P(x) How

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

More information

Genomics and bioinformatics summary. Finding genes -- computer searches

Genomics and bioinformatics summary. Finding genes -- computer searches Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence

More information

Data Mining in Bioinformatics HMM

Data Mining in Bioinformatics HMM Data Mining in Bioinformatics HMM Microarray Problem: Major Objective n Major Objective: Discover a comprehensive theory of life s organization at the molecular level 2 1 Data Mining in Bioinformatics

More information

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Comparative Gene Finding. BMI/CS 776  Spring 2015 Colin Dewey Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following: using related genomes

More information

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION AND CALIBRATION Calculation of turn and beta intrinsic propensities. A statistical analysis of a protein structure

More information

Gibbs Sampling Methods for Multiple Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

Protein Secondary Structure Prediction

Protein Secondary Structure Prediction Protein Secondary Structure Prediction Doug Brutlag & Scott C. Schmidler Overview Goals and problem definition Existing approaches Classic methods Recent successful approaches Evaluating prediction algorithms

More information

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55 Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

More information

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell Hidden Markov Models in computational biology Ron Elber Computer Science Cornell 1 Or: how to fish homolog sequences from a database Many sequences in database RPOBESEQ Partitioned data base 2 An accessible

More information

Proteins: Structure & Function. Ulf Leser

Proteins: Structure & Function. Ulf Leser Proteins: Structure & Function Ulf Leser This Lecture Proteins Structure Function Databases Predicting Protein Secondary Structure Many figures from Zvelebil, M. and Baum, J. O. (2008). "Understanding

More information

A Statistical Model of Proteolytic Digestion

A Statistical Model of Proteolytic Digestion A Statistical Model of Proteolytic Digestion I-Jeng Wang, Christopher P. Diehl Research and Technology Development Center Johns Hopkins University Applied Physics Laboratory Laurel, MD 20723 6099 Email:

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

Protein Structures. Sequences of amino acid residues 20 different amino acids. Quaternary. Primary. Tertiary. Secondary. 10/8/2002 Lecture 12 1

Protein Structures. Sequences of amino acid residues 20 different amino acids. Quaternary. Primary. Tertiary. Secondary. 10/8/2002 Lecture 12 1 Protein Structures Sequences of amino acid residues 20 different amino acids Primary Secondary Tertiary Quaternary 10/8/2002 Lecture 12 1 Angles φ and ψ in the polypeptide chain 10/8/2002 Lecture 12 2

More information

Markov Chains and Hidden Markov Models. = stochastic, generative models

Markov Chains and Hidden Markov Models. = stochastic, generative models Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,

More information

De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra. Xiaowen Liu

De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra. Xiaowen Liu De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra Xiaowen Liu Department of BioHealth Informatics, Department of Computer and Information Sciences, Indiana University-Purdue

More information

bioinformatics 1 -- lecture 7

bioinformatics 1 -- lecture 7 bioinformatics 1 -- lecture 7 Probability and conditional probability Random sequences and significance (real sequences are not random) Erdos & Renyi: theoretical basis for the significance of an alignment

More information

MS-MS Analysis Programs

MS-MS Analysis Programs MS-MS Analysis Programs Basic Process Genome - Gives AA sequences of proteins Use this to predict spectra Compare data to prediction Determine degree of correctness Make assignment Did we see the protein?

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

What is the central dogma of biology?

What is the central dogma of biology? Bellringer What is the central dogma of biology? A. RNA DNA Protein B. DNA Protein Gene C. DNA Gene RNA D. DNA RNA Protein Review of DNA processes Replication (7.1) Transcription(7.2) Translation(7.3)

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Searching Sear ( Sub- (Sub )Strings Ulf Leser Searching (Sub-)Strings Ulf Leser This Lecture Exact substring search Naïve Boyer-Moore Searching with profiles Sequence profiles Ungapped approximate search Statistical evaluation of search results Ulf

More information

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction CMPS 6630: Introduction to Computational Biology and Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/14/07 CAP5510 1 CpG Islands Regions in DNA sequences with increased

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences 140.638 where do sequences come from? DNA is not hard to extract (getting DNA from a

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 05: Index-based alignment algorithms Slides adapted from Dr. Shaojie Zhang (University of Central Florida) Real applications of alignment Database search

More information

114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009

114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009 114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009 9 Protein tertiary structure Sources for this chapter, which are all recommended reading: D.W. Mount. Bioinformatics: Sequences and Genome

More information

Hidden Markov Models. Three classic HMM problems

Hidden Markov Models. Three classic HMM problems An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Hidden Markov Models Slides revised and adapted to Computational Biology IST 2015/2016 Ana Teresa Freitas Three classic HMM problems

More information

Bioinformatics 2 - Lecture 4

Bioinformatics 2 - Lecture 4 Bioinformatics 2 - Lecture 4 Guido Sanguinetti School of Informatics University of Edinburgh February 14, 2011 Sequences Many data types are ordered, i.e. you can naturally say what is before and what

More information

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010 BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

Seed-based sequence search: some theory and some applications

Seed-based sequence search: some theory and some applications Seed-based sequence search: some theory and some applications Gregory Kucherov CNRS/LIGM, Marne-la-Vallée joint work with Laurent Noé (LIFL LIlle) Journées GDR IM, Lyon, January -, 3 Filtration for approximate

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

CAP 5510 Lecture 3 Protein Structures

CAP 5510 Lecture 3 Protein Structures CAP 5510 Lecture 3 Protein Structures Su-Shing Chen Bioinformatics CISE 8/19/2005 Su-Shing Chen, CISE 1 Protein Conformation 8/19/2005 Su-Shing Chen, CISE 2 Protein Conformational Structures Hydrophobicity

More information

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment Molecular Modeling 2018-- Lecture 7 Homology modeling insertions/deletions manual realignment Homology modeling also called comparative modeling Sequences that have similar sequence have similar structure.

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

CMPS 3110: Bioinformatics. Tertiary Structure Prediction CMPS 3110: Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the laws of physics! Conformation space is finite

More information

Pairwise sequence alignment

Pairwise sequence alignment Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

More information

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University Protein Quantitation II: Multiple Reaction Monitoring Kelly Ruggles kelly@fenyolab.org New York University Traditional Affinity-based proteomics Use antibodies to quantify proteins Western Blot RPPA Immunohistochemistry

More information

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically

More information

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs Bioinformatics 1--lectures 15, 16 Markov chains Hidden Markov models Profile HMMs target sequence database input to database search results are sequence family pseudocounts or background-weighted pseudocounts

More information

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) Hidden Markov Models Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) 1 The occasionally dishonest casino A P A (1) = P A (2) = = 1/6 P A->B = P B->A = 1/10 B P B (1)=0.1... P

More information

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming 20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, 2008 4 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance 4. Global and local alignment

More information

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

Protein Secondary Structure Prediction

Protein Secondary Structure Prediction part of Bioinformatik von RNA- und Proteinstrukturen Computational EvoDevo University Leipzig Leipzig, SS 2011 the goal is the prediction of the secondary structure conformation which is local each amino

More information

Modeling Mass Spectrometry-Based Protein Analysis

Modeling Mass Spectrometry-Based Protein Analysis Chapter 8 Jan Eriksson and David Fenyö Abstract The success of mass spectrometry based proteomics depends on efficient methods for data analysis. These methods require a detailed understanding of the information

More information

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II) CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models

More information

Tutorial 1: Setting up your Skyline document

Tutorial 1: Setting up your Skyline document Tutorial 1: Setting up your Skyline document Caution! For using Skyline the number formats of your computer have to be set to English (United States). Open the Control Panel Clock, Language, and Region

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Today s Lecture: HMMs

Today s Lecture: HMMs Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models

More information

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

More information

RNA & PROTEIN SYNTHESIS. Making Proteins Using Directions From DNA

RNA & PROTEIN SYNTHESIS. Making Proteins Using Directions From DNA RNA & PROTEIN SYNTHESIS Making Proteins Using Directions From DNA RNA & Protein Synthesis v Nitrogenous bases in DNA contain information that directs protein synthesis v DNA remains in nucleus v in order

More information

Was T. rex Just a Big Chicken? Computational Proteomics

Was T. rex Just a Big Chicken? Computational Proteomics Was T. rex Just a Big Chicken? Computational Proteomics Phillip Compeau and Pavel Pevzner adjusted by Jovana Kovačević Bioinformatics Algorithms: an Active Learning Approach 215 by Compeau and Pevzner.

More information

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder HMM applications Applications of HMMs Gene finding Pairwise alignment (pair HMMs) Characterizing protein families (profile HMMs) Predicting membrane proteins, and membrane protein topology Gene finding

More information

Orientational degeneracy in the presence of one alignment tensor.

Orientational degeneracy in the presence of one alignment tensor. Orientational degeneracy in the presence of one alignment tensor. Rotation about the x, y and z axes can be performed in the aligned mode of the program to examine the four degenerate orientations of two

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

Protein Structure Prediction, Engineering & Design CHEM 430

Protein Structure Prediction, Engineering & Design CHEM 430 Protein Structure Prediction, Engineering & Design CHEM 430 Eero Saarinen The free energy surface of a protein Protein Structure Prediction & Design Full Protein Structure from Sequence - High Alignment

More information

Atomic masses. Atomic masses of elements. Atomic masses of isotopes. Nominal and exact atomic masses. Example: CO, N 2 ja C 2 H 4

Atomic masses. Atomic masses of elements. Atomic masses of isotopes. Nominal and exact atomic masses. Example: CO, N 2 ja C 2 H 4 High-Resolution Mass spectrometry (HR-MS, HRAM-MS) (FT mass spectrometry) MS that enables identifying elemental compositions (empirical formulas) from accurate m/z data 9.05.2017 1 Atomic masses (atomic

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

CSE182-L8. Mass Spectrometry

CSE182-L8. Mass Spectrometry CSE182-L8 Mass Spectrometry Project Notes Implement a few tools for proteomics C1:11/2/04 Answer MS questions to get started, select project partner, select a project. C2:11/15/04 (All but web-team) Plan

More information

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier *

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier * Protein Structure Prediction Using Multiple Artificial Neural Network Classifier * Hemashree Bordoloi and Kandarpa Kumar Sarma Abstract. Protein secondary structure prediction is the method of extracting

More information

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University Protein Quantitation II: Multiple Reaction Monitoring Kelly Ruggles kelly@fenyolab.org New York University Traditional Affinity-based proteomics Use antibodies to quantify proteins Western Blot Immunohistochemistry

More information

Lecture 7 Sequence analysis. Hidden Markov Models

Lecture 7 Sequence analysis. Hidden Markov Models Lecture 7 Sequence analysis. Hidden Markov Models Nicolas Lartillot may 2012 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 1 / 60 1 Motivation 2 Examples of Hidden Markov models 3 Hidden

More information

SA-REPC - Sequence Alignment with a Regular Expression Path Constraint

SA-REPC - Sequence Alignment with a Regular Expression Path Constraint SA-REPC - Sequence Alignment with a Regular Expression Path Constraint Nimrod Milo Tamar Pinhas Michal Ziv-Ukelson Ben-Gurion University of the Negev, Be er Sheva, Israel Graduate Seminar, BGU 2010 Milo,

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/15/07 CAP5510 1 EM Algorithm Goal: Find θ, Z that maximize Pr

More information

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

1-D Predictions. Prediction of local features: Secondary structure & surface exposure 1-D Predictions Prediction of local features: Secondary structure & surface exposure 1 Learning Objectives After today s session you should be able to: Explain the meaning and usage of the following local

More information