Probabilistic Arithmetic Automata
|
|
- Brice Barnett
- 5 years ago
- Views:
Transcription
1 Probabilistic Arithmetic Automata Applications of a Stochastic Computational Framework in Biological Sequence Analysis Inke Herms PhD thesis defense
2 Overview 1 Probabilistic Arithmetic Automata 2 Application I: Protein Identification by PMF 3 Modeling Protein Sequences 4 Application II: Seed Sensitivity 5 Application III: 454 Sequencing Read Statistics Probabilistic Arithmetic Automata 2 Inke Herms
3 Overview Probabilistic Arithmetic Automata 1 Probabilistic Arithmetic Automata 2 Application I: Protein Identification by PMF 3 Modeling Protein Sequences 4 Application II: Seed Sensitivity 5 Application III: 454 Sequencing Read Statistics Probabilistic Arithmetic Automata 3 Inke Herms
4 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands Probabilistic Arithmetic Automata 4 Inke Herms
5 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted Probabilistic Arithmetic Automata 4 Inke Herms
6 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions Probabilistic Arithmetic Automata 4 Inke Herms
7 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms
8 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms
9 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms
10 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms
11 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms
12 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms
13 Probabilistic Arithmetic Automata Probabilistic Arithmetic Automaton (PAA) Components 1 Markov chain: random walk describing the order of operands 2 a weight according to the resp. distribution is emitted 3 arithmetic operations are performed on the emissions start Probabilistic Arithmetic Automata 4 Inke Herms
14 Computational Framework Probabilistic Arithmetic Automata probability distribution of resulting value after t steps P(V t = v) = P(step t, state = q, value = v) states q runtime: O(t (# of states) 2 # of emissions # of values) Probabilistic Arithmetic Automata 5 Inke Herms
15 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification Probabilistic Arithmetic Automata 6 Inke Herms
16 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences Probabilistic Arithmetic Automata 6 Inke Herms
17 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences significance of transcription factor binding sites Probabilistic Arithmetic Automata 6 Inke Herms
18 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences significance of transcription factor binding sites quality of database filtering criteria Probabilistic Arithmetic Automata 6 Inke Herms
19 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences significance of transcription factor binding sites quality of database filtering criteria length distribution of DNA reads produced by high throughput pyrosequencing Probabilistic Arithmetic Automata 6 Inke Herms
20 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences significance of transcription factor binding sites quality of database filtering criteria length distribution of DNA reads produced by high throughput pyrosequencing expected population size under specified evolutionary processes Probabilistic Arithmetic Automata 6 Inke Herms
21 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences significance of transcription factor binding sites quality of database filtering criteria length distribution of DNA reads produced by high throughput pyrosequencing expected population size under specified evolutionary processes etc. Probabilistic Arithmetic Automata 6 Inke Herms
22 Applications Overview Probabilistic Arithmetic Automata significance of protein identification/classification finding over- or underrepresented patterns in biosequences significance of transcription factor binding sites quality of database filtering criteria length distribution of DNA reads produced by high throughput pyrosequencing expected population size under specified evolutionary processes etc. Probabilistic Arithmetic Automata 7 Inke Herms
23 Overview Application I: Peptide Statistics 1 Probabilistic Arithmetic Automata 2 Application I: Protein Identification by PMF 3 Modeling Protein Sequences 4 Application II: Seed Sensitivity 5 Application III: 454 Sequencing Read Statistics Probabilistic Arithmetic Automata 8 Inke Herms
24 Application I: Peptide Statistics Peptide Mass Fingerprinting (PMF) Probabilistic Arithmetic Automata 9 Inke Herms
25 Motivation Application I: Peptide Statistics score peak matchings with respect to a null model higher scores for significant matches Probabilistic Arithmetic Automata 10 Inke Herms
26 Motivation Application I: Peptide Statistics score peak matchings with respect to a null model higher scores for significant matches Goal build PAA to measure peptide fragments extend i.i.d. (H.-M. Kaltenbach, 2007) to Markov model for peptides incorporate incomplete cleavage and post-translational modifications Probabilistic Arithmetic Automata 10 Inke Herms
27 In silico Digestion Application I: Peptide Statistics proteolytic cleavage by site-specific protease cleavage characters Γ, prohibition characters Π widely used: Trypsin cleaves after Γ = {K, R}, unless followed by Π = {P} cleavage patterns Γ Π distinguish first and following fragments (do not start with P) Probabilistic Arithmetic Automata 11 Inke Herms
28 PAA Design Application I: Peptide Statistics states: amino acids transitions: conditional amino acid frequencies outgoing cleavage characters: prohibition character or end fragment weights: amino acid molecular masses operations: cumulate character weights Probabilistic Arithmetic Automata 12 Inke Herms
29 PAA Recurrence Application I: Peptide Statistics joint state-value distribution: f t (q, m) = P(step t, state = q, mass = m) Recurrence f t (q, m) = f t 1 (q, m m )T q qe q (m ) q Q m MS(q) Probabilistic Arithmetic Automata 13 Inke Herms
30 Fragment Statistics Application I: Peptide Statistics joint length-mass distribution P(length = k, mass = m) = P(step k + 1, state = ζ, mass = m) marginalization 1 length distribution P(length = k) 2 mass distribution P(mass = m) Probabilistic Arithmetic Automata 14 Inke Herms
31 Fragment Statistics Application I: Peptide Statistics joint length-mass distribution P(length = k, mass = m) = P(step k + 1, state = ζ, mass = m) marginalization 1 length distribution P(length = k) 2 mass distribution P(mass = m) mass occurrence probability P(fragmentation yields at least one fragment of mass m) significance of mass spectra alignment scores (H.-M. Kaltenbach, 2007) Probabilistic Arithmetic Automata 14 Inke Herms
32 Modified PAA Application I: Peptide Statistics Peptide statistics are influenced by... 1 Incomplete cleavage due to: inadequate conditions for protease self-digestion of protease Probabilistic Arithmetic Automata 15 Inke Herms
33 Modified PAA Application I: Peptide Statistics Peptide statistics are influenced by... 1 Incomplete cleavage inclusion into PAA change transitions reduce probability to transit from cleavage character to end redistribute probability to non-prohibition characters Probabilistic Arithmetic Automata 15 Inke Herms
34 Modified PAA Application I: Peptide Statistics Peptide statistics are influenced by... 1 Incomplete cleavage inclusion into PAA change transitions reduce probability to transit from cleavage character to end redistribute probability to non-prohibition characters future extension: take amino acid propensities into account Probabilistic Arithmetic Automata 15 Inke Herms
35 Modified PAA Application I: Peptide Statistics Peptide statistics are influenced by... 1 Incomplete cleavage inclusion into PAA change transitions 2 Post-translational modifications (PTMs) variants: addition of functional groups structural changes Probabilistic Arithmetic Automata 15 Inke Herms
36 Modified PAA Application I: Peptide Statistics Peptide statistics are influenced by... 1 Incomplete cleavage inclusion into PAA change transitions 2 Post-translational modifications (PTMs) variants: addition of functional groups structural changes impact: modify the function of a protein transform precursor molecule into active protein Probabilistic Arithmetic Automata 15 Inke Herms
37 Modified PAA Application I: Peptide Statistics Peptide statistics are influenced by... 1 Incomplete cleavage inclusion into PAA change transitions 2 Post-translational modifications (PTMs) inclusion into PAA change weights phosphorylation: +80 Da methylation: +14 Da or +28 Da augment weight distributions Probabilistic Arithmetic Automata 15 Inke Herms
38 Application I: Peptide Statistics Results: Comparison of peptide model first fragment probability i.i.d. Markov empirical probability i.i.d. Markov fragment length mass m in Dalton following fragments probability i.i.d. Markov empirical fragment length length distribution probability i.i.d Markov mass m in Dalton length-mass distribution Probabilistic Arithmetic Automata 16 Inke Herms
39 Application I: Peptide Statistics Results: Incomplete Cleavage and PTMs Missed cleavages probability 0.10 complete cleavage 0.08 missed cleavages fragment length probability mass m in Dalton complete cleavage missed cleavages Posttranslational modifications probability 0.8 Markov 0.6 add. PTMs mass m in Dalton mass occurrence probability Probabilistic Arithmetic Automata 17 Inke Herms
40 Summary Application I: Peptide Statistics Contributions PAA to measure proteolytic fragments different models for peptides single molecular mass or isotopic distribution distribution of fragment length, mass, and mass occurrence probabilities for i.i.d. and Markov peptides Markov model generates only slightly different results than i.i.d. model, more apparent for first fragment inclusion of incomplete cleavage and PTMs PTMs introduce additional fragment masses missed cleavages induce the strongest effect on peptide statistics use for mass spectra alignment Probabilistic Arithmetic Automata 18 Inke Herms
41 Summary Application I: Peptide Statistics Directions for future research extend model of incomplete cleavage generalization to tandem MS context Probabilistic Arithmetic Automata 19 Inke Herms
42 Overview SSE Protein Model 1 Probabilistic Arithmetic Automata 2 Application I: Protein Identification by PMF 3 Modeling Protein Sequences 4 Application II: Seed Sensitivity 5 Application III: 454 Sequencing Read Statistics Probabilistic Arithmetic Automata 20 Inke Herms
43 Motivation SSE Protein Model Why a new null model for protein sequences? to improve... the identification of unknown proteins the classification of protein domains or protein families the prediction of secondary structure for a sequence of amino acids which rely on a reasonable null model Probabilistic Arithmetic Automata 21 Inke Herms
44 Motivation SSE Protein Model Why a new null model for protein sequences? to improve... the identification of unknown proteins the classification of protein domains or protein families the prediction of secondary structure for a sequence of amino acids which rely on a reasonable null model Goal string model for protein sequences include information from secondary structure annotation Probabilistic Arithmetic Automata 21 Inke Herms
45 Secondary Structure Information SSE Protein Model 8 secondary structure elements B, E: β-bridge and β-strand S, T: region with high curvature G, H, I: different kinds of helices C: others (random coil) B E S T G H I C Probabilistic Arithmetic Automata 22 Inke Herms
46 Secondary Structure Information SSE Protein Model 8 secondary structure elements B, E: β-bridge and β-strand S, T: region with high curvature G, H, I: different kinds of helices C: others (random coil) PDB sequence with annotation sequence MQHVSAPVFVFECTRLAYVQHK annotation CCCCCHHHHHHCCCCHHHEEEE Probabilistic Arithmetic Automata 22 Inke Herms
47 Secondary Structure Information SSE Protein Model 8 secondary structure elements B, E: β-bridge and β-strand S, T: region with high curvature G, H, I: different kinds of helices C: others (random coil) PDB sequence with annotation sequence MQHVS APVFVF ECTR LAY VQHK annotation CCCCC HHHHHH CCCC HHH EEEE Probabilistic Arithmetic Automata 22 Inke Herms
48 Secondary Structure Information SSE Protein Model 4 secondary structure classes B, E: sheet S, T: turn G, H, I: helix C: coil Probabilistic Arithmetic Automata 22 Inke Herms
49 Secondary Structure Information SSE Protein Model 4 secondary structure classes B, E: sheet S, T: turn G, H, I: helix C: coil investigate: conditional amino acid frequencies within and across elements length of amino acid chains Probabilistic Arithmetic Automata 22 Inke Herms
50 The SSE Model SSE Protein Model Secondary Structure Element Based Protein String Model Probabilistic Arithmetic Automata 23 Inke Herms
51 Evaluation: Log likelihood SSE Protein Model compare SSE model with i.i.d. and Markov models M1, M2 Free model parameters i.i.d. M1 M2 SSE ln (maximum likelihood) = ln P(model data) more involved computation for SSE O(n 2 ) for sequence of length n avg. log likelihood i.i.d. M1 M2 SSE model Probabilistic Arithmetic Automata 24 Inke Herms
52 Evaluation: AIC SSE Protein Model penalized model selection: balance goodness of fit and model complexity AIC(M) := 2 ln (maximum likelihood) + 2#free model parameters good model minimizes AIC compare models by AIC difference: AIC(M i ) min j AIC(M j ) best model has AIC difference 0 account for sample size second-order AIC Probabilistic Arithmetic Automata 25 Inke Herms
53 Results SSE Protein Model Ranking according to second-order AIC rank model AIC difference 1 SSE 0 2 M i.i.d M SSE performs best! Probabilistic Arithmetic Automata 26 Inke Herms
54 Summary SSE Protein Model Contributions first generative random string model for proteins including SSEs O(n 2 ) method to compute log likelihood of given sequence of length n SSE model outperforms i.i.d., M1, and M2 according to AIC article submitted to RECOMB 2010 Probabilistic Arithmetic Automata 27 Inke Herms
55 Summary SSE Protein Model Possible extensions more elaborate models within individual elements hybrid model to account for bad annotation of initial amino acids variants of the SSE model that focus on families of proteins classification of new proteins Probabilistic Arithmetic Automata 28 Inke Herms
56 Overview Application II: Seed Sensitivity 1 Probabilistic Arithmetic Automata 2 Application I: Protein Identification by PMF 3 Modeling Protein Sequences 4 Application II: Seed Sensitivity 5 Application III: 454 Sequencing Read Statistics Probabilistic Arithmetic Automata 29 Inke Herms
57 Seeded Alignment Application II: Seed Sensitivity database search for putatively homologous sequences heuristic tools use filtration 1 select candidate sequences that match a short seed 2 investigate candidates by exact local alignment methods quality crucially depends on the seed: longer seed lower sensitivity shorter seed lower specificity Probabilistic Arithmetic Automata 30 Inke Herms
58 Seeded Alignment Application II: Seed Sensitivity database search for putatively homologous sequences heuristic tools use filtration 1 select candidate sequences that match a short seed 2 investigate candidates by exact local alignment methods quality crucially depends on the seed: longer seed lower sensitivity shorter seed lower specificity Goal PAA to compute seed sensitivity = fraction of sought sequences selected by filtration unified approach for gapped and ungapped alignments Probabilistic Arithmetic Automata 30 Inke Herms
59 Prerequisites model alignments with certain degree of similarity homology model Ungapped alignments Application II: Seed Sensitivity i.i.d. string over Σ = {0, 1} match probability p Gapped alignments Markovian string over Σ = {0, 1, 2, 3} p 0, P = p 0 p 1 p g p g 1 p 0 p 1 p g p g 2 p0 p1 p g 0 3 p0 p1 0 p g Probabilistic Arithmetic Automata 31 Inke Herms
60 Prerequisites Application II: Seed Sensitivity model alignments with certain degree of similarity homology model filtration criterion seed model Consecutive seed require contiguous perfect matches, e.g. π = Spaced seed specify discontiguous matching positions, e.g. π =1 11 1, [0, 1] Indel seed allow for indels of variable size, e.g. π =1 1??1, [0, 1],? [ɛ, 0, 1, 2, 3] Probabilistic Arithmetic Automata 31 Inke Herms
61 Prerequisites Application II: Seed Sensitivity model alignments with certain degree of similarity homology model filtration criterion seed model seed sensitivity P(seed hits random alignment) Hit count count occurrences of seed patterns in random alignment Define V t (π) : # of matches of seed π Wanted 1 Sensitivity: P(V t (π) 1) 2 Hit distribution: P(V t (π) = k), k = 0,..., K Probabilistic Arithmetic Automata 31 Inke Herms
62 PAA Design Application II: Seed Sensitivity π = 1 1 states: prefixes of seed patterns transitions: alignment column dependencies given by homology model weights: number of seed patterns ending in state operations: cumulate seed hits Probabilistic Arithmetic Automata 32 Inke Herms
63 Application II: Seed Sensitivity Results: Sensitivity and Hit Distribution distribution of seed hits: P(V t = k) = q Q P(step t, state = q, hits = k) Recurrence f t (q, k) = q Q f t 1 (q, k C(q))T q q C(q) : # patterns ending in q Probabilistic Arithmetic Automata 33 Inke Herms
64 Results: Alternative Criteria Application II: Seed Sensitivity Idea: seed should 1 maximize the sensitivity to alignments referring to homologous sequences (M hom ) 2 maximize the specificity to alignments referring to random, unrelated sequences (M 0 ) max P(V t (M hom ) k) P(V t (M 0 ) < k) for k = 1,..., K no further information, sensitivity supported Probabilistic Arithmetic Automata 34 Inke Herms
65 Summary Application II: Seed Sensitivity Contributions PAA to compute seed sensitivity and entire hit distribution different seed and homology models unifying definitions for gapped and ungapped alignments different occurrence counts (overlapping or non-overlapping) definition and evaluation of alternative criteria: sensitivity supported work presented at WABI 2008 Probabilistic Arithmetic Automata 35 Inke Herms
66 Overview Application III: 454 Read Statistics 1 Probabilistic Arithmetic Automata 2 Application I: Protein Identification by PMF 3 Modeling Protein Sequences 4 Application II: Seed Sensitivity 5 Application III: 454 Sequencing Read Statistics Probabilistic Arithmetic Automata 36 Inke Herms
67 454 Sequencing Technology Application III: 454 Read Statistics Probabilistic Arithmetic Automata 37 Inke Herms
68 Motivation Application III: 454 Read Statistics massive parallel DNA sequencing current GS-FLX: 400, 000 reads with 250 nt on average current GS-FLX Titanium: 1, 000, 000 reads with 400 nt on average still short reads (Sanger: 800 nt) the longer the reads, the better (for genome assembly) Probabilistic Arithmetic Automata 38 Inke Herms
69 Motivation Application III: 454 Read Statistics massive parallel DNA sequencing current GS-FLX: 400, 000 reads with 250 nt on average current GS-FLX Titanium: 1, 000, 000 reads with 400 nt on average still short reads (Sanger: 800 nt) the longer the reads, the better (for genome assembly) Goal PAA to compute length distribution of 454 reads investigate potential improvement of average read length Probabilistic Arithmetic Automata 38 Inke Herms
70 454 Sequencing Application III: 454 Read Statistics sequencing by synthesis nucleotides are flown periodically according to a specified order incorporation is accompanied by a flash of light signal is proportional to number of synthesized nucleotides (up to 8 nt) Probabilistic Arithmetic Automata 39 Inke Herms
71 454 Sequencing Application III: 454 Read Statistics sequencing by synthesis nucleotides are flown periodically according to a specified order incorporation is accompanied by a flash of light signal is proportional to number of synthesized nucleotides (up to 8 nt) Dispensation order d = d[1]... d[l] with d[i] Σ = {A, C, G, T }, where 1 every character from Σ has to occur in d 2 no character is flown in row: d[i] = σ d[i + 1] σ 3 d[1] d[l] Probabilistic Arithmetic Automata 39 Inke Herms
72 454 Sequencing Application III: 454 Read Statistics sequencing by synthesis nucleotides are flown periodically according to a specified order incorporation is accompanied by a flash of light signal is proportional to number of synthesized nucleotides (up to 8 nt) GS-FLX standard dispensation order: TACG 100 cycles = 400 nucleotide flows Probabilistic Arithmetic Automata 39 Inke Herms
73 cycle nt flow TACG GTCA (3 ) read length: 6 10 Application III: 454 Read Statistics Influence of d on Read Length Probabilistic Arithmetic Automata 40 Inke Herms
74 PAA Design Application III: 454 Read Statistics states: dispensation order d transitions: conditional nucleotide frequencies weights: dirac measures (emission 1 in all nucleotide states) operations: cumulate number of inserted nucleotides Probabilistic Arithmetic Automata 41 Inke Herms
75 Proof of Concept Application III: 454 Read Statistics theoretical empirical simulated empirical frequency frequency length length comparison of computed read lengths to empirical data simulated reads, processing according to TrimBack filter PAA yields reasonable results Probabilistic Arithmetic Automata 42 Inke Herms
76 Observation for sample reads Application III: 454 Read Statistics The Optimal Dispensation Order dispensation orders of length 4 perform best compared expected read length under all orders of length 4 CGAT provides on average 10% longer reads than TACG (282 vs. 257) optimal dispensation order maximizes the sum of dinucleotide frequencies Probabilistic Arithmetic Automata 43 Inke Herms
77 Observation for sample reads Application III: 454 Read Statistics The Optimal Dispensation Order dispensation orders of length 4 perform best compared expected read length under all orders of length 4 CGAT provides on average 10% longer reads than TACG (282 vs. 257) optimal dispensation order maximizes the sum of dinucleotide frequencies For new genome: 1 estimate conditional nucleotide frequencies (from preceding sequencing runs) 2 design PAA and compute expected read length for orders of length 4 3 return best-performing order Probabilistic Arithmetic Automata 43 Inke Herms
78 Summary Application III: 454 Read Statistics Contributions PAA to compute length distribution of 454 sequence reads for dispensation orders of different lengths different models for nucleotide sequences choice of dispensation order can have a noticeable effect on the average read length propose optimal dispensation order for a new genome given preliminary reads Probabilistic Arithmetic Automata 44 Inke Herms
79 Summary Application III: 454 Read Statistics Future prospect theoretical analysis of dispensation orders experimental validation Probabilistic Arithmetic Automata 45 Inke Herms
80 Publications Application III: 454 Read Statistics Modeling Protein Sequences Including Information from Secondary Structure I. Herms and S. Rahmann, submitted to RECOMB 2010 Accurate Statistics for Local Sequence Alignment with Position-Dependent Scoring by Rare-Event Sampling S. Wolfsheimer, I. Herms, S. Rahmann and A.K. Hartmann, submitted to BMC Bioinformatics, 2009 Computing Alignment Seed Sensitivity with Probabilistic Arithmetic Automata I. Herms and S. Rahmann, WABI 2008, LNBI 5251, pp E. Baake and I. Herms Single-Crossover Dynamics: Finite versus Infinite Populations Bulletin of Mathematical Biology 70 (2008), pp Probabilistic Arithmetic Automata 46 Inke Herms
81 Conclusion Conclusion PAA framework three applications of PAA framework to sequence statistics general, unifying framework elegant formulation and simplification of solutions to different tasks straightforward computations (re-use of algorithms from automata and Markov chain theory) extensions to formerly investigated questions SSE Model biologically motivated null model for protein sequences outperforms customary string models Probabilistic Arithmetic Automata 47 Inke Herms
82 Analysis of Secondary Structure Segments Appendix Relative entropy between amino acid compositions within individually annotated segments Probabilistic Arithmetic Automata 48 Inke Herms
83 Secondary Structure Classification Appendix 0.15 coil C 0.15 helix H A C D E F G H I K L M N P Q R S T V WY 0.00 A C D E F G H I K L M N P Q R S T V WY 0.15 sheet S 0.15 turnt A C D E F G H I K L M N P Q R S T V WY 0.00 A C D E F G H I K L M N P Q R S T V WY Amino acid compositions after collapsing structure annotation to 4 classes Probabilistic Arithmetic Automata 49 Inke Herms
84 Model Log Likelihoods Appendix... for sequence s of length s = n ln P(s i.i.d.) = s ln P(S i = s[i]) = ln p s[i] i=1 s 1 ln P(s M1) = ln ps[1] 0 + i=1 ln P(s M2) = ln p 0 s[1] + ln P s[1]s[2] s i=1 ln P s[i]s[i+1] + s 2 i=1 ln P(S i+2 = s[i + 2] S i = s[i], S i+1 = s[i + 1]) Probabilistic Arithmetic Automata 50 Inke Herms
85 Log Likelihood of SSE Model Appendix ln P(S = s SSE) with s = n P(S = s) = v Q P(S 1,..., S n = s[1, n], Y n = v) variant of forward algorithm f k (v) := P(S 1,..., S k = s[1, k], Y k = v, Y k+1 v) f n (v) := P(S 1,..., S n = s[1, n], Y n = v) for 1 k < n P(S = s) = v Q f n (v) Probabilistic Arithmetic Automata 51 Inke Herms
86 Log Likelihood of SSE Model Appendix Recurrence f k (v) = u Q k 1 f l (u) Ts[l]s[l+1] uv pv (s[l + 1, k]) P(R(v) = k l) l=1 + P start,v π v s[1] pv (s[1, k]) P(R(v) = k) for 1 k < n f n (v) = u Q n 1 f l (u) Ts[l]s[l+1] uv pv (s[l + 1, n]) P(R(v) n l) l=1 + P start,v π v s[1] pv (s[1, n]) P(R(v) n) Output: f (s) = v Q f n(v) Probabilistic Arithmetic Automata 52 Inke Herms
87 Log Likelihood of SSE Model Appendix for log likelihood: implementation of logarithms of sums of small values use ln(x + y) = ln(x(1 + y /x)) = ln x + ln(1 + exp(ln y ln x)) ( k 1 ln f k (v) = ln Fk 0 (v) + log1p exp ( ln Fk l (v) ln F k 0 (v))) where Fk l (v): probability of path that generates length k prefix, with the last k l residues produced in element v l=1 Probabilistic Arithmetic Automata 53 Inke Herms
88 Second-order AIC Appendix 2k(k + 1) AIC c := AIC + n k 1 = 2 l( θ, n x) + 2k n k 1 with n: size of sample x k: # of free parameters in most complex model Probabilistic Arithmetic Automata 54 Inke Herms
89 454 Read Length Distribution Appendix Recurrence f s Sq f s (q, l) = q q (q, l 1)T q q Q f h(q,s) (q, l) if q = g(s) otherwise g(s) maps number of nt flow to corresponding nucleotide within d h(q, s) gives last preceding flow corresponding to nucleotide q Probabilistic Arithmetic Automata 55 Inke Herms
Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence
Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)
More informationAlgorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment
Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot
More informationSequence analysis and comparison
The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species
More informationBioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment
Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value
More informationAn Introduction to Bioinformatics Algorithms Hidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationAlgorithms in Bioinformatics
Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology
More informationCOMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University
COMP 598 Advanced Computational Biology Methods & Research Introduction Jérôme Waldispühl School of Computer Science McGill University General informations (1) Office hours: by appointment Office: TR3018
More informationHidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationComputational Biology
Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,
More informationCMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison
CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture
More informationHMMs and biological sequence analysis
HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the
More informationDe Novo Peptide Sequencing
De Novo Peptide Sequencing Outline A simple de novo sequencing algorithm PTM Other ion types Mass segment error De Novo Peptide Sequencing b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 A NELLLNVK AN ELLLNVK ANE LLLNVK
More informationComputational Methods for Mass Spectrometry Proteomics
Computational Methods for Mass Spectrometry Proteomics Eidhammer, Ingvar ISBN-13: 9780470512975 Table of Contents Preface. Acknowledgements. 1 Protein, Proteome, and Proteomics. 1.1 Primary goals for studying
More informationProtein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche
Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its
More informationMass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were
Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were developed to allow the analysis of large intact (bigger than
More informationGrundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson
Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)
More informationLocal Alignment: Smith-Waterman algorithm
Local Alignment: Smith-Waterman algorithm Example: a shared common domain of two protein sequences; extended sections of genomic DNA sequence. Sensitive to detect similarity in highly diverged sequences.
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationMotivating the need for optimal sequence alignments...
1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use
More informationOn the Monotonicity of the String Correction Factor for Words with Mismatches
On the Monotonicity of the String Correction Factor for Words with Mismatches (extended abstract) Alberto Apostolico Georgia Tech & Univ. of Padova Cinzia Pizzi Univ. of Padova & Univ. of Helsinki Abstract.
More informationProtein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.
Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein
More informationStephen Scott.
1 / 21 sscott@cse.unl.edu 2 / 21 Introduction Designed to model (profile) a multiple alignment of a protein family (e.g., Fig. 5.1) Gives a probabilistic model of the proteins in the family Useful for
More informationHidden Markov Models
Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas Forward Algorithm For Markov chains we calculate the probability of a sequence, P(x) How
More informationCAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan
CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns
More informationHidden Markov Models
Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm
More informationGenomics and bioinformatics summary. Finding genes -- computer searches
Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence
More informationData Mining in Bioinformatics HMM
Data Mining in Bioinformatics HMM Microarray Problem: Major Objective n Major Objective: Discover a comprehensive theory of life s organization at the molecular level 2 1 Data Mining in Bioinformatics
More informationComparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey
Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following: using related genomes
More informationTHE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION
THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION AND CALIBRATION Calculation of turn and beta intrinsic propensities. A statistical analysis of a protein structure
More informationGibbs Sampling Methods for Multiple Sequence Alignment
Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical
More informationBLAST. Varieties of BLAST
BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database
More informationProtein Secondary Structure Prediction
Protein Secondary Structure Prediction Doug Brutlag & Scott C. Schmidler Overview Goals and problem definition Existing approaches Classic methods Recent successful approaches Evaluating prediction algorithms
More informationPairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55
Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise
More informationHidden Markov Models in computational biology. Ron Elber Computer Science Cornell
Hidden Markov Models in computational biology Ron Elber Computer Science Cornell 1 Or: how to fish homolog sequences from a database Many sequences in database RPOBESEQ Partitioned data base 2 An accessible
More informationProteins: Structure & Function. Ulf Leser
Proteins: Structure & Function Ulf Leser This Lecture Proteins Structure Function Databases Predicting Protein Secondary Structure Many figures from Zvelebil, M. and Baum, J. O. (2008). "Understanding
More informationA Statistical Model of Proteolytic Digestion
A Statistical Model of Proteolytic Digestion I-Jeng Wang, Christopher P. Diehl Research and Technology Development Center Johns Hopkins University Applied Physics Laboratory Laurel, MD 20723 6099 Email:
More informationBLAST: Target frequencies and information content Dannie Durand
Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences
More informationProtein Structures. Sequences of amino acid residues 20 different amino acids. Quaternary. Primary. Tertiary. Secondary. 10/8/2002 Lecture 12 1
Protein Structures Sequences of amino acid residues 20 different amino acids Primary Secondary Tertiary Quaternary 10/8/2002 Lecture 12 1 Angles φ and ψ in the polypeptide chain 10/8/2002 Lecture 12 2
More informationMarkov Chains and Hidden Markov Models. = stochastic, generative models
Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,
More informationDe novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra. Xiaowen Liu
De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra Xiaowen Liu Department of BioHealth Informatics, Department of Computer and Information Sciences, Indiana University-Purdue
More informationbioinformatics 1 -- lecture 7
bioinformatics 1 -- lecture 7 Probability and conditional probability Random sequences and significance (real sequences are not random) Erdos & Renyi: theoretical basis for the significance of an alignment
More informationMS-MS Analysis Programs
MS-MS Analysis Programs Basic Process Genome - Gives AA sequences of proteins Use this to predict spectra Compare data to prediction Determine degree of correctness Make assignment Did we see the protein?
More informationStatistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department
More informationWhat is the central dogma of biology?
Bellringer What is the central dogma of biology? A. RNA DNA Protein B. DNA Protein Gene C. DNA Gene RNA D. DNA RNA Protein Review of DNA processes Replication (7.1) Transcription(7.2) Translation(7.3)
More information10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison
10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:
More informationInDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9
Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic
More informationSearching Sear ( Sub- (Sub )Strings Ulf Leser
Searching (Sub-)Strings Ulf Leser This Lecture Exact substring search Naïve Boyer-Moore Searching with profiles Sequence profiles Ungapped approximate search Statistical evaluation of search results Ulf
More informationHomology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB
Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded
More informationCMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction
CMPS 6630: Introduction to Computational Biology and Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the
More informationSara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)
Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline
More informationGiri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748
CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/14/07 CAP5510 1 CpG Islands Regions in DNA sequences with increased
More informationCISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)
CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST
More informationStatistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic
More informationFirst generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences
First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences 140.638 where do sequences come from? DNA is not hard to extract (getting DNA from a
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 05: Index-based alignment algorithms Slides adapted from Dr. Shaojie Zhang (University of Central Florida) Real applications of alignment Database search
More information114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009
114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009 9 Protein tertiary structure Sources for this chapter, which are all recommended reading: D.W. Mount. Bioinformatics: Sequences and Genome
More informationHidden Markov Models. Three classic HMM problems
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Hidden Markov Models Slides revised and adapted to Computational Biology IST 2015/2016 Ana Teresa Freitas Three classic HMM problems
More informationBioinformatics 2 - Lecture 4
Bioinformatics 2 - Lecture 4 Guido Sanguinetti School of Informatics University of Edinburgh February 14, 2011 Sequences Many data types are ordered, i.e. you can naturally say what is before and what
More informationBLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010
BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for
More informationSequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University
Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of
More informationLearning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling
Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence
More informationSeed-based sequence search: some theory and some applications
Seed-based sequence search: some theory and some applications Gregory Kucherov CNRS/LIGM, Marne-la-Vallée joint work with Laurent Noé (LIFL LIlle) Journées GDR IM, Lyon, January -, 3 Filtration for approximate
More informationIntroduction to Bioinformatics
Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression
More informationCAP 5510 Lecture 3 Protein Structures
CAP 5510 Lecture 3 Protein Structures Su-Shing Chen Bioinformatics CISE 8/19/2005 Su-Shing Chen, CISE 1 Protein Conformation 8/19/2005 Su-Shing Chen, CISE 2 Protein Conformational Structures Hydrophobicity
More informationMolecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment
Molecular Modeling 2018-- Lecture 7 Homology modeling insertions/deletions manual realignment Homology modeling also called comparative modeling Sequences that have similar sequence have similar structure.
More informationMATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME
MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:
More informationCMPS 3110: Bioinformatics. Tertiary Structure Prediction
CMPS 3110: Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the laws of physics! Conformation space is finite
More informationPairwise sequence alignment
Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL
More informationProtein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University
Protein Quantitation II: Multiple Reaction Monitoring Kelly Ruggles kelly@fenyolab.org New York University Traditional Affinity-based proteomics Use antibodies to quantify proteins Western Blot RPPA Immunohistochemistry
More informationLecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008
Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically
More informationBioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs
Bioinformatics 1--lectures 15, 16 Markov chains Hidden Markov models Profile HMMs target sequence database input to database search results are sequence family pseudocounts or background-weighted pseudocounts
More informationHidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)
Hidden Markov Models Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) 1 The occasionally dishonest casino A P A (1) = P A (2) = = 1/6 P A->B = P B->A = 1/10 B P B (1)=0.1... P
More information20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming
20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, 2008 4 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance 4. Global and local alignment
More informationIntroduction to Comparative Protein Modeling. Chapter 4 Part I
Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature
More informationProtein Secondary Structure Prediction
part of Bioinformatik von RNA- und Proteinstrukturen Computational EvoDevo University Leipzig Leipzig, SS 2011 the goal is the prediction of the secondary structure conformation which is local each amino
More informationModeling Mass Spectrometry-Based Protein Analysis
Chapter 8 Jan Eriksson and David Fenyö Abstract The success of mass spectrometry based proteomics depends on efficient methods for data analysis. These methods require a detailed understanding of the information
More informationCISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)
CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models
More informationTutorial 1: Setting up your Skyline document
Tutorial 1: Setting up your Skyline document Caution! For using Skyline the number formats of your computer have to be set to English (United States). Open the Control Panel Clock, Language, and Region
More informationIntroduction to Bioinformatics
CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics
More informationToday s Lecture: HMMs
Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models
More informationModule: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment
Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand
More informationRNA & PROTEIN SYNTHESIS. Making Proteins Using Directions From DNA
RNA & PROTEIN SYNTHESIS Making Proteins Using Directions From DNA RNA & Protein Synthesis v Nitrogenous bases in DNA contain information that directs protein synthesis v DNA remains in nucleus v in order
More informationWas T. rex Just a Big Chicken? Computational Proteomics
Was T. rex Just a Big Chicken? Computational Proteomics Phillip Compeau and Pavel Pevzner adjusted by Jovana Kovačević Bioinformatics Algorithms: an Active Learning Approach 215 by Compeau and Pevzner.
More informationHMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder
HMM applications Applications of HMMs Gene finding Pairwise alignment (pair HMMs) Characterizing protein families (profile HMMs) Predicting membrane proteins, and membrane protein topology Gene finding
More informationOrientational degeneracy in the presence of one alignment tensor.
Orientational degeneracy in the presence of one alignment tensor. Rotation about the x, y and z axes can be performed in the aligned mode of the program to examine the four degenerate orientations of two
More informationBioinformatics and BLAST
Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists
More informationChapter 5. Proteomics and the analysis of protein sequence Ⅱ
Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and
More informationProtein Structure Prediction, Engineering & Design CHEM 430
Protein Structure Prediction, Engineering & Design CHEM 430 Eero Saarinen The free energy surface of a protein Protein Structure Prediction & Design Full Protein Structure from Sequence - High Alignment
More informationAtomic masses. Atomic masses of elements. Atomic masses of isotopes. Nominal and exact atomic masses. Example: CO, N 2 ja C 2 H 4
High-Resolution Mass spectrometry (HR-MS, HRAM-MS) (FT mass spectrometry) MS that enables identifying elemental compositions (empirical formulas) from accurate m/z data 9.05.2017 1 Atomic masses (atomic
More informationLarge-Scale Genomic Surveys
Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction
More informationLecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models
Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary
More informationO 3 O 4 O 5. q 3. q 4. Transition
Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in
More informationCSE182-L8. Mass Spectrometry
CSE182-L8 Mass Spectrometry Project Notes Implement a few tools for proteomics C1:11/2/04 Answer MS questions to get started, select project partner, select a project. C2:11/15/04 (All but web-team) Plan
More informationProtein Structure Prediction Using Multiple Artificial Neural Network Classifier *
Protein Structure Prediction Using Multiple Artificial Neural Network Classifier * Hemashree Bordoloi and Kandarpa Kumar Sarma Abstract. Protein secondary structure prediction is the method of extracting
More informationProtein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University
Protein Quantitation II: Multiple Reaction Monitoring Kelly Ruggles kelly@fenyolab.org New York University Traditional Affinity-based proteomics Use antibodies to quantify proteins Western Blot Immunohistochemistry
More informationLecture 7 Sequence analysis. Hidden Markov Models
Lecture 7 Sequence analysis. Hidden Markov Models Nicolas Lartillot may 2012 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 1 / 60 1 Motivation 2 Examples of Hidden Markov models 3 Hidden
More informationSA-REPC - Sequence Alignment with a Regular Expression Path Constraint
SA-REPC - Sequence Alignment with a Regular Expression Path Constraint Nimrod Milo Tamar Pinhas Michal Ziv-Ukelson Ben-Gurion University of the Negev, Be er Sheva, Israel Graduate Seminar, BGU 2010 Milo,
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related
More informationGiri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748
CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/15/07 CAP5510 1 EM Algorithm Goal: Find θ, Z that maximize Pr
More information1-D Predictions. Prediction of local features: Secondary structure & surface exposure
1-D Predictions Prediction of local features: Secondary structure & surface exposure 1 Learning Objectives After today s session you should be able to: Explain the meaning and usage of the following local
More information