DNA1: Last week's take-home lessons

Similar documents
Increasingly complex (accurate) searches. DNA2: diversity & continuity. Alignments & Scores. DNA2: Aligning ancient diversity.

Sequencing alignment Ameer Effat M. Elfarash

Sequencing alignment Ameer Effat M. Elfarash

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids

Multiple Sequence Alignment

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Hidden Markov Models

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Lecture 7 Sequence analysis. Hidden Markov Models

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Hidden Markov Models

Biology 644: Bioinformatics

Computational Biology: Basics & Interesting Problems

HMMs and biological sequence analysis

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Bioinformatics Chapter 1. Introduction

Sequence analysis and comparison

Markov Chains and Hidden Markov Models. = stochastic, generative models

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Week 10: Homology Modelling (II) - HHpred

Markov Models & DNA Sequence Evolution

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Bioinformatics 2 - Lecture 4

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

Sequence Alignment Techniques and Their Uses

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

EECS730: Introduction to Bioinformatics

Hidden Markov Models for biological sequence analysis I

Hidden Markov Models for biological sequence analysis

Stephen Scott.

Computational Genomics and Molecular Biology, Fall

Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Multiple sequence alignment

O 3 O 4 O 5. q 3. q 4. Transition

Tools and Algorithms in Bioinformatics

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Basic Local Alignment Search Tool

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Sequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Similarity or Identity? When are molecules similar?

Hidden Markov models in population genetics and evolutionary biology

Welcome to Class 21!

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

Large-Scale Genomic Surveys

Multiple Sequence Alignment using Profile HMM

Hidden Markov Models

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

BME 5742 Biosystems Modeling and Control

Protein function prediction based on sequence analysis

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Bioinformatics Exercises

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

Motivating the need for optimal sequence alignments...

Interpolated Markov Models for Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

From gene to protein. Premedical biology

Exercise 5. Sequence Profiles & BLAST

1. In most cases, genes code for and it is that

Advanced topics in bioinformatics

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Introduction to Bioinformatics

Sequence analysis and Genomics

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Similarity searching summary (2)

Sequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos. Molecular Genetics 101. Cell s internal world

7.36/7.91 recitation CB Lecture #4

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Comparative genomics: Overview & Tools + MUMmer algorithm

Reducing storage requirements for biological sequence comparison

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

From Gene to Protein

Today s Lecture: HMMs

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

STA 414/2104: Machine Learning

From DNA to protein, i.e. the central dogma

Transcription:

Harvard-MIT Division of Health Sciences and Technology HST.508: Genomics and Computational Biology DNA1: Last week's take-home lessons Types of mutants Mutation, drift, selection Binomial for each Association studies χ 2 statistic Linked & causative alleles Alleles, Haplotypes, genotypes Computing the first genome, the second... New technologies Random and systematic errors 1

DNA2: Today's story and goals Motivation and connection to DNA1 Comparing types of alignments & algorithms Dynamic programming Multi-sequence alignment Space-time-accuracy tradeoffs Finding genes -- motif profiles Hidden Markov Model for CpG Islands 2

DNA 2 DNA1: the last 5000 generations Intro2: Common & simple Figure (http://216.190.101.28/gold/) 3

Applications of Dynamic Programming ato sequence analysis Shotgun sequence assembly Multiple alignments Dispersed & tandem repeats Bird song alignments Gene Expression time-warping athrough HMMs RNA gene search & structure prediction Distant protein homologies Speech recognition 4

Alignments & Scores Global (e.g. haplotype) Local (motif) ACCACACA ACCACACA ::xx::x: :::: ACACCATA ACACCATA Score= 5(+1) + 3(-1) = 2 Score= 4(+1) = 4 Suffix (shotgun assembly) ACCACACA ::: ACACCATA Score= 3(+1) =3 5

Increasingly complex (accurate) searches Exact (StringSearch) Regular expression (PrositeSearch) Substitution matrix (BlastN) Profile matrix (PSI-blast) Gaps (Gap-Blast) Dynamic Programming (NW, SM) CGCG CGN{0-9}CG = CGAACG CGCG ~= CACG CGc(g/a) ~ = CACG CGCG ~= CGAACG CGCG ~= CAGACG Hidden Markov Models (HMMER) WU (http://hmmer.wustl.edu/) 6

"Hardness" of (multi-) sequence alignment Align 2 sequences of length N allowing gaps. ACCAC-ACA ACCACACA ::x::x:x: :xxxxxx: AC-ACCATA, A-----CACCATA, etc. 2N gap positions, gap lengths of 0 to N each: A naïve algorithm might scale by O(N 2N ). For N= 3x10 9 this is rather large. Now, what about k>2 sequences? or rearrangements other than gaps? 7

Testing search & classification algorithms Separate Training set and Testing sets Need databases of non-redundant sets. Need evaluation criteria (programs) Sensitivity and Specificity (false negatives & positives) sensitivity (true_predicted/true) specificity (true_predicted/all_predicted) Where do training sets come from? More expensive experiments: crystallography, genetics, biochemistry 8

Comparisons of homology scores Pearson WR Protein Sci 1995 Jun;4(6):1145-60 Comparison of methods for searching protein sequence databases. Methods Enzymol 1996;266:227-58 Effective protein sequence comparison. Algorithm: FASTA, Blastp, Blitz Substitution matrix:pam120, PAM250, BLOSUM50, BLOSUM62 Database: PIR, SWISS-PROT, GenPept 9

Switch to protein searches when possible M F x= u c a g 3 uac 3 aag 5'... aug uuu... Adjacent mrna codons Uxu F Y C uxc S uxa - - uxg - W Cxu L H cxc P R cxa Q cxg axu N S axc I T axa K R axg M TER C-S NH+ gxu D gxc V A G O gxa E gxg H:D/A 10

A Multiple Alignment of Immunoglobulins VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG-- 11

Scoring matrix based on large set of distantly related blocks: Blosum62 10 1 6 6 4 8 2 6 6 9 2 4 4 4 5 6 6 7 1 3 % A C D E F G H I K L M N P Q R S T V W Y 8 0-4 -2-4 0-4 -2-2 -2-2 -4-2 -2-2 2 0 0-6 -4 A 18-6 -8-4 -6-6 -2-6 -2-2 -6-6 -6-6 -2-2 -2-4 -4 C 12 4-6 -2-2 -6-2 -8-6 2-2 0-4 0-2 -6-8 -6 D 10-6 -4 0-6 2-6 -4 0-2 4 0 0-2 -4-6 -4 E 12-6 -2 0-6 0 0-6 -8-6 -6-4 -4-2 2 6 F 12-4 -8-4 -8-6 0-4 -4-4 0-4 -6-4 -6 G 16-6 -2-6 -4 2-4 0 0-2 -4-6 -4 4 H 8-6 4 2-6 -6-6 -6-4 -2 6-6 -2 I 10-4 -2 0-2 2 4 0-2 -4-6 -4 K 8 4-6 -6-4 -4-4 -2 2-4 -2 L 10-4 -4 0-2 -2-2 2-2 -2 M 12-4 0 0 2 0-6 -8-4 N 14-2 -4-2 -2-4 -8-6 P 10 2 0-2 -4-4 -2 Q 10-2 -2-6 -6-4 R 8 2-4 -6-4 S 10 0-4 -4 T 8-6 -2 V 22 4 W 14 Y 12

Scoring Functions and Alignments ascoring function: ω(match) = +1; ω(mismatch) = -1; ω(indel) = -2; } ω(other) = 0. aalignment score: sum of columns. aoptimal alignment: maximum score. substitution matrix 13

Calculating Alignment Scores ATGA A-TGA (1) :xx: (2) : : : ACTA ACT-A A T G A (1)Score = ω + ω + ω + ω = 1 1 1 + 1 = 0. A C T A A T G A (2) Score =ω +ω + ω + ω + ω = 1 2 + 1 2 + 1 = 1. A C T A if ω ( indel ) = 1, Score = 1 1 + 1 1 + 1 = + 1. 14

DNA2: Today's story and goals Motivation and connection to DNA1 Comparing types of alignments & algorithms Dynamic programming Multi-sequence alignment Space-time-accuracy tradeoffs Finding genes -- motif profiles Hidden Markov Model for CpG Islands 15

What is dynamic programming? A dynamic programming algorithm solves every subsubproblem just once and then saves its answer in a table, avoiding the work of recomputing the answer every time the subsubproblem is encountered. -- Cormen et al. "Introduction to Algorithms", The MIT Press. 16

Recursion of Optimal Global Alignments u s : optimal global alignment score of u and v. v ATGA s + ω ; ACT A ATGA ATG A s = max s + ω ; ACTA ACT A ATG A s + ω. ACTA 17

Recursion of Optimal Local Alignments u s : optimal local alignment score of u and v. v u 1 u 2...ui s +ω ; 1 v v 2...vj 1 vj u 1u 2...ui 1 ui u 1 u 2...ui s +ω ; s = max v 1 v 2...vj 1 vj v 1 v 2...vj 1 s u u 2...ui 1 ui +ω ; v 1 v 2...vj 0. 18

Computing Row-by-Row A C T A 0 min min min min A min 1-1 -3-5 T min G min A min A C T A 0 min min min min A min 1-1 -3-5 T min -1 0 0-2 G A min min A C T A 0 min min min min A min 1-1 -3-5 T min -1 0 0-2 G min -3-2 -1-1 A min A C T A 0 min min min min A min 1-1 -3-5 T min -1 0 0-2 G min -3-2 -1-1 A min -5-4 -3 0 min = -10 99 19

Traceback Optimal Global Alignment A C T A 0 min min min min A min 1-1 -3-5 T min -1 0 0-2 G min -3-2 -1-1 A min -5-4 -3 0 A G T A : : A T C A 20

Local and Global Alignments A C C A C A C A 0 0 0 0 0 0 0 0 0 A 0 1 0 0 1 0 1 0 1 C 0 0 2 1 0 2 0 2 0 A 0 1 0 1 2 0 3 1 3 C 0 0 2 1 0 3 1 4 2 C 0 0 0 3 2 1 2 2 3 A 0 1 0 1 4 2 2 1 3 T 0 0 0 0 2 3 1 1 1 A 0 1 0 0 1 1 4 2 2 A C C A C A C A 0 m m m m m m m m A m 1-1 -3-5 -7-9 -11-13 C m -1 2 0-2 -4-6 -8-10 A m -3 0 1 1-1 -3-5 -7 C m -5-2 1 0 2 0-2 -4 C m -7-4 -1 0 1 1 1 A m -9-6 -3 0-1 2 0 2 T m -11-8 -5-2 -1 0 1 0 A m -13-10 -1-7 -4-1 0-1 2 21

Time and Space Complexity of Computing Alignments For two sequences u=u 1 u 2 u n and v=v 1 v 2 v m, finding the optimal alignment takes O(mn) time and O(mn) space. An O(1)-time operation: one comparison, three multiplication steps, computing an entry in the alignment table An O(1)-space memory: one byte, a data structure of two floating points, an entry in the alignment table 22

Time and Space Problems acomparing two one-megabase genomes. aspace: An entry: 4 bytes; Table: 4 * 10^6 * 10^6 = 4 G bytes memory. atime: 1000 MHz CPU: 1M entries/second; 10^12 entries: 1M seconds = 10 days. 23

Time & Space Improvement for w-band Global Alignments atwo sequences differ by at most w bps (w<<n). aw-band algorithm: O(wn) time and space. aexample: w=3. A C C A C A C A 0 m m m A m 1-1 -3-5 C m -1 2 0-2 -4 A m -3 0 1 1-1 -3 C -5-2 1 0 2 0-2 C -4-1 0 1 1 1 A -3 0-1 2 0 2 T -2-1 0 1 0 A -1 0-1 2-1 24

Summary Dynamic programming Statistical interpretation of alignments Computing optimal global alignment Computing optimal local alignment Time and space complexity Improvement of time and space Scoring functions 25

DNA2: Today's story and goals Motivation and connection to DNA1 Comparing types of alignments & algorithms Dynamic programming Multi-sequence alignment Space-time-accuracy tradeoffs Finding genes -- motif profiles Hidden Markov Model for CpG Islands 26

A Multiple Alignment of Immunoglobulins VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG-- 27

A multiple alignment <=> Dynamic programming on a hyperlattice See G. Fullen, 1996. 28

Multiple Alignment vs Pairwise Alignment AT AT AT A- -T AT AT A- -T Optimal Multiple Alignment Non-Optimal Pairwise Alignment 29

- - + δ s Computing a Node on Hyperlattice k=3 2 k 1=7 - S A +δ NV N N V - A +δ - - NS N N - S - +δ - NA NV N V S - - - NA N N V - - V S A N N N + δ + δ - NS N NA s s s s s = max NA NS s NA - N NV s V S A NA NS - N s - N NS - N s - N NV N N N s A S - +δ NV NS NA s NV - N s - - A + δ - NS N NV s 30

Challenges of Optimal Multiple Alignments aspace complexity (hyperlattice size): O(n k ) for k sequences each n long. acomputing a hyperlattice node: O(2 k ). atime complexity: O(2 k n k ). afind the optimal solution is exponential in k (non-polynomial, NP-hard). 31

Methods and Heuristics for Optimal Multiple Alignments aoptimal: dynamic programming Pruning the hyperlattice (MSA) aheuristics: tree alignments(clustalw) star alignments sampling (Gibbs) (discussed in RNA2) local profiling with iteration (PSI-Blast,...) 32

ClustalW: Progressive Multiple Alignment S 1 S 2 S 3 S 4 S S S 1 2 3 S 1 S 2 4 All Pairwise Alignments Similarity Matrix S 3 9 4 S S4 See Higgins(1991) and Thompson(1994). 4 4 7 4 Cluster Analysis Multiple Alignment Step: 1. Aligning S 1 and S 3 2. Aligning S 2 and S 4 3. Aligning (S 1,S 3 ) with (S 2,S 4 ). Dendrogram S 1 S 3 Distance S 2 S 4 33

Star Alignments s s s s s 1 2 3 4 5 = = = = = s s s s 1 2 3 4 ATTGCCATT ATGGCCATT ATCCAATTTT ATCTTCTT ACTGACC Similarity s2 7 s3 2 2 Pairwise Alignment Matrix s4 0 0 0 s5 3 4 7 3 Find the Central Sequence s 1 Multiple Alignment ATTGCCATT-- ATGGCCATT-- ATC-CAATTTT ATCTTC-TT-- ACTGACC---- AT*GCCATTTT Combine into Multiple Alignment Pairwise Alignment ATTGCCATT ATGGCCATT ATTGCCATT-- ATC-CAATTTT ATTGCCATT ATCTTC-TT ATTGCCATT ACTGACC 34

DNA2: Today's story and goals Motivation and connection to DNA1 Comparing types of alignments & algorithms Dynamic programming Multi-sequence alignment Space-time-accuracy tradeoffs Finding genes -- motif profiles Hidden Markov Model for CpG Islands 35

Accurately finding genes & their edges What is distinctive? 0. Promoters & CGs islands 1. Preferred codons 2. RNA splice signals 3. Frame across splices 4. Inter-species conservation 5. cdna for splice edges Failure to find edges? Variety & combinations Tiny proteins (& RNAs) Alternatives & weak motifs Alternatives Gene too close or distant Rare transcript 36

Annotated "Protein" Sizes in Yeast & Mycoplasma 14% 12% 10% 8% 6% 4% 2% Yeast SC MG % of proteins at length x 0% 0 100 200 300 400 500 600 700 800 900 x = "Protein" size in #aa 37

Predicting small proteins (ORFs) 100000 #codons giving random ORF expectation=1 per kb (vs %GC) min max 10000 1000 100 10 Yeast ORF length giving random expectation=1 per kb (vs %GC) 1 0% 50% 100% 38

Small coding regions Mutations in domain II of 23 S rrna facilitate translation of a 23 S rrna-encoded pentapeptide conferring erythromycin resistance. Dam et al. 1996 J Mol Biol 259:1-6 Trp (W) leader peptide, 14 codons: MKAIFVLKGWWRTS STOP Phe (F) leader peptide, 15 codons: MKHIPFFFAFFFTFP STOP His (H) leader peptide, 16 codons: STOP MTRVQFKHHHHHHHPD Other examples in proteomics lectures 39

Motif Matrices a a t g c a t g g a t g t g t g a 1 3 0 0 c 1 0 0 0 g 1 1 0 4 t 1 0 4 0 Align and calculate frequencies. Note: Higher order correlations lost. 40

Protein starts See GeneMark 41

Motif Matrices a a t g c a t g g a t g t g t g 1+3+4+4 = 12 1+3+4+4 = 12 1+3+4+4 = 12 1+1+4+4 = 10 a 1 3 0 0 c 1 0 0 0 g 1 1 0 4 t 1 0 4 0 Align and calculate frequencies. Note: Higher order correlations lost. Score test sets: a c c c 1+0+0+0 = 1 42

DNA2: Today's story and goals Motivation and connection to DNA1 Comparing types of alignments & algorithms Dynamic programming Multi-sequence alignment Space-time-accuracy tradeoffs Finding genes -- motif profiles Hidden Markov Model for CpG Islands 43

Why probabilistic models in sequence analysis? a Recognition - Is this sequence a protein start? a Discrimination - Is this protein more like a hemoglobin or a myoglobin? a Database search - What are all of sequences in SwissProt that look like a serine protease? 44

A Basic idea Assign a number to every possible sequence such that s P(s M) = 1 P(s M) is a probability of sequence s given a model M. 45

Sequence recognition Recognition question - What is the probability that the sequence s is from the start site model M? P(M s) = P(M)* P(s M) / P(s) (Bayes' theorem) P(M) and P(s) are prior probabilities and P(M s) is posterior probability. 46

Database search a N = null model (random bases or AAs) a Report all sequences with logp(s M) - logp(s N) > logp(n) - logp(m) a Example, say α/β hydrolase fold is rare in the database, about 10 in 10,000,000. The threshold is 20 bits. If considering 0.05 as a significant level, then the threshold is 20+4.4 = 24.4 bits. 47

Plausible sources of mono, di, tri, & tetra- nucleotide biases C rare due to lack of uracil glycosylase (cytidine deamination) TT rare due to lack of UV repair enzymes. CG rare due to 5methylCG to TG transitions (cytidine deamination) AGG rare due to low abundance of the corresponding Arg-tRNA. CTAG rare in bacteria due to error-prone "repair" of CTAGG to C*CAGG. AAAA excess due to polya pseudogenes and/or polymerase slippage. AmAcid Codon Number /1000 Fraction Arg AGG 3363.00 1.93 0.03 Arg AGA 5345.00 3.07 0.06 Arg CGG 10558.00 6.06 0.11 Arg CGA 6853.00 3.94 0.07 Arg CGT 34601.00 19.87 0.36 Arg CGC 36362.00 20.88 0.37 ftp://sanger.otago.ac.nz/pub/transterm/data/codons/bct/esccol.cod 48

CpG Island + in a ocean of - First order Hidden Markov Model P(A+ A+) A+ MM=16, HMM= 64 transition probabilities (adjacent bp) T+ A- T- C+ P(G+ C+) > G+ P(C- A+)> C- G- 49

Estimate transistion probabilities -- an example Training set S= aaacagcctgacatggttccgaacagcctcgacatggcgtt Isle A+ C+ G+ T+ A C G T- A+.20.40.20.20.00.00.00.00 1.00 C+.29.14.43.14.00.00.00.00 1.00 G+.33.33.17.17.00.00.00.00 1.00 T+.00.25.25.25.25.00.00.00 1.00.82 1.13 1.05.76.25.00.00.00 Sums Ocean A C G T A+ C+ G+ T+ A-.33.33.17.17.00.00.00.00 1.00 C-.40.20.00.20.00.20.00.00 1.00 G-.25.25.25.25.00.00.00.00 1.00 T-.00.25.50.25.00.00.00.00 1.00.98 1.03.92.87.00.20.00.00 Sums Laplace pseudocount: Add +1 count to each observed. (p.9,108,321 P(G C) = #(CG) / N #(CN) 50 Dirichlet) (http://shop.barnesandnoble.com/textbooks/booksearch/isbninquiry.asp?mscssid=&isbn=0521629713)

Estimated transistion probabilities from 48 "known" islands Training set (+) A C G T A.18.27.43.12 1.00 C.17.37.27.19 1.00 G.16.34.38.13 1.00 T.08.36.38.18 1.00 (-) A C G T P(G C) = A.30.21.29.21 1.00 #(CG) / N #(CN) C.32.30.08.30 1.00 G.25.25.30.21 1.00 T.18.24.29.29 1.00 1.05.99.95 1.01 51 Sums

Viterbi: dynamic programming Most probable path l,k=2 states Recursion: v l (i+1) = e l (x i +1) max(v k (i)a kl ) for HMM 1/8*.27 v s i = C G C G begin 1 0 0 0 0 A+ 0 0 0 0 0 C+ 0 0.125 0 0.012 0 G+ 0 0 0.034 0 0.0032 T+ 0 0 0 0 0 A 0 0 0 0 0 C 0 0.125 0 0.0026 0 G 0 0 0.01 0 0.0002 T 0 0 0 0 0 a= table in slide 51 e= emit s i in state l (Durbin p.56) 52

DNA2: Today's story and goals Motivation and connection to DNA1 Comparing types of alignments & algorithms Dynamic programming Multi-sequence alignment Space-time-accuracy tradeoffs Finding genes -- motif profiles Hidden Markov Model for CpG Islands 53