Sequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos. Molecular Genetics 101. Cell s internal world

Similar documents
Sequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos

Multiple Choice Review- Eukaryotic Gene Expression

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Data Mining in Bioinformatics HMM

Hidden Markov Models for biological sequence analysis

Hidden Markov Models (I)

1. In most cases, genes code for and it is that

Markov Chains and Hidden Markov Models. = stochastic, generative models

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Lecture 18 June 2 nd, Gene Expression Regulation Mutations

CSCE 471/871 Lecture 3: Markov Chains and

Hidden Markov Models for biological sequence analysis I

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Markov Models & DNA Sequence Evolution

BME 5742 Biosystems Modeling and Control

O 3 O 4 O 5. q 3. q 4. Transition

Stephen Scott.

Introduction to Hidden Markov Models (HMMs)

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Bioinformatics 2 - Lecture 4

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11

From Gene to Protein

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

The Computational Problem. We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene.

Videos. Bozeman, transcription and translation: Crashcourse: Transcription and Translation -

CS711008Z Algorithm Design and Analysis

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Computational Biology: Basics & Interesting Problems

GCD3033:Cell Biology. Transcription

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Chapter 17. From Gene to Protein. Biology Kevin Dees

Regulation of Gene Expression

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Computational Genomics and Molecular Biology, Fall

Hidden Markov Models. Ron Shamir, CG 08

GENE REGULATION AND PROBLEMS OF DEVELOPMENT

HMMs and biological sequence analysis

Proteomics. 2 nd semester, Department of Biotechnology and Bioinformatics Laboratory of Nano-Biotechnology and Artificial Bioengineering

Introduction to Machine Learning CMU-10701

Lecture 9. Intro to Hidden Markov Models (finish up)

Chapters 12&13 Notes: DNA, RNA & Protein Synthesis

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Translation Part 2 of Protein Synthesis

Hidden Markov Models

Flow of Genetic Information

GENE ACTIVITY Gene structure Transcription Transcript processing mrna transport mrna stability Translation Posttranslational modifications

HIDDEN MARKOV MODELS

Eukaryotic vs. Prokaryotic genes

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Dynamic Approaches: The Hidden Markov Model

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Name: SBI 4U. Gene Expression Quiz. Overall Expectation:

UNIT 5. Protein Synthesis 11/22/16

Hidden Markov Models,99,100! Markov, here I come!

From gene to protein. Premedical biology

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Complete all warm up questions Focus on operon functioning we will be creating operon models on Monday

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Biology 644: Bioinformatics

Related Courses He who asks is a fool for five minutes, but he who does not ask remains a fool forever.

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Lecture 7 Sequence analysis. Hidden Markov Models

Controlling Gene Expression

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Advanced Data Science

Chapter 4: Hidden Markov Models

Chapter 15 Active Reading Guide Regulation of Gene Expression

Bioinformatics Chapter 1. Introduction

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Today s Lecture: HMMs

Biology 112 Practice Midterm Questions

Cellular Neuroanatomy I The Prototypical Neuron: Soma. Reading: BCP Chapter 2

EECS730: Introduction to Bioinformatics

Hidden Markov Methods. Algorithms and Implementation

RNA & PROTEIN SYNTHESIS. Making Proteins Using Directions From DNA

Bio 119 Bacterial Genomics 6/26/10

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Hidden Markov Models

Old FINAL EXAM BIO409/509 NAME. Please number your answers and write them on the attached, lined paper.

Hidden Markov Models (HMMs) November 14, 2017

The Gene The gene; Genes Genes Allele;

Hidden Markov Models

Supplemental Materials

Introduction to molecular biology. Mitesh Shrestha

Lesson Overview. Ribosomes and Protein Synthesis 13.2

Molecular Biology - Translation of RNA to make Protein *

Hidden Markov Models and some applications

Objective 3.01 (DNA, RNA and Protein Synthesis)

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

Predicting RNA Secondary Structure

Hidden Markov models in population genetics and evolutionary biology

Transcription:

Sequence Analysis BBSI 2006: ecture #χ+ Takis Benos 2006 Molecular Genetics 0 Cell s internal world

DNA - Chromosomes - Genes We cannot define it but we know it when we see it A loose definition: What is a gene? Gene is a DNA/RNA information unit that is able to perform a function in a cellular environment Central Dogma: rotein coding genes transcribed translated DNA RNA protein 2

Open Reading Frames ORFs aatagcgaat tttccaacga caaaagctaa atatcgcaaa aacctcagta aaaatcttgc 60 tggagctatt attgctaagt aacatttacc ccctgaagtt aatggatcaa tcaagagaga 20 tgtgggctgt aatgaatcgt cttattgaat taacaggttg gatcgttctt gtcgtttcag 80 tcattcttct tggcgtggcg agtcacattg acaactatca gccacctgaa cagagtgctt 240 cggtacaaca caagtaagct ctgcacttgt ggagcgacat gctgcccgtc cgggtgcatg 300 ttttcacttg tcggatatta aaccaggaat ttattatctt gttcgatgtt gtaataaa 358 Open Reading Frames ORFs aatagcgaat tttccaacga caaaagctaa atatcgcaaa aacctcagta aaaatcttgc 60 tggagctatt attgctaagt aacatttacc ccctgaagtt aatggatcaa tcaagagaga 20 tgtgggctgt aatgaatcgt cttattgaat taacaggttg gatcgttctt gtcgtttcag 80 M N R tcattcttct tggcgtggcg agtcacattg acaactatca gccacctgaa cagagtgctt 240 cggtacaaca caagtaagct ctgcacttgt ggagcgacat gctgcccgtc cgggtgcatg 300 ttttcacttg tcggatatta K stop aaccaggaat ttattatctt gttcgatgtt gtaataaa 358 MNRIETGWIVVVSVIGVASHIDNYQEQSASVQHK Gene s characteristics aatagcgaat tttccaacga caaaagctaa atatcgcaaa aacctcagta aaaatcttgc 60 tggagctatt attgctaagt aacatttacc ccctgaagtt aatggatcaa tcaagagaga 20 tgtgggctgt aatgaatcgt cttattgaat taacaggttg gatcgttctt gtcgtttcag 80 tcattcttct tggcgtggcg agtcacattg acaactatca gccacctgaa cagagtgctt 240 cggtacaaca caagtaagct ctgcacttgt ggagcgacat gctgcccgtc cgggtgcatg 300 ttttcacttg tcggatatta aaccaggaat ttattatctt gttcgatgtt gtaataaa 358 ATG TAA promoter 5 UTR ORF 3 UTR mrna 3

Transcription cugaaguu aauggaucaa ucaagagaga 20 ugugggcugu aaugaaucgu cuuauugaau uaacagguug gaucguucuu gucguuucag 80 ucauucuucu uggcguggcg agucacauug acaacuauca gccaccugaa cagagugcuu 240 cgguacaaca caaguaagcu cugcacuugu ggagcgacau gcugcccguc cgggugcaug 300 uuuucacuug ucggauauua aaccaggaau uuauuaucuu guucgauguu guaauaaa 358 CA cuga aauaaa aaaaaaa olya tail Translation cuga aauaaa aaaaaaa MNRIETGWIVVVSVIGVASHIDNYQEQSASVQHK rotein coding genes cntd Source: http://www.emc.maricopa.edu/faculty/farabee/biobk/biobookr OTSYn.html 4

Genes and roteins Alterations of the DNA Base substitutions: silent: no a.a. replacement missense: a.a. replacement non-sense: a.a. stop codon replacement Alterations of the DNA cntd UGU UGC silent UGA non-sense UGG missense Source: http://www.emc.maricopa.edu/faculty/farabee/biobk/biobookrotsyn.html 5

Inheritance - genetic differences Molecular evolution Two species will acquire mutations proportionally to their divergence time. However: all proteins do not change in the same pace a given protein does not necessarily change in the same pace throughout time different parts of the same protein change at different paces Molecular evolution cntd Human CA_HUMAN; 0508 vs. Drosophila CA_IG; 062 Query: MAKGRSVVKGYQTFSAREGGRRVTGEGAGISTRSRFNEISGDNGW 60 MA+G RSVVKG Q FSARE G RV TGEGA IST++RF+EISGDNGW+ Sbjct: MARGARSVVKGCQFSARECGHRVGTGEGACISTKTRFSEISGDNGWI 60 Query: 6 NYHFWRETGTHKVHHHVQNFQKYGIYREKGNVESVYVIDEDVAFKSEGNER 20 NY FW+E GT K+H HHVQNFQKYGIYREKGN+ESVY+IDEDVAFK EGNER Sbjct: 6 NYRFWKEKGTQKIHYHHVQNFQKYGIYREKGNESVYIIDEDVAFKFEGNER 20 Query: 2 FIWVAYHQYYQRIGVKKSAAWKKDRVANQEVMAEATKNFDAVSRDFVS 80 + IWVAYHQ+YQ++GVKKS AWKKDR+ N EVMAEA KNF+D VS+DFV Sbjct: 2 YNIWVAYHQHYQKVGVKKSGAWKKDRVNTEVMAEAIKNFIDTVSQDFVG 80 Query: 8 VHRRIKKAGSGNYSGDISDDFRFAFESITNVIFGERQGMEEVVNEAQRFIDAIYQM 240 VHRRIK+ GSG +SGDI +DFRFAFESITNVIFGER GMEE+V+EAQ+FIDA+YQM Sbjct: 8 VHRRIKQQGSGKFSGDIREDFRFAFESITNVIFGERGMEEIVDEAQKFIDAVYQM 240 Query: 24 FHTSVMNDFRFRTKTWKDHVAAWDVIFSKADIYTQNFYWERQKGSVHHDYRG 300 FHTSVMNDFRFRTKTW+DHVAAWD IF+KA+ YTQNFYW+R+K ++Y G Sbjct: 24 FHTSVMNDFRFRTKTWRDHVAAWDTIFNKAEKYTQNFYWDRRKRE-FNNYG 299 Query: 30 MYRGDSKMSFEDIKANVTEMAGGVDTTSMTQWHYEMARNKVQDMRAEVAAR 360 +YRG+ K+ ED+KANVTEMAGGVDTTSMTQWHYEMAR+ VQ+MR EV AR Sbjct: 300 IYRGNDKSEDVKANVTEMAGGVDTTSMTQWHYEMARSNVQEMREEVNAR 359 Query: 36 HQAQGDMATMQVKASIKETRHISVTQRYVNDVRDYMIAKTVQVAIY 420 QAQGD + MQVKASIKETRHISVTQRYVNDVRDYMIAKTVQVA+Y Sbjct: 360 RQAQGDTSKMQVKASIKETRHISVTQRYVNDVRDYMIAKTVQVAVY 49 Query: 42 AGRETFFFDENFDTRWSKDKNITYFRNGFGWGVRQCGRRIAEEMTIFINM 480 A+GR+ FF + FDTRW K++++ +FRNGFGWGVRQC+GRRIAEEMT+FI++ Sbjct: 420 AMGRDAFFSNGQFDTRWGKERDIHFRNGFGWGVRQCVGRRIAEEMTFIHI 479 Query: 48 ENFRVEIQHSDVGTTFNIMEKISFTFWFNQEATQ 520 ENF+VE+QH SDV T FNIM+KI F FNQ+ Q Sbjct: 480 ENFKVEQHFSDVDTIFNIMDKIFVFRFNQDQ 59 6

Molecular evolution cntd Human CA_HUMAN; 0508 vs. zebrafish Cypa; Q8JH93 Query: 34 TGEGAGISTRSRFNEISGDNGWNYHFWRETGTHKVHHHVQNFQKYGIYREK 93 T G + +FN+I N ++ F + G VH V NF+ +GIYREK+ Sbjct: 27 TRSGRAQNSTVQFNKIGRWRNSSVAFTKMGGRNVHRIMVHNFKTFGIYREKV 86 Query: 94 GNVESVYVIDEDVAFKSEGNERFIWVAYHQYYQRIGVKKSAAWKKDRVA 53 G +SVY+I ED A+FK+EG + R + W AY Y + GVK+ AWK DR+ Sbjct: 87 GIYDSVYIIKEDGAIFKAEGHHNRINVDAWTAYRDYRNQKYGVKEGKAWKTDRMI 46 Query: 54 NQEVMAEATKNFDAVSRDFVSVHRRIKKAGSGNYSGDISDDFRFAFESITNV 23 N+E++ + F+D V +DFV+ ++++I+++G ++ D++ DFRF+ ES++ V Sbjct: 47 NKEKQGTFVDEVGQDFVARVNKQIERSGQKQWTTDTHDFRFSESVSAV 206 Query: 24 IFGERQGMEEVVNEAQRFIDAIYQMFHTSVMNDFRFRTKTWKDHVAAWDVI 273 ++GER G+ + ++E Q FID + MF T+ M R + WK+HV AWD I Sbjct: 207 YGERGDNIDEFQHFIDCVSVMFKTTSMYGRSIGSNIWKNHVEAWDGI 266 Query: 274 FSKADIYTQNFYWERQKGSVHHDYRGMYRGDSKMSFEDIKANVTEMAGGVDTTSM 333 F++AD QN + + ++ + Y G+ K+S EDIKA+VTE++AGGVD+ + Sbjct: 267 FNQADRCIQNIFKQWKENEGNGKYGVAIMQDKSIEDIKASVTEMAGGVDSVTF 326 Query: 334 TQWHYEMARNKVQDMRAEVAARHQAQGDMATMQVKASIKETRHISVT 393 T W YE+AR +QD RAE+ AAR +GDM M++++KA++KETRH++++ Sbjct: 327 TWTYEARQDQDERAEISAARIAFKGDMVQMVKMIKAAKETRHVAMS 386 Query: 394 QRYVNDVRDYMIAKTVQVAIYAGRETFFFDENFDTRWSKDKNITYFRN 453 RY+ D V+++Y IA TVQ+ +YA+GR+ FF E + +RW+S ++ YF++ Sbjct: 387 RYITEDTVIQNYHIAGTVQGVYAMGRDHQFFKEQYCSRWISSNRQ--YFKS 444 Query: 454 GFGWGVRQCGRRIAEEMTIFINMENFRVEIQHSDVGTTFNIMEKISFTFW 53 GFG+G RQCGRRIAE EM IFI+MENFR+E Q +V + F +MEKI T Sbjct: 445 GFGFGRQCGRRIAETEMQIFIHMENFRIEKQKQIEVRSKFEMEKIITIK 504 Query: 54 FN 55 N Sbjct: 505 N 506 Transcription regulation promoter region expression rates degradation post modifications Splicing genomic DNA intron- intron-2 intron-3 mrna precursor mrna 7

trna ribosomal RNA snorna microrna etc Non-coding genes Source: http://www.emc.maricopa.edu/faculty/farabee/biobk/biobookrotsyn.html Elements of robability Theory with examples Outline Conditional robabilities Markov Chains Hidden Markov Models Information measures 8

robabilities Definition: x # favourable outcomes x total # possible outcomes Conditional robabilities Definition: xa # favourable outcomes x given A total # possible outcomes given A Conditional robabilities cntd Example: outcome x: tomorrow I ll go hiking event A: tomorrow will be sunny x xa A NOT x 9

Conditional robabilities cntd Joint probability:,y Y Y If Y then,y independent,y Y Marginal probability:,y YY Y Y Conditional robabilities cntd Bayes theorem Y Y Y Y Y! x osterior probabilities are the compromise between data and prior information. Bayes: Application- roblem from Durbin et al., 998: A rare genetic disease is discovered with population frequency one in million. An extremely good genetic test is 00% sensitive always correct if you have the disease and 99.99% specific false positive rate 0.0%. Will you be willing to take such a test? Hint: What is the probability that you have the disease, if the test is positive? 0

Bayes: Application- cntd Answer: D + + D D / +.0 * 0-6 / [.0 * 0-6 + 0-4 * - 0-6 ] 0.0099 roblem: Application of Bayes-2 Given a set of transmembrane proteins with specified membrane domains of length training set, can you develop a probabilistic model that predicts which parts of a new transmembrane protein are likely to be membrane domains? Application of Bayes-2 cntd Solution: Suppose that we suspect that the amino acid frequencies differ between membrane and nonmembrane regions. Using the training set, calculate the probabilities, a i D, that each amino acid a i is part of a membrane domain D. Also, using the nonmembrane parts, calculate the corresponding probabilities, a i not D.

2 Application of Bayes-2 cntd Solution cntd: Divide the new protein into segments. Using Bayes theorem, calculate the posterior probability of each segment being a membrane domain using the a i D.! d M d d M D D M M M arg max : ; Markov chains What is a Markov chain? Markov chain of order n is a stochastic process of a series of outcomes, in which the probability of outcome x depends on the state of the previous n outcomes. Markov chains cntd Markov chain of first order: Transition probabilities: i i-! i i i x 2 2 2 2......,...,,...,,...,,

Application of Markov chains roblem from Durbin et al.: CpG islands Given two sets of sequences from the human genome, one with CpG islands and one without, can you calculate a model that can predict the CpG islands? Application of Markov chains cntd Solution: Application of Markov chains cntd Histogram of scores CpG islands: CpG other 3

Hidden Markov Models What is a Hidden Markov Model? A Markov process in which the probability of an outcome depends also in a hidden random variable state. Transition probability: the probability of reaching a state given the previous state. Emission probability: the probability of an outcome given the state. Hidden Markov Models cntd Graphical representation of the HMM: CpG islands transition probabilities Question: Where is the Markov process here? Application of HMMs roblem from Durbin et al.: dishonest casino 0.95 0.9 Fair : /6 2: /6 : /0 2: /0 0.05 3: /6 4: /6 5: /6 6: /6 0. 3: /0 4: /0 5: /0 6: /2 oaded 4

Application of HMMs cntd roblem from Durbin et al.: dishonest casino Given the previous model and 2 a series of die rolls x i, i,,, can we predict which of the rolls are coming from the fair and which from the loaded die? Question: What is hidden here? Application of HMMs cntd Answer: YES Viterbi algorithm best path Forward-backward algorithm probability of state k in outcome x i HMMs: Viterbi algorithm Viterbi predictions: 300 rolls of die 5

HMMs in biology General comments: Usually the structure of the model is unknown The transition and emission probabilities are calculated based on trusted training sets and the postulated model redictions are based on the Viterbi or the forward-backward algorithm, depending on the question asked Definitions: Information measures Entropy: E#log # $ p i log p i n Relative Entropy:, Q! i n i pi pi log q M,Y x Mutual Information: i,y j log x, y i j i, j i x i y j 6