Sequence Analysis BBSI 2006: ecture #χ+ Takis Benos 2006 Molecular Genetics 0 Cell s internal world
DNA - Chromosomes - Genes We cannot define it but we know it when we see it A loose definition: What is a gene? Gene is a DNA/RNA information unit that is able to perform a function in a cellular environment Central Dogma: rotein coding genes transcribed translated DNA RNA protein 2
Open Reading Frames ORFs aatagcgaat tttccaacga caaaagctaa atatcgcaaa aacctcagta aaaatcttgc 60 tggagctatt attgctaagt aacatttacc ccctgaagtt aatggatcaa tcaagagaga 20 tgtgggctgt aatgaatcgt cttattgaat taacaggttg gatcgttctt gtcgtttcag 80 tcattcttct tggcgtggcg agtcacattg acaactatca gccacctgaa cagagtgctt 240 cggtacaaca caagtaagct ctgcacttgt ggagcgacat gctgcccgtc cgggtgcatg 300 ttttcacttg tcggatatta aaccaggaat ttattatctt gttcgatgtt gtaataaa 358 Open Reading Frames ORFs aatagcgaat tttccaacga caaaagctaa atatcgcaaa aacctcagta aaaatcttgc 60 tggagctatt attgctaagt aacatttacc ccctgaagtt aatggatcaa tcaagagaga 20 tgtgggctgt aatgaatcgt cttattgaat taacaggttg gatcgttctt gtcgtttcag 80 M N R tcattcttct tggcgtggcg agtcacattg acaactatca gccacctgaa cagagtgctt 240 cggtacaaca caagtaagct ctgcacttgt ggagcgacat gctgcccgtc cgggtgcatg 300 ttttcacttg tcggatatta K stop aaccaggaat ttattatctt gttcgatgtt gtaataaa 358 MNRIETGWIVVVSVIGVASHIDNYQEQSASVQHK Gene s characteristics aatagcgaat tttccaacga caaaagctaa atatcgcaaa aacctcagta aaaatcttgc 60 tggagctatt attgctaagt aacatttacc ccctgaagtt aatggatcaa tcaagagaga 20 tgtgggctgt aatgaatcgt cttattgaat taacaggttg gatcgttctt gtcgtttcag 80 tcattcttct tggcgtggcg agtcacattg acaactatca gccacctgaa cagagtgctt 240 cggtacaaca caagtaagct ctgcacttgt ggagcgacat gctgcccgtc cgggtgcatg 300 ttttcacttg tcggatatta aaccaggaat ttattatctt gttcgatgtt gtaataaa 358 ATG TAA promoter 5 UTR ORF 3 UTR mrna 3
Transcription cugaaguu aauggaucaa ucaagagaga 20 ugugggcugu aaugaaucgu cuuauugaau uaacagguug gaucguucuu gucguuucag 80 ucauucuucu uggcguggcg agucacauug acaacuauca gccaccugaa cagagugcuu 240 cgguacaaca caaguaagcu cugcacuugu ggagcgacau gcugcccguc cgggugcaug 300 uuuucacuug ucggauauua aaccaggaau uuauuaucuu guucgauguu guaauaaa 358 CA cuga aauaaa aaaaaaa olya tail Translation cuga aauaaa aaaaaaa MNRIETGWIVVVSVIGVASHIDNYQEQSASVQHK rotein coding genes cntd Source: http://www.emc.maricopa.edu/faculty/farabee/biobk/biobookr OTSYn.html 4
Genes and roteins Alterations of the DNA Base substitutions: silent: no a.a. replacement missense: a.a. replacement non-sense: a.a. stop codon replacement Alterations of the DNA cntd UGU UGC silent UGA non-sense UGG missense Source: http://www.emc.maricopa.edu/faculty/farabee/biobk/biobookrotsyn.html 5
Inheritance - genetic differences Molecular evolution Two species will acquire mutations proportionally to their divergence time. However: all proteins do not change in the same pace a given protein does not necessarily change in the same pace throughout time different parts of the same protein change at different paces Molecular evolution cntd Human CA_HUMAN; 0508 vs. Drosophila CA_IG; 062 Query: MAKGRSVVKGYQTFSAREGGRRVTGEGAGISTRSRFNEISGDNGW 60 MA+G RSVVKG Q FSARE G RV TGEGA IST++RF+EISGDNGW+ Sbjct: MARGARSVVKGCQFSARECGHRVGTGEGACISTKTRFSEISGDNGWI 60 Query: 6 NYHFWRETGTHKVHHHVQNFQKYGIYREKGNVESVYVIDEDVAFKSEGNER 20 NY FW+E GT K+H HHVQNFQKYGIYREKGN+ESVY+IDEDVAFK EGNER Sbjct: 6 NYRFWKEKGTQKIHYHHVQNFQKYGIYREKGNESVYIIDEDVAFKFEGNER 20 Query: 2 FIWVAYHQYYQRIGVKKSAAWKKDRVANQEVMAEATKNFDAVSRDFVS 80 + IWVAYHQ+YQ++GVKKS AWKKDR+ N EVMAEA KNF+D VS+DFV Sbjct: 2 YNIWVAYHQHYQKVGVKKSGAWKKDRVNTEVMAEAIKNFIDTVSQDFVG 80 Query: 8 VHRRIKKAGSGNYSGDISDDFRFAFESITNVIFGERQGMEEVVNEAQRFIDAIYQM 240 VHRRIK+ GSG +SGDI +DFRFAFESITNVIFGER GMEE+V+EAQ+FIDA+YQM Sbjct: 8 VHRRIKQQGSGKFSGDIREDFRFAFESITNVIFGERGMEEIVDEAQKFIDAVYQM 240 Query: 24 FHTSVMNDFRFRTKTWKDHVAAWDVIFSKADIYTQNFYWERQKGSVHHDYRG 300 FHTSVMNDFRFRTKTW+DHVAAWD IF+KA+ YTQNFYW+R+K ++Y G Sbjct: 24 FHTSVMNDFRFRTKTWRDHVAAWDTIFNKAEKYTQNFYWDRRKRE-FNNYG 299 Query: 30 MYRGDSKMSFEDIKANVTEMAGGVDTTSMTQWHYEMARNKVQDMRAEVAAR 360 +YRG+ K+ ED+KANVTEMAGGVDTTSMTQWHYEMAR+ VQ+MR EV AR Sbjct: 300 IYRGNDKSEDVKANVTEMAGGVDTTSMTQWHYEMARSNVQEMREEVNAR 359 Query: 36 HQAQGDMATMQVKASIKETRHISVTQRYVNDVRDYMIAKTVQVAIY 420 QAQGD + MQVKASIKETRHISVTQRYVNDVRDYMIAKTVQVA+Y Sbjct: 360 RQAQGDTSKMQVKASIKETRHISVTQRYVNDVRDYMIAKTVQVAVY 49 Query: 42 AGRETFFFDENFDTRWSKDKNITYFRNGFGWGVRQCGRRIAEEMTIFINM 480 A+GR+ FF + FDTRW K++++ +FRNGFGWGVRQC+GRRIAEEMT+FI++ Sbjct: 420 AMGRDAFFSNGQFDTRWGKERDIHFRNGFGWGVRQCVGRRIAEEMTFIHI 479 Query: 48 ENFRVEIQHSDVGTTFNIMEKISFTFWFNQEATQ 520 ENF+VE+QH SDV T FNIM+KI F FNQ+ Q Sbjct: 480 ENFKVEQHFSDVDTIFNIMDKIFVFRFNQDQ 59 6
Molecular evolution cntd Human CA_HUMAN; 0508 vs. zebrafish Cypa; Q8JH93 Query: 34 TGEGAGISTRSRFNEISGDNGWNYHFWRETGTHKVHHHVQNFQKYGIYREK 93 T G + +FN+I N ++ F + G VH V NF+ +GIYREK+ Sbjct: 27 TRSGRAQNSTVQFNKIGRWRNSSVAFTKMGGRNVHRIMVHNFKTFGIYREKV 86 Query: 94 GNVESVYVIDEDVAFKSEGNERFIWVAYHQYYQRIGVKKSAAWKKDRVA 53 G +SVY+I ED A+FK+EG + R + W AY Y + GVK+ AWK DR+ Sbjct: 87 GIYDSVYIIKEDGAIFKAEGHHNRINVDAWTAYRDYRNQKYGVKEGKAWKTDRMI 46 Query: 54 NQEVMAEATKNFDAVSRDFVSVHRRIKKAGSGNYSGDISDDFRFAFESITNV 23 N+E++ + F+D V +DFV+ ++++I+++G ++ D++ DFRF+ ES++ V Sbjct: 47 NKEKQGTFVDEVGQDFVARVNKQIERSGQKQWTTDTHDFRFSESVSAV 206 Query: 24 IFGERQGMEEVVNEAQRFIDAIYQMFHTSVMNDFRFRTKTWKDHVAAWDVI 273 ++GER G+ + ++E Q FID + MF T+ M R + WK+HV AWD I Sbjct: 207 YGERGDNIDEFQHFIDCVSVMFKTTSMYGRSIGSNIWKNHVEAWDGI 266 Query: 274 FSKADIYTQNFYWERQKGSVHHDYRGMYRGDSKMSFEDIKANVTEMAGGVDTTSM 333 F++AD QN + + ++ + Y G+ K+S EDIKA+VTE++AGGVD+ + Sbjct: 267 FNQADRCIQNIFKQWKENEGNGKYGVAIMQDKSIEDIKASVTEMAGGVDSVTF 326 Query: 334 TQWHYEMARNKVQDMRAEVAARHQAQGDMATMQVKASIKETRHISVT 393 T W YE+AR +QD RAE+ AAR +GDM M++++KA++KETRH++++ Sbjct: 327 TWTYEARQDQDERAEISAARIAFKGDMVQMVKMIKAAKETRHVAMS 386 Query: 394 QRYVNDVRDYMIAKTVQVAIYAGRETFFFDENFDTRWSKDKNITYFRN 453 RY+ D V+++Y IA TVQ+ +YA+GR+ FF E + +RW+S ++ YF++ Sbjct: 387 RYITEDTVIQNYHIAGTVQGVYAMGRDHQFFKEQYCSRWISSNRQ--YFKS 444 Query: 454 GFGWGVRQCGRRIAEEMTIFINMENFRVEIQHSDVGTTFNIMEKISFTFW 53 GFG+G RQCGRRIAE EM IFI+MENFR+E Q +V + F +MEKI T Sbjct: 445 GFGFGRQCGRRIAETEMQIFIHMENFRIEKQKQIEVRSKFEMEKIITIK 504 Query: 54 FN 55 N Sbjct: 505 N 506 Transcription regulation promoter region expression rates degradation post modifications Splicing genomic DNA intron- intron-2 intron-3 mrna precursor mrna 7
trna ribosomal RNA snorna microrna etc Non-coding genes Source: http://www.emc.maricopa.edu/faculty/farabee/biobk/biobookrotsyn.html Elements of robability Theory with examples Outline Conditional robabilities Markov Chains Hidden Markov Models Information measures 8
robabilities Definition: x # favourable outcomes x total # possible outcomes Conditional robabilities Definition: xa # favourable outcomes x given A total # possible outcomes given A Conditional robabilities cntd Example: outcome x: tomorrow I ll go hiking event A: tomorrow will be sunny x xa A NOT x 9
Conditional robabilities cntd Joint probability:,y Y Y If Y then,y independent,y Y Marginal probability:,y YY Y Y Conditional robabilities cntd Bayes theorem Y Y Y Y Y! x osterior probabilities are the compromise between data and prior information. Bayes: Application- roblem from Durbin et al., 998: A rare genetic disease is discovered with population frequency one in million. An extremely good genetic test is 00% sensitive always correct if you have the disease and 99.99% specific false positive rate 0.0%. Will you be willing to take such a test? Hint: What is the probability that you have the disease, if the test is positive? 0
Bayes: Application- cntd Answer: D + + D D / +.0 * 0-6 / [.0 * 0-6 + 0-4 * - 0-6 ] 0.0099 roblem: Application of Bayes-2 Given a set of transmembrane proteins with specified membrane domains of length training set, can you develop a probabilistic model that predicts which parts of a new transmembrane protein are likely to be membrane domains? Application of Bayes-2 cntd Solution: Suppose that we suspect that the amino acid frequencies differ between membrane and nonmembrane regions. Using the training set, calculate the probabilities, a i D, that each amino acid a i is part of a membrane domain D. Also, using the nonmembrane parts, calculate the corresponding probabilities, a i not D.
2 Application of Bayes-2 cntd Solution cntd: Divide the new protein into segments. Using Bayes theorem, calculate the posterior probability of each segment being a membrane domain using the a i D.! d M d d M D D M M M arg max : ; Markov chains What is a Markov chain? Markov chain of order n is a stochastic process of a series of outcomes, in which the probability of outcome x depends on the state of the previous n outcomes. Markov chains cntd Markov chain of first order: Transition probabilities: i i-! i i i x 2 2 2 2......,...,,...,,...,,
Application of Markov chains roblem from Durbin et al.: CpG islands Given two sets of sequences from the human genome, one with CpG islands and one without, can you calculate a model that can predict the CpG islands? Application of Markov chains cntd Solution: Application of Markov chains cntd Histogram of scores CpG islands: CpG other 3
Hidden Markov Models What is a Hidden Markov Model? A Markov process in which the probability of an outcome depends also in a hidden random variable state. Transition probability: the probability of reaching a state given the previous state. Emission probability: the probability of an outcome given the state. Hidden Markov Models cntd Graphical representation of the HMM: CpG islands transition probabilities Question: Where is the Markov process here? Application of HMMs roblem from Durbin et al.: dishonest casino 0.95 0.9 Fair : /6 2: /6 : /0 2: /0 0.05 3: /6 4: /6 5: /6 6: /6 0. 3: /0 4: /0 5: /0 6: /2 oaded 4
Application of HMMs cntd roblem from Durbin et al.: dishonest casino Given the previous model and 2 a series of die rolls x i, i,,, can we predict which of the rolls are coming from the fair and which from the loaded die? Question: What is hidden here? Application of HMMs cntd Answer: YES Viterbi algorithm best path Forward-backward algorithm probability of state k in outcome x i HMMs: Viterbi algorithm Viterbi predictions: 300 rolls of die 5
HMMs in biology General comments: Usually the structure of the model is unknown The transition and emission probabilities are calculated based on trusted training sets and the postulated model redictions are based on the Viterbi or the forward-backward algorithm, depending on the question asked Definitions: Information measures Entropy: E#log # $ p i log p i n Relative Entropy:, Q! i n i pi pi log q M,Y x Mutual Information: i,y j log x, y i j i, j i x i y j 6