2005 Fall Workshop on Information Theory and Communications. Bioinformatics. from a Perspective of Communication Science

Size: px
Start display at page:

Download "2005 Fall Workshop on Information Theory and Communications. Bioinformatics. from a Perspective of Communication Science"

Transcription

1 2005 Fall Workshop on Information Theory and Communications Bioinformatics from a Perspective of Communication Science Chung-Chin Lu Department of Electrical Engineering National Tsing Hua University cclu@ee.nthu.edu.tw July 22, 2005

2 Biological Background : The Central Dogma 1

3 Typical Animal Cell From Neil Campbell, Jane Reece, and Larry Mitchell, Biology, 5th ed. (Menlo Park, CA: Addison Wesley Longman, 1999) c Addison Wesley Longman, Inc. 2

4 Places for Genetic Information Processing in Cell Nucleus : storage and transportation of genetic information Storage : deoxyribonucleic acid (DNA) Transportation : ribonucleic acid (RNA) Ribosomes : factories for protein synthesis with the blueprint in RNA Mitochondria : energy production with its own circular, doubled-stranded DNA 3

5 Double-stranded DNA From J. D. Watson, N. H. Hopkins, J. W. Roberts, J. A. Steitz, and A. M. Weiner, Molecular Biology of the Gene, 4th ed. (Redwood City, CA: Benjamin/Cummings Publishing Co., 1987). c 1987 James D. Watson. 4

6 Chemical Structures of RNA and DNA From Christopher K. Mathews, K. E. van Holde, and Kevin G. Ahern, Biochemistry, 3rd ed. (San Francisco, CA: Benjamin/Cummings, 2000) c 2000 Addison Wesley Longman, Inc. 5

7 Repeating Units of DNA and RNA Three subunits in a repeating unit (a ribo-nucleotide) of RNA A phosphate group A ribose A base Three subunits in a repeating unit (a deoxy-ribo-nucleotide) of DNA A phosphate group A 2 -deoxyribose A base 6

8 Nucleic Acid Bases and Paring of Bases in DNA Five nucleic acid bases Two purines Adenine (A) in DNA and RNA Guanine (G) in DNA and RNA Three pyrimidines Cytosine (C) in DNA and RNA Thymine (T) in DNA and Uracil (U) in RNA Paring of bases in DNA Adenine (A) Thymine (T) Guanine (G) Cytosine (C) 7

9 The Bases of DNA From Christopher K. Mathews, K. E. van Holde, and Kevin G. Ahern, Biochemistry, 3rd ed. (San Francisco, CA: Benjamin/Cummings, 2000) c 2000 Addison Wesley Longman, Inc. 8

10 The Flow of Genetic Information in Cell From Christopher K. Mathews, K. E. van Holde, and Kevin G. Ahern, Biochemistry, 3rd ed. (San Francisco, CA: Benjamin/Cummings, 2000) c 2000 Addison Wesley Longman, Inc. 9

11 DNA replication From Christopher K. Mathews, K. E. van Holde, and Kevin G. Ahern, Biochemistry, 3rd ed. (San Francisco, CA: Benjamin/Cummings, 2000) c 2000 Addison Wesley Longman, Inc. 10

12 Transcription From Christopher K. Mathews, K. E. van Holde, and Kevin G. Ahern, Biochemistry, 3rd ed. (San Francisco, CA: Benjamin/Cummings, 2000) c 2000 Addison Wesley Longman, Inc. 11

13 Relationships of DNA to mrna to Polypeptide From Blanchetot, Nature (1983) 301: c 1983 Macmillan Magazines, Ltd. 12

14 ! G H! E F C D G H E F C D ' ( A B M N $ K L $ $ = > I J ` a ] ] % & $ $ A B M N ^ _ $ $ ] K L " # [ \ = > I J. / ; < : , ; < : ) * Y Z S T W X Q R U V O P National Tsing Hua University (NTHU) Table of mrna Codons and Corresponding Amino Acids U C A G U UUU Phe UUC UUA Leu UUG CUU CUC Leu CUA CUG A U U A U C lle A U A { } A U G M et GUU GUC Val GUA GUG U C U U C C Ser U C A U C G CCU CCC CCA CCG ACU ACC Thr ACA ACG GCU GCC Ala GCA GCG C Pr o A U A U T yr U A C { } { } U A A sto p U A G sto p C A U H is C A C C A A G ln C A G AAU Asn AAC AAA Lys AAG GAU Asp GAC GAA Glu GAG U G U C ys U G C { } { } U G A Sto p U G G T rp CGU CGC Arg CGA CGG AGU Ser AGC AGA Arg AGG GGU GGC Gly GGA GGG G U C A G U C A G U C A G U C A G first letter third - letter second - letter 13

15 Protein as a Polymer Protein is a chain of amino acids There are 20 different kinds of amino acids Each amino acid is coded as a 3-tuple of nucleotides, called a codon. There are 64 codons, but 20 amino acids. So many amino acids are represented by more than one codons. Start codon : AUG which also encodes amino acid Methionine Stop codons : UAA, UAG and UGA which do not encode any amino acid 14

16 Translation of an RNA Message into a Protein From Christopher K. Mathews, K. E. van Holde, and Kevin G. Ahern, Biochemistry, 3rd ed. (San Francisco, CA: Benjamin/Cummings, 2000) c 2000 Addison Wesley Longman, Inc. 15

17 National Tsing Hua University (NTHU) The Central Dogma: Transcription and Translation 16

18 Messenger RNA (mrna) Pre-messenger RNA (pre-mrna) : primary RNA transcript Exons : concatenated to form a coding sequence for the synthesis of protein Introns : in between exons with functions not clear Poly-A signal, cleavage site and downstream element Splicing : deletion of introns Messenger RNA (mrna) : RNA after splicing A cap and a 5 untranslated region (5 UTR) Coding sequence (CDS) for protein synthesis 3 untranslated region (3 UTR) and a Poly-A tail sequence 17

19 Biological data are Digital!! Biological digital information modulates the macro-molecular of repetitive units Analogy : Digital information modulates sinusoidal waves 18

20 Digital Modulation on Macro-molecular of Repetitive Units 19

21 From Christopher K. Mathews, K. E. van Holde, and Kevin G. Ahern, Biochemistry, 3rd ed. (San Francisco, CA: Benjamin/Cummings, 2000) c 2000 Addison Wesley Longman, Inc. 20

22 Gene Structure Prediction and The Decoding Problem Aim To predict the structure of a gene based on the DNA sequence. Formalism To decode a DNA sequence (a sequence of A, T, G, C) to a sequence of exons and introns (a sequence of E and I). 21

23 Signal Sensors for Gene Structure Prediction 22

24 Signal Sensors To determine the transcriptional beginning of a gene, called transcription start site (TSS) and the upstream regulatory region, called promoter. Transcription start site (TSS). Elements of promoter : TATA box, CCAAT box, GC box, etc. To determine the precise exon-intron boundaries, called splice sites, in the coding region, as a crucial part in gene structure prediction. Donor site : the 5 splice site of an intron. Acceptor site : the 3 splice site of an intron. To determine the transcriptional termination of a gene. 23

25 PolyA signal, cleavage site, downstream element (DSE). 24

26 Transcription From Christopher K. Mathews, K. E. van Holde, and Kevin G. Ahern, Biochemistry, 3rd ed. (San Francisco, CA: Benjamin/Cummings, 2000) c 2000 Addison Wesley Longman, Inc. 25

27 Transcription Start Site (TSS) Signals Recognized by the basal RNA polymerase II transcriptional machinery. Encompassing three main core promotor elements TATA box, initiator (Inr), Downstream promoter element (DPE), which are generally located within -60 to +50 of the transcription start site. 26

28 The Basal RNA Polymerase II Transcriptional Machinery 27

29 Consensus in Transcription Start Site Signals TATA box has a strong consensus T A T A A A T +1 T +2 T +3 T +4 T +5 T +6 Located about 25 to 30 nt upstream of the TSS. Inr element has a loose consensus Py Py A N T/A Py Py A 2 A 1 A +1 A +2 A +3 A +4 A +5 Encompassing the TSS. DPE has no consensus and is not yet well characterized. 28

30 The Nucleotide Distribution in the TATA signal A T C G % position 29

31 The Nucleotide Distribution in the TSS signal A T C G % position 30

32 Pre-mRNA Splicing - Spliceosome Cycle From After Sharp, Cell (1994) 77:811. c 1994 Cell Press/The Nobel Foundation. 31

33 Consensus in Splice Signals Donor sites : exon A G / G U AUGU intron D 2 D 1 D +1 D +2 Acceptor sites : intron (C/T) N N A G / G exon A 2 A 1 A +1 32

34 Conservation in Splice Signals Donor sites : only GU are conserved in the D +1 D +2 positions for more than 98% of donor sites Acceptor sites : only AG are conserved in the A 2 A 1 positions for more than 98% of donor sites Spliceosome which does the job of splicing can recognize the splicing sites 33

35 Polyadenylation Mechanism There are three major steps : Recognition of the authentic signals of polyadenylation in the 3 -terminal of a pre-mrna, Cleavage of the pre-mrna, Addition of up to 250 adenosine residues (named polya tail). 34

36 Precleavage Complex 35

37 Proteins Involved CPSF (blue): Cleavage and polyadenylation specificity factor, binds to the AAUAAA motif and interacts with PAP and CstF. CstF (brown): Cleavage stimulation factor, binds to the GT/T-rich element. (at DNA level) CF I and II (gray): Cleavage factors I and II are required for cleavage. RNA polymerase II CTD (carboxyl-terminal domain): stimulates the cleavage reaction. PAP (orange): Poly(A) polymerase, initiates poly(a) synthesis, yielding an oligo(a) at least 10 nt long. PAB II (yellow): Poly(A)-binding protein II is for the elongation of poly(a) in mammals. 36

38 Model for Polyadenylation 37

39 38

40 Authentic Signals of Polyadenylation There are two major signals : PolyA signal (PAS) nucleotides upstream to the cleavage/polyadenylation site. A highly conserved hexamer AAUAAA (and the common variant AUUAAA). Recognized by the cleavage and polyadenylation specificity factor (CPSF). Downstream element (DE) nucleotides downstream to the cleavage/polyadenylation site. 39

41 consisting of a much less well-characterized U or G-U rich sequence, Recognized by the cleavage stimulation factor (CstF). Then cleavage occurs between these two signals as directed by two cleavage factors, CF Im and CF IIm. 40

42 Polyadenylation site, Window = [ 200, +206] A U G C G+C G+U 0.5 proportion position 41

43 Biological data are Noisy!! 42

44 Variation in Nucleotide Positions of a Signal Variation is from evolution of organisms. Transcriptional and post-transcriptional factors can still recognize signals. Inter-relation between nucleotide positions of a signal induces the recognition process by DNA-protein interaction. 43

45 Hidden Encoding and Channel Models Are there hidden legitimate codeword(s) for representing a signal? What is the codeword length for that signal? What is the stochastic mechanism (the channel) between the hidden legitimate codeword(s) and the observed diversified DNA segments for a signal? 44

46 Inter-dependency among Base Positions in Splice Signals Not being sufficiently addressed by previous models such as Weight matrix model (WMM) (Standen, 1984) Weight array model (WAM) (Zhang and Marr, 1993) Maximal dependence decomposition (MDD) (Burge and Karlin, 1997) Tree model (Cai et al., 2000) Potential models? Higher-order Markov chains Dependency graphs 45

47 Test of Dependency and Chi-square Statistics Question : How to find the dependency (strength of inter-relation) between the positions in a splice signal? Table 1: A contingency table for signals in DNA sequence. s i \s j A T C G Total A Y 11 Y 12 Y 13 Y 14 Y 1c T Y 21 Y 22 Y 23 Y 24 Y 2c C Y 31 Y 32 Y 33 Y 34 Y 3c G Y 41 Y 42 Y 43 Y 44 Y 4c Total Y r1 Y r2 Y r3 Y r4 Y 46

48 Test of Dependency and Chi-square Statistics Chi-square test statistics : where χ 2 (X i, X j ) = 4 m=1 4 n=1 E mn = Y mc Y rn /Y. (Y mn E mn ) 2 E mn. P (null hypothesis is rejected when it is true) = P (χ 2 (X i, X j ) K null hypothesis) = α where α is a numerical value for the Type I error of the test. If χ 2 (X i, X j ) is greater than a critical point K, two positions are said to have strong dependency. 47

49 The chi-square Statistics for the TATA Box Signal i/j X 5 X 4 X 3 X 2 X 1 X +1 X +2 X +3 X +4 X +5 X +6 X +7 X +8 X X X X X X X X X X X X X

50 Dependency Graph for the TATA Box Signal

51 Dependency Graph for the TSS Signal 50

52 National Tsing Hua University (NTHU) Dependency Graph for Donor Site D+7 51

53 Difficulty for Statistical Reasoning with Dependency Graphs There are always cycles in a dependency graph. 52

54 A Remedy Expanding a dependency graph into a Bayesian network. 53

55 Expanded Bayesian Network for the TATA Box Signal 54

56 55

57 )3( )3( )3( )3( )3( )3( )4( )4( )5( )5( )5( )5( )8( )8( )8( National Tsing Hua University (NTHU) Expanded Bayesian Network for Donor Site D 2 (0) D 3(1) D 1(1 ) D +4(1 ) D +5(1 ) D +6 (1 ) D (2) 4 D )2 D )2( D 1 )2( D +3 )2( D ) D (2 +5 ) D (2 +6 ) D +7 (2 ) 3 ( 2 +4 (2 D 5 (3) D 4 D 3 3() D 2 D 1 D +3 3() D +4 D +5 ()3 D +6 D +7 D +8 (3) D 6 (4) D 5 (4) D 4 D 3 (4) D 2 D 1 ()4 D +3 (4) +D 4 ()4 D +5 (4) D +6 (4) D +7 (4) D +8 (4) D +9 (4) D 7 (5) D 6 5() D 5 (5) D 4 D 3 5() D 2 D 1 D +3 (5) + D 4 D +5 (5) D +6 (5) D +7 ()5 D +8 5() D +9 ()5 D 8 (6) D 7 (6) D (6) D 3 (6) 6 D 5 (6) )6( )6( )6( D +3 (6) )6( D +5 (6) D +6 (6) D +7 (6) D +8 (6) D +9 (6 ) D 4 D 2 D 1 + D 4 D 9 (7) D 8 (7) D 7 (7) D (7) 6 D 5 (7) D 4 7() D 3 (7) D 2 7() D 1 (7) D +3 (7) D + 4 )7( D +5 (7) D +6 (7) D +7 (7) D +8 (7) D +9 (7 ) D 9 D 8 D 7 D 8() 6 D 5 (8) D 4 )8( D 3 8() D 2 )8( D 1 8() D +3 8() D + 4 )8( D +5 8() D +6 )8( D +7 )8( D +8 8() D +9 )8( 56

58 Datasets for TATA Model Training 862 human non-redundant and experimentally verified TATA box sequences were extracted from NCBI ( to form a true dataset pseudo-tata signals are retrieved from the exon and intron regions of the 862 genes to form a false dataset. 57

59 Datasets for TSS Model Training 1430 human non-redundant promoter sequences were extracted from EPD78 ( and blasted in the human genome to form a true dataset false TSS signal sequences are obtained in random from exon and intron regions of the 1430 genes. 58

60 Datasets for Splice Site Model Training We extract a collection of real and pseudo splice sites from a set of 462 annotated multiple-exon human genes at Table 2: Number of genes, true and pseudo splice sites in the dataset. Genes (True) donor acceptor (False) donor acceptor We exclude the splice sites which contains base positions not labelled with A, T, C, G but with other symbols. 59

61 Datasets for PolyA Signal Model Training 2923 polya signal sequences with AAUAAA are retrieved form GeneBank to form a true dataset pseudo-polya signals with AAUAAA are retrieved from the exon and intron regions of the 2923 genes to form a false dataset. 60

62 Five-fold Cross-Validation. The models are cross-validated by randomly partitioning the dataset into five subsets. Then we test each subset (called the testing data) with the parameters trained by the other four subsets (called the training data) under the splice site models, and take the average of the five predictive accuracy measures corresponding to the five testing/training data pairs. We also justify the training data with the model trained by themselves in the same manner. 61

63 Measures for Predictive Accuracy Actual positive (AP ), Actual negative (AN), Predicted positive (P P ), Predicted negative (P N), AP = T P + F N AN = F P + T N P P = T P + F P T P F P P N = F N + T N F N T N False negative rate : F N rate = False positive rate F P rate = #F N #T P + #F N. #F P #T N + #F P. 62

64 Measures for Predictive Accuracy (Cont ) Sensitivity : Sensitivity = 1 false negative rate = #T P #T P + #F N = #T P #AP. Specificity Specificity = 1 false positive rate = #T N #T N + #F P = #T N #AN. Predictive Positive Value (PPV): P P V = #T P #T P + #F P = #T P #P P. 63

65 Results and Comparison 64

66 TATA Signal TATA (Testing Data), Dependency Graph Models, α = False Positive Rate (%) False Negative Rate (%) Figure 1: Comparison of prediction accuracy of 6 dependency graph models for the training data of TATA box corresponding to 6 different windows. 65

67 TATA (Training Data), Dependency Graph Models, α = False Positive Rate (%) False Negative Rate (%) Figure 2: Comparison of prediction accuracy of 6 dependency graph models for the testing data of TATA box corresponding to 6 different windows. 66

68 TSS Site TSS (Training Data), Dependency Graph Models, α = False Positive Rate (%) False Negative Rate (%) Figure 3: Comparison of prediction accuracy of 6 dependency graph models for the training data of TSS corresponding to 6 different windows with the same right edge. 67

69 80 70 TSS (Testing Data), Dependency Graph Models, α = False Positive Rate (%) False Negative Rate (%) Figure 4: Comparison of prediction accuracy of 6 dependency graph models for the testing data of TSS corresponding to 6 different windows with the same right edge. 68

70 40 35 TSS (Training Data), Dependency Graph Models, α = False Positive Rate (%) False Negative Rate (%) Figure 5: Comparison of prediction accuracy of 6 dependency graph models for the training data of TSS corresponding to 6 different windows with the same left edge. 69

71 80 70 TSS (Testing Data), Dependency Graph Models, α = False Positive Rate (%) False Negative Rate (%) Figure 6: Comparison of prediction accuracy of 6 dependency graph models for the testing data of TSS corresponding to 6 different windows with the same left edge. 70

72 Donor Site False Positive Rate (%) Donor site (Training Data) Zero order Markov Chain Model (WMM), Window = [ 6, +9], With Laplace s Rule 1st order Markov Chain Model (WAM), Window = [ 9, +9], With Laplace s Rule 2nd order Markov Chain Model, Window = [ 9, +9], With Laplace s Rule 3rd order Markov Chain Model, Window = [ 3, +7], With Laplace s Rule MDD, Window = [ 6, +15], With Laplace s Rule Cai_Tree Model, Window = [ 9, +9], With Laplace s Rule EBN Model (at most 1 parent), Window = [ 9, +9], α=10 8, With Laplace s Rule EBN Model (at most 2 parents), Window = [ 9, +9], α=10 8, With Laplace s Rule EBN Model (at most 3 parents), Window = [ 6, +15], α=10 1, With Laplace s Rule False Negative Rate (%) Figure 7: Comparison of predictive accuracy for the training data of the donor site under WMM, WAM, MDD, Cai s Tree, the 2nd-order Markov chain, the 3rd-order Markov chain and the expanded Bayesian network with at most 1, 2 and 3 parents prediction models. 71

73 False Positive Rate (%) Donor site (Testing Data) Zero order Markov Chain Model (WMM), Window = [ 6, +9], With Laplace s Rule 1st order Markov Chain Model (WAM), Window = [ 9, +9], With Laplace s Rule 2nd order Markov Chain Model, Window = [ 9, +9], With Laplace s Rule 3rd order Markov Chain Model, Window = [ 3, +7], With Laplace s Rule MDD, Window = [ 6, +15], With Laplace s Rule Cai_Tree Model, Window = [ 9, +9], With Laplace s Rule EBN Model (at most 1 parent), Window = [ 9, +9], α=10 8, With Laplace s Rule EBN Model (at most 2 parents), Window = [ 9, +9], α=10 8, With Laplace s Rule EBN Model (at most 3 parents), Window = [ 6, +15], α=10 1, With Laplace s Rule False Negative Rate (%) Figure 8: Comparison of predictive accuracy for the testing data of the donor site under WMM, WAM, MDD, Cai s Tree, the 2nd-order Markov chain, the 3rd-order Markov chain and the expanded Bayesian network with at most 1, 2 and 3 parents prediction models. 72

74 Acceptor Site False Positive Rate (%) Acceptor site (Training Data) Zero order Markov Chain Model (WMM), Window = [ 27, +3], Without Laplace s Rule 1st order Markov Chain Model (WAM), Window = [ 27, +9], Without Laplace s Rule 2nd order Markov Chain Model, Window = [ 27, +9], With Laplace s Rule 3rd order Markov Chain Model, Window = [ 27, +3], With Laplace s Rule MDD, Window = [ 27, +9], With Laplace s Rule Cai_Tree Model, Window = [ 27, +9], Without Laplace s Rule EBN Model (at most 1 parent), Window = [ 27, +3], α=10 8, Without Laplace s Rule EBN Model (at most 2 parents), Window = [ 27, +9], α=10 3, Without Laplace s Rule EBN Model (at most 3 parents), Window = [ 27, +3], α=10 3, With Laplace s Rule False Negative Rate (%) Figure 9: Comparison of predictive accuracy for the training data of the donor site under WMM, WAM, MDD, Cai s Tree, the 2nd-order Markov chain, the 3rd-order Markov chain and the expanded Bayesian network with at most 1, 2 and 3 parents prediction models. 73

75 Acceptor site (Testing Data) Zero order Markov Chain Model (WMM), Window = [ 27, +3], Without Laplace s Rule 1st order Markov Chain Model (WAM), Window = [ 27, +9], Without Laplace s Rule 2nd order Markov Chain Model, Window = [ 27, +9], With Laplace s Rule 3rd order Markov Chain Model, Window = [ 27, +3], With Laplace s Rule MDD, Window = [ 27, +9], With Laplace s Rule Cai_Tree Model, Window = [ 27, +9], Without Laplace s Rule EBN Model (at most 1 parent), Window = [ 27, +3], α=10 8, Without Laplace s Rule EBN Model (at most 2 parents), Window = [ 27, +9], α=10 3, Without Laplace s Rule EBN Model (at most 3 parents), Window = [ 27, +3], α=10 3, With Laplace s Rule False Positive Rate (%) False Negative Rate (%) Figure 10: Comparison of predictive accuracy for the testing data of the donor site under WMM, WAM, MDD, Cai s Tree, the 2nd-order Markov chain, the 3rd-order Markov chain and the expanded Bayesian network with at most 1, 2 and 3 parents prediction models. 74

76 PolyA Signal PAS, Training(5 fold cross validation), Window = [ 90, +96] DG WMM WAM SMC 70 FP rate (%) in training data FN rate (%) in training data Figure 11: Comparing accuracy between different methods using training data (in region [-90, +96]). 75

77 PAS, Testing(5 fold cross validation), Window = [ 90, +96] DG WMM WAM SMC ERPIN POLYAH FP rate (%) in testing data FN rate (%) in testing data Figure 12: Comparing accuracy between different methods using testing data (in region [-90, +96]). 76

78 Discussion Dependency graph models with their expanded Bayesian networks seem to be sufficient to address the intrinsic cyclic inter-dependency between base positions in a biological signal site. 77

79 Content Sensors for Gene Structure Prediction 78

80 Content Sensors Exon sensor: to detect the coding regions, i.e., the regions where exons reside We have constructed a dependency graph with window size equal to 9 nucleotides and then created an expanded Baysian network for the exon sensor Intron sensor: to detect the non-coding regions, i.e., the regions where introns reside We have also constructed a dependency graph with window size equal to 9 nucleotides and then created an expanded Baysian network for the intron sensor 79

81 Gene Structure Prediction by Stochastic Grammar There is a stochastic grammar (encoding process) to describe the alternating exon and intron gene structure. A state diagram and the corresponding trellis diagram with proper states are created to represent the stochastic grammar (encoding process). Each state is created by a dependency graph with expanded Bayesian network Viterbi algorithm (a dynamic programming algorithm) is used to decode a path on the trellis diagram to determine the exon-intron structure of a gene. 80

82 National Tsing Hua University (NTHU) 81

83 National Tsing Hua University (NTHU) State Diagram I nt r on Exon A 27 A 2 A 1 A +1 A +9 a g E 0 I E 1 t g E 2 D +15 D +2 D +1 D 1 D 2 D 3 82

84 The trellis diagram 83

85 National Tsing Hua University (NTHU) A 1 A + 1 A + 1 A + 2 A + 2 A + 3 A + 3 A + 4 A + 4 A + 5 A + 5 A + 6 A + 6 A + 7 A + 7 A + 8 A + 8 A + 9 E E E A + 9 D 3 D 3 D 2 D 2 E E E D 1 D 1 D +1 D + 1 D + 2 D + 2 D + 3 D + 3 D + 4 D + 4 D + 5 D + 5 D + 6 D + 6 D + 7 D + 7 D + 8 D + 8 D + 9 D + 9 D+10 D + 10 D+11 D + 11 D+12 D + 12 D + 13 D +14 I D + 13 D +14 D+ 15 D + 15 A 27 A 27 A 26 A 26 A 25 A 25 A 24 A 24 A 23 A 23 A 22 A 22 A 21 A 21 A 20 A 20 A 19 A 19 A 18 A 18 A 17 A 17 A 16 A 16 A 15 A 15 A 14 A 14 A 13 A 13 A 12 A 12 A 11 A 11 A 10 A 10 A 9 A 9 A 8 A 8 A 7 A 7 A 6 A 6 A 5 A 4 A 3 A 2 I A 5 A 4 A 3 A 2 A 1 84

86 Maximum a Posteriori (MAP) Decoding Γ = γ 1 γ 2 γ N : state sequence S = s 1 s 1 s N : DNA sequence S = s (1) 1 s(1) 2 s (1) N s (2) 1 s(2) 2 s (2) N. s (k) 1 s(k) 2 s (k) N 85

87 Maximum a Posteriori (MAP) Decoding ˆΓ = arg max Γ Assumptions: P(Γ S) = arg max Γ N P(γ i γ 1,, γ i 1, S). i=1 (1)P(γ i γ 1,, γ i 1, S)) = P(γ i S) = P(γi S(i, γ i )) = P(γ i)p(s(i, γ i ) γ i ). P(S(i, γ i )) (2)P(S(i, γ i )) = l P(s (j) l ), s (j) l S(i, γ i ). 86

88 National Tsing Hua University (NTHU) Viterbi Algorithm - The Best Path A + 9 A + 9 A + 9 A + 9 A + 9 A + 9 A + 9 A + 9 A + 9 E 0 E 0 E 0 E 0 E 0 E 0 E 0 E 0 E 0 E 1 E 1 E 1 E 1 E 1 E 1 E 1 E 1 E 1 E 2 E 2 E 2 E 2 E 2 E 2 E 2 E 2 E 2 D 3 D 3 D 3 D 3 D 3 D D 3 3 D 3 D 3 D 2 D 2 D 2 D 2 D 2 D D 2 2 D D 2 2 D 1 87

89 Measures for Predictive Accuracy Sensitivity : Sensitivity = 1 false negative rate = #T P #T P + #F N = #T P #AP. Specificity Specificity = 1 false positive rate = #T N #T N + #F P = #T N #AN. Predictive Positive Value (PPV): P P V = #T P #T P + #F P = #T P #P P. 88

90 Nucleotide-level Accuracy Total test genes: 462 AP AN PP TP= FP=7332 PN FN = TN = Sensitivity: Specificity: P.P.V.:

91 Nucleotide-level Accuracy with Start Codon and Stop Codon Total test genes: 462 AP AN PP TP= FP=7584 PN FN =72675 TN = Sensitivity: Specificity: P.P.V.:

92 Exon-level Accuracy with Start Codon and Stop Codon Total actual exons : 2843 Exactly predicted Partially predicted Overlapped Missed Total predicted exons : 2713 Exact Partial Overlapping Wrong Sensitivity: ME : P.P.V.: WE :

93 Prediction of Protein Structure 92

94 Protein 3D structure From H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne: The Protein Data Bank. Nucleic Acids Research, 28 pp (2000) 93

95 Levels of Protein Structure Primary structure: the amino acid sequence Secondary structure: the segmentation of an amino acid sequence into α-helices, β-sheets, and loops Tertiary structure: domains and folds 94

96 Different Mapping for Protein Secondary Structure RES DSSP EHL CK KENITNLDACITRLRVSVADVSKVDQAGLKKLG STTBCEEEECSSCEEEEESCGGGCCHHHHHHTT CCCECEEEECCCCEEEEECCHHHCCHHHHHHCC CCCCCEEEECCCCEEEEECCCCCCCHHHHHHCC 95

97 Prediction of Protein Secondary Structure C: the set of all eligible protein secondary structure sequences, i.e., the code of protein secondary structure sequences Assumed to be represented by a trellis, i.e., a trellis code Each protein secondary structure sequence Q = (Q 1, Q 2,..., Q L ) of length L in C corresponds to a unique (state) path X = (X 0, X 1,..., X L 4 ) in the trellis from the initial state X 0 to the final state X L 4. For the purpose of representing interactions among amino acids 96

98 in a protein chain, we define by O = O (1) 1 O (1) 2... O (1) L O (2) 1 O (2) 2... O (2) L O (k) 1 O (k) 2... O (k) L the observation template of the unconnected Bayesian network of k layers. The template O of k layers is fully defined given the observed amino acid sequence O = (O 1, O 2,..., O L ) by assigning O (j) i = O i for all depth j for all i. Maximum-Likelihood Decoding chooses the largest 97

99 ˆQ = arg max Q P (Q O) as the output, where P (Q O) = P (Q 1, Q 2, Q L O) = P (Q 1, Q 2, Q 3 S) P (Q 4, Q 5, Q L O, Q 1, Q 2, Q 3 ) = P (Q1, Q 2, Q 3 O) L 2 i=4 P (Q i O, Q i 1 )P (Q L 1, Q L O, Q L 2 ) 98

100 For the initial states: P (Q 1, Q 2, Q 3 O) = P (Q1, Q 2, Q 3 O [1:lI ]) = P (Q 1, Q 2, Q 3 )P (O [1:lI ] Q 1, Q 2, Q 3 ) P (O [1:lI ]) where l I is the window size of an initial state. For the intermidiate states: P (Q i O, Q i 1 ) = P (Qi O [i ll :i+l R 1], Q i 1 ) = P (Q i Q i 1 )P (O [i ll :i+l R 1] Q i, Q i 1 ) P (O [i ll :i+l R 1] Q i 1 ) where l L and l R are denoted as the start and end of the window of an internal state, respectively. 99

101 For the terminal states: P (Q L 1, Q L O, Q L 2 ) = P (QL 1, Q L O [L lt +1:L], Q L 2 ) = P (Q L 1, Q L Q L 2 )P (O [L lt +1:L] Q L 2, Q L 1, Q L ) P (O [L lt +1:L] Q L 2 ) where l T is the window size of a terminal state. The decoding can be fulfilled by applying Viterbi algorithm to compute the value of ˆQ = arg max P (Q O) given an amino acid Q sequence O. 100

102 Factor Graph for Protein Secondary Structure Prediction Since the observed amino acid sequence O is fixed in the annotating (decoding) problem, the observation template O is thus considered fixed. The global function of this annotating (decoding) problem is selected as the APP g(x 0,..., X L 4 ) P (X 0, X 1,..., X L 4 O) The global function g can be further factored into a product of 101

103 several local functions: g(x 0,..., X L 4 ) = P (X 0 O)P (X 1 X 0, O) P (X L 4 X 0, X 1,..., X L 5, O) L 4 i=1 I i (X i 1, X i ) = P (X0 O)P (X 1 O) P (X L 4 O) L 4 i=1 = f 1 (X 0 )f 2 (X 0, X 1 )f 3 (X 1 )f 4 (X 1, X 2 ) I i (X i 1, X i ) f 2L 9 (X L 5 )f 2L 8 (X L 5, X L 4 )f 2L 7 (X L 4 ) where f 2i (X i 1, X i ) = I i (X i 1, X i ) is an indicator function of the local behavior of the ith trellis section that constrains the possible combinations of X i 1 and X i. 102

104 For f 1 (X 0 ), For f 2i+1 (X i ), 1 i L 5, f 1 (X 0 ) = P (X 0 O) = P (X0 O [1:lI ]) = P (X 0)P (O [1:lI ] X 0 ) P (O [1:lI ]) f 2i+1 (X i ) = P (X i O) = P (Xi O [(i+3) ll :(i+3)+l R 1]) = P (X i)p (O [(i+3) ll :(i+3)+l R 1] X i ) P (O [(i+3) ll :(i+3)+l R 1]) 103

105 For f 2L 7 (X L 4 ), f 2L 7 (X L 4 ) = P (X L 4 O) = P (XL 4 O [L lt +1:L]) = P (X L 4)P (O [L lt +1:L] X L 4 ) P (O [L lt +1:L]) 104

106 The factor graph for the annotating problem is as follows: X 0 X 1 X 2 X L 5 X L 4 f 2 f 4 f2l 8 f 1 f 3 f5 2L 9 f f2l 7 105

107 The Sum-Product Update Rule The message sent between one variable vertex and one function vertex in the sum-product algorithm is an updated message subject to the following sum-product update rule: Variable vertex to function vertex µ x f (x) = µ h x (x), h n(x)\{f} Function vertex to variable vertex µ f x (x) = f(n f ) µ y f (y) x. y N f \{x} 106

108 When the factor graph is cycle-free, the sum-product algorithm is guaranteed to give g(v ) v = µ f v (v), v V, f n(v) after a suitable message passing schedule is done. 107

109 Measure for Secondary Structure Prediction Accuracy Let M ij be the number of residues observed in state i and predicted in state j, with i and j {H, E, L}, and the total number of residues is simply N = i,j M ij The three-state per-residue accuracy Q 3 is thus defined as i Q 3 = 100 M ii N 108

110 Prediction Results Table of prediction results for PDBselect25 data set with EHL- and CK-mapping in Q 3 measure: L EHL(vi) EHL(sp) CK(vi) CK(sp) % 62.96% 63.61% 64.91% % 63.60% 64.16% 65.35% % 63.86% 65.17% 65.84% L: number of secondary structures in the initial state or terminal state vi: decoded by Viterbi algorithm sp: decoded by sum-product algorithm 109

111 Table of prediction results for CB513 data set with EHL- and CK-mapping in Q 3 measure: L EHL(vi) EHL(sp) CK(vi) CK(sp) % 60.00% 60.81% 62.28% % 59.91% 61.26% 62.62% % 58.94% 61.03% 62.24% 110

112 Current Challenging Problems in Bioinformatics Genomics Novel gene discovery Alternative splicing Gene regulatory networks Cell cycle Development Disease finding Proteomics Prediction of structures and functions of proteins Drug design 111

113 Metabolic pathways Construction of pathways Cellular functions Systems biology 112

114 Systems Biology From Neil Campbell, Jane Reece, and Larry Mitchell, Biology, 5th ed. (Menlo Park, CA: Addison Wesley Longman, 1999) c Addison Wesley Longman, Inc. 113

115 Bioinformatics is Not Just Computer Algorithms!! Information theory Coding theory Communication theory Signal processing and linguistics System (control) theory Statistics, probability theory and stochastic processes Combinatorics and graph theory 114

116 Co-investigators Dr. Wen-Hsiung Li at the Department of Ecology and Evolution, University of Chicago, USA. Te-Ming Chen, Chao-Chung Chang, Chen-Wei Hsu, Yun Lee, Chiung-Wen He at the Department of Electrical Engineering, National Tsing Hua University, Taiwan. 115

117 Thank You Very Much 116

Objective: You will be able to justify the claim that organisms share many conserved core processes and features.

Objective: You will be able to justify the claim that organisms share many conserved core processes and features. Objective: You will be able to justify the claim that organisms share many conserved core processes and features. Do Now: Read Enduring Understanding B Essential knowledge: Organisms share many conserved

More information

In previous lecture. Shannon s information measure x. Intuitive notion: H = number of required yes/no questions.

In previous lecture. Shannon s information measure x. Intuitive notion: H = number of required yes/no questions. In previous lecture Shannon s information measure H ( X ) p log p log p x x 2 x 2 x Intuitive notion: H = number of required yes/no questions. The basic information unit is bit = 1 yes/no question or coin

More information

Aoife McLysaght Dept. of Genetics Trinity College Dublin

Aoife McLysaght Dept. of Genetics Trinity College Dublin Aoife McLysaght Dept. of Genetics Trinity College Dublin Evolution of genome arrangement Evolution of genome content. Evolution of genome arrangement Gene order changes Inversions, translocations Evolution

More information

From gene to protein. Premedical biology

From gene to protein. Premedical biology From gene to protein Premedical biology Central dogma of Biology, Molecular Biology, Genetics transcription replication reverse transcription translation DNA RNA Protein RNA chemically similar to DNA,

More information

Using an Artificial Regulatory Network to Investigate Neural Computation

Using an Artificial Regulatory Network to Investigate Neural Computation Using an Artificial Regulatory Network to Investigate Neural Computation W. Garrett Mitchener College of Charleston January 6, 25 W. Garrett Mitchener (C of C) UM January 6, 25 / 4 Evolution and Computing

More information

Genetic Code, Attributive Mappings and Stochastic Matrices

Genetic Code, Attributive Mappings and Stochastic Matrices Genetic Code, Attributive Mappings and Stochastic Matrices Matthew He Division of Math, Science and Technology Nova Southeastern University Ft. Lauderdale, FL 33314, USA Email: hem@nova.edu Abstract: In

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

A p-adic Model of DNA Sequence and Genetic Code 1

A p-adic Model of DNA Sequence and Genetic Code 1 ISSN 2070-0466, p-adic Numbers, Ultrametric Analysis and Applications, 2009, Vol. 1, No. 1, pp. 34 41. c Pleiades Publishing, Ltd., 2009. RESEARCH ARTICLES A p-adic Model of DNA Sequence and Genetic Code

More information

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype Lecture Series 7 From DNA to Protein: Genotype to Phenotype Reading Assignments Read Chapter 7 From DNA to Protein A. Genes and the Synthesis of Polypeptides Genes are made up of DNA and are expressed

More information

A modular Fibonacci sequence in proteins

A modular Fibonacci sequence in proteins A modular Fibonacci sequence in proteins P. Dominy 1 and G. Rosen 2 1 Hagerty Library, Drexel University, Philadelphia, PA 19104, USA 2 Department of Physics, Drexel University, Philadelphia, PA 19104,

More information

Videos. Bozeman, transcription and translation: https://youtu.be/h3b9arupxzg Crashcourse: Transcription and Translation - https://youtu.

Videos. Bozeman, transcription and translation: https://youtu.be/h3b9arupxzg Crashcourse: Transcription and Translation - https://youtu. Translation Translation Videos Bozeman, transcription and translation: https://youtu.be/h3b9arupxzg Crashcourse: Transcription and Translation - https://youtu.be/itsb2sqr-r0 Translation Translation The

More information

Edinburgh Research Explorer

Edinburgh Research Explorer Edinburgh Research Explorer Codon usage patterns in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster and Homo sapiens; a review of the considerable

More information

Lesson Overview. Ribosomes and Protein Synthesis 13.2

Lesson Overview. Ribosomes and Protein Synthesis 13.2 13.2 The Genetic Code The first step in decoding genetic messages is to transcribe a nucleotide base sequence from DNA to mrna. This transcribed information contains a code for making proteins. The Genetic

More information

Reducing Redundancy of Codons through Total Graph

Reducing Redundancy of Codons through Total Graph American Journal of Bioinformatics Original Research Paper Reducing Redundancy of Codons through Total Graph Nisha Gohain, Tazid Ali and Adil Akhtar Department of Mathematics, Dibrugarh University, Dibrugarh-786004,

More information

Lecture IV A. Shannon s theory of noisy channels and molecular codes

Lecture IV A. Shannon s theory of noisy channels and molecular codes Lecture IV A Shannon s theory of noisy channels and molecular codes Noisy molecular codes: Rate-Distortion theory S Mapping M Channel/Code = mapping between two molecular spaces. Two functionals determine

More information

Genetic code on the dyadic plane

Genetic code on the dyadic plane Genetic code on the dyadic plane arxiv:q-bio/0701007v3 [q-bio.qm] 2 Nov 2007 A.Yu.Khrennikov, S.V.Kozyrev June 18, 2018 Abstract We introduce the simple parametrization for the space of codons (triples

More information

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS 1 Prokaryotes and Eukaryotes 2 DNA and RNA 3 4 Double helix structure Codons Codons are triplets of bases from the RNA sequence. Each triplet defines an amino-acid.

More information

UNIVERSITY OF CAMBRIDGE INTERNATIONAL EXAMINATIONS General Certifi cate of Education Advanced Subsidiary Level and Advanced Level

UNIVERSITY OF CAMBRIDGE INTERNATIONAL EXAMINATIONS General Certifi cate of Education Advanced Subsidiary Level and Advanced Level *1166350738* UNIVERSITY OF CAMBRIDGE INTERNATIONAL EXAMINATIONS General Certifi cate of Education Advanced Subsidiary Level and Advanced Level CEMISTRY 9701/43 Paper 4 Structured Questions October/November

More information

From Gene to Protein

From Gene to Protein From Gene to Protein Gene Expression Process by which DNA directs the synthesis of a protein 2 stages transcription translation All organisms One gene one protein 1. Transcription of DNA Gene Composed

More information

1. In most cases, genes code for and it is that

1. In most cases, genes code for and it is that Name Chapter 10 Reading Guide From DNA to Protein: Gene Expression Concept 10.1 Genetics Shows That Genes Code for Proteins 1. In most cases, genes code for and it is that determine. 2. Describe what Garrod

More information

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus: m Eukaryotic mrna processing Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus: Cap structure a modified guanine base is added to the 5 end. Poly-A tail

More information

RNA Processing: Eukaryotic mrnas

RNA Processing: Eukaryotic mrnas RNA Processing: Eukaryotic mrnas Eukaryotic mrnas have three main parts (Figure 13.8): 5! untranslated region (5! UTR), varies in length. The coding sequence specifies the amino acid sequence of the protein

More information

The degeneracy of the genetic code and Hadamard matrices. Sergey V. Petoukhov

The degeneracy of the genetic code and Hadamard matrices. Sergey V. Petoukhov The degeneracy of the genetic code and Hadamard matrices Sergey V. Petoukhov Department of Biomechanics, Mechanical Engineering Research Institute of the Russian Academy of Sciences petoukhov@hotmail.com,

More information

Biology 155 Practice FINAL EXAM

Biology 155 Practice FINAL EXAM Biology 155 Practice FINAL EXAM 1. Which of the following is NOT necessary for adaptive evolution? a. differential fitness among phenotypes b. small population size c. phenotypic variation d. heritability

More information

From DNA to protein, i.e. the central dogma

From DNA to protein, i.e. the central dogma From DNA to protein, i.e. the central dogma DNA RNA Protein Biochemistry, chapters1 5 and Chapters 29 31. Chapters 2 5 and 29 31 will be covered more in detail in other lectures. ph, chapter 1, will be

More information

GCD3033:Cell Biology. Transcription

GCD3033:Cell Biology. Transcription Transcription Transcription: DNA to RNA A) production of complementary strand of DNA B) RNA types C) transcription start/stop signals D) Initiation of eukaryotic gene expression E) transcription factors

More information

Slide 1 / 54. Gene Expression in Eukaryotic cells

Slide 1 / 54. Gene Expression in Eukaryotic cells Slide 1 / 54 Gene Expression in Eukaryotic cells Slide 2 / 54 Central Dogma DNA is the the genetic material of the eukaryotic cell. Watson & Crick worked out the structure of DNA as a double helix. According

More information

C CH 3 N C COOH. Write the structural formulas of all of the dipeptides that they could form with each other.

C CH 3 N C COOH. Write the structural formulas of all of the dipeptides that they could form with each other. hapter 25 Biochemistry oncept heck 25.1 Two common amino acids are 3 2 N alanine 3 2 N threonine Write the structural formulas of all of the dipeptides that they could form with each other. The carboxyl

More information

PROTEIN SYNTHESIS INTRO

PROTEIN SYNTHESIS INTRO MR. POMERANTZ Page 1 of 6 Protein synthesis Intro. Use the text book to help properly answer the following questions 1. RNA differs from DNA in that RNA a. is single-stranded. c. contains the nitrogen

More information

BME 5742 Biosystems Modeling and Control

BME 5742 Biosystems Modeling and Control BME 5742 Biosystems Modeling and Control Lecture 24 Unregulated Gene Expression Model Dr. Zvi Roth (FAU) 1 The genetic material inside a cell, encoded in its DNA, governs the response of a cell to various

More information

ومن أحياها Translation 1. Translation 1. DONE BY :Maen Faoury

ومن أحياها Translation 1. Translation 1. DONE BY :Maen Faoury Translation 1 DONE BY :Maen Faoury 0 1 ومن أحياها Translation 1 2 ومن أحياها Translation 1 In this lecture and the coming lectures you are going to see how the genetic information is transferred into proteins

More information

Ranjit P. Bahadur Assistant Professor Department of Biotechnology Indian Institute of Technology Kharagpur, India. 1 st November, 2013

Ranjit P. Bahadur Assistant Professor Department of Biotechnology Indian Institute of Technology Kharagpur, India. 1 st November, 2013 Hydration of protein-rna recognition sites Ranjit P. Bahadur Assistant Professor Department of Biotechnology Indian Institute of Technology Kharagpur, India 1 st November, 2013 Central Dogma of life DNA

More information

GENE ACTIVITY Gene structure Transcription Transcript processing mrna transport mrna stability Translation Posttranslational modifications

GENE ACTIVITY Gene structure Transcription Transcript processing mrna transport mrna stability Translation Posttranslational modifications 1 GENE ACTIVITY Gene structure Transcription Transcript processing mrna transport mrna stability Translation Posttranslational modifications 2 DNA Promoter Gene A Gene B Termination Signal Transcription

More information

RNA & PROTEIN SYNTHESIS. Making Proteins Using Directions From DNA

RNA & PROTEIN SYNTHESIS. Making Proteins Using Directions From DNA RNA & PROTEIN SYNTHESIS Making Proteins Using Directions From DNA RNA & Protein Synthesis v Nitrogenous bases in DNA contain information that directs protein synthesis v DNA remains in nucleus v in order

More information

Mathematics of Bioinformatics ---Theory, Practice, and Applications (Part II)

Mathematics of Bioinformatics ---Theory, Practice, and Applications (Part II) Mathematics of Bioinformatics ---Theory, Practice, and Applications (Part II) Matthew He, Ph.D. Professor/Director Division of Math, Science, and Technology Nova Southeastern University, Florida, USA December

More information

GENETICS - CLUTCH CH.11 TRANSLATION.

GENETICS - CLUTCH CH.11 TRANSLATION. !! www.clutchprep.com CONCEPT: GENETIC CODE Nucleotides and amino acids are translated in a 1 to 1 method The triplet code states that three nucleotides codes for one amino acid - A codon is a term for

More information

CHEMISTRY 9701/42 Paper 4 Structured Questions May/June hours Candidates answer on the Question Paper. Additional Materials: Data Booklet

CHEMISTRY 9701/42 Paper 4 Structured Questions May/June hours Candidates answer on the Question Paper. Additional Materials: Data Booklet Cambridge International Examinations Cambridge International Advanced Level CHEMISTRY 9701/42 Paper 4 Structured Questions May/June 2014 2 hours Candidates answer on the Question Paper. Additional Materials:

More information

Organic Chemistry Option II: Chemical Biology

Organic Chemistry Option II: Chemical Biology Organic Chemistry Option II: Chemical Biology Recommended books: Dr Stuart Conway Department of Chemistry, Chemistry Research Laboratory, University of Oxford email: stuart.conway@chem.ox.ac.uk Teaching

More information

A Minimum Principle in Codon-Anticodon Interaction

A Minimum Principle in Codon-Anticodon Interaction A Minimum Principle in Codon-Anticodon Interaction A. Sciarrino a,b,, P. Sorba c arxiv:0.480v [q-bio.qm] 9 Oct 0 Abstract a Dipartimento di Scienze Fisiche, Università di Napoli Federico II Complesso Universitario

More information

Laith AL-Mustafa. Protein synthesis. Nabil Bashir 10\28\ First

Laith AL-Mustafa. Protein synthesis. Nabil Bashir 10\28\ First Laith AL-Mustafa Protein synthesis Nabil Bashir 10\28\2015 http://1drv.ms/1gigdnv 01 First 0 Protein synthesis In previous lectures we started talking about DNA Replication (DNA synthesis) and we covered

More information

Molecular Biology - Translation of RNA to make Protein *

Molecular Biology - Translation of RNA to make Protein * OpenStax-CNX module: m49485 1 Molecular Biology - Translation of RNA to make Protein * Jerey Mahr Based on Translation by OpenStax This work is produced by OpenStax-CNX and licensed under the Creative

More information

Chapter 17. From Gene to Protein. Biology Kevin Dees

Chapter 17. From Gene to Protein. Biology Kevin Dees Chapter 17 From Gene to Protein DNA The information molecule Sequences of bases is a code DNA organized in to chromosomes Chromosomes are organized into genes What do the genes actually say??? Reflecting

More information

Translation. Genetic code

Translation. Genetic code Translation Genetic code If genes are segments of DNA and if DNA is just a string of nucleotide pairs, then how does the sequence of nucleotide pairs dictate the sequence of amino acids in proteins? Simple

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation.

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation. Protein Synthesis Unit 6 Goal: Students will be able to describe the processes of transcription and translation. Types of RNA Messenger RNA (mrna) makes a copy of DNA, carries instructions for making proteins,

More information

Chapters 12&13 Notes: DNA, RNA & Protein Synthesis

Chapters 12&13 Notes: DNA, RNA & Protein Synthesis Chapters 12&13 Notes: DNA, RNA & Protein Synthesis Name Period Words to Know: nucleotides, DNA, complementary base pairing, replication, genes, proteins, mrna, rrna, trna, transcription, translation, codon,

More information

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid. 1. A change that makes a polypeptide defective has been discovered in its amino acid sequence. The normal and defective amino acid sequences are shown below. Researchers are attempting to reproduce the

More information

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1. Motifs and Logos Six Discovering Genomics, Proteomics, and Bioinformatics by A. Malcolm Campbell and Laurie J. Heyer Chapter 2 Genome Sequence Acquisition and Analysis Sami Khuri Department of Computer

More information

Natural Selection. Nothing in Biology makes sense, except in the light of evolution. T. Dobzhansky

Natural Selection. Nothing in Biology makes sense, except in the light of evolution. T. Dobzhansky It is interesting to contemplate a tangled bank, clothed with many plants of many kinds, with birds singing on the bushes, with various insects flitting about, and with worms crawling through the damp

More information

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation.

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation. Protein Synthesis Unit 6 Goal: Students will be able to describe the processes of transcription and translation. Protein Synthesis: Protein synthesis uses the information in genes to make proteins. 2 Steps

More information

Molecular Biology of the Cell

Molecular Biology of the Cell Alberts Johnson Lewis Raff Roberts Walter Molecular Biology of the Cell Fifth Edition Chapter 6 How Cells Read the Genome: From DNA to Protein Copyright Garland Science 2008 Figure 6-1 Molecular Biology

More information

The genetic code, 8-dimensional hypercomplex numbers and dyadic shifts. Sergey V. Petoukhov

The genetic code, 8-dimensional hypercomplex numbers and dyadic shifts. Sergey V. Petoukhov The genetic code, 8-dimensional hypercomplex numbers and dyadic shifts Sergey V. Petoukhov Head of Laboratory of Biomechanical System, Mechanical Engineering Research Institute of the Russian Academy of

More information

A Mathematical Model of the Genetic Code, the Origin of Protein Coding, and the Ribosome as a Dynamical Molecular Machine

A Mathematical Model of the Genetic Code, the Origin of Protein Coding, and the Ribosome as a Dynamical Molecular Machine A Mathematical Model of the Genetic Code, the Origin of Protein Coding, and the Ribosome as a Dynamical Molecular Machine Diego L. Gonzalez CNR- IMM Is)tuto per la Microele4ronica e i Microsistemi Dipar)mento

More information

Name: SBI 4U. Gene Expression Quiz. Overall Expectation:

Name: SBI 4U. Gene Expression Quiz. Overall Expectation: Gene Expression Quiz Overall Expectation: - Demonstrate an understanding of concepts related to molecular genetics, and how genetic modification is applied in industry and agriculture Specific Expectation(s):

More information

Abstract Following Petoukhov and his collaborators we use two length n zero-one sequences, α and β,

Abstract Following Petoukhov and his collaborators we use two length n zero-one sequences, α and β, Studying Genetic Code by a Matrix Approach Tanner Crowder 1 and Chi-Kwong Li 2 Department of Mathematics, The College of William and Mary, Williamsburg, Virginia 23185, USA E-mails: tjcrow@wmedu, ckli@mathwmedu

More information

UNIT 5. Protein Synthesis 11/22/16

UNIT 5. Protein Synthesis 11/22/16 UNIT 5 Protein Synthesis IV. Transcription (8.4) A. RNA carries DNA s instruction 1. Francis Crick defined the central dogma of molecular biology a. Replication copies DNA b. Transcription converts DNA

More information

Interpolated Markov Models for Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Interpolated Markov Models for Gene Finding. BMI/CS 776  Spring 2015 Colin Dewey Interpolated Markov Models for Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following the

More information

Translation. A ribosome, mrna, and trna.

Translation. A ribosome, mrna, and trna. Translation The basic processes of translation are conserved among prokaryotes and eukaryotes. Prokaryotic Translation A ribosome, mrna, and trna. In the initiation of translation in prokaryotes, the Shine-Dalgarno

More information

Multiple Choice Review- Eukaryotic Gene Expression

Multiple Choice Review- Eukaryotic Gene Expression Multiple Choice Review- Eukaryotic Gene Expression 1. Which of the following is the Central Dogma of cell biology? a. DNA Nucleic Acid Protein Amino Acid b. Prokaryote Bacteria - Eukaryote c. Atom Molecule

More information

1. Contains the sugar ribose instead of deoxyribose. 2. Single-stranded instead of double stranded. 3. Contains uracil in place of thymine.

1. Contains the sugar ribose instead of deoxyribose. 2. Single-stranded instead of double stranded. 3. Contains uracil in place of thymine. Protein Synthesis & Mutations RNA 1. Contains the sugar ribose instead of deoxyribose. 2. Single-stranded instead of double stranded. 3. Contains uracil in place of thymine. RNA Contains: 1. Adenine 2.

More information

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11 The Eukaryotic Genome and Its Expression Lecture Series 11 The Eukaryotic Genome and Its Expression A. The Eukaryotic Genome B. Repetitive Sequences (rem: teleomeres) C. The Structures of Protein-Coding

More information

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p.110-114 Arrangement of information in DNA----- requirements for RNA Common arrangement of protein-coding genes in prokaryotes=

More information

Types of RNA. 1. Messenger RNA(mRNA): 1. Represents only 5% of the total RNA in the cell.

Types of RNA. 1. Messenger RNA(mRNA): 1. Represents only 5% of the total RNA in the cell. RNAs L.Os. Know the different types of RNA & their relative concentration Know the structure of each RNA Understand their functions Know their locations in the cell Understand the differences between prokaryotic

More information

9/11/18. Molecular and Cellular Biology. 3. The Cell From Genes to Proteins. key processes

9/11/18. Molecular and Cellular Biology. 3. The Cell From Genes to Proteins. key processes Molecular and Cellular Biology Animal Cell ((eukaryotic cell) -----> compare with prokaryotic cell) ENDOPLASMIC RETICULUM (ER) Rough ER Smooth ER Flagellum Nuclear envelope Nucleolus NUCLEUS Chromatin

More information

Three-Dimensional Algebraic Models of the trna Code and 12 Graphs for Representing the Amino Acids

Three-Dimensional Algebraic Models of the trna Code and 12 Graphs for Representing the Amino Acids Life 2014, 4, 341-373; doi:10.3390/life4030341 Article OPEN ACCESS life ISSN 2075-1729 www.mdpi.com/journal/life Three-Dimensional Algebraic Models of the trna Code and 12 Graphs for Representing the Amino

More information

Molecular Biology of the Cell

Molecular Biology of the Cell Alberts Johnson Lewis Raff Roberts Walter Molecular Biology of the Cell Fifth Edition Chapter 6 How Cells Read the Genome: From DNA to Protein Copyright Garland Science 2008 Figure 6-1 Molecular Biology

More information

Molecular Genetics Principles of Gene Expression: Translation

Molecular Genetics Principles of Gene Expression: Translation Paper No. : 16 Module : 13 Principles of gene expression: Translation Development Team Principal Investigator: Prof. Neeta Sehgal Head, Department of Zoology, University of Delhi Paper Coordinator: Prof.

More information

Advanced Topics in RNA and DNA. DNA Microarrays Aptamers

Advanced Topics in RNA and DNA. DNA Microarrays Aptamers Quiz 1 Advanced Topics in RNA and DNA DNA Microarrays Aptamers 2 Quantifying mrna levels to asses protein expression 3 The DNA Microarray Experiment 4 Application of DNA Microarrays 5 Some applications

More information

Introduction to the Ribosome Overview of protein synthesis on the ribosome Prof. Anders Liljas

Introduction to the Ribosome Overview of protein synthesis on the ribosome Prof. Anders Liljas Introduction to the Ribosome Molecular Biophysics Lund University 1 A B C D E F G H I J Genome Protein aa1 aa2 aa3 aa4 aa5 aa6 aa7 aa10 aa9 aa8 aa11 aa12 aa13 a a 14 How is a polypeptide synthesized? 2

More information

Today s Lecture: HMMs

Today s Lecture: HMMs Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models

More information

Introduction to molecular biology. Mitesh Shrestha

Introduction to molecular biology. Mitesh Shrestha Introduction to molecular biology Mitesh Shrestha Molecular biology: definition Molecular biology is the study of molecular underpinnings of the process of replication, transcription and translation of

More information

Practical Bioinformatics

Practical Bioinformatics 5/2/2017 Dictionaries d i c t i o n a r y = { A : T, T : A, G : C, C : G } d i c t i o n a r y [ G ] d i c t i o n a r y [ N ] = N d i c t i o n a r y. h a s k e y ( C ) Dictionaries g e n e t i c C o

More information

Crick s early Hypothesis Revisited

Crick s early Hypothesis Revisited Crick s early Hypothesis Revisited Or The Existence of a Universal Coding Frame Ryan Rossi, Jean-Louis Lassez and Axel Bernal UPenn Center for Bioinformatics BIOINFORMATICS The application of computer

More information

Eukaryotic vs. Prokaryotic genes

Eukaryotic vs. Prokaryotic genes BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 18: Eukaryotic genes http://compbio.uchsc.edu/hunter/bio5099 Larry.Hunter@uchsc.edu Eukaryotic vs. Prokaryotic genes Like in prokaryotes,

More information

-14. -Abdulrahman Al-Hanbali. -Shahd Alqudah. -Dr Ma mon Ahram. 1 P a g e

-14. -Abdulrahman Al-Hanbali. -Shahd Alqudah. -Dr Ma mon Ahram. 1 P a g e -14 -Abdulrahman Al-Hanbali -Shahd Alqudah -Dr Ma mon Ahram 1 P a g e In this lecture we will talk about the last stage in the synthesis of proteins from DNA which is translation. Translation is the process

More information

Energy and Cellular Metabolism

Energy and Cellular Metabolism 1 Chapter 4 About This Chapter Energy and Cellular Metabolism 2 Energy in biological systems Chemical reactions Enzymes Metabolism Figure 4.1 Energy transfer in the environment Table 4.1 Properties of

More information

Gene regulation II Biochemistry 302. February 27, 2006

Gene regulation II Biochemistry 302. February 27, 2006 Gene regulation II Biochemistry 302 February 27, 2006 Molecular basis of inhibition of RNAP by Lac repressor 35 promoter site 10 promoter site CRP/DNA complex 60 Lewis, M. et al. (1996) Science 271:1247

More information

Supplementary Information for

Supplementary Information for Supplementary Information for Evolutionary conservation of codon optimality reveals hidden signatures of co-translational folding Sebastian Pechmann & Judith Frydman Department of Biology and BioX, Stanford

More information

Molecular Biology (9)

Molecular Biology (9) Molecular Biology (9) Translation Mamoun Ahram, PhD Second semester, 2017-2018 1 Resources This lecture Cooper, Ch. 8 (297-319) 2 General information Protein synthesis involves interactions between three

More information

Chapter

Chapter Chapter 17 17.4-17.6 Molecular Components of Translation A cell interprets a genetic message and builds a polypeptide The message is a series of codons on mrna The interpreter is called transfer (trna)

More information

DNA Feature Sensors. B. Majoros

DNA Feature Sensors. B. Majoros DNA Feature Sensors B. Majoros What is Feature Sensing? A feature is any DNA subsequence of biological significance. For practical reasons, we recognize two broad classes of features: signals short, fixed-length

More information

ATTRIBUTIVE CONCEPTION OF GENETIC CODE, ITS BI-PERIODIC TABLES AND PROBLEM OF UNIFICATION BASES OF BIOLOGICAL LANGUAGES *

ATTRIBUTIVE CONCEPTION OF GENETIC CODE, ITS BI-PERIODIC TABLES AND PROBLEM OF UNIFICATION BASES OF BIOLOGICAL LANGUAGES * Symmetry: Culture and Science Vols. 14-15, 281-307, 2003-2004 ATTRIBUTIVE CONCEPTION OF GENETIC CODE, ITS BI-PERIODIC TABLES AND PROBLEM OF UNIFICATION BASES OF BIOLOGICAL LANGUAGES * Sergei V. Petoukhov

More information

Section 7. Junaid Malek, M.D.

Section 7. Junaid Malek, M.D. Section 7 Junaid Malek, M.D. RNA Processing and Nomenclature For the purposes of this class, please do not refer to anything as mrna that has not been completely processed (spliced, capped, tailed) RNAs

More information

Translation Part 2 of Protein Synthesis

Translation Part 2 of Protein Synthesis Translation Part 2 of Protein Synthesis IN: How is transcription like making a jello mold? (be specific) What process does this diagram represent? A. Mutation B. Replication C.Transcription D.Translation

More information

Quiz answers. Allele. BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 17: The Quiz (and back to Eukaryotic DNA)

Quiz answers. Allele. BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 17: The Quiz (and back to Eukaryotic DNA) BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 17: The Quiz (and back to Eukaryotic DNA) http://compbio.uchsc.edu/hunter/bio5099 Larry.Hunter@uchsc.edu Quiz answers Kinase: An enzyme

More information

Translation and the Genetic Code

Translation and the Genetic Code Chapter 11. Translation and the Genetic Code 1. Protein Structure 2. Components required for Protein Synthesis 3. Properties of the Genetic Code: An Overview 4. A Degenerate and Ordered Code 1 Sickle-Cell

More information

Translation and Operons

Translation and Operons Translation and Operons You Should Be Able To 1. Describe the three stages translation. including the movement of trna molecules through the ribosome. 2. Compare and contrast the roles of three different

More information

Introduction to Hidden Markov Models (HMMs)

Introduction to Hidden Markov Models (HMMs) Introduction to Hidden Markov Models (HMMs) But first, some probability and statistics background Important Topics 1.! Random Variables and Probability 2.! Probability Distributions 3.! Parameter Estimation

More information

Eukaryotic Gene Expression: Basics and Benefits Prof. P N RANGARAJAN Department of Biochemistry Indian Institute of Science Bangalore

Eukaryotic Gene Expression: Basics and Benefits Prof. P N RANGARAJAN Department of Biochemistry Indian Institute of Science Bangalore Eukaryotic Gene Expression: Basics and Benefits Prof. P N RANGARAJAN Department of Biochemistry Indian Institute of Science Bangalore Module No #04 Lecture No # 12 Eukaryotic gene Regulation: Co-transcriptional

More information

CODING A LIFE FULL OF ERRORS

CODING A LIFE FULL OF ERRORS CODING A LIFE FULL OF ERRORS PITP ϕ(c 5 ) c 3 c 4 c 5 c 6 ϕ(c 1 ) ϕ(c 2 ) ϕ(c 3 ) ϕ(c 4 ) ϕ(c i ) c i c 7 c 8 c 9 c 10 c 11 c 12 IAS 2012 PART I What is Life? (biological and artificial) Self-replication.

More information

TRANSLATION: How to make proteins?

TRANSLATION: How to make proteins? TRANSLATION: How to make proteins? EUKARYOTIC mrna CBP80 NUCLEUS SPLICEOSOME 5 UTR INTRON 3 UTR m 7 GpppG AUG UAA 5 ss 3 ss CBP20 PABP2 AAAAAAAAAAAAA 50-200 nts CYTOPLASM eif3 EJC PABP1 5 UTR 3 UTR m 7

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

9/2/17. Molecular and Cellular Biology. 3. The Cell From Genes to Proteins. key processes

9/2/17. Molecular and Cellular Biology. 3. The Cell From Genes to Proteins. key processes Molecular and Cellular Biology Animal Cell ((eukaryotic cell) -----> compare with prokaryotic cell) ENDOPLASMIC RETICULUM (ER) Rough ER Smooth ER Flagellum Nuclear envelope Nucleolus NUCLEUS Chromatin

More information

Lect. 19. Natural Selection I. 4 April 2017 EEB 2245, C. Simon

Lect. 19. Natural Selection I. 4 April 2017 EEB 2245, C. Simon Lect. 19. Natural Selection I 4 April 2017 EEB 2245, C. Simon Last Time Gene flow reduces among population variability, reduces structure Interaction of climate, ecology, bottlenecks, drift, and gene flow

More information

CSEP 590A Summer Tonight MLE. FYI, re HW #2: Hemoglobin History. Lecture 4 MLE, EM, RE, Expression. Maximum Likelihood Estimators

CSEP 590A Summer Tonight MLE. FYI, re HW #2: Hemoglobin History. Lecture 4 MLE, EM, RE, Expression. Maximum Likelihood Estimators CSEP 59A Summer 26 Lecture 4 MLE, EM, RE, Expression FYI, re HW #2: Hemoglobin History 1 Alberts et al., 3rd ed.,pg389 2 Tonight MLE: Maximum Likelihood Estimators EM: the Expectation Maximization Algorithm

More information

THE GENETIC CODE INVARIANCE: WHEN EULER AND FIBONACCI MEET

THE GENETIC CODE INVARIANCE: WHEN EULER AND FIBONACCI MEET Symmetry: Culture and Science Vol. 25, No. 3, 261-278, 2014 THE GENETIC CODE INVARIANCE: WHEN EULER AND FIBONACCI MEET Tidjani Négadi Address: Department of Physics, Faculty of Science, University of Oran,

More information

CSEP 590A Summer Lecture 4 MLE, EM, RE, Expression

CSEP 590A Summer Lecture 4 MLE, EM, RE, Expression CSEP 590A Summer 2006 Lecture 4 MLE, EM, RE, Expression 1 FYI, re HW #2: Hemoglobin History Alberts et al., 3rd ed.,pg389 2 Tonight MLE: Maximum Likelihood Estimators EM: the Expectation Maximization Algorithm

More information

Introduction to Molecular and Cell Biology

Introduction to Molecular and Cell Biology Introduction to Molecular and Cell Biology Molecular biology seeks to understand the physical and chemical basis of life. and helps us answer the following? What is the molecular basis of disease? What

More information

Natural Selection. Nothing in Biology makes sense, except in the light of evolution. T. Dobzhansky

Natural Selection. Nothing in Biology makes sense, except in the light of evolution. T. Dobzhansky It is interesting to contemplate a tangled bank, clothed with many plants of many kinds, with birds singing on the bushes, with various insects flitting about, and with worms crawling through the damp

More information

Regulation of Gene Expression

Regulation of Gene Expression Chapter 18 Regulation of Gene Expression Edited by Shawn Lester PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley

More information