2005 Fall Workshop on Information Theory and Communications. Bioinformatics. from a Perspective of Communication Science

Size: px

Start display at page:

Download "2005 Fall Workshop on Information Theory and Communications. Bioinformatics. from a Perspective of Communication Science"

Doris Norman
5 years ago
Views:

1 2005 Fall Workshop on Information Theory and Communications Bioinformatics from a Perspective of Communication Science Chung-Chin Lu Department of Electrical Engineering National Tsing Hua University cclu@ee.nthu.edu.tw July 22, 2005

2 Biological Background : The Central Dogma 1

3 Typical Animal Cell From Neil Campbell, Jane Reece, and Larry Mitchell, Biology, 5th ed. (Menlo Park, CA: Addison Wesley Longman, 1999) c Addison Wesley Longman, Inc. 2

4 Places for Genetic Information Processing in Cell Nucleus : storage and transportation of genetic information Storage : deoxyribonucleic acid (DNA) Transportation : ribonucleic acid (RNA) Ribosomes : factories for protein synthesis with the blueprint in RNA Mitochondria : energy production with its own circular, doubled-stranded DNA 3

5 Double-stranded DNA From J. D. Watson, N. H. Hopkins, J. W. Roberts, J. A. Steitz, and A. M. Weiner, Molecular Biology of the Gene, 4th ed. (Redwood City, CA: Benjamin/Cummings Publishing Co., 1987). c 1987 James D. Watson. 4

6 Chemical Structures of RNA and DNA From Christopher K. Mathews, K. E. van Holde, and Kevin G. Ahern, Biochemistry, 3rd ed. (San Francisco, CA: Benjamin/Cummings, 2000) c 2000 Addison Wesley Longman, Inc. 5

7 Repeating Units of DNA and RNA Three subunits in a repeating unit (a ribo-nucleotide) of RNA A phosphate group A ribose A base Three subunits in a repeating unit (a deoxy-ribo-nucleotide) of DNA A phosphate group A 2 -deoxyribose A base 6

8 Nucleic Acid Bases and Paring of Bases in DNA Five nucleic acid bases Two purines Adenine (A) in DNA and RNA Guanine (G) in DNA and RNA Three pyrimidines Cytosine (C) in DNA and RNA Thymine (T) in DNA and Uracil (U) in RNA Paring of bases in DNA Adenine (A) Thymine (T) Guanine (G) Cytosine (C) 7

9 The Bases of DNA From Christopher K. Mathews, K. E. van Holde, and Kevin G. Ahern, Biochemistry, 3rd ed. (San Francisco, CA: Benjamin/Cummings, 2000) c 2000 Addison Wesley Longman, Inc. 8

10 The Flow of Genetic Information in Cell From Christopher K. Mathews, K. E. van Holde, and Kevin G. Ahern, Biochemistry, 3rd ed. (San Francisco, CA: Benjamin/Cummings, 2000) c 2000 Addison Wesley Longman, Inc. 9

11 DNA replication From Christopher K. Mathews, K. E. van Holde, and Kevin G. Ahern, Biochemistry, 3rd ed. (San Francisco, CA: Benjamin/Cummings, 2000) c 2000 Addison Wesley Longman, Inc. 10

12 Transcription From Christopher K. Mathews, K. E. van Holde, and Kevin G. Ahern, Biochemistry, 3rd ed. (San Francisco, CA: Benjamin/Cummings, 2000) c 2000 Addison Wesley Longman, Inc. 11

13 Relationships of DNA to mrna to Polypeptide From Blanchetot, Nature (1983) 301: c 1983 Macmillan Magazines, Ltd. 12

14 ! G H! E F C D G H E F C D ' ( A B M N $ K L $ $ = > I J ` a ] ] % & $ $ A B M N ^ _ $ $ ] K L " # [ \ = > I J. / ; < : , ; < : ) * Y Z S T W X Q R U V O P National Tsing Hua University (NTHU) Table of mrna Codons and Corresponding Amino Acids U C A G U UUU Phe UUC UUA Leu UUG CUU CUC Leu CUA CUG A U U A U C lle A U A { } A U G M et GUU GUC Val GUA GUG U C U U C C Ser U C A U C G CCU CCC CCA CCG ACU ACC Thr ACA ACG GCU GCC Ala GCA GCG C Pr o A U A U T yr U A C { } { } U A A sto p U A G sto p C A U H is C A C C A A G ln C A G AAU Asn AAC AAA Lys AAG GAU Asp GAC GAA Glu GAG U G U C ys U G C { } { } U G A Sto p U G G T rp CGU CGC Arg CGA CGG AGU Ser AGC AGA Arg AGG GGU GGC Gly GGA GGG G U C A G U C A G U C A G U C A G first letter third - letter second - letter 13

15 Protein as a Polymer Protein is a chain of amino acids There are 20 different kinds of amino acids Each amino acid is coded as a 3-tuple of nucleotides, called a codon. There are 64 codons, but 20 amino acids. So many amino acids are represented by more than one codons. Start codon : AUG which also encodes amino acid Methionine Stop codons : UAA, UAG and UGA which do not encode any amino acid 14

16 Translation of an RNA Message into a Protein From Christopher K. Mathews, K. E. van Holde, and Kevin G. Ahern, Biochemistry, 3rd ed. (San Francisco, CA: Benjamin/Cummings, 2000) c 2000 Addison Wesley Longman, Inc. 15

17 National Tsing Hua University (NTHU) The Central Dogma: Transcription and Translation 16

18 Messenger RNA (mrna) Pre-messenger RNA (pre-mrna) : primary RNA transcript Exons : concatenated to form a coding sequence for the synthesis of protein Introns : in between exons with functions not clear Poly-A signal, cleavage site and downstream element Splicing : deletion of introns Messenger RNA (mrna) : RNA after splicing A cap and a 5 untranslated region (5 UTR) Coding sequence (CDS) for protein synthesis 3 untranslated region (3 UTR) and a Poly-A tail sequence 17

19 Biological data are Digital!! Biological digital information modulates the macro-molecular of repetitive units Analogy : Digital information modulates sinusoidal waves 18

20 Digital Modulation on Macro-molecular of Repetitive Units 19

21 From Christopher K. Mathews, K. E. van Holde, and Kevin G. Ahern, Biochemistry, 3rd ed. (San Francisco, CA: Benjamin/Cummings, 2000) c 2000 Addison Wesley Longman, Inc. 20

22 Gene Structure Prediction and The Decoding Problem Aim To predict the structure of a gene based on the DNA sequence. Formalism To decode a DNA sequence (a sequence of A, T, G, C) to a sequence of exons and introns (a sequence of E and I). 21

23 Signal Sensors for Gene Structure Prediction 22

24 Signal Sensors To determine the transcriptional beginning of a gene, called transcription start site (TSS) and the upstream regulatory region, called promoter. Transcription start site (TSS). Elements of promoter : TATA box, CCAAT box, GC box, etc. To determine the precise exon-intron boundaries, called splice sites, in the coding region, as a crucial part in gene structure prediction. Donor site : the 5 splice site of an intron. Acceptor site : the 3 splice site of an intron. To determine the transcriptional termination of a gene. 23

25 PolyA signal, cleavage site, downstream element (DSE). 24

26 Transcription From Christopher K. Mathews, K. E. van Holde, and Kevin G. Ahern, Biochemistry, 3rd ed. (San Francisco, CA: Benjamin/Cummings, 2000) c 2000 Addison Wesley Longman, Inc. 25

27 Transcription Start Site (TSS) Signals Recognized by the basal RNA polymerase II transcriptional machinery. Encompassing three main core promotor elements TATA box, initiator (Inr), Downstream promoter element (DPE), which are generally located within -60 to +50 of the transcription start site. 26

28 The Basal RNA Polymerase II Transcriptional Machinery 27

29 Consensus in Transcription Start Site Signals TATA box has a strong consensus T A T A A A T +1 T +2 T +3 T +4 T +5 T +6 Located about 25 to 30 nt upstream of the TSS. Inr element has a loose consensus Py Py A N T/A Py Py A 2 A 1 A +1 A +2 A +3 A +4 A +5 Encompassing the TSS. DPE has no consensus and is not yet well characterized. 28

30 The Nucleotide Distribution in the TATA signal A T C G % position 29

31 The Nucleotide Distribution in the TSS signal A T C G % position 30

32 Pre-mRNA Splicing - Spliceosome Cycle From After Sharp, Cell (1994) 77:811. c 1994 Cell Press/The Nobel Foundation. 31

33 Consensus in Splice Signals Donor sites : exon A G / G U AUGU intron D 2 D 1 D +1 D +2 Acceptor sites : intron (C/T) N N A G / G exon A 2 A 1 A +1 32

34 Conservation in Splice Signals Donor sites : only GU are conserved in the D +1 D +2 positions for more than 98% of donor sites Acceptor sites : only AG are conserved in the A 2 A 1 positions for more than 98% of donor sites Spliceosome which does the job of splicing can recognize the splicing sites 33

35 Polyadenylation Mechanism There are three major steps : Recognition of the authentic signals of polyadenylation in the 3 -terminal of a pre-mrna, Cleavage of the pre-mrna, Addition of up to 250 adenosine residues (named polya tail). 34

36 Precleavage Complex 35

37 Proteins Involved CPSF (blue): Cleavage and polyadenylation specificity factor, binds to the AAUAAA motif and interacts with PAP and CstF. CstF (brown): Cleavage stimulation factor, binds to the GT/T-rich element. (at DNA level) CF I and II (gray): Cleavage factors I and II are required for cleavage. RNA polymerase II CTD (carboxyl-terminal domain): stimulates the cleavage reaction. PAP (orange): Poly(A) polymerase, initiates poly(a) synthesis, yielding an oligo(a) at least 10 nt long. PAB II (yellow): Poly(A)-binding protein II is for the elongation of poly(a) in mammals. 36

38 Model for Polyadenylation 37

39 38

40 Authentic Signals of Polyadenylation There are two major signals : PolyA signal (PAS) nucleotides upstream to the cleavage/polyadenylation site. A highly conserved hexamer AAUAAA (and the common variant AUUAAA). Recognized by the cleavage and polyadenylation specificity factor (CPSF). Downstream element (DE) nucleotides downstream to the cleavage/polyadenylation site. 39

41 consisting of a much less well-characterized U or G-U rich sequence, Recognized by the cleavage stimulation factor (CstF). Then cleavage occurs between these two signals as directed by two cleavage factors, CF Im and CF IIm. 40

42 Polyadenylation site, Window = [ 200, +206] A U G C G+C G+U 0.5 proportion position 41

43 Biological data are Noisy!! 42

44 Variation in Nucleotide Positions of a Signal Variation is from evolution of organisms. Transcriptional and post-transcriptional factors can still recognize signals. Inter-relation between nucleotide positions of a signal induces the recognition process by DNA-protein interaction. 43

45 Hidden Encoding and Channel Models Are there hidden legitimate codeword(s) for representing a signal? What is the codeword length for that signal? What is the stochastic mechanism (the channel) between the hidden legitimate codeword(s) and the observed diversified DNA segments for a signal? 44

46 Inter-dependency among Base Positions in Splice Signals Not being sufficiently addressed by previous models such as Weight matrix model (WMM) (Standen, 1984) Weight array model (WAM) (Zhang and Marr, 1993) Maximal dependence decomposition (MDD) (Burge and Karlin, 1997) Tree model (Cai et al., 2000) Potential models? Higher-order Markov chains Dependency graphs 45

47 Test of Dependency and Chi-square Statistics Question : How to find the dependency (strength of inter-relation) between the positions in a splice signal? Table 1: A contingency table for signals in DNA sequence. s i \s j A T C G Total A Y 11 Y 12 Y 13 Y 14 Y 1c T Y 21 Y 22 Y 23 Y 24 Y 2c C Y 31 Y 32 Y 33 Y 34 Y 3c G Y 41 Y 42 Y 43 Y 44 Y 4c Total Y r1 Y r2 Y r3 Y r4 Y 46

48 Test of Dependency and Chi-square Statistics Chi-square test statistics : where χ 2 (X i, X j ) = 4 m=1 4 n=1 E mn = Y mc Y rn /Y. (Y mn E mn ) 2 E mn. P (null hypothesis is rejected when it is true) = P (χ 2 (X i, X j ) K null hypothesis) = α where α is a numerical value for the Type I error of the test. If χ 2 (X i, X j ) is greater than a critical point K, two positions are said to have strong dependency. 47

49 The chi-square Statistics for the TATA Box Signal i/j X 5 X 4 X 3 X 2 X 1 X +1 X +2 X +3 X +4 X +5 X +6 X +7 X +8 X X X X X X X X X X X X X

50 Dependency Graph for the TATA Box Signal

51 Dependency Graph for the TSS Signal 50

52 National Tsing Hua University (NTHU) Dependency Graph for Donor Site D+7 51

53 Difficulty for Statistical Reasoning with Dependency Graphs There are always cycles in a dependency graph. 52

54 A Remedy Expanding a dependency graph into a Bayesian network. 53

55 Expanded Bayesian Network for the TATA Box Signal 54

56 55

57 )3( )3( )3( )3( )3( )3( )4( )4( )5( )5( )5( )5( )8( )8( )8( National Tsing Hua University (NTHU) Expanded Bayesian Network for Donor Site D 2 (0) D 3(1) D 1(1 ) D +4(1 ) D +5(1 ) D +6 (1 ) D (2) 4 D )2 D )2( D 1 )2( D +3 )2( D ) D (2 +5 ) D (2 +6 ) D +7 (2 ) 3 ( 2 +4 (2 D 5 (3) D 4 D 3 3() D 2 D 1 D +3 3() D +4 D +5 ()3 D +6 D +7 D +8 (3) D 6 (4) D 5 (4) D 4 D 3 (4) D 2 D 1 ()4 D +3 (4) +D 4 ()4 D +5 (4) D +6 (4) D +7 (4) D +8 (4) D +9 (4) D 7 (5) D 6 5() D 5 (5) D 4 D 3 5() D 2 D 1 D +3 (5) + D 4 D +5 (5) D +6 (5) D +7 ()5 D +8 5() D +9 ()5 D 8 (6) D 7 (6) D (6) D 3 (6) 6 D 5 (6) )6( )6( )6( D +3 (6) )6( D +5 (6) D +6 (6) D +7 (6) D +8 (6) D +9 (6 ) D 4 D 2 D 1 + D 4 D 9 (7) D 8 (7) D 7 (7) D (7) 6 D 5 (7) D 4 7() D 3 (7) D 2 7() D 1 (7) D +3 (7) D + 4 )7( D +5 (7) D +6 (7) D +7 (7) D +8 (7) D +9 (7 ) D 9 D 8 D 7 D 8() 6 D 5 (8) D 4 )8( D 3 8() D 2 )8( D 1 8() D +3 8() D + 4 )8( D +5 8() D +6 )8( D +7 )8( D +8 8() D +9 )8( 56

58 Datasets for TATA Model Training 862 human non-redundant and experimentally verified TATA box sequences were extracted from NCBI ( to form a true dataset pseudo-tata signals are retrieved from the exon and intron regions of the 862 genes to form a false dataset. 57

59 Datasets for TSS Model Training 1430 human non-redundant promoter sequences were extracted from EPD78 ( and blasted in the human genome to form a true dataset false TSS signal sequences are obtained in random from exon and intron regions of the 1430 genes. 58

60 Datasets for Splice Site Model Training We extract a collection of real and pseudo splice sites from a set of 462 annotated multiple-exon human genes at Table 2: Number of genes, true and pseudo splice sites in the dataset. Genes (True) donor acceptor (False) donor acceptor We exclude the splice sites which contains base positions not labelled with A, T, C, G but with other symbols. 59

61 Datasets for PolyA Signal Model Training 2923 polya signal sequences with AAUAAA are retrieved form GeneBank to form a true dataset pseudo-polya signals with AAUAAA are retrieved from the exon and intron regions of the 2923 genes to form a false dataset. 60

62 Five-fold Cross-Validation. The models are cross-validated by randomly partitioning the dataset into five subsets. Then we test each subset (called the testing data) with the parameters trained by the other four subsets (called the training data) under the splice site models, and take the average of the five predictive accuracy measures corresponding to the five testing/training data pairs. We also justify the training data with the model trained by themselves in the same manner. 61

63 Measures for Predictive Accuracy Actual positive (AP ), Actual negative (AN), Predicted positive (P P ), Predicted negative (P N), AP = T P + F N AN = F P + T N P P = T P + F P T P F P P N = F N + T N F N T N False negative rate : F N rate = False positive rate F P rate = #F N #T P + #F N. #F P #T N + #F P. 62

64 Measures for Predictive Accuracy (Cont ) Sensitivity : Sensitivity = 1 false negative rate = #T P #T P + #F N = #T P #AP. Specificity Specificity = 1 false positive rate = #T N #T N + #F P = #T N #AN. Predictive Positive Value (PPV): P P V = #T P #T P + #F P = #T P #P P. 63

65 Results and Comparison 64

66 TATA Signal TATA (Testing Data), Dependency Graph Models, α = False Positive Rate (%) False Negative Rate (%) Figure 1: Comparison of prediction accuracy of 6 dependency graph models for the training data of TATA box corresponding to 6 different windows. 65

67 TATA (Training Data), Dependency Graph Models, α = False Positive Rate (%) False Negative Rate (%) Figure 2: Comparison of prediction accuracy of 6 dependency graph models for the testing data of TATA box corresponding to 6 different windows. 66

68 TSS Site TSS (Training Data), Dependency Graph Models, α = False Positive Rate (%) False Negative Rate (%) Figure 3: Comparison of prediction accuracy of 6 dependency graph models for the training data of TSS corresponding to 6 different windows with the same right edge. 67

69 80 70 TSS (Testing Data), Dependency Graph Models, α = False Positive Rate (%) False Negative Rate (%) Figure 4: Comparison of prediction accuracy of 6 dependency graph models for the testing data of TSS corresponding to 6 different windows with the same right edge. 68

70 40 35 TSS (Training Data), Dependency Graph Models, α = False Positive Rate (%) False Negative Rate (%) Figure 5: Comparison of prediction accuracy of 6 dependency graph models for the training data of TSS corresponding to 6 different windows with the same left edge. 69

71 80 70 TSS (Testing Data), Dependency Graph Models, α = False Positive Rate (%) False Negative Rate (%) Figure 6: Comparison of prediction accuracy of 6 dependency graph models for the testing data of TSS corresponding to 6 different windows with the same left edge. 70

72 Donor Site False Positive Rate (%) Donor site (Training Data) Zero order Markov Chain Model (WMM), Window = [ 6, +9], With Laplace s Rule 1st order Markov Chain Model (WAM), Window = [ 9, +9], With Laplace s Rule 2nd order Markov Chain Model, Window = [ 9, +9], With Laplace s Rule 3rd order Markov Chain Model, Window = [ 3, +7], With Laplace s Rule MDD, Window = [ 6, +15], With Laplace s Rule Cai_Tree Model, Window = [ 9, +9], With Laplace s Rule EBN Model (at most 1 parent), Window = [ 9, +9], α=10 8, With Laplace s Rule EBN Model (at most 2 parents), Window = [ 9, +9], α=10 8, With Laplace s Rule EBN Model (at most 3 parents), Window = [ 6, +15], α=10 1, With Laplace s Rule False Negative Rate (%) Figure 7: Comparison of predictive accuracy for the training data of the donor site under WMM, WAM, MDD, Cai s Tree, the 2nd-order Markov chain, the 3rd-order Markov chain and the expanded Bayesian network with at most 1, 2 and 3 parents prediction models. 71

73 False Positive Rate (%) Donor site (Testing Data) Zero order Markov Chain Model (WMM), Window = [ 6, +9], With Laplace s Rule 1st order Markov Chain Model (WAM), Window = [ 9, +9], With Laplace s Rule 2nd order Markov Chain Model, Window = [ 9, +9], With Laplace s Rule 3rd order Markov Chain Model, Window = [ 3, +7], With Laplace s Rule MDD, Window = [ 6, +15], With Laplace s Rule Cai_Tree Model, Window = [ 9, +9], With Laplace s Rule EBN Model (at most 1 parent), Window = [ 9, +9], α=10 8, With Laplace s Rule EBN Model (at most 2 parents), Window = [ 9, +9], α=10 8, With Laplace s Rule EBN Model (at most 3 parents), Window = [ 6, +15], α=10 1, With Laplace s Rule False Negative Rate (%) Figure 8: Comparison of predictive accuracy for the testing data of the donor site under WMM, WAM, MDD, Cai s Tree, the 2nd-order Markov chain, the 3rd-order Markov chain and the expanded Bayesian network with at most 1, 2 and 3 parents prediction models. 72

74 Acceptor Site False Positive Rate (%) Acceptor site (Training Data) Zero order Markov Chain Model (WMM), Window = [ 27, +3], Without Laplace s Rule 1st order Markov Chain Model (WAM), Window = [ 27, +9], Without Laplace s Rule 2nd order Markov Chain Model, Window = [ 27, +9], With Laplace s Rule 3rd order Markov Chain Model, Window = [ 27, +3], With Laplace s Rule MDD, Window = [ 27, +9], With Laplace s Rule Cai_Tree Model, Window = [ 27, +9], Without Laplace s Rule EBN Model (at most 1 parent), Window = [ 27, +3], α=10 8, Without Laplace s Rule EBN Model (at most 2 parents), Window = [ 27, +9], α=10 3, Without Laplace s Rule EBN Model (at most 3 parents), Window = [ 27, +3], α=10 3, With Laplace s Rule False Negative Rate (%) Figure 9: Comparison of predictive accuracy for the training data of the donor site under WMM, WAM, MDD, Cai s Tree, the 2nd-order Markov chain, the 3rd-order Markov chain and the expanded Bayesian network with at most 1, 2 and 3 parents prediction models. 73

75 Acceptor site (Testing Data) Zero order Markov Chain Model (WMM), Window = [ 27, +3], Without Laplace s Rule 1st order Markov Chain Model (WAM), Window = [ 27, +9], Without Laplace s Rule 2nd order Markov Chain Model, Window = [ 27, +9], With Laplace s Rule 3rd order Markov Chain Model, Window = [ 27, +3], With Laplace s Rule MDD, Window = [ 27, +9], With Laplace s Rule Cai_Tree Model, Window = [ 27, +9], Without Laplace s Rule EBN Model (at most 1 parent), Window = [ 27, +3], α=10 8, Without Laplace s Rule EBN Model (at most 2 parents), Window = [ 27, +9], α=10 3, Without Laplace s Rule EBN Model (at most 3 parents), Window = [ 27, +3], α=10 3, With Laplace s Rule False Positive Rate (%) False Negative Rate (%) Figure 10: Comparison of predictive accuracy for the testing data of the donor site under WMM, WAM, MDD, Cai s Tree, the 2nd-order Markov chain, the 3rd-order Markov chain and the expanded Bayesian network with at most 1, 2 and 3 parents prediction models. 74

76 PolyA Signal PAS, Training(5 fold cross validation), Window = [ 90, +96] DG WMM WAM SMC 70 FP rate (%) in training data FN rate (%) in training data Figure 11: Comparing accuracy between different methods using training data (in region [-90, +96]). 75

77 PAS, Testing(5 fold cross validation), Window = [ 90, +96] DG WMM WAM SMC ERPIN POLYAH FP rate (%) in testing data FN rate (%) in testing data Figure 12: Comparing accuracy between different methods using testing data (in region [-90, +96]). 76

78 Discussion Dependency graph models with their expanded Bayesian networks seem to be sufficient to address the intrinsic cyclic inter-dependency between base positions in a biological signal site. 77

79 Content Sensors for Gene Structure Prediction 78

80 Content Sensors Exon sensor: to detect the coding regions, i.e., the regions where exons reside We have constructed a dependency graph with window size equal to 9 nucleotides and then created an expanded Baysian network for the exon sensor Intron sensor: to detect the non-coding regions, i.e., the regions where introns reside We have also constructed a dependency graph with window size equal to 9 nucleotides and then created an expanded Baysian network for the intron sensor 79

81 Gene Structure Prediction by Stochastic Grammar There is a stochastic grammar (encoding process) to describe the alternating exon and intron gene structure. A state diagram and the corresponding trellis diagram with proper states are created to represent the stochastic grammar (encoding process). Each state is created by a dependency graph with expanded Bayesian network Viterbi algorithm (a dynamic programming algorithm) is used to decode a path on the trellis diagram to determine the exon-intron structure of a gene. 80

82 National Tsing Hua University (NTHU) 81

83 National Tsing Hua University (NTHU) State Diagram I nt r on Exon A 27 A 2 A 1 A +1 A +9 a g E 0 I E 1 t g E 2 D +15 D +2 D +1 D 1 D 2 D 3 82

84 The trellis diagram 83

85 National Tsing Hua University (NTHU) A 1 A + 1 A + 1 A + 2 A + 2 A + 3 A + 3 A + 4 A + 4 A + 5 A + 5 A + 6 A + 6 A + 7 A + 7 A + 8 A + 8 A + 9 E E E A + 9 D 3 D 3 D 2 D 2 E E E D 1 D 1 D +1 D + 1 D + 2 D + 2 D + 3 D + 3 D + 4 D + 4 D + 5 D + 5 D + 6 D + 6 D + 7 D + 7 D + 8 D + 8 D + 9 D + 9 D+10 D + 10 D+11 D + 11 D+12 D + 12 D + 13 D +14 I D + 13 D +14 D+ 15 D + 15 A 27 A 27 A 26 A 26 A 25 A 25 A 24 A 24 A 23 A 23 A 22 A 22 A 21 A 21 A 20 A 20 A 19 A 19 A 18 A 18 A 17 A 17 A 16 A 16 A 15 A 15 A 14 A 14 A 13 A 13 A 12 A 12 A 11 A 11 A 10 A 10 A 9 A 9 A 8 A 8 A 7 A 7 A 6 A 6 A 5 A 4 A 3 A 2 I A 5 A 4 A 3 A 2 A 1 84

86 Maximum a Posteriori (MAP) Decoding Γ = γ 1 γ 2 γ N : state sequence S = s 1 s 1 s N : DNA sequence S = s (1) 1 s(1) 2 s (1) N s (2) 1 s(2) 2 s (2) N. s (k) 1 s(k) 2 s (k) N 85

87 Maximum a Posteriori (MAP) Decoding ˆΓ = arg max Γ Assumptions: P(Γ S) = arg max Γ N P(γ i γ 1,, γ i 1, S). i=1 (1)P(γ i γ 1,, γ i 1, S)) = P(γ i S) = P(γi S(i, γ i )) = P(γ i)p(s(i, γ i ) γ i ). P(S(i, γ i )) (2)P(S(i, γ i )) = l P(s (j) l ), s (j) l S(i, γ i ). 86

88 National Tsing Hua University (NTHU) Viterbi Algorithm - The Best Path A + 9 A + 9 A + 9 A + 9 A + 9 A + 9 A + 9 A + 9 A + 9 E 0 E 0 E 0 E 0 E 0 E 0 E 0 E 0 E 0 E 1 E 1 E 1 E 1 E 1 E 1 E 1 E 1 E 1 E 2 E 2 E 2 E 2 E 2 E 2 E 2 E 2 E 2 D 3 D 3 D 3 D 3 D 3 D D 3 3 D 3 D 3 D 2 D 2 D 2 D 2 D 2 D D 2 2 D D 2 2 D 1 87

89 Measures for Predictive Accuracy Sensitivity : Sensitivity = 1 false negative rate = #T P #T P + #F N = #T P #AP. Specificity Specificity = 1 false positive rate = #T N #T N + #F P = #T N #AN. Predictive Positive Value (PPV): P P V = #T P #T P + #F P = #T P #P P. 88

90 Nucleotide-level Accuracy Total test genes: 462 AP AN PP TP= FP=7332 PN FN = TN = Sensitivity: Specificity: P.P.V.:

91 Nucleotide-level Accuracy with Start Codon and Stop Codon Total test genes: 462 AP AN PP TP= FP=7584 PN FN =72675 TN = Sensitivity: Specificity: P.P.V.:

92 Exon-level Accuracy with Start Codon and Stop Codon Total actual exons : 2843 Exactly predicted Partially predicted Overlapped Missed Total predicted exons : 2713 Exact Partial Overlapping Wrong Sensitivity: ME : P.P.V.: WE :

93 Prediction of Protein Structure 92

94 Protein 3D structure From H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne: The Protein Data Bank. Nucleic Acids Research, 28 pp (2000) 93

95 Levels of Protein Structure Primary structure: the amino acid sequence Secondary structure: the segmentation of an amino acid sequence into α-helices, β-sheets, and loops Tertiary structure: domains and folds 94

96 Different Mapping for Protein Secondary Structure RES DSSP EHL CK KENITNLDACITRLRVSVADVSKVDQAGLKKLG STTBCEEEECSSCEEEEESCGGGCCHHHHHHTT CCCECEEEECCCCEEEEECCHHHCCHHHHHHCC CCCCCEEEECCCCEEEEECCCCCCCHHHHHHCC 95

97 Prediction of Protein Secondary Structure C: the set of all eligible protein secondary structure sequences, i.e., the code of protein secondary structure sequences Assumed to be represented by a trellis, i.e., a trellis code Each protein secondary structure sequence Q = (Q 1, Q 2,..., Q L ) of length L in C corresponds to a unique (state) path X = (X 0, X 1,..., X L 4 ) in the trellis from the initial state X 0 to the final state X L 4. For the purpose of representing interactions among amino acids 96

98 in a protein chain, we define by O = O (1) 1 O (1) 2... O (1) L O (2) 1 O (2) 2... O (2) L O (k) 1 O (k) 2... O (k) L the observation template of the unconnected Bayesian network of k layers. The template O of k layers is fully defined given the observed amino acid sequence O = (O 1, O 2,..., O L ) by assigning O (j) i = O i for all depth j for all i. Maximum-Likelihood Decoding chooses the largest 97

99 ˆQ = arg max Q P (Q O) as the output, where P (Q O) = P (Q 1, Q 2, Q L O) = P (Q 1, Q 2, Q 3 S) P (Q 4, Q 5, Q L O, Q 1, Q 2, Q 3 ) = P (Q1, Q 2, Q 3 O) L 2 i=4 P (Q i O, Q i 1 )P (Q L 1, Q L O, Q L 2 ) 98

100 For the initial states: P (Q 1, Q 2, Q 3 O) = P (Q1, Q 2, Q 3 O [1:lI ]) = P (Q 1, Q 2, Q 3 )P (O [1:lI ] Q 1, Q 2, Q 3 ) P (O [1:lI ]) where l I is the window size of an initial state. For the intermidiate states: P (Q i O, Q i 1 ) = P (Qi O [i ll :i+l R 1], Q i 1 ) = P (Q i Q i 1 )P (O [i ll :i+l R 1] Q i, Q i 1 ) P (O [i ll :i+l R 1] Q i 1 ) where l L and l R are denoted as the start and end of the window of an internal state, respectively. 99

101 For the terminal states: P (Q L 1, Q L O, Q L 2 ) = P (QL 1, Q L O [L lt +1:L], Q L 2 ) = P (Q L 1, Q L Q L 2 )P (O [L lt +1:L] Q L 2, Q L 1, Q L ) P (O [L lt +1:L] Q L 2 ) where l T is the window size of a terminal state. The decoding can be fulfilled by applying Viterbi algorithm to compute the value of ˆQ = arg max P (Q O) given an amino acid Q sequence O. 100

102 Factor Graph for Protein Secondary Structure Prediction Since the observed amino acid sequence O is fixed in the annotating (decoding) problem, the observation template O is thus considered fixed. The global function of this annotating (decoding) problem is selected as the APP g(x 0,..., X L 4 ) P (X 0, X 1,..., X L 4 O) The global function g can be further factored into a product of 101

103 several local functions: g(x 0,..., X L 4 ) = P (X 0 O)P (X 1 X 0, O) P (X L 4 X 0, X 1,..., X L 5, O) L 4 i=1 I i (X i 1, X i ) = P (X0 O)P (X 1 O) P (X L 4 O) L 4 i=1 = f 1 (X 0 )f 2 (X 0, X 1 )f 3 (X 1 )f 4 (X 1, X 2 ) I i (X i 1, X i ) f 2L 9 (X L 5 )f 2L 8 (X L 5, X L 4 )f 2L 7 (X L 4 ) where f 2i (X i 1, X i ) = I i (X i 1, X i ) is an indicator function of the local behavior of the ith trellis section that constrains the possible combinations of X i 1 and X i. 102

104 For f 1 (X 0 ), For f 2i+1 (X i ), 1 i L 5, f 1 (X 0 ) = P (X 0 O) = P (X0 O [1:lI ]) = P (X 0)P (O [1:lI ] X 0 ) P (O [1:lI ]) f 2i+1 (X i ) = P (X i O) = P (Xi O [(i+3) ll :(i+3)+l R 1]) = P (X i)p (O [(i+3) ll :(i+3)+l R 1] X i ) P (O [(i+3) ll :(i+3)+l R 1]) 103

105 For f 2L 7 (X L 4 ), f 2L 7 (X L 4 ) = P (X L 4 O) = P (XL 4 O [L lt +1:L]) = P (X L 4)P (O [L lt +1:L] X L 4 ) P (O [L lt +1:L]) 104

106 The factor graph for the annotating problem is as follows: X 0 X 1 X 2 X L 5 X L 4 f 2 f 4 f2l 8 f 1 f 3 f5 2L 9 f f2l 7 105

107 The Sum-Product Update Rule The message sent between one variable vertex and one function vertex in the sum-product algorithm is an updated message subject to the following sum-product update rule: Variable vertex to function vertex µ x f (x) = µ h x (x), h n(x)\{f} Function vertex to variable vertex µ f x (x) = f(n f ) µ y f (y) x. y N f \{x} 106

108 When the factor graph is cycle-free, the sum-product algorithm is guaranteed to give g(v ) v = µ f v (v), v V, f n(v) after a suitable message passing schedule is done. 107

109 Measure for Secondary Structure Prediction Accuracy Let M ij be the number of residues observed in state i and predicted in state j, with i and j {H, E, L}, and the total number of residues is simply N = i,j M ij The three-state per-residue accuracy Q 3 is thus defined as i Q 3 = 100 M ii N 108

110 Prediction Results Table of prediction results for PDBselect25 data set with EHL- and CK-mapping in Q 3 measure: L EHL(vi) EHL(sp) CK(vi) CK(sp) % 62.96% 63.61% 64.91% % 63.60% 64.16% 65.35% % 63.86% 65.17% 65.84% L: number of secondary structures in the initial state or terminal state vi: decoded by Viterbi algorithm sp: decoded by sum-product algorithm 109

111 Table of prediction results for CB513 data set with EHL- and CK-mapping in Q 3 measure: L EHL(vi) EHL(sp) CK(vi) CK(sp) % 60.00% 60.81% 62.28% % 59.91% 61.26% 62.62% % 58.94% 61.03% 62.24% 110

112 Current Challenging Problems in Bioinformatics Genomics Novel gene discovery Alternative splicing Gene regulatory networks Cell cycle Development Disease finding Proteomics Prediction of structures and functions of proteins Drug design 111

113 Metabolic pathways Construction of pathways Cellular functions Systems biology 112

114 Systems Biology From Neil Campbell, Jane Reece, and Larry Mitchell, Biology, 5th ed. (Menlo Park, CA: Addison Wesley Longman, 1999) c Addison Wesley Longman, Inc. 113

115 Bioinformatics is Not Just Computer Algorithms!! Information theory Coding theory Communication theory Signal processing and linguistics System (control) theory Statistics, probability theory and stochastic processes Combinatorics and graph theory 114

116 Co-investigators Dr. Wen-Hsiung Li at the Department of Ecology and Evolution, University of Chicago, USA. Te-Ming Chen, Chao-Chung Chang, Chen-Wei Hsu, Yun Lee, Chiung-Wen He at the Department of Electrical Engineering, National Tsing Hua University, Taiwan. 115

117 Thank You Very Much 116

Objective: You will be able to justify the claim that organisms share many conserved core processes and features.

Objective: You will be able to justify the claim that organisms share many conserved core processes and features. Do Now: Read Enduring Understanding B Essential knowledge: Organisms share many conserved