Lecture 15: Realities of Genome Assembly Protein Sequencing

Lecture 15: Realities of Genome Assembly Protein Sequencing Study Chapter 8.10-8.15 1 Euler s Theorems A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing edges: in(v)=out(v) Theorem 1: A connected graph has a Eulerian Cycle if and only if each of its vertices are balanced. In mid-tour for every path onto an island there must be another path off Exceptions are allowed at the start and end of the tour Theorem 2: A connected graph has an Eulerian Path if and only if it contains exactly two semi-balanced vertices and all others are balanced. Semi-balanced vertex: in(v) out(v) =1 One of the semi-balanced vertices, with out(v)=in(v)+1 is the start of the tour The other semi-balanced vertex, with in(v)=out(v)+1 is the end of the tour 2 1

Eulerian Cycle Start at any vertex *v*, and follow a trail of edges until you return to *v* As long as there exists any vertex *u* that belongs to the current tour, but has adjacent edges that are not part of the tour Start a new trail from *u* Following unused edges until returning to *u* Join the new trail to the original tour A more complicated Königsberg 3 Example Problem: Eulerian Path Approach S = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } Vertices correspond to ( l 1 ) mers : { AT, TG, GC, GG, GT, CA, CG } Edges correspond to l mers from S GT CG AT TG GC CA GG Find path that visits every EDGE once 4 2

Genome Assembly vs Minimal Superstring Minimal substring problem Every k-mer are known and used as a vertex, (all σ k ) Paths, and there may be multiple, are solutions Read fragments No guarantee that we will see every k-mer Can't disambiguate repeats 5 From DNA to Proteins DNA sequences OS that controls living biological systems Sections of DNA (Genes) encode proteins, like programs Triplets of nucleotides (codons) encode the amino-acid sequences, as well as the stop codes, used to assemble proteins Complications in going from DNA Protein: introns, RNA editing prior to translation, posttranslational modifications 6 3

Proteins Proteins are the machinery or hardware Compose the cellular structures Control the biochemical reactions in cells Regulate and trigger the chain reactions (metabolic pathways) that result in the cell s life cycle Determine which parts of the DNA code are activated, executed, and when Like DNA, proteins are long molecular chains Sequences of 20 amino acid residues rather than 4 nucleic acids 7 From Genes to Proteins The central dogma of molecular biology is that information encoded by the bases of DNA are transcribed by RNA and then converted into proteins 8 4

Protein Components Proteins are made from 20 amino acids Peptide bonds join amino acids into long chains 100 s to 1000 s of amino acid residues long Amino Acid 3-Letter 1-Letter Molecular Weight Alanine Ala A 89.09 Cysteine Cys C 121.16 Aspartate Asp D 133.10 Glutamate Glu E 147.13 Phenylalanine Phe F 165.19 Glycine Gly G 75.07 Histidine His H 155.16 Isoleucine Ile I 131.18 Lysine Lys K 146.19 Leucine Leu L 131.18 Amino Acid 3-Letter 1-Letter Molecular Weight Methionine Met M 149.21 Asparagine Asn N 132.12 Proline Pro P 115.13 Glutamine Gln Q 146.15 Arginine Arg R 174.20 Serine Ser S 105.09 Threonine The T 119.12 Valine Val V 117.15 Tryptophan Trp W 204.23 Tyrosine Tyr Y 181.19 9 Protein Assembly Amino acids are joined by peptide bonds into long chains These chains fold into proteins Interact with other proteins and large molecules N-terminus C-terminus 10 5

Protein Sequencing Purify a sample Break into pieces Proteases cleave proteins into smaller peptide chains Read fragments Edman degradation for short peptide sequences Mass spectrometry measures mass/charge The Hard part Reassemble Relatively easy 11 Peptide Fragmentation Collision Induced Dissociation H...-HN-CH-CO... NH-CH-CO-NH-CH-CO- OH R i-1 R i R i+1 H + Prefix Fragment Suffix Fragment Peptides tend to fragment along the backbone. Fragments can also lose neutral chemical groups like NH 3 and H 2 O. 12 6

Breaking Peptides into Fragment Ions Proteases, e.g. trypsin, break proteins into peptides. A Tandem Mass Spectrometer (MS/MS) further breaks the peptides down into fragment ions and measures the mass of each piece. Mass Spectrometer accelerates the fragmented ions; heavier ions accelerate slower than lighter ones. Mass Spectrometer measure mass/charge ratio of an ion. 13 N- and C-terminal Peptides NH 2 - -CO 2 H 14 7

Terminal peptides and ion types Peptide Mass (D) 57 + 97 + 147 + 114 = 415 Peptide without Mass (D) 57 + 97 + 147 + 114 18 = 397 15 N- and C-terminal Peptides 486 NH 2 - -CO 2 H 415 71 301 185 154 332 57 429 16 8

N- and C-terminal Peptides 486 NH 2 - -CO 2 H 415 71 301 185 154 332 57 429 17 N- and C-terminal Peptides 486 415 71 301 185 154 332 57 429 18 9

N- and C-terminal Peptides 486 415 301 Reconstruct peptide from the set of masses of fragment ions (mass-spectrum) 71 185 154 332 57 429 19 Theoretical Mass Spectrum protein = PLAY Amino Acid 3-Letter 1-Letter Molecular Weight Alanine Ala A 89.09 Cysteine Cys C 121.16 Aspartate Asp D 133.10 Glutamate Glu E 147.13 Phenylalanine Phe F 165.19 Glycine Gly G 75.07 Histidine His H 155.16 Isoleucine Ile I 131.18 Lysine Lys K 146.19 Leucine Leu L 131.18 Amino Acid 3-Letter 1-Letter Molecular Weight Methionine Met M 149.21 Asparagine Asn N 132.12 Proline Pro P 115.13 Glutamine Gln Q 146.15 Arginine Arg R 174.20 Serine Ser S 105.09 Threonine The T 119.12 Valine Val V 117.15 Tryptophan Trp W 204.23 Tyrosine Tyr Y 181.19 20 10

Intensity H 2 O Mass Spectra G V D L K 57 Da = K L G 99 Da = V D V G 0 mass The peaks in the mass spectrum: Prefix and Suffix Fragments. Fragments with neutral losses (-H 2 O, -NH 3 ) Noise and missing peaks. 21 Protein Identification with MS/MS G V D L K MS/MS Peptide Identification: 0 mass 22 11

S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6 T: + c d Full ms2 638.00 [ 165.00-1925.00] 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 226.9 326.0 397.1 425.0 489.1 524.9 588.1 589.2 629.0 687.3 850.3 851.4 949.4 1048.6 1049.6 200 400 600 800 1000 1200 1400 1600 1800 2000 m/z De Novo vs. Database Search Database Search Relative Abundance De Novo Database of known peptides MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE, ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN.. Mass, Score W Database of all peptides = 20 n R A V AAAAAAAA,AAAAAAAC,AAAAAAAD,AAAAAAAE, L G T AAAAAAAG,AAAAAAAF,AAAAAAAH,AAAAAAI, E P C L K AVGELTI, AVGELTK W, AVGELTL, AVGELTM, D T YYYYYYYS,YYYYYYYT,YYYYYYYV,YYYYYYYY AVGELTK 23 A Paradox Database of all peptides is huge O(20 n ). Database of all known peptides is much smaller O(10 8 ). However, de novo algorithms can be much faster, even though their search space is much larger! A database search scans all peptides in the database of all known peptides search space to find best one. De novo eliminates the need to scan database of all peptides by modeling the problem as a graph search. 24 12