De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics

Size: px

Start display at page:

Download "De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics"

Gilbert King
5 years ago
Views:

1 De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics John R. Rose Computer Science and Engineering University of South Carolina 1

2 Overview Background Information Theoretic Scoring Function Test Data Set Comparison with Existing Methods Conclusions Future Work 2

3 Background Analogy: Genome Machine Code Proteome Execution of Code Protein identification is important For drug discovery research For the identification microbes in environmental samples Approaches using tandem mass spectrometry data: Database searching De Novo Sequencing Tagging 3

4 Tandem MS Data A peptide is ionized and the peptide bonds are fragmented Fragment ions form peaks in the spectrum corresponding to their mass-charge ratio. Intens. [a.u.] m/z

5 Tandem MS Data Fragment ions include a,b,c,x,y,z, ions. de Novo sequencing focuses on y and b ions. y ions contain the carboxyl terminus b ions containing the amino terminus 5

6 Tandem MS Data A good quality spectrum consists of a ladder of peaks of the y-ions and a ladder of peaks of the b-ions Ex: b-ions y-ions F GLSLVR FG LSLVR FGL SLVR FGLS LVR FGLSL VR FGLSLV R 6

7 Approaches to peptide identification Frank et al. JPR

8 De Novo Sequencing Data: tandem MS spectrum Goal: find the corresponding peptide General approach: Identify y and/or b ions propose candidate peptides Score each candidate Return highest ranking peptides Two key issues: Model for candidate peptide generation Scoring function to evaluate candidates 8

9 Candidate Peptide Generation The peptide sequence can be derived by the mass differences of adjacent peaks in each of the two ladders Ex: b-ions y-ions I YEVEGMR IY EVEGMR IYE VEGMR IYEV EGMR IYEVE GMR IYEVEG MR IYEVEGM R Complicating factors: Missing peaks Posttranslational modifications Many-to-one equivalences, e.g., AG,GA,K,Q,E are similar in mass IYEVEGMR 9

10 Actual example of labeled y and b ion peaks 10

11 The spectrum graph Frank et al. JPR

12 Construction of the NC-spectrum Graph Chen et. al JCB 2001 Create a pair of nodes, N j and C j, for each ion I j. Create two auxiliary nodes N 0 and C 0. to represent the zero mass and parent mass, respectively. Let V = {N 0, N 1,, N k, C 0, C 1,, C k }. Each node x is placed assigned coordinate cord(x) according to the total mass of its amino acids, that is, cord( x) 0 W 18 = wj 1 W wj x x = x = C = N N 0 x = C N 0 C 2 C 1 N 1 N 2 C j j

13 Construction of the NC-spectrum Graph Abundance (100%) W = W 18 cord( x) = w j 1 W w j + 1 x = N x = C x = N x = C 0 0 j j Mass / Charge N 0 C

14 Construction of the NC-spectrum Graph Abundance (100%) W = W 18 cord( x) = w j 1 W w j + 1 x = N x = C x = N x = C 0 0 j j Mass / Charge N 0 C 1 N 1 C

15 Construction of the NC-spectrum Graph Abundance (100%) W = W 18 cord( x) = w j 1 W w j + 1 x = N x = C x = N x = C 0 0 j j Mass / Charge N 0 C 2 C 1 N 1 N 2 C

16 Construction of the NC-spectrum Graph Mass(S) = S Mass(W) = W Mass(R) = R N 0 C 2 C 1 N 1 N 2 C S+W Mass(S+W) =

17 Construction of the NC-spectrum Graph N 0 N 2 C 1 N 1 C 2 C Each path from N 0 to C 0 represents a possible sequence for the peptide A feasible path is a path from N 0 to C 0 that goes through exactly one node for each pair (either N j or C j ). 17

18 Construction of the NC-spectrum Graph N 0 N 2 C 1 N 1 C 2 C This is not a feasible path: misses ion I 2 18

19 Construction of the NC-spectrum Graph N 0 N 2 C 1 N 1 C 2 C This is a feasible path 19

20 Problem Reformulation Input: an NC-spectrum graph G. Output: a feasible path from N 0 to C 0. Difficulty: A longest path does not always go through exactly one of each pair of nodes. This is an NP-hard problem if the graph is a general directed graph. 20

21 Renaming Nodes Rename the nodes from left to right as X 0,, X k,y k,,y 0 N 0 N 2 C 1 N 1 C 2 C X 0 X 1 X 2 Y 2 Y 1 Y X i and Y i form a complementary pair of nodes for ion i. 21

22 Problem Reformulation X 0 X 1 X k Y k Y 1 Y 0 Let M(i, j) be a two-dimensional matrix with 0 i, j k. Let M(i, j)=1 if there exists a path L from X 0 to X i and a path R from Y j to Y 0, such that L and R together contain exactly one of X p and Y p for each P in [0, max{i, j}]. X 0 X 1 X 2 X i Y j Y i Y 2 Y 1 Y 0 L R 22

23 Problem Reformulation There is a feasible path if and only if for some i and k, there is an edge e from X i to Y k and M(i, k) = 1, or for some k and j, there is an edge e from X k to Y j and M(k, j) = 1 X 0 X i Y k Y 0 L e R X 0 X k Y j Y 0 L 23 e R

24 Candidate Peptide Generation Complicating factors: Posttranslational modifications Many-to-one equivalences, e.g., AG,GA,K,Q,E are similar in mass Noise Peaks Missing peaks 24

25 Missing peaks Candidate Peptide Generation Now a many-to-many combinatorial problem Ex: ATEEQLK If b 4 ion is missing then b 3 represents ATE and b 5 represents ATEEQ Then the mass difference for EQ is unresolved. Recall that AG,GA,K,Q,E are similar in mass Thus EQ, QE, AGQ, GAQ, AGE, GAE,.. have similar mass 25

26 Candidate Peptide Evaluation Model for candidate generation Traditional focus on fragmentation model Increasing fragmentation model sophistication Better posttranslational modification models No model of peptide amino acid content QuasiNovo approach Unsophisticated fragmentation model No posttranslational modification model Uses information theory to model peptide amino acid content 26

27 Modeling Peptide Amino Acid Content Basic Idea: Examine actual proteins to characterize likely combinations of amino acids Underlying hypothesis: amino acid content is not random Analogy: model letter combinations in a language examine documents in that language compile profiles of letter combinations predict missing letters from partial data Motivation: Ability to distinguish between mass-equivalent combinations Ability to deal with missing peaks 27

28 Amino Acid Distribution Data Tabulation of amino acid distributions: Let <a 1 a 2 a n > be a contiguous sequence of n amino acids. There are n amino acids: <a 1 >, < a 2 >,,<a n > There are n-1 ordered amino acid pairs: <a 1 a 2 >, < a 2 a 3 >,,< a n-1 a n > etc. QuasiNovo has been evaluated with 3-,4-,5-, and 6-tuples Tuple frequencies are then normalized. 28

29 Amino Acid Distribution Data Three amino acid profiles used: 1. Gammaproteobacteria: 206 complete genomes 23,882,564 tryptic peptides 2. Actinobacteria: 58 complete genomes 7,380,927 tryptic peptides generated 3. Mammalia: 4 complete genomes: Bovine, Human, Mouse, Rat 9,835,585 tryptic peptides generated 29

30 QuasiNovo s Use of Tuple-Profiles Score candidate peptides score(fglslvr) = p(slvr)p(l SLVR)p(G LSLV)p(F GLSL) Discard poor scoring candidates Handle missing peaks Find set of a i that maximize P(a i a i-4 a i-3 a i-2 a i-1 ) 30

31 Test Data Set 280 spectra of peptides selected by Frank & Pevzner (2005) molecular mass of up to 1400 Da peptides with 7-16 amino acids (average length of 10.5) source: ISB protein mixture data set and Open Proteomics Database Data set used to compare PepNovo with Sherenga Peaks Lutefisk Later used to compare NovoHMM with PepNovo Sherenga Peaks Lutefisk 31

32 The contenders: PepNovo v1.03 PepNovo+ NovoHMM QuasiNovo QuasiNovo Reranking Results 32

33 Results % Correct PepNovo+ PepNovo v1.03 NovoHMM Quasinovo Quasinovo Reranking Number of Incorrect Residues Results for set of 280 MS-MS test spectra comparing PepNovo+, PepNovo, NovoHMM, with a QuasiNovo reranking and QuasiNovo. 33

34 Results % Correct Number of Incorrect Residues PepNovo+ PepNovo v1.03 NovoHMM Gammaproteobacteria Actinobacteria Mammalia Results for set of 76 MS-MS test spectra for E. coli peptides comparing PepNovo+, PepNovo, NovoHMM, with three QuasiNovo scoring functions based on amino acid distributions in Gammaproteobacteria, Actinobacteria, and Mammalia. 34

35 Results Algorithm PepNovo+ NovoHMM Quasinovo Reranking Terminal ion pair b2-ion y2-ion Complete peptide Comparison of Terminal Pair and Overall Accuracy 35

36 Conclusions and Future Work The QuasiNovo peptide model predicts peptide amino acid content has limited understanding of fragmentation outperforms the PepNovo+ and NovoHMM QuasiNovo reranking reranks PepNovo+ and NovoHMM results proof-of-concept for combining peptide & fragmentation models shows best overall performance Future: Combine QuasiNovo amino acid model with a sophisticated fragmentation model 36

37 Acknowledgements Rose Lab Jimmy Cleveland Achraf Elallali Amadeo Bellotti Fox Lab Alvin Fox Karen Fox Jennifer Intelicato-Young Support Funding from Alfred P. Sloan Foundation Experiments were conducted on a 128-core shared memory computer funded by NSF (CNS ). 37

38 Gammaproteobacteria x = 3 x = 4 x = 5 x = 6 x = 7 x = 8 x = 9 x = 10 x = 11 x = 12 QuasiNovo MM Reranking NovoHMM PepNovo+ x = 3 x = 4 x = 5 x = 6 x = 7 x = 8 x = 9 x = 10 x = 11 x = 12 QuasiNovo MM Reranking NovoHMM PepNovo Cumulative results from 174 spectra x = n number of correctly predicted amino acids Note: a predicted amino acid is correct if it appears within 2.5 Da of its position in the actual peptide 38

39 Actinobacteria x = 3 x = 4 x = 5 x = 6 x = 7 x = 8 x = 9 x = 10 x = 11 x = 12 QuasiNovo MM Reranking NovoHMM PepNovo+ x = 3 x = 4 x = 5 x = 6 x = 7 x = 8 x = 9 x = 10 x = 11 x = 12 QuasiNovo MM Reranking NovoHMM PepNovo Cumulative results from 27 spectra x = n number of correctly predicted amino acids Note: a predicted amino acid is correct if it appears within 2.5 Da of its position in the actual peptide 39

40 Results: Mammalia x = 3 x = 4 x = 5 x = 6 x = 7 x = 8 x = 9 x = 10 x = 11 x = 12 QuasiNovo MM Reranking NovoHMM PepNovo+ x = 3 x = 4 x = 5 x = 6 x = 7 x = 8 x = 9 x = 10 x = 11 x = 12 QuasiNovo MM Reranking NovoHMM PepNovo Cumulative results from 79 spectra x = n number of correctly predicted amino acids Note: a predicted amino acid is correct if it appears within 2.5 Da of its position in the actual peptide 40

41 EF-Tu Protein DISTILLER/MASCOT identification: AIDKPFLLPIEDVFSISGR QuasiNovo identification: DSDKPFMMPVEDVFSITGR Score(AIDKPFLLPIEDVFSISGR) = e-38 Score(DSDKPFMMPVEDVFSITGR) = e-36 QuasiNovo result supported by microbiological data Gram stain physiological tests visual comparison of spectra of environmental isolates versus known S. aureus and interpretation of Distiller/Mascot sequence assignment Note: Distiller results based on 18 peaks vs 12 peaks for QuasiNovo Peptide displays loss of 3 water molecules 41

De Novo Peptide Identification Via Mixed-Integer Linear Optimization And Tandem Mass Spectrometry

17 th European Symposium on Computer Aided Process Engineering ESCAPE17 V. Plesu and P.S. Agachi (Editors) 2007 Elsevier B.V. All rights reserved. 1 De Novo Peptide Identification Via Mixed-Integer Linear