De Novo Peptide Sequencing

Size: px

Start display at page:

Download "De Novo Peptide Sequencing"

Berenice Ray
5 years ago
Views:

1 De Novo Peptide Sequencing

2 Outline A simple de novo sequencing algorithm PTM Other ion types Mass segment error

3 De Novo Peptide Sequencing b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 A NELLLNVK AN ELLLNVK ANE LLLNVK ANEL LLNVK ANELL LNVK ANELLL NVK ANELLLN VK ANELLLNV K y 8 y 7 y 6 y 5 y 4 y 3 y 2 y 1

Score Function Implementation M total residue mass peptide a 1 a i a i+1 a k a 1 a 2 a i a i+1 a k prefix mass m M m suffix mass If the corresponding y and/or b ions are observed for prefix mass m

4 Score Function Implementation M total residue mass peptide a 1 a i a i+1 a k a 1 a 2 a i a i+1 a k prefix mass m M m suffix mass If the corresponding y and/or b ions are observed for prefix mass m (and therefore suffix M m), then peptide is likely to have a prefix mass m. Let f(m) > 0. Otherwise, f(m) 0. Note that f(m) is usually the sum of several related ion types. Also f(m) can be computed without knowing the actual sequence.

5 Score for a Peptide For a sequence P with prefix masses m 1, m 2,, m k, the peptide score is defined as f S, P = f m 1 + f m 2 + f m k De novo sequencing: Given spectrum S, construct the peptide P that maximizes f(s, P).

6 De Novo Sequencing De novo sequencing: Given a spectrum S and a total residue mass M, computes a peptide P such that score(s, P) is maximized. When score(s, P) is sum of f(prefix mass), there is a simple algorithm. For simplicity we use nominal masses.

7 Simple Model f(m) m M m m(a) Find a path from 0 to M, where each step is equal to the mass of some amino acid. Maximize the total score of the incident cells.

8 Algorithm Idea f(m) m M m m(a) D(m): the maximized score that can be achieved by the partial path reaching m. If P = P a is the best path for m, then P must be the best path for m m(a). Thus, D m = max D m m a + f m. a The algorithm initializes D(0) = 0 and all other cells to be. Then computes D(m) for m from 1 to M by the above formula.

Dynamic Programming D 0 1 2 3 m M m m(a) The best sequence can be retrieved by a

9 Dynamic Programming D m M m m(a) The best sequence can be retrieved by a backtracking process by repetitively computing the last amino acid a. Time complexity?

10 High Resolution Data What if the mass values are not nominal?

11 Phosphorylation Monoisotopic mass change: PO 3 H = ps pt py H H H S T Y

12 PTM and De Novo Sequencing Variable PTM does not cause major speed slow down for de novo sequencing algorithms. Instead of trying 20 regular amino acids in the maximization, the algorithm simply tries all modified amino acids too. The time complexity is increased by a constant factor. (Compare to the exponential growth in database search approach). However, since the solution space is larger when many variable PTMs are allowed, the accuracy of the algorithm is reduced.

13 Other Fragment Ions x 3 y 3 z 3 x 2 y 2 z 2 x 1 y 1 z 1 R 1 O R 2 H 2 N C C N C C N C C N H H H H H H O R 3 O R 4 C H COOH a 1 b 1 c 1 a 2 b 2 Between two adjacent residues, there are 3 fragmentation possibilities, causing 6 fragment ion types. Each ion type has a mass offset a: -27, b: +1, c: +18, x: +45, y: +19, z: +2 b and y ions are complementary. Charge one b + y = total residue mass +20. y ion usually the most abundant. Also neutral loss ions such as y-h 2 O. c 2 a 3 b 3 c 3

14 Calculating f(m) with Other Ions For example, if b and y ions are considered, then for prefix mass m, the corresponding ion masses are: b = m+1; y = M-m+19; Calculate the log likelihood ratio for each ion type, add them up as f(m).

15 Double Count

16 Solution Pretend it does not exist? Rare event in real sequences, anyway. No. the algorithm is encouraged to find a peptide that reuses many significant peaks as both y and b ions. And those results found by the algorithm will be wrong. Heuristic solution: detect the algorithm error, and re-run the algorithm by discounting the overlapped peaks.

17 Mass Segment Error Most errors are due to incomplete ion ladders in the spectrum. Thus, a segment of amino acids cannot be determined. However, the total mass of the segment, is fixed. E.g. [242]VLSLLVESK, where 242 = N+Q, N+K, or L+E The first two or three residues often have low confidence, because of a lack of fragment ions. Most de novo sequencing software uses the precursor mass as a constraint (thus the mass of the derived sequence is usually correct).

18 Solution Match a protein sequence database to correct some of the errors. Use machine learning to learn the more frequent combination when the peaks are missing.

19 Automated De Novo Sequencing Many de novo sequencing programs Sherenga (1999) Lutefisk (2001) PEAKS (2003) PepNovo (2005) Novor (2015) Two main models: Spectrum graph PEAKS

CSE182-L8. Mass Spectrometry

CSE182-L8. Mass Spectrometry CSE182-L8 Mass Spectrometry Project Notes Implement a few tools for proteomics C1:11/2/04 Answer MS questions to get started, select project partner, select a project. C2:11/15/04 (All but web-team) Plan