De Novo Peptide Identification Via Mixed-Integer Linear Optimization And Tandem Mass Spectrometry

Similar documents
DE NOVO PEPTIDE SEQUENCING FOR MASS SPECTRA BASED ON MULTI-CHARGE STRONG TAGS

De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics

A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry

NovoHMM: A Hidden Markov Model for de Novo Peptide Sequencing

A New Hybrid De Novo Sequencing Method For Protein Identification

via Tandem Mass Spectrometry and Propositional Satisfiability De Novo Peptide Sequencing Renato Bruni University of Perugia

Efficiency of Database Search for Identification of Mutated and Modified Proteins via Mass Spectrometry

A Suboptimal Algorithm for De Novo Peptide Sequencing via Tandem Mass Spectrometry. BINGWEN LU and TING CHEN ABSTRACT

Computational Methods for Mass Spectrometry Proteomics

PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search

Peptide Sequence Tags for Fast Database Search in Mass-Spectrometry

Modeling Mass Spectrometry-Based Protein Analysis

Quality Assessment of Tandem Mass Spectra Based on Cumulative Intensity Normalization

Proteomics. November 13, 2007

An SVM Scorer for More Sensitive and Reliable Peptide Identification via Tandem Mass Spectrometry

De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra. Xiaowen Liu

Identification of proteins by enzyme digestion, mass

Peter A. DiMaggio, Jr., Nicolas L. Young, Richard C. Baliban, Benjamin A. Garcia, and Christodoulos A. Floudas. Research

De Novo Peptide Sequencing

Analysis of Peptide MS/MS Spectra from Large-Scale Proteomics Experiments Using Spectrum Libraries

MS-MS Analysis Programs

Protein Identification Using Tandem Mass Spectrometry. Nathan Edwards Informatics Research Applied Biosystems

Nature Methods: doi: /nmeth Supplementary Figure 1. Fragment indexing allows efficient spectra similarity comparisons.

Protein Sequencing and Identification by Mass Spectrometry

Towards Detecting Protein Complexes from Protein Interaction Data

Supplementary Material for: Clustering Millions of Tandem Mass Spectra

Learning Score Function Parameters for Improved Spectrum Identification in Tandem Mass Spectrometry Experiments

Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search

Electrospray ionization mass spectrometry (ESI-

QuasiNovo: Algorithms for De Novo Peptide Sequencing

A Kernel-Based Case Retrieval Algorithm with Application to Bioinformatics

A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra

BIOINFORMATICS. Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry

Computational Analysis of Mass Spectrometric Data for Whole Organism Proteomic Studies

InsPecT: Identification of Posttranslationally Modified Peptides from Tandem Mass Spectra

Identification of Post-translational Modifications via Blind Search of Mass-Spectra

PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search

SPECTRA LIBRARY ASSISTED DE NOVO PEPTIDE SEQUENCING FOR HCD AND ETD SPECTRA PAIRS

ADVANCEMENT IN PROTEIN INFERENCE FROM SHOTGUN PROTEOMICS USING PEPTIDE DETECTABILITY

Tandem Mass Spectrometry: Generating function, alignment and assembly

In shotgun proteomics, a complex protein mixture derived from a biological sample is directly analyzed. Research Article

On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering

Mass Spectrometry Based De Novo Peptide Sequencing Error Correction

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

Effective Strategies for Improving Peptide Identification with Tandem Mass Spectrometry

pnovo+: De Novo Peptide Sequencing Using Complementary HCD and ETD Tandem Mass Spectra

HMMatch: Peptide Identification by Spectral Matching of Tandem Mass Spectra Using Hidden Markov Models ABSTRACT

Jiří Novák and David Hoksza

Speeding up Scoring Module of Mass Spectrometry Based Protein Identification by GPUs

Introduction to pepxmltab

ALIGNMENT OF LC-MS DATA USING PEPTIDE FEATURES. A Thesis XINCHENG TANG

Intensity-based protein identification by machine learning from a library of tandem mass spectra

Algorithms for Identifying Protein Cross-Links via Tandem Mass Spectrometry ABSTRACT

De novo peptide sequencing methods for tandem mass. spectra

Mass spectrometry in proteomics

AB SCIEX SelexION Technology Used to Improve Mass Spectral Library Searching Scores by Removal of Isobaric Interferences

Identification of Human Hemoglobin Protein Variants Using Electrospray Ionization-Electron Transfer Dissociation Mass Spectrometry

Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University

Tandem mass spectra were extracted from the Xcalibur data system format. (.RAW) and charge state assignment was performed using in house software

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

Yifei Bao. Beatrix. Manor Askenazi

A Statistical Model of Proteolytic Digestion

HMMatch: Peptide Identification by Spectral Matching. of Tandem Mass Spectra using Hidden Markov Models

Mass Spectrometry and Proteomics - Lecture 5 - Matthias Trost Newcastle University

The Power of LC MALDI: Identification of Proteins by LC MALDI MS/MS Using the Applied Biosystems 4700 Proteomics Analyzer with TOF/TOF Optics

Liangyi Zhang and James P. Reilly* Department of Chemistry, Indiana University, 800 East Kirkwood Avenue, Bloomington, Indiana 47405

Methods for proteome analysis of obesity (Adipose tissue)

BST 226 Statistical Methods for Bioinformatics David M. Rocke. January 22, 2014 BST 226 Statistical Methods for Bioinformatics 1

pnovo: De novo Peptide Sequencing and Identification Using HCD Spectra

MasSPIKE (Mass SPectrum Interpretation and Kernel Extraction) for Biological Samples p.1

Data pre-processing in liquid chromatography mass spectrometry-based proteomics

Parallel Algorithms For Real-Time Peptide-Spectrum Matching

BIOINFORMATICS. Bingwen Lu and Ting Chen INTRODUCTION

Atomic masses. Atomic masses of elements. Atomic masses of isotopes. Nominal and exact atomic masses. Example: CO, N 2 ja C 2 H 4

The influence of histidine on cleavage C-terminal to acidic residues in doubly protonated tryptic peptides

DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics

Multi-residue analysis of pesticides by GC-HRMS

Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

(Refer Slide Time 00:09) (Refer Slide Time 00:13)

MS2DB: An Algorithmic Approach to Determine Disulfide Linkage Patterns in Proteins by Utilizing Tandem Mass Spectrometric Data

MSc Chemistry Analytical Sciences. Advances in Data Dependent and Data Independent Acquisition for data analysis in proteomic research

MaSS-Simulator: A highly configurable MS/MS simulator for generating test datasets for big data algorithms.

Identifying the proteome: software tools David Fenyö

PROTEIN SEQUENCING AND IDENTIFICATION USING TANDEM MASS SPECTROMETRY

TANDEM mass spectrometry (MS/MS) is an essential and

Peptide de Novo Sequencing Using 157 nm Photodissociation in a Tandem Time-of-Flight Mass Spectrometer

Spectrum-to-Spectrum Searching Using a. Proteome-wide Spectral Library

MS-based proteomics to investigate proteins and their modifications

Fundamentals of Mass Spectrometry. Fundamentals of Mass Spectrometry. Learning Objective. Proteomics

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

An Unsupervised, Model-Free, Machine-Learning Combiner for Peptide Identifications from Tandem Mass Spectra

Computer aided operation and design of the cationic surfactants production

PesticideScreener. Innovation with Integrity. Comprehensive Pesticide Screening and Quantitation UHR-TOF MS

Properties of Average Score Distributions of SEQUEST

Fast Protein and Peptide Separations Using Monolithic Nanocolumns and Capillary Columns

CSE182-L8. Mass Spectrometry

BioSystems 109 (2012) Contents lists available at SciVerse ScienceDirect. BioSystems

Accelerating the Metabolite Identification Process Using High Resolution Q-TOF Data and Mass-MetaSite Software

Designed for Accuracy. Innovation with Integrity. High resolution quantitative proteomics LC-MS

Transcription:

17 th European Symposium on Computer Aided Process Engineering ESCAPE17 V. Plesu and P.S. Agachi (Editors) 2007 Elsevier B.V. All rights reserved. 1 De Novo Peptide Identification Via Mixed-Integer Linear Optimization And Tandem Mass Spectrometry Peter A. DiMaggio Jr. and Christodoulos A. Floudas a a Department of Chemical Engineering Princeton University Princeton NJ 08544 USA floudas@titan.princeton.edu Abstract A novel methodology for the de novo identification of peptides via mixedinteger linear optimization (MILP) and tandem mass spectrometry is presented. The overall mathematical model is presented and the key concepts of the proposed approach are described. A pre-processing algorithm is utilized to identify important m/z values in the tandem mass spectrum. Missing peaks due to residue-dependent fragmentation characteristics are dealt with using a twostage algorithmic framework. A cross-correlation approach is used to resolve missing amino acid assignments and to select the most probable peptide by comparing the theoretical spectra of the candidate sequences that were generated from the MILP sequencing stages with the experimental tandem mass spectrum. The novel proposed de novo method denoted as PILOT is compared to existing popular methods such as Lutefisk PEAKS PepNovo EigenMS and NovoHMM for a set of spectra resulting from QTOF instruments. Keywords: mixed-integer linear optimization (MILP) de novo peptide identification tandem mass spectrometry (MS/MS) 1. Introduction Of fundamental importance in proteomics is the problem of peptide and protein identification. Over the past couple decades tandem mass spectrometry

2 P. DiMaggio et. al (MS/MS) coupled with high performance liquid chromatography (HPLC) has emerged as a powerful experimental technique for the effective identification of peptides and proteins. In recognition of the extensive amount of sequence information embedded in a single mass spectrum tandem MS has served as an impetus for the recent development of numerous computational approaches formulated to sequence peptides robustly and efficiently with particular emphasis on the integration of these algorithms into a high throughput computational framework for proteomics. The two most frequently reported computational approaches in the literature are (a) de novo and (b) database search methods both of which can utilize deterministic probabilistic and/or stochastic solution techniques. The maority of peptide identification methods used in industry are database search methods [1-5] due to their accuracy and their ability to exploit organism information during the identification. A variety of techniques for peptide identification using databases currently exist. One approach as implemented in the SEQUEST algorithm [1] uses a signal-processing technique known as cross-correlation to mathematically determine the overlap between a theoretical spectrum as derived from a sequence in the database and the experimental spectrum under investigation. The more frequently used technique known as probability-based matching utilizes a probabilistic model to determine whether an ion peak match between the experimental and theoretical tandem mass spectrum is actual or random [245]. Despite the sophistication of these database methods they are ineffective if the database in which the search is conducted does not contain the corresponding peptide responsible for generating the tandem mass spectrum. De novo methods have received considerable interest since they are the only efficient means for applications such as finding novel proteins amino acid mutations and studying the proteome before the genome. A prominent methodology for the de novo peptide sequencing problem is a spectrum graph approach [6-10]. Various types of graph representations have been proposed but the maority of methods map the peaks in the tandem mass spectrum to nodes on a directed graph where the nodes are connected by edges if the mass difference between them is equal to the weight of an amino acid. Despite the vast potential of de novo methods they can be computationally demanding and may exhibit inconsistent prediction accuracies. 2. Novel Method for De Novo Peptide Identification 2.1. Mathematical Model and Algorithmic Framework A tandem mass spectrum is comprised of the mass of the parent peptide (m P ) and a set of data point pairs corresponding to the mass-to-charge ratio of the ion

De Novo Peptide Identification via Mixed-Integer Linear Optimization and Tandem Mass Spectrometry 3 peaks (mass(ion peak i)) and their intensities (λ i ). The following sets are defined using these parameters: S = {S i = (i): M i mass(ion peak ) mass(ion peak i) = mass of an amino acid mass(ion peak ) > mass(ion peak i)} (1) C = {C i = (i): mass(ion peak i) + mass(ion peak ) = m P + 2} (2) The set S contains the pairs of peaks whose mass difference (M i ) is equal to the weight of an amino acid and the set C contains the pairs of peaks which are known as complementary ions. It is important to note that the pair (i) in C indicates that ion peak i and ion peak are of different ion type. A peptide sequence is derived from tandem mass spectrum data by connecting ion peaks of similar ion type by the weights of amino acids. This is nontrivial since the type of an ion peak (i.e. a b c x y z) is not known a priori. The key idea of the proposed approach is two model the selection of peaks and connections between peaks using binary variables. We define the binary variable p k to equal one if ion peak k is used in the construction of the candidate sequence (i.e. p k = 1) else p k is equal to zero. We also define the binary variable w i to equal one if peaks i and are connected in the construction of the candidate sequence (i.e. p i = p = 1) and to be zero otherwise. Based on the observation that y- and b-ions are typically the most abundant in intensity in a tandem mass spectrum we postulate the obective function of maximimizing the intensities of the peaks used in the construction of the candidate sequence so as to maximize the number of b- or y-ions used. MAX λ wi (3) ( i ) Si Several contraints can be added to the problem defined in Eq. (3) in order to incorporate various ion peak properties and fragmentation characteristics as shown in Eqs. (4) (11). MAX λ wi ( i ) Si s.t. M i wi ( mp 18) + tolerance (4) ( i ) S i M i wi ( mp 18) tolerance (5) ( i ) S i i p + p 1 ( i ) Ci (6)

4 P. DiMaggio et. al Si w w i = p i = p i BC i (7) i BC i (8) i i Si wi = 1 (9) i BCi Si wi = 1 (10) tail BC i S i tail wi wik = 0 ii BCi i BCi (11) Si k Sik w p = ( ) i k {01} i k The mass balance for the peptide is defined by Eqs. (4) and (5) which ensures that the sum of the masses used to derive the candidate sequence is within some error tolerance of the experimental mass of the parent peptide minus water. To eliminate ion peaks of a different ion type from being used in the peptide sequence Eq. (6) enforces that if ion peak i is selected (i.e p i = 1) then its complementary ion ion peak will not be selected (i.e. p = 0) since ion peak i and ion peak are complementary ions (see Eq. (2)). The relationship between the binary variables p and w given in Eqs. (7) and (8) ensures that if ion peak i and ion peak are choosen (i.e. p i = p = 1) then a path exists between these two peaks (w i = 1). Eqs. (9) and (10) require that the candidate sequence has the correct N- and C-terminus boundary conditions which are predefined in the sets tail BCi and BC i respectively. Eq. (11) enforces that the number of input paths entering and the number of output paths leaving an ion peak i are equal. The peptide identification problem is defined by Eqs. (3)-(11). A preprocessing algorithm is used to filter the peaks in the raw tandem mass spectrum and to validate the existence of ion peaks pertaining to the N- and C- terminus boundary conditions of the ion series. To accommodate missing peaks in the tandem mass spectrum a two-stage framework is employed in which the first stage sequences the candidate peptides using single amino acid weights and the second stage allows for combinations of two to three amino acids weights to be used in the construction of the candidate sequences. Residue assignment ambiguities are subsequently resolved using a modified SEQUEST algorithm [1] so as to exploit the information in the tandem mass spectrum which was not utilized in the sequencing calculations. This postprocessing component of the method replaces weights in the candidate peptide sequences derived from the second stage calculations with permutations of amino acids consistent with these weights. The theoretical tandem mass spectrum for each candidate sequence is predicted and cross-correlated with the experimental tandem mass spectrum and the highest scoring sequence is

De Novo Peptide Identification via Mixed-Integer Linear Optimization and Tandem Mass Spectrometry 5 reported as the most probable peptide. This overall framework is denoted as PILOT which stands for Peptide identification via mixed Integer Linear Optimization and Tandem mass spectrometry. 2.2. Case study In this section we present a comparative study with several existing de novo peptide identification methods to demonstrate the predictive capabilities of the proposed framework PILOT. The algorithms examined in the comparison that is Lutefisk LutefiskXP PepNovo PEAKS EigenMS NovoHMM were selected on the basis of availability reported popularity and performance. In the studies presented assignments to isobaric residues (i.e. Q and K I and L) are considered to be equivalent. To test the method's performance on quadrupole time-of-flight (QTOF) tandem mass spectra we selected an existing data set that is publicly available [11]. These spectra were collected with Q-TOF2 and Q-TOF-Global mass spectrometers for a control mixture of four known proteins: alcohol dehydrogenase (yeast) myoglobin (horse) albumin (bovine BSA) and cytochrome C (horse). The top-ranked sequence reported from each of these methods were compared using a number of metrics. 2.3. Results A summary of the identification results for the de novo methods on the 38 quadrupole time-of-flight spectra are presented in Table 1. Table 1: Identification Rates for the 38 QTOF Spectra Lutefisk LutefiskXP PepNovo PEAKS EigenMS PILOT Correct Peptides 10 9 16 21 20 25 with in 1 residue 11 10 17 22 21 25 with in 2 residues 23 22 25 29 29 33 with in 3 residues 23 25 27 32 30 35 Correct Residues 245 294 337 366 353 381 In terms of correct peptide identifications PILOT is superior to the other de novo methods with an identification rate of about 66 percent followed by PEAKS and EigenMS both at about 53 percent. A common limitation of de novo methods is the inability to assign the correct N-terminal amino acid pair or resolve isobaric residues (i.e. Q or GA W or SV etc.). Thus to accommodate this limitation in the comparison we also reported the percentage of predictions for which there are only one two or three incorrect amino acid assignments in the entire sequence. In Table 1 it is seen that allowing for up to three incorrect amino acids increases the identification rate for all methods on the order of 30 percent indicating that these limitations affect the results reported by all the de novo methods. The last entry in Table 1 reports the number of correctly assigned residues normalized by the total number of actual residues (which is

6 P. DiMaggio et. al 418 for the 38 doubly-charged peptides considered). PILOT outperforms the other de novo methods with a residue accuracy of 91 percent. 3. Conclusions A novel mixed-integer linear optimization framework PILOT was proposed for the automated de novo identification of peptides using tandem mass spectroscopy. For a given experimental MS/MS spectrum PILOT generates a rank-ordered list of potential candidate sequences and a cross-correlation technique is employed to assess the degree of similarity between the theoretical tandem mass spectra of predicted sequences and the experimental tandem mass spectrum. A comparative study for 38 quadrupole time-of-flight spectra was presented to benchmark the performance of the proposed framework with several existing methods. For the case study presented PILOT consistently outperformed the other de novo methods in several measures of prediction accuracy. Acknowledgements The authors gratefully acknowledge financial support from the US Environmental Protection Agency EPA (R 832721-010) Siemens Corporation and the National Institutes of Health. Although the research described in the article has been funded in part by the U.S. Environmental Protection Agency's STAR program through grant (R 832721-010) it has not been subected to any EPA review and therefore does not necessarily reflect the views of the Agency and no official endorsement should be inferred. References 1. J.K. Eng A.L. McCormack and J.R. Yates J. Am. Soc. Mass Spectrom. 5 (1994) 976. 2. D.N. Perkins D.J.C. Pappin D.M. Creasy and J.S. Cottrell Electrophoresis 20 (1999) 3551. 3. P.A. Pevzner Z. Mulyukov V. Dancik and C. L. Tang Genome Research 11 (2001) 290. 4. V. Bafna and N. Edwards Bioinformatics 17 (2001) S13. 5. M. Havilio Y. Haddad and Z. Smilansky Anal. Chem. 75 (2003) 435. 6. J.A. Taylor and R.S. Johnson Rapid Commun. Mass Spectrom. 11 (1997) 1067. 7. V. Dancik T.A. Addona K.R. Clauser J.E. Vath and P.A. Pevzner J. Comp. Biol. 6 (1999) 327. 8. T. Chen M.Y. Kao M. Tepel J. Rush and G.M. Church J. Comp. Biol. 10 (2001) 325. 9. A. Frank and P. Pevzner Anal. Chem. 77 (2005) 964. 10. M. Bern and D. Goldberg J. Comp. Biol. 13 (2006) 364. 11. B. Ma K.Z. Zhang C. Hendrie C. Liang M. Li A. Doherty-Kirby and G. Laoie Rapid Commun. Mass Spectrom. 17 (2003) 2337.