Quality Assessment of Tandem Mass Spectra Based on Cumulative Intensity Normalization

Size: px
Start display at page:

Download "Quality Assessment of Tandem Mass Spectra Based on Cumulative Intensity Normalization"

Transcription

1 Quality Assessment of Tandem Mass Spectra Based on Cumulative Intensity Normalization Seungjin Na and Eunok Paek* Department of Mechanical and Information Engineering, University of Seoul, Seoul, Korea Received July 4, 2006 A large proportion of MS/MS spectral analyses do not result in significant matches because their spectral quality is too poor to produce meaningful identification. Throughput of peptide identification can be greatly improved, if one can filter out, in advance, spectra that would lead to wrong identification. We introduce here an innovative approach to assess spectral quality utilizing a new spectral feature called Xrea, based on cumulative intensity normalization. Keywords: tandem mass spectra intensity normalization spectral quality peptide identification Introduction Various approaches have been introduced to interpret tandem mass spectra (MS/MS) for peptide identification. Database search methods such as SEQUEST 1 and Mascot 2 compare an experimental MS/MS spectrum with theoretical MS/MS spectra, generated from candidate peptides found in a database. On the other hand, in de novo sequencing approaches such as Lutefisk 3 and PEAKS, 4 a peptide sequence is directly inferred from each spectrum without resorting to a peptide sequence database. Approaches combining these two methods have also been proposed. 5,6 A single spectrum may be associated with many candidate peptides. A scoring scheme is applied to these candidate peptides to find a peptide that most likely corresponds to a spectrum. The peptide with the highest score among candidates is regarded as the correct match, when its score is considered to be significant. Many algorithms have therefore been proposed for scoring, which is a critical step in peptide identification. Intensity-based scoring algorithms 7-10 have been proposed, and they have produced substantial improvements in peptide identification. It is difficult, however, to apply raw intensities of fragment ion peaks to a scoring algorithm because they vary considerably from spectrum to spectrum. Intensity normalization is required for incorporating intensity information in a spectrum into a scoring algorithm, and many methods have been proposed for this purpose. Existing methods mainly normalize intensity with specific information about raw intensities, for example, the intensity of the most abundant peak. But such methods are very sensitive to variations in raw intensities. In this paper, we propose a new intensity normalization method, called cumulative intensity normalization, which considers both the magnitude of individual fragment ion peaks * To whom correspondence should be addressed. Eunok Paek, Dept. of Mechanical and Information Engineering, University of Seoul, 90 Jeonnongdong, Dongdaemun-gu, Seoul, Korea, Tel, ; fax, ; , paek@uos.ac.kr. and their ranking in raw intensities and, thus, overcomes the shortcomings of existing normalization methods. Effectiveness of this newly proposed method is demonstrated by an experiment that compares SEQUEST search results when different normalization methods are applied. In addition to its contribution to better scoring, cumulative intensity normalization can also be useful when estimating the quality of a spectrum. In mass spectrometry-based proteomics, a high percentage (e.g., 80-90%) of MS/MS data does not result in significant matches and is discarded. Thus, it will be useful if low-quality spectra that are not likely to lead to any useful identification can be localized and filtered out. Recently, there have been many reports assessing data quality of peptide mass spectra. In some of these research efforts, machine learning approaches are adopted using various spectral features. Machine learning methods vary from a genetic algorithm 11 to a support vector machine, 12 and several different learning methods are compared in terms of their performances. 13 Other approaches use a score function based on spatial distribution patterns of peaks 14 and maximum length of peptide sequence tags. 15 These results are used for filtering noisy spectrum data via spectral quality assessment. Another study reports that unassigned spectra of high quality can be identified by additional extended searching. 16 In this paper, we propose a method for the assessment of spectral quality that addresses the problem of computational efficiency encountered by MS/MS spectra analyses in highthroughput proteomics. Several spectral features are introduced to assess the quality of tandem mass spectra. First, we present a novel spectral feature measuring patterns in peak intensity distribution of a spectrum. This measure, named Xrea, uses the properties of cumulative intensity normalization. In addition, we use features to measure the abundance of potentially interpretable peak pairs in a spectrum. Our approach, combining features involving intensity and m/z difference, greatly enhances the computational throughput by filtering out useless data prior to database searching. We evaluated the effectiveness of our method against two different MS/MS data sets /pr CCC: $ American Chemical Society Journal of Proteome Research 2006, 5, Published on Web 11/09/2006

2 research articles Na and Paek Figure 1. MS/MS spectra of various qualities. Experimental Section We used two different publicly available MS/MS data sets from ion trap tandem mass spectrometers. One data set, from ISB protein mixture, was used as control data to demonstrate the effectiveness of newly proposed intensity normalization, to characterize spectral quality by newly introduced features, and to optimize the quality assessment algorithm. The other set, from human K562 cells, was used as test data to validate the performance and usefulness of our algorithm. 1. ISB Data. The ISB protein mixture data set consists of MS/MS spectra 17 obtained by mixing together 18 purified proteins and performing mass spectrometry analysis on an ESI-ITMS (ThermoFinnigan, San Jose, CA). Combined tandem mass spectra from this control mixture were searched using SEQUEST against a human protein database appended with sequences of the 18 control mixture proteins. This analysis produced assignments to doubly charged spectra, to triply charged spectra, and 504 to singly charged spectra (a spectrum representing a charge state greater than 1+ is searched twice). In total, 1656 peptide assignments to doubly charged spectra, 984 to triply charged spectra, and 125 to singly charged spectra were determined to be correct after manual inspection Human Erythroleukemia K562 Cell Line. This data set 18 is available in the data repository from PeptideAtlas ( An extract of the erythroleukemia cell line K562 grown in suspension was analyzed on an LCQ Classic ion trap mass spectrometer. Spectra were searched using SEQUEST against the IPI human protein database (version 2.18). This analysis produced assignments to doubly charged spectra, to triply charged spectra, and 1644 to singly charged spectra. Of those, 1692 peptide assignments to doubly charged spectra and 1790 to triply charged spectra were regarded as ones with high confidence over the XCorr score threshold. In general, multiply charged spectra from ion trap mass spectrometers were searched twice, assuming 2+ or 3+ charge state, resulting in waste of computational resources. This calls for filtering out many spectra from these multiply charged spectra. Furthermore, singly charged spectra constitute only a small fraction of the entire set. Therefore, we restricted our analysis to multiply charged tandem mass spectra. To establish a model for spectral quality assessment, we used support vector machine (SVM) as a machine learning method. From the ISB data set, 2640 spectra that were identified to be correct were labeled GOOD (identifiable). The remaining spectra were labeled BAD (unidentifiable). The SVM classifier was trained on this labeled spectra using 5-fold cross validation. Further, the learned SVM classifier was applied to a more typical high-throughput proteomics data set (human erythroleukemia K562 cell line) to test our algorithm, and its classification power is presented. Results and Discussion Intensity Normalization. For peptide identification, many scoring algorithms 1-6 have been developed to evaluate a match between a spectrum and a peptide. Early scoring algorithms focused only on the presence or absence of peaks at specific m/z and did not take into account their intensities. However, it has been suggested that intensity-based scoring algorithms 7-10 can provide significant improvements in peptide identification. It is difficult to apply raw intensities to a scoring algorithm, however, because raw intensities of fragment ion peaks are highly variable from spectrum to spectrum. Therefore, intensity normalization is necessary for incorporating raw intensities of fragment ion peaks into algorithms. Most existing methods of analysis use relative intensity, defined as the raw intensity normalized by the intensity of the most abundant peak or the total intensity of all the peaks in a spectrum. But this relative intensity normalization has some shortcomings. First, when the single most abundant peak is very strong relative to the rest of the peaks, as observed quite often, intensities of the remaining peaks may be hard to distinguish from the background noise and can thus be ignored, as in the case of the spectrum in Figure 1c. In Figure 1c, the original spectrum is redrawn in a small dashed box after 3242 Journal of Proteome Research Vol. 5, No. 12, 2006

3 Quality Assessment of Tandem Mass Spectra research articles Figure 2. Transformation of relative normalization into cumulative normalization. Two curves are shown, the curves of normalized peak intensities when they are arranged in ascending order according to each of the relative and cumulative methods. If all the peaks in a spectrum are equal, the cumulative curve will be a diagonal line. The cumulative curve cannot extend over this diagonal line. removing the most intense peak. In this spectrum displayed inside the inset box, high and low peaks are more easily distinguishable than the original spectrum. Second, when raw intensities are small over the entire spectrum, values of relative intensities will be more or less the same, and the peaks may therefore look indistinguishable as in Figure 1d. To overcome these shortcomings, a different normalization approach, called rank-based intensity normalization, which relies entirely on the rank of a fragment ion peak in a spectrum without any regard to the magnitude of raw intensity, has been proposed. 12 But, it has been reported previously that fragment ion intensities are reproducible, 10,19-20 and experts in mass spectrometry believe that intensities of fragment ion peaks have useful information even if the fragmentation process of collision-induced dissociation is not exactly quantitative. We propose here, a new intensity normalization method, called cumulative intensity normalization. In this approach, relative intensities (raw intensities divided by the total intensity) are cumulated so that intensity of the nth highest peak is defined as the sum of relative intensities of all the peaks, the intensities of which are smaller than or equal to that of the nth highest peak. Equation 1 shown below defines cumulative intensity. Cumulative normalized intensity of the nth highest peak ) {I raw (x) Rank(x) g n} TIC where I raw(x) is raw intensity of a fragment ion at x (m/z), TIC (total ion current) is the total intensity of a spectrum, and Rank- (x) represents the order of a fragment ion at x when sorted by magnitude of raw intensities in descending order. The most intense peak has rank 1, the second most intense peak has rank 2, and so forth. In accordance with this definition, the most intense peak is always normalized to 1. Figure 2 compares curves of normalized peak intensities when they are arranged in ascending order, applying each of the two different normalization methods. The curve drawn using cumulative normalization represents gradients in raw intensities in a spectrum. In the cumulative curve, the difference between nth and (n - 1)th peak values is the nth RI bytic, raw intensity divided by TIC. Thus, if every peak in a spectrum is equal in its raw intensity, the increasing rate of the curve is a constant CS, where CS is RI bytic, which is the same for all the fragment ion peaks in this case. Accordingly, the cumulative curve will be a diagonal line shown in Figure 2, and its slope (1) Figure 3. Cumulative and relative curves for the four spectra presented in Figure 1. will be CS. But, given that the total intensity of a spectrum is constant and peaks raw intensities are not all equal, if there are peaks whose RI bytic values are higher than CS, there must be peaks whose RI bytic are smaller than CS, because the sum of all RI bytic values in a spectrum is a constant 1. Because the normalized intensities are arranged by their magnitudes in this curve, no cumulative curve extends over this diagonal line. A more rigorous proof of this property can be found in Supporting Information. Figure 3 shows relative and cumulative curves for the spectra shown in Figure 1. The cumulative peak intensities have rankbased normalized intensities (diagonal line) as their upper bounds. In contrast, the relative curve in Figure 3d extends over the diagonal. In other words, when raw intensities are small over the entire spectrum as in Figure 1d, relative intensities will be more or less the same, while cumulative intensities will have obvious differences due to their ranking. Thus, the cumulative normalization complies with the individual ranking of each peak, taking into account individual intensity magnitudes as well. It must also be noted that the cumulative intensity normalization does not suffer from the problem that normalized intensity is very sensitive to the magnitude of the most abundant peak. To account for this effectiveness, if one assumes that the intensity of the most abundant peak is R, that there are N peaks other than the most abundant peak, and that intensities of those N peaks are all β, which is a lot smaller than R (β,r), the total intensity of the spectrum is R+N β. In cumulative normalization, cumulative intensity of the second most abundant peak is CI 2 ) N β/(r +N β). Given that R is a constant, CI 2 increases as N β increases. N β is proportional not only to β but also to N. If the number of peaks in a spectrum increases, N β increases. That is, even if the magnitude of β is very small, the influence of the most abundant peak can be reduced by an increase in the number of peaks. But, in relative normalization, relative intensity of the second most abundant peak is RI 2 ) β/r. Because it is determined entirely by the magnitude of β given a constant R, the smaller β is, the smaller RI 2 becomes. As a result, as shown in Figure 1c, while relative normalization is very sensitive to the most abundant peak, cumulative normalization can be relatively free from such influence. The curves in Figure 3c exemplify such effect very well. In other words, normalized peak intensities are relatively stable from spectrum to spectrum. Effectiveness of Cumulative Intensity Normalization. To demonstrate the effectiveness of cumulative intensity normalization, we performed a SEQUEST search on the ISB dataset twice against a human protein database (40 Mb) (extracted from ftp://ftp.ncicrf.gov/pub/nonredun/protein.nrdb.z) appended with sequences of the 18 control mixture proteins, Journal of Proteome Research Vol. 5, No. 12,

4 research articles using no enzyme search while tolerating up to 4 miscleavages. The first search was a regular SEQUEST search, and the second was done by replacing all the intensities in the *.dta file (spectrum file format for SEQUEST) with cumulative intensities, while all the search parameters remained the same. To measure the performance of each search result, we regarded peptide assignments corresponding to 18 proteins in the control mixture as correct. From the search with the unprocessed spectra, 1728 peptide assignments among 2+ spectra and 1078 assignments among 3+ spectra corresponded to the control mixture. In contrast, with cumulatively normalized spectra, 1811 and 1122 peptide assignments among 2+ and 3+ spectra, respectively, corresponded to the control mixture. Further detailed investigation showed that we obtained more meaningful search results, in addition to an increase in the number of correct assignments to the control proteins. Figure 4 shows distributions of XCorr scores from two different SEQUEST searches for doubly charged spectra. The inset box shows the distribution of spectra whose peptide assignments correspond to the control mixture proteins. It shows not only an increase in the number of assignments to correct proteins, but also an increase in match scores of peptides for the search results with cumulative intensities. When the SEQUEST XCorr threshold was set at 2.5, for significant peptide hits, a larger number of correct peptide hits existed over the threshold line in the case of the search using cumulatively normalized spectra than that of the regular search. The number of unassigned spectra over the threshold increased as well. But this increase was relatively small compared with the increase in the number of assigned spectra. A search with triply charged spectra also showed a similar trend in distribution. These results were further analyzed using PeptideProphet. 22 When PeptideProphet probability score of 0.9 was used for thresholding, it resulted in 2405 peptide assignments in unprocessed search and 2498 assignments in a search using cumulative intensity. Of these, the number of peptide assignments corresponding to the control mixture was 2346 in unprocessed search and 2454 in a search using cumulative intensity (59 vs 44 assignments corresponding to proteins other than control mixture). When applied to a tandem mass spectra analysis, cumulative intensity normalization enabled us to identify more peptide assignments with higher confidence when compared with the results from applying existing normalization method. Novel Feature for Spectral Quality. The features that are known to relate to quality of mass spectra include number of peaks, total ion current, signal-to-noise level, how likely two fragment ion peaks are to differ by the mass of an amino acid, and the existence of isotope and neutral loss peaks Generally, noise peaks in a spectrum have low intensities and signal peaks have high intensities, although with some exceptions. It can be said that the quality of a mass spectrum is good when one can clearly distinguish peaks with high intensity from those with low intensities, that is, when signal and noise peaks are clearly distinguishable. Figure 1a is an example of a good quality spectrum, in which one can distinguish peaks with high intensities from peaks with low intensities, while Figure 1d is an example of a bad quality spectrum. In this paper, we propose a new feature to assess the quality of mass spectra using properties of cumulative curve. As a spectrum shows more prominent differences between high and low peaks as in Figure 1a, its cumulative curve gets closer to the bottom right corner (Figure 3a). Since the cumulative curve Figure 4. Distributions of XCorr scores from two different SEQUEST searches using different normalization schemes (relative vs cumulative) for 2+ spectra. The inset box represents the distribution of peptide assignment (to a protein in the control mixture) scores. It shows more correct assignments to control proteins (1811 vs 1728) when cumulatively normalized spectra were used. It also shows an increase in match scores of peptide hits for the correct assignments. Dashed line represents XCorr value of 2.5 as the threshold of significant peptide hit, which is taken from DTASelect algorithm. 21 More correct peptide hits exist over the threshold in a search with cumulatively normalized spectra than in regular search (1436 vs 1091). represents gradients in raw intensities in a spectrum, with lower intensity values, the cumulative curve gets closer to the bottom, while with higher values, it gets closer to the right. As a result, if a spectrum has low intensities for noise peaks and high intensities for signal peaks, the cumulative curve will be closer to the bottom right corner. Figure 5 shows cumulative curves for the spectra shown in Figure 1. Here, we hypothesize that the larger the area XX is, the better the quality of a spectrum is, where XX is defined as the area between the diagonal and the cumulative curve, as marked in Figure 5. On the basis of this hypothesis, we propose a new feature called Xrea, defined in eq 2 and propose to use it when evaluating the quality of a mass spectrum. the area of XX Xrea ) the area of a lower right triangle +R Na and Paek (2) 3244 Journal of Proteome Research Vol. 5, No. 12, 2006

5 Quality Assessment of Tandem Mass Spectra research articles Figure 5. Cumulative curves. XX is defined as the area between the cumulative curve and the diagonal line. The larger the area XX is, the better the quality of a mass spectrum is. where the area of cumulative curve is computed using strip method of numerical integration. Bin width is fixed as 1/n, where n is the number of fragment ion peaks. R is a penalty factor defined as the relative magnitude of the most abundant peak. The difference between the most and the second most intense cumulative intensity is the most intense RI bytic (see the definition of cumulative intensity). The more intense the magnitude of the most abundant peak is, the larger the area of XX, and thus, the spectrum will be regarded as having better quality. To prevent this tendency, R is employed, and its value is the most intense RI bytic in each spectrum. As the spectrum in Figure 1c shows, if the single most abundant peak is very strong, it is hard to determine spectral quality, even though its influence is reduced by using cumulative intensity normalization. According to eq 2, Figure 1a with the biggest XX is the best spectrum among the four example spectra. On the other hand, Xrea will be 0 for a spectrum with the worst quality, and the cumulative curve will be equal to the diagonal line. This is the case when every peak in a spectrum is equal in its raw intensity. We assume that such a spectrum is of the worst quality because signals and noises are indistinguishable. To demonstrate that Xrea is a useful measure for spectral quality, we assessed the correlation between Xrea and SE- QUEST XCorr. Figure 6 shows the averaged XCorr scores of spectra against Xrea. The XCorr score generally increases as the Xrea gets better. If we assume that an increase in Xrea means better spectral quality, it follows that as spectral quality gets better its peptide match score will be higher. Figure 7 shows the spectra distribution against Xrea. With an increase in Xrea, it is more likely that a spectrum results in a significant match. Overall, Xrea threshold (dashed line) can be used to separate poor matches from good ones. Quality Assessment of Mass Spectra. Currently, the most popular approach to interpreting tandem mass spectra is to search protein sequence databases with experimental MS/MS data. 1,2 When the database is large, it takes many computational resources. Recent advances in mass spectrometry technology made high-throughput proteome analysis possible. When processing a large number of spectra from such highthroughput experiments, a large portion of spectra do not result in significant matches because their spectral quality is too poor Figure 6. Average XCorr score of doubly and triply charged spectra against Xrea. XCorr score generally increases as Xrea gets better. A scatter plot for doubly charged spectra is shown in Figure 10. Figure 7. Distribution of spectra against Xrea. Distribution for GOOD (black) and BAD (gray) spectra sets are shown. As Xrea gets bigger, a fraction of good spectra at a given Xrea value increases. SEQUEST results are mostly poor for the spectra below the specified threshold (dashed line) of Xrea. to produce useful identification. In Figure 7, spectra represented by a black region produce useful identification, while spectra in the gray region do not result in significant matches and are therefore discarded. If it were possible to assess the quality of tandem mass spectra, filtering out bad spectra prior to software analysis would save many computational resources and greatly improve the throughput of peptide identification. In determining the quality of a given spectrum, the aforementioned feature, Xrea, is used together with another feature called Good-Diff Fraction. 12 Good-Diff Fraction (GDFR), which foretells how likely a fragment ion peak pair is to differ by the mass of an amino acid, is defined as follows: {I(x) + I(y) M(x) - M(y) ) M i } {I(x) + I(y) 56 e M(x) - M(y) e 187} where I(x) is the intensity of peak x, M(x) isthem/z value of x and M i is mass of an amino acid. The numbers 56 and 187 are masses of glycine and tryptophan. In this definition, we use cumulative intensity and tolerated offsets of (0.25 Da to compare a mass. Journal of Proteome Research Vol. 5, No. 12,

6 research articles Na and Paek Figure 8. Spectra distribution against GDFR defined for singly charged fragments (1+ fragments). GOOD (black) and BAD (gray) assignments are shown. For Good assignments, GDFR(1+ fragments) of doubly charged peptide is higher than that of triply charged peptide. Multiply charged spectra include multiply charged fragment ions. Thus, it is reasonable to extend the definition of GDFR to include mass differences between multiply charged fragment ions as well, so that M i/2 or M i/3 is used as well as M i in the GDFR definition. However, a previous investigation on peptide fragmentation indicates that singly charged fragments are much more abundant than doubly charged fragments when precursor ions are doubly charged. 19 Therefore, it would be sufficient to use GDFR defined only for singly charged fragments as a feature for the quality of 2+ spectra. Figure 8a shows distributions of 2+ spectra against GDFR of singly charged fragments after filtering out low-quality spectra based on the Xrea threshold value. It is easy to see that GDFR is a discriminant feature between GOOD and BAD spectra. Figure 8b shows the same distribution for 3+ spectra. In this case, however, GDFR does not seem to be a useful feature for discerning between GOOD and BAD sets. This is due to the fact that doubly charged fragments are more abundant than singly charged fragments when precursor ions are triply charged. It is necessary to consider GDFR for doubly charged fragments as well as for singly charged fragments. Quality assessment was conducted with the multiply charged tandem mass spectra. The data set was partitioned into GOOD and BAD spectra, as explained in Experimental Section. For classification, we used SVM (support vector machine). As an input to the SVM classifier generator, Xrea values and GDFR for singly and doubly charged fragments were used. To avoid any bias in the training set during learning, 5-fold cross validation was adopted. The training set consisted of 2640 and 2688 spectra of GOOD and BAD sets, respectively. To test the learned model, MS/MS spectra from human erythroleukemia K562 cell line 18 were searched using SEQUEST against the IPI human protein database. Of assignments to doubly charged spectra and to triply charged spectra, 1692 peptide assignments to doubly charged spectra and 1790 to triply charged spectra were regarded as GOOD. The XCorr thresholds used were 3.22 ([M + H 2] 2 + ) and 3.45 ([M + H 3] 3 + ), which were determined by comparing search results against normal and reverse database. 18 By assessing the quality of each spectrum using the SVM classifier, we could filter out 75% of unidentifiable spectra while losing only 10% of identifiable spectra. Figure 9 shows the overall performance of SVM classifier by means of a receiver operator characteristic (ROC) curve. Even if only 2% loss of GOOD identifiable spectra is allowed, it can filter out about 60% of BAD unidentifiable ones. Figure 9. ROC curve for SVM classifier tested on human erythroleukemia data set. The inset box represents ROC curve while keeping more than 90% of GOOD spectra. Our method makes it possible to filter out 75% of unidentifiable spectra while losing only 10% of identifiable spectra. Even if only 2% loss of GOOD identifiable spectra is allowed, it can filter out about 60% of BAD unidentifiable ones. Applications of the Model and Future Directions. The proposed spectral quality measure predicts reasonably well how likely a spectrum will result in a correct identification and thus can be used to filter out low-quality spectra. But there are quite a few spectra for which this measure does not correspond to the likelihood of correct identification. The most interesting examples of such spectra are those high-quality spectra whose match scores are poor (those in the marked section A of Figure 10). In database searching approaches, peptide sequencing is impossible when peptides are not in the searched database or if peptides are post-translationally modified. In such cases, the search will provide incorrect identification and insignificant scores. However, the peptide-spectrum match score can be improved either by extending the database or considering posttranslational modifications during the analysis. If the spectral quality assessment results in high values, it will be helpful to conduct a further in-depth analysis by putting more compu Journal of Proteome Research Vol. 5, No. 12, 2006

7 Quality Assessment of Tandem Mass Spectra research articles spectral quality evaluation while its search results score well, it may require manual inspection by a mass spectrometry expert or at least by some other validation process, as it may well be an indication that the search results are not entirely credible. Conclusions Figure 10. XCorr score distribution against Xrea for doubly charged spectra from the ISB dataset. The dashed line represents XCorr score of 2.5. Spectra with very high XCorr are associated with very high Xrea values, and almost all the spectra below the specified Xrea threshold have poor XCorr values. Section A represents high-quality spectra whose match scores are poor, which warrants a more in-depth analysis. Section B represents low-quality spectra whose match scores are good, of which search results are suspicious. We developed a useful quality measure to overcome problems in peptide identification by MS/MS analysis. New intensity normalization, using cumulative intensity, enabled us to identify potential peptides, likely to be lost in database search due to poor spectral quality. The proposed spectral quality filtering method proved exceptional for filtering out poor-quality spectra that could lead to wrong identification in large shotgun proteomic datasets, and greatly improved throughput of peptide identification. Evaluation of the method using ISB and human erythroleukemia datasets established the utility of our method and its applicability to MS/MS data under different conditions. We expect that intensity normalization and quality assessment will be useful for designing efficient scoring algorithms and search strategies in software development for the analysis of shotgun proteomic data. Acknowledgment. This work was supported by 21C Frontier Functional Proteomics Project from Korean Ministry of Science & Technology (FPR05A2-340) and an Academic Research Grant (2005) from University of Seoul. Supporting Information Available: A detailed proof of the property of cumulative intensity normalization can be found. This material is available free of charge via the Internet at References Figure 11. Peptide-spectrum match shown as a colored peak (red for y-ion and blue for b-ion), when a peak corresponds with a theoretical fragmentation site of the peptide. It is interpreted as doubly charged precursor ion with the peptide VAGTWYS- LAMAASDISLLDAQSAPLR. A high XCorr score (3.15) and DeltaCn (0.19) are assigned to this spectrum by SEQUEST, which are well over the usual stringent threshold values of XCorr of 2.5 and DeltaCn of 0.1. Experts evaluate this match as a false-positive, because noise and signal peaks are indistinguishable, random matches are made for the fragment ion peaks, and high-intensity peaks are not explained. The spectral quality assessment result by Xrea for this spectra was poor and below the threshold. tational time. In a recent work, it was shown that unassigned spectra of high quality can be identified by additional extended searching. 16 Another interesting set of spectra arises when its match score is high while its spectral quality is evaluated as poor, in spite of the general expectation that good-quality spectra lead to good search results. We investigated the spectra with high XCorr value and poor Xrea estimation, that is, those in section B of Figure 10. Figure 11 shows a typical example of such a spectrum. It was observed that, in most such cases, the peptide mass is relatively heavy. It was previously reported that XCorr value correlates with peptide mass, and high mass peptides get better XCorr values. 23 Thus, if a spectrum receives poor (1) Eng, J. K.; McCormack, A. L.; Yates, J. R., III. J. Am. Soc. Mass Spectrom. 1994, 5, (2) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, (3) Taylor, J. A.; Johnson, R. S. Anal. Chem. 2001, 74, (4) Ma, B.; Zhang, K.; Hendrie, C.; Liang, C.; Li, M.; Doherty-Kirby, A.; Lajoie, G. Rapid Commun. Mass Spectrom. 2003, 17, (5) Mann, M.; Wilm, M. Anal. Chem. 1994, 66, (6) Tabb, D. L.; Saraf, A.; Yates, J. R., III. Anal. Chem. 2003, 75, (7) Colinge, J.; Masselot, A.; Giron, M.; Dessingy, T.; Magnin, J. Proteomics 2003, 3, (8) Havilio, M.; Haddad, Y.; Smilansky, Z. Anal. Chem. 2003, 75, (9) Frank, A.; Pevzner, P. Anal. Chem. 2005, 77, (10) Elias, J. E.; Gibbons, F. D.; King, O. D.; Roth, F. P.; Gygi, S. P. Nat. Biotechnol. 2004, 22, (11) Moore, R. E.; Young, M. K.; Lee, T. D. J. Am. Soc. Mass Spectrom. 2000, 11, (12) Bern, M.; Goldberg, D.; McDonald, W. H.; Yates, J. R., III. Bioinformatics 2004, 20, i49-i54. (13) Flikka, K.; Martens, L.; Vandekerckhove, J.; Gevaert, K.; Eidhammer, I. Proteomics 2006, 6, (14) Xu, M.; Geer, L. Y.; Bryant, S. H.; Roth, J. S.; Kowalak, J. A.; Maynard, D. M.; Markey, S. P. J. Proteome Res. 2005, 4, (15) Savitski, M. M.; Nielsen, M. L.; Zubarev, R. A. Mol. Cell. Proteomics 2005, 4, (16) Nesvizhskii, A. I.; Roos, F. F.; Grossmann, J.; Vogelzang, M.; Eddes, J. S.; Gruissem, W.; Baginsky, S.; Aebersold, R. Mol. Cell. Proteomics 2006, 5, (17) Keller, A.; Purvine, S.; Nesvizhskii, A. I.; Stolyar, S.; Goodlett, D. R.; Kolker, E. OMICS 2002, 6, (18) Resing, K. A.; Meyer-Arendt, K.; Mendoza, A. M.; Aveline-Wolf, L. D.; Jonscher, K. R.; Pierce, K. G.; Old, W. M.; Cheung, H. T.; Journal of Proteome Research Vol. 5, No. 12,

8 research articles Russell, S.; Wattawa, J. L.; Goehle, G. R.; Knight, R. D.; Ahn, N. G. Anal. Chem. 2004, 76, (19) Huang, Y.; Triscari, J. M.; Pasa-Tolic, L.; Anderson, G. A.; Lipton, M. S.; Smith, R. D.; Wysocki, V. H. J. Am. Chem. Soc. 2004, 126, (20) Huang, Y.; Triscari, J. M.; Tseng, G. C.; Pasa-Tolic, L.; Lipton, M. S.; Smith, R. D.; Wysocki, V. H. Anal. Chem. 2005, 77, (21) Tabb, D. L.; McDonald, W. H.; Yates, J. R., III. J. Proteome Res. 2002, 1, (22) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Anal. Chem. 2002, 74, (23) MacCoss, M. J.; Wu, C. C.; Yates, J. R., III. Anal. Chem. 2002, 74, PR Na and Paek 3248 Journal of Proteome Research Vol. 5, No. 12, 2006

PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search

PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Yunhu Wan, Austin Yang, and Ting Chen*, Department of Mathematics, Department of Pharmaceutical Sciences, and

More information

ADVANCEMENT IN PROTEIN INFERENCE FROM SHOTGUN PROTEOMICS USING PEPTIDE DETECTABILITY

ADVANCEMENT IN PROTEIN INFERENCE FROM SHOTGUN PROTEOMICS USING PEPTIDE DETECTABILITY ADVANCEMENT IN PROTEIN INFERENCE FROM SHOTGUN PROTEOMICS USING PEPTIDE DETECTABILITY PEDRO ALVES, 1 RANDY J. ARNOLD, 2 MILOS V. NOVOTNY, 2 PREDRAG RADIVOJAC, 1 JAMES P. REILLY, 2 HAIXU TANG 1, 3* 1) School

More information

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra. Andrew Keller

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra. Andrew Keller PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra Andrew Keller Outline Need to validate peptide assignments to MS/MS spectra Statistical approach to validation Running PeptideProphet

More information

Analysis of Peptide MS/MS Spectra from Large-Scale Proteomics Experiments Using Spectrum Libraries

Analysis of Peptide MS/MS Spectra from Large-Scale Proteomics Experiments Using Spectrum Libraries Anal. Chem. 2006, 78, 5678-5684 Analysis of Peptide MS/MS Spectra from Large-Scale Proteomics Experiments Using Spectrum Libraries Barbara E. Frewen, Gennifer E. Merrihew, Christine C. Wu, William Stafford

More information

DE NOVO PEPTIDE SEQUENCING FOR MASS SPECTRA BASED ON MULTI-CHARGE STRONG TAGS

DE NOVO PEPTIDE SEQUENCING FOR MASS SPECTRA BASED ON MULTI-CHARGE STRONG TAGS DE NOVO PEPTIDE SEQUENCING FO MASS SPECTA BASED ON MULTI-CHAGE STONG TAGS KANG NING, KET FAH CHONG, HON WAI LEONG Department of Computer Science, National University of Singapore, 3 Science Drive 2, Singapore

More information

Identification of proteins by enzyme digestion, mass

Identification of proteins by enzyme digestion, mass Method for Screening Peptide Fragment Ion Mass Spectra Prior to Database Searching Roger E. Moore, Mary K. Young, and Terry D. Lee Beckman Research Institute of the City of Hope, Duarte, California, USA

More information

Modeling Mass Spectrometry-Based Protein Analysis

Modeling Mass Spectrometry-Based Protein Analysis Chapter 8 Jan Eriksson and David Fenyö Abstract The success of mass spectrometry based proteomics depends on efficient methods for data analysis. These methods require a detailed understanding of the information

More information

Peptide Sequence Tags for Fast Database Search in Mass-Spectrometry

Peptide Sequence Tags for Fast Database Search in Mass-Spectrometry Peptide Sequence Tags for Fast Database Search in Mass-Spectrometry Ari Frank,*, Stephen Tanner, Vineet Bafna, and Pavel Pevzner Department of Computer Science & Engineering, University of California,

More information

De Novo Peptide Identification Via Mixed-Integer Linear Optimization And Tandem Mass Spectrometry

De Novo Peptide Identification Via Mixed-Integer Linear Optimization And Tandem Mass Spectrometry 17 th European Symposium on Computer Aided Process Engineering ESCAPE17 V. Plesu and P.S. Agachi (Editors) 2007 Elsevier B.V. All rights reserved. 1 De Novo Peptide Identification Via Mixed-Integer Linear

More information

Speeding up Scoring Module of Mass Spectrometry Based Protein Identification by GPUs

Speeding up Scoring Module of Mass Spectrometry Based Protein Identification by GPUs Speeding up Scoring Module of Mass Spectrometry Based Protein Identification by GPUs Li You Abstract Database searching is a main method for protein identification in shotgun proteomics, and till now most

More information

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database Overview - MS Proteomics in One Slide Obtain protein Digest into peptides Acquire spectra in mass spectrometer MS masses of peptides MS/MS fragments of a peptide Results! Match to sequence database 2 But

More information

NovoHMM: A Hidden Markov Model for de Novo Peptide Sequencing

NovoHMM: A Hidden Markov Model for de Novo Peptide Sequencing Anal. Chem. 2005, 77, 7265-7273 NovoHMM: A Hidden Markov Model for de Novo Peptide Sequencing Bernd Fischer, Volker Roth, Franz Roos, Jonas Grossmann, Sacha Baginsky, Peter Widmayer, Wilhelm Gruissem,

More information

An SVM Scorer for More Sensitive and Reliable Peptide Identification via Tandem Mass Spectrometry

An SVM Scorer for More Sensitive and Reliable Peptide Identification via Tandem Mass Spectrometry An SVM Scorer for More Sensitive and Reliable Peptide Identification via Tandem Mass Spectrometry Haipeng Wang, Yan Fu, Ruixiang Sun, Simin He, Rong Zeng, and Wen Gao Pacific Symposium on Biocomputing

More information

Bioinformatics Advance Access published December 13, Quality classification of tandem mass spectrometry data

Bioinformatics Advance Access published December 13, Quality classification of tandem mass spectrometry data BIOINFORMATICS Bioinformatics Advance Access published December 13, 2005 Quality classification of tandem mass spectrometry data Jussi Salmi 1,*, Robert Moulder 2, Jan-Jonas Filén 2,3, Olli S. Nevalainen

More information

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra Andrew Keller Day 2 October 17, 2006 Andrew Keller Rosetta Bioinformatics, Seattle Outline Need to validate peptide assignments to MS/MS

More information

Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search

Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search Anal. Chem. 2002, 74, 5383-5392 Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search Andrew Keller,*, Alexey I. Nesvizhskii,*, Eugene Kolker,

More information

PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search

PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Yunhu Wan, Austin Yang, and Ting Chen Department of Mathematics, Department of Pharmaceutical Sciences, Department

More information

MS-MS Analysis Programs

MS-MS Analysis Programs MS-MS Analysis Programs Basic Process Genome - Gives AA sequences of proteins Use this to predict spectra Compare data to prediction Determine degree of correctness Make assignment Did we see the protein?

More information

Supplementary Material for: Clustering Millions of Tandem Mass Spectra

Supplementary Material for: Clustering Millions of Tandem Mass Spectra Supplementary Material for: Clustering Millions of Tandem Mass Spectra Ari M. Frank 1 Nuno Bandeira 1 Zhouxin Shen 2 Stephen Tanner 3 Steven P. Briggs 2 Richard D. Smith 4 Pavel A. Pevzner 1 October 4,

More information

SQID: An Intensity-Incorporated Protein Identification Algorithm for Tandem Mass Spectrometry

SQID: An Intensity-Incorporated Protein Identification Algorithm for Tandem Mass Spectrometry pubs.acs.org/jpr SQID: An Intensity-Incorporated Protein Identification Algorithm for Tandem Mass Spectrometry Wenzhou Li, Li Ji, Jonathan Goya, Guanhong Tan, and Vicki H. Wysocki* Department of Chemistry

More information

Improved Validation of Peptide MS/MS Assignments. Using Spectral Intensity Prediction

Improved Validation of Peptide MS/MS Assignments. Using Spectral Intensity Prediction MCP Papers in Press. Published on October 2, 2006 as Manuscript M600320-MCP200 Improved Validation of Peptide MS/MS Assignments Using Spectral Intensity Prediction Shaojun Sun 1, Karen Meyer-Arendt 2,

More information

Learning Score Function Parameters for Improved Spectrum Identification in Tandem Mass Spectrometry Experiments

Learning Score Function Parameters for Improved Spectrum Identification in Tandem Mass Spectrometry Experiments pubs.acs.org/jpr Learning Score Function Parameters for Improved Spectrum Identification in Tandem Mass Spectrometry Experiments Marina Spivak, Michael S. Bereman, Michael J. MacCoss, and William Stafford

More information

Tandem mass spectra were extracted from the Xcalibur data system format. (.RAW) and charge state assignment was performed using in house software

Tandem mass spectra were extracted from the Xcalibur data system format. (.RAW) and charge state assignment was performed using in house software Supplementary Methods Software Interpretation of Tandem mass spectra Tandem mass spectra were extracted from the Xcalibur data system format (.RAW) and charge state assignment was performed using in house

More information

DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics

DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics Chih-Chiang Tsou 1,2, Dmitry Avtonomov 2, Brett Larsen 3, Monika Tucholska 3, Hyungwon Choi 4 Anne-Claude Gingras

More information

Computational Methods for Mass Spectrometry Proteomics

Computational Methods for Mass Spectrometry Proteomics Computational Methods for Mass Spectrometry Proteomics Eidhammer, Ingvar ISBN-13: 9780470512975 Table of Contents Preface. Acknowledgements. 1 Protein, Proteome, and Proteomics. 1.1 Primary goals for studying

More information

Research A novel approach to denoising ion trap tandem mass spectra Jiarui Ding 1, Jinhong Shi 2, Guy G Poirier 3 and Fang-Xiang Wu* 1,2

Research A novel approach to denoising ion trap tandem mass spectra Jiarui Ding 1, Jinhong Shi 2, Guy G Poirier 3 and Fang-Xiang Wu* 1,2 Proteome Science BioMed Central Research A novel approach to denoising ion trap tandem mass spectra Jiarui Ding 1, Jinhong Shi 2, Guy G Poirier 3 and Fang-Xiang Wu* 1,2 Open Access Address: 1 Department

More information

Intensity-based protein identification by machine learning from a library of tandem mass spectra

Intensity-based protein identification by machine learning from a library of tandem mass spectra Intensity-based protein identification by machine learning from a library of tandem mass spectra Joshua E Elias 1,Francis D Gibbons 2,Oliver D King 2,Frederick P Roth 2,4 & Steven P Gygi 1,3,4 Tandem mass

More information

Computational Analysis of Mass Spectrometric Data for Whole Organism Proteomic Studies

Computational Analysis of Mass Spectrometric Data for Whole Organism Proteomic Studies University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Doctoral Dissertations Graduate School 5-2006 Computational Analysis of Mass Spectrometric Data for Whole Organism Proteomic

More information

A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry

A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry Ting Chen Department of Genetics arvard Medical School Boston, MA 02115, USA Ming-Yang Kao Department of Computer

More information

Spectrum-to-Spectrum Searching Using a. Proteome-wide Spectral Library

Spectrum-to-Spectrum Searching Using a. Proteome-wide Spectral Library MCP Papers in Press. Published on April 30, 2011 as Manuscript M111.007666 Spectrum-to-Spectrum Searching Using a Proteome-wide Spectral Library Chia-Yu Yen, Stephane Houel, Natalie G. Ahn, and William

More information

Identification of Post-translational Modifications via Blind Search of Mass-Spectra

Identification of Post-translational Modifications via Blind Search of Mass-Spectra Identification of Post-translational Modifications via Blind Search of Mass-Spectra Dekel Tsur Computer Science and Engineering UC San Diego dtsur@cs.ucsd.edu Vineet Bafna Computer Science and Engineering

More information

Properties of Average Score Distributions of SEQUEST

Properties of Average Score Distributions of SEQUEST Research Properties of Average Score Distributions of SEQUEST THE PROBABILITY RATIO METHOD* S Salvador Martínez-Bartolomé, Pedro Navarro, Fernando Martín-Maroto, Daniel López-Ferrer **, Antonio Ramos-Fernández,

More information

Protein inference based on peptides identified from. tandem mass spectra

Protein inference based on peptides identified from. tandem mass spectra Protein inference based on peptides identified from tandem mass spectra A Thesis Submitted to the College of Graduate Studies and Research in Partial Fulfillment of the Requirements for the degree of Doctor

More information

Workflow concept. Data goes through the workflow. A Node contains an operation An edge represents data flow The results are brought together in tables

Workflow concept. Data goes through the workflow. A Node contains an operation An edge represents data flow The results are brought together in tables PROTEOME DISCOVERER Workflow concept Data goes through the workflow Spectra Peptides Quantitation A Node contains an operation An edge represents data flow The results are brought together in tables Protein

More information

Proteomics: the first decade and beyond. (2003) Patterson and Aebersold Nat Genet 33 Suppl: from

Proteomics: the first decade and beyond. (2003) Patterson and Aebersold Nat Genet 33 Suppl: from Advances in mass spectrometry and the generation of large quantities of nucleotide sequence information, combined with computational algorithms that could correlate the two, led to the emergence of proteomics

More information

A New Hybrid De Novo Sequencing Method For Protein Identification

A New Hybrid De Novo Sequencing Method For Protein Identification A New Hybrid De Novo Sequencing Method For Protein Identification Penghao Wang 1*, Albert Zomaya 2, Susan Wilson 1,3 1. Prince of Wales Clinical School, University of New South Wales, Kensington NSW 2052,

More information

Key questions of proteomics. Bioinformatics 2. Proteomics. Foundation of proteomics. What proteins are there? Protein digestion

Key questions of proteomics. Bioinformatics 2. Proteomics. Foundation of proteomics. What proteins are there? Protein digestion s s Key questions of proteomics What proteins are there? Bioinformatics 2 Lecture 2 roteomics How much is there of each of the proteins? - Absolute quantitation - Stoichiometry What (modification/splice)

More information

Last updated: Copyright

Last updated: Copyright Last updated: 2012-08-20 Copyright 2004-2012 plabel (v2.4) User s Manual by Bioinformatics Group, Institute of Computing Technology, Chinese Academy of Sciences Tel: 86-10-62601016 Email: zhangkun01@ict.ac.cn,

More information

A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra

A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores D.C. Anderson*, Weiqun Li, Donald G. Payan,

More information

In shotgun proteomics, a complex protein mixture derived from a biological sample is directly analyzed. Research Article

In shotgun proteomics, a complex protein mixture derived from a biological sample is directly analyzed. Research Article JOURNAL OF COMPUTATIONAL BIOLOGY Volume 16, Number 8, 2009 # Mary Ann Liebert, Inc. Pp. 1 11 DOI: 10.1089/cmb.2009.0018 Research Article A Bayesian Approach to Protein Inference Problem in Shotgun Proteomics

More information

Introduction to pepxmltab

Introduction to pepxmltab Introduction to pepxmltab Xiaojing Wang October 30, 2018 Contents 1 Introduction 1 2 Convert pepxml to a tabular format 1 3 PSMs Filtering 4 4 Session Information 5 1 Introduction Mass spectrometry (MS)-based

More information

Mass Spectrometry and Proteomics - Lecture 5 - Matthias Trost Newcastle University

Mass Spectrometry and Proteomics - Lecture 5 - Matthias Trost Newcastle University Mass Spectrometry and Proteomics - Lecture 5 - Matthias Trost Newcastle University matthias.trost@ncl.ac.uk Previously Proteomics Sample prep 144 Lecture 5 Quantitation techniques Search Algorithms Proteomics

More information

Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were

Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were developed to allow the analysis of large intact (bigger than

More information

HOWTO, example workflow and data files. (Version )

HOWTO, example workflow and data files. (Version ) HOWTO, example workflow and data files. (Version 20 09 2017) 1 Introduction: SugarQb is a collection of software tools (Nodes) which enable the automated identification of intact glycopeptides from HCD

More information

InsPecT: Identification of Posttranslationally Modified Peptides from Tandem Mass Spectra

InsPecT: Identification of Posttranslationally Modified Peptides from Tandem Mass Spectra Anal. Chem. 2005, 77, 4626-4639 InsPecT: Identification of Posttranslationally Modified Peptides from Tandem Mass Spectra Stephen Tanner,*, Hongjun Shu, Ari Frank, Ling-Chi Wang, Ebrahim Zandi, Marc Mumby,

More information

A NESTED MIXTURE MODEL FOR PROTEIN IDENTIFICATION USING MASS SPECTROMETRY

A NESTED MIXTURE MODEL FOR PROTEIN IDENTIFICATION USING MASS SPECTROMETRY Submitted to the Annals of Applied Statistics A NESTED MIXTURE MODEL FOR PROTEIN IDENTIFICATION USING MASS SPECTROMETRY By Qunhua Li,, Michael MacCoss and Matthew Stephens University of Washington and

More information

Improved Classification of Mass Spectrometry Database Search Results Using Newer Machine Learning Approaches*

Improved Classification of Mass Spectrometry Database Search Results Using Newer Machine Learning Approaches* Research Improved Classification of Mass Spectrometry Database Search Results Using Newer Machine Learning Approaches* Peter J. Ulintz, Ji Zhu, Zhaohui S. Qin **, and Philip C. Andrews Manual analysis

More information

Protein Identification Using Tandem Mass Spectrometry. Nathan Edwards Informatics Research Applied Biosystems

Protein Identification Using Tandem Mass Spectrometry. Nathan Edwards Informatics Research Applied Biosystems Protein Identification Using Tandem Mass Spectrometry Nathan Edwards Informatics Research Applied Biosystems Outline Proteomics context Tandem mass spectrometry Peptide fragmentation Peptide identification

More information

HMMatch: Peptide Identification by Spectral Matching. of Tandem Mass Spectra using Hidden Markov Models

HMMatch: Peptide Identification by Spectral Matching. of Tandem Mass Spectra using Hidden Markov Models HMMatch: Peptide Identification by Spectral Matching of Tandem Mass Spectra using Hidden Markov Models Xue Wu 1 Chau-Wen Tseng 1 Nathan Edwards 2 1 Department of Computer Science University of Maryland,

More information

SPECTRA LIBRARY ASSISTED DE NOVO PEPTIDE SEQUENCING FOR HCD AND ETD SPECTRA PAIRS

SPECTRA LIBRARY ASSISTED DE NOVO PEPTIDE SEQUENCING FOR HCD AND ETD SPECTRA PAIRS SPECTRA LIBRARY ASSISTED DE NOVO PEPTIDE SEQUENCING FOR HCD AND ETD SPECTRA PAIRS 1 Yan Yan Department of Computer Science University of Western Ontario, Canada OUTLINE Background Tandem mass spectrometry

More information

Transferred Subgroup False Discovery Rate for Rare Post-translational Modifications Detected by Mass Spectrometry* S

Transferred Subgroup False Discovery Rate for Rare Post-translational Modifications Detected by Mass Spectrometry* S Transferred Subgroup False Discovery Rate for Rare Post-translational Modifications Detected by Mass Spectrometry* S Yan Fu and Xiaohong Qian Technological Innovation and Resources 2014 by The American

More information

Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data

Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data Anthony J Bonner Han Liu Abstract This paper addresses a central problem of Proteomics: estimating the amounts of each of

More information

De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics

De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics John R. Rose Computer Science and Engineering University of South Carolina 1 Overview Background Information Theoretic

More information

A Kernel-Based Case Retrieval Algorithm with Application to Bioinformatics

A Kernel-Based Case Retrieval Algorithm with Application to Bioinformatics A Kernel-Based Case Retrieval Algorithm with Application to Bioinformatics Yan Fu,2, Qiang Yang 3, Charles X. Ling 4, Haipeng Wang, Dequan Li, Ruixiang Sun 2, Hu Zhou 5, Rong Zeng 5, Yiqiang Chen, Simin

More information

HMMatch: Peptide Identification by Spectral Matching of Tandem Mass Spectra Using Hidden Markov Models ABSTRACT

HMMatch: Peptide Identification by Spectral Matching of Tandem Mass Spectra Using Hidden Markov Models ABSTRACT JOURNAL OF COMPUTATIONAL BIOLOGY Volume 14, Number 8, 2007 Mary Ann Liebert, Inc. Pp. 1025 1043 DOI: 10.1089/cmb.2007.0071 HMMatch: Peptide Identification by Spectral Matching of Tandem Mass Spectra Using

More information

Methods for proteome analysis of obesity (Adipose tissue)

Methods for proteome analysis of obesity (Adipose tissue) Methods for proteome analysis of obesity (Adipose tissue) I. Sample preparation and liquid chromatography-tandem mass spectrometric analysis Instruments, softwares, and materials AB SCIEX Triple TOF 5600

More information

Efficiency of Database Search for Identification of Mutated and Modified Proteins via Mass Spectrometry

Efficiency of Database Search for Identification of Mutated and Modified Proteins via Mass Spectrometry Methods Efficiency of Database Search for Identification of Mutated and Modified Proteins via Mass Spectrometry Pavel A. Pevzner, 1,3 Zufar Mulyukov, 1 Vlado Dancik, 2 and Chris L Tang 2 Department of

More information

Electrospray ionization mass spectrometry (ESI-

Electrospray ionization mass spectrometry (ESI- Automated Charge State Determination of Complex Isotope-Resolved Mass Spectra by Peak-Target Fourier Transform Li Chen a and Yee Leng Yap b a Bioinformatics Institute, 30 Biopolis Street, Singapore b Davos

More information

A statistical approach to peptide identification from clustered tandem mass spectrometry data

A statistical approach to peptide identification from clustered tandem mass spectrometry data A statistical approach to peptide identification from clustered tandem mass spectrometry data Soyoung Ryu, David R. Goodlett, William S. Noble and Vladimir N. Minin Department of Statistics, University

More information

Tutorial 1: Setting up your Skyline document

Tutorial 1: Setting up your Skyline document Tutorial 1: Setting up your Skyline document Caution! For using Skyline the number formats of your computer have to be set to English (United States). Open the Control Panel Clock, Language, and Region

More information

MS-based proteomics to investigate proteins and their modifications

MS-based proteomics to investigate proteins and their modifications MS-based proteomics to investigate proteins and their modifications Francis Impens VIB Proteomics Core October th 217 Overview Mass spectrometry-based proteomics: general workflow Identification of protein

More information

Quantitative Comparison of Proteomic Data Quality between a 2D and 3D Quadrupole Ion Trap

Quantitative Comparison of Proteomic Data Quality between a 2D and 3D Quadrupole Ion Trap Quantitative Comparison of Proteomic Data Quality between a 2D and 3D Quadrupole Ion Trap Adele R. Blackler, Aaron A. Klammer, Michael J. MacCoss, and Christine C. Wu*, Department of Pharmacology, University

More information

MaSS-Simulator: A highly configurable MS/MS simulator for generating test datasets for big data algorithms.

MaSS-Simulator: A highly configurable MS/MS simulator for generating test datasets for big data algorithms. MaSS-Simulator: A highly configurable MS/MS simulator for generating test datasets for big data algorithms. Muaaz Gul Awan 1 and Fahad Saeed 1 1 Department of Computer Science, Western Michigan University,

More information

BIOINFORMATICS. Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry

BIOINFORMATICS. Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry BIOINFORMATICS Vol. 20 no. 12 2004, pages 1948 1954 doi:10.1093/bioinformatics/bth186 Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry Yan

More information

On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering

On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering Jiří Novák, David Hoksza, Jakub Lokoč, and Tomáš Skopal Siret Research Group, Faculty of Mathematics and Physics, Charles

More information

The Pitfalls of Peaklist Generation Software Performance on Database Searches

The Pitfalls of Peaklist Generation Software Performance on Database Searches Proceedings of the 56th ASMS Conference on Mass Spectrometry and Allied Topics, Denver, CO, June 1-5, 2008 The Pitfalls of Peaklist Generation Software Performance on Database Searches Aenoch J. Lynn,

More information

MSblender: a probabilistic approach for integrating peptide identifications from multiple database search engines

MSblender: a probabilistic approach for integrating peptide identifications from multiple database search engines Article Subscriber access provided by University of Texas Libraries MSblender: a probabilistic approach for integrating peptide identifications from multiple database search engines Taejoon Kwon, Hyungwon

More information

Proteomics. November 13, 2007

Proteomics. November 13, 2007 Proteomics November 13, 2007 Acknowledgement Slides presented here have been borrowed from presentations by : Dr. Mark A. Knepper (LKEM, NHLBI, NIH) Dr. Nathan Edwards (Center for Bioinformatics and Computational

More information

Isotopic-Labeling and Mass Spectrometry-Based Quantitative Proteomics

Isotopic-Labeling and Mass Spectrometry-Based Quantitative Proteomics Isotopic-Labeling and Mass Spectrometry-Based Quantitative Proteomics Xiao-jun Li, Ph.D. Current address: Homestead Clinical Day 4 October 19, 2006 Protein Quantification LC-MS/MS Data XLink mzxml file

More information

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University Protein Quantitation II: Multiple Reaction Monitoring Kelly Ruggles kelly@fenyolab.org New York University Traditional Affinity-based proteomics Use antibodies to quantify proteins Western Blot RPPA Immunohistochemistry

More information

Welcome! Course 7: Concepts for LC-MS

Welcome! Course 7: Concepts for LC-MS Welcome! Mass Spectrometry meets Cheminformatics Tobias Kind and Julie Leary UC Davis Course 7: Concepts for LC-MS Class website: CHE 241 - Spring 28 - CRN 16583 Slides: http://fiehnlab.ucdavis.edu/staff/kind/teaching/

More information

JUMP: a tag-based database search tool for peptide identification with high sensitivity

JUMP: a tag-based database search tool for peptide identification with high sensitivity MCP Papers in Press. Published on September 8, 2014 as Manuscript O114.039586 JUMP: a tag-based database search tool for peptide identification with high sensitivity and accuracy Xusheng Wang 1, Yuxin

More information

pparse: A method for accurate determination of monoisotopic peaks in high-resolution mass spectra

pparse: A method for accurate determination of monoisotopic peaks in high-resolution mass spectra 226 DOI 10.1002/pmic.201100081 Proteomics 2012, 12, 226 235 RESEARCH ARTICLE pparse: A method for accurate determination of monoisotopic peaks in high-resolution mass spectra Zuo-Fei Yuan 1,2, Chao Liu

More information

Faster SEQUEST Searching for Peptide Identification from Tandem Mass Spectra

Faster SEQUEST Searching for Peptide Identification from Tandem Mass Spectra pubs.acs.org/jpr Faster SEQUEST Searching for Peptide Identification from Tandem Mass Spectra Benjamin J. Diament Department of Computer Science and Engineering, University of Washington, Seattle, Washington,

More information

profileanalysis Innovation with Integrity Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research

profileanalysis Innovation with Integrity Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research profileanalysis Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research Innovation with Integrity Omics Research Biomarker Discovery Made Easy by ProfileAnalysis

More information

Improved peptide identification in proteomics by two consecutive stages of mass spectrometric fragmentation

Improved peptide identification in proteomics by two consecutive stages of mass spectrometric fragmentation Improved peptide identification in proteomics by two consecutive stages of mass spectrometric fragmentation Jesper V. Olsen and Matthias Mann* Center for Experimental BioInformatics, Department of Biochemistry

More information

Using Annotated Peptide Mass Spectrum Libraries for Protein Identification

Using Annotated Peptide Mass Spectrum Libraries for Protein Identification for Protein Identification R. Craig, J. C. Cortens, D. Fenyo, and R. C. Beavis*,, Beavis Informatics Ltd., Winnipeg, MB, Canada R3B 1G7, Manitoba Centre for Systems Biology and Proteomics, University of

More information

iprophet: Multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates

iprophet: Multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates MCP Papers in Press. Published on August 29, 2011 as Manuscript M111.007690 This is the Pre-Published Version iprophet: Multi-level integrative analysis of shotgun proteomic data improves peptide and protein

More information

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University Protein Quantitation II: Multiple Reaction Monitoring Kelly Ruggles kelly@fenyolab.org New York University Traditional Affinity-based proteomics Use antibodies to quantify proteins Western Blot Immunohistochemistry

More information

Nature Methods: doi: /nmeth Supplementary Figure 1. Fragment indexing allows efficient spectra similarity comparisons.

Nature Methods: doi: /nmeth Supplementary Figure 1. Fragment indexing allows efficient spectra similarity comparisons. Supplementary Figure 1 Fragment indexing allows efficient spectra similarity comparisons. The cost and efficiency of spectra similarity calculations can be approximated by the number of fragment comparisons

More information

Effective Strategies for Improving Peptide Identification with Tandem Mass Spectrometry

Effective Strategies for Improving Peptide Identification with Tandem Mass Spectrometry Effective Strategies for Improving Peptide Identification with Tandem Mass Spectrometry by Xi Han A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree

More information

De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra. Xiaowen Liu

De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra. Xiaowen Liu De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra Xiaowen Liu Department of BioHealth Informatics, Department of Computer and Information Sciences, Indiana University-Purdue

More information

SRM assay generation and data analysis in Skyline

SRM assay generation and data analysis in Skyline in Skyline Preparation 1. Download the example data from www.srmcourse.ch/eupa.html (3 raw files, 1 csv file, 1 sptxt file). 2. The number formats of your computer have to be set to English (United States).

More information

Mass Spectrometry Based De Novo Peptide Sequencing Error Correction

Mass Spectrometry Based De Novo Peptide Sequencing Error Correction Mass Spectrometry Based De Novo Peptide Sequencing Error Correction by Chenyu Yao A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics

More information

NPTEL VIDEO COURSE PROTEOMICS PROF. SANJEEVA SRIVASTAVA

NPTEL VIDEO COURSE PROTEOMICS PROF. SANJEEVA SRIVASTAVA LECTURE-25 Quantitative proteomics: itraq and TMT TRANSCRIPT Welcome to the proteomics course. Today we will talk about quantitative proteomics and discuss about itraq and TMT techniques. The quantitative

More information

Yifei Bao. Beatrix. Manor Askenazi

Yifei Bao. Beatrix. Manor Askenazi Detection and Correction of Interference in MS1 Quantitation of Peptides Using their Isotope Distributions Yifei Bao Department of Computer Science Stevens Institute of Technology Beatrix Ueberheide Department

More information

Optimization and Use of Peptide Mass Measurement Accuracy in Shotgun Proteomics

Optimization and Use of Peptide Mass Measurement Accuracy in Shotgun Proteomics MCP Papers in Press. Published on April 23, 26 as Manuscript M5339-MCP2 Optimization and Use of Peptide Mass Measurement Accuracy in Shotgun Proteomics Wilhelm Haas, Brendan K. Faherty, Scott A. Gerber,

More information

Mass spectrometry-based proteomics has become

Mass spectrometry-based proteomics has become FOCUS: THE ORBITRAP Computational Principles of Determining and Improving Mass Precision and Accuracy for Proteome Measurements in an Orbitrap Jürgen Cox and Matthias Mann Proteomics and Signal Transduction,

More information

Quantitation of a target protein in crude samples using targeted peptide quantification by Mass Spectrometry

Quantitation of a target protein in crude samples using targeted peptide quantification by Mass Spectrometry Quantitation of a target protein in crude samples using targeted peptide quantification by Mass Spectrometry Jon Hao, Rong Ye, and Mason Tao Poochon Scientific, Frederick, Maryland 21701 Abstract Background:

More information

Database Search Strategies for Proteomic Data Sets Generated by Electron Capture Dissociation Mass Spectrometry

Database Search Strategies for Proteomic Data Sets Generated by Electron Capture Dissociation Mass Spectrometry Database Search Strategies for Proteomic Data Sets Generated by Electron Capture Dissociation Mass Spectrometry Steve M. M. Sweet,,# Andrew W. Jones, Debbie L. Cunningham, John K. Heath, Andrew J. Creese,

More information

A Statistical Model of Proteolytic Digestion

A Statistical Model of Proteolytic Digestion A Statistical Model of Proteolytic Digestion I-Jeng Wang, Christopher P. Diehl Research and Technology Development Center Johns Hopkins University Applied Physics Laboratory Laurel, MD 20723 6099 Email:

More information

TUTORIAL EXERCISES WITH ANSWERS

TUTORIAL EXERCISES WITH ANSWERS TUTORIAL EXERCISES WITH ANSWERS Tutorial 1 Settings 1. What is the exact monoisotopic mass difference for peptides carrying a 13 C (and NO additional 15 N) labelled C-terminal lysine residue? a. 6.020129

More information

Parallel Algorithms For Real-Time Peptide-Spectrum Matching

Parallel Algorithms For Real-Time Peptide-Spectrum Matching Parallel Algorithms For Real-Time Peptide-Spectrum Matching A Thesis Submitted to the College of Graduate Studies and Research in Partial Fulfillment of the Requirements for the degree of Master of Science

More information

False Discovery Rates of Protein Identifications: A Strike against the Two-Peptide Rule

False Discovery Rates of Protein Identifications: A Strike against the Two-Peptide Rule False Discovery Rates of Protein Identifications: A Strike against the Two-Peptide Rule Nitin Gupta*, and Pavel A. Pevzner Bioinformatics Program and Department of Computer Science and Engineering, University

More information

MS2DB: An Algorithmic Approach to Determine Disulfide Linkage Patterns in Proteins by Utilizing Tandem Mass Spectrometric Data

MS2DB: An Algorithmic Approach to Determine Disulfide Linkage Patterns in Proteins by Utilizing Tandem Mass Spectrometric Data MS2DB: An Algorithmic Approach to Determine Disulfide Linkage Patterns in Proteins by Utilizing Tandem Mass Spectrometric Data Timothy Lee 1, Rahul Singh 1, Ten-Yang Yen 2, and Bruce Macher 2 1 Department

More information

Efficient Marginalization to Compute Protein Posterior Probabilities from Shotgun Mass Spectrometry Data

Efficient Marginalization to Compute Protein Posterior Probabilities from Shotgun Mass Spectrometry Data Efficient Marginalization to Compute Protein Posterior Probabilities from Shotgun Mass Spectrometry Data Oliver Serang Department of Genome Sciences, University of Washington, Seattle, Washington Michael

More information

Learning Peptide-Spectrum Alignment Models for Tandem Mass Spectrometry

Learning Peptide-Spectrum Alignment Models for Tandem Mass Spectrometry Learning Peptide-Spectrum Alignment Models for Tandem Mass Spectrometry John T. Halloran Dept. of Electrical Engineering University of Washington Seattle, WA 99, USA Jeff A. Bilmes Dept. of Electrical

More information

ALIGNMENT OF LC-MS DATA USING PEPTIDE FEATURES. A Thesis XINCHENG TANG

ALIGNMENT OF LC-MS DATA USING PEPTIDE FEATURES. A Thesis XINCHENG TANG ALIGNMENT OF LC-MS DATA USING PEPTIDE FEATURES A Thesis by XINCHENG TANG Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for the degree of

More information

Protein Post-translational Modifications Mapping with MS/MS based Frequent Interval Pattern Mining

Protein Post-translational Modifications Mapping with MS/MS based Frequent Interval Pattern Mining Protein Post-translational Modifications Mapping with MS/MS based Frequent Interval Pattern Mining Han Liu Department of Computer Science University of Illinois at Urbana-Champaign Email: hanliu@ncsa.uiuc.edu

More information

WADA Technical Document TD2015IDCR

WADA Technical Document TD2015IDCR MINIMUM CRITERIA FOR CHROMATOGRAPHIC-MASS SPECTROMETRIC CONFIRMATION OF THE IDENTITY OF ANALYTES FOR DOPING CONTROL PURPOSES. The ability of a method to identify an analyte is a function of the entire

More information