On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering

Size: px
Start display at page:

Download "On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering"

Transcription

1 On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering Jiří Novák, David Hoksza, Jakub Lokoč, and Tomáš Skopal Siret Research Group, Faculty of Mathematics and Physics, Charles University in Prague, Malostranské nám. 25, Prague, Czech Republic Abstract. Tandem mass spectrometry is a well-known technique for identification of protein sequences from an in vitro sample. To identify the sequences from spectra captured by a spectrometer, the similarity search in a database of hypothetical mass spectra is often used. For this purpose, a database of known protein sequences is utilized to generate the hypothetical spectra. Since the number of sequences in the databases grows rapidly over the time, several approaches have been proposed to index the databases of mass spectra. In this paper, we improve an approach based on the non-metric similarity search where the M-tree and the TriGen algorithm are employed for fast and approximative search. We show that preprocessing of mass spectra by clustering speeds up the identification of sequences more than 100 with respect to the sequential scan of the entire database. Moreover, when the protein candidates are refined by sequential scan in the postprocessing step, the whole approach exhibits precision similar to that of sequential scan over the entire database (over 90%). Keywords: tandem mass spectrometry, similarity search, non-metric access methods, protein sequences identification, spectral clustering 1 Introduction Almost every process on the cell level is secured by proteins whose interactions form the basis of all living organisms. The functions of proteins are determined by their 3D structure, which is derived from protein sequences. Tandem mass spectrometry (MS/MS) is a widely known method for protein and peptide sequences identification from a sample of proteins in vitro. Commonly, the sample is analyzed by more runs of a mass spectrometer. A set containing hundreds to thousands of mass spectra is captured in each run. The proteins in the sample are split to many peptide ions where a mass spectrum corresponds to a peptide ion. More peptide ions correspond to a peptide sequence and, similarly, more peptide sequences come from a protein sequence. This work was supported by Czech Science Foundation (GAČR) projects P202/11/0968, P202/12/P297, 201/09/H057 and by the Grant Agency of Charles University (GAUK) project Nr

2 2 J. Novák et al. A mass spectrum is a list of peaks corresponding to peptide fragment ions. The peak is a pair ( m z, I), where m z is a mass-to-charge ratio and I is the intensity of a fragment ion occurrence. In a spectrum, there occur several types of fragment ions forming so-called ions series. The most important series for correct peptide sequence identification are y-ions and b-ions. The completeness of these series is crucial for correct spectra interpretation, because the m z difference between two neighboring peaks in one series, e.g., y i and y i+1, corresponds to a mass of an amino acid in the peptide. The precursor mass m p (the mass of the peptide ion before splitting) is also provided as an additional information for each spectrum captured by a spectrometer. The interpretation of spectra is often complicated with post-translational modifications (PTMs) of amino acids, because masses of amino acids are changed in that case, and thus the peaks are shifted [21]. The mass spectrometer does not determine the peptide sequences from mass spectra directly but the spectra must be interpreted after they are captured. The successful computational approaches for interpretation of mass spectra (i.e., assigning the peptide sequences to mass spectra) [16] are often based on the similarity search in databases of already known or predicted protein sequences [24]. The databases contain millions of protein sequences [15] and the spectra sets generated by a MS/MS analysis that need to be interpreted (query sets from the database point of view) contain thousands of mass spectra. Thus a sequential scan of entire database for each mass spectrum is time consuming. To speed-up the search, an index over the database of hypothetical mass spectra generated from known peptide sequences (short pieces of proteins) can be constructed. The simplest way is to index the database of spectra by peptide precursor mass because there is a correlation between the precursor mass of similar peptides [20]. A disadvantage is that indexing of peptides by precursor mass limits the capability of managing spectra with PTMs because mass of peptides with PTMs may differ from the peptides without PTMs from tens to hundreds Daltons. A few approaches were proposed where an inverted file was employed to index the database of protein sequences [11], [14]. Another approach uses a suffix tree [13] and there are also approaches based on the similarity search in metric spaces [22] [4]. The approaches based on the inverted index and on the similarity search in metric spaces commonly do not support the search of spectra with PTMs. We have proposed a fast method based on approximative non-metric similarity search [19], which is able to manage spectra with PTMs. Despite the search in an indexed database is fast, query sets of mass spectra still contain many noise spectra that should be ignored (on average 90% in the query set [10]). The noise spectra cannot be assigned to peptide sequences because they occur as an artifact of the spectrometer process. A kind of query set preprocessing can be used to eliminate the noise spectra and to speed up the search, because only a small part of the query set needs to be interpreted [25] [23]. Commonly used preprocessing approaches are the spectrum quality filtering [6] [17] and the clustering [8], [7], [5]. The spectrum quality filtering methods analyze many parameters of spectra (the number of peaks and their relative intensity, the precursor mass, the number

3 On Optimizing the Non-metric Similarity Search... 3 of complementary y-ions and b-ions, etc.) and assign to each spectrum a score. Since the mass spectrometers from different manufacturers use different physical principles, the significance of parameters may differ from machine to machine. Thus, the score heavily depends on the mass spectrometers which were used to capture the spectra. Only the spectra reaching a specific score are used for further processing while the others are ignored as noisy ones. On the other hand, the clustering is independent on the properties of different machines because the spectra from different spectrometers are processed the same way, i.e., without the knowledge of significance of particular parameters. The clustering is based on fact that a mass spectrometer generates multiple spectra corresponding to a peptide sequence [28]. Since the spectra corresponding to the peptide sequence are similar, they form a cluster. In the set of spectra obtained from one spectrometer run there are many spectra which are not noise but they correspond to a peptide sequence. A disadvantage in this case is the clustering causes loss of some peptide sequences [28]. Since a mass spectrometry task often consists of multiple spectrometer runs (each run generates a query set), the above disadvantage can be successfully resolved by merging query sets from multiple runs. The number of identifiable peptides that are not clustered decreases with the increasing number of merged query sets, while the noise spectra are successfully eliminated [2]. 2 Computational Methods We briefly introduce the metric access methods, the spectrum similarity employed in MAMs and for clustering of mass spectra, the original idea for spectra interpretation by MAMs, and an extension of this approach using preprocessing and postprocessing. 2.1 Metric access methods (MAMs) A metric is a distance function satisfying the reflexivity, symmetry, non-negativity and triangle inequality. The MAMs were designed for fast search in databases modeled in metric spaces, where the triangle inequality is crucial for organizing objects into metric regions and for pruning irrelevant regions while searching [31]. A distance (partially) violating the triangle inequality is called a semi-metric and the process is denoted as non-metric search. The violation of the triangle inequality is expressed by the triangle error (T-error) tolerance θ [26]. 2.2 Spectrum similarity The use of MAMs and clustering require a similarity function which says how much two spectra are similar. A commonly employed function is the angle distance (or cosine similarity) [12], [1]; another approach is the sigmoid similarity [7], [25]. In this paper, we have chosen the parameterized Hausdorff distance

4 4 J. Novák et al. d HP (Eq. 2), which was successfully applied in non-metric indexes [18]. n x h(x, y) = i x min yj y {max(0, x i y j ξ)} dim(x) (1) d HP (x, y) = max(h(x, y), h(y, x)) (2) where x and y are the vectors of m z ratios, dim(x) is the length of x and ξ is a mass error tolerance. The angle distance can be computed a little bit faster than d HP but the number of identified peptides and the efficiency of MAMs is lower when the angle distance is utilized. Since lists of peaks in mass spectra are implicitly sorted, both distances can be computed with the linear time complexity O(p), where p is the number of peaks in a spectrum [18]. 2.3 Original method We briefly describe the previously proposed approach [19], which employs the M- tree [3] and the Trigen algorithm [26]. First, protein sequences from a database are split to peptide sequences and the hypothetical mass spectra are generated from the peptide sequences. Second, the hypothetical mass spectra are indexed by the M-tree (or by another MAM) under the d HP while the TriGen algorithm is utilized to control the T-error tolerance θ (i.e., the efficiency of the M-tree). The search is faster with increasing θ but the number of identified peptides is lower. Finally, a k-nearest neighbor (knn) query is performed for each query spectrum. For many spectra in the query set, a hypothetical mass spectrum among the k-nearest neighbors corresponds to a peptide sequence that we are looking for. Since the d HP is a coarse function, an additional re-ranking is assumed to determine the correct peptide sequence from the k-nearest neighbors. 2.4 Improvements We propose an extension of the original approach, where the clustering is employed in preprocessing step to filter out the noise spectra thus speeding the search, and where the sequential scan over the candidates is used in the postprocessing step to increase the number of identified peptide sequences (Fig. 1). Fig. 1. Sequences identification (original method is yellow, improvements are blue)

5 On Optimizing the Non-metric Similarity Search... 5 Preprocessing. The preprocessing is realized by the clustering (Fig. 1a). A major premise for the clustering is that query sets from more spectrometer runs are merged. Hence, many interpretable spectra that are captured only once per a spectrometer run have a twin in the query set so they are not eliminated by the clustering. On the other hand, the noise spectra are successfully cleared away thus many spectra are not searched in the query phase (Fig. 1b), making the search significantly faster. Alg. 1. Clustering of query mass spectra 1 Clustering(a set of clusters C, a threshold t, a number of cycles w) { 2 let C be initialized with one mass spectrum per cluster; 3 for w cycles { 4 MergeClusters(C,t); 5 SelectCentroids(C); 6 RearrangeClusters(C,t); 7 SelectCentroids(C); } } 8 9 MergeClusters(a set of clusters C, a threshold t) { 10 for all clusters c i C { 11 select the spectrum c i,0 { // a centroid is stored at c i,0 12 for all clusters c j C { 13 if all spectra c j,k have d HP (c i,0, c j,k ) t { 14 store the position p of the cluster c j with the minimal d HP (c i,0, c j,0); } } } 15 merge the clusters c i and c p; } } SelectCentroids(a set of clusters C) { 18 for all clusters c i C { 19 for all spectra c i,j { 20 P = ; 21 for all spectra c i,k { 22 store the maximal distance d HP (c i,j, c i,k ) and 23 the position k of the spectrum c i,k in the maximal d HP into P ; } 24 select the position p with the minimal d HP from P ; } 25 switch the spectra c i,0 and c i,p; } // a new centroid has been moved to c i,0 } RearrangeClusters(a set of clusters C, a threshold t) { 28 for all clusters c i C { 29 for all spectra c i,m { 30 P = ; 31 for all clusters c j C { 32 for all spectra c j,n { 33 if all d HP (c i,m, c j,n) t { store the distance d HP (c i,m, c j,0) and 34 the position j of the cluster c j into P ; } } } 35 select the position p of the cluster with minimal d HP from P ; 36 move the spectrum c i,m to the cluster c p; } } } One of the best-known algorithms for the clustering is the K-means [30], which is not suitable for clustering of mass spectra because we cannot predict the number of clusters K before the clustering [8]. Moreover, its time complexity is O(NKd), where N is the number of spectra in the query set and d is the dimensionality. The K-means is not suitable for large query sets and highdimensional data which is exactly the case of mass spectra (usually containing many peaks/dimensions). A better clustering algorithm for mass spectra is the hierarchical clustering [8] [30]. A disadvantage for large query sets of spectra is the time complexity O(N 2 ). Since we analyze the impact of the clustering on the number of identified peptide sequences, we employ a simple hierarchical-like clustering (Alg. 1).

6 6 J. Novák et al. More efficient clustering algorithms [30] may be used for large query sets, e.g., an approach based on the density clustering (DENCLUE) [9] with the time complexity O(N log N), which is capable of tackling high-dimensional data and which is robust when dealing with noise data. The clustering algorithm (Alg. 1) requires a set of clusters C initialized with one mass spectrum per cluster. Then two phases are repeated in w cycles. First, pairs of clusters with the minimal d HP (c i,0, c j,0 ) that d HP (c i,0, c j,0 ) t are merged, where t is a threshold of the d HP and c i,0, c j,0 are the centroids of clusters c i, c j. The threshold t determines whether the spectra in a cluster are similar or not. Moreover, it determines the number of clusters. If t is too low, each spectrum forms a singleton cluster. If t is too high, all spectra form one cluster. Second, the spectra are rearranged among the clusters. A spectrum is moved to another cluster, if the d HP among the spectrum and all spectra in the target cluster is less or equal t. In case that more clusters are selected, the cluster is picked where the d HP between its centroid and the moved object is minimal. New centroids of clusters are selected after each phase. Finally, the centroids of clusters containing at least two spectra form the queries, which will be processed by MAM. Another way consists in putting all peaks from all the spectra in the cluster into a representative spectrum [2]. The intensities of the closest peaks are counted up and their m z values are averaged. Since the increasing number of peaks in a spectrum worsens the efficiency of MAMs because of high intrinsic dimensionality [19], this approach needs a bit improvement for purposes of mass spectra interpretation by the non-metric similarity search. For example, a specified number of peaks with the highest intensity can be selected from the representative spectrum. Query phase. The query phase corresponds to the original idea presented in Sec. 2.3, where a knn query is performed by a MAM for each spectrum selected in the preprocessing (Fig. 1b). The k nearest neighbor peptide sequences to each query spectrum are assigned to the protein sequences of their origin. The protein sequences containing at least one good peptide sequence hit (e.g., d HP 0.65) are the protein sequence candidates. The MAM we have chosen for the query phase is the non-metric tree (NMtree) [27] because it combines the M-tree with the TriGen algorithm in a way that allows to dynamically control the retrieval precision at query time. The NM-tree could be replaced by another MAM since our approach is independent on a specific method. Postprocessing. The postprocessing is a sequential scan of protein sequence candidates (Fig. 1c), which significantly improves the number of identified peptide sequences because more peptide sequences in a protein sequence correspond to the mass spectra in the query set [29] [25]. The protein sequence candidates (i.e., a small subset of the database sequences selected in the query phase) are split to peptide sequences and their hypothetical mass spectra are compared to the entire set of input spectra (as it was before the clustering phase). The

7 On Optimizing the Non-metric Similarity Search... 7 spectra previously missed during the preprocessing or during the query phase are assigned to peptide sequences. The newly identified peptide sequences are assigned to the protein sequences of their origin. Finally, the peptide (or protein, respectively) sequences identified in the query phase and refined in the postprocessing phase form the result. Note that some peptide sequences are lost during the clustering because their spectra are present only once in the query set. Some peptide sequences are lost during the query phase because the search is only approximative (non-metric). The sequential scan of protein sequence candidates helps to reveal a peptide sequence in case it forms a part of a candidate protein sequence which was hit by another peptide sequence. 3 Experiments We used a dataset containing MS/MS spectra from 2 protein mixtures A and B [10]. Spectra corresponding to peptide sequences were manually annotated. 14 mass spectrometer runs were performed on the mixture A and 8 runs on the mixture B. We show the results for the spectra from the first 6 runs on mixture A and from all runs on mixture B. The spectra were searched in the database of hypothetical mass spectra generated from the database of protein sequences. We used a part of the MSDB [15] database containing 100,000 protein sequences (i.e., 5.6 millions of peptide sequences or hypothetical mass spectra) including the reference protein sequences for mixtures A and B. All the experiments were carried out on a machine with 2 processors Intel Xeon X5660 (24 cores, 2.8 GHz) with 24 GB RAM and 64-bit OS Windows Server 2008 R2. Even though our implementation supports parallel processing of mass spectra, the stated times of clustering and peptide sequences identification are measured at one core. If not otherwise stated, the following settings were used protein sequences splitting enzyme: trypsin; maximum number of missed cleavage sites: 1; mass range of peptide ions generated from the database: 500-5,000 Da; fragment ions generated in hypothetical mass spectra: y-ions and b- ions; mass range of generated fragment ions: 300-2,000; m z error tolerance (ξ): 0.4 Da; number of peaks with highest intensity used in a query: 50; distance measure: d HP (with n = 30); clustering threshold (t): 0.65 (values returned by d HP are normalized to 0, 1 ), T-error tolerance θ: 0.1. The number of clusters is the number of those containing at least 2 spectra. Since we perform one knn query per cluster, the number of clusters determines the number of knn queries processed on the NM-tree. The number of missed spectra is counted after the clustering phase and before query phase. It is the number of annotated spectra in clusters with single objects and thus missed by clustering. Independent runs means that query sets of spectra from more spectrometer runs were processed separately and the results were summed (the number of clusters, number of missed spectra, time of clustering and ratio of identified spectra to annotated spectra) or averaged (time of identification per spectrum). Merged runs means that query sets of spectra from more spectrometer runs were processed together.

8 8 J. Novák et al. 3.1 Clustering of spectra from two spectrometer runs We have verified that clusters formed from merged query sets of spectra from two spectrometer runs contain many more annotated spectra than clusters formed from the query sets which are processed separately (Tab. 1). On average, the clusters formed from spectra from two spectrometer runs contain about 40.7% more annotated spectra than clusters formed from single spectrometer run. Since we perform one knn query per cluster containing at least 2 spectra, up to 79% of all knn queries are not performed for the clusters formed from the spectra merged from two runs. For clusters formed from the spectra from single runs, up to 87% of all knn queries are not performed but there are many missed annotated spectra. Num. of Num. of Independent runs Merged runs Dataset all annotated Num. of Spectra Clustering Num. of Spectra Clustering spectra spectra clusters missed time [s] clusters missed time [s] A A A B B B B Table 1. Clustering of spectra from single runs and from two merged runs 3.2 Effectiveness and efficiency of peptide sequences identification We have tested the impact of the query spectra clustering on the number of finally identified peptide sequences (i.e., after the postprocessing) and on the average time of identification per spectrum. We have compared the sequential scan of entire database and the NM-tree in 3 different ways without the clustering, with the clustering of two query sets processed independently, and together. When the clustering and/or the NM-tree were employed, the postprocessing was used. The most peptide sequences (on average 94.6%) were identified when the sequential scan was performed without the clustering (Tab. 2). On average 93.8% peptide sequences were identified when the NM-tree was employed without clustering. The ratio of identified peptides was noticeably worse when the clustering was applied on the query sets from single runs about 75.3% for the sequential scan and only 65.4% for the NM-tree. When the clustering was applied on the query sets merged from two spectrometer runs, the ratio of identified peptides was almost the same like when no clustering was employed. On average, it was about 93.6% for the sequential scan and 90.1% for the NM-tree. The clustering of query sets merged from 2 runs worsens the ratio of identified peptides about 1% when the sequential scan is performed over entire database and about 3.7% when the NM-tree is employed. The slowest method was the sequential scan without clustering, where the average time of identification per spectrum was 7.04 s (Tab. 3). The NM-tree

9 On Optimizing the Non-metric Similarity Search... 9 without clustering took 0.28 s, thus the speed-up was When clustering was applied on the query sets from single runs, the average time was 0.98 s (speedup 7.2 ) for the sequential scan and 0.04 s (speed-up ) for the NM-tree. When query sets from two spectrometer runs were merged and the clustering was applied, the average time was 1.59 s for the sequential scan (speed-up 4.4 ) and 0.07 s for the NM-tree (speed-up ). When the NM-tree was employed with clustering, the average speed-up was 4 wrt. NM-tree without clustering. With clustering Without clustering Dataset Independent runs Merged runs Seq. scan NM-tree Seq. scan NM-tree Seq. scan NM-tree A A A B B B B Table 2. The ratio of identified spectra to annotated spectra [%] With clustering Without clustering Dataset Independent runs Merged runs Seq. scan NM-tree Seq. scan NM-tree Seq. scan NM-tree A A A B B B B Table 3. Time of identification per spectrum [s] 3.3 Clustering of spectra merged from more spectrometer runs We have tested the impact of the increasing number of spectra from more spectrometer runs in a query set on the number of annotated mass spectra missed by clustering and on the time of clustering (Tab. 4). We can observe that the number of missed annotated spectra is almost the same when spectra from two or more spectrometer runs are merged, thus merging spectra from more than two spectrometer runs does not significantly improve the effectiveness of peptide sequences identification. Since we employ a simple clustering algorithm (Alg. 1), a disadvantage of merging spectra from too many spectrometer runs is that the time of clustering increases with the quadratic time complexity. We also measured the ratio of identified to annotated spectra and the average time of identification per spectrum on the NM-tree. The ratio of identified spectra is almost the same when spectra from two or more spectrometer runs are merged (on average 95%). The time of identification a bit increases with increasing number of spectra because of the quadratic complexity of clustering. We can observe that the ratio of the number of clusters to the number of all spectra in a query set is lower with the increasing number of spectra. This could

10 10 J. Novák et al. be an advantage for large query sets of mass spectra because only a small number of the spectra is queried and thus the search is significantly faster. When spectra from 14 spectrometer runs on mixture A were merged, spectra formed 1188 clusters with more than one spectrum. Thus only 8.3% of all queries were performed on the NM-tree. When spectra from 8 spectrometer runs on mixture B were merged, 4599 spectra formed 711 clusters thus only 15.5% of all queries were performed. Num. Num. of Num. Ratio of Clustering Ratio of Time of Spectra Dataset of all annotated of clust. to all time ident. ident. missed spectra spectra clusters spectra [%] [s] spectra [%] [s] A A A A A A Table 4. Clustering of spectra merged from more spectrometer runs 3.4 Impact of distance threshold on clustering We have tested the impact of the threshold t of d HP on the number of clusters, number of spectra missed by the clustering and on the time of clustering (Tab. 5). We used the dataset A1-2 with 2213 spectra merged from two spectrometer runs. The number of clusters increases with increasing t while the number of spectra missed by clustering decreases. The optimal t seems to be about 0.65 when the number of clusters (or knn queries performed, respectively) is only 17.9% wrt. the number of knn queries which must be performed when the clustering is not employed. Moreover, there are only 16 missed spectra. For t < 0.65, the number of spectra missed by clustering grows because there are less hits among the hypothetical and the query spectra. The ratio of identified to annotated spectra is still more than 95% because the sequential scan of protein sequence candidates is employed. For t > 0.65, the number of clusters increases (up to t = 0.75) and the number of missed spectra is almost zero. A disadvantage is that high t may form clusters of spectra not coming from the same peptide. In practice, the optimal t depends on the number of peaks in query spectra. The optimal t may be higher than 0.65 when a support of PTMs is implemented as described in [19]. The time of identification increases a bit with the increasing t this corresponds to the increasing number of clusters. t Num. of Spectra Clustering Ratio of ident. Time of clusters missed time [s] spectra [%] ident. [s] Table 5. Impact of distance threshold t on clustering

11 On Optimizing the Non-metric Similarity Search Conclusions We have shown that the clustering of tandem mass spectra significantly improves the efficiency of the method for protein and peptide sequences identification based on the non-metric similarity search in databases of protein sequences. When the NM-tree was employed with clustering, the search was more than 100 faster than the sequential scan without clustering, while the ratio of identified peptides was more than 90% in both cases. The first major premise for successful identification of peptide sequences with clustering is that query sets from at least two spectrometer runs are merged. The second major premise is that the sequential scan of protein sequence candidates is performed because the search using the NM-tree is fast but approximative. The fulfillment of both premises increases the number of identified peptide sequences and speeds up the search. An important advantage of mass spectra preprocessing by clustering is its independence on the mass spectrometer, which is used to capture the spectra. Since the mass spectrometer can generate spectra in many runs, a disadvantage may be the time complexity of the algorithm, which is used to cluster the spectra. We use only a simple clustering algorithm with the quadratic time complexity, thus the implementation of a more sophisticated clustering algorithm with the time complexity, e.g., O(N log N) is a subject of our future work. References 1. Alfassi, Z.B.: On the Normalization of a Mass Spectrum for Comparison of Two Spectra. Journal of the Am. Soc. for Mass Spec. 15(3), (2004) 2. Beer, I., Barnea, E., Ziv, T., Admon, A.: Improving large-scale proteomics by clustering of mass spectrometry data. Proteomics 4, (2004) 3. Ciaccia, P., Patella, M., Zezula, P.: M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In: VLDB. pp (1997) 4. Dutta, D., Chen, T.: Speeding up Tandem Mass Spectrometry Database Search: Metric Embeddings and Fast Near Neighbor Search. Bioinf. 23(5), (2007) 5. Falkner, J.A., Falkner, J.W., Yocum, A.K., Andrews, P.C.: A spectral clustering approach to MS/MS identification of post-translational modifications. Journal of Proteome research 7(11), (2008) 6. Flikka, K., et al.: Improving the reliability and throughput of mass spectrometrybased proteomics by spectrum quality filtering. Proteomics 6, (2006) 7. Flikka, K., et al.: Implementation and application of a versatile clustering tool for tandem mass spectrometry data. Proteomics 7, (2007) 8. Frank, A.M., et al.: Clustering millions of tandem mass spectra. Journal of Proteome Research 7(1), (2008) 9. Hinneburg, A., Keim, D.A.: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. In: Proc. of KDD 98. pp (1998) 10. Keller, A., et al.: Experimental Protein Mixture for Validating Tandem Mass Spectral Analysis. OMICS: A Journal of Integrative Biology 6(2), (2002) 11. Li, Y., et al.: Speeding up tandem mass spectrometry based database searching by peptide and spectrum indexing. Rapid Comm. Mass Spec. 24(6), (2010) 12. Liu, J., et al.: Methods for peptide identification by spectral comparison. Proteome Science 5(3) (2007)

12 12 J. Novák et al. 13. Lu, B., Chen, T.: A Suffix Tree Approach to the Interpretation of Tandem Mass Spectra: Applications to Peptides of Non-specific Digestion and Post-translational Modifications. In: Bioinformatics. vol. 19, pp. Suppl. 2:ii (2003) 14. Mao, R., Ramakrishnan, S.R., Nuckolls, G., Miranker, D.P.: An inverted index for mass spectra similarity query and comparison with a metric-space method: case study. In: SISAP 10. pp (2010) 15. MSDB, Nesvizhskii, A.I.: A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. Journal of Proteomics 73(11), (2010) 17. Nesvizhskii, A.I., et al.: Dynamic Spectrum Quality Assessment and Iterative Computational Analysis of Shotgun Proteomic Data. Molecular & Cellular Proteomics 5, (2006) 18. Novák, J., Hoksza, D.: Parametrised Hausdorff Distance as a Non-Metric Similarity Model for Tandem Mass Spectrometry. In: CEUR Proc. DATESO. pp (2010) 19. Novák, J., Skopal, T., Hoksza, D., Lokoč, J.: Non-metric Similarity Search of Tandem Mass Spectra Including Posttranslational Modifications. Journal of Discrete Algorithms (2011), Park, C.Y., et al.: Rapid and accurate peptide identification from tandem mass spectra. Journal of Proteome research 7(7), (2008) 21. Pevzner, P.A., Mulyukov, Z., Dančík, V., Tang, C.L.: Efficiency of Database Search for Identification of Mutated and Modified Proteins via Mass Spectrometry. Genome Research 11(2), (2001) 22. Ramakrishnan, S.R., et al.: A Fast Coarse Filtering Method for Peptide Identification by Mass Spectrometry. Bioinformatics 22(12), (2006) 23. Renard, B.Y., et al.: When less can yield more - Computational preprocessing of MS/MS spectra for peptide identification. Proteomics 9, (2009) 24. Sadygov, R.G., et al.: Large-scale Database Searching Using Tandem Mass Spectra: Looking up the Answer in the Back of the Book. Nature Met. 1(3), (2004) 25. Salmi, J., Nyman, T.A., Nevalainen, O.S., Aittokallio, T.: Filtering strategies for improving protein identification in high-throughput MS/MS studies. Proteomics 9, (2009) 26. Skopal, T.: Unified Framework for Fast Exact and Approximate Search in Dissimilarity Spaces. ACM Transactions on Database Systems 32(4), 29 (2007) 27. Skopal, T., Lokoč, J.: NM-Tree: Flexible Approximate Similarity Search in Metric and Non-metric Spaces. In: DEXA 08. pp (2008) 28. Tabb, D.L., et al.: Similarity among Tandem Mass Spectra from Proteomic Experiments: Detection, Significance and Utility. Anal. Chem. 75(10) (2003) 29. Wang, J., et al.: Peptide identification from mixture tandem mass spectra. Molecular & Cellular Proteomics 9(7), (2010) 30. Xu, R., Wunsch, D.: Survey of Clustering Algorithms. IEEE Transactions on neural networks 16(3), (2005) 31. Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach (Advances in Database Systems). Springer, USA (2006)

Non-metric Similarity Search of Tandem Mass Spectra Including Posttranslational Modifications

Non-metric Similarity Search of Tandem Mass Spectra Including Posttranslational Modifications Non-metric Similarity Search of Tandem Mass Spectra Including Posttranslational Modifications Jiří Novák, Tomáš Skopal, David Hoksza and Jakub Lokoč SIRET Research Group Department of Software Engineering,

More information

Journal of Discrete Algorithms

Journal of Discrete Algorithms Journal of Discrete Algorithms 13 (2012) 19 31 Contents lists available at SciVerse ScienceDirect Journal of Discrete Algorithms www.elsevier.com/locate/jda Non-metric similarity search of tandem mass

More information

Jiří Novák and David Hoksza

Jiří Novák and David Hoksza ParametrisedHausdorff HausdorffDistance Distanceas asa Non-Metric a Non-Metric Similarity Similarity Model Model for Tandem for Tandem Mass Mass Spectrometry Spectrometry Jiří Novák and David Hoksza Jiří

More information

Supplementary Material for: Clustering Millions of Tandem Mass Spectra

Supplementary Material for: Clustering Millions of Tandem Mass Spectra Supplementary Material for: Clustering Millions of Tandem Mass Spectra Ari M. Frank 1 Nuno Bandeira 1 Zhouxin Shen 2 Stephen Tanner 3 Steven P. Briggs 2 Richard D. Smith 4 Pavel A. Pevzner 1 October 4,

More information

Modeling Mass Spectrometry-Based Protein Analysis

Modeling Mass Spectrometry-Based Protein Analysis Chapter 8 Jan Eriksson and David Fenyö Abstract The success of mass spectrometry based proteomics depends on efficient methods for data analysis. These methods require a detailed understanding of the information

More information

Nature Methods: doi: /nmeth Supplementary Figure 1. Fragment indexing allows efficient spectra similarity comparisons.

Nature Methods: doi: /nmeth Supplementary Figure 1. Fragment indexing allows efficient spectra similarity comparisons. Supplementary Figure 1 Fragment indexing allows efficient spectra similarity comparisons. The cost and efficiency of spectra similarity calculations can be approximated by the number of fragment comparisons

More information

Workflow concept. Data goes through the workflow. A Node contains an operation An edge represents data flow The results are brought together in tables

Workflow concept. Data goes through the workflow. A Node contains an operation An edge represents data flow The results are brought together in tables PROTEOME DISCOVERER Workflow concept Data goes through the workflow Spectra Peptides Quantitation A Node contains an operation An edge represents data flow The results are brought together in tables Protein

More information

De Novo Peptide Identification Via Mixed-Integer Linear Optimization And Tandem Mass Spectrometry

De Novo Peptide Identification Via Mixed-Integer Linear Optimization And Tandem Mass Spectrometry 17 th European Symposium on Computer Aided Process Engineering ESCAPE17 V. Plesu and P.S. Agachi (Editors) 2007 Elsevier B.V. All rights reserved. 1 De Novo Peptide Identification Via Mixed-Integer Linear

More information

Tandem Mass Spectrometry: Generating function, alignment and assembly

Tandem Mass Spectrometry: Generating function, alignment and assembly Tandem Mass Spectrometry: Generating function, alignment and assembly With slides from Sangtae Kim and from Jones & Pevzner 2004 Determining reliability of identifications Can we use Target/Decoy to estimate

More information

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database Overview - MS Proteomics in One Slide Obtain protein Digest into peptides Acquire spectra in mass spectrometer MS masses of peptides MS/MS fragments of a peptide Results! Match to sequence database 2 But

More information

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra. Andrew Keller

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra. Andrew Keller PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra Andrew Keller Outline Need to validate peptide assignments to MS/MS spectra Statistical approach to validation Running PeptideProphet

More information

De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra. Xiaowen Liu

De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra. Xiaowen Liu De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra Xiaowen Liu Department of BioHealth Informatics, Department of Computer and Information Sciences, Indiana University-Purdue

More information

ASCQ_ME: a new engine for peptide mass fingerprint directly from mass spectrum without mass list extraction

ASCQ_ME: a new engine for peptide mass fingerprint directly from mass spectrum without mass list extraction ASCQ_ME: a new engine for peptide mass fingerprint directly from mass spectrum without mass list extraction Jean-Charles BOISSON1, Laetitia JOURDAN1, El-Ghazali TALBI1, Cécile CREN-OLIVE2 et Christian

More information

Efficiency of Database Search for Identification of Mutated and Modified Proteins via Mass Spectrometry

Efficiency of Database Search for Identification of Mutated and Modified Proteins via Mass Spectrometry Methods Efficiency of Database Search for Identification of Mutated and Modified Proteins via Mass Spectrometry Pavel A. Pevzner, 1,3 Zufar Mulyukov, 1 Vlado Dancik, 2 and Chris L Tang 2 Department of

More information

Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were

Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were developed to allow the analysis of large intact (bigger than

More information

Last updated: Copyright

Last updated: Copyright Last updated: 2012-08-20 Copyright 2004-2012 plabel (v2.4) User s Manual by Bioinformatics Group, Institute of Computing Technology, Chinese Academy of Sciences Tel: 86-10-62601016 Email: zhangkun01@ict.ac.cn,

More information

Computational Methods for Mass Spectrometry Proteomics

Computational Methods for Mass Spectrometry Proteomics Computational Methods for Mass Spectrometry Proteomics Eidhammer, Ingvar ISBN-13: 9780470512975 Table of Contents Preface. Acknowledgements. 1 Protein, Proteome, and Proteomics. 1.1 Primary goals for studying

More information

Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data

Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data Anthony J Bonner Han Liu Abstract This paper addresses a central problem of Proteomics: estimating the amounts of each of

More information

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra Andrew Keller Day 2 October 17, 2006 Andrew Keller Rosetta Bioinformatics, Seattle Outline Need to validate peptide assignments to MS/MS

More information

Identification of proteins by enzyme digestion, mass

Identification of proteins by enzyme digestion, mass Method for Screening Peptide Fragment Ion Mass Spectra Prior to Database Searching Roger E. Moore, Mary K. Young, and Terry D. Lee Beckman Research Institute of the City of Hope, Duarte, California, USA

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association

More information

Electrospray ionization mass spectrometry (ESI-

Electrospray ionization mass spectrometry (ESI- Automated Charge State Determination of Complex Isotope-Resolved Mass Spectra by Peak-Target Fourier Transform Li Chen a and Yee Leng Yap b a Bioinformatics Institute, 30 Biopolis Street, Singapore b Davos

More information

An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets

An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets IEEE Big Data 2015 Big Data in Geosciences Workshop An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets Fatih Akdag and Christoph F. Eick Department of Computer

More information

PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search

PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Yunhu Wan, Austin Yang, and Ting Chen*, Department of Mathematics, Department of Pharmaceutical Sciences, and

More information

Composite Quantization for Approximate Nearest Neighbor Search

Composite Quantization for Approximate Nearest Neighbor Search Composite Quantization for Approximate Nearest Neighbor Search Jingdong Wang Lead Researcher Microsoft Research http://research.microsoft.com/~jingdw ICML 104, joint work with my interns Ting Zhang from

More information

Tandem mass spectra were extracted from the Xcalibur data system format. (.RAW) and charge state assignment was performed using in house software

Tandem mass spectra were extracted from the Xcalibur data system format. (.RAW) and charge state assignment was performed using in house software Supplementary Methods Software Interpretation of Tandem mass spectra Tandem mass spectra were extracted from the Xcalibur data system format (.RAW) and charge state assignment was performed using in house

More information

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics GENOME Bioinformatics 2 Proteomics protein-gene PROTEOME protein-protein METABOLISM Slide from http://www.nd.edu/~networks/ Citrate Cycle Bio-chemical reactions What is it? Proteomics Reveal protein Protein

More information

DE NOVO PEPTIDE SEQUENCING FOR MASS SPECTRA BASED ON MULTI-CHARGE STRONG TAGS

DE NOVO PEPTIDE SEQUENCING FOR MASS SPECTRA BASED ON MULTI-CHARGE STRONG TAGS DE NOVO PEPTIDE SEQUENCING FO MASS SPECTA BASED ON MULTI-CHAGE STONG TAGS KANG NING, KET FAH CHONG, HON WAI LEONG Department of Computer Science, National University of Singapore, 3 Science Drive 2, Singapore

More information

SPECTRA LIBRARY ASSISTED DE NOVO PEPTIDE SEQUENCING FOR HCD AND ETD SPECTRA PAIRS

SPECTRA LIBRARY ASSISTED DE NOVO PEPTIDE SEQUENCING FOR HCD AND ETD SPECTRA PAIRS SPECTRA LIBRARY ASSISTED DE NOVO PEPTIDE SEQUENCING FOR HCD AND ETD SPECTRA PAIRS 1 Yan Yan Department of Computer Science University of Western Ontario, Canada OUTLINE Background Tandem mass spectrometry

More information

Mass Spectrometry Based De Novo Peptide Sequencing Error Correction

Mass Spectrometry Based De Novo Peptide Sequencing Error Correction Mass Spectrometry Based De Novo Peptide Sequencing Error Correction by Chenyu Yao A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics

More information

MS-MS Analysis Programs

MS-MS Analysis Programs MS-MS Analysis Programs Basic Process Genome - Gives AA sequences of proteins Use this to predict spectra Compare data to prediction Determine degree of correctness Make assignment Did we see the protein?

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

Tutorial 1: Setting up your Skyline document

Tutorial 1: Setting up your Skyline document Tutorial 1: Setting up your Skyline document Caution! For using Skyline the number formats of your computer have to be set to English (United States). Open the Control Panel Clock, Language, and Region

More information

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it? Proteomics What is it? Reveal protein interactions Protein profiling in a sample Yeast two hybrid screening High throughput 2D PAGE Automatic analysis of 2D Page Yeast two hybrid Use two mating strains

More information

Protein Post-translational Modifications Mapping with MS/MS based Frequent Interval Pattern Mining

Protein Post-translational Modifications Mapping with MS/MS based Frequent Interval Pattern Mining Protein Post-translational Modifications Mapping with MS/MS based Frequent Interval Pattern Mining Han Liu Department of Computer Science University of Illinois at Urbana-Champaign Email: hanliu@ncsa.uiuc.edu

More information

Accelerating Biomolecular Nuclear Magnetic Resonance Assignment with A*

Accelerating Biomolecular Nuclear Magnetic Resonance Assignment with A* Accelerating Biomolecular Nuclear Magnetic Resonance Assignment with A* Joel Venzke, Paxten Johnson, Rachel Davis, John Emmons, Katherine Roth, David Mascharka, Leah Robison, Timothy Urness and Adina Kilpatrick

More information

CSE182-L8. Mass Spectrometry

CSE182-L8. Mass Spectrometry CSE182-L8 Mass Spectrometry Project Notes Implement a few tools for proteomics C1:11/2/04 Answer MS questions to get started, select project partner, select a project. C2:11/15/04 (All but web-team) Plan

More information

Mass Spectrometry and Proteomics - Lecture 5 - Matthias Trost Newcastle University

Mass Spectrometry and Proteomics - Lecture 5 - Matthias Trost Newcastle University Mass Spectrometry and Proteomics - Lecture 5 - Matthias Trost Newcastle University matthias.trost@ncl.ac.uk Previously Proteomics Sample prep 144 Lecture 5 Quantitation techniques Search Algorithms Proteomics

More information

A Better Scoring Model for De Novo Peptide Sequencing: The Symmetric Difference between Explained and Measured Masses Supplementary Figures

A Better Scoring Model for De Novo Peptide Sequencing: The Symmetric Difference between Explained and Measured Masses Supplementary Figures A Better Scoring Model for De Novo Peptide Sequencing: The Symmetric Difference between Explained and Measured Masses Supplementary Figures Thomas Tschager *, Simon Rösch *, Ludovic Gillet 2 and Peter

More information

Parallel Algorithms For Real-Time Peptide-Spectrum Matching

Parallel Algorithms For Real-Time Peptide-Spectrum Matching Parallel Algorithms For Real-Time Peptide-Spectrum Matching A Thesis Submitted to the College of Graduate Studies and Research in Partial Fulfillment of the Requirements for the degree of Master of Science

More information

Atomic masses. Atomic masses of elements. Atomic masses of isotopes. Nominal and exact atomic masses. Example: CO, N 2 ja C 2 H 4

Atomic masses. Atomic masses of elements. Atomic masses of isotopes. Nominal and exact atomic masses. Example: CO, N 2 ja C 2 H 4 High-Resolution Mass spectrometry (HR-MS, HRAM-MS) (FT mass spectrometry) MS that enables identifying elemental compositions (empirical formulas) from accurate m/z data 9.05.2017 1 Atomic masses (atomic

More information

DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics

DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics Chih-Chiang Tsou 1,2, Dmitry Avtonomov 2, Brett Larsen 3, Monika Tucholska 3, Hyungwon Choi 4 Anne-Claude Gingras

More information

Spectrum-to-Spectrum Searching Using a. Proteome-wide Spectral Library

Spectrum-to-Spectrum Searching Using a. Proteome-wide Spectral Library MCP Papers in Press. Published on April 30, 2011 as Manuscript M111.007666 Spectrum-to-Spectrum Searching Using a Proteome-wide Spectral Library Chia-Yu Yen, Stephane Houel, Natalie G. Ahn, and William

More information

Analysis of Peptide MS/MS Spectra from Large-Scale Proteomics Experiments Using Spectrum Libraries

Analysis of Peptide MS/MS Spectra from Large-Scale Proteomics Experiments Using Spectrum Libraries Anal. Chem. 2006, 78, 5678-5684 Analysis of Peptide MS/MS Spectra from Large-Scale Proteomics Experiments Using Spectrum Libraries Barbara E. Frewen, Gennifer E. Merrihew, Christine C. Wu, William Stafford

More information

Invariant Pattern Recognition using Dual-tree Complex Wavelets and Fourier Features

Invariant Pattern Recognition using Dual-tree Complex Wavelets and Fourier Features Invariant Pattern Recognition using Dual-tree Complex Wavelets and Fourier Features G. Y. Chen and B. Kégl Department of Computer Science and Operations Research, University of Montreal, CP 6128 succ.

More information

Protein Sequencing and Identification by Mass Spectrometry

Protein Sequencing and Identification by Mass Spectrometry Protein Sequencing and Identification by Mass Spectrometry Tandem Mass Spectrometry De Novo Peptide Sequencing Spectrum Graph Protein Identification via Database Search Identifying Post Translationally

More information

MALDI-HDMS E : A Novel Data Independent Acquisition Method for the Enhanced Analysis of 2D-Gel Tryptic Peptide Digests

MALDI-HDMS E : A Novel Data Independent Acquisition Method for the Enhanced Analysis of 2D-Gel Tryptic Peptide Digests -HDMS E : A Novel Data Independent Acquisition Method for the Enhanced Analysis of 2D-Gel Tryptic Peptide Digests Emmanuelle Claude, 1 Mark Towers, 1 and Rachel Craven 2 1 Waters Corporation, Manchester,

More information

SeqAn and OpenMS Integration Workshop. Temesgen Dadi, Julianus Pfeuffer, Alexander Fillbrunn The Center for Integrative Bioinformatics (CIBI)

SeqAn and OpenMS Integration Workshop. Temesgen Dadi, Julianus Pfeuffer, Alexander Fillbrunn The Center for Integrative Bioinformatics (CIBI) SeqAn and OpenMS Integration Workshop Temesgen Dadi, Julianus Pfeuffer, Alexander Fillbrunn The Center for Integrative Bioinformatics (CIBI) Mass-spectrometry data analysis in KNIME Julianus Pfeuffer,

More information

Proteomics. November 13, 2007

Proteomics. November 13, 2007 Proteomics November 13, 2007 Acknowledgement Slides presented here have been borrowed from presentations by : Dr. Mark A. Knepper (LKEM, NHLBI, NIH) Dr. Nathan Edwards (Center for Bioinformatics and Computational

More information

Learning Score Function Parameters for Improved Spectrum Identification in Tandem Mass Spectrometry Experiments

Learning Score Function Parameters for Improved Spectrum Identification in Tandem Mass Spectrometry Experiments pubs.acs.org/jpr Learning Score Function Parameters for Improved Spectrum Identification in Tandem Mass Spectrometry Experiments Marina Spivak, Michael S. Bereman, Michael J. MacCoss, and William Stafford

More information

Peptide Sequence Tags for Fast Database Search in Mass-Spectrometry

Peptide Sequence Tags for Fast Database Search in Mass-Spectrometry Peptide Sequence Tags for Fast Database Search in Mass-Spectrometry Ari Frank,*, Stephen Tanner, Vineet Bafna, and Pavel Pevzner Department of Computer Science & Engineering, University of California,

More information

Quality Assessment of Tandem Mass Spectra Based on Cumulative Intensity Normalization

Quality Assessment of Tandem Mass Spectra Based on Cumulative Intensity Normalization Quality Assessment of Tandem Mass Spectra Based on Cumulative Intensity Normalization Seungjin Na and Eunok Paek* Department of Mechanical and Information Engineering, University of Seoul, Seoul, Korea

More information

MS2DB: An Algorithmic Approach to Determine Disulfide Linkage Patterns in Proteins by Utilizing Tandem Mass Spectrometric Data

MS2DB: An Algorithmic Approach to Determine Disulfide Linkage Patterns in Proteins by Utilizing Tandem Mass Spectrometric Data MS2DB: An Algorithmic Approach to Determine Disulfide Linkage Patterns in Proteins by Utilizing Tandem Mass Spectrometric Data Timothy Lee 1, Rahul Singh 1, Ten-Yang Yen 2, and Bruce Macher 2 1 Department

More information

TANDEM mass spectrometry (MS/MS) is an essential and

TANDEM mass spectrometry (MS/MS) is an essential and IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 3, JULY-SEPTEMBER 2005 217 Predicting Molecular Formulas of Fragment Ions with Isotope Patterns in Tandem Mass Spectra Jingfen

More information

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data Elena Baralis and Tania Cerquitelli Politecnico di Torino Data set types Record Tables Document Data Transaction Data Graph World Wide Web Molecular Structures Ordered Spatial Data Temporal Data Sequential

More information

A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry

A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry Ting Chen Department of Genetics arvard Medical School Boston, MA 02115, USA Ming-Yang Kao Department of Computer

More information

Development and Evaluation of Methods for Predicting Protein Levels from Tandem Mass Spectrometry Data. Han Liu

Development and Evaluation of Methods for Predicting Protein Levels from Tandem Mass Spectrometry Data. Han Liu Development and Evaluation of Methods for Predicting Protein Levels from Tandem Mass Spectrometry Data by Han Liu A thesis submitted in conformity with the requirements for the degree of Master of Science

More information

De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics

De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics John R. Rose Computer Science and Engineering University of South Carolina 1 Overview Background Information Theoretic

More information

Properties of Average Score Distributions of SEQUEST

Properties of Average Score Distributions of SEQUEST Research Properties of Average Score Distributions of SEQUEST THE PROBABILITY RATIO METHOD* S Salvador Martínez-Bartolomé, Pedro Navarro, Fernando Martín-Maroto, Daniel López-Ferrer **, Antonio Ramos-Fernández,

More information

An SVM Scorer for More Sensitive and Reliable Peptide Identification via Tandem Mass Spectrometry

An SVM Scorer for More Sensitive and Reliable Peptide Identification via Tandem Mass Spectrometry An SVM Scorer for More Sensitive and Reliable Peptide Identification via Tandem Mass Spectrometry Haipeng Wang, Yan Fu, Ruixiang Sun, Simin He, Rong Zeng, and Wen Gao Pacific Symposium on Biocomputing

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

For most practical applications, computing the

For most practical applications, computing the Efficient Calculation of Exact Mass Isotopic Distributions Ross K. Snider Snider Technology, Inc., Bozeman, Montana, USA, and Department of Electrical and Computer Engineering, Montana State University,

More information

profileanalysis Innovation with Integrity Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research

profileanalysis Innovation with Integrity Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research profileanalysis Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research Innovation with Integrity Omics Research Biomarker Discovery Made Easy by ProfileAnalysis

More information

Mass spectrometry in proteomics

Mass spectrometry in proteomics I519 Introduction to Bioinformatics, Fall, 2013 Mass spectrometry in proteomics Haixu Tang School of Informatics and Computing Indiana University, Bloomington Modified from: www.bioalgorithms.info Outline

More information

Probabilistic Arithmetic Automata

Probabilistic Arithmetic Automata Probabilistic Arithmetic Automata Applications of a Stochastic Computational Framework in Biological Sequence Analysis Inke Herms PhD thesis defense Overview 1 Probabilistic Arithmetic Automata 2 Application

More information

Yifei Bao. Beatrix. Manor Askenazi

Yifei Bao. Beatrix. Manor Askenazi Detection and Correction of Interference in MS1 Quantitation of Peptides Using their Isotope Distributions Yifei Bao Department of Computer Science Stevens Institute of Technology Beatrix Ueberheide Department

More information

K-means-based Feature Learning for Protein Sequence Classification

K-means-based Feature Learning for Protein Sequence Classification K-means-based Feature Learning for Protein Sequence Classification Paul Melman and Usman W. Roshan Department of Computer Science, NJIT Newark, NJ, 07102, USA pm462@njit.edu, usman.w.roshan@njit.edu Abstract

More information

Protein Identification Using Tandem Mass Spectrometry. Nathan Edwards Informatics Research Applied Biosystems

Protein Identification Using Tandem Mass Spectrometry. Nathan Edwards Informatics Research Applied Biosystems Protein Identification Using Tandem Mass Spectrometry Nathan Edwards Informatics Research Applied Biosystems Outline Proteomics context Tandem mass spectrometry Peptide fragmentation Peptide identification

More information

BST 226 Statistical Methods for Bioinformatics David M. Rocke. January 22, 2014 BST 226 Statistical Methods for Bioinformatics 1

BST 226 Statistical Methods for Bioinformatics David M. Rocke. January 22, 2014 BST 226 Statistical Methods for Bioinformatics 1 BST 226 Statistical Methods for Bioinformatics David M. Rocke January 22, 2014 BST 226 Statistical Methods for Bioinformatics 1 Mass Spectrometry Mass spectrometry (mass spec, MS) comprises a set of instrumental

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search

Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search Anal. Chem. 2002, 74, 5383-5392 Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search Andrew Keller,*, Alexey I. Nesvizhskii,*, Eugene Kolker,

More information

In order to compare the proteins of the phylogenomic matrix, we needed a similarity

In order to compare the proteins of the phylogenomic matrix, we needed a similarity Similarity Matrix Generation In order to compare the proteins of the phylogenomic matrix, we needed a similarity measure. Hamming distances between phylogenetic profiles require the use of thresholds for

More information

A statistical approach to peptide identification from clustered tandem mass spectrometry data

A statistical approach to peptide identification from clustered tandem mass spectrometry data A statistical approach to peptide identification from clustered tandem mass spectrometry data Soyoung Ryu, David R. Goodlett, William S. Noble and Vladimir N. Minin Department of Statistics, University

More information

Identification of Post-translational Modifications via Blind Search of Mass-Spectra

Identification of Post-translational Modifications via Blind Search of Mass-Spectra Identification of Post-translational Modifications via Blind Search of Mass-Spectra Dekel Tsur Computer Science and Engineering UC San Diego dtsur@cs.ucsd.edu Vineet Bafna Computer Science and Engineering

More information

Mixture Mode for Peptide Mass Fingerprinting ASMS 2003

Mixture Mode for Peptide Mass Fingerprinting ASMS 2003 Mixture Mode for Peptide Mass Fingerprinting ASMS 2003 1 Mixture Mode: New in Mascot 1.9 All peptide mass fingerprint searches now test for the possibility that the sample is a mixture of proteins. Mascot

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Types of data sets Record Tables Document Data Transaction Data Graph World Wide Web Molecular Structures

More information

Peter A. DiMaggio, Jr., Nicolas L. Young, Richard C. Baliban, Benjamin A. Garcia, and Christodoulos A. Floudas. Research

Peter A. DiMaggio, Jr., Nicolas L. Young, Richard C. Baliban, Benjamin A. Garcia, and Christodoulos A. Floudas. Research Research A Mixed Integer Linear Optimization Framework for the Identification and Quantification of Targeted Post-translational Modifications of Highly Modified Proteins Using Multiplexed Electron Transfer

More information

NPTEL VIDEO COURSE PROTEOMICS PROF. SANJEEVA SRIVASTAVA

NPTEL VIDEO COURSE PROTEOMICS PROF. SANJEEVA SRIVASTAVA LECTURE-25 Quantitative proteomics: itraq and TMT TRANSCRIPT Welcome to the proteomics course. Today we will talk about quantitative proteomics and discuss about itraq and TMT techniques. The quantitative

More information

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University Protein Quantitation II: Multiple Reaction Monitoring Kelly Ruggles kelly@fenyolab.org New York University Traditional Affinity-based proteomics Use antibodies to quantify proteins Western Blot Immunohistochemistry

More information

Bayesian Clustering of Multi-Omics

Bayesian Clustering of Multi-Omics Bayesian Clustering of Multi-Omics for Cardiovascular Diseases Nils Strelow 22./23.01.2019 Final Presentation Trends in Bioinformatics WS18/19 Recap Intermediate presentation Precision Medicine Multi-Omics

More information

Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions

Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions Notation S pattern space X feature vector X = [x 1,...,x l ] l = dim{x} number of features X feature space K number of classes ω i class indicator Ω = {ω 1,...,ω K } g(x) discriminant function H decision

More information

Lecture 15: Realities of Genome Assembly Protein Sequencing

Lecture 15: Realities of Genome Assembly Protein Sequencing Lecture 15: Realities of Genome Assembly Protein Sequencing Study Chapter 8.10-8.15 1 Euler s Theorems A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing

More information

MassHunter TOF/QTOF Users Meeting

MassHunter TOF/QTOF Users Meeting MassHunter TOF/QTOF Users Meeting 1 Qualitative Analysis Workflows Workflows in Qualitative Analysis allow the user to only see and work with the areas and dialog boxes they need for their specific tasks

More information

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University Protein Quantitation II: Multiple Reaction Monitoring Kelly Ruggles kelly@fenyolab.org New York University Traditional Affinity-based proteomics Use antibodies to quantify proteins Western Blot RPPA Immunohistochemistry

More information

HOWTO, example workflow and data files. (Version )

HOWTO, example workflow and data files. (Version ) HOWTO, example workflow and data files. (Version 20 09 2017) 1 Introduction: SugarQb is a collection of software tools (Nodes) which enable the automated identification of intact glycopeptides from HCD

More information

SRM assay generation and data analysis in Skyline

SRM assay generation and data analysis in Skyline in Skyline Preparation 1. Download the example data from www.srmcourse.ch/eupa.html (3 raw files, 1 csv file, 1 sptxt file). 2. The number formats of your computer have to be set to English (United States).

More information

Computational Analysis of Mass Spectrometric Data for Whole Organism Proteomic Studies

Computational Analysis of Mass Spectrometric Data for Whole Organism Proteomic Studies University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Doctoral Dissertations Graduate School 5-2006 Computational Analysis of Mass Spectrometric Data for Whole Organism Proteomic

More information

TUTORIAL EXERCISES WITH ANSWERS

TUTORIAL EXERCISES WITH ANSWERS TUTORIAL EXERCISES WITH ANSWERS Tutorial 1 Settings 1. What is the exact monoisotopic mass difference for peptides carrying a 13 C (and NO additional 15 N) labelled C-terminal lysine residue? a. 6.020129

More information

Fast Comparison of Software Birthmarks for Detecting the Theft with the Search Engine

Fast Comparison of Software Birthmarks for Detecting the Theft with the Search Engine 2016 4th Intl Conf on Applied Computing and Information Technology/3rd Intl Conf on Computational Science/Intelligence and Applied Informatics/1st Intl Conf on Big Data, Cloud Computing, Data Science &

More information

Pavel Zezula, Giuseppe Amato,

Pavel Zezula, Giuseppe Amato, SIMILARITY SEARCH The Metric Space Approach Pavel Zezula Giuseppe Amato Vlastislav Dohnal Michal Batko Table of Content Part I: Metric searching in a nutshell Foundations of metric space searching Survey

More information

via Tandem Mass Spectrometry and Propositional Satisfiability De Novo Peptide Sequencing Renato Bruni University of Perugia

via Tandem Mass Spectrometry and Propositional Satisfiability De Novo Peptide Sequencing Renato Bruni University of Perugia De Novo Peptide Sequencing via Tandem Mass Spectrometry and Propositional Satisfiability Renato Bruni bruni@diei.unipg.it or bruni@dis.uniroma1.it University of Perugia I FIMA International Conference

More information

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Indexes for Multimedia Data 14 Indexes for Multimedia

More information

PRIDE Cluster: building the consensus of proteomics data

PRIDE Cluster: building the consensus of proteomics data Supplementary Materials PRIDE Cluster: building the consensus of proteomics data Johannes Griss, Joseph Michael Foster, Henning Hermjakob and Juan Antonio Vizcaíno EMBL-European Bioinformatics Institute,

More information

Was T. rex Just a Big Chicken? Computational Proteomics

Was T. rex Just a Big Chicken? Computational Proteomics Was T. rex Just a Big Chicken? Computational Proteomics Phillip Compeau and Pavel Pevzner adjusted by Jovana Kovačević Bioinformatics Algorithms: an Active Learning Approach 215 by Compeau and Pevzner.

More information

Fast similarity searching making the virtual real. Stephen Pickett, GSK

Fast similarity searching making the virtual real. Stephen Pickett, GSK Fast similarity searching making the virtual real Stephen Pickett, GSK Introduction Introduction to similarity searching Use cases Why is speed so crucial? Why MadFast? Some performance stats Implementation

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

A Statistical Model of Proteolytic Digestion

A Statistical Model of Proteolytic Digestion A Statistical Model of Proteolytic Digestion I-Jeng Wang, Christopher P. Diehl Research and Technology Development Center Johns Hopkins University Applied Physics Laboratory Laurel, MD 20723 6099 Email:

More information

Improved Validation of Peptide MS/MS Assignments. Using Spectral Intensity Prediction

Improved Validation of Peptide MS/MS Assignments. Using Spectral Intensity Prediction MCP Papers in Press. Published on October 2, 2006 as Manuscript M600320-MCP200 Improved Validation of Peptide MS/MS Assignments Using Spectral Intensity Prediction Shaojun Sun 1, Karen Meyer-Arendt 2,

More information

A New Hybrid De Novo Sequencing Method For Protein Identification

A New Hybrid De Novo Sequencing Method For Protein Identification A New Hybrid De Novo Sequencing Method For Protein Identification Penghao Wang 1*, Albert Zomaya 2, Susan Wilson 1,3 1. Prince of Wales Clinical School, University of New South Wales, Kensington NSW 2052,

More information