On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering

Size: px

Start display at page:

Download "On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering"

Linette Tucker
6 years ago
Views:

1 On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering Jiří Novák, David Hoksza, Jakub Lokoč, and Tomáš Skopal Siret Research Group, Faculty of Mathematics and Physics, Charles University in Prague, Malostranské nám. 25, Prague, Czech Republic Abstract. Tandem mass spectrometry is a well-known technique for identification of protein sequences from an in vitro sample. To identify the sequences from spectra captured by a spectrometer, the similarity search in a database of hypothetical mass spectra is often used. For this purpose, a database of known protein sequences is utilized to generate the hypothetical spectra. Since the number of sequences in the databases grows rapidly over the time, several approaches have been proposed to index the databases of mass spectra. In this paper, we improve an approach based on the non-metric similarity search where the M-tree and the TriGen algorithm are employed for fast and approximative search. We show that preprocessing of mass spectra by clustering speeds up the identification of sequences more than 100 with respect to the sequential scan of the entire database. Moreover, when the protein candidates are refined by sequential scan in the postprocessing step, the whole approach exhibits precision similar to that of sequential scan over the entire database (over 90%). Keywords: tandem mass spectrometry, similarity search, non-metric access methods, protein sequences identification, spectral clustering 1 Introduction Almost every process on the cell level is secured by proteins whose interactions form the basis of all living organisms. The functions of proteins are determined by their 3D structure, which is derived from protein sequences. Tandem mass spectrometry (MS/MS) is a widely known method for protein and peptide sequences identification from a sample of proteins in vitro. Commonly, the sample is analyzed by more runs of a mass spectrometer. A set containing hundreds to thousands of mass spectra is captured in each run. The proteins in the sample are split to many peptide ions where a mass spectrum corresponds to a peptide ion. More peptide ions correspond to a peptide sequence and, similarly, more peptide sequences come from a protein sequence. This work was supported by Czech Science Foundation (GAČR) projects P202/11/0968, P202/12/P297, 201/09/H057 and by the Grant Agency of Charles University (GAUK) project Nr

2 2 J. Novák et al. A mass spectrum is a list of peaks corresponding to peptide fragment ions. The peak is a pair ( m z, I), where m z is a mass-to-charge ratio and I is the intensity of a fragment ion occurrence. In a spectrum, there occur several types of fragment ions forming so-called ions series. The most important series for correct peptide sequence identification are y-ions and b-ions. The completeness of these series is crucial for correct spectra interpretation, because the m z difference between two neighboring peaks in one series, e.g., y i and y i+1, corresponds to a mass of an amino acid in the peptide. The precursor mass m p (the mass of the peptide ion before splitting) is also provided as an additional information for each spectrum captured by a spectrometer. The interpretation of spectra is often complicated with post-translational modifications (PTMs) of amino acids, because masses of amino acids are changed in that case, and thus the peaks are shifted [21]. The mass spectrometer does not determine the peptide sequences from mass spectra directly but the spectra must be interpreted after they are captured. The successful computational approaches for interpretation of mass spectra (i.e., assigning the peptide sequences to mass spectra) [16] are often based on the similarity search in databases of already known or predicted protein sequences [24]. The databases contain millions of protein sequences [15] and the spectra sets generated by a MS/MS analysis that need to be interpreted (query sets from the database point of view) contain thousands of mass spectra. Thus a sequential scan of entire database for each mass spectrum is time consuming. To speed-up the search, an index over the database of hypothetical mass spectra generated from known peptide sequences (short pieces of proteins) can be constructed. The simplest way is to index the database of spectra by peptide precursor mass because there is a correlation between the precursor mass of similar peptides [20]. A disadvantage is that indexing of peptides by precursor mass limits the capability of managing spectra with PTMs because mass of peptides with PTMs may differ from the peptides without PTMs from tens to hundreds Daltons. A few approaches were proposed where an inverted file was employed to index the database of protein sequences [11], [14]. Another approach uses a suffix tree [13] and there are also approaches based on the similarity search in metric spaces [22] [4]. The approaches based on the inverted index and on the similarity search in metric spaces commonly do not support the search of spectra with PTMs. We have proposed a fast method based on approximative non-metric similarity search [19], which is able to manage spectra with PTMs. Despite the search in an indexed database is fast, query sets of mass spectra still contain many noise spectra that should be ignored (on average 90% in the query set [10]). The noise spectra cannot be assigned to peptide sequences because they occur as an artifact of the spectrometer process. A kind of query set preprocessing can be used to eliminate the noise spectra and to speed up the search, because only a small part of the query set needs to be interpreted [25] [23]. Commonly used preprocessing approaches are the spectrum quality filtering [6] [17] and the clustering [8], [7], [5]. The spectrum quality filtering methods analyze many parameters of spectra (the number of peaks and their relative intensity, the precursor mass, the number

3 On Optimizing the Non-metric Similarity Search... 3 of complementary y-ions and b-ions, etc.) and assign to each spectrum a score. Since the mass spectrometers from different manufacturers use different physical principles, the significance of parameters may differ from machine to machine. Thus, the score heavily depends on the mass spectrometers which were used to capture the spectra. Only the spectra reaching a specific score are used for further processing while the others are ignored as noisy ones. On the other hand, the clustering is independent on the properties of different machines because the spectra from different spectrometers are processed the same way, i.e., without the knowledge of significance of particular parameters. The clustering is based on fact that a mass spectrometer generates multiple spectra corresponding to a peptide sequence [28]. Since the spectra corresponding to the peptide sequence are similar, they form a cluster. In the set of spectra obtained from one spectrometer run there are many spectra which are not noise but they correspond to a peptide sequence. A disadvantage in this case is the clustering causes loss of some peptide sequences [28]. Since a mass spectrometry task often consists of multiple spectrometer runs (each run generates a query set), the above disadvantage can be successfully resolved by merging query sets from multiple runs. The number of identifiable peptides that are not clustered decreases with the increasing number of merged query sets, while the noise spectra are successfully eliminated [2]. 2 Computational Methods We briefly introduce the metric access methods, the spectrum similarity employed in MAMs and for clustering of mass spectra, the original idea for spectra interpretation by MAMs, and an extension of this approach using preprocessing and postprocessing. 2.1 Metric access methods (MAMs) A metric is a distance function satisfying the reflexivity, symmetry, non-negativity and triangle inequality. The MAMs were designed for fast search in databases modeled in metric spaces, where the triangle inequality is crucial for organizing objects into metric regions and for pruning irrelevant regions while searching [31]. A distance (partially) violating the triangle inequality is called a semi-metric and the process is denoted as non-metric search. The violation of the triangle inequality is expressed by the triangle error (T-error) tolerance θ [26]. 2.2 Spectrum similarity The use of MAMs and clustering require a similarity function which says how much two spectra are similar. A commonly employed function is the angle distance (or cosine similarity) [12], [1]; another approach is the sigmoid similarity [7], [25]. In this paper, we have chosen the parameterized Hausdorff distance

4 4 J. Novák et al. d HP (Eq. 2), which was successfully applied in non-metric indexes [18]. n x h(x, y) = i x min yj y {max(0, x i y j ξ)} dim(x) (1) d HP (x, y) = max(h(x, y), h(y, x)) (2) where x and y are the vectors of m z ratios, dim(x) is the length of x and ξ is a mass error tolerance. The angle distance can be computed a little bit faster than d HP but the number of identified peptides and the efficiency of MAMs is lower when the angle distance is utilized. Since lists of peaks in mass spectra are implicitly sorted, both distances can be computed with the linear time complexity O(p), where p is the number of peaks in a spectrum [18]. 2.3 Original method We briefly describe the previously proposed approach [19], which employs the M- tree [3] and the Trigen algorithm [26]. First, protein sequences from a database are split to peptide sequences and the hypothetical mass spectra are generated from the peptide sequences. Second, the hypothetical mass spectra are indexed by the M-tree (or by another MAM) under the d HP while the TriGen algorithm is utilized to control the T-error tolerance θ (i.e., the efficiency of the M-tree). The search is faster with increasing θ but the number of identified peptides is lower. Finally, a k-nearest neighbor (knn) query is performed for each query spectrum. For many spectra in the query set, a hypothetical mass spectrum among the k-nearest neighbors corresponds to a peptide sequence that we are looking for. Since the d HP is a coarse function, an additional re-ranking is assumed to determine the correct peptide sequence from the k-nearest neighbors. 2.4 Improvements We propose an extension of the original approach, where the clustering is employed in preprocessing step to filter out the noise spectra thus speeding the search, and where the sequential scan over the candidates is used in the postprocessing step to increase the number of identified peptide sequences (Fig. 1). Fig. 1. Sequences identification (original method is yellow, improvements are blue)

5 On Optimizing the Non-metric Similarity Search... 5 Preprocessing. The preprocessing is realized by the clustering (Fig. 1a). A major premise for the clustering is that query sets from more spectrometer runs are merged. Hence, many interpretable spectra that are captured only once per a spectrometer run have a twin in the query set so they are not eliminated by the clustering. On the other hand, the noise spectra are successfully cleared away thus many spectra are not searched in the query phase (Fig. 1b), making the search significantly faster. Alg. 1. Clustering of query mass spectra 1 Clustering(a set of clusters C, a threshold t, a number of cycles w) { 2 let C be initialized with one mass spectrum per cluster; 3 for w cycles { 4 MergeClusters(C,t); 5 SelectCentroids(C); 6 RearrangeClusters(C,t); 7 SelectCentroids(C); } } 8 9 MergeClusters(a set of clusters C, a threshold t) { 10 for all clusters c i C { 11 select the spectrum c i,0 { // a centroid is stored at c i,0 12 for all clusters c j C { 13 if all spectra c j,k have d HP (c i,0, c j,k ) t { 14 store the position p of the cluster c j with the minimal d HP (c i,0, c j,0); } } } 15 merge the clusters c i and c p; } } SelectCentroids(a set of clusters C) { 18 for all clusters c i C { 19 for all spectra c i,j { 20 P = ; 21 for all spectra c i,k { 22 store the maximal distance d HP (c i,j, c i,k ) and 23 the position k of the spectrum c i,k in the maximal d HP into P ; } 24 select the position p with the minimal d HP from P ; } 25 switch the spectra c i,0 and c i,p; } // a new centroid has been moved to c i,0 } RearrangeClusters(a set of clusters C, a threshold t) { 28 for all clusters c i C { 29 for all spectra c i,m { 30 P = ; 31 for all clusters c j C { 32 for all spectra c j,n { 33 if all d HP (c i,m, c j,n) t { store the distance d HP (c i,m, c j,0) and 34 the position j of the cluster c j into P ; } } } 35 select the position p of the cluster with minimal d HP from P ; 36 move the spectrum c i,m to the cluster c p; } } } One of the best-known algorithms for the clustering is the K-means [30], which is not suitable for clustering of mass spectra because we cannot predict the number of clusters K before the clustering [8]. Moreover, its time complexity is O(NKd), where N is the number of spectra in the query set and d is the dimensionality. The K-means is not suitable for large query sets and highdimensional data which is exactly the case of mass spectra (usually containing many peaks/dimensions). A better clustering algorithm for mass spectra is the hierarchical clustering [8] [30]. A disadvantage for large query sets of spectra is the time complexity O(N 2 ). Since we analyze the impact of the clustering on the number of identified peptide sequences, we employ a simple hierarchical-like clustering (Alg. 1).

6 6 J. Novák et al. More efficient clustering algorithms [30] may be used for large query sets, e.g., an approach based on the density clustering (DENCLUE) [9] with the time complexity O(N log N), which is capable of tackling high-dimensional data and which is robust when dealing with noise data. The clustering algorithm (Alg. 1) requires a set of clusters C initialized with one mass spectrum per cluster. Then two phases are repeated in w cycles. First, pairs of clusters with the minimal d HP (c i,0, c j,0 ) that d HP (c i,0, c j,0 ) t are merged, where t is a threshold of the d HP and c i,0, c j,0 are the centroids of clusters c i, c j. The threshold t determines whether the spectra in a cluster are similar or not. Moreover, it determines the number of clusters. If t is too low, each spectrum forms a singleton cluster. If t is too high, all spectra form one cluster. Second, the spectra are rearranged among the clusters. A spectrum is moved to another cluster, if the d HP among the spectrum and all spectra in the target cluster is less or equal t. In case that more clusters are selected, the cluster is picked where the d HP between its centroid and the moved object is minimal. New centroids of clusters are selected after each phase. Finally, the centroids of clusters containing at least two spectra form the queries, which will be processed by MAM. Another way consists in putting all peaks from all the spectra in the cluster into a representative spectrum [2]. The intensities of the closest peaks are counted up and their m z values are averaged. Since the increasing number of peaks in a spectrum worsens the efficiency of MAMs because of high intrinsic dimensionality [19], this approach needs a bit improvement for purposes of mass spectra interpretation by the non-metric similarity search. For example, a specified number of peaks with the highest intensity can be selected from the representative spectrum. Query phase. The query phase corresponds to the original idea presented in Sec. 2.3, where a knn query is performed by a MAM for each spectrum selected in the preprocessing (Fig. 1b). The k nearest neighbor peptide sequences to each query spectrum are assigned to the protein sequences of their origin. The protein sequences containing at least one good peptide sequence hit (e.g., d HP 0.65) are the protein sequence candidates. The MAM we have chosen for the query phase is the non-metric tree (NMtree) [27] because it combines the M-tree with the TriGen algorithm in a way that allows to dynamically control the retrieval precision at query time. The NM-tree could be replaced by another MAM since our approach is independent on a specific method. Postprocessing. The postprocessing is a sequential scan of protein sequence candidates (Fig. 1c), which significantly improves the number of identified peptide sequences because more peptide sequences in a protein sequence correspond to the mass spectra in the query set [29] [25]. The protein sequence candidates (i.e., a small subset of the database sequences selected in the query phase) are split to peptide sequences and their hypothetical mass spectra are compared to the entire set of input spectra (as it was before the clustering phase). The

7 On Optimizing the Non-metric Similarity Search... 7 spectra previously missed during the preprocessing or during the query phase are assigned to peptide sequences. The newly identified peptide sequences are assigned to the protein sequences of their origin. Finally, the peptide (or protein, respectively) sequences identified in the query phase and refined in the postprocessing phase form the result. Note that some peptide sequences are lost during the clustering because their spectra are present only once in the query set. Some peptide sequences are lost during the query phase because the search is only approximative (non-metric). The sequential scan of protein sequence candidates helps to reveal a peptide sequence in case it forms a part of a candidate protein sequence which was hit by another peptide sequence. 3 Experiments We used a dataset containing MS/MS spectra from 2 protein mixtures A and B [10]. Spectra corresponding to peptide sequences were manually annotated. 14 mass spectrometer runs were performed on the mixture A and 8 runs on the mixture B. We show the results for the spectra from the first 6 runs on mixture A and from all runs on mixture B. The spectra were searched in the database of hypothetical mass spectra generated from the database of protein sequences. We used a part of the MSDB [15] database containing 100,000 protein sequences (i.e., 5.6 millions of peptide sequences or hypothetical mass spectra) including the reference protein sequences for mixtures A and B. All the experiments were carried out on a machine with 2 processors Intel Xeon X5660 (24 cores, 2.8 GHz) with 24 GB RAM and 64-bit OS Windows Server 2008 R2. Even though our implementation supports parallel processing of mass spectra, the stated times of clustering and peptide sequences identification are measured at one core. If not otherwise stated, the following settings were used protein sequences splitting enzyme: trypsin; maximum number of missed cleavage sites: 1; mass range of peptide ions generated from the database: 500-5,000 Da; fragment ions generated in hypothetical mass spectra: y-ions and b- ions; mass range of generated fragment ions: 300-2,000; m z error tolerance (ξ): 0.4 Da; number of peaks with highest intensity used in a query: 50; distance measure: d HP (with n = 30); clustering threshold (t): 0.65 (values returned by d HP are normalized to 0, 1 ), T-error tolerance θ: 0.1. The number of clusters is the number of those containing at least 2 spectra. Since we perform one knn query per cluster, the number of clusters determines the number of knn queries processed on the NM-tree. The number of missed spectra is counted after the clustering phase and before query phase. It is the number of annotated spectra in clusters with single objects and thus missed by clustering. Independent runs means that query sets of spectra from more spectrometer runs were processed separately and the results were summed (the number of clusters, number of missed spectra, time of clustering and ratio of identified spectra to annotated spectra) or averaged (time of identification per spectrum). Merged runs means that query sets of spectra from more spectrometer runs were processed together.

8 8 J. Novák et al. 3.1 Clustering of spectra from two spectrometer runs We have verified that clusters formed from merged query sets of spectra from two spectrometer runs contain many more annotated spectra than clusters formed from the query sets which are processed separately (Tab. 1). On average, the clusters formed from spectra from two spectrometer runs contain about 40.7% more annotated spectra than clusters formed from single spectrometer run. Since we perform one knn query per cluster containing at least 2 spectra, up to 79% of all knn queries are not performed for the clusters formed from the spectra merged from two runs. For clusters formed from the spectra from single runs, up to 87% of all knn queries are not performed but there are many missed annotated spectra. Num. of Num. of Independent runs Merged runs Dataset all annotated Num. of Spectra Clustering Num. of Spectra Clustering spectra spectra clusters missed time [s] clusters missed time [s] A A A B B B B Table 1. Clustering of spectra from single runs and from two merged runs 3.2 Effectiveness and efficiency of peptide sequences identification We have tested the impact of the query spectra clustering on the number of finally identified peptide sequences (i.e., after the postprocessing) and on the average time of identification per spectrum. We have compared the sequential scan of entire database and the NM-tree in 3 different ways without the clustering, with the clustering of two query sets processed independently, and together. When the clustering and/or the NM-tree were employed, the postprocessing was used. The most peptide sequences (on average 94.6%) were identified when the sequential scan was performed without the clustering (Tab. 2). On average 93.8% peptide sequences were identified when the NM-tree was employed without clustering. The ratio of identified peptides was noticeably worse when the clustering was applied on the query sets from single runs about 75.3% for the sequential scan and only 65.4% for the NM-tree. When the clustering was applied on the query sets merged from two spectrometer runs, the ratio of identified peptides was almost the same like when no clustering was employed. On average, it was about 93.6% for the sequential scan and 90.1% for the NM-tree. The clustering of query sets merged from 2 runs worsens the ratio of identified peptides about 1% when the sequential scan is performed over entire database and about 3.7% when the NM-tree is employed. The slowest method was the sequential scan without clustering, where the average time of identification per spectrum was 7.04 s (Tab. 3). The NM-tree

9 On Optimizing the Non-metric Similarity Search... 9 without clustering took 0.28 s, thus the speed-up was When clustering was applied on the query sets from single runs, the average time was 0.98 s (speedup 7.2 ) for the sequential scan and 0.04 s (speed-up ) for the NM-tree. When query sets from two spectrometer runs were merged and the clustering was applied, the average time was 1.59 s for the sequential scan (speed-up 4.4 ) and 0.07 s for the NM-tree (speed-up ). When the NM-tree was employed with clustering, the average speed-up was 4 wrt. NM-tree without clustering. With clustering Without clustering Dataset Independent runs Merged runs Seq. scan NM-tree Seq. scan NM-tree Seq. scan NM-tree A A A B B B B Table 2. The ratio of identified spectra to annotated spectra [%] With clustering Without clustering Dataset Independent runs Merged runs Seq. scan NM-tree Seq. scan NM-tree Seq. scan NM-tree A A A B B B B Table 3. Time of identification per spectrum [s] 3.3 Clustering of spectra merged from more spectrometer runs We have tested the impact of the increasing number of spectra from more spectrometer runs in a query set on the number of annotated mass spectra missed by clustering and on the time of clustering (Tab. 4). We can observe that the number of missed annotated spectra is almost the same when spectra from two or more spectrometer runs are merged, thus merging spectra from more than two spectrometer runs does not significantly improve the effectiveness of peptide sequences identification. Since we employ a simple clustering algorithm (Alg. 1), a disadvantage of merging spectra from too many spectrometer runs is that the time of clustering increases with the quadratic time complexity. We also measured the ratio of identified to annotated spectra and the average time of identification per spectrum on the NM-tree. The ratio of identified spectra is almost the same when spectra from two or more spectrometer runs are merged (on average 95%). The time of identification a bit increases with increasing number of spectra because of the quadratic complexity of clustering. We can observe that the ratio of the number of clusters to the number of all spectra in a query set is lower with the increasing number of spectra. This could

10 10 J. Novák et al. be an advantage for large query sets of mass spectra because only a small number of the spectra is queried and thus the search is significantly faster. When spectra from 14 spectrometer runs on mixture A were merged, spectra formed 1188 clusters with more than one spectrum. Thus only 8.3% of all queries were performed on the NM-tree. When spectra from 8 spectrometer runs on mixture B were merged, 4599 spectra formed 711 clusters thus only 15.5% of all queries were performed. Num. Num. of Num. Ratio of Clustering Ratio of Time of Spectra Dataset of all annotated of clust. to all time ident. ident. missed spectra spectra clusters spectra [%] [s] spectra [%] [s] A A A A A A Table 4. Clustering of spectra merged from more spectrometer runs 3.4 Impact of distance threshold on clustering We have tested the impact of the threshold t of d HP on the number of clusters, number of spectra missed by the clustering and on the time of clustering (Tab. 5). We used the dataset A1-2 with 2213 spectra merged from two spectrometer runs. The number of clusters increases with increasing t while the number of spectra missed by clustering decreases. The optimal t seems to be about 0.65 when the number of clusters (or knn queries performed, respectively) is only 17.9% wrt. the number of knn queries which must be performed when the clustering is not employed. Moreover, there are only 16 missed spectra. For t < 0.65, the number of spectra missed by clustering grows because there are less hits among the hypothetical and the query spectra. The ratio of identified to annotated spectra is still more than 95% because the sequential scan of protein sequence candidates is employed. For t > 0.65, the number of clusters increases (up to t = 0.75) and the number of missed spectra is almost zero. A disadvantage is that high t may form clusters of spectra not coming from the same peptide. In practice, the optimal t depends on the number of peaks in query spectra. The optimal t may be higher than 0.65 when a support of PTMs is implemented as described in [19]. The time of identification increases a bit with the increasing t this corresponds to the increasing number of clusters. t Num. of Spectra Clustering Ratio of ident. Time of clusters missed time [s] spectra [%] ident. [s] Table 5. Impact of distance threshold t on clustering

11 On Optimizing the Non-metric Similarity Search Conclusions We have shown that the clustering of tandem mass spectra significantly improves the efficiency of the method for protein and peptide sequences identification based on the non-metric similarity search in databases of protein sequences. When the NM-tree was employed with clustering, the search was more than 100 faster than the sequential scan without clustering, while the ratio of identified peptides was more than 90% in both cases. The first major premise for successful identification of peptide sequences with clustering is that query sets from at least two spectrometer runs are merged. The second major premise is that the sequential scan of protein sequence candidates is performed because the search using the NM-tree is fast but approximative. The fulfillment of both premises increases the number of identified peptide sequences and speeds up the search. An important advantage of mass spectra preprocessing by clustering is its independence on the mass spectrometer, which is used to capture the spectra. Since the mass spectrometer can generate spectra in many runs, a disadvantage may be the time complexity of the algorithm, which is used to cluster the spectra. We use only a simple clustering algorithm with the quadratic time complexity, thus the implementation of a more sophisticated clustering algorithm with the time complexity, e.g., O(N log N) is a subject of our future work. References 1. Alfassi, Z.B.: On the Normalization of a Mass Spectrum for Comparison of Two Spectra. Journal of the Am. Soc. for Mass Spec. 15(3), (2004) 2. Beer, I., Barnea, E., Ziv, T., Admon, A.: Improving large-scale proteomics by clustering of mass spectrometry data. Proteomics 4, (2004) 3. Ciaccia, P., Patella, M., Zezula, P.: M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In: VLDB. pp (1997) 4. Dutta, D., Chen, T.: Speeding up Tandem Mass Spectrometry Database Search: Metric Embeddings and Fast Near Neighbor Search. Bioinf. 23(5), (2007) 5. Falkner, J.A., Falkner, J.W., Yocum, A.K., Andrews, P.C.: A spectral clustering approach to MS/MS identification of post-translational modifications. Journal of Proteome research 7(11), (2008) 6. Flikka, K., et al.: Improving the reliability and throughput of mass spectrometrybased proteomics by spectrum quality filtering. Proteomics 6, (2006) 7. Flikka, K., et al.: Implementation and application of a versatile clustering tool for tandem mass spectrometry data. Proteomics 7, (2007) 8. Frank, A.M., et al.: Clustering millions of tandem mass spectra. Journal of Proteome Research 7(1), (2008) 9. Hinneburg, A., Keim, D.A.: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. In: Proc. of KDD 98. pp (1998) 10. Keller, A., et al.: Experimental Protein Mixture for Validating Tandem Mass Spectral Analysis. OMICS: A Journal of Integrative Biology 6(2), (2002) 11. Li, Y., et al.: Speeding up tandem mass spectrometry based database searching by peptide and spectrum indexing. Rapid Comm. Mass Spec. 24(6), (2010) 12. Liu, J., et al.: Methods for peptide identification by spectral comparison. Proteome Science 5(3) (2007)

12 12 J. Novák et al. 13. Lu, B., Chen, T.: A Suffix Tree Approach to the Interpretation of Tandem Mass Spectra: Applications to Peptides of Non-specific Digestion and Post-translational Modifications. In: Bioinformatics. vol. 19, pp. Suppl. 2:ii (2003) 14. Mao, R., Ramakrishnan, S.R., Nuckolls, G., Miranker, D.P.: An inverted index for mass spectra similarity query and comparison with a metric-space method: case study. In: SISAP 10. pp (2010) 15. MSDB, Nesvizhskii, A.I.: A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. Journal of Proteomics 73(11), (2010) 17. Nesvizhskii, A.I., et al.: Dynamic Spectrum Quality Assessment and Iterative Computational Analysis of Shotgun Proteomic Data. Molecular & Cellular Proteomics 5, (2006) 18. Novák, J., Hoksza, D.: Parametrised Hausdorff Distance as a Non-Metric Similarity Model for Tandem Mass Spectrometry. In: CEUR Proc. DATESO. pp (2010) 19. Novák, J., Skopal, T., Hoksza, D., Lokoč, J.: Non-metric Similarity Search of Tandem Mass Spectra Including Posttranslational Modifications. Journal of Discrete Algorithms (2011), Park, C.Y., et al.: Rapid and accurate peptide identification from tandem mass spectra. Journal of Proteome research 7(7), (2008) 21. Pevzner, P.A., Mulyukov, Z., Dančík, V., Tang, C.L.: Efficiency of Database Search for Identification of Mutated and Modified Proteins via Mass Spectrometry. Genome Research 11(2), (2001) 22. Ramakrishnan, S.R., et al.: A Fast Coarse Filtering Method for Peptide Identification by Mass Spectrometry. Bioinformatics 22(12), (2006) 23. Renard, B.Y., et al.: When less can yield more - Computational preprocessing of MS/MS spectra for peptide identification. Proteomics 9, (2009) 24. Sadygov, R.G., et al.: Large-scale Database Searching Using Tandem Mass Spectra: Looking up the Answer in the Back of the Book. Nature Met. 1(3), (2004) 25. Salmi, J., Nyman, T.A., Nevalainen, O.S., Aittokallio, T.: Filtering strategies for improving protein identification in high-throughput MS/MS studies. Proteomics 9, (2009) 26. Skopal, T.: Unified Framework for Fast Exact and Approximate Search in Dissimilarity Spaces. ACM Transactions on Database Systems 32(4), 29 (2007) 27. Skopal, T., Lokoč, J.: NM-Tree: Flexible Approximate Similarity Search in Metric and Non-metric Spaces. In: DEXA 08. pp (2008) 28. Tabb, D.L., et al.: Similarity among Tandem Mass Spectra from Proteomic Experiments: Detection, Significance and Utility. Anal. Chem. 75(10) (2003) 29. Wang, J., et al.: Peptide identification from mixture tandem mass spectra. Molecular & Cellular Proteomics 9(7), (2010) 30. Xu, R., Wunsch, D.: Survey of Clustering Algorithms. IEEE Transactions on neural networks 16(3), (2005) 31. Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach (Advances in Database Systems). Springer, USA (2006)

Non-metric Similarity Search of Tandem Mass Spectra Including Posttranslational Modifications

Non-metric Similarity Search of Tandem Mass Spectra Including Posttranslational Modifications Jiří Novák, Tomáš Skopal, David Hoksza and Jakub Lokoč SIRET Research Group Department of Software Engineering,