PRIDE Cluster: building the consensus of proteomics data

Size: px

Start display at page:

Download "PRIDE Cluster: building the consensus of proteomics data"

Samuel Moody
5 years ago
Views:

1 Supplementary Materials PRIDE Cluster: building the consensus of proteomics data Johannes Griss, Joseph Michael Foster, Henning Hermjakob and Juan Antonio Vizcaíno EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. 1

2 Table of Contents Table of Contents... 2 Supplementary Figures... 3 Supplementary Table 1 - List of available spectral libraries Supplementary Protocol Supplementary Note 1 - Clustering algorithm Supplementary Note 2 - PRIDE Spectra Clustering API Supplementary Note 3 - Test datasets Supplementary Note 4 - Assessing clustering quality and determining clustering thresholds 26 Supplementary Note 5 - Assessing PSM reliabilities Supplementary Note 6 - Identifying incorrect annotations in PRIDE through PRIDE Cluster Supplementary Note 7 - The PRIDE Cluster spectral libraries Abbreviations References

3 Supplementary Figures Supplementary Figure 1. The clustering algorithm (MS-Cluster) described in Frank et al. 1 may lead to incorrect results in a highly heterogeneous environment such as the PRIDE repository. A) Unidentified, low resolution (= quality) spectra B) (Later) addition of high quality, identified spectra C) Identification is extended to unidentified spectra while the cluster membership is not updated. D) Incorrect identifications are generated Early experiments using low-resolution instruments record spectra that cannot be identified (a). Over time, higher quality spectra are recorded (that can be identified) and submitted to the repository (in this example identifying two peptides: one green, one red) (b). When these new spectra are incrementally added to the clusters, one of the high quality spectra is randomly chosen first and assigned to the existing cluster (in this example a green one). This shifts the cluster s consensus spectrum and all green spectra are assigned to this cluster. The red spectra no longer fit the now shifted cluster and are assigned to a new, purely red cluster (c). As described by Frank et al., the identification from the high quality spectra is attributed to all unidentified spectra in the cluster. Therefore, some spectra that should have been assigned to the red peptide (d) are misclassified as green ones. This problem will grow over time as (unidentified) low quality clusters continue to shift when new high quality spectra are added. 3

4 Supplementary Figure 2. The proportion of clusters that contained spectra identified as multiple different peptides proved to be too dataset dependent for a reliable assessment of clustering quality. Proportion of clusters that contained spectra identified as more than one peptide sequence. The data is shown for all test instances and combinations of clustering settings evaluated (the number of iterations (N) and the similarity threshold (t)). The fraction of mixed clusters decreased with increasing clustering threshold. Even with the highest threshold this fraction was still considerably higher than the 1.5% reported by Frank et al. 1 This higher proportion of mixed clusters is not indicative of lower clustering quality but an expected phenomenon. Clusters are able to group high and low quality spectra from the same peptide (as shown through the range of the spectra s precursor ion m/z values). If all of these spectra are identified with a database search engine, lower quality spectra have a higher probability to be identified incorrectly. While the relative number of mixed clusters was comparable between the COPD and the CPTAC instance, the number of mixed clusters was considerably lower in the HUPO instance. We were not able to identify a conclusive cause for this difference. Nevertheless, these results indicate that these analysis results highly depend on the analyzed test dataset. Furthermore, this might explain the difference to the results reported by Frank et al., who may have used more homogenous test datasets for their analysis (data not shown in 1 ). 4

5 Supplementary Figure 3. Larger clusters contained spectra attributed to fewer peptides than smaller clusters. Proportion of spectra that were identified as the most commonly identified peptide sequence within a cluster versus the cluster s size. Larger clusters were more likely to contain more spectra identified as a single peptide than smaller clusters. This finding was independent of the chosen clustering parameters. The data from the CPTAC instance was slightly different compared to the data from the HUPO and the COPD instance. This was most probably caused by the fact that while the HUPO and the COPD instance had a comparable size, the CPTAC instance was considerably smaller. Nevertheless, this data conclusively shows that clusters tend to become more reliable with increasing size. This observation coincides with the expectation that more frequently measured peptides are more likely to result in high quality spectra. Once these high quality spectra are present, the clusters become more stable and thereby more likely to only contain spectra from the same peptide. It can therefore be extrapolated, that the larger a (clustered) repository gets, the more distinct reliable clusters become, and incorrect identifications of lower quality spectra can be identified. 5

6 Supplementary Figure 4. SpectraST identified the consensus spectra from larger clusters more reliably than from smaller ones. Charts representing the SpectraST f-score for each cluster s consensus spectrum versus the cluster s size. These results indicate that larger clusters result in consensus spectra that are more similar to the consensus spectra from the corresponding NIST spectral library (see Supplementary Note 3). Larger clusters were more reliably identified than small clusters. This is in-line with the observation presented in Supplementary Figure 3 that larger clusters were more likely to only contain one peptide identification and can thus be regarded as more reliable. 6

7 Supplementary Figure 5. The relative number of peptides in one cluster identified differently by the used search engines decreased with increasing threshold. Relative number of consensus spectra that were identified as different peptides by the search engines used. This analysis only considered the first ranked identification from every search engine. We observed several cases where one of the search engines reported the same sequence as the others but ranked second or third. Consensus spectra were classified as differently identified as soon as only one search engine identified a contrary peptide. Therefore, the numbers shown in this figure present the worst-case scenario. The relative number of consensus spectra with different results decreased with increasing clustering threshold in all test datasets. The decrease of spectra that led to different identifications was similar for every increase of the similarity threshold. 7

8 Supplementary Figure 6. The reliability of peptide identifications is assessed through their ratio within a cluster. The ratio is the proportion of spectra within a cluster that were identified as the same peptide. 8

9 Supplementary Figure 7. The number of target and decoy identifications with low ratios was basically identical while nearly only target identifications had high ratios. Cumulative number of target and decoy identifications versus cluster ratio. These plots show the cumulative number of peptide spectrum matches (PSMs, y-axis) that identify a sequence with a ratio less than the given maximum ratio (x-axis). The plots show that there is an exponential correlation between the cumulative number of target PSMs and the ratio in all instances. The numbers of decoy sequences show a similar relationship up to a certain point when no additional decoy sequences exist above a certain ratio. This threshold seems to be dataset dependent but can easily be learned if a decoy search has been performed. Additionally, up to this point the number of target and decoy sequences is comparable which is in accordance with the assumption that incorrect identifications are evenly distributed between the target and the decoy database 2. Therefore, it seems safe to assume that identifications with a low ratio represent the incorrect identifications present in the search results. 9

Supplementary Figure 8. PRIDE Cluster s role in the PRIDE ecosystem. Researcher submit their data to PRIDE in the PRIDE XML format created by PRIDE Converter.

10 Supplementary Figure 8. PRIDE Cluster s role in the PRIDE ecosystem. Researcher submit their data to PRIDE in the PRIDE XML format created by PRIDE Converter. Experiments in PRIDE and generated PRIDE XML files can be viewed and assessed using PRIDE Inspector. PRIDE Cluster retrieves all identified spectra from PRIDE and clusters them using the publically available PRIDE Spectra Clustering API. The clustering results are used to assess the reliability of identifications in PRIDE. These reliability assessments are visible in PRIDE and made available to external resources for further processing. Additionally, spectral libraries created from the clustering results are made available to the research community. 10

11 Supplementary Figure 9. The vast majority of clusters only contained spectra from a single species. Number of different species from which the spectra in a cluster originated. The vast majority of clusters only contained spectra from a single species. Still, there were several clusters that contained peptides from a considerable number of species. While most of the spectra in clusters with multiple species were identified as peptides from common contaminants (for instance, actin or keratin) some of the peptides (for example IGGIGTVPVGR from elongation factor 1-alpha or FLDGLYVSEK from 60S ribosomal protein L9) originated from proteins that are conserved across all represented species. Therefore, it seems inaccurate to classify spectra / peptides as contaminants only because they were identified in multiple species1. 11

12 Supplementary Table 1 - List of available spectral libraries Species Number of Spectra Homo sapiens (Human) 81,428 Mus musculus (Mouse) 53,376 Rattus norvegicus (Rat) 17,624 Arabidopsis thaliana 69,242 Drosophila melanogaster (Fruit fly) 4,686 Saccharomyces cerevisiae (Baker's yeast) 71,487 Salmonella typhimurium 25,713 The spectral libraries created from PRIDE Cluster (version ). Libraries are made available as soon as they contain more than 4,000 spectra. All libraries are available at 12

13 Supplementary Protocol Clustering algorithm. We developed an adapted version of the MS-Cluster clustering algorithm presented by Frank et al. 1 that was optimized for clustering quality. The main differences to the MS-Cluster algorithm are that our algorithm can split clusters and always assigns spectra to the cluster with the most similar consensus spectrum. The methods to calculate spectral similarities and generate consensus spectra were not altered. Furthermore, methods that were only applicable to data from ion trap mass spectrometers were replaced by methods that can be used on data from any type of mass spectrometer. These changes considerably increased the execution time of the algorithm but were necessary to make the algorithm usable for heterogeneous datasets. Similar to the MS-Cluster algorithm the spectra are sorted according to their quality before clustering. A spectrum s quality is estimated by roughly calculating its signal-to-noise ratio using the same method as the spectral library building component of SpectraST 3 (see Supplementary Note 1). Next, the algorithm iterates over all spectra and assigns them to the most similar cluster (each spectrum is only compared to the cluster s consensus spectrum). If no cluster exists that exceeds the defined similarity threshold, a new cluster is created that only contains this spectrum. Then, all clusters are compared with each other and similar clusters (based on the set similarity threshold) are merged. At last, all spectra in all clusters are checked whether they are still similar to the cluster s consensus spectrum. Spectra that are no longer similar to the respective consensus spectrum are returned into the pool of unclustered spectra. This process is repeated until either all spectra are assigned to clusters or a maximum of N iterations is reached. Thereby, two parameters can be set when using this algorithm: the minimum similarity threshold t and the maximum number of iterations N. Datasets and database search. We used three distinct datasets to test the performance of our algorithm. The COPD dataset (4,612,229 spectra) contains the data from a study by Steiling et al. 4 on the influence of smoking on proteomics profiles. The HUPO dataset (5,192,683 spectra) contained the data from the original HUPO Plasma Proteome Project 5 and experiments from the HUPO Brain Proteome Project 6 (pilot projects from labs 10, 12, and 13). The data for these two test datasets was downloaded in mzxml format from the PeptideAtlas 7 repository (for a detailed list of experiments see Table P.1). Peptide Atlas Accessions of the HUPO test dataset experiments: PAe000058, PAe000059, PAe000060, PAe000061, PAe000062, PAe000063, PAe000064, PAe000065, PAe000066, PAe000067, PAe000068, PAe000069, PAe000070, PAe000096, PAe000101, PAe000130, PAe000131, PAe000132, 13

14 PAe000133, PAe000134, PAe000135, PAe000794, PAe000797, PAe000806, PAe000810, PAe000815, PAe000817, PAe000819, PAe000825, PAe000845, PAe000846, PAe000854, PAe000857, PAe000859, PAe000862, PAe000876, PAe001855, PAe001856, PAe001857, PAe001858, PAe001859, PAe001860, PAe001861, PAe001862, PAe001863, PAe001864, PAe001865, PAe Peptide Atlas Accessions of the COPD test dataset experiments: PAe000795, PAe000796, PAe000812, PAe000814, PAe000816, PAe000820, PAe000821, PAe000824, PAe000828, PAe000832, PAe000835, PAe000838, PAe000841, PAe000842, PAe000848, PAe000860, PAe000863, PAe000864, PAe000865, PAe000867, PAe000870, PAe000871, PAe000874, PAe000875, PAe000879, PAe Table P.1: List of experiments downloaded from PeptideAtlas The CPTAC dataset (1,325,842 spectra) contained data from five instrument configurations from study 6 8 conducted by the Clinical Proteomic Technology Assessment for Cancer (CPTAC). The data was downloaded from Tranche 9 and converted from Thermo Raw to MGF format using ProteoWizard 10. We used the database search engine tools X!Tandem 11 (version ) and OMSSA 12 (version 2.1.8) to identify the spectra in the test datasets. The same settings were used for both search engines (see Supplementary Note 3). Carbamidomethylation was set as a fixed modification and Methionine oxidation as variable modification. All other parameters were left at their default settings. For X!Tandem, the refinement mode was turned off. Searches were performed against a randomized, concatenated target-decoy database. The score cut-off for identifications was set to an expectation value of 0.1 for both search engines (FDR of 5% in the COPD and HUPO instance and 1% in the CPTAC instance). For a detailed description of the test datasets and search procedure see Supplementary Note 3. Clustering process. All spectra have to be kept in memory to be able to split clusters while clustering. To still be able to process large datasets and decrease the processing time, the spectra are split based on their precursor ion s m/z. These groups of spectra can then be processed in parallel. To make sure that all spectra of a peptide are included in at least one clustering process, the groups overlap by half the used precursor m/z window size. Thereby, every spectrum is clustered twice which leads to duplicate and highly similar clusters. Therefore, after the clustering process, identical clusters from two neighboring m/z regions are merged. We clustered all identified spectra (target and decoy identifications) from the three test instances. We used four similarity thresholds 0.5, 0.6, 0.7, and 0.8 with a maximum of 4 iterations, additionally a similarity threshold of 0.7 with a maximum of 10 iterations. Therefore, every instance was clustered with five different settings. To assess the algorithm s robustness we used an extremely wide precursor m/z window size of 20 m/z units for all test instances. This confronts the algorithm with spectra from a large number of different peptides. 14

15 A detailed description of the clustering process and the processing of the test instances can be found in Supplementary Note 3. Assessing clustering quality. To assess the clustering quality we searched the consensus spectra of all clusters from the three test datasets using five different database search engines (X!Tandem , OMSSA 2.1.8, Mascot , Crux , and SpectraST 3 4.0, standalone). This was possible as the number of consensus spectra was considerably smaller than the number of original input spectra. As five different search engines were used, we can safely assume that no search engine specific bias was introduced. We used the same settings and protein sequence databases to search the consensus spectra as for the search of the individual spectra. Only for SpectraST the conditions of the search were different: the respective NIST spectral libraries and a precursor tolerance of 4 m/z units were used. For these results we did not specify any confidence thresholds for the identifications. A detailed description can be found in Supplementary Note 3. Identifying reliable identifications. To identify reliable identifications we transformed the sequences identified within a cluster to property vectors. For every distinct sequence the following attributes were analyzed: the ratio (the proportion of spectra identified as this peptide sequence within a given cluster), its rank among the sequences identified in the cluster, the total size of the cluster, the ratio of search engines that identified the same sequence from the consensus spectrum, the search engine ratio (see before) of the higher ranked sequence (if applicable) in the cluster and whether it is a decoy sequence. We then used a machine learning approach to identify the property that separated target and decoy sequences (see Supplementary Note 5 for details). Building PRIDE Cluster. We retrieved all identified spectra from the PRIDE repository containing the data of over 9,040 different public proteomics experiments (June 2012). The only requirement for an experiment to be included was that the investigated species had to be reported. Identified spectra were not included if the spectrum s precursor ion s m/z was missing. This resulted in a total of 20,666,123 identified spectra identifying 2,815,820 distinct peptides in 40 species. The spectra were clustered using a similarity threshold of 0.7 and a maximum of 4 iterations. The data was made accessible through a web based application. The results from the clustering process are stored in a MySQL database (see Figure P.1). 15

16 Figure P.1: The PRIDE Cluster database contains the complete results from clustering all public identified spectra available in PRIDE. The tables highlighted in green hold the summarized (meta-) data retrieved from the PRIDE repository. This data is not altered by the clustering process. The blue tables hold the actual results of the clustering process. The table clustering_method is only used for debugging purposes in case the process encounters any problems. The cluster table contains all clusters generated during the process including the cluster s consensus spectrum as comma delimited strings. The link to the actual data from the PRIDE repository is stored in the self-explanatory table cluster_has_spectrum and cluster_has_peptide. These two links are necessary since one spectrum can potentially have multiple peptide identifications. The server-side code was developed using the Perl programming language. Boxplots were generated using R (version ). The client-side is a standard HTML web-page and utilizes JavaScript for interactive components. The jquery (version 1.7.1), jqueryui (version ) and jquery datatables (version 1.9.0) JavaScript libraries were used. Spectral libraries were generated from all reliable clusters. Their consensus spectra are made available in the NIST s MSP data format. A cluster s consensus spectrum is added to a species specific library as soon as it contains at least one (reliable) peptide identification from an experiment from the respective species. 16

17 Supplementary Note 1 - Clustering algorithm The clustering algorithm used in this work is based on the MS-Cluster algorithm proposed by Frank et al. 1. It was optimized to increase the quality of the generated clusters at the cost of reducing its speed. Spectrum normalization Before any comparison the spectra s intensities are normalized. The peak intensities are normalized so that the total spectrum intensity (sum of intensities of all peaks) is 1,000. This method was not changed from the original algorithm 1. Spectra Similarity (normalized dot product) The similarity between two spectra is assessed using the normalized dot product as described by Frank et al. 1 (no changes were made to this algorithm). For the comparison of two spectra only the k highest peaks are taken into consideration. k is calculated by dividing the precursor m/z by 50. Additionally, the peak intensities used for the comparison are transformed using 1 + LOG(I), where I is the peaks normalized intensity. Algorithm: 1. Calculate k as described above. 2. Get the k highest peaks from both Spectra S1 and S2. 3. Sort the peaks according to m/z value. 4. Create intensity vector SV1 and SV2 NOTE: All added intensities are transformed using 1 + LOG (intensity) 5. Iterate over S1 peaks a. Add intensity to SV1 intensity array, b. Check if S2 peaks contain peak with comparable m/z (closest within 0.5 m/z units range). i.if yes then add peak with closest m/z to SV2 intensity array. ii.else add 0.0 to SV2 intensity array. 6. Add all not added S2 peak intensities to SV2 intensity array and 0.0 for every added peak to SV1 intensity array. 7. Calculate the dot-product over the two intensity vectors SV1 and SV2: Equation 1.1 Formula to calculate the normalized dot-product. Spectrum quality assessment (signal-to-noise ratio) Frank et al. s MS-Cluster algorithm uses a machine learning based rule set to assess the quality of MS/MS spectra. This set of rules is only applicable to ion trap data 1. This method 17

18 was replaced by the more basic method used in the spectral library building component of the spectral search engine SpectraST to roughly assess a spectrum s signal-to-noise ratio 3. The advantage of this simpler approach is that it is applicable to spectra originating from virtually any mass spectrometer platform. A spectrum s signal-to-noise ratio (considered as its quality) is approximated by dividing the sum of the intensity (I) of the 2 nd - 6 th highest peak through the median intensity of all peaks: Equation 1.1 Formula used to calculate a spectrum s quality (i.e. its signal-to-noise ratio). Consensus spectrum building The algorithm used to build consensus spectra is the same as the one used by Frank et al. 1 and originally described here 15. The final m/z threshold used is set to 0.4 m/z units starting from 0.1 m/z units and using 0.1 m/z unit step increases. Algorithm: Every peak in the consensus spectrum stores, the m/z value, intensity value and in how many spectra the peak was observed. Since the total number of spectra contributing to the consensus spectrum is known, a peak s probability to be observed in a spectrum can be calculated. 1. Add all peaks from all spectra to the consensus spectrum (CS). In case two peaks have an identical m/z value, add the intensities and increment how often the peak was observed. 2. Merge identical peaks. a. Start at a tolerance of 0.1 m/z units- increment by 0.1 m/z units until 0.4 m/z units are reached. b. Merge peaks within the tolerance. Use the weighted average m/z (weighted based on the peak s intensities) as the new m/z. 3. Adapt peak intensities based on how often they were observed (Pi): I the peak s intensity. Pi the probability the peak is detected in a spectrum. This formula multiplies the observed intensity by Filter the consensus spectrum: a. Keep only the top 5 peaks within every 100 m/z units window. Spectra Clustering As mentioned above, the clustering algorithm was adapted to improve the quality of the generated clusters. One major difference to the MS-Cluster algorithm 1 is that clusters can be split if new spectra are added to the cluster (see step 4 below). Additionally, spectra are not 18

19 added to the first fitting cluster (the first cluster with a similarity above the set threshold t) but to the best fitting cluster (the one with the highest similarity, see point 2). The algorithm depends on two variables: the similarity threshold t defining how similar spectra must be to be clustered together, and the maximum number of iterations N to optimize the clustering result. Algorithm: 1. Sort Spectra: Spectra are sorted according to their estimated quality (see Equation 1.1). 2. Clustering Spectra: Iterate over all spectra and compare every spectrum against the consensus spectrum of every cluster. a. IF the similarity is above the threshold t, the similarity is stored and at the end of the comparison the spectrum is added to the cluster with the highest similarity. b. ELSE the spectrum is not similar to any consensus spectrum and a new cluster containing only this spectrum is generated. 3. Merging Cluster: Consensus spectra of all clusters are compared with each other. a. IF a cluster s consensus spectrum is similar (above threshold t) to another cluster s consensus spectrum the clusters are merged. 4. Remove non-fitting spectra: Every cluster is checked whether all spectra are still similar (above the threshold t) to the cluster s consensus spectrum. a. IF a spectrum is no longer similar to the cluster s consensus spectrum the spectrum is removed from the cluster and a new cluster only containing this spectrum is created. 5. Go to 2. UNTIL all spectra fit their cluster or a maximum of N iterations is reached. 19

20 Supplementary Note 2 - PRIDE Spectra Clustering API The PRIDE Spectra Clustering API is a Java API and was used to create the results presented in this manuscript. It is available from as well as directly through the EBI maven repository (uk.ac.ebi.pride.tools:pride-spectra-clustering 1.0 from The project s homepage contains a step-by-step tutorial of how to use the PRIDE Spectra Clustering API to cluster spectra as it was done in this work ( The API s central points of entry are the implementations of the SpectraClustering interface. Currently, there is only the FrankEtAlClustering class available which is the adapted version of Frank et al. s 1 original algorithm used in this work. We tried to represent every step used during the clustering process through Java interfaces. Therefore, we hope that this API is a good basis to develop and test new methods to cluster MS/MS spectra. The SpectraClustering interface contains two methods that influence the algorithm s performance: - setsimilaritythreshold: Sets the similarity threshold required to cluster spectra together. - setclusteringrounds: Sets the maximum number of iterations (N) to optimize the clustering results. The function clusterspectra is then used to cluster a List of Spectrum. The Spectrum interface used in the PRIDE Spectra Clustering API is the same as the one used in the jmzreader API 16 ( Therefore, spectra can simply be read from a MS data file using jmzreader and then directly clustered using the PRIDE Spectra Clustering API. clusterspectra returns a List of SpectraCluster that represent the clustering result. Every SpectraCluster contains properties that describe the cluster (for example the cluster s average precursor m/z) as well as a List of Spectrum that holds all spectra that were added to this cluster. 20

Supplementary Note 3 - Test datasets Figure 3.1 Process used to develop and assess the performance of the clustering algorithm. We used three distinct datasets to test our algorithm (see Figure 3.1).

21 Supplementary Note 3 - Test datasets Figure 3.1 Process used to develop and assess the performance of the clustering algorithm. We used three distinct datasets to test our algorithm (see Figure 3.1). The COPD dataset represents data acquired during a study by Steiling et al. 4. They analyzed the proteomics and transcriptomics profiles of the bronchial airway epithelium of current and never smokers. The data was collected using 1D-PAGE and an LTQ ProteomeX ion trap (ThermoFinnigan, Waltham, MA). The data was downloaded in mzxml format from the PeptideAtlas 7 repository ( and consisted of 26 experiments with 4,612,229 spectra. This dataset, in our opinion, represents a standard, state-of-the art proteomics experiment. The HUPO dataset consists of data from various experiments conducted during two projects of the Human Proteome Organization (HUPO). These are the experiments from the original HUPO Plasma Proteome Project (PPP) 5 as well as experiments from the HUPO Brain Proteome Project 6 (HBPP) (pilot projects from labs 10, 12, and 13). The data was downloaded in mzxml format from the PeptideAtlas 7 repository ( and consisted of 48 experiments with 5,192,683 spectra. These data are very heterogeneous and were generated in the early days of large-scale proteomics research. Therefore, it seems to be a representative sample of legacy data found in current public proteomics repositories. 21

22 The CPTAC dataset contains data from the study 6 conducted by the Clinical Proteomic Technology Assessment for Cancer (CPTAC) 8. All replicates from five instrument configurations / labs were included in our analysis: the LTQ2@95, LTQ@73, LTQ Orbitrap@86, LTQ XL OrbitrapP@65 and the LTQ XLx@65. The data was downloaded from Tranche 9 and converted to MGF format using ProteoWizard 10. This resulted in a total of 71 experiments (1 experiment per replicate) with 1,325,842 spectra. Four files could not be converted using ProteoWizard as only corrupted versions were present in Tranche. These were excluded from the analysis ( 1 replicate LTQ@73 6A, 1 replicate LTQ XL OrbitrapP@65 6B, 2 replicates LTQ@73 6C ). All of these experiments were performed in highly controlled environments following strict standard operating procedures. Therefore, this dataset can be seen as a collection of highly reproducible and robust experiments. The detailed experiment list for the COPD and the HUPO instances can be found in Supplementary Table 1. Search All experiments were searched using X!Tandem 11 ( Cyclone version ) and OMSSA 12 (see Table 3.1). Two search engines were used to prevent any bias towards the algorithm of a single search engine and to replicate the now common approach to search experiments with multiple search engines. The search settings were as follows: precursor tolerance was set to 2.0 Da and the fragment tolerance to 0.6 Da. Charge states between 2 and 4 as well as up to 2 missed cleavages were allowed (enzyme was set to trypsin). Carbamidomethylation was set as a fixed modification and Methioninoxidation as variable modification. Refinement was disabled in X!Tandem while all other settings were left at their default values. The search was performed against a concatenated target and (randomized) decoy database generated using the decoy.pl script from Matrix Science (UK, The COPD and the HUPO instances were searched against the UniProtKB human complete proteome set version The CPTAC instance was searched against the UniProtKB yeast complete proteome set version as well as the Universal Protein Standard (UPS) 1 and UPS 2 fasta library from Sigma-Aldrich. A maximum expectation threshold of 0.1 was used for both search engines, X!Tandem and OMSSA. All peptide-spectrum-matches (PSMs) (i.e. all identified spectra and the associated peptide identification data) were stored in a database for further analysis. 22

23 Dataset X!Tandem (PSMs Decoy / Target) OMSSA (PSMs Decoy / Target) COPD 5,155 / 350,477 38,786 / 444,673 HUPO 10,468 / 183,090 7,559 / 159,926 CPTAC 1,109 / 119,094 2,042 / 119,657 Table 3.1. Number of decoy and target PSMs per search engine and dataset. Clustering All identified spectra of each test instance were clustered using the algorithm described in Supplementary Note 1. The clustering process was repeated several times using different thresholds t and iterations N (see Table 3.2). Clusters with only one spectrum were ignored and discarded ( COPD : 4,807 clusters, HUPO : 19,569 clusters, CPTAC : 7,270 clusters). Threshold (t) Maximum Iterations (N) COPD HUPO CPTAC ,578 39, ,383 48,589 23, ,676 61,906 33, ,857 83,394 53, ,688 83,314 53, , ,193 88,114 Table 3.2. Clustering settings used for the various instances and the number of generated clusters per instance. Since clusters can be split and spectra hence reassigned to new clusters, all spectra have to be kept in memory. Therefore, only spectra with a similar precursor ion m/z ratio were clustered at once. The window size W used to select spectra was set to 20 m/z units. This extremely wide window was chosen to thoroughly test the algorithm s accuracy under extreme conditions and is considerably wider than the window of 2 m/z used by Frank et al. 1 Neighbouring windows overlapped by W/2 m/z units. Thereby, it was ensured that all spectra originating from the same peptide were present in at least one clustering process. Most spectra are part of two clusters as every spectrum was clustered in two clustering processes. This leads to duplicate / highly similar clusters to be generated. Therefore, these clusters from neighbouring windows are merged after the clustering. Two thresholds had to be defined to identify these similar clusters: the maximum difference between the average precursor m/z of the clusters and the required minimum similarity between the cluster s consensus spectra. A set of 3,600 neighbouring clusters was analyzed using a grid-expansion on both variables: thresholds between and precursor m/z tolerances between 0-1 m/z units. Clusters with an average precursor m/z of more than 0.5 apart are highly unlikely to be identical (see Figure 23

24 3.1). Therefore, the ideal similarity threshold was defined as the lowest similarity threshold that would not increase the number of merged clusters from a tolerance between 0.5 m/z to 1 m/z units. The allowed precursor m/z tolerance was then defined as the lowest tolerance at which this final maximum number of merged clusters was reached. This resulted in a minimal similarity threshold of 0.97 and a maximum precursor m/z tolerance of 0.35 m/z units. Figure 3.1. Results from the grid expansion of the required similarity threshold and the maximum precursor m/z difference to identify identical clusters. Only the 9 highest thresholds used are shown as these proved to be best suited. Searching consensus spectra The consensus spectra of all clusters were searched using five different search engines: X!Tandem 11 (version ), OMSSA 12 (version 2.1.8), Mascot 13 (version 2.3), Crux 14 (version 1.37), and SpectraST 3 (version 4.0, standalone). This allowed us to additionally assess the quality of the consensus spectrum building algorithm as well as the clustering accuracy. Apart from SpectraST, the exact same databases were used for the search of the consensus spectra as were used for the search of the individual spectra. For SpectraST, the NIST ( ion trap human spectral library (version ) was used for the COPD and the HUPO instances. The CPTAC instance was searched against a concatenated spectral library of the NIST ion trap yeast spectral library ( ) and the NIST ion trap UPS 1 spectra library ( ). All search engines, apart from SpectraST, used the same settings as for the search of the individual spectra (precursor tolerance of

25 Da, fragment tolerance of 0.6 Da, maximum 2 missed cleavages, enzyme trypsin, charge states 2-4). For SpectraST, the precursor tolerance was set to 4 m/z units. 25

26 Supplementary Note 4 - Assessing clustering quality and determining clustering thresholds Frank et al. estimated the clustering quality by checking how many clusters contained spectra identified as different peptides (using the search engine InsPecT) 1. For their test they used spectra attributed to a given peptide identification and added an equal number of other randomly selected decoy spectra that had a different identification and a precursor m/z of at least 8 m/z units apart to the target peptide s precursor (see Supplementary Note 3 in 1 ). In early tests of our algorithm, we replicated this approach and got comparable results (data not shown). When we investigated these results we found that spectra with different precursor m/z are basically never clustered together. We then adapted this approach to use a random selection of spectra from different peptide identifications but with a similar precursor m/z. As expected, the results of the clustering accuracy evaluation got worse as spectra with similar precursor m/z are more likely to be similar (also, because the underlying peptides were more similar). Additionally, the results from this approach greatly depended on the sampled decoy spectra (data not shown). For the same set of target spectra, the number of spectra with different peptide identifications clustered together ranged between 0-10% depending on the (randomly) chosen set of decoy spectra (for 1,000 randomly chosen target sequences the clustering process was repeated 100 times, i.e. choosing 100 different random sets of decoy spectra, data not shown). To circumvent the problem of random sampling, we adapted the whole clustering approach to cluster all spectra within a given precursor m/z range. This m/z range was wide enough to include all spectra of a given peptide identification as well as a considerable amount of spectra from other peptides to thoroughly test the algorithm s robustness. This range was set to 20 m/z units for all test instances. We then assessed the clustering quality by examining the precursor m/z range of the spectra included in a cluster. The clustering algorithm is blind to this attribute and thus not biased by it. While the identification data from search engines is always incorrect in a certain amount of cases, the precursor m/z is directly measured by the instrument and a physical property of the measured analyte. It can therefore be regarded as a reliable, independent measurement of a spectrum s origin. We believe that his approach is considerably more stringent than the approach used by Frank et al. to assess the algorithm s performance. 26

27 Figure 4.1. Boxplot of the cluster s precursor m/z ranges per dataset and clustering method (i.e. method settings). The limited discriminative power of the dot-product can be seen by the fact that the ranges decrease from larger ( COPD instance) to smaller test sets ( CPTAC instance). There was no significant difference in clustering accuracy when using a maximum of 4 iterations (N) compared to a maximum of 10 iterations (tested for t = 0.7) (Wilcoxon rank-sum test (data not normally distributed), COPD instance p = , HUPO instance p = , CPTAC instance p = ). There was a significant increase of clustering accuracy identified by decreasing cluster ranges for every threshold increase in every instance Wilcoxon rank-sum test p < 2.2e-16 for every comparison of neighboring thresholds. The analysis of the precursor m/z ranges of the spectra in all clusters from all test datasets as well as the PRIDE Cluster database (Figure 4.1 and Figure 1b in the main manuscript) clearly shows that the clustering algorithm is highly stable irrespective of the test dataset. The vast majority of cluster ranges were below 2 m/z units which can be explained by normal variations in MS data. These results indicate that the clustering algorithm reliably clusters spectra originating from the same chemical compound irrespective of the underlying dataset. We manually investigated several of the clusters with high m/z ranges from the COPD instance clustered with a threshold t = 0.7 and a maximum of 4 iterations (N). Three clusters had an m/z range greater than 10 m/z units: one contained 1,988 spectra identified as EFTPPVQAAYQK, and three spectra attributed to three other sequences (Figure 4.2a). All five search engines used to search the consensus spectrum also identified EFTPPVQAAYQK with high confidence. The group of spectra with a precursor m/z below 682 m/z were also identified as EFTPPVQAAYQK but with a dehydration (mass shift of 18 Da) on the first amino acid. 27

28 The second cluster (Figure 4.2b) contained 6 out of 7 spectra identified as QFPFLASIQNQGR ). The spectra with a precursor m/z around 745 m/z were also identified as QFPFLASIQNQGR but with a deamination (mass shift of Da) on the first residue. The fact that the clustering algorithm grouped these spectra together is not surprising as only the k highest peaks are used for the comparison (Supplementary Note 1). The main peaks used for the comparison were not affected by the modification. Another cluster contained 777 spectra identified as QISNLQQSISDAEQR. The two spectra that had outlying precursor m/z again contained deaminated residues (Figure 4.2c). These examples show that the found outliers are caused by highly similar peptides that cannot be distinguished using the here presented algorithm. Nevertheless, the spectra all originated from the same peptide sequence which given the amount of data processed can be seen as a very good result. a) b) c) Figure 4.2. Precursor ion m/z distribution of the spectra from three clusters from the COPD instance (maximum number of iterations N=4, similarity threshold T=0.7). These clusters showed high precursor m/z ranges and were investigated manually. The outliers were all caused by peptides with a deamination on one amino acid that was not recognized by the clustering algorithm. When we investigated clusters generated with clustering similarity thresholds below 0.7, the clusters with wide precursor m/z ranges did contain spectra attributed to various peptide identifications. The consensus spectra from these clusters could then no longer be confidently identified by any of the used five search engines. Therefore, it seems most likely that spectra from different peptide species were clustered together when using these lower thresholds. The threshold t of 0.7 was the first threshold to produce reliable results in all test instances. 28

29 Supplementary Note 5 - Assessing PSM reliabilities We assessed the feasibility of identifying incorrect identifications checking whether it was possible to identify the decoy identifications within the test instances search results. To assess the reliability of the PSMs in a cluster the distinct sequences identified within a cluster were stored as property vectors. Modifications were not taken into account as the spectra s precursor ion m/z values were close enough to indicate that the masses of the identified peptides were basically identical. Every sequence was represented using a vector of properties: the sequence s ratio (the proportion of spectra identified as this sequence within the cluster), its rank among the sequences identified in the cluster, the total size of the cluster, the ratio of search engines that identified the same sequence from the consensus spectrum, the search engine ratio (see before) of the previous (higher ranked) sequence and whether it is a decoy sequence. A machine learning approach was followed using the Waikato Environment for Knowledge Analysis (WEKA) version To learn the rules to classify target and decoy sequences the Conjunctive Rule Learner was used with the following parameters: 3 folds, minimum total weight 2.0, number of antecedents -1, seed 1. This approach was chosen since it is one of the simplest machine learning algorithms that results in simple, human understandable results. More sophisticated algorithms will most probably return better results but are more likely to over fit the data. The data from the COPD instance was used as training set with 10-fold cross-validation and the data from the HUPO instance as independent test set. Irrespective of the clustering algorithm used and combination of attributes the algorithm only chose the (distinct) peptide sequence s ratio within the cluster to distinguish between target and decoy identifications (a ratio > to classify a sequence as non-decoy, i.e. a correct identification, Figure 5.1). 29

30 Figure 5.1. Distribution of ratios between target and decoy sequences. An identification s ratio (the relative number of spectra within a cluster that were identified as this sequence) was identified as the most suited attribute to classify target and decoy identifications. The ratios of target sequences were significantly higher than the ratios of decoy sequences (Wilcoxon rank-sum test p < 2.2e-16 for all comparisons). This data clearly shows the feasibility to classify target and decoy (i.e. incorrect identifications) based on the ratio alone. A similarity threshold of 0.7 was the first one where both distributions were clearly separated in all test datasets. While the results presented in Supplementary Figure 6 are comparable for the COPD and the HUPO instances, there is a distinct difference to the results from the CPTAC instance. In terms of size the HUPO and the CPTAC instances are comparable. However, the main difference between the COPD and the HUPO instance, compared to the CPTAC instance, is the peptide false discovery rate (FDR) observed in the identifications. We calculated the FDR of an instance by merging the results of both search engines and used the formula proposed by Elias et al. 2 : Based on this formula the COPD and HUPO instances have a FDR of 5% while the CPTAC instance has a considerable lower FDR of 1%. This suggests that the ratio cut-off to 30

31 separate incorrect and correct identifications most probably depends on the peptide FDR found in the analyzed dataset. As the true FDR is unknown in the PRIDE repository, we selected a minimum similarity threshold (t) of 0.7 to classify identifications as reliable in PRIDE. This threshold is well above the threshold suggested by the results from the test datasets and can therefore be regarded as a very conservative assessment. As this ratio is likely to randomly occur in very small clusters we analyzed the size distribution of clusters where the most frequent peptide identification was from the decoy database and constituted at least 70% of the peptide identifications within the cluster (Figure 5.2). Based on this analysis we defined a minimum cluster size of 10 spectra as it was able to identify more than 85% of all validated decoy clusters within all three test datasets. This seems to be a good balance between decreasing the number of incorrect identifications and removing correct ones. Figure 5.2. Size distribution of clusters whose primary peptide identification was from the decoy database and constituted at least 70% of all identifications in the cluster (N = maximum number of iterations. T = similarity threshold). Non-random incorrect identifications When analyzing clusters where the most frequent peptide identification was from the decoy database we observed that some of these clusters contained up to 1,000 spectra from multiple experiments. In several cases, all of these spectra were identified as the same decoy sequence (see outliers in Figure 5.2). It is highly improbable that these spectra were generated by chance or that these spectra originate from a chemical compound other than a peptide. Therefore, a possible hypothesis is that these spectra originated from peptides that are not 31

32 present in the used sequence database. These spectra lead to stable incorrect identifications and represent a systematic error in the search results. Therefore, they violate the assumption that incorrect identifications are evenly distributed among the target and the decoy database 2. Furthermore, several validation tools that take the number of search engines and replicates that identified a given peptide sequence into consideration (for example iprophet 18 ) will classify these identifications as highly reliable. Even though our here presented algorithm cannot identify the origin of these spectra and will also incorrectly label these identifications as reliable, it can identify the presence of such nonrandom incorrect identifications when decoy search results are present. Such results can then be used to investigate whether the used search database might have to be extended. In an effort to identify the true origin of these spectra we searched the consensus spectra of all clusters that contained equal to or more than 50% decoy PSMs against the common Repository of Adventitious Proteins (crap, version using X!Tandem and OMSSA. This database contains proteins of three general classes: (i) common laboratory proteins; (ii) proteins added by accident through dust or physical contact; and (iii) proteins used as molecular weight or mass spectrometry quantitation standards. The search settings were identical as before (see Supplementary Note 3). For both search engines only identifications with an expectation value lower than 0.1 were accepted. As can be seen in the number of identifications presented in Table 5.1 only 10% - 20% of these consensus spectra were successfully identified. The number of spectra that were identified as the same peptide by both search engines was even lower. Therefore, even though some of these clusters seem to represent peptides that originate from common contaminants, the vast majority seems to originate from peptides that were neither present in the used search nor in the crap database. Instance N = 4, t = 0.6 N = 4, t = 0.7 N = 4, t = 0.8 Total Ident. Same Total Ident. Same Total Ident. Same COPD 3, , ,175 1,648 1,139 HUPO 2, , , CPTAC Table 5.1. Number of identified ( Ident. ) together with the number of equal identifications ( Same ) by both search engines against the crap database compared to the total number of clusters ( Total ) that contained equal 32

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database

Overview - MS Proteomics in One Slide Obtain protein Digest into peptides Acquire spectra in mass spectrometer MS masses of peptides MS/MS fragments of a peptide Results! Match to sequence database 2 But