PRIDE Cluster: building the consensus of proteomics data

Size: px
Start display at page:

Download "PRIDE Cluster: building the consensus of proteomics data"

Transcription

1 Supplementary Materials PRIDE Cluster: building the consensus of proteomics data Johannes Griss, Joseph Michael Foster, Henning Hermjakob and Juan Antonio Vizcaíno EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. 1

2 Table of Contents Table of Contents... 2 Supplementary Figures... 3 Supplementary Table 1 - List of available spectral libraries Supplementary Protocol Supplementary Note 1 - Clustering algorithm Supplementary Note 2 - PRIDE Spectra Clustering API Supplementary Note 3 - Test datasets Supplementary Note 4 - Assessing clustering quality and determining clustering thresholds 26 Supplementary Note 5 - Assessing PSM reliabilities Supplementary Note 6 - Identifying incorrect annotations in PRIDE through PRIDE Cluster Supplementary Note 7 - The PRIDE Cluster spectral libraries Abbreviations References

3 Supplementary Figures Supplementary Figure 1. The clustering algorithm (MS-Cluster) described in Frank et al. 1 may lead to incorrect results in a highly heterogeneous environment such as the PRIDE repository. A) Unidentified, low resolution (= quality) spectra B) (Later) addition of high quality, identified spectra C) Identification is extended to unidentified spectra while the cluster membership is not updated. D) Incorrect identifications are generated Early experiments using low-resolution instruments record spectra that cannot be identified (a). Over time, higher quality spectra are recorded (that can be identified) and submitted to the repository (in this example identifying two peptides: one green, one red) (b). When these new spectra are incrementally added to the clusters, one of the high quality spectra is randomly chosen first and assigned to the existing cluster (in this example a green one). This shifts the cluster s consensus spectrum and all green spectra are assigned to this cluster. The red spectra no longer fit the now shifted cluster and are assigned to a new, purely red cluster (c). As described by Frank et al., the identification from the high quality spectra is attributed to all unidentified spectra in the cluster. Therefore, some spectra that should have been assigned to the red peptide (d) are misclassified as green ones. This problem will grow over time as (unidentified) low quality clusters continue to shift when new high quality spectra are added. 3

4 Supplementary Figure 2. The proportion of clusters that contained spectra identified as multiple different peptides proved to be too dataset dependent for a reliable assessment of clustering quality. Proportion of clusters that contained spectra identified as more than one peptide sequence. The data is shown for all test instances and combinations of clustering settings evaluated (the number of iterations (N) and the similarity threshold (t)). The fraction of mixed clusters decreased with increasing clustering threshold. Even with the highest threshold this fraction was still considerably higher than the 1.5% reported by Frank et al. 1 This higher proportion of mixed clusters is not indicative of lower clustering quality but an expected phenomenon. Clusters are able to group high and low quality spectra from the same peptide (as shown through the range of the spectra s precursor ion m/z values). If all of these spectra are identified with a database search engine, lower quality spectra have a higher probability to be identified incorrectly. While the relative number of mixed clusters was comparable between the COPD and the CPTAC instance, the number of mixed clusters was considerably lower in the HUPO instance. We were not able to identify a conclusive cause for this difference. Nevertheless, these results indicate that these analysis results highly depend on the analyzed test dataset. Furthermore, this might explain the difference to the results reported by Frank et al., who may have used more homogenous test datasets for their analysis (data not shown in 1 ). 4

5 Supplementary Figure 3. Larger clusters contained spectra attributed to fewer peptides than smaller clusters. Proportion of spectra that were identified as the most commonly identified peptide sequence within a cluster versus the cluster s size. Larger clusters were more likely to contain more spectra identified as a single peptide than smaller clusters. This finding was independent of the chosen clustering parameters. The data from the CPTAC instance was slightly different compared to the data from the HUPO and the COPD instance. This was most probably caused by the fact that while the HUPO and the COPD instance had a comparable size, the CPTAC instance was considerably smaller. Nevertheless, this data conclusively shows that clusters tend to become more reliable with increasing size. This observation coincides with the expectation that more frequently measured peptides are more likely to result in high quality spectra. Once these high quality spectra are present, the clusters become more stable and thereby more likely to only contain spectra from the same peptide. It can therefore be extrapolated, that the larger a (clustered) repository gets, the more distinct reliable clusters become, and incorrect identifications of lower quality spectra can be identified. 5

6 Supplementary Figure 4. SpectraST identified the consensus spectra from larger clusters more reliably than from smaller ones. Charts representing the SpectraST f-score for each cluster s consensus spectrum versus the cluster s size. These results indicate that larger clusters result in consensus spectra that are more similar to the consensus spectra from the corresponding NIST spectral library (see Supplementary Note 3). Larger clusters were more reliably identified than small clusters. This is in-line with the observation presented in Supplementary Figure 3 that larger clusters were more likely to only contain one peptide identification and can thus be regarded as more reliable. 6

7 Supplementary Figure 5. The relative number of peptides in one cluster identified differently by the used search engines decreased with increasing threshold. Relative number of consensus spectra that were identified as different peptides by the search engines used. This analysis only considered the first ranked identification from every search engine. We observed several cases where one of the search engines reported the same sequence as the others but ranked second or third. Consensus spectra were classified as differently identified as soon as only one search engine identified a contrary peptide. Therefore, the numbers shown in this figure present the worst-case scenario. The relative number of consensus spectra with different results decreased with increasing clustering threshold in all test datasets. The decrease of spectra that led to different identifications was similar for every increase of the similarity threshold. 7

8 Supplementary Figure 6. The reliability of peptide identifications is assessed through their ratio within a cluster. The ratio is the proportion of spectra within a cluster that were identified as the same peptide. 8

9 Supplementary Figure 7. The number of target and decoy identifications with low ratios was basically identical while nearly only target identifications had high ratios. Cumulative number of target and decoy identifications versus cluster ratio. These plots show the cumulative number of peptide spectrum matches (PSMs, y-axis) that identify a sequence with a ratio less than the given maximum ratio (x-axis). The plots show that there is an exponential correlation between the cumulative number of target PSMs and the ratio in all instances. The numbers of decoy sequences show a similar relationship up to a certain point when no additional decoy sequences exist above a certain ratio. This threshold seems to be dataset dependent but can easily be learned if a decoy search has been performed. Additionally, up to this point the number of target and decoy sequences is comparable which is in accordance with the assumption that incorrect identifications are evenly distributed between the target and the decoy database 2. Therefore, it seems safe to assume that identifications with a low ratio represent the incorrect identifications present in the search results. 9

10 Supplementary Figure 8. PRIDE Cluster s role in the PRIDE ecosystem. Researcher submit their data to PRIDE in the PRIDE XML format created by PRIDE Converter. Experiments in PRIDE and generated PRIDE XML files can be viewed and assessed using PRIDE Inspector. PRIDE Cluster retrieves all identified spectra from PRIDE and clusters them using the publically available PRIDE Spectra Clustering API. The clustering results are used to assess the reliability of identifications in PRIDE. These reliability assessments are visible in PRIDE and made available to external resources for further processing. Additionally, spectral libraries created from the clustering results are made available to the research community. 10

11 Supplementary Figure 9. The vast majority of clusters only contained spectra from a single species. Number of different species from which the spectra in a cluster originated. The vast majority of clusters only contained spectra from a single species. Still, there were several clusters that contained peptides from a considerable number of species. While most of the spectra in clusters with multiple species were identified as peptides from common contaminants (for instance, actin or keratin) some of the peptides (for example IGGIGTVPVGR from elongation factor 1-alpha or FLDGLYVSEK from 60S ribosomal protein L9) originated from proteins that are conserved across all represented species. Therefore, it seems inaccurate to classify spectra / peptides as contaminants only because they were identified in multiple species1. 11

12 Supplementary Table 1 - List of available spectral libraries Species Number of Spectra Homo sapiens (Human) 81,428 Mus musculus (Mouse) 53,376 Rattus norvegicus (Rat) 17,624 Arabidopsis thaliana 69,242 Drosophila melanogaster (Fruit fly) 4,686 Saccharomyces cerevisiae (Baker's yeast) 71,487 Salmonella typhimurium 25,713 The spectral libraries created from PRIDE Cluster (version ). Libraries are made available as soon as they contain more than 4,000 spectra. All libraries are available at 12

13 Supplementary Protocol Clustering algorithm. We developed an adapted version of the MS-Cluster clustering algorithm presented by Frank et al. 1 that was optimized for clustering quality. The main differences to the MS-Cluster algorithm are that our algorithm can split clusters and always assigns spectra to the cluster with the most similar consensus spectrum. The methods to calculate spectral similarities and generate consensus spectra were not altered. Furthermore, methods that were only applicable to data from ion trap mass spectrometers were replaced by methods that can be used on data from any type of mass spectrometer. These changes considerably increased the execution time of the algorithm but were necessary to make the algorithm usable for heterogeneous datasets. Similar to the MS-Cluster algorithm the spectra are sorted according to their quality before clustering. A spectrum s quality is estimated by roughly calculating its signal-to-noise ratio using the same method as the spectral library building component of SpectraST 3 (see Supplementary Note 1). Next, the algorithm iterates over all spectra and assigns them to the most similar cluster (each spectrum is only compared to the cluster s consensus spectrum). If no cluster exists that exceeds the defined similarity threshold, a new cluster is created that only contains this spectrum. Then, all clusters are compared with each other and similar clusters (based on the set similarity threshold) are merged. At last, all spectra in all clusters are checked whether they are still similar to the cluster s consensus spectrum. Spectra that are no longer similar to the respective consensus spectrum are returned into the pool of unclustered spectra. This process is repeated until either all spectra are assigned to clusters or a maximum of N iterations is reached. Thereby, two parameters can be set when using this algorithm: the minimum similarity threshold t and the maximum number of iterations N. Datasets and database search. We used three distinct datasets to test the performance of our algorithm. The COPD dataset (4,612,229 spectra) contains the data from a study by Steiling et al. 4 on the influence of smoking on proteomics profiles. The HUPO dataset (5,192,683 spectra) contained the data from the original HUPO Plasma Proteome Project 5 and experiments from the HUPO Brain Proteome Project 6 (pilot projects from labs 10, 12, and 13). The data for these two test datasets was downloaded in mzxml format from the PeptideAtlas 7 repository (for a detailed list of experiments see Table P.1). Peptide Atlas Accessions of the HUPO test dataset experiments: PAe000058, PAe000059, PAe000060, PAe000061, PAe000062, PAe000063, PAe000064, PAe000065, PAe000066, PAe000067, PAe000068, PAe000069, PAe000070, PAe000096, PAe000101, PAe000130, PAe000131, PAe000132, 13

14 PAe000133, PAe000134, PAe000135, PAe000794, PAe000797, PAe000806, PAe000810, PAe000815, PAe000817, PAe000819, PAe000825, PAe000845, PAe000846, PAe000854, PAe000857, PAe000859, PAe000862, PAe000876, PAe001855, PAe001856, PAe001857, PAe001858, PAe001859, PAe001860, PAe001861, PAe001862, PAe001863, PAe001864, PAe001865, PAe Peptide Atlas Accessions of the COPD test dataset experiments: PAe000795, PAe000796, PAe000812, PAe000814, PAe000816, PAe000820, PAe000821, PAe000824, PAe000828, PAe000832, PAe000835, PAe000838, PAe000841, PAe000842, PAe000848, PAe000860, PAe000863, PAe000864, PAe000865, PAe000867, PAe000870, PAe000871, PAe000874, PAe000875, PAe000879, PAe Table P.1: List of experiments downloaded from PeptideAtlas The CPTAC dataset (1,325,842 spectra) contained data from five instrument configurations from study 6 8 conducted by the Clinical Proteomic Technology Assessment for Cancer (CPTAC). The data was downloaded from Tranche 9 and converted from Thermo Raw to MGF format using ProteoWizard 10. We used the database search engine tools X!Tandem 11 (version ) and OMSSA 12 (version 2.1.8) to identify the spectra in the test datasets. The same settings were used for both search engines (see Supplementary Note 3). Carbamidomethylation was set as a fixed modification and Methionine oxidation as variable modification. All other parameters were left at their default settings. For X!Tandem, the refinement mode was turned off. Searches were performed against a randomized, concatenated target-decoy database. The score cut-off for identifications was set to an expectation value of 0.1 for both search engines (FDR of 5% in the COPD and HUPO instance and 1% in the CPTAC instance). For a detailed description of the test datasets and search procedure see Supplementary Note 3. Clustering process. All spectra have to be kept in memory to be able to split clusters while clustering. To still be able to process large datasets and decrease the processing time, the spectra are split based on their precursor ion s m/z. These groups of spectra can then be processed in parallel. To make sure that all spectra of a peptide are included in at least one clustering process, the groups overlap by half the used precursor m/z window size. Thereby, every spectrum is clustered twice which leads to duplicate and highly similar clusters. Therefore, after the clustering process, identical clusters from two neighboring m/z regions are merged. We clustered all identified spectra (target and decoy identifications) from the three test instances. We used four similarity thresholds 0.5, 0.6, 0.7, and 0.8 with a maximum of 4 iterations, additionally a similarity threshold of 0.7 with a maximum of 10 iterations. Therefore, every instance was clustered with five different settings. To assess the algorithm s robustness we used an extremely wide precursor m/z window size of 20 m/z units for all test instances. This confronts the algorithm with spectra from a large number of different peptides. 14

15 A detailed description of the clustering process and the processing of the test instances can be found in Supplementary Note 3. Assessing clustering quality. To assess the clustering quality we searched the consensus spectra of all clusters from the three test datasets using five different database search engines (X!Tandem , OMSSA 2.1.8, Mascot , Crux , and SpectraST 3 4.0, standalone). This was possible as the number of consensus spectra was considerably smaller than the number of original input spectra. As five different search engines were used, we can safely assume that no search engine specific bias was introduced. We used the same settings and protein sequence databases to search the consensus spectra as for the search of the individual spectra. Only for SpectraST the conditions of the search were different: the respective NIST spectral libraries and a precursor tolerance of 4 m/z units were used. For these results we did not specify any confidence thresholds for the identifications. A detailed description can be found in Supplementary Note 3. Identifying reliable identifications. To identify reliable identifications we transformed the sequences identified within a cluster to property vectors. For every distinct sequence the following attributes were analyzed: the ratio (the proportion of spectra identified as this peptide sequence within a given cluster), its rank among the sequences identified in the cluster, the total size of the cluster, the ratio of search engines that identified the same sequence from the consensus spectrum, the search engine ratio (see before) of the higher ranked sequence (if applicable) in the cluster and whether it is a decoy sequence. We then used a machine learning approach to identify the property that separated target and decoy sequences (see Supplementary Note 5 for details). Building PRIDE Cluster. We retrieved all identified spectra from the PRIDE repository containing the data of over 9,040 different public proteomics experiments (June 2012). The only requirement for an experiment to be included was that the investigated species had to be reported. Identified spectra were not included if the spectrum s precursor ion s m/z was missing. This resulted in a total of 20,666,123 identified spectra identifying 2,815,820 distinct peptides in 40 species. The spectra were clustered using a similarity threshold of 0.7 and a maximum of 4 iterations. The data was made accessible through a web based application. The results from the clustering process are stored in a MySQL database (see Figure P.1). 15

16 Figure P.1: The PRIDE Cluster database contains the complete results from clustering all public identified spectra available in PRIDE. The tables highlighted in green hold the summarized (meta-) data retrieved from the PRIDE repository. This data is not altered by the clustering process. The blue tables hold the actual results of the clustering process. The table clustering_method is only used for debugging purposes in case the process encounters any problems. The cluster table contains all clusters generated during the process including the cluster s consensus spectrum as comma delimited strings. The link to the actual data from the PRIDE repository is stored in the self-explanatory table cluster_has_spectrum and cluster_has_peptide. These two links are necessary since one spectrum can potentially have multiple peptide identifications. The server-side code was developed using the Perl programming language. Boxplots were generated using R (version ). The client-side is a standard HTML web-page and utilizes JavaScript for interactive components. The jquery (version 1.7.1), jqueryui (version ) and jquery datatables (version 1.9.0) JavaScript libraries were used. Spectral libraries were generated from all reliable clusters. Their consensus spectra are made available in the NIST s MSP data format. A cluster s consensus spectrum is added to a species specific library as soon as it contains at least one (reliable) peptide identification from an experiment from the respective species. 16

17 Supplementary Note 1 - Clustering algorithm The clustering algorithm used in this work is based on the MS-Cluster algorithm proposed by Frank et al. 1. It was optimized to increase the quality of the generated clusters at the cost of reducing its speed. Spectrum normalization Before any comparison the spectra s intensities are normalized. The peak intensities are normalized so that the total spectrum intensity (sum of intensities of all peaks) is 1,000. This method was not changed from the original algorithm 1. Spectra Similarity (normalized dot product) The similarity between two spectra is assessed using the normalized dot product as described by Frank et al. 1 (no changes were made to this algorithm). For the comparison of two spectra only the k highest peaks are taken into consideration. k is calculated by dividing the precursor m/z by 50. Additionally, the peak intensities used for the comparison are transformed using 1 + LOG(I), where I is the peaks normalized intensity. Algorithm: 1. Calculate k as described above. 2. Get the k highest peaks from both Spectra S1 and S2. 3. Sort the peaks according to m/z value. 4. Create intensity vector SV1 and SV2 NOTE: All added intensities are transformed using 1 + LOG (intensity) 5. Iterate over S1 peaks a. Add intensity to SV1 intensity array, b. Check if S2 peaks contain peak with comparable m/z (closest within 0.5 m/z units range). i.if yes then add peak with closest m/z to SV2 intensity array. ii.else add 0.0 to SV2 intensity array. 6. Add all not added S2 peak intensities to SV2 intensity array and 0.0 for every added peak to SV1 intensity array. 7. Calculate the dot-product over the two intensity vectors SV1 and SV2: Equation 1.1 Formula to calculate the normalized dot-product. Spectrum quality assessment (signal-to-noise ratio) Frank et al. s MS-Cluster algorithm uses a machine learning based rule set to assess the quality of MS/MS spectra. This set of rules is only applicable to ion trap data 1. This method 17

18 was replaced by the more basic method used in the spectral library building component of the spectral search engine SpectraST to roughly assess a spectrum s signal-to-noise ratio 3. The advantage of this simpler approach is that it is applicable to spectra originating from virtually any mass spectrometer platform. A spectrum s signal-to-noise ratio (considered as its quality) is approximated by dividing the sum of the intensity (I) of the 2 nd - 6 th highest peak through the median intensity of all peaks: Equation 1.1 Formula used to calculate a spectrum s quality (i.e. its signal-to-noise ratio). Consensus spectrum building The algorithm used to build consensus spectra is the same as the one used by Frank et al. 1 and originally described here 15. The final m/z threshold used is set to 0.4 m/z units starting from 0.1 m/z units and using 0.1 m/z unit step increases. Algorithm: Every peak in the consensus spectrum stores, the m/z value, intensity value and in how many spectra the peak was observed. Since the total number of spectra contributing to the consensus spectrum is known, a peak s probability to be observed in a spectrum can be calculated. 1. Add all peaks from all spectra to the consensus spectrum (CS). In case two peaks have an identical m/z value, add the intensities and increment how often the peak was observed. 2. Merge identical peaks. a. Start at a tolerance of 0.1 m/z units- increment by 0.1 m/z units until 0.4 m/z units are reached. b. Merge peaks within the tolerance. Use the weighted average m/z (weighted based on the peak s intensities) as the new m/z. 3. Adapt peak intensities based on how often they were observed (Pi): I the peak s intensity. Pi the probability the peak is detected in a spectrum. This formula multiplies the observed intensity by Filter the consensus spectrum: a. Keep only the top 5 peaks within every 100 m/z units window. Spectra Clustering As mentioned above, the clustering algorithm was adapted to improve the quality of the generated clusters. One major difference to the MS-Cluster algorithm 1 is that clusters can be split if new spectra are added to the cluster (see step 4 below). Additionally, spectra are not 18

19 added to the first fitting cluster (the first cluster with a similarity above the set threshold t) but to the best fitting cluster (the one with the highest similarity, see point 2). The algorithm depends on two variables: the similarity threshold t defining how similar spectra must be to be clustered together, and the maximum number of iterations N to optimize the clustering result. Algorithm: 1. Sort Spectra: Spectra are sorted according to their estimated quality (see Equation 1.1). 2. Clustering Spectra: Iterate over all spectra and compare every spectrum against the consensus spectrum of every cluster. a. IF the similarity is above the threshold t, the similarity is stored and at the end of the comparison the spectrum is added to the cluster with the highest similarity. b. ELSE the spectrum is not similar to any consensus spectrum and a new cluster containing only this spectrum is generated. 3. Merging Cluster: Consensus spectra of all clusters are compared with each other. a. IF a cluster s consensus spectrum is similar (above threshold t) to another cluster s consensus spectrum the clusters are merged. 4. Remove non-fitting spectra: Every cluster is checked whether all spectra are still similar (above the threshold t) to the cluster s consensus spectrum. a. IF a spectrum is no longer similar to the cluster s consensus spectrum the spectrum is removed from the cluster and a new cluster only containing this spectrum is created. 5. Go to 2. UNTIL all spectra fit their cluster or a maximum of N iterations is reached. 19

20 Supplementary Note 2 - PRIDE Spectra Clustering API The PRIDE Spectra Clustering API is a Java API and was used to create the results presented in this manuscript. It is available from as well as directly through the EBI maven repository (uk.ac.ebi.pride.tools:pride-spectra-clustering 1.0 from The project s homepage contains a step-by-step tutorial of how to use the PRIDE Spectra Clustering API to cluster spectra as it was done in this work ( The API s central points of entry are the implementations of the SpectraClustering interface. Currently, there is only the FrankEtAlClustering class available which is the adapted version of Frank et al. s 1 original algorithm used in this work. We tried to represent every step used during the clustering process through Java interfaces. Therefore, we hope that this API is a good basis to develop and test new methods to cluster MS/MS spectra. The SpectraClustering interface contains two methods that influence the algorithm s performance: - setsimilaritythreshold: Sets the similarity threshold required to cluster spectra together. - setclusteringrounds: Sets the maximum number of iterations (N) to optimize the clustering results. The function clusterspectra is then used to cluster a List of Spectrum. The Spectrum interface used in the PRIDE Spectra Clustering API is the same as the one used in the jmzreader API 16 ( Therefore, spectra can simply be read from a MS data file using jmzreader and then directly clustered using the PRIDE Spectra Clustering API. clusterspectra returns a List of SpectraCluster that represent the clustering result. Every SpectraCluster contains properties that describe the cluster (for example the cluster s average precursor m/z) as well as a List of Spectrum that holds all spectra that were added to this cluster. 20

21 Supplementary Note 3 - Test datasets Figure 3.1 Process used to develop and assess the performance of the clustering algorithm. We used three distinct datasets to test our algorithm (see Figure 3.1). The COPD dataset represents data acquired during a study by Steiling et al. 4. They analyzed the proteomics and transcriptomics profiles of the bronchial airway epithelium of current and never smokers. The data was collected using 1D-PAGE and an LTQ ProteomeX ion trap (ThermoFinnigan, Waltham, MA). The data was downloaded in mzxml format from the PeptideAtlas 7 repository ( and consisted of 26 experiments with 4,612,229 spectra. This dataset, in our opinion, represents a standard, state-of-the art proteomics experiment. The HUPO dataset consists of data from various experiments conducted during two projects of the Human Proteome Organization (HUPO). These are the experiments from the original HUPO Plasma Proteome Project (PPP) 5 as well as experiments from the HUPO Brain Proteome Project 6 (HBPP) (pilot projects from labs 10, 12, and 13). The data was downloaded in mzxml format from the PeptideAtlas 7 repository ( and consisted of 48 experiments with 5,192,683 spectra. These data are very heterogeneous and were generated in the early days of large-scale proteomics research. Therefore, it seems to be a representative sample of legacy data found in current public proteomics repositories. 21

22 The CPTAC dataset contains data from the study 6 conducted by the Clinical Proteomic Technology Assessment for Cancer (CPTAC) 8. All replicates from five instrument configurations / labs were included in our analysis: the LTQ2@95, LTQ@73, LTQ Orbitrap@86, LTQ XL OrbitrapP@65 and the LTQ XLx@65. The data was downloaded from Tranche 9 and converted to MGF format using ProteoWizard 10. This resulted in a total of 71 experiments (1 experiment per replicate) with 1,325,842 spectra. Four files could not be converted using ProteoWizard as only corrupted versions were present in Tranche. These were excluded from the analysis ( 1 replicate LTQ@73 6A, 1 replicate LTQ XL OrbitrapP@65 6B, 2 replicates LTQ@73 6C ). All of these experiments were performed in highly controlled environments following strict standard operating procedures. Therefore, this dataset can be seen as a collection of highly reproducible and robust experiments. The detailed experiment list for the COPD and the HUPO instances can be found in Supplementary Table 1. Search All experiments were searched using X!Tandem 11 ( Cyclone version ) and OMSSA 12 (see Table 3.1). Two search engines were used to prevent any bias towards the algorithm of a single search engine and to replicate the now common approach to search experiments with multiple search engines. The search settings were as follows: precursor tolerance was set to 2.0 Da and the fragment tolerance to 0.6 Da. Charge states between 2 and 4 as well as up to 2 missed cleavages were allowed (enzyme was set to trypsin). Carbamidomethylation was set as a fixed modification and Methioninoxidation as variable modification. Refinement was disabled in X!Tandem while all other settings were left at their default values. The search was performed against a concatenated target and (randomized) decoy database generated using the decoy.pl script from Matrix Science (UK, The COPD and the HUPO instances were searched against the UniProtKB human complete proteome set version The CPTAC instance was searched against the UniProtKB yeast complete proteome set version as well as the Universal Protein Standard (UPS) 1 and UPS 2 fasta library from Sigma-Aldrich. A maximum expectation threshold of 0.1 was used for both search engines, X!Tandem and OMSSA. All peptide-spectrum-matches (PSMs) (i.e. all identified spectra and the associated peptide identification data) were stored in a database for further analysis. 22

23 Dataset X!Tandem (PSMs Decoy / Target) OMSSA (PSMs Decoy / Target) COPD 5,155 / 350,477 38,786 / 444,673 HUPO 10,468 / 183,090 7,559 / 159,926 CPTAC 1,109 / 119,094 2,042 / 119,657 Table 3.1. Number of decoy and target PSMs per search engine and dataset. Clustering All identified spectra of each test instance were clustered using the algorithm described in Supplementary Note 1. The clustering process was repeated several times using different thresholds t and iterations N (see Table 3.2). Clusters with only one spectrum were ignored and discarded ( COPD : 4,807 clusters, HUPO : 19,569 clusters, CPTAC : 7,270 clusters). Threshold (t) Maximum Iterations (N) COPD HUPO CPTAC ,578 39, ,383 48,589 23, ,676 61,906 33, ,857 83,394 53, ,688 83,314 53, , ,193 88,114 Table 3.2. Clustering settings used for the various instances and the number of generated clusters per instance. Since clusters can be split and spectra hence reassigned to new clusters, all spectra have to be kept in memory. Therefore, only spectra with a similar precursor ion m/z ratio were clustered at once. The window size W used to select spectra was set to 20 m/z units. This extremely wide window was chosen to thoroughly test the algorithm s accuracy under extreme conditions and is considerably wider than the window of 2 m/z used by Frank et al. 1 Neighbouring windows overlapped by W/2 m/z units. Thereby, it was ensured that all spectra originating from the same peptide were present in at least one clustering process. Most spectra are part of two clusters as every spectrum was clustered in two clustering processes. This leads to duplicate / highly similar clusters to be generated. Therefore, these clusters from neighbouring windows are merged after the clustering. Two thresholds had to be defined to identify these similar clusters: the maximum difference between the average precursor m/z of the clusters and the required minimum similarity between the cluster s consensus spectra. A set of 3,600 neighbouring clusters was analyzed using a grid-expansion on both variables: thresholds between and precursor m/z tolerances between 0-1 m/z units. Clusters with an average precursor m/z of more than 0.5 apart are highly unlikely to be identical (see Figure 23

24 3.1). Therefore, the ideal similarity threshold was defined as the lowest similarity threshold that would not increase the number of merged clusters from a tolerance between 0.5 m/z to 1 m/z units. The allowed precursor m/z tolerance was then defined as the lowest tolerance at which this final maximum number of merged clusters was reached. This resulted in a minimal similarity threshold of 0.97 and a maximum precursor m/z tolerance of 0.35 m/z units. Figure 3.1. Results from the grid expansion of the required similarity threshold and the maximum precursor m/z difference to identify identical clusters. Only the 9 highest thresholds used are shown as these proved to be best suited. Searching consensus spectra The consensus spectra of all clusters were searched using five different search engines: X!Tandem 11 (version ), OMSSA 12 (version 2.1.8), Mascot 13 (version 2.3), Crux 14 (version 1.37), and SpectraST 3 (version 4.0, standalone). This allowed us to additionally assess the quality of the consensus spectrum building algorithm as well as the clustering accuracy. Apart from SpectraST, the exact same databases were used for the search of the consensus spectra as were used for the search of the individual spectra. For SpectraST, the NIST ( ion trap human spectral library (version ) was used for the COPD and the HUPO instances. The CPTAC instance was searched against a concatenated spectral library of the NIST ion trap yeast spectral library ( ) and the NIST ion trap UPS 1 spectra library ( ). All search engines, apart from SpectraST, used the same settings as for the search of the individual spectra (precursor tolerance of

25 Da, fragment tolerance of 0.6 Da, maximum 2 missed cleavages, enzyme trypsin, charge states 2-4). For SpectraST, the precursor tolerance was set to 4 m/z units. 25

26 Supplementary Note 4 - Assessing clustering quality and determining clustering thresholds Frank et al. estimated the clustering quality by checking how many clusters contained spectra identified as different peptides (using the search engine InsPecT) 1. For their test they used spectra attributed to a given peptide identification and added an equal number of other randomly selected decoy spectra that had a different identification and a precursor m/z of at least 8 m/z units apart to the target peptide s precursor (see Supplementary Note 3 in 1 ). In early tests of our algorithm, we replicated this approach and got comparable results (data not shown). When we investigated these results we found that spectra with different precursor m/z are basically never clustered together. We then adapted this approach to use a random selection of spectra from different peptide identifications but with a similar precursor m/z. As expected, the results of the clustering accuracy evaluation got worse as spectra with similar precursor m/z are more likely to be similar (also, because the underlying peptides were more similar). Additionally, the results from this approach greatly depended on the sampled decoy spectra (data not shown). For the same set of target spectra, the number of spectra with different peptide identifications clustered together ranged between 0-10% depending on the (randomly) chosen set of decoy spectra (for 1,000 randomly chosen target sequences the clustering process was repeated 100 times, i.e. choosing 100 different random sets of decoy spectra, data not shown). To circumvent the problem of random sampling, we adapted the whole clustering approach to cluster all spectra within a given precursor m/z range. This m/z range was wide enough to include all spectra of a given peptide identification as well as a considerable amount of spectra from other peptides to thoroughly test the algorithm s robustness. This range was set to 20 m/z units for all test instances. We then assessed the clustering quality by examining the precursor m/z range of the spectra included in a cluster. The clustering algorithm is blind to this attribute and thus not biased by it. While the identification data from search engines is always incorrect in a certain amount of cases, the precursor m/z is directly measured by the instrument and a physical property of the measured analyte. It can therefore be regarded as a reliable, independent measurement of a spectrum s origin. We believe that his approach is considerably more stringent than the approach used by Frank et al. to assess the algorithm s performance. 26

27 Figure 4.1. Boxplot of the cluster s precursor m/z ranges per dataset and clustering method (i.e. method settings). The limited discriminative power of the dot-product can be seen by the fact that the ranges decrease from larger ( COPD instance) to smaller test sets ( CPTAC instance). There was no significant difference in clustering accuracy when using a maximum of 4 iterations (N) compared to a maximum of 10 iterations (tested for t = 0.7) (Wilcoxon rank-sum test (data not normally distributed), COPD instance p = , HUPO instance p = , CPTAC instance p = ). There was a significant increase of clustering accuracy identified by decreasing cluster ranges for every threshold increase in every instance Wilcoxon rank-sum test p < 2.2e-16 for every comparison of neighboring thresholds. The analysis of the precursor m/z ranges of the spectra in all clusters from all test datasets as well as the PRIDE Cluster database (Figure 4.1 and Figure 1b in the main manuscript) clearly shows that the clustering algorithm is highly stable irrespective of the test dataset. The vast majority of cluster ranges were below 2 m/z units which can be explained by normal variations in MS data. These results indicate that the clustering algorithm reliably clusters spectra originating from the same chemical compound irrespective of the underlying dataset. We manually investigated several of the clusters with high m/z ranges from the COPD instance clustered with a threshold t = 0.7 and a maximum of 4 iterations (N). Three clusters had an m/z range greater than 10 m/z units: one contained 1,988 spectra identified as EFTPPVQAAYQK, and three spectra attributed to three other sequences (Figure 4.2a). All five search engines used to search the consensus spectrum also identified EFTPPVQAAYQK with high confidence. The group of spectra with a precursor m/z below 682 m/z were also identified as EFTPPVQAAYQK but with a dehydration (mass shift of 18 Da) on the first amino acid. 27

28 The second cluster (Figure 4.2b) contained 6 out of 7 spectra identified as QFPFLASIQNQGR ). The spectra with a precursor m/z around 745 m/z were also identified as QFPFLASIQNQGR but with a deamination (mass shift of Da) on the first residue. The fact that the clustering algorithm grouped these spectra together is not surprising as only the k highest peaks are used for the comparison (Supplementary Note 1). The main peaks used for the comparison were not affected by the modification. Another cluster contained 777 spectra identified as QISNLQQSISDAEQR. The two spectra that had outlying precursor m/z again contained deaminated residues (Figure 4.2c). These examples show that the found outliers are caused by highly similar peptides that cannot be distinguished using the here presented algorithm. Nevertheless, the spectra all originated from the same peptide sequence which given the amount of data processed can be seen as a very good result. a) b) c) Figure 4.2. Precursor ion m/z distribution of the spectra from three clusters from the COPD instance (maximum number of iterations N=4, similarity threshold T=0.7). These clusters showed high precursor m/z ranges and were investigated manually. The outliers were all caused by peptides with a deamination on one amino acid that was not recognized by the clustering algorithm. When we investigated clusters generated with clustering similarity thresholds below 0.7, the clusters with wide precursor m/z ranges did contain spectra attributed to various peptide identifications. The consensus spectra from these clusters could then no longer be confidently identified by any of the used five search engines. Therefore, it seems most likely that spectra from different peptide species were clustered together when using these lower thresholds. The threshold t of 0.7 was the first threshold to produce reliable results in all test instances. 28

29 Supplementary Note 5 - Assessing PSM reliabilities We assessed the feasibility of identifying incorrect identifications checking whether it was possible to identify the decoy identifications within the test instances search results. To assess the reliability of the PSMs in a cluster the distinct sequences identified within a cluster were stored as property vectors. Modifications were not taken into account as the spectra s precursor ion m/z values were close enough to indicate that the masses of the identified peptides were basically identical. Every sequence was represented using a vector of properties: the sequence s ratio (the proportion of spectra identified as this sequence within the cluster), its rank among the sequences identified in the cluster, the total size of the cluster, the ratio of search engines that identified the same sequence from the consensus spectrum, the search engine ratio (see before) of the previous (higher ranked) sequence and whether it is a decoy sequence. A machine learning approach was followed using the Waikato Environment for Knowledge Analysis (WEKA) version To learn the rules to classify target and decoy sequences the Conjunctive Rule Learner was used with the following parameters: 3 folds, minimum total weight 2.0, number of antecedents -1, seed 1. This approach was chosen since it is one of the simplest machine learning algorithms that results in simple, human understandable results. More sophisticated algorithms will most probably return better results but are more likely to over fit the data. The data from the COPD instance was used as training set with 10-fold cross-validation and the data from the HUPO instance as independent test set. Irrespective of the clustering algorithm used and combination of attributes the algorithm only chose the (distinct) peptide sequence s ratio within the cluster to distinguish between target and decoy identifications (a ratio > to classify a sequence as non-decoy, i.e. a correct identification, Figure 5.1). 29

30 Figure 5.1. Distribution of ratios between target and decoy sequences. An identification s ratio (the relative number of spectra within a cluster that were identified as this sequence) was identified as the most suited attribute to classify target and decoy identifications. The ratios of target sequences were significantly higher than the ratios of decoy sequences (Wilcoxon rank-sum test p < 2.2e-16 for all comparisons). This data clearly shows the feasibility to classify target and decoy (i.e. incorrect identifications) based on the ratio alone. A similarity threshold of 0.7 was the first one where both distributions were clearly separated in all test datasets. While the results presented in Supplementary Figure 6 are comparable for the COPD and the HUPO instances, there is a distinct difference to the results from the CPTAC instance. In terms of size the HUPO and the CPTAC instances are comparable. However, the main difference between the COPD and the HUPO instance, compared to the CPTAC instance, is the peptide false discovery rate (FDR) observed in the identifications. We calculated the FDR of an instance by merging the results of both search engines and used the formula proposed by Elias et al. 2 : Based on this formula the COPD and HUPO instances have a FDR of 5% while the CPTAC instance has a considerable lower FDR of 1%. This suggests that the ratio cut-off to 30

31 separate incorrect and correct identifications most probably depends on the peptide FDR found in the analyzed dataset. As the true FDR is unknown in the PRIDE repository, we selected a minimum similarity threshold (t) of 0.7 to classify identifications as reliable in PRIDE. This threshold is well above the threshold suggested by the results from the test datasets and can therefore be regarded as a very conservative assessment. As this ratio is likely to randomly occur in very small clusters we analyzed the size distribution of clusters where the most frequent peptide identification was from the decoy database and constituted at least 70% of the peptide identifications within the cluster (Figure 5.2). Based on this analysis we defined a minimum cluster size of 10 spectra as it was able to identify more than 85% of all validated decoy clusters within all three test datasets. This seems to be a good balance between decreasing the number of incorrect identifications and removing correct ones. Figure 5.2. Size distribution of clusters whose primary peptide identification was from the decoy database and constituted at least 70% of all identifications in the cluster (N = maximum number of iterations. T = similarity threshold). Non-random incorrect identifications When analyzing clusters where the most frequent peptide identification was from the decoy database we observed that some of these clusters contained up to 1,000 spectra from multiple experiments. In several cases, all of these spectra were identified as the same decoy sequence (see outliers in Figure 5.2). It is highly improbable that these spectra were generated by chance or that these spectra originate from a chemical compound other than a peptide. Therefore, a possible hypothesis is that these spectra originated from peptides that are not 31

32 present in the used sequence database. These spectra lead to stable incorrect identifications and represent a systematic error in the search results. Therefore, they violate the assumption that incorrect identifications are evenly distributed among the target and the decoy database 2. Furthermore, several validation tools that take the number of search engines and replicates that identified a given peptide sequence into consideration (for example iprophet 18 ) will classify these identifications as highly reliable. Even though our here presented algorithm cannot identify the origin of these spectra and will also incorrectly label these identifications as reliable, it can identify the presence of such nonrandom incorrect identifications when decoy search results are present. Such results can then be used to investigate whether the used search database might have to be extended. In an effort to identify the true origin of these spectra we searched the consensus spectra of all clusters that contained equal to or more than 50% decoy PSMs against the common Repository of Adventitious Proteins (crap, version using X!Tandem and OMSSA. This database contains proteins of three general classes: (i) common laboratory proteins; (ii) proteins added by accident through dust or physical contact; and (iii) proteins used as molecular weight or mass spectrometry quantitation standards. The search settings were identical as before (see Supplementary Note 3). For both search engines only identifications with an expectation value lower than 0.1 were accepted. As can be seen in the number of identifications presented in Table 5.1 only 10% - 20% of these consensus spectra were successfully identified. The number of spectra that were identified as the same peptide by both search engines was even lower. Therefore, even though some of these clusters seem to represent peptides that originate from common contaminants, the vast majority seems to originate from peptides that were neither present in the used search nor in the crap database. Instance N = 4, t = 0.6 N = 4, t = 0.7 N = 4, t = 0.8 Total Ident. Same Total Ident. Same Total Ident. Same COPD 3, , ,175 1,648 1,139 HUPO 2, , , CPTAC Table 5.1. Number of identified ( Ident. ) together with the number of equal identifications ( Same ) by both search engines against the crap database compared to the total number of clusters ( Total ) that contained equal 32

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database Overview - MS Proteomics in One Slide Obtain protein Digest into peptides Acquire spectra in mass spectrometer MS masses of peptides MS/MS fragments of a peptide Results! Match to sequence database 2 But

More information

Spectrum-to-Spectrum Searching Using a. Proteome-wide Spectral Library

Spectrum-to-Spectrum Searching Using a. Proteome-wide Spectral Library MCP Papers in Press. Published on April 30, 2011 as Manuscript M111.007666 Spectrum-to-Spectrum Searching Using a Proteome-wide Spectral Library Chia-Yu Yen, Stephane Houel, Natalie G. Ahn, and William

More information

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra Andrew Keller Day 2 October 17, 2006 Andrew Keller Rosetta Bioinformatics, Seattle Outline Need to validate peptide assignments to MS/MS

More information

Nature Methods: doi: /nmeth Supplementary Figure 1. Fragment indexing allows efficient spectra similarity comparisons.

Nature Methods: doi: /nmeth Supplementary Figure 1. Fragment indexing allows efficient spectra similarity comparisons. Supplementary Figure 1 Fragment indexing allows efficient spectra similarity comparisons. The cost and efficiency of spectra similarity calculations can be approximated by the number of fragment comparisons

More information

HOWTO, example workflow and data files. (Version )

HOWTO, example workflow and data files. (Version ) HOWTO, example workflow and data files. (Version 20 09 2017) 1 Introduction: SugarQb is a collection of software tools (Nodes) which enable the automated identification of intact glycopeptides from HCD

More information

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra. Andrew Keller

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra. Andrew Keller PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra Andrew Keller Outline Need to validate peptide assignments to MS/MS spectra Statistical approach to validation Running PeptideProphet

More information

A Description of the CPTAC Common Data Analysis Pipeline (CDAP)

A Description of the CPTAC Common Data Analysis Pipeline (CDAP) A Description of the CPTAC Common Data Analysis Pipeline (CDAP) v. 01/14/2014 Summary The purpose of this document is to describe the software programs and output files of the Common Data Analysis Pipeline

More information

MS-MS Analysis Programs

MS-MS Analysis Programs MS-MS Analysis Programs Basic Process Genome - Gives AA sequences of proteins Use this to predict spectra Compare data to prediction Determine degree of correctness Make assignment Did we see the protein?

More information

Tutorial 2: Analysis of DIA data in Skyline

Tutorial 2: Analysis of DIA data in Skyline Tutorial 2: Analysis of DIA data in Skyline In this tutorial we will learn how to use Skyline to perform targeted post-acquisition analysis for peptide and inferred protein detection and quantitation using

More information

Tutorial 1: Setting up your Skyline document

Tutorial 1: Setting up your Skyline document Tutorial 1: Setting up your Skyline document Caution! For using Skyline the number formats of your computer have to be set to English (United States). Open the Control Panel Clock, Language, and Region

More information

Workflow concept. Data goes through the workflow. A Node contains an operation An edge represents data flow The results are brought together in tables

Workflow concept. Data goes through the workflow. A Node contains an operation An edge represents data flow The results are brought together in tables PROTEOME DISCOVERER Workflow concept Data goes through the workflow Spectra Peptides Quantitation A Node contains an operation An edge represents data flow The results are brought together in tables Protein

More information

DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics

DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics Chih-Chiang Tsou 1,2, Dmitry Avtonomov 2, Brett Larsen 3, Monika Tucholska 3, Hyungwon Choi 4 Anne-Claude Gingras

More information

An Unsupervised, Model-Free, Machine-Learning Combiner for Peptide Identifications from Tandem Mass Spectra

An Unsupervised, Model-Free, Machine-Learning Combiner for Peptide Identifications from Tandem Mass Spectra Clin Proteom (2009) 5:23 36 DOI 0.007/s204-009-9024-5 An Unsupervised, Model-Free, Machine-Learning Combiner for Peptide Identifications from Tandem Mass Spectra Nathan Edwards Xue Wu Chau-Wen Tseng Published

More information

Improved 6- Plex TMT Quantification Throughput Using a Linear Ion Trap HCD MS 3 Scan Jane M. Liu, 1,2 * Michael J. Sweredoski, 2 Sonja Hess 2 *

Improved 6- Plex TMT Quantification Throughput Using a Linear Ion Trap HCD MS 3 Scan Jane M. Liu, 1,2 * Michael J. Sweredoski, 2 Sonja Hess 2 * Improved 6- Plex TMT Quantification Throughput Using a Linear Ion Trap HCD MS 3 Scan Jane M. Liu, 1,2 * Michael J. Sweredoski, 2 Sonja Hess 2 * 1 Department of Chemistry, Pomona College, Claremont, California

More information

Modeling Mass Spectrometry-Based Protein Analysis

Modeling Mass Spectrometry-Based Protein Analysis Chapter 8 Jan Eriksson and David Fenyö Abstract The success of mass spectrometry based proteomics depends on efficient methods for data analysis. These methods require a detailed understanding of the information

More information

Mass Spectrometry and Proteomics - Lecture 5 - Matthias Trost Newcastle University

Mass Spectrometry and Proteomics - Lecture 5 - Matthias Trost Newcastle University Mass Spectrometry and Proteomics - Lecture 5 - Matthias Trost Newcastle University matthias.trost@ncl.ac.uk Previously Proteomics Sample prep 144 Lecture 5 Quantitation techniques Search Algorithms Proteomics

More information

iprophet: Multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates

iprophet: Multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates MCP Papers in Press. Published on August 29, 2011 as Manuscript M111.007690 This is the Pre-Published Version iprophet: Multi-level integrative analysis of shotgun proteomic data improves peptide and protein

More information

Last updated: Copyright

Last updated: Copyright Last updated: 2012-08-20 Copyright 2004-2012 plabel (v2.4) User s Manual by Bioinformatics Group, Institute of Computing Technology, Chinese Academy of Sciences Tel: 86-10-62601016 Email: zhangkun01@ict.ac.cn,

More information

Key questions of proteomics. Bioinformatics 2. Proteomics. Foundation of proteomics. What proteins are there? Protein digestion

Key questions of proteomics. Bioinformatics 2. Proteomics. Foundation of proteomics. What proteins are there? Protein digestion s s Key questions of proteomics What proteins are there? Bioinformatics 2 Lecture 2 roteomics How much is there of each of the proteins? - Absolute quantitation - Stoichiometry What (modification/splice)

More information

Properties of Average Score Distributions of SEQUEST

Properties of Average Score Distributions of SEQUEST Research Properties of Average Score Distributions of SEQUEST THE PROBABILITY RATIO METHOD* S Salvador Martínez-Bartolomé, Pedro Navarro, Fernando Martín-Maroto, Daniel López-Ferrer **, Antonio Ramos-Fernández,

More information

Tutorial 1: Library Generation from DDA data

Tutorial 1: Library Generation from DDA data Tutorial 1: Library Generation from DDA data 1. Introduction Before a targeted, peptide-centric DIA analysis can be performed, a spectral library containing peptide-query parameters needs to be generated.

More information

Protein Identification Using Tandem Mass Spectrometry. Nathan Edwards Informatics Research Applied Biosystems

Protein Identification Using Tandem Mass Spectrometry. Nathan Edwards Informatics Research Applied Biosystems Protein Identification Using Tandem Mass Spectrometry Nathan Edwards Informatics Research Applied Biosystems Outline Proteomics context Tandem mass spectrometry Peptide fragmentation Peptide identification

More information

Figure S1. Interaction of PcTS with αsyn. (a) 1 H- 15 N HSQC NMR spectra of 100 µm αsyn in the absence (0:1, black) and increasing equivalent

Figure S1. Interaction of PcTS with αsyn. (a) 1 H- 15 N HSQC NMR spectra of 100 µm αsyn in the absence (0:1, black) and increasing equivalent Figure S1. Interaction of PcTS with αsyn. (a) 1 H- 15 N HSQC NMR spectra of 100 µm αsyn in the absence (0:1, black) and increasing equivalent concentrations of PcTS (100 µm, blue; 500 µm, green; 1.5 mm,

More information

TUTORIAL EXERCISES WITH ANSWERS

TUTORIAL EXERCISES WITH ANSWERS TUTORIAL EXERCISES WITH ANSWERS Tutorial 1 Settings 1. What is the exact monoisotopic mass difference for peptides carrying a 13 C (and NO additional 15 N) labelled C-terminal lysine residue? a. 6.020129

More information

Targeted Proteomics Environment

Targeted Proteomics Environment Targeted Proteomics Environment Quantitative Proteomics with Bruker Q-TOF Instruments and Skyline Brendan MacLean Quantitative Proteomics Spectrum-based Spectral counting Isobaric tags Chromatography-based

More information

Comprehensive support for quantitation

Comprehensive support for quantitation Comprehensive support for quantitation One of the major new features in the current release of Mascot is support for quantitation. This is still work in progress. Our goal is to support all of the popular

More information

Tandem Mass Spectrometry: Generating function, alignment and assembly

Tandem Mass Spectrometry: Generating function, alignment and assembly Tandem Mass Spectrometry: Generating function, alignment and assembly With slides from Sangtae Kim and from Jones & Pevzner 2004 Determining reliability of identifications Can we use Target/Decoy to estimate

More information

Improved Validation of Peptide MS/MS Assignments. Using Spectral Intensity Prediction

Improved Validation of Peptide MS/MS Assignments. Using Spectral Intensity Prediction MCP Papers in Press. Published on October 2, 2006 as Manuscript M600320-MCP200 Improved Validation of Peptide MS/MS Assignments Using Spectral Intensity Prediction Shaojun Sun 1, Karen Meyer-Arendt 2,

More information

MALDI-HDMS E : A Novel Data Independent Acquisition Method for the Enhanced Analysis of 2D-Gel Tryptic Peptide Digests

MALDI-HDMS E : A Novel Data Independent Acquisition Method for the Enhanced Analysis of 2D-Gel Tryptic Peptide Digests -HDMS E : A Novel Data Independent Acquisition Method for the Enhanced Analysis of 2D-Gel Tryptic Peptide Digests Emmanuelle Claude, 1 Mark Towers, 1 and Rachel Craven 2 1 Waters Corporation, Manchester,

More information

X!TandemPipeline (Myosine Anabolisée) validating, filtering and grouping MSMS identifications

X!TandemPipeline (Myosine Anabolisée) validating, filtering and grouping MSMS identifications X!TandemPipeline 3.3.3 (Myosine Anabolisée) validating, filtering and grouping MSMS identifications Olivier Langella and Benoit Valot langella@moulon.inra.fr; valot@moulon.inra.fr PAPPSO - http://pappso.inra.fr/

More information

SeqAn and OpenMS Integration Workshop. Temesgen Dadi, Julianus Pfeuffer, Alexander Fillbrunn The Center for Integrative Bioinformatics (CIBI)

SeqAn and OpenMS Integration Workshop. Temesgen Dadi, Julianus Pfeuffer, Alexander Fillbrunn The Center for Integrative Bioinformatics (CIBI) SeqAn and OpenMS Integration Workshop Temesgen Dadi, Julianus Pfeuffer, Alexander Fillbrunn The Center for Integrative Bioinformatics (CIBI) Mass-spectrometry data analysis in KNIME Julianus Pfeuffer,

More information

Effective Strategies for Improving Peptide Identification with Tandem Mass Spectrometry

Effective Strategies for Improving Peptide Identification with Tandem Mass Spectrometry Effective Strategies for Improving Peptide Identification with Tandem Mass Spectrometry by Xi Han A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree

More information

Mass Spectrometry Based De Novo Peptide Sequencing Error Correction

Mass Spectrometry Based De Novo Peptide Sequencing Error Correction Mass Spectrometry Based De Novo Peptide Sequencing Error Correction by Chenyu Yao A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics

More information

Tandem mass spectra were extracted from the Xcalibur data system format. (.RAW) and charge state assignment was performed using in house software

Tandem mass spectra were extracted from the Xcalibur data system format. (.RAW) and charge state assignment was performed using in house software Supplementary Methods Software Interpretation of Tandem mass spectra Tandem mass spectra were extracted from the Xcalibur data system format (.RAW) and charge state assignment was performed using in house

More information

Identification of proteins by enzyme digestion, mass

Identification of proteins by enzyme digestion, mass Method for Screening Peptide Fragment Ion Mass Spectra Prior to Database Searching Roger E. Moore, Mary K. Young, and Terry D. Lee Beckman Research Institute of the City of Hope, Duarte, California, USA

More information

Proteome-wide label-free quantification with MaxQuant. Jürgen Cox Max Planck Institute of Biochemistry July 2011

Proteome-wide label-free quantification with MaxQuant. Jürgen Cox Max Planck Institute of Biochemistry July 2011 Proteome-wide label-free quantification with MaxQuant Jürgen Cox Max Planck Institute of Biochemistry July 2011 MaxQuant MaxQuant Feature detection Data acquisition Initial Andromeda search Statistics

More information

SRM assay generation and data analysis in Skyline

SRM assay generation and data analysis in Skyline in Skyline Preparation 1. Download the example data from www.srmcourse.ch/eupa.html (3 raw files, 1 csv file, 1 sptxt file). 2. The number formats of your computer have to be set to English (United States).

More information

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University Protein Quantitation II: Multiple Reaction Monitoring Kelly Ruggles kelly@fenyolab.org New York University Traditional Affinity-based proteomics Use antibodies to quantify proteins Western Blot RPPA Immunohistochemistry

More information

Introduction to pepxmltab

Introduction to pepxmltab Introduction to pepxmltab Xiaojing Wang October 30, 2018 Contents 1 Introduction 1 2 Convert pepxml to a tabular format 1 3 PSMs Filtering 4 4 Session Information 5 1 Introduction Mass spectrometry (MS)-based

More information

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University Protein Quantitation II: Multiple Reaction Monitoring Kelly Ruggles kelly@fenyolab.org New York University Traditional Affinity-based proteomics Use antibodies to quantify proteins Western Blot Immunohistochemistry

More information

Pyrobayes: an improved base caller for SNP discovery in pyrosequences

Pyrobayes: an improved base caller for SNP discovery in pyrosequences Pyrobayes: an improved base caller for SNP discovery in pyrosequences Aaron R Quinlan, Donald A Stewart, Michael P Strömberg & Gábor T Marth Supplementary figures and text: Supplementary Figure 1. The

More information

Ligand Scout Tutorials

Ligand Scout Tutorials Ligand Scout Tutorials Step : Creating a pharmacophore from a protein-ligand complex. Type ke6 in the upper right area of the screen and press the button Download *+. The protein will be downloaded and

More information

A TMT-labeled Spectral Library for Peptide Sequencing

A TMT-labeled Spectral Library for Peptide Sequencing A TMT-labeled Spectral Library for Peptide Sequencing by Jianqiao Shen A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics

More information

Computational Methods for Mass Spectrometry Proteomics

Computational Methods for Mass Spectrometry Proteomics Computational Methods for Mass Spectrometry Proteomics Eidhammer, Ingvar ISBN-13: 9780470512975 Table of Contents Preface. Acknowledgements. 1 Protein, Proteome, and Proteomics. 1.1 Primary goals for studying

More information

Skyline Small Molecule Targets

Skyline Small Molecule Targets Skyline Small Molecule Targets The Skyline Targeted Proteomics Environment provides informative visual displays of the raw mass spectrometer data you import into your Skyline documents. Originally developed

More information

Supplementary Material for: Clustering Millions of Tandem Mass Spectra

Supplementary Material for: Clustering Millions of Tandem Mass Spectra Supplementary Material for: Clustering Millions of Tandem Mass Spectra Ari M. Frank 1 Nuno Bandeira 1 Zhouxin Shen 2 Stephen Tanner 3 Steven P. Briggs 2 Richard D. Smith 4 Pavel A. Pevzner 1 October 4,

More information

Multi-residue analysis of pesticides by GC-HRMS

Multi-residue analysis of pesticides by GC-HRMS An Executive Summary Multi-residue analysis of pesticides by GC-HRMS Dr. Hans Mol is senior scientist at RIKILT- Wageningen UR Introduction Regulatory authorities throughout the world set and enforce strict

More information

The Pitfalls of Peaklist Generation Software Performance on Database Searches

The Pitfalls of Peaklist Generation Software Performance on Database Searches Proceedings of the 56th ASMS Conference on Mass Spectrometry and Allied Topics, Denver, CO, June 1-5, 2008 The Pitfalls of Peaklist Generation Software Performance on Database Searches Aenoch J. Lynn,

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007 -2 Transcript Alignment Assembly and Automated Gene Structure Improvements Using PASA-2 Mathangi Thiagarajan mathangi@jcvi.org Rice Genome Annotation Workshop May 23rd, 2007 About PASA PASA is an open

More information

Efficient Marginalization to Compute Protein Posterior Probabilities from Shotgun Mass Spectrometry Data

Efficient Marginalization to Compute Protein Posterior Probabilities from Shotgun Mass Spectrometry Data Efficient Marginalization to Compute Protein Posterior Probabilities from Shotgun Mass Spectrometry Data Oliver Serang Department of Genome Sciences, University of Washington, Seattle, Washington Michael

More information

Self-assembling covalent organic frameworks functionalized. magnetic graphene hydrophilic biocomposite as an ultrasensitive

Self-assembling covalent organic frameworks functionalized. magnetic graphene hydrophilic biocomposite as an ultrasensitive Electronic Supplementary Material (ESI) for Nanoscale. This journal is The Royal Society of Chemistry 2017 Electronic Supporting Information for: Self-assembling covalent organic frameworks functionalized

More information

De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics

De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics John R. Rose Computer Science and Engineering University of South Carolina 1 Overview Background Information Theoretic

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

HMMatch: Peptide Identification by Spectral Matching of Tandem Mass Spectra Using Hidden Markov Models ABSTRACT

HMMatch: Peptide Identification by Spectral Matching of Tandem Mass Spectra Using Hidden Markov Models ABSTRACT JOURNAL OF COMPUTATIONAL BIOLOGY Volume 14, Number 8, 2007 Mary Ann Liebert, Inc. Pp. 1025 1043 DOI: 10.1089/cmb.2007.0071 HMMatch: Peptide Identification by Spectral Matching of Tandem Mass Spectra Using

More information

Site-specific Identification of Lysine Acetylation Stoichiometries in Mammalian Cells

Site-specific Identification of Lysine Acetylation Stoichiometries in Mammalian Cells Supplementary Information Site-specific Identification of Lysine Acetylation Stoichiometries in Mammalian Cells Tong Zhou 1, 2, Ying-hua Chung 1, 2, Jianji Chen 1, Yue Chen 1 1. Department of Biochemistry,

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

Methods for proteome analysis of obesity (Adipose tissue)

Methods for proteome analysis of obesity (Adipose tissue) Methods for proteome analysis of obesity (Adipose tissue) I. Sample preparation and liquid chromatography-tandem mass spectrometric analysis Instruments, softwares, and materials AB SCIEX Triple TOF 5600

More information

Proteomics. November 13, 2007

Proteomics. November 13, 2007 Proteomics November 13, 2007 Acknowledgement Slides presented here have been borrowed from presentations by : Dr. Mark A. Knepper (LKEM, NHLBI, NIH) Dr. Nathan Edwards (Center for Bioinformatics and Computational

More information

Protein Structure Determination from Pseudocontact Shifts Using ROSETTA

Protein Structure Determination from Pseudocontact Shifts Using ROSETTA Supporting Information Protein Structure Determination from Pseudocontact Shifts Using ROSETTA Christophe Schmitz, Robert Vernon, Gottfried Otting, David Baker and Thomas Huber Table S0. Biological Magnetic

More information

Agilent MassHunter Quantitative Data Analysis

Agilent MassHunter Quantitative Data Analysis Agilent MassHunter Quantitative Data Analysis Presenters: Howard Sanford Stephen Harnos MassHunter Quantitation: Batch Table, Compound Information Setup, Calibration Curve and Globals Settings 1 MassHunter

More information

Procedure to Create NCBI KOGS

Procedure to Create NCBI KOGS Procedure to Create NCBI KOGS full details in: Tatusov et al (2003) BMC Bioinformatics 4:41. 1. Detect and mask typical repetitive domains Reason: masking prevents spurious lumping of non-orthologs based

More information

Yifei Bao. Beatrix. Manor Askenazi

Yifei Bao. Beatrix. Manor Askenazi Detection and Correction of Interference in MS1 Quantitation of Peptides Using their Isotope Distributions Yifei Bao Department of Computer Science Stevens Institute of Technology Beatrix Ueberheide Department

More information

De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra. Xiaowen Liu

De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra. Xiaowen Liu De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra Xiaowen Liu Department of BioHealth Informatics, Department of Computer and Information Sciences, Indiana University-Purdue

More information

profileanalysis Innovation with Integrity Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research

profileanalysis Innovation with Integrity Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research profileanalysis Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research Innovation with Integrity Omics Research Biomarker Discovery Made Easy by ProfileAnalysis

More information

Isotopic-Labeling and Mass Spectrometry-Based Quantitative Proteomics

Isotopic-Labeling and Mass Spectrometry-Based Quantitative Proteomics Isotopic-Labeling and Mass Spectrometry-Based Quantitative Proteomics Xiao-jun Li, Ph.D. Current address: Homestead Clinical Day 4 October 19, 2006 Protein Quantification LC-MS/MS Data XLink mzxml file

More information

Artificial Neural Networks Examination, June 2005

Artificial Neural Networks Examination, June 2005 Artificial Neural Networks Examination, June 2005 Instructions There are SIXTY questions. (The pass mark is 30 out of 60). For each question, please select a maximum of ONE of the given answers (either

More information

Was T. rex Just a Big Chicken? Computational Proteomics

Was T. rex Just a Big Chicken? Computational Proteomics Was T. rex Just a Big Chicken? Computational Proteomics Phillip Compeau and Pavel Pevzner adjusted by Jovana Kovačević Bioinformatics Algorithms: an Active Learning Approach 215 by Compeau and Pevzner.

More information

Spectronaut Pulsar. User Manual

Spectronaut Pulsar. User Manual Spectronaut Pulsar User Manual 1 General Information... 6 1.1 Computer System Requirements... 6 1.2 Scope of Spectronaut Software... 6 1.3 Spectronaut Pulsar... 6 1.4 Spectronaut Release Features... 7

More information

Neural Networks and Ensemble Methods for Classification

Neural Networks and Ensemble Methods for Classification Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated

More information

MS2DB: An Algorithmic Approach to Determine Disulfide Linkage Patterns in Proteins by Utilizing Tandem Mass Spectrometric Data

MS2DB: An Algorithmic Approach to Determine Disulfide Linkage Patterns in Proteins by Utilizing Tandem Mass Spectrometric Data MS2DB: An Algorithmic Approach to Determine Disulfide Linkage Patterns in Proteins by Utilizing Tandem Mass Spectrometric Data Timothy Lee 1, Rahul Singh 1, Ten-Yang Yen 2, and Bruce Macher 2 1 Department

More information

Supplementary Materials for mplr-loc Web-server

Supplementary Materials for mplr-loc Web-server Supplementary Materials for mplr-loc Web-server Shibiao Wan and Man-Wai Mak email: shibiao.wan@connect.polyu.hk, enmwmak@polyu.edu.hk June 2014 Back to mplr-loc Server Contents 1 Introduction to mplr-loc

More information

A graph-based filtering method for top-down mass spectral identification

A graph-based filtering method for top-down mass spectral identification Yang and Zhu BMC Genomics 2018, 19(Suppl 7):666 https://doi.org/10.1186/s12864-018-5026-x METHODOLOGY Open Access A graph-based filtering method for top-down mass spectral identification Runmin Yang and

More information

Analysis of Peptide MS/MS Spectra from Large-Scale Proteomics Experiments Using Spectrum Libraries

Analysis of Peptide MS/MS Spectra from Large-Scale Proteomics Experiments Using Spectrum Libraries Anal. Chem. 2006, 78, 5678-5684 Analysis of Peptide MS/MS Spectra from Large-Scale Proteomics Experiments Using Spectrum Libraries Barbara E. Frewen, Gennifer E. Merrihew, Christine C. Wu, William Stafford

More information

Photometric Redshifts with DAME

Photometric Redshifts with DAME Photometric Redshifts with DAME O. Laurino, R. D Abrusco M. Brescia, G. Longo & DAME Working Group VO-Day... in Tour Napoli, February 09-0, 200 The general astrophysical problem Due to new instruments

More information

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand

More information

MassHunter TOF/QTOF Users Meeting

MassHunter TOF/QTOF Users Meeting MassHunter TOF/QTOF Users Meeting 1 Qualitative Analysis Workflows Workflows in Qualitative Analysis allow the user to only see and work with the areas and dialog boxes they need for their specific tasks

More information

TANDEM MASS SPECTRAL LIBRARIES OF PEPTIDES AND THEIR ROLES IN PROTEOMICS RESEARCH

TANDEM MASS SPECTRAL LIBRARIES OF PEPTIDES AND THEIR ROLES IN PROTEOMICS RESEARCH TANDEM MASS SPECTRAL LIBRARIES OF PEPTIDES AND THEIR ROLES IN PROTEOMICS RESEARCH Wenguang Shao 1,2 and Henry Lam 2,3 * 1 Department of Biology, Institute of Molecular Systems Biology, Eidgen ossische

More information

Compounding insights Thermo Scientific Compound Discoverer Software

Compounding insights Thermo Scientific Compound Discoverer Software Compounding insights Thermo Scientific Compound Discoverer Software Integrated, complete, toolset solves small-molecule analysis challenges Thermo Scientific Orbitrap mass spectrometers produce information-rich

More information

High-Throughput Protein Quantitation Using Multiple Reaction Monitoring

High-Throughput Protein Quantitation Using Multiple Reaction Monitoring High-Throughput Protein Quantitation Using Multiple Reaction Monitoring Application Note Authors Ning Tang, Christine Miller, Joe Roark, Norton Kitagawa and Keith Waddell Agilent Technologies, Inc. Santa

More information

Electrospray ionization mass spectrometry (ESI-

Electrospray ionization mass spectrometry (ESI- Automated Charge State Determination of Complex Isotope-Resolved Mass Spectra by Peak-Target Fourier Transform Li Chen a and Yee Leng Yap b a Bioinformatics Institute, 30 Biopolis Street, Singapore b Davos

More information

Towards Detecting Protein Complexes from Protein Interaction Data

Towards Detecting Protein Complexes from Protein Interaction Data Towards Detecting Protein Complexes from Protein Interaction Data Pengjun Pei 1 and Aidong Zhang 1 Department of Computer Science and Engineering State University of New York at Buffalo Buffalo NY 14260,

More information

A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra

A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores D.C. Anderson*, Weiqun Li, Donald G. Payan,

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering

On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering Jiří Novák, David Hoksza, Jakub Lokoč, and Tomáš Skopal Siret Research Group, Faculty of Mathematics and Physics, Charles

More information

Analysis of Labeled and Non-Labeled Proteomic Data Using Progenesis QI for Proteomics

Analysis of Labeled and Non-Labeled Proteomic Data Using Progenesis QI for Proteomics Analysis of Labeled and Non-Labeled Proteomic Data Using Progenesis QI for Proteomics Lee Gethings, Gushinder Atwal, Martin Palmer, Chris Hughes, Hans Vissers, and James Langridge Waters Corporation, Wilmslow,

More information

Database Search Strategies for Proteomic Data Sets Generated by Electron Capture Dissociation Mass Spectrometry

Database Search Strategies for Proteomic Data Sets Generated by Electron Capture Dissociation Mass Spectrometry Database Search Strategies for Proteomic Data Sets Generated by Electron Capture Dissociation Mass Spectrometry Steve M. M. Sweet,,# Andrew W. Jones, Debbie L. Cunningham, John K. Heath, Andrew J. Creese,

More information

Computational Structural Bioinformatics

Computational Structural Bioinformatics Computational Structural Bioinformatics ECS129 Instructor: Patrice Koehl http://koehllab.genomecenter.ucdavis.edu/teaching/ecs129 koehl@cs.ucdavis.edu Learning curve Math / CS Biology/ Chemistry Pre-requisite

More information

Reducing storage requirements for biological sequence comparison

Reducing storage requirements for biological sequence comparison Bioinformatics Advance Access published July 15, 2004 Bioinfor matics Oxford University Press 2004; all rights reserved. Reducing storage requirements for biological sequence comparison Michael Roberts,

More information

Data Warehousing & Data Mining

Data Warehousing & Data Mining 13. Meta-Algorithms for Classification Data Warehousing & Data Mining Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13.

More information

ASCQ_ME: a new engine for peptide mass fingerprint directly from mass spectrum without mass list extraction

ASCQ_ME: a new engine for peptide mass fingerprint directly from mass spectrum without mass list extraction ASCQ_ME: a new engine for peptide mass fingerprint directly from mass spectrum without mass list extraction Jean-Charles BOISSON1, Laetitia JOURDAN1, El-Ghazali TALBI1, Cécile CREN-OLIVE2 et Christian

More information

Chemometrics. 1. Find an important subset of the original variables.

Chemometrics. 1. Find an important subset of the original variables. Chemistry 311 2003-01-13 1 Chemometrics Chemometrics: Mathematical, statistical, graphical or symbolic methods to improve the understanding of chemical information. or The science of relating measurements

More information

Information Dependent Acquisition (IDA) 1

Information Dependent Acquisition (IDA) 1 Information Dependent Acquisition (IDA) Information Dependent Acquisition (IDA) enables on the fly acquisition of MS/MS spectra during a chromatographic run. Analyst Software IDA is optimized to generate

More information

MANUAL for GLORIA light curve demonstrator experiment test interface implementation

MANUAL for GLORIA light curve demonstrator experiment test interface implementation GLORIA is funded by the European Union 7th Framework Programme (FP7/2007-2013) under grant agreement n 283783 MANUAL for GLORIA light curve demonstrator experiment test interface implementation Version:

More information

MSblender: a probabilistic approach for integrating peptide identifications from multiple database search engines

MSblender: a probabilistic approach for integrating peptide identifications from multiple database search engines Article Subscriber access provided by University of Texas Libraries MSblender: a probabilistic approach for integrating peptide identifications from multiple database search engines Taejoon Kwon, Hyungwon

More information

The AAFCO Proficiency Testing Program Statistics and Reporting

The AAFCO Proficiency Testing Program Statistics and Reporting The AAFCO Proficiency Testing Program Statistics and Reporting Program Chair: Dr. Victoria Siegel Statistics and Reports: Dr. Andrew Crawford Contents Program Model Data Prescreening Calculating Robust

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

More information

What s New in NIST11 (April 3, 2011)

What s New in NIST11 (April 3, 2011) What s New in NIST11 (April 3, 2011) NIST11 consists of the 2011 release of the NIST/EPA/NIH Electron Ionization (EI) Mass Spectral Database, the NIST MS/MS Database, and the NIST GC Methods and Retention

More information

Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were

Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were developed to allow the analysis of large intact (bigger than

More information

Agilent MassHunter Profinder: Solving the Challenge of Isotopologue Extraction for Qualitative Flux Analysis

Agilent MassHunter Profinder: Solving the Challenge of Isotopologue Extraction for Qualitative Flux Analysis Agilent MassHunter Profinder: Solving the Challenge of Isotopologue Extraction for Qualitative Flux Analysis Technical Overview Introduction Metabolomics studies measure the relative abundance of metabolites

More information