Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database

Overview - MS Proteomics in One Slide Obtain protein Digest into peptides Acquire spectra in mass spectrometer MS masses of peptides MS/MS fragments of a peptide Results! Match to sequence database 2

But it s more complex than that. Most things are more difficult than for genomics / transcriptomics. No amplification we only ever lose signal Can t sequence peptides as sensitively as DNA/RNA More complications peptides can be modified in many, many ways Mapping spectra -> peptides > proteins is not as easy as reads to transcripts Implications for quantification 3

Data Analysis Challenges & Solutions Let s look at data analysis, using an example experiment. Control Cells Treated Cells We have a cancer cell line. We treated it with secret compound Z. We want to know what effect Z has on the proteome of the cells. What proteins are in the samples? Which proteins significantly change in amount between the samples? This will be a discovery, shotgun proteomics experiment.

Peptide LC-MS Optional separation https://commons.wikimedia.org/wiki/file:mass_spectrometry_protocol.png 5

An LC MS System Oxford TDI Proteomics Core 6

7 https://commons.wikimedia.org/wiki/file:mass_spectrometry_protocol.png

8 MSMS Fragmentation

A Real Spectrum Peptide from Keratin a common contaminant! 9

Great, what do we get out of the machine? 15 RAW data files in vendor format (5 x triplicate runs) Approx 30GB of raw data 500,000 1,000,000 spectra

Formats, formats, formats Vendor Raw Data Formats Vendor Software Academic Converters Bespoke Scripts Thermo.RAW Fragment Spectra Peak Lists PKL Agilent.d DTA Waters.RAW AB Sciex.wiff Generic Flexible Formats mzml Bruker.yep /.baf mzxml MGF MS2 mzdata

Converters and BioHPC BioHPC cannot install license restricted mass-spec vendor software, so we need to use open formats. To run protein ID analyses on BioHPC we recommend obtaining MGF format data Ask your proteomics core, or use ProteoWizard. http://proteowizard.sourceforge.net/ 12

Peptide Identification Identify peptides by matching experimental spectra to theoretical ones from a protein sequence database. Input Spectrum Database Search Engine HQGVMVGMGQK Sequence Database e.g. UniProtKB Score: 46 A Peptide to Spectrum Match (PSM)

SearchGUI How to run searches easily?? Many tools, all with own command line and parameter formats. Use CompOmics Search GUI Installed on BioHPC Installed search engines: X! Tandem MS-GF+ OMSSA Comet http://compomics.github.io/projects/searchgui.html 14

PSM Scoring Making scores more meaningful Mascot Score = 46 Xcorr = 2.43 Every search engine uses a different scoring algorithm. Rules for calling a good ID have evolved, but may not be based on good evidence. Very hard to compare or combine results. HyperScore = 0.9844 Can we transform them into something more useful?

PSM Re-Scoring Take score(s) from the search engine Map to a standard scale Fit distributions for true and false IDs Obtain probability for score x. Keller, A., Nesvizhskii, A. I., Kolker, E., & Aebersold, R. (2002). Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Analytical chemistry, 74(20), 5383-5392.

Target-Decoy Method Are our probabilities really accurate??? Real Sequences TARGETS Fake Sequences DECOYS A correct match is always to a real sequence A incorrect random match is equally likely to a target or decoy sequence. Estimate the number of incorrect target matches by counting the decoy matches. An empirical estimate of the False Discovery Rate (FDR). Elias, J. E., & Gygi, S. P. (2007). Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature methods, 4(3), 207-214.

Filtering PSMs A large experiment produces > 100,000 PSMs. No way to manually inspect each one! We usually report PSMs filtered to a specific False Discovery Rate. 1% FDR is most common. Matches with post-translational modifications require special treatment.

PeptideShaker How to do all combination and filtering easily? Use CompOmics PeptideShaker Installed on BioHPC http://compomics.github.io/projects/peptide-shaker.html 19

Protein Inference I We don t identify proteins we only identify peptides! Peptides could come from one or more proteins how do we resolve this? Peptides A Proteins 1 Present: Peptide A is uniquely assigned B C D E 2 3 4 5??? All peptides are shared Not Present Present

Protein Inference II (Very) Naïve rules e.g. If protein is identified with 2 unique peptides it is present Parsimony The smallest list of proteins that can explain the peptides identified is the most likely. Minimal set cover / minimal partial set cover etc. Bayesian Models Consider probabilities that proteins produced peptides and spectra. Prior information probability peptide x can be observed by MS etc. Correlation with other sources e.g. did RNA-Seq on the same sample find mrna for the protein?

Protein Scoring - ProteinProphet Start with a list of peptides and their identification probabilities. Map peptides to all possible proteins that contain them. Group proteins that can t be distinguished - no unique peptides. Adjust peptide probabilities based on number of siblings. Assign weights for shared peptides to each protein containing them. Compute protein probability assuming peptide IDs are independent events. N ProteinProb i = 1 (1 Weight i,j PeptideProb j ) j=1 Protein Probability = Probability at least one of the peptide IDs was correct = 1 Probability all of the IDs were wrong Nesvizhskii, A. I., Keller, A., Kolker, E., & Aebersold, R. (2003). A statistical model for identifying proteins by tandem mass spectrometry. Analytical chemistry, 75(17), 4646-4658.

Quantification Spectral Counts Now we have protein IDs we could do quantification by counting the number of PSMs assigned to each protein. PSMs produced by a protein are protein proportional to abundance BUT Longer proteins generate more peptides = more PSMs Some proteins are just difficult = fewer PSMs than expected Can compare spectral counts of same protein, or normalize by length, Mw, expected number of peptides etc. Not good for less-abundant proteins. Low spectral counts = poor comparisons 1 v 2 not as accurate as 10 v 20

Astrocyte CompOmics Protein ID Workflow BioHPC provides a simple workflow to: Identify peptides with 3 search engines Combine the results Perform target-decoy validation Export reports Download project for inspection in PeptideShaker GUI Uses CompOmics tools, runs on BioHPC Nucleus cluster See https://astrocyte.biohpc.swmed.edu/brand/biohpc 24

The Example Experiment We want to do an exhaustive and accurate comparison, so we will use SILAC quantification and fractionate our samples at the peptide level. Normal Growth Medium Control Cells Lysis 5 MS Runs Mix Digest Fractionate Treated Cells Lysis Heavy Growth Medium REPEAT IN TRIPLICATE

Quantification SILAC I Normal Growth Medium Control Cells Treated Cells Heavy Growth Medium We always see each peptide twice Heavy and light forms Protein ratio = peptide Heavy to Light ratio So, let s find ratios for all peptides in our MS data..

Heavy m/z Quantification SILAC II Time Find SILAC pair signatures in the MS run Many scans through time see the same SILAC pair as the peptide elutes Extract intensity of light and heavy at each scan through time Peptide Ratio = Slope OR Peptide Ratio = Ratio of area under curves Light

SILAC Pair Finding From: Cox, J., & Mann, M. (2008). MaxQuant enables high peptide identification rates, individualized ppb-range mass accuracies and proteome-wide protein quantification. Nature biotechnology, 26(12), 1367-1372.

SILAC Protein Level In SILAC each peptide gives an ratio estimate for the protein(s) it originates from. Ratios from multiple peptides can be combined into a protein ratio in various ways. Simple mean / median of peptide ratios Weighted mean / median more abundant peptides contribute more Find & discard outliers Plot H vs L peptide areas and perform linear regression Multiple peptide observations allow estimate of protein ratio error via variability between peptide ratios.

MaxQuant Installed in BioHPC windcv session (Windows Only) Uses vendor specific input files (Thermo RAW) Identification & Quantitation in one package Good for SILAC experiments, not recommended for non-quantitative work 30

Significance Analysis I Once we have the protein quantitation we can look for meaningful differences between samples. At this point proteomics data is much like other datasets. You can apply techniques from e.g. microarray analysis to proteomics data. Proteomics has a poor reputation for statistical rigor: e.g. many people consider that in a 1:1 mixture log 2 ratios are normally distributed. Things beyond 2 s.d. are interesting changes: log 2 Protein Ratio No! The variance of proteomics measurements is highly dependent on intensity

Log Intensity Significance Analysis II To make well-grounded decisions we must model the variance of the measurements, which depends on intensity: 30 A 28 26 24 22 20 A & B have the same ratios between samples A changes significantly B does not change significantly B 18 16 14 Can use microarray focused packages, such as LIMMA or plgem for R 12 10-2 -1 0 1 2 Log Ratio

Introductory Web Resources Proteome Software Wiki http://proteome-software.wikispaces.com/proteomics http://proteome-software.wikispaces.com/bioinformatics CompOmics Tutorials https://compomics.com/bioinformatics-for-proteomics/ Steen & Steen Lab @ Harvard http://www.childrenshospital.org/cfapps/research/data_admin/site602/mainpages602p0.html

What do you want to do on BioHPC? We have installed various software. Is there anything else you need? What analyses do you want to do on BioHPC? File Conversion Proteowizard / msconvert (windcv) Peptide Identification Search Engines X!Tandem, OMSSA, MSGF+, Comet, SearchGUI Postprocessing Tools Peptide Shaker, Trans-Proteomics Pipeline Quantitative Proteomics MaxQuant (windcv) Downstream Statistics PeptideShaker, Perseus (windcv), R, Python 34