Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database

Similar documents
Workflow concept. Data goes through the workflow. A Node contains an operation An edge represents data flow The results are brought together in tables

HOWTO, example workflow and data files. (Version )

Mass Spectrometry and Proteomics - Lecture 5 - Matthias Trost Newcastle University

Isotopic-Labeling and Mass Spectrometry-Based Quantitative Proteomics

SeqAn and OpenMS Integration Workshop. Temesgen Dadi, Julianus Pfeuffer, Alexander Fillbrunn The Center for Integrative Bioinformatics (CIBI)

X!TandemPipeline (Myosine Anabolisée) validating, filtering and grouping MSMS identifications

A Description of the CPTAC Common Data Analysis Pipeline (CDAP)

Comprehensive support for quantitation

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra. Andrew Keller

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra

Key questions of proteomics. Bioinformatics 2. Proteomics. Foundation of proteomics. What proteins are there? Protein digestion

Last updated: Copyright

iprophet: Multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates

Spectronaut Pulsar. User Manual

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

MS-MS Analysis Programs

Proteome-wide label-free quantification with MaxQuant. Jürgen Cox Max Planck Institute of Biochemistry July 2011

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

Quan%ta%on with XPRESS. and. ASAPRa%o

Improved 6- Plex TMT Quantification Throughput Using a Linear Ion Trap HCD MS 3 Scan Jane M. Liu, 1,2 * Michael J. Sweredoski, 2 Sonja Hess 2 *

Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data

Targeted Proteomics Environment

DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics

The Pitfalls of Peaklist Generation Software Performance on Database Searches

Nature Methods: doi: /nmeth Supplementary Figure 1. Fragment indexing allows efficient spectra similarity comparisons.

Mass spectrometry in proteomics

PRIDE Cluster: building the consensus of proteomics data

Tutorial 2: Analysis of DIA data in Skyline

MaSS-Simulator: A highly configurable MS/MS simulator for generating test datasets for big data algorithms.

Computational Methods for Mass Spectrometry Proteomics

In shotgun proteomics, a complex protein mixture derived from a biological sample is directly analyzed. Research Article

Spectrum-to-Spectrum Searching Using a. Proteome-wide Spectral Library

Effective Strategies for Improving Peptide Identification with Tandem Mass Spectrometry

STATISTICAL METHODS FOR THE ANALYSIS OF MASS SPECTROMETRY- BASED PROTEOMICS DATA. A Dissertation XUAN WANG

Bayesian Clustering of Multi-Omics

Proteomics: the first decade and beyond. (2003) Patterson and Aebersold Nat Genet 33 Suppl: from

TUTORIAL EXERCISES WITH ANSWERS

Tutorial 1: Setting up your Skyline document

Site-specific Identification of Lysine Acetylation Stoichiometries in Mammalian Cells

Mass spectrometry-based proteomics has become

profileanalysis Innovation with Integrity Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research

MassHunter Software Overview

Methods for proteome analysis of obesity (Adipose tissue)

Yifei Bao. Beatrix. Manor Askenazi

Statistical analysis of isobaric-labeled mass spectrometry data

Transferred Subgroup False Discovery Rate for Rare Post-translational Modifications Detected by Mass Spectrometry* S

Biological Mass Spectrometry

Correction of Errors in Tandem Mass Spectrum Extraction Enhances Phosphopeptide Identification

Quantitation of a target protein in crude samples using targeted peptide quantification by Mass Spectrometry

Agilent MassHunter Profinder: Solving the Challenge of Isotopologue Extraction for Qualitative Flux Analysis

Tutorial 1: Library Generation from DDA data

Database search of tandem mass spectra is a central component

BST 226 Statistical Methods for Bioinformatics David M. Rocke. January 22, 2014 BST 226 Statistical Methods for Bioinformatics 1

High-Field Orbitrap Creating new possibilities

SRM assay generation and data analysis in Skyline

Quality Assessment of Tandem Mass Spectra Based on Cumulative Intensity Normalization

Tandem Mass Spectrometry: Generating function, alignment and assembly

DeMix Workflow for Efficient Identification of Co-fragmented. Peptides in High Resolution Data-dependent Tandem Mass

MSblender: a probabilistic approach for integrating peptide identifications from multiple database search engines

Performing Peptide Bioanalysis Using High Resolution Mass Spectrometry with Target Enhancement MRM Acquisition

MSc Chemistry Analytical Sciences. Advances in Data Dependent and Data Independent Acquisition for data analysis in proteomic research

MS-based proteomics to investigate proteins and their modifications

Efficient Marginalization to Compute Protein Posterior Probabilities from Shotgun Mass Spectrometry Data

All Ions MS/MS: Targeted Screening and Quantitation Using Agilent TOF and Q-TOF LC/MS Systems

Tandem mass spectra were extracted from the Xcalibur data system format. (.RAW) and charge state assignment was performed using in house software

Compounding insights Thermo Scientific Compound Discoverer Software

Developing Algorithms for the Determination of Relative Abundances of Peptides from LC/MS Data

Protein Identification Using Tandem Mass Spectrometry. Nathan Edwards Informatics Research Applied Biosystems

A Quadrupole-Orbitrap Hybrid Mass Spectrometer Offers Highest Benchtop Performance for In-Depth Analysis of Complex Proteomes

UCD Conway Institute of Biomolecular & Biomedical Research Graduate Education 2009/2010

Qualitative Proteomics (how to obtain high-confidence high-throughput protein identification!)

False Discovery Rates of Protein Identifications: A Strike against the Two-Peptide Rule

Introduction to pepxmltab

Protein identification problem from a Bayesian pointofview

A NESTED MIXTURE MODEL FOR PROTEIN IDENTIFICATION USING MASS SPECTROMETRY

Figure S1. Interaction of PcTS with αsyn. (a) 1 H- 15 N HSQC NMR spectra of 100 µm αsyn in the absence (0:1, black) and increasing equivalent

Modeling Mass Spectrometry-Based Protein Analysis

Intensity-based protein identification by machine learning from a library of tandem mass spectra

High-Throughput Protein Quantitation Using Multiple Reaction Monitoring

Protein inference based on peptides identified from. tandem mass spectra

Welcome! Course 7: Concepts for LC-MS

Andromeda: A Peptide Search Engine Integrated into the MaxQuant Environment

A TMT-labeled Spectral Library for Peptide Sequencing

pparse: A method for accurate determination of monoisotopic peaks in high-resolution mass spectra

JUMP: a tag-based database search tool for peptide identification with high sensitivity

Background: Comment [1]: Comment [2]: Comment [3]: Comment [4]: mass spectrometry

An Unsupervised, Model-Free, Machine-Learning Combiner for Peptide Identifications from Tandem Mass Spectra

MSnID Package for Handling MS/MS Identifications

Identification of proteins by enzyme digestion, mass

A statistical approach to peptide identification from clustered tandem mass spectrometry data

Making Sense of Differences in LCMS Data: Integrated Tools

MassHunter TOF/QTOF Users Meeting

Optimization and Use of Peptide Mass Measurement Accuracy in Shotgun Proteomics* S

Learning Score Function Parameters for Improved Spectrum Identification in Tandem Mass Spectrometry Experiments

Improved Validation of Peptide MS/MS Assignments. Using Spectral Intensity Prediction

Data pre-processing in liquid chromatography mass spectrometry-based proteomics

Identification of a Set of Conserved Eukaryotic Internal Retention Time Standards for Data-Independent Acquisition Mass Spectrometry

Skyline Small Molecule Targets

NPTEL VIDEO COURSE PROTEOMICS PROF. SANJEEVA SRIVASTAVA

Quantitative Proteomics

Transcription:

Overview - MS Proteomics in One Slide Obtain protein Digest into peptides Acquire spectra in mass spectrometer MS masses of peptides MS/MS fragments of a peptide Results! Match to sequence database 2

But it s more complex than that. Most things are more difficult than for genomics / transcriptomics. No amplification we only ever lose signal Can t sequence peptides as sensitively as DNA/RNA More complications peptides can be modified in many, many ways Mapping spectra -> peptides > proteins is not as easy as reads to transcripts Implications for quantification 3

Data Analysis Challenges & Solutions Let s look at data analysis, using an example experiment. Control Cells Treated Cells We have a cancer cell line. We treated it with secret compound Z. We want to know what effect Z has on the proteome of the cells. What proteins are in the samples? Which proteins significantly change in amount between the samples? This will be a discovery, shotgun proteomics experiment.

Peptide LC-MS Optional separation https://commons.wikimedia.org/wiki/file:mass_spectrometry_protocol.png 5

An LC MS System Oxford TDI Proteomics Core 6

7 https://commons.wikimedia.org/wiki/file:mass_spectrometry_protocol.png

8 MSMS Fragmentation

A Real Spectrum Peptide from Keratin a common contaminant! 9

Great, what do we get out of the machine? 15 RAW data files in vendor format (5 x triplicate runs) Approx 30GB of raw data 500,000 1,000,000 spectra

Formats, formats, formats Vendor Raw Data Formats Vendor Software Academic Converters Bespoke Scripts Thermo.RAW Fragment Spectra Peak Lists PKL Agilent.d DTA Waters.RAW AB Sciex.wiff Generic Flexible Formats mzml Bruker.yep /.baf mzxml MGF MS2 mzdata

Converters and BioHPC BioHPC cannot install license restricted mass-spec vendor software, so we need to use open formats. To run protein ID analyses on BioHPC we recommend obtaining MGF format data Ask your proteomics core, or use ProteoWizard. http://proteowizard.sourceforge.net/ 12

Peptide Identification Identify peptides by matching experimental spectra to theoretical ones from a protein sequence database. Input Spectrum Database Search Engine HQGVMVGMGQK Sequence Database e.g. UniProtKB Score: 46 A Peptide to Spectrum Match (PSM)

SearchGUI How to run searches easily?? Many tools, all with own command line and parameter formats. Use CompOmics Search GUI Installed on BioHPC Installed search engines: X! Tandem MS-GF+ OMSSA Comet http://compomics.github.io/projects/searchgui.html 14

PSM Scoring Making scores more meaningful Mascot Score = 46 Xcorr = 2.43 Every search engine uses a different scoring algorithm. Rules for calling a good ID have evolved, but may not be based on good evidence. Very hard to compare or combine results. HyperScore = 0.9844 Can we transform them into something more useful?

PSM Re-Scoring Take score(s) from the search engine Map to a standard scale Fit distributions for true and false IDs Obtain probability for score x. Keller, A., Nesvizhskii, A. I., Kolker, E., & Aebersold, R. (2002). Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Analytical chemistry, 74(20), 5383-5392.

Target-Decoy Method Are our probabilities really accurate??? Real Sequences TARGETS Fake Sequences DECOYS A correct match is always to a real sequence A incorrect random match is equally likely to a target or decoy sequence. Estimate the number of incorrect target matches by counting the decoy matches. An empirical estimate of the False Discovery Rate (FDR). Elias, J. E., & Gygi, S. P. (2007). Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature methods, 4(3), 207-214.

Filtering PSMs A large experiment produces > 100,000 PSMs. No way to manually inspect each one! We usually report PSMs filtered to a specific False Discovery Rate. 1% FDR is most common. Matches with post-translational modifications require special treatment.

PeptideShaker How to do all combination and filtering easily? Use CompOmics PeptideShaker Installed on BioHPC http://compomics.github.io/projects/peptide-shaker.html 19

Protein Inference I We don t identify proteins we only identify peptides! Peptides could come from one or more proteins how do we resolve this? Peptides A Proteins 1 Present: Peptide A is uniquely assigned B C D E 2 3 4 5??? All peptides are shared Not Present Present

Protein Inference II (Very) Naïve rules e.g. If protein is identified with 2 unique peptides it is present Parsimony The smallest list of proteins that can explain the peptides identified is the most likely. Minimal set cover / minimal partial set cover etc. Bayesian Models Consider probabilities that proteins produced peptides and spectra. Prior information probability peptide x can be observed by MS etc. Correlation with other sources e.g. did RNA-Seq on the same sample find mrna for the protein?

Protein Scoring - ProteinProphet Start with a list of peptides and their identification probabilities. Map peptides to all possible proteins that contain them. Group proteins that can t be distinguished - no unique peptides. Adjust peptide probabilities based on number of siblings. Assign weights for shared peptides to each protein containing them. Compute protein probability assuming peptide IDs are independent events. N ProteinProb i = 1 (1 Weight i,j PeptideProb j ) j=1 Protein Probability = Probability at least one of the peptide IDs was correct = 1 Probability all of the IDs were wrong Nesvizhskii, A. I., Keller, A., Kolker, E., & Aebersold, R. (2003). A statistical model for identifying proteins by tandem mass spectrometry. Analytical chemistry, 75(17), 4646-4658.

Quantification Spectral Counts Now we have protein IDs we could do quantification by counting the number of PSMs assigned to each protein. PSMs produced by a protein are protein proportional to abundance BUT Longer proteins generate more peptides = more PSMs Some proteins are just difficult = fewer PSMs than expected Can compare spectral counts of same protein, or normalize by length, Mw, expected number of peptides etc. Not good for less-abundant proteins. Low spectral counts = poor comparisons 1 v 2 not as accurate as 10 v 20

Astrocyte CompOmics Protein ID Workflow BioHPC provides a simple workflow to: Identify peptides with 3 search engines Combine the results Perform target-decoy validation Export reports Download project for inspection in PeptideShaker GUI Uses CompOmics tools, runs on BioHPC Nucleus cluster See https://astrocyte.biohpc.swmed.edu/brand/biohpc 24

The Example Experiment We want to do an exhaustive and accurate comparison, so we will use SILAC quantification and fractionate our samples at the peptide level. Normal Growth Medium Control Cells Lysis 5 MS Runs Mix Digest Fractionate Treated Cells Lysis Heavy Growth Medium REPEAT IN TRIPLICATE

Quantification SILAC I Normal Growth Medium Control Cells Treated Cells Heavy Growth Medium We always see each peptide twice Heavy and light forms Protein ratio = peptide Heavy to Light ratio So, let s find ratios for all peptides in our MS data..

Heavy m/z Quantification SILAC II Time Find SILAC pair signatures in the MS run Many scans through time see the same SILAC pair as the peptide elutes Extract intensity of light and heavy at each scan through time Peptide Ratio = Slope OR Peptide Ratio = Ratio of area under curves Light

SILAC Pair Finding From: Cox, J., & Mann, M. (2008). MaxQuant enables high peptide identification rates, individualized ppb-range mass accuracies and proteome-wide protein quantification. Nature biotechnology, 26(12), 1367-1372.

SILAC Protein Level In SILAC each peptide gives an ratio estimate for the protein(s) it originates from. Ratios from multiple peptides can be combined into a protein ratio in various ways. Simple mean / median of peptide ratios Weighted mean / median more abundant peptides contribute more Find & discard outliers Plot H vs L peptide areas and perform linear regression Multiple peptide observations allow estimate of protein ratio error via variability between peptide ratios.

MaxQuant Installed in BioHPC windcv session (Windows Only) Uses vendor specific input files (Thermo RAW) Identification & Quantitation in one package Good for SILAC experiments, not recommended for non-quantitative work 30

Significance Analysis I Once we have the protein quantitation we can look for meaningful differences between samples. At this point proteomics data is much like other datasets. You can apply techniques from e.g. microarray analysis to proteomics data. Proteomics has a poor reputation for statistical rigor: e.g. many people consider that in a 1:1 mixture log 2 ratios are normally distributed. Things beyond 2 s.d. are interesting changes: log 2 Protein Ratio No! The variance of proteomics measurements is highly dependent on intensity

Log Intensity Significance Analysis II To make well-grounded decisions we must model the variance of the measurements, which depends on intensity: 30 A 28 26 24 22 20 A & B have the same ratios between samples A changes significantly B does not change significantly B 18 16 14 Can use microarray focused packages, such as LIMMA or plgem for R 12 10-2 -1 0 1 2 Log Ratio

Introductory Web Resources Proteome Software Wiki http://proteome-software.wikispaces.com/proteomics http://proteome-software.wikispaces.com/bioinformatics CompOmics Tutorials https://compomics.com/bioinformatics-for-proteomics/ Steen & Steen Lab @ Harvard http://www.childrenshospital.org/cfapps/research/data_admin/site602/mainpages602p0.html

What do you want to do on BioHPC? We have installed various software. Is there anything else you need? What analyses do you want to do on BioHPC? File Conversion Proteowizard / msconvert (windcv) Peptide Identification Search Engines X!Tandem, OMSSA, MSGF+, Comet, SearchGUI Postprocessing Tools Peptide Shaker, Trans-Proteomics Pipeline Quantitative Proteomics MaxQuant (windcv) Downstream Statistics PeptideShaker, Perseus (windcv), R, Python 34