Statistical analysis of isobaric-labeled mass spectrometry data

Statistical analysis of isobaric-labeled mass spectrometry data Farhad Shakeri July 3, 2018 Core Unit for Bioinformatics Analyses Institute for Genomic Statistics and Bioinformatics University Hospital Bonn

Outliers Proteomics, general overview Mass Spectrometry-based Quantitative proteomics Stable isotope labeling mass spectrometry Peptide identification Data structure Data analysis workflow

What is proteomics? Genomics: what can the cell potentially do? Transcriptomics: what is currently being turned on? Proteomics: what enzymes are currently active? which signals are being transduced? Omics Technologies Metabolomics: what is being produced/consumed? Definition: The proteome is the entire set of proteins in a given cell, tissue or biological sample, at a precise developmental or cellular phase. http://en.wikipedia.org/wiki/file:metabolomics_schema.png, accessed 2014-03-10, 11:42:00 UTC

What is proteomics? Transcriptomics Most features known Most features measured Signal correlates with abundance MS-Based Proteomics All possible features not known Sample is dynamic during analysis 20-50% of features measured Signal not detected means either that feature not present or feature present but not detected LCMS Source: Science Learning Hub, University of Waikato Credit: Steve Carr, Broad Institute of MIT and Harvard

Mass Spectrometry-based Quantitative proteomics LC/MSMS: Liquid Chromatography Tandem Mass Spectrometry Source: Emmanuel Barillot, Laurence Calzone, Philippe Hupé, Jean-Philippe Vert, Andrei Zinovyev, Computational Systems Biology of Cancer Chapman & Hall/CRC Mathematical & Computational Biology, 2012 5

Mass spectrometry-based quantitative proteomics LC/MSMS: Liquid Chromatography Tandem Mass Spectrometry Two main Mass-Spec categories: 1. Data-Dependent Aquisition DDA 2. Selected Reaction Monitoring SRM Data-Independent Aquisition DIA Specific to DDA: Allows relative quantification of peptides through chemical labeling. - isotopic (stable isotope labeling by amino acids in cell culture, or SILAC) - isobaric (isobaric tags for relative and absolute quantitation, or itraq) - isobaric (Tandem Mass Tags, or TMT) 6

Stable isotope labeling mass spectrometry LC/MSMS DDA Multiple samples can be quantitated simultaneously using isobaric tags. Tandem Mass Tags (TMT) 1. Isobaric Labeling Peptides being tagged 2. Pooling (Mixture) 3. Fractionation MS Runs 4. Mass Spec Roberto Romero et. al. 2010

All figures from: Steve Carr, Broad Institute of MIT and Harvard Stable isotope labeling mass spectrometry Multiple samples can be quantitated simultaneously using isobaric tags. itraq DMSO Kinase Inhib 1 Kinase Inhib 2 Kinase Inhib 3 Lyse and Digest Label Tags consist of reporter, balance, and reactive regions. Pool 114 115 116 117 ts Lighter reporter regions are paired with heavier balance regions. Entire tag attached to the peptide adds the same mass shift. (MS1)

All figures from: Steve Carr, Broad Institute of MIT and Harvard Stable isotope labeling mass spectrometry Multiple samples can be quantitated simultaneously using isobaric tags. Entire tag attached to the peptide adds the same mass shift. (MS1) Quantitative information regarding the relative amount of the peptide in the samples Peptide #1: No effect Relative Abundance 100 80 60 40 20 116.1111 114.1108 the peptides appear as a single -shifted- precursor. 0 112 114 116 118 m/z 100 114.1107 Peptide #2: Sensitive to all inhibitors Relative Abundance 80 60 40 20 115.1077 117.1146 Fragmentation in MS2 9060 8040 Relative Abundance 7020 600 112 114 116 118 50 m/z 40 100 116.1111 80 114.1108 291.2149 117.1145 reporter ions (MS2) 390.2832 503.3672 Reporter regions dissociate to produce ion signals Mix Peptides from a 720.4188 30 703.2882 20 116.1111 218.0594 404.3024 614.2397 792.3369 200.1014 462.1813 10 145.1086 240.1341 331.1429 352.1475 549.2076 774.3190833.5016 904.5338 561.3007 0 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 m/z M 0 112 114 116 118 m/z

Peptide structure & MS spectra Peptide Sequence: CLCYDGFMASEDMK y ions carboxyl terminus C-terminus b ions amino terminus N-terminus y1_2 Corresponding (Theoretical) Ion Spectra Intensity 0 2000 4000 6000 8000 10000 12000 278.15 288.14 393.18 462.18 522.22 617.29 680.29 740.27 y1_3 y1_4 y1_5 y1_6 y2_12 y1_7 y2_14 y1_8 y1_9 y1_10 y1_11 y1_12 811.33 797.37 944.37 958.41 1015.42 1130.45 1293.51 1467.56 0 20 40 60 80 100 b1_2 b1_3 b2_10 b2_11 b1_5 b1_6 b2_13 b2_14 b1_7 m/z CLCYDGFMASEDMK / 10

Peptide structure & MS spectra Example Peptide spectrum from MS experiment. source: Tobias Kind/FiehnLab Complete MS map is a jungle of hundreds of thousands of Peptide spectra.

Peptide identification Observed Spectrum from Mass Spectrometer Theoretical Peptide Spectra Either Stored in Spectral Libraries or Calculated for each candidate Peptide. These two are being matched by Maximum Likelihood algorithms. SCORE how well each observed spectrum matches a number of candidate peptides. Estimate Likelihood (E-Value) Log(# of Matches) Hyper Score source: Tobias Kind/FiehnLab Peptide Spectrum Match PSM Expected Number Of Random Matches Best Hit Credit: Brian Searls

Mass Spec data output matrix of PSM intensities Mixture 1 Mixture 2 Mixture 3 Run 1 Run2 Run 12 Run 1 Run 2 Run 12 Run 1 Run 2 Run 12 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 Protein 1 Peptide 1 PSM 1 Protein 2 Protein 3 Protein 4 Protein 5 Peptide 1 PSM 1 Peptide 2 PSM 1 PSM 1* PSM 3 PSM 4 Peptide 3 PSM 1 Peptide 1 PSM 1 Peptide 2 PSM 1 Peptide 1 PSM 1 Peptide 2 PSM 1 Peptide 1 PSM 1 PSM 3 Peptide 2 PSM 1 Peptide 3 PSM 1 Feature (Every single row in the matrix): Protein + PSM PSM (Peptide Spectrum Match) : Peptide + Charge Channel Condition Mixture 126 WT.0h 2 3 127 WT.24h 2 3 128 WT.48h 2 3 129 Cblb.0h 2 3 130 Cblb.24h 2 3 131 Cblb.48h 2 3 Bio- Replicate 126 WT.0h 3 7 127 Cblb.0h 3 7 128 WT.24h 3 7 129 WT.0h 3 8 130 Cblb.0h 3 8 131 WT.24h 3 8 126 Cblb.24h 4 7 127 WT.48h 4 7 128 Cblb.48h 4 7 129 Cblb.24h 4 8 130 WT.48h 4 8 131 Cblb.48h 4 8

Data analysis workflow - Non-Complete Channel-sets. - Repeated PSM measurements within Fractions - Repeated PSM measurements across Fractions - Unique Peptides per Protein. Pre-Processing Transformation/ Normalization Missing data Imputation - Log2 + Median/Quantile - Variance Stabilisation Normalisation (VSN) Not Applicable to TMT (yet!) Summarization Tukey s median polish Significance Inference Moderated t-test (Limma) 14

Data analysis workflow Pre-Processing 12 Intensity boxplot before normalization fractions combined 2 3 4 9 Transformation/ Normalization Summarization Significance Inference log2(intensity) 6 3 0 12 9 6 3 0 12 9 Channel Conditio 6 Mixture Bion Replicate 126 WT.0h 2 3 3 127 WT.24h 2 3 128 WT.48h 2 3 0 129 Cblb.0h 2 3 126 127 128 129 130 Cblb.24h 2 3 131 Cblb.48h 2 3 130 126 WT.0h 3 7 127 Cblb.0h 3 7 128 WT.24h 3 7 129 WT.0h 3 8 130 Cblb.0h 3 8 131 WT.24h 3 8 126 Cblb.24h 4 7 127 WT.48h 4 7 128 Cblb.48h 4 7 129 Cblb.24h 4 8 130 WT.48h 4 8 131 Cblb.48h 4 8 15 131 126 127 128 129 Channel 130 131 126 127 128 129 130 131 Rep.3 Rep.7 Rep.8 Channel 126 127 128 129 130 131

0 3 6 9 0 3 6 9 0 3 6 9 Data analysis workflow Intensity boxplot VSN normalization fractions combined 12 2 3 4 Pre-Processing 9 Transformation/ Normalization Summarization Variance Stabilisation Normalisation (VSN) Abundance 6 3 0 12 9 6 3 Abundance Abundance.Norm Channel 126 127 128 129 130 131 Significance Inference 0 126 127 128 129 130 131 126 Check density for Normality VSN Normalization Channel fractions combined 127 128 129 130 131 126 127 128 129 130 131 2 3 4 0.3 density 0.2 0.1 0.0 0.3 0.2 0.1 Abundance Abundance.Norm Channel 126 127 128 129 130 131 0.0 16 12 Abundance 12 12

Data analysis workflow Pre-Processing Transformation/ Normalization Summarization Significance Inference Tukey s Median Polish Yij = µ + αi + βj + ϵij log-tintensity = grand median + row median + column median + use case: - combine features rolling-up from Peptide to Protein level overall intensity of protein i for sample j: Yi = µ + αi Yi = µ + median(ϵij) - Impute missing data miss Yij = µ + αi + βj residuals Protein 5 PSM 1 Peptide 1 PSM 3 Peptide 2 PSM 1 Peptide 3 PSM 1 Mixture 1 Mixture 2 Mixture 3 Run 1 Run2 Run 12 Run 1 Run 2 Run 12 Run 1 Run 2 Run 12 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 What else!? 17

18 Cblb.0h Cblb.24h Cblb.48h WT.0h WT.24h WT.48h Cblb.0h WT.0h WT.24h Cblb.24h Cblb.48h WT.48h 2 3 4 0 5 10 15 MS runs Log2 intensities # peptide: 20 atplsptvtlsmsadvplvveyk_2 atplsptvtlsmsadvplvveyk_3 cagnediitlr_2 cdrnlamgvnltsmsk_3 cdrnlamgvnltsmsk_3 cdrnlamgvnltsmsk_3 dlshigdavviscak_3 fsasgelgngnik_2 iadmghlk_2 iadmghlk_3 iadmghlk_3 icrdlshigdavviscak_3 mpsgefar_2 nlamgvnltsmsk_2 nlamgvnltsmsk_2 nlamgvnltsmsk_3 segfdtyr_2 segfdtyrcdr_3 vsdyemk_2 vsdyemk_2 vsdyemk_3 ylnfftk_2 yylapkiedeeas_2 P17918 Cblb.0h Cblb.24h Cblb.48h WT.0h WT.24h WT.48h Cblb.0h WT.0h WT.24h Cblb.24h Cblb.48h WT.48h 2 3 4 0 5 10 15 MS runs Log2 intensities Processed feature level data Run summary P17918 Profile plots & Summarization Mixture 1 Mixture 2 Mixture 3 Run 1 Run2 Run 12 Run 1 Run 2 Run 12 Run 1 Run 2 Run 12 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 Protein 5 Peptide 1 PSM 1 PSM 3 Peptide 2 PSM 1 Peptide 3 PSM 1 Mixture 1 Mixture 2 Mixture 3 Mixture 1 Mixture 2 Mixture 3

Issue with normalization and summarization Pre-Processing Transformation/ Normalization Summarization Current approach 1. Combine Fractions 2. Summarize for every PROTEIN, WITHIN MIXTURE 3. Between channel normalization (VSN) Significance Inference Alternative 1. Within MIXTURE & between FRACTION normalization. 2. Fraction combination within Mixture 3. Between Channel normalization (VSN) 4. Protein Summarization, within Mixture. 19

WT.24h Cblb.24h Statistical Inference p 1 p 2 ± z /2 s p1 (1 p 1 ) n 1 + p 2(1 p 2 ) n 2 Pre-Processing apple p 1 p2 r q p 1 (1 p 1 ) n + p 2(1 p 2 ) 1 2 /1 var{ŷ}, ŷ + t n 2 /1 s p1 (1 p p 1 p 2 ± z 1 ) /2 + p 2(1 p 2 ) 2 n 1 n ŷ t n 2 q var{ŷ} + s 2, ŷ + t n 2 q var{ŷ} 2 + s 2 F6ZQA3 A0A0R4J2A1 P83741 3 Q9QXL1 A0A075B5P2 F8WGM5 Q8BMS9 Q9CQR4 P61924 LIMMA: Q9CS00 apple BORROW INFO ACROSS PROTEINS Q9D6J6 B1ASP2 Q9D0Q7 P24668 ŷ t n 2 q /1 var{ŷ}, ŷ + t n 2 q /1 var{ŷ} Traditional approach: approach, for one protein: x Diab 1 x Control Student d di erence of group means s constant t = = Ȳ1 Ȳ 2 q Student d apple estimate ŷ t n 2 q of variation 1 s /1 var{ŷ} + s 2, ŷ + t n 1 + 1 2 q /1 n 2 x Diab var{ŷ} x Control + s 2 protein-specific degree of freedom, reflects x Diab x Student( Control Student d s constant d) 6-1 variance the number 0 of replicates s constant Solution by linear models for microarrays (limma): x Diab x Control Student( Smyth, 2005 s constant d) s 2 = d 0 s 2 0 + d s 2 di erence of group means d 0 + d t = ŷ t n 2 Transformation/ Normalization apple /1 Summarization Significance Inference estimate of variation s 2 = d 0 s 2 0 + d s 2 d 0 + d /1 = Ȳ1 q Ȳ 2 s q var{ŷ} Log 10 adjusted P Student d 1 n 1 + 1 n 2 3 P23492 d = d 0 + d 6-1 F8WJ93 Q8C405 Q3TE40 NS LogFC> 1 P.adj<0.05 & LogFC> 1 WT-24h v. KO-24h 9 Q9CS42 Q9CZD3 Q8R409 Q8R164 A0A0G2JH17 Q9EQ28 Q8K0D0 Q8BK72 P97304 Q8BXV2 Q9D1C9 Q3TKY6 P09055 Q66JS6 Q9Z0W3 Q8BH24 A0A0N4SV40 O88627 Q64261 A0A0R4J275 Q9CXY6 Q9CQE1 Q8K4P0 P58044 Q9WTK5 Q04207 E9QMP6 P70677 Q99M51 2.5 0.0 2.5 5.0 Log 2 fold change consensus variance over all proteins consensus degree of freedom over all proteins

Outlook Between channel v. between run normalization (order!) Alternative methods for fraction combination. Mixed-effect models instead of median-polish and t-test. Model missingness. (metric for assessing data quality?) Imputing missing values prior to summarization.

Acknowledgments This work has been done in collaboration with: Dr. Marc Sylvester Mass Spectrometery Core Facility Institut for Biochemistry and Molecular Biology University of Bonn Dr. Andreas Buness Core Unit for Bioinformatics Analyses Institute for Genomic Statistics and Bioinformatics University Hospital Bonn

Liquid Chromatography Tandem Mass Spectrometry 1D or 2D gel electrophoresis proteins are digested into peptides by means of chemical or enzymatic digestion LC: separation of peptides in time such that the mass spectrometer is provided with only a small portion at a time. Liquid from LC undergoes electrospray ionisation to form molecular ions Ion mixture sorted according to their m/z ratio. Survey scan (precursor ion selection) Fragmentation of the precursor ion to Product ions. Sorting of product ions according to their m/z ratio in the second analyser. Source: Emmanuel Barillot, Laurence Calzone, Philippe Hupé, Jean-Philippe Vert, Andrei Zinovyev, Computational Systems Biology of Cancer Chapman & Hall/CRC Mathematical & Computational Biology, 2012 23

Mass Spectrometry-based Quantitative proteomics LC/MSMS: Liquid Chromatography Tandem Mass Spectrometry In common Two main categories: Discovery proteomics (Shotgun) Targeted proteomics Proteins digested into peptides. Peptides separated by liquid chromatography (LC). Operated in: PROs Data-Dependant Acquisition (DDA) Discovering the maximal number of proteins from one or a few samples. Selected Reaction Monitoring (SRM) Accurate quantification of sets of specific proteins in many samples. CONs limited quantification capabilities on large sample sets. Limited measurements of a few thousands transitions. Specific to DDA: Allows relative quantification of peptides through chemical labeling. isotopic (stable isotope labeling by amino acids in cell culture, or SILAC) isobaric (isobaric tags for relative and absolute quantitation, or itraq) isobaric (Tandem Mass Tags, or TMT) 24

Data analysis workflow Pre-Processing Transformation/ Normalization Missing data Imputation Summarization Significance Inference Repeated Feature measurements within Fraction (Run). Protein 1 Peptide 1 PSM 1 Protein 2 Peptide 1 PSM 1 Peptide 2 Mixture 1 Run 1 Run2 Run 12 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 PSM 1 x x x x x x - - - - - - - - - - - - PSM x - x x - x - - - - - - - - - - - - PSM 3 PSM 4 Peptide 3 PSM 1 25

Data analysis workflow Pre-Processing Transformation/ Normalization Missing data Imputation Summarization Significance Inference Repeated Feature measurements within Fractions (Runs). Protein 1 Peptide 1 PSM 1 Protein 2 Peptide 1 PSM 1 Peptide 2 Mixture 1 Run 1 Run2 Run 12 Ion Score 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 PSM 1 46 x x x x x x - - - - - - - - - - - - PSM 1* 53 x - x x - x - - - - - - - - - - - - PSM 3 PSM 4 Peptide 3 PSM 1 26

Data analysis workflow Pre-Processing Transformation/ Normalization Missing data Imputation Summarization Significance Inference Repeated Feature measurements across Fractions (Runs). Protein 1 Peptide 1 PSM 1 Protein 2 Peptide 1 PSM 1 Peptide 2 Mixture 1 Run 1 Run2 Run 12 Ion Score 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 PSM 1 46 x x x x x x - - - - - - - - - - - - PSM 1* 53 x - x x - x - - - - - - - - - - - - PSM 3 PSM 4 Peptide 3 PSM 1 x x x x x x - - - - - - x x x x x - 27

Data analysis workflow Pre-Processing Transformation/ Normalization Missing data Imputation Summarization Significance Inference Use Features with Complete Channel-set Protein 1 Peptide 1 PSM 1 Protein 2 Peptide 1 PSM 1 Peptide 2 Mixture 1 Run 1 Run2 Run 12 Ion Score 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 PSM 1 46 x x x x x x - - - - - - - - - - - - PSM 1* 53 x - x x - x - - - - - - - - - - - - PSM 3 x x x - x x - - - - - - - - - - - - PSM 4 Peptide 3 PSM 1 x x x x x x - - - - - - x x x x x - 28

Data analysis workflow Pre-Processing Transformation/ Normalization Missing data Imputation Summarization Significance Inference Remove single-shot Proteins (at least 2 peptides per Protein) Protein 1 Peptide 1 PSM 1 Protein 2 Peptide 1 PSM 1 Peptide 2 Mixture 1 Run 1 Run2 Run 12 Ion Score 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 x x x x x x PSM 1 46 x x x x x x - - - - - - - - - - - - PSM 1* 53 x - x x - x - - - - - - - - - - - - PSM 3 PSM 4 Peptide 3 PSM 1 x x x x x x - - - - - - x x x x x - 29