BIOINF 4120 Bioinforma2cs 2 - Structures and Systems -

BIOINF 4120 Bioinforma2cs 2 - Structures and Systems - Oliver Kohlbacher Summer 2013 16. Quan0ta0ve Proteomics Overview LC- MS- based proteomics - defini0on of maps and features Quan0fica0on approaches Labeled quan0fica0on Label- free quan0fica0on Algorithmic problems Feature finding Map alignment Examples 2 LC- MS- Based Proteomics Quan0ta0ve proteomics tries to measure the expression level of all proteins (as many as possible) in a sample Problems Sensi0vity of MS makes detec0on of low- abundance proteins difficult MS signal intensity is propor0onal to pep0de concentra0on, but factor varies from pep0de to pep0de! No absolute quan0fica0on from the signal alone Complexity of sample makes separa0on difficult Datasets tend to get huge (up to hundreds of GB per sample), so data analysis is difficult 3 1

Shotgun Proteomics Proteins digestion A L E L F R H P N D M A A K G A S E D I P V K D L K F G G H P E T L E S E D E M K H K A K D K V E L F A K H L K K S A Y K L Q D V A G M H M K G W I L Q G G Q E E G V G V G A E L G F Q G V L N G Q I K M R G L L I M S L S W V I D G E Q L F D K F K A K L T A E V G H H E A E L T P L A Q S H A T K S T H N G I Y L E F Peptidedigest Separation Key ideas Separa0on of whole proteins possible but difficult, hence diges0on preferred Usually: trypsin cuts axer K and R and ensures pep0des suitable for MS (posi0ve charge at the end) Separate pep0des; this is easy Iden0fy proteins through pep0des K Y K F K H K H L K F D K L F K I P V K A L E L F R S E D E M K N D M A A K A S E D L K E L G F Q G G H P E T L E H P G D F G A D A Q G A M S K V E A D V A G H G Q E V L K Y L E F I S E A I I Q V L Q S K G H H E A E L T P A Q S H A T K M G L S D G E W Q L V L N V W G K I R 4 HPLC- MS Analysis I HPLC ESI TOF Spectrum (scan) RT Separation 1 Different peptides have different retention time Ionization Peptide receives z charge units Separation 2 Detector measures 5 Orbitrap analyzer Introduc0on Intensity mass / charge 6 2

LC- MS Data (Map) 7 LC- MS Data (Map) 8 LC- MS Data (Map) 9 3

LC- MS Data (Map) 10 LC- MS Data (Map) 11 LC- MS Data (Map) 12 4

LC- MS Data (Map) 13 LC- MS Data (Map) 14 LC- MS Data (Map) 15 5

LC- MS Data (Map) 16 LC- MS Data (Map) 17 LC- MS Data (Map) 18 6

LC- MS Data (Map) Identification (EVAAFAQFGSDLDASTK) 0 250 500 750 1000 Quantification (15 nmol/µl, 3x overexpressed, ) 19 Quan2fica2on Key problem Detector signal is propor0onal to pep0de concentra0on Constant factor varies from pep0de to pep0de! Hence, no correla0on between absolute signal intensity and absolute concentra0on Reason Different ioniza0on/flight behavior of different pep0des Consequences Rela0ve quan0fica0on possible between two samples Absolute quan0fica0on requires external standard for calibra0on 20 Differen2al Analysis Two basic approaches: Labeling (e.g., SILAC, itraq, ) Label-Free Quantification (LFQ) State 1 Proteins w/ heavy label Mix Fractionate Digest Isolate Healthy State 2 Proteins w/ light label diseased Nat. Biotechnol. 17: 994-999,1999 21 7

SILAC SILAC Stable Isotope Labeling with Amino Acids in Cell Culture Introduce stable labels by feeding labeled amino acids to the cell culture Labels will be integrated into all proteins axer a reasonable amount of 0me Mix and compare with an unlabeled sample Tryp0c digest ensures that each pep0de contains (with some excep0ons) exactly one K/R! Pep0des with heavy and light label are otherwise iden0cal and coelute Spectra contain isotope paherns for both heavy and light pep0des light heavy SILAC pair with charge 2 and approximately a 1:1 ratio (unpeurbed) 22 SILAC Stable Isotope Labeling with Amino Acids in Cell Culture 3 Mumby, Brekken, Genome Biol (2005), 6:230 23 Isobaric Labeling http://en.wikipedia.org/wiki/file:isobaric_labeling.png [accessed 19.11.11, 19:48 CET] 24 8

Isobaric Labeling Idea Label the different samples with labels of the same mass (isobaric) Design the label in a way that they fragment differently upon collision- induced dissocia0on MS 2 spectra will then contain repoer ions Quan0fica0on and iden0fica0on are then both based on tandem spectra only Key method: itraq isobaric tags for rela2ve and absolute quan2fica2on Based on covalent modifica0on of N- terminus of pep0des Labeling performed axer diges0on (also applicable to clinical samples) Kits available for 4 or 8 dis0nct labels ( quadruplex, octoplex ) 25 itraq Ross et al., Mol Cell Prot (2004), 3, 1154-1169. 26 itraq Ross et al., Mol Cell Prot (2004), 3, 1154-1169. 27 9

Differen2al Analysis Two basic approaches: Labeling (e.g., SILAC, itraq) Label-Free Quantification (LFQ) Map 1 ( healthy ) Map 2 ( diseased )? 28 Label- Free Quan2fica2on (LFQ) Label- free quan0fica0on is probably the most natural way of quan0fying No labeling required, removing fuher sources of error, no restric0on on sample genera0on, cheap Data on different samples acquired in different measurements higher reproducibility needed Manual analysis difficult Scales very well with the number of samples, basically no limit, no difference in the analysis between 2 or 100 samples 29 Data Reduc2on Peptide (feature) Isotope pattern Elution profile Feature Finding Problem: Identify all peaks belonging to one peptide and sum up their intensities 30 10

Data Reduc2on Features Aggregation of peaks to features achieves up to 1,000-fold reduction of data volume reduction to a meaningful quantity: ion count of one peptide 31 LFQ Analysis Strategy 1. Find features in all maps 32 LFQ Analysis Strategy 1. Find features in all maps 2. Align maps 33 11

LFQ Analysis Strategy 1. Find features in all maps 2. Align maps 3. Link corresponding features 34 LFQ Analysis Strategy 1. Find features in all maps 2. Align maps 3. Link corresponding features 4. Iden2fy features GDAFFGMSCK 35 LFQ Analysis Strategy 1. Find features in all maps 2. Align maps 3. Link corresponding features 4. Iden2fy features 5. Quan2fy GDAFFGMSCK 1.0 : 1.2 : 0.5 36 12

Proteomics Data Flow Sample HPLC/MS Raw Data 10 GB Sig.- Proc. 50 MB Maps Diff. Anal. Annot. Maps Data Reduction Filtered Raw Data 1 GB Differentially 50 MB Identification Expressed 1 kb Proteins 37 Quan2fying Analytes Analytes have to be in solu0on for proteomics and metabolomics We thus deal with concentra0ons: amounts per volume of sample V Molar concentra0on c i = n i / V [SI unit: mol/m 3 ] Mass concentra0on ρ i = m i / V [SI unit: kg/m 3 ] Transla0ng molar concentra0ons into mass concentra0ons can be done via the molecular weight M i of the analyte ρ i = c i M i Precision and Accuracy Reference value Probability density Accuracy good accuracy, poor precision good precision, poor accuracy Precision Accuracy: closeness to the true value (mostly influenced by systema0c error) repe00on of the experiment will not improve the result Precision: repeatability of the measurement (mostly influenced by random error) repe00on of the experiment will yield more a value closer to the true value An ideal experiment combines high accuracy with high precision Value 13

Measurement Errors Reference value Probability density Accuracy Precision Value Each measurement is associated with an error There are two basic types of error: Random error: defines the variance of repeated measurements (e.g., due to high noise level) this is always present in every measurement Systema2c error (bias): shixs the mean of repeated experiments (e.g., due to an incorrect calibra0on) Calibra2on Curve detector response concentration Measurement of the detector response for various (known) concentra0ons allows the construc0on of a calibra0on curve Most detector responses are chosen in a way that the response changes linearly with the concentra0on Once the calibra0on curve has been measured, it allows the determina0on of the concentra0on of an unknown sample Response saturation detector response slope = sensitivity LOD LOQ linear range LOL noise concentration LOD: level of detec0on at what concentra0on can we decide that the analyte is present LOQ: level of quan0fica0on at what concentra0on can we accurately quan0fy it LOL: limit of linearity satura0on effects sta here Linear range (dynamic range): the concentra0on range where we get a response that is linear in the concentra0on 14

Feature Finding Feature finding is a key data reduc0on step enabling complex analysis It is a key step for LC- MS data, both for proteomics and metabolomics Feature finding boils down huge maps to hundreds or thousands of features Various algorithms have been proposed for feature finding We will discuss the algorithm proposed by Gröpl et al. (2005), which uses a two- dimensional model fit The model is based on the shape of an ideal feature as defined by the separa0on process Gröpl, C, Lange, E, Reine, K, Kohlbacher, O, Sturm, M, Huber, C, Mayr, B, and Klein, C (2005). CompLife 2005, Springer LNBI 3695, p. 151-161. 43 Feature Finding Iden0fy all peaks belonging to one pep0de Key idea: iden0fy suspicious regions Fit a model to that region and see what peaks are explained by it 44 Feature A[ributes Attributes Position (, RT) Intensity, volume Quality 45 15

Features Models Feature model = Isotope pattern x Elution profile 46 Feature Models Physical processes leading to the shape of a feature: Chromatography Elu0on profiles are (ideally) shaped like a Gaussian Parameters: width, height, posi0on Mass spectrometry Mass spectra of pep0des are characterized by the isotope pahern Modeled by a binomial distribu0on A two- dimensional feature is then described by the product of these func0ons 47 Isotope Pa[erns Molecule with one carbon atom Two possibili0es: light variant, 12 C Heavy variant 13 C 98.9% of all molecules will be light 1.1% will be heavy 12 C 98.90% 13 C 1.10% 14 N 99.63% 15 N 0.37% 16 O 99.76% 17 O 0.04% 18 O 0.20% 1 H 99.98% 2 H 0.02% 48 16

Isotope Pa[erns Molecule with 10 carbon atoms Lightest variant contains only 12 C This is called monoisotopic Others contain 1-10 13 C atoms, these are heavier by 1-10 Da than the monoisotopic one In general, the rela0ve intensi0es follow a binomial distribu0on For higher masses (i.e., a larger number of atoms), the monoisotopic peak will be no longer the most likely variant 49 Averagine Since the isotope pahern changes with the composi0on of the pep0de, it is unknown which pahern should be fihed! Idea We know the mass of the feature Assume an average composi0on of an amino acid Then we can es0mate the composi0on The elemental composi0on of such an average amino acid, also called averagine, can be derived sta0s0cally: C 4.94 H 7.76 N 1.36 O 1.48 S 0.04 50 Isotope Pa[erns Based on averagine compositions one can compute the isotope patterns for any given Heavier peptides have smaller monoisotopic peaks In the limit, the distribution approaches a normal distribution m [Da] P (k=0) P (k=1) P (k=2) P (k=3) P (k=4) 1000 0.55 0.30 0.10 0.02 0.00 2000 0.30 0.33 0.21 0.09 0.03 3000 0.17 0.28 0.25 0.15 0.08 4000 0.09 0.20 0.24 0.19 0.12 51 17

Feature Model Isotope pahern is also modulated by the instrument resolu0on We can assume a Gaussian shape for each of the peaks of the isotope pahern 52 Feature Model RT Elu0on profile is typically assumed to be Gaussian There are some variants that also allow for asymmetric peaks This defines the RT dimension of a feature 53 Feature Finding Algorithm Algorithm consists of four phases 1. Seeding. Choose peaks of high intensi6es, as those are usually in features ( seeds ). 2. Extension. Conserva6vely add peaks around the seed, never mind if you pick up a few peaks too many. 3. Modeling. Es6mate parameters of a two- dimensional feature for the region. 4. Refinement. Op6mally fit a model to the collected peaks. Remove peaks not agreeing with the model. Iterate un6l convergence. 54 18

Algorithm: Seeding Sta with the highest peaks in the map Pick only one seed per feature, thus exclude peaks of already iden0fied features for later seeding More advanced variants of the algorithm use Wavelet techniques to detect the best seeds Problems Low- intensity features have intensi0es barely above the surrounding noise Choose a threshold based on the average noise Dilemma: threshold too high, features will not get seeded Threshold too low, millions of noise peaks will be considered as seeds ) HUGE run 0mes 55 Feature Finding Overview Sturm, OpenMS A Framework for Computational Mass Spectrometry, Disseation, Tübingen, 2010 56 Algorithm: Extension Explore the peaks around the seed Add them to a set of relevant peaks Abo if the peaks are getng too small or too far away up down up down 57 19

Algorithm: Refinement Remove peaks that are not consistent with the model Determine op2mal model for the reduced set of peaks Iterate this un0l no fuher improvement can be achieved Remove all peaks of this feature from poten0al seeds 58 Feature Finding Iden0fy all peaks belonging to one pep0de Key idea: iden0fy suspicious regions Fit a model to that region and iden0fy peaks explained by it 59 Feature Finding Extension: collect all data points close to the seed Refinement: remove peaks that are not consistent with the model Fit an op2mal model for the reduced set of peaks Iterate this un0l no fuher improvement can be achieved 60 20

Collec2ng Mass Traces A mass trace is a series of peaks along the RT dimension with lihle varia0on in the dimension Mass traces are found with a simple heuris0c abor0ng the search if the peak intensity hits the local noise level Search for mass traces in the correct distance Limit length of mass trace to the length of the most intense mass trace Sturm, OpenMS A Framework for Computational Mass Spectrometry, Disseation, Tübingen, 2010 61 Feature Deconvolu2on Sturm, OpenMS A Framework for Computational Mass Spectrometry, Disseation, Tübingen, 2010 62 Feature Deconvolu2on Features can overlap in various ways Mass traces can contain more than one chromatographic peak (features not baseline- separated in RT dimension) Mass traces can be interleaved between features in the m/ z dimension Co- elu0ng features can be sharing mass traces Resolving these conflicts is done in a feature deconvolu0on step by sta0s0cal tes0ng: Test several hypotheses that could explain the features The most likely of all hypotheses will be iden0fied through comparison with the data 63 21

Algorithm: Modeling Test all possible models for different charges states (charge +2, charge +3, ) Decide on the charge of the features based on the best fit for these models 1 2 3 64 Algorithm: Modeling/Refinement Es0mate quality of fit for model m and data d i at posi0ons r i : Maximum Likelihood Es0mator determines good star2ng values for model parameters Fuher op2miza2on of model parameters in refinement phase (least- squares fit) 65 Feature Assembly Feature resolu0on is not always possible unambiguously 66 22

S2ll Difficult: Low- Intensity Features Problem: The algorithm picked up the blue feature, The red one was not found as it was too close to the noise peaks (green) 67 Map Alignment Goal: Correct retention time offset and distoions in label-free experiments. 68 Mul2ple Feature Map Alignment Given k feature maps Map 1 Map2 RT and of a peptide may vary between maps compute suitable mapping Map k 69 23

Mul2ple Feature Map Alignment Dewarp k maps onto a comparable coordinate system Map 1 T 1 Map2 T 2 Map k T k 70 Mul2ple Feature Map Alignment Dewarp k maps onto a comparable coordinate system Assign corresponding elements across k maps Map 1 Map 2 T 1 Map k T 2 Consensus map T k 71 Map Alignment Algorithm The algorithms proposed by Lange et al. tries to find an op0mal alignment of two maps through pose clustering It assumes an affine transforma0on between two maps (suitable, unless the chromatographic separa0on has very poor reproducibility) The algorithm consists of two phases Superposi2on phase transform all maps onto the coordinate system of a reference map Consensus phase successive grouping of corresponding elements group elements in the transformed maps, which are nearest neighbors in a weighted Euclidean metric 72 24

Superposi2on Phase S M T = As t +b The problem is to find the affine transformation T that minimizes the distance between T(S) and M. 73 Superposi2on Phase S M T = As t +b T(S) and M 74 Pose Clustering S M T (s ) = a s + b T (s ) = a s + b 75 25

Pose Clustering S M s 1 s 2 m 1 m 2 m 1 = a s 1 +b b m 2 = a s 2 +b a 76 Pose Clustering S M s1 m 1 s 2 m 2 m 1 = a s 1 +b b m 2 = a s 2 +b a 77 Pose Clustering S M s 1 m 1 s 2 m 2 m 1 = a s 1 +b b m 2 = a s 2 +b a 78 26

Pose Clustering S M s 1 m 1 s 2 m 2 m 1 = a s 1 +b b m 2 = a s 2 +b a 79 Pose Clustering S M s 1 m 1 s 2 m 2 a b Matching of corresponding pairs will result in the correct transformation These are more likely than random matches! 80 Speeding Things Up S M s 1 s 2 Only consider pairs (s 1,s 2 ) in S with s 1 having a small distance to s 2 in. 81 27

Speeding Things Up S M s 1 m 1 s 2 m 2 Only match pair (s 1,s 2 ) onto pair (m 1,m 2 ) if s 1 and m 1 as well as s 2 and m 2 lie close together in. 82 Improve Matching S M s 1 m 1 s 2 m 2 Normalize intensities in M and S: weigh the vote of each transformation by the intensity similarities of the point matches (s 1,m 1 ) and (s 2,m 2 ). 83 Summary Quan0ta0ve shotgun proteomics produces large and complex datasets Manual analysis of these datasets is oxen prohibi0vely labor- intensive Feature detec0on significantly reduces the data and makes the quan0ta0ve analysis viable Map alignment enables the comparison of features across maps thus allowing for a label- free quan0fica0on 84 28

References Papers on Feature Finding and Map Alignment Gröpl, C, Lange, E, Reine, K, Kohlbacher, O, Sturm, M, Huber, C, Mayr, B, and Klein, C (2005). Algorithms for the automated absolute quan0fica0on of diagnos0c markers in complex proteomics samples. In: Proceedings of the 1st Symposium on Computa0onal Life Sciences (CLS 2005), edited by M. Behold, R. Glen, K. Diederichs, O. Kohlbacher, I. Fischer. Springer LNBI 3695, p. 151-161. Mayr, B, Kohlbacher, O, Reine, K, Sturm, M, Gröpl, C, Lange, E, Klein, C, and Huber, CG (2006). Absolute Myoglobin Quan0ta0on in Serum by Combining Two- Dimensional Liquid Chromatography- Electrospray Ioniza0on Mass Spectrometry and Novel Data Analysis Algorithms. J. Proteome Res. 5:414-421. Lange E, Gröpl C, Schulz- Trieglaff O, Leinenbach A, Huber C, Reine K. A geometric approach for the alignment of liquid chromatography- mass spectrometry data. Bioinforma0cs (2007), 23(13):i273-81. Web links to soaware tools www.openms.de 85 29