Developing Algorithms for the Determination of Relative Peptide Abundances from LC/MS Data

Size: px

Start display at page:

Download "Developing Algorithms for the Determination of Relative Peptide Abundances from LC/MS Data"

Patricia Wilkerson
5 years ago
Views:

1 Research in Industrial Projects for Students Institute for Pure & Applied Mathematics University of California, Los Angeles Final Report prepared for The Spielberg Family Center for Applied Proteomics Developing Algorithms for the Determination of Relative Peptide Abundances from LC/MS Data Student Members Jacob Marcus (Project Manager) Anne Eaton Melanie Kanter Arunima Ray Faculty Mentors Shawn Cokus Matteo Pellegrini Sponsoring Mentors Parag Mallick Roland Luethy August 22, 2008 This project was jointly supported by The Spielberg Family Center for Applied Proteomics and NSF Grant DMS

3 Abstract Proteomics as a field has gained prominence in recent years as scientists have realized its potential as a clinical and diagnostic tool. The Spielberg Family Center for Applied Proteomics consists of a group of researchers aiming to improve disease treatment through proteome analysis. The ability to identify characteristics of a proteome will help predict the efficacy of drugs prior to their administration, and so enable doctors to tailor disease treatments to individuals. Quantifying the differences in the abundances of proteins among patients is an important step in the path towards individualized treatments. Liquid Chromatography/Mass Spectrometry (LC/MS) can be used to identify the components of a mixture of proteins broken down into peptides. However, the tools needed to quantify these peptides are not well-developed. The RIPS team was asked to develop and compare algorithms to determine the relative abundance of peptides and proteins across samples based on LC/MS images. Our approach has been modular, to facilitate comparative analysis of variants of each component module. Tests on datasets derived from samples containing known amounts of purified proteins have yielded accurate results. Data filtering and the selection of optimal module combinations have greatly reduced the variance in our results. We have also modified our algorithm to be compatible with a dataset generated by samples of a human epithelial carcinoma cell line system, where differently labeled proteins are present in the same sample, in known ratios. Initial results on this dataset are promising, but further refinement is needed. We have also done some preliminary work on protein quantification and assigning confidences to the ratios calculated by our algorithms. The results of our algorithm show a systematic deviation from the expected ratios with increase in concentration. This may be attributed to non-linearities in mass spectrometric intensity measurements as a function of the concentration of the substance being analyzed. We have also determined some important factors hindering accurate identification, such as differing levels of tryptic digestion across samples, overlapping peaks, etc. 3

5 Contents Abstract 3 1 Introduction Background The Spielberg Center and Our Project Challenges and Our Approach Data Generation Purified Protein Mixes SILAC data Formatting Visualization Initial Quantification Approaches Naïve Approaches Intensity at a Coordinate Summing Intensities Area under a Fitted Curve Multiple Retention Times Summing Intensities Area under a Fitted Curve Areas for Improvement A Modular Quantification Approach Purified Protein Mixes Extraction of Two-Dimensional Neighborhood Consolidation of Isotopes Estimation of Noise Bounding of Retention Time Curve Fitting Quantification SILAC Dataset Data Filtering and Optimizing the Algorithm Data Filtering Removing Nonsense Peptides Removing Matches with Large Differences in Retention Times Optimizing the Algorithm

6 5.2.1 Estimation of Noise and Bounding of Retention Time Curve Fitting and Quantification The Selected Combination of Modules Remaining Outliers Final Evaluations Evaluation of Initial Quantification Approaches Evaluation of Modular Quantification Algorithm Protein Mix Protein Mix SILAC Dataset Recommendations for Future Work Further Refinement of Modules Data Filtering for SILAC Data Bounding of Retention Time Curve Fitting Concentration Dependence Expanding to All Features Utilizing Information from Multiple Experiments Assigning a Confidence Value Protein Quantification Summary and Conclusions 51 Appendices A Liquid Chromatography/Mass Spectrometry-Mass Spectrometry 53 A.1 Liquid Chromatography A.2 Mass Spectrometry A.3 Combined Liquid Chromatography/Mass Spectrometry-Mass Spectrometry References Cited Works 57 6

7 List of Figures 1.1 The Central Dogma of Molecular Biology Personalized Treatment Based on Proteome Analysis Data Generated by a Single LC/MS Experiment Subsets of Data A Typical Feature Associated with an Identified Peptide A Naïve Approach Stages of the Modular Quantification Algorithm Extraction of a Two-Dimensional Neighborhood for a Given Peptide Consolidation of Isotopes Bounding of Retention Time Fitting a Gamma Curve Examples of Possibly Mismatched Peptides Identifications at the Edge of the Retention Time Axis Effects of Data Filtering Comparing Modules for Estimation of Noise and Bounding of Retention Time Comparing Modules for Curve Fitting and Quantification Evaluation on 6 Protein Mix for Summing Intensities at Points within a Certain Distance from an Identified Point Evaluation on 5 and 6 Protein Mix Evaluation on SILAC Dataset and Comparison with Q Concentration Dependence Calculation of Relative Protein Abundances A.1 Liquid Chromatography A.2 Mass Spectrometry A.3 Combined Liquid Chromatography/Mass Spectrometry A.4 Identifications by Tandem Mass Spectrometry

9 List of Tables 2.1 Composition of the 6 Protein Mix Composition of the 5 Protein Mix Injection Volumes for the 5 Protein Mix Incomplete Trypsination Products Injection Volumes for the 5 Protein Mix

11 Acknowledgments Thanks to our mentors Shawn, Matteo, Parag, and Roland for their invaluable help throughout this summer. This project would not have been possible without their constant support, encouragement, and words of wisdom. We would also like to thank the Institute of Pure and Applied Mathematics for hosting us, and the National Science Foundation and the Spielberg Center for funding our research. 11

13 Chapter 1 Introduction 1.1 Background The field of proteomics has gained prominence in recent years as scientists have come to realize its vast, mostly untapped potential as a clinical and diagnostic tool. Similar to genomics (the study of the genetic material the genome of an organism), proteomics is the study of the proteome, that is, the collection of proteins expressed by cells. Unlike the genome, the proteome may be different across cell types of the same individual, and more importantly, changes over time based on external and internal factors such as disease, drugs, and the environment. Figure 1.1: The central dogma of molecular biology. The genetic material (DNA) generates RNA which gives rise to a string of amino acids, called a polypeptide. Each polypeptide is processed further through the events illustrated to form a functional protein. Studying DNA is helpful, but yields no direct information about the intermediate modification steps that can drastically affect protein structure and function. It is difficult to predict protein expression levels based on DNA alone. While DNA is the blueprint for the generation of proteins, there are various intermediate steps that contribute to the final structure of any particular protein (see figure 1.1). Events such as post-transcriptional RNA modification, post-translational polypeptide modifications including methylation and phosphorylation, chaperoning of protein folding, and protein complex formation are not directly measurable using genomics. Studying DNA alone cannot yield sufficient information to accurately predict protein structure and quantity. Studying proteins, therefore, provides researchers and doctors more complete and relevant information about an individual. Since most drugs target proteins, proteomics can also enhance drug discovery and development. 13

Figure 1.2: Personalized treatment based on proteome analysis. Analysis of a blood sample could reveal biomarkers indicating the optimum treatment for an individual.

14 Figure 1.2: Personalized treatment based on proteome analysis. Analysis of a blood sample could reveal biomarkers indicating the optimum treatment for an individual. Moreover, studying the patients who do not respond well to any of the current treatments can guide future research on potential drugs. Such personalization of treatment is the ultimate goal of clinical proteomics. A very important application of proteomics involves the discovery of biomarkers. A biomarker is a substance used as an indicator of a biological state, such as disease. Protein markers have been used for early diagnosis of diseases [1, 2] and may be important to understanding the diversity of patient response to treatments. For example, the widely used breast cancer drug Herceptin is only effective against 25% of breast cancers. In the past, doctors would prescribe this medication without any prior knowledge of how effective it might be. However, it is now known that Herceptin is effective only when the patient exhibits overexpression of the HER-2 biomarker [3]. This information has helped doctors prescribe treatments more likely to be effective, has led to better understanding of breast cancer, and has identified avenues for further research. Identifying and quantifying the proteins expressed in different patients will thus assist doctors in tailoring disease treatment to individuals by predicting the efficacy of different drugs for each specific individual (figure 1.2) [4]. Such personalized treatment is the ultimate goal of clinical proteomics and our sponsor, the Spielberg Family Center for Applied Proteomics. Since there are more than 25,000 genes in the human genome and each gene can yield multiple proteins, fast and accurate identification and quantification are not simple tasks. However, recent advances in mass spectrometry (MS) have greatly increased our ability to identify proteins [1]. Combined with an accurate quantification algorithm these advances make personalized medicine based on proteomic analysis a feasible goal. 1.2 The Spielberg Center and Our Project The Spielberg Center, based at Cedars-Sinai Hospital in Los Angeles, California, consists of a group of researchers aiming to improve disease treatment by applying proteomic techniques to clinical samples in order to better understand the biology of cancer and other diseases. 14

15 This involves establishing panels of biomarkers, predicting response to a given therapy, and identifying targets for potential new drugs [5]. The Spielberg Center s focus is on the reliable extraction and comparison of qualitative and quantitative information from high-resolution mass spectrometric proteomics data. The LC/MS approach is highly efficient, has high throughput, and can be used to simultaneously monitor a large number of peptides [6]. In brief, it involves the digestion of proteins into peptides, which are coarsely separated by HPLC (High Pressure/Performance Liquid Chromatography) into fractions according to their affinity to water. The fractions are vaporized and then ionized, that is, electrically charged, and further separated by MS by their mass/charge ratio (see Appendix A). Figure 1.3: Data generated by a single LC/MS experiment. Proteins are first digested by the enzyme trypsin and separated by Liquid Chromatography according to their affinity to water (y-axis). The separated fractions are ionized, vaporized and subjected to Mass Spectrometry and thus separated by mass/charge ratio (x-axis). Detected intensity is depicted using a colorscale as shown. Points which have been selected for a second round of mass spectrometry which might lead them to be associated with a peptide sequence are marked with white specks. The Spielberg Center s work on the quantification of peptides has primarily involved studying the data generated by LC/MS experiments. An example of data from one experiment is given in figure 1.3, where certain points marked with white spots indicate molecules that have been subjected to a second round of mass spectrometry (MS2) in an attempt to identify their sequence. Since the intensity recorded by the mass spectrometer is related to 15

16 the quantity of peptide present in the sample, it should be possible to calculate the relative abundance of a given peptide in different samples by comparing LC/MS images. The problem presented to the RIPS team is the exploration of algorithms to quickly and accurately estimate ratios of peptide abundances given LC/MS images and the locations and amino acid sequences of identified peptides in these images. 1.3 Challenges and Our Approach The most daunting challenge for our project is the complexity of clinical protein samples. The picture shown above in figure 1.3 was generated by a sample containing only six proteins. Realistically, a quantification algorithm must be able to simultaneously quantify peptides generated by clinical samples, which may contain hundreds or thousands of proteins. For instance, diagnostically, much attention is paid to the proteins in blood plasma (which might represent the largest version of the human proteome); the plasma proteome contains about 40,000 plasma proteins, 500,000 proteins from tissues, and 10,000,000 clonal forms of immunoglobulin [7]. The large dynamic range of clinical samples combined with the low abundance of potentially interesting proteins is another complication. Successful algorithms, therefore, should be versatile enough to handle vast numbers of peptides simultaneously, without prohibitive computational requirements or loss of reliability. Another challenge we dealt with involved peptide isotopes. Elements in nature may exist as isotopes, alternative forms of an atom containing the same number of protons but different numbers of neutrons. For a macromolecule, such as a peptide, an isotope is a form in which some of the component atoms are in a different isotopic form. For example, the most abundant form of a peptide is frequently the form in which one of its carbon atoms is the C-13 isotope, and all the others are of the usual C-12 form. Ionized peptides may exist as different isotopes or in different charge states. Since mass spectrometry separates particles based on mass/charge ratio, forms of the same peptide may be found in a variety of different locations in an LC/MS image. Since our ultimate goal is quantification of the peptide, all these locations need to be considered to estimate the quantity of the peptide in question. In addition, due to the continuous nature of liquid chromatography, peptides are found in a range of retention times, but the width of this range varies for each peptide, and even between experiments for the same peptide. Moreover, noise is an innate part of mass spectrometry, and a successful algorithm must be able to accurately detect a signal amidst a noisy background. Finally, once an algorithm provides accurate information about peptide quantities, we must still determine the best way to calculate quantities of proteins in order to obtain biologically significant information. Our first steps in the project involved data exploration and visualization. Next we attempted some simplistic initial approaches to quantification before moving on to a modular quantification algorithm; this provides us with a framework for refinement of each module of the process and comparison of different versions of the same module. Data filtering and optimization of the performance of combinations of different versions of each module has led to accurate results for purified protein mixes. Preliminary work on a dataset generated using a human epithelial carcinoma cell line system has also yielded promising results. The ratios estimated by our algorithm for peptides generated by the purified protein mixes showed very little spread, but the medians showed a deviation from the expected ratios, for large concentration differences. The systematic nature of the deviation leads us to believe that the cause may lie in the nature of mass spectrometry and an inherent nonlinearity in intensity measurements depending on the total amount of substance being analyzed. 16

17 Chapter 2 Data 2.1 Generation Purified Protein Mixes Our sponsors obtained the 6 purified proteins, listed in table 2.1.2, from Michrom Bioresources Inc. These proteins had been reduced, alkylated, digested using trypsin, and dried. The 6 Protein Mix was created by mixing 1 µm of each of the digested proteins in a solution of 3 M GuHCl and 0.1% formic acid. Similarly, the 5 digested yeast proteins of table 2.2 were resuspended in 95% water, 5% acetonitrile, and 0.1% formic acid to generate the 5 Protein Mix. Samples of the 5 Protein Mix and 6 Protein Mix were injected into an inline liquid chromatography/fourier transform ion cyclotron resonance mass spectrometer (LC/FT-ICR-MS). Data from eight replicate runs of the 6 Protein Mix, and four runs (each of different quantities) of the 5 Protein Mix were provided to us SILAC data Stable Isotope Labeling in Amino Acid Cell Culture (SILAC) is a method for in vivo incorporation of an isotope label into proteins for mass spectrometry based quantification. Two cell culture media are prepared, containing light and heavy versions of the amino acid lysine respectively, that is, the six constituent carbon atoms are either all in the C-12 form (light lysine) or the C-13 form (heavy lysine). Cells from a human epithelial carcinoma cell line system A431 were grown on these media. These cells overexpress the cell surface protein EGFR and are thus a good mimetic for a variety of epithelial cancers, such as lung and prostate cancer. Since lysine cannot be synthesized in vivo and must be gathered from the environment, proteins within any cells grown in these media will be labeled with a heavy or light lysine isotope. Cells in the two media are considered to be otherwise identical and thus should generate the same set of proteins. These proteins will be different only in their masses: each constituent lysine will lead to a mass difference of 6 Da. The two populations of cells were mixed together in different known proportions. The mixed samples were subjected to the same analysis as the purified protein mixes. Data from nine such LC/MS experiments were provided to us. 17

18 Name of Protein Amount (in pmol) Bovine catalase 500 Bovine lactoperoxidase 500 Bovine serum albumin 500 Bovine apotransferrin 500 Bovine glutamate dehydrogenase 500 Bovine carbonic anhydrase 500 Table 2.1: Composition of the 6 Protein Mix Name of Protein Amount (in pmol) Alcohol Dehydrogenase 500 Enolase 50 Triosephosphate isomerase 5 Hexokinase 0.5 Phosphoglucose isomerase 0.05 Table 2.2: Composition of the 5 Protein Mix 2.2 Formatting For each sample, we received two types of data from our sponsors: intensity information at millions of different retention time and mass/charge coordinates, and sets of identifications, each consisting of a retention time and mass/charge coordinate, a peptide sequence, the confidence in the peptide identification (from PeptideProphet [8]), and the protein from which the peptide was likely derived. The latter information was obtained via the Computational Proteomics Analysis System [9]. For the SILAC data, if a constituent lysine atom was in the heavy form, the symbol for lysine (K) in the given sequence is followed by an apostrophe. The intensity data were visualized in two ways using tools from the Proteowizard library provided to us by the Spielberg Center. The msaccess and mspicture programs were used to create two-dimensional images, as shown in figure 1.3, and SeeMS was used to interactively examine individual mass spectra. The program msaccess is also useful for extracting smaller subsets of the data. When exploring a particular feature, we are only interested in the points near an identified point. Since each dataset ranges from 300 MB to 1.3 GB in size, it is inconvenient to store all the data in Matlab memory while running our algorithms. Upper and lower limits for mass/charge and retention time can be provided to msaccess, which then returns the data in the specified region. We have written a Matlab function (Extractslice) to import such subsets of data extracted by msaccess into a Matlab array. There is another function getdata to obtain such subsets of data and store them as text files. 2.3 Visualization To gain a better understanding of the data and in particular the characteristics of a typical peak, we graphed small subsets of the data from the purified protein mixes in three dimensions using Matlab. We went on to compare graphs centered at an identified point to 18

Figure 2.1: Subsets of data, centered around a randomly selected point (left) and an identified point (right). The red lines are at the mass/charge and retention time values for the center.

2: A typical feature associated with an identified peptide. Discrete mass/charge values correspond to isotopes; points that are presumably noise can be seen between isotope signals.

19 Figure 2.1: Subsets of data, centered around a randomly selected point (left) and an identified point (right). The red lines are at the mass/charge and retention time values for the center. The identification is located near the leading retention time edge of signal points and the range of observed intensities in the two graphs differ markedly. Figure 2.2: A typical feature associated with an identified peptide. Discrete mass/charge values correspond to isotopes; points that are presumably noise can be seen between isotope signals. those centered at arbitrary locations (figure 2.1). Note that the retention time coordinate at which a peptide is identified does not represent the center of the corresponding feature on the retention time axis; due to the settings of the tandem mass spectrometer, identifications generally occur near the leading retention time edge of features (figure 2.1). A typical feature identified to represent a peptide is shown in figure 2.2. Each copy of 19

20 a peptide can be found in a variety of isotopic forms (as mentioned in Chapter 1). The most abundant form of the peptide depends on the size of the peptide and the distribution of isotopes for each constituent atom. Frequently, this is the form containing a single atom with an extra neutron. The signal associated with each isotope is spread along the retention time axis, as is expected for liquid chromatography. However, this spread is not the same for different peptides or even for the same peptide across experiments. In addition, there is no easy way to align retention times for different experiments: the same peptide may elute off the column at somewhat different times even for replicate runs. Methods for choosing relevant retention time values for a particular peptide are complicated by noise, which also varies within and between experiments. Due to time constraints we were unable to explore the SILAC dataset as thoroughly as the others. However, features associated with heavy and light versions of the same peptide were found in the same region along the retention time axis, at different mass/charge values, as expected. In general, the features showed similar characteristics to those for the purified protein mixes. The SILAC dataset was notably much more complex than the other sets and this was exhibited by the dense distribution of features. 20

21 Chapter 3 Initial Quantification Approaches Initially, we developed some simple algorithms to calculate a measure of abundance of a peptide in a single LC/MS experiment. Since we are focussing on determining relative amounts of peptides, these algorithms were attempting to calculate a number linearly related to the absolute amount of a peptide, so that taking ratios of the outputs for different runs yields the relative abundance. The algorithms described in this section are simplistic but they helped guide our future work and allowed us to familiarize ourselves with the data. 3.1 Naïve Approaches We began by coding three simple algorithms. These algorithms were unlikely to be effective, but allowed us to familiarize ourselves with the data and the processing of it. All of these algorithms shared a similar structure; they took as input a vector of mass/charge values and a corresponding vector of intensities. Multiple retention times were not considered. Using parameters such as a radius around an identified point they produce a single number that represents a part or all of the given intensities. The confidence in the peptide identifications were not taken into account Intensity at a Coordinate We have been given the mass/charge coordinates of identified peptides. Our first naïve approach simply returns the intensity at the given coordinate for an identified peptide. Unfortunately, due to the methods by which mass/charge is measured, although the peptide coordinates are often close to regions of interest, the exact mass/charge coordinate provided may have an intensity that is not a good estimate of the true peptide intensity Summing Intensities The next algorithm sums the intensities at a fixed number of the points to the left and the right of a specified mass/charge value. This algorithm does not take noise into account Area under a Fitted Curve With the expectation that more sophisticated algorithms would incorporate fitting and integrating under curves fitted to the data, we decided to code a simplified version of this approach. The user provides this algorithm with a vector of mass/charge values, a 21

22 Figure 3.1: A naïve approach. Area is calculated using the trapezoidal rule. vector of intensities, a mass/charge coordinate, and a radius. The code then computes an approximation of the area under the mass/charge vs. intensity curve via linear interpolation (the trapezoidal rule), as shown in figure 3.1. This code ignores the fact that not all peaks in the same retention time slice should be attributed to the same feature. 3.2 Multiple Retention Times Since peptides elute off the liquid chromatography column in a continuous process, considering a single retention time severely limits the ability to determine peptide abundances. The next set of algorithms we developed included multiple slices of data along both the mass/charge and retention time axes Summing Intensities This algorithm extends the naïve summing algorithm to include intensities along the retention time axis as well. The user specifies a center (corresponding to an identified peptide), a retention time radius, a mass/charge radius, and an intensity threshold. The code then sums the intensities above the threshold for points located within the rectangle as generated by the specified radii from the given center. This method is also the first method that attempts to distinguish between noise and peaks of interest by setting a threshold at which to cut off analysis. One problem with this algorithm is the difficulty in determining the optimum radii for our analysis. These radii are fixed in our algorithm, but in actual data, they vary for different peptides. Since identifications are generally located near the leading retention time edge of the feature, the upper and lower retention time radii should be different. Also, this algorithm does not consider the fact that signal points should only be found in discrete regions along the mass/charge axis due to the mass properties of peptide isotopes. 22

23 3.2.2 Area under a Fitted Curve We expect the intensities along the retention time axis for each mass/charge value associated with a feature to have a roughly Gaussian shape. We have observed that the spread of the Gaussian curve is different for each peptide. Moreover, peptide features are known to show discretization along the mass/charge axis. Therefore, summing the intensities within a fixed rectangle might include points that are unassociated with the feature of interest and thus is likely to yield inaccurate results. We first attempted to correct this by generating a program that fits a Gaussian curve to the intensities along the retention time axis for each mass/charge value, ignoring intensities below a certain threshold. 3.3 Areas for Improvement Tests of our initial approaches suggested various areas for improvement. For instance, we ignored both the known distribution of isotopes and the fact that the level of noise varies along both retention time and mass/charge axes. A more sophisticated approach would need to take these and other related factors into account. 23

Chapter 4 A Modular Quantification Approach Because of the complexity of the peptide quantification problem, and to ensure the flexibility required to make future changes, we have separated the

25 Chapter 4 A Modular Quantification Approach Because of the complexity of the peptide quantification problem, and to ensure the flexibility required to make future changes, we have separated the peptide quantification algorithm into separate modules (figure 4.1). We have also decided to only consider peptide identifications with at least 0.9 PeptideProphet value. This was done to ensure that only the features most likely associated with peptide sequences are quantified, since identifications with lower PeptideProphet values are more likely incorrect. Figure 4.1: Stages of the modular quantification algorithm. 4.1 Purified Protein Mixes Starting with any identified peptide, the modular algorithm uses the given location information and the calculated theoretical mass/charge value to extract a two-dimensional neighborhood. Within this region, points associated with isotopic forms of the peptide are found, and intensities are combined. In this step, we are generating an approximation of the elution profile we would expect for a pure solution of the peptide. The algorithm then retrieves a local estimate for noise in the neighborhood and uses it to limit the data along the retention time axis. A curve is fit to this filtered data and the algorithm then computes a quantity for a given peptide by taking the area under the fitted curve. Each step of the 25

process is a separate module. We have conceived and implemented different approaches for each of the steps. 4.1.

26 process is a separate module. We have conceived and implemented different approaches for each of the steps Extraction of Two-Dimensional Neighborhood Associated code: main Since the sequence of an identified peptide is known, we have a very accurate estimate of its mass. As mentioned in Chapter 2, the identifications received from our sponsor contain the charge of each identified species. Thus, for any given identified peptide, the theoretical mass/charge ratio is known. However, the coordinates at which the peptide is identified do not represent the center of a feature; due to the settings of the tandem mass spectrometer, identifications generally occur on the leading retention time edge of the feature associated with the identified peptide (figure 2.1). For an identified peptide, the first module of our algorithm takes in the retention time coordinate of the point of identification, and uses the theoretical mass/charge ratio to extract a two-dimensional subset of the data (figure 4.2). Upper and lower distances along retention time and mass/charge, chosen by averaging over multiple features, are set to ensure that the peak will not be missed. The distances chosen are 300 seconds in each direction along the retention time axis and 10 Da in each direction along that axis. Figure 4.2: Extraction of a two-dimensional neighborhood for a given peptide. A subset of data extracted for a peptide, of theoretical mass/charge Daltons with radii in the mass/charge ratio of 2 Daltons and 11 Daltons (lower and upper respectively) and retention time radius 25 seconds and 45 seconds respectively. In this step of the algorithm, there are no measures to ensure that additional peaks are not being extracted as well. The purpose of this step is to enable us to focus in on a smaller area in order to minimize running time. The parameters used in this step have an important impact on the subsequent steps, since we must ensure that the entire peak is contained in the extracted neighborhood. 26

27 4.1.2 Consolidation of Isotopes Figure 4.3: Consolidation of isotopes. From the extracted two-dimensional subset, signal points are found and selected out. (a) A zoomed-in view of the data shown in figure 4.2. (b) and (c) Selected points within 25 ppm of theoretical mass/charge values with and without other points, respectively. (d) Intensities for isotopes at the same retention time are summed to yield a master peak. Quantization of Signal Points Associated code: Squish and Squish2 The features associated with peptides exhibit discretization along the mass/charge axis due to the presence of different isotopic forms. Using the known charge state of the peptide, the sequence of the peptide and the fact that the difference in isotopes masses is roughly Da (the mass of a neutron), the theoretical mass/charge ratios for isotopes can be calculated. However, the methods used by the mass spectrometer to determine mass/charge are only approximate, and therefore the observed mass/charge for each isotope may differ from the theoretical value, usually by less than 5 ppm. The basic version of this module (Squish) searches within a strip of 25 ppm of the theoretical mass/charge value for each isotope, and the most intense point in a particular strip for each retention time is centered at the theoretical value. Although mass spectrometry is accurate within 5 ppm, we used strips of 25 ppm to ensure that the signal is not missed. Also, since a maximum value is taken, there is only a small probability that the point selected to be centered at the theoretical value will be noise instead of signal. The quantization process is shown in figure 27

28 4.3(a-c). At this point, signal points associated with the same isotopic form are assigned the theoretical mass/charge value for their form for the remaining steps of the algorithm. In the second version of this module (Squish2), signal points are also required to be higher in intensity than a calculated noise threshold (see Section 4.4). This further consideration is likely to improve the calculated ratios by removing noise at this basic level. Summing of Isotope Data Associated code: MetaSquish Since isotopes have identical chemical properties, it is not necessary to quantify them separately. Therefore, to represent the peptide, the intensities of isotopes at the same retention time are summed, to yield a master peak as shown in figure 4.3(d) Estimation of Noise Associated code: NoiseLevel, NoiseLevel2 and NoiseLevel3 The noise level in an LC/MS image varies depending on retention time and mass/charge. Since a global noise level does not account for this variation, this module was written to calculate a local noise estimate for each feature. In the second module, we described the spread of signal points around the theoretical mass/charge values for a given peptide. These signal points are found in regularly spaced strips, separated by noise strips. The intensities of points in these noise strips were used to calculate a noise level. The first two versions of this module, NoiseLevel and NoiseLevel2, compute the mean and median intensity, respectively, in each noise strip and sum them. We decided to sum the noise values since the signal points have also been summed in the previous module. The third version of this module (NoiseLevel3) is much simpler it finds the most intense point within the peak of interest and a certain percentage of that intensity is used as the noise threshold. Validation tests led us to set 10% as the percentage value. It is also possible to compute a median intensity across noise strips in the second module (Condensation of Isotopes) of the algorithm and bypass the noise module completely Bounding of Retention Time Comparison Approaches Associated code: RTCut1, RTCut2 and RTCut3 So far the data have not been bounded in the retention time direction. However, this is necessary in order to ensure that only the relevant points are considered in our curve-fitting and quantification modules and to exclude noisy tails and nearby peaks. This module takes in the filtered data from the third module (the master peak) and a value for the local noise threshold, and outputs a matrix of the relevant data as chosen by specified parameters; these parameters include the amount of data to be examined at each step and the criterion for deciding whether a subset of the data are signal. The first step of this module finds the most intense point in the data and considers it the center of the peak of interest. Then the algorithm examines a segment of the data to the immediate right of the center and determines if the data are mostly signal. The criterion for this decision is different for each version of the module. If the data are mostly signal the algorithm moves on to the next segment of data to the right. The search continues until a segment not fulfilling the criterion is found. The first point of this segment is now considered to be the right-boundary of the area of interest. The same process is repeated 28

29 Figure 4.4: Bounding of retention time. A threshold based on a percentage of the most intense point is calculated. Data are examined by the module to find where most data points lie below the threshold, and there a boundary (shown in red) is drawn. to the left of the center, examining data and shifting left until the data meet the chosen criterion, and then identifying that point as the left-boundary for the data. If the criterion is never met in one or the other direction, the algorithm draws the boundary at the edge of the input data. The first version of this module (RTCut1) considers points to be signal if their intensities are above the noise threshold (figure 4.4). The second version (RTCut2) examines segments of data until intensities are below the noise threshold, and continues until points with intensities begin to rise above the noise threshold. The boundaries are thus drawn as far from center as possible without including another peak. The third version (RTCut3) is similar to the second, but differs in that it decides that the next peak has been found if the intensities of a majority of points in a segment are increasing. Frequently in the data inputted to the Bounding of Retention Time module, there are large gaps along retention time with no intensities. There may exist an unrelated feature beyond that gap. In such a situation, RTCut1, not finding any points below the threshold given to it, includes the unrelated feature when it passes on data to the next module. To avoid this situation, RTCut1 has a gap parameter, currently set to 10 Da; if the intensities of the given data are always higher than the threshold, the code checks for gaps and cuts the data there if it finds one. There are a few potential problems with the current versions of this module. The parameters for the module still need to be optimized. At sub-optimal levels, these parameters might cause the module to be ineffective. Also, it is possible that the most intense point in the data does not correspond to the center of the peak of interest; this might occur if there is a highly intense noise point or if the data also happens to include another peak. A newer version of RTCut1 has been written in which the highest point within a certain distance from the point of identification is considered to be the center. This code is included among our deliverables, but was not used for the evaluations in chapters 5 and 6. Fingerprinting Associated code: fingerprint, CountCarbons, Score, Average, RTCutCarbs and RemoveNoise 29

30 We have also developed an approach called fingerprinting which combines the Consolidation of Isotopes and Bounding of Retention Time modules. Since we know the relative abundances of C-12 and C-13 in nature, and the number of carbon atoms in each identified peptide, it is possible to calculate the expected distribution of peptide isotopes. We call this distribution the fingerprint for the given peptide. In each retention time slice, points with intensities below 10% of the maximum intensity are removed in the RemoveNoise submodule. At each retention time, intensities in a radius of 5 ppm from the theoretical mass/charge values for the first six isotopes are averaged in the submodule Average. At this point we have exactly six intensities for each retention time. These data are compared to the fingerprint by computing a cross correlation score in the submodule Score. The submodule RTCutCarbs limits the data around the retention time slice with maximum cross-correlation score, when a certain number of consecutive slices with scores below 10% of the maximum score are found. This is somewhat analogous to RTCut1, except that the vertical axis represents cross-correlation score rather than intensity and the bounding rule is different Curve Fitting Associated code: mygaussfit[10], GammaFit and mygammafit Since liquid chromatography is a continuous process, it is likely that the most accurate way to determine the total amount of peptide eluted is to extrapolate a smooth curve from the master peak. Due to the random nature of diffusion through the chromatographic column, we initially hypothesized that the master peak from the Consolidation of Isotopes module will be Gaussian with respect to retention time. We therefore used the program mygaussfit, which uses the method of least squares, to fit a second degree polynomial to the natural log of the data. However, we noticed that peaks were frequently right skewed with very steep left sides. This asymmetry was more pronounced at higher concentrations of peptides. Asymmetry of chromatographic peaks is not unusual [11]. We believe that the right tail may be due to the physical limitations caused by the flow rate off the liquid chromatography column. To accommodate the skewed nature of our peaks, we can use GammaFit and mygammafit to fit a gamma curve to the data using a nonlinear regression function in Matlab, nlinfit (figure 4.5). There are some restrictions to using these functions. First, the data must be scaled along the intensity axis to reduce the intensity values of the data. This is important because if the intensity values are high, the fitting parameters for the gamma curve are very large and Matlab rounds them to infinity. Secondly, the gamma curve must be modified to include a shifting parameter. This is necessary because a gamma curve is in the range [0, ) so that the data must be near zero to ensure a proper fit. Despite these modifications, there are cases when the function is unable to satisfactorily fit a gamma curve. We were unable to determine why this happens. In these cases the code uses mygammafit to fit a Gaussian curve. The Gaussian curve is a special form of the gamma curve and in many cases yields accurate fits. If the data inputted into the curve fitting module do not show peak-like characteristics, attempts at curve fitting (gamma or Gaussian) are likely to fail. In this case, the area under the fitted curve is not estimated, NaN is returned, and the results for this peptide are excluded from any evaluations. Additionally, it is important to note that in both curve fits (gamma and Gaussian), a threshold is set based on a proportion of the highest intensity observed in the inputted data; points with intensities below the threshold are not used for curve fitting algorithm. 30

31 Figure 4.5: Fitting a gamma curve. A Gamma curve is fit to the data using a nonlinear regression function in Matlab. Points below 20% of the most intense point are not considered when fitting the curve. The curve fits this example quite well, with an R-squared value of The threshold to be used can be changed. For our evaluations 0.1 is used. The optimum threshold likely depends on the noise level in the given LC/MS image Quantification Associated code: GammaFit, mygaussfit, trapz and sum Once we have obtained a curve fitted to our data, the area under the curve can be calculated and used to represent the quantity of a peptide in the original sample. The GammaFit and mygaussfit versions of the Fit Curve module return the total area under the fitted curve. Alternatively, the trapezoidal method can be used on the fragment of the curve bounded by the points selected by the previous module, using the built-in Matlab function, trapz. In another version of this module, the curve-fitting module is bypassed and the Matlab function sum is used to sum the intensities of the points selected by the Bounding of Retention Time module. Ideally, the number generated by our algorithm is a relative measure of the absolute quantity of the peptide in the original sample. Since our focus is on relative quantification, we run our algorithm on two different datasets and obtain a ratio for each peptide. 4.2 SILAC Dataset Associated code: silac The modular algorithm used on the purified protein mixes was adapted to work on SILAC data. For each identification, the algorithm first determines whether the peptide is in the heavy or light form. As described in Chapter 2, the presence of an apostrophe after the symbol for lysine (K) in a peptide sequence denotes that the lysine was in the heavy form. An identified peptide s corresponding light or heavy isotope will be located a certain distance to the left or right, respectively, of the identified isotope on the mass/charge axis. The algorithm calculates this distance using the number of lysines in the peptide sequence and the common charge state of the isotopes. For example, suppose an identified heavy isotope contains two lysines. Because each lysine contributes a mass difference of 6 Da, the 31

32 partner will be lighter by 12 Da. If the heavy and light isotopes both have charge 2, the light isotope can be found 12/2 = 6 Da to the left of the identified isotope. More generally, the distance between heavy and light forms of peptides is given by: 6 number of constituent lysines charge Our problem is complicated by the fact that the heavy and light forms are each associated with several isotopes due to the isotopes of other amino acids besides lysine. Our methods to deal with this are similar to those described earlier in this chapter for the purified protein mixes. The following steps are performed for each form of a peptide. First, a strip of data starting at the theoretical mass/charge is extracted. The width of the strip is 6 1 / charge units to encompass all the isotopes of the peptide. To account for slight error in the mass spectrometer, the strip width is increased slightly. Within this strip, the algorithm finds the retention time slice with the most data points (in a 10 second window of the identified retention time), since based on observation, these data are likely to be signal. The highest intensity point in this slice is then found. This corresponds to the most abundant isotope. The algorithm then restricts analysis to the most abundant isotope and the two isotopes to its right. To account for mass spectrometry error, we must consider points in a small interval surrounding the theoretical mass/charge values for each isotope. If multiple intensities are recorded at a single retention time for each of these intervals, the maximum intensity is chosen. The chosen intensities in the intervals around these mass/charge values for each retention time are then summed. The algorithm then calls RTCut1. Finally, the algorithm fits a gamma curve to the intensities and the area under the curve is calculated. 32

33 Chapter 5 Data Filtering and Optimizing the Algorithm The data from the 5 Protein Mix were used to guide our work on data filtering and optimization of module combinations. Different volumes of the same mixture were injected in different runs, summarized in table 5.1, leading to different quantities of peptides in each run. About 50 peptides were compared between each pair of runs. Due to time constraints, not all possible module combinations were evaluated; specifically the first two modules were not altered at all in our comparisons. The following combination of modules with the given parameters was used to generate a baseline for comparisons (figure 5.3) and to determine the effect of data filtering: Extraction of Two-Dimensional Neighborhood: main, with 10 Da as lower and upper mass/charge radius, and 300 seconds as lower and upper retention time radius Consolidation of Isotopes: Squish and MetaSquish Estimation of Noise: Bypassed (unused by subsequent modules) Limit Retention Time Axis: Cutting off when intensities of four out of five consecutive points are in increasing order (RTCut3) Fit Curve: Gaussian curve, curve-fitting threshold = 0.1 Quantify: Area under the truncated curve, using trapz. While the ratios that result from this combination of modules are fairly close to the expected value, a large number of outliers are seen. These outliers are frequently quite extreme, such as a ratio of 57,084,044 instead of the expected value of 2 (figure 5.3). These outliers were used to determine the shortcomings of our algorithm. Run Name Injection Volume (in µl) Run Run Run Run Table 5.1: Injection Volumes for the 5 Protein Mix 33

34 5.1 Data Filtering Removing Nonsense Peptides Initial investigation of the outliers in the baseline combination revealed that many of them were identified as being generated by proteins not among the components of the 5 Protein Mix. Some peptides were identified as being obtained from trypsin, which was used for digesting proteins into peptides. However, there were peptides identified to be from serum albumin, the serotransferrin receptor, etc. These could be misidentifications or generated from impurities within the mixtures. Regardless of their source, we have no knowledge of their relative abundance in different runs and as a result they were excluded from further evaluations. This step removed a large number of outliers, as shown in figure Removing Matches with Large Differences in Retention Times Many of the remaining outliers exhibit large differences in retention times (figure 5.1). Although there is some amount of variability in the retention time associated with the same peptide in different experiments, it is very unlikely that the difference will be higher than 100 seconds. Therefore, ratios obtained for features supposedly associated with the same peptide but with retention times differing by more than 100 seconds were excluded. 5.2 Optimizing the Algorithm Estimation of Noise and Bounding of Retention Time First we wished to compare the versions for the Estimation of Noise and Bounding of Retention Time modules. The baseline combination was compared to other combinations differing only in the versions used for these modules. The results are shown in figure 5.4. We were surprised to note the poor performance of the fingerprinting approach to Bounding of Retention Time. We believe that refining our approach might lead to better results. As we see from figure 5.4, RTCut1 with NoiseLevel3 performs better than the other combinations shown Curve Fitting and Quantification Comparisons between results obtained for the different versions of the Curve Fitting and Quantification modules are shown in figure 5.5. These results are generated by using the screened list of peptides, with RTCut1 and NoiseLevel3. Summing the intensities of selected points does not yield good results; this was expected since the distribution of data is not very uniform. It is theoretically sound to use the total area under the fitted curve as the quantity. Using a gamma curve yields the best results; while there are some extreme outliers, the interquartile ranges are smaller than for a fitted Gaussian curve. 5.3 The Selected Combination of Modules Based on our comparisons, the following combination of module versions was selected to form the final version of the modular algorithm: 34

35 Figure 5.1: Examples of possibly mismatched peptides. (a) and (b) show data after the Consolidation of Isotopes, identified to be associated with the peptide K.ATDGGAHGVINVSVSEAAIEASTR (charge 2) from run 1 and run 2 respectively. The baseline algorithm returned a ratio of 0.0, compared to the expected ratio of 2.(c) and (d) are similarly matched data for K.CCSDVFNQVVK.S (charge 2) from run 1 and run 3 respectively. The baseline algorithm returned a ratio of instead of the expected ratio of 10. Extraction of Two-Dimensional Neighborhood: main, lower and upper mass/charge radius = 10 Da, lower and upper retention time radius = 300 seconds Consolidation of Isotopes: Squish and MetaSquish Estimate Noise: Proportion (0.1) of highest intensity (NoiseLevel3) Limit Retention Time Axis: Cutting off when intensities of four out of five consecutive points are below the noise threshold (RTCut1) Fit Curve: Gamma curve, curvefitting threshold = 0.1 Quantify: total area under the curve. 35

36 Peptide Incomplete Trypsination Products K.YDLDFK.N K.DGKYDLDFKNPNSDK.S, K.YDLDFKNPNSDK.N R.SYFHEDDKFIADK.T R.SYFHEDDKFIADKTK.F K.ADVDGFLOVGGASLKPEFVDIINSR.N K.DKADVDGFLOVGGASLKPEFVDIINSR.N R.VLGIDGGEGK.E R.VLGIDGGEGKEELFR.S Table 5.2: Incomplete trypsination products 5.4 Remaining Outliers Several of the few remaining outliers are products of incomplete trypsination. Frequently, these occur in sets with overlapping regions, and if some are overestimated, the others are underestimated. For instance, if K.YDLDFKNPNSDK.S is underestimated, K.YDLDFK.N is overestimated. Other examples of such peptides are shown in table 5.2. Other outliers are found at the lower and upper limits of the retention time range, such as the one shown in figure 5.2. Since the entire peak may not be included in one or the other case, the generated ratios are likely to be inaccurate. Modified peptides (including caret symbols in their sequence) are consistently underestimated by our algorithm, and we are also unable to deal with overlapping peaks. Figure 5.2: Identifications at the edge of the retention time axis. Both pictures are data at the end of the Consolidation of Isotopes step. As we can see, the curve is truncated in picture on the right. 36

37 Figure 5.3: Effects of data filtering. Results using the baseline algorithm for all combinations of the four given datasets. Expected ratios are along the horizontal axis. a and d, b and e, and c and f show the same data, with and without outliers respectively.(a) and (d)show the calculated ratios for all matched identifications. (b) and (e) show the calculated ratios without those for peptides obtained from impurities.(c) and (f) show the results without those for peptides from impurities and matches that differed in retention time by more than 100 seconds. 37

38 Figure 5.4: Comparing modules for estimation of noise and bounding of retention time. Results for all combinations of the four given datasets, using the screened list of peptides are shown. Expected ratios are along the horizontal axis. a and e, b and f, c and g, and d and h show the same data, with and without outliers respectively. (a) and (e) are the results of the baseline algorithm. (b) and (f) are the results using the fingerprinting approach to Bounding of Retention Time, as described in Chapter 4, retaining all other versions from the baseline algorithm. (c) and (g) are results for using RTCut1, bypassing the Estimation of Noise module, and retaining all other aspects of the baseline algorithm. (d) and (h) are results using RTCut1 and NoiseLevel3 (proportion = 0.1), retaining all other aspects of the baseline algorithm. 38

39 Figure 5.5: Comparing modules for curve fitting and quantification. Results for all combinations of the four given datasets, using the screened list of peptides are shown. Expected ratios are along the horizontal axis. All of module combinations used RTCut1 with NoiseLevel3 (proportion = 0.1). a and e, b and f, c and g, and d and h show the same data, with and without outliers respectively. (a) and (e) show the calculated ratios when the curve-fitting module is bypassed and the intensities of the selected points are summed. (b) and (f) show the results when a Gaussian curve is fitted to the data, and the area under the truncated curve is used to quantify the peptide. (c) and (g) show results where the entire area under a fitted Gaussian is used to quantify the peptide. (d) and (h) show results when the total area under a fitted gamma curve is used to quantify the peptide. 39

41 Chapter 6 Final Evaluations As described earlier in Chapter 2, we have received data from several LC/MS experiments from our sponsor. The concentrations of proteins and thus, assuming perfect tryptic digestion, peptides in the samples that generated these data are known and thus we can compare the results of our algorithm to these expected ratios. 6.1 Evaluation of Initial Quantification Approaches As expected, the naïve approaches mentioned in Chapter 3 did not perform well. The best results were seen for summing intensities in fixed radii in the retention time and mass/charge direction, as described in Section 3.2.1, but these results were very sensitive to the radii chosen. Only a few peptides were compared. Some results for the 6 Protein Mix are shown in figure Evaluation of Modular Quantification Algorithm The best combination as described in Chapter 5 was used to generate the following evaluations Protein Mix We were provided data from LC/MS experiments on a mixture of 5 proteins (described in Chapter 2). Different volumes of the same mixture were injected in different runs, summarized in table 6.1. Thus, the quantities of peptides in each run were expected to be different. About 50 peptides were compared to generate the results shown in figure 6.2. We can see that the spread of the calculated ratios is very small, and the medians are close to the expected values. Notably, peptides from only the three most abundant peptides were identified in most cases. Peptides from Hexokinase and Phosphoglucose isomerase were only seen in runs 3 and 4, and even then, in very few numbers. This is not surprising since they were present in very low amounts. 41

Figure 6.1: Evaluation for summing intensities at points within a certain distance from an identified point. Results generated for Runs 1 and 2 of the 6 Protein Mix.

42 Figure 6.1: Evaluation for summing intensities at points within a certain distance from an identified point. Results generated for Runs 1 and 2 of the 6 Protein Mix. Run Name Injection Volume (in µl) Run Run 2 1 Run 3 5 Run 4 10 Table 6.1: Injection Volumes for the 5 Protein Mix Protein Mix About 50 peptides identified in both of two LC/MS experiments on samples of equal concentrations of the 6 Protein Mix. The results for our algorithm are shown in figure 6.2. The medians of the calculated ratios are very close to the expected values. Notably, the combination of modules does not perform as well on this mix as on the 5 Protein Mix, but this is only expected since the combination was optimized using the 5 Protein Mix SILAC Dataset The Spielberg Center provided us with quantifications on the dataset given to us by the program they currently use, Q3[12]. The results of our algorithm are compared to these in 42

43 Figure 6.2: Evaluation on 5 and 6 Protein Mix. Expected ratios are shown on the horizontal axis. (a) shows the results for all combinations of the four runs of 5 Protein Mix. (b) shows the results for combinations of four runs of the 6 Protein Mix. The optimal combination of modules, as described in Chapter 5 was used to generate the results in both cases. figure 6.3. Our algorithm performed satisfactorily considering that the module combinations were not optimized using the SILAC dataset, which is significantly much more complex than the other protein mix datasets. Moreover, due to time constraints, the results shown are the first set of ratios we have generated. We see that the medians for the ratios calculated by our algorithm compare favorably to the currently used program. We are confident that optimizing the parameters used can drastically reduce the large spread of our calculated ratios. 43

44 Figure 6.3: Evaluation on the SILAC dataset and comparison with Q3. Expected ratios are shown along the horizontal axis. About 40 peptides were quantified in each experiment. 44

45 Chapter 7 Recommendations for Future Work 7.1 Further Refinement of Modules Data Filtering for SILAC Data Due to time constraints and the fact that it was derived from actual cells, we were unable to filter the SILAC data nearly as extensively as the purified protein mixes. The lack of filtering explains the underperformance of our adapted algorithm to some extent. A possible way of filtering SILAC data would be to find the maximum and minimum intensities in the data following the Consolidation of Isotopes module, take their difference and divide the difference by the maximum intensity. This calculation could yield a measure by which to determine whether a peak is present. If we find that an identified peptide does not have a partner, it is likely derived from an impurity Bounding of Retention Time Comparison Approaches Some of the versions for the Bounding of Retention Time module frequently fail to find a boundary for the signal along retention time, in one or both directions. In such a situation, the boundary is drawn at the corresponding edge of the inputted data. This could indicate that the data represent noise, or that the initially extracted two-dimensional patch was not sufficiently large to include the entire feature of interest. A potential future addition to the module would signal the user if a boundary could not be found on either side. We have already done some work on this in RTCut1. We have noticed that there is a tendency for RTCut1 to cut off too soon on the left side of a peak. This occurs because the left side is frequently very steep, often with very low intensity points adjacent to the most intense point of the peak. The low intensity points may even be below the threshold for cutting. In the current design, a boundary is drawn when four out of five consecutive points are below the threshold. Thus, these low intensity points cause part of the left side of the peak to be cut off. This is worrisome since now the data given to the Curve Fitting module might simply be decreasing in intensity to the right, and an incorrect curve would be fit. Very simple modifications to this module can be sufficient to deal with this problem. The criterion for cutting could be made different for either side of the peak. This would be prudent since the two sides of peaks show are quite differently shaped. Currently, when RTCut1 finds five points which fail the criterion, all of those five points are excluded. It would be easy to change this to include those points 45

46 but no more, or draw the boundary after the first three points. This might include two or three low intensity points in the data passed on to the Curve Fitting module, but since a threshold is used in that module as well, this is unlikely to cause too much trouble. Fingerprinting Improving the implementation of ideas in the fingerprinting module offers a potentially fruitful avenue of investigation. Presently, the module averages intensities around the theoretical mass/charge coordinates for each isotope. A maximum value could be taken instead, as done in Squish. Other methods of matching data to the theoretical distribution of isotopes could be explored Curve Fitting We have been using a threshold in the Curve Fitting module because some low intensity points can be seen within our master peak, interspersed with high intensity signal points. These points are generated by the presence of retention time slices with little or no data points: a fairly odd phenomenon in LC/MS, yet something we have noticed a lot during the course of this project. We have developed QTest which removes these low intensity points from within the master peak, using an approach based on the standard Q-Test. Further work can be done to determine the source of these points, and better ways of removing them can be devised. Moreover, points could be weighted by intensity for curve fitting. Since we are confident that the high intensity points are signal while the low intensity points may be either signal or noise, this would allow us get more information from our data and obtain better fits. The shape of the elution profile of one peptide can differ significantly from another depending on the structure of the peptides, the concentration of proteins in the chromatographic column, the concentration gradient for the process of chromatography and the occurrence of ion suppression. Grouping peptides into different families based on these factors and then fitting a different curve to each family might improve quantification significantly Concentration Dependence In the course of evaluating our algorithm on several data sets, we noticed that the results are significantly closer to the expected value for samples with lower concentrations of peptides. As shown in figure 7.1, we can see that median of the calculated ratios shows a systematic deviation from the expected values. The systematic nature of the deviation along with the small spread of ratios lead us to believe that the cause may lie in the LC/MS instrumentation or the pre-processing of the data. It is possible that there is a nonlinear relationship between intensity and quantity in mass spectrometry, or that noise is affecting the signal at higher concentrations. Determining the relationship between the calculated and expected ratios would be helpful in normalizing the results of our algorithm to yield more accurate quantifications. 7.2 Expanding to All Features Given data from two different runs and corresponding identifications, our algorithm currently only quantifies peptides that have been identified in both runs. However, due to the mechanism of tandem mass spectrometry, a large number of the peptides identified in 46

47 Figure 7.1: Concentration dependence. The results of our algorithm show a systematic deviation from the expected ratios with increase in concentration. the first sample are not identified in the second and vice versa. A much larger amount of information could be extracted from our data by matching features that are not identified in both samples and quantifying those as well. We would like to be able to take an identified feature in one sample and find the corresponding feature in another sample, even if it was not identified with tandem mass spectrometry. While the mass/charge value of a peptide will be the same in both samples, the retention time at which it appears can vary by a large amount, usually less than 100 seconds. Using pairs of identified features in the two samples, we could create a map between retention times for the two LC/MS images. MSMatchmaker is a program that creates such a mapping and should be easily adaptable to our algorithm. This would greatly increase the number of peptide features that are analyzed by our algorithm, potentially increasing the number of proteins we can quantify and the accuracy with which we can quantify them. 7.3 Utilizing Information from Multiple Experiments Presently, our algorithm arrives at a relative quantification for a peptide only after independent analyses of data from two different LC/MS experiments. However, we believe that analysis of features associated with the same peptide in multiple LC/MS experiments could be used to inform one another; this might decrease the number of mismatched peptides and improve quantification by, for instance, checking to see that maximum intensities associated with a peptide in two samples do not differ by extraordinary amounts. 7.4 Assigning a Confidence Value When we evaluated the performance of our algorithm on purified protein mixes, the data were screened for peptides from impurities, however, this would be impossible to do when analyzing biological samples. In our evaluations we were also able to exclude mismatched 47

48 features, by only using identifications with PeptideProphet values above 0.9 and excluding matches with a retention time difference greater than 100 seconds. Such screening will not be possible when our algorithm is being used on real data. In such a situation, the user will simply input two mass/charge and retention time coordinates for features which supposedly represent the same peptide, and then the algorithm will quantify the features and report a ratio. However, if the features are mismatched, the ratio produced is meaningless. Therefore, it will be very useful to assign a confidence value to each matching and quantification. There are a number of variables that can help us predict a mismatch. The PeptideProphet value of the original identifications, the difference in retention time between matched features and some measure of whether a signal is actually present can be used to assign a match confidence to each ratio. Besides mismatches, there are other cases in which the ratio returned by our algorithm would be suspect, such as when the fitted curve has a low R 2 value. Such variables which contribute to the the accuracy of the algorithm could be used to assign a quantification confidence to each reported ratio. We have done preliminary work on determining the confidence in ratios. CalcCon calculates a single confidence value between 0 and 1 based on the retention times and PeptideProphet values of the two identifications, the maximum intensity in both associated features, the R 2 values of the two fitted curves and the sequence of the peptide in question. The returned confidence value incorporates both match confidence and quantification confidence. So far the correlation between the confidence values returned by CalcCon and the relative errors in the calculated ratios has been minimal. Linear regression on the used variables could be used to refine our methods and determine the most important predictors of error. 7.5 Protein Quantification Figure 7.2: Calculation of relative protein abundances. Results from all possible pairs of the data from the 5 Protein Mix are shown. Expected ratios are along the horizontal axis. Throughout this project, we have calculated ratios of peptide abundances across two 48

49 samples. However, the relative abundances of the parent proteins are of more clinical significance. Thus an important continuation of our project would be to develop an accurate, robust estimator of parent protein ratios based on peptide ratios. For a specific protein, the ratios for each peptide derived from it can be considered to be a random sample from a distribution centered at the true protein ratio. The spread in this distribution is due to both experimental and algorithmic error. We expect the distribution to be symmetric, and if so, the mean of the peptide ratios is a good estimate for the protein ratio. Other estimators may be considered as well. If ratios for some peptides are found to be more consistent and accurate than others, the ratios associated with these peptides could be given more weight in the estimation. Further, our estimator could be tailored to take into account the fact that our algorithm seems to overestimate ratios. We have done some preliminary work on calculating protein ratios from peptide ratios. Results using the mean of peptide ratios for data from the 5 Protein Mix data are shown in figure 7.2. The ratio for each protein is based on data from ten to twenty of its constituent peptides. 49

51 Chapter 8 Summary and Conclusions We have developed a modular framework to determine relative peptide quantities based on LC/MS images. We have delivered multiple versions of the component modules and compared the performance of combinations of different versions. These were evaluated on both replicate data from a 6 Protein Mix and data generated by 5 Protein Mix of varying concentrations. Currently, we are able to determine relative peptide abundances with considerable precision for peptides generated from mixtures of purified proteins. We also have adapted this algorithm for use on SILAC data. The adapted algorithm has been evaluated with promising results, but needs more refining. We have addressed key challenges, such as data filtering to remove impurities and mismatches between samples, extraction of only the signal associated with a specified peptide excluding noise and nearby features associated with unrelated species, and finding a curve that can accurately describe a peptide s elution profile. Our evaluations of different combinations of module versions was as exhaustive as time would allow, but there is still room for further investigation and improvement. We have done limited work on optimizing the parameter values used in different modules and doing so could also lead to improvement in the performance of the algorithm. Other versions of the modules can also be envisioned, such as fitting different types of curves, determining different families of applicable curves, weighting points based on intensities for curve fitting, or further investigating the fingerprinting approach for bounding along retention time. Other directions for future work include assigning a confidence value to quantifications performed by the algorithm, expanding the set of features the algorithm can quantify to those unidentified by external software, combining peptide abundances in order to quantify their parent proteins and improving the algorithm s performance on SILAC data. 51

52 52

53 Appendix A Liquid Chromatography/Mass Spectrometry-Mass Spectrometry A.1 Liquid Chromatography Figure A.1: Liquid chromatography. A process used to separate mixtures of substances based on physical properties. In this example, a mixture of inks is poured into a column and adheres to it. The solvent (or series of solvents) is now poured into the column, and depending on the properties of the component inks and the solvents, each type of ink leaves the column at a different retention time. If a detector is present a chromatogram is generated, recording the elution of substances from the column over time. The term chromatography is used to describe a family of techniques used to separate 53

54 fluid mixtures based on the differing physical properties of their components. The solution to be separated -the mobile phase -is made to pass through a stationary phase. In column chromatography, the stationary phase is within a column. Figure A.2: Mass spectrometry. A method used to separate charged gaseous particles based on mass/charge ratio. A solution is vaporized and sprayed by charged particles, before being injected through a fine needle in the presence of a strong electric field. Subjected to an strong magnetic field, the trajectories of particles are determined by mass/charge values, and a mass spectrum is generated by a particle detector. Intensity is related to abundance. In the first step of column chromatography, the mixture to be separated is first poured into the chromatographic column, containing a permeable solid matrix composed of the stationary phase, immersed in solvent. The mixture adheres to the stationary phase. In the most general form of column chromatography, the solvent is continuously poured into the column. The components of the mixture interact differentially with the stationary phase, and thus come off the column, that is, elute, at different times. The time at which a component elutes off a column is called its retention time. However, substances generally elute over a range of times. A detector can be used to generate a chromatogram, as shown in figure A.1, which tracks the elution of substances over time. A variety of stationary phases and buffers can be used to separate mixtures based on charge, hydrophobicity, size, or ability to bind to certain chemical groups. Liquid Chromatography (LC) is a type of column chromatography (LC may also be performed on a plane) where the mobile phase is a liquid. A form of liquid chromatography where the stationary phase is composed of very small packing particles and a relatively high pressure is referred to as high performance liquid chromatography, or high pressure liquid chromatography (HPLC). Using high pressure generally reduces the width of peaks in a chromatogram and thus improves resolution. In HPLC, after the original mixture is poured into the chromatographic column, elution buffers with an increasing concentration of a particular substance are poured through the column: this further increases the resolution obtained by the chromatography. A.2 Mass Spectrometry Mass Spectrometry (MS) is a technique that can be used to separate and identify components of a mixture based on their mass/charge ratio. A simplified schematic is shown in figure A.2. The components of a mixture are vaporized and ionized. This cloud of charged particles is then injected into an electromagnetic field, in which the trajectory of a particle 54

55 Figure A.3: Combined liquid chromatography/mass spectrometry. The separation based on two factors greatly reduces the number of mixture components that must be dealt with simultaneously, increases resolution, and enables analysis of large numbers of peptides. depends on its mass/charge ratio. A detector is used to determine the intensity of each species, shown in a mass spectrum (figure A.2). The technique of mass spectrometry is used widely, and has many variants. A more complete description is beyond the scope of this report, but can be found in the references ( [13, 14, 15]). A.3 Combined Liquid Chromatography/Mass Spectrometry-Mass Spectrometry Combined liquid chromatography/mass spectrometry is a powerful technique which can be used to separate complex mixtures with greater resolution than in either of the component procedures. Generally, an HPLC column is connected in-line with a mass spectrometer, so that the fractions generated by the chromatography column are immediately introduced into the mass spectrometer. In tandem mass spectrometry during the first MS step (MS1), certain species can be selected for further analysis. These chosen species are broken down into smaller pieces and subjected to another round of MS (MS2). Based on the MS2 spectra, the species may be sequenced. The result of a single LC/MS experiment can be summarized as a two-dimensional image, as shown in figure A.4. In the process used to generate the data described in this report, a sample of proteins is first digested using the enzyme trypsin, and then the resulting mixture of peptides is injected into the LC/MS-MS apparatus. Using the information from MS2, some peptides from the original sample have been identified, and using a database search we can determine its parent protein. 55

56 Figure A.4: Identifications by tandem mass spectrometry. Each individual mass spectrum forms a horizontal slice of the image. Retention times are ordered along the vertical axis, and intensity is represented by a colorscale as shown. Points which have been selected for MS2 are marked with white circles. 56

Developing Algorithms for the Determination of Relative Abundances of Peptides from LC/MS Data

Developing Algorithms for the Determination of Relative Abundances of Peptides from LC/MS Data RIPS Team Jake Marcus (Project Manager) Anne Eaton Melanie Kanter Aru Ray Faculty Mentors Shawn Cokus Matteo