ALIGNMENT OF LC-MS DATA USING PEPTIDE FEATURES. A Thesis XINCHENG TANG

Similar documents
STATISTICAL METHODS FOR THE ANALYSIS OF MASS SPECTROMETRY- BASED PROTEOMICS DATA. A Dissertation XUAN WANG

Computational Methods for Mass Spectrometry Proteomics

Modeling Mass Spectrometry-Based Protein Analysis

Proteomics. November 13, 2007

A statistical method for chromatographic alignment of LC-MS data

BST 226 Statistical Methods for Bioinformatics David M. Rocke. January 22, 2014 BST 226 Statistical Methods for Bioinformatics 1

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

Identification of proteins by enzyme digestion, mass

Introduction to pepxmltab

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

WADA Technical Document TD2015IDCR

for the Novice Mass Spectrometry (^>, John Greaves and John Roboz yc**' CRC Press J Taylor & Francis Group Boca Raton London New York

LECTURE-13. Peptide Mass Fingerprinting HANDOUT. Mass spectrometry is an indispensable tool for qualitative and quantitative analysis of

Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data

Data pre-processing in liquid chromatography mass spectrometry-based proteomics

TANDEM MASS SPECTROSCOPY

Protein Identification Using Tandem Mass Spectrometry. Nathan Edwards Informatics Research Applied Biosystems

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra. Andrew Keller

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database

Yifei Bao. Beatrix. Manor Askenazi

Tandem mass spectra were extracted from the Xcalibur data system format. (.RAW) and charge state assignment was performed using in house software

MS-MS Analysis Programs

Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were

QTOF-based proteomics and metabolomics for the agro-food chain.

Developing Algorithms for the Determination of Relative Abundances of Peptides from LC/MS Data

Isotopic-Labeling and Mass Spectrometry-Based Quantitative Proteomics

Mass Spectrometry and Proteomics - Lecture 5 - Matthias Trost Newcastle University

Isotope Dilution Mass Spectrometry

MSnID Package for Handling MS/MS Identifications

Workflow concept. Data goes through the workflow. A Node contains an operation An edge represents data flow The results are brought together in tables

Tutorial 1: Setting up your Skyline document

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra

WADA Technical Document TD2003IDCR

Supplementary Material for: Clustering Millions of Tandem Mass Spectra

Mass Spectrometry. Hyphenated Techniques GC-MS LC-MS and MS-MS

Key questions of proteomics. Bioinformatics 2. Proteomics. Foundation of proteomics. What proteins are there? Protein digestion

Comprehensive support for quantitation

Aplicació de la proteòmica a la cerca de Biomarcadors proteics Barcelona, 08 de Juny 2010

Quantitation of a target protein in crude samples using targeted peptide quantification by Mass Spectrometry

Making Sense of Differences in LCMS Data: Integrated Tools

Analysis of Peptide MS/MS Spectra from Large-Scale Proteomics Experiments Using Spectrum Libraries

MS-based proteomics to investigate proteins and their modifications

High-Throughput Protein Quantitation Using Multiple Reaction Monitoring

De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra. Xiaowen Liu

Qualitative Proteomics (how to obtain high-confidence high-throughput protein identification!)

DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics

A Statistical Model of Proteolytic Digestion

SRM assay generation and data analysis in Skyline

Biological Mass Spectrometry

MassHunter TOF/QTOF Users Meeting

Parallel Algorithms For Real-Time Peptide-Spectrum Matching

UCD Conway Institute of Biomolecular & Biomedical Research Graduate Education 2009/2010

NPTEL VIDEO COURSE PROTEOMICS PROF. SANJEEVA SRIVASTAVA

A New Hybrid De Novo Sequencing Method For Protein Identification

DE NOVO PEPTIDE SEQUENCING FOR MASS SPECTRA BASED ON MULTI-CHARGE STRONG TAGS

Nature Methods: doi: /nmeth Supplementary Figure 1. Fragment indexing allows efficient spectra similarity comparisons.

Improved 6- Plex TMT Quantification Throughput Using a Linear Ion Trap HCD MS 3 Scan Jane M. Liu, 1,2 * Michael J. Sweredoski, 2 Sonja Hess 2 *

De Novo Peptide Identification Via Mixed-Integer Linear Optimization And Tandem Mass Spectrometry

Fundamentals of Mass Spectrometry. Fundamentals of Mass Spectrometry. Learning Objective. Proteomics

Proteome-wide label-free quantification with MaxQuant. Jürgen Cox Max Planck Institute of Biochemistry July 2011

LC-MS Based Metabolomics

All Ions MS/MS: Targeted Screening and Quantitation Using Agilent TOF and Q-TOF LC/MS Systems

Analysis of Polar Metabolites using Mass Spectrometry

Methods for proteome analysis of obesity (Adipose tissue)

Quantitative Proteomics

PROTEIN SEQUENCING AND IDENTIFICATION USING TANDEM MASS SPECTROMETRY

Effective Strategies for Improving Peptide Identification with Tandem Mass Spectrometry

Learning Score Function Parameters for Improved Spectrum Identification in Tandem Mass Spectrometry Experiments

Mass spectrometry in proteomics

Automated and accurate component detection using reference mass spectra

CHEMISTRY (CHEM) CHEM 208. Introduction to Chemical Analysis II - SL

A Software Suite for the Generation and Comparison of Peptide Arrays from Sets. of Data Collected by Liquid Chromatography-Mass Spectrometry

profileanalysis Innovation with Integrity Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research

Contaminant profiling: A set of next generation analytical tools to deal with contaminant complexity

LECTURE-11. Hybrid MS Configurations HANDOUT. As discussed in our previous lecture, mass spectrometry is by far the most versatile

MassHunter Software Overview

An SVM Scorer for More Sensitive and Reliable Peptide Identification via Tandem Mass Spectrometry

HOWTO, example workflow and data files. (Version )

1. Prepare the MALDI sample plate by spotting an angiotensin standard and the test sample(s).

Other Methods for Generating Ions 1. MALDI matrix assisted laser desorption ionization MS 2. Spray ionization techniques 3. Fast atom bombardment 4.

Wavelet-Based Preprocessing Methods for Mass Spectrometry Data

Agilent MassHunter Profinder: Solving the Challenge of Isotopologue Extraction for Qualitative Flux Analysis

PC235: 2008 Lecture 5: Quantitation. Arnold Falick

(Refer Slide Time 00:09) (Refer Slide Time 00:13)

Peptide Targeted Quantification By High Resolution Mass Spectrometry A Paradigm Shift? Zhiqi Hao Thermo Fisher Scientific San Jose, CA

Protein Sequencing and Identification by Mass Spectrometry

Analysis of Labeled and Non-Labeled Proteomic Data Using Progenesis QI for Proteomics

Designed for Accuracy. Innovation with Integrity. High resolution quantitative proteomics LC-MS

Development and Evaluation of Methods for Predicting Protein Levels from Tandem Mass Spectrometry Data. Han Liu

Chemistry Instrumental Analysis Lecture 37. Chem 4631

SeqAn and OpenMS Integration Workshop. Temesgen Dadi, Julianus Pfeuffer, Alexander Fillbrunn The Center for Integrative Bioinformatics (CIBI)

ADVANCEMENT IN PROTEIN INFERENCE FROM SHOTGUN PROTEOMICS USING PEPTIDE DETECTABILITY

Overview. Introduction. André Schreiber AB SCIEX Concord, Ontario (Canada)

BIOINF 4120 Bioinformatics 2 - Structures and Systems - Oliver Kohlbacher Summer Systems Biology Exp. Methods

6 x 5 Ways to Ensure Your LC-MS/MS is Healthy

pparse: A method for accurate determination of monoisotopic peaks in high-resolution mass spectra

Chapter 2 What are the Common Mass Spectrometry-Based Analyses Used in Biology?

Computational Analysis of Mass Spectrometric Data for Whole Organism Proteomic Studies

Identification and Characterization of an Isolated Impurity Fraction: Analysis of an Unknown Degradant Found in Quetiapine Fumarate

Protein analysis using mass spectrometry

Transcription:

ALIGNMENT OF LC-MS DATA USING PEPTIDE FEATURES A Thesis by XINCHENG TANG Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE December 2011 Major Subject: Statistics

ALIGNMENT OF LC-MS DATA USING PEPTIDE FEATURES A Thesis by XINCHENG TANG Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Approved by: Chair of Committee, Committee Members, Head of Department, Alan R. Dabney Fred P. Dahm Scott Dindot Simon J. Sheather December 2011 Major Subject: Statistics

iii ABSTRACT Alignment of LC-MS Data Using Peptide Features. (December 2011) Xincheng Tang, B.E., Wuhan University, Hubei, China Chair of Advisory Committee: Dr. Alan R. Dabney Integrated liquid-chromatography mass-spectrometry(lc-ms) is becoming a widely used approach for quantifying the protein composition of complex samples. In the last few years, this technology has been used to compare complex biological samples across multiple conditions. One challenge in the analysis of an LC-MS experiment is the alignment of peptide features across samples. In this paper, we proposed a new method using the peptide internal information (both LC-MS and LC-MS/MS information) to align features from multiple LC-MS experiments. We defined Anchor points which are data elements that are highly confident we have identified and are shared by both samples. We chose one sample as template data set, find Anchor Points in this sample, then apply alignment to modify another sample and find Anchors in modified sample, these Anchors should line up with one another. One advantage of our method is that it allows statistical assessment of alignment performance. Use Anchor Points to perform alignment between samples, and labeling an objective performance in LC-MS.

iv ACKNOWLEDGMENTS I would like to express my sincerest appreciation to my advisor, Dr. Alan R. Dabney, for his direction, suggestion, encouragement, patience and continuous support toward my professional development. He guided me to think about statistical research problems, how to do research and how to write a paper step by step. I am very grateful to all my committee members, Dr. Fred P. Dahm, Dr. Scott Dindot, for their guidance and support throughout the course of this research. Thanks also go to my friends and colleagues and the department faculty and staff for making my time at Texas A&M University a great experience. Finally I feel lucky to have my wife Xuan Wang, my parents by my side, thank you for your support and love.

v NOMENCLATURE GC-MS LC-MS MALDI MS MS/MS M/Z PNNL RT SELDI TOF Gas Chromatography Mass Spectrometry Liquid Chromatography Mass Spectrometry Matrix Assisted Laser Desorption Ionization Mass Spectrometry Tandem Mass Spectrometry Mass-To-Charge Ratio Pacific Northwest National Laboratory Retention Time Surface Enhanced Laser Desorption Ionization Time-Of-Flight

vi TABLE OF CONTENTS Page ABSTRACT... ACKNOWLEDGEMENTS... NOMENCLATURE... TABLE OF CONTENTS... LIST OF TABLES... LIST OF FIGURES... iii iv v vi vii viii 1 INTRODUCTION... 1 1.1 Background... 1 1.2 Experimental Procedure... 2 1.3 Existing Alignment Methods... 2 2 METHODS... 7 2.1 Anchor Points... 7 2.2 Alignment Algorithm... 8 3 REAL DATA EXAMPLE... 10 4 CONCLUSION... 17 REFERENCES... 18 VITA... 20

vii LIST OF TABLES TABLE Page 2.1 An Example of Data Record Returned from SEQUEST.......... 7 2.2 An Example of Anchor Points Located.................... 7

viii LIST OF FIGURES FIGURE Page 1.1 Sample Preparation.............................. 3 1.2 Mass Spectrometry.............................. 4 3.1 Plot of Anchor Points Embedded in Both Samples. Left Panel is for Sample 1 and Right Panel for Sample 2................... 11 3.2 Histograms of Scan Number of Anchor Points. Left Panel is for before Alignment and Tight Panel for after Alignment............... 12 3.3 Histograms of M/Z of Anchor Points. Left Panel is for before Alignment and Right Panel for after Alignment..................... 13 3.4 Histograms of Log(Peak Area) of Anchor Points. Left Panel is for before Alignment and Right Panel for after Alignment............... 14 3.5 Scatter Plot of Scan Number vs Log (Peak Area) on All Data Points. Left Panel is for before Alignment and Right Panel for after Alignment.... 15 3.6 Scatter Plot of Scan Number vs. M/Z on All Data Points. Left Panel is for before Alignment and Right Panel for after Alignment......... 16

1 1. INTRODUCTION 1.1 Background Mass spectrometry (MS) proteomics has become the tool of choice for identifying and quantifying the proteome of an organism, in recent years, tremendous improvement in instrument performance and computational tools are used. Several MS methods for interrogating the proteome have been developed: Surface Enhanced Laser Desorption Ionization(SELDI)[1], Matrix Assisted Laser Desorption Ionization (MALDI)[2] coupled with time-of-flight (TOF) or other instruments, and gas chromatography MS (GC-MS) or liquid chromatography MS (LC-MS). In LC-MS-based proteomics, complex mixtures of proteins are first subjected to enzymatic cleavage, and the resulting peptide products are analyzed by using a mass spectrometer. In tandem mass spectrometry (denoted by MS/MS), fragmentation spectra are obtained for each subset of observed high-intensity peaks and compared to fragmentation spectra in a database, using software like SEQUEST[3], Mascot[4], or X!Tandem[5]. If the data used to be analyzed and interpreted at the protein level, then once a list has been constructed of the proteins believed to be present in the sample, the next task is to quantify the abundance of the proteins. Protein abundance information is contained in the set of peaks that correspond to the protein component peptides. Peak height or area is a function of the number of ions detected for a particular peptide, and is related to peptide abundance. In LC-MS, each sample may have thousands of scans, each containing a mass spectrum. The mass spectrum for a single MS scan can be summarized by a plot of M/Z values versus peak intensities. These data contains signals that are specific to individual peptides. As a first step towards identifying and quantifying those This thesis follows the style of Journal of Nuclear Materials.

2 peptides, features need to be identified in the data and distinguished from background noise, one simple method is to employ a filter on the signal-to-noise ratio of a peak relative to its local background. Each peptide gives an envelope of peaks due to a peptide constituent amino acids. The presence of a peptide can be characterized by the M/Z value corresponding to the peak arising from the most common isotope, referred to as the monoisotopic mass. 1.2 Experimental Procedure A LC-MS-based proteomic experiment requires several steps of sample preparation (Figure 1.1), including cell lysis to break cells apart, protein separation to spread out the collection of protein into more homogenous groups, and protein digestion to break intact proteins into more manageable peptide components. Once this is complete, peptides are further separated, then ionized and introduced into the mass spectrometer. A mass spectrometer measures the mass-to-charge ratio (M/Z) of ionized molecules, which is designed to carry out the distinct functions of ionization and mass analysis. The key components of a mass spectrometer are the ion source, mass analyzer, and ion detector (Figure 1.2). The ion source is responsible for assigning charge to each peptide. Mass analyzer measures the mass-to-charge (M/Z) ratio of each ion. The detector captures the ions and measures the intensity of each ion species. In terms of a mass spectrum, the mass analyzer is responsible for the M/Z information on the x-axis and the detector is responsible for the peak intensity information on the y-axis. 1.3 Existing Alignment Methods The goal of alignment is to match corresponding peptide features in the M/Z vs scan plot(see Figure 1.2) from different experiments. A time warping method

Fig. 1.1. Sample Preparation 3

Fig. 1.2. Mass Spectrometry 4

5 based on raw spectrum for alignment of LC-MS data was introduced by Bylund and others[6],which is a modification of the original correlated optimized warping algorithm[7]. Wang and others [8], implemented a dynamic time warping algorithm allowing every RT point to be moved. However, the LC-MS data have added dimension of mass spectral information, so only mapping the retension time coordinates between two LC-MS files is not sufficient to provide alignment for individual peptides. Radulovic and others [9] performed alignment based on (M/Z, RT) values of detected features. Their method first divides the M/Z domain into several intervals and fitted different piece-wise linear time warping functions for each M/Z interval. After the time warping, they applied a wobble function to peak and allow peak to move (1-2% of total scan range) in order to match with the nearest adjacent peak in another file. Their method relies on the (M/Z, RT) values of detected peptide features, it fails to take advantage of other information in the raw image. Wang and others [10] proposed an alignment algorithm, PETAL, for LC-MS data. It uses both the raw spectrum data and the information of the detected peak features for peptide alignment. In this paper, two Shewanella datasets are obtained from Pacific Northwest National Laboratory (PNNL) and they were analyzed by SEQUEST on different days. SEQUEST correlates uninterpreted tandem mass spectra of peptides with amino acid sequences from protein and nucleotide databases, which determines the amino acid sequence and thus the protein(s) and organism(s) that correspond to the mass spectrum being analyzed. Based on the SEQUEST output files, each sample has thousands of scans, and the M/Z, peak intensities, peptide information associated. It s obvious that there s some systematic error between the alignment of the two datasets. In this study, we first applied some filter criteria to choose data points matched with high confidence in both samples, which are called Anchor Points. We then use these Anchor Points in sample one as the baseline and modify the data points in

6 sample two, to make the Anchor Points between the two samples aligned together, which means after alignment the Anchor Points in both samples show up at the same locations. The alignment algorithm is then applied on all the data points in sample two. Finally, statistical measurements of the performance of alignment are given on sample level and regional level. In future study, we hope this alignment method can be applied to several samples of one organism, and as a guide to justify the points with same peptide information in different samples.

7 2. METHODS 2.1 Anchor Points In our method, for different experiments, we have the raw data analyzed by SEQUEST. Based on the SEQUEST output files, each sample has thousands of scans, and the M/Z, peak intensities, peptide information of each scan, an example is given in Table 2.1. Define a point with high probability that its peptide shows up in SEQUEST output files as Anchor Points. A sample record of data is given in Table 2.1 Table 2.1 An Example of Data Record Returned from SEQUEST. ScanNum MZ PeakArea PassFilt PeakSignalToNoiseRatio 8239 826.4 4.98E+06 1 25.1 ChargeState Xcorr NumTrypticEnds RankXc Peptide 3 7.9171 2 1 K.LAYADGYVHA Table 2.2 An Example of Anchor Points Located. ScanNum MZ PeakArea PassFilt PeakSignalToNoiseRatio 8239 826.4 4.98E+06 1 25.1 ChargeState Xcorr NumTrypticEnds RankXc Peptide 3 7.9171 2 1 K.LAYADGYVHA ScanNum MZ PeakArea PassFilt PeakSignalToNoiseRatio 7567 813.2 3.57E+06 1 19.2 ChargeState Xcorr NumTrypticEnds RankXc Peptide 3 7.5681 2 1 K.LAYADGYVHA

8 In LC-MS, we need to distinguish the peptide features from the background noise, the first step for doing this is MS peak detection. We employ a simple filter routine on the signal-to-noise ratio of a peak relative to its local background [11]. In our approach, in order to find peptides that exist in both samples with high confidence, three filtering criteria are applied. The first criterion is PassFilt equaling to 1, the second criterion is NumTrypticEnds equaling to 2 and the third criterion is SignalToNoiseRatio being greater than 10. PassFilt is a score that does not come from by SEQUEST but is calculated from syn-fht summary generator using Xcorr, DelCN, RankXc and the number of tryptic termini. NumTrypticEnds (number of tryptic cleavage sites) is the number of termini that conforms to the expected cleavage behavior of trypsin (i.e. C-terminal to R and K). Note that K-P and R-P do not qualify as tryptic cleavages because of the proline rule. However, the protein N-terminus and protein C-terminus do count as tryptic cleavage sites. Values can be 0, 1, or 2 with 2 = fully tryptic; 1 = partially tryptic; 0 = Non tryptic. Any points in both sample satisfied these three criteria are called Anchor Points. Table 2.2 is an example of the Anchor Points located after applying the three filtering criteria. In this example, we can find the points in both sample with same peptide information, which is K.LAYADGYVHA. 2.2 Alignment Algorithm Anchor Points found from both samples differ on ScanNum, which represent retention time, so we need to find some algorithms to make these points in two samples aligned on ScanNum as well as M/Z and PeakArea. Let M i be M/Z for peak i, S i be the Scan Number for peak i, and P i be the log intensity for peak i. The data is normalized to 0-1 range by dividing normalization factors of M/Z, scan number and log(peak Area), denoted as MN, SN, and PN, which are the max M/Z, max Scan Number and max log(peak Area) of the two samples. The reason why we do normalization is that due to the different scales of Scan Number,M/Z and Peak

9 intensity values, the distance can not be equally measured, for example, the scan number is very large compared to the M/Z. So transforming the data into 0-1 range will give equal weight of all these three values. With the pool of Anchor Points found between sample one and sample two, we are able to locate the five nearest anchor points for peak i in the second sample, where the distance is defined by Euclidean metric considering the three dimensions of normalized M/Z, Scan Number and log(peakarea). With the defined range, let D i j be the distance between peak i and Anchor Point j in the second sample two, D ij = ((M i M j )/MN) 2 + ((S i S j )/SN) 2 + (P i P j )/P N) 2 Let 1 be the difference of averaged M/Z across the five nearest Anchor Points of peak i between the two samples, 2 be the difference of averaged scan number and 3 be the difference of averaged intensity. Then we use the differences to modify peak i in sample two by adding ( 1, 2, 3 ) to (M i, S i, P i ) 1 = M ij1 M ij2 2 = S ij1 S ij2 3 = P ij1 P ij2 where j = 1, 2, 3, 4, 5 This algorithm uses both the raw spectrum data which are analyzed by SE- QUEST and the information of the detected peak features for peptide alignment, we use all the information,such as Scan Number, intensity and M/Z to find the target points. Although in both samples,the points with same peptide information appear in different place due to the systematic bias, we assume that, for such point in sample one, the point with same peptide information in sample two should be not far away from the point in sample one.

10 3. REAL DATA EXAMPLE In the Shewanella datasets, the normalizing factors of M/Z, Scan Number and Peak Area are as follows: MN = 1519.48, SN = 10606, P N = 10.08476 For comparison, both Anchor Points before and after alignment are plotted in Figure 3.1. The distance of Anchors in both samples before alignment have systematic difference, and after alignment, the difference should be randomly distributed around the standard points (Anchor Points in sample one). We draw histograms to compare the distance of Anchors between both samples before and after alignment in Figure 3.2, Figure 3.3, and Figure 3.4, the Histogram shows that, after alignment, the distance differences between two samples are mostly around 0. It is important to perform test to justify that after alignment, there is no systematic difference on the anchor points between two samples. As justification, perform Wilcoxon signed-rank test on the differences between two samples on the Anchor points before and after alignment. The null hypothesis is the true location shift is equal to 0 and the alternative hypothesis is the true location shift is not equal to 0. We have the following results: Before alignment Wilcoxon signed-rank test on Scan Number of anchor point with continuity correction W = 537733, p-value =2.2e- 16. After alignment Wilcoxon signed-rank test on Scan Number of anchor point with continuity correction W = 293222, p-value =0.0986. Before alignment Wilcoxon signed-rank test on M/Z of anchor points with continuity correction W = 142212, p-value =0.1091. After alignment Wilcoxon signedrank test on M/Z of anchor point with continuity correction W = 265884, p-value =0.9049. Before alignment Wilcoxon signed-rank test on log(peak Area) of anchor point with continuity correction W = 336362, p-value=1.655e-09. After alignment Wilcoxon

11 Fig. 3.1. Plot of Anchor Points Embedded in Both Samples. Left Panel is for Sample 1 and Right Panel for Sample 2. signed-rank test on log(peak Area) of anchor point with continuity correction W = 281910, p-value =0.6141. So we conclude that the difference between Anchor points after alignment in two samples has a common median 0, which indicates the method can be used as one alignment method to find or justify Anchor Points in sample two. All the data points in the two sampels before and after alignment are displayed in Figure 3.5, and Figure 3.6. It s clear that the two samples mixed better in the scatter plots after alignment.

Fig. 3.2. Histograms of Scan Number of Anchor Points. Left Panel is for before Alignment and Tight Panel for after Alignment. 12

Fig. 3.3. Histograms of M/Z of Anchor Points. Left Panel is for before Alignment and Right Panel for after Alignment. 13

Fig. 3.4. Histograms of Log(Peak Area) of Anchor Points. Left Panel is for before Alignment and Right Panel for after Alignment. 14

Fig. 3.5. Scatter Plot of Scan Number vs Log (Peak Area) on All Data Points. Left Panel is for before Alignment and Right Panel for after Alignment. 15

Fig. 3.6. Scatter Plot of Scan Number vs. M/Z on All Data Points. Left Panel is for before Alignment and Right Panel for after Alignment. 16

17 4. CONCLUSION One advantage of our method is that it allows statistical assessment of alignment performance. We could statistically evaluate the performance of our methodology with other alignment algorithms on some dataset that has peptides identified with high confidence. The statistical confidence measure of our method was given on sample level. We expanded it to region level and the future work would be developing peptide level statistical confidence measure and pass it to downstream quantitative analysis.

18 REFERENCES [1] N. Tang, P. Tornatore, and S. R. Weinberger. Current Developments in SELDI Affinity Technology. Mass Spectrometry Reviews, 23(1) (2004) 33-34. [2] M. Karas, D. Bachman, U. Bahr, and F. Hillenkamp. Matrix-Assisted Ultraviolet Laser Desorption of Non-Volatile Compounds. Int J Mass Spectrometry Ion Proc, 78 (1987) 53-68. [3] J. K. Eng, A. L. Mccormack, and J. R. III. Yates. An Approach to Correlate Tandem Mass Spectral Data of Peptides with Amino Acid Sequences in A Protein Database. J Am Soc Mass Spectrom, 5 (1987) 976-989. [4] D. N. Perkins, D. J. C. Pappin, and D. M. Creasy. Probability-Based Protein Identification by Searching Sequence Databases Using Mass Spectrometry Data. Electrophoresis, 20 (1987) 3551-3567. [5] R. Craig and R. C. Beavis. TANDEM: Matching Proteins with Tandem Mass Spectra. Bioinformatics, 20(9) (2004) 1466-1467. [6] D. Bylund, R. Danielsson, G. Malmquist, and K. E. Markides. Chromatographic Alignment by Warping and Dynamic Programming as A Pre-Processing Tool for Parafac Modelling of Liquid Chromatography-Mass Spectrometry Data. Journal of Chromatography, A 961 (2002) 237-244. [7] N. P. Nielsen, J. M. Carstensen, and J. Smedsgaard. Aligning of Single and Multiple Wave Length Chromatographic Profiles for Chemometric Data Analysis Using Correlation Optimised Warping. Bioinformatics, A 805 (1987) 17-35. [8] W. Wang, H. Zhou, H. Lin, S. Roy, T. A. Shaler, L. R. Hill, S. Norton, P. Kumar, M. Anderle, and C. Becker. Quantification of Proteins and Metabolites by Mass Spectrometry without Isotopic Labeling or Spiked Standards. Analytical Chemistry, 75 (2003) 4818-4826. [9] D. Radulovic, S. Jelveh, S. Ryu, T. G. Hamilton, E. Foss, Y. Mao, and A. Emili. Informatics Platform for Global Proteomic Profiling and Biomarker Discovery Using

19 Liquid-Chromatography-Tandem Mass Spectrometry. Molecular & Cellular Proteomics, 3 (2004) 984-997. [10] P. Wang, H. Tang, M. P. Fitzgibbon, M. Mcintosh, M. Coram, H. Zhang, E. Yi, and R. Aebersold. A Statistical Method for Chromatographic Alignment of LC-MS Data. Biostatistics, 82 (2007) 357-367. [11] N. Jaitly, A. Mayampurath, K. Little_eld, J. N. Adkins, G. A. Anderson, and R. D. Smith. Decon2LS: An Open-Source Software Package for Automated Processing and Visualization of High Resolution Mass Spectrometry Data. Bioinformatics, 10(87) (2009) 1471-2105.

20 VITA Xincheng Tang majored in mechanical engineering at Wuhan University(P.R.China), where he obtained his Bachelor of Engineering in 2006. He received his M.S. in statistics from Texas A&M University in December 2011. His research interests include Quantitative Proteomics, Clinical Trial Design. He may be reached at: Department of Statistics Texas A&M University 3143 TAMU College Station, TX 77843-3143. and his email address is xtang@stat.tamu.edu.