Interpreting raw biological mass spectra using isotopic mass-to-charge ratio and envelope fingerprinting

Similar documents
High-Field Orbitrap Creating new possibilities

Thermo Scientific LTQ Orbitrap Velos Hybrid FT Mass Spectrometer

HOWTO, example workflow and data files. (Version )

Computational Methods for Mass Spectrometry Proteomics

Electrospray ionization mass spectrometry (ESI-

Last updated: Copyright

1. Prepare the MALDI sample plate by spotting an angiotensin standard and the test sample(s).

Improved 6- Plex TMT Quantification Throughput Using a Linear Ion Trap HCD MS 3 Scan Jane M. Liu, 1,2 * Michael J. Sweredoski, 2 Sonja Hess 2 *

DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics

Modeling Mass Spectrometry-Based Protein Analysis

High-Throughput Protein Quantitation Using Multiple Reaction Monitoring

Quantitation of a target protein in crude samples using targeted peptide quantification by Mass Spectrometry

Workflow concept. Data goes through the workflow. A Node contains an operation An edge represents data flow The results are brought together in tables

Finnigan LCQ Advantage MAX

Self-assembling covalent organic frameworks functionalized. magnetic graphene hydrophilic biocomposite as an ultrasensitive

Key questions of proteomics. Bioinformatics 2. Proteomics. Foundation of proteomics. What proteins are there? Protein digestion

De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra. Xiaowen Liu

ProMass Deconvolution User Training. Novatia LLC January, 2013

Thermo Finnigan LTQ. Specifications

MS-based proteomics to investigate proteins and their modifications

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

MassHunter TOF/QTOF Users Meeting

Information Dependent Acquisition (IDA) 1

Multi-residue analysis of pesticides by GC-HRMS

All Ions MS/MS: Targeted Screening and Quantitation Using Agilent TOF and Q-TOF LC/MS Systems

Isotopic-Labeling and Mass Spectrometry-Based Quantitative Proteomics

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database

Tandem mass spectra were extracted from the Xcalibur data system format. (.RAW) and charge state assignment was performed using in house software

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were

Accurate, High-Throughput Protein Identification Using the Q TRAP LC/MS/MS System and Pro ID Software

Cerno Application Note Extending the Limits of Mass Spectrometry

The Pitfalls of Peaklist Generation Software Performance on Database Searches

CHAPTER-2 Formation of Adenine from CH 3 COONH 4 /NH 4 HCO 3 the Probable Prebiotic Route for Adenine

TANDEM MASS SPECTROSCOPY

TOMAHAQ Method Construction

Effective Strategies for Improving Peptide Identification with Tandem Mass Spectrometry

SPECTRA LIBRARY ASSISTED DE NOVO PEPTIDE SEQUENCING FOR HCD AND ETD SPECTRA PAIRS

Mass Spectrometry and Proteomics - Lecture 5 - Matthias Trost Newcastle University

Analyst Software. Peptide and Protein Quantitation Tutorial

Key Words Q Exactive, Accela, MetQuest, Mass Frontier, Drug Discovery

Overview. Introduction. André Schreiber AB SCIEX Concord, Ontario (Canada)

Tutorial 1: Setting up your Skyline document

Electron Transfer Dissociation of N-linked Glycopeptides from a Recombinant mab Using SYNAPT G2-S HDMS

X!TandemPipeline (Myosine Anabolisée) validating, filtering and grouping MSMS identifications

Identification of Human Hemoglobin Protein Variants Using Electrospray Ionization-Electron Transfer Dissociation Mass Spectrometry

MassHunter Software Overview

Nature Methods: doi: /nmeth Supplementary Figure 1. Fragment indexing allows efficient spectra similarity comparisons.

Identification of proteins by enzyme digestion, mass

Calibrating Thermo Exactive with the Direct Analysis in Real Time (DART) Ambient Ionization Source (Protocol Adapted from US FDA FCC)

SEAMLESS INTEGRATION OF MASS DETECTION INTO THE UV CHROMATOGRAPHIC WORKFLOW

Proudly serving laboratories worldwide since 1979 CALL for Refurbished & Certified Lab Equipment LCQ Deca XP Plus

LECTURE-13. Peptide Mass Fingerprinting HANDOUT. Mass spectrometry is an indispensable tool for qualitative and quantitative analysis of

WADA Technical Document TD2003IDCR

Chemistry Instrumental Analysis Lecture 37. Chem 4631

Guide to Peptide Quantitation. Agilent clinical research

Protein Deconvolution Version 2.0

Data pre-processing in liquid chromatography mass spectrometry-based proteomics

AB SCIEX SelexION Technology Used to Improve Mass Spectral Library Searching Scores by Removal of Isobaric Interferences

A graph-based filtering method for top-down mass spectral identification

HR/AM Targeted Peptide Quantification on a Q Exactive MS: A Unique Combination of High Selectivity, High Sensitivity, and High Throughput

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra

BIOINF 4120 Bioinformatics 2 - Structures and Systems - Oliver Kohlbacher Summer Systems Biology Exp. Methods

Bruker Daltonics. EASY-nLC. Tailored HPLC for nano-lc-ms Proteomics. Nano-HPLC. think forward

In the past decade, shotgun proteomics has been one

WADA Technical Document TD2015IDCR

Application Note LCMS-112 A Fully Automated Two-Step Procedure for Quality Control of Synthetic Peptides

Designed for Accuracy. Innovation with Integrity. High resolution quantitative proteomics LC-MS

Increasing Speed of UHPLC-MS Analysis Using Single-stage Orbitrap Mass Spectrometer

Application Note LCMS-116 What are we eating? MetaboScape Software; Enabling the De-replication and Identification of Unknowns in Food Metabolomics

UCD Conway Institute of Biomolecular & Biomedical Research Graduate Education 2009/2010

TUTORIAL EXERCISES WITH ANSWERS

Figure S1. Interaction of PcTS with αsyn. (a) 1 H- 15 N HSQC NMR spectra of 100 µm αsyn in the absence (0:1, black) and increasing equivalent

Mass spectrometry-based proteomics has become

Analysis of a Verapamil Microsomal Incubation using Metabolite ID and Mass Frontier TM

Comprehensive support for quantitation

Multiple Fragmentation Methods for Small Molecule Characterization on a Dual Pressure Linear Ion Trap Orbitrap Hybrid Mass Spectrometer

Rapid Distinction of Leucine and Isoleucine in Monoclonal Antibodies Using Nanoflow. LCMS n. Discovery Attribute Sciences

Bioanalytical Chem: 4590: LC-MSMS of analgesics LC-MS Experiment Liquid Chromatography Mass Spectrometry (LC/MS)

Atomic masses. Atomic masses of elements. Atomic masses of isotopes. Nominal and exact atomic masses. Example: CO, N 2 ja C 2 H 4

Analysis of Labeled and Non-Labeled Proteomic Data Using Progenesis QI for Proteomics

Peter A. DiMaggio, Jr., Nicolas L. Young, Richard C. Baliban, Benjamin A. Garcia, and Christodoulos A. Floudas. Research

Making Sense of Differences in LCMS Data: Integrated Tools

Fundamentals of Mass Spectrometry. Fundamentals of Mass Spectrometry. Learning Objective. Proteomics

Traditional Herbal Medicine Structural Elucidation using SYNAPT HDMS

Translational Biomarker Core

Mass Spectrometry. Hyphenated Techniques GC-MS LC-MS and MS-MS

for the Novice Mass Spectrometry (^>, John Greaves and John Roboz yc**' CRC Press J Taylor & Francis Group Boca Raton London New York

MassHunter METLIN Metabolite PCD/PCDL Quick Start Guide

High Throughpu[ Separations and Structural Information Achieved with Automated LC/MS. Michael P. Balogh, Matthew J. Dilts Waters Chromatography

Analyst Software. Automatic Optimization Tutorial

PosterREPRINT COMPARISON OF PEAK PARKING VERSUS AUTOMATED FRACTION ANALYSIS OF A COMPLEX PROTEIN MIXTURE. Introduction

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra. Andrew Keller

Agilent MassHunter Profinder: Solving the Challenge of Isotopologue Extraction for Qualitative Flux Analysis

Relative quantification using TMT11plex on a modified Q Exactive HF mass spectrometer

Targeted protein quantification

Applications of Mass Spectrometry for Biotherapeutic Characterization

Tandem MS = MS / MS. ESI-MS give information on the mass of a molecule but none on the structure

PesticideScreener. Innovation with Integrity. Comprehensive Pesticide Screening and Quantitation UHR-TOF MS

MALDI-HDMS E : A Novel Data Independent Acquisition Method for the Enhanced Analysis of 2D-Gel Tryptic Peptide Digests

Transcription:

Research Article Received: 24 December 2 Revised: 1 February 21 Accepted: March 21 Published online in Wiley Online Library Rapid Commun. Mass Spectrom. 21, 27, 7 77 (wileyonlinelibrary.com) DOI: 1./rcm.55 Interpreting raw biological mass spectra using isotopic mass-to-charge ratio and envelope fingerprinting Li Li 1,2 and Zhixin Tian 1,2 * 1 State Key Laboratory of Molecular Reaction Dynamics, Dalian ational Laboratory for Clean Energy, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Liaoning 1, China 2 Dalian ational Laboratory for Clean Energy, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Liaoning 1, China RATIOALE: Soft ionization, high-resolution mass spectrometry is widely used to characterize large biological molecules, such as proteins. Deconvolution ( deisotoping ) of isotopic envelopes (ies) in biological mass spectra into monoisotopic or average masses is challenging due to low signals and heavily overlapped ies, resulting in many wrong interpretations. METHODS: Isotopic envelopes (ies) are directly used without deisotoping to identify biological molecules. An algorithm, isotopic mass-to-charge ratio () and envelope fingerprinting (imef), was implemented in the ProteinGoggle search engine for top-down intact protein database searching. imef combines isotopic fingerprinting (imf) and isotopic envelop fingerprinting (ief), where Isotopic mass-to-charge ratio means the value of the most abundant isotopic peak within the ie of a precursor or product ion. imf is used to fish precursor or product ion candidates from the database, which is pre-built and contains all ie information (precursor and product ions) of all proteoforms of the studied system. ief identifies matching precursor or product ions. A protein is finally identified with user-specified total number of matching product ions and post-translational modification scores. RESULTS: The working principles of imef and ProteinGoggle, and the definition of a set of related parameters and scoring metrics, are illustrated with high-resolution tandem mass spectrometric analysis of a mixture of ubiquitin and the HUMA histone H4 proteoforms. Ubiquitin was confidently identified from its CID, ETD, and HCD spectra with 57, 1, and matching product ions, respectively; 5 proteoforms were confidently found from the H4 dataset. The locations of PTMs in 54 and isoforms were partially and fully identified. COCLUSIOS: Database search with imef bypasses deisotoping avoiding associated errors, and also provides full quality control of matching precursor and product ions and finally protein IDs. Overlapped ies of different product ions could also be confidently unwrapped in situ. Improvement and addition of more functionalities and utilities of ProteinGoggle are underway. Copyright 21 John Wiley & Sons, Ltd. Since the very first mass measurement of an electron by Thomson, [1] mass spectrometers have been used to analyze small inorganic, organic and large biological molecules. For these molecules, mass spectrometers measure their isotopic envelopes (ies). Every ie consists of a certain number of isotopic peaks due to the presence of heavy isotopes. For small inorganic and organic molecules (<5 kda), which are often ionized with electron ionization or chemical ionization and where a charge state of 1 (plus or minus) is normally observed, monoisotopic peaks are normally the predominant peaks within measured ies, and the corresponding monoisotopic masses (i.e, value with z being 1) are traditionally used as the so-called masses. For large biological molecules (such as proteins), which are often analyzed by electrospray ionization * Correspondence to: Z. Tian, State Key Laboratory of Molecular Reaction Dynamics, Dalian ational Laboratory for Clean Energy, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Liaoning 1, China. E-mail: zhixintian@dicp.ac.cn Rapid Commun. Mass Spectrom. 21, 27, 7 77 (ESI) and carry multiple charges, the abundance of the monoisotopic peak is often very low due to the increased probability for multiple heavy isotopes. Observation of the monoisotopic peak becomes unlikely for molecules larger than kda, and the corresponding monoisotopic masses are missed. In order to find these missing monoisotopic masses as well as to transform observed charge states to 1or(withorwithoutaprotoninthepostitivemode),a variety of deconvolution ( deisotoping ) algorithms and programs [2 7] have been successfully developed and widely used in biological mass spectrometry. These monoisotopic masses of both small and large molecules are used to either confirm known molecules or identify unknown molecules. One of most used applications of the latter is protein identification from tandem mass spectrometry (MS/MS) data of unknown protein mixtures in the field of top-down proteomics. Deisotoping has been implemented in various search engines (such as ProSightPC 2., [8] MSAlign, [] MS-TopDown, [1] Big Mascot, [11] and PIITA [] )forprotein database search with mass fingerprinting. These search engines have made identification of proteins from thousands of tandem mass spectra possible in a high-throughput Copyright 21 John Wiley & Sons, Ltd. 7

L. Li and Z. Tian fashion, allowing for subsequent pathway analysis, comparison with entire genomes and identification of regulatory posttranslational modifications (PTMs). The raw MS data for these search engines are the measured ies of charged molecules. These ies provide all the information for these molecules. Here we propose the use of ies instead of deconvoluted masses for the analysis of either known or unknown molecules. Proof-of-principle of the idea is demonstrated with top-down protein database searches from high-resolution MS/MS analysis of intact proteins or intact protein mixtures. A search algorithm called isotopic mass-to-charge ratio () and envelope fingerprinting (imef) [1,14] was created and implemented in an engine called ProteinGoggle. imef is the combination of isotopic fingerprinting (imf) and isotopic envelop fingerprinting (ief). Isotopic mass-to-charge ratio here specifically means the value of the most abundant isotopic peak within the ie of a precursor or product ion, and imf is used to fish precursor ion or product ion candidates out of the database. ief is used to identify the precursor or product ion. Because the ies of both precursor and product ions are used directly as measured by MS, the deisotoping step adopted in any mass fingerprinting algorithm is bypassed. The working principles of imef and ProteinGoggle, and the definition of a set of related parameters and scoring metrics, are illustrated with the MS/MS analysis of a ubiquitin and HUMA histone H4 proteoform mixture using an Orbitrap mass spectrometer. EXPERIMETAL Materials Ubiquitin (U25), formic acid, and HPLC grade solvents were purchased from Sigma-Aldrich (St. Louis, MO, USA). Tandem mass spectrometry analysis of ubiquitin and HUMA histone H4 mixture Collision-induced dissociation (CID), electron transfer dissociation (ETD), and high-energy collision-induced dissociation (HCD) tandem mass spectra of ubiquitin were acquired in an Orbitrap Elite mass spectrometer (Thermo Scientific, Waltham, MA, USA). Specifically, ubiquitin (2 nmol/l, CH OH/H 2 O :1 (v/v), HCOOH 1%) was infused by a syringe pump at nl/min and transmitted to the inlet capillary of the mass spectrometer (4 cm long, 5 mm i.d., C) using electrospray ionization (tip 1 mm i.d., high voltage 2 V). A selected ubiquituin precursor ion ( 857, z = 1) was isolated using a units window. The AGC (automatic gain control) targets for the precursor, product, and reagent ions were 5 1 5,11 5, and 5 1 5, respectively. The normalized collision energy, activation Q, and activation time for CID were 5.%,.25, and 1 ms, respectively. The activation time for ETD was ms. The normalized collision energy and activation time for HCD were set to 2.% and 1 ms. Both MS and MS/MS spectra were acquired and averaged with 5 micro-scans. The liquid chromatography (weak cation exchange)/ tandem mass spectrometry (LC/MS/MS) dataset of HeLa histone H4 proteoform mixture was taken from PeptideAtlas [] (dataset Identifier: PASS7). [1] The H4 family mixture was separated from HeLa core histones with reversed-phase LC. Implementation of the imef algorithm Step 1. Conversion of raw experimental data Single MS/MS spectra and LC/MS/MS datasets from the analysis of single intact proteins and intact protein mixtures are converted into the centroid format if acquired with the profile mode. Step 2. Preparation of a database A customized database including combinatorial PTMs and amino acid variations (upon users choice) of all relevant proteins and their proteoforms is generated from the corresponding flat text file download from Uniprot. [17] The ies are generated and stored for both precursor and product ions of every proteoform. An example database of ubiquitin is provided in Supplementary Table S1 (see Supporting Information). The theoretical ies from different charge states Table 1. ief parameters for the identification of a precursor or product ion Short name Full name Definition Value 8 IPMD isotopic peak deviation Relative deviation in ppm of an experimental isotopic peak from that of the corresponding theoretical value within an ie IPMDO isotopic peak deviation Percentage of isotopic peak outliers (among all isotopic peaks outlier above IPACO) whose deviation is allowed to be larger than IPMD within an ie IPMDOM isotopic peak deviation Maximum IPMD allowed for IPMDOs within an ie outlier maximum IPAD isotopic peak abundance Relative abundance deviation of an experimental isotopic peak IPADO IPADOM deviation isotopic peak abundance deviation outlier isotopic peak abundance deviation outlier maximum from that of the corresponding theoretical one within an ie Percentage of isotopic peak outliers (among all isotopic peaks above IPACO) whose abundance deviation is allowed to be larger than IPAD within an ie ppm 2% ppm 1% 2% Maximum IPAD allowed for IPADOs within an ie 2% wileyonlinelibrary.com/journal/rcm Copyright 21 John Wiley & Sons, Ltd. Rapid Commun. Mass Spectrom. 21, 27, 7 77

Interpreting raw biological mass spectra Table 2. Additional parameters to the ief ones listed in Table 1 for the identification of a protein ID Short name Full name Definition Value PMPs PTM Score Percentage of matching product ions Post-translational modification score minimum percentage of experimental matching product ions for the identification of a protein. PMPs is calculated by dividing actual number of matching product ions (MPs) by the total number of theoretical product ions (2n-2), where n is the number of amino acids in the protein sequence Total number of non-redundant matching product ions (multiple charge states are counted only once) that independently define the unique localization of a PTM. For such a specific product ion, no other modification along the amino acid sequence range of the product ion in the database used in the search could account for the extra elemental composition (e.g., CH 2 from methylation) from the modification; other modifications are potentially either the same modification on the same amino acid but a different location or a different type of modification with the same elemental composition on the same or a different amino acid. 5% 1 of each proteoform with values in the experimental MS spectrum acquisition range (e.g. 5 2) are calculated and stored in MySQL data directory as precursor_ions. The ies of all product ions (1+ only, b/y ions for CID/HCD and c/z ions for ETD/ECD) are also generated as product ions ; the theoretical relative abundance of higher charge states (smaller than that of the precursor ion) is the same as that of 1+ and information is calculated on the fly. Selected information from the input flat text file is read in as flat_txt. Step. MS level imef of precursor ions: searching for protein candidates Step.1 Searching of preliminary protein candidates for the precursor ion starts from the first MS/MS spectrum using imf. The value of the most abundance isotopic peak (named as Highest Isotopic Peak, excluding those on Isotopic Peak Exclusion List, Scheme 1) is searched in the database with a parameter of tolerance (e.g., ppm, IPMD in Table 1). This search is conducted only Figure 1. Representative graphic user interfaces (GUIs) of ProteinGoggle: (a) Main GUI; (b) Create a new database; (c) Database and Repository Manager; (d) ProteinGoggleView list of protein IDs; (e) Graphical view of protein ID histone H4_S1acK1acK2me2. From the top are sequentially ief map of the precursor ion, graphical fragmentation map, and ief maps of all matching product ions. Rapid Commun. Mass Spectrom. 21, 27, 7 77 Copyright 21 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/rcm

L. Li and Z. Tian DB A1/P1 Right Isotopic Peak Y A1 P1 DB Left Isotopic Peak A1 P1 DB Y Highest Isotopic Peak Preliminary protein candidates A2 P2 Precursor ion theo. ie Precursor ion exp. ie Y Y Isotopic Peak Exclusion List for isotopic peaks within the isolation window of the precursor ion in the preceding MS spectrum. Every searched value is added to the Isotopic Peak Exclusion List (hereinafter the same). If a preliminary protein candidate(s) is found, Step.4 is followed (hereinafter the same). Step.2 The value of the left isotopic peak of Highest Isotopic Peak (named as Left Isotopic Peak) is chosen to repeat imf in Step.1. Protein candidates Product ion theo. ies A P Y Product ion exp. ies Preliminary protein IDs MPs Initial protein ID Initial protein IDs Combined initial protein IDs Remove duplicates Final Protein IDs Scheme 1. Woking flow chart of imef.highest Isotopic Peak = the most abundant isotopic peak (not on the Isotopic Peak Exclusion List) in the MS spectrum. Left Isotopic Peak = the left isotopic peak of the highest isotopic peak; Right Isotopic Peak = the right isotopic peak of the highest isotopic peak; A = Algorithm, DB = DataBase, exp. = experimental, ie = Isotopic Envelope, = o, MPs = umber of Matching Product ions, P = Parameter, theo. = theoretical, Y = Yes; A1 = imf, A2 = IEF, A = imef, P1 = IPMD, P2 = IPACO + IPAD + IPADO + IPADOM + IPMD + IPMDO + IPMDOM, P = P2 + PMPs + PTM Score. 7 Figure 1. (Continued) Figure 1. (Continued) wileyonlinelibrary.com/journal/rcm Copyright 21 John Wiley & Sons, Ltd. Rapid Commun. Mass Spectrom. 21, 27, 7 77

Interpreting raw biological mass spectra Figure 2. ief of ubiquitin (z = 1). IPMD = ppm, IPACO = 1%, IPMDO = 2%, IPMDOM = ppm, IPAD = 1%, IPADO = 2%, IPADOM = 2%, Window = (). Actual IPMD and IPAD values for all isotopic peaks above IPACO are marked by the peaks. Step. The right isotopic peak of the Highest Isotopic Peak (named as Right Isotopic Peak) is chosen to repeat imf in Step.1. Step.4 ief of each preliminary protein candidate found in Steps.1,.2, and. is conducted to secure protein candidates. In ief, a relative abundance threshold (isotopic peak abundance cutoff, IPACO) is first used to make an abundance cutoff above which all theoretical isotopic peaks are fingerprinted with the corresponding experimental ones using two parameters of IPMD and IPAD (Table 1) to limit and abundance deviation. IPACO, IPMD, and IPAD are essential parameters for the fingerprinting. A set of four other optional parameters including IPMDO, IPMDOM, IPADO, and IPADOM (Table 1) is used to control outlying isotopic peaks. Step is repeated until the most abundant isotopic peak in the remaining MS spectrum within the isolation window of the precursor ion is below a user-specified precursor abundance threshold (PreAT). If multiple precursor ions co-elute in the isolation window, each is independently searched against the database, although the MS/MS spectrum and corresponding product ions are shared. Step 4. MS/MS level imef of product ions: searching for initial protein IDs Once all the protein candidates from the database for a particular experimental precursor ion have been secured, imef of the product ions, comparing the experimental and the theoretical, is carried out using either a topdown or a targeted screening approach. In the top-down screening approach, as used in Step, imef of the Figure. The overall graphical fragmentation maps together with re-drawn representative ief maps of matching product ions for identification of ubiquitin (z =1) from CID (a, d), ETD (b, e), and HCD (c, f). The search parameters are IPMD = ppm, IPACO = 1%, IPMDO = 2%, IPMDOM = ppm, IPAD = 1%, IPADO = 2%, IPADOM = 2%, PMPs = 5%, and PTM Score = 1. Actual IPMD and IPAD values for all isotopic peaks above IPACO are marked by the peaks. 71 Rapid Commun. Mass Spectrom. 21, 27, 7 77 Copyright 21 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/rcm

L. Li and Z. Tian observed and theoretical product ions starts from the most abundant isotopic peak until the remaining isotopic peaks in the MS/MS spectrum are below a user-specified product ion abundance threshold (ProAT). In the targeted screening approach, for each protein candidate, the imf of each theoretical product ion is carried out in a targeted search of product ion candidates in the experimental MS/MS spectrum. ief of every product ion candidate is then carried out in a search for matching product ions. All matching product ions of each protein candidate are collected for calculations of PMPs and PTM Scores. Protein candidates passing PMPs and PTM Scores (Table 2) are preliminary protein IDs, and the preliminary protein ID with the largest MPs will be chosen as an initial protein ID. If only a single MS/MS spectrum from dissociation of a pure protein is provided, ProteinGoggle will fingerprint the spectrum to the precursor entry with the highest charge state in the corresponding database. Operation of ProteinGoggle When ProteinGoggle is started, the main graphical user interface (GUI) is open (Fig. 1(a)) for users to run a new search by first sequentially creating a customized database, and then selecting a dataset file, opening or creating search parameter file, and specifying a directory of output results. Users can also revisit and view previous search results by clicking the view button at the bottom of the main GUI. Create a customized database All ProteinGoggle databases are managed by MySQL Server 5.. From the Databases pull-down menu at the main GUI, users choose create a database to open the database-creating GUI (Fig. 1(b)). Users first open an already downloaded Uniprot [17] flat text file which contains all the proteins of interest. ProteinGoggle reads out all types of PTMs and amino acid variations for users to select. Users then specify the MS Figure 4. Overlapped ies of z 8+ 74, c 8+ 7, and c 7+ 4 from ETD of ubiquitin (z = 1) are well unwrapped with imef. Abundance of overlapped isotopic peaks is proportionally partitioned to minimize the total relative abundance deviation (IPAD, absolute value) of every isotopic peak of separated ies. au = arbitrary unit. 72 Figure 5. An example of PTM Score assignment. HUMA histone H4 (P285) with acetylation on S1, K1, and dimethylation on K2 was identified with 4 matching product ions. PTM Scores for S1ac, K1ac, and K2me2 are 1, 1, and 5, respectively. wileyonlinelibrary.com/journal/rcm Copyright 21 John Wiley & Sons, Ltd. Rapid Commun. Mass Spectrom. 21, 27, 7 77

Interpreting raw biological mass spectra outputs the combined initial protein IDs in a MS excel file, a search parameter summary file and for every protein ID it outputs IEF maps of both the precursor ion and all matching product ions in JPG format. At the same time, all search results are popped up in ProteinGoggleView (Fig. 1(d)). By clicking any protein ID, users can view the ief map of the precursor ion, the graphical fragmentation map, and ief maps of all the matching product ions in the Image window (Fig. 1(e)). Obtain final protein IDs The combined initial protein IDs in the Excel file can be grouped in MS Access by protein sequence and combinatorial PTMs with both PTM Score and MPs as filtering criteria to remove duplicates. Figure. Standard deviation of IPMD (a) and IPAD (b) of ubiquitin ( 857, z = 1). IPAD and IPMD values of every isotopic peaks above IPACO = 1% in the ie are calculated and labeled. acquisition window and maximum PTMs (as well as amino acid variants) for ProteinGoggle to generate the ie information of all the combinatorial proteoforms and all possible charge states of each proteoform in the window. Basic information of generated databases can be viewed by selecting Database Manager (Fig. 1(c)) from the Databases pull-down menu. Choose input files At the main GUI, users select a dataset file, a database, and search parameters. The dataset could be a single tandem mass (MS/MS) spectrum from dissociation of a protein or a LC/MS/MS dataset from the analysis of a protein mixture. ProteinGoggle currently takes and reads both Thermo.RAW and.mzxml data formats. Users then choose a related database from the database column. For search parameters, users can open a saved file or change every parameter by direct input through the keyboard or clicking the up and down button by the side of each parameter. Users can then save the new set of parameters as a new parameter file or overwrite the original file. Run the search After selection of a dataset file, a database, and search parameters, the search is initiated with the Execute button clicked at the main GUI. The default output file path is the same as with the dataset. Upon completion of a search, ProteinGoggle RESULTS AD DISCUSSIO For ief of the precursor ion of ubiquitin (z = 1) (Fig. 2) in the MS spectrum, 1 isotopic peaks in the theoretical ie are above an IPACO of 1%. These peaks passed fingerprinting with the corresponding experimental ones using the parameters in Table 1. Thus, this ubiqutin ion becomes a precursor candidate for subsequent imef of product ions: 57, 1, and matching product ions were found for the CID, ETD, and HCD of ubiquitin (z = 1), respectively, all of which pass the identification parameters in Table 2. The overall graphical fragmentation maps together with re-drawn representative ief maps of matching product ions are displayed in Fig.. The original ief maps directly output by ProteinGoggle of all the matching product ions for CID, ETD, and HCD of ubiquitin are provided in Supplementary Figs. S1, S2, and S (Supporting Information), respectively. In the analysis of the histone H4 proteoform mixture dataset, 14,8 initial protein IDs were obtained, and there are 5 unique IDs. The locations of PTMs in 54 and isoforms were partially and fully identified, respectively. The detailed information for these protein IDs is provided in Supplementary Table S2 (see Supporting Information). This search took about 2 h on a laptop computer with an Intel i CPU of 2.GHz and RAM of.41gb. In addition to a decrease in the time required for deisotoping of MS/MS data, by comparing experimental raw data with theoretical ies, imef in ProteinGoggle provides a direct comparison between observed data and the corresponding predicted one from the database. This both removes uncertainties in the deisotoping steps and provides adjustable and visible confidence for identification. All fingerprinting parameters, IPMD, IPACO, IPMDO, IPMDOM, IPAD, IPADO, and IPADOM for both precursor and product ions, identification parameters PMPs and PTM Score could be freely specified by the user, providing custom analysis depending upon the type of instrumentation used and the study design. The program is also equipped with a set of suggested default values. For overlapped ies of adjacent ions, the abundance of overlapped isotopic peaks could be proportionally partitioned. A partition coefficient was assigned for each overlapped isotopic peak, and all coefficients were optimized to minimize the total relative abundance deviation (IPAD, absolute value) of every isotopic peak of separated ies. The overlapped ies 7 Rapid Commun. Mass Spectrom. 21, 27, 7 77 Copyright 21 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/rcm

L. Li and Z. Tian 74 High Medium Low y58 7+ y24 4+ y7 4+ CID.517.527.8178.42 4.845 4.22 4.8 4.511 4.5542 82.885 82.748 82.88777 8.4 8.82 14.551 14.42 14.57 14.8471.77.4775.5845 c75 7+ c27 + z4 + ETD.777.827.5 1.7 1.24 1.82 1.5572 1.787 1.8228 1.81 111.827 1.2 1.57 1.25 11.287 11.5 75.71 7.524 7.857 7.7214 77.552 77.7517 y58 7+ y 2+ b4 + HCD.51448.574.8527.47 4.88 4.24 4.72 4.511 4.51 4.827 1.4147 1.77 2.41748 2. 52.2514 521.287 522.2571 7+ 4+ 4+ Figure 7. Standard deviation of IPMD of nine representative high, medium, and low abundance ions from CID (y 58,y24,y7 ), ETD 7+ + + (c75,c27,z4 ), and HCD 7+ 2+ 1+ (y58,y,b4 )of ubiquitin (z = 1). IPMD values of every isotopic peak above IPACO = 1% in the ies are calculated and labeled. wileyonlinelibrary.com/journal/rcm Copyright 21 John Wiley & Sons, Ltd. Rapid Commun. Mass Spectrom. 21, 27, 7 77

Interpreting raw biological mass spectra High Medium Low CID ETD HCD 7+ 4+ 4+ Figure 8. Standard deviation of IPAD of nine representative high, medium, and low abundance ions from CID (y 58,y24,y7 ), ETD 7+ + + (c75,c27,z4 ), and HCD 7+ 2+ 1+ (y58,y,b4 )of ubiquitin (z = 1). IPAD values of every isotopic peak above IPACO = 1% in the ies are calculated and labeled. 75 Rapid Commun. Mass Spectrom. 21, 27, 7 77 Copyright 21 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/rcm

L. Li and Z. Tian Figure. Comparison of imef (as implemented in ProteinGoggle) and PMF (as implemented in ProSightPC) in the protein database search of ETD of ubiquitin ( 857, z = 1). a) Venn diagram of matching product ions from the two search engines; b) example ief map of c 1+ 7 from the 4 unique matching product ions of ProSightPC; c) example ief map of c 5+ 45 from the unique matching product ions of ProteinGoggle. 7 of z 8+ 74, c 8+ 7, and c 7+ 4 from ETD of ubiquitin (z = 1) were efficiently separated with this partitioning strategy (Fig. 4; detailed data is provided in Supplementary Table S, see Supporting Information). Confidence of PTM location assignment is characterized by the PTM Score. If multiple PTMs appear on the same protein, the PTM Scores are calculated separately. An example of a H4 proteoform with acetylation on S1, K1, and dimethylation on K2 is shown in Fig. 5, where the PTM Scores are 1, 1, and 5, respectively. Because IPAD and IPMD of isotopic peaks in the ies of both the precursor and the product ions are key parameters used in the search, their good reproducibility is indispensable for robust identification. Reproducibility of IPAD and IPMD at both the MS and MS/MS levels are evaluated with a sample size of 5 (n = 5). Standard deviations of IPAD and IPAD of all isotopic peaks above IPACO = 1% of the ies of both the precursor ion and representative high, medium, and low abundance matching product ions from CID, ETD, and HCD are calculated and plotted. The standard deviation of IPMD and IPAD of ubiquitin ( 857, z = 1) at the MS level span the ranges of.2 1. ppm and 4%, respectively (Fig. ). At the MS/MS level, the standard deviation of IPMD and IPAD of nine representative high, medium, and low abundance ions from CID (y 7+ 58, y 4+ 24, y 4+ 7 ), ETD (c 7+ 75, c + 27, z + 4 ), and HCD (y 7+ 58,y 2+,b 1+ 4 ) are evaluated. The standard deviation of IPMD of all ions is mostly within 2 ppm although those of z + 4 go up to 4 ppm (Fig. 7). The standard deviation of IPAD of all ions is mostly within 2% although those of z + 4 go up to 47% (Fig. 8). In general, the higher the abundance of the ion, the smaller the standard deviations of both IPMD and IPAD. As the efficiency of ETD is normally much lower than those of CID and HCD, the abundance of its MS/MS spectrum is on average much lower; the highest standard deviations of both IPMD and IPAD appeared in the product ions from ETD. In a comparison of imef and peptide mass fingerprinting (PMF) algorithms in a protein database search, the ETD spectrum of ubiquitin ( 857, z = 1) was also searched with ProSightPC 2.. The search was done in the absolute mass mode using monoisotopic masses for both the precursor and the product ions. The parameters adopted in the search are: minimum signal-to-noise ratio (S/) =, minimum RL value =., product ion mass tolerance = ppm. Among 2 matching product ions, 58 were also found by ProteinGoggle (Fig. (a)). Among the 4 unique matching product ions of ProSightPC 2., 27 have missing isotopic peaks (up to 4) in their ies (IPACO = 1%). The ief map of one example of C 1+ 7 is shown in Fig. (b) and the rest is provided in Supplementary Fig. S4 (see Supporting Information); six have very high IPAD values (from 5 to ); and the remaining z 4+ 4 has a good ief map. For the unique matching product ions of ProteinGoggle, no isotopic peak is missing in any ies and the highest allowed IPAD value is 2%. The ief map of one example of C 5+ 45 isshowninfig.(c)andtherestis provided in Supplementary Fig. S5 (see Supporting Information). For the analysis of the H4 dataset, proteoforms were identified by ProSightPC with a P Score cutoff of 1E-4, where 47 proteoforms were also identified by ProteinGoggle. In addition to the absolute mass mode, ProSightPC also has biomarker and sequence tag search modes; while ProteinGoggle currently only has one search mode. wileyonlinelibrary.com/journal/rcm Copyright 21 John Wiley & Sons, Ltd. Rapid Commun. Mass Spectrom. 21, 27, 7 77

Interpreting raw biological mass spectra Further developments include decoy search for false discovery rate control, a comprehensive score for each ID, and other utilities. Because of the success that we have observed in protein database search using imef, we have been incorporating it into two additional search engines PeptideGoggle and GlycanGoggle for peptide and glycan database search, respectively. COCLUSIOS Protein database searches with imef and ProteinGoggle utilize raw MS data, i.e., ies of both precursor and product ions, for fingerprinting and database search, and the deisotoping step used in other mass fingerprinting algorithms is bypassed. With the development and wide availability of mass spectrometers of high mass resolution and mass measurement accuracy, we expect that imef and ProteinGoggle will find wide application. Based on et Framework 4., the ProteinGoggle program has been written in C# with Client/Server (C/S) infrastructure and MySQL Server 5. as the data system. The code uses the standard three-layer structure, i.e., User Interface/Business Logic Layer/Data Access Layer (UI/BLL/DAL). A standalone ProteinGoggle 1. software package for computers running on Windows operating systems is freely available to academic users upon request. SUPPORTIG IFORMATIO Additional supporting information may be found in the online version of this article. Acknowledgements This work was partially supported by the Dalian Institute of Chemical Physics Research Start-up Fund and China Youth 1-talents Scheme. REFERECES [1] J. J. Thomson. On the appearance of helium and neon in vacuum tubes. Science 21, 7,. [2] M. Mann, C. K. Meng, J. B. Fenn. Interpreting mass spectra of multiply charged ions. Anal. Chem., 1, 172. [] M. Sturm, A. Bertsch, C. Gropl, A. Hildebrandt, R. Hussong, E. Lange,. Pfeifer, O. Schulz-Trieglaff, A. Zerck, K. Reinert, O. Kohlbacher. OpenMS An open-source software framework for mass spectrometry. BMC Bioinformatics 28,, 1. [4] B. B. Reinhold, V.. Reinhold. Electrospray ionization mass spectrometry: Deconvolution by an entropy-based algorithm. J. Am. Soc. Mass Spectrom.,, 27. [5] Z. Zhang, A. G. Marshall. A universal algorithm for fast and automated charge state deconvolution of electrospray mass-to-charge ratio spectra. J. Am. Soc. Mass Spectrom.,, 225. [] M. R. Hoopmann, G. L. Finney, M. J. MacCoss. Highspeed data reduction, feature detection and MS/MS spectrum quality assessment of shotgun proteomics data sets using high-resolution mass Spectrometry. Anal. Chem. 27, 7, 52. [7] P. C. Du, R. H. Angeletti. Automatic deconvolution of isotope-resolved mass spectra using variable selection and quantized peptide mass distribution. Anal. Chem. 2, 78, 85. [8] C. D. Wenger, M. T. Boyne, J. T. Ferguson, D. E. Robinson,. L. KelleheL. Versatile online-offline engine for automated acquisition of high-resolution tandem mass spectra. Anal. Chem. 28, 8, 855. [] M. Palmblad, D. J. Mills, L. V. Bindschedler, R. Cramer. Chromatographic alignment of LC-MS and LC-MS/MS datasets by genetic algorithm feature extraction. J. Am. Soc. Mass Spectrom. 27,, 5. [1] A. M. Frank, J. J. Pesavento, C. A. Mizzen,. L. Kelleher, P. A. Pevzner. Interpreting top-down mass spectra using spectral alignment. Anal. Chem. 28, 8, 24. [11]. M. Karabacak, L. Li, A. Tiwari, L. J. Hayward, P. Y. Hong, M. L. Easterling, J.. Agar. Sensitive and specific identification of wild type and variant proteins from 8 to kda using top-down mass spectrometry. Mol. Cell. Proteomics 2, 8, 84. [] Y. S. Tsai, A. Scherl, J. L. Shaw, C. L. MacKay, S. A. Shaffer, P. R. R. Langridge-Smith, D. R. Goodlett. Precursor ion independent algorithm for top-down shotgun proteomics. J. Am. Soc. Mass Spectrom. 2, 2, 24. [1] Z. Tian. An analytical apparatus and method for biomolecule idenfication. Patent Application 211451.1 (People s Republic of China), 2. [14] Z. Tian. Proc. th ASMS Conf. Mass Spectrometry and Allied Topics, Vancouver, Canada. May 2 24, 2, p. 17. [] Available: http://www.peptideatlas.org/. [1] Z. Tian,. Tolic, R. Zhao, R. J. Moore, S. M. Hengel, E. W. Robinson, D. L. Stenoien, S. Wu, R. D. Smith, L. Pasa-Tolic. Enhanced top-down characterization of histone post-translational modifications. Genome Biol. 2, 1, R8. [17] Available: http://www.uniprot.org/. 77 Rapid Commun. Mass Spectrom. 21, 27, 7 77 Copyright 21 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/rcm