Algorithms for Identifying Protein Cross-Links via Tandem Mass Spectrometry ABSTRACT

Size: px

Start display at page:

Download "Algorithms for Identifying Protein Cross-Links via Tandem Mass Spectrometry ABSTRACT"

Bruno Morgan
6 years ago
Views:

1 JOURNAL OF COMPUTATIONAL BIOLOGY Volume 8, Number 6, 2001 Mary Ann Liebert, Inc. Pp Algorithms for Identifying Protein Cross-Links via Tandem Mass Spectrometry TING CHEN, 1 JACOB D. JAFFE, 2 and GEORGE M. CHURCH 3 ABSTRACT Cross-linking technology combined with tandem mass spectrometry (MS-MS) is a powerful method that provides a rapid solution to the discovery of protein protein interactions and protein structures. We studied the problem of detecting cross-linked peptides and crosslinked amino acids from tandem mass spectral data. Our method consists of two steps: the rst step nds two protein subsequences whose mass sum equals a given mass measured from the mass spectrometry; and the second step nds the best cross-linked amino acids in these two peptide sequences that are optimally correlated to a given tandem mass spectrum. We designed fast and space-ef cient algorithms for these two steps and implemented and tested them on experimental data of cross-linked hemoglobin proteins. An interchain crosslink between two subunits was found in two tandem mass spectra. The length of the cross-linker (7.7 Å ) is very close to the actual distance (8.18 Å ) obtained from the molecular structure in PDB. Key words: proteomics, mass spectrometry, algorithms, protein cross-linking, protein-protein interactions, protein folding. 1. INTRODUCTION In recent years, more and more genomes of model organisms have been sequenced. Using these genomic sequences, researchers have focused on the identi cation of genes on the genome, the study of gene regulation and gene regulatory networks, the discovery of signal transduction pathways, the determination of protein structures, the detection of protein protein, protein DNA, and protein metabolite interactions, and the elucidation of functions of genes and their protein products. A method which combines chemical cross-linking of proteins with mass spectrometry or tandem mass spectrometry may be useful in discovering protein complexes, determining their structure (Young et al., 2000) and/or quantitating in vivo concentrations. This paper focuses on new algorithms for interpretation of complex experimental data generated by protein cross-linking and tandem mass spectrometry. Traditionally, three-dimensional structures of proteins are solved by x-ray crystallography and NMR. However, generating an accurate structure that satis es constraints of experimental data can be extremely 1 Department of Molecular Biology, University of Southern California, Los Angeles, CA Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA Department of Genetics, Harvard Medical School, Boston, MA

2 572 CHEN ET AL. dif cult. There has been some success in other computational methods that predict structures from energy functions, multiple alignments, and threading. However, the accuracy and general applicability of these methods lags far behind the rates at which new protein sequences are being identi ed. Classical methods of studying protein protein interactions often scale up poorly, make assumptions about number of interacting proteins or their modi cation state, or lose details about environmental effects on complex formation. Cross-linking technology combined with mass spectrometry provides an alternative approach to detecting protein protein interactions and adding reliable distance constraints to protein structures. Previous studies of cross-linking have been able to produce low resolution interatomic distance constraints, which in conjunction with threading, has led to the determination of the three-dimensional structure of a model protein (Young et al., 2000). A technique called isotopic labeling (Kellogg et al., 2001) has been proposed to separate peptides and cross-links in mass spectrometry. Through a proper design of labeling, the crosslinks can be easily identi ed by checking their masses in the spectra. Similar techniques can be applied to identify the docking structure of two interacting proteins. However, these ideas mostly focus on the individual protein or interaction, and it is not clear how to scale them up to the whole proteome level. In this paper, we applied tandem mass spectrometry to detect cross-linked peptides and cross-linked amino acids, which has the potential for large scale applications. Tandem mass spectrometry plays a powerful role in the identi cation of cross-linking sites. Combined with high performance liquid chromatography (HPLC), tandem mass spectrometry has been widely used to identify peptides and analyze protein sequences. Peptides, i.e., of an amino acid sequence HNHCHR 1 CO- NHCHR 2 CO- -NHCHR n COOH, resulting from proteolytic digestion of proteins are separated by HPLC and then analyzed in a mass spectrometer. The latter is a two step process which consists of measuring the massto-charge (m/z) ratio of an ionized peptide (the parent ion) and then measuring the m/z of fragmentation products (the daughter ions) of the peptide after collision-induced dissociation (CID). For each molecule of this peptide, a randomly selected peptide bond is broken, and as a result, CID produces a ladder of ions, typically N-terminal b ions (HNHCHR 1 CO- -NHCHR i CO C ) and C-terminal y ions (HNHCHR i+1 CO- -NHCHR n COOH C H C ), where two consecutive b-ions (or y-ions) differ by exactly one amino acid. These ions display a characteristic pattern for this peptide on the tandem mass spectrum. Computer programs such as SEQUEST (Eng et al., 1994) correlate peptide sequences in a protein database with the tandem mass spectrum and report all peptides with signi cant correlation scores. Some programs (Pevzner et al., 2000) allow searching the database with amino acids modi cations. An alternative approach (Taylor and Johnson, 1997; Dancik et al., 2000; Chen et al., 2001), called de novo peptide sequencing, extracts candidate peptide sequences from the spectral data before they are validated and/or available in a database. We performed an experiment that rst chemically cross-linked interacting proteins, then digested them proteolytically (using trypsin), and nally separated and identi ed cross-linking sites through HPLC-tandem mass spectrometry. This paper focuses on the algorithmic solutions to the identi cation of cross-linked peptides from the tandem mass spectral data. Cross-link structures are shown in Fig. 1. Figure 1(a) shows an H-structure cross-link where two amino acids in two peptides are connected to a linker by covalent bonds, Fig. 1(b) shows that the linker may simply decorate some amino acid of a peptide, Fig. 1(c) shows a self-cross-linked peptide where two amino acids on this peptide are connected by a linker. Figure 1(d) shows two doubly cross-linked peptides. Our main interest is to identify an H-structure cross-link (Fig. 1(a)). Our approach to nding this crosslink from a tandem mass spectrum uses the following two steps: ² Step 1: Given the mass of the parent cross-link molecules measured in mass spectrometry, nd every pair of peptides whose mass sum (plus the mass of the linker) equals this mass. ² Step 2: Given a pair of peptides, nd the cross-linked amino acid pair that is optimally correlated to the tandem mass spectrum. Finally, the algorithm reports the pair of peptides with the maximum correlation score. For simplicity, assuming that we are given k protein sequences each with n proteolytic amino acids (the targets for proteolytic digestion, i.e., for trypsin, they are Ks and Rs), we can ² Solve Step 1 in O.kn 2 log.kn// time and O.kn/ space.

3 IDENTIFYING PROTEIN CROSS-LINKS 573 FIG. 1. Four cross-link structures: (a) two cross-linked peptides, (b) one decorated peptide, (c) one self cross-linked peptide, (d) two doubly cross-linked peptides. If two peptide sequences of m amino acids are given and the tandem mass spectrum has h mass peaks, we can ² Solve Step 2 in O.m C h/ time and O.m C h/ space. The algorithm does not depend on the number of possible cross-linked amino acid pairs. Our paper is organized as follows. Section 2 describes the algorithm to nd two protein subsequences whose mass sum equals a given mass. Section 3 describes the algorithm of nding the best cross-linking site between two peptides. Section 4 discusses implementation issues and the probabilities of solutions. Section 5 reports the implementation of our algorithms and the test on hemoglobin cross-linking experimental data. 2. IDENTIFYING TWO SUBSEQUENCES FOR A MASS SUM There are n.n C 1/=2 possible subsequences corresponding to a protein with n proteolytic amino acids. For example, trypsin cuts a protein sequence right after a lysine(k) or an arginine(r). The exact place of each cut depends on the distribution of Ks and Rs: if two Ks or Rs are very close, the rst one may not be cut. Moreover, local sequence in uences and modi cations may protect some Ks and Rs from being cleaved by trypsin. Therefore, two protein sequences have O.n 4 / possible pairs of subsequences. To speed up the computation, we translate a protein sequence P into an array A of nc1 masses, each of which corresponds to a unique subsequence between two adjacent proteolytic amino acids. For example, P =NRDNKT, when trypsin is used for digestion, is translated into an array A D.A 1 ; A 2 ; A 3 / where A 1 D the mass for NR, A 2 D the mass for DNK, and A 3 D the mass for T. Thus, any proteolytic subsequence

4 574 CHEN ET AL. of P has the mass sum equal to the sum of the elements in the corresponding interval of A. For example, the subsequence DNKT has the mass of A 2 C A 3. Since the mass of the linker is xed, we focus on nding two protein subsequences for a given mass M, which is equivalent to the Subsequence Sum Problem (SSP) de ned below. De nition 1. Subsequence Sum Problem (SSP). Given two positive n-number sequences A D.a 1 ; : : : ; a n / and B D.b 1 ; : : : ; b n /, and a number M, nd all possible pairs of subsequences,.a i ; : : : ; a j /, 1 i j n, and.b k ; : : : ; b l /, 1 k l n, such that jx a s C sdi lx b t D M: tdk First, we consider an n-number sequence A and a mass M, and we are asked to nd a subsequence of A such that its sum equals M. The following Algorithm Find-A solves this problem. Algorithm Find-A.A; M/ 1. i D 1, j D 1; # 1st seq:.a i ; ; a j / D.a 1 / 2. sum D a 1 ; # sum is the sum of.a 1 / 3. While j < n or sum M 4. If sum D M # Solution: sum D M 5 then output.i; j/; 6. If sum < M # Add a jc1 to.a i ; ; a j / 7. then j D min.j C 1; n/, sum D sum C a j ; 8. Else # Delete a i from.a i ; ; a j / 9. sum D sum a i, i D i C 1. Theorem 1. Given a number M and a positive number sequence A D.a 1 ; ; a n /, Algorithm Find-A nds all the subsequences.a i ; ; a j /; 1 i j n, satisfying P j sdi a s D M, in O.n/ time and O.n/ space. Proof. Straightforward. Following Algorithm Find-A and Theorem 1, we have Lemma 2. The SSP can be solved in O.n 3 / time and O.n C p/ space, where p is the number of solutions, and the number of solutions for SSP is at most O.n 3 /. Proof. For every subsequence a i ; : : : ; a j of A, Algorithm Find-A nds all subsequences of B that satisfy P l tdk b t D M P j sdi a s in O.n/ time. There are O.n 2 / subsequences of A, and thus SSP can be solved in O.n 3 / time. This algorithm stores all the solutions in O.p/ space and requires O.n/ space for A and B, a total of O.n C p/ space. Since there are at most O.n/ solutions (subsequences of B ) for every subsequence of A, The number of solutions for SSP is at most O.n 3 /. If O.n 2 / space is allowed, then the following holds. Lemma 3. The SSP can be solved in O.n 2 log n C p/ time and O.n 2 C p/ space, where p is the number of solutions. Proof. We compute the sums of all subsequences of B in O.n 2 / time and store them into an array of O.n 2 / space. Then, the sums are sorted in O.n 2 log n/ time. For every subsequence a i ; : : : ; a j of A, we can nd all the subsequences of B that satisfy P l tdk b t D M P j sdi a s in O.log n C q/ time using binary search, where q is the number of solutions. If p is the total number of solutions, nding all of them takes O.n 2 log n C p/ time and O.n 2 C p/ space.

5 IDENTIFYING PROTEIN CROSS-LINKS 575 Algorithm Find-AB.A; B; M/ 1. S A D f.a i ; ; a j / j. P j sdi a s < M PjC1 sdi a s; j < n/ or. P n sdi a s < M/g; 2. S B D f.b k /; k D 1; ; ng; 3. While (S A 6D ;) 4. Find.a i ; ; a j / 2 S A where v D P j sdi a s is the maximum sum; 5. For every.b k ; ; b l / 2 S B and P l tdk b t < M v 6. Delete.b k ; ; b l / from S B ; 7. If.l < n/, then add.b k ; ; b lc1 / into S B ; 8. For every.b k ; ; b l / 2 S B and P l tdk b t D M v 9. Output i; j; k; l; 10. Delete.a i ; ; a j / from S A ; 11. If i < j, then add.a i ; ; a j 1 / into S A. For every a i, Step 1 nds the longest subsequence.a i ; ; a j / that satis es P j sdi a s < M for every i. Any subsequence of A to be in a solution must have the sum less than M and be a pre x subsequence of some element in S A. Then, Step 2 nds the shortest subsequences of B. Any subsequence of B to be in a solution must be a pre x supersequence of some element in S B. Steps 3 11 check every subsequence of A with sum less than M in decreasing order and search its corresponding solutions in B. Step 4 nds.a i ; : : : ; a j / 2 S A with the maximum sum v. Steps 5 7 delete every element.b k ; : : : ; b l / 2 S B such that P l tdk b t < M v and replace it by its immediate pre x supersequence.b k ; : : : ; b lc1 /. Steps 8 9 nd every element.b k ; : : : ; b l / 2 S B such that P l tdk b t D M v in S B and report the solutions.a i ; : : : ; a j / and.b k ; : : : ; b l /. After all solutions for.a i ; : : : ; a j / are found, Steps replace them by the immediate pre x subsequence.a i ; : : : ; a j 1 / in S A. Theorem 4. Algorithm Find-AB solves SSP in O.n 2 logn C p log n/ time and O.n C p/ space, where p is the number of solutions. Proof. Obviously, every output of Algorithm Find-AB is correct and unique. Assume that.a i ; ; a j / and.b k ; ; b l / satisfy P j sdi a s C P l tdk b t D M. We show that the algorithm nds this solution. In Step 1, let.a i ; ; a j 0/ 2 S A, which is the longest subsequence of A that starts with a i and has a sum less than M. Because P j sdi a s < M, it is true that j j 0. The algorithm will check.a i ; ; a j 0/ (Step 4) and all its pre x subsequences including.a i ; ; a j / (Steps 10 11) in length-decreasing order. Similarly, every subsequence of B starting with b k, including.b k ; ; b l /, is checked in Steps 5, 6, and 7 in length-increasing order. Let Step 4 generate a series of v-values: v 1 ; v 2 ;. This series is in decreasing order, because Step 4 always nds an element in S A with the maximum sum and Steps 8 9 replace it by a subsequence with a smaller sum. Step 5 deletes subsequences of B that have a sum less than M v 1, M v 2, at each iteration, respectively, and Steps 6 7 always replace them with subsequences of larger sums. Let w D P j sdi a s. When the algorithm is deleting.a i ; ; a j / in Step 10, all the subsequences of B with a sum less than M w would have been deleted. Thus,.b k ; ; b l / is present in S B because P l tdk b t D M w. The algorithm should nd it and report the solution.a i ; ; a j / and.b k ; ; b l /. Both S A and S B maintain O.n/ elements throughout the algorithm. We store them in a data structure called a Red-Black Tree (Cormen et al., 1990). This is a binary tree that allows retrieval, addition, and deletion in O.log n/ time while keeping the tree height O.log n/ through ef cient tree rotations. Thus, the algorithm requires only O.n C p/ space. In Steps 3 11, every subsequence of A is checked and deleted at most once. Every subsequence of B is checked and deleted at most once in Steps 5 9 if it is not in any solution. If a subsequence is in r 1 solutions, it will be checked exactly r C 1 times. It takes O.log n/ time for every retrieval in Steps 4, 5, and 8, every deletion in Steps 6 and 10, and every addition in Steps 7 and 11. The total running time is O.n 2 log n C p log n/.

6 576 CHEN ET AL. Algorithm Find-AB requires basically only O.n/ space if the solutions are not stored. Similar algorithms can be applied to solve the SSP problem with k sequences: Theorem 5. Given a number M and k positive n-number sequences A t D.a t1 ; ; a tn /, t D 1; ; k, nding all possible pairs of subsequences of A 1, A 2,, or A k such that their mass sum equals M takes O.kn 2 log.kn/ C p log.kn// time and O.kn C p/ space, where p is the number of solutions. The total number of solutions is at most O.k 2 n 3 /. In a protein database of k n size where k À n, Theorem 5 implies that the solutions can be found in the scale of O.k log k/ time. Lemma 3 can also be applied to solve the SSP problem with k sequences, and it requires O.kn 2 log.kn/ C p/ time, similar to that in Theorem 5, but uses O.kn 2 / space. The O.kn/-space algorithm has an advantage in applications where kn is huge, because an O.kn 2 /-space algorithm could easily take up the whole virtual memory Cross-linking structures 3. IDENTIFYING CROSS-LINKED AMINO ACIDS Algorithms in Section 2 produce a list of peptide pairs whose mass sum equals the parent ion mass M. Let a pair of peptides P D.p 1 ; ; p m / and Q D.q 1 ; ; q n / form an H-structure cross-link shown in Fig. 1(a). In Figure 1(a), amino acids p i and q j are connected by a linker introduced by a cross-linking agent. A typical agent, disuccinimidyl glutarate (DSG), cross-links two lysines to form a Lys linker Lys structure. If each peptide sequence has r (r m and r n) such amino acids, there are r 2 possible cross-links. In tandem mass spectrometry, CID fragments each cross-link molecule into one of the four patterns shown in Fig. 2. In Fig. 2(a), the peptide bond p k p kc1, located on the N-terminal side (left) of the cross-linked amino acid (p i ), is broken, and the molecule is fragmented into an N-terminal ion, p 1 p k C, and a C-terminal ion, the H-structure ion Q. C p kc1 p m /. In Fig. 2(b), the peptide bond is located on the C-terminal side (right) of p i, and the molecule is fragmented into an N-terminal ion Q.p 1 p k C / and a C-terminal ion p C kc1 p m. Similarly, if a peptide bond of Q is broken, the molecule has two fragmentation patterns shown in Fig. 2(c) and 2(d). Depending on the cross-linking agent used in the experiment, a linker may contain a peptide bond. If the linker is broken, the H-structure molecule is cleaved into two peptide ions, each of which has a decoration similar to the molecule shown in Fig. 1(c). For simplicity, assume m D n in the following description. Theoretically, there are O.m/ possible ions for each cross-linked P and Q. A tandem mass spectrum is a collection of mass peaks, ideally, each of which corresponds to some ion in Fig Algorithms for identifying cross-linked amino acids De nition 2. Cross-Linked Amino Acids Problem (CLAA). Given an h-mass peak tandem mass spectrum T D.t 1 ; ; t h / and two cross-linked m-amino acid peptides P D.p 1 ; ; p m / and Q D.q 1 ; ; q m /, each having r (r m) amino acids that can be cross-linked, nd the cross-linked amino acids that maximize the value of a given scoring function F.P; Q; T /. There are r 2 cross-linked amino acid pairs corresponding to r 2 possible cross-link structures. For every such structure, we can generate a hypothetical tandem mass spectrum, S D.s 1 ; : : : ; s 4.m 1/ /, from the masses of all the possible fragmented ions in Fig. 2. A scoring function F compares S with T and gives a score based on how well they are correlated. Generally, F considers a match between two mass peaks t i of T and s j of S, if jt i s j j ", where " is the maximum measurement error allowed in an experiment. The higher the number of matches is, the higher the F-score is. For simplicity, consider F counting the number of matches. For every mass peak of S, nding its match in T takes O.log h/ time by binary search. The total number of matches can be determined in O.m log h/ time. Thus, nding the most likely cross-linked amino acid pair takes a total of O.r 2 m log h/ time. However, there is better way to do it.

7 IDENTIFYING PROTEIN CROSS-LINKS 577 FIG. 2. Four fragmentation patterns for a cross-linked molecule in tandem mass spectrometry. Theorem 6. The CLAA problem can be solved in O.m C h/ time and O.m C h/ space if the scoring function F counts the number of matches. Proof. Figures 2(a) (d) shows 4.m 1/ ions for one cross-link, and thus r 2 cross-links have a total of 4.m 1/r 2 ions. However, there are only 8.m 1/ different masses for these ions. Figure 3 shows two types of N- and C-terminal ions after breaking the peptide bond p i p ic1. In Fig. 3(a), the cross-linked amino acid of P locates after p i, and in Fig. 3(b), the cross-linked amino acid of P locates before p i. The C-terminal ion in Fig. 3(a) has a xed mass no matter where the cross-link may be located, as long as it is after p i. So does the N-terminal ion in Figure 3(b). Since P and Q together have 2.m 1/ peptide bonds, there are only 8.m 1/ different masses. Let M. ; T / D 1 indicate the match of a mass with a spectrum T, and M. ; T / D 0 otherwise. In the following steps, we can nd the best cross-link: 1. Calculate ion masses of P :.L N p 1 ; L C p 1 /,,.L N p m 1 ; L C p m 1 / for the N- and C-terminal ions in Fig. 3(a) corresponding to the breaking of peptide bonds.p 1 p 2 /,,.p m 1 p m /, respectively, and similarly,.r N p 1 ; R C p 1 /,,.R N p m 1 ; R C p m 1 / for the N- and C-terminal ions in Figure 3(b). 2. Look up these masses in T for matches: M.L N p i ; T /, M.L C p i ; T /, M.R N p i ; T /, and M.R C p i ; T /, for i D 1; ; m 1.

8 578 CHEN ET AL. FIG. 3. Two types of N- and C-terminal ions after breaking the p i p ic1 peptide bond. 3. For every possible cross-linked amino acid p i, calculate Xi 1 score i D.M.L N p j ; T / C M.L C p j ; T // j D1 C mx.m.rp N j ; T / C M.Rp C j ; T //: jdi Find the maximum score, score a, which indicates the most likely cross-linked amino acid p a. 4. Repeat Steps 1, 2, and 3 and nd the most likely cross-linked amino acid q b for Q. 5. Report the cross-linked amino acids p a and q b. In Step 1, it takes O.m/ time to calculate.l N p 1 ; L C p 1 / and from.l N p 1 ; L C p 1 / to.l N p 2 ; L C p 2 / takes only O.1/ time for adding and subtracting p 2. Thus, Step 1 takes O.m/ time. The matching in Step 2 takes O.m Ch/ time by merging four sorted lists, fl N p i g, fl C p i g, fr N p i g, and fr C p i g, into one sorted list and matching this list with the sorted list of T. Similarly, it takes O.m/ time to calculate score 1 and only additional O.1/ time for score 2, score 3, and so on. Thus, Step 3 takes O.m/ time. Step 4 has similar time complexity. Therefore, nding the cross-linked amino acids takes a total of O.m C h/ time. Although Theorem 6 is for the special F-function that counts the number of matches, this theorem can be generalized for many other functions, such as the correlation function, the convolution function, and so on Speed up searching 4. IMPLEMENTATION AND PROBABILITY Lemma 2 indicates that there may be as many as O.n 3 / different solutions for the Subsequence Sum Problem (SSP), given a pair of protein sequences. For a database of k protein sequences, Theorem 5 tells us that the number could be as large as O.k 2 n 3 /. In fact, it is possible that there are too many solutions for SSP, especially when the database is very large and the number of subsequence pairs grows quadratically to the database size k. On the other hand, if the mass/charge error is large, many peptide pairs could fall into this error range. In both cases, checking all cross-linked peptides using the algorithm in Theorem 6 is very time consuming. To speed up the SSP search, we propose to examine a peptide and determine

9 IDENTIFYING PROTEIN CROSS-LINKS 579 whether it could belong to a solution by comparing its hypothetical cross-linked tandem mass spectrum with the experimental spectrum. Let M be the parent ion mass, T be an experimental spectrum, and F be the scoring function that counts the number of matches between a peptide P and a spectrum T. Then F.P; T / can be computed by Steps 1, 2, and 3 in the algorithm in Theorem 6. Let ± be the minimum matching score (of F.P; T /) for accepting P as one of the peptides of the cross-link. We can speed up the database search by using the following steps: 1. For every peptide subsequence P in the database, compute F.P; T / and save P to a candidate list if F.P ; T / ±. 2. From the candidate list, nd every pair of P and Q satisfying mass.p / C mass.q/ C mass.linker/ D M, and let F.P; Q; T / DF.P ; T /CF.Q; T /. 3. Report the peptide pair with the maximum score. This process could eliminate a large number of peptide pairs giving low matching scores Matching probability Probabilistic models (Qin et al., 1997; Dancik et al., 2000; Bafna and Edwards, 2001) have been proposed for scoring tandem mass spectra against a peptide database. Environmental parameters such as the probability of losing a peak and the probability of a random noise have to be determined for each experiment in order to derive a good scoring function. In our case, we have already known that the experimental spectra came from some peptide in the human hemoglobin protein. In this study, we use a simple scoring function. Let T be an experimental spectrum of t mass peaks, and H be a hypothetical spectrum of h ions. Assume the abundance levels are ignored; otherwise we can set an abundance level threshold to lter out noise. Let the matching range be [ ±; ±] and let the machine detect ions up to r m/z ratio. In our ESI-MS-MS experiments, ± is at most 0.2, h is generally between 20 and 50 ions, r is about 2,000 m/z, and depending on the experiment, t ranges from tens to hundreds of mass peaks. Assume that matching ranges between peaks in H do not overlap. Then the expected (mean) number of matches ¹ between T and H is ¹ D t 2±h r where 2±h=r is the probability that a random mass peak hits an ion of H. If matching ranges overlap, we can replace 2±h by the sum of matching ranges. The number of matches between T and H has a binomial distribution. Since 2±h=r is small (< 0:1), the binomial distribution can be can be approximated by the Poisson distribution ¹ ¹x P.x/ D e x! and the probability of at least m matches can be computed by m 1 X P.x m/ D 1 P.x D i/: id Cross-linking experiments 5. EXPERIMENTS AND DATA ANALYSIS Chemical modi cations and cross-links were introduced into the human blood protein hemoglobin (Hb; Sigma, St. Louis, MO) by reaction with disuccinimidyl glutarate (DSG; Pierce, Rockford, IL). Hemoglobin is a four-subunit protein consisting of two subunits each of two unique polypeptide chains, and. The

Brie y, 50 mm Hb was reacted with 5 mm DSG for one hour at room temperature in the presence of 90 mm sodium phosphate ph 7.5 and 10% DMSO. The reaction was stopped by addition of an equal volume of 1.

10 580 CHEN ET AL. approximate molecular weight of each chain is 15,000 daltons. DSG is a homobifunctional cross-linker which creates a link between two separate lysine residues by inserting a 5-carbon chain between them. Brie y, 50 mm Hb was reacted with 5 mm DSG for one hour at room temperature in the presence of 90 mm sodium phosphate ph 7.5 and 10% DMSO. The reaction was stopped by addition of an equal volume of 1.0 M Tris ph 7.5. An aliquot of this reaction was mixed with 2 volumes of 6M urea and TPCK-treated trypsin (Sigma, St. Louis, MO) was added at a ratio of 1:125 w/w trypsin to Hb. This mixture was allowed to react at 37 ± C overnight. One hundred ¹l of the digestion mixture was subject to HPLC-MS-MS analysis. The sample was loaded onto a YMC ODS-AQ S3¹ 120 Å mm column (Waters, Milford, MA) and the peptides were separated by a gradient of 5 80% acetonitrile (0.03% tri uoroacetic acid and 0.10% acetic acid as ion pairing reagents) at a ow rate of 0.1 ml/min. The eluent was directed to a Finnigan LCQ (Thermoquest, San Jose, CA) ion trap mass spectrometer tted with an ESI source and operated in positive ion mode. Signi cantly ionized peptides were automatically measured for m/z and subjected to CID. Data were automatically collected Data analysis The data were analyzed visually and with several in-house algorithms, as well as the commercial package SEQUEST (Eng et al., 1994). Every tandem mass spectrum is preprocessed by the following steps: 1) close mass peaks that are candidates of isotopic ions are merged and replaced by a new mass peak, and the abundance levels are replaced by their sum; 2) mass peaks with low relative abundance levels are cut off. Then, we implemented the algorithm in Lemma 3 and computed the score between a pair of sequences S and a tandem mass spectrum T using the convolution function. The probability of each top score solution was calculated from the Poisson distribution. Several types of modi cation events were recognized: interpeptide cross-links (Fig. 1(a)), decoration (Fig. 1(b)), and internal cross-links (Fig. 1(c)). Of speci c interest to this work, a cross-link between lysine 82 of both subunits of Hb was detected, as shown in Fig. 4. Structurally, this would correspond to bridging the central channel of the Hb protein. The interresidue distance between the two lysines was FIG. 4. 3D structure of two identical peptides (67-VLGAFSDGLAHLDNLK-82) from two subunits cross-linked in a hemoglobin protein, and the cross-linked lysines (Ks).

11 IDENTIFYING PROTEIN CROSS-LINKS 581 FIG. 5. The tandem mass spectrum of two identical peptides (67-VLGAFSDGLAHLDNLK-82) from two subunits cross-linked in a hemoglobin protein where four b-ions and three y-ions containing the cross-link were identi ed. measured to be 8.18 Å (derived from the PDB le 1A3N), which is comparable with the 7.7 Å spacer length of DSG (Tame and Vallone, 2000). This cross-link was found in two tandem mass spectra, shown in Figs. 5 and 6, and both parent ions have zd3 charges. The masses are equivalent to two molecules of the tryptic peptide 67-VLGAFSDGLAHLDNLK-82 of Hb after being carbamylated (an N-terminal modi cation of 43 daltons due to urea) and cross-linked by the glutarate moiety of the DSG (96 daltons). Fragmented daughter ions from this peptide were observed in the spectrum shown in Fig. 5: four b-ions (ZD1), b7- H 2 O, b9 H 2 O, b11 H 2 O, and b14ch 2 O and three cross-linked y-ions (Z=2), L y2, L y3, and L y15. Using the error range of [ 0:2; 0:2], we compute the probability from the Poisson distribution: P.x 7/ D 2:42E 05. Figure 6 shows four b-ions, b7 H 2 O, b9 H 2 O, b11 H 2 O, and b14 H 2 O and two cross-linked y-ions (Z=2), L y3 and L y15. The probability computed from the Poisson distribution is P.x 6/ D 6:47E DISCUSSION The interchain cross-link is con rmed in two tandem mass spectra both with < 10 4 matching probabilities. The length of the cross-linker (7.7 Å ) is very close to the actual distance (8.18 Å ) obtained from the molecular structure in PDB. Our experiment strongly suggests that protein structure information can be derived via protein cross-linking and tandem mass spectrometry. It should be pointed out that the algorithms suggested here do not consider the case of amino acid modi cations. However, if the modi cation such as phosphorylation is speci ed, the algorithms still work. The underlying chemistry in MS-MS is still not completely understood. There are many mass peaks shown in both Figs. 5 and 6 which have not been interpreted. Some of them may be internal ions of the cross-link, and to verify this, further experiments on MS3 are probably necessary. There are many other complications

12 582 CHEN ET AL. FIG. 6. The tandem mass spectrum of two identical peptides (67-VLGAFSDGLAHLDNLK-82) from two subunits cross-linked in a Hemoglobin protein where four b-ions and two y-ions containing the cross-link were identi ed. in the experimental data because of noise, multiply charged ions, mass measurement errors, calibration errors, internal ions from double fragmentation, other types of ions such as a- and z-ions, and so on. All these can affect the interpretation and some of them are machine-dependent while others are experimentdependent. Also, a good scoring function is critical to judge what is the best interpretation for a tandem mass spectrum. Agreement on the best scoring function and how to use the abundance of information remain unresolved. REFERENCES Bafna, V., and Edwards, N A probablistic model for scoring tandem mass spectra against a peptide database. To appear in Bioinformatics. Chen, T., Kao, M., Tepel, M., Rush, J., and Church, G.M A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry. J. Comp. Biol. 8(3), Cormen, T.H., Leiserson, C.E., and Rivest, R.L Introduction to Algorithms, MIT Press, Boston. Dancik, V., Addona, T.A., Clauser, K.R., Vath, J.E., and Pevzner, P.A De novo peptide sequencing via tandem mass spectrometry: A graph-theoretical approach. J. Comp. Biol. 6, Eng, J.K., McCormack, A.L., and Yates, J.R An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrometry 5, Kellogg, C.B., Kelley, J.J., Stein, C., and Donald, B.R Reducing mass degeneracy in SAR by MS by stable isotopic labeling. J. Comp. Biol. 8(1), Pevzner, P.A., Dancik, V., and Tang, C.L Mutation-tolerantprotein identi cation by mass spectrometry. J. Comp. Biol. 7(6), Qin, J., Fenyo, D., Zhao, Y., Hall, W.W., Chao, D.M., Wilson, C.J., Young, R.A., and Chait, B.T A strategy for rapid, high-con dence protein identi cation. Analytical Chemistry 69,

13 IDENTIFYING PROTEIN CROSS-LINKS 583 Tame, J.R., and Vallone, B The structures of deoxy human haemoglobin and the mutant Hb Tyralpha42His at 120 K. Acta Crystallogr. D. Biol. Crystallogr. 56(7), Taylor, J.A., and Johnson, R.S Sequence database searches via de novo peptide sequencing by tandem mass spectral data. Biomed. Environ. Mass Spectrum. 11, Young, M.M., Tang, N., Hempel, J.C., Oshiro, C.M., Taylor, E.W., Kuntz, I.D., Gibson, B.W., and Dollinger, G High throughput protein fold identi cation by using experimental constraints derived from intramolecular cross-links and mass spectrometry. Proc. Natl. Acad. Sci. USA 97(11), Address correspondence to: Ting Chen Department of Molecular Biology University of Southern California Los Angeles, CA

A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry

A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry Ting Chen Department of Genetics arvard Medical School Boston, MA 02115, USA Ming-Yang Kao Department of Computer