A Statistical Model of Proteolytic Digestion

Size: px

Start display at page:

Download "A Statistical Model of Proteolytic Digestion"

Hugh Flynn
5 years ago
Views:

1 A Statistical Model of Proteolytic Digestion I-Jeng Wang, Christopher P. Diehl Research and Technology Development Center Johns Hopkins University Applied Physics Laboratory Laurel, MD Fernando J. Pineda Dept. of Molecular Microbiology and Immunology Johns Hopkins Bloomberg School of Public Health Baltimore, MD Abstract We present a stochastic model of proteolytic digestion of a proteome, assuming the distribution of parent protein lengths in the proteome, the relative abundances of the 20 amino acids in the proteome, and the digestion rules of the enzyme used in the digestion. We derived a closed form expression for the fragment mass distribution for a large class of enzymes including the widely used Trypsin. The expression uses the distribution of lengths in a mixture of proteins taken from a proteome, as well as the relative abundances of the 20 amino acids in the proteome. The agreement between theory and the in silico digest is excellent. I. INTRODUCTION We present a rigorous statistical model of proteolytic digestion of a proteome, assuming (1) the distribution of parent protein lengths in the proteome, (2) the relative abundances of the 20 amino acids in the proteome, and (3) the digestion rules of the enzyme used in the digestion. Such a statistical model can serve as a model of the null hypothesis to calculate p-values in hypothesis testing for microorganism identification by proteolytic digestion. Current protein identification software (e.g., MOWSE [1, MassSearch [2, and Mascot [3) is based on histograms of digested proteomes, or on models derived from unrealistic assumptions (e.g., a uniform distribution of proteolytic fragments). The proposed model accounts for the fact that digestion products come from the mixture of proteins that constitutes a microorganism s proteome. To facilitate our discussion, we organize digestion models into a taxonomy according to (1) the class of digestion rules obeyed by the enzyme and (2) the order of the Markov model used to model the amino acid sequence. We denote by C(m) the class of enzymes whose digestion rules depend on m adjacent residues. Table I shows a selection of enzymes and their corresponding cleavage rules. In this paper, we present the result assuming that the cleavage rules are deterministic. We also introduce a technique to extend the result to address simple probabilistic digestion rules that account for possible (probabilistic) partial digestions. We classify amino acid sequence models according to the order of the Markov chain used to model the sequence. For example, M(0) denotes the simplest sequence models assuming that the adjacent amino acid residues are independent; M(1) denotes the next most complex models assuming that the next residue depends on the previous amino acid residue; and so forth. Our methodology views the enzymatic digestion problem as a regenerative process to which we can apply Wald s Digesting Reagent Class C(1) Cleavage Rules Armillaria Before {C K } AspN Before D Thermolysin Before {F I L M V W} Clostripain After R LysC After K Pancreatic Elastase After {A G S V} Digesting Reagent Class C(2) Cleavage Rules Chymotrypsin After {F L M W Y} if not followed by P Post Proline After P if not followed by P Trypsin After {K R} if not followed by P V8 Ammonium Acetate After E if not followed by P V8 Phosphate Buffer After {D E} if not followed by P CNBr Cys { Before C After M } Hydroxylamine After N if followed by G Mild Acid Hydrolysis After D if followed by P TABLE I CLEAVAGE RULES FOR SELECTED REAGENTS equation and its generalizations to simplify the computation of the fragment mass distribution. In particular, when overlaid with a cleavage process (either C(1) or C(2)), a protein sequence with models M(0) can be equivalently modeled as a regenerative process with the cleavage sites as the regeneration points. Hence the regenerative cycles (the fragments) are i.i.d.. Using Wald s first lemma and its extensions [4, [5, the fragment mass distribution can be conveniently partitioned into two terms: the length distribution and the conditional mass probability density. In this paper, we present results for {C(1), M(0)} and a subset of {C(2), M(0)} with the class of enzymes C P (2) that exhibit Proline blocking. The class C P (2) includes the widely used enzyme, Trypsin, which cleaves after Lysine or Arginine unless followed by Proline. The same technique can be applied to derive expressions for general models in {C(2), M(0)}. The agreement between theory and the in silico digest is excellent as illustrated by the examples given in Section 4. Note that a summary of part of the results given here was presented in [6 as a poster. II. MODELS AND DECOMPOSITION Our objective is to analytically compute the expected number of proteolytic fragments with mass in an arbitrary mass interval (m m, m + m), denoted by H(m, m). We assume that H(m, m) depends only on the parent protein length distribution in the database of interest, a probabilistic

2 protein sequence model derived from the database, and the chosen enzyme digestion rule. Since the parent proteins are finite in length, application of digestions rules to a finite parent protein can lead to two distinct classes of fragments: fragments ending at a cleavage site and fragments not ending at a cleavage site. Here the cleavage site is referred to a specific amino acid that defines a cleavage following a enzyme digestion rule. For example, amino acids C and K are both cleavage sites for fragments from digesting enzyme Armillaria (see Table I). In this paper, we only focus on the derivation of H(m, m) contributed by fragments ending at a cleavage site. The contribution by the other class of fragments is less significant and can be more easily derived since there can be at most one of such fragment from every single parent protein. A. Sequence and fragment models Throughout the paper, we assume that the protein sequence is i.i.d.. That is, we only consider class M(0) models. We use S = X 1 X 2 X 3 to denote an infinite random protein sequence, where the X i s are random variables taking value from a finite alphabet A (the set of 20 amino acids). We use p a, to denote P {X n = a}, and p C, C A, to denote P {X n C}. We assume that p a s are estimated from the relative amino acid abundance in the database of interest. We consider C(1) and C P (2) enzyme classes. For C(1) enzymes, cleavage occurs at sites adjacent to amino acids in the set C. For C P (2) enzymes, cleavage occurs after any amino acid in C if the following amino acid is not Proline. To facilitate our discussion, we will use {F n } to denote a generic random fragment sequence resulting from the application of a cleavage rule to a random protein sequence S. Here, we provide an informal discussion on why {F n } is i.i.d. except for the first and possibly the last fragments (a rigorous proof can be easily constructed following the discussion ). We first address digestion rules in C(1). Given that amino acids in M(0) protein models are i.i.d., fragment mass distributions do not depend on which end of the protein we choose as the N-terminus. Therefore, without loss of generality, we can assume that for class C(1) enzymes, cleavage sites always occur after (instead of before as indicated in Table 1 for some C(1) enzymes) an amino acid in C. For models in {C(1), M(0)}, it is clear that the cleavage event {X n C} is a regenerative point for the protein sequence S. Hence the process between cleavage (that is, the fragments {F n }) is i.i.d.. For models in {C(2), M(0)}, a similar conclusion can be drawn by constructing an equivalent two-step model for the protein sequence. Figure 1 illustrates such a model for digestion with Trypsin for which a cleavage occurs after K or R unless followed by P. In step one, we first determine whether the amino acid is P or not (occurs with probability p P ). In step two, we determine the specific amino acid conditioned on the outcome of the step one. It is obvious that the protein sequence generated by this two-step model is statistically equivalent to a sequence S generated by a model in M(0). With this equivalent protein model, the digestion rule (with Proline blocking) can be applied right after the step one to determine cleavage Fig. 1. A two-step process to generating an i.i.d protein sequence and fragmentation by Trypsin. sites. It can be seen that the cleavage sites are regenerative points and an i.i.d. fragment process {F n } can be defined from the regenerative cycles. Note that this leads to a very minor approximation since the first fragment F 1 has a different distribution at the first amino acid (the last fragment could have a different distribution as well if it ends exactly at the last amino acid in the parent protein). Given that {F n } is i.i.d. (or approximately i.i.d. as for the C P (2) models), we will use F to denote a generic random fragment in the sequel. B. Decomposition of mass distribution Based on the i.i.d. property of fragments established in the last section, we can establish a decomposition that enable us to treat the fragment length distribution and mass probability density separately in the derivation of H(m, m). This is accomplished by applying Wald s first lemma and its extensions [4, [5. Here we first state the decomposition: where H(m, m) N = g(k)p { M(F ) m < m L(F )=k}, k=1 g(k) = N N=k n(n)l N,k, and n(n) is the number of proteins of length N in the database of interest; L N,k is the expected number of fragments of length k from a parent protein sequence of length N; M(F ) denotes the mass of F ; and L(F ) denotes the length of F. This decomposition described by (1) significantly simplifies the computation of H(m, m) in that it separates the calculation of fragment length distribution (g(k)), from the mass distribution of a random fragment (P { M(F ) m < m L(F )=k}). To prove the decomposition (1), we first observe that H(m, m) can be derived by breaking it into terms based on fragment and parent protein lengths. Define H N,k (m, m) as the expected number of fragments in the mass range that come from parent proteins with length N and has length k. Then (1)

3 we can write H(m, m) as H(m, m) = N N=1 n(n) N H N,k (m, m). (2) k=1 Hence we can focus our discussion on the decomposition of H N,k (m, m). We first provide an expression for L N,k (expected number of fragments with length k from protein with length N) by applying the Wald s equation. Define a stopping time associated with the fragment process {F n } by τ n min{k : k L(F i ) n}. Using the indicator function we can write L N,k as [ τn k+1 L N,k = E I{L(F i )=k}, (3) where I{} denote the indicator function. Note that the proper choice of stopping time (τ N k+1 ) is important to avoid overor under-counting of fragments. Since {F n } is i.i.d., we can apply the Wald s equation to (3) and obtain L N,k = E(τ N k+1 )P {L(F )=k}. (4) Following the same argument, we can also obtain H N,k (m, m) = E(τ N k+1 )P { M(F ) m < m, L(F )=k} = E(τ N k+1 )P {L(F )=k} P { M(F ) m < m L(F )=k}. By substituting (4) into the above equation we obtain the decomposition for H N,k (m, m): H N,k (m, m) =L N,k P { M(F ) m < m L(F )=k}. (5) The desired decomposition (1) can be easily derived by substituting (5) into (2). III. FRAGMENT MASS DISTRIBUTIONS Based on the decomposition (1) we can obtain the closed form expression for the fragment mass distribution H(m), for models in both {C(1), M(0)} and {C P (2), M(0)}, by deriving expressions for L N,k and P { M(F ) m < m L(F )=k}. A. Digestion model {C(1), M(0)} Given a random protein sequence of length N, S N = X 1 X 2 X N, L N,k for models in {C(1), M(0)} can be derived as [ N k+1 L N,k = E I {X i...x i+k 1 is a fragment } = P {X 1,...,X k 1 C, X k C}+ N k+1 i=2 P {X i 1 C,X i,...,x i+k 2 C,X i+k 1 C}. By deriving the needed probabilities based on the M(0) sequence model, we can obtain L N,k =[(N k)p C +1(1 p C ) k 1 p C. (6) To derive the conditional mass probability P { M(F ) m < m L(F ) = k}, we assume that the mass of a fragment depends only on its amino acid composition. We represent the composition of a fragment F by a 20-dimensional vector of nonnegative integers, denoted by C(F )=[n 1,...,n 20, where n i N {0} is the number of occurrences of the i-th amino acid (assuming that a fixed ordering is defined on A) inf. We will use n =[n 1,...,n 20 to denote a generic composition vector. We assume that the mass of a composition (and hence the mass of the fragment with the particular composition) is determined by 20 M(n) =m H2O + m i n i, where m i is the mass of the i-th amino acid. Note that we can also generalize this model to incorporate uncertainty in amino acid masses modeled by appropriate distributions. Following the above assumptions, we can write the desired conditional mass distribution as I{ M(n) m < m}p {C(F )=n L(F)=k}, n =k i C n (7) where n i n i. Note that the indicator function in (7) is deterministic and should be replaced by a probability if a probabilistic model is used to define the mass of a composition. Based on (7) the closed-form expression for the conditional mass probability can be derived as n =k i C n I{ M(n) m < m}(k 1)! where ˆp i is defined by ˆp i = p i z i = z i { p C if i C, 1 p C otherwise. ( 20 ) ˆp ni i, (8) n i! And again the indicator function can be replaced by a probability to incorporate uncertainty in amino acid masses. B. Digestion model {C P (2), M(0)} To derive L N,k, we again use indicator functions to obtain [ N k+1 L N,k = E I {X i...x i+k 1 is a fragment } = N k+1 P {X i...x i+k 1 is a fragment}. However, the derivation of the probabilities are not as straightforward due to the blockings of cleavage by proline. Let us

4 denote the set of amino acid pairs that represent unblocked cleavage sites by C NB. That is C NB {XX : X C,X P}. Then we can derive the needed probabilities using Markovian arguments. Since the protein sequence is i.i.d., we have P {X 1...X k is a fragment} = P {X 1 X 2 C NB } k 1 j=2 P {X k X k+1 C NB X k 1 X k C NB } ; for i =2,...,N k, wehave P {X i...x i+k 1 is a fragment} = P {X i 1 X i C NB } P {X i X i+1 C NB X i 1 X i C NB } and i+k 2 j=i+1 P {X i+k 1 X i+k C NB X i+k 2 X i+k 1 C NB } ; P {X N k+1...x N is a fragment} = P {X N k X N k+1 C NB } P {X N k+1 X N k C NB X N k X N k+1 C NB } N 1 j=n k+2 P {X N C X N 1 X N C NB }. Deriving the needed conditional probabilities and substituting them into the above equations we can obtain [ ( L N,k = α 1+ N k + p ) P p C (1 p 1 p C) k 1 p C, P (9) where p P p C = p C [1 1 p C (1 p P ) [ α = (1 p C {1 ) 1 p C, 1 p C 1 p C (1 p P ) }. Note that p C <p C as long as p P > 0. Furthermore, when p P is small, (9) can be approximated quite accurately by L N,k [1 + (N k) p C(1 p C) k 1 p C, which is identical to the length distribution for C(1) enzymes given in (6) but with scaled cleavage probability p C. To compute the conditional mass probability, we first decompose the probability based on the composition of the fragment composition as in (7): I{ M(n) m < m}p {C(F )=n L(F)=k}. n =k 1 i C ni<np (10) (theory- in_silico)/sqrt(theory) fragments per bin (1Da bins) mass (Da) 800 in silico theory Fig. 2. Comparison between analytical expression and in silico digestion for E. Coli proteome over a 100 Da window. Note that the number of proline occurrences, n P, needs to be larger than or equal to ( i C n i) 1 in order to block the cleavages in the interior of a fragment. The conditional composition probability P {C(F ) = n L(F ) = k} can be derived by treating each case with a specific number of blocked cleavages separately. C. Accounting for partial digestions The presented results for deterministic digestion can be easily extended to account for partial digestions modelled with a simple probabilistic digestion model. Consider the case for C(1) models under which cleavages occurs after every amino acid in C. Assume now that each cleavage after an amino acid a C occurs only with probability q a (q a < 1 implies partial digestions). The presented results can be applied by expanding the set A to include one additional fictitious amino acid a + for each a C.Fora C, the probability is re-defined as q a p a. For the additional amino acid a +, the probability is (1 q a )p a. The mass of the additional amino acid a + is the same as the mass of its associated original amino acid a. With this expanded alphabet and modified protein sequence model, the presented result can be applied directly as if the digestions are deterministic (with the same C). Note that the same technique applies to models in C P (2) as well. IV. NUMERICAL RESULTS A comparison of analytic and in-silico results is shown in Figure 2. We use an E. coli proteome taken from the SWISSPROT database. The proteins were shuffled to destroy sequential correlations. The analytic calculation is faster than 1000

5 the corresponding in silico digestion. Except for unexplained excess dispersion at 450 Da, the residuals between the in-silco and analytic results are consistent with Poisson statistics. It is clear from this figure that the mass distribution can vary by several orders of magnitude over just few Daltons. This result leads us to believe that heuristically derived mass distributions are likely to result in biased p-values in hypothesis tests. This is especially true for small peptides. It becomes a minor effect in heavier peptides. It is typically recommended that protein identification be performed using peptides heavier than 500 Da due to lack of selectivity of the lighter peptides. Our results suggest that, a more careful treatment of the mass distribution below 500Da, could make this mass range more informative. ACKNOWLEDGMENT This work was supported by the JHU/APL Independent Research and Development Program. REFERENCES [1 D. J. C. Pappin, P. Hojrup, and A. J. Bleasby, Rapid Identification of Proteins by Peptide-mass Fingerprinting, Current Biology, vol. 3, no. 6, 1993, pp [2 P. James, M. Quadroni, E. Carafoli, and G. Gonnet, Protein Identification by Mass Profile Fingerprinting, Biochemical and Biophysical Research Communications, 195(1), August 1993, pp [3 D. N. Perkins, D. J. C. Pappin, D. M. Creasy, and J. S. Cottrell, Probability-based Protein Identification by Searching Sequence Databases Using Mass Spectrometry Data, Electrophoresis, 20, 1999, pp [4 A. Wald, Sequential Tests of Statistical Hypotheses, Annals of Mathematical Statistics, 16, 1945, pp [5 G.V. Moustakides, Extensions of Wald s first lemma to Markov processes, Journal of Applied Probability, 36(1), 1999, pp [6 I-J. Wang, C. P. Diehl, and F. J. Pineda, A statistical model of proteolytic digestion, conference poster, IEEE Computer Society Bioinformatics Conference, Stanford, CA, August 2003.

Identification of proteins by enzyme digestion, mass

Identification of proteins by enzyme digestion, mass Method for Screening Peptide Fragment Ion Mass Spectra Prior to Database Searching Roger E. Moore, Mary K. Young, and Terry D. Lee Beckman Research Institute of the City of Hope, Duarte, California, USA