Protein Post-translational Modifications Mapping with MS/MS based Frequent Interval Pattern Mining

Size: px

Start display at page:

Download "Protein Post-translational Modifications Mapping with MS/MS based Frequent Interval Pattern Mining"

Clara Pope
6 years ago
Views:

1 Protein Post-translational Modifications Mapping with MS/MS based Frequent Interval Pattern Mining Han Liu Department of Computer Science University of Illinois at Urbana-Champaign May 8, 2005 ABSTRACT Tandem mass spectrometry (MS/MS)-based proteomics has demonstrated to be an indispensable tool for large scale protein identification and expression profiling tasks. However, analysis of protein post-translational modifications (PTMs) with MS/MS still presents formidable challenges. Although some heuristic algorithms have been developed for this problem, they are far from mature. In this paper, we give out a detail survey of current approaches and propose an alternative but more powerful solution, one that uses more-straightforward laboratory procedures and could be applied automatically. Specifically, our aim is to use frequent interval pattern mining techniques to map modification sites in molecular detail, a parametric scoring scheme is also proposed, the parameters could be tuned from the mining results or assigned a priori by domain expert. Comparing with the classical PTMs mapping algorithms like SALSA and P-Mod, this novel mass spectrometric data mining approach demonstrates more robustness and flexibility. To evaluate the biological relevance, we test these methods on the real-world datasets generated by MS/MS experiments performed on various tissue samples taken from mouse. The results show that this new approach is competitive on automated PTMs mapping tasks, while with good efficiency and scalability for real-world Bioinformatics applications. Keywords: protein post-translational modifications mapping, frequent interval pattern mining, scalability, mass spectrometry, bio-data mining Technical report submitted according to the regulations of Department of Computer Science at the University of Illinois at Urbana-Champaign 1

2 1 Introduction Proteomics is the large-scale study of the thousands of proteins in a cell [8]. In a typical Proteomics experiment, the goal might be to compare the proteins present in a certain tissue under different conditions. For instance, a biologist might want to study cancer by comparing the proteins in a cancerous liver to the proteins in a healthy liver. Modern mass spectrometry makes this possible by enabling the identification of thousands of proteins in a complex mixture [16, 2]. However, identifying proteins is only part of the story, it is also important to map the protein post-translational modifications (PTMs) in the molecular details [11], since PTMs modulate the activity of most eukaryote proteins, and their determination generates valuable insight into biological functions. Despite the great importance of PTMs, their study on large scale has been hampered by lack of suitable methods, many PTMs have been discovered serendipitously during studies of individual proteins with the help of standard molecular techniques, such direct analysis strategies requires isolation of the correctly processed protein in a sufficiently large amount for biochemical study. As a result, many PTMs can not be analyzed, which prevents us from fully understanding the protein modification mechnisms in the workings of the cell [11]. Figure 1: MS/MS: In the first phase, a given m/z value peptide ions are filtered out. In the second phase, the filtered out peptide ions are further dissociated and the m/z value of the ion fragments are measured Tandem mass spectrometry (MS/MS) of peptides is a central technology of Proteomics, enabling the identification of thousands of peptides and proteins from a complex mixture [13, 10, 1]. The whole processes of MS/MS are illustrated in figure 1. In a typical experiment, thousands of proteins from a tissue sample are fragmented into tens of thousands of peptides and gotten ionized, each peptide ion consists of about 6 to 30 amino acid residues. This peptide ions mixture then enter the tandem mass spectrometer and be analyzed according to two phases. The first phase is responsible for filtering peptides ions 2

3 of a certain masses-to-charge ratio (m/z), in the second phase, a peptide is split into two fragments by means of collision induced dissociation (CID) with a noble gas. In almost all cases, the peptide is broken between the chemical bonds of two amino acids. The result is a collection of spectra, one for each peptide ions, where each peak represents the relative abundance of a prefix or suffix ion [15]. Table 1: Some common and important post-translational modifications PTM type Mass (Da) Stability Function and notes Phosphorlation +80 +/++ modulation of molecular in interaction, signaling Acetylation regulate protein-dna interations Methylation regulate gene expression Hydroxyproline protein-ligand interaction Sulfation (styr) regulate protein-protein interactions Disulfide bond intra-and intermolecular crosslink protein stablility Deamidation a common chemical artifact Nitration of tyrosine +45 +/++ Oxidative damage when inflammation Pyroglutamic acid protein-stability Farnesyl membrane tethering Palmtoyl /++ cellular localization and signaling With the increasing acquisition rate of tandem mass spectrometers, there is an increasing potential to solve important biological problems like peptide sequencing or protein identification by applying data-mining techniques to MS/MS data [3, 2]. Although the identification of proteins in a complex mixture is becoming routine, protein identification alone provides only limited insight into protein function. An important component of protein regulation and function is covalent modifications to protein structures that occur post-transactionally [12]. Many protein post-translational modifications (PTMs) give rise to specific features of MS/MS spectra [8]. For example, phosphorylated serine and threonine residues eliminate phosphoric acid (80 Da) in MS/MS. Thus, product ions at 40 and 80 units below doubly and singly charged precursor ions, respectively, are observed in the corresponding spectra. Table 1 summarizes some common and important PTMs [11]. Currently, over 380 different PTMs have been discovered, identifying the type and location of these PTMs is a first step in understanding their regulatory potential. Despite their importance to cellular function, the methodologies used to study these modifications are far from mature, most of them are not compatible with protein mixtures, or are specific for a given type of PTMs. Among these algorithms, the most promising one is named SALSA [6], which is a pattern recognition algorithm used to map PTMs according to some user-specific modifications characteristics, however, as we will state in the following 3

4 section, SALSA can not be used to predict unanticipated modifications without userspecified priors, which prevents its real-world applications greatly. Another well-known algorithm is named P-Mod, which conduct modifications mapping based on mass shift of the peptide sequences, however, it still rely on a strict assumption that we have already known which proteins are in the mixtures a priori, which is still too strong an assumption for modern proteomics analysis. In this paper, we propose a fundamentally different approach from previous work. First, the spectral data is viewed as a directed sequence, called spectrum sequence, where each element is a double item, one corresponds to a mass peak (each peak represents a fragmented peptide ion), while the other one is the mass difference of the two corresponding fragmented peptide ions. Then, we mine closed frequent motifs satisfying a min sup threshold in the sequence space. a parametric scoring scheme is also proposed, the parameters could be tuned from the mining results or assigned a priori by domain expert. To evaluate their biological relevance, we test these methods on the real-world datasets generated by MS/MS experiments performed on various tissue samples taken from mouse. The results show that this new approach provides reasonable detection of protein post-translational modifications compared to SALSA, while with good efficiency and scalability for real-world Bioinformatics applications. This paper is organized as follows. Section 2 introduces two classical PTMs mapping algorithms named SALSA and P-Mod, and Section 3 defines formally the closed motif finding and the peptide PTMs mapping problem. Section 4 describes the frequent interval pattern mining algorithms for the peptide PTMs mapping problem of two kinds of spectra: ideal spectra and noisy spectra. Section 5 reports the implementation and testing of our method on real-world dataset. Section 6 discusses future research directions. 2 Related Work In this section, we will first introduce two State-of-the-art algorithms named SALSA and P-Mod, they are all used intensively by biologists for PTMs mapping from MS/MS data. The problems of SALSA and P-Mod are also figured out, which is what we are trying to make up in this paper. Based on the characteristics of SALSA, we formalizes the motif searching and PTMs mapping problems under a frequent interval pattern mining framework, which forms the basis of our work. 2.1 SALSA Algorithm SALSA (scoring algorithm for spectral analysis) has been developed by biologists for rapidly screening large number of peptide MS/MS spectra for fragmentation characteristics indicative of specific peptide modifications. It can detect specific features in MS/MS 4

5 spectra and scores the spectra based on how many of the features are displayed and their intensities in the spectrum. SALSA could detect PTMs by detecting some simple component features. Four types of features could be detected by SALSA, as shown in figure 2. Figure 2: Spectral characteristics detected by the SALSA algorithm The first feature is a product ion at a specific m/z value. An example is the loss of a chemical modification as a charged fragment that then appears in the MS/MS spectrum at a particular m/z value, regardless of the m/z of the peptide from which it was lost. The second is a neutral loss, in which a neutral fragment is lost from the precursor ion. The product ion has the same charge state as the precursor. However, the difference between the mass of the precursor and the product ion detected will equal the mass of the lost neutral fragment. The third feature is a charged loss, in which a multiply charged precursor 5

6 ion loses a charged fragment. An example of this is the loss of a singly charged fragment from a doubly charged precursor. The fourth feature is an ion pair, which denotes any two signals separated by a specified m/z value anywhere in the MS/MS spectrum. The appearance of an ion pair can indicate the presence of a specific component in a peptide sequence. A natural extension of the fourth feature is to conduct amino acid sequence motif searching, which means, instead of detecting just a pair of ions, it can find a serie of ions in the spectra. For the first type feature, product ion, SALSA scores specific product ions by identifying the most abundant ion within a window centered at the designated m/z value ±0.5m/z unit for the selected ion. Neutral losses and charge losses are scored in an analogous manner. The window for neutral loss detection is centered at the precursor m/z minus the user-specified neutral mass/precusor charge (Note that the actual m/z value for a neutral loss from a doubly charged precursor is half of that of the same mass loss from a singly charged precursor). Neutral losses result in product ions that have the same charge as the precursor ion. In contrast, charged losses generate product ions that have a charge one unit less that that of the precursor and are only observed in spectra arising from doubly charged precursors. charged losses are calculated by 2 precursor m/z 1. For the ion pairs and ion series problem, SALSA scores the correspondence between the experimental and theoretical ion series regardless of their absolute positions on the m/z axis. A virtual ruler is used with the relative separations of ions fixed and then superimposed on the experimental mass spectrum by aligning the first ion in the ion series to the fragment ion with the highest experimentally determined m/z value. More details are illustrated in figure 2. Scoring of spectra is calculated from the %TIC values of the detected ions corresponding to hypothetical ions serie i 1 i n. The %TIC values corresponding to peaks i 1, i 2, i 3,... i n are denoted as T 1, T 2, T 3,..., T n, respectively. Scores for spectra are calculated as Score = N(T 1 T 2... T n ) 1 n (1) where N is the number of detected ions that correspond to hypothetical ions i 1 i n in the series. For spectra in which one or more of the ions in the series are missing, the algorithm inserts a value I n equal to the threshold value for ion detection, SALSA provides a focused search for spectra corresponding to a particular peptide or peptide modification, it does not look for exact matches in MS/MS spectra. Instead, it uses user-specified criteria to search for specific spectral features or fragmentation patterns, which could be thought of as spectral fingerprints for a peptide sequence or its variants. However, identification of peptide modifications using SALSA still require expertise in spectral interpretation. Moreover, SALSA scores only rank spectra based on their correspondence to search criteria, but do not indicate any quantitative measurement. When facing huge number of data, SALSA will encounter problems. 6

7 2.2 P-Mod Algorithm To make up the deficiency of SALSA, P-Mod was developed by the same research group [7]. P-Mod calculates mass differences between search peptide sequences and MS/MS precursors and localizes the mass shift to a sequence position in the peptide. The mass shifts are calculated as mass shift = sequence mass neutral precursor mass (2) Since modifications are detected as mass shifts, P-Mod does not require the user to guess at masses or sequence locations of modifications. For PTMs mapping, an array of customized search criteria is generated for every sequence-to-spectrum comparison, taking into consideration the primary peptide sequence, the observed mass shift, the precursor m/z and instrumental limitations of ion trap mass spectrometers. The first element in each search array is a list of all of the expected b- or y- series fragment ions for the unmodified peptide sequence. Succeeding elements in the search array consist of these same fragment ions, tailored to reflect the mass shift localized at different amino acid residues in the sequence. Each element of more than 6 applied search criteria is given a raw score. The corresponding scoring formula is score = 1 n (ln(1 + I n )) (3) b ci (1 + 3d 2 ) where n = the number of applied search criteria, I n = intensity of the largest ion within 1.25 m/z of the expected location for the nth search criterion, b ci =background intensity in the index compartment which contains the scored ion, and d = distance in m/z between the scored ion and its expected location. Although each element in each peptide search array is scored separately, only the highest scoring element is recorded as a potential match. The fact that only the highest scoring element from the search array is recorded means that the raw scores assigned to individual sequence-to-spectrum comparisons are extreme values. P-Mod uses extreme value statistics to model the distribution of scores assigned to sequence matches, and derives a corresponding p-value for statistical significance evaluation. The formula is Y = S µ α ln k 100 and p = 1 exp( exp Y ) (4) where Y = the extreme value reduced variate, S = raw score, µ = a conditional location parameter, R = a conditional scale parameter, k = the number of comparisons and p = the estimated p value. As we have described before, SALSA could detect peptide modifications for a particular protein even with relative small abundance, however, the fact that these modification features must be specified by the user prevents SALSA for automatically discovering 7

8 unanticipated PTMs from huge number of spectra. P-Mod does not have this constraint, however, it has a very strong assumption that we have already know which proteins are in the mixture or the proteins have already been identified by the mixtures. This philosophy is dubious, since PTMs mapping is mainly targeted on improving identification performance, without mapping the modifications first, database searching algorithms, such as SEQUEST or X!Tandem can not work well. P-Mod is a method for modifications mapping, however, its performance is based on the identifications algorithm, which in turn should rely on the performance of itself! Therefore, P-Mod s assumption is reasonable or natural enough for the real-world applications. A more intuitive and natural method is needed. 3 Problem Formalization Assume that each spectrum has been normalized so that the x-axis is m but not m/z. A spectrum can be represented by a position sequence S = < (m 1, I 1 ), (m 2, I 2 ),..., (m n, I n ) >, each element of S is a dual item (m i, I i ), where m i represents the horizontal position of the ith peak and I i represents the corresponding peak intensity. As shown in figure 3, the position sequence S could be converted into a distance sequence = < (σ 1, I 1 ), (σ 2, I 2 ),..., (σ n, I n ) >. each σ i is defined as: { mi if i = 1 σ i = m i+1 m i else (5) Figure 3: MS/MS Spectrum for the sequence S = < (m 1, I 1 ), (m 2, I 2 ),..., (m 6, I 6 ) > For the convenience of description, both notations of position sequence S and distance sequence will be used in this paper. Because of the existence of noise, we setup a noise offset vector O = < δ 1, δ 2,..., δ n >, which means the ith peak may appear in the range [m i δ i, m i + δ i ]. Without loss of generality, we simply write a sequence as S = < m 1, m 2,..., m n > or = < σ 1, σ 2,..., σ n >, since m i and σ i are what our algorithm use directly for frequent interval pattern mining, while I i is only useful when we want to score the ming results. 8

9 A position sequence S α = < α 1, α 2,..., α m > is a sub-sequence of another position sequence S β = < β 1, β 2,..., β n >, denoted as S α S β ( if m n), written as S α S β ), if and only if k Z, such that the item set of α, denoted H α = {α 1 + k, α 2 + k,..., α m + k} is a subset of the item set of β, denoted as H β = {β 1, β 2,..., β n }, i.e., H α H β. It s straightforward that this definition is a partial order relationship. Also, we define a distance sequence α is a sub-sequence of another distance sequence β if and only if their corresponding position sequences have this partial oder relationship. The output of a MS/MS experiment is a spectra dataset, each spectra is associated with an id. For simplicity, say the id of the ith spectrum is i. we then transform the spectra dataset into a position sequence database, D S = {S 1, S 2,..., S n }, which is a set of position sequences, and a distance sequence database, D = { 1, 2,..., n }. S represents the number of elements in sequence S, while D represents the number of sequences in the database D. The support of a sequence S α in a sequence database D is the number of sequences in D which contain S α, denoted as N α, N α = {S S D and S α S}. Given a minimum support threshold, min sup, the set of frequent interval pattern, F interval, includes all the sequences whose support is no less than min sup. The set of closed frequent interval pattern is defined as follows, C interval = {S α S α F S and S β F S, s.t. S α S β and N α = N β }. Since C interval includes no sequence which has a super-sequence with the same support, we have C interval F interval. Without considering any kind of noise or peak loss, the problem of closed motif finding is to find C interval above a minimum support threshold in the distance database D. The single-site exact peptide PTMs mapping problem is to find a pair of closed frequent interval patterns, say, P A = < α 1, α 2,...α k, α k+1,..., α n > and P B = < β 1, β 2,...β k, β k+1,..., β n > (n n), with two integers k (k < n) and l, satisfying { P B2 P A2 and N P B /N PA > I 1, if P B1 P A1, P B1 P A1 and N (6) P B /N PA > I 1, if P B2 P A2, where P A1 =< α 1, α 2,...α k >, P A2 =< α k+1,..., α n >, P B1 =< β 1, β 2,...β k >, P B2 =< β k +1,..., β n >, P A1 =< α 1 + l,..., α k + l >, P A2 =< α k+1 + l,..., α n + l >, and 0 < I 1 < 1 is a given threshold. The tuning of I 1 should depend on domain-specific knowledge. In fact, these two formulas are equivalent, here we write both just for explicit representation. Similarly, for the problem of double-sites peptide PTMs mapping is to first find a one site modified interval pattern, like P B, then use similar methods to find another closed frequent interval pattern, say P C = < γ 1, γ 2,...γ k, γ k +1,..., γ n > (P C P A ) (n n ) and two integers k (k < n ) and l, The definition is similar. Also, a parameter 0 < I 2 < 1 is used to denote the domain-specific threshold. By this way, we could go on to define triple-sites peptide PTMs mapping problems recursively. Discussions: The above discussions are under the assumption that there is no noise. For real applications, two kinds of noise should be considered: peak loss vs. peak shift. peak 9

10 loss means for some spectrum, we may lose some peaks due to partial ionization of the peptide fragments. It s also possible that some peaks exist without corresponding peptide ions. peak shift is mainly due to the instrument resolution and isotopic distributions, the measured peak may shift a small value along the x-axis direction. when considering noise, we should modify the above PTMs mapping definition to approximate peptide PTMs mapping problem. When considering the noise, we simply change all the C interval in the above definition to F interval to deal with the peak loss problem. For the peak shit scenario, to simplify the discussion here, we assume a symmetric whitening noise offset vector O = {δ 1, δ 2,..., δ n }. δ i N (0, σ 2 ) (7) when comparing to values, m 1 i and m 2 i, we say m 1 i matches m 2 i, if and only if m 1 i [m 2 i +2σ, m 2 i 2σ]. Since real applications generally contain noise, approximate peptide PTMs mapping problem is our concern in the remaining part of this paper. 4 Methodology In this section, a frequent interval pattern mining algorithm is developed for protein PTMs mapping problem. Traditional frequent pattern mining algorithms could be roughly divided into two approaches: candidate generate-and-test vs. pattern growth. In the previous works, the pattern growth approach outperforms the generation-and-test counterparts for many applications. Therefore, we design our algorithm based on the pattern growth approach, in hope of better performance for the real-world applications. 4.1 Pattern Growth Approach Pattern growth is a novel and efficient method for mining frequent patterns from large scale databases. It was first introduced by Han.et [5, 4]. It adopts a divide-and-conquer approach to decompose both the mining tasks and the databases. Then, a pattern fragment growth method is used to avoid the costly candidate generation-and-test procedures at all. Moreover, an extended prefix-tree structure is constructed to compress crucial information about frequent patterns and avoid costly, repeated database I/O operations. A comprehensive performance study has shown that it is robust and especially suitable for most real-world applications. A lot of subsequent works [17, 18, 9] with this approach have been published. Our algorithm also follows this direction. 4.2 Initial Database and Conditional Database The key concept of the pattern growth approach is the conditional database. By employing divide-and-conquer strategies, the original database is divided into several partitions according to some prefix patterns. For a given prefix interval pattern P i, all the following frequent interval patterns could be mined from a P i specified partition D Pi without 10

11 accessing any other information. Each conditional database is further divided recursively following the same procedure. Definition 1: Prefix Interval Pattern: A frequent interval pattern F interval is also called a prefix interval pattern. The reason that we call it Prefix is because it will grow in the subsequent steps. Definition 2: Conditional database: A conditional database, named D Pi of a prefix pattern P i is the database of all the transactions which contain/follow the prefix pattern. Definition 3: Initial database: An initial database is the conditional database of empty prefix pattern. In order to facilitate the mining process, for each transaction, only the current gap and the position of the next peak are stored. Instead of storing all the transactions in the conditional database, only the references are store, each transaction in the conditional database is stored in a triple item <Gap, <SpectrumID, PeakID> >. Here, Gap = current gap value, The peak is stored as <SpectrumID, PeakID>. Since for different spectra, interval patterns may appear in different positions, a n-peak spectrum (p 1, p 2, p 3,..., p n ) should be populated into n transactions. For each peak p i in spectrum j, the corresponding transaction is represented as { < Gap = m i m i 1, < SpectrumID = j, PeakID = i >>, if i > 1 (8) < Gap = 0, < SpectrumID = j, PeakID = i >>, else During the mining process, these transactions are sorted by their current gaps, and all transactions with the same current gap are clustered into the same group. When constructing the initial database, according to the above introduced data format of the conditional database, all transactions will be represented as <0, <SpectrumID,PeakID> >. Where, the 0 means all the current gaps are 0, since we have aligned all the spectrum according to their first peak. 4.3 Frequent Interval Pattern Mining Algorithm Exact Pattern Mining: Figure 4 is the pseudo code for the exact frequent pattern mining problem, which is a recursive method. The main idea comes from the traditional pattern growth approach. However, there is a subtle difference between this interval pattern mining and the traditional pattern-growth frequent pattern mining algorithms. Assume we have an interval [a, c] and b [a, c], following the prefix pattern mining approach, even if [a, b] is not frequent, we can not prune [a, c] directly, since it s still possible that [a, c] is a frequent interval pattern. To deal with this problem, In Line 13-15, we construct the conditional database for recursive call. In Line 19-21, we push forward the current transactions. The difference between these two cases is when we are calculating the new gap, the former uses the next gap itself and the latter adds the next 11

12 gap to the current gap. This is because when we push forward the current transactions, we are actually ignoring the current peaks. As a result, the new gap size is the sum of the sizes of the two old gaps. However, when we are preparing the conditional database for the recursive call, we count the current peak, so the new gap size is just the size of the next gap. This difference does not exist in the previous sequential pattern mining problems, since gaps are ignored in sequential patterns. ExactPatternMining(PatternPrefix P, ConditionalDatabase D) BEGIN 1. sort D based on the first gap of each transaction 2. while not empty(d) 3. Find transactions T with the smallest gap in D 4. let G = t.gap, t T 5. remove T from D 6. if (support(t ) minsup) 7. construct the new pattern prefix P 2 =< P, G > 8. output < P 2, support(t ) > 9. construct the conditional database D for each transaction t =< G, < id, num >> in T 11. if (Spectrum[id] has more than num peaks) 12. construct a new transaction t let t 2.Gap = peak[id][num + 1] peak[id][num] 14. let t 2.num = num insert t 2 into D ExactPatternMining(P 2,D 2 ) 17. for each transaction t =< G, < id, num >> in T 18. if (Spectrum[id] has more than num peaks) 19. construct a new transaction t let t 2.Gap = G + (peak[id][num + 1] peak[id][num]) 21. let t 2.num = num insert t 2 into D END Figure 4: The Exact Frequent Interval Mining Algorithm The key for an efficient implementation of this algorithm is how to choose the data structure to construct the conditional database. Two critical operations should be supported: remove the least element and insert a new element. For this purpose, heap is adopted, a heap is constructed so that the transaction with the smallest gap is on the top 12

13 of the heap. The time complexities for least-element-removal and new-element-insertion are both O(log(n)), here n is the number of elements in the heap. For the space complexity. We denote the total number of spectrums by N, the average number of peaks for each spectrum by P. The size of the peak array is O(N P ). Instead of populating the initial database explicitly, only reference are constructed, the size of the initial database is O(N P ). The depth of the recursive calls cannot exceed the number of peaks in the spectrums. Thus, the worst case space complexity for this algorithm is O(N P 2 ). In an average case, the size of conditional database shrinks exponentially with respect to the depth of the recursive calls. Thus, the average case space complexity is O(N P ). This is because, for each recursive, if we assume that D i+1 1/2 D i then D ( ) 2 D (9) 2P 1 Approximate Pattern Mining: The above algorithm solves the problem of finding exact frequent interval patterns. However, as discussed before, peak shifts may exist due to instrument constraint. A method which is capable of dealing with this uncertainty is more meaningful for biologists. Here, we show how the above exact pattern mining algorithm could be extended to an approximate version. A straight-forward extension is done by discretization: All peak positions are discretized to be multiples of a compartment size. After this preprocessing, we could reuse the exact interval pattern mining algorithm, since the support of a given pattern now includes all the transactions which could approximately match it. Assume the original peak position is m orginal, the given compartment size is represented as Z, the new peak position m new is calculated as m new = 1 2 ( moriginal Z Z + moriginal Z) (10) Z This discretization approach is easy to implement, however, there exists some potential problems. Suppose a n peak spectrum (p 1, p 2, p 3...p n ), the position m i of the ith peak is viewed as a random variable with a Gaussian noise N (0, σ 2 ). If the first peak is fixed, the Gaussian noise will lead peak shifts of 2σ for all peaks. Given (m 1, m 2, m 3...m n ) for the corresponding peak, if we use an error bound of [ 2σ, 2σ] to match the sequence, under the independent assumption, the probability that they will match should be 0.95 n. However, follow the discretization approach and simply set the discretization compartment size of 4σ, we will not get that probability. This is because the true value of peak p i will not just appear in the middle of the compartment. If it equals to a discretization points, the probability of matching will be only If there are k such peaks, For the other cases, the probability of correct matching will be a number π i, 0.50 π i For the whole spectrum the probability of correct matching will be only π 1 π 2 π n < 0.95 n. 13

14 The above analysis motivates us to design a more effective approximate mining algorithm. Fortunately, it is not hard to modify the previous exact patten mining algorithm for this purpose. the pseudo code is shown in figure 5. ApproximatePatternMining(PatternPrefix P, ConditionalDatabase D) BEGIN 1. sort D based on the first gap of each transaction 2. let G = 0 3. while true 4. while not empty(d) && D [G a, G + a] minsup 5. increase G until the first peak is outside of the range 6. push-forward transactions in D with gap G a 7. if empty(d) 8. break 9. let T = D [G a, G + a] 10. construct the new pattern prefix P 2 =< P, G > 11. output < P 2, support(t ) > 12. construct the conditional database D for each transaction t =< g, < id, num >> in T 14. if (Spectrum[id] has more than num peaks) 15. construct a new transaction t let t 2.g = peak[id][num + 1] peak[id][num] 17. let t 2.num = num insert t 2 into D ApproximatePatternMining(P 2,D 2 ) 20. let G = G + b 21. push-forward transactions in D with gap < G a END Figure 5: The Approximate Frequent Mining Algorithm For this method, two parameters a and b are introduced. a is the maximally allowed error bound of a single peak, that is, the maximally allowed difference between the corresponding peaks when we align the first peaks of the two patterns ( or one pattern and one transaction). This a could be viewed as a measurement of how much uncertainty we want to tolerant. b is the smallest increment of the current gap between two adjacent recursive calls, it can be viewed as a measurement of how much difference we enforce for the algorithm to move smoothly. The introduction of b is necessary, This is because when we have more than minsup transactions within range [g a, g + a] (g = current peak position+ a ), then we might also have more than minsup transactions within range [g + ɛ a, g + ɛ + a] (ɛ is a tiny value). These recursive calls are almost duplicate. 14

15 In the above pseudo code, there is an operation called push-forward. This is just an abbreviation of Line in the Exact Frequent Interval Mining algorithm before. Heaps are no longer suitable for constructing conditional databases in this approximate mining algorithm. Besides minimum-element removal and insertion operations, we also need to calculate the number of transactions within range [G a, G + a] efficiently. which is not supported by heaps. Therefore, a balanced search tree is adopted here. We keep two pointers, one for the minimum element, and one for the minimum element that is greater than G + a. In this way, it s easy to maintain the number of transactions in the range of [G a, G + a]. The space complexity of the approximate interval pattern mining algorithm is the same as that of the exact interval pattern mining algorithm, since the size of the conditional database and the maximal number of recursive calls do not change. However, in average cases, the space usage will increase. This is because when dealing approximate matching, the size of the conditional database for the recursive call is usually larger than the exact matching counterpart. As a result, the constant factor in average space usage will increase, while the computational complexity does not change. Also, the following theorem could guarantee that all the frequent interval patterns could be mined by this method. Theorem 1: The given exact and approximate frequent interval pattern mining algorithms are guaranteed to output all the frequent interval patterns Proof: The proof of completeness is obvious, according to these two pattern mining algorithms, there are two branches, one is the pattern mining in the prefix tree based conditional database, the completeness could be guaranteed by the prefixspan method [14]. For the peak-removal branch, it s in fact add more transactions into the original database, which does not prune anything at all. Therefore, finish the proof. 4.4 Modifications Mapping Modifications mapping could be conducted based on the frequent interval patterns. According to the definition of PTMs, we are trying to find the modifications between two frequent patterns. In the following, one exact approach and one approximate approach will be shown. Then, a scoring function is defined, which could be used to rank the mining results. Exact Modifications Mapping: The key problem to detect a modification is to determine where the modification begins. For example, if we have two length-n patterns P 1 =< ψ 1, ψ 2,..., ψ n > and P 2 =< ψ 1, ψ 2,..., ψ n >. Here, both ψ i and ψ i represent the ith gap in the frequent interval patterns. If there exists a number k (k n), for which we have ψ i = ψ i, 1 i kandk i n, then we say there is a modification at position k for the corresponding frequent patterns. 15

16 The way to detect such modifications is to enumerate all the possible k values. Each time, we only need to consider equal-length patterns for comparison, since we use frequent interval patterns F interval instead of closed frequent interval patterns C interval. Given a value k for length-n patterns (k < n and the patterns are represented as (ψ 1, ψ 2,..., ψ n ), we sort all the length-n patterns with the multi-key (ψ 1,..., ψ k 1, ψ k+1,..., ψ n ). Two patterns are modifications of each other if and only if they have the same key. This requires a multiple-key sorting. The naive method is to consider a multiple-key sorting as n single-key sorting. There are altogether n 1 possible k values (from 1 to n 1). Thus, the number of single-key sorting for this naive approach n (n 1). Assume there are m length-n patterns, the time complexity is O(n 2 m log m). A more sophisticated extension to the naive approach is that we reuse the previous sorting results and try to reduce the time complexity exponential.the key component for this approach is Stable Sorting, which is different from ordinary sortings in that it guarantee that the relative order of two data items with the same key does not change in the sorting procedure. With stable sorting technique, the number of single-key sortings could be reduced to 2 n 2. The method is shown in figure 6 ModificationMapping(length-n PatternArray P set ) BEGIN 1. for i = n 1 to 1 2. stable sort P set on single dimension i 3. output detected modifications 4. for i = n downto 2 5. stable sort P set on dimension i 6. extract modifications at dimension i to end from sorting results 7. output detected modifications END Figure 6: Modifications Mapping method The intuitive runing of this algorithm is showed in table 2. In this example, assume we focus on length-5 patterns. The success of this algorithm depends on a critical property of stable sorting: If we have two keys k and K, and the data is first ordered by key K before stable sort, then after stable sort on key K, the data is ordered by key < K, K >. With this property, we see that, for the first loop of the algorithm, the keys ψ 4, ψ 3, ψ 2, ψ 1 are sorted together, when extracting the modification patterns, all the modifications in the fifth row could be detected. Then, for the first run of the second loop, the keys ψ 3, ψ 2, ψ 1, ψ 5 are sorted, therefore, all the modifications in the forth row could be extracted here. For each step of the second loop, we could reuse the previous results. By this way, the whole key space could be traversed, therefore, the correctness of this algorithm is 16

17 guaranteed. Also, the number of single-key sorting could be reduced to only 2 n 2. The time complexity is only O(n m log m), which is more efficient than the naive approach. Table 2: An illustrative run of the modification mapping algorithm ψ 2 ψ 3 ψ 4 ψ 5 step 5: 4, 3, 2, 1, 5,4,3,2 ψ 1 ψ 3 ψ 4 ψ 5 step 4: 4, 3, 2, 1,5,4,3 ψ 1 ψ 2 ψ 4 ψ 5 step 3: 4, 3, 2,1,5,4 ψ 1 ψ 2 ψ 3 ψ 5 step 2: 4, 3,2,1,5 ψ 1 ψ 2 ψ 3 ψ 4 step 1: 4, 3,2,1 Approximate Modifications Mapping: The same problem arises again when we switch from exact modifications mapping to approximate modifications mapping. Approximate mapping is more difficult because it can not be accurately detected by simple sorting. There exists a useful property which could guarantee efficient approximate modifications mapping: If the prefix of the two patterns are the same, then the first gap in which these patterns differ should be at least delta. This condition effectively tells us that for the approximate modifications mapping algorithm, we need to conduct approximate matching only after the modification position. We can still use stable sorting for the dimensions on or before the modification position. However, we cannot use stable sorting for the patterns after the modification position. Instead, for each block of patterns that agree on the prefix, R-Tree is adopted to store its suffix. By this way, approximate modifications could be mapped efficiently. Discussions: The above discussions is mainly focused on single-site PTMs mapping problem, however, it s straightforward to see that, this algorithm could also detect multiple-site PTMs. Assume that P 1 is the pattern for the original peptide sequence, P 2 is the single-site modified version, while P 3 is the double-site modified version based on P 2. From the previous definition, P 1, P 2 and P 3 are all frequent interval patterns. From the theorem 1, they could all be mined by out algorithms. For the modifications mapping, all the modifications could be detected. Also, in order to improve the efficiency, we could enforce a constraint on the modification positions k 1 and k 2, so that, we could set k 1 2 and n k 2 2. Therefore, the searching space could be further pruned effectively. 4.5 Scoring Function For protein PTMs mapping problem, many modifications could be detected, a good scoring function to rank the results become a necessity. Here, we give out a simple and intuitively 17

18 meaningful scoring function, assume that one modification is detected from two frequent interval patterns P 1 and P 2, the corresponding support of these two patterns are N P1 and N P2, and the I ki represents the peak intensity vector for the k-th instance of pattern P i. The finial score of this pattern is defined as score = (N P 1 k I ki ) (N P2 l I li ) N P1 k I ki + N P2 (11) l I li The interpretation of the above score is very intuitive, it s just a harmonic mean of the weighted intensities of the two component patterns. 5 Experimental Results In this section, several experiments were performed to show the efficiency and effectiveness of our algorithm. By conducting experiments on the real-world dataset, we also analyze the biological relevance of our algorithm. 5.1 Data set The data set included 42, 392 different spectra, collected during 46 runs of Illiquid- Chromatography Tandem Mass Spectrometry (LC M S/M S). all the spectrum collection from the first run was used in the analssi. There was no merging of spectra and hence each spectrum corresponds to one peptide. All protein samples originated from 1D P AGE or 2D Gel electrophoresis separations. Bands or spots were extracted from the gel and subjected to in-gel digestion by trypsin. Cysteines were reduced and acrylated by iodoacetamide. Two randomly selected spectra are shown in the figure 7: There are altogether 1, 302 spectra in the data set. After resale the x-axis from mass-tocharge ratio m/z to mass m. Since there are many noisy peaks, only the most abundance 300 were selected and the other peaks are removed. The spectra were then discretized with a window of width Z = 0.3, for different peaks in the same compartment, only the most intense one is kept. After these preprocessing steps, the average peak number for each spectrum is P = 243. Also, we resale the peak intensities to 100. Since we are only interested in the Post-translational Modifications mapping problem, we did not conduct protein/peptide identifications via SEQUEST or X!Tandem. All the original spectra data are stored in the.pkl format file, one column represents the corresponding m/z value, the other column represents the intensity levels. For each spectrum the precursor m/z and charge value are also recorded. For the experiments in the following, we first evaluate the algorithm performance (mainly, the time complexity) on the real world applications, then, biological relevance of the results is considered and discussed. 18

19 18 x INTENSITY M/Z 16 x INTENSITY M/Z Figure 7: Two randomly selected spectra from the dataset, precursor charge = Algorithm Performance and Scalability We now consider the scalability of the proposed frequent pattern mining algorithms. The space complexity has already been discussed previously, here, we mainly focus on the time complexity on the real data. While given a specific example, i.e. the minimal support = 1, all the combinatorial algorithms s worst case should have an exponential complexity, however, their average performance are quite different based on different underlying heuristics. It s easy to see that the approximate pattern mining method is a proportional to the exact pattern mining method with only a constant fact, here, we only test the time complexity of the exact pattern mining approach, figures 8 show the algorithm profiles. We fixed the value of min sup at 20, the minimal length of each pattern is set to be 8, the window width for discretization is 0.3, the first figure shows the time used with respect to the growth of number of spectra. Each point on the graph is the average of experimental results over 20 repeated runs. When fixing the number of spectra to be 10 and all the other parameters remain, the second figure illustrates the time used with respect to the growth of minimal support min sup, each point averages the results over 20 runs. All 19

20 4000 The time complexity vs. number of spectra 250 The time complexity vs. minimum support time (second) time (second) number of spectra minimum suport Figure 8: Algorithm performance, the first figure is illustrates time vs. number of spectra, the second figure illustrates time vs. min sup these experiments are conducted on a LINUX machine with Pentium 1.7G CPU. As we can see from the figures, the time complexity vs. number of spectra on the real dataset is x 1.6, which is a polynomial complexity less than x 2, therefore, it s a reasonable complexity for the frequent pattern mining algorithms. from the second graph, we could see that the time complexity is very sensitive to the decrease of the minimum support, if the minimum support is less 4, the time should be more than 20 hours for even 20 spectra! Fortunately, we normally set the min sup 10, in such a range, the time complexity is till approximate to a polynomial. Experiments with other parameter settings have consistently shown similar results to the one shown in the above figures, and are omitted. 5.3 Biological Relevance In previous sections, we were focused on the frequent interval pattern mining problem and testing their scalability on the real dataset. In this section, we discuss the biological relevance of the modifications mapping algorithms. For our experiments, we set the modifications support ratio = 0.8, while the modified and nonmodified part should have a length at least 3. Our strategy here is to conduct the modifications mapping first, then, all the detected modifications will be clustered into histograms according to their scores, therefore, we 20

21 could get the distribution of different modifications. One thing to note is, in a typical mass spectrometry experiments, there is a large portion of noise, even though we only select 300 the most intense peaks, many of them are caused by random noise. If the peak noise are uniform noise or Gaussian noise, when viewing interval gaps as random variables generated by the peaks, the noise distribution is more biased on smaller intervals. Therefore, we need to build a suitable noise model and the corresponding noise distributions for baseline correction. In this paper, two noise models are considered: uniform noise and gaussian noise. Their distributions are shown in figure distribution of the uniform noise distribution of the gaussian noise Figure 9: The noise models: Uniform vs. Gaussian From the noise models, we could see that the interval distributions generated by uniform noise has a heavier tail than that from Gaussian noise. When we analyze the modifications distributions, we need to consider the effects of random noise, and conduct a baseline correction. The modification distribution is shown in figure 10, from which, we can conclude that the underlying noise has a Gaussian distribution, after the baseline corrections, we could see modifications like = 109, 98, 42 are very frequent. This distribution automatically rediscovered qualitative phenomena known previously to experienced mass spectrometrists. As represented by the intense mode at 109, it represents the post-translational modification Pyrrolysine; For the intense mode at 98, 21

22 14 x distribution of the modifications Figure 10: The distribution of the detected modifications it s mainly caused by the modification Phosphorylation and water. while the intense mode at 210 is mainly caused by Myristoylation. The intense mode at 17 represents the Pyrrolidone carboxylic acid. All these modifications are very popular ones, it s not surprise that they are more frequent than the others. Also, some detected frequent gaps, such as 113 and 114, are in fact the mass of amino acids Isoleucine and Asparagine. Which are not really modifications, but since it happens frequently, they could also be detected. The algorithm also detected previously unprescribed patterns. For example, after ranking the detected modifications according to the given scores, the gap = 226 ranks very high, this result is consistent under different parameter setting, since no frequent double-site modifications could get such a value, we suspect that this may be a new modification in the given samples. More details need to be confirmed by the analytical biological experiment. 6 Conclusions and Discussions We make two main contributions in this paper. First, we motivated and formalized a novel class of data mining problems that arise in protein post-translational modifications mapping, specifically for analysis of tandem mass spectrometry, but with natural applications to market-basket database. The second is that we developed the frequent interval pattern mining and modifications mapping algorithms. Based on an extension of PrefixSpan and stable sorting, these methods are natural and efficient on the real-world mass 22

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra. Andrew Keller

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra Andrew Keller Outline Need to validate peptide assignments to MS/MS spectra Statistical approach to validation Running PeptideProphet