Protein Post-translational Modifications Mapping with MS/MS based Frequent Interval Pattern Mining
|
|
- Clara Pope
- 6 years ago
- Views:
Transcription
1 Protein Post-translational Modifications Mapping with MS/MS based Frequent Interval Pattern Mining Han Liu Department of Computer Science University of Illinois at Urbana-Champaign May 8, 2005 ABSTRACT Tandem mass spectrometry (MS/MS)-based proteomics has demonstrated to be an indispensable tool for large scale protein identification and expression profiling tasks. However, analysis of protein post-translational modifications (PTMs) with MS/MS still presents formidable challenges. Although some heuristic algorithms have been developed for this problem, they are far from mature. In this paper, we give out a detail survey of current approaches and propose an alternative but more powerful solution, one that uses more-straightforward laboratory procedures and could be applied automatically. Specifically, our aim is to use frequent interval pattern mining techniques to map modification sites in molecular detail, a parametric scoring scheme is also proposed, the parameters could be tuned from the mining results or assigned a priori by domain expert. Comparing with the classical PTMs mapping algorithms like SALSA and P-Mod, this novel mass spectrometric data mining approach demonstrates more robustness and flexibility. To evaluate the biological relevance, we test these methods on the real-world datasets generated by MS/MS experiments performed on various tissue samples taken from mouse. The results show that this new approach is competitive on automated PTMs mapping tasks, while with good efficiency and scalability for real-world Bioinformatics applications. Keywords: protein post-translational modifications mapping, frequent interval pattern mining, scalability, mass spectrometry, bio-data mining Technical report submitted according to the regulations of Department of Computer Science at the University of Illinois at Urbana-Champaign 1
2 1 Introduction Proteomics is the large-scale study of the thousands of proteins in a cell [8]. In a typical Proteomics experiment, the goal might be to compare the proteins present in a certain tissue under different conditions. For instance, a biologist might want to study cancer by comparing the proteins in a cancerous liver to the proteins in a healthy liver. Modern mass spectrometry makes this possible by enabling the identification of thousands of proteins in a complex mixture [16, 2]. However, identifying proteins is only part of the story, it is also important to map the protein post-translational modifications (PTMs) in the molecular details [11], since PTMs modulate the activity of most eukaryote proteins, and their determination generates valuable insight into biological functions. Despite the great importance of PTMs, their study on large scale has been hampered by lack of suitable methods, many PTMs have been discovered serendipitously during studies of individual proteins with the help of standard molecular techniques, such direct analysis strategies requires isolation of the correctly processed protein in a sufficiently large amount for biochemical study. As a result, many PTMs can not be analyzed, which prevents us from fully understanding the protein modification mechnisms in the workings of the cell [11]. Figure 1: MS/MS: In the first phase, a given m/z value peptide ions are filtered out. In the second phase, the filtered out peptide ions are further dissociated and the m/z value of the ion fragments are measured Tandem mass spectrometry (MS/MS) of peptides is a central technology of Proteomics, enabling the identification of thousands of peptides and proteins from a complex mixture [13, 10, 1]. The whole processes of MS/MS are illustrated in figure 1. In a typical experiment, thousands of proteins from a tissue sample are fragmented into tens of thousands of peptides and gotten ionized, each peptide ion consists of about 6 to 30 amino acid residues. This peptide ions mixture then enter the tandem mass spectrometer and be analyzed according to two phases. The first phase is responsible for filtering peptides ions 2
3 of a certain masses-to-charge ratio (m/z), in the second phase, a peptide is split into two fragments by means of collision induced dissociation (CID) with a noble gas. In almost all cases, the peptide is broken between the chemical bonds of two amino acids. The result is a collection of spectra, one for each peptide ions, where each peak represents the relative abundance of a prefix or suffix ion [15]. Table 1: Some common and important post-translational modifications PTM type Mass (Da) Stability Function and notes Phosphorlation +80 +/++ modulation of molecular in interaction, signaling Acetylation regulate protein-dna interations Methylation regulate gene expression Hydroxyproline protein-ligand interaction Sulfation (styr) regulate protein-protein interactions Disulfide bond intra-and intermolecular crosslink protein stablility Deamidation a common chemical artifact Nitration of tyrosine +45 +/++ Oxidative damage when inflammation Pyroglutamic acid protein-stability Farnesyl membrane tethering Palmtoyl /++ cellular localization and signaling With the increasing acquisition rate of tandem mass spectrometers, there is an increasing potential to solve important biological problems like peptide sequencing or protein identification by applying data-mining techniques to MS/MS data [3, 2]. Although the identification of proteins in a complex mixture is becoming routine, protein identification alone provides only limited insight into protein function. An important component of protein regulation and function is covalent modifications to protein structures that occur post-transactionally [12]. Many protein post-translational modifications (PTMs) give rise to specific features of MS/MS spectra [8]. For example, phosphorylated serine and threonine residues eliminate phosphoric acid (80 Da) in MS/MS. Thus, product ions at 40 and 80 units below doubly and singly charged precursor ions, respectively, are observed in the corresponding spectra. Table 1 summarizes some common and important PTMs [11]. Currently, over 380 different PTMs have been discovered, identifying the type and location of these PTMs is a first step in understanding their regulatory potential. Despite their importance to cellular function, the methodologies used to study these modifications are far from mature, most of them are not compatible with protein mixtures, or are specific for a given type of PTMs. Among these algorithms, the most promising one is named SALSA [6], which is a pattern recognition algorithm used to map PTMs according to some user-specific modifications characteristics, however, as we will state in the following 3
4 section, SALSA can not be used to predict unanticipated modifications without userspecified priors, which prevents its real-world applications greatly. Another well-known algorithm is named P-Mod, which conduct modifications mapping based on mass shift of the peptide sequences, however, it still rely on a strict assumption that we have already known which proteins are in the mixtures a priori, which is still too strong an assumption for modern proteomics analysis. In this paper, we propose a fundamentally different approach from previous work. First, the spectral data is viewed as a directed sequence, called spectrum sequence, where each element is a double item, one corresponds to a mass peak (each peak represents a fragmented peptide ion), while the other one is the mass difference of the two corresponding fragmented peptide ions. Then, we mine closed frequent motifs satisfying a min sup threshold in the sequence space. a parametric scoring scheme is also proposed, the parameters could be tuned from the mining results or assigned a priori by domain expert. To evaluate their biological relevance, we test these methods on the real-world datasets generated by MS/MS experiments performed on various tissue samples taken from mouse. The results show that this new approach provides reasonable detection of protein post-translational modifications compared to SALSA, while with good efficiency and scalability for real-world Bioinformatics applications. This paper is organized as follows. Section 2 introduces two classical PTMs mapping algorithms named SALSA and P-Mod, and Section 3 defines formally the closed motif finding and the peptide PTMs mapping problem. Section 4 describes the frequent interval pattern mining algorithms for the peptide PTMs mapping problem of two kinds of spectra: ideal spectra and noisy spectra. Section 5 reports the implementation and testing of our method on real-world dataset. Section 6 discusses future research directions. 2 Related Work In this section, we will first introduce two State-of-the-art algorithms named SALSA and P-Mod, they are all used intensively by biologists for PTMs mapping from MS/MS data. The problems of SALSA and P-Mod are also figured out, which is what we are trying to make up in this paper. Based on the characteristics of SALSA, we formalizes the motif searching and PTMs mapping problems under a frequent interval pattern mining framework, which forms the basis of our work. 2.1 SALSA Algorithm SALSA (scoring algorithm for spectral analysis) has been developed by biologists for rapidly screening large number of peptide MS/MS spectra for fragmentation characteristics indicative of specific peptide modifications. It can detect specific features in MS/MS 4
5 spectra and scores the spectra based on how many of the features are displayed and their intensities in the spectrum. SALSA could detect PTMs by detecting some simple component features. Four types of features could be detected by SALSA, as shown in figure 2. Figure 2: Spectral characteristics detected by the SALSA algorithm The first feature is a product ion at a specific m/z value. An example is the loss of a chemical modification as a charged fragment that then appears in the MS/MS spectrum at a particular m/z value, regardless of the m/z of the peptide from which it was lost. The second is a neutral loss, in which a neutral fragment is lost from the precursor ion. The product ion has the same charge state as the precursor. However, the difference between the mass of the precursor and the product ion detected will equal the mass of the lost neutral fragment. The third feature is a charged loss, in which a multiply charged precursor 5
6 ion loses a charged fragment. An example of this is the loss of a singly charged fragment from a doubly charged precursor. The fourth feature is an ion pair, which denotes any two signals separated by a specified m/z value anywhere in the MS/MS spectrum. The appearance of an ion pair can indicate the presence of a specific component in a peptide sequence. A natural extension of the fourth feature is to conduct amino acid sequence motif searching, which means, instead of detecting just a pair of ions, it can find a serie of ions in the spectra. For the first type feature, product ion, SALSA scores specific product ions by identifying the most abundant ion within a window centered at the designated m/z value ±0.5m/z unit for the selected ion. Neutral losses and charge losses are scored in an analogous manner. The window for neutral loss detection is centered at the precursor m/z minus the user-specified neutral mass/precusor charge (Note that the actual m/z value for a neutral loss from a doubly charged precursor is half of that of the same mass loss from a singly charged precursor). Neutral losses result in product ions that have the same charge as the precursor ion. In contrast, charged losses generate product ions that have a charge one unit less that that of the precursor and are only observed in spectra arising from doubly charged precursors. charged losses are calculated by 2 precursor m/z 1. For the ion pairs and ion series problem, SALSA scores the correspondence between the experimental and theoretical ion series regardless of their absolute positions on the m/z axis. A virtual ruler is used with the relative separations of ions fixed and then superimposed on the experimental mass spectrum by aligning the first ion in the ion series to the fragment ion with the highest experimentally determined m/z value. More details are illustrated in figure 2. Scoring of spectra is calculated from the %TIC values of the detected ions corresponding to hypothetical ions serie i 1 i n. The %TIC values corresponding to peaks i 1, i 2, i 3,... i n are denoted as T 1, T 2, T 3,..., T n, respectively. Scores for spectra are calculated as Score = N(T 1 T 2... T n ) 1 n (1) where N is the number of detected ions that correspond to hypothetical ions i 1 i n in the series. For spectra in which one or more of the ions in the series are missing, the algorithm inserts a value I n equal to the threshold value for ion detection, SALSA provides a focused search for spectra corresponding to a particular peptide or peptide modification, it does not look for exact matches in MS/MS spectra. Instead, it uses user-specified criteria to search for specific spectral features or fragmentation patterns, which could be thought of as spectral fingerprints for a peptide sequence or its variants. However, identification of peptide modifications using SALSA still require expertise in spectral interpretation. Moreover, SALSA scores only rank spectra based on their correspondence to search criteria, but do not indicate any quantitative measurement. When facing huge number of data, SALSA will encounter problems. 6
7 2.2 P-Mod Algorithm To make up the deficiency of SALSA, P-Mod was developed by the same research group [7]. P-Mod calculates mass differences between search peptide sequences and MS/MS precursors and localizes the mass shift to a sequence position in the peptide. The mass shifts are calculated as mass shift = sequence mass neutral precursor mass (2) Since modifications are detected as mass shifts, P-Mod does not require the user to guess at masses or sequence locations of modifications. For PTMs mapping, an array of customized search criteria is generated for every sequence-to-spectrum comparison, taking into consideration the primary peptide sequence, the observed mass shift, the precursor m/z and instrumental limitations of ion trap mass spectrometers. The first element in each search array is a list of all of the expected b- or y- series fragment ions for the unmodified peptide sequence. Succeeding elements in the search array consist of these same fragment ions, tailored to reflect the mass shift localized at different amino acid residues in the sequence. Each element of more than 6 applied search criteria is given a raw score. The corresponding scoring formula is score = 1 n (ln(1 + I n )) (3) b ci (1 + 3d 2 ) where n = the number of applied search criteria, I n = intensity of the largest ion within 1.25 m/z of the expected location for the nth search criterion, b ci =background intensity in the index compartment which contains the scored ion, and d = distance in m/z between the scored ion and its expected location. Although each element in each peptide search array is scored separately, only the highest scoring element is recorded as a potential match. The fact that only the highest scoring element from the search array is recorded means that the raw scores assigned to individual sequence-to-spectrum comparisons are extreme values. P-Mod uses extreme value statistics to model the distribution of scores assigned to sequence matches, and derives a corresponding p-value for statistical significance evaluation. The formula is Y = S µ α ln k 100 and p = 1 exp( exp Y ) (4) where Y = the extreme value reduced variate, S = raw score, µ = a conditional location parameter, R = a conditional scale parameter, k = the number of comparisons and p = the estimated p value. As we have described before, SALSA could detect peptide modifications for a particular protein even with relative small abundance, however, the fact that these modification features must be specified by the user prevents SALSA for automatically discovering 7
8 unanticipated PTMs from huge number of spectra. P-Mod does not have this constraint, however, it has a very strong assumption that we have already know which proteins are in the mixture or the proteins have already been identified by the mixtures. This philosophy is dubious, since PTMs mapping is mainly targeted on improving identification performance, without mapping the modifications first, database searching algorithms, such as SEQUEST or X!Tandem can not work well. P-Mod is a method for modifications mapping, however, its performance is based on the identifications algorithm, which in turn should rely on the performance of itself! Therefore, P-Mod s assumption is reasonable or natural enough for the real-world applications. A more intuitive and natural method is needed. 3 Problem Formalization Assume that each spectrum has been normalized so that the x-axis is m but not m/z. A spectrum can be represented by a position sequence S = < (m 1, I 1 ), (m 2, I 2 ),..., (m n, I n ) >, each element of S is a dual item (m i, I i ), where m i represents the horizontal position of the ith peak and I i represents the corresponding peak intensity. As shown in figure 3, the position sequence S could be converted into a distance sequence = < (σ 1, I 1 ), (σ 2, I 2 ),..., (σ n, I n ) >. each σ i is defined as: { mi if i = 1 σ i = m i+1 m i else (5) Figure 3: MS/MS Spectrum for the sequence S = < (m 1, I 1 ), (m 2, I 2 ),..., (m 6, I 6 ) > For the convenience of description, both notations of position sequence S and distance sequence will be used in this paper. Because of the existence of noise, we setup a noise offset vector O = < δ 1, δ 2,..., δ n >, which means the ith peak may appear in the range [m i δ i, m i + δ i ]. Without loss of generality, we simply write a sequence as S = < m 1, m 2,..., m n > or = < σ 1, σ 2,..., σ n >, since m i and σ i are what our algorithm use directly for frequent interval pattern mining, while I i is only useful when we want to score the ming results. 8
9 A position sequence S α = < α 1, α 2,..., α m > is a sub-sequence of another position sequence S β = < β 1, β 2,..., β n >, denoted as S α S β ( if m n), written as S α S β ), if and only if k Z, such that the item set of α, denoted H α = {α 1 + k, α 2 + k,..., α m + k} is a subset of the item set of β, denoted as H β = {β 1, β 2,..., β n }, i.e., H α H β. It s straightforward that this definition is a partial order relationship. Also, we define a distance sequence α is a sub-sequence of another distance sequence β if and only if their corresponding position sequences have this partial oder relationship. The output of a MS/MS experiment is a spectra dataset, each spectra is associated with an id. For simplicity, say the id of the ith spectrum is i. we then transform the spectra dataset into a position sequence database, D S = {S 1, S 2,..., S n }, which is a set of position sequences, and a distance sequence database, D = { 1, 2,..., n }. S represents the number of elements in sequence S, while D represents the number of sequences in the database D. The support of a sequence S α in a sequence database D is the number of sequences in D which contain S α, denoted as N α, N α = {S S D and S α S}. Given a minimum support threshold, min sup, the set of frequent interval pattern, F interval, includes all the sequences whose support is no less than min sup. The set of closed frequent interval pattern is defined as follows, C interval = {S α S α F S and S β F S, s.t. S α S β and N α = N β }. Since C interval includes no sequence which has a super-sequence with the same support, we have C interval F interval. Without considering any kind of noise or peak loss, the problem of closed motif finding is to find C interval above a minimum support threshold in the distance database D. The single-site exact peptide PTMs mapping problem is to find a pair of closed frequent interval patterns, say, P A = < α 1, α 2,...α k, α k+1,..., α n > and P B = < β 1, β 2,...β k, β k+1,..., β n > (n n), with two integers k (k < n) and l, satisfying { P B2 P A2 and N P B /N PA > I 1, if P B1 P A1, P B1 P A1 and N (6) P B /N PA > I 1, if P B2 P A2, where P A1 =< α 1, α 2,...α k >, P A2 =< α k+1,..., α n >, P B1 =< β 1, β 2,...β k >, P B2 =< β k +1,..., β n >, P A1 =< α 1 + l,..., α k + l >, P A2 =< α k+1 + l,..., α n + l >, and 0 < I 1 < 1 is a given threshold. The tuning of I 1 should depend on domain-specific knowledge. In fact, these two formulas are equivalent, here we write both just for explicit representation. Similarly, for the problem of double-sites peptide PTMs mapping is to first find a one site modified interval pattern, like P B, then use similar methods to find another closed frequent interval pattern, say P C = < γ 1, γ 2,...γ k, γ k +1,..., γ n > (P C P A ) (n n ) and two integers k (k < n ) and l, The definition is similar. Also, a parameter 0 < I 2 < 1 is used to denote the domain-specific threshold. By this way, we could go on to define triple-sites peptide PTMs mapping problems recursively. Discussions: The above discussions are under the assumption that there is no noise. For real applications, two kinds of noise should be considered: peak loss vs. peak shift. peak 9
10 loss means for some spectrum, we may lose some peaks due to partial ionization of the peptide fragments. It s also possible that some peaks exist without corresponding peptide ions. peak shift is mainly due to the instrument resolution and isotopic distributions, the measured peak may shift a small value along the x-axis direction. when considering noise, we should modify the above PTMs mapping definition to approximate peptide PTMs mapping problem. When considering the noise, we simply change all the C interval in the above definition to F interval to deal with the peak loss problem. For the peak shit scenario, to simplify the discussion here, we assume a symmetric whitening noise offset vector O = {δ 1, δ 2,..., δ n }. δ i N (0, σ 2 ) (7) when comparing to values, m 1 i and m 2 i, we say m 1 i matches m 2 i, if and only if m 1 i [m 2 i +2σ, m 2 i 2σ]. Since real applications generally contain noise, approximate peptide PTMs mapping problem is our concern in the remaining part of this paper. 4 Methodology In this section, a frequent interval pattern mining algorithm is developed for protein PTMs mapping problem. Traditional frequent pattern mining algorithms could be roughly divided into two approaches: candidate generate-and-test vs. pattern growth. In the previous works, the pattern growth approach outperforms the generation-and-test counterparts for many applications. Therefore, we design our algorithm based on the pattern growth approach, in hope of better performance for the real-world applications. 4.1 Pattern Growth Approach Pattern growth is a novel and efficient method for mining frequent patterns from large scale databases. It was first introduced by Han.et [5, 4]. It adopts a divide-and-conquer approach to decompose both the mining tasks and the databases. Then, a pattern fragment growth method is used to avoid the costly candidate generation-and-test procedures at all. Moreover, an extended prefix-tree structure is constructed to compress crucial information about frequent patterns and avoid costly, repeated database I/O operations. A comprehensive performance study has shown that it is robust and especially suitable for most real-world applications. A lot of subsequent works [17, 18, 9] with this approach have been published. Our algorithm also follows this direction. 4.2 Initial Database and Conditional Database The key concept of the pattern growth approach is the conditional database. By employing divide-and-conquer strategies, the original database is divided into several partitions according to some prefix patterns. For a given prefix interval pattern P i, all the following frequent interval patterns could be mined from a P i specified partition D Pi without 10
11 accessing any other information. Each conditional database is further divided recursively following the same procedure. Definition 1: Prefix Interval Pattern: A frequent interval pattern F interval is also called a prefix interval pattern. The reason that we call it Prefix is because it will grow in the subsequent steps. Definition 2: Conditional database: A conditional database, named D Pi of a prefix pattern P i is the database of all the transactions which contain/follow the prefix pattern. Definition 3: Initial database: An initial database is the conditional database of empty prefix pattern. In order to facilitate the mining process, for each transaction, only the current gap and the position of the next peak are stored. Instead of storing all the transactions in the conditional database, only the references are store, each transaction in the conditional database is stored in a triple item <Gap, <SpectrumID, PeakID> >. Here, Gap = current gap value, The peak is stored as <SpectrumID, PeakID>. Since for different spectra, interval patterns may appear in different positions, a n-peak spectrum (p 1, p 2, p 3,..., p n ) should be populated into n transactions. For each peak p i in spectrum j, the corresponding transaction is represented as { < Gap = m i m i 1, < SpectrumID = j, PeakID = i >>, if i > 1 (8) < Gap = 0, < SpectrumID = j, PeakID = i >>, else During the mining process, these transactions are sorted by their current gaps, and all transactions with the same current gap are clustered into the same group. When constructing the initial database, according to the above introduced data format of the conditional database, all transactions will be represented as <0, <SpectrumID,PeakID> >. Where, the 0 means all the current gaps are 0, since we have aligned all the spectrum according to their first peak. 4.3 Frequent Interval Pattern Mining Algorithm Exact Pattern Mining: Figure 4 is the pseudo code for the exact frequent pattern mining problem, which is a recursive method. The main idea comes from the traditional pattern growth approach. However, there is a subtle difference between this interval pattern mining and the traditional pattern-growth frequent pattern mining algorithms. Assume we have an interval [a, c] and b [a, c], following the prefix pattern mining approach, even if [a, b] is not frequent, we can not prune [a, c] directly, since it s still possible that [a, c] is a frequent interval pattern. To deal with this problem, In Line 13-15, we construct the conditional database for recursive call. In Line 19-21, we push forward the current transactions. The difference between these two cases is when we are calculating the new gap, the former uses the next gap itself and the latter adds the next 11
12 gap to the current gap. This is because when we push forward the current transactions, we are actually ignoring the current peaks. As a result, the new gap size is the sum of the sizes of the two old gaps. However, when we are preparing the conditional database for the recursive call, we count the current peak, so the new gap size is just the size of the next gap. This difference does not exist in the previous sequential pattern mining problems, since gaps are ignored in sequential patterns. ExactPatternMining(PatternPrefix P, ConditionalDatabase D) BEGIN 1. sort D based on the first gap of each transaction 2. while not empty(d) 3. Find transactions T with the smallest gap in D 4. let G = t.gap, t T 5. remove T from D 6. if (support(t ) minsup) 7. construct the new pattern prefix P 2 =< P, G > 8. output < P 2, support(t ) > 9. construct the conditional database D for each transaction t =< G, < id, num >> in T 11. if (Spectrum[id] has more than num peaks) 12. construct a new transaction t let t 2.Gap = peak[id][num + 1] peak[id][num] 14. let t 2.num = num insert t 2 into D ExactPatternMining(P 2,D 2 ) 17. for each transaction t =< G, < id, num >> in T 18. if (Spectrum[id] has more than num peaks) 19. construct a new transaction t let t 2.Gap = G + (peak[id][num + 1] peak[id][num]) 21. let t 2.num = num insert t 2 into D END Figure 4: The Exact Frequent Interval Mining Algorithm The key for an efficient implementation of this algorithm is how to choose the data structure to construct the conditional database. Two critical operations should be supported: remove the least element and insert a new element. For this purpose, heap is adopted, a heap is constructed so that the transaction with the smallest gap is on the top 12
13 of the heap. The time complexities for least-element-removal and new-element-insertion are both O(log(n)), here n is the number of elements in the heap. For the space complexity. We denote the total number of spectrums by N, the average number of peaks for each spectrum by P. The size of the peak array is O(N P ). Instead of populating the initial database explicitly, only reference are constructed, the size of the initial database is O(N P ). The depth of the recursive calls cannot exceed the number of peaks in the spectrums. Thus, the worst case space complexity for this algorithm is O(N P 2 ). In an average case, the size of conditional database shrinks exponentially with respect to the depth of the recursive calls. Thus, the average case space complexity is O(N P ). This is because, for each recursive, if we assume that D i+1 1/2 D i then D ( ) 2 D (9) 2P 1 Approximate Pattern Mining: The above algorithm solves the problem of finding exact frequent interval patterns. However, as discussed before, peak shifts may exist due to instrument constraint. A method which is capable of dealing with this uncertainty is more meaningful for biologists. Here, we show how the above exact pattern mining algorithm could be extended to an approximate version. A straight-forward extension is done by discretization: All peak positions are discretized to be multiples of a compartment size. After this preprocessing, we could reuse the exact interval pattern mining algorithm, since the support of a given pattern now includes all the transactions which could approximately match it. Assume the original peak position is m orginal, the given compartment size is represented as Z, the new peak position m new is calculated as m new = 1 2 ( moriginal Z Z + moriginal Z) (10) Z This discretization approach is easy to implement, however, there exists some potential problems. Suppose a n peak spectrum (p 1, p 2, p 3...p n ), the position m i of the ith peak is viewed as a random variable with a Gaussian noise N (0, σ 2 ). If the first peak is fixed, the Gaussian noise will lead peak shifts of 2σ for all peaks. Given (m 1, m 2, m 3...m n ) for the corresponding peak, if we use an error bound of [ 2σ, 2σ] to match the sequence, under the independent assumption, the probability that they will match should be 0.95 n. However, follow the discretization approach and simply set the discretization compartment size of 4σ, we will not get that probability. This is because the true value of peak p i will not just appear in the middle of the compartment. If it equals to a discretization points, the probability of matching will be only If there are k such peaks, For the other cases, the probability of correct matching will be a number π i, 0.50 π i For the whole spectrum the probability of correct matching will be only π 1 π 2 π n < 0.95 n. 13
14 The above analysis motivates us to design a more effective approximate mining algorithm. Fortunately, it is not hard to modify the previous exact patten mining algorithm for this purpose. the pseudo code is shown in figure 5. ApproximatePatternMining(PatternPrefix P, ConditionalDatabase D) BEGIN 1. sort D based on the first gap of each transaction 2. let G = 0 3. while true 4. while not empty(d) && D [G a, G + a] minsup 5. increase G until the first peak is outside of the range 6. push-forward transactions in D with gap G a 7. if empty(d) 8. break 9. let T = D [G a, G + a] 10. construct the new pattern prefix P 2 =< P, G > 11. output < P 2, support(t ) > 12. construct the conditional database D for each transaction t =< g, < id, num >> in T 14. if (Spectrum[id] has more than num peaks) 15. construct a new transaction t let t 2.g = peak[id][num + 1] peak[id][num] 17. let t 2.num = num insert t 2 into D ApproximatePatternMining(P 2,D 2 ) 20. let G = G + b 21. push-forward transactions in D with gap < G a END Figure 5: The Approximate Frequent Mining Algorithm For this method, two parameters a and b are introduced. a is the maximally allowed error bound of a single peak, that is, the maximally allowed difference between the corresponding peaks when we align the first peaks of the two patterns ( or one pattern and one transaction). This a could be viewed as a measurement of how much uncertainty we want to tolerant. b is the smallest increment of the current gap between two adjacent recursive calls, it can be viewed as a measurement of how much difference we enforce for the algorithm to move smoothly. The introduction of b is necessary, This is because when we have more than minsup transactions within range [g a, g + a] (g = current peak position+ a ), then we might also have more than minsup transactions within range [g + ɛ a, g + ɛ + a] (ɛ is a tiny value). These recursive calls are almost duplicate. 14
15 In the above pseudo code, there is an operation called push-forward. This is just an abbreviation of Line in the Exact Frequent Interval Mining algorithm before. Heaps are no longer suitable for constructing conditional databases in this approximate mining algorithm. Besides minimum-element removal and insertion operations, we also need to calculate the number of transactions within range [G a, G + a] efficiently. which is not supported by heaps. Therefore, a balanced search tree is adopted here. We keep two pointers, one for the minimum element, and one for the minimum element that is greater than G + a. In this way, it s easy to maintain the number of transactions in the range of [G a, G + a]. The space complexity of the approximate interval pattern mining algorithm is the same as that of the exact interval pattern mining algorithm, since the size of the conditional database and the maximal number of recursive calls do not change. However, in average cases, the space usage will increase. This is because when dealing approximate matching, the size of the conditional database for the recursive call is usually larger than the exact matching counterpart. As a result, the constant factor in average space usage will increase, while the computational complexity does not change. Also, the following theorem could guarantee that all the frequent interval patterns could be mined by this method. Theorem 1: The given exact and approximate frequent interval pattern mining algorithms are guaranteed to output all the frequent interval patterns Proof: The proof of completeness is obvious, according to these two pattern mining algorithms, there are two branches, one is the pattern mining in the prefix tree based conditional database, the completeness could be guaranteed by the prefixspan method [14]. For the peak-removal branch, it s in fact add more transactions into the original database, which does not prune anything at all. Therefore, finish the proof. 4.4 Modifications Mapping Modifications mapping could be conducted based on the frequent interval patterns. According to the definition of PTMs, we are trying to find the modifications between two frequent patterns. In the following, one exact approach and one approximate approach will be shown. Then, a scoring function is defined, which could be used to rank the mining results. Exact Modifications Mapping: The key problem to detect a modification is to determine where the modification begins. For example, if we have two length-n patterns P 1 =< ψ 1, ψ 2,..., ψ n > and P 2 =< ψ 1, ψ 2,..., ψ n >. Here, both ψ i and ψ i represent the ith gap in the frequent interval patterns. If there exists a number k (k n), for which we have ψ i = ψ i, 1 i kandk i n, then we say there is a modification at position k for the corresponding frequent patterns. 15
16 The way to detect such modifications is to enumerate all the possible k values. Each time, we only need to consider equal-length patterns for comparison, since we use frequent interval patterns F interval instead of closed frequent interval patterns C interval. Given a value k for length-n patterns (k < n and the patterns are represented as (ψ 1, ψ 2,..., ψ n ), we sort all the length-n patterns with the multi-key (ψ 1,..., ψ k 1, ψ k+1,..., ψ n ). Two patterns are modifications of each other if and only if they have the same key. This requires a multiple-key sorting. The naive method is to consider a multiple-key sorting as n single-key sorting. There are altogether n 1 possible k values (from 1 to n 1). Thus, the number of single-key sorting for this naive approach n (n 1). Assume there are m length-n patterns, the time complexity is O(n 2 m log m). A more sophisticated extension to the naive approach is that we reuse the previous sorting results and try to reduce the time complexity exponential.the key component for this approach is Stable Sorting, which is different from ordinary sortings in that it guarantee that the relative order of two data items with the same key does not change in the sorting procedure. With stable sorting technique, the number of single-key sortings could be reduced to 2 n 2. The method is shown in figure 6 ModificationMapping(length-n PatternArray P set ) BEGIN 1. for i = n 1 to 1 2. stable sort P set on single dimension i 3. output detected modifications 4. for i = n downto 2 5. stable sort P set on dimension i 6. extract modifications at dimension i to end from sorting results 7. output detected modifications END Figure 6: Modifications Mapping method The intuitive runing of this algorithm is showed in table 2. In this example, assume we focus on length-5 patterns. The success of this algorithm depends on a critical property of stable sorting: If we have two keys k and K, and the data is first ordered by key K before stable sort, then after stable sort on key K, the data is ordered by key < K, K >. With this property, we see that, for the first loop of the algorithm, the keys ψ 4, ψ 3, ψ 2, ψ 1 are sorted together, when extracting the modification patterns, all the modifications in the fifth row could be detected. Then, for the first run of the second loop, the keys ψ 3, ψ 2, ψ 1, ψ 5 are sorted, therefore, all the modifications in the forth row could be extracted here. For each step of the second loop, we could reuse the previous results. By this way, the whole key space could be traversed, therefore, the correctness of this algorithm is 16
17 guaranteed. Also, the number of single-key sorting could be reduced to only 2 n 2. The time complexity is only O(n m log m), which is more efficient than the naive approach. Table 2: An illustrative run of the modification mapping algorithm ψ 2 ψ 3 ψ 4 ψ 5 step 5: 4, 3, 2, 1, 5,4,3,2 ψ 1 ψ 3 ψ 4 ψ 5 step 4: 4, 3, 2, 1,5,4,3 ψ 1 ψ 2 ψ 4 ψ 5 step 3: 4, 3, 2,1,5,4 ψ 1 ψ 2 ψ 3 ψ 5 step 2: 4, 3,2,1,5 ψ 1 ψ 2 ψ 3 ψ 4 step 1: 4, 3,2,1 Approximate Modifications Mapping: The same problem arises again when we switch from exact modifications mapping to approximate modifications mapping. Approximate mapping is more difficult because it can not be accurately detected by simple sorting. There exists a useful property which could guarantee efficient approximate modifications mapping: If the prefix of the two patterns are the same, then the first gap in which these patterns differ should be at least delta. This condition effectively tells us that for the approximate modifications mapping algorithm, we need to conduct approximate matching only after the modification position. We can still use stable sorting for the dimensions on or before the modification position. However, we cannot use stable sorting for the patterns after the modification position. Instead, for each block of patterns that agree on the prefix, R-Tree is adopted to store its suffix. By this way, approximate modifications could be mapped efficiently. Discussions: The above discussions is mainly focused on single-site PTMs mapping problem, however, it s straightforward to see that, this algorithm could also detect multiple-site PTMs. Assume that P 1 is the pattern for the original peptide sequence, P 2 is the single-site modified version, while P 3 is the double-site modified version based on P 2. From the previous definition, P 1, P 2 and P 3 are all frequent interval patterns. From the theorem 1, they could all be mined by out algorithms. For the modifications mapping, all the modifications could be detected. Also, in order to improve the efficiency, we could enforce a constraint on the modification positions k 1 and k 2, so that, we could set k 1 2 and n k 2 2. Therefore, the searching space could be further pruned effectively. 4.5 Scoring Function For protein PTMs mapping problem, many modifications could be detected, a good scoring function to rank the results become a necessity. Here, we give out a simple and intuitively 17
18 meaningful scoring function, assume that one modification is detected from two frequent interval patterns P 1 and P 2, the corresponding support of these two patterns are N P1 and N P2, and the I ki represents the peak intensity vector for the k-th instance of pattern P i. The finial score of this pattern is defined as score = (N P 1 k I ki ) (N P2 l I li ) N P1 k I ki + N P2 (11) l I li The interpretation of the above score is very intuitive, it s just a harmonic mean of the weighted intensities of the two component patterns. 5 Experimental Results In this section, several experiments were performed to show the efficiency and effectiveness of our algorithm. By conducting experiments on the real-world dataset, we also analyze the biological relevance of our algorithm. 5.1 Data set The data set included 42, 392 different spectra, collected during 46 runs of Illiquid- Chromatography Tandem Mass Spectrometry (LC M S/M S). all the spectrum collection from the first run was used in the analssi. There was no merging of spectra and hence each spectrum corresponds to one peptide. All protein samples originated from 1D P AGE or 2D Gel electrophoresis separations. Bands or spots were extracted from the gel and subjected to in-gel digestion by trypsin. Cysteines were reduced and acrylated by iodoacetamide. Two randomly selected spectra are shown in the figure 7: There are altogether 1, 302 spectra in the data set. After resale the x-axis from mass-tocharge ratio m/z to mass m. Since there are many noisy peaks, only the most abundance 300 were selected and the other peaks are removed. The spectra were then discretized with a window of width Z = 0.3, for different peaks in the same compartment, only the most intense one is kept. After these preprocessing steps, the average peak number for each spectrum is P = 243. Also, we resale the peak intensities to 100. Since we are only interested in the Post-translational Modifications mapping problem, we did not conduct protein/peptide identifications via SEQUEST or X!Tandem. All the original spectra data are stored in the.pkl format file, one column represents the corresponding m/z value, the other column represents the intensity levels. For each spectrum the precursor m/z and charge value are also recorded. For the experiments in the following, we first evaluate the algorithm performance (mainly, the time complexity) on the real world applications, then, biological relevance of the results is considered and discussed. 18
19 18 x INTENSITY M/Z 16 x INTENSITY M/Z Figure 7: Two randomly selected spectra from the dataset, precursor charge = Algorithm Performance and Scalability We now consider the scalability of the proposed frequent pattern mining algorithms. The space complexity has already been discussed previously, here, we mainly focus on the time complexity on the real data. While given a specific example, i.e. the minimal support = 1, all the combinatorial algorithms s worst case should have an exponential complexity, however, their average performance are quite different based on different underlying heuristics. It s easy to see that the approximate pattern mining method is a proportional to the exact pattern mining method with only a constant fact, here, we only test the time complexity of the exact pattern mining approach, figures 8 show the algorithm profiles. We fixed the value of min sup at 20, the minimal length of each pattern is set to be 8, the window width for discretization is 0.3, the first figure shows the time used with respect to the growth of number of spectra. Each point on the graph is the average of experimental results over 20 repeated runs. When fixing the number of spectra to be 10 and all the other parameters remain, the second figure illustrates the time used with respect to the growth of minimal support min sup, each point averages the results over 20 runs. All 19
20 4000 The time complexity vs. number of spectra 250 The time complexity vs. minimum support time (second) time (second) number of spectra minimum suport Figure 8: Algorithm performance, the first figure is illustrates time vs. number of spectra, the second figure illustrates time vs. min sup these experiments are conducted on a LINUX machine with Pentium 1.7G CPU. As we can see from the figures, the time complexity vs. number of spectra on the real dataset is x 1.6, which is a polynomial complexity less than x 2, therefore, it s a reasonable complexity for the frequent pattern mining algorithms. from the second graph, we could see that the time complexity is very sensitive to the decrease of the minimum support, if the minimum support is less 4, the time should be more than 20 hours for even 20 spectra! Fortunately, we normally set the min sup 10, in such a range, the time complexity is till approximate to a polynomial. Experiments with other parameter settings have consistently shown similar results to the one shown in the above figures, and are omitted. 5.3 Biological Relevance In previous sections, we were focused on the frequent interval pattern mining problem and testing their scalability on the real dataset. In this section, we discuss the biological relevance of the modifications mapping algorithms. For our experiments, we set the modifications support ratio = 0.8, while the modified and nonmodified part should have a length at least 3. Our strategy here is to conduct the modifications mapping first, then, all the detected modifications will be clustered into histograms according to their scores, therefore, we 20
21 could get the distribution of different modifications. One thing to note is, in a typical mass spectrometry experiments, there is a large portion of noise, even though we only select 300 the most intense peaks, many of them are caused by random noise. If the peak noise are uniform noise or Gaussian noise, when viewing interval gaps as random variables generated by the peaks, the noise distribution is more biased on smaller intervals. Therefore, we need to build a suitable noise model and the corresponding noise distributions for baseline correction. In this paper, two noise models are considered: uniform noise and gaussian noise. Their distributions are shown in figure distribution of the uniform noise distribution of the gaussian noise Figure 9: The noise models: Uniform vs. Gaussian From the noise models, we could see that the interval distributions generated by uniform noise has a heavier tail than that from Gaussian noise. When we analyze the modifications distributions, we need to consider the effects of random noise, and conduct a baseline correction. The modification distribution is shown in figure 10, from which, we can conclude that the underlying noise has a Gaussian distribution, after the baseline corrections, we could see modifications like = 109, 98, 42 are very frequent. This distribution automatically rediscovered qualitative phenomena known previously to experienced mass spectrometrists. As represented by the intense mode at 109, it represents the post-translational modification Pyrrolysine; For the intense mode at 98, 21
22 14 x distribution of the modifications Figure 10: The distribution of the detected modifications it s mainly caused by the modification Phosphorylation and water. while the intense mode at 210 is mainly caused by Myristoylation. The intense mode at 17 represents the Pyrrolidone carboxylic acid. All these modifications are very popular ones, it s not surprise that they are more frequent than the others. Also, some detected frequent gaps, such as 113 and 114, are in fact the mass of amino acids Isoleucine and Asparagine. Which are not really modifications, but since it happens frequently, they could also be detected. The algorithm also detected previously unprescribed patterns. For example, after ranking the detected modifications according to the given scores, the gap = 226 ranks very high, this result is consistent under different parameter setting, since no frequent double-site modifications could get such a value, we suspect that this may be a new modification in the given samples. More details need to be confirmed by the analytical biological experiment. 6 Conclusions and Discussions We make two main contributions in this paper. First, we motivated and formalized a novel class of data mining problems that arise in protein post-translational modifications mapping, specifically for analysis of tandem mass spectrometry, but with natural applications to market-basket database. The second is that we developed the frequent interval pattern mining and modifications mapping algorithms. Based on an extension of PrefixSpan and stable sorting, these methods are natural and efficient on the real-world mass 22
PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra. Andrew Keller
PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra Andrew Keller Outline Need to validate peptide assignments to MS/MS spectra Statistical approach to validation Running PeptideProphet
More informationTowards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data
Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data Anthony J Bonner Han Liu Abstract This paper addresses a central problem of Proteomics: estimating the amounts of each of
More informationLast updated: Copyright
Last updated: 2012-08-20 Copyright 2004-2012 plabel (v2.4) User s Manual by Bioinformatics Group, Institute of Computing Technology, Chinese Academy of Sciences Tel: 86-10-62601016 Email: zhangkun01@ict.ac.cn,
More informationNature Methods: doi: /nmeth Supplementary Figure 1. Fragment indexing allows efficient spectra similarity comparisons.
Supplementary Figure 1 Fragment indexing allows efficient spectra similarity comparisons. The cost and efficiency of spectra similarity calculations can be approximated by the number of fragment comparisons
More informationComputational Methods for Mass Spectrometry Proteomics
Computational Methods for Mass Spectrometry Proteomics Eidhammer, Ingvar ISBN-13: 9780470512975 Table of Contents Preface. Acknowledgements. 1 Protein, Proteome, and Proteomics. 1.1 Primary goals for studying
More informationMass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were
Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were developed to allow the analysis of large intact (bigger than
More informationIdentification of proteins by enzyme digestion, mass
Method for Screening Peptide Fragment Ion Mass Spectra Prior to Database Searching Roger E. Moore, Mary K. Young, and Terry D. Lee Beckman Research Institute of the City of Hope, Duarte, California, USA
More informationProtein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University
Protein Quantitation II: Multiple Reaction Monitoring Kelly Ruggles kelly@fenyolab.org New York University Traditional Affinity-based proteomics Use antibodies to quantify proteins Western Blot RPPA Immunohistochemistry
More informationTandem Mass Spectrometry: Generating function, alignment and assembly
Tandem Mass Spectrometry: Generating function, alignment and assembly With slides from Sangtae Kim and from Jones & Pevzner 2004 Determining reliability of identifications Can we use Target/Decoy to estimate
More informationProtein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University
Protein Quantitation II: Multiple Reaction Monitoring Kelly Ruggles kelly@fenyolab.org New York University Traditional Affinity-based proteomics Use antibodies to quantify proteins Western Blot Immunohistochemistry
More informationTUTORIAL EXERCISES WITH ANSWERS
TUTORIAL EXERCISES WITH ANSWERS Tutorial 1 Settings 1. What is the exact monoisotopic mass difference for peptides carrying a 13 C (and NO additional 15 N) labelled C-terminal lysine residue? a. 6.020129
More informationComprehensive support for quantitation
Comprehensive support for quantitation One of the major new features in the current release of Mascot is support for quantitation. This is still work in progress. Our goal is to support all of the popular
More informationProtein Identification Using Tandem Mass Spectrometry. Nathan Edwards Informatics Research Applied Biosystems
Protein Identification Using Tandem Mass Spectrometry Nathan Edwards Informatics Research Applied Biosystems Outline Proteomics context Tandem mass spectrometry Peptide fragmentation Peptide identification
More informationPowerful Scan Modes of QTRAP System Technology
Powerful Scan Modes of QTRAP System Technology Unique Hybrid Triple Quadrupole Linear Ion Trap Technology Provides Powerful Workflows to Answer Complex Questions with No Compromises While there are many
More informationMass Spectrometry Based De Novo Peptide Sequencing Error Correction
Mass Spectrometry Based De Novo Peptide Sequencing Error Correction by Chenyu Yao A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics
More informationProtein Sequencing and Identification by Mass Spectrometry
Protein Sequencing and Identification by Mass Spectrometry Tandem Mass Spectrometry De Novo Peptide Sequencing Spectrum Graph Protein Identification via Database Search Identifying Post Translationally
More informationAssociation Rules. Fundamentals
Politecnico di Torino Politecnico di Torino 1 Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket counter Association rule
More informationPeptideProphet: Validation of Peptide Assignments to MS/MS Spectra
PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra Andrew Keller Day 2 October 17, 2006 Andrew Keller Rosetta Bioinformatics, Seattle Outline Need to validate peptide assignments to MS/MS
More informationD B M G Data Base and Data Mining Group of Politecnico di Torino
Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket
More informationWorkflow concept. Data goes through the workflow. A Node contains an operation An edge represents data flow The results are brought together in tables
PROTEOME DISCOVERER Workflow concept Data goes through the workflow Spectra Peptides Quantitation A Node contains an operation An edge represents data flow The results are brought together in tables Protein
More informationCSE182-L8. Mass Spectrometry
CSE182-L8 Mass Spectrometry Project Notes Implement a few tools for proteomics C1:11/2/04 Answer MS questions to get started, select project partner, select a project. C2:11/15/04 (All but web-team) Plan
More informationTutorial 1: Setting up your Skyline document
Tutorial 1: Setting up your Skyline document Caution! For using Skyline the number formats of your computer have to be set to English (United States). Open the Control Panel Clock, Language, and Region
More informationOn Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering
On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering Jiří Novák, David Hoksza, Jakub Lokoč, and Tomáš Skopal Siret Research Group, Faculty of Mathematics and Physics, Charles
More informationD B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.
Definitions Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Itemset is a set including one or more items Example: {Beer, Diapers} k-itemset is an itemset that contains k
More informationD B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example
Association rules Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket
More informationMS2DB: An Algorithmic Approach to Determine Disulfide Linkage Patterns in Proteins by Utilizing Tandem Mass Spectrometric Data
MS2DB: An Algorithmic Approach to Determine Disulfide Linkage Patterns in Proteins by Utilizing Tandem Mass Spectrometric Data Timothy Lee 1, Rahul Singh 1, Ten-Yang Yen 2, and Bruce Macher 2 1 Department
More informationAtomic masses. Atomic masses of elements. Atomic masses of isotopes. Nominal and exact atomic masses. Example: CO, N 2 ja C 2 H 4
High-Resolution Mass spectrometry (HR-MS, HRAM-MS) (FT mass spectrometry) MS that enables identifying elemental compositions (empirical formulas) from accurate m/z data 9.05.2017 1 Atomic masses (atomic
More informationProteomics. November 13, 2007
Proteomics November 13, 2007 Acknowledgement Slides presented here have been borrowed from presentations by : Dr. Mark A. Knepper (LKEM, NHLBI, NIH) Dr. Nathan Edwards (Center for Bioinformatics and Computational
More informationInformation Dependent Acquisition (IDA) 1
Information Dependent Acquisition (IDA) Information Dependent Acquisition (IDA) enables on the fly acquisition of MS/MS spectra during a chromatographic run. Analyst Software IDA is optimized to generate
More informationLecture 15: Realities of Genome Assembly Protein Sequencing
Lecture 15: Realities of Genome Assembly Protein Sequencing Study Chapter 8.10-8.15 1 Euler s Theorems A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing
More informationDe Novo Peptide Sequencing
De Novo Peptide Sequencing Outline A simple de novo sequencing algorithm PTM Other ion types Mass segment error De Novo Peptide Sequencing b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 A NELLLNVK AN ELLLNVK ANE LLLNVK
More informationA Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry
A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry Ting Chen Department of Genetics arvard Medical School Boston, MA 02115, USA Ming-Yang Kao Department of Computer
More informationBackground: Imagine it is time for your lunch break, you take your sandwich outside and you sit down to enjoy your lunch with a beautiful view of
Background: Imagine it is time for your lunch break, you take your sandwich outside and you sit down to enjoy your lunch with a beautiful view of Montana s Rocky Mountains. As you look up, you see what
More informationDIA-Umpire: comprehensive computational framework for data independent acquisition proteomics
DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics Chih-Chiang Tsou 1,2, Dmitry Avtonomov 2, Brett Larsen 3, Monika Tucholska 3, Hyungwon Choi 4 Anne-Claude Gingras
More informationPeter A. DiMaggio, Jr., Nicolas L. Young, Richard C. Baliban, Benjamin A. Garcia, and Christodoulos A. Floudas. Research
Research A Mixed Integer Linear Optimization Framework for the Identification and Quantification of Targeted Post-translational Modifications of Highly Modified Proteins Using Multiplexed Electron Transfer
More informationSRM assay generation and data analysis in Skyline
in Skyline Preparation 1. Download the example data from www.srmcourse.ch/eupa.html (3 raw files, 1 csv file, 1 sptxt file). 2. The number formats of your computer have to be set to English (United States).
More informationWADA Technical Document TD2003IDCR
IDENTIFICATION CRITERIA FOR QUALITATIVE ASSAYS INCORPORATING CHROMATOGRAPHY AND MASS SPECTROMETRY The appropriate analytical characteristics must be documented for a particular assay. The Laboratory must
More informationMulti-residue analysis of pesticides by GC-HRMS
An Executive Summary Multi-residue analysis of pesticides by GC-HRMS Dr. Hans Mol is senior scientist at RIKILT- Wageningen UR Introduction Regulatory authorities throughout the world set and enforce strict
More informationFP-growth and PrefixSpan
FP-growth and PrefixSpan n Challenges of Frequent Pattern Mining n Improving Apriori n Fp-growth n Fp-tree n Mining frequent patterns with FP-tree n PrefixSpan Challenges of Frequent Pattern Mining n Challenges
More informationModeling Mass Spectrometry-Based Protein Analysis
Chapter 8 Jan Eriksson and David Fenyö Abstract The success of mass spectrometry based proteomics depends on efficient methods for data analysis. These methods require a detailed understanding of the information
More informationMassHunter TOF/QTOF Users Meeting
MassHunter TOF/QTOF Users Meeting 1 Qualitative Analysis Workflows Workflows in Qualitative Analysis allow the user to only see and work with the areas and dialog boxes they need for their specific tasks
More informationBiological Mass Spectrometry
Biochemistry 412 Biological Mass Spectrometry February 13 th, 2007 Proteomics The study of the complete complement of proteins found in an organism Degrees of Freedom for Protein Variability Covalent Modifications
More informationFrequent Pattern Mining: Exercises
Frequent Pattern Mining: Exercises Christian Borgelt School of Computer Science tto-von-guericke-university of Magdeburg Universitätsplatz 2, 39106 Magdeburg, Germany christian@borgelt.net http://www.borgelt.net/
More informationEffective Strategies for Improving Peptide Identification with Tandem Mass Spectrometry
Effective Strategies for Improving Peptide Identification with Tandem Mass Spectrometry by Xi Han A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree
More informationHOWTO, example workflow and data files. (Version )
HOWTO, example workflow and data files. (Version 20 09 2017) 1 Introduction: SugarQb is a collection of software tools (Nodes) which enable the automated identification of intact glycopeptides from HCD
More informationYifei Bao. Beatrix. Manor Askenazi
Detection and Correction of Interference in MS1 Quantitation of Peptides Using their Isotope Distributions Yifei Bao Department of Computer Science Stevens Institute of Technology Beatrix Ueberheide Department
More informationMS-MS Analysis Programs
MS-MS Analysis Programs Basic Process Genome - Gives AA sequences of proteins Use this to predict spectra Compare data to prediction Determine degree of correctness Make assignment Did we see the protein?
More informationTowards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data
Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data Anthony J Bonner Han Liu Abstract This paper addresses a central problem of Proteomics: estimating the amounts of each of
More informationDe novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra. Xiaowen Liu
De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra Xiaowen Liu Department of BioHealth Informatics, Department of Computer and Information Sciences, Indiana University-Purdue
More informationKey questions of proteomics. Bioinformatics 2. Proteomics. Foundation of proteomics. What proteins are there? Protein digestion
s s Key questions of proteomics What proteins are there? Bioinformatics 2 Lecture 2 roteomics How much is there of each of the proteins? - Absolute quantitation - Stoichiometry What (modification/splice)
More informationParallel Algorithms For Real-Time Peptide-Spectrum Matching
Parallel Algorithms For Real-Time Peptide-Spectrum Matching A Thesis Submitted to the College of Graduate Studies and Research in Partial Fulfillment of the Requirements for the degree of Master of Science
More informationDevelopment and Evaluation of Methods for Predicting Protein Levels from Tandem Mass Spectrometry Data. Han Liu
Development and Evaluation of Methods for Predicting Protein Levels from Tandem Mass Spectrometry Data by Han Liu A thesis submitted in conformity with the requirements for the degree of Master of Science
More informationMALDI-HDMS E : A Novel Data Independent Acquisition Method for the Enhanced Analysis of 2D-Gel Tryptic Peptide Digests
-HDMS E : A Novel Data Independent Acquisition Method for the Enhanced Analysis of 2D-Gel Tryptic Peptide Digests Emmanuelle Claude, 1 Mark Towers, 1 and Rachel Craven 2 1 Waters Corporation, Manchester,
More informationLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Syllabus Fri. 21.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 27.10. (2) A.1 Linear Regression Fri. 3.11. (3) A.2 Linear Classification Fri. 10.11. (4) A.3 Regularization
More informationMS-based proteomics to investigate proteins and their modifications
MS-based proteomics to investigate proteins and their modifications Francis Impens VIB Proteomics Core October th 217 Overview Mass spectrometry-based proteomics: general workflow Identification of protein
More informationBackground: Comment [1]: Comment [2]: Comment [3]: Comment [4]: mass spectrometry
Background: Imagine it is time for your lunch break, you take your sandwich outside and you sit down to enjoy your lunch with a beautiful view of Montana s Rocky Mountains. As you look up, you see what
More informationAn Introduction to Bioinformatics Algorithms Hidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationExhaustive search. CS 466 Saurabh Sinha
Exhaustive search CS 466 Saurabh Sinha Agenda Two different problems Restriction mapping Motif finding Common theme: exhaustive search of solution space Reading: Chapter 4. Restriction Mapping Restriction
More informationDATA MINING LECTURE 3. Frequent Itemsets Association Rules
DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.
More informationAside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n
Aside: Golden Ratio Golden Ratio: A universal law. Golden ratio φ = lim n a n+b n a n = 1+ 5 2 a n+1 = a n + b n, b n = a n 1 Ruta (UIUC) CS473 1 Spring 2018 1 / 41 CS 473: Algorithms, Spring 2018 Dynamic
More informationDe Novo Peptide Identification Via Mixed-Integer Linear Optimization And Tandem Mass Spectrometry
17 th European Symposium on Computer Aided Process Engineering ESCAPE17 V. Plesu and P.S. Agachi (Editors) 2007 Elsevier B.V. All rights reserved. 1 De Novo Peptide Identification Via Mixed-Integer Linear
More informationBIOLIGHT STUDIO IN ROUTINE UV/VIS SPECTROSCOPY
BIOLIGHT STUDIO IN ROUTINE UV/VIS SPECTROSCOPY UV/Vis Spectroscopy is a technique that is widely used to characterize, identify and quantify chemical compounds in all fields of analytical chemistry. The
More informationNPTEL VIDEO COURSE PROTEOMICS PROF. SANJEEVA SRIVASTAVA
LECTURE-25 Quantitative proteomics: itraq and TMT TRANSCRIPT Welcome to the proteomics course. Today we will talk about quantitative proteomics and discuss about itraq and TMT techniques. The quantitative
More informationMetabolite Identification and Characterization by Mining Mass Spectrometry Data with SAS and Python
PharmaSUG 2018 - Paper AD34 Metabolite Identification and Characterization by Mining Mass Spectrometry Data with SAS and Python Kristen Cardinal, Colorado Springs, Colorado, United States Hao Sun, Sun
More informationHidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationTowards Detecting Protein Complexes from Protein Interaction Data
Towards Detecting Protein Complexes from Protein Interaction Data Pengjun Pei 1 and Aidong Zhang 1 Department of Computer Science and Engineering State University of New York at Buffalo Buffalo NY 14260,
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 6
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 6 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights
More informationIntroduction to spectral alignment
SI Appendix C. Introduction to spectral alignment Due to the complexity of the anti-symmetric spectral alignment algorithm described in Appendix A, this appendix provides an extended introduction to the
More informationData Mining and Analysis: Fundamental Concepts and Algorithms
Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA
More informationTopic Contents. Factoring Methods. Unit 3: Factoring Methods. Finding the square root of a number
Topic Contents Factoring Methods Unit 3 The smallest divisor of an integer The GCD of two numbers Generating prime numbers Computing prime factors of an integer Generating pseudo random numbers Raising
More informationMass Spectrometry. Hyphenated Techniques GC-MS LC-MS and MS-MS
Mass Spectrometry Hyphenated Techniques GC-MS LC-MS and MS-MS Reasons for Using Chromatography with MS Mixture analysis by MS alone is difficult Fragmentation from ionization (EI or CI) Fragments from
More informationOverview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database
Overview - MS Proteomics in One Slide Obtain protein Digest into peptides Acquire spectra in mass spectrometer MS masses of peptides MS/MS fragments of a peptide Results! Match to sequence database 2 But
More informationvia Tandem Mass Spectrometry and Propositional Satisfiability De Novo Peptide Sequencing Renato Bruni University of Perugia
De Novo Peptide Sequencing via Tandem Mass Spectrometry and Propositional Satisfiability Renato Bruni bruni@diei.unipg.it or bruni@dis.uniroma1.it University of Perugia I FIMA International Conference
More informationHigh-Field Orbitrap Creating new possibilities
Thermo Scientific Orbitrap Elite Hybrid Mass Spectrometer High-Field Orbitrap Creating new possibilities Ultrahigh resolution Faster scanning Higher sensitivity Complementary fragmentation The highest
More informationProperties of Average Score Distributions of SEQUEST
Research Properties of Average Score Distributions of SEQUEST THE PROBABILITY RATIO METHOD* S Salvador Martínez-Bartolomé, Pedro Navarro, Fernando Martín-Maroto, Daniel López-Ferrer **, Antonio Ramos-Fernández,
More informationMeasures of hydroxymethylation
Measures of hydroxymethylation Alla Slynko Axel Benner July 22, 2018 arxiv:1708.04819v2 [q-bio.qm] 17 Aug 2017 Abstract Hydroxymethylcytosine (5hmC) methylation is well-known epigenetic mark impacting
More informationQuasiNovo: Algorithms for De Novo Peptide Sequencing
University of South Carolina Scholar Commons Theses and Dissertations 2013 QuasiNovo: Algorithms for De Novo Peptide Sequencing James Paul Cleveland University of South Carolina Follow this and additional
More informationFinnigan LCQ Advantage MAX
www.ietltd.com Proudly serving laboratories worldwide since 1979 CALL +847.913.0777 for Refurbished & Certified Lab Equipment Finnigan LCQ Advantage MAX The Finnigan LCQ Advantage MAX ion trap mass spectrometer
More informationSEAMLESS INTEGRATION OF MASS DETECTION INTO THE UV CHROMATOGRAPHIC WORKFLOW
SEAMLESS INTEGRATION OF MASS DETECTION INTO THE UV CHROMATOGRAPHIC WORKFLOW Paula Hong, John Van Antwerp, and Patricia McConville Waters Corporation, Milford, MA, USA Historically UV detection has been
More informationfor the Novice Mass Spectrometry (^>, John Greaves and John Roboz yc**' CRC Press J Taylor & Francis Group Boca Raton London New York
Mass Spectrometry for the Novice John Greaves and John Roboz (^>, yc**' CRC Press J Taylor & Francis Group Boca Raton London New York CRC Press is an imprint of the Taylor & Francis Croup, an informa business
More informationReductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York
Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association
More informationprofileanalysis Innovation with Integrity Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research
profileanalysis Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research Innovation with Integrity Omics Research Biomarker Discovery Made Easy by ProfileAnalysis
More informationMACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance
MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance Jingbo Shang, Jian Peng, Jiawei Han University of Illinois, Urbana-Champaign May 6, 2016 Presented by Jingbo Shang 2 Outline
More informationDe novo peptide sequencing methods for tandem mass. spectra
De novo peptide sequencing methods for tandem mass spectra A Thesis Submitted to the College of Graduate Studies and Research in Partial Fulfillment of the Requirements for the degree of Doctor of Philosophy
More informationOn Two Class-Constrained Versions of the Multiple Knapsack Problem
On Two Class-Constrained Versions of the Multiple Knapsack Problem Hadas Shachnai Tami Tamir Department of Computer Science The Technion, Haifa 32000, Israel Abstract We study two variants of the classic
More informationMass Spectrometry and Proteomics - Lecture 5 - Matthias Trost Newcastle University
Mass Spectrometry and Proteomics - Lecture 5 - Matthias Trost Newcastle University matthias.trost@ncl.ac.uk Previously Proteomics Sample prep 144 Lecture 5 Quantitation techniques Search Algorithms Proteomics
More informationImproved Validation of Peptide MS/MS Assignments. Using Spectral Intensity Prediction
MCP Papers in Press. Published on October 2, 2006 as Manuscript M600320-MCP200 Improved Validation of Peptide MS/MS Assignments Using Spectral Intensity Prediction Shaojun Sun 1, Karen Meyer-Arendt 2,
More informationSequence analysis and comparison
The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species
More informationIntroduction to Algorithms
Lecture 1 Introduction to Algorithms 1.1 Overview The purpose of this lecture is to give a brief overview of the topic of Algorithms and the kind of thinking it involves: why we focus on the subjects that
More informationTandem mass spectra were extracted from the Xcalibur data system format. (.RAW) and charge state assignment was performed using in house software
Supplementary Methods Software Interpretation of Tandem mass spectra Tandem mass spectra were extracted from the Xcalibur data system format (.RAW) and charge state assignment was performed using in house
More informationFigure S1. Interaction of PcTS with αsyn. (a) 1 H- 15 N HSQC NMR spectra of 100 µm αsyn in the absence (0:1, black) and increasing equivalent
Figure S1. Interaction of PcTS with αsyn. (a) 1 H- 15 N HSQC NMR spectra of 100 µm αsyn in the absence (0:1, black) and increasing equivalent concentrations of PcTS (100 µm, blue; 500 µm, green; 1.5 mm,
More informationQuiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)
Introduction to Algorithms October 13, 2010 Massachusetts Institute of Technology 6.006 Fall 2010 Professors Konstantinos Daskalakis and Patrick Jaillet Quiz 1 Solutions Quiz 1 Solutions Problem 1. We
More informationDe Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics
De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics John R. Rose Computer Science and Engineering University of South Carolina 1 Overview Background Information Theoretic
More informationCS 584 Data Mining. Association Rule Mining 2
CS 584 Data Mining Association Rule Mining 2 Recall from last time: Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2 d Use pruning techniques to reduce M
More informationSpectrum-to-Spectrum Searching Using a. Proteome-wide Spectral Library
MCP Papers in Press. Published on April 30, 2011 as Manuscript M111.007666 Spectrum-to-Spectrum Searching Using a Proteome-wide Spectral Library Chia-Yu Yen, Stephane Houel, Natalie G. Ahn, and William
More information1 Divide and Conquer (September 3)
The control of a large force is the same principle as the control of a few men: it is merely a question of dividing up their numbers. Sun Zi, The Art of War (c. 400 C.E.), translated by Lionel Giles (1910)
More informationDesigned for Accuracy. Innovation with Integrity. High resolution quantitative proteomics LC-MS
Designed for Accuracy High resolution quantitative proteomics Innovation with Integrity LC-MS Setting New Standards in Accuracy The development of mass spectrometry based proteomics approaches has dramatically
More informationHandling a Concept Hierarchy
Food Electronics Handling a Concept Hierarchy Bread Milk Computers Home Wheat White Skim 2% Desktop Laptop Accessory TV DVD Foremost Kemps Printer Scanner Data Mining: Association Rules 5 Why should we
More informationElectrospray ionization mass spectrometry (ESI-
Automated Charge State Determination of Complex Isotope-Resolved Mass Spectra by Peak-Target Fourier Transform Li Chen a and Yee Leng Yap b a Bioinformatics Institute, 30 Biopolis Street, Singapore b Davos
More informationCS570 Introduction to Data Mining
CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data cleaning
More information