Protein Post-translational Modifications Mapping with MS/MS based Frequent Interval Pattern Mining

Size: px
Start display at page:

Download "Protein Post-translational Modifications Mapping with MS/MS based Frequent Interval Pattern Mining"

Transcription

1 Protein Post-translational Modifications Mapping with MS/MS based Frequent Interval Pattern Mining Han Liu Department of Computer Science University of Illinois at Urbana-Champaign May 8, 2005 ABSTRACT Tandem mass spectrometry (MS/MS)-based proteomics has demonstrated to be an indispensable tool for large scale protein identification and expression profiling tasks. However, analysis of protein post-translational modifications (PTMs) with MS/MS still presents formidable challenges. Although some heuristic algorithms have been developed for this problem, they are far from mature. In this paper, we give out a detail survey of current approaches and propose an alternative but more powerful solution, one that uses more-straightforward laboratory procedures and could be applied automatically. Specifically, our aim is to use frequent interval pattern mining techniques to map modification sites in molecular detail, a parametric scoring scheme is also proposed, the parameters could be tuned from the mining results or assigned a priori by domain expert. Comparing with the classical PTMs mapping algorithms like SALSA and P-Mod, this novel mass spectrometric data mining approach demonstrates more robustness and flexibility. To evaluate the biological relevance, we test these methods on the real-world datasets generated by MS/MS experiments performed on various tissue samples taken from mouse. The results show that this new approach is competitive on automated PTMs mapping tasks, while with good efficiency and scalability for real-world Bioinformatics applications. Keywords: protein post-translational modifications mapping, frequent interval pattern mining, scalability, mass spectrometry, bio-data mining Technical report submitted according to the regulations of Department of Computer Science at the University of Illinois at Urbana-Champaign 1

2 1 Introduction Proteomics is the large-scale study of the thousands of proteins in a cell [8]. In a typical Proteomics experiment, the goal might be to compare the proteins present in a certain tissue under different conditions. For instance, a biologist might want to study cancer by comparing the proteins in a cancerous liver to the proteins in a healthy liver. Modern mass spectrometry makes this possible by enabling the identification of thousands of proteins in a complex mixture [16, 2]. However, identifying proteins is only part of the story, it is also important to map the protein post-translational modifications (PTMs) in the molecular details [11], since PTMs modulate the activity of most eukaryote proteins, and their determination generates valuable insight into biological functions. Despite the great importance of PTMs, their study on large scale has been hampered by lack of suitable methods, many PTMs have been discovered serendipitously during studies of individual proteins with the help of standard molecular techniques, such direct analysis strategies requires isolation of the correctly processed protein in a sufficiently large amount for biochemical study. As a result, many PTMs can not be analyzed, which prevents us from fully understanding the protein modification mechnisms in the workings of the cell [11]. Figure 1: MS/MS: In the first phase, a given m/z value peptide ions are filtered out. In the second phase, the filtered out peptide ions are further dissociated and the m/z value of the ion fragments are measured Tandem mass spectrometry (MS/MS) of peptides is a central technology of Proteomics, enabling the identification of thousands of peptides and proteins from a complex mixture [13, 10, 1]. The whole processes of MS/MS are illustrated in figure 1. In a typical experiment, thousands of proteins from a tissue sample are fragmented into tens of thousands of peptides and gotten ionized, each peptide ion consists of about 6 to 30 amino acid residues. This peptide ions mixture then enter the tandem mass spectrometer and be analyzed according to two phases. The first phase is responsible for filtering peptides ions 2

3 of a certain masses-to-charge ratio (m/z), in the second phase, a peptide is split into two fragments by means of collision induced dissociation (CID) with a noble gas. In almost all cases, the peptide is broken between the chemical bonds of two amino acids. The result is a collection of spectra, one for each peptide ions, where each peak represents the relative abundance of a prefix or suffix ion [15]. Table 1: Some common and important post-translational modifications PTM type Mass (Da) Stability Function and notes Phosphorlation +80 +/++ modulation of molecular in interaction, signaling Acetylation regulate protein-dna interations Methylation regulate gene expression Hydroxyproline protein-ligand interaction Sulfation (styr) regulate protein-protein interactions Disulfide bond intra-and intermolecular crosslink protein stablility Deamidation a common chemical artifact Nitration of tyrosine +45 +/++ Oxidative damage when inflammation Pyroglutamic acid protein-stability Farnesyl membrane tethering Palmtoyl /++ cellular localization and signaling With the increasing acquisition rate of tandem mass spectrometers, there is an increasing potential to solve important biological problems like peptide sequencing or protein identification by applying data-mining techniques to MS/MS data [3, 2]. Although the identification of proteins in a complex mixture is becoming routine, protein identification alone provides only limited insight into protein function. An important component of protein regulation and function is covalent modifications to protein structures that occur post-transactionally [12]. Many protein post-translational modifications (PTMs) give rise to specific features of MS/MS spectra [8]. For example, phosphorylated serine and threonine residues eliminate phosphoric acid (80 Da) in MS/MS. Thus, product ions at 40 and 80 units below doubly and singly charged precursor ions, respectively, are observed in the corresponding spectra. Table 1 summarizes some common and important PTMs [11]. Currently, over 380 different PTMs have been discovered, identifying the type and location of these PTMs is a first step in understanding their regulatory potential. Despite their importance to cellular function, the methodologies used to study these modifications are far from mature, most of them are not compatible with protein mixtures, or are specific for a given type of PTMs. Among these algorithms, the most promising one is named SALSA [6], which is a pattern recognition algorithm used to map PTMs according to some user-specific modifications characteristics, however, as we will state in the following 3

4 section, SALSA can not be used to predict unanticipated modifications without userspecified priors, which prevents its real-world applications greatly. Another well-known algorithm is named P-Mod, which conduct modifications mapping based on mass shift of the peptide sequences, however, it still rely on a strict assumption that we have already known which proteins are in the mixtures a priori, which is still too strong an assumption for modern proteomics analysis. In this paper, we propose a fundamentally different approach from previous work. First, the spectral data is viewed as a directed sequence, called spectrum sequence, where each element is a double item, one corresponds to a mass peak (each peak represents a fragmented peptide ion), while the other one is the mass difference of the two corresponding fragmented peptide ions. Then, we mine closed frequent motifs satisfying a min sup threshold in the sequence space. a parametric scoring scheme is also proposed, the parameters could be tuned from the mining results or assigned a priori by domain expert. To evaluate their biological relevance, we test these methods on the real-world datasets generated by MS/MS experiments performed on various tissue samples taken from mouse. The results show that this new approach provides reasonable detection of protein post-translational modifications compared to SALSA, while with good efficiency and scalability for real-world Bioinformatics applications. This paper is organized as follows. Section 2 introduces two classical PTMs mapping algorithms named SALSA and P-Mod, and Section 3 defines formally the closed motif finding and the peptide PTMs mapping problem. Section 4 describes the frequent interval pattern mining algorithms for the peptide PTMs mapping problem of two kinds of spectra: ideal spectra and noisy spectra. Section 5 reports the implementation and testing of our method on real-world dataset. Section 6 discusses future research directions. 2 Related Work In this section, we will first introduce two State-of-the-art algorithms named SALSA and P-Mod, they are all used intensively by biologists for PTMs mapping from MS/MS data. The problems of SALSA and P-Mod are also figured out, which is what we are trying to make up in this paper. Based on the characteristics of SALSA, we formalizes the motif searching and PTMs mapping problems under a frequent interval pattern mining framework, which forms the basis of our work. 2.1 SALSA Algorithm SALSA (scoring algorithm for spectral analysis) has been developed by biologists for rapidly screening large number of peptide MS/MS spectra for fragmentation characteristics indicative of specific peptide modifications. It can detect specific features in MS/MS 4

5 spectra and scores the spectra based on how many of the features are displayed and their intensities in the spectrum. SALSA could detect PTMs by detecting some simple component features. Four types of features could be detected by SALSA, as shown in figure 2. Figure 2: Spectral characteristics detected by the SALSA algorithm The first feature is a product ion at a specific m/z value. An example is the loss of a chemical modification as a charged fragment that then appears in the MS/MS spectrum at a particular m/z value, regardless of the m/z of the peptide from which it was lost. The second is a neutral loss, in which a neutral fragment is lost from the precursor ion. The product ion has the same charge state as the precursor. However, the difference between the mass of the precursor and the product ion detected will equal the mass of the lost neutral fragment. The third feature is a charged loss, in which a multiply charged precursor 5

6 ion loses a charged fragment. An example of this is the loss of a singly charged fragment from a doubly charged precursor. The fourth feature is an ion pair, which denotes any two signals separated by a specified m/z value anywhere in the MS/MS spectrum. The appearance of an ion pair can indicate the presence of a specific component in a peptide sequence. A natural extension of the fourth feature is to conduct amino acid sequence motif searching, which means, instead of detecting just a pair of ions, it can find a serie of ions in the spectra. For the first type feature, product ion, SALSA scores specific product ions by identifying the most abundant ion within a window centered at the designated m/z value ±0.5m/z unit for the selected ion. Neutral losses and charge losses are scored in an analogous manner. The window for neutral loss detection is centered at the precursor m/z minus the user-specified neutral mass/precusor charge (Note that the actual m/z value for a neutral loss from a doubly charged precursor is half of that of the same mass loss from a singly charged precursor). Neutral losses result in product ions that have the same charge as the precursor ion. In contrast, charged losses generate product ions that have a charge one unit less that that of the precursor and are only observed in spectra arising from doubly charged precursors. charged losses are calculated by 2 precursor m/z 1. For the ion pairs and ion series problem, SALSA scores the correspondence between the experimental and theoretical ion series regardless of their absolute positions on the m/z axis. A virtual ruler is used with the relative separations of ions fixed and then superimposed on the experimental mass spectrum by aligning the first ion in the ion series to the fragment ion with the highest experimentally determined m/z value. More details are illustrated in figure 2. Scoring of spectra is calculated from the %TIC values of the detected ions corresponding to hypothetical ions serie i 1 i n. The %TIC values corresponding to peaks i 1, i 2, i 3,... i n are denoted as T 1, T 2, T 3,..., T n, respectively. Scores for spectra are calculated as Score = N(T 1 T 2... T n ) 1 n (1) where N is the number of detected ions that correspond to hypothetical ions i 1 i n in the series. For spectra in which one or more of the ions in the series are missing, the algorithm inserts a value I n equal to the threshold value for ion detection, SALSA provides a focused search for spectra corresponding to a particular peptide or peptide modification, it does not look for exact matches in MS/MS spectra. Instead, it uses user-specified criteria to search for specific spectral features or fragmentation patterns, which could be thought of as spectral fingerprints for a peptide sequence or its variants. However, identification of peptide modifications using SALSA still require expertise in spectral interpretation. Moreover, SALSA scores only rank spectra based on their correspondence to search criteria, but do not indicate any quantitative measurement. When facing huge number of data, SALSA will encounter problems. 6

7 2.2 P-Mod Algorithm To make up the deficiency of SALSA, P-Mod was developed by the same research group [7]. P-Mod calculates mass differences between search peptide sequences and MS/MS precursors and localizes the mass shift to a sequence position in the peptide. The mass shifts are calculated as mass shift = sequence mass neutral precursor mass (2) Since modifications are detected as mass shifts, P-Mod does not require the user to guess at masses or sequence locations of modifications. For PTMs mapping, an array of customized search criteria is generated for every sequence-to-spectrum comparison, taking into consideration the primary peptide sequence, the observed mass shift, the precursor m/z and instrumental limitations of ion trap mass spectrometers. The first element in each search array is a list of all of the expected b- or y- series fragment ions for the unmodified peptide sequence. Succeeding elements in the search array consist of these same fragment ions, tailored to reflect the mass shift localized at different amino acid residues in the sequence. Each element of more than 6 applied search criteria is given a raw score. The corresponding scoring formula is score = 1 n (ln(1 + I n )) (3) b ci (1 + 3d 2 ) where n = the number of applied search criteria, I n = intensity of the largest ion within 1.25 m/z of the expected location for the nth search criterion, b ci =background intensity in the index compartment which contains the scored ion, and d = distance in m/z between the scored ion and its expected location. Although each element in each peptide search array is scored separately, only the highest scoring element is recorded as a potential match. The fact that only the highest scoring element from the search array is recorded means that the raw scores assigned to individual sequence-to-spectrum comparisons are extreme values. P-Mod uses extreme value statistics to model the distribution of scores assigned to sequence matches, and derives a corresponding p-value for statistical significance evaluation. The formula is Y = S µ α ln k 100 and p = 1 exp( exp Y ) (4) where Y = the extreme value reduced variate, S = raw score, µ = a conditional location parameter, R = a conditional scale parameter, k = the number of comparisons and p = the estimated p value. As we have described before, SALSA could detect peptide modifications for a particular protein even with relative small abundance, however, the fact that these modification features must be specified by the user prevents SALSA for automatically discovering 7

8 unanticipated PTMs from huge number of spectra. P-Mod does not have this constraint, however, it has a very strong assumption that we have already know which proteins are in the mixture or the proteins have already been identified by the mixtures. This philosophy is dubious, since PTMs mapping is mainly targeted on improving identification performance, without mapping the modifications first, database searching algorithms, such as SEQUEST or X!Tandem can not work well. P-Mod is a method for modifications mapping, however, its performance is based on the identifications algorithm, which in turn should rely on the performance of itself! Therefore, P-Mod s assumption is reasonable or natural enough for the real-world applications. A more intuitive and natural method is needed. 3 Problem Formalization Assume that each spectrum has been normalized so that the x-axis is m but not m/z. A spectrum can be represented by a position sequence S = < (m 1, I 1 ), (m 2, I 2 ),..., (m n, I n ) >, each element of S is a dual item (m i, I i ), where m i represents the horizontal position of the ith peak and I i represents the corresponding peak intensity. As shown in figure 3, the position sequence S could be converted into a distance sequence = < (σ 1, I 1 ), (σ 2, I 2 ),..., (σ n, I n ) >. each σ i is defined as: { mi if i = 1 σ i = m i+1 m i else (5) Figure 3: MS/MS Spectrum for the sequence S = < (m 1, I 1 ), (m 2, I 2 ),..., (m 6, I 6 ) > For the convenience of description, both notations of position sequence S and distance sequence will be used in this paper. Because of the existence of noise, we setup a noise offset vector O = < δ 1, δ 2,..., δ n >, which means the ith peak may appear in the range [m i δ i, m i + δ i ]. Without loss of generality, we simply write a sequence as S = < m 1, m 2,..., m n > or = < σ 1, σ 2,..., σ n >, since m i and σ i are what our algorithm use directly for frequent interval pattern mining, while I i is only useful when we want to score the ming results. 8

9 A position sequence S α = < α 1, α 2,..., α m > is a sub-sequence of another position sequence S β = < β 1, β 2,..., β n >, denoted as S α S β ( if m n), written as S α S β ), if and only if k Z, such that the item set of α, denoted H α = {α 1 + k, α 2 + k,..., α m + k} is a subset of the item set of β, denoted as H β = {β 1, β 2,..., β n }, i.e., H α H β. It s straightforward that this definition is a partial order relationship. Also, we define a distance sequence α is a sub-sequence of another distance sequence β if and only if their corresponding position sequences have this partial oder relationship. The output of a MS/MS experiment is a spectra dataset, each spectra is associated with an id. For simplicity, say the id of the ith spectrum is i. we then transform the spectra dataset into a position sequence database, D S = {S 1, S 2,..., S n }, which is a set of position sequences, and a distance sequence database, D = { 1, 2,..., n }. S represents the number of elements in sequence S, while D represents the number of sequences in the database D. The support of a sequence S α in a sequence database D is the number of sequences in D which contain S α, denoted as N α, N α = {S S D and S α S}. Given a minimum support threshold, min sup, the set of frequent interval pattern, F interval, includes all the sequences whose support is no less than min sup. The set of closed frequent interval pattern is defined as follows, C interval = {S α S α F S and S β F S, s.t. S α S β and N α = N β }. Since C interval includes no sequence which has a super-sequence with the same support, we have C interval F interval. Without considering any kind of noise or peak loss, the problem of closed motif finding is to find C interval above a minimum support threshold in the distance database D. The single-site exact peptide PTMs mapping problem is to find a pair of closed frequent interval patterns, say, P A = < α 1, α 2,...α k, α k+1,..., α n > and P B = < β 1, β 2,...β k, β k+1,..., β n > (n n), with two integers k (k < n) and l, satisfying { P B2 P A2 and N P B /N PA > I 1, if P B1 P A1, P B1 P A1 and N (6) P B /N PA > I 1, if P B2 P A2, where P A1 =< α 1, α 2,...α k >, P A2 =< α k+1,..., α n >, P B1 =< β 1, β 2,...β k >, P B2 =< β k +1,..., β n >, P A1 =< α 1 + l,..., α k + l >, P A2 =< α k+1 + l,..., α n + l >, and 0 < I 1 < 1 is a given threshold. The tuning of I 1 should depend on domain-specific knowledge. In fact, these two formulas are equivalent, here we write both just for explicit representation. Similarly, for the problem of double-sites peptide PTMs mapping is to first find a one site modified interval pattern, like P B, then use similar methods to find another closed frequent interval pattern, say P C = < γ 1, γ 2,...γ k, γ k +1,..., γ n > (P C P A ) (n n ) and two integers k (k < n ) and l, The definition is similar. Also, a parameter 0 < I 2 < 1 is used to denote the domain-specific threshold. By this way, we could go on to define triple-sites peptide PTMs mapping problems recursively. Discussions: The above discussions are under the assumption that there is no noise. For real applications, two kinds of noise should be considered: peak loss vs. peak shift. peak 9

10 loss means for some spectrum, we may lose some peaks due to partial ionization of the peptide fragments. It s also possible that some peaks exist without corresponding peptide ions. peak shift is mainly due to the instrument resolution and isotopic distributions, the measured peak may shift a small value along the x-axis direction. when considering noise, we should modify the above PTMs mapping definition to approximate peptide PTMs mapping problem. When considering the noise, we simply change all the C interval in the above definition to F interval to deal with the peak loss problem. For the peak shit scenario, to simplify the discussion here, we assume a symmetric whitening noise offset vector O = {δ 1, δ 2,..., δ n }. δ i N (0, σ 2 ) (7) when comparing to values, m 1 i and m 2 i, we say m 1 i matches m 2 i, if and only if m 1 i [m 2 i +2σ, m 2 i 2σ]. Since real applications generally contain noise, approximate peptide PTMs mapping problem is our concern in the remaining part of this paper. 4 Methodology In this section, a frequent interval pattern mining algorithm is developed for protein PTMs mapping problem. Traditional frequent pattern mining algorithms could be roughly divided into two approaches: candidate generate-and-test vs. pattern growth. In the previous works, the pattern growth approach outperforms the generation-and-test counterparts for many applications. Therefore, we design our algorithm based on the pattern growth approach, in hope of better performance for the real-world applications. 4.1 Pattern Growth Approach Pattern growth is a novel and efficient method for mining frequent patterns from large scale databases. It was first introduced by Han.et [5, 4]. It adopts a divide-and-conquer approach to decompose both the mining tasks and the databases. Then, a pattern fragment growth method is used to avoid the costly candidate generation-and-test procedures at all. Moreover, an extended prefix-tree structure is constructed to compress crucial information about frequent patterns and avoid costly, repeated database I/O operations. A comprehensive performance study has shown that it is robust and especially suitable for most real-world applications. A lot of subsequent works [17, 18, 9] with this approach have been published. Our algorithm also follows this direction. 4.2 Initial Database and Conditional Database The key concept of the pattern growth approach is the conditional database. By employing divide-and-conquer strategies, the original database is divided into several partitions according to some prefix patterns. For a given prefix interval pattern P i, all the following frequent interval patterns could be mined from a P i specified partition D Pi without 10

11 accessing any other information. Each conditional database is further divided recursively following the same procedure. Definition 1: Prefix Interval Pattern: A frequent interval pattern F interval is also called a prefix interval pattern. The reason that we call it Prefix is because it will grow in the subsequent steps. Definition 2: Conditional database: A conditional database, named D Pi of a prefix pattern P i is the database of all the transactions which contain/follow the prefix pattern. Definition 3: Initial database: An initial database is the conditional database of empty prefix pattern. In order to facilitate the mining process, for each transaction, only the current gap and the position of the next peak are stored. Instead of storing all the transactions in the conditional database, only the references are store, each transaction in the conditional database is stored in a triple item <Gap, <SpectrumID, PeakID> >. Here, Gap = current gap value, The peak is stored as <SpectrumID, PeakID>. Since for different spectra, interval patterns may appear in different positions, a n-peak spectrum (p 1, p 2, p 3,..., p n ) should be populated into n transactions. For each peak p i in spectrum j, the corresponding transaction is represented as { < Gap = m i m i 1, < SpectrumID = j, PeakID = i >>, if i > 1 (8) < Gap = 0, < SpectrumID = j, PeakID = i >>, else During the mining process, these transactions are sorted by their current gaps, and all transactions with the same current gap are clustered into the same group. When constructing the initial database, according to the above introduced data format of the conditional database, all transactions will be represented as <0, <SpectrumID,PeakID> >. Where, the 0 means all the current gaps are 0, since we have aligned all the spectrum according to their first peak. 4.3 Frequent Interval Pattern Mining Algorithm Exact Pattern Mining: Figure 4 is the pseudo code for the exact frequent pattern mining problem, which is a recursive method. The main idea comes from the traditional pattern growth approach. However, there is a subtle difference between this interval pattern mining and the traditional pattern-growth frequent pattern mining algorithms. Assume we have an interval [a, c] and b [a, c], following the prefix pattern mining approach, even if [a, b] is not frequent, we can not prune [a, c] directly, since it s still possible that [a, c] is a frequent interval pattern. To deal with this problem, In Line 13-15, we construct the conditional database for recursive call. In Line 19-21, we push forward the current transactions. The difference between these two cases is when we are calculating the new gap, the former uses the next gap itself and the latter adds the next 11

12 gap to the current gap. This is because when we push forward the current transactions, we are actually ignoring the current peaks. As a result, the new gap size is the sum of the sizes of the two old gaps. However, when we are preparing the conditional database for the recursive call, we count the current peak, so the new gap size is just the size of the next gap. This difference does not exist in the previous sequential pattern mining problems, since gaps are ignored in sequential patterns. ExactPatternMining(PatternPrefix P, ConditionalDatabase D) BEGIN 1. sort D based on the first gap of each transaction 2. while not empty(d) 3. Find transactions T with the smallest gap in D 4. let G = t.gap, t T 5. remove T from D 6. if (support(t ) minsup) 7. construct the new pattern prefix P 2 =< P, G > 8. output < P 2, support(t ) > 9. construct the conditional database D for each transaction t =< G, < id, num >> in T 11. if (Spectrum[id] has more than num peaks) 12. construct a new transaction t let t 2.Gap = peak[id][num + 1] peak[id][num] 14. let t 2.num = num insert t 2 into D ExactPatternMining(P 2,D 2 ) 17. for each transaction t =< G, < id, num >> in T 18. if (Spectrum[id] has more than num peaks) 19. construct a new transaction t let t 2.Gap = G + (peak[id][num + 1] peak[id][num]) 21. let t 2.num = num insert t 2 into D END Figure 4: The Exact Frequent Interval Mining Algorithm The key for an efficient implementation of this algorithm is how to choose the data structure to construct the conditional database. Two critical operations should be supported: remove the least element and insert a new element. For this purpose, heap is adopted, a heap is constructed so that the transaction with the smallest gap is on the top 12

13 of the heap. The time complexities for least-element-removal and new-element-insertion are both O(log(n)), here n is the number of elements in the heap. For the space complexity. We denote the total number of spectrums by N, the average number of peaks for each spectrum by P. The size of the peak array is O(N P ). Instead of populating the initial database explicitly, only reference are constructed, the size of the initial database is O(N P ). The depth of the recursive calls cannot exceed the number of peaks in the spectrums. Thus, the worst case space complexity for this algorithm is O(N P 2 ). In an average case, the size of conditional database shrinks exponentially with respect to the depth of the recursive calls. Thus, the average case space complexity is O(N P ). This is because, for each recursive, if we assume that D i+1 1/2 D i then D ( ) 2 D (9) 2P 1 Approximate Pattern Mining: The above algorithm solves the problem of finding exact frequent interval patterns. However, as discussed before, peak shifts may exist due to instrument constraint. A method which is capable of dealing with this uncertainty is more meaningful for biologists. Here, we show how the above exact pattern mining algorithm could be extended to an approximate version. A straight-forward extension is done by discretization: All peak positions are discretized to be multiples of a compartment size. After this preprocessing, we could reuse the exact interval pattern mining algorithm, since the support of a given pattern now includes all the transactions which could approximately match it. Assume the original peak position is m orginal, the given compartment size is represented as Z, the new peak position m new is calculated as m new = 1 2 ( moriginal Z Z + moriginal Z) (10) Z This discretization approach is easy to implement, however, there exists some potential problems. Suppose a n peak spectrum (p 1, p 2, p 3...p n ), the position m i of the ith peak is viewed as a random variable with a Gaussian noise N (0, σ 2 ). If the first peak is fixed, the Gaussian noise will lead peak shifts of 2σ for all peaks. Given (m 1, m 2, m 3...m n ) for the corresponding peak, if we use an error bound of [ 2σ, 2σ] to match the sequence, under the independent assumption, the probability that they will match should be 0.95 n. However, follow the discretization approach and simply set the discretization compartment size of 4σ, we will not get that probability. This is because the true value of peak p i will not just appear in the middle of the compartment. If it equals to a discretization points, the probability of matching will be only If there are k such peaks, For the other cases, the probability of correct matching will be a number π i, 0.50 π i For the whole spectrum the probability of correct matching will be only π 1 π 2 π n < 0.95 n. 13

14 The above analysis motivates us to design a more effective approximate mining algorithm. Fortunately, it is not hard to modify the previous exact patten mining algorithm for this purpose. the pseudo code is shown in figure 5. ApproximatePatternMining(PatternPrefix P, ConditionalDatabase D) BEGIN 1. sort D based on the first gap of each transaction 2. let G = 0 3. while true 4. while not empty(d) && D [G a, G + a] minsup 5. increase G until the first peak is outside of the range 6. push-forward transactions in D with gap G a 7. if empty(d) 8. break 9. let T = D [G a, G + a] 10. construct the new pattern prefix P 2 =< P, G > 11. output < P 2, support(t ) > 12. construct the conditional database D for each transaction t =< g, < id, num >> in T 14. if (Spectrum[id] has more than num peaks) 15. construct a new transaction t let t 2.g = peak[id][num + 1] peak[id][num] 17. let t 2.num = num insert t 2 into D ApproximatePatternMining(P 2,D 2 ) 20. let G = G + b 21. push-forward transactions in D with gap < G a END Figure 5: The Approximate Frequent Mining Algorithm For this method, two parameters a and b are introduced. a is the maximally allowed error bound of a single peak, that is, the maximally allowed difference between the corresponding peaks when we align the first peaks of the two patterns ( or one pattern and one transaction). This a could be viewed as a measurement of how much uncertainty we want to tolerant. b is the smallest increment of the current gap between two adjacent recursive calls, it can be viewed as a measurement of how much difference we enforce for the algorithm to move smoothly. The introduction of b is necessary, This is because when we have more than minsup transactions within range [g a, g + a] (g = current peak position+ a ), then we might also have more than minsup transactions within range [g + ɛ a, g + ɛ + a] (ɛ is a tiny value). These recursive calls are almost duplicate. 14

15 In the above pseudo code, there is an operation called push-forward. This is just an abbreviation of Line in the Exact Frequent Interval Mining algorithm before. Heaps are no longer suitable for constructing conditional databases in this approximate mining algorithm. Besides minimum-element removal and insertion operations, we also need to calculate the number of transactions within range [G a, G + a] efficiently. which is not supported by heaps. Therefore, a balanced search tree is adopted here. We keep two pointers, one for the minimum element, and one for the minimum element that is greater than G + a. In this way, it s easy to maintain the number of transactions in the range of [G a, G + a]. The space complexity of the approximate interval pattern mining algorithm is the same as that of the exact interval pattern mining algorithm, since the size of the conditional database and the maximal number of recursive calls do not change. However, in average cases, the space usage will increase. This is because when dealing approximate matching, the size of the conditional database for the recursive call is usually larger than the exact matching counterpart. As a result, the constant factor in average space usage will increase, while the computational complexity does not change. Also, the following theorem could guarantee that all the frequent interval patterns could be mined by this method. Theorem 1: The given exact and approximate frequent interval pattern mining algorithms are guaranteed to output all the frequent interval patterns Proof: The proof of completeness is obvious, according to these two pattern mining algorithms, there are two branches, one is the pattern mining in the prefix tree based conditional database, the completeness could be guaranteed by the prefixspan method [14]. For the peak-removal branch, it s in fact add more transactions into the original database, which does not prune anything at all. Therefore, finish the proof. 4.4 Modifications Mapping Modifications mapping could be conducted based on the frequent interval patterns. According to the definition of PTMs, we are trying to find the modifications between two frequent patterns. In the following, one exact approach and one approximate approach will be shown. Then, a scoring function is defined, which could be used to rank the mining results. Exact Modifications Mapping: The key problem to detect a modification is to determine where the modification begins. For example, if we have two length-n patterns P 1 =< ψ 1, ψ 2,..., ψ n > and P 2 =< ψ 1, ψ 2,..., ψ n >. Here, both ψ i and ψ i represent the ith gap in the frequent interval patterns. If there exists a number k (k n), for which we have ψ i = ψ i, 1 i kandk i n, then we say there is a modification at position k for the corresponding frequent patterns. 15

16 The way to detect such modifications is to enumerate all the possible k values. Each time, we only need to consider equal-length patterns for comparison, since we use frequent interval patterns F interval instead of closed frequent interval patterns C interval. Given a value k for length-n patterns (k < n and the patterns are represented as (ψ 1, ψ 2,..., ψ n ), we sort all the length-n patterns with the multi-key (ψ 1,..., ψ k 1, ψ k+1,..., ψ n ). Two patterns are modifications of each other if and only if they have the same key. This requires a multiple-key sorting. The naive method is to consider a multiple-key sorting as n single-key sorting. There are altogether n 1 possible k values (from 1 to n 1). Thus, the number of single-key sorting for this naive approach n (n 1). Assume there are m length-n patterns, the time complexity is O(n 2 m log m). A more sophisticated extension to the naive approach is that we reuse the previous sorting results and try to reduce the time complexity exponential.the key component for this approach is Stable Sorting, which is different from ordinary sortings in that it guarantee that the relative order of two data items with the same key does not change in the sorting procedure. With stable sorting technique, the number of single-key sortings could be reduced to 2 n 2. The method is shown in figure 6 ModificationMapping(length-n PatternArray P set ) BEGIN 1. for i = n 1 to 1 2. stable sort P set on single dimension i 3. output detected modifications 4. for i = n downto 2 5. stable sort P set on dimension i 6. extract modifications at dimension i to end from sorting results 7. output detected modifications END Figure 6: Modifications Mapping method The intuitive runing of this algorithm is showed in table 2. In this example, assume we focus on length-5 patterns. The success of this algorithm depends on a critical property of stable sorting: If we have two keys k and K, and the data is first ordered by key K before stable sort, then after stable sort on key K, the data is ordered by key < K, K >. With this property, we see that, for the first loop of the algorithm, the keys ψ 4, ψ 3, ψ 2, ψ 1 are sorted together, when extracting the modification patterns, all the modifications in the fifth row could be detected. Then, for the first run of the second loop, the keys ψ 3, ψ 2, ψ 1, ψ 5 are sorted, therefore, all the modifications in the forth row could be extracted here. For each step of the second loop, we could reuse the previous results. By this way, the whole key space could be traversed, therefore, the correctness of this algorithm is 16

17 guaranteed. Also, the number of single-key sorting could be reduced to only 2 n 2. The time complexity is only O(n m log m), which is more efficient than the naive approach. Table 2: An illustrative run of the modification mapping algorithm ψ 2 ψ 3 ψ 4 ψ 5 step 5: 4, 3, 2, 1, 5,4,3,2 ψ 1 ψ 3 ψ 4 ψ 5 step 4: 4, 3, 2, 1,5,4,3 ψ 1 ψ 2 ψ 4 ψ 5 step 3: 4, 3, 2,1,5,4 ψ 1 ψ 2 ψ 3 ψ 5 step 2: 4, 3,2,1,5 ψ 1 ψ 2 ψ 3 ψ 4 step 1: 4, 3,2,1 Approximate Modifications Mapping: The same problem arises again when we switch from exact modifications mapping to approximate modifications mapping. Approximate mapping is more difficult because it can not be accurately detected by simple sorting. There exists a useful property which could guarantee efficient approximate modifications mapping: If the prefix of the two patterns are the same, then the first gap in which these patterns differ should be at least delta. This condition effectively tells us that for the approximate modifications mapping algorithm, we need to conduct approximate matching only after the modification position. We can still use stable sorting for the dimensions on or before the modification position. However, we cannot use stable sorting for the patterns after the modification position. Instead, for each block of patterns that agree on the prefix, R-Tree is adopted to store its suffix. By this way, approximate modifications could be mapped efficiently. Discussions: The above discussions is mainly focused on single-site PTMs mapping problem, however, it s straightforward to see that, this algorithm could also detect multiple-site PTMs. Assume that P 1 is the pattern for the original peptide sequence, P 2 is the single-site modified version, while P 3 is the double-site modified version based on P 2. From the previous definition, P 1, P 2 and P 3 are all frequent interval patterns. From the theorem 1, they could all be mined by out algorithms. For the modifications mapping, all the modifications could be detected. Also, in order to improve the efficiency, we could enforce a constraint on the modification positions k 1 and k 2, so that, we could set k 1 2 and n k 2 2. Therefore, the searching space could be further pruned effectively. 4.5 Scoring Function For protein PTMs mapping problem, many modifications could be detected, a good scoring function to rank the results become a necessity. Here, we give out a simple and intuitively 17

18 meaningful scoring function, assume that one modification is detected from two frequent interval patterns P 1 and P 2, the corresponding support of these two patterns are N P1 and N P2, and the I ki represents the peak intensity vector for the k-th instance of pattern P i. The finial score of this pattern is defined as score = (N P 1 k I ki ) (N P2 l I li ) N P1 k I ki + N P2 (11) l I li The interpretation of the above score is very intuitive, it s just a harmonic mean of the weighted intensities of the two component patterns. 5 Experimental Results In this section, several experiments were performed to show the efficiency and effectiveness of our algorithm. By conducting experiments on the real-world dataset, we also analyze the biological relevance of our algorithm. 5.1 Data set The data set included 42, 392 different spectra, collected during 46 runs of Illiquid- Chromatography Tandem Mass Spectrometry (LC M S/M S). all the spectrum collection from the first run was used in the analssi. There was no merging of spectra and hence each spectrum corresponds to one peptide. All protein samples originated from 1D P AGE or 2D Gel electrophoresis separations. Bands or spots were extracted from the gel and subjected to in-gel digestion by trypsin. Cysteines were reduced and acrylated by iodoacetamide. Two randomly selected spectra are shown in the figure 7: There are altogether 1, 302 spectra in the data set. After resale the x-axis from mass-tocharge ratio m/z to mass m. Since there are many noisy peaks, only the most abundance 300 were selected and the other peaks are removed. The spectra were then discretized with a window of width Z = 0.3, for different peaks in the same compartment, only the most intense one is kept. After these preprocessing steps, the average peak number for each spectrum is P = 243. Also, we resale the peak intensities to 100. Since we are only interested in the Post-translational Modifications mapping problem, we did not conduct protein/peptide identifications via SEQUEST or X!Tandem. All the original spectra data are stored in the.pkl format file, one column represents the corresponding m/z value, the other column represents the intensity levels. For each spectrum the precursor m/z and charge value are also recorded. For the experiments in the following, we first evaluate the algorithm performance (mainly, the time complexity) on the real world applications, then, biological relevance of the results is considered and discussed. 18

19 18 x INTENSITY M/Z 16 x INTENSITY M/Z Figure 7: Two randomly selected spectra from the dataset, precursor charge = Algorithm Performance and Scalability We now consider the scalability of the proposed frequent pattern mining algorithms. The space complexity has already been discussed previously, here, we mainly focus on the time complexity on the real data. While given a specific example, i.e. the minimal support = 1, all the combinatorial algorithms s worst case should have an exponential complexity, however, their average performance are quite different based on different underlying heuristics. It s easy to see that the approximate pattern mining method is a proportional to the exact pattern mining method with only a constant fact, here, we only test the time complexity of the exact pattern mining approach, figures 8 show the algorithm profiles. We fixed the value of min sup at 20, the minimal length of each pattern is set to be 8, the window width for discretization is 0.3, the first figure shows the time used with respect to the growth of number of spectra. Each point on the graph is the average of experimental results over 20 repeated runs. When fixing the number of spectra to be 10 and all the other parameters remain, the second figure illustrates the time used with respect to the growth of minimal support min sup, each point averages the results over 20 runs. All 19

20 4000 The time complexity vs. number of spectra 250 The time complexity vs. minimum support time (second) time (second) number of spectra minimum suport Figure 8: Algorithm performance, the first figure is illustrates time vs. number of spectra, the second figure illustrates time vs. min sup these experiments are conducted on a LINUX machine with Pentium 1.7G CPU. As we can see from the figures, the time complexity vs. number of spectra on the real dataset is x 1.6, which is a polynomial complexity less than x 2, therefore, it s a reasonable complexity for the frequent pattern mining algorithms. from the second graph, we could see that the time complexity is very sensitive to the decrease of the minimum support, if the minimum support is less 4, the time should be more than 20 hours for even 20 spectra! Fortunately, we normally set the min sup 10, in such a range, the time complexity is till approximate to a polynomial. Experiments with other parameter settings have consistently shown similar results to the one shown in the above figures, and are omitted. 5.3 Biological Relevance In previous sections, we were focused on the frequent interval pattern mining problem and testing their scalability on the real dataset. In this section, we discuss the biological relevance of the modifications mapping algorithms. For our experiments, we set the modifications support ratio = 0.8, while the modified and nonmodified part should have a length at least 3. Our strategy here is to conduct the modifications mapping first, then, all the detected modifications will be clustered into histograms according to their scores, therefore, we 20

21 could get the distribution of different modifications. One thing to note is, in a typical mass spectrometry experiments, there is a large portion of noise, even though we only select 300 the most intense peaks, many of them are caused by random noise. If the peak noise are uniform noise or Gaussian noise, when viewing interval gaps as random variables generated by the peaks, the noise distribution is more biased on smaller intervals. Therefore, we need to build a suitable noise model and the corresponding noise distributions for baseline correction. In this paper, two noise models are considered: uniform noise and gaussian noise. Their distributions are shown in figure distribution of the uniform noise distribution of the gaussian noise Figure 9: The noise models: Uniform vs. Gaussian From the noise models, we could see that the interval distributions generated by uniform noise has a heavier tail than that from Gaussian noise. When we analyze the modifications distributions, we need to consider the effects of random noise, and conduct a baseline correction. The modification distribution is shown in figure 10, from which, we can conclude that the underlying noise has a Gaussian distribution, after the baseline corrections, we could see modifications like = 109, 98, 42 are very frequent. This distribution automatically rediscovered qualitative phenomena known previously to experienced mass spectrometrists. As represented by the intense mode at 109, it represents the post-translational modification Pyrrolysine; For the intense mode at 98, 21

22 14 x distribution of the modifications Figure 10: The distribution of the detected modifications it s mainly caused by the modification Phosphorylation and water. while the intense mode at 210 is mainly caused by Myristoylation. The intense mode at 17 represents the Pyrrolidone carboxylic acid. All these modifications are very popular ones, it s not surprise that they are more frequent than the others. Also, some detected frequent gaps, such as 113 and 114, are in fact the mass of amino acids Isoleucine and Asparagine. Which are not really modifications, but since it happens frequently, they could also be detected. The algorithm also detected previously unprescribed patterns. For example, after ranking the detected modifications according to the given scores, the gap = 226 ranks very high, this result is consistent under different parameter setting, since no frequent double-site modifications could get such a value, we suspect that this may be a new modification in the given samples. More details need to be confirmed by the analytical biological experiment. 6 Conclusions and Discussions We make two main contributions in this paper. First, we motivated and formalized a novel class of data mining problems that arise in protein post-translational modifications mapping, specifically for analysis of tandem mass spectrometry, but with natural applications to market-basket database. The second is that we developed the frequent interval pattern mining and modifications mapping algorithms. Based on an extension of PrefixSpan and stable sorting, these methods are natural and efficient on the real-world mass 22

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra. Andrew Keller

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra. Andrew Keller PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra Andrew Keller Outline Need to validate peptide assignments to MS/MS spectra Statistical approach to validation Running PeptideProphet

More information

Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data

Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data Anthony J Bonner Han Liu Abstract This paper addresses a central problem of Proteomics: estimating the amounts of each of

More information

Last updated: Copyright

Last updated: Copyright Last updated: 2012-08-20 Copyright 2004-2012 plabel (v2.4) User s Manual by Bioinformatics Group, Institute of Computing Technology, Chinese Academy of Sciences Tel: 86-10-62601016 Email: zhangkun01@ict.ac.cn,

More information

Nature Methods: doi: /nmeth Supplementary Figure 1. Fragment indexing allows efficient spectra similarity comparisons.

Nature Methods: doi: /nmeth Supplementary Figure 1. Fragment indexing allows efficient spectra similarity comparisons. Supplementary Figure 1 Fragment indexing allows efficient spectra similarity comparisons. The cost and efficiency of spectra similarity calculations can be approximated by the number of fragment comparisons

More information

Computational Methods for Mass Spectrometry Proteomics

Computational Methods for Mass Spectrometry Proteomics Computational Methods for Mass Spectrometry Proteomics Eidhammer, Ingvar ISBN-13: 9780470512975 Table of Contents Preface. Acknowledgements. 1 Protein, Proteome, and Proteomics. 1.1 Primary goals for studying

More information

Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were

Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were developed to allow the analysis of large intact (bigger than

More information

Identification of proteins by enzyme digestion, mass

Identification of proteins by enzyme digestion, mass Method for Screening Peptide Fragment Ion Mass Spectra Prior to Database Searching Roger E. Moore, Mary K. Young, and Terry D. Lee Beckman Research Institute of the City of Hope, Duarte, California, USA

More information

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University Protein Quantitation II: Multiple Reaction Monitoring Kelly Ruggles kelly@fenyolab.org New York University Traditional Affinity-based proteomics Use antibodies to quantify proteins Western Blot RPPA Immunohistochemistry

More information

Tandem Mass Spectrometry: Generating function, alignment and assembly

Tandem Mass Spectrometry: Generating function, alignment and assembly Tandem Mass Spectrometry: Generating function, alignment and assembly With slides from Sangtae Kim and from Jones & Pevzner 2004 Determining reliability of identifications Can we use Target/Decoy to estimate

More information

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University Protein Quantitation II: Multiple Reaction Monitoring Kelly Ruggles kelly@fenyolab.org New York University Traditional Affinity-based proteomics Use antibodies to quantify proteins Western Blot Immunohistochemistry

More information

TUTORIAL EXERCISES WITH ANSWERS

TUTORIAL EXERCISES WITH ANSWERS TUTORIAL EXERCISES WITH ANSWERS Tutorial 1 Settings 1. What is the exact monoisotopic mass difference for peptides carrying a 13 C (and NO additional 15 N) labelled C-terminal lysine residue? a. 6.020129

More information

Comprehensive support for quantitation

Comprehensive support for quantitation Comprehensive support for quantitation One of the major new features in the current release of Mascot is support for quantitation. This is still work in progress. Our goal is to support all of the popular

More information

Protein Identification Using Tandem Mass Spectrometry. Nathan Edwards Informatics Research Applied Biosystems

Protein Identification Using Tandem Mass Spectrometry. Nathan Edwards Informatics Research Applied Biosystems Protein Identification Using Tandem Mass Spectrometry Nathan Edwards Informatics Research Applied Biosystems Outline Proteomics context Tandem mass spectrometry Peptide fragmentation Peptide identification

More information

Powerful Scan Modes of QTRAP System Technology

Powerful Scan Modes of QTRAP System Technology Powerful Scan Modes of QTRAP System Technology Unique Hybrid Triple Quadrupole Linear Ion Trap Technology Provides Powerful Workflows to Answer Complex Questions with No Compromises While there are many

More information

Mass Spectrometry Based De Novo Peptide Sequencing Error Correction

Mass Spectrometry Based De Novo Peptide Sequencing Error Correction Mass Spectrometry Based De Novo Peptide Sequencing Error Correction by Chenyu Yao A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics

More information

Protein Sequencing and Identification by Mass Spectrometry

Protein Sequencing and Identification by Mass Spectrometry Protein Sequencing and Identification by Mass Spectrometry Tandem Mass Spectrometry De Novo Peptide Sequencing Spectrum Graph Protein Identification via Database Search Identifying Post Translationally

More information

Association Rules. Fundamentals

Association Rules. Fundamentals Politecnico di Torino Politecnico di Torino 1 Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket counter Association rule

More information

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra Andrew Keller Day 2 October 17, 2006 Andrew Keller Rosetta Bioinformatics, Seattle Outline Need to validate peptide assignments to MS/MS

More information

D B M G Data Base and Data Mining Group of Politecnico di Torino

D B M G Data Base and Data Mining Group of Politecnico di Torino Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

Workflow concept. Data goes through the workflow. A Node contains an operation An edge represents data flow The results are brought together in tables

Workflow concept. Data goes through the workflow. A Node contains an operation An edge represents data flow The results are brought together in tables PROTEOME DISCOVERER Workflow concept Data goes through the workflow Spectra Peptides Quantitation A Node contains an operation An edge represents data flow The results are brought together in tables Protein

More information

CSE182-L8. Mass Spectrometry

CSE182-L8. Mass Spectrometry CSE182-L8 Mass Spectrometry Project Notes Implement a few tools for proteomics C1:11/2/04 Answer MS questions to get started, select project partner, select a project. C2:11/15/04 (All but web-team) Plan

More information

Tutorial 1: Setting up your Skyline document

Tutorial 1: Setting up your Skyline document Tutorial 1: Setting up your Skyline document Caution! For using Skyline the number formats of your computer have to be set to English (United States). Open the Control Panel Clock, Language, and Region

More information

On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering

On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering Jiří Novák, David Hoksza, Jakub Lokoč, and Tomáš Skopal Siret Research Group, Faculty of Mathematics and Physics, Charles

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions. Definitions Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Itemset is a set including one or more items Example: {Beer, Diapers} k-itemset is an itemset that contains k

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example Association rules Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

MS2DB: An Algorithmic Approach to Determine Disulfide Linkage Patterns in Proteins by Utilizing Tandem Mass Spectrometric Data

MS2DB: An Algorithmic Approach to Determine Disulfide Linkage Patterns in Proteins by Utilizing Tandem Mass Spectrometric Data MS2DB: An Algorithmic Approach to Determine Disulfide Linkage Patterns in Proteins by Utilizing Tandem Mass Spectrometric Data Timothy Lee 1, Rahul Singh 1, Ten-Yang Yen 2, and Bruce Macher 2 1 Department

More information

Atomic masses. Atomic masses of elements. Atomic masses of isotopes. Nominal and exact atomic masses. Example: CO, N 2 ja C 2 H 4

Atomic masses. Atomic masses of elements. Atomic masses of isotopes. Nominal and exact atomic masses. Example: CO, N 2 ja C 2 H 4 High-Resolution Mass spectrometry (HR-MS, HRAM-MS) (FT mass spectrometry) MS that enables identifying elemental compositions (empirical formulas) from accurate m/z data 9.05.2017 1 Atomic masses (atomic

More information

Proteomics. November 13, 2007

Proteomics. November 13, 2007 Proteomics November 13, 2007 Acknowledgement Slides presented here have been borrowed from presentations by : Dr. Mark A. Knepper (LKEM, NHLBI, NIH) Dr. Nathan Edwards (Center for Bioinformatics and Computational

More information

Information Dependent Acquisition (IDA) 1

Information Dependent Acquisition (IDA) 1 Information Dependent Acquisition (IDA) Information Dependent Acquisition (IDA) enables on the fly acquisition of MS/MS spectra during a chromatographic run. Analyst Software IDA is optimized to generate

More information

Lecture 15: Realities of Genome Assembly Protein Sequencing

Lecture 15: Realities of Genome Assembly Protein Sequencing Lecture 15: Realities of Genome Assembly Protein Sequencing Study Chapter 8.10-8.15 1 Euler s Theorems A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing

More information

De Novo Peptide Sequencing

De Novo Peptide Sequencing De Novo Peptide Sequencing Outline A simple de novo sequencing algorithm PTM Other ion types Mass segment error De Novo Peptide Sequencing b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 A NELLLNVK AN ELLLNVK ANE LLLNVK

More information

A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry

A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry Ting Chen Department of Genetics arvard Medical School Boston, MA 02115, USA Ming-Yang Kao Department of Computer

More information

Background: Imagine it is time for your lunch break, you take your sandwich outside and you sit down to enjoy your lunch with a beautiful view of

Background: Imagine it is time for your lunch break, you take your sandwich outside and you sit down to enjoy your lunch with a beautiful view of Background: Imagine it is time for your lunch break, you take your sandwich outside and you sit down to enjoy your lunch with a beautiful view of Montana s Rocky Mountains. As you look up, you see what

More information

DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics

DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics Chih-Chiang Tsou 1,2, Dmitry Avtonomov 2, Brett Larsen 3, Monika Tucholska 3, Hyungwon Choi 4 Anne-Claude Gingras

More information

Peter A. DiMaggio, Jr., Nicolas L. Young, Richard C. Baliban, Benjamin A. Garcia, and Christodoulos A. Floudas. Research

Peter A. DiMaggio, Jr., Nicolas L. Young, Richard C. Baliban, Benjamin A. Garcia, and Christodoulos A. Floudas. Research Research A Mixed Integer Linear Optimization Framework for the Identification and Quantification of Targeted Post-translational Modifications of Highly Modified Proteins Using Multiplexed Electron Transfer

More information

SRM assay generation and data analysis in Skyline

SRM assay generation and data analysis in Skyline in Skyline Preparation 1. Download the example data from www.srmcourse.ch/eupa.html (3 raw files, 1 csv file, 1 sptxt file). 2. The number formats of your computer have to be set to English (United States).

More information

WADA Technical Document TD2003IDCR

WADA Technical Document TD2003IDCR IDENTIFICATION CRITERIA FOR QUALITATIVE ASSAYS INCORPORATING CHROMATOGRAPHY AND MASS SPECTROMETRY The appropriate analytical characteristics must be documented for a particular assay. The Laboratory must

More information

Multi-residue analysis of pesticides by GC-HRMS

Multi-residue analysis of pesticides by GC-HRMS An Executive Summary Multi-residue analysis of pesticides by GC-HRMS Dr. Hans Mol is senior scientist at RIKILT- Wageningen UR Introduction Regulatory authorities throughout the world set and enforce strict

More information

FP-growth and PrefixSpan

FP-growth and PrefixSpan FP-growth and PrefixSpan n Challenges of Frequent Pattern Mining n Improving Apriori n Fp-growth n Fp-tree n Mining frequent patterns with FP-tree n PrefixSpan Challenges of Frequent Pattern Mining n Challenges

More information

Modeling Mass Spectrometry-Based Protein Analysis

Modeling Mass Spectrometry-Based Protein Analysis Chapter 8 Jan Eriksson and David Fenyö Abstract The success of mass spectrometry based proteomics depends on efficient methods for data analysis. These methods require a detailed understanding of the information

More information

MassHunter TOF/QTOF Users Meeting

MassHunter TOF/QTOF Users Meeting MassHunter TOF/QTOF Users Meeting 1 Qualitative Analysis Workflows Workflows in Qualitative Analysis allow the user to only see and work with the areas and dialog boxes they need for their specific tasks

More information

Biological Mass Spectrometry

Biological Mass Spectrometry Biochemistry 412 Biological Mass Spectrometry February 13 th, 2007 Proteomics The study of the complete complement of proteins found in an organism Degrees of Freedom for Protein Variability Covalent Modifications

More information

Frequent Pattern Mining: Exercises

Frequent Pattern Mining: Exercises Frequent Pattern Mining: Exercises Christian Borgelt School of Computer Science tto-von-guericke-university of Magdeburg Universitätsplatz 2, 39106 Magdeburg, Germany christian@borgelt.net http://www.borgelt.net/

More information

Effective Strategies for Improving Peptide Identification with Tandem Mass Spectrometry

Effective Strategies for Improving Peptide Identification with Tandem Mass Spectrometry Effective Strategies for Improving Peptide Identification with Tandem Mass Spectrometry by Xi Han A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree

More information

HOWTO, example workflow and data files. (Version )

HOWTO, example workflow and data files. (Version ) HOWTO, example workflow and data files. (Version 20 09 2017) 1 Introduction: SugarQb is a collection of software tools (Nodes) which enable the automated identification of intact glycopeptides from HCD

More information

Yifei Bao. Beatrix. Manor Askenazi

Yifei Bao. Beatrix. Manor Askenazi Detection and Correction of Interference in MS1 Quantitation of Peptides Using their Isotope Distributions Yifei Bao Department of Computer Science Stevens Institute of Technology Beatrix Ueberheide Department

More information

MS-MS Analysis Programs

MS-MS Analysis Programs MS-MS Analysis Programs Basic Process Genome - Gives AA sequences of proteins Use this to predict spectra Compare data to prediction Determine degree of correctness Make assignment Did we see the protein?

More information

Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data

Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data Anthony J Bonner Han Liu Abstract This paper addresses a central problem of Proteomics: estimating the amounts of each of

More information

De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra. Xiaowen Liu

De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra. Xiaowen Liu De novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra Xiaowen Liu Department of BioHealth Informatics, Department of Computer and Information Sciences, Indiana University-Purdue

More information

Key questions of proteomics. Bioinformatics 2. Proteomics. Foundation of proteomics. What proteins are there? Protein digestion

Key questions of proteomics. Bioinformatics 2. Proteomics. Foundation of proteomics. What proteins are there? Protein digestion s s Key questions of proteomics What proteins are there? Bioinformatics 2 Lecture 2 roteomics How much is there of each of the proteins? - Absolute quantitation - Stoichiometry What (modification/splice)

More information

Parallel Algorithms For Real-Time Peptide-Spectrum Matching

Parallel Algorithms For Real-Time Peptide-Spectrum Matching Parallel Algorithms For Real-Time Peptide-Spectrum Matching A Thesis Submitted to the College of Graduate Studies and Research in Partial Fulfillment of the Requirements for the degree of Master of Science

More information

Development and Evaluation of Methods for Predicting Protein Levels from Tandem Mass Spectrometry Data. Han Liu

Development and Evaluation of Methods for Predicting Protein Levels from Tandem Mass Spectrometry Data. Han Liu Development and Evaluation of Methods for Predicting Protein Levels from Tandem Mass Spectrometry Data by Han Liu A thesis submitted in conformity with the requirements for the degree of Master of Science

More information

MALDI-HDMS E : A Novel Data Independent Acquisition Method for the Enhanced Analysis of 2D-Gel Tryptic Peptide Digests

MALDI-HDMS E : A Novel Data Independent Acquisition Method for the Enhanced Analysis of 2D-Gel Tryptic Peptide Digests -HDMS E : A Novel Data Independent Acquisition Method for the Enhanced Analysis of 2D-Gel Tryptic Peptide Digests Emmanuelle Claude, 1 Mark Towers, 1 and Rachel Craven 2 1 Waters Corporation, Manchester,

More information

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Syllabus Fri. 21.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 27.10. (2) A.1 Linear Regression Fri. 3.11. (3) A.2 Linear Classification Fri. 10.11. (4) A.3 Regularization

More information

MS-based proteomics to investigate proteins and their modifications

MS-based proteomics to investigate proteins and their modifications MS-based proteomics to investigate proteins and their modifications Francis Impens VIB Proteomics Core October th 217 Overview Mass spectrometry-based proteomics: general workflow Identification of protein

More information

Background: Comment [1]: Comment [2]: Comment [3]: Comment [4]: mass spectrometry

Background: Comment [1]: Comment [2]: Comment [3]: Comment [4]: mass spectrometry Background: Imagine it is time for your lunch break, you take your sandwich outside and you sit down to enjoy your lunch with a beautiful view of Montana s Rocky Mountains. As you look up, you see what

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Exhaustive search. CS 466 Saurabh Sinha

Exhaustive search. CS 466 Saurabh Sinha Exhaustive search CS 466 Saurabh Sinha Agenda Two different problems Restriction mapping Motif finding Common theme: exhaustive search of solution space Reading: Chapter 4. Restriction Mapping Restriction

More information

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

DATA MINING LECTURE 3. Frequent Itemsets Association Rules DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.

More information

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n Aside: Golden Ratio Golden Ratio: A universal law. Golden ratio φ = lim n a n+b n a n = 1+ 5 2 a n+1 = a n + b n, b n = a n 1 Ruta (UIUC) CS473 1 Spring 2018 1 / 41 CS 473: Algorithms, Spring 2018 Dynamic

More information

De Novo Peptide Identification Via Mixed-Integer Linear Optimization And Tandem Mass Spectrometry

De Novo Peptide Identification Via Mixed-Integer Linear Optimization And Tandem Mass Spectrometry 17 th European Symposium on Computer Aided Process Engineering ESCAPE17 V. Plesu and P.S. Agachi (Editors) 2007 Elsevier B.V. All rights reserved. 1 De Novo Peptide Identification Via Mixed-Integer Linear

More information

BIOLIGHT STUDIO IN ROUTINE UV/VIS SPECTROSCOPY

BIOLIGHT STUDIO IN ROUTINE UV/VIS SPECTROSCOPY BIOLIGHT STUDIO IN ROUTINE UV/VIS SPECTROSCOPY UV/Vis Spectroscopy is a technique that is widely used to characterize, identify and quantify chemical compounds in all fields of analytical chemistry. The

More information

NPTEL VIDEO COURSE PROTEOMICS PROF. SANJEEVA SRIVASTAVA

NPTEL VIDEO COURSE PROTEOMICS PROF. SANJEEVA SRIVASTAVA LECTURE-25 Quantitative proteomics: itraq and TMT TRANSCRIPT Welcome to the proteomics course. Today we will talk about quantitative proteomics and discuss about itraq and TMT techniques. The quantitative

More information

Metabolite Identification and Characterization by Mining Mass Spectrometry Data with SAS and Python

Metabolite Identification and Characterization by Mining Mass Spectrometry Data with SAS and Python PharmaSUG 2018 - Paper AD34 Metabolite Identification and Characterization by Mining Mass Spectrometry Data with SAS and Python Kristen Cardinal, Colorado Springs, Colorado, United States Hao Sun, Sun

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Towards Detecting Protein Complexes from Protein Interaction Data

Towards Detecting Protein Complexes from Protein Interaction Data Towards Detecting Protein Complexes from Protein Interaction Data Pengjun Pei 1 and Aidong Zhang 1 Department of Computer Science and Engineering State University of New York at Buffalo Buffalo NY 14260,

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 6 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights

More information

Introduction to spectral alignment

Introduction to spectral alignment SI Appendix C. Introduction to spectral alignment Due to the complexity of the anti-symmetric spectral alignment algorithm described in Appendix A, this appendix provides an extended introduction to the

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Topic Contents. Factoring Methods. Unit 3: Factoring Methods. Finding the square root of a number

Topic Contents. Factoring Methods. Unit 3: Factoring Methods. Finding the square root of a number Topic Contents Factoring Methods Unit 3 The smallest divisor of an integer The GCD of two numbers Generating prime numbers Computing prime factors of an integer Generating pseudo random numbers Raising

More information

Mass Spectrometry. Hyphenated Techniques GC-MS LC-MS and MS-MS

Mass Spectrometry. Hyphenated Techniques GC-MS LC-MS and MS-MS Mass Spectrometry Hyphenated Techniques GC-MS LC-MS and MS-MS Reasons for Using Chromatography with MS Mixture analysis by MS alone is difficult Fragmentation from ionization (EI or CI) Fragments from

More information

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database Overview - MS Proteomics in One Slide Obtain protein Digest into peptides Acquire spectra in mass spectrometer MS masses of peptides MS/MS fragments of a peptide Results! Match to sequence database 2 But

More information

via Tandem Mass Spectrometry and Propositional Satisfiability De Novo Peptide Sequencing Renato Bruni University of Perugia

via Tandem Mass Spectrometry and Propositional Satisfiability De Novo Peptide Sequencing Renato Bruni University of Perugia De Novo Peptide Sequencing via Tandem Mass Spectrometry and Propositional Satisfiability Renato Bruni bruni@diei.unipg.it or bruni@dis.uniroma1.it University of Perugia I FIMA International Conference

More information

High-Field Orbitrap Creating new possibilities

High-Field Orbitrap Creating new possibilities Thermo Scientific Orbitrap Elite Hybrid Mass Spectrometer High-Field Orbitrap Creating new possibilities Ultrahigh resolution Faster scanning Higher sensitivity Complementary fragmentation The highest

More information

Properties of Average Score Distributions of SEQUEST

Properties of Average Score Distributions of SEQUEST Research Properties of Average Score Distributions of SEQUEST THE PROBABILITY RATIO METHOD* S Salvador Martínez-Bartolomé, Pedro Navarro, Fernando Martín-Maroto, Daniel López-Ferrer **, Antonio Ramos-Fernández,

More information

Measures of hydroxymethylation

Measures of hydroxymethylation Measures of hydroxymethylation Alla Slynko Axel Benner July 22, 2018 arxiv:1708.04819v2 [q-bio.qm] 17 Aug 2017 Abstract Hydroxymethylcytosine (5hmC) methylation is well-known epigenetic mark impacting

More information

QuasiNovo: Algorithms for De Novo Peptide Sequencing

QuasiNovo: Algorithms for De Novo Peptide Sequencing University of South Carolina Scholar Commons Theses and Dissertations 2013 QuasiNovo: Algorithms for De Novo Peptide Sequencing James Paul Cleveland University of South Carolina Follow this and additional

More information

Finnigan LCQ Advantage MAX

Finnigan LCQ Advantage MAX www.ietltd.com Proudly serving laboratories worldwide since 1979 CALL +847.913.0777 for Refurbished & Certified Lab Equipment Finnigan LCQ Advantage MAX The Finnigan LCQ Advantage MAX ion trap mass spectrometer

More information

SEAMLESS INTEGRATION OF MASS DETECTION INTO THE UV CHROMATOGRAPHIC WORKFLOW

SEAMLESS INTEGRATION OF MASS DETECTION INTO THE UV CHROMATOGRAPHIC WORKFLOW SEAMLESS INTEGRATION OF MASS DETECTION INTO THE UV CHROMATOGRAPHIC WORKFLOW Paula Hong, John Van Antwerp, and Patricia McConville Waters Corporation, Milford, MA, USA Historically UV detection has been

More information

for the Novice Mass Spectrometry (^>, John Greaves and John Roboz yc**' CRC Press J Taylor & Francis Group Boca Raton London New York

for the Novice Mass Spectrometry (^>, John Greaves and John Roboz yc**' CRC Press J Taylor & Francis Group Boca Raton London New York Mass Spectrometry for the Novice John Greaves and John Roboz (^>, yc**' CRC Press J Taylor & Francis Group Boca Raton London New York CRC Press is an imprint of the Taylor & Francis Croup, an informa business

More information

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association

More information

profileanalysis Innovation with Integrity Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research

profileanalysis Innovation with Integrity Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research profileanalysis Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research Innovation with Integrity Omics Research Biomarker Discovery Made Easy by ProfileAnalysis

More information

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance Jingbo Shang, Jian Peng, Jiawei Han University of Illinois, Urbana-Champaign May 6, 2016 Presented by Jingbo Shang 2 Outline

More information

De novo peptide sequencing methods for tandem mass. spectra

De novo peptide sequencing methods for tandem mass. spectra De novo peptide sequencing methods for tandem mass spectra A Thesis Submitted to the College of Graduate Studies and Research in Partial Fulfillment of the Requirements for the degree of Doctor of Philosophy

More information

On Two Class-Constrained Versions of the Multiple Knapsack Problem

On Two Class-Constrained Versions of the Multiple Knapsack Problem On Two Class-Constrained Versions of the Multiple Knapsack Problem Hadas Shachnai Tami Tamir Department of Computer Science The Technion, Haifa 32000, Israel Abstract We study two variants of the classic

More information

Mass Spectrometry and Proteomics - Lecture 5 - Matthias Trost Newcastle University

Mass Spectrometry and Proteomics - Lecture 5 - Matthias Trost Newcastle University Mass Spectrometry and Proteomics - Lecture 5 - Matthias Trost Newcastle University matthias.trost@ncl.ac.uk Previously Proteomics Sample prep 144 Lecture 5 Quantitation techniques Search Algorithms Proteomics

More information

Improved Validation of Peptide MS/MS Assignments. Using Spectral Intensity Prediction

Improved Validation of Peptide MS/MS Assignments. Using Spectral Intensity Prediction MCP Papers in Press. Published on October 2, 2006 as Manuscript M600320-MCP200 Improved Validation of Peptide MS/MS Assignments Using Spectral Intensity Prediction Shaojun Sun 1, Karen Meyer-Arendt 2,

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Introduction to Algorithms

Introduction to Algorithms Lecture 1 Introduction to Algorithms 1.1 Overview The purpose of this lecture is to give a brief overview of the topic of Algorithms and the kind of thinking it involves: why we focus on the subjects that

More information

Tandem mass spectra were extracted from the Xcalibur data system format. (.RAW) and charge state assignment was performed using in house software

Tandem mass spectra were extracted from the Xcalibur data system format. (.RAW) and charge state assignment was performed using in house software Supplementary Methods Software Interpretation of Tandem mass spectra Tandem mass spectra were extracted from the Xcalibur data system format (.RAW) and charge state assignment was performed using in house

More information

Figure S1. Interaction of PcTS with αsyn. (a) 1 H- 15 N HSQC NMR spectra of 100 µm αsyn in the absence (0:1, black) and increasing equivalent

Figure S1. Interaction of PcTS with αsyn. (a) 1 H- 15 N HSQC NMR spectra of 100 µm αsyn in the absence (0:1, black) and increasing equivalent Figure S1. Interaction of PcTS with αsyn. (a) 1 H- 15 N HSQC NMR spectra of 100 µm αsyn in the absence (0:1, black) and increasing equivalent concentrations of PcTS (100 µm, blue; 500 µm, green; 1.5 mm,

More information

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts) Introduction to Algorithms October 13, 2010 Massachusetts Institute of Technology 6.006 Fall 2010 Professors Konstantinos Daskalakis and Patrick Jaillet Quiz 1 Solutions Quiz 1 Solutions Problem 1. We

More information

De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics

De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics John R. Rose Computer Science and Engineering University of South Carolina 1 Overview Background Information Theoretic

More information

CS 584 Data Mining. Association Rule Mining 2

CS 584 Data Mining. Association Rule Mining 2 CS 584 Data Mining Association Rule Mining 2 Recall from last time: Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2 d Use pruning techniques to reduce M

More information

Spectrum-to-Spectrum Searching Using a. Proteome-wide Spectral Library

Spectrum-to-Spectrum Searching Using a. Proteome-wide Spectral Library MCP Papers in Press. Published on April 30, 2011 as Manuscript M111.007666 Spectrum-to-Spectrum Searching Using a Proteome-wide Spectral Library Chia-Yu Yen, Stephane Houel, Natalie G. Ahn, and William

More information

1 Divide and Conquer (September 3)

1 Divide and Conquer (September 3) The control of a large force is the same principle as the control of a few men: it is merely a question of dividing up their numbers. Sun Zi, The Art of War (c. 400 C.E.), translated by Lionel Giles (1910)

More information

Designed for Accuracy. Innovation with Integrity. High resolution quantitative proteomics LC-MS

Designed for Accuracy. Innovation with Integrity. High resolution quantitative proteomics LC-MS Designed for Accuracy High resolution quantitative proteomics Innovation with Integrity LC-MS Setting New Standards in Accuracy The development of mass spectrometry based proteomics approaches has dramatically

More information

Handling a Concept Hierarchy

Handling a Concept Hierarchy Food Electronics Handling a Concept Hierarchy Bread Milk Computers Home Wheat White Skim 2% Desktop Laptop Accessory TV DVD Foremost Kemps Printer Scanner Data Mining: Association Rules 5 Why should we

More information

Electrospray ionization mass spectrometry (ESI-

Electrospray ionization mass spectrometry (ESI- Automated Charge State Determination of Complex Isotope-Resolved Mass Spectra by Peak-Target Fourier Transform Li Chen a and Yee Leng Yap b a Bioinformatics Institute, 30 Biopolis Street, Singapore b Davos

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data cleaning

More information