General comments, TODOs, etc.

Size: px
Start display at page:

Download "General comments, TODOs, etc."

Transcription

1 General comments, TODOs, etc. -Think up a catchy title. -Decide which references to anonymize to maintain double-blindness. -Write the abstract -Fill in remaining [CITE], which are cases where it is unclear to me what paper should be cited. The NIPS style guide allows for -page only of citations; the font size has been reduced as much as permissible. - Given the space constraints, I m inclined to think that we should not include a future work section in the paper.

2 Identifying Tandem Mass Spectra using Dynamic Bayesian Networks Anonymous Author(s) Affiliation Address Abstract Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Introduction Tandem mass spectrometry, a.k.a. shotgun proteomics, is an increasingly accurate and efficient technology for identifying and quantifying proteins in a complex biological sample, such as a drop of blood. This technology has been used to identify biomarkers associated with disease [CITE], and to quantify changes in protein expression across different cell types [CITE]. Most applications of tandem mass spectrometry require the ability to accurately map a fragmentation spectrum generated by the device to a peptide, a protein subsequence, which generated the spectrum. The task of mapping spectra to peptides is known as spectrum identification, a pattern recognition task akin to speech recognition. In speech recognition, the input is an utterance, which must be mapped to a sentence in natural language, a enormous structured class of labels. A spectrum is akin to an acoustic utterance; a peptide is akin to a sentence, a sequence of amino acids instead of words. Unlike speech recognition, (i) accurate labelled data, ground truth peptide-spectrum matches, cannot be acquired; (ii) the scoring function for peptide-spectrum matches has traditionally been a non-probabilistic function, while probabilistic approaches have become dominant in speech; (iii) the optimization used to identify the best peptide match require enumerating and scoring all candidate peptides against a spectrum. In this work, we introduce a dynamic Bayesian network (DBN) that generalizes one of the most popular scoring functions for peptide identification (Section ). Our probabilistic formulation, which we call, provides new insight into a technique that has been used in computational biology for over 7 years. provides a new function for scoring peptide-spectrum matches, that significantly outperforms existing scoring functions, including those used in expensive commercial tools for peptide identification. We further show that additional qualitative knowledge about peptide fragmentation can be easily incorporated into the model, leading to further improvements in identification accuracy. A fundamental computational constraint in current approaches to spectrum identification is the dependence on peptide database search. The best peptide match is found by exhaustively scoring a large list of candidate peptides against the spectrum. In speech recognition, database search would be analogous to decoding an utterance by scoring every common sentence in the English language against the utterance, picking the highest scoring match. In Section, we extend with lattices, a compressed representation of sequences, common in speech and language processing [CITE]. Lattices find novel use here: allowing us to replace an exhaustive enumeration with dynamic programming over peptide sequences.

3 A B Proteins Peptides Mass spectrometer Spectra Figure : (A) Schematic of a typical shotgun proteomics experiment. The three steps () cleaving proteins into peptides, () separation of peptides using liquid chromatography, and () tandem mass spectrometry analysis are described in the text. (B) A sample fragmentation spectrum, along with the peptide (PTPVSHNDDLYG) responsible for generating the spectrum. Peaks corresponding to prefixes and suffixes of the peptide are colored red and blue, respectively. By convention, prefixes are referred to as b-ions and suffixes as y-ions. Jeff: [Note, while lattices are common in the speech world, outside of speech they might be confusible with, say, Birkhoff lattices. We might want to add a bit of text in the above sayingthat lattices, in this context, is a linear sized representation of an exponential number of sequences, and can be seen as a sequential analoge of, say, binary decision diagrams. BTW, also, one option for the extended version is to,s ay define a macro that is a comment for the main version but includes the text for the extended version, so then we can have one.tex file for both submission and supplement.] Jeff: [One other commen there. I think this reads well, but we then immediately go on to describe shotgun proteomics. Pehraps in the intro offer up a few more details of the model and what enables it to achieve such good performance. The reason is that, otherwise, people might be left wondering.] Tandem Mass Spectrometry Experimental framework A typical shotgun proteomics experiment proceeds in three steps, as illustrated in Figure. The input to the experiment is a collection of proteins, which have been isolated from a complex mixture. Each protein can be represented as a string of amino acids, where the alphabet is size and the proteins range in length from amino acids. A typical complex mixture may contain a few thousand proteins, ranging in abundance from tens to hundreds of thousands of copies. In the first experimental step, the proteins are digested into shorter sequences (peptides) using a molecular agent called trypsin. To a first approximation, trypsin cleaves each protein deterministically at all occurrences of K or R unless they are followed by a P. This digestion is necessary because whole proteins are too massive to be subject to direct mass spectometry analysis without using very expensive equipment. Second, the peptides are subjected to a process called liquid chromatography, in which the peptides pass through a thin glass column that separates the peptides based on a particular chemical property (e.g., the hydrophobicity). This separation step reduces the complexity of the mixtures of peptides going into the mass spectrometer. The third step, which occurs inside the mass spectrometer, involves two rounds of mass spectrometry. Approximately every second, the device analyzes the population of approximately, intact peptides that most recently exited from the liquid chromatography column. Then, based on this initial analysis, the machine selects five distinct peptide species for fragmentation. Each of these fragmented species is subjected to a second round

4 of mass spectrometry analysis. The resulting fragmentation spectra are the primary output of the experiment. A sample fragmentation spectrum is shown in Figure B. During the fragmentation process, each amino acid sequence is typically cleaved once, so cleavage of the population results in a variety of observed prefix and suffix sequences. Each of these subpeptides is characterized by its mass-tocharge ratio (m/z, shown on the horizontal axis) and a corresponding intensity (unitless, but roughly proportional to abundance, shown on the vertical axis). The input to the spectrum identification problem is one such fragmentation spectrum, along with the observed (approximate) mass of the intact peptide. The goal is to identify the peptide sequence that was responsible for generating the spectrum. Solving the spectrum identification problem In practice, the spectrum identification problem can be solved in two different ways, either de novo, in which the universe of all possible peptides is considered as candidates solutions, or by restricting the search space to a given peptide database. Because high throughput DNA sequencing can provide a very good estimate of the set of possible peptide sequences for most commonly studied organisms, and because database search typically provides more accurate results than de novo approaches, we focus on the database search version of the problem in this paper. The first computer program to use a database search procedure to identify fragmentation spectra was SEQUEST [7], and SEQUEST s basic algorithm is still used by essentially all database search tools available today. John: [cite: Sadygov] Bill: [Do we really need a cite for the previous sentence? If so, then we have to use something more recent than. I vote to delete this cite] The approach is as follows. We are given a spectrum S, a peptide database P, a precursor mass m (i.e., the measured mass of the intact peptide), and a precursor mass tolerance δ. The algorithm extracts from the database all peptides whose mass lies within the range [m δ, m + δ]. These comprise the set of candidate peptides C(m, P, δ) = {p : p P ; m(p) m < δ} where m(p) is the calculated mass of peptide p. In practice, depending on the size of the peptide database and the precursor mass tolerance, the number of candidate peptides ranges from hundreds to hundreds of thousands. Each candidate peptide p is used to generate a theoretical spectrum s(p), and the theoretical spectrum is compared to the observed spectrum using a score function K(, ). The program reports the candidate peptide whose theoretical spectrum scores most highly with respect to the observed spectrum: arg max p C(m,P,δ) K(S, s(p)). In this work, we compare the performance of to two widely used search programs, SEQUEST and Mascot [], as well as to a less commonly used but methodologically related method, []. These three methods differ primarily in their choice of score function K(, ). Describing the details of SEQUEST s score function, XCorr, is beyond the scope of this paper, but the basic idea is to compute a scalar product of the observed and theoretical spectrum and then subtract out an average scalar product term that is produced by shifting the observed and theoretical spectrum 7 τ= 7 N i= S is(p) i τ. Mascot is a relative to one another: XCorr(S, s(p)) = S, s(p) commercial product that uses a probabilistic scoring function to rank candidate peptides, the details of which have not been published. first generates a theoretical spectrum, akin to SEQUEST s. The probability that the peaks in the theoretical spectrum occurred in the observed spectrum is then caclulated using a a hidden Markov model (HMM), and the candidate peptide is assigned a score based on the confidence of this probability, which is measured using an estimated normal distribution over the peptide masses within ±δ of the precursor mass. The spectrum identification problem is difficult to solve primarily because of noise in the observed spectrum. In general, the x-axis of the observed spectrum is known with relatively high precision. However, in any given spectrum, many expected fragment ions will fail to be observed, and the spectrum is also likely to contain a variety of additional, unexplained peaks. These unexplained peaks may result from unusual fragmentation events, in which small molecular groups are shed from the peptide during fragmentation, or from contaminating molecules (peptides or other small molecules) that are present in the mass spectrometer along with the target peptide species. Evaluation Metrics

5 Jeff: [Do we want this here? Most ML papers put the evaluation methodology just before the results section, and the the first thing that is done is the intro, motivation, background literature and alternative approaches, and then the new approach (i.e., the new probabilistic method). Then the results (and methodology) go at the end. Putting the evaluation section here might surprise the reviewer.] Labelled data for spectrum identification would consist of a set of ground truth peptide-spectrum matches: spectra where the mapping to a peptide is known. Unfortunately, accurate labelled data does not exist in this domain, which complicates evaluation. To estimate the probability that a spectrum identification is false, we therefore make use of the standard decoy-target approach [, 9]. For each spectrum, two searches are performed: one to find the best peptide in the target database C(m, P, δ). Then, a second search is performed to find the best peptide in a decoy database C(m, P, δ): a set of plausible peptides where it is extremely improbable that the correct peptide is contained in it. In our experiments, the target P and decoy P databases are the same size, with decoys being generated by randomly permuting peptides in the target database, under the requirement that P P =. A single tandem mass spectrometry experiment generates m = O( ) spectra. We expect a certain fraction of the identifications to be spurious, and so only the top-k scoring identifications are retained as quality matches, the rest are ignored. False Discovery Rate (FDR) [] (essentially one minus precision) provides a rule for determining what k should be, given a bound on the expected fraction of spurious identifications among the top-k. To make use of FDR, we first pose the question of whether or not to accept a single spectrum identification as a hypothesis test. Consider a single spectrum s, searched against the target database C(m(s), P, δ). Denote the peptide scoring function θ : s p Θ R. When only one spectrum is under consideration, the dependence of θ on s is not shown. Now, θ(p) is itself a random variable. To sample from the distribution of θ(p), we score each peptide in the target database: θ(c) = {θ(p) : p C(m(s), P, δ)}. Choosing the highest scoring peptide as the proposed match corresponds to the test statistic T (θ(c)) = max(θ(p) : p C(m(S), P, δ)). Colloquially, the hypothesis test can expressed in terms of the test statistic. The null hypothesis, H, is that a peptide matches the spectrum by chance; the alternate hypothesis, H, is that the peptide generated the spectrum. Formally, the hypothesis test is H : θ(p) θ H : θ(p) > θ, where θ is a user-determined threshold on the score which determines the stringency of the test. As a decision rule, the null hypothesis is rejected if the test statistic T (θ(c)) exceeds critical value c. Equivalently, the highest scoring peptide match for a spectrum is deemed correct if its score is greater than c. A single tandem MS experiment leads to m hypotheses. Let V (c) be the number of hypotheses where H is incorrectly rejected at critical value c; let R(c) be the number of hypotheses where H was incorrectly rejected. For sufficiently large m, we estimate FDR using F DR(c) = E[V (c)]/ E[R(c)]. An estimate of E[V (c)] is the number of spectra where the best decoy match has a score higher than c; an estimate of E[R(c)] is the value of R(c) itself, the number of spectra where the best target match has a score higher than c. The decoy database is only used to estimate the error rate. The above estimate of FDR has an intuitive interpretation, it is -precision. Since FDR(c) is not necessarily strictly increasing with c, we instead report the estimated []: ˆq(c) = min F DR(t). t c At a score threshold c, we have q(c) [, ], which is the expected fraction of spurious identifications among those whose score is at least c. Jeff: [I think most of the equations above are not long and shoud be inlined, to save space.] The tradeoff between the number of identifications that are accepted and the stringency of the acceptance criterion is represented as an absolute ranking curve. Each point on the x-axis is a in [,], the corresponding value on the y-axis is the number of top-scoring spectra whose identification is accepted at that. At q =, all m identifications are accepted; at q =, no identifications are accepted. In real-world usage, the concern is with maximizing performance at

6 small s, so we plot only q [,.]. One method dominates another if its absolute ranking curve is strictly above the absolute ranking curve for the other method. Ajit: [Include a pointer to an absolute ranking plot.] Ajit: [It s natural for a machine learning audience to want to view hypothesis testing as a / classification problem: i.e., assign label to the identifications we want to accept. If we switch from FDR to the positive False Discovery Rate [], we can draw a connection to Bayes error rates on such a classification problem. However, using Bayes error rate would require the user to control the stringency of the test by setting a parameter that corresponds to the relative importance of a false negative to a false positive, and that is harder to understand and a.] Scoring identifications as inference in a Dynamic Bayesian Network In this section, we show that Equation?? can be generalized Ajit: [ this is not strictly what is going on, as we re not generalizing it, rather we are inspired by it to create a proper probabilistic model. ] as inference in a DBN (Figure ). The DBN is based on the mobile proton hypothesis of peptide fragmentation [], which we describe mathematically below. We provide empirical evidence that our probabilistic scoring function is significantly better than the scoring functions used in commercially developed packages. Peptide Fragmentation We start at the second phase in tandem mass spectrometry: the protein sequence has been digested, and a peptide has been isolated in the first mass spectrometry step. A peptide is represented as a string p = a a... a n, since our only concern is in decoding a peptide s sequence. Each letter a t is drawn from an alphabet of standard amino acids, whose masses are known. The mass function m( ) refers both to the mass of a residue, m(a t ), and to the mass of a sequence of residues, m(p) = n t= m(a t). Peptides are ionized in the second phase of mass spectrometry, so each peptide has a positive charge due to carrying one, two, or three extra protons: c(p) {,, }. Peptides predominantly fragment into a prefix and suffix: b = a... a t, y = a t+... a n. The extra protons are divided between the prefix and suffix: c(b) + c(y) = c(p). If either b or y have zero charge, it cannot be detected, and its corresponding peak will not show up in the spectrum. Charge distributions are not equally probable: e.g., when c(p) =, fragment ions of charge are exceedingly rare. When the peptide fragments at position t, the prefix fragment ion is referred to as the b t -ion; the suffix fragment ion, the y t -ion. The set {b t } t is referred to as the b-ion series, with the y-ion series defined analogously. Each peak in an idealized spectrum corresponds to a fragment ion in the the b-ion or y-ion series: the position of the peak for a fragment ion b is a deterministic function of m(b) and c(b) and likewise for y. A fragment spectra measures how often particular peaks with a specific mass-to-charge (m/z) ratio are detected, so there is no sequence information in a peak. A spectrum s is a collection of peaks, intensities at given m/z positions: s = {(x j, h j )} where x j is a point on the m/z axis (x-axis), and h j is the corresponding intensity (see Figure B). In practice, there is substantial discrepancy between an idealized spectrum and a real one due to measurement noise, secondary fragmentation of the b or y ions, non-protein contaminants, or other imperfections in the isolation of the peptide. Even barring noise in the spectra, there is substantial variation across spectra which must be controlled. There can be order-of-magnitude differences in both total intensity j h j and maximum intensity, max{h j } across spectra. To control for intensity variation, we rank-normalize each spectrum: peaks are sorted in order of increasing intensity, and the i th peak is assigned intensity i/ s, so max{h j } =.. From the settings used to collect the spectra, we know that x j [, ] m/z units. We quantize the m/z scanning range into B = uniformly sized bins. The bins correspond to a vector of random variables S = (S i : i =... B). A spectrum is an instantiation of S, s = (s,..., s B ), where the most intense rank-normalized peak is retained in each bin. If no peak is present in a bin, then S i =. A Generative Model of Peptide Fragmentation DBNs are commonly used to model discrete-time phenomena; but can be applied to any sequential data. In Figure, each non-prologue frame t =... n corresponds to the fragmentation of peptide p into the b t and y t ions. The peptide is represented as a vector of random variables A = (A i : i =

7 ). Since we are given the peptide-spectrum match to score, A is observed, with A t = a t. The spectrum variables S are fixed across all frames, and observed, since the spectrum is given. The masses of the prefix and suffix are denoted n t = m(a... a t ) and c t = m(a t+... a n ). The masses can be defined recursively: n =, n t = n t + m(a t ), and c n =, c t = c t+ + m(a t+ ). The variables p = {n t, c t } n t= identify the peptide. The random variables b t, y t {... B} are indices that select which bins are expected to contain the b t -ion and the y t -ion, respectively. Recall, the there is a deterministic relationship between the mass and charge of a fragment, and its location on the m/z axis: i.e, b t = round((n t + )/z t ). To generalize Equation?? to a posterior probability, we need a background score which measures the average fit of the spectrum to a shifted version of the theoretical spectrum. The shift variable τ allows us to shift the theoretical spectrum. τ [ M... + M], for a choice of M {... B}. Instead of predicting the b t -ion at bin b t, we predict it at bin b t + τ. If the shifted bin location is outside the range {,..., B}, we map those positions to a special bin that contains no peak. To shift the entire theoretical spectrum, τ t = τ t, t =... n. The distribution over τ is uniform. Most of the conditional probability distributions in Figure are deterministic, which leads to a simple form for the joint distribution: n p(τ, s, p) = p(τ ) t= i= B [P (S i b t, y t, τ t )] δ(i=bt+τt i=ct+τt). () The inference which connects this model to Equation?? is the log-posterior of τ n : θ(s, p) log p(τ n = p, s) = log p(τ =, p, s) log τ τ p(p, s τ ). () The log p(τ =, p, s) term is the probabilistic analogue of S, s(p) in Equation??, a term which measures the similarity between the theoretical and observed spectra. The log τ τ p(p, s τ ) term is a generalized version of the cross-correlation between the real and theoretical spectra: the average similarity between the spectrum and shifted versions of the theoretical spectra. Computing the scoring function θ(s, p) is somewhat simpler than computing the evidence p(p, s). Algorithms for DBN inference are typically forward-backward schemes (c.f., []), with it being possible for θ( ) to be computed using only a forward pass. Virtual Evidence An advantage of our probabilistic approach to scoring is that we have substantial flexibility in representing the contribution of peaks towards the score, P (S i b t, y t, τ t ). Using virtual evidence [], we are free to choose an arbitrary non-negative function f i (S) to model each bin. One way to mimic the observation S i = s i is to introduce a virtual binary variable C i, whose sole parent is S i. The virtual child is fixed to C i =. If P (C i = S i ) δ(s i = s i ), then P (S i = s i b t, y t, τ t ) = C i P (C i =, S i b t, y t, τ t )P (C t = S i ). Virtual evidence changes the definition of the virtual child s conditional probability distribution to P (C i = S i ) = f i (S i ), for a user-defined non-negative function f i. One could define a separate f i for each bin i, but for simplicity we choose a single function for all bins, f. Following Equation?? we impose additional constraints on the form of f. The score of a peptidespectrum match should depend only on the peaks: f() =. If a peak is found in an activated bin, its contribution to the score must be higher than that of an activated bin with no peak: S >, f(s) > f(). Finally, matching high intensity peaks should be worth more than matching low intensity peaks; the b- and y-series should be more prominent than noise in the spectrum: i.e., f is monotone increasing. Based on our experiments, a class of f that works particularly well is f λ (S) = eλ λ + λe λs e λ λ. () 7

8 Prologue Chunk Epilogue n c n t b t S i i =...B a t τ τ t τ n c t y t Shared across all frames Figure : as a graphical model: the prologue occurs once at the beginning, the epilogue occurs once at the end, and the chunk is unrolled as necessary to any desired length. At each chunk, the bottom plate is expanded to have B copies. The parameter λ > dictates the relative value placed upon peak intensity in the scoring function. Experiments We compare the performance of spectrum identification algorithms against three tandem MS experiments on proteins from two different organisms: cm: A tryptic digest of S. Cerevisiae lysate containing,9 spectra, each with a precursor ion charge of. That is, c(p) = for all candidate peptides. Yeast-: A tryptic digest of S. Cerevisiae lysate, containing,99 spectra, each with a precursor charge or. That is c(p) {, } for all candidate peptides. We compare all algorithms under the assumption that each candidate peptide has c(p) =. Worm-: A tryptic digest of C. Elegans proteins, containing, spectra, each with precursor charge or. Again, we compare all algorithms under the assumption that c(p) =. The peptide database P for the yeast data sets is generated by an in silico trypsin digest of the soluble yeast proteome []. We compare the performance of four spectrum identification algorithms on these three data sets :, /XCorr, Mascot, and. Jeff: [Say again that these other methods include at least one that is quite standard, and that mascot is commercial]. The search parameters are controlled across the four methods. Candidate peptides are selected using a mass window of δ =. Da, save, which uses a hard-coded window of δ =. Da. The entire b- and y-ion series are assumed to be present. A fixed modification to cysteine is included to account for carbamidomethylation of protein disulfide bonds. In all cases, the decoys are generated by randomly permuting target peptides. Figure presents the absolute ranking comparison of the four methods. In all cases, there is a significant improvement in the number of spectra that are confidently identified, with strictly dominating over q [, ], save for the Worm- experiment. Ajit: [I m betting that the poor performance in Worm- is due largely to failures on spectra with heavy precursor masses where the charge is probably +. If we assume the charge is +, a large chunk of the b/y-series would fall outside the scanning range of the device] Jeff: [I think we should include some reasons for this performance increase, rather than just giving it. First, did you end up using the MLE for λ? If so, say so. This should also include the benefit of the f function, that this distribution was not pulled out of a hat, but is a family of distributions that The prefix and suffix of a peptide are more commonly referred to as the N- and C-terminal fragments. Comparisons on additional data sets are included in D. a n n n c n

9 Spectra identified ( s) Spectra identified ( s) 9 7 Mascot..... (a) cm..... (d) Yeast- Spectra identified ( s) 7 Mascot Spectra identified ( s) Spectra identified ( s) (g) Worm- Mascot..... (b) Yeast- Mascot..... (e) Yeast- Spectra identified ( s) Spectra identified ( s) Spectra identified ( s)..... (h) Worm- Figure : Absolute ranking comparison Mascot..... (c) Yeast- Mascot..... (f) Worm- are particularly suited to this problem. Shoudl also say that this function is novel, not before been used for this (or any problem), as far as we know. Now also, key benefit of our approach is that it is probabilistic, and thus automatically normalized appropriately, unlike the crux approach mentioned above where there can be unwanted miscalculations between forground and background model (at least this should be our hypothesis).] Lattice Decoding Lattice representation for peptide database The drawback of representing each peptide as an individual observation sequence is that the same computations need to be carried out multiple times for peptides with identical substrings. A more efficient way of representing a peptide database is in the form of a subpeptide lattice. Lattice representations are widely used for other sequence modeling problems outside computational biology, such as speech and language processing (e.g. [,, ]). They provide a way of representing a finite but possibly very large set of strings in a compact, compressed form by the sharing of common substrings. Given an alphabet A of amino acids, a peptide p can be defined as a string over A. A subpeptide s is a substring of p whose length, s, is typically less than the length of p. We denote the total inventory of subpeptides with S. A subpeptide lattice is a directed acyclic graph G = (V, E) with a set of vertices V and a set of edges E, each of which is labelled with a subpeptide s S 9

10 Figure : Compressed lattice for the three peptides AAAANWLR, AAADEWDER, AAADLISR. From Jeff: I don t think we ll have space for this figure, unfortunately, at least in the main version. We could use it in the extended version (which could be a strict superset of this paper). Figure : Graphical model structure for a peptide lattice. and, optionally, additional information such as frequencies or probabilities. The concatenation of subpeptides along a path through the lattice corresponds to a complete peptide in the database. Using a lattice representation, common subpeptides can be shared among peptides and the peptide database can be represented much more compactly. The computations needed to evaluate the observation model for specific amino acids are only performed once per edge; thus, depending on the degree of sharing inherent in the lattice (relative to the uncompressed database), significant speedups can be achieved. The question is how to define the S such that the resulting lattice is as compact as possible. To address this problem we exploit the fact that, formally, a lattice is a (weighted) finite-state automaton (FSA) Jeff: [Add cite]. Our initial starting point is a naive lattice representation where every peptide is represented as a separate path consisting of edges labelled with individual amino acids only. We then apply a series of well-known operations on finite-state machines that transform the lattice into the corresponding minimal lattice that has the smallest possible number of states. The alphabet S results as a by-product of this procedure. The first step is to convert the peptide database (i.e. a simple set of strings) to a finite-state automaton F. Next, F is determinized. Determinization converts F into an equivalent FSA, F det, such that for any given state q and alphabet symbol a A, there is only a single outgoing edge from q labeled with a. Third, F det is minimized. Minimization creates an FSA, F min, that is equivalent to F det but has the minimal number of states. Algorithms for determinization and minimization have been studied in depth (e.g. []); we use the implementations provided in the OpenFst toolkit. Finally, deterministic subpaths in the lattice (sequences of states with only one outgoing edge) are collapsed into a single edge, further limiting the number of states and edges and thus reducing memory requirements. Figure shows an example of a compressed lattice for three peptides. At the end of this procedure, S is defined by list of unique edge labels in the final collapsed lattice. One problem is that the lattice incorporates peptides of different lengths, which complicates scoring with the observation model described in Section. In order be able to score all strings simultaneously, they need to be warped to a common length. We achieve this by appending a dummy amino acid symbol to peptides shorter than the longest peptide in the database, such that all strings have the same length. Graphical model representation of lattices In order to use a lattice representation within our graphical modeling framework the lattice needs to be represented as a graphical model structure, visualized in Figure. Valid paths through the lattice are specified by the NODE variable and associated parameters: the probability of node j given node i is nonzero whenever an edge exists between them in the original lattice. The SP (subpeptide) variable with cardinality S encodes the identity of the edge label and is dependent on a start node i and end node j. SPPOS specifies the position in the subpeptide; whenever the final position is reached, the binary transition variable TRANS is switched to. The TRANS variable is in turn a switching parent for NODE and SP; if it is, NODE and SP take on new values (i.e. a transition in the lattice occurs), otherwise the values from the previous frame are copied. Finally, SPPOS and SP jointly determine the amino acid (AA) variable, which is connected to an observation (or a more complicated observation model as described above). The validity of strings is ensured by dedicated end node and end transition variables which ensure that the end the observation sequence coincides with the end of a subpeptide. When using a peptide lattice to search an entire database, precursor filtering can be done as part of the search. To this end a pruning variable is included that assigns zero probability to a path if the current accumulated mass exceeds an upper mass limit, or if the lower mass limit exceeds the current mass

11 database # peptides naive lattice compressed lattice S worm,9 7M/7M 9k/k k yeast,9.m 9k/k k Table : Sizes of naive and compressed lattices (given as number of nodes/number of edges), and the size of the subpeptide alphabet for worm and yeast databases. Experiment naive compressed A B C Table : CPU time of inference for database search vs. search through lattice plus the maximum possible mass value that can still be added before the end of the peptide is reached (the maximum remaining mass is equal to the remaining number of peptide positions multiplied by the largest mass value of any amino acid). The pruning variable is checked whenever a new edge in the lattice is being entered. Experiments Table compares the sizes of the original naive lattice representation where each peptide is represented as an individual string and the corresponding compressed lattice representation. With respect to computational efficiency, speedups can be achieved by evaluating the observation model only once for each edge in the lattice. Three different timing experiments were conducted to evaluate the lattice representation. In Experiment A we use the lattice as a compact representation for sets of peptides that have been prefiltered according to their precursor mass values. In Experiment B, the entire peptide database is represented as a lattice and the search is conducted against the entire database. Precursor filtering is performed as part of the search, through the pruning variable in the graphical model lattice structure, as described above. In Experiment C we also conduct a search over the entire database but (additionally?) use pruning options provided by the graphical model inference code. Timing experiments were conducted on a (MACHINE SPECS?). Each number is the average of runs and reports the inference time only, excluding startup cost. Acknowledgements Use unnumbered third level headings for the acknowledgements title. All acknowledgements go at the end of the paper.

12 References [] X. Aubert, C. Dugast, H. Ney, and V. Steinbiss. Large vocabulary continuous speech recognition of wall street journal data. In Proceedings of ICASSP, pages 9. [] Jeff Bilmes. Dynamic graphical models. IEEE Signal Processing Magazine, 7():9, Nov. [] C. Chelba and A. Acero. Position-specific posterior lattices for indexing speech. In Proceedings of ACL,. [] A. R. Dongre, J. L. Jones, A. Somogyi, and V. H. Wysocki. Influence of peptide composition, gas-phase basicity, and chemical modification on fragmentation efficiency: evidence for the mobile proton model. Journal of the American Chemical Society, : 7, 99. [] C. Dyer, S. Muresan, and P. Resnik. Generalizing word lattice translation. In Proceedings of ACL/HLT, pages,. [] J. E. Elias and S. P. Gygi. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature Methods, ():7, 7. [7] J. K. Eng, A. L. McCormack, and J. R. Yates, III. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry, :97 99, 99. [] J.E. Hopcroft and J. Ullman. Introduction fo Automata Theory, Languages and Computation. Addison- Wesley, Reading, Mass., 979. [9] L. Käll, J. D. Storey, M. J. MacCoss, and W. S. Noble. Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. Journal of Proteome Research, 7():9,. [] J. Pearl. Probabilistic Reasoning in Intelligent Systems : Networks of Plausible Inference. Morgan Kaufmann, 99. [] D. N. Perkins, D. J. C. Pappin, D. M. Creasy, and J. S. Cottrell. Probability-based protein identification by searching sequence databases using mass spectrometry data. ELECTROPHORESIS, (): 7, 999. [] J. D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society, : 79 9,. [] J. D. Storey. The positive false discovery rate: A bayesian interpretation and the. The Annals of Statistics, ():,. [] J. D. Storey and R. Tibshirani. Statistical significance for genome-wide studies. Proceedings of the National Academy of Sciences of the United States of America, :9 9,. [] Yunhu Wan, Austin Yang, and Ting Chen. : A hidden markov model based scoring function for mass spectrometry database search. Analytical Chemistry, 7(): 7,. PMID: 9. [] M. P. Washburn, D. Wolters, and J. R. Yates, III. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nature Biotechnology, 9: 7,.

13 SUPPLEMENTARY MATERIAL A B Tandem Mass Spectrometry Description of the Testing Datasets Elided. The data sets have been used in previously published work. C Evaluation Metrics Add form of the equation when ntargets neq ndecoys. Include results on qvality: i.e., that we do better even under alternate estimators of the FDR. Ajit: [Questions that we may want to put in a supplement.] Q: Why are ground truth peptide-spectrum matches not available in any significant quantities? A: Theoretically, one could create a purified sample of a peptide which could be used to generate a spectrum where the peptide is known. However, the resolution of tandem mass spectrometry is so high that creating sufficiently pure samples is impractical. One could attempt to label spectra by hand, but such labellings are known not to be especially accurate [CITE]. D Scoring Identifications as Inference in a Dynamic Bayesian Network Explain where the virtual evidence function comes from, the MLE, and why it does not work well. Additional Experiments Scatter plots relating our scoring function against. Break down the comparison of methods based on filtering returned PSMs by length, by spectrum length, by precursor mass. Sum-product vs. max-product. Ablative: replace the VECPT function with intensity. E Lattice Decoding Additional Experiments

PROJECT TITLE. B.Tech. BRANCH NAME NAME 2 (ROLL NUMBER)

PROJECT TITLE. B.Tech. BRANCH NAME NAME 2 (ROLL NUMBER) PROJECT TITLE A thesis submitted in partial fulfillment of the requirements for the award of the degree of B.Tech. in BRANCH NAME By NAME 1 (ROLL NUMBER) NAME 2 (ROLL NUMBER) DEPARTMENT OF DEPARTMENT NAME

More information

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Recurrent Neural Networks. COMP-550 Oct 5, 2017 Recurrent Neural Networks COMP-550 Oct 5, 2017 Outline Introduction to neural networks and deep learning Feedforward neural networks Recurrent neural networks 2 Classification Review y = f( x) output label

More information

CLEAR SPACE AND MINIMUM SIZE. A clear space area free of competing visual elements should be maintained.

CLEAR SPACE AND MINIMUM SIZE. A clear space area free of competing visual elements should be maintained. BRAND TOOL KIT LOGO The preferred logoversion to use is on white background. Negative version is reversed (white) on a dark blue background. Also a black and white version is available for 1-colour print

More information

Text Analysis. Week 5

Text Analysis. Week 5 Week 5 Text Analysis Reference and Slide Source: ChengXiang Zhai and Sean Massung. 2016. Text Data Management and Analysis: a Practical Introduction to Information Retrieval and Text Mining. Association

More information

Brand guidelines 2014

Brand guidelines 2014 Brand guidelines 2014 // master logo // master logo & alternative colour breakdowns Blue C: 100 M: 95 Y: 5 K: 0 R: 43 G: 57 B: 144 #2b3990 // master logo - alternate colourful version // master logo -

More information

Neural Networks for NLP. COMP-599 Nov 30, 2016

Neural Networks for NLP. COMP-599 Nov 30, 2016 Neural Networks for NLP COMP-599 Nov 30, 2016 Outline Neural networks and deep learning: introduction Feedforward neural networks word2vec Complex neural network architectures Convolutional neural networks

More information

USER MANUAL. ICIM S.p.A. Certifi cation Mark

USER MANUAL. ICIM S.p.A. Certifi cation Mark USER MANUAL ICIM S.p.A. Certifi cation Mark Index Informative note 4 The Certifi cation Mark 6 Certifi ed Management System 8 Certifi ed Management System: Examples 20 Certifi ed Product 28 Certifi ed

More information

Week 7 Sentiment Analysis, Topic Detection

Week 7 Sentiment Analysis, Topic Detection Week 7 Sentiment Analysis, Topic Detection Reference and Slide Source: ChengXiang Zhai and Sean Massung. 2016. Text Data Management and Analysis: a Practical Introduction to Information Retrieval and Text

More information

Dorset Coastal Explorer Planning

Dorset Coastal Explorer Planning Dorset Coastal Explorer Planning Please read this information carefully. If you wish to proceed after reading this information you must signify your agreement to the following conditions of access by selecting

More information

How to give. an IYPT talk

How to give. an IYPT talk How to give Today you will learn: 1. What is IYPT and where is it from 2. How US was involved before 3. How to craft physics arguments 4. How to use presentation tricks 5. How to criticize flaws an IYPT

More information

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database Overview - MS Proteomics in One Slide Obtain protein Digest into peptides Acquire spectra in mass spectrometer MS masses of peptides MS/MS fragments of a peptide Results! Match to sequence database 2 But

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Linear Algebra. Dave Bayer

Linear Algebra. Dave Bayer Linear Algebra Dave Bayer c 2007 by Dave Bayer All rights reserved Projective Press New York, NY USA http://projectivepress.com/linearalgebra Dave Bayer Department of Mathematics Barnard College Columbia

More information

Computational Methods for Mass Spectrometry Proteomics

Computational Methods for Mass Spectrometry Proteomics Computational Methods for Mass Spectrometry Proteomics Eidhammer, Ingvar ISBN-13: 9780470512975 Table of Contents Preface. Acknowledgements. 1 Protein, Proteome, and Proteomics. 1.1 Primary goals for studying

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Tandem Mass Spectrometry: Generating function, alignment and assembly

Tandem Mass Spectrometry: Generating function, alignment and assembly Tandem Mass Spectrometry: Generating function, alignment and assembly With slides from Sangtae Kim and from Jones & Pevzner 2004 Determining reliability of identifications Can we use Target/Decoy to estimate

More information

Learning Peptide-Spectrum Alignment Models for Tandem Mass Spectrometry

Learning Peptide-Spectrum Alignment Models for Tandem Mass Spectrometry Learning Peptide-Spectrum Alignment Models for Tandem Mass Spectrometry John T. Halloran Dept. of Electrical Engineering University of Washington Seattle, WA 99, USA Jeff A. Bilmes Dept. of Electrical

More information

Workflow concept. Data goes through the workflow. A Node contains an operation An edge represents data flow The results are brought together in tables

Workflow concept. Data goes through the workflow. A Node contains an operation An edge represents data flow The results are brought together in tables PROTEOME DISCOVERER Workflow concept Data goes through the workflow Spectra Peptides Quantitation A Node contains an operation An edge represents data flow The results are brought together in tables Protein

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Was T. rex Just a Big Chicken? Computational Proteomics

Was T. rex Just a Big Chicken? Computational Proteomics Was T. rex Just a Big Chicken? Computational Proteomics Phillip Compeau and Pavel Pevzner adjusted by Jovana Kovačević Bioinformatics Algorithms: an Active Learning Approach 215 by Compeau and Pevzner.

More information

Tutorial 1: Setting up your Skyline document

Tutorial 1: Setting up your Skyline document Tutorial 1: Setting up your Skyline document Caution! For using Skyline the number formats of your computer have to be set to English (United States). Open the Control Panel Clock, Language, and Region

More information

Learning Score Function Parameters for Improved Spectrum Identification in Tandem Mass Spectrometry Experiments

Learning Score Function Parameters for Improved Spectrum Identification in Tandem Mass Spectrometry Experiments pubs.acs.org/jpr Learning Score Function Parameters for Improved Spectrum Identification in Tandem Mass Spectrometry Experiments Marina Spivak, Michael S. Bereman, Michael J. MacCoss, and William Stafford

More information

Nature Methods: doi: /nmeth Supplementary Figure 1. Fragment indexing allows efficient spectra similarity comparisons.

Nature Methods: doi: /nmeth Supplementary Figure 1. Fragment indexing allows efficient spectra similarity comparisons. Supplementary Figure 1 Fragment indexing allows efficient spectra similarity comparisons. The cost and efficiency of spectra similarity calculations can be approximated by the number of fragment comparisons

More information

HOWTO, example workflow and data files. (Version )

HOWTO, example workflow and data files. (Version ) HOWTO, example workflow and data files. (Version 20 09 2017) 1 Introduction: SugarQb is a collection of software tools (Nodes) which enable the automated identification of intact glycopeptides from HCD

More information

SRM assay generation and data analysis in Skyline

SRM assay generation and data analysis in Skyline in Skyline Preparation 1. Download the example data from www.srmcourse.ch/eupa.html (3 raw files, 1 csv file, 1 sptxt file). 2. The number formats of your computer have to be set to English (United States).

More information

Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were

Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were developed to allow the analysis of large intact (bigger than

More information

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University Protein Quantitation II: Multiple Reaction Monitoring Kelly Ruggles kelly@fenyolab.org New York University Traditional Affinity-based proteomics Use antibodies to quantify proteins Western Blot RPPA Immunohistochemistry

More information

Sediment Management for Coastal Restoration in Louisiana -Role of Mississippi and Atchafalaya Rivers

Sediment Management for Coastal Restoration in Louisiana -Role of Mississippi and Atchafalaya Rivers Sediment Management for Coastal Restoration in Louisiana -Role of Mississippi and Atchafalaya Rivers Syed M. Khalil Louisiana Applied Coastal Engineering & Science (LACES) Division 9 th INTECOL International

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Spectrum-to-Spectrum Searching Using a. Proteome-wide Spectral Library

Spectrum-to-Spectrum Searching Using a. Proteome-wide Spectral Library MCP Papers in Press. Published on April 30, 2011 as Manuscript M111.007666 Spectrum-to-Spectrum Searching Using a Proteome-wide Spectral Library Chia-Yu Yen, Stephane Houel, Natalie G. Ahn, and William

More information

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Statistics Journal Club, 36-825 Beau Dabbs and Philipp Burckhardt 9-19-2014 1 Paper

More information

Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data

Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data Anthony J Bonner Han Liu Abstract This paper addresses a central problem of Proteomics: estimating the amounts of each of

More information

Modeling Mass Spectrometry-Based Protein Analysis

Modeling Mass Spectrometry-Based Protein Analysis Chapter 8 Jan Eriksson and David Fenyö Abstract The success of mass spectrometry based proteomics depends on efficient methods for data analysis. These methods require a detailed understanding of the information

More information

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra Andrew Keller Day 2 October 17, 2006 Andrew Keller Rosetta Bioinformatics, Seattle Outline Need to validate peptide assignments to MS/MS

More information

Background: Comment [1]: Comment [2]: Comment [3]: Comment [4]: mass spectrometry

Background: Comment [1]: Comment [2]: Comment [3]: Comment [4]: mass spectrometry Background: Imagine it is time for your lunch break, you take your sandwich outside and you sit down to enjoy your lunch with a beautiful view of Montana s Rocky Mountains. As you look up, you see what

More information

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model: Chapter 5 Text Input 5.1 Problem In the last two chapters we looked at language models, and in your first homework you are building language models for English and Chinese to enable the computer to guess

More information

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University Protein Quantitation II: Multiple Reaction Monitoring Kelly Ruggles kelly@fenyolab.org New York University Traditional Affinity-based proteomics Use antibodies to quantify proteins Western Blot Immunohistochemistry

More information

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 ) Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds

More information

Key questions of proteomics. Bioinformatics 2. Proteomics. Foundation of proteomics. What proteins are there? Protein digestion

Key questions of proteomics. Bioinformatics 2. Proteomics. Foundation of proteomics. What proteins are there? Protein digestion s s Key questions of proteomics What proteins are there? Bioinformatics 2 Lecture 2 roteomics How much is there of each of the proteins? - Absolute quantitation - Stoichiometry What (modification/splice)

More information

MEDIA KIT D E S T I N A T I O N S M A G A Z I N E D E S T I N A T I O N S M A G A Z I N E. C O M

MEDIA KIT D E S T I N A T I O N S M A G A Z I N E D E S T I N A T I O N S M A G A Z I N E. C O M MEDIA KIT 201 8 D E S T I N A T I O N S M A G A Z I N E 2 0 1 8 D E S T I N A T I O N S M A G A Z I N E. C O M O U R B R A N D A platform for those consumed by wanderlust. D E S T I N A T I O N S M A G

More information

Analysis of Peptide MS/MS Spectra from Large-Scale Proteomics Experiments Using Spectrum Libraries

Analysis of Peptide MS/MS Spectra from Large-Scale Proteomics Experiments Using Spectrum Libraries Anal. Chem. 2006, 78, 5678-5684 Analysis of Peptide MS/MS Spectra from Large-Scale Proteomics Experiments Using Spectrum Libraries Barbara E. Frewen, Gennifer E. Merrihew, Christine C. Wu, William Stafford

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

A Statistical Model of Proteolytic Digestion

A Statistical Model of Proteolytic Digestion A Statistical Model of Proteolytic Digestion I-Jeng Wang, Christopher P. Diehl Research and Technology Development Center Johns Hopkins University Applied Physics Laboratory Laurel, MD 20723 6099 Email:

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Tandem mass spectra were extracted from the Xcalibur data system format. (.RAW) and charge state assignment was performed using in house software

Tandem mass spectra were extracted from the Xcalibur data system format. (.RAW) and charge state assignment was performed using in house software Supplementary Methods Software Interpretation of Tandem mass spectra Tandem mass spectra were extracted from the Xcalibur data system format (.RAW) and charge state assignment was performed using in house

More information

CSC173 Workshop: 13 Sept. Notes

CSC173 Workshop: 13 Sept. Notes CSC173 Workshop: 13 Sept. Notes Frank Ferraro Department of Computer Science University of Rochester September 14, 2010 1 Regular Languages and Equivalent Forms A language can be thought of a set L of

More information

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Networks BY: MOHAMAD ALSABBAGH Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional

More information

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra. Andrew Keller

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra. Andrew Keller PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra Andrew Keller Outline Need to validate peptide assignments to MS/MS spectra Statistical approach to validation Running PeptideProphet

More information

MS-MS Analysis Programs

MS-MS Analysis Programs MS-MS Analysis Programs Basic Process Genome - Gives AA sequences of proteins Use this to predict spectra Compare data to prediction Determine degree of correctness Make assignment Did we see the protein?

More information

De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics

De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to Proteomics John R. Rose Computer Science and Engineering University of South Carolina 1 Overview Background Information Theoretic

More information

Background: Imagine it is time for your lunch break, you take your sandwich outside and you sit down to enjoy your lunch with a beautiful view of

Background: Imagine it is time for your lunch break, you take your sandwich outside and you sit down to enjoy your lunch with a beautiful view of Background: Imagine it is time for your lunch break, you take your sandwich outside and you sit down to enjoy your lunch with a beautiful view of Montana s Rocky Mountains. As you look up, you see what

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

CS Homework 3. October 15, 2009

CS Homework 3. October 15, 2009 CS 294 - Homework 3 October 15, 2009 If you have questions, contact Alexandre Bouchard (bouchard@cs.berkeley.edu) for part 1 and Alex Simma (asimma@eecs.berkeley.edu) for part 2. Also check the class website

More information

Development and Evaluation of Methods for Predicting Protein Levels from Tandem Mass Spectrometry Data. Han Liu

Development and Evaluation of Methods for Predicting Protein Levels from Tandem Mass Spectrometry Data. Han Liu Development and Evaluation of Methods for Predicting Protein Levels from Tandem Mass Spectrometry Data by Han Liu A thesis submitted in conformity with the requirements for the degree of Master of Science

More information

Design and Implementation of Speech Recognition Systems

Design and Implementation of Speech Recognition Systems Design and Implementation of Speech Recognition Systems Spring 2013 Class 7: Templates to HMMs 13 Feb 2013 1 Recap Thus far, we have looked at dynamic programming for string matching, And derived DTW from

More information

A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry

A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry Ting Chen Department of Genetics arvard Medical School Boston, MA 02115, USA Ming-Yang Kao Department of Computer

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg Temporal Reasoning Kai Arras, University of Freiburg 1 Temporal Reasoning Contents Introduction Temporal Reasoning Hidden Markov Models Linear Dynamical Systems (LDS) Kalman Filter 2 Temporal Reasoning

More information

Identification of proteins by enzyme digestion, mass

Identification of proteins by enzyme digestion, mass Method for Screening Peptide Fragment Ion Mass Spectra Prior to Database Searching Roger E. Moore, Mary K. Young, and Terry D. Lee Beckman Research Institute of the City of Hope, Duarte, California, USA

More information

Transferred Subgroup False Discovery Rate for Rare Post-translational Modifications Detected by Mass Spectrometry* S

Transferred Subgroup False Discovery Rate for Rare Post-translational Modifications Detected by Mass Spectrometry* S Transferred Subgroup False Discovery Rate for Rare Post-translational Modifications Detected by Mass Spectrometry* S Yan Fu and Xiaohong Qian Technological Innovation and Resources 2014 by The American

More information

Improved 6- Plex TMT Quantification Throughput Using a Linear Ion Trap HCD MS 3 Scan Jane M. Liu, 1,2 * Michael J. Sweredoski, 2 Sonja Hess 2 *

Improved 6- Plex TMT Quantification Throughput Using a Linear Ion Trap HCD MS 3 Scan Jane M. Liu, 1,2 * Michael J. Sweredoski, 2 Sonja Hess 2 * Improved 6- Plex TMT Quantification Throughput Using a Linear Ion Trap HCD MS 3 Scan Jane M. Liu, 1,2 * Michael J. Sweredoski, 2 Sonja Hess 2 * 1 Department of Chemistry, Pomona College, Claremont, California

More information

Announcements. Problem Set 6 due next Monday, February 25, at 12:50PM. Midterm graded, will be returned at end of lecture.

Announcements. Problem Set 6 due next Monday, February 25, at 12:50PM. Midterm graded, will be returned at end of lecture. Turing Machines Hello Hello Condensed Slide Slide Readers! Readers! This This lecture lecture is is almost almost entirely entirely animations that that show show how how each each Turing Turing machine

More information

Effective Strategies for Improving Peptide Identification with Tandem Mass Spectrometry

Effective Strategies for Improving Peptide Identification with Tandem Mass Spectrometry Effective Strategies for Improving Peptide Identification with Tandem Mass Spectrometry by Xi Han A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Efficient Marginalization to Compute Protein Posterior Probabilities from Shotgun Mass Spectrometry Data

Efficient Marginalization to Compute Protein Posterior Probabilities from Shotgun Mass Spectrometry Data Efficient Marginalization to Compute Protein Posterior Probabilities from Shotgun Mass Spectrometry Data Oliver Serang Department of Genome Sciences, University of Washington, Seattle, Washington Michael

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Intensity-based protein identification by machine learning from a library of tandem mass spectra

Intensity-based protein identification by machine learning from a library of tandem mass spectra Intensity-based protein identification by machine learning from a library of tandem mass spectra Joshua E Elias 1,Francis D Gibbons 2,Oliver D King 2,Frederick P Roth 2,4 & Steven P Gygi 1,3,4 Tandem mass

More information

De Novo Peptide Identification Via Mixed-Integer Linear Optimization And Tandem Mass Spectrometry

De Novo Peptide Identification Via Mixed-Integer Linear Optimization And Tandem Mass Spectrometry 17 th European Symposium on Computer Aided Process Engineering ESCAPE17 V. Plesu and P.S. Agachi (Editors) 2007 Elsevier B.V. All rights reserved. 1 De Novo Peptide Identification Via Mixed-Integer Linear

More information

Proteomics. November 13, 2007

Proteomics. November 13, 2007 Proteomics November 13, 2007 Acknowledgement Slides presented here have been borrowed from presentations by : Dr. Mark A. Knepper (LKEM, NHLBI, NIH) Dr. Nathan Edwards (Center for Bioinformatics and Computational

More information

Week 2: Defining Computation

Week 2: Defining Computation Computational Complexity Theory Summer HSSP 2018 Week 2: Defining Computation Dylan Hendrickson MIT Educational Studies Program 2.1 Turing Machines Turing machines provide a simple, clearly defined way

More information

Languages, regular languages, finite automata

Languages, regular languages, finite automata Notes on Computer Theory Last updated: January, 2018 Languages, regular languages, finite automata Content largely taken from Richards [1] and Sipser [2] 1 Languages An alphabet is a finite set of characters,

More information

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20. 10-601 Machine Learning, Midterm Exam: Spring 2008 Please put your name on this cover sheet If you need more room to work out your answer to a question, use the back of the page and clearly mark on the

More information

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007. Proofs, Strings, and Finite Automata CS154 Chris Pollett Feb 5, 2007. Outline Proofs and Proof Strategies Strings Finding proofs Example: For every graph G, the sum of the degrees of all the nodes in G

More information

Augmented Statistical Models for Speech Recognition

Augmented Statistical Models for Speech Recognition Augmented Statistical Models for Speech Recognition Mark Gales & Martin Layton 31 August 2005 Trajectory Models For Speech Processing Workshop Overview Dependency Modelling in Speech Recognition: latent

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Learning Concave Conditional Likelihood Models for Improved Analysis of Tandem Mass Spectra

Learning Concave Conditional Likelihood Models for Improved Analysis of Tandem Mass Spectra Learning Concave Conditional Likelihood Models for Improved Analysis of Tandem Mass Spectra John T. Halloran Department of Public Health Sciences University of California, Davis jthalloran@ucdavis.edu

More information

Extensions of Bayesian Networks. Outline. Bayesian Network. Reasoning under Uncertainty. Features of Bayesian Networks.

Extensions of Bayesian Networks. Outline. Bayesian Network. Reasoning under Uncertainty. Features of Bayesian Networks. Extensions of Bayesian Networks Outline Ethan Howe, James Lenfestey, Tom Temple Intro to Dynamic Bayesian Nets (Tom Exact inference in DBNs with demo (Ethan Approximate inference and learning (Tom Probabilistic

More information

Protein Identification Using Tandem Mass Spectrometry. Nathan Edwards Informatics Research Applied Biosystems

Protein Identification Using Tandem Mass Spectrometry. Nathan Edwards Informatics Research Applied Biosystems Protein Identification Using Tandem Mass Spectrometry Nathan Edwards Informatics Research Applied Biosystems Outline Proteomics context Tandem mass spectrometry Peptide fragmentation Peptide identification

More information

Keppel, G. & Wickens, T. D. Design and Analysis Chapter 4: Analytical Comparisons Among Treatment Means

Keppel, G. & Wickens, T. D. Design and Analysis Chapter 4: Analytical Comparisons Among Treatment Means Keppel, G. & Wickens, T. D. Design and Analysis Chapter 4: Analytical Comparisons Among Treatment Means 4.1 The Need for Analytical Comparisons...the between-groups sum of squares averages the differences

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

More information

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I kevin small & byron wallace today a review of probability random variables, maximum likelihood, etc. crucial for clinical

More information

Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges

Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges 1 PRELIMINARIES Two vertices X i and X j are adjacent if there is an edge between them. A path

More information

Approximate Inference

Approximate Inference Approximate Inference Simulation has a name: sampling Sampling is a hot topic in machine learning, and it s really simple Basic idea: Draw N samples from a sampling distribution S Compute an approximate

More information

Introduction to pepxmltab

Introduction to pepxmltab Introduction to pepxmltab Xiaojing Wang October 30, 2018 Contents 1 Introduction 1 2 Convert pepxml to a tabular format 1 3 PSMs Filtering 4 4 Session Information 5 1 Introduction Mass spectrometry (MS)-based

More information

About the relationship between formal logic and complexity classes

About the relationship between formal logic and complexity classes About the relationship between formal logic and complexity classes Working paper Comments welcome; my email: armandobcm@yahoo.com Armando B. Matos October 20, 2013 1 Introduction We analyze a particular

More information

Mixtures of Gaussians with Sparse Structure

Mixtures of Gaussians with Sparse Structure Mixtures of Gaussians with Sparse Structure Costas Boulis 1 Abstract When fitting a mixture of Gaussians to training data there are usually two choices for the type of Gaussians used. Either diagonal or

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

Comprehensive support for quantitation

Comprehensive support for quantitation Comprehensive support for quantitation One of the major new features in the current release of Mascot is support for quantitation. This is still work in progress. Our goal is to support all of the popular

More information

Lecture Notes on Inductive Definitions

Lecture Notes on Inductive Definitions Lecture Notes on Inductive Definitions 15-312: Foundations of Programming Languages Frank Pfenning Lecture 2 September 2, 2004 These supplementary notes review the notion of an inductive definition and

More information

On Stateless Multicounter Machines

On Stateless Multicounter Machines On Stateless Multicounter Machines Ömer Eğecioğlu and Oscar H. Ibarra Department of Computer Science University of California, Santa Barbara, CA 93106, USA Email: {omer, ibarra}@cs.ucsb.edu Abstract. We

More information

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma COMS 4771 Probabilistic Reasoning via Graphical Models Nakul Verma Last time Dimensionality Reduction Linear vs non-linear Dimensionality Reduction Principal Component Analysis (PCA) Non-linear methods

More information

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction 15-0: Learning vs. Deduction Artificial Intelligence Programming Bayesian Learning Chris Brooks Department of Computer Science University of San Francisco So far, we ve seen two types of reasoning: Deductive

More information

Parallel Algorithms For Real-Time Peptide-Spectrum Matching

Parallel Algorithms For Real-Time Peptide-Spectrum Matching Parallel Algorithms For Real-Time Peptide-Spectrum Matching A Thesis Submitted to the College of Graduate Studies and Research in Partial Fulfillment of the Requirements for the degree of Master of Science

More information

Last updated: Copyright

Last updated: Copyright Last updated: 2012-08-20 Copyright 2004-2012 plabel (v2.4) User s Manual by Bioinformatics Group, Institute of Computing Technology, Chinese Academy of Sciences Tel: 86-10-62601016 Email: zhangkun01@ict.ac.cn,

More information

Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4 Z1 Z2 Z3 Z4

Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4 Z1 Z2 Z3 Z4 Inference: Exploiting Local Structure aphne Koller Stanford University CS228 Handout #4 We have seen that N inference exploits the network structure, in particular the conditional independence and the

More information

CSE182-L8. Mass Spectrometry

CSE182-L8. Mass Spectrometry CSE182-L8 Mass Spectrometry Project Notes Implement a few tools for proteomics C1:11/2/04 Answer MS questions to get started, select project partner, select a project. C2:11/15/04 (All but web-team) Plan

More information

Hidden Markov Models in Language Processing

Hidden Markov Models in Language Processing Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications

More information