General comments, TODOs, etc.

Size: px

Start display at page:

Download "General comments, TODOs, etc."

Marianna Price
5 years ago
Views:

1 General comments, TODOs, etc. -Think up a catchy title. -Decide which references to anonymize to maintain double-blindness. -Write the abstract -Fill in remaining [CITE], which are cases where it is unclear to me what paper should be cited. The NIPS style guide allows for -page only of citations; the font size has been reduced as much as permissible. - Given the space constraints, I m inclined to think that we should not include a future work section in the paper.

2 Identifying Tandem Mass Spectra using Dynamic Bayesian Networks Anonymous Author(s) Affiliation Address Abstract Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Introduction Tandem mass spectrometry, a.k.a. shotgun proteomics, is an increasingly accurate and efficient technology for identifying and quantifying proteins in a complex biological sample, such as a drop of blood. This technology has been used to identify biomarkers associated with disease [CITE], and to quantify changes in protein expression across different cell types [CITE]. Most applications of tandem mass spectrometry require the ability to accurately map a fragmentation spectrum generated by the device to a peptide, a protein subsequence, which generated the spectrum. The task of mapping spectra to peptides is known as spectrum identification, a pattern recognition task akin to speech recognition. In speech recognition, the input is an utterance, which must be mapped to a sentence in natural language, a enormous structured class of labels. A spectrum is akin to an acoustic utterance; a peptide is akin to a sentence, a sequence of amino acids instead of words. Unlike speech recognition, (i) accurate labelled data, ground truth peptide-spectrum matches, cannot be acquired; (ii) the scoring function for peptide-spectrum matches has traditionally been a non-probabilistic function, while probabilistic approaches have become dominant in speech; (iii) the optimization used to identify the best peptide match require enumerating and scoring all candidate peptides against a spectrum. In this work, we introduce a dynamic Bayesian network (DBN) that generalizes one of the most popular scoring functions for peptide identification (Section ). Our probabilistic formulation, which we call, provides new insight into a technique that has been used in computational biology for over 7 years. provides a new function for scoring peptide-spectrum matches, that significantly outperforms existing scoring functions, including those used in expensive commercial tools for peptide identification. We further show that additional qualitative knowledge about peptide fragmentation can be easily incorporated into the model, leading to further improvements in identification accuracy. A fundamental computational constraint in current approaches to spectrum identification is the dependence on peptide database search. The best peptide match is found by exhaustively scoring a large list of candidate peptides against the spectrum. In speech recognition, database search would be analogous to decoding an utterance by scoring every common sentence in the English language against the utterance, picking the highest scoring match. In Section, we extend with lattices, a compressed representation of sequences, common in speech and language processing [CITE]. Lattices find novel use here: allowing us to replace an exhaustive enumeration with dynamic programming over peptide sequences.

9 7 9 7 9 7 9 7 9 7 9 A B Proteins Peptides Mass spectrometer Spectra Figure : (A) Schematic of a typical shotgun proteomics experiment.

(B) A sample fragmentation spectrum, along with the peptide (PTPVSHNDDLYG) responsible for generating the spectrum.

3 A B Proteins Peptides Mass spectrometer Spectra Figure : (A) Schematic of a typical shotgun proteomics experiment. The three steps () cleaving proteins into peptides, () separation of peptides using liquid chromatography, and () tandem mass spectrometry analysis are described in the text. (B) A sample fragmentation spectrum, along with the peptide (PTPVSHNDDLYG) responsible for generating the spectrum. Peaks corresponding to prefixes and suffixes of the peptide are colored red and blue, respectively. By convention, prefixes are referred to as b-ions and suffixes as y-ions. Jeff: [Note, while lattices are common in the speech world, outside of speech they might be confusible with, say, Birkhoff lattices. We might want to add a bit of text in the above sayingthat lattices, in this context, is a linear sized representation of an exponential number of sequences, and can be seen as a sequential analoge of, say, binary decision diagrams. BTW, also, one option for the extended version is to,s ay define a macro that is a comment for the main version but includes the text for the extended version, so then we can have one.tex file for both submission and supplement.] Jeff: [One other commen there. I think this reads well, but we then immediately go on to describe shotgun proteomics. Pehraps in the intro offer up a few more details of the model and what enables it to achieve such good performance. The reason is that, otherwise, people might be left wondering.] Tandem Mass Spectrometry Experimental framework A typical shotgun proteomics experiment proceeds in three steps, as illustrated in Figure. The input to the experiment is a collection of proteins, which have been isolated from a complex mixture. Each protein can be represented as a string of amino acids, where the alphabet is size and the proteins range in length from amino acids. A typical complex mixture may contain a few thousand proteins, ranging in abundance from tens to hundreds of thousands of copies. In the first experimental step, the proteins are digested into shorter sequences (peptides) using a molecular agent called trypsin. To a first approximation, trypsin cleaves each protein deterministically at all occurrences of K or R unless they are followed by a P. This digestion is necessary because whole proteins are too massive to be subject to direct mass spectometry analysis without using very expensive equipment. Second, the peptides are subjected to a process called liquid chromatography, in which the peptides pass through a thin glass column that separates the peptides based on a particular chemical property (e.g., the hydrophobicity). This separation step reduces the complexity of the mixtures of peptides going into the mass spectrometer. The third step, which occurs inside the mass spectrometer, involves two rounds of mass spectrometry. Approximately every second, the device analyzes the population of approximately, intact peptides that most recently exited from the liquid chromatography column. Then, based on this initial analysis, the machine selects five distinct peptide species for fragmentation. Each of these fragmented species is subjected to a second round

4 of mass spectrometry analysis. The resulting fragmentation spectra are the primary output of the experiment. A sample fragmentation spectrum is shown in Figure B. During the fragmentation process, each amino acid sequence is typically cleaved once, so cleavage of the population results in a variety of observed prefix and suffix sequences. Each of these subpeptides is characterized by its mass-tocharge ratio (m/z, shown on the horizontal axis) and a corresponding intensity (unitless, but roughly proportional to abundance, shown on the vertical axis). The input to the spectrum identification problem is one such fragmentation spectrum, along with the observed (approximate) mass of the intact peptide. The goal is to identify the peptide sequence that was responsible for generating the spectrum. Solving the spectrum identification problem In practice, the spectrum identification problem can be solved in two different ways, either de novo, in which the universe of all possible peptides is considered as candidates solutions, or by restricting the search space to a given peptide database. Because high throughput DNA sequencing can provide a very good estimate of the set of possible peptide sequences for most commonly studied organisms, and because database search typically provides more accurate results than de novo approaches, we focus on the database search version of the problem in this paper. The first computer program to use a database search procedure to identify fragmentation spectra was SEQUEST [7], and SEQUEST s basic algorithm is still used by essentially all database search tools available today. John: [cite: Sadygov] Bill: [Do we really need a cite for the previous sentence? If so, then we have to use something more recent than. I vote to delete this cite] The approach is as follows. We are given a spectrum S, a peptide database P, a precursor mass m (i.e., the measured mass of the intact peptide), and a precursor mass tolerance δ. The algorithm extracts from the database all peptides whose mass lies within the range [m δ, m + δ]. These comprise the set of candidate peptides C(m, P, δ) = {p : p P ; m(p) m < δ} where m(p) is the calculated mass of peptide p. In practice, depending on the size of the peptide database and the precursor mass tolerance, the number of candidate peptides ranges from hundreds to hundreds of thousands. Each candidate peptide p is used to generate a theoretical spectrum s(p), and the theoretical spectrum is compared to the observed spectrum using a score function K(, ). The program reports the candidate peptide whose theoretical spectrum scores most highly with respect to the observed spectrum: arg max p C(m,P,δ) K(S, s(p)). In this work, we compare the performance of to two widely used search programs, SEQUEST and Mascot [], as well as to a less commonly used but methodologically related method, []. These three methods differ primarily in their choice of score function K(, ). Describing the details of SEQUEST s score function, XCorr, is beyond the scope of this paper, but the basic idea is to compute a scalar product of the observed and theoretical spectrum and then subtract out an average scalar product term that is produced by shifting the observed and theoretical spectrum 7 τ= 7 N i= S is(p) i τ. Mascot is a relative to one another: XCorr(S, s(p)) = S, s(p) commercial product that uses a probabilistic scoring function to rank candidate peptides, the details of which have not been published. first generates a theoretical spectrum, akin to SEQUEST s. The probability that the peaks in the theoretical spectrum occurred in the observed spectrum is then caclulated using a a hidden Markov model (HMM), and the candidate peptide is assigned a score based on the confidence of this probability, which is measured using an estimated normal distribution over the peptide masses within ±δ of the precursor mass. The spectrum identification problem is difficult to solve primarily because of noise in the observed spectrum. In general, the x-axis of the observed spectrum is known with relatively high precision. However, in any given spectrum, many expected fragment ions will fail to be observed, and the spectrum is also likely to contain a variety of additional, unexplained peaks. These unexplained peaks may result from unusual fragmentation events, in which small molecular groups are shed from the peptide during fragmentation, or from contaminating molecules (peptides or other small molecules) that are present in the mass spectrometer along with the target peptide species. Evaluation Metrics

5 Jeff: [Do we want this here? Most ML papers put the evaluation methodology just before the results section, and the the first thing that is done is the intro, motivation, background literature and alternative approaches, and then the new approach (i.e., the new probabilistic method). Then the results (and methodology) go at the end. Putting the evaluation section here might surprise the reviewer.] Labelled data for spectrum identification would consist of a set of ground truth peptide-spectrum matches: spectra where the mapping to a peptide is known. Unfortunately, accurate labelled data does not exist in this domain, which complicates evaluation. To estimate the probability that a spectrum identification is false, we therefore make use of the standard decoy-target approach [, 9]. For each spectrum, two searches are performed: one to find the best peptide in the target database C(m, P, δ). Then, a second search is performed to find the best peptide in a decoy database C(m, P, δ): a set of plausible peptides where it is extremely improbable that the correct peptide is contained in it. In our experiments, the target P and decoy P databases are the same size, with decoys being generated by randomly permuting peptides in the target database, under the requirement that P P =. A single tandem mass spectrometry experiment generates m = O( ) spectra. We expect a certain fraction of the identifications to be spurious, and so only the top-k scoring identifications are retained as quality matches, the rest are ignored. False Discovery Rate (FDR) [] (essentially one minus precision) provides a rule for determining what k should be, given a bound on the expected fraction of spurious identifications among the top-k. To make use of FDR, we first pose the question of whether or not to accept a single spectrum identification as a hypothesis test. Consider a single spectrum s, searched against the target database C(m(s), P, δ). Denote the peptide scoring function θ : s p Θ R. When only one spectrum is under consideration, the dependence of θ on s is not shown. Now, θ(p) is itself a random variable. To sample from the distribution of θ(p), we score each peptide in the target database: θ(c) = {θ(p) : p C(m(s), P, δ)}. Choosing the highest scoring peptide as the proposed match corresponds to the test statistic T (θ(c)) = max(θ(p) : p C(m(S), P, δ)). Colloquially, the hypothesis test can expressed in terms of the test statistic. The null hypothesis, H, is that a peptide matches the spectrum by chance; the alternate hypothesis, H, is that the peptide generated the spectrum. Formally, the hypothesis test is H : θ(p) θ H : θ(p) > θ, where θ is a user-determined threshold on the score which determines the stringency of the test. As a decision rule, the null hypothesis is rejected if the test statistic T (θ(c)) exceeds critical value c. Equivalently, the highest scoring peptide match for a spectrum is deemed correct if its score is greater than c. A single tandem MS experiment leads to m hypotheses. Let V (c) be the number of hypotheses where H is incorrectly rejected at critical value c; let R(c) be the number of hypotheses where H was incorrectly rejected. For sufficiently large m, we estimate FDR using F DR(c) = E[V (c)]/ E[R(c)]. An estimate of E[V (c)] is the number of spectra where the best decoy match has a score higher than c; an estimate of E[R(c)] is the value of R(c) itself, the number of spectra where the best target match has a score higher than c. The decoy database is only used to estimate the error rate. The above estimate of FDR has an intuitive interpretation, it is -precision. Since FDR(c) is not necessarily strictly increasing with c, we instead report the estimated []: ˆq(c) = min F DR(t). t c At a score threshold c, we have q(c) [, ], which is the expected fraction of spurious identifications among those whose score is at least c. Jeff: [I think most of the equations above are not long and shoud be inlined, to save space.] The tradeoff between the number of identifications that are accepted and the stringency of the acceptance criterion is represented as an absolute ranking curve. Each point on the x-axis is a in [,], the corresponding value on the y-axis is the number of top-scoring spectra whose identification is accepted at that. At q =, all m identifications are accepted; at q =, no identifications are accepted. In real-world usage, the concern is with maximizing performance at

6 small s, so we plot only q [,.]. One method dominates another if its absolute ranking curve is strictly above the absolute ranking curve for the other method. Ajit: [Include a pointer to an absolute ranking plot.] Ajit: [It s natural for a machine learning audience to want to view hypothesis testing as a / classification problem: i.e., assign label to the identifications we want to accept. If we switch from FDR to the positive False Discovery Rate [], we can draw a connection to Bayes error rates on such a classification problem. However, using Bayes error rate would require the user to control the stringency of the test by setting a parameter that corresponds to the relative importance of a false negative to a false positive, and that is harder to understand and a.] Scoring identifications as inference in a Dynamic Bayesian Network In this section, we show that Equation?? can be generalized Ajit: [ this is not strictly what is going on, as we re not generalizing it, rather we are inspired by it to create a proper probabilistic model. ] as inference in a DBN (Figure ). The DBN is based on the mobile proton hypothesis of peptide fragmentation [], which we describe mathematically below. We provide empirical evidence that our probabilistic scoring function is significantly better than the scoring functions used in commercially developed packages. Peptide Fragmentation We start at the second phase in tandem mass spectrometry: the protein sequence has been digested, and a peptide has been isolated in the first mass spectrometry step. A peptide is represented as a string p = a a... a n, since our only concern is in decoding a peptide s sequence. Each letter a t is drawn from an alphabet of standard amino acids, whose masses are known. The mass function m( ) refers both to the mass of a residue, m(a t ), and to the mass of a sequence of residues, m(p) = n t= m(a t). Peptides are ionized in the second phase of mass spectrometry, so each peptide has a positive charge due to carrying one, two, or three extra protons: c(p) {,, }. Peptides predominantly fragment into a prefix and suffix: b = a... a t, y = a t+... a n. The extra protons are divided between the prefix and suffix: c(b) + c(y) = c(p). If either b or y have zero charge, it cannot be detected, and its corresponding peak will not show up in the spectrum. Charge distributions are not equally probable: e.g., when c(p) =, fragment ions of charge are exceedingly rare. When the peptide fragments at position t, the prefix fragment ion is referred to as the b t -ion; the suffix fragment ion, the y t -ion. The set {b t } t is referred to as the b-ion series, with the y-ion series defined analogously. Each peak in an idealized spectrum corresponds to a fragment ion in the the b-ion or y-ion series: the position of the peak for a fragment ion b is a deterministic function of m(b) and c(b) and likewise for y. A fragment spectra measures how often particular peaks with a specific mass-to-charge (m/z) ratio are detected, so there is no sequence information in a peak. A spectrum s is a collection of peaks, intensities at given m/z positions: s = {(x j, h j )} where x j is a point on the m/z axis (x-axis), and h j is the corresponding intensity (see Figure B). In practice, there is substantial discrepancy between an idealized spectrum and a real one due to measurement noise, secondary fragmentation of the b or y ions, non-protein contaminants, or other imperfections in the isolation of the peptide. Even barring noise in the spectra, there is substantial variation across spectra which must be controlled. There can be order-of-magnitude differences in both total intensity j h j and maximum intensity, max{h j } across spectra. To control for intensity variation, we rank-normalize each spectrum: peaks are sorted in order of increasing intensity, and the i th peak is assigned intensity i/ s, so max{h j } =.. From the settings used to collect the spectra, we know that x j [, ] m/z units. We quantize the m/z scanning range into B = uniformly sized bins. The bins correspond to a vector of random variables S = (S i : i =... B). A spectrum is an instantiation of S, s = (s,..., s B ), where the most intense rank-normalized peak is retained in each bin. If no peak is present in a bin, then S i =. A Generative Model of Peptide Fragmentation DBNs are commonly used to model discrete-time phenomena; but can be applied to any sequential data. In Figure, each non-prologue frame t =... n corresponds to the fragmentation of peptide p into the b t and y t ions. The peptide is represented as a vector of random variables A = (A i : i =

7 ). Since we are given the peptide-spectrum match to score, A is observed, with A t = a t. The spectrum variables S are fixed across all frames, and observed, since the spectrum is given. The masses of the prefix and suffix are denoted n t = m(a... a t ) and c t = m(a t+... a n ). The masses can be defined recursively: n =, n t = n t + m(a t ), and c n =, c t = c t+ + m(a t+ ). The variables p = {n t, c t } n t= identify the peptide. The random variables b t, y t {... B} are indices that select which bins are expected to contain the b t -ion and the y t -ion, respectively. Recall, the there is a deterministic relationship between the mass and charge of a fragment, and its location on the m/z axis: i.e, b t = round((n t + )/z t ). To generalize Equation?? to a posterior probability, we need a background score which measures the average fit of the spectrum to a shifted version of the theoretical spectrum. The shift variable τ allows us to shift the theoretical spectrum. τ [ M... + M], for a choice of M {... B}. Instead of predicting the b t -ion at bin b t, we predict it at bin b t + τ. If the shifted bin location is outside the range {,..., B}, we map those positions to a special bin that contains no peak. To shift the entire theoretical spectrum, τ t = τ t, t =... n. The distribution over τ is uniform. Most of the conditional probability distributions in Figure are deterministic, which leads to a simple form for the joint distribution: n p(τ, s, p) = p(τ ) t= i= B [P (S i b t, y t, τ t )] δ(i=bt+τt i=ct+τt). () The inference which connects this model to Equation?? is the log-posterior of τ n : θ(s, p) log p(τ n = p, s) = log p(τ =, p, s) log τ τ p(p, s τ ). () The log p(τ =, p, s) term is the probabilistic analogue of S, s(p) in Equation??, a term which measures the similarity between the theoretical and observed spectra. The log τ τ p(p, s τ ) term is a generalized version of the cross-correlation between the real and theoretical spectra: the average similarity between the spectrum and shifted versions of the theoretical spectra. Computing the scoring function θ(s, p) is somewhat simpler than computing the evidence p(p, s). Algorithms for DBN inference are typically forward-backward schemes (c.f., []), with it being possible for θ( ) to be computed using only a forward pass. Virtual Evidence An advantage of our probabilistic approach to scoring is that we have substantial flexibility in representing the contribution of peaks towards the score, P (S i b t, y t, τ t ). Using virtual evidence [], we are free to choose an arbitrary non-negative function f i (S) to model each bin. One way to mimic the observation S i = s i is to introduce a virtual binary variable C i, whose sole parent is S i. The virtual child is fixed to C i =. If P (C i = S i ) δ(s i = s i ), then P (S i = s i b t, y t, τ t ) = C i P (C i =, S i b t, y t, τ t )P (C t = S i ). Virtual evidence changes the definition of the virtual child s conditional probability distribution to P (C i = S i ) = f i (S i ), for a user-defined non-negative function f i. One could define a separate f i for each bin i, but for simplicity we choose a single function for all bins, f. Following Equation?? we impose additional constraints on the form of f. The score of a peptidespectrum match should depend only on the peaks: f() =. If a peak is found in an activated bin, its contribution to the score must be higher than that of an activated bin with no peak: S >, f(s) > f(). Finally, matching high intensity peaks should be worth more than matching low intensity peaks; the b- and y-series should be more prominent than noise in the spectrum: i.e., f is monotone increasing. Based on our experiments, a class of f that works particularly well is f λ (S) = eλ λ + λe λs e λ λ. () 7

8 Prologue Chunk Epilogue n c n t b t S i i =...B a t τ τ t τ n c t y t Shared across all frames Figure : as a graphical model: the prologue occurs once at the beginning, the epilogue occurs once at the end, and the chunk is unrolled as necessary to any desired length. At each chunk, the bottom plate is expanded to have B copies. The parameter λ > dictates the relative value placed upon peak intensity in the scoring function. Experiments We compare the performance of spectrum identification algorithms against three tandem MS experiments on proteins from two different organisms: cm: A tryptic digest of S. Cerevisiae lysate containing,9 spectra, each with a precursor ion charge of. That is, c(p) = for all candidate peptides. Yeast-: A tryptic digest of S. Cerevisiae lysate, containing,99 spectra, each with a precursor charge or. That is c(p) {, } for all candidate peptides. We compare all algorithms under the assumption that each candidate peptide has c(p) =. Worm-: A tryptic digest of C. Elegans proteins, containing, spectra, each with precursor charge or. Again, we compare all algorithms under the assumption that c(p) =. The peptide database P for the yeast data sets is generated by an in silico trypsin digest of the soluble yeast proteome []. We compare the performance of four spectrum identification algorithms on these three data sets :, /XCorr, Mascot, and. Jeff: [Say again that these other methods include at least one that is quite standard, and that mascot is commercial]. The search parameters are controlled across the four methods. Candidate peptides are selected using a mass window of δ =. Da, save, which uses a hard-coded window of δ =. Da. The entire b- and y-ion series are assumed to be present. A fixed modification to cysteine is included to account for carbamidomethylation of protein disulfide bonds. In all cases, the decoys are generated by randomly permuting target peptides. Figure presents the absolute ranking comparison of the four methods. In all cases, there is a significant improvement in the number of spectra that are confidently identified, with strictly dominating over q [, ], save for the Worm- experiment. Ajit: [I m betting that the poor performance in Worm- is due largely to failures on spectra with heavy precursor masses where the charge is probably +. If we assume the charge is +, a large chunk of the b/y-series would fall outside the scanning range of the device] Jeff: [I think we should include some reasons for this performance increase, rather than just giving it. First, did you end up using the MLE for λ? If so, say so. This should also include the benefit of the f function, that this distribution was not pulled out of a hat, but is a family of distributions that The prefix and suffix of a peptide are more commonly referred to as the N- and C-terminal fragments. Comparisons on additional data sets are included in D. a n n n c n

9 Spectra identified ( s) Spectra identified ( s) 9 7 Mascot..... (a) cm..... (d) Yeast- Spectra identified ( s) 7 Mascot Spectra identified ( s) Spectra identified ( s) (g) Worm- Mascot..... (b) Yeast- Mascot..... (e) Yeast- Spectra identified ( s) Spectra identified ( s) Spectra identified ( s)..... (h) Worm- Figure : Absolute ranking comparison Mascot..... (c) Yeast- Mascot..... (f) Worm- are particularly suited to this problem. Shoudl also say that this function is novel, not before been used for this (or any problem), as far as we know. Now also, key benefit of our approach is that it is probabilistic, and thus automatically normalized appropriately, unlike the crux approach mentioned above where there can be unwanted miscalculations between forground and background model (at least this should be our hypothesis).] Lattice Decoding Lattice representation for peptide database The drawback of representing each peptide as an individual observation sequence is that the same computations need to be carried out multiple times for peptides with identical substrings. A more efficient way of representing a peptide database is in the form of a subpeptide lattice. Lattice representations are widely used for other sequence modeling problems outside computational biology, such as speech and language processing (e.g. [,, ]). They provide a way of representing a finite but possibly very large set of strings in a compact, compressed form by the sharing of common substrings. Given an alphabet A of amino acids, a peptide p can be defined as a string over A. A subpeptide s is a substring of p whose length, s, is typically less than the length of p. We denote the total inventory of subpeptides with S. A subpeptide lattice is a directed acyclic graph G = (V, E) with a set of vertices V and a set of edges E, each of which is labelled with a subpeptide s S 9

10 Figure : Compressed lattice for the three peptides AAAANWLR, AAADEWDER, AAADLISR. From Jeff: I don t think we ll have space for this figure, unfortunately, at least in the main version. We could use it in the extended version (which could be a strict superset of this paper). Figure : Graphical model structure for a peptide lattice. and, optionally, additional information such as frequencies or probabilities. The concatenation of subpeptides along a path through the lattice corresponds to a complete peptide in the database. Using a lattice representation, common subpeptides can be shared among peptides and the peptide database can be represented much more compactly. The computations needed to evaluate the observation model for specific amino acids are only performed once per edge; thus, depending on the degree of sharing inherent in the lattice (relative to the uncompressed database), significant speedups can be achieved. The question is how to define the S such that the resulting lattice is as compact as possible. To address this problem we exploit the fact that, formally, a lattice is a (weighted) finite-state automaton (FSA) Jeff: [Add cite]. Our initial starting point is a naive lattice representation where every peptide is represented as a separate path consisting of edges labelled with individual amino acids only. We then apply a series of well-known operations on finite-state machines that transform the lattice into the corresponding minimal lattice that has the smallest possible number of states. The alphabet S results as a by-product of this procedure. The first step is to convert the peptide database (i.e. a simple set of strings) to a finite-state automaton F. Next, F is determinized. Determinization converts F into an equivalent FSA, F det, such that for any given state q and alphabet symbol a A, there is only a single outgoing edge from q labeled with a. Third, F det is minimized. Minimization creates an FSA, F min, that is equivalent to F det but has the minimal number of states. Algorithms for determinization and minimization have been studied in depth (e.g. []); we use the implementations provided in the OpenFst toolkit. Finally, deterministic subpaths in the lattice (sequences of states with only one outgoing edge) are collapsed into a single edge, further limiting the number of states and edges and thus reducing memory requirements. Figure shows an example of a compressed lattice for three peptides. At the end of this procedure, S is defined by list of unique edge labels in the final collapsed lattice. One problem is that the lattice incorporates peptides of different lengths, which complicates scoring with the observation model described in Section. In order be able to score all strings simultaneously, they need to be warped to a common length. We achieve this by appending a dummy amino acid symbol to peptides shorter than the longest peptide in the database, such that all strings have the same length. Graphical model representation of lattices In order to use a lattice representation within our graphical modeling framework the lattice needs to be represented as a graphical model structure, visualized in Figure. Valid paths through the lattice are specified by the NODE variable and associated parameters: the probability of node j given node i is nonzero whenever an edge exists between them in the original lattice. The SP (subpeptide) variable with cardinality S encodes the identity of the edge label and is dependent on a start node i and end node j. SPPOS specifies the position in the subpeptide; whenever the final position is reached, the binary transition variable TRANS is switched to. The TRANS variable is in turn a switching parent for NODE and SP; if it is, NODE and SP take on new values (i.e. a transition in the lattice occurs), otherwise the values from the previous frame are copied. Finally, SPPOS and SP jointly determine the amino acid (AA) variable, which is connected to an observation (or a more complicated observation model as described above). The validity of strings is ensured by dedicated end node and end transition variables which ensure that the end the observation sequence coincides with the end of a subpeptide. When using a peptide lattice to search an entire database, precursor filtering can be done as part of the search. To this end a pruning variable is included that assigns zero probability to a path if the current accumulated mass exceeds an upper mass limit, or if the lower mass limit exceeds the current mass

11 database # peptides naive lattice compressed lattice S worm,9 7M/7M 9k/k k yeast,9.m 9k/k k Table : Sizes of naive and compressed lattices (given as number of nodes/number of edges), and the size of the subpeptide alphabet for worm and yeast databases. Experiment naive compressed A B C Table : CPU time of inference for database search vs. search through lattice plus the maximum possible mass value that can still be added before the end of the peptide is reached (the maximum remaining mass is equal to the remaining number of peptide positions multiplied by the largest mass value of any amino acid). The pruning variable is checked whenever a new edge in the lattice is being entered. Experiments Table compares the sizes of the original naive lattice representation where each peptide is represented as an individual string and the corresponding compressed lattice representation. With respect to computational efficiency, speedups can be achieved by evaluating the observation model only once for each edge in the lattice. Three different timing experiments were conducted to evaluate the lattice representation. In Experiment A we use the lattice as a compact representation for sets of peptides that have been prefiltered according to their precursor mass values. In Experiment B, the entire peptide database is represented as a lattice and the search is conducted against the entire database. Precursor filtering is performed as part of the search, through the pruning variable in the graphical model lattice structure, as described above. In Experiment C we also conduct a search over the entire database but (additionally?) use pruning options provided by the graphical model inference code. Timing experiments were conducted on a (MACHINE SPECS?). Each number is the average of runs and reports the inference time only, excluding startup cost. Acknowledgements Use unnumbered third level headings for the acknowledgements title. All acknowledgements go at the end of the paper.

12 References [] X. Aubert, C. Dugast, H. Ney, and V. Steinbiss. Large vocabulary continuous speech recognition of wall street journal data. In Proceedings of ICASSP, pages 9. [] Jeff Bilmes. Dynamic graphical models. IEEE Signal Processing Magazine, 7():9, Nov. [] C. Chelba and A. Acero. Position-specific posterior lattices for indexing speech. In Proceedings of ACL,. [] A. R. Dongre, J. L. Jones, A. Somogyi, and V. H. Wysocki. Influence of peptide composition, gas-phase basicity, and chemical modification on fragmentation efficiency: evidence for the mobile proton model. Journal of the American Chemical Society, : 7, 99. [] C. Dyer, S. Muresan, and P. Resnik. Generalizing word lattice translation. In Proceedings of ACL/HLT, pages,. [] J. E. Elias and S. P. Gygi. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature Methods, ():7, 7. [7] J. K. Eng, A. L. McCormack, and J. R. Yates, III. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry, :97 99, 99. [] J.E. Hopcroft and J. Ullman. Introduction fo Automata Theory, Languages and Computation. Addison- Wesley, Reading, Mass., 979. [9] L. Käll, J. D. Storey, M. J. MacCoss, and W. S. Noble. Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. Journal of Proteome Research, 7():9,. [] J. Pearl. Probabilistic Reasoning in Intelligent Systems : Networks of Plausible Inference. Morgan Kaufmann, 99. [] D. N. Perkins, D. J. C. Pappin, D. M. Creasy, and J. S. Cottrell. Probability-based protein identification by searching sequence databases using mass spectrometry data. ELECTROPHORESIS, (): 7, 999. [] J. D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society, : 79 9,. [] J. D. Storey. The positive false discovery rate: A bayesian interpretation and the. The Annals of Statistics, ():,. [] J. D. Storey and R. Tibshirani. Statistical significance for genome-wide studies. Proceedings of the National Academy of Sciences of the United States of America, :9 9,. [] Yunhu Wan, Austin Yang, and Ting Chen. : A hidden markov model based scoring function for mass spectrometry database search. Analytical Chemistry, 7(): 7,. PMID: 9. [] M. P. Washburn, D. Wolters, and J. R. Yates, III. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nature Biotechnology, 9: 7,.

13 SUPPLEMENTARY MATERIAL A B Tandem Mass Spectrometry Description of the Testing Datasets Elided. The data sets have been used in previously published work. C Evaluation Metrics Add form of the equation when ntargets neq ndecoys. Include results on qvality: i.e., that we do better even under alternate estimators of the FDR. Ajit: [Questions that we may want to put in a supplement.] Q: Why are ground truth peptide-spectrum matches not available in any significant quantities? A: Theoretically, one could create a purified sample of a peptide which could be used to generate a spectrum where the peptide is known. However, the resolution of tandem mass spectrometry is so high that creating sufficiently pure samples is impractical. One could attempt to label spectra by hand, but such labellings are known not to be especially accurate [CITE]. D Scoring Identifications as Inference in a Dynamic Bayesian Network Explain where the virtual evidence function comes from, the MLE, and why it does not work well. Additional Experiments Scatter plots relating our scoring function against. Break down the comparison of methods based on filtering returned PSMs by length, by spectrum length, by precursor mass. Sum-product vs. max-product. Ablative: replace the VECPT function with intensity. E Lattice Decoding Additional Experiments

PROJECT TITLE. B.Tech. BRANCH NAME NAME 2 (ROLL NUMBER)

PROJECT TITLE. B.Tech. BRANCH NAME NAME 2 (ROLL NUMBER) PROJECT TITLE A thesis submitted in partial fulfillment of the requirements for the award of the degree of B.Tech. in BRANCH NAME By NAME 1 (ROLL NUMBER) NAME 2 (ROLL NUMBER) DEPARTMENT OF DEPARTMENT NAME