Approximate P-Values for Local Sequence Alignments: Numerical Studies. JOHN D. STOREY and DAVID SIEGMUND ABSTRACT

Size: px
Start display at page:

Download "Approximate P-Values for Local Sequence Alignments: Numerical Studies. JOHN D. STOREY and DAVID SIEGMUND ABSTRACT"

Transcription

1 JOURNAL OF COMPUTATIONAL BIOLOGY Volume 8, Number 5, 200 Mary Ann Liebert, Inc. Pp Approximate P-Values for Local Sequence Alignments: Numerical Studies JOHN D. STOREY and DAVID SIEGMUND ABSTRACT Siegmund and Yair (2000) have given an approximate p-value when two independent, identically distributed sequences from a nite alphabet are optimally aligned based on a scoring system that rewards similarities according to a general scoring matrix and penalizes gaps (insertions and deletions). The approximation involves an in nite sequence of dif cultto-compute parameters. In this paper, it is shown by numerical studies that these reduce to essentially two numerically distinct parameters, which can be computed as one-dimensional numerical integrals. For an arbitrary scoring matrix and af ne gap penalty, this modi ed approximation is easily evaluated. Comparison with published numerical results show that it is reasonably accurate. Key words: local alignment, af ne gap penalty, p-value, Marov renewal theory.. INTRODUCTION Comparison of a new gene (DNA sequence) or protein (amino acid sequence) with the existing sequences in a database can be an important rst step in learning the structure and/or function of the new sequence. Local pairwise alignment of sequences plays an important role in such database searches. The quality of an alignment is determined by the total similarity score of letters at aligned positions and the number of gaps (insertions/deletions) in each sequence. (These concepts will be de ned precisely in what follows.) Smith and Waterman (98) and Altschul et al. (990) have developed rapid and effective computational algorithms for local pairwise comparison of DNA or amino acid sequences. In addition, one would lie to evaluate the statistical signi cance of sequences showing a particular level of similarity (e.g., Arratia et al., 990; Dembo et al., 994; Altschul and Gish, 996). Approximate p-values when gaps are not allowed have been given by a number of authors, e.g., Arratia et al. (990), for a special scoring function and more generally by Dembo et al. (994). The form of the approximation in the ungapped case has been conjectured to be valid also in the gapped case (Waterman and Vingron, 994) and has been empirically t to simulated and to real data to obtain values for parameters in the approximating formula (cf. Waterman and Vingron, 994; Altschul and Gish, 996). While these approximations seem to be reasonably accurate, development of each approximation requires substantial statistical analysis and computation that go beyond the basic data of the problem: the scoring matrix and the gap costs. More recent results in this direction have been obtained by Altschul et al. (200). Department of Statistics, Stanford University, Stanford, CA

2 550 STOREY AND SIEGMUND For af ne gap costs, Mott and Tribe (999) obtain a useful approximation by an interesting heuristic argument. Assuming the gap open penalty is suf ciently large, Siegmund and Yair (2000) provide a somewhat different approximation, which depends on in nitely many parameters. They also conjecture that for numerical purposes a further approximation depending on two of these parameters would suf ce. The purpose of this paper is to show that a) this conjecture is correct, b) the modi ed approximation is reasonably accurate, and c) it can be rapidly computed directly from basic data of the problem. In addition to better theoretical understanding of how the p-value depends on the parameters of the problem, this approximation requires effectively zero computing time, so it could be implemented on line for a scoring matrix of the user s choice. Section 2 gives our notation and states the formal approximation of Siegmund and Yair (2000), which depends on the parameters 0; ; : Section 3 explains the probabilistic meaning of these parameters. Section 4 contains simulations based on the representations of Section 3, which show for several common scoring matrices that the r for r are essentially constant. Using the two parameters 0 and 3 D lim r r, one obtains a much simpler approximation, which is shown numerically to be in reasonable agreement with the approximation of Atlschul and Gish (996) based on their empirically estimated parameters. Convenient integral representations for 0;3 are given in an appendix. 2. NOTATION AND APPROXIMATION Consider two nite sequences x and y from a nite alphabet. Thus, x D x x 2 x m and y D y y 2 y n with x i ;y j 2 A. We assume throughout that x ;:::;x m are independently distributed with P 0 fx i D g D ¹ for all i; and similarly y ;:::;y n are independently distributed with P 0 fy j D g Dº for all j. Moreover, the x s and y s are independent. A candidate alignment, z D f.i t ;j t / : t g, for some i < i 2 < < i m and j <j 2 < <j n, speci es that x it and y jt are aligned for all t D ; ;: The other x s with subscripts between i and i and the other y s with subscripts between j and j are said to be unaligned. Note that there may be other letters, at the beginning or end of the two sequences, that are neither aligned nor formally designated as unaligned. We assume that either i tc D i t C orj tc D j t C for all t<; i.e., there can be unaligned letters in only one sequence at a time. With each candidate alignment z we associate a score S z D S z.x; y/. Aligned letters x i and y j are scored according to a similarity matrix K.x i ;y j /: We assume a penalty of ± for each unaligned letter. The total number of unaligned letters is l D.i i C C j j C /. A gap is the interval of unaligned letters that begins with a value t C such that i t D j t and either i tc >i t C orj tc >j t C and ends with the next aligned pair after.x it ;y jt /. Each gap is assessed a cost in addition to the cost ± of each unaligned letter. Frequently one refers to as the gap open and ± as the gap extension cost. The score of the candidate alignment z is S z D S z.x; y/ D 6tD K.x i t ;y jt / j ±l, where j is the number of gaps and l is the total number of unaligned letters, or equivalently the total length of all gaps. Note that there are no costs assessed for letters that are neither aligned nor unaligned. Within the collection Z of all candidate alignments, one can identify the best alignment the one with the highest score. The p-value of the best score under the null assumption that the sequences x and y are independent random samples from the given alphabet is P 0.max z2z S z b/; () where b is the observed value of the score for the best alignment and P 0 is the null probability described above. We assume that E 0 K.x;y/ < 0 and P 0 fk.x;y/ > 0g > 0: (2) Let Ã.µ/ D log E 0 exp[µk.x;y/]. Then exp[µk. ; / Ã.µ/]¹ º de nes an exponential family of probabilities indexed by µ, which for µ D 0 reduces to the original probability Q 0. ; / D ¹ º. It follows from (2) by convexity that there is a unique value µ > 0 for which Ã.µ / D 0: Let ¹ 0 D Ã 0.µ />0:

3 LOCAL SEQUENCE ALIGNMENTS 55 Let Q. ; / D exp[µ K. ; /]Q 0. ; / and note that ¹ 0 D E K.x;y/: Also let Q i;j be de ned to be the product probability that gives x the marginal distribution it has under Q i and y the marginal distribution it has under Q j. We assume that E K.x;y/ E i;j K.x;y/ > 0 (3) for all i; j. This condition is easily checed and excludes the case that K. ; / is a sum of a function of and a function of. We shall require the technical assumptions that min.m; n/=b!and that for some 0 <²<, mnexpf.µ b/ ² g is bounded. The second assumption puts the probability of interest into the domain of large deviations. To simplify the presentation of the main approximation, we assume that K.x;y/ is a nonarithmetic random variable and discuss the more complicated arithmetic case below. The main result of Siegmund and Yair (2000) is as follows. Theorem. Assume conditions (2) and (3) hold and that µ D log.µ b/ C C (4) for some constant C. There exist parameters 0; ; in (0,] such that as b!, P 0 max S z b z2z X X mne µ b.µ ¹ 0 / 20 [.2be µ =¹ 0 / j =j!] jd0 ldj e µ ±l X i il j C l jc ; (5) where the innermost summation extends over the l j terms having i C Ci l jc D j and i C 2i 2 C C.l j C /i l jc D l: An upper bound for the indicated sums is =[exp.µ ±/ ]: Remars. (i) A principal consequence of the numerical studies reported below is that the constants r for r are essentially all equal, say, to a single parameter 3. Hence the right hand side of (5) can reasonably be approximated by the much simpler expression mn.µ ¹ 0 / 20 exp. µ bf 2.µ ¹ 0 / 3e µ =[e µ ± ]g/: (6) (ii) In applications, the random variables K.x i ;y j / will be arithmetic of span h, typically h D : In this case it does not seem possible to give a precise asymptotic expression for the desired probability, which can be shown to lie asymptotically in an interval bounded below by hµ =.e hµ / and above by. e hµ /=.hµ / times the quantitity on the right hand side of (5). (iii) As explained by Siegmund and Yair (2000) the condition (4) can be viewed as a diagnostic for the accuracy of the approximation (5). If we suppose that b and are given and (4) de nes the value of C, then one should be careful in applying (5) when C is small, say, C< : THE CONSTANTS l r, r 0,, 2, In the rest of this paper we assume that K.x i ;y j / is arithmetic with span h D : The constant 0 appears in the approximation of Dembo et al. (994), where gaps are not permitted. It has the same probabilistic meaning as the other r, but it is de ned relative to a random wal process rather than a Marov chain and consequently is easily evaluated numerically. It has been thoroughly discussed in the context of sequential analysis (e.g., Woodroofe, 982; Siegmund, 985). To be more precise and prepare for the more complex case of r.r ) given below, let P denote the probability under which.x ;y /;.x 2 ;y 2 /; are independent and identically distributed with the distribution Q de ned following display (2). The log lielihood ratio of the rst n pairs.x i ;y i / under P

4 552 STOREY AND SIEGMUND relative to P 0 is `n D µ S n, where S n D P n K.x i;y i /: It follows from the arguments of Siegmund and Yair (2000) that 0 D lim N which after summation by parts and a change of measure equals N E 0 [exp.max n N `n/]; (7) lim N. e µ / X j E [expf µ.s.j/ j/gi.j/ N]; (8) where.j/ D minfn : S n j g for j (and.0/ is similarly de ned with a strict inequality). Since ¹ 0 D E K.x ;y />0, by the strong law of large numbers P f.j/ N g converges to for j<. ²/N¹ 0 and to 0 for j>.c²/n¹ 0 for arbitrary 0 <²<: It follows from the renewal theorem that º 0 D lim E [expf µ.s.j/ j/g] (9) j exists. From (5) (7) and elementary analysis one obtains 0 D ¹ 0. e µ /º 0 : (0) In addition, º 0 can be expressed in terms of the distribution of the ladder variables..0/; S.0/ / (e.g., Siegmund, 985, Chapter 8) and then can be evaluated numerically as a one-dimensional inverse Fourier transform (cf. Woodroofe [982] for nonarithmetic variables and Tu and Siegmund [999] for the arithmetic case). For completeness the result of Tu and Siegmund (999) is stated in the appendix. Except for the numerical evaluation of º 0, the preceding argument applies with minor modi cations to the r for r : Consider two sequences x and y of length N, for some integers N!. Given r and, <N r, consider the alignment which matches the rst x s with the rst y s and the x s between C and N r with the y s between Cr C andn. Let P.r/ denote the probability measure that maes the aligned pairs x i ;y j independent and identically distributed with joint distribution Q and leaves the distribution of the other x i and y j unchanged. Let `.r/ denote the log lielihood ratio of P.r/ relative to P.r/,so`.r/ 0 D µ [ P K.x i;y i / C P N r C K.x i;y /]and`.r/ icr D µ P 2 [K.x i;y i / K.x i ;y icr /]. Siegmund and Yair (2000) obtain the representations (cf. (7)) r D lim N µ N E.r/ 0 exp.max `.r/ N / D lim N µ N E.r/ exp.max f`.r/ N g/ : () Let P.r/ denote the extension of P.r/ N r to the in nitely long sequence fx i;y i ; i<g: Note that D `.r/ D µ [K.x ;y / K.x ;y Cr /] is a function of x, y and y Cr only. Under P.r/ the are identically distributed; and.r/ is independent of any.r/ l for l such that.l mod r/ 6D. mod r/. For each 0 i<r, the process f.r/ : N r;. mod r/ D ig is a rst order Marov chain. Thus `.r/ D P 2.r/ i is an additive functional of an rth order Marov chain. By () and the argument leading from (7) to (0), we see that.r/.r/ r D ¹ r. e µ /º r ; (2) where ¹ r D E.r/ [K.x 2 ;y 2 / K.x 2 ;y 2Cr ] is constant in r, and the limit de ning º r (exactly as in (9)) is guaranteed to exist by the renewal theorem for additive functionals of a Marov chain applied to the process of ladder variables, e.g., Athreya et al. (978). Althoughwe obtain a representationstructurally identicalto (0), numerical computationsare much more complicated than for a random wal. Asmussen (989) has used a representation of the limit de ning º r in terms of ladder variables to evaluate such a parameter for a similar process. The problem is also discussed by Karlin and Dembo (992), although they do not appear to have developed a numerical algorithm. In the

5 LOCAL SEQUENCE ALIGNMENTS 553 following section we use the representation in () for simulating º r : In this regard, note that simulation requires only the generation of independent, identically distributed, discrete, random variables, while an analytic approach must deal with the Marovian dependence described above. From the probabilistic meaning of the º r, it seems natural to conjecture that they are effectively constant in r ; and since.r/ 2 ; ;.r/ rc are independent and identically distributed, with a distribution not depending on r, the limit of º r as r!can be evaluated in terms of a random wal (albeit a different random wal from that entering into º 0 above). 4. NUMERICAL RESULTS Our goal in this section is two-fold: (i) to show by simulation that it is reasonable to regard the constants r for r as a single constant, so that the complicated approximations given in Theorem can be replaced by the much simpler expresson (6), which is very easy to evaluate numerically via the algorithm given in the appendix; and (ii) to compare the results of our approximation to selected results in the literature reported by Altschul and Gish (996). We use the amino acid frequencies reported by Robinson and Robinson (99) and the scoring matrices BLOSUM62, BLOSUM50 (Henioff and Henioff (992), and PAM250 (e.g., Dayhoff, 978). Tables and 2 give estimated values of º r for the BLOSUM62 and PAM250 substitution matrices, respectively. The estimates are based on 50,000 repetitions of a Monte Carlo experiment and are given for various values of r. Since º r is represented as a limit as j! and large values of j require proportionately long Monte Carlo runs, the estimates are also tabled for different values of j. (The process increases at rate ¹, so a rough estimate, which ignores edge effects, of the mean value of `.r/.j/ is j=¹. Table 3 gives values of ¹.) Standard deviations of the estimates are about The estimates are essentially constant in both r and j. Very similar results have been obtained for the BLOSUM50 substitution matrix and hence are omitted. We conclude from these simulations that it is reasonable to use the simpli ed approximation (6) instead of the much more complicated (5). Table 3 gives the numerical values of different parameters that are required for implementation of the approximation (6). In particular, instead of º r for r, we propose to use º D lim r º r, which involves independent, identically distributed, random variables and hence can be computed by the method given in the appendix. Then 3 D ¹. e µ /º. For ease in maing numerical comparisons with published values, we continue to use the parameters in the form given above. We also give parameters relevant to the scale that Altschul and Gish (996) call nats, which arises from multiplying K given in an arbitrary scale by µ. An increase of one unit in the threshold a D µ b results in a change of the probability () by approximately the factor e. In this scale, the mean values ¹ r become the Kullbac Leibler information numbers I r D µ ¹ r.r D 0; /. In Table 3, the values given for º 0 and º have been obtained by the method given in the appendix, which requires a one-dimensional numerical integration. It is easily automated and very fast to evaluate. To the extent that the values of º r do change with r, the value given for º in Table 3 should be a better approximation for large r, when the increments of the process `.r/ are independent over long stretches. Although the value of º is close to the values given in Table, it is outside the range of Table. Simulated Values of º r for BLOSUM62 Matrix a r j a Estimates are based on 50,000 repetitions of a Monte Carlo experiment, for various values of j.

6 554 STOREY AND SIEGMUND Table 2. Simulated Values of º r for PAM250 Matrix a r j a Estimates are based on 50,000 repetitions of a Monte Carlo experiment, for various values of j. Table 3. Parameters of Selected Substitution Matrices BLOSUM62 BLOSUM50 PAM250 µ º ¹ I º ¹ I Table 4. p-values a K ± b p 0 6 BLOSUM BLOSUM PAM a Upper entry is our approximation; lower entry from Altschul and Gish (996); m D n D 500.

7 LOCAL SEQUENCE ALIGNMENTS 555 reasonable sampling variability. The values of º for the BLOSUM50 and PAM250 matrices are within the range of reasonable sampling variability of the Monte Carlo estimates for large r. To verify that these values are at least reasonable, one can compare them to the case of a random wal having normally distributed increments, for which there is the simple approximation º ¼ exp[ ½.2I/ =2 ]; where ½ D 0:583 (cf. Siegmund, 985). This gives º 0 ¼ 0:625 and º ¼ 0:503 for the BLOSUM62 matrix and similar results for the other two cases. For implementation of the approximation (6), we also mae edge corrections to m and n, by replacing m with m 0 D m b=¹ 0 and by a similar correction to n. This correction is justi ed by the observation that an (ungapped) interval that achieves a score of b is roughly of length b=¹ 0. Similar edge corrections are used by Altschul and Gish (996) and by Mott and Tribe (999). Table 4 gives p-values for m D n D 500 and selected values of ; ±,andb. The rst entry given is the approximation (6); the second uses the extreme value approximation suggested by Waterman and Vingron (994) and by Altschul and Gish (996) with the empirically determined parameters given in Altschul and Gish (996). Our approximation is typically about 2 5 times as large as the Altschul Gish approximation. Very similar results hold for other values of m D n in the range [300,000]. The practical advantage of our approximation is that it requires only the almost trivial evaluation of µ and related parameters and computation of two one dimensional numerical integrals. Hence it can be easily evaluated for a scoring matrix chosen by the user without restriction to the few cases for which the empirical research necessary to estimate the parameters of the Altschul Gish approximation has been carried out. ACKNOWLEDGMENTS This research was supported in part by National Science Foundation Grant DMS and by an NSF Graduate Research Fellowship. 5. APPENDIX Theorem 2. Let x ;x 2 ; ;x n be independent and identically distributed arithmetic random variables with positive mean ¹, nite positive variance ¾ 2, and span h. LetÁ.t/ be the characteristic function of x and.t/ D log. Á.t// D P nd [Á n.t/=n]. LetS n D P n id x i. Then for arbitrary >0, 8 t ± ¹ ² 9 X µ n E.e SC n / C log.¹=h/ D Z 2¼ >< t e h it C log h h. eit / >= dt 2¼ 0 h >: e h it C e it : >; nd Let S. / denote the left hand side of (). Siegmund (985, p. 75) shows that º 0 and º are of the form hexp[ S.µ /]=[ e µ h ]: The values given in Table 2 were obtained from this expression. (3) REFERENCES Altschul, S.F., Bundschuh, R., Olsen, R., and Hwa, T The estimation of statistical parameters for local alignment score distributions. Nucl. Acids Res. 29, Altschul, S.F., and Gish, W Local alignment statistics. Methods in Enzymology 266, Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J Basic local alignment search tool. J. Mol. Biol. 25, Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucl. Acids Res. 25,

8 556 STOREY AND SIEGMUND Arratia, R., Goldstein, L., and Gordon L Two moments suf ce for Poisson approximation: The Chen Stein method. Ann. Probab. 7, Arratia, R., Gordon, L., and Waterman, M.S The Erdös-Rényi Law in distribution for coin tossing and sequence matching. Ann. Statist. 8, Asmussen, S Ris theory in a Marovian environment. Scand. Actuarial J., Athreya, K.B., McDonald, D., and Ney, P Limit theorems for semi-marov processes and renewal theory for Marov chains. Ann. Probab. 6, Dayhoff, M.O Atlas of Protein Sequence and Structure, vol. 5, suppl. 3, National Biomedical Research Foundation, Washington, DC. Dembo, A., Karlin, S., and Zeitouni, O Limit distribution of maximal non-aligned two-sequence segmental score, Ann. Probab. 22, Henioff, S., and Henioff, J.G Amino acid substitution matrices from protein blocs. Proc. Nat. Acad. Sci. USA 89, Karlin, S., and Dembo, A Limit distributions of maximal segmental score among Marov-dependent partial sums. Adv. Appl. Probab. 24, Mott, R., and Tribe, R Approximate statistics of gapped alignments. J. Comp. Biol. 6, 9 2. Pearson, W.R Comparison of methods for searching protein databases. Protein Sci. 4, Robinson, A.B., and Robinson, L.R. 99. Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. Proc. Nat. Acad. Sci. USA 88, Siegmund, D Sequential Analysis: Tests and Con dence Intervals. Springer-Verlag, New Yor. Siegmund, D., and Yair, B Approximate p-values for local sequence alignments. Ann. Statist. 28, Smith, T.F., and Waterman, M.S. 98. Identi cation of common molecular subsequences.j. Mol. Biol. 47, Tu, I.-P., and Siegmund, D The maximum of a function of a Marov chain and application to linage analysis. Adv. Appl. Probab. 3, Waterman, M Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman and Hall, London. Waterman, M., and Vingron, M Sequence comparison and Poisson approximation. Statistical Science 9, Woodroofe, M Nonlinear Renewal Theory in Sequential Analysis. SIAM, Philadelphia. Yair, B., and Polla, M A new representation for a renewal-theoretic constant appearing in asymptotic approximations of large deviations. Ann. Appl. Probab. 8, Address correspondence to: David Siegmund Stanford University Department of Statistics Sequoia Hall 40 Stanford, CA, dos@stat.stanford.edu

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

More information

Local Alignment Statistics

Local Alignment Statistics Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison

More information

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Pacific Symposium on Biocomputing 4: (1999)

Pacific Symposium on Biocomputing 4: (1999) OPTIMIZING SMITH-WATERMAN ALIGNMENTS ROLF OLSEN, TERENCE HWA Department of Physics, University of California at San Diego La Jolla, CA 92093-0319 email: rolf@cezanne.ucsd.edu, hwa@ucsd.edu MICHAEL L ASSIG

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

Statistical Signi cance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models ABSTRACT

Statistical Signi cance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models ABSTRACT JOURNAL OF COMPUTATIONAL BIOLOGY Volume 8, Number 3, 2001 Mary Ann Liebert, Inc. Pp. 249 282 Statistical Signi cance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models YI-KUO YU

More information

Optimization of a New Score Function for the Detection of Remote Homologs

Optimization of a New Score Function for the Detection of Remote Homologs PROTEINS: Structure, Function, and Genetics 41:498 503 (2000) Optimization of a New Score Function for the Detection of Remote Homologs Maricel Kann, 1 Bin Qian, 2 and Richard A. Goldstein 1,2 * 1 Department

More information

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55 Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

More information

BLAST: Basic Local Alignment Search Tool

BLAST: Basic Local Alignment Search Tool .. CSC 448 Bioinformatics Algorithms Alexander Dekhtyar.. (Rapid) Local Sequence Alignment BLAST BLAST: Basic Local Alignment Search Tool BLAST is a family of rapid approximate local alignment algorithms[2].

More information

Pairwise sequence alignment

Pairwise sequence alignment Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

More information

A profile-based protein sequence alignment algorithm for a domain clustering database

A profile-based protein sequence alignment algorithm for a domain clustering database A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing

More information

Chapter 7: Rapid alignment methods: FASTA and BLAST

Chapter 7: Rapid alignment methods: FASTA and BLAST Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem Search strategies FASTA BLAST Introduction to bioinformatics, Autumn 2007 117 BLAST: Basic Local Alignment Search Tool BLAST (Altschul

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Biologically significant sequence alignments using Boltzmann probabilities

Biologically significant sequence alignments using Boltzmann probabilities Biologically significant sequence alignments using Boltzmann probabilities P. Clote Department of Biology, Boston College Gasson Hall 416, Chestnut Hill MA 02467 clote@bc.edu May 7, 2003 Abstract In this

More information

MPIPairwiseStatSig: Parallel Pairwise Statistical Significance Estimation of Local Sequence Alignment

MPIPairwiseStatSig: Parallel Pairwise Statistical Significance Estimation of Local Sequence Alignment MPIPairwiseStatSig: Parallel Pairwise Statistical Significance Estimation of Local Sequence Alignment Ankit Agrawal, Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical Engg. and Computer Science

More information

Statistical Distributions of Optimal Global Alignment Scores of Random Protein Sequences

Statistical Distributions of Optimal Global Alignment Scores of Random Protein Sequences BMC Bioinformatics This Provisional PDF corresponds to the article as it appeared upon acceptance. The fully-formatted PDF version will become available shortly after the date of publication, from the

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Generalized Affine Gap Costs for Protein Sequence Alignment

Generalized Affine Gap Costs for Protein Sequence Alignment PROTEINS: Structure, Function, and Genetics 32:88 96 (1998) Generalized Affine Gap Costs for Protein Sequence Alignment Stephen F. Altschul* National Center for Biotechnology Information, National Library

More information

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm Lecture 2, 12/3/2003: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties Local alignment the Smith-Waterman algorithm 1 Computational

More information

Local Alignment: Smith-Waterman algorithm

Local Alignment: Smith-Waterman algorithm Local Alignment: Smith-Waterman algorithm Example: a shared common domain of two protein sequences; extended sections of genomic DNA sequence. Sensitive to detect similarity in highly diverged sequences.

More information

Finite Width Model Sequence Comparison

Finite Width Model Sequence Comparison Finite Width Model Sequence Comparison Nicholas Chia and Ralf Bundschuh Department of Physics, Ohio State University, 74 W. 8th St., Columbus, OH 4, USA (Dated: May, 4) Sequence comparison is a widely

More information

A Practical Approach to Significance Assessment in Alignment with Gaps

A Practical Approach to Significance Assessment in Alignment with Gaps A Practical Approach to Significance Assessment in Alignment with Gaps Nicholas Chia and Ralf Bundschuh 1 Ohio State University, Columbus, OH 43210, USA Abstract. Current numerical methods for assessing

More information

1.5 Sequence alignment

1.5 Sequence alignment 1.5 Sequence alignment The dramatic increase in the number of sequenced genomes and proteomes has lead to development of various bioinformatic methods and algorithms for extracting information (data mining)

More information

Stephen Scott.

Stephen Scott. 1 / 21 sscott@cse.unl.edu 2 / 21 Introduction Designed to model (profile) a multiple alignment of a protein family (e.g., Fig. 5.1) Gives a probabilistic model of the proteins in the family Useful for

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING METHODS /(X) ¼ I Q(X)

SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING METHODS /(X) ¼ I Q(X) JOURNAL OF COMPUTATIONAL BIOLOGY Volume 8, Number 5, 20 # Mary Ann Liebert, Inc. Pp. 677 69 DOI: 0.089/cmb.200.0328 Research Articles Sequence Alignment as Hypothesis Testing LU MENG, FENGZHU SUN, 2 XUEGONG

More information

Sequence Analysis '17 -- lecture 7

Sequence Analysis '17 -- lecture 7 Sequence Analysis '17 -- lecture 7 Significance E-values How significant is that? Please give me a number for......how likely the data would not have been the result of chance,......as opposed to......a

More information

Practical considerations of working with sequencing data

Practical considerations of working with sequencing data Practical considerations of working with sequencing data File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!

More information

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas Informal inductive proof of best alignment path onsider the last step in the best

More information

Poisson Approximation for Two Scan Statistics with Rates of Convergence

Poisson Approximation for Two Scan Statistics with Rates of Convergence Poisson Approximation for Two Scan Statistics with Rates of Convergence Xiao Fang (Joint work with David Siegmund) National University of Singapore May 28, 2015 Outline The first scan statistic The second

More information

E-value Estimation for Non-Local Alignment Scores

E-value Estimation for Non-Local Alignment Scores E-value Estimation for Non-Local Alignment Scores 1,2 1 Wadsworth Center, New York State Department of Health 2 Department of Computer Science, Rensselaer Polytechnic Institute April 13, 211 Janelia Farm

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Scoring Matrices. Shifra Ben-Dor Irit Orr

Scoring Matrices. Shifra Ben-Dor Irit Orr Scoring Matrices Shifra Ben-Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison

More information

Fundamentals of database searching

Fundamentals of database searching Fundamentals of database searching Aligning novel sequences with previously characterized genes or proteins provides important insights into their common attributes and evolutionary origins. The principles

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

A New Similarity Measure among Protein Sequences

A New Similarity Measure among Protein Sequences A New Similarity Measure among Protein Sequences Kuen-Pin Wu, Hsin-Nan Lin, Ting-Yi Sung and Wen-Lian Hsu * Institute of Information Science Academia Sinica, Taipei 115, Taiwan Abstract Protein sequence

More information

Derived Distribution Points Heuristic for Fast Pairwise Statistical Significance Estimation

Derived Distribution Points Heuristic for Fast Pairwise Statistical Significance Estimation Derived Distribution Points Heuristic for Fast Pairwise Statistical Significance Estimation Ankit Agrawal, Alok Choudhary Dept. of Electrical Engg. and Computer Science Northwestern University 2145 Sheridan

More information

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

More information

SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment

SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment Ofer Gill and Bud Mishra Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street,

More information

SEQUENCE alignment is an underlying application in the

SEQUENCE alignment is an underlying application in the 194 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 1, JANUARY/FEBRUARY 2011 Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific

More information

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in

More information

Do Aligned Sequences Share the Same Fold?

Do Aligned Sequences Share the Same Fold? J. Mol. Biol. (1997) 273, 355±368 Do Aligned Sequences Share the Same Fold? Ruben A. Abagyan* and Serge Batalov The Skirball Institute of Biomolecular Medicine Biochemistry Department NYU Medical Center

More information

Distributional regimes for the number of k-word matches between two random sequences

Distributional regimes for the number of k-word matches between two random sequences Distributional regimes for the number of k-word matches between two random sequences Ross A. Lippert, Haiyan Huang, and Michael S. Waterman Informatics Research, Celera Genomics, Rockville, MD 0878; Department

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best alignment path onsider the last step in

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids Science in China Series C: Life Sciences 2007 Science in China Press Springer-Verlag Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

More information

Exercise 5. Sequence Profiles & BLAST

Exercise 5. Sequence Profiles & BLAST Exercise 5 Sequence Profiles & BLAST 1 Substitution Matrix (BLOSUM62) Likelihood to substitute one amino acid with another Figure taken from https://en.wikipedia.org/wiki/blosum 2 Substitution Matrix (BLOSUM62)

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

bioinformatics 1 -- lecture 7

bioinformatics 1 -- lecture 7 bioinformatics 1 -- lecture 7 Probability and conditional probability Random sequences and significance (real sequences are not random) Erdos & Renyi: theoretical basis for the significance of an alignment

More information

Sequence comparison: Score matrices

Sequence comparison: Score matrices Sequence comparison: Score matrices http://facultywashingtonedu/jht/gs559_2013/ Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Lecture 5,6 Local sequence alignment

Lecture 5,6 Local sequence alignment Lecture 5,6 Local sequence alignment Chapter 6 in Jones and Pevzner Fall 2018 September 4,6, 2018 Evolution as a tool for biological insight Nothing in biology makes sense except in the light of evolution

More information

Alignment & BLAST. By: Hadi Mozafari KUMS

Alignment & BLAST. By: Hadi Mozafari KUMS Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence

More information

Reducing storage requirements for biological sequence comparison

Reducing storage requirements for biological sequence comparison Bioinformatics Advance Access published July 15, 2004 Bioinfor matics Oxford University Press 2004; all rights reserved. Reducing storage requirements for biological sequence comparison Michael Roberts,

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL 2 Pairwise alignment 2.1 Introduction The most basic sequence analysis task is to ask if two sequences are related. This is usually done by first aligning the sequences (or parts of them) and then deciding

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

1 Variance of a Random Variable

1 Variance of a Random Variable Indian Institute of Technology Bombay Department of Electrical Engineering Handout 14 EE 325 Probability and Random Processes Lecture Notes 9 August 28, 2014 1 Variance of a Random Variable The expectation

More information

The Kuhn-Tucker Problem

The Kuhn-Tucker Problem Natalia Lazzati Mathematics for Economics (Part I) Note 8: Nonlinear Programming - The Kuhn-Tucker Problem Note 8 is based on de la Fuente (2000, Ch. 7) and Simon and Blume (1994, Ch. 18 and 19). The Kuhn-Tucker

More information

2 ANDREW L. RUKHIN We accept here the usual in the change-point analysis convention that, when the minimizer in ) is not determined uniquely, the smal

2 ANDREW L. RUKHIN We accept here the usual in the change-point analysis convention that, when the minimizer in ) is not determined uniquely, the smal THE RATES OF CONVERGENCE OF BAYES ESTIMATORS IN CHANGE-OINT ANALYSIS ANDREW L. RUKHIN Department of Mathematics and Statistics UMBC Baltimore, MD 2228 USA Abstract In the asymptotic setting of the change-point

More information

11. Bootstrap Methods

11. Bootstrap Methods 11. Bootstrap Methods c A. Colin Cameron & Pravin K. Trivedi 2006 These transparencies were prepared in 20043. They can be used as an adjunct to Chapter 11 of our subsequent book Microeconometrics: Methods

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely JOURNAL OF COMPUTATIONAL BIOLOGY Volume 8, Number 1, 2001 Mary Ann Liebert, Inc. Pp. 69 78 Perfect Phylogenetic Networks with Recombination LUSHENG WANG, 1 KAIZHONG ZHANG, 2 and LOUXIN ZHANG 3 ABSTRACT

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University Sequence Alignment: Scoring Schemes COMP 571 Luay Nakhleh, Rice University Scoring Schemes Recall that an alignment score is aimed at providing a scale to measure the degree of similarity (or difference)

More information

Chapter 1. GMM: Basic Concepts

Chapter 1. GMM: Basic Concepts Chapter 1. GMM: Basic Concepts Contents 1 Motivating Examples 1 1.1 Instrumental variable estimator....................... 1 1.2 Estimating parameters in monetary policy rules.............. 2 1.3 Estimating

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

Sequence Comparison. mouse human

Sequence Comparison. mouse human Sequence Comparison Sequence Comparison mouse human Why Compare Sequences? The first fact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity

More information

Bisection Ideas in End-Point Conditioned Markov Process Simulation

Bisection Ideas in End-Point Conditioned Markov Process Simulation Bisection Ideas in End-Point Conditioned Markov Process Simulation Søren Asmussen and Asger Hobolth Department of Mathematical Sciences, Aarhus University Ny Munkegade, 8000 Aarhus C, Denmark {asmus,asger}@imf.au.dk

More information

Gibbs Sampling Methods for Multiple Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical

More information

Markov-Switching Models with Endogenous Explanatory Variables. Chang-Jin Kim 1

Markov-Switching Models with Endogenous Explanatory Variables. Chang-Jin Kim 1 Markov-Switching Models with Endogenous Explanatory Variables by Chang-Jin Kim 1 Dept. of Economics, Korea University and Dept. of Economics, University of Washington First draft: August, 2002 This version:

More information

The Distribution of Short Word Match Counts Between Markovian Sequences

The Distribution of Short Word Match Counts Between Markovian Sequences The Distribution of Short Word Match Counts Between Markovian Sequences Conrad J. Burden 1, Paul Leopardi 1 and Sylvain Forêt 2 1 Mathematical Sciences Institute, Australian National University, Canberra,

More information

On the length of the longest exact position. match in a random sequence

On the length of the longest exact position. match in a random sequence On the length of the longest exact position match in a random sequence G. Reinert and Michael S. Waterman Abstract A mixed Poisson approximation and a Poisson approximation for the length of the longest

More information

A Poisson model for gapped local alignments

A Poisson model for gapped local alignments A Poisson model for gapped local alignments Dirk Metzler 1, 2, Steffen Grossmann, and Anton Wakolbinger Department of Mathematics, Goethe-Universität, Postfach 11 19 32, D-60054 Frankfurt am Main, Germany

More information

PROCEEDINGS OF THE AMERICAN MATHEMATICAL SOCIETY Volume 46, Number 1, October 1974 ASYMPTOTIC DISTRIBUTION OF NORMALIZED ARITHMETICAL FUNCTIONS PAUL E

PROCEEDINGS OF THE AMERICAN MATHEMATICAL SOCIETY Volume 46, Number 1, October 1974 ASYMPTOTIC DISTRIBUTION OF NORMALIZED ARITHMETICAL FUNCTIONS PAUL E PROCEEDIGS OF THE AMERICA MATHEMATICAL SOCIETY Volume 46, umber 1, October 1974 ASYMPTOTIC DISTRIBUTIO OF ORMALIZED ARITHMETICAL FUCTIOS PAUL ERDOS AD JAOS GALAMBOS ABSTRACT. Let f(n) be an arbitrary arithmetical

More information

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Comparative Gene Finding. BMI/CS 776  Spring 2015 Colin Dewey Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following: using related genomes

More information

ECONOMET RICS P RELIM EXAM August 24, 2010 Department of Economics, Michigan State University

ECONOMET RICS P RELIM EXAM August 24, 2010 Department of Economics, Michigan State University ECONOMET RICS P RELIM EXAM August 24, 2010 Department of Economics, Michigan State University Instructions: Answer all four (4) questions. Be sure to show your work or provide su cient justi cation for

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

More information

Nonlinear Programming (NLP)

Nonlinear Programming (NLP) Natalia Lazzati Mathematics for Economics (Part I) Note 6: Nonlinear Programming - Unconstrained Optimization Note 6 is based on de la Fuente (2000, Ch. 7), Madden (1986, Ch. 3 and 5) and Simon and Blume

More information

Making sense of score statistics for sequence alignments

Making sense of score statistics for sequence alignments Marco Pagni is staff member at the Swiss Institute of Bioinformatics. His research interests include software development and handling of databases of protein domains. C. Victor Jongeneel is Director of

More information

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded

More information