Computational Methods for Structural Bioinforamtics and Computational Biology (1) (Sequence comparison)

Size: px
Start display at page:

Download "Computational Methods for Structural Bioinforamtics and Computational Biology (1) (Sequence comparison)"

Transcription

1 Computational Methods for Structural Bioinforamtics and Computational Biology (1) (Sequence comparison) Dragon Star Short Course Suzhou University, June 15 June 19, 2009 Jie Liang 梁杰 Molecular and Systems Computational Bioengineering Lab (MoSCoBL) Department of Bioengineering University of Illinois at Chicago 上海交通大学系统医学研究院 上海生物信息技术研究中心

2 Course Organization June 15 June 19 Working language in Chinese; slides in English Discussions are encouraged throughout the lectures Lectures will focus on fundamental, while students are welcome to challenge the instructor with any questions related to the subject Additional discussion session

3 Reference Books "Introduction to Computational Molecular Biology" by Carlos Setubal and Joao Meidanis, 1997, PWS Publishing, ISBN Computational Molecular Biology by Peter Clote and Rolf Backofen, John Wiley, 2000, ISBN Geometry and topology for mesh generation by Herbert Edelsbrunner, Cambridge University Press, 2001, ISBN Monte Carlo Strategies in Scientific Computing by Jun S. Liu (Springer Series in Statistics), 2009 (Paperback), ISBN-10: "Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids" by Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison, Cambridge University Press, 1999, ISBN Monte Carlo Statistical Methods by Christian P. Robert and George Casella, Springer, 2005, ISBN-10:

4 A Brief Survey Computer science background? Biology background? Mathematics/Statistical background? None of above? Have you taken another bioinformatics course?

5 Prerequisites Basic knowledge of computer science Assume no prior knowledge in biology above high school Strong motivation in learning bioinformatics and computational biology

6 Today s Lecture Scope of bioinformatics and the course Basic concepts in molecular biology: DNA, RNA, protein, Dynamic programming and pairwise sequence analysis Statistical models for evaluating aligned sequences Optimal multiple sequence alignment Heuristic multiple sequence alignment

7 Scope of Bioinformatics: Studying Biology on Computer data management; data mining; modeling; prediction; theory formulation bioinformatics genes, proteins, protein complexes, pathways, cells, organisms, ecosystem an indispensable part of biological science with its own methodology engineering aspect scientific aspect computer science, biology, statistics physics, mathematics, chemistry, engineering,

8 Why Bioinformatics? (I) More than 80 US universities offer graduate degrees in bioinformatics At cross-section of two most exciting fields: computer science and biology Exponential growths in computing technologies (hardware, Internet) pave the way for bioinformatics development

9 Why Bioinformatics? (II) Analytical technology High-throughput data Biological knowledge Medicine & bioengineering

10 What Can Computing Do for Biology? 1. Data interpretation in analytical technologies 2. Data management and computational infrastructure 3. Discovery from data mining 4. Physical models through computing, prediction, and design 5. Theoretical / in silico biology Almost every area of computer science can be applied to biology Many computer scientists study biological problems

11 1. Data Interpretation in Analytical Technologies (I) Analytical technologies are the driving force of new (large-scale) biology: DNA sequencing (genomics) X-ray / NMR structure determination (structural genomics) Protein identification using mass spectrometry (proteomics) Microarray chips (functional genomics)

12 1. Data Interpretation in Analytical Technologies (II) peak assignment N H H R C O C C H H R C O N C C H H N H H R C O C C H N H NMR spectra NMR protein structure determination structural restraint extraction i+4 i+3 i+2 protein structure structure calculation i-1 i i+1

13 1. Data Interpretation in Analytical Technologies (III) From image to data (imaging processing) Large-scale data cannot be handled without computer Noisy data (optimization with under-constraint / overconstraint) Computer algorithms/programs can mimic human interpretation process and do it much faster Automation of experimental data interpretation

14 2. Data Management and Computational Infrastructure Track instruments, experiment conditions and results at each step of a complicated biological experiment (LIMS at modern wet labs) Data storage and retrieval (database) Data visualization Data query and analysis pipeline

15 3. Discovery from Data Mining (I)

16 3. Discovery from Data Mining (II) Pattern/knowledge discovery from data many biological data are generated by biological processes which are not well understood interpretation of such data requires discovery of convoluted relationships hidden in the data which segment of a DNA sequence represents a gene, a regulatory region which genes are possibly responsible for a particular disease Complicated data Large-scale, high-dimension Noisy (false positives and false negatives)

17 4. Modeling, Prediction and Design (I) Modeling and prediction of biological objects/processes modeling of biochemistry enzyme reaction rates modeling of biophysics dynamics of biomolecules modeling of evolution prediction of phylogeny and substitution pattern

18 4. Modeling, Prediction and Design (II) Prediction of outcomes of biological processes computing will become an integral part of modern biology through an iterative process of model formulation computational prediction experimental validation From prediction to engineering design Protein structure prediction to protein engineering Design genetically modified species

19 5. Theoretical / In Silico Biology Generate new hypothesis, formulate and test fundamental theories of biology new hypothesis about detailed evolutionary history, through mining genomic sequence data? new hypothesis about a particular signaling network, through data mining? new hypothesis about protein folding pathways, through simulations? new hypothesis of cancer biology and developmental biology

20 Bioinformatics Application to Biological Systems bacteria (Synechococcus) yeast (Saccharomyces cerevisia) plants (Arabidopsis) viruses (SARS) neural systems (neurons)

21 Can Biology Help Computing? Computational techniques inspired by biology: Neural network (artificial intelligence) Genetic algorithm, automata A new driver of computer science: New algorithms New driver for theory development Better hardware (clusters and supercomputers) Develop new theoretical framework: DNA computing Network communication,

22 Computing versus Biology what computer science is to molecular biology is like what mathematics has been to physics Larry Hunter, ISMB 94 molecular biology is (becoming) an information science Leroy Hood, RECOMB 00 Bioinformatics and computational biology is still in its early development!

23 Course Topics Data interpretation in analytical technologies Data management and computational infrastructure Discovery from data mining Modeling, prediction and design Theoretical / in silico biology Cover some classical/mainstream as well as many research bioinformatics problems from computational prospective

24 Course Outline (1) June 15: Comparison and prediction of biological molecules (with introduction) Pairwise sequence comparison Multiple sequence comparison June 16: Geometric structures of biomolecules Protein structure, geometric volume and surface models of biomolecules Secondary and tertiary structure prediction Geometric constructs: Voronoi diagram, Delaunay triangulation, alpha shape Algorithms for computing geometric constructs Application: protein function prediction

25 Course Outline (2) June 17: Generating conformations of biomolecules State models of biomolecules Sampling by Markov chain Monte Carlo Sampling by Sequential Importance Sampling Appliications: protein packing problem June 18: Empirical potential and fitness function for biomolecules Anfinsen s principle and mathematical structure for designing potentials for structure prediction and for proten fitness landscape Empirical statistical function Potential function by optimization Application: global nonlinear fitness function of evolution of protein folds June 19: Evolution of biomolecules and stochastic networks Models of molecular evolutoin, Molecular phylogeny Maximum likelihood and Bayesian Monte Carlo estimators Application: protein function prediciton Stochastic molecular netoworks Simulation and exact solution of stochastic landscape of genetic cirtuits. resource

26 What I Will Teach A general introduction to a few important problems in bioinformatics and computational biology problems definitions: from biological problem to computable problem some key aspects of models, theories, algorithms, and computational techniques A way of thinking: tackling biological problem computationally how to look at a biological problem from a computational point of view how to formulate a computational problem to address a biological issue how to collect statistics from biological data how to build a computational model how to design algorithms for the model how to test and evaluate a computational algorithm how to gain confidence of a prediction result

27 New Ways of Thinking Critical thinking Analytical thinking Quantitative thinking Algorithmic thinking

28 Introduction (1) Biological sequence comparison DNA-DNA RNA-RNA Protein-protein Sequence comparison is the most important and fundamental operation in bioinformatics Key to understand evolution of a gene or an organism

29 Introduction (2) Applications in most bioinformatics problems Sequence assembly Gene finding Protein structure prediction Evolutionary analysis THE most popular tool: BLAST Foundation of sequence database search

30 Today s Lecture Scope of bioinformatics and the course Sequence comparison

31 Genome Each cell contains a full genome (DNA) The size varies: Small for viruses and prokaryotes (10 kbp-20mbp) Medium for lower eukaryotes Yeast, unicellular eukaryote 13 Mbp Worm (Caenorhabditis elegans) 100 Mbp Fly, invertebrate (Drosophila melanogaster) 170 Mbp Larger for higher eukaryotes Mouse and human 3000 Mbp Very variable for plants (many are polyploid) Mouse ear cress (Arabidopsis thaliana) 120 Mbp Lilies 60,000 Mbp

32 Differences in DNA ~2% ~4% ~0.2%

33 An Example of Sequence Comparison TT...TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG AAGGATC...TCAGTAATTAATCATGCACCTATGTGGCGG AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT

34 multiple alignment: alignment of 3 or Alignement (1) a correspondence between elements of two sequences with order kept FSEYTTHRGHR : ::::: :: FESYTTHRPHR FESYTTHRGHR :::::::: :: FESYTTHRPHR pairwise alignment: 2 sequences aligned

35 Alignement (2) Similar to longest common subsequence (LCS) problem for strings, (Robinson, 1938) LCS: define a set of operations (e.g. substitution, insertion or deletion) that transform the aligned elements of one sequence into the corresponding elements of the other and associate with each operation a cost or a score. Optimal alignment: the alignment that is associated with the lowest cost (or highest score). Between two sequences several optimal alignments can be constructed with the same optimal score. FSEY-THRGHR : : ::: :: FESYTTHRPHR FSEYT-HRGHR : :: :: :: FESYTTHRPHR

36 Some Terminology Alphabet: a finite set of characters from which strings are made. Eg. {A,T,G,C}, twenty amino acid residues. String: ordered succession of characters or symbols. It is synonymous to sequence. Length of a string s: denoted as s, it is the number of characters in it. The character at position is s(i). Concatenation: Concatenation of two strings s and t is denoted by st and is given by appending all characters of string t in sequence after those of s. The length of this is s + t. If s = GGCTA and t = CAAC, then st = GGCTACAAC. Prefix: A prefix of s is any substring of s of the form s [1...j ] for 0 <= j <= s. Special case: We allow j=0 such that s[1...0] is the empty string, which is also a prefix of s. t is a prefix of s if and only if there is another string u such that s = tu. Prefix(s,k) denotes a prefix of s with exactly k characters, with 0<=k<= s. Suffix: A suffix of s is a substring of the form s[i... s ] for a certain i such that 1<= i <= s +1. We allow i= s +1, in which case s[ s +1...s] denotes the empty string. A string t is a suffix of s iff there exists u such that s = ut.

37 Components of Sequence Alignment (1) Scoring function: a measure of similarity between elements (nucleotides, amino acids, gaps); (2) An algorithm for alignment; Insertion Indel Deletion FDSK-THRGHR :.: :: ::: FESYWTH-GHR (3) Confidence assessment of alignment result. Match (:) Mismatch (substitution)

38 Edit Distance (Hamming Distance) Introduced by Levenshtein in 1966 Binary: match = 1 / mismatch = 0 (Identity Matrix) Definition: Minimum number of edit operations to transform one string to another Can be used for DNA/RNA Possible edit operations Symbol insertion and deletion Symbol substitution

39 Scoring Matrix amino acid substitution matrices (20X20) account for probability of one amino acid being substituted for another: frequency of substitution - genetic code tolerance for changes - natural selection penalize residues pairs with a low probability of mutation in evolution and rewards pairs with a high probability empirically derived from observed amino acid substitutions that occur between aligned residues in homologous sequences

40 Physical Bases of Mutation Matrix Geometric nature Physical nature (charged or hydrophobic) Chemical nature Frequencies of amino acids physical property matrices

41 PAM The first substitution matrices derived by Dayhoff et al. (1978) PAM (point accepted mutation) distance: Two sequences are defined to have diverged by one PAM unit if they show in average one accepted point mutation (i.e. one amino acid change) per hundred amino acids. Derived from the pairwise alignment of sequences less than 15% divergent.

42 BLOSUM Block substitution matrices (Henikoff & Henikoff 1992) Blocks: highly conserved regions in a set of aligned protein sequences (local multiple alignment) Number of BLOSUM matrix (e.g. BLOSUM 62) indicates the cutoff of percent identity that defines the clusters - lower cutoffs allow more diverse sequences

43 BLOSUM 62 Matrix A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

44 What Matrices to Use Close homolog: high cutoffs for BLOSUM (up to BLOSUM 90) or lower PAM values BLAST default: BLOSUM 62 Remote homolog: lower cutoffs for BLOSUM (down to BLOSUM 10) or high PAM values (PAM 200 or PAM 250) A best performer in structure prediction: PAM 250

45 Gap Penalty Functions Corresponding to insertion/deletion in evolution Can be derived from alignment Known alignments Performance-based (sequence comparison)

46 Affine Gap Penalty Function If we are introducing k spaces together, the penalty should be less than that for k independent spaces. i.e. w (k) k w(1) or, w ( k 1 +k 2 + +k n ) w ( k 1 ) +w(k 2 ) + +w( k n ). A function which satisfies the above conditions is called a subadditive function. An affine function is a function of the form, w ( k ) = h + g k, k 1, where w (0) = 0 and h, g > 0.

47 Affine Gap Penalty This is the most commonly used model w(k) = h + gk, k 1,with w(0) = 0. h: gap opening penalty; g: gap extension penalty h > g > 0 (e.g., for PAM250, k) Non-linear form: h + g log (k) FDS--THRGHR :.: :::::: FESYTTHRGHR FDS-T-HRGHR :.: : ::::: FESYTTHRGHR

48 Time Complexities General Gap Penalty Functions: O( mn 2 +m 2 n ), so it is O( n 3 ), if m is about the same length as n. Affine Gap Penalty Functions: O(mn),

49 Score of an alignment: reward matches and penalize mismatches and spaces. eg, each column gets a (different) value for: a match: +1, (both have the same characters); a mismatch : -1, (both have different characters); and a space in a column: -2. The total score of an alignment is the sum of the values assigned to its columns. The best alignment: The one with the maximum total score. eg. G A - C G G A T T A G G A T C G G A A T A G match mismatch -1 space -2 The total score is: 9 x x (-1) + 1 x (-2) = 6 The best alignment is the similarity between the two sequences s and t: sim(s,t) How to find the best alignment? Generate or enumerate all possible alignments, and pick the one with the best scoring. Dynamic programming: much faster!

50 Dot Matrix and Alignment Dot matrix: Score between cross-elements path: Mapping to an alignment AACG ATCG - G GT A GT - TGC TGC A A C G G T A T G C A T 1 1 C 1 1 G G G T 1 1 T 1 1 G C 1 1

51 Dynamic Programming Steps 1. Assign scores between elements in dot matrix 2. For each cell in the dot matrix, check all possible pathways back to the beginning of the sequence (allowing insertions and deletions) and give that cell the value of the maximum scoring pathway 3. Construct an alignment (pathway) back from the last cell in the dot matrix (or the highest scoring) cell to give the highest scoring alignment

52 Global vs. Local Alignment Global alignment: the alignment of full sequences Good for comparing members of same protein family Needleman & Wunsch, 1970, J Mol Biol 48:443 Local alignment: the alignment of segments of sequences ignore areas that show little similarity Smith & Waterman 1981, J Mol Biol, 147:195 modified from Needelman-Wunsh algorithm can be done with heuristics (FASTA and BLAST)

53 Dynamic Programming for global alignment (Solving a problem by using already computed solutions for smaller instances of the same problem.) Concept: Given two sequences s and t, instead of determining similarity between s and t as whole sequences only, we build up the solution by determining all similarities between arbitrary prefixes of the two sequences. We start with shorter prefixes and use the computed values of these to solve the problem of larger prefixes. Let m be the size of s and n the size of t. There are m+1 prefixes of s and n+1 of t, including the empty string. We can arrange our calculations in an (m+1) (n+1) matrix a, where element a (i, j ) contains the similarity between s [1...i ] and t [1...j ].

54 s(i) t(j) The matrix a for s = AAAC and t = AGC The first row and first column: multiples of space penalty. Only one alignment possible if one seq is empty: add spaces, with score -2k Key point: the value for the entry a (i, j) can be obtained by looking at just three previous entries: those for ( i -1, j ), ( i-1, j-1 ) and (i, j-1). The reason is that there are only three ways to align s [1..i] and t [1..j] align s[1...i] with t[1...j-1] and match a space with t [ j ] align s[1...i-1] with t [1...j-1] and match s[i] with t [j]. align s[1...i-1] with t[1...j] and match s[i] with a space. These are exhaustive possibilities, since we cannot have two spaces paired.

55 For example, the value of a[1, 2] comes from one of the three: a[1, 1] - 2 = -1; a [0, 1] -1 = -3; a [0, 2] - 2 = -6 sim(s[1...i ], t [1...j ]) = maximum of sim ( s[1...i], t [1...j-1] ) - 2 sim ( s[1...i -1], t [1...j-1] )+ p ( i, j ) sim ( s[1...i -1], t [1...j] ) - 2 Here entries are computed row by row, and usually gap g<0 Algorithm to Compute Global Similarity: Since a (i, j) stores sim (i, j ), the similarity of s[1...i ] with t[1...j] is: a( s[1...i ], t[1...j] ) = maximum of a (i, j-1) - 2 a (i-1, j-1) + p(i, j) a (i-1, j ) - 2 Important: the order of the computing needs to make sure that a ( i, j-1 ), a( i-1, j-1 ), and a( i-1, j ) are available when computing a (i, j ).

56 Quadratic time complexity: O (m) O (m) O (n) O (m n) If sequences are of similar length n, then the time complexity is O ( n 2 ).

57 Optimal Alignment t ( j ) The arrows are not implemented explicitly. s ( i ) We can start at entry (m, n) and follow the arrows until we get to (0,0). Each arrow gives one column of alignment. Horizontal: space in s matches t [j] Vertical: space in t matches s [i] Diagonal: s [i] matches t [j] Call to Align( m,n, len ) gives an optimal alignment, given matrix a, and strings s, and t

58 Answers are given in vectors align-s and align-t, holding in 1..len the aligned characters, symbols or spaces. Length of the alignment is returned by len: max( s, t ) <= len <= m+n Time complexity when a matrix is given: O(len), the size of the returned alignment or O(m+n) It is possible that several alignments may have the same scores: The algorithm returns just one, giving preferences to edges leaving (i, j ) in counterclockwise order. Upmost alignment: This is achieved through the order of if s That is: if there are two or three choices, a column with space in t is preferred over a column with two symbols, which is preferred over a column with space over s A A A A A A A A A A G - - A G A - G eg s -ATAT ATATt TATA- > -TATA

59 Local Comparison An alignment between a substring of s and a substring of t. Goal: to find the highest scoring local alignment. Same Data structure: an array a[1..m+ 1][1..n+1] a[i, j] : the highest score of an alignment between a suffix of s[1..i ] and a suffix of t[1..j] Because the empty string, which has a score 0, is always a valid suffix of a sequence, all entries >= 0. First row and first column: initialize to 0. The entry a( s [1...i], t [1...j] ) = maximum of a (i, j-1 ) - g a (i-1, j-1 ) + p ( i, j ) a (i-1, j ) - g an empty alignment

60 t(j) s(i) (Ignore the numbers in this figure)

61 Find the maximum entry in the whole array: this is the score of an optimal local alignment. Start from any entry with this score value, and trace back until there is no arrow: optimal local alignment. In general, we are interested in not only the optimal local alignment, but also near optimal alignment. End Spaces in Alignments: End spaces are before the first character or after the last character. Consider the following alignment: C A G C A - C T T G G A T T C T C G G size C A G C G T G G size x(-2) x 1 There will be many spaces in any alignment because length differences, contributing to a large negative score (-19). The above alignment is pretty good, if end spaces are ignored: 6 matches, 1 mismatch, 1 space.

62 The alignment with the best score is: CAGCACTTGGATTCTCGG CAGC-----G-T----GG 10x(-2) +8 x 1 = -12 Although this alignment gives a better score (-12 as compared to -18), it is not interesting because it is not finding similar regions We are interested in regions in the longer sequence that are approximately the same as the shorter regions. However, if we choose the first alignment and neglect all end spaces, then the score is +3. Semiglobal Comparison!

63 Ignore the end space after s: spaces after the last character has no cost. In an optimal alignment, these spaces are matched to a suffix of t. Remove this final part of the alignment, we obtain an alignment between s and a prefix of t, with the same score. Therefore need to find the best similarity between s and a prefix of t. Since in the basic algorithm a[i, j ] contains the similarity between s[1..i ] and t[1..j ], Take the maximum value in the last row of a : sim(s, t) = max a[m, j], and j in [1, n] = a[m, k] The alignment can be obtained by tracing back, except we start from (m, k). s(i) t(j)

64 Ignore the end space after t: spaces after the last character has no cost. In an optimal alignment, these spaces match to a suffix of s. Remove this final part, we obtain an alignment between t and a prefix of s, with the same score. Therefore need to find the best similarity between t and a prefix of s. Since in the basic algorithm a[i, j] contains the similarity between s[1..i] and t [1..j], Take the maximum value in the last COLUMN of a : sim(t, s) = max a[i, n], and i in [1, m] = a[k, n] The alignment can be obtained by tracing back, except we start from (k, n). s( i ) t( j )

65 Ignore the initial space before s: spaces before the first character has no cost. This is equivalent to the best alignment between s and a suffix of t. a[i, j] needs to contain the highest similarity between s[1..i] and a suffix of t [1..j], Therefore, for s, with s =m and t, with t =n, we need to look at a [m,n]. s(i) t(j) The matrix can be filled the same way as the basic global algorithm, But the first row has to be 0: since initial spaces before s have no costs. The alignment can be obtained by tracing back from (m, n). Ignore the initial spaces before t: Same, except the first column has to be 0. First row and col are 0s now.

66 Summary of end gap conditions : In order to not penalize spaces at: Take the following action in a[, ]: Beginning of s Initialize first row with zero End of s Find maximum in last row Beginning of t Initialize first col. with zero End of t Find for maximum in last column And combinations...

67 Reading Material About dynamic programming in sequence alignment W.R. Pearson and W.Miller. Methods in Enzymology, 210: , T.F.Smith and M.S. Waterman. J. Mol. Biol. 147: , 1981

68 Computational Complexity of Dynamic Programming Computing time: O(nm), where n and m are sequence lengths). Retrieval time: O(Max (n,m)) [worst case: n+m; best case: Min(n,m)] Required memory: O(nm).

69 Comparing Very similar sequences: The scores of their optimal alignments are very close to the maximum possible. For two sequences s and t with the same lengths, The dynamic programming matrix is square. The main diagonal gives an alignment without spaces. If that alignment is not optimal, need to add spaces. We add spaces in pairs, so s and t still have the same lengths: But now the alignment is off diagonal!

70 S = GCGCATGGATTGAGCGA t = TGCGCCATGGATGAGCA The optimal alignment is: If the sequences are similar, the path of the best alignment should be very close to the main diagonal. Therefore, we may not need to fill the entire matrix, rather, we fill a narrow band of entries around the main diagonal. An algorithm that fills in a band of width 2k+1 around the main diagonal. The path of the optimal alignment is not on the main diagonal, but twice removed. There are two spaces.

71 a [i, j] depends on a [i-1, j ], a [i-1, j-1], and a [i, j-1]. Do not use any a [ i-1, j ]anda[ i, j-1] if they are outside the k-band. No need to test a[ i-1, j-1 ]: it will always be inside the k-band. For a[i-1, j] anda[ i, j-1 ], we test because a[i, j] may be on the border of the band: InsideStrip( i, j, k) = ( -k i-j k ) if this is true: 1 The entry a[n, n] contains the highest score of an alignment within the k-band. The time complexity is O (kn ), if k is modest, this is much better than O ( n 2 ). How do we know it is correct if we just look at entries within the k-band? If there are (k +1) or more space pairs, the best possible score is when all of the rest of the sequence match perfectly: match (n - k - 1) + 2( k + 1 )g If our k-band computation gives a score better than the above, than there is no need to increase k. If not, we need to increase k, and repeat the calculation. Usually, we double k and run the calculation again.

72 Confidence Assessment of Sequence Alignment Why confidence assessment is needed True homology or alignment by chance Expected probability by chance Statistical models

73 Why not to use sequence identity as confidence measure

74 p-value and e-value The probability that a variate would assume a value greater than or equal to the observed value strictly by chance P(z>z o ) If the P-value found for an alignment is low (<0.001), the alignment is probably biologically meaningful. Pre-compute the parameters based on a statistical model

75 Need for Heuristic Alignment Time complexity for optimal alignment: O(n 2 ), n -- sequence length Given the current size of sequence databases, use of optimal algorithms is not practical for database search Heuristic techniques: BLAST, FASTA, MUMmer, PatternHunter min (optimal alignment, SSearch) 2min (FASTA) 20 sec (BLAST)

76 Ideas in Heuristics Search Indexing and filtering: Google search Good alignment includes short identical, or similar fragments break entire string into substrings, index the substrings Search for matching short substrings and use as seed for further analysis extend to entire string and find the most significant local alignment segment

77 FASTA (1) Lipman & Pearson, 1985, Science 227, Key idea Identify regions of the sequences with the highest density of matches. In this step exact matches of a given length (by default 2 for proteins, 6 for nucleic acids) are determined and regions (fragments of diagonals) with a high number of matches selected.

78 FASTA (2) A-FTFWSYAIGL--PSSSIVSWKSCHVLHKVLRDGHPNVLHDCQRYRSNI... : AIPQFWSYAIERPLNSSWIVVWKSCITTHHLMVYGNERFIQYLAS-RNTL

79 BLAST Basic Alignment Search Tool (Altschul et al, 1990, J. Mol. Biol. 215, ) Uses word matching like FASTA Similarity matching of words (3 aa s, 11 bases) does not require identical words. If no words are similar, then no alignment won t find matches for very short sequences

80 Today s Lecture Scope of bioinformatics and the course Pairwise sequence comparison Multiple sequence comparison

81 Introduction The multiple sequence alignment of a set of sequences may be viewed as an evolutionary history of the sequences. No sequence ordering is required.

82 An Example of Multiple Alignment VTISCTGSESNIGAG-NHVKWYQQLPG VTISCTGTESNIGS--ITVNWYQQLPG LRLSCSSSDFIFSS--YAMYWVRQAPG LSLTCTVSETSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKEFYPSD--IAVEWWSNG--

83 Why Multiple Alignment (1) Natural extension of Pairwise Sequence Alignment Pairwise alignment whispers multiple alignment shouts out loud Hubbard et al 1996 Much more sensitive in detecting sequence relationship and patterns

84 Why Multiple Alignment (2) Give hints about the function and evolutionary history of a set of sequences Foundation for phylogenic tree construction and protein family classification Useful for protein structure prediction

85 Scoring Multiple Alignment Which one is better? VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG-- or VTISCTGSSSNIG-AGNHVKWYQQLPG VTISCTGTSSNIG--SITVNWYQQLPG LRLSCSSSGFIFS--SYAMYWVRQAPG LSLTCTVSGTSFD--DYYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNW--YVDG ATLVCLISDFYPG--AVTVAW--KADS AALGCLVKDYFPE--PVTVSW--NS-G VSLTCLVKGFYPS--DIAVEW--ESNG or VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCS-SSGFIFSS-YAMYWVRQAPG LSLTCT-VSGTSFDD-YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG--

86 Scoring PQRRZW YQRKZX YZTUOP TZZ_FO Total Score = [w(r, U ) + w(r, Δ ) + w(k, U) + w(k, Δ) ] / 4

87 Additive Functions: the alignment score is the sum of column scores. Independent of the ordering of the sequences in the alignment: A column with I, -, I, V, should score the same as V, I, I, - Reward the presence of many equal or strongly related residues, and penalize unrelated residues and spaces. Sum-of-Pair (SP) Function: The sum of pairwise scores of all pairs of symbols in the column. eg. SP_score ( I, -, I, V ) eg. = p ( I,- ) + p (I,I) + p (I,V) + p(-,i) + p(-,v) + p (I,V) Here we used the "unit costs" for pairwise alignment, summing up all the costs of all possible pairs of letters, i.e. the sum of the unit costs of the pairs (1,2), (1,3), (1,4),..., (1,8), (2,3), (2,4),..., (2,8), (3,4), (3,5),...,...,..., and (7,8). In general, any cost/weight scheme could be used, it just needs to map pairs of characters to a numeric value.

88 p ( -,- ) = 0: If we select two sequences from a multiple alignment, and ignore the rest, we have a pairwise alignment -- if we ignore columns with two spaces. This is called a projection of the multiple alignment. becomes: ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- ATLVCLISDFYPGAVTVAWKADS AALGCLVKDYFPEPVTVSWNSG Projection of a multiple alignment a: if a ij is the pairwise projection of a to sequence i and j SP_score(a) = Σ score(a ij ), i<j Here score(a ij )is the score of a projected pairwise alignment.

89 Optimal Multiple Alignment is an alignment with minimum overall cost, or maximum overall similarity. The Dynamic Programming Hyperlattice. For the alignment of 3 sequences, every alignment can be seen as a unique path through a 3-dimensional lattice. This path is denoted by listing at each node visited, component by component, the distance from the starting point in the bottom left (i.e. from the source (0,0,0) of the lattice): (0,0,0), (1,0,0), (2,1,0), (3,2,0), (3,3,1),(4,3,2) for the above example The distance from the starting point is the number of letters already aligned: For example, column 4 of the alignment corresponds to node (3,3,1): 3 letters from the first and second sequence are aligned at that point, and one letter from the third.

90 Imagine light sources on the top, front, and right-hand side of the lattice, "shadows" of the alignment will be projected to the opposing faces (walls). Assume that the light sources are farther away from the lattice, and the shadows are projected without distortion. a b In Fig. a, only the light source on the right is "on", projecting the path onto the face on the left. In Fig. b, all light sources are "on".

91 An optimal multiple alignment can be calculated by dynamic programming. The special case of pairwise alignment: We can visualize dynamic programming as a calculation that visits every node in a 2-dimensional matrix, or 2-d lattice, in a way that obeys the order of dependencies between the nodes, as indicated by the arrows. The sequences are now arranged such that the calculation of the alignment starts in the bottom-left corner, and not the top-left corner. In this setting, we start bottomleft, then move to the right until the bottom row is finished, then visit the node marked by an asteriks (*), move to the right as before, etc. In three or more dimensions, we have to look at more nodes: e.g. 7 nodes for three sequences. Correspondingly, the minimum needs to be taken from 7 possible values

92 Computational Complexity of Multiple Alignment by Standard Dynamic Programming Each node in the k-dimensional hyperlattice is visited once, and therefore the running time must be proportional to the number of nodes in the lattice. This number is the product of the lengths of the sequences. eg. the 3-dimensional lattice as visualized. How many steps does the algorithm ``rest'' at each node? Dynamic programming organizes the visiting of nodes in such a way that we just need to ``look back'' one single step, at the nodes that we've visited before, to look up the values we need for calculating the minimum. The time we spend for retrieving the minima and calculating the sum does not depend on the length of the sequences. However, it depends on the number of sequences. We've had 3 values for 2 sequences, 7 values for 3 sequences, 15 values for 4 sequences. This goes up exponentially: 2 k -1 The running time is expontential: O(2 k Π s i ), i = 1.. K If the proportionality factor is 1 nanosecond, then for 6 sequences of length 100, we'll have a running time of 2 6 x x10-9, that's roughly 64,000 seconds (17 hours). Add two more sequences, then we will have to wait 2.6 x10 9 = 82 years!

93 The memory space requirement is even worse. To trace back the alignment, we need to store the whole lattice, a data structure the size of a multidimensional skyscraper. In fact, space is the No.1 problem here, bogging down multiple alignment methods that try to achieve optimality. Furthermore, incorporating a realistic gap model, we will further increase our demands on space and running time

94 As we proceed Warning: Muddy Road Ahead!!!

95 Progressive Alignment Devised by Feng and Doolittle in A heuristic method, not guaranteed to find the optimal alignment. Multiple alignment is achieved by successive application of pairwise methods.

96 Basic Algorithm Compare all sequences pairwise. Perform cluster analysis on the pairwise data to generate a hierarchy for alignment (guide tree). Build alignment step by step according to the guide tree. Build the multiple alignment by first aligning the most similar pair of sequences, then add another sequence or another pairwise alignments.

97 Steps in Progressive Multiple Alignment Compare pairwise sequences Perform cluster analysis on pairwise data to generate hierarchy for alignment

98 Alignment (1) Build multiple alignments by first aligning most similar pair, then next similar pair etc.

99 Alignment (2)

100 CLUSTAL Most successful implementation of progressive alignment (Des Higgins) CLUSTAL - gives equal weight to all sequences CLUSTALW - has the ability to give different weights to the sequences CLUSTALX - provides a GUI to CLUSTAL

101 Profile-Based Approach Seq1-> Seq2-> Seq3-> Seq4-> Information about the degree of conservation of sequence positions is included

102 Position-specific Score Matrix (PSSM) For protein of length L, scoring matrix is L x 20, PSSM(i,j) -- Profile : specific scores for each of the 20 amino acids at each position in a sequence. For highly conserved residues at a particular position, a high positive score is assigned, and others are assigned high negatives. For a weakly conserved position, a value close to zero is assigned to all the amino acid types.

103 Building a Profile First, get multiple sequence alignments using substitution matrix, S jk. Second, count the number of occurrences of amino acid k at position i, C ik. (1) Average-score method: W ij = Σ k C ik S jk / N. (2) log-odds-ratio formula: W ij = log(q ij /p j ). q ij = C ij / N. p j : background probability of residue j.

104 Calculating Profiles (1) ACGCTAFKI GCGCTAFKI ACGCTAFKL GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL W ij = Σ k C ik S jk / N C 1A = 4, C 1G = 3 W 1A = (4 S AA + 3 S AG ) / 7 = ( ) / 7 = 2.3 Gribskov et al, Proc. Natl. Acad. Sci. USA 84, , 19

105 Calculating profile (2) W ij = log(q ij /p j ). q ij = C ij / N. p j : background probability of residue j. For small N, formula q ij = C ij / N is not good A large set of too closely related sequences carries little more information than a single member. Absence of Leu does not mean no Leu at this position when Ile is abundant! Pseudocount frequency, g ij

106 Frequency Matrix Effective frequency, f ij f ij = α q ij α + + β β g i j Frequency matrix element, f ij, is the probability of amino acid j at position i. g ij = p j Σ k q ik exp (λ S kj ) p j: : background frequency

107 Frequency matrix, example i (position) 1,,L j (amino acid type) 1,,

108 Profile Alignment (1) sequence profile ACD VWY

109 Profile Alignment (2) alignment of alignments Sequence Profile Alignment. Profile Profile Alignment. Penalize gaps in conserved regions more heavily than gaps in more variable regions Dynamic Programming. (same idea as in Pairwise Sequence Alignment) Optimal alignment in time O(a 2 l 2 ) a = alphabet size, l = sequence length

110 Psi-Blast Psi (Position Specific Iterated) is an automatic profile-like search The program first performs a gapped blast search of the database. The information of the significant alignments is then used to construct a position specific score matrix. This matrix replaces the query sequence in the next round of database searching The program may be iterated until no new significant are found

111 Summary Diverse scope of bioinformatics Pairwise sequence comparison Multiple sequence comparison Acknowledgement: Prof Dong Xu, Chair, Dept of Computer science

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Sequencing alignment Ameer Effat M. Elfarash

Sequencing alignment Ameer Effat M. Elfarash Sequencing alignment Ameer Effat M. Elfarash Dept. of Genetics Fac. of Agriculture, Assiut Univ. aelfarash@aun.edu.eg Why perform a multiple sequence alignment? MSAs are at the heart of comparative genomics

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Sequence Comparison. mouse human

Sequence Comparison. mouse human Sequence Comparison Sequence Comparison mouse human Why Compare Sequences? The first fact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

More information

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids Database searches 1 DNA and protein databases EMBL/GenBank/DDBJ database of nucleic acids 2 DNA and protein databases EMBL/GenBank/DDBJ database of nucleic acids (cntd) 3 DNA and protein databases SWISS-PROT

More information

Sequencing alignment Ameer Effat M. Elfarash

Sequencing alignment Ameer Effat M. Elfarash Sequencing alignment Ameer Effat M. Elfarash Dept. of Genetics Fac. of Agriculture, Assiut Univ. amir_effat@yahoo.com Why perform a multiple sequence alignment? MSAs are at the heart of comparative genomics

More information

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

More information

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55 Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

Pairwise sequence alignment

Pairwise sequence alignment Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

More information

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming 20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, 2008 4 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance 4. Global and local alignment

More information

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Institute of Bioinformatics Johannes Kepler University, Linz, Austria Sequence Alignment 2. Sequence Alignment Sequence Alignment 2.1

More information

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Biochemistry 324 Bioinformatics. Pairwise sequence alignment Biochemistry 324 Bioinformatics Pairwise sequence alignment How do we compare genes/proteins? When we have sequenced a genome, we try and identify the function of unknown genes by finding a similar gene

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity Algorithms in Bioinformatics: A Practical Introduction Sequence Similarity Earliest Researches in Sequence Comparison Doolittle et al. (Science, July 1983) searched for platelet-derived growth factor (PDGF)

More information

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir Sequence Bioinformatics Multiple Sequence Alignment Waqas Nasir 2010-11-12 Multiple Sequence Alignment One amino acid plays coy; a pair of homologous sequences whisper; many aligned sequences shout out

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary

More information

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

Lecture 5,6 Local sequence alignment

Lecture 5,6 Local sequence alignment Lecture 5,6 Local sequence alignment Chapter 6 in Jones and Pevzner Fall 2018 September 4,6, 2018 Evolution as a tool for biological insight Nothing in biology makes sense except in the light of evolution

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang

More information

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

More information

Bio nformatics. Lecture 3. Saad Mneimneh

Bio nformatics. Lecture 3. Saad Mneimneh Bio nformatics Lecture 3 Sequencing As before, DNA is cut into small ( 0.4KB) fragments and a clone library is formed. Biological experiments allow to read a certain number of these short fragments per

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

Lecture 2: Pairwise Alignment. CG Ron Shamir

Lecture 2: Pairwise Alignment. CG Ron Shamir Lecture 2: Pairwise Alignment 1 Main source 2 Why compare sequences? Human hexosaminidase A vs Mouse hexosaminidase A 3 www.mathworks.com/.../jan04/bio_genome.html Sequence Alignment עימוד רצפים The problem:

More information

Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

More information

A profile-based protein sequence alignment algorithm for a domain clustering database

A profile-based protein sequence alignment algorithm for a domain clustering database A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program) Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program) Course Name: Structural Bioinformatics Course Description: Instructor: This course introduces fundamental concepts and methods for structural

More information

Pairwise sequence alignments

Pairwise sequence alignments Pairwise sequence alignments Volker Flegel VI, October 2003 Page 1 Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs VI, October

More information

Biosequence Alignment 徐鹰佐治亚大学生化系 吉林大学计算机学院

Biosequence Alignment 徐鹰佐治亚大学生化系 吉林大学计算机学院 Biosequence Alignment 徐鹰佐治亚大学生化系 吉林大学计算机学院 Bio sequences Sequences could be DNA, protein and RNA sequences DNA sequence (consisting of 4 letters: A, C, G, T) Ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtg

More information

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of

More information

Copyright 2000 N. AYDIN. All rights reserved. 1

Copyright 2000 N. AYDIN. All rights reserved. 1 Introduction to Bioinformatics Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr Multiple Sequence Alignment Outline Multiple sequence alignment introduction to msa methods of msa progressive global alignment

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

Local Alignment Statistics

Local Alignment Statistics Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison

More information

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences 140.638 where do sequences come from? DNA is not hard to extract (getting DNA from a

More information

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm Lecture 2, 12/3/2003: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties Local alignment the Smith-Waterman algorithm 1 Computational

More information

Alignment & BLAST. By: Hadi Mozafari KUMS

Alignment & BLAST. By: Hadi Mozafari KUMS Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence

More information

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University Measures of Sequence Similarity Alignment with dot

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr Introduction to Bioinformatics Shifra Ben-Dor Irit Orr Lecture Outline: Technical Course Items Introduction to Bioinformatics Introduction to Databases This week and next week What is bioinformatics? A

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Bioinformatics Definitions The use of computational

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009 8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009 2 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance and alignment 4. The number

More information

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT 5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:

More information

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded

More information

Moreover, the circular logic

Moreover, the circular logic Moreover, the circular logic How do we know what is the right distance without a good alignment? And how do we construct a good alignment without knowing what substitutions were made previously? ATGCGT--GCAAGT

More information

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel ) Pairwise sequence alignments Vassilios Ioannidis (From Volker Flegel ) Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs Importance

More information

BLAST: Basic Local Alignment Search Tool

BLAST: Basic Local Alignment Search Tool .. CSC 448 Bioinformatics Algorithms Alexander Dekhtyar.. (Rapid) Local Sequence Alignment BLAST BLAST: Basic Local Alignment Search Tool BLAST is a family of rapid approximate local alignment algorithms[2].

More information

Exercise 5. Sequence Profiles & BLAST

Exercise 5. Sequence Profiles & BLAST Exercise 5 Sequence Profiles & BLAST 1 Substitution Matrix (BLOSUM62) Likelihood to substitute one amino acid with another Figure taken from https://en.wikipedia.org/wiki/blosum 2 Substitution Matrix (BLOSUM62)

More information

Practical considerations of working with sequencing data

Practical considerations of working with sequencing data Practical considerations of working with sequencing data File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!

More information

Sequence Alignment (chapter 6)

Sequence Alignment (chapter 6) Sequence lignment (chapter 6) he biological problem lobal alignment Local alignment Multiple alignment Introduction to bioinformatics, utumn 6 Background: comparative genomics Basic question in biology:

More information

Analysis and Design of Algorithms Dynamic Programming

Analysis and Design of Algorithms Dynamic Programming Analysis and Design of Algorithms Dynamic Programming Lecture Notes by Dr. Wang, Rui Fall 2008 Department of Computer Science Ocean University of China November 6, 2009 Introduction 2 Introduction..................................................................

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

Similarity or Identity? When are molecules similar?

Similarity or Identity? When are molecules similar? Similarity or Identity? When are molecules similar? Mapping Identity A -> A T -> T G -> G C -> C or Leu -> Leu Pro -> Pro Arg -> Arg Phe -> Phe etc If we map similarity using identity, how similar are

More information

Network alignment and querying

Network alignment and querying Network biology minicourse (part 4) Algorithmic challenges in genomics Network alignment and querying Roded Sharan School of Computer Science, Tel Aviv University Multiple Species PPI Data Rapid growth

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011 8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011 2 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance and alignment 4. The number

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell

More information

Reducing storage requirements for biological sequence comparison

Reducing storage requirements for biological sequence comparison Bioinformatics Advance Access published July 15, 2004 Bioinfor matics Oxford University Press 2004; all rights reserved. Reducing storage requirements for biological sequence comparison Michael Roberts,

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST Introduction Bioinformatics is a powerful tool which can be used to determine evolutionary relationships and

More information

CSE 549: Computational Biology. Substitution Matrices

CSE 549: Computational Biology. Substitution Matrices CSE 9: Computational Biology Substitution Matrices How should we score alignments So far, we ve looked at arbitrary schemes for scoring mutations. How can we assign scores in a more meaningful way? Are

More information

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7

More information

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in

More information

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004 CSE 397-497: Computational Issues in Molecular Biology Lecture 6 Spring 2004-1 - Topics for today Based on premise that algorithms we've studied are too slow: Faster method for global comparison when sequences

More information

Lecture Notes: Markov chains

Lecture Notes: Markov chains Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Local Alignment: Smith-Waterman algorithm

Local Alignment: Smith-Waterman algorithm Local Alignment: Smith-Waterman algorithm Example: a shared common domain of two protein sequences; extended sections of genomic DNA sequence. Sensitive to detect similarity in highly diverged sequences.

More information

Biology Tutorial. Aarti Balasubramani Anusha Bharadwaj Massa Shoura Stefan Giovan

Biology Tutorial. Aarti Balasubramani Anusha Bharadwaj Massa Shoura Stefan Giovan Biology Tutorial Aarti Balasubramani Anusha Bharadwaj Massa Shoura Stefan Giovan Viruses A T4 bacteriophage injecting DNA into a cell. Influenza A virus Electron micrograph of HIV. Cone-shaped cores are

More information

Collected Works of Charles Dickens

Collected Works of Charles Dickens Collected Works of Charles Dickens A Random Dickens Quote If there were no bad people, there would be no good lawyers. Original Sentence It was a dark and stormy night; the night was dark except at sunny

More information

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

RNA Search and! Motif Discovery Genome 541! Intro to Computational! Molecular Biology RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology" Day 1" Many biologically interesting roles for RNA" RNA secondary structure prediction" 3 4 Approaches to Structure

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

Biologically significant sequence alignments using Boltzmann probabilities

Biologically significant sequence alignments using Boltzmann probabilities Biologically significant sequence alignments using Boltzmann probabilities P. Clote Department of Biology, Boston College Gasson Hall 416, Chestnut Hill MA 02467 clote@bc.edu May 7, 2003 Abstract In this

More information