Computational Methods for Structural Bioinforamtics and Computational Biology (1) (Sequence comparison)

Size: px

Start display at page:

Download "Computational Methods for Structural Bioinforamtics and Computational Biology (1) (Sequence comparison)"

Claire Kennedy
5 years ago
Views:

1 Computational Methods for Structural Bioinforamtics and Computational Biology (1) (Sequence comparison) Dragon Star Short Course Suzhou University, June 15 June 19, 2009 Jie Liang 梁杰 Molecular and Systems Computational Bioengineering Lab (MoSCoBL) Department of Bioengineering University of Illinois at Chicago 上海交通大学系统医学研究院上海生物信息技术研究中心

2 Course Organization June 15 June 19 Working language in Chinese; slides in English Discussions are encouraged throughout the lectures Lectures will focus on fundamental, while students are welcome to challenge the instructor with any questions related to the subject Additional discussion session

3 Reference Books "Introduction to Computational Molecular Biology" by Carlos Setubal and Joao Meidanis, 1997, PWS Publishing, ISBN Computational Molecular Biology by Peter Clote and Rolf Backofen, John Wiley, 2000, ISBN Geometry and topology for mesh generation by Herbert Edelsbrunner, Cambridge University Press, 2001, ISBN Monte Carlo Strategies in Scientific Computing by Jun S. Liu (Springer Series in Statistics), 2009 (Paperback), ISBN-10: "Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids" by Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison, Cambridge University Press, 1999, ISBN Monte Carlo Statistical Methods by Christian P. Robert and George Casella, Springer, 2005, ISBN-10:

4 A Brief Survey Computer science background? Biology background? Mathematics/Statistical background? None of above? Have you taken another bioinformatics course?

5 Prerequisites Basic knowledge of computer science Assume no prior knowledge in biology above high school Strong motivation in learning bioinformatics and computational biology

6 Today s Lecture Scope of bioinformatics and the course Basic concepts in molecular biology: DNA, RNA, protein, Dynamic programming and pairwise sequence analysis Statistical models for evaluating aligned sequences Optimal multiple sequence alignment Heuristic multiple sequence alignment

7 Scope of Bioinformatics: Studying Biology on Computer data management; data mining; modeling; prediction; theory formulation bioinformatics genes, proteins, protein complexes, pathways, cells, organisms, ecosystem an indispensable part of biological science with its own methodology engineering aspect scientific aspect computer science, biology, statistics physics, mathematics, chemistry, engineering,

8 Why Bioinformatics? (I) More than 80 US universities offer graduate degrees in bioinformatics At cross-section of two most exciting fields: computer science and biology Exponential growths in computing technologies (hardware, Internet) pave the way for bioinformatics development

9 Why Bioinformatics? (II) Analytical technology High-throughput data Biological knowledge Medicine & bioengineering

10 What Can Computing Do for Biology? 1. Data interpretation in analytical technologies 2. Data management and computational infrastructure 3. Discovery from data mining 4. Physical models through computing, prediction, and design 5. Theoretical / in silico biology Almost every area of computer science can be applied to biology Many computer scientists study biological problems

11 1. Data Interpretation in Analytical Technologies (I) Analytical technologies are the driving force of new (large-scale) biology: DNA sequencing (genomics) X-ray / NMR structure determination (structural genomics) Protein identification using mass spectrometry (proteomics) Microarray chips (functional genomics)

12 1. Data Interpretation in Analytical Technologies (II) peak assignment N H H R C O C C H H R C O N C C H H N H H R C O C C H N H NMR spectra NMR protein structure determination structural restraint extraction i+4 i+3 i+2 protein structure structure calculation i-1 i i+1

13 1. Data Interpretation in Analytical Technologies (III) From image to data (imaging processing) Large-scale data cannot be handled without computer Noisy data (optimization with under-constraint / overconstraint) Computer algorithms/programs can mimic human interpretation process and do it much faster Automation of experimental data interpretation

14 2. Data Management and Computational Infrastructure Track instruments, experiment conditions and results at each step of a complicated biological experiment (LIMS at modern wet labs) Data storage and retrieval (database) Data visualization Data query and analysis pipeline

15 3. Discovery from Data Mining (I)

16 3. Discovery from Data Mining (II) Pattern/knowledge discovery from data many biological data are generated by biological processes which are not well understood interpretation of such data requires discovery of convoluted relationships hidden in the data which segment of a DNA sequence represents a gene, a regulatory region which genes are possibly responsible for a particular disease Complicated data Large-scale, high-dimension Noisy (false positives and false negatives)

17 4. Modeling, Prediction and Design (I) Modeling and prediction of biological objects/processes modeling of biochemistry enzyme reaction rates modeling of biophysics dynamics of biomolecules modeling of evolution prediction of phylogeny and substitution pattern

18 4. Modeling, Prediction and Design (II) Prediction of outcomes of biological processes computing will become an integral part of modern biology through an iterative process of model formulation computational prediction experimental validation From prediction to engineering design Protein structure prediction to protein engineering Design genetically modified species

19 5. Theoretical / In Silico Biology Generate new hypothesis, formulate and test fundamental theories of biology new hypothesis about detailed evolutionary history, through mining genomic sequence data? new hypothesis about a particular signaling network, through data mining? new hypothesis about protein folding pathways, through simulations? new hypothesis of cancer biology and developmental biology

20 Bioinformatics Application to Biological Systems bacteria (Synechococcus) yeast (Saccharomyces cerevisia) plants (Arabidopsis) viruses (SARS) neural systems (neurons)

21 Can Biology Help Computing? Computational techniques inspired by biology: Neural network (artificial intelligence) Genetic algorithm, automata A new driver of computer science: New algorithms New driver for theory development Better hardware (clusters and supercomputers) Develop new theoretical framework: DNA computing Network communication,

22 Computing versus Biology what computer science is to molecular biology is like what mathematics has been to physics Larry Hunter, ISMB 94 molecular biology is (becoming) an information science Leroy Hood, RECOMB 00 Bioinformatics and computational biology is still in its early development!

23 Course Topics Data interpretation in analytical technologies Data management and computational infrastructure Discovery from data mining Modeling, prediction and design Theoretical / in silico biology Cover some classical/mainstream as well as many research bioinformatics problems from computational prospective

24 Course Outline (1) June 15: Comparison and prediction of biological molecules (with introduction) Pairwise sequence comparison Multiple sequence comparison June 16: Geometric structures of biomolecules Protein structure, geometric volume and surface models of biomolecules Secondary and tertiary structure prediction Geometric constructs: Voronoi diagram, Delaunay triangulation, alpha shape Algorithms for computing geometric constructs Application: protein function prediction

25 Course Outline (2) June 17: Generating conformations of biomolecules State models of biomolecules Sampling by Markov chain Monte Carlo Sampling by Sequential Importance Sampling Appliications: protein packing problem June 18: Empirical potential and fitness function for biomolecules Anfinsen s principle and mathematical structure for designing potentials for structure prediction and for proten fitness landscape Empirical statistical function Potential function by optimization Application: global nonlinear fitness function of evolution of protein folds June 19: Evolution of biomolecules and stochastic networks Models of molecular evolutoin, Molecular phylogeny Maximum likelihood and Bayesian Monte Carlo estimators Application: protein function prediciton Stochastic molecular netoworks Simulation and exact solution of stochastic landscape of genetic cirtuits. resource

26 What I Will Teach A general introduction to a few important problems in bioinformatics and computational biology problems definitions: from biological problem to computable problem some key aspects of models, theories, algorithms, and computational techniques A way of thinking: tackling biological problem computationally how to look at a biological problem from a computational point of view how to formulate a computational problem to address a biological issue how to collect statistics from biological data how to build a computational model how to design algorithms for the model how to test and evaluate a computational algorithm how to gain confidence of a prediction result

27 New Ways of Thinking Critical thinking Analytical thinking Quantitative thinking Algorithmic thinking

28 Introduction (1) Biological sequence comparison DNA-DNA RNA-RNA Protein-protein Sequence comparison is the most important and fundamental operation in bioinformatics Key to understand evolution of a gene or an organism

29 Introduction (2) Applications in most bioinformatics problems Sequence assembly Gene finding Protein structure prediction Evolutionary analysis THE most popular tool: BLAST Foundation of sequence database search

30 Today s Lecture Scope of bioinformatics and the course Sequence comparison

31 Genome Each cell contains a full genome (DNA) The size varies: Small for viruses and prokaryotes (10 kbp-20mbp) Medium for lower eukaryotes Yeast, unicellular eukaryote 13 Mbp Worm (Caenorhabditis elegans) 100 Mbp Fly, invertebrate (Drosophila melanogaster) 170 Mbp Larger for higher eukaryotes Mouse and human 3000 Mbp Very variable for plants (many are polyploid) Mouse ear cress (Arabidopsis thaliana) 120 Mbp Lilies 60,000 Mbp

32 Differences in DNA ~2% ~4% ~0.2%

33 An Example of Sequence Comparison TT...TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG AAGGATC...TCAGTAATTAATCATGCACCTATGTGGCGG AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT

34 multiple alignment: alignment of 3 or Alignement (1) a correspondence between elements of two sequences with order kept FSEYTTHRGHR : ::::: :: FESYTTHRPHR FESYTTHRGHR :::::::: :: FESYTTHRPHR pairwise alignment: 2 sequences aligned

35 Alignement (2) Similar to longest common subsequence (LCS) problem for strings, (Robinson, 1938) LCS: define a set of operations (e.g. substitution, insertion or deletion) that transform the aligned elements of one sequence into the corresponding elements of the other and associate with each operation a cost or a score. Optimal alignment: the alignment that is associated with the lowest cost (or highest score). Between two sequences several optimal alignments can be constructed with the same optimal score. FSEY-THRGHR : : ::: :: FESYTTHRPHR FSEYT-HRGHR : :: :: :: FESYTTHRPHR

36 Some Terminology Alphabet: a finite set of characters from which strings are made. Eg. {A,T,G,C}, twenty amino acid residues. String: ordered succession of characters or symbols. It is synonymous to sequence. Length of a string s: denoted as s, it is the number of characters in it. The character at position is s(i). Concatenation: Concatenation of two strings s and t is denoted by st and is given by appending all characters of string t in sequence after those of s. The length of this is s + t. If s = GGCTA and t = CAAC, then st = GGCTACAAC. Prefix: A prefix of s is any substring of s of the form s [1...j ] for 0 <= j <= s. Special case: We allow j=0 such that s[1...0] is the empty string, which is also a prefix of s. t is a prefix of s if and only if there is another string u such that s = tu. Prefix(s,k) denotes a prefix of s with exactly k characters, with 0<=k<= s. Suffix: A suffix of s is a substring of the form s[i... s ] for a certain i such that 1<= i <= s +1. We allow i= s +1, in which case s[ s +1...s] denotes the empty string. A string t is a suffix of s iff there exists u such that s = ut.

37 Components of Sequence Alignment (1) Scoring function: a measure of similarity between elements (nucleotides, amino acids, gaps); (2) An algorithm for alignment; Insertion Indel Deletion FDSK-THRGHR :.: :: ::: FESYWTH-GHR (3) Confidence assessment of alignment result. Match (:) Mismatch (substitution)

38 Edit Distance (Hamming Distance) Introduced by Levenshtein in 1966 Binary: match = 1 / mismatch = 0 (Identity Matrix) Definition: Minimum number of edit operations to transform one string to another Can be used for DNA/RNA Possible edit operations Symbol insertion and deletion Symbol substitution

39 Scoring Matrix amino acid substitution matrices (20X20) account for probability of one amino acid being substituted for another: frequency of substitution - genetic code tolerance for changes - natural selection penalize residues pairs with a low probability of mutation in evolution and rewards pairs with a high probability empirically derived from observed amino acid substitutions that occur between aligned residues in homologous sequences

40 Physical Bases of Mutation Matrix Geometric nature Physical nature (charged or hydrophobic) Chemical nature Frequencies of amino acids physical property matrices

41 PAM The first substitution matrices derived by Dayhoff et al. (1978) PAM (point accepted mutation) distance: Two sequences are defined to have diverged by one PAM unit if they show in average one accepted point mutation (i.e. one amino acid change) per hundred amino acids. Derived from the pairwise alignment of sequences less than 15% divergent.

42 BLOSUM Block substitution matrices (Henikoff & Henikoff 1992) Blocks: highly conserved regions in a set of aligned protein sequences (local multiple alignment) Number of BLOSUM matrix (e.g. BLOSUM 62) indicates the cutoff of percent identity that defines the clusters - lower cutoffs allow more diverse sequences

43 BLOSUM 62 Matrix A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

44 What Matrices to Use Close homolog: high cutoffs for BLOSUM (up to BLOSUM 90) or lower PAM values BLAST default: BLOSUM 62 Remote homolog: lower cutoffs for BLOSUM (down to BLOSUM 10) or high PAM values (PAM 200 or PAM 250) A best performer in structure prediction: PAM 250

45 Gap Penalty Functions Corresponding to insertion/deletion in evolution Can be derived from alignment Known alignments Performance-based (sequence comparison)

46 Affine Gap Penalty Function If we are introducing k spaces together, the penalty should be less than that for k independent spaces. i.e. w (k) k w(1) or, w ( k 1 +k 2 + +k n ) w ( k 1 ) +w(k 2 ) + +w( k n ). A function which satisfies the above conditions is called a subadditive function. An affine function is a function of the form, w ( k ) = h + g k, k 1, where w (0) = 0 and h, g > 0.

47 Affine Gap Penalty This is the most commonly used model w(k) = h + gk, k 1,with w(0) = 0. h: gap opening penalty; g: gap extension penalty h > g > 0 (e.g., for PAM250, k) Non-linear form: h + g log (k) FDS--THRGHR :.: :::::: FESYTTHRGHR FDS-T-HRGHR :.: : ::::: FESYTTHRGHR

48 Time Complexities General Gap Penalty Functions: O( mn 2 +m 2 n ), so it is O( n 3 ), if m is about the same length as n. Affine Gap Penalty Functions: O(mn),

49 Score of an alignment: reward matches and penalize mismatches and spaces. eg, each column gets a (different) value for: a match: +1, (both have the same characters); a mismatch : -1, (both have different characters); and a space in a column: -2. The total score of an alignment is the sum of the values assigned to its columns. The best alignment: The one with the maximum total score. eg. G A - C G G A T T A G G A T C G G A A T A G match mismatch -1 space -2 The total score is: 9 x x (-1) + 1 x (-2) = 6 The best alignment is the similarity between the two sequences s and t: sim(s,t) How to find the best alignment? Generate or enumerate all possible alignments, and pick the one with the best scoring. Dynamic programming: much faster!

50 Dot Matrix and Alignment Dot matrix: Score between cross-elements path: Mapping to an alignment AACG ATCG - G GT A GT - TGC TGC A A C G G T A T G C A T 1 1 C 1 1 G G G T 1 1 T 1 1 G C 1 1

51 Dynamic Programming Steps 1. Assign scores between elements in dot matrix 2. For each cell in the dot matrix, check all possible pathways back to the beginning of the sequence (allowing insertions and deletions) and give that cell the value of the maximum scoring pathway 3. Construct an alignment (pathway) back from the last cell in the dot matrix (or the highest scoring) cell to give the highest scoring alignment

52 Global vs. Local Alignment Global alignment: the alignment of full sequences Good for comparing members of same protein family Needleman & Wunsch, 1970, J Mol Biol 48:443 Local alignment: the alignment of segments of sequences ignore areas that show little similarity Smith & Waterman 1981, J Mol Biol, 147:195 modified from Needelman-Wunsh algorithm can be done with heuristics (FASTA and BLAST)

53 Dynamic Programming for global alignment (Solving a problem by using already computed solutions for smaller instances of the same problem.) Concept: Given two sequences s and t, instead of determining similarity between s and t as whole sequences only, we build up the solution by determining all similarities between arbitrary prefixes of the two sequences. We start with shorter prefixes and use the computed values of these to solve the problem of larger prefixes. Let m be the size of s and n the size of t. There are m+1 prefixes of s and n+1 of t, including the empty string. We can arrange our calculations in an (m+1) (n+1) matrix a, where element a (i, j ) contains the similarity between s [1...i ] and t [1...j ].

54 s(i) t(j) The matrix a for s = AAAC and t = AGC The first row and first column: multiples of space penalty. Only one alignment possible if one seq is empty: add spaces, with score -2k Key point: the value for the entry a (i, j) can be obtained by looking at just three previous entries: those for ( i -1, j ), ( i-1, j-1 ) and (i, j-1). The reason is that there are only three ways to align s [1..i] and t [1..j] align s[1...i] with t[1...j-1] and match a space with t [ j ] align s[1...i-1] with t [1...j-1] and match s[i] with t [j]. align s[1...i-1] with t[1...j] and match s[i] with a space. These are exhaustive possibilities, since we cannot have two spaces paired.

55 For example, the value of a[1, 2] comes from one of the three: a[1, 1] - 2 = -1; a [0, 1] -1 = -3; a [0, 2] - 2 = -6 sim(s[1...i ], t [1...j ]) = maximum of sim ( s[1...i], t [1...j-1] ) - 2 sim ( s[1...i -1], t [1...j-1] )+ p ( i, j ) sim ( s[1...i -1], t [1...j] ) - 2 Here entries are computed row by row, and usually gap g<0 Algorithm to Compute Global Similarity: Since a (i, j) stores sim (i, j ), the similarity of s[1...i ] with t[1...j] is: a( s[1...i ], t[1...j] ) = maximum of a (i, j-1) - 2 a (i-1, j-1) + p(i, j) a (i-1, j ) - 2 Important: the order of the computing needs to make sure that a ( i, j-1 ), a( i-1, j-1 ), and a( i-1, j ) are available when computing a (i, j ).

56 Quadratic time complexity: O (m) O (m) O (n) O (m n) If sequences are of similar length n, then the time complexity is O ( n 2 ).

57 Optimal Alignment t ( j ) The arrows are not implemented explicitly. s ( i ) We can start at entry (m, n) and follow the arrows until we get to (0,0). Each arrow gives one column of alignment. Horizontal: space in s matches t [j] Vertical: space in t matches s [i] Diagonal: s [i] matches t [j] Call to Align( m,n, len ) gives an optimal alignment, given matrix a, and strings s, and t

58 Answers are given in vectors align-s and align-t, holding in 1..len the aligned characters, symbols or spaces. Length of the alignment is returned by len: max( s, t ) <= len <= m+n Time complexity when a matrix is given: O(len), the size of the returned alignment or O(m+n) It is possible that several alignments may have the same scores: The algorithm returns just one, giving preferences to edges leaving (i, j ) in counterclockwise order. Upmost alignment: This is achieved through the order of if s That is: if there are two or three choices, a column with space in t is preferred over a column with two symbols, which is preferred over a column with space over s A A A A A A A A A A G - - A G A - G eg s -ATAT ATATt TATA- > -TATA

59 Local Comparison An alignment between a substring of s and a substring of t. Goal: to find the highest scoring local alignment. Same Data structure: an array a[1..m+ 1][1..n+1] a[i, j] : the highest score of an alignment between a suffix of s[1..i ] and a suffix of t[1..j] Because the empty string, which has a score 0, is always a valid suffix of a sequence, all entries >= 0. First row and first column: initialize to 0. The entry a( s [1...i], t [1...j] ) = maximum of a (i, j-1 ) - g a (i-1, j-1 ) + p ( i, j ) a (i-1, j ) - g an empty alignment

60 t(j) s(i) (Ignore the numbers in this figure)

61 Find the maximum entry in the whole array: this is the score of an optimal local alignment. Start from any entry with this score value, and trace back until there is no arrow: optimal local alignment. In general, we are interested in not only the optimal local alignment, but also near optimal alignment. End Spaces in Alignments: End spaces are before the first character or after the last character. Consider the following alignment: C A G C A - C T T G G A T T C T C G G size C A G C G T G G size x(-2) x 1 There will be many spaces in any alignment because length differences, contributing to a large negative score (-19). The above alignment is pretty good, if end spaces are ignored: 6 matches, 1 mismatch, 1 space.

62 The alignment with the best score is: CAGCACTTGGATTCTCGG CAGC-----G-T----GG 10x(-2) +8 x 1 = -12 Although this alignment gives a better score (-12 as compared to -18), it is not interesting because it is not finding similar regions We are interested in regions in the longer sequence that are approximately the same as the shorter regions. However, if we choose the first alignment and neglect all end spaces, then the score is +3. Semiglobal Comparison!

63 Ignore the end space after s: spaces after the last character has no cost. In an optimal alignment, these spaces are matched to a suffix of t. Remove this final part of the alignment, we obtain an alignment between s and a prefix of t, with the same score. Therefore need to find the best similarity between s and a prefix of t. Since in the basic algorithm a[i, j ] contains the similarity between s[1..i ] and t[1..j ], Take the maximum value in the last row of a : sim(s, t) = max a[m, j], and j in [1, n] = a[m, k] The alignment can be obtained by tracing back, except we start from (m, k). s(i) t(j)

64 Ignore the end space after t: spaces after the last character has no cost. In an optimal alignment, these spaces match to a suffix of s. Remove this final part, we obtain an alignment between t and a prefix of s, with the same score. Therefore need to find the best similarity between t and a prefix of s. Since in the basic algorithm a[i, j] contains the similarity between s[1..i] and t [1..j], Take the maximum value in the last COLUMN of a : sim(t, s) = max a[i, n], and i in [1, m] = a[k, n] The alignment can be obtained by tracing back, except we start from (k, n). s( i ) t( j )

65 Ignore the initial space before s: spaces before the first character has no cost. This is equivalent to the best alignment between s and a suffix of t. a[i, j] needs to contain the highest similarity between s[1..i] and a suffix of t [1..j], Therefore, for s, with s =m and t, with t =n, we need to look at a [m,n]. s(i) t(j) The matrix can be filled the same way as the basic global algorithm, But the first row has to be 0: since initial spaces before s have no costs. The alignment can be obtained by tracing back from (m, n). Ignore the initial spaces before t: Same, except the first column has to be 0. First row and col are 0s now.

66 Summary of end gap conditions : In order to not penalize spaces at: Take the following action in a[, ]: Beginning of s Initialize first row with zero End of s Find maximum in last row Beginning of t Initialize first col. with zero End of t Find for maximum in last column And combinations...

67 Reading Material About dynamic programming in sequence alignment W.R. Pearson and W.Miller. Methods in Enzymology, 210: , T.F.Smith and M.S. Waterman. J. Mol. Biol. 147: , 1981

68 Computational Complexity of Dynamic Programming Computing time: O(nm), where n and m are sequence lengths). Retrieval time: O(Max (n,m)) [worst case: n+m; best case: Min(n,m)] Required memory: O(nm).

69 Comparing Very similar sequences: The scores of their optimal alignments are very close to the maximum possible. For two sequences s and t with the same lengths, The dynamic programming matrix is square. The main diagonal gives an alignment without spaces. If that alignment is not optimal, need to add spaces. We add spaces in pairs, so s and t still have the same lengths: But now the alignment is off diagonal!

70 S = GCGCATGGATTGAGCGA t = TGCGCCATGGATGAGCA The optimal alignment is: If the sequences are similar, the path of the best alignment should be very close to the main diagonal. Therefore, we may not need to fill the entire matrix, rather, we fill a narrow band of entries around the main diagonal. An algorithm that fills in a band of width 2k+1 around the main diagonal. The path of the optimal alignment is not on the main diagonal, but twice removed. There are two spaces.

71 a [i, j] depends on a [i-1, j ], a [i-1, j-1], and a [i, j-1]. Do not use any a [ i-1, j ]anda[ i, j-1] if they are outside the k-band. No need to test a[ i-1, j-1 ]: it will always be inside the k-band. For a[i-1, j] anda[ i, j-1 ], we test because a[i, j] may be on the border of the band: InsideStrip( i, j, k) = ( -k i-j k ) if this is true: 1 The entry a[n, n] contains the highest score of an alignment within the k-band. The time complexity is O (kn ), if k is modest, this is much better than O ( n 2 ). How do we know it is correct if we just look at entries within the k-band? If there are (k +1) or more space pairs, the best possible score is when all of the rest of the sequence match perfectly: match (n - k - 1) + 2( k + 1 )g If our k-band computation gives a score better than the above, than there is no need to increase k. If not, we need to increase k, and repeat the calculation. Usually, we double k and run the calculation again.

72 Confidence Assessment of Sequence Alignment Why confidence assessment is needed True homology or alignment by chance Expected probability by chance Statistical models

73 Why not to use sequence identity as confidence measure

74 p-value and e-value The probability that a variate would assume a value greater than or equal to the observed value strictly by chance P(z>z o ) If the P-value found for an alignment is low (<0.001), the alignment is probably biologically meaningful. Pre-compute the parameters based on a statistical model

75 Need for Heuristic Alignment Time complexity for optimal alignment: O(n 2 ), n -- sequence length Given the current size of sequence databases, use of optimal algorithms is not practical for database search Heuristic techniques: BLAST, FASTA, MUMmer, PatternHunter min (optimal alignment, SSearch) 2min (FASTA) 20 sec (BLAST)

76 Ideas in Heuristics Search Indexing and filtering: Google search Good alignment includes short identical, or similar fragments break entire string into substrings, index the substrings Search for matching short substrings and use as seed for further analysis extend to entire string and find the most significant local alignment segment

77 FASTA (1) Lipman & Pearson, 1985, Science 227, Key idea Identify regions of the sequences with the highest density of matches. In this step exact matches of a given length (by default 2 for proteins, 6 for nucleic acids) are determined and regions (fragments of diagonals) with a high number of matches selected.

78 FASTA (2) A-FTFWSYAIGL--PSSSIVSWKSCHVLHKVLRDGHPNVLHDCQRYRSNI... : AIPQFWSYAIERPLNSSWIVVWKSCITTHHLMVYGNERFIQYLAS-RNTL

79 BLAST Basic Alignment Search Tool (Altschul et al, 1990, J. Mol. Biol. 215, ) Uses word matching like FASTA Similarity matching of words (3 aa s, 11 bases) does not require identical words. If no words are similar, then no alignment won t find matches for very short sequences

80 Today s Lecture Scope of bioinformatics and the course Pairwise sequence comparison Multiple sequence comparison

81 Introduction The multiple sequence alignment of a set of sequences may be viewed as an evolutionary history of the sequences. No sequence ordering is required.

82 An Example of Multiple Alignment VTISCTGSESNIGAG-NHVKWYQQLPG VTISCTGTESNIGS--ITVNWYQQLPG LRLSCSSSDFIFSS--YAMYWVRQAPG LSLTCTVSETSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKEFYPSD--IAVEWWSNG--

83 Why Multiple Alignment (1) Natural extension of Pairwise Sequence Alignment Pairwise alignment whispers multiple alignment shouts out loud Hubbard et al 1996 Much more sensitive in detecting sequence relationship and patterns

84 Why Multiple Alignment (2) Give hints about the function and evolutionary history of a set of sequences Foundation for phylogenic tree construction and protein family classification Useful for protein structure prediction

85 Scoring Multiple Alignment Which one is better? VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG-- or VTISCTGSSSNIG-AGNHVKWYQQLPG VTISCTGTSSNIG--SITVNWYQQLPG LRLSCSSSGFIFS--SYAMYWVRQAPG LSLTCTVSGTSFD--DYYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNW--YVDG ATLVCLISDFYPG--AVTVAW--KADS AALGCLVKDYFPE--PVTVSW--NS-G VSLTCLVKGFYPS--DIAVEW--ESNG or VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCS-SSGFIFSS-YAMYWVRQAPG LSLTCT-VSGTSFDD-YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG--

86 Scoring PQRRZW YQRKZX YZTUOP TZZ_FO Total Score = [w(r, U ) + w(r, Δ ) + w(k, U) + w(k, Δ) ] / 4

87 Additive Functions: the alignment score is the sum of column scores. Independent of the ordering of the sequences in the alignment: A column with I, -, I, V, should score the same as V, I, I, - Reward the presence of many equal or strongly related residues, and penalize unrelated residues and spaces. Sum-of-Pair (SP) Function: The sum of pairwise scores of all pairs of symbols in the column. eg. SP_score ( I, -, I, V ) eg. = p ( I,- ) + p (I,I) + p (I,V) + p(-,i) + p(-,v) + p (I,V) Here we used the "unit costs" for pairwise alignment, summing up all the costs of all possible pairs of letters, i.e. the sum of the unit costs of the pairs (1,2), (1,3), (1,4),..., (1,8), (2,3), (2,4),..., (2,8), (3,4), (3,5),...,...,..., and (7,8). In general, any cost/weight scheme could be used, it just needs to map pairs of characters to a numeric value.

88 p ( -,- ) = 0: If we select two sequences from a multiple alignment, and ignore the rest, we have a pairwise alignment -- if we ignore columns with two spaces. This is called a projection of the multiple alignment. becomes: ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- ATLVCLISDFYPGAVTVAWKADS AALGCLVKDYFPEPVTVSWNSG Projection of a multiple alignment a: if a ij is the pairwise projection of a to sequence i and j SP_score(a) = Σ score(a ij ), i<j Here score(a ij )is the score of a projected pairwise alignment.

89 Optimal Multiple Alignment is an alignment with minimum overall cost, or maximum overall similarity. The Dynamic Programming Hyperlattice. For the alignment of 3 sequences, every alignment can be seen as a unique path through a 3-dimensional lattice. This path is denoted by listing at each node visited, component by component, the distance from the starting point in the bottom left (i.e. from the source (0,0,0) of the lattice): (0,0,0), (1,0,0), (2,1,0), (3,2,0), (3,3,1),(4,3,2) for the above example The distance from the starting point is the number of letters already aligned: For example, column 4 of the alignment corresponds to node (3,3,1): 3 letters from the first and second sequence are aligned at that point, and one letter from the third.

90 Imagine light sources on the top, front, and right-hand side of the lattice, "shadows" of the alignment will be projected to the opposing faces (walls). Assume that the light sources are farther away from the lattice, and the shadows are projected without distortion. a b In Fig. a, only the light source on the right is "on", projecting the path onto the face on the left. In Fig. b, all light sources are "on".

91 An optimal multiple alignment can be calculated by dynamic programming. The special case of pairwise alignment: We can visualize dynamic programming as a calculation that visits every node in a 2-dimensional matrix, or 2-d lattice, in a way that obeys the order of dependencies between the nodes, as indicated by the arrows. The sequences are now arranged such that the calculation of the alignment starts in the bottom-left corner, and not the top-left corner. In this setting, we start bottomleft, then move to the right until the bottom row is finished, then visit the node marked by an asteriks (*), move to the right as before, etc. In three or more dimensions, we have to look at more nodes: e.g. 7 nodes for three sequences. Correspondingly, the minimum needs to be taken from 7 possible values

92 Computational Complexity of Multiple Alignment by Standard Dynamic Programming Each node in the k-dimensional hyperlattice is visited once, and therefore the running time must be proportional to the number of nodes in the lattice. This number is the product of the lengths of the sequences. eg. the 3-dimensional lattice as visualized. How many steps does the algorithm ``rest'' at each node? Dynamic programming organizes the visiting of nodes in such a way that we just need to ``look back'' one single step, at the nodes that we've visited before, to look up the values we need for calculating the minimum. The time we spend for retrieving the minima and calculating the sum does not depend on the length of the sequences. However, it depends on the number of sequences. We've had 3 values for 2 sequences, 7 values for 3 sequences, 15 values for 4 sequences. This goes up exponentially: 2 k -1 The running time is expontential: O(2 k Π s i ), i = 1.. K If the proportionality factor is 1 nanosecond, then for 6 sequences of length 100, we'll have a running time of 2 6 x x10-9, that's roughly 64,000 seconds (17 hours). Add two more sequences, then we will have to wait 2.6 x10 9 = 82 years!

93 The memory space requirement is even worse. To trace back the alignment, we need to store the whole lattice, a data structure the size of a multidimensional skyscraper. In fact, space is the No.1 problem here, bogging down multiple alignment methods that try to achieve optimality. Furthermore, incorporating a realistic gap model, we will further increase our demands on space and running time

94 As we proceed Warning: Muddy Road Ahead!!!

95 Progressive Alignment Devised by Feng and Doolittle in A heuristic method, not guaranteed to find the optimal alignment. Multiple alignment is achieved by successive application of pairwise methods.

96 Basic Algorithm Compare all sequences pairwise. Perform cluster analysis on the pairwise data to generate a hierarchy for alignment (guide tree). Build alignment step by step according to the guide tree. Build the multiple alignment by first aligning the most similar pair of sequences, then add another sequence or another pairwise alignments.

97 Steps in Progressive Multiple Alignment Compare pairwise sequences Perform cluster analysis on pairwise data to generate hierarchy for alignment

98 Alignment (1) Build multiple alignments by first aligning most similar pair, then next similar pair etc.

99 Alignment (2)

100 CLUSTAL Most successful implementation of progressive alignment (Des Higgins) CLUSTAL - gives equal weight to all sequences CLUSTALW - has the ability to give different weights to the sequences CLUSTALX - provides a GUI to CLUSTAL

101 Profile-Based Approach Seq1-> Seq2-> Seq3-> Seq4-> Information about the degree of conservation of sequence positions is included

102 Position-specific Score Matrix (PSSM) For protein of length L, scoring matrix is L x 20, PSSM(i,j) -- Profile : specific scores for each of the 20 amino acids at each position in a sequence. For highly conserved residues at a particular position, a high positive score is assigned, and others are assigned high negatives. For a weakly conserved position, a value close to zero is assigned to all the amino acid types.

103 Building a Profile First, get multiple sequence alignments using substitution matrix, S jk. Second, count the number of occurrences of amino acid k at position i, C ik. (1) Average-score method: W ij = Σ k C ik S jk / N. (2) log-odds-ratio formula: W ij = log(q ij /p j ). q ij = C ij / N. p j : background probability of residue j.

104 Calculating Profiles (1) ACGCTAFKI GCGCTAFKI ACGCTAFKL GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL W ij = Σ k C ik S jk / N C 1A = 4, C 1G = 3 W 1A = (4 S AA + 3 S AG ) / 7 = ( ) / 7 = 2.3 Gribskov et al, Proc. Natl. Acad. Sci. USA 84, , 19

105 Calculating profile (2) W ij = log(q ij /p j ). q ij = C ij / N. p j : background probability of residue j. For small N, formula q ij = C ij / N is not good A large set of too closely related sequences carries little more information than a single member. Absence of Leu does not mean no Leu at this position when Ile is abundant! Pseudocount frequency, g ij

106 Frequency Matrix Effective frequency, f ij f ij = α q ij α + + β β g i j Frequency matrix element, f ij, is the probability of amino acid j at position i. g ij = p j Σ k q ik exp (λ S kj ) p j: : background frequency

107 Frequency matrix, example i (position) 1,,L j (amino acid type) 1,,

108 Profile Alignment (1) sequence profile ACD VWY

109 Profile Alignment (2) alignment of alignments Sequence Profile Alignment. Profile Profile Alignment. Penalize gaps in conserved regions more heavily than gaps in more variable regions Dynamic Programming. (same idea as in Pairwise Sequence Alignment) Optimal alignment in time O(a 2 l 2 ) a = alphabet size, l = sequence length

110 Psi-Blast Psi (Position Specific Iterated) is an automatic profile-like search The program first performs a gapped blast search of the database. The information of the significant alignments is then used to construct a position specific score matrix. This matrix replaces the query sequence in the next round of database searching The program may be iterated until no new significant are found

111 Summary Diverse scope of bioinformatics Pairwise sequence comparison Multiple sequence comparison Acknowledgement: Prof Dong Xu, Chair, Dept of Computer science

In-Depth Assessment of Local Sequence Alignment

2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.