SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS 1 Prokaryotes and Eukaryotes 2 DNA and RNA 3 4
Double helix structure Codons Codons are triplets of bases from the RNA sequence. Each triplet defines an amino-acid. 5 6 The standard genetic code TTT Phe TCT Ser TAT Tyr TGT Cys TTC Phe TCC Ser TAC Tyr TGC Cys TTA Leu TCA Ser TAA STOP TGA STOP TTG Leu TCG Ser TAG STOP TGG Trp CTT Leu CCT Pro CAT His CGT Arg CTC Leu CCC pro CAC His CGC Arg CTA Leu CCA Pro CAA Gln CGA Arg CTG Leu CCG Pro CAG Gln CGG Arg ATT Ile ACT Thr AAT Asn AGT Ser ATC Ile ACC Thr AAC Asn AGC Ser ATA Ile ACA Thr AAA Lys AGA Arg ATG Met ACG Thr AAG Lys AGG Arg GTT Val GCT Ala GAT Asp GGT Gly GTC Val GCC Ala GAC Asp GGC Gly GTA Val GCA Ala GAA Glu GGA Gly GTG Val GCG Ala GAG Glu GGG Gly Nucleotides and amino acids The four nucleotides in DNA (RNA) A adenine G guanine C cytosine T thymine (U uracil) The twenty amino-acids in proteins A alanine C cysteine D aspartic acid E glutamic acid F phenylalanine G glycine H histidine I isoleucine K lysine L leucine M methionine N asparagine P proline Q glutamine R arginine S serine T threonine V valine W tryptophan Y tyrosine 7 8
Sequences Goals of sequence alignment DNA sequence: GCTGAACGATTCGTTACT Amino-acid sequence: MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYS Given two (nucleotide or amino acid) sequences, we want to: measure their similarity determine the correspondences between elements of the sequences observe patterns of sequence conservation and variability of sequences over time 9 10 Definition of sequence alignment Changes in alphabets Given an alphabet A, a string is a finite sequence of letters from A Example: GCTGAACG (DNA alphabet) Sequence alignment is the assignment of letter-letter correspondences between two or more strings from a given alphabet exchange of a single letter for another (point mutation) insertion of a single letter deletion of a single letter Pairwise sequence alignment is the process of transforming one sequence into another by repated application of these three operations on single letters 11 12
Example alignment Different alignment notations Without gaps: With gaps: With gaps: G C T G A A C G C T A T A A T C G C T G A A C G C T A T A A T C G C T G A A C G C T A T A A T C G C T G A A C G C T A T A A T C G C T G A A C G C T A T A A T C 13 14 Optimal alignment Dotplot To decide which alignment is the best of all possibilities, we need: 1. A way to systematically examine all possible alignments 2. A score for each possible alignment, which reflects the similarity of the two sequences The optimal alignment will then be the one(s) with the highest score. Note that there may be more than one optimal alignment. D O R O T H Y C R O W F O O T H O D G K I D D D O O O O O O O R R R O O O O O O O T T T H H H Y Y H H H O O O O O O O D D D G G K K I I N N N 15 16
Dotplot Dotplot A B R A C A D A B R A C A D A B R A B B B B R R R R A A A A A A A A C C C D D D B B B B R R R R C C C D D D B B B B R R R R 17 Dotplot of the amino acid sequence of SLIT protein of Drosophila melanogaster (fruit fly). Web tool Dotlet: http://myhits.isb-sib.ch/cgi-bin/dotlet 18 Dotplot: filtering Dotplot with filtering To avoid very short stretches or many small gaps along stretches of matches one may use the filtering parameters window and threshold A dot will appear in a cell of the dotplot if that cell is in the center of a stretch of characters of length window such that the number of matches is threshold Another option is to give the cell a color (or grey value), such that the higher the number of matches in the window, the more intense the color becomes w = 1, t = 1 w = 11, t = 5 Dotplot with window w and threshold t of the amino acid sequence of the protein pancreatic ribonuclease of the horse. 19 20
Measures of sequence similarity Hamming distance Functions that associates a numeric value with a pair of sequences: 1. similarity measure Higher value greater similarity 2. distance function Larger distance smaller similarity (a distance function is a dissimilarity measure) For two strings of equal length their Hamming distance is the number of character positions in which they differ s : A G T C t : C G T A s : A G C A C A C A t : A C A C A C T A Hamming distance = 2 Hamming distance = 6 Disadvantage: shift of just one position leads to large Hamming distance. 21 22 Edit distance Alignments with gaps Distance can be based on the number of edit operations required to change one string to the other Here an edit operation is a deletion, insertion or alteration of a single character in either sequence (a, a) match (no change from s to t) (a, ) deletion of character a (in s); indicated by in t (a, b) replacement of a (in s) by b (in t), where a b (, b) insertion of character b (in s); indicated by in s For the DNA alphabet: a {A, C, T, G} b {A, C, T, G} Input: Alignment 1: Alignment 2: s : A G C A C A C A t : A C A C A C T A s : A G C A C A C A t : A C A C A C T A s : A G C A C A C A t : A C A C A C T A 23 24
Protocol of edit operations Unit cost model Alignment 1: s : A G C A C A C A t : A C A C A C T A Match (A, A) Delete (G, ) Match (C, C) Match (A, A) Match (C, C) Match (A, A) Match (C, C) Insert (, T ) Match (A, A) Assign a cost or weight w to each operation. For example: match: w(a, a) = 0 replacement: w(a, b) = 1 for a b deletion/insertion: w(a, ) = w(, b) = 1 This scheme is known as the Levenshtein Distance, also called unit cost model 25 26 Edit distance Edit distance: examples Given a cost function w for single operations: 1. The cost of an alignment of two sequences s and t is the sum of the costs of all the edit operations needed to transform s to t 2. An optimal alignment of s and t is an alignment which has minimal cost among all possible alignments Alignment 1: cost = 2 Alignment 2: cost = 4 s : A G C A C A C A t : A C A C A C T A s : A G C A C A C A t : A C A C A C T A 3. The edit distance of s and t is the cost of an optimal alignment of s and t under a cost function w. We denote it by d w (s; t) Alignment 1 is optimal under the unit cost model edit distance d w (s; t) = 2 27 28
Scoring functions Some changes in nucleotide or amino acid sequences are more likely than others So assign variable weights to different edit operations This leads to the concept of scoring functions or substitution matrices A substitution matrix: square array of values which indicate the scores associated to possible transitions (replacements, insertions, deletions) One uses either similarity scores or dissimilarity scores (such as edit distance) Similarity scoring schemes for DNA sequences Percent Identity substitution matrix: 99% identity 50% identity A T G C A +1-3 -3-3 T -3 +1-3 -3 G -3-3 +1-3 C -3-3 -3 +1 A T G C A +3-2 -2-2 T -2 +3-2 -2 G -2-2 +3-2 C -2-2 -2 +3 Substitutions that are more likely get a higher similarity score or, equivalently, a smaller dissimilarity score 29 30 Dotplots and sequence alignment D O R O T H Y C R O W F O O T H O D G K I D D D O O O O O O O R R R O O O O O O O T T T H H H Y Y H H H O O O O O O O D D D G G K K I N Any path through this dotplot from upper left to lower right, moving at each cell only East, South or Southeast, corresponds to an alignment. D O R O T H Y C R O W F O O T H O D G K I N D O R O T H Y H O D G K I N N I N 31 Optimal substructure property S Dynamic programming M If M is a point on an optimal path π [S T ] (solid line) then π [S M] and π [M T ] are also optimal paths. The cost of the dotted path from S to M cannot be smaller than the cost of the solid path from S to M. T 32
Edit distance: Recursive computation Edit distance: Recursive computation Create a matrix by D, with elements 1 D(i, j), i = 1, 2,..., n and j = 1, 2,..., m such that D(i, j) is the minimal edit distance between the sequences that consist of the first i characters of s and the first j characters of t Then D(n, m) will be the minimal edit distance between the full sequences s and t For initialization, we need to add an extra row D(0, j), j = 0, 1, 2,..., m, and column D(i, 0), i = 0, 1, 2,..., n to the matrix. D 00 D 01 D 0m D = D 10 D 11 D 1m.... D n0 D n1 D nm D(n, m) equals the minimal edit distance between the full sequences s and t 1 i is the row index, running from top to bottom; j is the column index, running from left to right. 33 34 Steps in the dotplot matrix Recursion Each step in the matrix which arrives in cell (i, j) can be of three types: East (previous cell was (i, j 1)) South (previous cell was (i 1, j)) SouthEast (previous cell was (i 1, j 1)) edit operation step in matrix cost substitution of a i b j (i 1, j 1) (i, j) w(a i, b j ) deletion of a i from s (i 1, j) (i, j) w(a i, ) deletion of b j from t (i, j 1) (i, j) w(, b j ) Three paths arrive at cell (i, j): The optimal paths from (0, 0) to (i 1, j 1), (i 1, j), or (i, j 1), followed by: step cost (i 1, j 1) (i, j) D(i 1, j 1) + w(a i, b j ) (i 1, j) (i, j) D(i 1, j) + w(a i, ) (i, j 1) (i, j) D(i, j 1) + w(, b j ) The minimum of these is the cost D(i, j) of the optimal path from (0, 0) to (i, j). Recursion: D(i, j) = min{d(i 1, j 1) + w(a i, b j ), D(i 1, j) + w(a i, ), D(i, j 1) + w(, b j )} 35 36
Initialization Retrieving the optimal path(s) On the top row and left column of the matrix we have no North or West neighbours, respectively. So here we have to initialize values: D(i, 0) = i w(a k, ), D(0, j) = k=0 j w(, b k ) k=0 which impose the gap penalty on unmatched characters at the beginning of either sequence store a pointer (an arrow) to one of the three cells (i 1, j 1), (i 1, j) or (i, j 1) that provided the minimal value. This cell is called the predecessor of (i, j) If there are more cells that provided the minimal value (remember that optimal paths need not be unique) we store a pointer to each of these cells In practice one often uses a constant gap penalty: w(a k, ) = w(, b k ) = g 37 38 Needleman-Wunsch algorithm 1: INPUT: two sequences s = a 1 a 2... a n and t = b 1 b 2... b m ; cost function w with gap penalty g 2: OUTPUT: matrix D containing the minimal edit distance between the sequences s and t 3: for i = 0 to n do 4: D(i, 0) g i 5: end for 6: for j = 0 to m do 7: D(0, j) g j 8: end for 9: for i = 1 to n do 10: for j = 1 to m do 11: Match D(i 1, j 1) + w(a i, b j ) 12: Delete D(i 1, j) + g 13: Insert D(i, j 1) + g 14: D(i, j) min(match, Insert, Delete) 15: end for 16: end for Example Alignment of sequences s=ggaatgg and t=atg with scoring scheme: w(a, a) = 0 (match) w(a, b) = 4 for a b (mismatch) w(a, ) = w(, b) = 5 (gap insertion) 39 40
Example (continued) Example (continued) Matrix D(i, j) after initialization and the first diagonal step: s t A T G 0 5 10 15 G 5 4 G 10 A 15 A 20 T 25 G 30 G 35 NB: to be consistent with the definition of the matrix D(i, j), sequence s is plotted vertically, sequence t horizontally 41 Matrix after termination, including pointers: s t A T G 0 5 10 15 G 5 4 9 10 G 10 9 8 9 A 15 10 13 12 A 20 15 14 17 T 25 20 15 18 G 30 25 20 15 G 35 30 25 20 Red arrows indicate trace-back paths of optimal alignment. 42 Example (continued) Sequence logos Two cells where the trace-back path branches four optimal alignments with equal score: Graphical display of multiple alignment, with colored stacks of letters representing nucleotides or amino acids at successive positions. Height of a letter at a certain position increases with increasing frequency of an amino acid at that position. G G A A T G G A T G G G A A T G G A T G G G A A T G G A T G G G A A T G G A T G 43 Sequence logo of human exon-intron splice boundaries. c http://weblogo.berkeley.edu 44