Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline Introduction What is pairwise sequence alignment? Scoring Sequence Alignments Pairwise Sequence Alignment Algorithms Needleman-Wunsch Algorithm (global alignment) Smith-Waterman Algorithm (local alignment) Extension to repeated matches (multiple local alignments) Heuristic Pairwise Sequence Alignment Algorithms BLAST FASTA 2 1

Introduction Advances in molecular biology allow increasingly rapid sequencing of genomes --> Exponential growth in Genbank. Francois Jacob (1977) [Evolution and tinkering, science 196:1161166] Nature is a tinkerer and not an inventor Eric Wieschaus (1995) [Associated Press, 9 October, 1995] We didn t know it at the time, but we found out everything in life is so similar, that the same genes that work in flies are the ones that work in humans. 3 Introduction New sequences are adapted from pre-existing sequences rather than invented de novo. Sequence similarity is an indicator of homology. Other (several) uses for sequence similarity Database queries Comparative genomics... 4 2

Outline Introduction What is Pairwise Sequence Alignment? Scoring Sequence Alignments Pairwise Sequence Alignment Algorithms Needleman-Wunsch Algorithm (global alignment) Smith-Waterman Algorithm (local alignment) Extension to repeated matches (multiple local alignments) Heuristic Pairwise Sequence Alignment Algorithms BLAST FASTA 5 What is Pairwise Sequence Alignment? The problem of deciding if a pair of sequences are evolutionarily related or not. Two biological sequences are similar Two strings are similar Sequences accumulate Insertions Deletions and Substitutions 6 3

What is Pairwise Sequence Alignment? Distance Between DNA Sequences Hamming distance is not typically used to compare DNA or protein sequences. Levenshtein distance allows one to compare strings of different lengths. Edit distance Definition: The edit distance between two strings is defined as the minimum number of edit operations insertions, deletions and substitutions needed to transform the first string into the second. Matches are not counted. 7 What is Pairwise Sequence Alignment? String Alignment The concept of an alignment is crucial. Global Alignment Definition: A (global) alignment of two strings S1 e S2 is obtained by first inserting chosen spaces (or dashes), either into or at the ends of S1 and S2, and then placing the two resulting strings one above the other so that every character or space (dash) in either string is opposite to a unique character (dash) or unique space (dash) in the other string. 8 4

What is Pairwise Sequence Alignment? Gaps Gaps help create alignments that better conform to underlying biological models. Mechanisms that make long insertions or deletions in DNA include: unequal crossing-over in meiosis; DNA slippage during replication; insertion of transposable elements into DNA string; insertions of DNA by retro-viruses; etc... Definition: A gap is any maximal, consecutive run of spaces (or dashes) in a single string of a given alignment. 9 What is Pairwise Sequence Alignment? Example S1 = WEAGAWGHEE S2 = PAWHEAE WEAGAWGHE-E P-A--W-HEAE mismatch match gap WEAGAWGHE-E --P-AW-HEAE More than one possible alignment! Which one is better? Is it a true or a spurious alignment? 1 5

Outline Introduction What is Pairwise Sequence Alignment? Scoring Sequence Alignments Pairwise Sequence Alignment Algorithms Needleman-Wunsch Algorithm (global alignment) Smith-Waterman Algorithm (local alignment) Extension to repeated matches (multiple local alignments) Heuristic Pairwise Sequence Alignment Algorithms BLAST FASTA 11 How to Score an Alignment? Find the best alignment between two strings under some scoring scheme. Use a scoring model that quantifies evolutionary preferences. Substitution matrices Matches and mismatches Gap penalty Initiating a gap Gap extension penalty Extending a gap Set of values for quantifying the likelihood of one residue being substituted by another in an alignment. 12 6

The Scoring Model The total score will be a sum of terms for each aligned pair of residues, plus terms for each gap. Identities and conservative substitutions will be more likely in alignments than expected by chance. contribute with positive score terms. Non-conservative changes are expected to be observed less frequently in real alignments than expected by chance contribute with negative score terms. 13 The Scoring Model The score assigned to an alignment is computed using this function: where S =! s i ( 2 s1( i), s ( i)) + G( g) s(s1(i),s2(i)) is the score for each aligned pair of residues and Given by a Scoring Matrix! G(g) are the gap penalties Given apriori! Scores s(.,.) and gap penalties G(g) can be computed using different models (scoring matrices, probabilistics models,...)! 14 7

Example Alignment Scores A E H P W A 5 E 6 G H 1 W -4 15 Gap penalty: -8 Gap extension penalty: -8 WEAGAWGHE-E --P-AW-HEAE (-8) + (-8) + () + (-8) + 5 + 15 + (-8) + 1 + 6 + (-8) + 6 = 1 15 Example Alignment Scores A E H P W A 5 E 6 G H 1 W -4 15 Gap penalty: -8 Gap extension penalty: -8 Exercise: What is the score of the following alignment? WEAGAWGHE-E P-A--W-HEAE 16 8

Example Alignment Scores A E H P W A 5 E 6 G H 1 W -4 15 Gap penalty: -8 Gap extension penalty: -8 Exercise: What is the score of the following alignment? WEAGAWGHE-E P-A--W-HEAE (-4) + (-8) + 5 + (-8) + (-8) + 15 + (-8) + 1 + 6 + (-8) + 6 = 17 Scoring Matrices Family of matrices listing the likelihood of change from one sequence to another during evolution. Amino acid substitution matrices PAM (Point Accepted Mutation) BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) DNA substitution matrices DNA: less conserved than protein sequences. Less effective to compare coding regions at nucleotide level. 18 9

DNA Substitution Matrices Scoring matrices for nucleotide sequences are relatively simple. A positive value or a high score is given for a match and a negative value/low positive score is given for a mismatch. This assignment is based on the assumption that the frequencies of mutation are equal for all bases. However, this assumption may not be realistic! Observations show that transitions (substitutions between purines and purines, A<->C) occur more frequently than transversions (substitutions between pyrimidines and pyrimidines, T<->G) Therefore, a more sophisticated statistical model with different probability values to reflect two types of mutations is needed! Several nucleotide substitution models (Example: Kimura model) 19 Amino acid substitution matrices PAM Matrices (Dayhoff, 1978) Encode and summarize expected evolutionary change at the amino acid level. Each matrix is designed to be used to compare pairs of sequences that are a specific number of PAM units diverged. 1 PAM unit indicates the probability of 1 point mutation per 1 residues. 2 1

Amino acid substitution matrices After 1 PAMs of evolution, not every residue will have changed Some residues may have mutated several times. Some residues may have returned to their original state. Some residues may not changed at all. PAM matrices started by constructing hypothetical phylogenetic trees relating the sequences in 71 families, where each pair of sequences differed by no more than 15% of their residues. For each amino acid pair, A i and A j, count the number of times that A i aligns opposite A j, and divide that number by the total number of pairs in all the aligned data. 21 PAM Matrices Let F(i,j) denote the resulting frequency. Let F i and F j be the frequencies that amino acids A i and A j appear in the sequences. The (i,j) entry for the ideal PAMn matrix is: F( i, j) log ( ) F( i) F( j) The image cannot be displayed. Your computer may not have enoug been corrupted. Restart your computer, and then open the file again image and then insert it again. 22 11

Amino acid substitution matrices Evolutionary distance (PAM) 1 11 23 38 56 8 12 159 Observed difference % 1 1 2 3 4 5 6 7 Most widely Used PAM Matrix PAM25 23 25 8 24 12

Amino acid substitution matrices BLOSUM Matrices (Henikoff, 1992) Substitution matrices derived using probabilistic models. Matrices derived from a much larger dataset: the protein families BLOCKS database. Sequences are clustered whenever their percentage of identical residues exceed some level L%. BLOSUM5 and BLOSUM62 are widely used. BLOSUM observes significantly more replacements than PAM, even for infrequent pairs. 25 BLOSUM5 A R N D C Q E G H I L K M F P S T W Y V! A 5 1! R! 7-4 1-4 3! N 7 2 1-4 -4 1-4! D 2 8-4 2-4 -4-4 -5-5 -4! C -4-4 13-4 -5! Q 1 7 2 1 2-4! E 2 2 6-4 1! G 8-4 -4-4 -4! H 1 1 1-4 2-4! I -4-4 -4-4 -4 5 2 2 4! L -4-4 -4 2 5 3 1-4 1! K 3 2 1 6-4! M -4 2 3 7 1! F -4-5 -4-4 1-4 8-4 1 4! P -4-4 -4 1-4! S 1 1 5 2-4! T 2 5! W -4-5 -5 1-4 -4 15 2! Y 2 4 2 8! V -4-4 -4 4 1 1 5 26 13

Amino acid substitution matrices PAM Matrices vs BLOSUM Matrices PAM model is designed to track evolutionary origin of proteins. BLOSUM model is designed to find conserved domains of proteins. Thumb Rules Lower PAMs and higher BLOSUMs find short local alignment of highly similar sequences. Higher PAMs and lower BLOSUMs find longer weaker local alignments. 27 14