EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

EECS 730 Introduction to Bioinformatics Sequence Alignment Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/

Administrative Term-project In a group of 2 students Think about it and ask around I will post choices of project on-line soon You will need to turn in a report and a do a presentation at the end of the semester. More details to come 2012/9/4 EECS 730 2

Total number of alignments Align seq1: ATAAGC and seq2: AAAACG. Let L be the alignment length. ATAAGC- AAAA-CG L=7 We know that L is at least 6 and at most 12. L=6 L=12 ATAAGC ATAAGC------ AAAACG ------AAAACG The total number of possible alignments is 8989!!! And the total number of non-redundant alignments is 924!!! A more realistic example: two protein sequences of 300 amino acids each have 10 88 possible alignments. 2012/9/4 EECS 730 3

Now what? When two sequences are aligned, there is an enormous number of possible alignment. Brute-force examining all possible alignments of two sequences could take forever Dynamic Programming is a very general programming technique. It is applicable when a large search space can be structured into a succession of stages, such that: The initial stage contains trivial solutions to sub-problems Each partial solution in a later stage can be calculated by recurring a fixed number of partial solutions in an earlier stage The final stage contains the overall solution 2012/9/4 EECS 730 4

Scoring matrix A R N D C Q E G H I L K M F P S T W Y V A 13 6 9 9 5 8 9 12 6 8 6 7 7 4 11 11 11 2 4 9 R 3 17 4 3 2 5 3 2 6 3 2 9 4 1 4 4 3 7 2 2 N 4 4 6 7 2 5 6 4 6 3 2 5 3 2 4 5 4 2 3 3 D 5 4 8 11 1 7 10 5 6 3 2 5 3 1 4 5 5 1 2 3 C 2 1 1 1 52 1 1 2 2 2 1 1 1 1 2 3 2 1 4 2 Q 3 5 5 6 1 10 7 3 7 2 3 5 3 1 4 3 3 1 2 3 E 5 4 7 11 1 9 12 5 6 3 2 5 3 1 4 5 5 1 2 3 G 12 5 10 10 4 7 9 27 5 5 4 6 5 3 8 11 9 2 3 7 H 2 5 5 4 2 7 4 2 15 2 2 3 2 2 3 3 2 2 3 2 I 3 2 2 2 2 2 2 2 2 10 6 2 6 5 2 3 4 1 3 9 L 6 4 4 3 2 6 4 3 5 15 34 4 20 13 5 4 6 6 7 13 K 6 18 10 8 2 10 8 5 8 5 4 24 9 2 6 8 8 4 3 5 M 1 1 1 1 0 1 1 1 1 2 3 2 6 2 1 1 1 1 1 2 F 2 1 2 1 1 1 1 1 3 5 6 1 4 32 1 2 2 4 20 3 P 7 5 5 4 3 5 4 5 5 3 3 4 3 2 20 6 5 1 2 4 S 9 6 8 7 7 6 7 9 6 5 4 7 5 3 9 10 9 4 4 6 T 8 5 6 6 4 5 5 6 4 6 4 6 5 3 6 8 11 2 3 6 W 0 2 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 55 1 0 Y 1 1 2 1 3 1 1 1 3 2 2 1 2 15 1 2 2 3 31 2 V 7 4 4 4 4 4 4 5 4 15 10 4 10 5 5 5 7 2 4 17 Top: original amino acid; Side: replacement amino acid 2012/9/4 EECS 730 5 P(i, j) is multiplied by 100

Dynamic Programming (DP) Dynamic programming usually consists of three components. Recursive relation Tabular computation Trace back This efficient recursive method is used to search through all possible alignments and finding the one with the optimal score. EECS 730

How DP is used for alignment? An alignment is represented as a path through a matrix. To search through the matrix of all possible paths and find the optimal path DP is used. 2012/9/4 EECS 730 7

Four possible outcomes in aligning two sequences 1 [1] identity (stay along a diagonal) [2] mismatch (stay along a diagonal) [3] gap in one sequence (move vertically!) [4] gap in the other sequence (move horizontally!) 2012/9/4 EECS 730 8 2

Intuition of Dynamic Programming If we already have the optimal solution to: XY AB then we know the next pair of characters will either be: XYZ or XY- or XYZ ABC ABC AB- (where - indicates a gap). So we can extend the match by determining which of these has the highest score. 2012/9/4 EECS 730 9

Sequence alignment with DP Align sequence x and y. F is the DP matrix; s is the substitution matrix; is the gap penalty. F 0,0 0 F ( i, j) max F F F ( i 1, j 1) ( k, j) ( i ( i, k ) ( j s ( x k ), k ), i, y k k j ), 0,..., 0,..., i 1, j 1. 2012/9/4 EECS 730 10

Two types of sequence alignment: global and local Global alignment: the entire sequence of each protein or DNA sequence is contained in the alignment. Local alignment: only regions of greatest similarity between two sequences are aligned percent identity: ~26% RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K++ + + +GTW++MA + L + A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81 2012/9/4 EECS 730 11

Global vs. local alignment Global alignment are often not effective for highly diverged sequences - do not reflect the biological reality that two sequences may only share limited regions of conserved sequence. Sometimes two sequences may be derived from ancient recombination events where only a single functional domain is shared. Global methods are useful when you want to force two sequences to align over their entire length Local alignment is almost always used for database searches such as BLAST. It is useful to find domains (or limited regions of homology) within sequences. 2012/9/4 EECS 730 12

Local alignment DP Align sequence x and y. F is the DP matrix; s is the substitution matrix; d is the linear gap penalty. F F 0,0 0 ( i, j) max F F F 0 ( i 1, j 1) ( k, j) ( i ( i, k ) ( j s ( x k ), k ), i, y k k j ), 0,..., 0,..., i 1, j 1. 2012/9/4 EECS 730 13

Algorithms for sequence alignment The global alignment algorithm of Needleman and Wunsch (1970). The local alignment algorithm of Smith and Waterman (1981). 2012/9/4 EECS 730 14

Global alignment with the algorithm of Needleman and Wunsch (1970) Two sequences can be compared in a matrix along x- and y-axes. If they are identical, a path along a diagonal can be drawn Find the optimal subpaths, and add them up to achieve the best score. This involves adding gaps when needed allowing for conservative substitutions choosing a scoring system (simple or complicated) N-W is guaranteed to find optimal alignment(s) 2012/9/4 EECS 730 15

2012/9/4 EECS 730 16 Gap Penalties Above, the function of gap penalties can take any form Below, using a simple gap penalty (-d for each gap position) to speed up the alignment algorithm 1. 0,..., ), ( ), ( 1, 0,..., ), ( ), ( ),, ( 1) 1, ( max ), ( j k k j k i F i k k i j k F y x s j i F j i F j i. 1), (, ) 1, ( ),, ( 1) 1, ( max ), ( d j i F d j i F y x s j i F j i F j i

Three steps to global alignment with the Needleman-Wunsch algorithm [1] set up a matrix [2] score the matrix [3] identify the optimal alignment(s) 2012/9/4 EECS 730 17

Start Needleman-Wunsch with an identity matrix sequence 1 ABCNJ-RQCLCR-PM sequence 2 AJC-JNR-CKCRBPsequence 1 ABC-NJRQCLCR-PM sequence 2 AJCJN-R-CKCRBP- 2012/9/4 EECS 730 18

2012/9/4 EECS 730 19

Needleman-Wunsch: dynamic programming N-W is guaranteed to find optimal alignments, although the algorithm does not search all possible alignments. It is an example of a dynamic programming algorithm: an optimal path (alignment) is identified by incrementally extending optimal subpaths. Thus, a series of decisions is made at each step of the alignment to find the pair of residues with the best score. 2012/9/4 EECS 730 20

Local alignment Often, one wants to align subsequences of two larger sequences. Examples: Database search locating common domains in proteins (e.g., transmembrane proteins) comparing extended sections of genomic DNA sequence detecting similarity between highly diverged sequences (only certain parts of sequence conserved enough for alignment, the rest may have accumulated too much noise) local alignments are generally not subsets of global alignments! 2012/9/4 EECS 730 21

Local alignment Two major differences with respect to global alignment: No score is negative. Traceback begins at the highest score in the matrix and continues until you reach 0. 2012/9/4 EECS 730 22

Recap: Dynamic Programming Great for doing pairwise global alignments Produces a quantitative alignment score Problems if one tries to do alignments with very large sequences (memory requirement grows as N 2 or as N x M) Serious problems if one tries to align one sequence against a database (10 s of hours) Need an alternative.. 2012/9/4 EECS 730 23

Parameters of Sequence Alignment Scoring Systems: Each symbol pairing is assigned a numerical value, based on a symbol comparison table. Gap Penalties: Opening: The cost to introduce a gap Extension: The cost to elongate a gap 2012/9/4 EECS 730 24

Substitution Matrix A substitution matrix contains values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments (or multiple sequence alignments) of amino acids. Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution. The two major types of substitution matrices are PAM and BLOSUM. 2012/9/4 EECS 730 25

PAM matrices: Point-accepted mutations PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. Other PAM matrices are extrapolated from PAM1. All the PAM data come from closely related proteins (>85% amino acid identity) 2012/9/4 EECS 730 26

Dayhoff s 34 protein superfamilies Protein PAMs per 100 million years Ig kappa chain 37 Kappa casein 33 Lactalbumin 27 Hemoglobin 12 Myoglobin 8.9 Insulin 4.4 Histone H4 0.10 Ubiquitin 0.00 2012/9/4 EECS 730 27

Multiple sequence alignment of glyceraldehyde 3-phosphate dehydrogenases fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA 2012/9/4 EECS 730 28

Dayhoff s numbers of Percent Accepted Mutation : what amino acid substitutions occur in proteins? A Ala R Arg N Asn D Asp C Cys Q Gln E Glu G Gly A R 30 N 109 17 D 154 0 532 C 33 10 0 0 Q 93 120 50 76 0 E 266 0 94 831 0 422 G 579 10 156 162 10 30 112 H 21 103 226 43 10 243 23 10 2012/9/4 EECS 730 29

Normalized frequencies of amino acids Gly 8.9% Arg 4.1% Ala 8.7% Asn 4.0% Leu 8.5% Phe 4.0% Lys 8.1% Gln 3.8% Ser 7.0% Ile 3.7% Val 6.5% His 3.4% Thr 5.8% Cys 3.3% Pro 5.1% Tyr 3.0% Glu 5.0% Met 1.5% Asp 4.7% Trp 1.0% blue=6 codons; red=1 codon 2012/9/4 EECS 730 30

Dayhoff s PAM1 mutation probability matrix A Ala R Arg N Asn D Asp C Cys Q Gln E Glu G Gly H His I Ile A 9867 2 9 10 3 8 17 21 2 6 R 1 9913 1 0 1 10 0 0 10 3 N 4 1 9822 36 0 4 6 6 21 3 D 6 0 42 9859 0 6 53 6 4 1 C 1 1 0 0 9973 0 0 0 1 1 Q 3 9 4 5 0 9876 27 1 23 1 E 10 0 7 56 0 35 9865 4 2 3 G 21 1 12 11 1 3 7 9935 1 0 H 1 8 18 3 1 20 1 0 9912 0 I 2 2 3 1 2 1 2 0 0 9872 P(i, j) is multiplied by 10,000 Original amino acid 2012/9/4 EECS 730 31

Dayhoff s PAM0 mutation probability matrix: the rules for extremely slowly evolving proteins PAM0 A Ala R Arg N Asn D Asp C Cys Q Gln E Glu G Gly A 100% 0% 0% 0% 0% 0% 0% 0% R 0% 100% 0% 0% 0% 0% 0% 0% N 0% 0% 100% 0% 0% 0% 0% 0% D 0% 0% 0% 100% 0% 0% 0% 0% C 0% 0% 0% 0% 100% 0% 0% 0% Q 0% 0% 0% 0% 0% 100% 0% 0% E 0% 0% 0% 0% 0% 0% 100% 0% G 0% 0% 0% 0% 0% 0% 0% 100% Top: original amino acid Side: replacement amino acid 2012/9/4 EECS 730 32

Dayhoff s PAM2000 mutation probability matrix: the rules for very distantly related proteins PAM A Ala R Arg N Asn D Asp C Cys Q Gln E Glu 2012/9/4 EECS 730 33 G Gly A 8.7% 8.7% 8.7% 8.7% 8.7% 8.7% 8.7% 8.7% R 4.1% 4.1% 4.1% 4.1% 4.1% 4.1% 4.1% 4.1% N 4.0% 4.0% 4.0% 4.0% 4.0% 4.0% 4.0% 4.0% D 4.7% 4.7% 4.7% 4.7% 4.7% 4.7% 4.7% 4.7% C 3.3% 3.3% 3.3% 3.3% 3.3% 3.3% 3.3% 3.3% Q 3.8% 3.8% 3.8% 3.8% 3.8% 3.8% 3.8% 3.8% E 5.0% 5.0% 5.0% 5.0% 5.0% 5.0% 5.0% 5.0% G 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% Top: original amino acid Side: replacement amino acid

PAM250 mutation probability matrix A R N D C Q E G H I L K M F P S T W Y V A 13 6 9 9 5 8 9 12 6 8 6 7 7 4 11 11 11 2 4 9 R 3 17 4 3 2 5 3 2 6 3 2 9 4 1 4 4 3 7 2 2 N 4 4 6 7 2 5 6 4 6 3 2 5 3 2 4 5 4 2 3 3 D 5 4 8 11 1 7 10 5 6 3 2 5 3 1 4 5 5 1 2 3 C 2 1 1 1 52 1 1 2 2 2 1 1 1 1 2 3 2 1 4 2 Q 3 5 5 6 1 10 7 3 7 2 3 5 3 1 4 3 3 1 2 3 E 5 4 7 11 1 9 12 5 6 3 2 5 3 1 4 5 5 1 2 3 G 12 5 10 10 4 7 9 27 5 5 4 6 5 3 8 11 9 2 3 7 H 2 5 5 4 2 7 4 2 15 2 2 3 2 2 3 3 2 2 3 2 I 3 2 2 2 2 2 2 2 2 10 6 2 6 5 2 3 4 1 3 9 L 6 4 4 3 2 6 4 3 5 15 34 4 20 13 5 4 6 6 7 13 K 6 18 10 8 2 10 8 5 8 5 4 24 9 2 6 8 8 4 3 5 M 1 1 1 1 0 1 1 1 1 2 3 2 6 2 1 1 1 1 1 2 F 2 1 2 1 1 1 1 1 3 5 6 1 4 32 1 2 2 4 20 3 P 7 5 5 4 3 5 4 5 5 3 3 4 3 2 20 6 5 1 2 4 S 9 6 8 7 7 6 7 9 6 5 4 7 5 3 9 10 9 4 4 6 T 8 5 6 6 4 5 5 6 4 6 4 6 5 3 6 8 11 2 3 6 W 0 2 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 55 1 0 Y 1 1 2 1 3 1 1 1 3 2 2 1 2 15 1 2 2 3 31 2 V 7 4 4 4 4 4 4 5 4 15 10 4 10 5 5 5 7 2 4 17 Top: original amino acid; Side: replacement amino acid 2012/9/4 EECS 730 34 P(i, j) is multiplied by 100

Why do we go from a mutation probability matrix to a log odds matrix? We want a scoring matrix so that when we do a pairwise alignment (or a BLAST search) we know what score to assign to two aligned amino acid residues. Logarithms are easier to use for a scoring system. They allow us to sum the scores of aligned residues (rather than having to multiply them). 2012/9/4 EECS 730 35

How do we go from a mutation probability matrix to a log odds matrix? The cells in a log odds matrix consist of an odds ratio : the probability that an alignment is authentic the probability that the alignment was random The score S for an alignment of residues a,b is given by: S(a,b) = 10 log 10 (M ab /p b ) As an example, for ala trp, assume that p(ala trp) = 0.55 and p(trp) = 0.01 S(ala,trp) = 10 log 10 (0.55/0.010) = 17.4 2012/9/4 EECS 730 36

What do the numbers mean in a log odds matrix? S(a,tryptophan) = 10 log 10 (0.55/0.010) = 17.4 A score of +17 for tryptophan means that this alignment is 50 times more likely than a chance alignment of ala and trp residues. S(a,b) = 17 Probability of replacement (M ab /p b ) = x Then 17 = 10 log 10 x 1.7 = log 10 x 10 1.7 = x = 50 2012/9/4 EECS 730 37

What do the numbers mean in a log odds matrix? A score of +2 indicates that the amino acid replacement occurs 1.6 times as frequently as expected by chance. A score of 0 is neutral. A score of 10 indicates that the correspondence of two amino acids in an alignment that accurately represents homology (evolutionary descent) is one tenth as frequent as the chance alignment of these amino acids. 2012/9/4 EECS 730 38

A 2 R -2 6 N 0 0 2 D 0-1 2 4 C -2-4 -4-5 12 Q 0 1 1 2-5 4 E 0-1 1 3-5 2 4 G 1-3 0 1-3 -1 0 5 H -1 2 2 1-3 3 1-2 6 I -1-2 -2-2 -2-2 -2-3 -2 5 L -2-3 -3-4 -6-2 -3-4 -2-2 6 K -1 3 1 0-5 1 0-2 0-2 -3 5 M -1 0-2 -3-5 -1-2 -3-2 2 4 0 6 F -3-4 -3-6 -4-5 -5-5 -2 1 2-5 0 9 P 1 0 0-1 -3 0-1 0 0-2 -3-1 -2-5 6 S 1 0 1 0 0-1 0 1-1 -1-3 0-2 -3 1 2 PAM250 log odds scoring matrix T 1-1 0 0-2 -1 0 0-1 0-2 0-1 -3 0 1 3 W -6 2-4 -7-8 -5-7 -7-3 -5-2 -3-4 0-6 -2-5 17 Y -3-4 -2-4 0-4 -4-5 0-1 -1-4 -2 7-5 -3-3 0 10 V 0-2 -2-2 -2-2 -2-1 -2 4 2-2 2-1 -1-1 0-6 -2 4 A R N D C Q E G H I L K M F P S T W Y V 2012/9/4 EECS 730 39

PAM250 vs. PAM10 SIM alignment program http://www.expasy.ch/tools/sim-prot.html Try it with yourself by aligning RBP4 (Q28369) and bovine β-lactoglobulin (P02754) 2012/9/4 EECS 730 40

Comparing two proteins with a PAM1 matrix gives completely different results than PAM250! Consider two distantly related proteins. A PAM40 matrix is not forgiving of mismatches, and penalizes them severely. Using this matrix you can find almost no match. hsrbp, 136 CRLLNLDGTC btlact, 3 CLLLALALTC * ** * ** A PAM250 matrix is very tolerant of mismatches. 24.7% identity in 81 residues overlap; Score: 77.0; Gap frequency: 3.7% rbp4 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDV btlact 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN * **** * * * * ** * rbp4 86 --CADMVGTFTDTEDPAKFKM btlact 80 GECAQKKIIAEKTKIPAVFKI ** * ** ** 2012/9/4 EECS 730 41

BLOSUM Matrices BLOSUM matrices are based on local alignments. BLOCKS database: >500 groups of local multiple alignment of distantly related protein sequences Focused on conserved regions BLOSUM stands for BLOcks SUbstitution Matrix. BLOSUM62 is a matrix calculated from comparisons of sequences with less than 62% identity. 2012/9/4 EECS 730 42

BLOSUM Matrices 100 62 30 Percent amino acid identity BLOSUM62 2012/9/4 EECS 730 43

BLOSUM Matrices 100 100 100 Percent amino acid identity 62 30 62 30 62 30 BLOSUM80 BLOSUM62 BLOSUM30 2012/9/4 EECS 730 44

BLOSUM Matrices All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. BLOSUM62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix. 2012/9/4 EECS 730 45

A 4 Blosum62 scoring matrix R -1 5 N -2 0 6 D -2-2 1 6 C 0-3 -3-3 9 Q -1 1 0 0-3 5 E -1 0 0 2-4 2 5 G 0-2 0-1 -3-2 -2 6 H -2 0 1-1 -3 0 0-2 8 I -1-3 -3-3 -1-3 -3-4 -3 4 L -1-2 -3-4 -1-2 -3-4 -3 2 4 K -1 2 0-1 -1 1 1-2 -1-3 -2 5 M -1-2 -2-3 -1 0-2 -3-2 1 2-1 5 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 A R N D C Q E G H I L K M F P S T W Y V 2012/9/4 EECS 730 46

Summary of PAM & BLOSUM Rat versus mouse RBP Rat versus bacterial lipocalin 2012/9/4 EECS 730 47

TIPS on choosing a scoring matrix Generally, BLOSUM matrices perform better than PAM matrices for local similarity searches (Henikoff & Henikoff, 1993). When comparing closely related proteins one should use lower PAM or higher BLOSUM matrices, for distantly related proteins higher PAM or lower BLOSUM matrices. For database searching the commonly used matrix is BLOSUM62. 2012/9/4 EECS 730 48

Why Gap Penalties? The optimal alignment of two similar sequences is usually that which maximizes the number of matches and minimizes the number of gaps. There is a tradeoff between these two adding gaps reduces mismatches Permitting the insertion of arbitrarily many gaps can lead to high scoring alignments of nonhomologous sequences. Penalizing gaps forces alignments to have relatively few gaps. 2012/9/4 EECS 730 49

Gap Penalties How to balance gaps with mismatches? Gaps must get a steep penalty, or else you ll end up with nonsense alignments. In real sequences, muti-base (or amino acid) gaps are quit common genetic insertion/deletion events Affine gap penalties give a big penalty for each new gap, but a much smaller gap extension penalty. 2012/9/4 EECS 730 50

Gap penalties Too high a gap penalty is used relative to the range of scores in the substitution matrix, gaps will never appear in the alignment. Conversely, too long gaps will appear everywhere. Gaps of length k are more probably than k gaps of length 1 (Single mutational event vs. different length mutational events). PSKGK-----GRSW P-S-K-GK-GR-SW 2012/9/4 EECS 730 51

Affine Gap Penalty Scoring scheme should penalize new gaps more Affine means that the penalty for a gap computes as a linear function of its length x A total gap score: w x = g + r( x-1) w x : total gap penalty; g: gap open penalty; r: gap extend penalty; x: gap length Example: g=-12; r = -4 2012/9/4 EECS 730 52

Significance of alignments Percent identity Rule of thumb: if two proteins share 25% amino acid identity over a span of 150 amino acids, they are probably significantly related. What if two proteins share limited amino acids identity? Use statistical tests to decide whether the matches are true positives or false positives 2012/9/4 EECS 730 53

Randomization test: scramble a sequence First compare two proteins and obtain a score RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K++ + + +GTW++MA + L + A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81 Next scramble the bottom sequence 100 times, and obtain 100 randomized scores (+/- standard deviation) Composition and length are maintained If the comparison is real we expect the authentic score to be several standard deviations above the mean of the randomized scores 2012/9/4 EECS 730 54

A randomization test shows that RBP is significantly related to -lactoglobulin 16 Number of instances 14 12 10 8 6 4 2 100 random shuffles Mean score = 8.4 Std. dev. = 4.5 Real comparison Score = 37 0 1 10 19 28 37 Quality score But this test assumes a normal distribution of scores! 2012/9/4 EECS 730 55 Page 77

You can perform this randomization test And obtain Z score: Z = (Sreal Xrandomized score) standard deviation 2012/9/4 EECS 730 56

Summary Sequence alignment Algorithm Substitution matrices (represent the probability of mutations): PAM, BLOSUM Gap penalty: gap opening penalty, gap extension penalty Global alignment aligns the entire sequence of each protein or DNA sequence. Needleman-Wunsch Global alignment. Local alignment finds the best match between subsequences. Smith-Waterman local alignment algorithm. 2012/9/4 EECS 730 57

Acknowledge Many of the images and slides in this PowerPoint presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright 2003 by John Wiley & Sons, Inc. 2012/9/4 EECS 730 58