Substitution matrices

Similar documents
Theoretical distribution of PSSM scores

CSE 549: Computational Biology. Substitution Matrices

Position-specific scoring matrices (PSSM)

Quantifying sequence similarity

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Scoring Matrices. Shifra Ben-Dor Irit Orr

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Local Alignment Statistics

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Advanced topics in bioinformatics

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

In-Depth Assessment of Local Sequence Alignment

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Pairwise sequence alignments

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Large-Scale Genomic Surveys

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Practical considerations of working with sequencing data

Sequence analysis and comparison

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Pairwise & Multiple sequence alignments

Scoring Matrices. Shifra Ben Dor Irit Orr

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Tools and Algorithms in Bioinformatics

Computational Biology

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Optimization of a New Score Function for the Detection of Remote Homologs

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Matrix-based pattern matching

Algorithms in Bioinformatics

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Statistical Distributions of Optimal Global Alignment Scores of Random Protein Sequences

Pairwise sequence alignment

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Single alignment: Substitution Matrix. 16 march 2017

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Multiple Alignment using Hydrophobic Clusters : a tool to align and identify distantly related proteins

Sequence comparison: Score matrices

Lecture 5,6 Local sequence alignment

Inferring Phylogenies from Protein Sequences by. Parsimony, Distance, and Likelihood Methods. Joseph Felsenstein. Department of Genetics

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Basic Local Alignment Search Tool

Exercise 5. Sequence Profiles & BLAST

Matrix-based pattern discovery algorithms

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Moreover, the circular logic

7.36/7.91 recitation CB Lecture #4

Tutorial 4 Substitution matrices and PSI-BLAST

Foreword by. Stephen Altschul. An Essential Guide to the Basic Local Alignment Search Tool BLAST. Ian Korf, Mark Yandell & Joseph Bedell

An Introduction to Sequence Similarity ( Homology ) Searching

Week 10: Homology Modelling (II) - HHpred

Evaluation Measures of Multiple Sequence Alignments. Gaston H. Gonnet, *Chantal Korostensky and Steve Benner. Institute for Scientic Computing

Pairwise Sequence Alignment

Similarity searching summary (2)

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld

A New Similarity Measure among Protein Sequences

Sequence Alignment Techniques and Their Uses

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Local Alignment: Smith-Waterman algorithm

Sequence Comparison. mouse human

Sequence analysis and Genomics

Substitution Matrices

Bioinformatics and BLAST

Sequence Analysis '17 -- lecture 7

Introduction to Bioinformatics

... and searches for related sequences probably make up the vast bulk of bioinformatics activities.

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Solving the protein sequence metric problem

BINF 730. DNA Sequence Alignment Why?

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

PAM-1 Matrix 10,000. From: Ala Arg Asn Asp Cys Gln Glu To:

Chapter 2: Sequence Alignment

Substitutional Analysis of Orthologous Protein Families Using BLOCKS

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Tools and Algorithms in Bioinformatics

Similarity or Identity? When are molecules similar?

BLAST: Target frequencies and information content Dannie Durand

bioinformatics 1 -- lecture 7

BLAST: Basic Local Alignment Search Tool

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Bioinformatics. Transcriptome

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Introduction to Bioinformatics Algorithms Homework 3 Solution

Lecture Notes: Markov chains

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

Copyright 2000 N. AYDIN. All rights reserved. 1

BIOINFORMATICS: An Introduction

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Domain-based computational approaches to understand the molecular basis of diseases

Transcription:

Introduction to Bioinformatics Substitution matrices Jacques van Helden Jacques.van-Helden@univ-amu.fr Université d Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics (TAGC, INSERM Unit U1090) http://tagc.univ-mrs.fr/ B!GRe Bioinformatique des Génomes et Réseaux FORMER ADDRESS (1999-2011) Université Libre de Bruxelles, Belgique Bioinformatique des Génomes et des Réseaux (BiGRe lab) http://www.bigre.ulb.ac.be/

Substitution matrix A substitution matrix indicates the score associated to each possible pair of aligned residues in an alignment. Each row and each column represent one of the possible residues (4 for DNA, 20 for proteins). The diagonal indicates identities. The lower triangle indicates substitutions. The upper triangle is symmetrical to the lower triangle, and does not need to be displayed. Positive scores indicate that the aligned pair of residue is considered beneficial for the alignment. Note that some mismatches might have positive scores in protein alignments (see later). Negative scores are considered as penalties associated to mismatches, and alignment algorithms will try to avoid aligning the pairs of residues. One could decide to give a lower cost to A-T substitutions, if we assume that these are more likely to occur in our sequences Example: the top matrix represents arbitrarily defined scores for DNA alignment match 2 A-T mismatch -1 other mismatch -2 The scoring scheme can be represented as a substitution matrix A C G T A 2 C -2 2 G -2-2 2 T -1-2 -2 2 A 4 R -1 5 N -2 0 6 D -2-2 1 6 C 0-3 -3-3 9 Q -1 1 0 0-3 5 E -1 0 0 2-4 2 5 G 0-2 0-1 -3-2 -2 6 H -2 0 1-1 -3 0 0-2 8 I -1-3 -3-3 -1-3 -3-4 -3 4 L -1-2 -3-4 -1-2 -3-4 -3 2 4 K -1 2 0-1 -3 1 1-2 -1-3 -2 5 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 A R N D C Q E G H I L K M F P S T W Y V 2

Scoring an alignment matrix with a substitution matrix Let us come back to our previous alignment matrix For each cell of the alignment matrix, we compare the residue in sequences A and B, and take the score for this pair of residues in the substitution matrix. A C G T A 2 C -2 2 G -2-2 2 T -1-2 -2 2 A A T C T T C A G C G T A T T G C T A 2 2-1 -2-1 -1-2 2-2 -2-2 -1 2-1 -1-2 -2-1 T -1-1 2-2 2 2-2 -1-2 -2-2 2-1 2 2-2 -2 2 C -2-2 -2 2-2 -2 2-2 -2 2-2 -2-2 -2-2 -2 2-2 T -1-1 2-2 2 2-2 -1-2 -2-2 2-1 2 2-2 -2 2 T -1-1 2-2 2 2-2 -1-2 -2-2 2-1 2 2-2 -2 2 A 2 2-1 -2-1 -1-2 2-2 -2-2 -1 2-1 -1-2 -2-1 G -2-2 -2-2 -2-2 -2-2 2-2 2-2 -2-2 -2 2-2 -2 C -2-2 -2 2-2 -2 2-2 -2 2-2 -2-2 -2-2 -2 2-2 C -2-2 -2 2-2 -2 2-2 -2 2-2 -2-2 -2-2 -2 2-2 G -2-2 -2-2 -2-2 -2-2 2-2 2-2 -2-2 -2 2-2 -2 G -2-2 -2-2 -2-2 -2-2 2-2 2-2 -2-2 -2 2-2 -2 A 2 2-1 -2-1 -1-2 2-2 -2-2 -1 2-1 -1-2 -2-1 G -2-2 -2-2 -2-2 -2-2 2-2 2-2 -2-2 -2 2-2 -2 G -2-2 -2-2 -2-2 -2-2 2-2 2-2 -2-2 -2 2-2 -2 T -1-1 2-2 2 2-2 -1-2 -2-2 2-1 2 2-2 -2 2 A 2 2-1 -2-1 -1-2 2-2 -2-2 -1 2-1 -1-2 -2-1 T -1-1 2-2 2 2-2 -1-2 -2-2 2-1 2 2-2 -2 2 T -1-1 2-2 2 2-2 -1-2 -2-2 2-1 2 2-2 -2 2 3

Substitution counts in 71 groups of aligned proteins (Dayhoff, 1978) A R 30 N 109 17 D 154 0 532 en rouge: chiffres illisibles C 33 10 0 0 Q 93 120 50 76 0 E 266 0 94 831 0 422 G 579 10 156 162 10 30 112 H 21 103 226 43 10 243 23 10 I 66 30 36 31 17 8 35 0 3 L 95 17 37 0 0 75 15 17 40 253 K 57 423 322 85 0 147 104 60 23 43 39 M 29 12 0 0 0 20 7 7 0 57 207 90 F 20 7 7 0 0 0 0 1è 20 90 167 0 17 P 345 67 20 10 10 93 40 49 50 7 43 43 4 7 S 772 137 432 89 117 47 86 450 26 20 32 168 20 40 269 T 590 20 169 57 10 37 31 50 14 129 52 200 28 10 73 696 W 0 27 3 0 0 0 0 0 3 0 13 0 0 10 0 17 0 Y 20 3 36 0 30 0 10 0 40 13 23 10 0 260 0 22 23 6 V 365 20 13 17 33 27 37 97 30 661 303 17 77 10 50 43 18 0 17 A R N D C Q E G H I L K M F P S T W Y V Hydrophobic A C G I L M P V Aromatic H F W Y Polar N Q S T Y Basic R H K Acidic D E 4

Substitution matrices for proteins " s i, j = s j,i = log 2 $ # f % i, j ( f i f ' j )& C S T P A G... C 11.5... S 0.1 2.2... T -0.5 1.5 2.5... P -3.1 0.4 0.1 7.6... A 0.5 1.1 0.6 0.3 2.4... G -2.0 0.4-1.1-1.6 0.5 1.6........................... Margaret Dayhoff (1978) measured the rate of substitutions between each pair of amino acids, in a collection of aligned proteins. Scores are calculated as log-odds Positive values reflect frequent ("accepted") substitutions, i.e. substitutions that occur more frequently than expected by chance. Negative values reflect rare ("unfavourable") mutations, i.e. substitutions that occur less frequently than expected by chance The diagonal reflect residue conservation Reference: Dayhoff et al. (1978). A model of evolutionary change in proteins. In Atlas of tein Sequence and Structure, vol. 5, suppl. 3, 345 352. National Biomedical Research Foundation, Silver Spring, MD, 1978. 5

PAM scoring matrices The alignments used by Dayhoff had ~85% identity However, frequencies of substitutions are expected to depend on the rate of divergence between sequences: the number of substitutions increases with time. In order to take into account the divergence rate, Margaret Dayhoff calculated a series of scoring matrices, each reflecting a certain level of divergence PAM001 rates of substitutions between amino-acid pairs expected for proteins with an average of 1% substitution per position PAM050 rates of substitutions between amino-acid pairs expected for proteins with an average of 50% substitution per position PAM250 250% mutations/position (note: a position could mutate several times) The substitution matrix must this be chosen according to the relatedness of the sequences to be aligned Reference: Dayhoff et al. (1978). A model of evolutionary change in proteins. In Atlas of tein Sequence and Structure, vol. 5, suppl. 3, 345 352. National Biomedical Research Foundation, Silver Spring, MD, 1978. 6

Extrapolation of the PAM series from PAM001 M i,3 =P(X ) M 17,j =P( X) 0.0009 0.0001 0.9822 0.0042 0.0000 0.0004... 0.0013 0.0000 0.0003 0.0001... 0.0022 0.0002 0.0013 0.0004 0.0001 0.0003... 0.9871 0.0000 0.0002 0.0009 P( -> )= P( -> -> ) + P( -> -> ) +... + P( -> -> ) = (0.0009)(0.0001) + (0.0001)(0.0002) +... + (0.0001)(0.009) 7

PAM250 matrix C 12 S 0 2 T -2 1 3 P -1 1 0 6 A -2 1 1 1 2 G -3 1 0-1 1 5 N -4 1 0-1 0 0 2 D -5 0 0-1 0 1 2 4 E -5 0 0-1 0 0 1 3 4 Q -5-1 -1 0 0-1 1 2 2 4 H -3-1 -1 0-1 -2 2 1 1 3 6 R -4 0-1 0-2 -3 0-1 -1 1 2 6 K -5 0 0-1 -1-2 1 0 0 1 0 3 5 M -5-2 -1-2 -1-3 -2-3 -2-1 -2 0 0 6 I -2-1 0-2 -1-3 -2-2 -2-2 -2-2 -2 2 5 L -6-3 -2-3 -2-4 -3-4 -3-2 -2-3 -3 4 2 6 V -2-1 0-1 0-1 -2-2 -2-2 -2-2 -2 2 4 2 4 F -4-3 -3-5 -4-5 -4-6 -5-5 -2-4 -5 0 1 2-1 9 Y 0-3 -3-5 -3-5 -2-4 -4-4 0-4 -4-2 -1-1 -2 7 10 W -8-2 -5-6 -6-7 -4-7 -7-5 -3 2-3 -4-5 -2-6 0 0 17 C S T P A G N D E Q H R K M I L V F Y W Hydrophobic C P A G M I L V Aromatic H F Y W Polar S T N Q Y Basic H R K Acidic D E 8

Hinton diagram of the PAM250 matrix Yellow boxes indicate positive values (accepted mutations) Red boxes indicate negative values (avoided mutations). The area of each box is proportional to the absolute value of the log-odds score. 9

BLOSUM scoring matrices Henikoff and Henikoff (1992) analyzed substitution rates on the basis of aligned regions (blocks) They calculated scoring matrices from blocks with different percentages of protein divergence Example: BLOSUM62calculated from blocks with ~62% identity BLOSUM80calculated from blocks with ~80% identity When these substitution matrices are used to score sequence alignments, one should always choose the matrix appropriate to the expected percentage of similarity. Reference: Henikoff, S. and Henikoff, J.G. (1992). Amino acid substitution matrices from protein blocks. PNAS 89:10915-10919. 10

BLOSUM30 A 4 R -1 8 N 0-2 8 D 0-1 1 9 C -3-2 -1-3 17 Q 1 3-1 -1-2 8 E 0-1 -1 1 1 2 6 G 0-2 0-1 -4-2 -2 8 H -2-1 -1-2 -5 0 0-3 14 I 0-3 0-4 -2-2 -3-1 -2 6 L -1-2 -2-1 0-2 -1-2 -1 2 4 K 0 1 0 0-3 0 2-1 -2-2 -2 4 M 1 0 0-3 -2-1 -1-2 2 1 2 2 6 F -2-1 -1-5 -3-3 -4-3 -3 0 2-1 -2 10 P -1-1 -3-1 -3 0 1-1 1-3 -3 1-4 -4 11 S 1-1 0 0-2 -1 0 0-1 -1-2 0-2 -1-1 4 T 1-3 1-1 -2 0-2 -2-2 0 0-1 0-2 0 2 5 W -5 0-7 -4-2 -1-1 1-5 -3-2 -2-3 1-3 -3-5 20 Y -4 0-4 -1-6 -1-2 -3 0-1 3-1 -1 3-2 -2-1 5 9 V 1-1 -2-2 -2-3 -3-3 -3 4 1-2 0 1-4 -1 1-3 1 5 Asx B 0-2 4 5-2 -1 0 0-2 -2-1 0-2 -3-2 0 0-5 -3-2 5 Glx Z 0 0-1 0 0 4 5-2 0-3 -1 1-1 -4 0-1 -1-1 -2-3 0 4 Unkown X 0-1 0-1 -2 0-1 -1-1 0 0 0 0-1 -1 0 0-2 -1 0-1 0-1 End * -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4 1 Asx Glx Unk End A R N D C Q E G H I L K M F P S T W Y V B Z X * Hydrophobic A C G I L M P V Aromatic H F W Y Polar N Q S T Y Basic R H K Acidic D E 11

BLOSUM62 A 4 R -1 5 N -2 0 6 D -2-2 1 6 C 0-3 -3-3 9 Q -1 1 0 0-3 5 E -1 0 0 2-4 2 5 G 0-2 0-1 -3-2 -2 6 H -2 0 1-1 -3 0 0-2 8 I -1-3 -3-3 -1-3 -3-4 -3 4 L -1-2 -3-4 -1-2 -3-4 -3 2 4 K -1 2 0-1 -3 1 1-2 -1-3 -2 5 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 Asx B -2-1 3 4-3 0 1-1 0-3 -4 0-3 -3-2 0-1 -4-3 -3 4 Glx Z -1 0 0 1-3 3 4-2 0-3 -3 1-1 -3-1 0-1 -3-2 -2 1 4 Unkown X 0-1 -1-1 -2-1 -1-1 -1-1 -1-1 -1-1 -2 0 0-2 -1-1 -1-1 -1 End * -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4 1 Asx Glx Unk End A R N D C Q E G H I L K M F P S T W Y V B Z X * Hydrophobic A C G I L M P V Aromatic H F W Y Polar N Q S T Y Basic R H K Acidic D E 12

BLOSUM80 A 5 R -2 6 N -2-1 6 D -2-2 1 6 C -1-4 -3-4 9 Q -1 1 0-1 -4 6 E -1-1 -1 1-5 2 6 G 0-3 -1-2 -4-2 -3 6 H -2 0 0-2 -4 1 0-3 8 I -2-3 -4-4 -2-3 -4-5 -4 5 L -2-3 -4-5 -2-3 -4-4 -3 1 4 K -1 2 0-1 -4 1 1-2 -1-3 -3 5 M -1-2 -3-4 -2 0-2 -4-2 1 2-2 6 F -3-4 -4-4 -3-4 -4-4 -2-1 0-4 0 6 P -1-2 -3-2 -4-2 -2-3 -3-4 -3-1 -3-4 8 S 1-1 0-1 -2 0 0-1 -1-3 -3-1 -2-3 -1 5 T 0-1 0-1 -1-1 -1-2 -2-1 -2-1 -1-2 -2 1 5 W -3-4 -4-6 -3-3 -4-4 -3-3 -2-4 -2 0-5 -4-4 11 Y -2-3 -3-4 -3-2 -3-4 2-2 -2-3 -2 3-4 -2-2 2 7 V 0-3 -4-4 -1-3 -3-4 -4 3 1-3 1-1 -3-2 0-3 -2 4 Asx B -2-2 4 4-4 0 1-1 -1-4 -4-1 -3-4 -2 0-1 -5-3 -4 4 Glx Z -1 0 0 1-4 3 4-3 0-4 -3 1-2 -4-2 0-1 -4-3 -3 0 4 Unkown X -1-1 -1-2 -3-1 -1-2 -2-2 -2-1 -1-2 -2-1 -1-3 -2-1 -2-1 -1 End * -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4 1 Asx Glx Unk End A R N D C Q E G H I L K M F P S T W Y V B Z X * Hydrophobic A C G I L M P V Aromatic H F W Y Polar N Q S T Y Basic R H K Acidic D E 13

BLOSUM62 Amino acid properties A 4 R -1 5 N -2 0 6 D -2-2 1 6 C 0-3 -3-3 9 Q -1 1 0 0-3 5 E -1 0 0 2-4 2 5 G 0-2 0-1 -3-2 -2 6 H -2 0 1-1 -3 0 0-2 8 I -1-3 -3-3 -1-3 -3-4 -3 4 L -1-2 -3-4 -1-2 -3-4 -3 2 4 K -1 2 0-1 -3 1 1-2 -1-3 -2 5 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 Asx B -2-1 3 4-3 0 1-1 0-3 -4 0-3 -3-2 0-1 -4-3 -3 4 Glx Z -1 0 0 1-3 3 4-2 0-3 -3 1-1 -3-1 0-1 -3-2 -2 1 4 Unkown X 0-1 -1-1 -2-1 -1-1 -1-1 -1-1 -1-1 -2 0 0-2 -1-1 -1-1 -1 End * -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4 1 Asx Glx Unk End A R N D C Q E G H I L K M F P S T W Y V B Z X * Hydrophobic A C G I L M P V Aromatic H F W Y Polar N Q S T Y Basic R H K Acidic D E 14

BLOSUM62 - substitutions between acidic residues A 4 R -1 5 N -2 0 6 D -2-2 1 6 C 0-3 -3-3 9 Q -1 1 0 0-3 5 E -1 0 0 2-4 2 5 G 0-2 0-1 -3-2 -2 6 H -2 0 1-1 -3 0 0-2 8 I -1-3 -3-3 -1-3 -3-4 -3 4 L -1-2 -3-4 -1-2 -3-4 -3 2 4 K -1 2 0-1 -3 1 1-2 -1-3 -2 5 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 Asx B -2-1 3 4-3 0 1-1 0-3 -4 0-3 -3-2 0-1 -4-3 -3 4 Glx Z -1 0 0 1-3 3 4-2 0-3 -3 1-1 -3-1 0-1 -3-2 -2 1 4 Unkn X 0-1 -1-1 -2-1 -1-1 -1-1 -1-1 -1-1 -2 0 0-2 -1-1 -1-1 -1 End * -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4 1 Asx Glx Unk End A R N D C Q E G H I L K M F P S T W Y V B Z X * Hydrophobic A C G I L M P V Aromatic H F W Y Polar N Q S T Y Basic R H K Acidic D E 15

BLOSUM62- substitutions between basic residues A 4 R -1 5 N -2 0 6 D -2-2 1 6 C 0-3 -3-3 9 Q -1 1 0 0-3 5 E -1 0 0 2-4 2 5 G 0-2 0-1 -3-2 -2 6 H -2 0 1-1 -3 0 0-2 8 I -1-3 -3-3 -1-3 -3-4 -3 4 L -1-2 -3-4 -1-2 -3-4 -3 2 4 K -1 2 0-1 -3 1 1-2 -1-3 -2 5 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 B -2-1 3 4-3 0 1-1 0-3 -4 0-3 -3-2 0-1 -4-3 -3 4 Z -1 0 0 1-3 3 4-2 0-3 -3 1-1 -3-1 0-1 -3-2 -2 1 4 X 0-1 -1-1 -2-1 -1-1 -1-1 -1-1 -1-1 -2 0 0-2 -1-1 -1-1 -1 * -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4 1 A R N D C Q E G H I L K M F P S T W Y V B Z X * Hydrophobic A C G I L M P V Aromatic H F W Y Polar N Q S T Y Basic R H K Acidic D E 16

BLOSUM62 - substitutions between aromatic residues A 4 R -1 5 N -2 0 6 D -2-2 1 6 C 0-3 -3-3 9 Q -1 1 0 0-3 5 E -1 0 0 2-4 2 5 G 0-2 0-1 -3-2 -2 6 H -2 0 1-1 -3 0 0-2 8 I -1-3 -3-3 -1-3 -3-4 -3 4 L -1-2 -3-4 -1-2 -3-4 -3 2 4 K -1 2 0-1 -3 1 1-2 -1-3 -2 5 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 B -2-1 3 4-3 0 1-1 0-3 -4 0-3 -3-2 0-1 -4-3 -3 4 Z -1 0 0 1-3 3 4-2 0-3 -3 1-1 -3-1 0-1 -3-2 -2 1 4 X 0-1 -1-1 -2-1 -1-1 -1-1 -1-1 -1-1 -2 0 0-2 -1-1 -1-1 -1 * -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4 1 A R N D C Q E G H I L K M F P S T W Y V B Z X * Hydrophobic A C G I L M P V Aromatic H F W Y Polar N Q S T Y Basic R H K Acidic D E 17

BLOSUM62 - substitutions between polar residues A 4 R -1 5 N -2 0 6 D -2-2 1 6 C 0-3 -3-3 9 Q -1 1 0 0-3 5 E -1 0 0 2-4 2 5 G 0-2 0-1 -3-2 -2 6 H -2 0 1-1 -3 0 0-2 8 I -1-3 -3-3 -1-3 -3-4 -3 4 L -1-2 -3-4 -1-2 -3-4 -3 2 4 K -1 2 0-1 -3 1 1-2 -1-3 -2 5 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 B -2-1 3 4-3 0 1-1 0-3 -4 0-3 -3-2 0-1 -4-3 -3 4 Z -1 0 0 1-3 3 4-2 0-3 -3 1-1 -3-1 0-1 -3-2 -2 1 4 X 0-1 -1-1 -2-1 -1-1 -1-1 -1-1 -1-1 -2 0 0-2 -1-1 -1-1 -1 * -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4 1 A R N D C Q E G H I L K M F P S T W Y V B Z X * Hydrophobic A C G I L M P V Aromatic H F W Y Polar N Q S T Y Basic R H K Acidic D E 18

BLOSUM62 - substitutions between hydrophobic residues A 4 R -1 5 N -2 0 6 D -2-2 1 6 C 0-3 -3-3 9 Q -1 1 0 0-3 5 E -1 0 0 2-4 2 5 G 0-2 0-1 -3-2 -2 6 H -2 0 1-1 -3 0 0-2 8 I -1-3 -3-3 -1-3 -3-4 -3 4 L -1-2 -3-4 -1-2 -3-4 -3 2 4 K -1 2 0-1 -3 1 1-2 -1-3 -2 5 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 B -2-1 3 4-3 0 1-1 0-3 -4 0-3 -3-2 0-1 -4-3 -3 4 Z -1 0 0 1-3 3 4-2 0-3 -3 1-1 -3-1 0-1 -3-2 -2 1 4 X 0-1 -1-1 -2-1 -1-1 -1-1 -1-1 -1-1 -2 0 0-2 -1-1 -1-1 -1 * -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4 1 A R N D C Q E G H I L K M F P S T W Y V B Z X * Hydrophobic A C G I L M P V Aromatic H F W Y Polar N Q S T Y Basic R H K Acidic D E 19

Substitution matrices - summary Different substitution scoring matrices have been established Residue categories (Phylip) PAM (Dayhoff, 1979). PAM means Percent Accepted Mutations BLOSUM (Henikoff & Henikoff, 1992). BLOSUM means Block sum. Substitution matrices allow to detect similarities between more distant proteins than what would be detected with the simple identity of residues. The matrix must be chosen carefully, depending on the expected rate of conservation between the sequences to be aligned. Beware With PAM matrices the score indicates the percentage of substitution per position -> higher numbers are appropriate for more distant proteins With BLOSUM matrices the score indicates the percentage of conservation -> higher numbers are appropriate for more conserved proteins 20

Scoring an alignment with a substitution matrix A 4 R -1 5 N -2 0 6 D -2-2 1 6 C 0-3 -3-3 9 Q -1 1 0 0-3 5 E -1 0 0 2-4 2 5 G 0-2 0-1 -3-2 -2 6 H -2 0 1-1 -3 0 0-2 8 I -1-3 -3-3 -1-3 -3-4 -3 4 L -1-2 -3-4 -1-2 -3-4 -3 2 4 K -1 2 0-1 -3 1 1-2 -1-3 -2 5 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 L " S = s r1,i r 2,i i=1 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 The substitution matrix can be used to assign a score to a pair-wise alignment. The score of the alignment is the sum, over all the aligned positions (i from 1 to L), of the scores of the pairs of residues (r 1,I and r 2,I ). Gaps are treated by subtracting a penalty, with two parameters: Gap opening (go) penalty Typical values : between -10 and -15 Gap extension (ge) penalty Typical values: between -0.5 and -2 A R N D C Q E G H I L K M F P S T W Y V i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21! R L A S V E T D M P - - - - - L T L R Q H!! T L T S L Q T T L K N L K E M A H L G T H! S! 21

Scoring an alignment with a substitution matrix A 4 R -1 5 N -2 0 6 D -2-2 1 6 C 0-3 -3-3 9 Q -1 1 0 0-3 5 E -1 0 0 2-4 2 5 G 0-2 0-1 -3-2 -2 6 H -2 0 1-1 -3 0 0-2 8 I -1-3 -3-3 -1-3 -3-4 -3 4 L -1-2 -3-4 -1-2 -3-4 -3 2 4 K -1 2 0-1 -3 1 1-2 -1-3 -2 5 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 L " S = s r1,i r 2,i i=1 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 The substitution matrix can be used to assign a score to a pair-wise alignment. The score of the alignment is the sum, over all the aligned positions (i from 1 to L), of the scores of the pairs of residues (r 1,I and r 2,I ). Gaps are treated by subtracting a penalty, with two parameters: Gap opening (go) penalty Typical values : between -10 and -15 Gap extension (ge) penalty Typical values: between -0.5 and -2 A R N D C Q E G H I L K M F P S T W Y V i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21! R L A S V E T D M P - - - - - L T L R Q H!.. : :. :. go ge ge ge ge....! T L T S L Q T T L K N L K E M A H L G T H! S -1 +4 +0 +4 +1 +2 +5-1 +2-1 -10-1 -1-1 -1-1 -2 +4-2 -1 +8 = 7! 22

Bibliography Substitution matrices PAM series Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. (1978). A model of evolutionary change in proteins. Atlas of tein Sequence and Structure 5, 345--352. BLOSUM substitution matrices Henikoff, S. & Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. c Natl Acad Sci U S A 89, 10915-9. Gonnet matrices, built by an iterative procedure Gonnet, G. H., Cohen, M. A. & Benner, S. A. (1992). Exhaustive matching of the entire protein sequence database. Science 256, 1443-5. 1. 23