PAM-1 Matrix 10,000. From: Ala Arg Asn Asp Cys Gln Glu To:

119-1 atrix 10,000 rom: la rg sn sp ys ln lu o: la 9867 2 9 10 3 8 17 rg 1 9913 1 0 1 10 0 sn 4 1 9822 36 0 4 6 sp 6 0 42 9859 0 6 53 ys 1 1 0 0 9973 0 0 ln 3 9 4 5 0 9876 27 lu 10 0 7 56 0 35 9865

120 1 is the expectation after approximately 1% of the sequence has been substituted. 2 is calculated as 1 1 x is calculated as (x-1) 1 250 is generally used for distant comparisons. t corresponds to 2.5 differences per site ( 20% identity). O: hese measure divergence not time.

121-250 atrix 100 rom: la rg sn sp ys ln lu o: la 13 6 9 9 5 8 9 rg 3 17 4 3 2 5 3 sn 4 4 6 7 2 5 6 sp 5 4 8 11 1 7 10 ys 2 1 1 1 52 1 1 ln 3 5 5 6 1 10 7 lu 5 4 7 11 1 9 12

122 scoring matrix he scoring values are generally shown as a symmetric log odds ratio matrix. Odds (for those who do not gamble) are 1 p where p is the probability of an event and 1 p is the probability of some other event. or example if p = 0.5 then the odds are 50/50 or 1 to 1 ( 0.5 0.5 = 1). hile if p = 0.75 then the odds are 3 to 1 ( 0.75 0.25 = 3). he odds ratio is the ratio of the odds for and against. p

123 scoring matrix enerally the odds are presented as log values. or matrices it is generally log 10 that is used and so each integer value represents an order of magnitude. or example if p = 0.08, odds are 0.08/0.92 = 0.087 (11 to 1) and log odds are log 10 (0.087) = 1.06 while if p = 0.996, odds are 0.996/0.004 = 249 (order magnitude larger and opposite direction), the log odds are log 10 (249) = +2.40.

124 or a scoring matrix ij = log p i ij p i p j = log ij p j = log observed frequency expected frequency his matrix will be symmetric.

125 12 0 2 2 1 3 3 1 0 6 2 1 1 1 2 3 1 0 1 1 5 4 1 0 1 0 0 2 5 0 0 1 0 1 2 4 5 0 0 1 0 0 1 3 4 5 1 1 0 0 1 1 2 2 4 3 1 1 0 1 2 2 1 1 3 6 4 0 1 0 2 3 0 1 1 1 2 6 5 0 0 1 1 2 1 0 0 1 0 3 5 5 2 1 2 1 3 2 3 2 1 2 0 0 6 2 1 0 2 1 3 2 2 2 2 2 2 2 2 5 6 3 2 3 2 4 3 4 3 2 2 3 3 4 2 6 2 1 0 1 0 1 2 2 2 2 2 2 2 2 4 2 4 4 3 3 5 4 5 3 6 5 5 2 4 5 0 1 2 1 9 0 3 3 5 3 5 2 4 4 4 0 4 4 2 1 1 2 7 10 8 2 5 6 6 7 4 7 7 5 3 2 3 4 5 2 6 0 0 17 alues multiplied by 10.

126 log odds of zero implies the two amino acids are found across from each in an alignment as often as expected by chance (given their mutabilities and frequencies of occurrence). log odds greater than zero implies the two amino acids are found across from each in an alignment more often than expected by chance (given their mutabilities and frequencies of occurrence). log odds less than zero implies the two amino acids are found across from each in an alignment less often than expected by chance (given their mutabilities and frequencies of occurrence).

127 wo uses for matrices, coring matrix 250 (very distant) 160 (distant) 70 (less distant) 30 (more similar) etc ransition matrix 1

128-1 atrix 10,000 rom: la rg sn sp ys ln lu o: la 9867 2 9 10 3 8 17 rg 1 9913 1 0 1 10 0 sn 4 1 9822 36 0 4 6 sp 6 0 42 9859 0 6 53 ys 1 1 0 0 9973 0 0 ln 3 9 4 5 0 9876 27 lu 10 0 7 56 0 35 9865

129 - strange (?) patterns ots of interesting properties any exchanges between amino acids and ar more double codon substitutions than expected ewer of some single codon substitutions; e.g. and

130 - scoring an amino acid alignment onsider an alignment... eq1 eq2 250 12 5 2-3 otal score is 12 + 5 + 2 3 = 16 he chances of getting an alignment this good by chance is given by the odds. ormally one would multiply the odds at each site (assuming independence) but since log s have been taken we can add the log odds. he log 10 odds of 1.6 corresponds to odds of 39.8. o this is an unusual similarity between these two peptides despite their length (in large part due to rare cysteines across from each other).

131 he matrix was computed on globular proteins and may therefore not be a good representation of the substitution matrix for membrane or other non-globular proteins. t assumes that all sites are equally mutable (but not all residues). Only a limited number of proteins were available in comparison to the huge numbers today.

132 he J matrix (Jones, aylor, hornton 1992) was an update of the matrix. t is mostly used as a transition matrix rather than as a scoring matrix (for the later purpose 250 still seems the method of choice).

133 matrix of BO BOcks Ubstitution atrix Based on the analysis of conserved proteins regions from the BO database. ore reliable than the matrix for distantly related proteins efault for B searches Used in many other programs including

134 BOU matrix 1 ind the frequency of occurrence of one amino acid p i = q ii + q ij /2 2 xpected frequencies e ij = p 2 i if i = j 3 core e ij = 2p i p j if i j s ij = 2 log 2 (q ij /e ij )

135 he matrix consist of the scores... s ij = 2 log 2 (q ij /e ij ). f the observed number of differences between a pair of amino acids is equal to the expected number then s ij = 0 f the observed is less than expected then s ij < 0 f the observed is greater than expected s ij > 0

136 BOU matrix 9 1 4 1 1 5 3 1 1 7 0 1 0 1 4 3 0 2 2 0 6 3 1 0 2 2 0 6 3 0 1 1 2 1 1 6 4 0 1 1 1 2 0 2 5 3 0 1 1 1 2 0 0 2 5 3 1 2 2 2 2 1 1 0 0 8 3 1 1 2 1 2 0 2 0 1 0 5 3 0 1 1 1 2 0 1 1 1 1 2 5 1 1 1 2 1 3 2 3 2 0 2 1 1 5 1 2 1 3 1 4 3 3 3 3 3 3 3 1 4 1 2 1 3 1 4 3 4 3 2 3 2 2 2 2 4 1 2 0 2 0 3 3 3 2 2 3 3 2 1 3 1 4 2 2 2 4 2 3 3 3 3 3 1 3 3 0 0 0 1 6 2 2 2 3 2 3 2 3 2 1 2 2 2 1 1 1 1 3 7 2 3 2 4 3 2 4 4 3 2 2 3 3 1 3 2 3 1 2 11 he lower left gives the log odds matrix (BOU62).

137 BOU matrix 9 1 4 1 1 5 3 1 1 7 0 1 0 1 4 3 0 2 2 0 6 3 1 0 2 2 0 6 3 0 1 1 2 1 1 6 4 0 1 1 1 2 0 2 5 3 0 1 1 1 2 0 0 2 5 3 1 2 2 2 2 1 1 0 0 8 3 1 1 2 1 2 0 2 0 1 0 5 3 0 1 1 1 2 0 1 1 1 1 2 5 1 1 1 2 1 3 2 3 2 0 2 1 1 5 1 2 1 3 1 4 3 3 3 3 3 3 3 1 4 1 2 1 3 1 4 3 4 3 2 3 2 2 2 2 4 1 2 0 2 0 3 3 3 2 2 3 3 2 1 3 1 4 2 2 2 4 2 3 3 3 3 3 1 3 3 0 0 0 1 6 2 2 2 3 2 3 2 3 2 1 2 2 2 1 1 1 1 3 7 2 3 2 4 3 2 4 4 3 2 2 3 3 1 3 2 3 1 2 11 he BOU matrix is less tolerant of substitutions to or from hydrophilic amino acids, but more tolerant of hydrophobic changes, cysteine, and tryptophan mismatches than a similar level matrix.

138 BOU matrix 9 1 4 1 1 5 3 1 1 7 0 1 0 1 4 3 0 2 2 0 6 3 1 0 2 2 0 6 3 0 1 1 2 1 1 6 4 0 1 1 1 2 0 2 5 3 0 1 1 1 2 0 0 2 5 3 1 2 2 2 2 1 1 0 0 8 3 1 1 2 1 2 0 2 0 1 0 5 3 0 1 1 1 2 0 1 1 1 1 2 5 1 1 1 2 1 3 2 3 2 0 2 1 1 5 1 2 1 3 1 4 3 3 3 3 3 3 3 1 4 1 2 1 3 1 4 3 4 3 2 3 2 2 2 2 4 1 2 0 2 0 3 3 3 2 2 3 3 2 1 3 1 4 2 2 2 4 2 3 3 3 3 3 1 3 3 0 0 0 1 6 2 2 2 3 2 3 2 3 2 1 2 2 2 1 1 1 1 3 7 2 3 2 4 3 2 4 4 3 2 2 3 3 1 3 2 3 1 2 11 his is a BOU62 matrix. t is roughly equivalent to a 160 matrix. he levels come from weighting different entries. n this case all proteins within 62% identity sum to a weight of 1.

139 B s recommendations uery length ubstitution matrix ap costs <35-30 ( 9,1) 35-50 -70 (10,1) 50-85 BOU-80 (10,1) >85 BOU-62 (11,1) mpirical measures still seem to work best despite many advances.

O matrix Uses classical distance measures to produce protein alignments iven the alignments it computes a new distance matrix lign again using the new distance matrix epeat this process many times n addition, they computed empirical measures for gap penalties. hey suggest or a probability of a gap of length k 10 ln() = 36.31 + 7.44 ln( distance) 14.93 ln(k) f a distance is not available 10 ln() = 20.63 1.65ln(k 1) 141

142 O matrix 11.5 0.1 2.2 0.5 1.5 2.5 3.1 0.4 0.1 7.6 0.5 1.1 0.6 0.3 2.4 2.0 0.4 1.1 1.6 0.5 6.6 1.8 0.9 0.5 0.9 0.3 0.4 3.8 3.2 0.5 0.0 0.7 0.3 0.1 2.2 4.7 3.0 0.2 0.1 0.5 0.0 0.8 0.9 2.7 3.6 2.4 0.2 0.0 0.2 0.2 1.0 0.7 0.9 1.7 2.7 1.3 0.2 0.3 1.1 0.8 1.4 1.2 0.4 0.4 1.2 6.0 2.2 0.2 0.2 0.9 0.6 1.0 0.3 0.3 0.4 1.5 0.6 4.7 2.8 0.1 0.1 0.6 0.4 1.1 0.8 0.5 1.2 1.5 0.6 2.7 3.2 0.9 1.4 0.6 2.4 0.7 3.5 2.2 3.0 2.0 1.0 1.3 1.7 1.4 4.3 1.1 1.8 0.6 2.6 0.8 4.5 2.8 3.8 2.7 1.9 2.2 2.4 2.1 2.5 4.0 1.5 2.1 1.3 2.3 1.2 4.4 3.0 4.0 2.8 1.6 1.9 2.2 2.1 2.8 2.8 4.0 0.0 1.0 0.0 1.8 0.1 3.3 2.2 2.9 1.9 1.5 2.0 2.0 1.7 1.6 3.1 1.8 3.4 0.8 2.8 2.2 3.8 2.3 5.2 3.1 4.5 3.9 2.6 0.1 3.2 3.3 1.6 1.0 2.0 0.1 7.0 0.5 1.9 1.9 3.1 2.2 4.0 1.4 2.8 2.7 1.7 2.2 1.8 2.1 0.2 0.7 0.0 1.1 5.1 7.8 1.0 3.3 3.5 5.0 3.6 4.0 3.6 5.2 4.3 2.7 0.8 1.6 3.5 1.0 1.8 0.7 2.6 3.6 4.1 14.2 he log odds matrix is lower left. t is 10 times the log of the prob these aa are aligned / prob of chance alignment.

143 pecialized matrices ome matrices also incorporate additional information - matrix includes information about protein structure and can be used with very distantly related sequences Other matrices are specific for different types of proteins - (coreatrix eading to ntra-embrane) and (redicted ydrophobic and ransmembrane matrix) are designed from/for membrane proteins (not soluble proteins) s of 2006, 94 matrices in enomeet