Exercise 5 Sequence Profiles & BLAST 1
Substitution Matrix (BLOSUM62) Likelihood to substitute one amino acid with another Figure taken from https://en.wikipedia.org/wiki/blosum 2
Substitution Matrix (BLOSUM62) Only one score per amino acid pair Is this true for every amino acid in every protein, i.e. are they independent of each other? Can we make a substitution matrix specific for a single protein? For every position of it? Why: better alignments!? A C D E >p1 SEAN S 1-1 0 0 E -1-4 1 5 A 2-2 0 4 N 0-2 1 0 3
Position-Specific Scoring Matrix (PSSM) What do we need? 1. Information about substitutions occuring in the protein 2. Differentiation between favored and unfavored substitutions 3. Transformation into positive and negative scores 4
Multiple Sequence Alignments MSAs contain information about evolution of proteins Figure taken from https://en.wikipedia.org/wiki/multiple_sequence_alignment 5
Position-Specific Scoring Matrix (PSSM) Define favored substitutions as more often than expected, unfavored as less often than expected What DO we expect? There are 20 amino acids, so P = 0.05 right? Are there better estimates? What to do? Count number of occuring amino acids in MSA Normalize to relative frequencies Divide by expected (background) frequencies 6
Position-Specific Scoring Matrix (PSSM) SE-AN SE-ES SEVEN SE-AS Count observed amino acids A E N S V 0 0 0 4 0 0 0 4 0 0 0 0 0 0 0 0 1 0 2 2 0 0 0 0 0 0 2 2 0 0 Normalize (divide by row sum) A E N S V 0 0 0 20 0 0 0 20 0 0 0 0 0 0 0 0 20 0 10 10 0 0 0 0 0 0 10 10 0 0 Divide by background frequencies (P = 0.05) A E N S V 0 0 0 4/4 0 0 0 4/4 0 0 0 0 0 0 0 0 1/1 0 2/4 2/4 0 0 0 0 0 0 2/4 2/4 0 0 7
Position-Specific Scoring Matrix (PSSM) Not quite there yet! Missing transformation into positive/negative scores Use logarithm Log(x) > 0 for x > 1 Log(x) < 0 for x < 1 S i,j = 2 log 2 f i,j P j, where S i,j is the score for amino acid j in MSA column i f i,j is the relative frequency for amino acid j in MSA column i P j is the expected (background) frequency for amino acid j 8
Position-Specific Scoring Matrix (PSSM) SE-AN SE-ES SEVEN SE-AS 2 Log 2 (x) A E N S V -inf -inf -inf 9 -inf -inf -inf 9 -inf -inf -inf -inf -inf -inf -inf -inf 9 -inf 7 7 -inf -inf -inf -inf -inf -inf 7 7 -inf -inf Note: Scores are rounded to the nearest integer Problem with small/homogeneous MSA Some amino acids are never observed in a MSA column Log(0) = negative infinity 9
PSSM: Redistributing Gaps So far we have ignored the gaps in the MSA Can we do better? How could we interpret the gaps? See them as wildcards Since the amino acid can be missing altogether, it shouldn t matter too much what amino acid we put there Redistribute gaps according to the expected (background) amino acid frequencies i.e. every gap adds P j to the count of amino acid j 10
PSSM: Redistributing Gaps SE-AN SE-ES SEVEN SE-AS Count observed amino acids and gaps A E N S V - 0 0 0 4 0 0 0 0 4 0 0 0 0 0 0 0 0 0 1 0 3 2 2 0 0 0 0 0 0 0 2 2 0 0 0 Multiply gaps by amino acid background frequencies and add to amino acid counts Normalize to rel. frequencies Divide by background Calculate Log-Score A E N S V 0 0 0 4 0 0 0 4 0 0 0 0 0.15 0.15 0.15 0.15 1.15 0.15 2 2 0 0 0 0 0 0 2 2 0 0 Note: This example uses uniform background frequencies (P = 0.05) 11
PSSM: Sequence weights Are all sequences in the MSA equal? Do some provide more information than others? SE-AN SEVEN SE-AN SE-AN SE-AN SE-AN SE-AN SEVEN Does the second MSA provide additional information regarding viable substitutions? 12
PSSM: Sequence weights Sequence weights: a matter of variation Henikoff S, Henikoff JG (1994). Position-based sequence weights. J. Mol. Biol., 243, 4:574-8. Combine MSA column variation and sequence variation w i,k = 1 r i S i,k, where w i,k is the weight for sequence k in MSA column i r i is the number of different observed amino acids in MSA column i (count gaps as a 21st amino acid) S i,k is the number of sequences in MSA column i sharing the same amino acid as sequence k (including itself) 13
PSSM: Sequence weights S1: SE-AN S2: SE-ES S3: SEVEN S4: SE-AS 1 r s S1 S2 S3 S4 r 1/4 1/4 1/4 1/4 1 1/4 1/4 1/4 1/4 1 1/6 1/6 1/2 1/6 2 1/4 1/4 1/4 1/4 2 1/4 1/4 1/4 1/4 2 0.67 0.67 1.0 0.67 Final weight is sum over all positions of a sequence Exclude positions where r = 1 For example: w S1 = 1 6 + 1 4 + 1 4 = 2 3 14
PSSM: Sequence weights Adjust amino acid and gap counts by weight of the contributing sequences SE-AN SE-ES SEVEN SE-AS S1 0.67 S2 0.67 S3 1.0 S4 0.67 A E N S V - 0 0 0 3.00 0 0 0 0 3.00 0 0 0 0 0 0 0 0 0 1.00 0 2.00 1.33 1.67 0 0 0 0 0 0 0 1.67 1.33 0 0 0 f 1,S = 2 3 + 2 3 + 1 + 2 3 = 3.00 f 4,E = 2 3 + 1 = 1.67 15
PSSM: Pseudocounts Problem with small/homogeneous MSA Some amino acids are never observed in a MSA column Log(0) = negative infinity Solution: Pseudocounts Add an arbitrary number of counts to each amino acid No more unobserved amino acids 16
PSSM: Pseudocounts Simple example: add 1 to every amino acid (not gaps) Let s ignore sequence weights and gap redistribution for now SE-AN SE-ES SEVEN SE-AS A E N S V - 1 1 1 5 1 1 0 1 5 1 1 1 1 0 1 1 1 1 2 1 3 3 3 1 1 1 1 0 1 1 3 3 1 1 0 A E N S V - 1/24 1/24 1/24 5/24 1/24 1/24 0 1/24 5/24 1/24 1/24 1/24 1/24 0 1/24 1/24 1/24 1/24 2/24 1/24 3/24 3/24 3/24 1/24 1/24 1/24 1/24 0 1/24 1/24 3/24 3/24 1/24 1/24 0 17
PSSM: Pseudocounts Simple solution eliminates Log(0) problem Can we do better? Once again: Use background frequencies Every observed amino acid adds to the pseudocounts based on its substitution ratios Where can we get those ratios? 18
PSSM: Pseudocounts Use BLOSUM62 amino acid pair frequencies Whole matrix normalized to sum up to 1.0 Each column/row sum equals background frequency of the corresponding amino acid (matrix is symmetric) 19
PSSM: Pseudocounts Every observed* amino acid adds to the pseudocounts based on its pair frequencies similar to redistributing gaps g i,a = σ j f i,j P j q a,j, where g i,a is the pseudocount value for amino acid a in MSA column i f i,j is the observed* frequency of amino acid j in MSA column i P j is the background frequency of amino acid j q a,j is the frequency for the amino acid pair a, j *adjusted by sequence weights and redistributed gaps 20
PSSM: Pseudocounts For example, let s assume that q S,S = 0.010 for amino acid pair S, S q S,A = 0.004 for amino acid pair S, A q S,j = 0.002 for all other amino acids pairs S, j Weighted f-matrix from page 15 after gap redistribution A E N S V 0 0 0 3.00 0 0 Pseudocounts (assuming uniform P = 0.05) A E N S V 0.24 0.12 0.12 0.60 0.12 0.12 Calculate PCs from f-matrix, then add them A E N S V 0.24 0.12 0.12 3.60 0.12 0.12 21
PSSM: Pseudocounts How much weight should the pseudocounts have? 50%? More? Less? Is there some dynamic value? The more independent observations in the MSA, the less pseudocounts are needed/wanted Simple estimate: average variation in the MSA columns 22
PSSM: Pseudocounts Estimate number of independent observation N = 1 σ L L i=1 r i, where N is estimated number of independent observations L is the number of MSA columns r i is the number of different observed amino acids in MSA column i (count gaps as a 21st amino acid) 23
PSSM: Pseudocounts Weight observed* amino acids against pseudocounts f i = α f i+β g i, where α+β f i are the adjusted amino acid frequencies in MSA column i f i are the observed* amino acid frequencies in MSA column i g i are the pseudocounts for MSA column i α is equal to N 1 β is an empirically chosen weight factor for the pseudocounts *adjusted by sequence weights and redistributed gaps 24
Position-Specific Scoring Matrix (PSSM) Putting it all together: 1. Calculate sequence weights 2. Count (with weights) observed amino acids and gaps 3. Redistribute gaps according to background frequencies 4. Add pseudocounts according to amino acid pair frequencies 5. Normalize to relative frequencies 6. Divide by background frequencies 7. Calculate Log-Score 8. Remove rows corresponding to gaps in the primary sequence (here the primary sequence is the first one in the MSA) Order of steps is important! 25
Position-Specific Scoring Matrix (PSSM) SE-AN SE-ES SEVEN SE-AS A C D E F G H I K L M N P Q R S T V W Y 1-1 0 0-3 0-1 -3 0-3 -2 0-1 0-1 4 1-2 -3-2 -1-4 1 5-3 -2 0-3 1-3 -2 0-1 2 0 0-1 -3-3 -2 0 0-1 -1 0-1 -1 1-1 0 0-1 -1-1 -1 0 0 2-1 0 2-2 0 4-3 -1-1 -2 0-2 -2-1 -1 1-1 0-1 -1-3 -2 0-2 1 0-3 -1 0-3 0-3 -2 5-2 0-1 3 0-2 -3-2 26
Position-Specific Scoring Matrix (PSSM) After all that WHY do we want PSSMs (again)? PSSMs help to improve alignments (local and global) Use PSSM scores instead of, for example, BLOSUM62 You can even align two PSSMs PSSMs condense information about the evolution of a protein Conserved positions are easy spot Important input feature for many prediction methods PSSMs help to find protein homologs in databases 27
BLAST Basic Local Alignment Search Tool (BLAST) Searches databases for similar protein/nucleotide sequences Scores hits based on local alignments and score matrices (default: BLOSUM62 for proteins) High speed due to using seeds for hit determination 28
BLAST What are BLAST seeds? Short sequences (3-grams for proteins) that have a high pairwise score to the query sequence (based on scoring matrix) Query: SEQWENCE Seeds: EQW = 5 + 5 + 11 = 21 WQN = 11 + 2 + 6 = 19 NCQ = 6 + 9 + 2 = 17 Analogous for seed vs PSSM Use rows of PSSM as sequence position Use corresponding amino acid column of PSSM as score 29
BLAST Search algorithm Find seeds in (indexed) sequence database Extend alignments with sequences that contain two or more seeds (dynamic programming) Keep high-scoring (local) alignments 30
PSI-BLAST Iterative BLAST 1. Use BLOSUM62 scores for first search against database 2. Build PSSM based on high-scoring hits 3. Search again using the PSSM 4. Repeat steps 2 & 3 for a specified number of times Can find more distantly related protein sequences But: false hits can pollute the PSSM 31
Homework Compute several variants of PSSMs from a MSA From basic to complex PSSMs Carefully read the different steps again Try to be efficient in calculating the arrays and matrices Re-use variables and methods Use numpy arrays and built-in features Generate a list of BLAST seeds for a PSSM PSSM and minimum score will be provided via parameter 32