Advanced topics in bioinformatics

Size: px

Start display at page:

Download "Advanced topics in bioinformatics"

Marjorie White
5 years ago
Views:

1 Feinberg Graduate School of the Weizmann Institute of Science Advanced topics in bioinformatics Shmuel Pietrokovski & Eitan Rubin Spring 2003 Course WWW site:

2 Lecture 1, 5/3/2003: Substitution matrices theory and schemes 2

3 Sequence alignment ATCAGAGTC TTCAGTC TTCAGTC TTCAGTC TTCA--GTC ^^+++ We wish to identify what regions are most similar to each other in the two sequences. Sequences are shifted one by the other and gaps introduced, to cover all possible alignments. The shifts and gaps provide the steps by which one sequence can be converted into the other. 3

4 Alignment scoring schemes: substitution matrices Unitary substitution matrix - two scores are used, one for matches and one mismatches. Practical usage of such matrices is for nucleotide alphabets. A C G T A C G T

5 Alignment scoring schemes: substitution matrices In protein sequences there are 20 types of residues (amino acids - aa) with complex relations by size, charge, genetic code, and chemistry. Unitary aa substitution matrices are outperformed by matrices that have different scores for the 210 possible aa pairs. These matrices are calculated by scoring the relations between different aa according to some of their features and/or which substitutions occur in correct alignments and what is the probability of having them by chance. 5

6 Sequence alignment BLOSUM62 in 1/2 Bit Units amino acids substitution matrix A R N D C Q E G H I L K M F P S T W Y V X A 4 R -1 5 N D C Q E G H I L K M F P S T W Y V X

7 Altschul JMB 219:555, 91 Alignment scoring schemes: substitution matrices Every substitution matrix is either explicitly calculated from target frequencies of aligned residues (q ij ) and the frequencies of the residues (p i ), or these target and observed frequencies are implicit and can be back-calculated from the substitution scores (s ij ). The ratio of a target frequency to the frequencies it will occur by chance compares the probability an event will occur under two alternative hypotheses - q ij /(p i p j ). This is called a likelihood, or odds, ratio. Such probabilities should be multiplied to get the probability of their independent occurrence, or their log can be added. Log-odds score - s ij = (ln q ij /(p i p j )) / λ (λ determines the base of the logarithm) 7

8 Sequence alignment BLOSUM62 in 1/2 Bit Units amino acids substitution matrix A R N D C Q E G H I L K M F P S T W Y V X A 4 R -1 5 P i P j Q ij Q ij /P i P j 2log 2 (Q ij /P i P j ) N A:A D A:R C Q E G H I L K M F P S T W Y See -3 ftp://ncbi.nlm.nih.gov/repository/blocks/unix/blosum/readme V and -3 ftp://ncbi.nlm.nih.gov/repository/blocks/unix/blosum/blosum/ X

9 Altschul JMB 219:555, 91 Alignment scoring schemes: substitution matrices Substitution matrices are characterized by their average score per residue pair H = Σ i,j q ij s ij = Σ i,j q ij log 2 (q ij /p i p j ) H is the information, in bit units, per aligned residue pair. It depends on the target frequencies (q ij ) - calculated from what we think are correct alignments - and on the alignments that would occur by chance (p i p j ). It is termed the relative entropy of the matrix. H measures the information provided by the matrix to distinguish correct alignments from chance ones. Well made matrices with lower values will identify more distant sequence relationships that produce 9 weaker alignments.

10 Alignment scoring schemes: substitution matrices Substitution matrices differ by the models and data used for their calculation. Each is suitable for identifying alignments of sequences with different evolutionary distances. Nevertheless, longer alignments are needed to identify the relationship between more distant sequences. The scale of the substitution matrix (base of the log) is arbitrary. However, matrices must be in the same scale to be compared to each other, and gap penalties are specific to the matrix and scale used. Typical penalties for local alignment with the BLOSUM62 matrix in half-bit units are 12 for opening a gap and 2 for extending it. 10

11 Amino acid (aa) substitution matrices can be calculated empirically, by examining which substitutions occur in correct alignments and a model for the random protein sequences. These matrices can also be derived by scoring the relations of aas to each other according to some of their features, such as size, charge, hydrophobicity and genetic code. 11

12 Residue-features matrices F S Y C L P H W I T Q R M A N G V D K E 12

13 Residue-features matrices Small N Polar Q D Negative E Charged G SP A C T V K R H Y W Tiny F Positive L I M Aliphatic Aromatic Hydrophobic 13

14 Hydrophobicity aa substitution matrix Residue-features matrices A R N D C Q E G H I L K M F P S T W Y V X A 10 R 5 10 N D C Q E G H I L K M F P S T W Y V X

15 Genetic code aa substitution matrix minimal number of base changes Residue-features matrices A R N D C Q E G H I L K M F P S T W Y V X A 0 R 2 0 N D * C Q E * G * H * * I * L * * K M F * Universal genetic code P TTT F TCT S TAT Y TGT C TTC F TCC S TAC Y TGC C TTA L TCA S TAA * TGA * TTG L TCG S TAG * TGG W CTT L CCT P CAT H CGT R CTC L CCC P CAC H CGC R CTA L CCA P CAA Q CGA R CTG L CCG P CAG Q CGG R ATT I ACT T AAT N AGT S ATC I ACC T AAC N AGC S ATA I ACA T AAA K AGA R ATG M ACG T AAG K AGG R GTT V GCT A GAT D GGT G GTC V GCC A GAC D GGC G GTA V GCA A GAA E GGA G GTG V GCG A GAG E GGG G S T W Y V X

16 Genetic code aa substitution matrix minimal number of base changes Residue-features matrices A R N D C Q E G H I L K M F P S T W Y V X A 0 R 2 0 N D C Q E G H I L K M F Universal genetic code P S T W Y TTT F TCT S TAT Y TGT C TTC F TCC S TAC Y TGC C TTA L TCA S TAA * TGA * TTG L TCG S TAG * TGG W CTT L CCT P CAT H CGT R CTC L CCC P CAC H CGC R CTA L CCA P CAA Q CGA R CTG L CCG P CAG Q CGG R ATT I ACT T AAT N AGT S ATC I ACC T AAC N AGC S ATA I ACA T AAA K AGA R ATG M ACG T AAG K AGG R GTT V GCT A GAT D GGT G GTC V GCC A GAC D GGC G GTA V GCA A GAA E GGA G GTG V GCG A GAG E GGG G V X

17 Residue-features matrices A selection of substitution matrices based on amino acids features- Mutation values for the interconversion of amino acid pairs (Fitch, 1966) Genetic code matrix (Benner et al., 1994) Residue replace ability matrix (Cserzo et al., 1994) Structure-Genetic matrix (Feng et al., 1985) Hydrophobicity scoring matrix (George et al., 1990) Chemical distance (Grantham, 1974) Chemical similarity scores (McLachlan, 1972) Base-substitution-protein-stability matrix (Miyazawa-Jernigan, 1993) Hydrophobicity scoring matrix (Riek et al., 1995) WAC matrix constructed from amino acid comparative profiles (Wei et al., 1997) Source: AAindex database at 17

18 The Dayhoff, or PAM, matrices PAM matrices are based on an explicit model of mutations during evolution. Amino acid (aa) changes in each site are assumed to be independent of previous changes at that site, of changes in other sites and of the position of the site. This model allows the extrapolation of substitutions observed over short evolutionary distances to longer ones. The input data are groups of protein sequences at least 85% identical to each other (protein families). Amino acid substitutions within each group probably result from single mutation events and do not significantly change the proteins function, and are thus termed accepted mutations. Sequences within each family are organized into a phylogenetic tree. 18

19 The Dayhoff, or PAM, matrices ACGH DBGH ADIJ CBIJ B=>C A=>D B=>D A=>C ABGH ABIJ I<=>G J<=>H Phylogenetic trees allows counting the aligned aa pairs (A ij ) that correspond to actual mutation events. This solves the dependence problem of different sequence alignments within one family. Dayhoff et.al A model of evolutionary change in proteins In Atlas of Protein Sequence and Structure, Suppl 3, A B C D G H I J A 1 1 B 1 1 C 1 1 D 1 1 G 1 H 1 I 1 J 1 19

20 The Dayhoff, or PAM, matrices We need to know the probability that each aa will change within a given evolutionary distance. This number is termed the relative mutability of the aa (m j ). Sequence alignment: ADA ADB Amino acids: A B D Changes: Occurrence: Relative mutability: Dayhoff et.al A model of evolutionary change in proteins In Atlas of Protein Sequence and Structure, Suppl 3, => exposure to mutation, these numbers are multiplied by the total number of mutations per 100 positions in each family. This scales the data from different families to the same evolutionary distance 20 : 1 percent accepted mutation - PAM.

21 The Dayhoff, or PAM, matrices Combining the A ij data of accepted point mutations and relative mutabilities, m j, gives the probability, M ij, that an aa i will change into an aa j after a given evolutionary distance. M ij = λm j A ij /(Σ A i ij ) and M jj = 1 - λm j λ is a proportionality constant chosen to keep the evolutionary distance 1 PAM. The PAM1 matrix gives the probabilities for 1 aa change per 100 aas - 1%. To get PAM matrices for larger changes PAM1 is multiplied by itself. 250 multiplications give the PAM250 matrix that has the probabilities for aa substitutions expected to occur over 250% of changes. This will cause sequence divergence of ~80%. Note that a position can mutate back to an aa previously found in it. 21

22 The Dayhoff, or PAM, matrices To get odds values for the PAM matrices the scores are divided by the frequency of the changing residue, f i - R ij = M ij/ / f i After getting the log10 value of the odds ratios the values of R ij and R ji changes are averaged and multiplied by 10. The resulting matrices are termed mutation data matrices (MDM). Their values are log-odds values with about third bits units (10log 10 ~ 3log 2 ). H, the relative entropy of the PAM250 MDM matrix is 0.35 bits and its expected score per aligned pair is There is a series of PAM matrices, each suitable for aligning sequences with differing amount of divergence. For sequence searches this means that you must know ahead what type of relationships you want to optimize the search for. This is true for all types of 22 substitution matrices.

23 The Dayhoff, or PAM, matrices The Dayhoff matrices were extensively, and successfully used for more then 15 years. Dayhoff and her coworkers introduced two key concepts for calculating aa substitution matrices: i) substitution frequencies can be based on estimated mutation data, ii) sequence alignments can be effectively scored by log-odds values. PAM matrices also have several weaknesses. The assumption for independence of mutation events necessitates the use of closely related sequences. Thus, the estimated mutation rates needs to be extrapolated for longer evolutionary distances. This amplifies inaccuracies. One likely cause of inaccuracy might be that aa changes between closely related sequences are mainly due to single nucleotide changes, while more changes in distant sequences typically come from multinucleotide changes. Protein positions are also not equally mutable 23and mutagenesis hot spots and cold spots exist in most protein families.

24 Blocks substitution matrices (BLOSUM) Blocks are ungapped local multiple-sequence alignments. They can be automatically found from groups of related protein sequences (families). Blocks represent the most conserved sequence regions of protein families (motifs). BLOSUM matrices are based on the changes within sequence motifs of protein families. Common evolutionary origin (homology) of the motifs is implicit. Relations between all motifs, that are different by at least some threshold, are equally considered. seq1 CFTKGTQV seq2 CLAEGTRI seq3 CMNYSTRV seq4 CHPADTKV seq5 CLTADARI seq6 CISKFSHI seq7 CVTGDALV seq8 CLTGDALV seq9 CVTGDALV seq10 ALAYDEPI 24

25 Blocks substitution matrices (BLOSUM) Seq1 C Seq2 C Seq3 C Seq4 C Seq5 C Seq6 C Seq7 C Seq8 C Seq9 C Seq10 A Observed aa pairs: 1CC + 2CC + 3CC + 4CC + 5CC + 6CC + 7CC + 8CC + 9AC = 36CC + 9AC seq1 CFTKGTQV seq2 CLAEGTRI seq3 CMNYSTRV seq4 CHPADTKV seq5 CLTADARI seq6 CISKFSHI seq7 CVTGDALV seq8 CLTGDALV seq9 CVTGDALV seq10 ALAYDEPI 25

26 Blocks substitution matrices (BLOSUM) Observed aa pairs: 36CC + 9AC The column defined 45 aa pairs (10 9/2). Observed frequencies of pairs in the column are q CC =36/45=0.8 and q AC =9/45=0.2 The observed frequencies of single aa in the pairs are p i = q ii + Σ i j q ij /2 p C = /2 = 0.9 and p A = 0.2/2 = 0.1 The expected frequencies of aa pairs are e ii = p i p i = p i 2 and e ij i j = p i p j + p j p i = 2p i p j Henikoff & Henikoff PNAS USA 89: C C C C C C C C C A e CC = = 0.81, e AC = 2( ) = 0.18, and e AA = =

27 Blocks substitution matrices (BLOSUM) The observed and expected frequencies of aa pairs in the column can be used to calculate log odds scores for the pairs: s ij = log (q ij /e ij ). Observed and expected frequencies are cumulatively counted for all columns of the blocks database to calculate log odds scores of amino acid substitutions in conserved protein motifs. 27

28 Blocks substitution matrices (BLOSUM) To reduce multiple contributions of the most closely related family members to the aa pair frequencies, sequences are clustered in each block. Clustering is done by percent identity, e.g. all sequences within a block that are 80% identical to each other are clustered together. The aas in each cluster are weighted by the cluster size 1/c, where c is the number of sequences in the cluster. Seq1 A Seq2 A Seq3 S Seq4 C Seq5 C Cluster1 1/3 A,1/3 A,1/3 S Cluster2 1/2 C,1/2 C Observed aa pairs: 1/6 CA + 1/6 CA + 1/6 CS + 1/6 CA + 1/6 CA + 1/6 CS 28

29 Blocks substitution matrices (BLOSUM) Reducing the clustering percent can cause some blocks to be entirely clustered and thus eliminate their contribution of pairs. Reduced clustering percent also decreases the contribution of closely related sequences and lowers the information (H) of the resulting matrix. Matrices made from more distant sequences (lower %clustering and information) are more suitable for identifying distant sequence relationships. Information Henikoff & Henikoff PNAS USA 89: Information - H

30 Blocks substitution matrices (BLOSUM) Construction procedure of the Blocks database from sequences of protein families employs an aa substitution matrix. To make the BLOS UM matrices, the database was first made with a unitary substitution matrix. The constructed blocks were then used to make a BLOSUM matrix, that was now used to reconstruct the database. After three such iterations the Blocks database and BLOSUM matrices converged. Start with a unitary matrix, Iterate until convergence Sequence groups Make database Blocks DB Make matrix Using a PAM matrix or just parts of the data also resulted in very similar matrices. Currently the Blocks database is significantly larger and more diverse than the version used to construct the BLOSUM matrices. Nevertheless, the database yields the same matrices. Substitution matrix 30

31 Blocks substitution matrices (BLOSUM) Performance of substitution matrices depends on type of alignments used and the evolutionary distance between the aligned sequences. Alignment types can be global, local or ungappedlocal. Performance in the first two types also depends on the gap model and penalties. BLOSUM matrices were found to be very good for identifying long diverged sequences and for ungapped local alignments (BLAST algorithm). This probably reflects the data used for their construction. 31

32 More details, sources and things to do for next lecture Sources: Altschul Amino acid substitution matrices from an information theoretic perspective J Mol Biol 219: (1991), Henikoff Scores for sequence searches and alignments Curr Opin Struct Biol 6: (1996). Dayhoff et.al A model of evolutionary change in proteins In Atlas of Protein Sequence and Structure, Suppl 3, NBRF (1978). Henikoff & Henikoff Amino acid substitution matrices from protein blocks Proc. Natl. Acad. Sci. USA 89: (1992). Assignment: Read the source articles for this lecture. List the similarities and differences between the approaches for calculating the PAM and BLOSUM matrices. For example, types of data, underlying assumptions, dealing with the lack of independence 32 (dependence) of the sequence data.

33 More details, sources and things to do for next lecture For those who are not acquainted with information theory or want to be certain they know the basics of it: An information theory primer for molecular biologistshttp:// 33

34 Next lecture: Dynamic programming 34

Practical Bioinformatics

Practical Bioinformatics 5/2/2017 Dictionaries d i c t i o n a r y = { A : T, T : A, G : C, C : G } d i c t i o n a r y [ G ] d i c t i o n a r y [ N ] = N d i c t i o n a r y. h a s k e y ( C ) Dictionaries g e n e t i c C o