Computational Biology

Size: px

Start display at page:

Download "Computational Biology"

Giles Scot Manning
5 years ago
Views:

1 Computational Biology Lecture 6 31 October Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1

2 What is a homologous sequence? A homologous sequence, in molecular biology, means that the sequence is similar to another sequence. The similarity is derived from common ancestry. 3 Origins of similar sequences A1 A A2 Gene Duplication Speciation A1 A2 A1 A2 Species A Species B Also horizontal gene transfer 4 2

Database Searching The Assumptions The sequences being sought have an evolutionary ancestral sequence in common with the query sequence The best guess at the actual path of evolution is the path that

3 Database Searching The Assumptions The sequences being sought have an evolutionary ancestral sequence in common with the query sequence The best guess at the actual path of evolution is the path that requires the fewest evolutionary events (most parsimonious) All substitutions are not equally likely and should be weighted accordingly Insertions and deletions (indels) are less likely than substitutions and should be weighted accordingly 5 Scoring Matrices are designed to detect signal above background, to detect similarities beyond what would be observed by chance alone 6 3

BLO(ck)SU(bstitution)M(atrix) (Henikoff & Henikoff 1992) Derived from a set (2000) of aligned and ungapped regions from protein families; emphasizing more on chemical similarities (versus how easy it

4 BLO(ck)SU(bstitution)M(atrix) (Henikoff & Henikoff 1992) Derived from a set (2000) of aligned and ungapped regions from protein families; emphasizing more on chemical similarities (versus how easy it is to mutate from one residue to another). BLOSUMx is derived from the set of segments of x% identity. BLOSUM62 Matrix, log-odds representation LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQITPEDLASETLLI 137 LDSNSVDLVLMGVPPRNVEVEAEAFMDNPLVVIAPPDHPLAGERAISLARLAEETFVM D:D = +6 C 9 S -1 4 T P A G N D E Q H R K M I L V F Y W C S T P A G N D E Q H R K M I L V F Y W From Henikoff 1996 D:R =

5 The rationale for substitution matrices We have two related amino acid sequences: HRLPCYVRRTS HRLPWVRRTS We have two possible matches HRLPCYVRRTS HRLPW-VRRTS or HRLPCYVRRTS HRLP-WVRRTS Note: - represents a gap As given, no way to chose. But if what if we knew that C (cystine) rarely substitutes for W (tryptophan) whereas Y (Tyrosine) more commonly substitutes for W Some amino acid substitutions are more likely than others we would like to capture with some kind of score these relative likelihoods 9 Some caveats This information is stored in a scoring matrix In order to create the scoring matrix, we need data that can be trusted This trusted data will be used to determine which substitutions are more or less likely Key is to choose right data set to generate your scoring matrix (how distant a match we seek in the database) 10 5

6 BLOSUM (Henikoff and Henikoff, 1992) Henikoff and Henikoff started with a set of protein sequences from public databases that had been grouped into related families From these sequences, they obtained blocks of aligned sequences Note: A block is the ungapped alignment of a relatively highly conserved region of a family of proteins These blocks provide the basic data for the BLOSUM approach 11 Amino Acid Frequency If all amino acids occurred with equal frequency, we could just count the number of times each of the possible substitutions occurs and use their relative frequencies as scores However, amino acids do not occur with equal frequency and this would bias the results 12 6

7 An Illustration Suppose there are only three amino acids, A, B and W. Over time, we have seen that W never substitutes for another amino acid A has a 50% chance of substituting for B and vice versa W is very rare and occurs only 1 time out of a thousand 13 From our trusted data 1000 aligned pairs with 1 W-W pair, 250 A-A, 499 A-B and 250 B-B Looking at the percents, we would get: A-A.25 A-B.499 B-B.25 NOTE: W-A AND W-B ARE 0 W-W.001 This would give W-W a very low number compared to A-A, A-B and B-B 14 7

8 To Generate Scores: We must take into consideration the relative rarity of the amino acids We do this by considering the ratio of the percent of times a substitution occurs in the blocks to the number of expected occurrences given a background frequency 15 A simple example with one block BABA AAAC AACC AABA AACC AABC 16 8

9 From our example There is one block with 24 observed amino acids, 14 of which are A, 4 are B and 6 are C. This gives us the proportion observed: A 14/24 B 4/24 C 6/24 17 Possible Aligned Pairs Next we need to determine how many aligned pairs of amino acids are in the block. For our example, the answer is 60 ws(s-1)/2 where w=width of amino acids, s is depth of sequence. In our case, w=4, s=6, so 24*5/2=

10 Observed Proportions for Pairs Aligned Pair A to A A to B A to C B to B B to C C to C Proportion of time Observed 26/60 8/60 10/60 3/60 6/60 7/60 19 Expected Proportion of Pairs Expected proportion of times that each amino acid pair is aligned under a random assortment of amino acids observed, given the observed amino acid frequencies I.e., if we choose two sequences of the same length at random with these frequencies and put them into the alignment, the expected proportion of pairs in which A is aligned is: 14/24* 14/24, the proportion of pairs in which A is aligned with B is 2*14/24*4/24 Why is the factor of 2 in the calculation of expected alignments for A to B? 20 10

11 Generating Likelihood We use these expected proportions and observed proportions to calculate estimated likelihood ratios The likelihood ratio is (twice) the logarithm of the ratio of the probability of the observed data to the probability of expected from random sequences with same frequencies LR = 2 log 2 (proportion obs./proportion exp.) 21 Aligned Pr Prop. Obs. Prop.Exp LR A-A 26/60 196/ A-B 8/60 112/ A-C 10/60 168/ B-B 3/60 16/ B-C 6/60 48/ C-C 7/60 36/

12 Entries in the Substitution Matrix Take the LR and round to the nearest integer In our example: A B C A B C Some Comments While simplistic, this scheme is more useful than simply scoring 1 for match and 0 for mismatch However, this depends heavily on the sequences of each family that happen to be in the database used to create the blocks Could some closely related groups overrepresented 24 12

13 Use Scaling on Similar Groups So, within a block, cluster sequences with x% similarity (e.g., 85%) Scale frequency by 1/(cluster size) Scale match by 1/(cluster A size)(cluster B size) (and don t count matches in a cluster) 25 Suppose we had the following data ABAA ABAC ABAA ABDA ABDC ACDC DBDC Each A counts 1/4 Each A counts 1/3 Each match counts 1/

In actual practice Henikoff and Henikoff conducted an iterative cycle which started with a unitary matrix (1= match, 0= mismatch) to generate the initial blocks of sequences.

14 In actual practice Henikoff and Henikoff conducted an iterative cycle which started with a unitary matrix (1= match, 0= mismatch) to generate the initial blocks of sequences. The sequences in these blocks are then clustered at X% similarity using the Blosum approach to generate a substitution matrix Repeat If a.85 similarity score for clustering is adopted, the final matrix is BLOSUM85 In choosing which matrix to use for database search, one often needs knowledge about the evolutionary distance between sequences of interest. With no information, BLOSUM62 is often a default 27 Substitution/Scoring Matrices Pam matrices (Dayhoff et al. 1978) --- phylogeny-based. PAM1: expected number of point mutations = 1 per 100 aa s PAM250 matrix, log-odds representation 28 14

15 Why do we need these matrices? Database searching Need different levels of sensitivity Close relationships (Low PAM, high Blosum) Distant relationships (High PAM, low Blosum) 29 Protein vs Nucleotide Which molecules should you search with? Which databases should you search, nucleotide or protein? 30 15

16 DNA vs. Protein searches DNA is composed of 4 characters: A,G,C,T It is anticipated that on the average, at least 25% of the residues of any 2 unrelated aligned sequences, would be identical. Protein sequence is composed of 20 characters (aa). The sensitivity of the comparison is improved. It is often accepted that convergence of proteins is rare, meaning that high similarity between 2 proteins would imply homology. 31 DNA vs. Protein searches What about very different DNA sequences that code for similar protein sequences? We certainly do not want to miss those. Conclusion: May have more to gain by using proteins for database similarity searches when possible

17 DNA vs. Protein searches The reasons for this conclusion are: When comparing DNA sequences, we get significantly more random matches than we get with proteins. The DNA databases are much larger, and grow faster than Protein databases. As the size of the database increases, so does the probability of more random hits! For DNA we usually use identity matrices, for protein more sensitive matrices like PAM and BLOSUM, which allow for better search results. The conservation in evolution, proteins are rarely mutated. 33 Main algorithms for database searching FastA Better for nucleotides than for proteins BLAST - Basic Local Alignment Search Tool Better for proteins than for nucleotides Smith-Waterman More sensitive than FastA or BLAST. 17

18 Specificity and sensitivity Sensitivity: the ability to detect true positive matches. The most sensitive search finds all true matches, but might have lots of false positives Specificity: the ability to reject false positive matches. The most specific search will return only true matches, but might have lots of false negatives 35 BLAST Algorithm Looking at best match of part of query string against part of database (local alignment) Plus an evaluation of how likely such a match is to happen by chance

19 Consider a Simple Example Two DNA sequences: ggagactgtagacagctaatgctata gaacgccctagccacgagcccttatc If sequences were random, would expect ¼ positions to match We have 11/26 matches. Is this significant? Turns out probability of getting 11/26 (or better) for random sequences is about Matching by Exhaustive Comparison Query length 200, DB size 20M About 20M alignments Substrings of query: (200 x 201)/2 Average length of substring: Comparisons 75 x 20K x 20M 38 19

20 Strategy BLAST uses a heuristic to find maximal segment pair (MSP): Highest-scoring pair of segments of same length L from query and DB (or HSPs: high-scoring segment pairs) Idea: If MSP has high score, then that pair will have a short region that scores well 39 Example K K I H I S K N N G Y L V D G K P N N I T K S N G F L A L G K N N G K S N G 40 20

21 Basic Algorithm 1. Generate all short segments of query sequence 2. Find all closely matching short segments in the database 3. For each match ( hit ), see if it can be extended to a high-scoring alignment 41 What s short? What s a close match? Two main parameters to the algorithm w: length of short region T: minimum score for close match for our examples w = 4, T = 15 (by BLOSSUM62) Query string A N F K K I H I S K N N G Y L V D G K N F V so consider 42 21

22 The Word List Precompute which other strings have a match score T ( 15 here) A N F K A N F K A N F K A N F K A S F K A S F C K N N G K N N G K N N G K N N G K S N G K S S G 43 Find All Occurrences in DB A. Build a finite automaton from word list; run it against the database or B. Index all segments of length w in database (with their locations) and look up segments in the word list 44 22

23 Extending a Hit Whenever we find a match, try to extend it on each end How far? Should we go past negative-scoring letter pairs? Might be some positive scores farther on A N F K K I H I S K N N G Y L V D G K N F V S V P K P N N I T K S N G F L A L G H G P C Allowing Dips Third parameter, X how far will we let the score go downhill looking for an extension X = 2: Willing to go 2 down from max so far looking for pairs giving an increase Report the match if sufficiently high score A N F K K I H I S K N N G Y L V D G K N F V S V P K P N N I T K S N G F L A L G H G P C

24 Effect of Parameters Higher T Smaller word list: Less sensitive: Larger w Larger word list: More sensitive: Larger X Match extensions take longer to check May find longer matches in some cases 47 Refinement: Two-Hit Method Profiling shows that most of the time in BLAST (90%) is spent extending hits So require two non-overlapping hits close together on the same diagonal to trigger extension 48 24

25 Define Terms Close together: parameter A On same diagonal: could be part of same alignment Hit at (i, j), is there another at (i+k, j+k)? Must lower T to retain similar sensitivity Cuts time in half 49 Example A = 10, T = H I S K... G Y L V N I T K... G F L A

26 Sequence Alignment Roadmap Trying to determine similarity or differences of two sequences Variations Longest Common Subsequence (LCS) Dynamic programming solution Incorporating mismatches Local alignment Multiple sequence alignment 51 Variations Kind of sequence: DNA, amino acid Distance or similarity measures Insertion-deletions (indels) only Substitution costs (e.g., BLOSSUM62) Gap costs: linear, affine, other 52 26

27 More Variations Scope Global entirety of two sequences Local best-matching segments Multiplicity 2 sequences k sequences 53 Longest Common Subsequence (LCS) LCS is a similarity measure (larger = better match) x = a 1 a 2 a m is a subsequence of y = b 1 b 2 b n if you can obtain x from y by deleting 0 or more characters For the sequence SOMETIMES, some subsequences are 54 27

28 Common Subsequence x is a common subsequence of y and z if x is a subsequence of y and x is a subsequence of z For sequences BECAUSE and ESCAPE some common subsequences are A longest common subsequence of y and z is a common subsequence x of greatest length 55 Threading Schemes I like to think of characters as beads to be threaded Each thread must go through same-named bead on each sequence Threads can t cross Most threads possible B E C A U S E E S C A P E 56 28

29 You Try It M I S C E L L A N E O U S S C A L L I O N S C O L L I N E A R S C A L L I O N S 57 Alignments Biologists like to line up the sequences, using - to represent an indel B E - C A U S - E - E S C A - - P E However, can be several alignments for same LCS B E - C A U - S E - E S C A - P - E 58 29

30 Shortest Common Supersequence (SCS) SCS of y and z is a shortest sequence x where both y and z are subsequences of x Similar threading game, except Thread doesn t need a bead from every sequence All beads must be threaded Fewer threads is better B E C A U S E E S C A P E 59 You Try It M I S C E L L A N E O U S S C A L L I O N S C O L L I N E A R 60 30

31 Can Find LCS Recursively x = a 1 a 2 a m y = b 1 b 2 b n LCS(i, j) length of LCS of a 1 a 2 a i and b 1 b 2 b j LCS(0, j) = LCS(i, 0) = LCS(i, j) = max(lcs(i-1,j), LCS(i,j-1), LCS(i-1,j-1)+1) if a i = b j LCS(i, j) = max(lcs(i-1,j), LCS(i,j-1)) otherwise 61 Examples x = BECAUSE y = ESCAPE LCS(2,3) = max(lcs(2,2), LCS(1,3)) LCS(3,3) = max(lcs(3,2), LCS(2,3), LCS(2,2)+1) 62 31

32 Dynamic Programming Use for recursive computation with repeated sub-problems Compute answers to all sub-problems Do smaller sub-problems first Remember the results 63 Example LCS E S C A P E i j B 1 0 E 2 0 C 3 0 A 4 0 U 5 0 S 6 0 E

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot