CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018
SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Wikipedia 2
COMPARING SEQUENCES cornerstone in sequence analysis aims for identification of sequence relatedness ONLY homologous sequences (derived from the same ancestor) can be compared homologous sequences should (but not MUST) have similar function and similar sequences 3
HOMOLOGY IN STRUCTURES reasons why structures have similar shapes are homology and homoplasy homology = shares the same ancestor homoplasy = similar structures but not derived from the same ancestor 4
HOMOLOGY IN SEQUENCES ACTGTACTCGCATCG ACTATACTCTCATTG species A ACTGTTCTCCCATCA species B 5
DEGREES OF HOMOLOGY Homology is qualitative! Paralog: homologous genes have diverged from each other after gene duplication Ortholog: Genes originating from a single ancestral gene Xenolog: Homologous genes acquired via Horizontal Gene Transfer (HGT) ; Koonin (2005) Annu. Rev. Genet 6
SEQUENCE ALIGNMENT ACTATACTCTCATTG ACTGTTCTCCCATCA 7
Sequence 1 DOT PLOT sequence 1 ACCTCGTGCA sequence 2 ACTTAGTCCA A C C T C G T G C A A C Sequence 2 T T A G T C C A sequence 1 ACCT-CGTGC-A sequence 2 AC-TTAGT-CCA 8
seq_2 DOT PLOT too many dots (high background) = no information How can we handle this problem? seq_1 9
GENERAL PARAMETERS FOR DOT PLOT Window size = subsequence length Window sliding = rate of moving window Threshold or mismatch = cut off (normally use similarity score as the cut off) window size TGAATCCCAGTTCAGCTCTTCAGCCTTTCGTGGATAAGAGAAGGCTGAAAGCGGGTCACGTTTTG TAAATGGCAGTACAGCTGTTAGGCCCATCGTGGCTAAGATCAGGCTCCAAATAGGTCCAGTTCCC 70% 70% 80% 10
PRACTICAL HINTS FOR DOT PLOT a window of 10-20 residues is a good place to start comparative very large sequences (>30 to about 100 residues) may be useful. a good practical rule is to makes plots that have 3 5 times as many dots as the length of the sequences (e.g., 3000-5000 dots for a 1000 base sequence) 11
Sequence 2 DOT PLOT Sequence 1 horizontal offsets (indels) 12
sequence 2 INTERPRETATION OF DOT PLOT (1) highly similar single diagonal line needs noise (or background) reduction sequence 1 13
sequence 2 INTERPRETATION OF DOT PLOT (2) domain identification sequence 1 14
EXON AND INTRON http://myhits.isb-sib.ch/util/dotlet 15
sequence 2 INTERPRETATION OF DOT PLOT (3) sequence 1 inversion 16
sequence 2 INTERPRETATION OF DOT PLOT (4) sequence 1 repeat 17
REPEATED PROTEIN DOMAINS http://myhits.isb-sib.ch/util/dotlet 18
sequence 2 INTERPRETATION OF DOT PLOT (5) sequence 1 palindromic sequence 19
TERMINATORS AND OTHER STEM- LOOP STRUCTURES http://myhits.isb-sib.ch/util/dotlet 20
sequence 2 INTERPRETATION OF DOT PLOT (6) sequence 1 low complexity regions AAAAAAAAAAAAAA 21
LOW-COMPLEXITY REGIONS Plasmodium falciparum serinerepeat antigen protein precursor http://myhits.isb-sib.ch/util/dotlet 22
GAPS IN ALIGNMENT gap has never exist in nature gaps make the comparison difficult gap in sequence alignment most likely is indel accuracy of alignment determines accuracy of indel ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT gap ~ indel(insertion/deletion) 23
SCORING PAIRWISE SEQUENCE ALIGNMENT FOR DNA SEQUENCES the easiest method to score is match scoring seq1 seq2 ATTCGTCGTAGCTAGGCTAA ATTGGCCGTACCATGGATAA match = 14 positions similarity score = 14 Normalized score seq1 seq2 ATTCGTCGTAGCTAGGCTAA ATTGGCCGTACCATGGATAA match = 14 positions mismatch = 6 positions total length = 20 positions similarity score = 70% 24
SCORING PAIRWISE SEQUENCE ALIGNMENT FOR PROTEIN SEQUENCES MAATPTVLLFWKLLDEVFMA 80% identity MAVTPLVLFFWKLVDEVFMA idea = amino acids that have the same physicochemical property would not change the structure of protein MAATPTVLLFWKLLDEVFMA + + 90% similarity MAVTPLVLFFWKLVDEVFMA 25
CONFUSING TERMS Identity proportion of pairs of identical characters between 2 sequences strongly depends on how two sequences are aligned Similarity proportion of pairs of similar characters between 2 sequences similarity is determined by substitution matrix strongly depends on how two sequences are aligned and matrix used Homology two sequences are homologs if they have the same ancestor we cannot score homology (so yes or no ONLY) 26
ALIGNMENT EVENT AND MUTATION EVENT Match -> no mutation Mismatch -> substitution Gap -> insertion/deletion (InDel) 27
SUBSTITUTION MUTATION IN DNA original DNA seq. T A C C T G A G C C A A Tyr Leu Ser Gln C T A Leu silent mutation missense mutation T A C C T C A G C C A A Tyr Leu Ser Gln T A C C T G C G C C A A Tyr Leu Arg Gln C T A Leu C T A Leu non-sense mutation T A C C T G A G C T A A Tyr Leu Ser C T A 28
NUCLEOTIDE SUBSTITUTION sequences that share the same common ancestor will gradually diverse very difficult to perform direct observation sequence divergence = proportion (p) of nucleotide sites that two sequences are different ACTGTACTCGCATCG ACTATACTCTCATTG ACTGTTCTCCCATCA 29
EMPIRICAL STUDIES OF AMINO ACID SUBSTITUTION several studies observed of the amino acid substitution results show that amino acid substitution is not random amino acids with similar chemical properties are more often to substitute in the sequence some amino acids (e.g., cysteine, glycine and tryptophan) are rarely changed 30
POINT ACCEPTED MUTATION (PAM) proposed in 1978 by Margaret Oakley Dayhoff the first substitution matrix for amino acid changes one PAM is a unit of evolutionary divergence in which 1% of amino acids have been changed if no selection for fitness (impossible!!), substitution is one of the main factors that drive the protein sequence change under observation of related protein sequences, frequencies of amino acid substitutions are biased prone to maintain the function of protein these are the point mutations that have been accepted during evolution 31
PAM 250 MATRIX the 1 PAM unit was constructed from the observation of amino acid changes in closely related proteins the data of one PAM was then extrapolated to PAM250 only PAM250 was published by Dayhoff et al. (1978) higher PAM matrix is good for highly divergent sequences; lower PAM is good for conserved sequences BIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins 32
BLOSUM MATRIX observed amino acid changes by different strategy with PAM matrix construction sequence data are derived from BLOCKS database differ from PAM, BLOSUM used distantly related sequences (PAM used closely related sequences) BLOSUM62 matrix (the first BLOSUM matrix) sequences having at least 62% identity are merged into a single sequence higher BLOSUM matrix (e.g., BLOSUM90) is good for comparing very similar sequences, the lower BLOSUM (e.g., BLOSUM30) is for highly divergent sequences 33
BLOSUM 62 MATRIX the 1 PAM unit was constructed from the observation of amino acid changes in closely related proteins the data of one PAM was then extrapolated to PAM250 only PAM250 was published by Dayhoff et al. (1978) higher PAM matrix is good for highly divergent sequences; lower PAM is good for conserved sequences BIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins 34
SUGGESTED USES FOR COMMON SUBSTITUTION MATRICES Menlove, Clement, and Crandall: Similarity Searching Using BLAST 35
GAP PENALTY assumption = indel is rare (not easy to occur) gap opening = penalty when gap is introduced into the alignment gap extension = penalty of the large size of gap, normally count from the second position of gap CCGTATCGTCTATCTACGTGCACTGAT CCCAATCTTCAATCTACG---TCTGAT gap opening gap extension 36
DYNAMIC PROGRAMMING Sean R Eddy Nature Biotechnology 22, 909-910 (2004) 37
BLAST: BASIC LOCAL ALIGNMENT SEARCH TOOL Wishard, Introduction to Bioinformatics A theoretical and Practical Approach 38
PAIRWISE SEQUENCES ALIGNMENT aim for comparison of 2 sequences global alignment try to do the best alignment of two sequences across the entire length local alignment try to fine the highly similar region(s) between two sequences overlapping alignment global alignment of two sequences with different sizes 39
GLOBAL ALIGNMENT end-to-end alignment may end up with a lot of gaps in the alignment if 2 sequences have dissimilar in size Not sensitive to the modular nature of proteins very sensitive to gap penalties (gap opening and gap extension) Needleman-Wunch algorithm (1970) 5' ACTACTAGATTACTTACGGATCAGGTACTTTAGAGGCTTGCAACCA 3' 5' ACTACTAGATT----ACGGATC--GTACTTTAGAGGCTAGCAACCA 3' 40
LOCAL ALIGNMENT finds local regions with high level of similarity more sensitive to the modular nature of proteins can be used to search databases Smith-Waterman algorithm (1981) ACTACTAGATTACTTACGGATCAGGTACTTTAGAGGCTTGCAACCA ACTACTAGATT----ACGGATC--GTACTTTAGAGGCTAGCAACCA Global Alignment ACTACTAGATT ACTACTAGATT ACGGATC ACGGATC GTACTTTAGAGGCTTGCAACCA GTACTTTAGAGGCTAGCAACCA Local Alignment 41
MULTIPLE SEQUENCE ALIGNMENT 42
PROBLEM OF USING PAIRWISE ALIGNMENT good for comparing of only two sequences hard to understand and interpret the alignment results when a number of sequences are >2 less evolutionary meaning ATGCTAGTAAGC ATTCAA-T--GC ATTCAA-TGC -TTCTAGCGC ATGCTAGTAAGC ATTCAA-T--GC -TTCTAGC--GC ATGCTAGTAAGC -TTCTAGC--GC 43
MULTIPLE SEQUENCE ALIGNMENT (MSA) most useful object in sequence analysis mid 1980s, MSA was generated by hand because dynamic programming (at that time) were slow when applied to >3 sequences idea arrangement of the homologous residues (nucleotide or amino acid) in the same column provides more biological information than pairwise sequence alignment 44
MSA METHODS Exact method Progressive methods: Clustal, MUSCLE Iterative methods: MAFFT Consistency based methods: T-Coffee, ProbCons Structure based methods: 3D-Coffee Multiple sequence alignment methods 45
MSA METHODS Sviatopolk-Mirsky Pais et al. (2014) Algorithm for Molecular Biology 46
PROGRESSIVE ALIGNMENT dynamic programming 47
THE CLUSTAL SERIES Clustal was published by Thompson, et al. in 1994 ClustalW, ClustalX Clustal algorithm were obsolete, but their algorithm is good for understanding the MSA algorithm generated a guide tree, then,do a progressive alignment based on that guide tree Latest: Clustal Omega 48
MUSCLE ALIGNMENT PROGRAM MUltiple Sequence Comparison by Log- Expectation (MUSCLE) was published by Edgar RC, et al. in 2004 step I: progressive alignment step II: improve progressive alignment step III: refinement very easy command line improved speed and accuracy (based on SP method) 49
MUSCLE ALIGNMENT PROGRAM 50
CHOOSING THE RIGHT MSA PROGRAM Chagoyen M (2013) Sequence Analysis and Structure Prediction Service. 51
QUESTIONS? 52