Biosequence Alignment 徐鹰佐治亚大学生化系 吉林大学计算机学院
Bio sequences Sequences could be DNA, protein and RNA sequences DNA sequence (consisting of 4 letters: A, C, G, T) Ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtg RNA sequence (consisting of 4 letters: A, C, G, U) Protein sequence (consisting of 20 letters, A, C, D,., Y) 2
Central Dogma of Biology 3
Sequence Homology Genes have evolved from a common ancestor generally have sequence level similarity, i.e., similar sequences Similar sequences tend to have similar biological functions. Through sequence comparison, one can infer if two sequences may have the same or related functions. 4
Sequence Homology Through multiple sequence alignment, one can possibly derive the functional sites of a sequence In biology, only useful things will be preserved. 5
Sequence Homology 6
Bio Sequence Comparison DNA sequence alignment aligning two DNA sequences to maximize their similarity AACG Example 1: AACG and AACG Example 2: AAGG and AACG AACG AAGG AACG 1 mismatch Example 3: AACGGTATGC and ATCGGGTTGC AACG -GT ATGC ATCG GGT -TGC 2 gaps and 1 mismatch 7
Bio Sequence Comparison Best alignment to align two sequences using the smallest number of mismatches and gaps Score: each aligned position: +2; each mismatch/ gap: 1 AACG AACG AAGG AACG AACG-GTATGC ATCGGGT-TGC score = 8 score = 5 score = 13 8
Bio Sequence Comparison Protein sequence alignment: it is more complex to measure protein sequence similarity than DNA sequences DNA sequence alignment: match or mismatch/gap Protein sequence alignment: degree of similarity There are twenty types of amino acids; each pair of amino acids have a similarity score, which varies for different amino acids Example: (A, A) = 4; (R, R) = 5; (A, R) = 1; (C, A) = 0; 9
Bio Sequence Comparison Blosum matrix A R N D C Q E G H I L K M F P S T W Y V 4-1 5-2 0 6-2 -2 1 6 0-3 -3-3 9-1 1 0 0-3 5-1 0 0 2-4 2 5 0-2 0-1 -3-2 -2 6-2 0 1-1 -3 0 0-2 8-1 -3-3 -3-1 -3-3 -4-3 4-1 -2-3 -4-1 -2-3 -4-3 2 4-1 2 0-1 -3 1 1-2 -1-3 -2 5-1 -1-2 -3-1 0-2 -3-2 1 2-1 5-2 -3-3 -3-2 -3-3 -3-1 0 0-3 0 6-1 -2-2 -1-3 -1-1 -2-2 -3-3 -1-2 -4 7 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5-3 -3-4 -4-2 -2-3 -2-2 -3-2 -3-1 1-4 -3-2 11-2 -2-2 -3-2 -1-2 -3 2-1 -1-2 -1 3-3 -2-2 2 7 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 A R N D C Q E G H I L K M F P S T W 10Y V
Bio Sequence Comparison Aligning protein sequences: (gap = 5) FDSKTHRGHR and FESYWTHGHR FDSK-THRGHR :.: :: ::: FESYWTH-GHR Score: 6+2+4-2-5+5+8-5+6+5+5 = 29 FDSKTHRGHR - - FESYWTHWHR Score: -5-3+0+0-2-2-1-5-2-2+0-5 = -27 Amino acids with similar physiochemical properties have higher similarity scores among them 11
Computing Sequence Alignment Two sequences: AACG and AAGG Step #1: calculating alignment matrix A A G A A C G 2 1-3 1-3 -4 4 3 2 3 3 5 AAGG AACG Rule: 1: initialization fill the first row and column with matching scores plus gap penalty 2: fill an empty cell based on scores of its left, upper and upperleft neighbors + the matching score of the current cell 3: chose the one giving the highest score G -4 2 2 5 12
Computing Sequence Alignment Step #2: Tracing back to recover the alignment A A G A A C G 2 1 0 1 0-1 4 3 2 3 3 5 Rule: 1: start from the rightlower corner 2: trace back to left, upper or upper-left neighbor which gives the current cell s score 3. Keep doing this until it cannot continue G -1 2 2 5 13
14
15
Sequence Alignment Algorithm Algorithmically the sequence alignment problem can be solved using a dynamic programming method 16
Dynamic Programming
ace Back for Solution Recovery
nterpreting Sequence Alignments oes higher sequence alignment score always mean better equence alignment? equence alignment scores depend not only on the quality f an alignment but also on sequence length and ompositions o we need to get rid of the background information to erive the true quality of a sequence alignment
Interpreting Sequence Alignments ery sequence: AAAA abase #1: AATTAATACATTAATATAATAAAATTACTGA abase #2: CGGTAGTACGTAGTGTTTAGTAGCTATGAA hich of these two sequences will have better chance to ve a good match with the query sequence after randomly
terpreting Sequence Alignments E-value ne way to assess the true uality of a particular lignment is to derive the ackground alignmentcore distribution of similar equences with the same letter composition.
equence Alignment Programs
omology Search by BLAST
omology Search by BLAST
Take Home Message equence comparison provides a powerful tool for erivation of homologous genes, and hence functional and tructural information 60% of the genes in a newly sequenced genomes have omologues among well annotated genes onserved sequence segments across multiple omologous genes suggest functional sites