Introduction to Bioinformatics for Medical Research Gideon Greenspan gdg@cs.technion.ac.il Lecture 8 Multiple Alignment and Phylogeny
Multiple Alignment & Phylogeny Multiple Alignment Scoring Complexity Feasible multiple alignment ClustalW Phylogenetic Trees Tree reconstruction Phylodendron 2
Multiple Alignment Like pairwise alignment n input sequences instead of 2 Add indels to make same length Local and global alignments Score columns in alignment Positive score for agreement Negative for disagreement or indels Seek an alignment to maximize score 3
Alignment Example Same allele in 3 or 4 GTCGTAGTCGGCTCGAC GTCTAGCGAGCGTGAT GCGAAGAGGCGAGC GCCGTCGCGTCGTAAC 1 23 2 30 3 6 4 4 Lone allele GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC 1 15 2 6 3 39 4 16 4
Motivation Determine consensus sequence Ignore variations, e.g. for genome sequencing Identify variations, e.g. for SNP mapping Build sequence families Orthologous genes, protein groups Build phylogenetic trees Infer evolutionary relationship 5
Scoring (1) Sum of pairwise alignment scores For n sequences, there are n (n-1)/2 pairs GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC 6
Scoring (2) Sum of column scores C T G C C T C - C - G A C T C A Minus number of different values Example: 1, 2, 2, 3 Minus number not in consensus Example: 0, 1, 2, 2 Maximum profile probability Example: 1.0, 0.1, 0.06, 0.02 7
Complexity (1) Pairwise A B alignment table Cell (i,j) = score of best alignment between first i elements of A and first j elements of B Complexity: length of A length of B 3-way A B C alignment table Cell (i,j,k) = score of best alignment between first i elements of A, first j of B, first k of C Complexity: length A length B length C 8
Complexity (2) n-way S 1 S 2 S n-1 S n alignment table Cell (x 1,,x n ) = best alignment score between first x 1 elements of S 1,, x n elements of S n Complexity: length S 1 length S n Example: protein family alignment 100 proteins, 1000 amino acids each Complexity: 10 300 table cells Calculation time: beyond the big crunch! 9
Feasible Approach Based on pairwise alignment scores Build n by n table of pairwise scores Align similar sequences first After alignment, consider as single sequence Continue aligning with further sequences Complexity is n 2 l 2 100 proteins of 1000 amino acids: ~ 5,000 alignments of 10 6 cells fi ~ 5 10 9 operations 10
Example (1) 1 GTCGTAGTCGGCTCGAC 2 GTCTAGCGAGCGTGAT 3 GCGAAGAGGCGAGC 4 GCCGTCGCGTCGTAAC 2 3 4 1 7 4 5 2 2 2 3 2 11
Example (2) 1 GTCGTAGTCG-GC-TCGAC 2 GTC-TAG-CGAGCGT-GAT 3 GCGAAGAGGCGAGC 4 GCCGTCGCGTCGTAAC 12 3 3-3 4-1 2 12
Example (3) 1 GTCGTAGTCG-GC-TCGAC 2 GTC-TAG-CGAGCGT-GAT 3 GC-GAAGAGGCG-AGC 4 GCCGTCGCGTCGTAAC 1 GTCGTA-GTCG-GC-TCGAC 2 GTC-TA-G-CGAGCGT-GAT 3 G-C-GAAGA-G-GCG-AG-C 4 G-CCGTCGC-G-TCGTAA-C 13
ClustalW Input Fast alignment? Scoring matrix Input sequences Alignment format Fast alignment options Gap scoring Phylogenetic trees 14
ClustalW Output (1) Input sequences Pairwise alignment scores Building alignment Final score 15
ClustalW Output (2) Sequence names Sequence positions Match strength in decreasing order: * :. 16
Phylogenetic Trees Represent closeness between many entities In our case, genomic or protein sequences chimp Distance representation monkey Observed entity human Unobserved commonality 17
Rooting Trees A tree can be hung from a root Adds directional information Requires addition of outgroup We know this is furthest pig monkey human chimp So we hang the tree from where it joins 18
Phylogeny and Evolution Evolutionary Time Common Ancestor Speciation Number of mutations 19
Tree Reconstruction Build tree based on organism sequences Distance-based methods Use pairwise alignment scores to build tree Ignores sequences after initial alignments Character-based methods Learn a tree with intermediate sequences that minimizes total number of mutations Slower but generally better results 20
Distance-based Example (1) 1 2 3 2 3 4 1 2 7 4 2 5 2 2 3 4 21
Distance-based Example (2) 12 3 3-3 4-1 2 1 2 3 4 22
Distance-based Example (3) 12 34-4 3 4 1 2 23
Newick Tree Format (CFTR_SHEEP:0.01457, (CFTR_HUMAN:0.16153, (CFTR_MOUSE:0.70599, (CFTR_RABIT:2.76042, (CFTR_SQUAC:1.27192, CFTR_XENLA:0.28818) :3.42183) :0.77076) :0.65873) :0.73937, CFTR_BOVIN:0.00953); 24
Phylodendron Input Graphical style Newick tree description Orientation Tree size 25