1 Sequence Analysis and Databases 2: Sequences and Multiple Alignments Jose María González-Izarzugaza Martínez CNIO Spanish National Cancer Research Centre (jmgonzalez@cnio.es)
2 Sequence Comparisons: How? - Pairwise Alignment of 2 Sequences: - Aligning a couple of sequences - Searching for homologues (BLAST) - Multiple Sequence Alignments (n>2): - Advanced Sequence Alignments: - Patterns, Profiles, HMMs - Distant Homologues with PSI-BLAST
3 Multiple Sequence Alignments Rationale: - They try to align more than 2 homologous sequences. - Conserved regions must be important due to selective pressure. - Better Alignments as we can focus on general rules. Algorithmic Complexity: - 2 Sequences: NxM (As seen for Pairwise Alignments) - 3 Sequences: NxMxL - 4 Sequences: NxMxLxJ - Not feasible in general for a set of homologous proteins. Examples of Heuristics to skip this algorithmic problem: - T_coffee : [http://www.ch.embnet.org/software/tcoffee.html] - Clustalw : [http://www.ebi.ac.uk/clustalw/]
4 Clustalw Algorithm The Algorithm in bare-words Sort the sequences by similarity Align the two most similar ones. Label the sequences as aligned. Repeat until there are no unlabeled seqs Computational Complexity: Now it is similar to performing N pairwise alignments, thus it is feasible for the computer to calculate them for a family of homologues.
5 View of a MSA using Belvu
6 PairWise vs MSA - If a couple of homologues have diverged more than 20% the signal between them is so low that BLAST is not able to catch it. In other words, we can NOT use BLAST to find remote homologues. - A Multiple Sequence Alignment (MSA) is able to spot important regions. Since important regions have higher selective pressure they change (evolve) less than other regions Conservation is related to Importance. -If the matches between a set of sequences occur in the conserved regions, the chances of these 2 sequences being homologues increase. So How can we use all this information to improve our knowledge? Can a MSA spot remote homologues?
7 Sequence Comparisons: How? - Pairwise Alignment of 2 Sequences: - Aligning a couple of sequences - Searching for homologues (BLAST) - Multiple Sequence Alignments (n>2): - Advanced Sequence Alignments: - Patterns, Profiles, HMMs - Distant Homologues with PSI-BLAST
8 Advanced Searches: Consensus Algorithm: For each position in the sequence, the consensus position will represent the most repeated monomer. Pros: - The most basic way of summarizing a MSA into a single line. - Easy to implement, easy to understand Cons: - Not taking into account the frequencies, but the most represented ACTGACTACGTACA ATGCGTACCATACA ATCAGTATCGTAGA ATCAGTATCGAACA ----------------------------- ATCAGTATCGTACA Consensus Sequence
9 Advanced Searches: Patterns -Patterns are also called Regular Expressions by bioinformaticians - Useful when dealing with motifs - It is a complex (but powerful) language, not always easy at first glance - Ambiguity can be depicted: [A,B] : Can be both A or B {A, B} : Anything but A or B X: Any monomer -Repetitions are easily represented: A{2,4}: Can be either AA, AAA or AAAA A+: Any number of A s (but at least one) A*: Any number of A s (or even none) - MSA s are reduced to a single line - Not taking into account the frequencies AGLV AGLV AG[IL]V AGIV [AC]-x-G-x{4}-{L,I} [Ala or Cys]-any-Gly-any-any-any-any-[any but Leu or Ile]
10 Advanced Searches: Profiles - Profiles are Position Specific Scoring Matrices (PSSM). - Same concept as the Scoring Matrices (i.e. BLOSUM), but these ones are calculated on scratch using each position in the MSA instead of being pre-calculated. -Thus, PSSM s are 20xN matrices, being N the length of the sequence - PSSM s take into account information specific to the family of proteins, so is Inferred from the alignment Few Assumptions - We can align a sequence and a profile using Smith & Waterman Algorithm Search for homologues MSA PSSM PSSM borrowed from F.Abascal
11 Advanced Searches: HMM-profiles - HMM stands for Hidden Markov Model. - Originally implemented for Speech Recognition - They are much robust than the simple profiles, specially when dealing with gaps. However, they are harder to implement. -HMMer is a package similar to BLAST but using HMMs x hidden states y observable outputs a transition probabilities b output probabilities http://hmmer.janelia.org/
12 Sequence Comparisons: How? - Pairwise Alignment of 2 Sequences: - Aligning a couple of sequences - Searching for homologues (BLAST) - Multiple Sequence Alignments (n>2): - Advanced Sequence Alignments: - Patterns, Profiles, HMMs - Distant Homologues with PSI-BLAST
13 Remote Homologues: PSI-Blast - PSI-BLAST Position Specific Iterated Blast - PSI-Blast is useful to retrieve remote homologues (id<20%) - Algorithm: 1) Run BLAST [Iteration #0] 2) Generate PSSM with the results better than a given threshold (e-value) 3) Run BLAST again using the PSSM as Input, [Iterations #1 to #N] 4) Update PSSM with new results 5) Repeat from 3 until convergence* *Convergence: When we can not find new results
14 Remote Homologues: PSI-Blast Target Sequence BLAS T DataBase PSI-BLAST PSSM Closely Related Homologues Remotely Related Homologues
15 Acknowledgements Federico Abascal (Original Text) Juan Carlos Sanchez Alfonso Valencia
16 XXX