Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Sequence Bioinformatics Multiple Sequence Alignment Waqas Nasir 2010-11-12

Multiple Sequence Alignment One amino acid plays coy; a pair of homologous sequences whisper; many aligned sequences shout out loud. (Lesk, 2008) More reliable than pair wise alignments Expose patterns of amino acid conservation Help in better prediction of secondary structure

Multiple Sequence Alignment For pairwise alignment we have; How to align 3 or more sequences? How to calculate P abc or P abcd in likelihood ratio? Large amount of data is needed to estimate Is it even possible to obtain optimal multiple sequence alignment?

Scoring Schemes Two important features of multiple alignment; Position specific scoring The evolutionary relationship between sequences Enough data is not available to parameterize the evolutionary model Assumption of independence between the columns gives score:

Scoring Scheme Minimum Entropy Try to minimize the entropy of the column. The probability of observing residue a in m i can be estimated by; The probability of the column would then be;

Scoring Scheme Minimum Entropy

Scoring Scheme Minimum Entropy The minimum entropy score for column m i would be; The more variation in the column, the higher the entropy Highly conserved columns give more information Completely conserved column would score 0 Good alignment would minimize the total entropy

Scoring Scheme SP scores (Sum of pairs) Independence between columns Score of the column is the sum of all pair wise scores For the column the score becomes; Where s comes from PAM or BLOSSUM

Scoring Scheme SP scores (Sum of pairs) The final alignment score would then be; Unrealistic assumption of same evolutionary distance Not enough data to estimate the probabilities of all evolutionary events

Scoring Scheme SP scores (Sum of pairs)

Multi-dimensional dynamic programming Dynamic programming (D.P.) for multiple sequences Algorithms require a lot of memory High computational cost As many dimensions as the number of sequences Examples of D.P. algorithms include, MSA (Mutliple Sequence Alignment Algorithm) Progressive Alignment (Feng-Doolittle Algorithm)

MSA (Multiple Sequence Alignment) Reduces the volume multi-dimensional D.P. matrix Can optimally align up to 7 sequences of 200-300 residues Makes use of SP scoring scheme The score of multiple alignment is given by;

MSA (Multiple Sequence Alignment) Where a kl is the pair wise alignment between sequences k and l. MSA uses lower threshold score β kl Only scores higher than β kl are considered Instead of passing through all the points in DP matrix only those points are added to the search space where the best alignment score > β kl

MSA (Multiple Sequence Alignment) Finally multi-dimensional DP algorithm is performed on this subset of hyper-cube.

Progressive Alignment The most commonly used approach Uses DP algorithm to align sequences Idea is to start with most related sequences and build on it by adding more sequences and groups Optimal alignment is not guaranteed Fast and efficient, results in reasonable alignments

The Feng-Doolittle Algorithm The algorithm works as follows; Perform pair wise alignment of all sequences Convert alignment score to evolutionary distances Construct a guide tree Align the most related sequences in the guide tree Align: The most closely related sequence to the existing alignment OR The next most related pair to each other OR Two sub-alignments (groups)

The Feng-Doolittle Algorithm PAM scores and affine gap penalties are used Once a gap, always a gap (Feng & Doolittle, 1987) The highest scoring alignments represent the alignment of the group. The distance D is calculated as follows;

Suffix Trees Data structure that represents suffixes of any given string S Defined by a rooted tree with: Every node containing two children except the root No two edges out of a node begin with same character Every edge of the tree defines a non-empty substring of S Facilitates fast retrieval and operations of sub-strings.

Example Multiple Sequence Alignment Case study 5.2 (Lesk, 2008)

Structural inferences from MSA The most highly conserved regions probably correspond to the active site Regions rich in EDIT operations probably correspond to surface loops A conserved Gly/Pro column probably represents a turn

Structural inferences from MSA A conserved pattern of hydrophobicity with spacing 2 with intervening residues more variable and including hydrophilic residues suggests a Betastrand on the surface. (Residues 50-60) A conserved pattern of hydrophobicity with spacing 4 suggests a helix. (Residues 40-49)

References Feng DF, Doolittle RF. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol. 1987;25(4):351-60. Lesk AM. Introduction to Bioinformatics. Oxford University Press Inc., New york. 2008; 3rd Edition; ISBN: 987-0-19-920804-3.