Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between them. Aligning sequences assigns functions to the unknown proteins, determines the evolutionary relatedness of organisms and helps in making prediction about the 3D structures. Homology is attributed to similarity due to a descent from a common ancestor, i.e. if two sequences from different organism are similar there is a possibility that both sequences are termed as Homologous. Thereby, predicting structure and function for the sequences. Types of Sequence Alignment- Based on sequence Length According to the length of sequence being compared it is of following two types 1) Global sequence Alignment In this alignment sequences are aligned along their entire length to include as many matching characters possible. TAGC-GC-GT TA-CA-CAGT 2) Local sequence Alignment In this alignment sequences are aligned to find a region of higher density or strong similarity. CGATAACGTAT --ATAAAC--- Based on Number of sequence- According to number of sequence being compared it is of following two types 1) Pairwise Sequence Alignment - This involves aligning two sequences and to get the best region of similarity. 1 KTSSGNGAEDS 11 1 KTSSGNGAEDS 11 1
Various methods used for pairwise alignment of nucleotide and protein sequences are: 1) Dot Plot It is graphical method for two sequences to identify the region of maximum similarity and dissimilarity, depicted by presence and absence of DOTS. 2) Dynamic Programming This method breaks a problem into small sub-problems and uses the solution of the sub-problems to compute the solution of the larger one. Some algorithms like Needleman-Wansch and Smith-Waterman are used here. 3) Heuristic Method When a single sequence is to be compared against the whole database heuristic methods like BLAST and FASTA are used. The following are certain parameters used for producing optimum alignment - a) Max target sequences It displays the result with total number of aligned sequences on a page. b) Expected Threshold It is a statistical indicator which calculates the probability that the resulting alignment are caused by random chance. The lower the E value, the more significant is the score The default value is kept 10 as 10 matches are expected to be found random by chance (Stochastic model of Karlin & Altschul, 1990). c) Query match - It gives the maximum match in a query range. This is useful for comparing many stronger matches of the query results from weaker matches of the results. d) Word size This algorithm works by using word matches between the query and the database sequences. It searches for exact word match, initiates the extension leading to the full alignment. Word size 3 is required for standard protein align. Word size 2 is required for short and nearly exact matches. e) Scoring schemes Different scoring schemes algorithms are devised to obtain an optimum alignment. Use of any substitution matrix helps in aligning possible pair of residues and also generates scores. To check the quality of pairwise sequence alignment, different PAM and BLOSUM matrices are used. BLOSUM (Block amino acid substitution matrix) This has been developed using conserved regions called BLOCKS. Of distantly related protein sequences available from the block database. Out of all BLOSUM 62 matrix is best used for detecting most protein similarities. BLOSUM 45 may be used for longer and weaker alignments. 2
PAM (Point Accepted Mutation) This is developed by calculating the substitution of amino acid during evolution which are naturally accepted. PAM 30 is used for sequences less than 35 in length whereas PAM 70 is used for sequences ranging from 35 to 50. f) Gap costs A gap is a space which is introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. Too many gaps should be avoided in the alignment and hence a gap penalty or gap score is assign. The introduction of gap causes the deduction of gap score from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment. Increase in gap costs parameter results in the decreased number of gaps in the alignment. The penalty for the creation of a gap should be large enough so that gaps are introduced only where needed, and the penalty for extending a gap should take into account the likelihood that insertions and deletions occur over several residues at a time. Some used values here are existence 10, 11 and extension 1. 2) Multiple sequence Alignment - This involves the alignment of more than two (protein, DNA) sequences and assess the sequence conservation of proteins domains and protein structures. It is an extrapolation of pairwise sequence alignment which reflects alignment of similar sequences and provides a better alignment score. 3
Various analysis like Homology modeling for prediction of protein structure, Phylogenetic analysis, motif detection etc are based on the results of multiple sequence alignment. There are many softwares like Clustal, t-coffee, Phylip, MSA, MUSCLE used for obtaining multiple sequence alignment. Example Seq 3 - Seq 4 - Seq 5 - PQGGGGWGQ Following parameters should be considered to align multiple sequences. Protein weight matrix - Matrix is used to increase the alignment score. Eg: PAM and BLOSUM. Gap Open The penalty to open a gap. The presence of a gap is frequently given more significance than the length of the gap. By default, the gap opening penalty is 10. Gap extension The penalty to extend a gap. Extension of the gap also involves additional amino acid penalized in the scoring of an alignment. By default, gap extension penalty is 0.20. Application of MSA results - Phylogenetic Analysis It is one of the major areas where multiple sequence analysis results are used to find the evolutionary relatedness between sequences. The results are displayed in form of Phylogenetic tree which has set of nodes and branches to link the nodes. Methods used for Phylogenetic analysis are: 1) UPGMA (Unweighted Pair Group Method with Arithmetic Mean) it is a simple, hierarchical clustering, tree making method which uses distance matrix to find the relatedness between sequences. 2) NJ (Neighbor Joining) - It is a method that is related to the clustering method. The method is especially suited for datasets comprising lineages with largely varying rates of evolution. In this method, a special case of the star decomposition is seen File format view: 4
PHYLIP - PHYLogeny Inference Package. It s a format for Joe Felsenstein s phylogenetic applications, having 8 letter maximum lengths for the sequence ID. Claudogram - In a cladgram, the external taxa line up neatly in a row Their branch lengths are not proportional to the number of evolutionary changes and thus no Phylogenetic time analysis can be done only the relative ordering of the taxa can be analyzed. Phylogram In a phylogram, the branch lengths represent the amount of evolutionary divergence. Such trees are said to be scaled. Pearson/ FASTA Text based format to represent amino acids in single letter code. It also has sequence names followed by comments. Jalview Java alignment editor. It is a visualization tool for alignment algorithms and other database search results. 5