Sequence analysis and comparison

The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species which have a similar gene (ORF)? Has anybody already studied this protein or a similar one? What is the biochemical function and what physicochemical characteristics to expect? Search & analysis strategy: Sequence search based on homology (similarity). Pattern searches - search for occurrences of a predefined pattern (may be a short sequence motif). Annotation searches - search by keywords, authors, additional features. Search for a 3D structure of a homologous protein. Amino acid sequences Information regarding the proteins function (catalytic activity, specific recognition sites, etc.). The proteins evolutionary origin. Information regarding the type of its 3D structure (folding type). Extracting this information is the task for sequence analysis.

Other goals of sequence analysis: Even more goals of sequence analysis: Assembly of sequence fragments into complete units (proteins, genes, chromosomes). Finding open reading frames (ORFs) for cdnas or genomic DNA and using codon usage tables. Management of sequence information Prediction of the biochemical and physicalchemical characteristics of a protein (molecular weight, isoelectric point (pi), extinction coefficient). Finding and using consensus sequences Examples promoters transcription initiation sites transcription termination sites polyadenylation sites ribosome binding sites protein features post-translational modifications: forming of disulfide bonds, glycosylation, cleavage of signal sequences etc. Analysing relationships between proteins-some general rules: Proteins with the same function taken from closely related organisms have highly similar amino acid sequences. The greater the differences observed for related proteins, the longer the time since the organisms have diverged - genetic divergence. The opposite is genetic convergence. Types of sequence comparison and alignment: compare sequence to database - goal: find related sequences (SIMILARITY) compare sequence to sequence - goal: find matching domains (ALIGNMENT) compare database to database - goal: estimate genetic distance (EVOLUTION) either: determine consensus sequences comparisons can be pairwise or multiple.

Sequence alignment: Sequence alignment - Allows to align and compare a sequence to a family of related sequences, to reveal conserved regions of functional importance. An accurate alignment can be useful for obtaining an idea of the 3D structure of a protein. Since there are many ways of aligning two sequences (an alignment produced by a program is one of several possible), we need criteria to judge the quality of an alignment. Modifications of a protein sequence to be considered: Replacement of one amino acid by another aabb acbb Insertions and deletion of single amino acids and larger blocks ccc-dee c-cddee Large rearrangements of the gene aaaaaabbbbbb bbbbbbaaaaaa Alignment accuracy Mind the Gap! The best alignment is the one that has the maximum number of identical residues aligned against each other - % similarity. Example: Sequence 1!! CPKICIGGWFAAY Sequence 2!! CSGICKKAWFV-Y Alignment pattern:! C--IC---WF--Y! Similarity = 6/13 = 46 % Score (s) = matches mismatches = 6 7 = -1 GATC GTGC GAT-C G-TGC Generally: S = Σ gains (identities, replacements) - Σ penalties Penalties = number of gaps gap creation penalty The values of identities and replacements are elements of the replacement matrix Rules of thumb: As many residues as possible should be aligned A gap should be added only if it significantly increases the number of matches The size of the gap and its position are important

Substitution scoring schemes Needed to assign a score to each of the possible substitutions of one amino acid by another, totally 210 possible pairs (190 pairs of different a.a. + 20 pairs of identical a.a.) presented in a form of a 20 X 20 matrix. Possible scoring schemes include: Identity scoring!! 0 if the a.a. are different and 1 if the same. Observed substitutions! assigns weights based on the analysis of substitution frequencies!! derived from manual alignments Chemical similarity score! higher weight to the alignment of a.a. with similar chemical!!! properties (V L,K R). Amino acid substitution matrices: PAM family of matrices (Dayhoff matrix): Take aligned set of closely related proteins (1300 sequences in 72 families in the original work) For each position in the set, find the most common amino acid observed. Calculate the frequency with which each other amino acid is observed at that position. Combine frequencies from all positions to give table of frequencies for each amino acid changing to each other amino acid. Take logarithm and normalize for frequency of each amino acid. Properties of the PAM matrix: Each element M i,j gives the probability of the a.a. in column j to be mutated to the a.a. in row i after a particular evolutionary time percentage of accepted mutations per 10 8 years (PAM). 1 PAM corresponds to an average change of in 1% of all a.a. positions. After 100 PAM of evolution not every residue will have changed: some will have mutated several times, perhaps returning to original state, while others not at all. AT 256 PAM 80 % of all a.a. will have changed, although to various degrees: 48% of Trp, 41% of Cys and 20% of His would be unchanged, but only 7% of Ser will remain. # PAM 250 matrix # Science June 5, 1992. # Values rounded to nearest integer A R N D C Q E G H I L K M F P S T W Y V A 2-1 0 0 0 0 0 0-1 -1-1 0-1 -2 0 1 1-4 -2 0 R -1 5 0 0-2 2 0-1 1-2 -2 3-2 -3-1 0 0-2 -2-2 N 0 0 4 2-2 1 1 0 1-3 -3 1-2 -3-1 1 0-4 -1-2 D 0 0 2 5-3 1 3 0 0-4 -4 0-3 -4-1 0 0-5 -3-3 C 0-2 -2-3 12-2 -3-2 -1-1 -2-3 -1-1 -3 0 0-1 0 0 Q 0 2 1 1-2 3 2-1 1-2 -2 2-1 -3 0 0 0-3 -2-2 E 0 0 1 3-3 2 4-1 0-3 -3 1-2 -4 0 0 0-4 -3-2 G 0-1 0 0-2 -1-1 7-1 -4-4 -1-4 -5-2 0-1 -4-4 -3 H -1 1 1 0-1 1 0-1 6-2 -2 1-1 0-1 0 0-1 2-2 I -1-2 -3-4 -1-2 -3-4 -2 4 3-2 2 1-3 -2-1 -2-1 3 L -1-2 -3-4 -2-2 -3-4 -2 3 4-2 3 2-2 -2-1 -1 0 2 K 0 3 1 0-3 2 1-1 1-2 -2 3-1 -3-1 0 0-4 -2-2 M -1-2 -2-3 -1-1 -2-4 -1 2 3-1 4 2-2 -1-1 -1 0 2 F -2-3 -3-4 -1-3 -4-5 0 1 2-3 2 7-4 -3-2 4 5 0 P 0-1 -1-1 -3 0 0-2 -1-3 -2-1 -2-4 8 0 0-5 -3-2 S 1 0 1 0 0 0 0 0 0-2 -2 0-1 -3 0 2 2-3 -2-1 T 1 0 0 0 0 0 0-1 0-1 -1 0-1 -2 0 2 2-4 -2 0 W -4-2 -4-5 -1-3 -4-4 -1-2 -1-4 -1 4-5 -3-4 14 4-3 Y -2-2 -1-3 0-2 -3-4 2-1 0-2 0 5-3 -2-2 4 8-1 V 0-2 -2-3 0-2 -2-3 -2 3 2-2 2 0-2 -1 0-3 -1 3

Other types of matrices: PET91 - version of PAM using a set of 2621 families of sequences. BLOSUM - blocks substitution matrix - amino acid substitution tables, which scores amino acid pairs based on the frequency of amino acid substitutions in aligned sequence motifs (blocks). Based on local alignments of 2000 blocks from 500 families. Different Blosum types: 30, 35, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80, 85, 90. Blosum62 the most popular, is based on blocks with at least 62% identity. High Blosum - closely related sequences, Low Blosum - distant sequences Differences with PAM -Evolutionarily divergent proteins are used. Uses Blocks instead of Global alignment PAM-1!! BLOSUM-90! Small evolutionary distance High identity within short sequences Which matrix to use?: PAM-250!! BLOSUM-20! Large evolutionary distance Low identity within long sequences Relationships between matrices Biological criteria can be used in alignment: Methods for sequence comparisons Frequent and infrequent residues Structurally or functionally important amino acids A match to highly conserved residues Repetitive sequences Sliding window method Central to many of the algorithms used in sequence analysis. The basic idea is to define a "window" of a certain number of residues (nucleotides or amino acids) and to calculate some value for the residues in that fragment. Once the calculation is completed, the program shifts one residue and analyzes the next window of residues and this process repeats itself until the end of the sequence is reached.

Sliding window in sequence analysis: Given two sequences A and B, all possible overlapping segments of a particular length (window length) from A are compared to all segments of B. For each pair of segments the amino acid pair scores are accumulated over the length of the segment: For example the comparison of the two segments: ALGAWDE ALATWDE gives a score of 1+1+0+0+1+1+1=5 The dot matrix method for sequence comparison: Two axes represent each one of the two sequences: sequence A along the top from left to right and sequence B along the left from top to bottom. The matrix is filled in by taking a window of sequence A and scanning along sequence B. Whenever a match occurs a dot is placed in the matrix. After reaching the end of sequence B, a new query sequence is generated from sequence A by sliding the window to the next position in sequence A. Example of a dot matrix comparison of two protein sequences: Dot matrix comparison of genomic DNA and cdna sequences: When two sequences share similarity over their entire length a diagonal line will extend from one corner of the dot plot to the diagonally opposite corner. If two sequences only share patches of similarity this will be revealed by diagonal stretches. Jumps correspond to positions where one or the other sequence has more (or less) letters than the other one (insertions & deletions)

Alignment using dynamic programming: Graphical representation of dynamic programming: Having two sequences A and B, at each aligned position there are 3 possibilities: w(ai, Bj) - substitution of Ai by Bj w(ai, D) - deletion of Ai w(d, Bj) - deletion of Bj w - the weight is derived from the chosen scoring scheme (e.g. PAM matrix). Gaps (D) are given negative weight, called gap penalty, since insertions and deletions are less common than substitutions. Try to find the path that gives the maximal score There are three moves allowed. Matching residues (diagonal move), deleting a residue from one sequence (horizontal move) or deleting a residue from the other (vertical move). RNI-LVSDAKNVGI RDISLV---KNAGI Types of alignment : Global alignment: align two sequences from beginning to end, Insisting that all sequence positions must match. Used in the alignment of sequences known to be related. Local alignment: find the best region of similarity between two sequences without insisting that the entire sequences match (a result will be several alignments with close or different scores). Used in database searching and in alignment of distantly related sequences with several regions of homology.

Functional information from multiple sequence alignment: A multiple sequence alignment allows us to extract information which is difficult to extract from a single sequence or from an alignment of only two sequences. When making multiple sequence alignment, try to have both sequences that are very conserved and some that are more distantly related. If possible, use programs for automatic analysis of multiple sequence alignments (e.g. AMAS at http:// www.compbio.dundee.ac.uk/software/ Amas/amas.html). Structural information from multiple sequence alignment: Example: alignment of ferrochelatase Positions of insertions and deletions suggest regions of surface loops in the 3D structure. Conserved Gly and Pro suggest a β-turn. Hydrophobic residues conserved at i, i+2, i+4 etc separated by hydrophilic residues suggest a surface β- strand. A short run of hydrophobic residues (4 aa) may suggest a buried β-strand, longer stretches (20 aa) may suggest a membrane spanning helix. Pairs of conserved hydrophobic aa separated by pairs of hydrophilic residues suggest an a-helix with one face packed against the protein core.

Alignment accuracy: Alignment accuracy: The accuracy of a multiple sequence alignment is always higher than that of a pairwise alignment. Overall alignment accuracy: it is possible to compare the score to the distribution of scores for alignment of random sequences of the same length and composition. The result may be expressed in standard deviations units above the mean. The alignment of some regions is more reliable than others. The most reliable regions are those for which the alignment does not change when small changes are made to the gap penalty and matrix parameters. The least reliable are regions of insertions and deletion, often loop regions. Percentage identity: unrelated sequences, chosen at random are expected to be identical in about 5% of their residues. For certain homology higher than 20% identity is required. Percentage identity depends on the length of the alignment: an alignment of 200 residues with 30% identity is more significant than alignment of 50 residues with 30% identity. What are you trying to find out? Are you trying to locate similar domains or motifs --> Local alignment is probably best Are you trying to determine whether the sequences come from the same family? --> Use one of the BLOSUM matrices Are you trying to determine how closely related the sequences are evolutionary? --> Use one of the PAM matrices

THE END