Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1

Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and a gap has been introduced into the bottom sequence to make the alignment more meaningful. Two computer algorithm for real protein sequence -Needleman-Wunsch algorithm look for global similarity between the sequences -Smith-Waterman algorithm focus on shorter regions of local similarity Dynamic programming algorithms find alignments containing the largest possible number of identical and similar amino acids by inserting gaps wherever necessary. 2

Pairwise similarity searching (2) The problem with this approach is that the indiscriminate use of gaps can make any two sequences match, no matter how similar (Figure 5.6). 3

Pairwise similarity searching (3) The problem is addressed by constraining the dynamic programming algorithms with gap penalties, which reduce the overall alignment score as more gaps are introduced. -Figure 5.7a A head to head alignment with no gaps provides a relatively low score -the incriminate insertion of gaps would produce a higher score but a meaningless alignment -Figure 5.7b A sensible gap penalty, which reduces the alignment score as more gaps are introduced, produces the optimal alignment, in which there are three gaps. 4

Pairwise similarity searching (4) Most algorithm employ more complex penalty systems in which the penalty is proportional to the length of the gaps or in which there is an initial penalty for opening a gap and then a lower penalty for extending it. However, dynamic programming algorithms are slow and resource-hungry. => Alternative methods have been developed, which are not dynamic programming, and which are faster but less accurate. These have been important in the development of Internetbased database search facilities. BLAST and FASTA 5

Pairwise similarity searching (5) BLAST and FASTA Several variants Table 5.1 http://blast.ncbi.nlm.nih.gov/blast.cgi 6

Pairwise similarity searching (6) Both BLAST and FASTA take into account the fact that highscoring alignments are likely to contain short stretches of identical or near identical letters, which are sometimes termed as words. In the case of BLAST, look for words of a certain fixed length (W) that score above a given threshold level, T, set by user. In FASTA, this word length is two amino acids and there is no T value. Both programs extend their matching segments to produce longer alignments, called as high-scoring segment pairs (BLAST) 7

Significance of sequence alignments (1) The significance of a sequence identity or sequence similarity score depends on the length of the sequence over which the alignment takes place. Ex. 60% similarity over 30 residues vs 60% similarity over 300 residues The difference between chance similarity and alignments that have real biological significance is determined by the statistical analysis of search scores, p values and E values. 8

Significance of sequence alignments (2) p value of a similarity score S is the probability that a score of at least S would have been obtained in a match between any two unrelated protein sequences of similar composition and length. -Low p value (ex. p value = 0.01) it is very unlikely that the similarity score was obtained by chance. -E value is related to p value and is the expected frequency of similarity scores of at least S, would occur by chance. 9

Multiple alignments (1) Multiple alignment search for inter-relationship between members of a protein family. If the same residue is found in five or ten proteins in the family, especially if the proteins are diverse, this suggest that residue may play a key functional role. 10

Multiple alignments (2) Figure 5.8 Multiple alignment of 15 serine protease sequence. The most highly conserved residues are those whose physical and chemical properties are absolutely essential to maintain protein function. Ex. Histidine Cysteine Proline 11

Multiple alignments (3) ClustalW/X the most commonly used software package These use progressive alignment algorithm strategies ; pairwise alignments are carried out first to assess the degree of similarity between each sequence and then to produce a dendrogram of these relationship, which is similar to phylogenic tree. The two most similar sequences are aligned first and the others are added in order of similarity. Advantage: fast Disadvantage : information in distant sequence alignment that could improve the overall alignment is lost. => Manually adjusted by bring conserved residue into register 12

Finding more distant relationship The standard similarity search algorithm discussed above are able to detect sequences showing 30% similarity with reasonable reliability. However, as sequences begin to diverge even further, the evolutionary relationships between proteins are more difficult to detect. Proteins with very little sequence similarity are related because protein structure is much more strongly preserved in evolution than sequence. Ex. Globin family 13

PSI-BLAST PSI-BLAST : position-specific iterated BLAST The principle is iterated database searching, where the results of a standard BLAST search are collected into a profile, which is then used for a second round of searching. 14

PSI-BLAST Figure 5.9 Query sequence A will find any sequences that show degree of similarity (B, C, D). Then, if B, C, D are used at the search queries, the threshold of detection would be extended to include E and F. In the next iteration, a profile that includes all the sequences from A to F should identify G. One problem with PSI-BLAST is its tendency to identify spurious matches. 15

Pattern recognition Pattern recognition search methods an extension of the multiple alignment strategy for identifying structurally and functionally conserved elements of proteins. -Consensus sequences -Sequence patterns -Motifs and blocks -Domains The above secondary databases has its strengths and weakness. => An integrated cross-referencing tools called InterPro has been developed which allows a query sequence to be screened against all of the databases and the extracts and presents the relevant information. (Plate 4 located p82-83) 16

Consensus sequence Consensus sequences a single sequence that represent the most common amino acid residues found at any given position in a multiple alignment. If at any given position no single amino acid is shared by 60 % or more of the sequences, then there is no consensus and the residue is represented by X. e.g. from Figure 5.8. W-V-X-T-A-A-H-C Major drawback it does not take into account conservative substitution (e.g. leucine, isoleucine and valine) which would be informative. This method is rarely used. 17

Sequence patterns Sequence patterns like consensus sequences except that variation is allowed at each position and is shown within brackets. e.g. from Figure 5.8 W-[VI]-[LV]-[ST]-A-A-H-C or W-[VI]-[LIVM]-[ST]-A-[STAG]-H-C 18

Motifs and blocks These are not individual sequences but multiply-aligned ungapped segments derived from the most highly conserved in protein families. Found in two databases (1) PRINT individual motifs from a single protein family are grouped together as fingerprints (Figure 5.10). (2) BLOCKS 19

Domains A protein domain an independent unit of structure or function which can often be found in the context of otherwise unrelated sequences. -ProDom, which lists the sequences of known protein domains created automatically by searching protein primary sequence databases. -PROSITE, which list sequence profiles corresponding to domain sequences, which weight matrices showing the likelihood of particular amino acids being found at each position. -Pfam and SMART 20

Pitfalls of functional annotation by similarity searching Standard similarity searches, recursive methods and pattern or profile searching can all identify sequences which are more or less related to a particular query. However, these methods are not foolproof and all have potential to come up with spurious matches and annotations -One of the most pressing dangers is database pollution e.g. SWISS-PROT vs TrEMBL -Error can also be introduced by the user if the similarity search algorithms are not introduced properly. e.g. PSI-BLAST -Sequence conservation does not always predict functional conservation. 21

Alternative methods for functional annotation (1) In any genome project, a stubborn minority of sequences resist all forms of functional annotation by homology searching. e.g. Among the proteins predicted from yeast genome project -30%: previously known and had been functionally characterized in experiments -another 30%: could be assigned tentative function, although in many cases only at the biochemical level, by homology searching -the left 30%: completely uncharacterized -the remaining 10% : regarded as unsafe prediction (questionable ORF) 22

Alternative methods for functional annotation (2) The anonymous proteins were placed into two categories -Hypothetical proteins, the product of orphan genes, which are predicted protein sequences that do not match any other sequence in database -Members of orphan families, predicted protein sequences with homologs in the databases, but the homologs themselves are of unknown function 23