Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Similar documents
Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Week 10: Homology Modelling (II) - HHpred

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Single alignment: Substitution Matrix. 16 march 2017

Sequence comparison: Score matrices

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Large-Scale Genomic Surveys

Multiple sequence alignment

Sequence analysis and comparison

Similarity or Identity? When are molecules similar?

In-Depth Assessment of Local Sequence Alignment

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Algorithms in Bioinformatics

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Motivating the need for optimal sequence alignments...

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Sequence Alignment Techniques and Their Uses

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Tools and Algorithms in Bioinformatics

Practical considerations of working with sequencing data

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Pairwise sequence alignment

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

Fundamentals of database searching

Sequence analysis and Genomics

Tools and Algorithms in Bioinformatics

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

CSCE555 Bioinformatics. Protein Function Annotation

Bioinformatics. Dept. of Computational Biology & Bioinformatics

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

Homology. and. Information Gathering and Domain Annotation for Proteins

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

BIOINFORMATICS: An Introduction

An Introduction to Sequence Similarity ( Homology ) Searching

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Ch. 9 Multiple Sequence Alignment (MSA)

Similarity searching summary (2)

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Alignment & BLAST. By: Hadi Mozafari KUMS

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

1.5 Sequence alignment

Exploring Evolution & Bioinformatics

EECS730: Introduction to Bioinformatics

Introduction to Bioinformatics

Bioinformatics and BLAST

Protein function prediction based on sequence analysis

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Collected Works of Charles Dickens

Computational Biology

Copyright 2000 N. AYDIN. All rights reserved. 1

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Hidden Markov Models

Protein Sequence Alignment and Database Scanning

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

Pairwise & Multiple sequence alignments

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

Bioinformatics for Biologists

Amino Acid Structures from Klug & Cummings. Bioinformatics (Lec 12)

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Overview Multiple Sequence Alignment

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Introduction to Evolutionary Concepts

Global alignments - review

Basic Local Alignment Search Tool

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Hidden Markov Models (HMMs) and Profiles

Introduction to Bioinformatics Online Course: IBT

BLAST: Target frequencies and information content Dannie Durand

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

Homology and Information Gathering and Domain Annotation for Proteins

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

COPIA: A New Software for Finding Consensus Patterns. Chengzhi Liang. A thesis. presented to the University ofwaterloo. in fulfilment of the

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Introduction to Sequence Alignment. Manpreet S. Katari

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Administration. ndrew Torda April /04/2008 [ 1 ]

A profile-based protein sequence alignment algorithm for a domain clustering database

EBI web resources II: Ensembl and InterPro

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... methods use different attributes related to mis sense mutations such as

PROTEIN CLUSTERING AND CLASSIFICATION

Sequence Database Search Techniques I: Blast and PatternHunter tools

Functional Annotation

Quantifying sequence similarity

Transcription:

Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1

Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and a gap has been introduced into the bottom sequence to make the alignment more meaningful. Two computer algorithm for real protein sequence -Needleman-Wunsch algorithm look for global similarity between the sequences -Smith-Waterman algorithm focus on shorter regions of local similarity Dynamic programming algorithms find alignments containing the largest possible number of identical and similar amino acids by inserting gaps wherever necessary. 2

Pairwise similarity searching (2) The problem with this approach is that the indiscriminate use of gaps can make any two sequences match, no matter how similar (Figure 5.6). 3

Pairwise similarity searching (3) The problem is addressed by constraining the dynamic programming algorithms with gap penalties, which reduce the overall alignment score as more gaps are introduced. -Figure 5.7a A head to head alignment with no gaps provides a relatively low score -the incriminate insertion of gaps would produce a higher score but a meaningless alignment -Figure 5.7b A sensible gap penalty, which reduces the alignment score as more gaps are introduced, produces the optimal alignment, in which there are three gaps. 4

Pairwise similarity searching (4) Most algorithm employ more complex penalty systems in which the penalty is proportional to the length of the gaps or in which there is an initial penalty for opening a gap and then a lower penalty for extending it. However, dynamic programming algorithms are slow and resource-hungry. => Alternative methods have been developed, which are not dynamic programming, and which are faster but less accurate. These have been important in the development of Internetbased database search facilities. BLAST and FASTA 5

Pairwise similarity searching (5) BLAST and FASTA Several variants Table 5.1 http://blast.ncbi.nlm.nih.gov/blast.cgi 6

Pairwise similarity searching (6) Both BLAST and FASTA take into account the fact that highscoring alignments are likely to contain short stretches of identical or near identical letters, which are sometimes termed as words. In the case of BLAST, look for words of a certain fixed length (W) that score above a given threshold level, T, set by user. In FASTA, this word length is two amino acids and there is no T value. Both programs extend their matching segments to produce longer alignments, called as high-scoring segment pairs (BLAST) 7

Significance of sequence alignments (1) The significance of a sequence identity or sequence similarity score depends on the length of the sequence over which the alignment takes place. Ex. 60% similarity over 30 residues vs 60% similarity over 300 residues The difference between chance similarity and alignments that have real biological significance is determined by the statistical analysis of search scores, p values and E values. 8

Significance of sequence alignments (2) p value of a similarity score S is the probability that a score of at least S would have been obtained in a match between any two unrelated protein sequences of similar composition and length. -Low p value (ex. p value = 0.01) it is very unlikely that the similarity score was obtained by chance. -E value is related to p value and is the expected frequency of similarity scores of at least S, would occur by chance. 9

Multiple alignments (1) Multiple alignment search for inter-relationship between members of a protein family. If the same residue is found in five or ten proteins in the family, especially if the proteins are diverse, this suggest that residue may play a key functional role. 10

Multiple alignments (2) Figure 5.8 Multiple alignment of 15 serine protease sequence. The most highly conserved residues are those whose physical and chemical properties are absolutely essential to maintain protein function. Ex. Histidine Cysteine Proline 11

Multiple alignments (3) ClustalW/X the most commonly used software package These use progressive alignment algorithm strategies ; pairwise alignments are carried out first to assess the degree of similarity between each sequence and then to produce a dendrogram of these relationship, which is similar to phylogenic tree. The two most similar sequences are aligned first and the others are added in order of similarity. Advantage: fast Disadvantage : information in distant sequence alignment that could improve the overall alignment is lost. => Manually adjusted by bring conserved residue into register 12

Finding more distant relationship The standard similarity search algorithm discussed above are able to detect sequences showing 30% similarity with reasonable reliability. However, as sequences begin to diverge even further, the evolutionary relationships between proteins are more difficult to detect. Proteins with very little sequence similarity are related because protein structure is much more strongly preserved in evolution than sequence. Ex. Globin family 13

PSI-BLAST PSI-BLAST : position-specific iterated BLAST The principle is iterated database searching, where the results of a standard BLAST search are collected into a profile, which is then used for a second round of searching. 14

PSI-BLAST Figure 5.9 Query sequence A will find any sequences that show degree of similarity (B, C, D). Then, if B, C, D are used at the search queries, the threshold of detection would be extended to include E and F. In the next iteration, a profile that includes all the sequences from A to F should identify G. One problem with PSI-BLAST is its tendency to identify spurious matches. 15

Pattern recognition Pattern recognition search methods an extension of the multiple alignment strategy for identifying structurally and functionally conserved elements of proteins. -Consensus sequences -Sequence patterns -Motifs and blocks -Domains The above secondary databases has its strengths and weakness. => An integrated cross-referencing tools called InterPro has been developed which allows a query sequence to be screened against all of the databases and the extracts and presents the relevant information. (Plate 4 located p82-83) 16

Consensus sequence Consensus sequences a single sequence that represent the most common amino acid residues found at any given position in a multiple alignment. If at any given position no single amino acid is shared by 60 % or more of the sequences, then there is no consensus and the residue is represented by X. e.g. from Figure 5.8. W-V-X-T-A-A-H-C Major drawback it does not take into account conservative substitution (e.g. leucine, isoleucine and valine) which would be informative. This method is rarely used. 17

Sequence patterns Sequence patterns like consensus sequences except that variation is allowed at each position and is shown within brackets. e.g. from Figure 5.8 W-[VI]-[LV]-[ST]-A-A-H-C or W-[VI]-[LIVM]-[ST]-A-[STAG]-H-C 18

Motifs and blocks These are not individual sequences but multiply-aligned ungapped segments derived from the most highly conserved in protein families. Found in two databases (1) PRINT individual motifs from a single protein family are grouped together as fingerprints (Figure 5.10). (2) BLOCKS 19

Domains A protein domain an independent unit of structure or function which can often be found in the context of otherwise unrelated sequences. -ProDom, which lists the sequences of known protein domains created automatically by searching protein primary sequence databases. -PROSITE, which list sequence profiles corresponding to domain sequences, which weight matrices showing the likelihood of particular amino acids being found at each position. -Pfam and SMART 20

Pitfalls of functional annotation by similarity searching Standard similarity searches, recursive methods and pattern or profile searching can all identify sequences which are more or less related to a particular query. However, these methods are not foolproof and all have potential to come up with spurious matches and annotations -One of the most pressing dangers is database pollution e.g. SWISS-PROT vs TrEMBL -Error can also be introduced by the user if the similarity search algorithms are not introduced properly. e.g. PSI-BLAST -Sequence conservation does not always predict functional conservation. 21

Alternative methods for functional annotation (1) In any genome project, a stubborn minority of sequences resist all forms of functional annotation by homology searching. e.g. Among the proteins predicted from yeast genome project -30%: previously known and had been functionally characterized in experiments -another 30%: could be assigned tentative function, although in many cases only at the biochemical level, by homology searching -the left 30%: completely uncharacterized -the remaining 10% : regarded as unsafe prediction (questionable ORF) 22

Alternative methods for functional annotation (2) The anonymous proteins were placed into two categories -Hypothetical proteins, the product of orphan genes, which are predicted protein sequences that do not match any other sequence in database -Members of orphan families, predicted protein sequences with homologs in the databases, but the homologs themselves are of unknown function 23