Sequence analysis and comparison

Similar documents
Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Week 10: Homology Modelling (II) - HHpred

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Algorithms in Bioinformatics

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Quantifying sequence similarity

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Pairwise & Multiple sequence alignments

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Practical considerations of working with sequencing data

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Scoring Matrices. Shifra Ben-Dor Irit Orr

Similarity or Identity? When are molecules similar?

In-Depth Assessment of Local Sequence Alignment

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Comparing whole genomes

Computational Biology

Sequences, Structures, and Gene Regulatory Networks

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Sequence analysis and Genomics

Bioinformatics and BLAST

Tools and Algorithms in Bioinformatics

Sequence Alignment Techniques and Their Uses

Ch. 9 Multiple Sequence Alignment (MSA)

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Single alignment: Substitution Matrix. 16 march 2017

An Introduction to Sequence Similarity ( Homology ) Searching

Local Alignment Statistics

Large-Scale Genomic Surveys

BIOINFORMATICS: An Introduction

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Protein Sequence Alignment and Database Scanning

An Introduction to Bioinformatics Algorithms Hidden Markov Models

From Gene to Protein

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Basic Local Alignment Search Tool

Substitution matrices

Copyright 2000 N. AYDIN. All rights reserved. 1

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Genomics and bioinformatics summary. Finding genes -- computer searches

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Pairwise sequence alignments

Motivating the need for optimal sequence alignments...

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Administration. ndrew Torda April /04/2008 [ 1 ]

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Protein Structure Prediction, Engineering & Design CHEM 430

Hidden Markov Models

Local Alignment: Smith-Waterman algorithm

Pairwise sequence alignment

CSE 549: Computational Biology. Substitution Matrices

Sequence Database Search Techniques I: Blast and PatternHunter tools

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Orientational degeneracy in the presence of one alignment tensor.

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Introduction to Bioinformatics

Multiple Sequence Alignment

BINF 730. DNA Sequence Alignment Why?

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Alignment & BLAST. By: Hadi Mozafari KUMS

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Bioinformatics Chapter 1. Introduction

Lecture 5,6 Local sequence alignment

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Overview Multiple Sequence Alignment

BLAST. Varieties of BLAST

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Empirical Analysis of Protein Insertions and Deletions Determining Parameters for the Correct Placement of Gaps in Protein Sequence Alignments

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Scoring Matrices. Shifra Ben Dor Irit Orr

PAM-1 Matrix 10,000. From: Ala Arg Asn Asp Cys Gln Glu To:

1. In most cases, genes code for and it is that

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

GCD3033:Cell Biology. Transcription

Advanced topics in bioinformatics

Computational Molecular Biology (

Cellular Neuroanatomy I The Prototypical Neuron: Soma. Reading: BCP Chapter 2

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Organic Chemistry Option II: Chemical Biology

Biol478/ August

Sequence comparison: Score matrices

Transcription:

The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species which have a similar gene (ORF)? Has anybody already studied this protein or a similar one? What is the biochemical function and what physicochemical characteristics to expect? Search & analysis strategy: Sequence search based on homology (similarity). Pattern searches - search for occurrences of a predefined pattern (may be a short sequence motif). Annotation searches - search by keywords, authors, additional features. Search for a 3D structure of a homologous protein. Amino acid sequences Information regarding the proteins function (catalytic activity, specific recognition sites, etc.). The proteins evolutionary origin. Information regarding the type of its 3D structure (folding type). Extracting this information is the task for sequence analysis.

Other goals of sequence analysis: Even more goals of sequence analysis: Assembly of sequence fragments into complete units (proteins, genes, chromosomes). Finding open reading frames (ORFs) for cdnas or genomic DNA and using codon usage tables. Management of sequence information Prediction of the biochemical and physicalchemical characteristics of a protein (molecular weight, isoelectric point (pi), extinction coefficient). Finding and using consensus sequences Examples promoters transcription initiation sites transcription termination sites polyadenylation sites ribosome binding sites protein features post-translational modifications: forming of disulfide bonds, glycosylation, cleavage of signal sequences etc. Analysing relationships between proteins-some general rules: Proteins with the same function taken from closely related organisms have highly similar amino acid sequences. The greater the differences observed for related proteins, the longer the time since the organisms have diverged - genetic divergence. The opposite is genetic convergence. Types of sequence comparison and alignment: compare sequence to database - goal: find related sequences (SIMILARITY) compare sequence to sequence - goal: find matching domains (ALIGNMENT) compare database to database - goal: estimate genetic distance (EVOLUTION) either: determine consensus sequences comparisons can be pairwise or multiple.

Sequence alignment: Sequence alignment - Allows to align and compare a sequence to a family of related sequences, to reveal conserved regions of functional importance. An accurate alignment can be useful for obtaining an idea of the 3D structure of a protein. Since there are many ways of aligning two sequences (an alignment produced by a program is one of several possible), we need criteria to judge the quality of an alignment. Modifications of a protein sequence to be considered: Replacement of one amino acid by another aabb acbb Insertions and deletion of single amino acids and larger blocks ccc-dee c-cddee Large rearrangements of the gene aaaaaabbbbbb bbbbbbaaaaaa Alignment accuracy Mind the Gap! The best alignment is the one that has the maximum number of identical residues aligned against each other - % similarity. Example: Sequence 1!! CPKICIGGWFAAY Sequence 2!! CSGICKKAWFV-Y Alignment pattern:! C--IC---WF--Y! Similarity = 6/13 = 46 % Score (s) = matches mismatches = 6 7 = -1 GATC GTGC GAT-C G-TGC Generally: S = Σ gains (identities, replacements) - Σ penalties Penalties = number of gaps gap creation penalty The values of identities and replacements are elements of the replacement matrix Rules of thumb: As many residues as possible should be aligned A gap should be added only if it significantly increases the number of matches The size of the gap and its position are important

Substitution scoring schemes Needed to assign a score to each of the possible substitutions of one amino acid by another, totally 210 possible pairs (190 pairs of different a.a. + 20 pairs of identical a.a.) presented in a form of a 20 X 20 matrix. Possible scoring schemes include: Identity scoring!! 0 if the a.a. are different and 1 if the same. Observed substitutions! assigns weights based on the analysis of substitution frequencies!! derived from manual alignments Chemical similarity score! higher weight to the alignment of a.a. with similar chemical!!! properties (V L,K R). Amino acid substitution matrices: PAM family of matrices (Dayhoff matrix): Take aligned set of closely related proteins (1300 sequences in 72 families in the original work) For each position in the set, find the most common amino acid observed. Calculate the frequency with which each other amino acid is observed at that position. Combine frequencies from all positions to give table of frequencies for each amino acid changing to each other amino acid. Take logarithm and normalize for frequency of each amino acid. Properties of the PAM matrix: Each element M i,j gives the probability of the a.a. in column j to be mutated to the a.a. in row i after a particular evolutionary time percentage of accepted mutations per 10 8 years (PAM). 1 PAM corresponds to an average change of in 1% of all a.a. positions. After 100 PAM of evolution not every residue will have changed: some will have mutated several times, perhaps returning to original state, while others not at all. AT 256 PAM 80 % of all a.a. will have changed, although to various degrees: 48% of Trp, 41% of Cys and 20% of His would be unchanged, but only 7% of Ser will remain. # PAM 250 matrix # Science June 5, 1992. # Values rounded to nearest integer A R N D C Q E G H I L K M F P S T W Y V A 2-1 0 0 0 0 0 0-1 -1-1 0-1 -2 0 1 1-4 -2 0 R -1 5 0 0-2 2 0-1 1-2 -2 3-2 -3-1 0 0-2 -2-2 N 0 0 4 2-2 1 1 0 1-3 -3 1-2 -3-1 1 0-4 -1-2 D 0 0 2 5-3 1 3 0 0-4 -4 0-3 -4-1 0 0-5 -3-3 C 0-2 -2-3 12-2 -3-2 -1-1 -2-3 -1-1 -3 0 0-1 0 0 Q 0 2 1 1-2 3 2-1 1-2 -2 2-1 -3 0 0 0-3 -2-2 E 0 0 1 3-3 2 4-1 0-3 -3 1-2 -4 0 0 0-4 -3-2 G 0-1 0 0-2 -1-1 7-1 -4-4 -1-4 -5-2 0-1 -4-4 -3 H -1 1 1 0-1 1 0-1 6-2 -2 1-1 0-1 0 0-1 2-2 I -1-2 -3-4 -1-2 -3-4 -2 4 3-2 2 1-3 -2-1 -2-1 3 L -1-2 -3-4 -2-2 -3-4 -2 3 4-2 3 2-2 -2-1 -1 0 2 K 0 3 1 0-3 2 1-1 1-2 -2 3-1 -3-1 0 0-4 -2-2 M -1-2 -2-3 -1-1 -2-4 -1 2 3-1 4 2-2 -1-1 -1 0 2 F -2-3 -3-4 -1-3 -4-5 0 1 2-3 2 7-4 -3-2 4 5 0 P 0-1 -1-1 -3 0 0-2 -1-3 -2-1 -2-4 8 0 0-5 -3-2 S 1 0 1 0 0 0 0 0 0-2 -2 0-1 -3 0 2 2-3 -2-1 T 1 0 0 0 0 0 0-1 0-1 -1 0-1 -2 0 2 2-4 -2 0 W -4-2 -4-5 -1-3 -4-4 -1-2 -1-4 -1 4-5 -3-4 14 4-3 Y -2-2 -1-3 0-2 -3-4 2-1 0-2 0 5-3 -2-2 4 8-1 V 0-2 -2-3 0-2 -2-3 -2 3 2-2 2 0-2 -1 0-3 -1 3

Other types of matrices: PET91 - version of PAM using a set of 2621 families of sequences. BLOSUM - blocks substitution matrix - amino acid substitution tables, which scores amino acid pairs based on the frequency of amino acid substitutions in aligned sequence motifs (blocks). Based on local alignments of 2000 blocks from 500 families. Different Blosum types: 30, 35, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80, 85, 90. Blosum62 the most popular, is based on blocks with at least 62% identity. High Blosum - closely related sequences, Low Blosum - distant sequences Differences with PAM -Evolutionarily divergent proteins are used. Uses Blocks instead of Global alignment PAM-1!! BLOSUM-90! Small evolutionary distance High identity within short sequences Which matrix to use?: PAM-250!! BLOSUM-20! Large evolutionary distance Low identity within long sequences Relationships between matrices Biological criteria can be used in alignment: Methods for sequence comparisons Frequent and infrequent residues Structurally or functionally important amino acids A match to highly conserved residues Repetitive sequences Sliding window method Central to many of the algorithms used in sequence analysis. The basic idea is to define a "window" of a certain number of residues (nucleotides or amino acids) and to calculate some value for the residues in that fragment. Once the calculation is completed, the program shifts one residue and analyzes the next window of residues and this process repeats itself until the end of the sequence is reached.

Sliding window in sequence analysis: Given two sequences A and B, all possible overlapping segments of a particular length (window length) from A are compared to all segments of B. For each pair of segments the amino acid pair scores are accumulated over the length of the segment: For example the comparison of the two segments: ALGAWDE ALATWDE gives a score of 1+1+0+0+1+1+1=5 The dot matrix method for sequence comparison: Two axes represent each one of the two sequences: sequence A along the top from left to right and sequence B along the left from top to bottom. The matrix is filled in by taking a window of sequence A and scanning along sequence B. Whenever a match occurs a dot is placed in the matrix. After reaching the end of sequence B, a new query sequence is generated from sequence A by sliding the window to the next position in sequence A. Example of a dot matrix comparison of two protein sequences: Dot matrix comparison of genomic DNA and cdna sequences: When two sequences share similarity over their entire length a diagonal line will extend from one corner of the dot plot to the diagonally opposite corner. If two sequences only share patches of similarity this will be revealed by diagonal stretches. Jumps correspond to positions where one or the other sequence has more (or less) letters than the other one (insertions & deletions)

Alignment using dynamic programming: Graphical representation of dynamic programming: Having two sequences A and B, at each aligned position there are 3 possibilities: w(ai, Bj) - substitution of Ai by Bj w(ai, D) - deletion of Ai w(d, Bj) - deletion of Bj w - the weight is derived from the chosen scoring scheme (e.g. PAM matrix). Gaps (D) are given negative weight, called gap penalty, since insertions and deletions are less common than substitutions. Try to find the path that gives the maximal score There are three moves allowed. Matching residues (diagonal move), deleting a residue from one sequence (horizontal move) or deleting a residue from the other (vertical move). RNI-LVSDAKNVGI RDISLV---KNAGI Types of alignment : Global alignment: align two sequences from beginning to end, Insisting that all sequence positions must match. Used in the alignment of sequences known to be related. Local alignment: find the best region of similarity between two sequences without insisting that the entire sequences match (a result will be several alignments with close or different scores). Used in database searching and in alignment of distantly related sequences with several regions of homology.

Functional information from multiple sequence alignment: A multiple sequence alignment allows us to extract information which is difficult to extract from a single sequence or from an alignment of only two sequences. When making multiple sequence alignment, try to have both sequences that are very conserved and some that are more distantly related. If possible, use programs for automatic analysis of multiple sequence alignments (e.g. AMAS at http:// www.compbio.dundee.ac.uk/software/ Amas/amas.html). Structural information from multiple sequence alignment: Example: alignment of ferrochelatase Positions of insertions and deletions suggest regions of surface loops in the 3D structure. Conserved Gly and Pro suggest a β-turn. Hydrophobic residues conserved at i, i+2, i+4 etc separated by hydrophilic residues suggest a surface β- strand. A short run of hydrophobic residues (4 aa) may suggest a buried β-strand, longer stretches (20 aa) may suggest a membrane spanning helix. Pairs of conserved hydrophobic aa separated by pairs of hydrophilic residues suggest an a-helix with one face packed against the protein core.

Alignment accuracy: Alignment accuracy: The accuracy of a multiple sequence alignment is always higher than that of a pairwise alignment. Overall alignment accuracy: it is possible to compare the score to the distribution of scores for alignment of random sequences of the same length and composition. The result may be expressed in standard deviations units above the mean. The alignment of some regions is more reliable than others. The most reliable regions are those for which the alignment does not change when small changes are made to the gap penalty and matrix parameters. The least reliable are regions of insertions and deletion, often loop regions. Percentage identity: unrelated sequences, chosen at random are expected to be identical in about 5% of their residues. For certain homology higher than 20% identity is required. Percentage identity depends on the length of the alignment: an alignment of 200 residues with 30% identity is more significant than alignment of 50 residues with 30% identity. What are you trying to find out? Are you trying to locate similar domains or motifs --> Local alignment is probably best Are you trying to determine whether the sequences come from the same family? --> Use one of the BLOSUM matrices Are you trying to determine how closely related the sequences are evolutionary? --> Use one of the PAM matrices

THE END