Pairwise & Multiple sequence alignments

Similar documents
Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Algorithms in Bioinformatics

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Sequence Alignment (chapter 6)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence analysis and Genomics

Practical considerations of working with sequencing data

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Bioinformatics and BLAST

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Single alignment: Substitution Matrix. 16 march 2017

In-Depth Assessment of Local Sequence Alignment

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Quantifying sequence similarity

Tools and Algorithms in Bioinformatics

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Sequence analysis and comparison

Computational Biology

Pairwise sequence alignment

Similarity or Identity? When are molecules similar?

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Motivating the need for optimal sequence alignments...

Introduction to Bioinformatics

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

EECS730: Introduction to Bioinformatics

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Tools and Algorithms in Bioinformatics

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Pairwise sequence alignments

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Large-Scale Genomic Surveys

Sequence comparison: Score matrices

Comparative genomics: Overview & Tools + MUMmer algorithm

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

1.5 Sequence alignment

BINF 730. DNA Sequence Alignment Why?

B I O I N F O R M A T I C S

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Copyright 2000 N. AYDIN. All rights reserved. 1

Pairwise Sequence Alignment

Sequence Alignment Techniques and Their Uses

Substitution matrices

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Lecture 2: Pairwise Alignment. CG Ron Shamir

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

An Introduction to Sequence Similarity ( Homology ) Searching

Dr. Amira A. AL-Hosary

Collected Works of Charles Dickens

Practical Bioinformatics

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Global alignments - review

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Basic Local Alignment Search Tool

Local Alignment: Smith-Waterman algorithm

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Lecture 5,6 Local sequence alignment

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

... and searches for related sequences probably make up the vast bulk of bioinformatics activities.

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Example of Function Prediction

Introduction to Bioinformatics

BIOINFORMATICS: An Introduction

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Analysis and Design of Algorithms Dynamic Programming

Moreover, the circular logic

Phylogenetic inference

COPIA: A New Software for Finding Consensus Patterns. Chengzhi Liang. A thesis. presented to the University ofwaterloo. in fulfilment of the

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Effects of Gap Open and Gap Extension Penalties

Bioinformatics for Biologists

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

A profile-based protein sequence alignment algorithm for a domain clustering database

Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Pair Hidden Markov Models

Scoring Matrices. Shifra Ben-Dor Irit Orr

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027

Bio nformatics. Lecture 3. Saad Mneimneh

NUMB3RS Activity: DNA Sequence Alignment. Episode: Guns and Roses

Week 10: Homology Modelling (II) - HHpred

Evolutionary Tree Analysis. Overview

Introduction to protein alignments

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Comparison. mouse human

Transcription:

Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in

Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived from a common ancestor trace history of mutations/evolutionary changes Proteins that are similar in sequence are likely to have similar structure and function 2

WHAT IS ALIGNMENT? Alignments are useful organizing tools because they provide pictorial representation of similarity / homology in the protein or nucleic acid sequences. 3

Sample Alignment SEQ_A: GDVEKGKKIFIMKCSQ SEQ_B: GCVEKGKIFINWCSQ There are two possible linear alignments 1. GDVEKGKKIFIMKCSQ GCVEKGKIFINWCSQ 2. GDVEKGKKIFIMKCSQ GCVEKGKIFINWCSQ 4

The optimal alignment GDVEKGKKIFIMKCSQ GCVEKGK-IFINWCSQ Insertion of one break maximizes the identities. 5

Theoretical background Alignment is the method based on the theoretical view that the two sequences are derived from each other by a number of elementary transformations Mutations (residue substitution) Insertion/deletion Slide function 6

Transformations Substitution, Addition/deletion, Slide function The most homologous sequences are those which can be derived from one another by the smallest number of such transformations. How to decide the smallest number of transformation? Therefore alignments are an optimization problem. 7

Terminology Identity Similarity Homology 8

Identity Objective and well defined Can be quantified Percent The number of identical matches divided by the length of the aligned region 9

What is Similarity? Protein similarity could be due to Evolutionary relationship Similar two or three dimensional structure Common Function Can be quantified Percent The number of identical + similar matches divided by the length of the aligned region 10

What is Homology? Homologous proteins may be encoded by- Same genes in different species Genes that have transferred between the species Genes that have originated from duplication of ancestral genes. 11

Difference between Homology & Similarity Similarity does not necessarily imply Homology. Homology has a precise definition: having a common evolutionary origin. Since homology is a qualitative description of the relationship, the term % homology has no meaning. Supporting data for a homologous relationship may include sequence or structural similarities, which can be described in quantitative terms. % identities, rmsd 12

Global Alignment 13

Local Alignment 14

An optimal alignment AALIM AAL-M A sub-optimal alignment AALIM AA-LM 15

Needleman & Wunsch algorithm JMB (1970). 48:443-453. Maximizes the number of amino acids of one protein that can be matched with the amino acids of other protein while allowing for optimum deletions/insertions. Based on theory of random walk in two dimensions 16

Random walk in two dimensions 3 possible paths Diagonal Horizontal Vertical Optimum path Diagonal 17

N & W Algorithm The optimal alignment is obtained by maximizing the similarities and minimizing the gaps. GLOSSARY 1. PROTEINS The words composed of 20 letters 2. LETTER is an element other than NULL 3. NULL is an symbol - i.e. the GAP 4. GAPS Run of nulls which indicates the deletion(s) in one sequence and insertion(s) in other sequence 18

Contd../ 5. SCORING Assigns a value to each possible MATRIX pair of Amino acids. Examples of matrices are UN, MD, GCM, CSW, UP. 6. PENALTY There are two types of penalties. Matrix Bias: is added to every cell of the scoring matrix and decides the size of the break. Also called Gap continuation penalty. Break Penalty: Applied every time a gap is inserted in either sequence. 19

Unitary Matrix Simplest scoring scheme Amino acids pairs are classified into 2 types: Identical Non-identical Identical pairs are scored 1 Non-identical pairs are scored 0 Less effective for detection of weak similarities A R N D A 1 0 0 0 R 0 1 0 0 N 0 0 1 0 D 0 0 0 1... 20

MAT(i,j)=SM(A GDVEKGKKIFIMKCSQ i,b j )+max(x,y,z) where X= row max along the diagonal penalty Y = column max along the diagonal penalty Z= next diagonal: MAT (i+1,j+1) GCVEKGK-IFINWCSQ 21

Contd../ Real Score ( R ) Similarity Score of real sequences Mean Score ( M ) Average similarity score of randomly permuted sequences Standard deviation ( Sd ) Standard deviation of the similarity scores of randomly permuted sequences. Alignment Score ( A ) A = (R-M)/sd Alignment score is expressed as number of standard deviation units by which the similarity score for real sequences (R) exceeds the average similarity score (M) of randomly permuted sequences 22

Trace back GDVEKGKKIFIMKCSQ GCVEKGK-IFINWCSQ 23

Sample output 24

Evolutionary process Orthologues Gene X A single Gene X is retained as the species diverges into two separate species Gene X Gene X Genes in two species are Orthologues 25

Evolutionary process Paralogues: genes that arise due to duplication Single gene X in one species is Gene X Gene X Gene A Gene X Gene B duplicated As each gene gathers mutations, it may begin to perform new function or may specialize in carrying out functions of ancestral genes These genes in a single species are paralogues If the species diverges, the daughter species may maintain the duplicated genes, therefore each species contain an Orthologue and a Paralogue to each gene in other species 26

Homologous/Orthologous/Paralogous sequences Orthologous sequences are homologous sequences in different species that have a common origin Distinction of Orthologoes is a result of gradual evolutionary modifications from the common ancestor Perform same function in different species Paralogous sequences are homologous sequences that exists within a species They have a common origin but involve gene duplication events to arise Purpose of gene duplication is to use sequence to implement a new function Perform different functions 27

Local Sequence Alignment Using Smith- Waterman Dynamic Programming Algorithm 28

Significance of local sequence alignment In global alignment, an attempt is made to align the entire sequences, as many characters as possible. In local alignment, stretches of sequence with the highest density of matches are given the highest priority, generating one or more islands of matches in the aligned sequences. Applications: locating common domains in proteins Example: transmembrane proteins, which might have different ends sticking out of the cell membrane, but have common 'middleparts' For comparing long DNA sequences with a short one Comparing a gene with a complete genome For detecting similarities between highly diverged sequences which still share common subsequences (that have little or no mutations). 29

Local sequence alignment Performs an exhaustive search for optimal local alignment Modification of Needleman-Wunsch algorithm: Negative weighting of mismatches Matrix entries non-negative Optimal path may start anywhere (not just first / last row/column) After the whole path matrix is filled, the optimal local alignment is simply given by a path starting at the highest score overall in the path matrix, containing all the contributing cells until the path score has dropped to zero. 30

Smith-Waterman Algorithm 31

Example of local alignment 32

Scoring the alignment using BLOSUM50 matrix H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 0 P 0-2 -1-1 -2-1 -4-2 -2-1 -1 A 0-2 -1 5 0 5-3 0-2 -1-1 W 0-3 -3-3 -3-3 15-3 -3-3 -3 H 0 10 0-2 -2-2 -3-2 10 0 0 E 0 0 6-1 -3-1 -3-3 0 6 6 A 0-2 -1 5 0 5-3 0-2 -1-1 E 0 0 6-1 -3-1 -3-3 0 6 6 Gap penalty: -8 33