Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Similar documents
Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Tools and Algorithms in Bioinformatics

Practical considerations of working with sequencing data

Quantifying sequence similarity

Single alignment: Substitution Matrix. 16 march 2017

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Algorithms in Bioinformatics

Pairwise & Multiple sequence alignments

Sequence analysis and Genomics

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

In-Depth Assessment of Local Sequence Alignment

Computational Biology

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Sequence analysis and comparison

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Introduction to Bioinformatics

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise sequence alignments

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Large-Scale Genomic Surveys

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Lecture 5,6 Local sequence alignment

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Collected Works of Charles Dickens

Tools and Algorithms in Bioinformatics

Sequence Alignment Techniques and Their Uses

Scoring Matrices. Shifra Ben-Dor Irit Orr

Bioinformatics and BLAST

Local Alignment: Smith-Waterman algorithm

Effects of Gap Open and Gap Extension Penalties

Sequence Alignment (chapter 6)

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Similarity or Identity? When are molecules similar?

Pairwise sequence alignment

Sequence Comparison. mouse human

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

Motivating the need for optimal sequence alignments...

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Pairwise Sequence Alignment

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

BIOINFORMATICS: An Introduction

Copyright 2000 N. AYDIN. All rights reserved. 1

Basic Local Alignment Search Tool

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

Dr. Amira A. AL-Hosary

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Substitution matrices

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Alignment & BLAST. By: Hadi Mozafari KUMS

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

Introduction to Computation & Pairwise Alignment

Week 10: Homology Modelling (II) - HHpred

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Local Alignment Statistics

Moreover, the circular logic

Biology Tutorial. Aarti Balasubramani Anusha Bharadwaj Massa Shoura Stefan Giovan

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Bioinformatics Exercises

Bioinformatics for Biologists

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

1.5 Sequence alignment

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Lecture Notes: Markov chains

BLAST: Target frequencies and information content Dannie Durand

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

7.36/7.91 recitation CB Lecture #4

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

COPIA: A New Software for Finding Consensus Patterns. Chengzhi Liang. A thesis. presented to the University ofwaterloo. in fulfilment of the

BINF 730. DNA Sequence Alignment Why?

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

... and searches for related sequences probably make up the vast bulk of bioinformatics activities.

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Scoring Matrices. Shifra Ben Dor Irit Orr

Homology Modeling. Roberto Lins EPFL - summer semester 2005

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

An Introduction to Sequence Similarity ( Homology ) Searching

BLAST. Varieties of BLAST

Sequence comparison: Score matrices

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Phylogenetic inference

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Administration. ndrew Torda April /04/2008 [ 1 ]

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Transcription:

Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline Introduction What is pairwise sequence alignment? Scoring Sequence Alignments Pairwise Sequence Alignment Algorithms Needleman-Wunsch Algorithm (global alignment) Smith-Waterman Algorithm (local alignment) Extension to repeated matches (multiple local alignments) Heuristic Pairwise Sequence Alignment Algorithms BLAST FASTA 2 1

Introduction Advances in molecular biology allow increasingly rapid sequencing of genomes --> Exponential growth in Genbank. Francois Jacob (1977) [Evolution and tinkering, science 196:1161166] Nature is a tinkerer and not an inventor Eric Wieschaus (1995) [Associated Press, 9 October, 1995] We didn t know it at the time, but we found out everything in life is so similar, that the same genes that work in flies are the ones that work in humans. 3 Introduction New sequences are adapted from pre-existing sequences rather than invented de novo. Sequence similarity is an indicator of homology. Other (several) uses for sequence similarity Database queries Comparative genomics... 4 2

Outline Introduction What is Pairwise Sequence Alignment? Scoring Sequence Alignments Pairwise Sequence Alignment Algorithms Needleman-Wunsch Algorithm (global alignment) Smith-Waterman Algorithm (local alignment) Extension to repeated matches (multiple local alignments) Heuristic Pairwise Sequence Alignment Algorithms BLAST FASTA 5 What is Pairwise Sequence Alignment? The problem of deciding if a pair of sequences are evolutionarily related or not. Two biological sequences are similar Two strings are similar Sequences accumulate Insertions Deletions and Substitutions 6 3

What is Pairwise Sequence Alignment? Distance Between DNA Sequences Hamming distance is not typically used to compare DNA or protein sequences. Levenshtein distance allows one to compare strings of different lengths. Edit distance Definition: The edit distance between two strings is defined as the minimum number of edit operations insertions, deletions and substitutions needed to transform the first string into the second. Matches are not counted. 7 What is Pairwise Sequence Alignment? String Alignment The concept of an alignment is crucial. Global Alignment Definition: A (global) alignment of two strings S1 e S2 is obtained by first inserting chosen spaces (or dashes), either into or at the ends of S1 and S2, and then placing the two resulting strings one above the other so that every character or space (dash) in either string is opposite to a unique character (dash) or unique space (dash) in the other string. 8 4

What is Pairwise Sequence Alignment? Gaps Gaps help create alignments that better conform to underlying biological models. Mechanisms that make long insertions or deletions in DNA include: unequal crossing-over in meiosis; DNA slippage during replication; insertion of transposable elements into DNA string; insertions of DNA by retro-viruses; etc... Definition: A gap is any maximal, consecutive run of spaces (or dashes) in a single string of a given alignment. 9 What is Pairwise Sequence Alignment? Example S1 = WEAGAWGHEE S2 = PAWHEAE WEAGAWGHE-E P-A--W-HEAE mismatch match gap WEAGAWGHE-E --P-AW-HEAE More than one possible alignment! Which one is better? Is it a true or a spurious alignment? 1 5

Outline Introduction What is Pairwise Sequence Alignment? Scoring Sequence Alignments Pairwise Sequence Alignment Algorithms Needleman-Wunsch Algorithm (global alignment) Smith-Waterman Algorithm (local alignment) Extension to repeated matches (multiple local alignments) Heuristic Pairwise Sequence Alignment Algorithms BLAST FASTA 11 How to Score an Alignment? Find the best alignment between two strings under some scoring scheme. Use a scoring model that quantifies evolutionary preferences. Substitution matrices Matches and mismatches Gap penalty Initiating a gap Gap extension penalty Extending a gap Set of values for quantifying the likelihood of one residue being substituted by another in an alignment. 12 6

The Scoring Model The total score will be a sum of terms for each aligned pair of residues, plus terms for each gap. Identities and conservative substitutions will be more likely in alignments than expected by chance. contribute with positive score terms. Non-conservative changes are expected to be observed less frequently in real alignments than expected by chance contribute with negative score terms. 13 The Scoring Model The score assigned to an alignment is computed using this function: where S =! s i ( 2 s1( i), s ( i)) + G( g) s(s1(i),s2(i)) is the score for each aligned pair of residues and Given by a Scoring Matrix! G(g) are the gap penalties Given apriori! Scores s(.,.) and gap penalties G(g) can be computed using different models (scoring matrices, probabilistics models,...)! 14 7

Example Alignment Scores A E H P W A 5 E 6 G H 1 W -4 15 Gap penalty: -8 Gap extension penalty: -8 WEAGAWGHE-E --P-AW-HEAE (-8) + (-8) + () + (-8) + 5 + 15 + (-8) + 1 + 6 + (-8) + 6 = 1 15 Example Alignment Scores A E H P W A 5 E 6 G H 1 W -4 15 Gap penalty: -8 Gap extension penalty: -8 Exercise: What is the score of the following alignment? WEAGAWGHE-E P-A--W-HEAE 16 8

Example Alignment Scores A E H P W A 5 E 6 G H 1 W -4 15 Gap penalty: -8 Gap extension penalty: -8 Exercise: What is the score of the following alignment? WEAGAWGHE-E P-A--W-HEAE (-4) + (-8) + 5 + (-8) + (-8) + 15 + (-8) + 1 + 6 + (-8) + 6 = 17 Scoring Matrices Family of matrices listing the likelihood of change from one sequence to another during evolution. Amino acid substitution matrices PAM (Point Accepted Mutation) BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) DNA substitution matrices DNA: less conserved than protein sequences. Less effective to compare coding regions at nucleotide level. 18 9

DNA Substitution Matrices Scoring matrices for nucleotide sequences are relatively simple. A positive value or a high score is given for a match and a negative value/low positive score is given for a mismatch. This assignment is based on the assumption that the frequencies of mutation are equal for all bases. However, this assumption may not be realistic! Observations show that transitions (substitutions between purines and purines, A<->C) occur more frequently than transversions (substitutions between pyrimidines and pyrimidines, T<->G) Therefore, a more sophisticated statistical model with different probability values to reflect two types of mutations is needed! Several nucleotide substitution models (Example: Kimura model) 19 Amino acid substitution matrices PAM Matrices (Dayhoff, 1978) Encode and summarize expected evolutionary change at the amino acid level. Each matrix is designed to be used to compare pairs of sequences that are a specific number of PAM units diverged. 1 PAM unit indicates the probability of 1 point mutation per 1 residues. 2 1

Amino acid substitution matrices After 1 PAMs of evolution, not every residue will have changed Some residues may have mutated several times. Some residues may have returned to their original state. Some residues may not changed at all. PAM matrices started by constructing hypothetical phylogenetic trees relating the sequences in 71 families, where each pair of sequences differed by no more than 15% of their residues. For each amino acid pair, A i and A j, count the number of times that A i aligns opposite A j, and divide that number by the total number of pairs in all the aligned data. 21 PAM Matrices Let F(i,j) denote the resulting frequency. Let F i and F j be the frequencies that amino acids A i and A j appear in the sequences. The (i,j) entry for the ideal PAMn matrix is: F( i, j) log ( ) F( i) F( j) The image cannot be displayed. Your computer may not have enoug been corrupted. Restart your computer, and then open the file again image and then insert it again. 22 11

Amino acid substitution matrices Evolutionary distance (PAM) 1 11 23 38 56 8 12 159 Observed difference % 1 1 2 3 4 5 6 7 Most widely Used PAM Matrix PAM25 23 25 8 24 12

Amino acid substitution matrices BLOSUM Matrices (Henikoff, 1992) Substitution matrices derived using probabilistic models. Matrices derived from a much larger dataset: the protein families BLOCKS database. Sequences are clustered whenever their percentage of identical residues exceed some level L%. BLOSUM5 and BLOSUM62 are widely used. BLOSUM observes significantly more replacements than PAM, even for infrequent pairs. 25 BLOSUM5 A R N D C Q E G H I L K M F P S T W Y V! A 5 1! R! 7-4 1-4 3! N 7 2 1-4 -4 1-4! D 2 8-4 2-4 -4-4 -5-5 -4! C -4-4 13-4 -5! Q 1 7 2 1 2-4! E 2 2 6-4 1! G 8-4 -4-4 -4! H 1 1 1-4 2-4! I -4-4 -4-4 -4 5 2 2 4! L -4-4 -4 2 5 3 1-4 1! K 3 2 1 6-4! M -4 2 3 7 1! F -4-5 -4-4 1-4 8-4 1 4! P -4-4 -4 1-4! S 1 1 5 2-4! T 2 5! W -4-5 -5 1-4 -4 15 2! Y 2 4 2 8! V -4-4 -4 4 1 1 5 26 13

Amino acid substitution matrices PAM Matrices vs BLOSUM Matrices PAM model is designed to track evolutionary origin of proteins. BLOSUM model is designed to find conserved domains of proteins. Thumb Rules Lower PAMs and higher BLOSUMs find short local alignment of highly similar sequences. Higher PAMs and lower BLOSUMs find longer weaker local alignments. 27 14