Quantifying sequence similarity

Similar documents
Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Scoring Matrices. Shifra Ben-Dor Irit Orr

Phylogenetic inference

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Substitution matrices

Moreover, the circular logic

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Sequence analysis and comparison

Computational Biology

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Local Alignment Statistics

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Sequence comparison: Score matrices

Algorithms in Bioinformatics

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Tools and Algorithms in Bioinformatics

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Pairwise & Multiple sequence alignments

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Practical considerations of working with sequencing data

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Pairwise sequence alignments

CSE 549: Computational Biology. Substitution Matrices

Dr. Amira A. AL-Hosary

Advanced topics in bioinformatics

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

7.36/7.91 recitation CB Lecture #4

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Scoring Matrices. Shifra Ben Dor Irit Orr

Large-Scale Genomic Surveys

Lecture Notes: Markov chains

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

EECS730: Introduction to Bioinformatics

Lecture 5,6 Local sequence alignment

Week 10: Homology Modelling (II) - HHpred

Bioinformatics and BLAST

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Using Bioinformatics to Study Evolutionary Relationships Instructions

bioinformatics 1 -- lecture 7

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Evolutionary Models. Evolutionary Models

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Understanding relationship between homologous sequences

Introduction to Bioinformatics Algorithms Homework 3 Solution

BIOINFORMATICS: An Introduction

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

BLAST: Target frequencies and information content Dannie Durand

Copyright 2000 N. AYDIN. All rights reserved. 1

In-Depth Assessment of Local Sequence Alignment

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Cladistics and Bioinformatics Questions 2013

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Overview Multiple Sequence Alignment

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Sequence Analysis '17 -- lecture 7

Tools and Algorithms in Bioinformatics

Homology Modeling. Roberto Lins EPFL - summer semester 2005

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Pairwise sequence alignment

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Similarity or Identity? When are molecules similar?

BINF 730. DNA Sequence Alignment Why?

Similarity searching summary (2)

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Basic Local Alignment Search Tool

PAM-1 Matrix 10,000. From: Ala Arg Asn Asp Cys Gln Glu To:

Tree Building Activity

Multiple Alignment using Hydrophobic Clusters : a tool to align and identify distantly related proteins

Bioinformatics Exercises

Collected Works of Charles Dickens

BLAST. Varieties of BLAST

p(-,i)+p(,i)+p(-,v)+p(i,v),v)+p(i,v)

What Is Conservation?

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Sequence Alignment Techniques and Their Uses

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

Introduction to Evolutionary Concepts

Introduction to Bioinformatics

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

IMPLEMENTING HIERARCHICAL CLUSTERING METHOD FOR MULTIPLE SEQUENCE ALIGNMENT AND PHYLOGENETIC TREE CONSTRUCTION

Sequence Database Search Techniques I: Blast and PatternHunter tools

Exercise 5. Sequence Profiles & BLAST

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Transcription:

Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity give two examples of DNA substitution matrices list four properties of amino acids that might be important in determining their physico-chemical similarity derive BLOSUM scores from counts of aligned residues interpret BLOSUM scores and BLOSUM-based alignment scores by performing calculations with log-odds values explain when to use matrices based on high/low BLOSUM numbers (e.g. 45/62/80) and why they are suitable list some of the elements in a model of evolution, explain what it is and why we use it convert mutational distance into evolutionary distance by correcting for mutational saturation with Jukes-Cantor 1

2/19/17 Homology Phylogenetic trees and sequence alignments contain a lot of information about function and evolution of sequences But they only make sense if the sequences descend from a common ancestor, i.e. they are evolutionarily related We are NEVER allowed to interpret trees or alignments, UNLESS they are made with homologous sequences! so how do we determine if sequences share a common ancestor? (i.e. are homologs?) Similarity Things that look really similar probably share a common ancestor 2

2/19/17 Identity Alignment means: adding gaps in one and/or the other sequence until they are both equally long: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--AGGCTATCACCTGACCTCCAGGCCGATGCCC TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC TAGCTATCACGACCGCGGTCGATTTGCCCGAC Before alignment: After alignment: Aligned sequences allow us to calculate percent identity: 8 different positions, and 31 identical positions The two sequences are 100 * 31 = 79% identical 39 Identity can be quantified, but homology cannot! Saying 79% homologous is wrong (even though too many people say it) Homology Definitions Property of two sequences that have a shared ancestor Homology is TRUE or FALSE: either you re family or you re not Identity Percentage of identical residues in an alignment Used for amino acids or nucleotides Similarity Percentage of amino acid residues in an alignment with a positive substitution score Not used for DNA 3

Evolutionary distance The evolutionary distance between two sequences is often expressed as the frequency of mismatches The distance between virus3 and virus7: 0.77 mutations/site Assumption: positions evolve randomly and independently The horizontal branches represent evolutionary lineages changing over time. The branch length represents the evolutionary time between two nodes. Unit: substitutions per sequence site. http://epidemic.bio.ed.ac.uk/how_to_read_a_phylogeny DNA substitution matrices Similarity is quantified with a substitution matrix Scores for matches in a sequence alignment Penalties for mismatches in a sequence alignment The identity matrix is the most often used for scoring DNA sequence alignments Not all nucleotide substitutions are equally likely Transitions occur about twice as often as transversions A 1 A C G T C -1 1 G -1-1 1 T -1-1 -1 1 Identity matrix A G T C Transitions: A G C T Transversions: A C A T C G G T A C G T A 1 C -2 1 G -1-2 1 T -2-1 -2 1 4

2/19/17 Identity and similarity These three sequences are all 66.7% identical but what are those colors? * Some amino acids are more similar than others (* hint) Cysteine (C) Aspartic acid (D) Glutamic acid (E) George Orwell Animal Farm Amino acid properties http://swift.cmbi.ru.nl/teach/b1b/html/postera4_nbic_new.pdf 5

2/19/17 The similarity signal What is the most relevant definition of a similarity signal between amino acids? Size? Polarity? Hydrophobicity? Preferred protein fold? any other measures that might be important? Similarity signal of what, exactly? * Of the function of the amino acid in the protein Of an evolutionary relationship (* hint) Evolution has tested this for us So let s use evolved sequences to score similarity! Which amino acids mutate into each other? We use alignments to discover which amino acids are more similar than others, according to evolution We quantify this by counting how often two amino acids really mutated into each other during evolution The most relevant signal to detect homology is based on reliable, well-aligned homologs Well-aligned homologs 6

BLOcks SUbstitution Matrix (BLOSUM) 1. Take some aligned homologous sequences 2. Group highly identical sequences (e.g. >62% identity) To remove redundancy biases in the sequence database 3. Identify well-aligned blocks So that only real mutations are compared (reliable, well-aligned homologs) 4. Count how often each pair of two amino acids mutated into each other (i.e. how often they are aligned) How to quantify this similarity? BLOSUM defines similarity between amino acids based on the statistical probability that they align in well-aligned homologs Observed/expected ratio (odds ratio) An obeserved/expected ratio of 2 means something happens twice as often as you expected by random chance Observed: aligned in well-aligned homologs Expected: aligned in unaligned sequences BLOSUM score: log ( ) S i,j = 2 log 2( F i,j ) aligned in well-aligned observed homologs aligned in expected unaligned sequences F i F j The observed/expected ratio for a whole alignment can be calculated by multiplying the ratios for all aligned amino acids This is a lot of multiplying for long sequences Logarithms allow us to sum the scores of aligned residues Much easier to use for a scoring system Hence: the BLOSUM score 7

An example Let s say that we have a well-aligned block of: 100 amino acids long 1,000 proteins deep With no gaps 7,400 7.4% are alanine (A) F A = 100 1,000 = 0.074 1.3% are tryptophan (W) F W = 0.013 Randomly, we expect a fraction of A-W alignments of: F A F W = 0.074 0.013 = 0.000962 In reality, we observe a fraction of A-W alignments of: 0.00034 The A-W mutation occurred less frequently in evolution than expected! so the substitution score must be negative The A-W substitution score is: S i,j = 2 log 2( F i,j F i F j ) S A,W = 2 log 2( 0.00034 ) 0.000962 = -3 BLOSUM62 Substitution matrix Low for infrequent substitutions High for frequent substitutions Small Low for common amino acids High for rare amino acids Polar, non-hydrophobic Positive Hydrophobic Aromatic Why are the highest numbers in each row/column on the diagonal? Because in well-aligned homologs, every amino acid is most often aligned to itself (i.e. it is not mutated) 8

So what do the numbers mean? S i,j = 2 log 2( F i,j F i F j ) 2 (S i,j/2) = F i,j F i F j A score of -3 for the substitution A-W means that this substitution is observed in well-aligned homologs 0.35 times as often as expected: 2 (-3/2) 0.00034 = 0.35 = 1 0.000962 or: = 2.83 times less often than expected 0.35 A score of 2 for the substitution D-E means that this substitution is observed in well-aligned homologs twice as often as expected: 2 (2/2) = 2 A score of 9 for the substitution C-C means that this substitution is observed in well-aligned homologs 22.6 times as often as expected: 2 (9/2) = 22.6 Exercise Small Polar, non-hydrophobic Positive Hydrophobic Aromatic Calculate the similarity score between these sequences: a) KAWSADV 1 + 4 1 + 1 3 + 2 2 = 0 b) HAMNWEQ 0 2 1 + 1 3 + 5 2 = 2 c) RDWSACV 2 2 + 11 + 4 + 4 3 + 4 = 20 9

Odds ratio that sequences are well-aligned homologs How likely is it that two sequences are well-aligned homologs? seqa seqb : -1 + -3 + -2 = -6 seqa seqc : -1 + -2 + -2 = -5 seqb seqc : 4 + 2 + 4 = 10 2 (-6/2) = 0.125 2 (-5/2) = 0.177 2 (10/2) = 32 When you observe RYD aligned to SDA, these sequences are 0.125 times more likely to be well-aligned homologs than expected or 8 times less likely to be well-aligned homologs than expected When you observe RYD aligned to SEA, these sequences are 0.177 times more likely to be well-aligned homologs than expected or 5.6 times less likely to be well-aligned homologs than expected When you observe SDA aligned to SEA, these sequences are 32 times more likely to be well-aligned homologs than expected Length of the alignment As you see a more and more length of two sequences If two sequences are well-aligned homologs, most of the aligned amino acids will have positive scores so the alignment score keeps increasing as you see more length of the sequences so the likelihood that they are well-aligned homologs keeps increasing If two sequences are not homologous, most of the aligned amino acids will have negative scores so the score keeps decreasing as you see more sequence so the likelihood that they are well-aligned homologs keeps decreasing 10

BLOSUM numbers BLOSUM removes redundancy biases from the blocks of well-aligned homologs The sequence selection in the database is biased Clusters of highly identical sequences are collapsed Per cluster, one consensus sequence is used to calculate the BLOSUM matrix High numbers collapse only very identical homologs Used to compare and detect closely related proteins Low numbers also collapse more divergent homologs Used to compare and detect more distant proteins The most used matrices: BLOSUM80: closely related sequences collapsed (>80% id) BLOSUM62: default, sequences collapsed if they have >62% identity BLOSUM45: distantly related sequences collapsed (>45% id) Substitution matrices Substitution matrices are used in (almost) all sequence analyses strongly influence your results Other amino acid substitution matrices Point Accepted Mutations (PAM) Developed by Margaret Dayhoff in 1978 Based on counting 1,572 real mutations that occurred during the evolution of 71 proteins Gonnet (1992) Similar to PAM, but based on more sequences Default in popular alignment program Clustal Omega Bioinformatic programs often have default parameter values set For example the choice of a substitution matrix Default settings are not always the most optimal for your questions If you have no idea, check the literature in your biological field what is considered reliable 11

Models of evolution We use models to check if we really understand something If our observations agree with the model, then we understand that something well (and vice versa) In a model of evolution, we quantitatively formalize the evolutionary processes that we think are at play A substitution matrix is (part of) a model of evolution, because it formalizes all the relative substitution rates between residues If other types of mutations are included, the model often agrees better with genome sequence data Absolute versus evolutionary distance Consider a recently diverged sequence The two DNA sequences will diverge, but how? Time à The divergence saturates at ~25% identity Mutational saturation: if the same position mutates again The identity between two random DNA sequences is 25% We can correct for it with the Jukes-Cantor formula: % id for random seqs 0.5 0.4 0.3 0.2 y = x 2 - x + 0.5 0 0.2 0.4 0.6 0.8 1 %GC of sequences 12