Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Similar documents
Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

CSE 549: Computational Biology. Substitution Matrices

Quantifying sequence similarity

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Scoring Matrices. Shifra Ben-Dor Irit Orr

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Advanced topics in bioinformatics

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Local Alignment Statistics

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Substitution matrices

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Sequence analysis and comparison

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Practical considerations of working with sequencing data

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Single alignment: Substitution Matrix. 16 march 2017

Pairwise sequence alignments

Computational Biology

Scoring Matrices. Shifra Ben Dor Irit Orr

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise sequence alignment

Algorithms in Bioinformatics

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Sequence Database Search Techniques I: Blast and PatternHunter tools

Large-Scale Genomic Surveys

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Basic Local Alignment Search Tool

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Introduction to Bioinformatics

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Sequence comparison: Score matrices

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

In-Depth Assessment of Local Sequence Alignment

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

BLAST: Target frequencies and information content Dannie Durand

Tools and Algorithms in Bioinformatics

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

M.O. Dayhoff, R.M. Schwartz, and B. C, Orcutt

Similarity or Identity? When are molecules similar?

Pairwise & Multiple sequence alignments

Chapter 2: Sequence Alignment

进化树构建方法的概率方法 第 4 章 : 进化树构建的概率方法 问题介绍. 部分 lid 修改自 i i f l 的 ih l i

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

An Introduction to Sequence Similarity ( Homology ) Searching

Lecture Notes: Markov chains

Dr. Amira A. AL-Hosary

PAM-1 Matrix 10,000. From: Ala Arg Asn Asp Cys Gln Glu To:

Pairwise Sequence Alignment

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Protein Sequence Alignment and Database Scanning

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld

Tutorial 4 Substitution matrices and PSI-BLAST

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Sequence Alignment Techniques and Their Uses

Statistical Distributions of Optimal Global Alignment Scores of Random Protein Sequences

Warm-Up- Review Natural Selection and Reproduction for quiz today!!!! Notes on Evidence of Evolution Work on Vocabulary and Lab

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Week 10: Homology Modelling (II) - HHpred

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Tools and Algorithms in Bioinformatics

7.36/7.91 recitation CB Lecture #4

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Exercise 5. Sequence Profiles & BLAST

Moreover, the circular logic

bioinformatics 1 -- lecture 7

BINF 730. DNA Sequence Alignment Why?

Bioinformatics Exercises

Bioinformatics and BLAST

Phylogenetic inference

... and searches for related sequences probably make up the vast bulk of bioinformatics activities.

Copyright 2000 N. AYDIN. All rights reserved. 1

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Reproduction- passing genetic information to the next generation

Sequence analysis and Genomics

Figure A1. Phylogenetic trees based on concatenated sequences of eight MLST loci. Phylogenetic trees were constructed based on concatenated sequences

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

BIOINFORMATICS: An Introduction

Introduction to protein alignments

SCIENTIFIC EVIDENCE TO SUPPORT THE THEORY OF EVOLUTION. Using Anatomy, Embryology, Biochemistry, and Paleontology

Substitution Matrices

Chapter 7: Rapid alignment methods: FASTA and BLAST

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

Lecture 5,6 Local sequence alignment

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Cladistics and Bioinformatics Questions 2013

Domain-based computational approaches to understand the molecular basis of diseases

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

Transcription:

Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow

Learning Objectives To explain the requirement for a scoring system reflecting possible biological relationships To describe the development of PAM scoring matrices To describe the development of BLOSUM scoring matrices 2

Database search to identify homologous sequences based on similarity scores Ignore position of symbols when scoring Similarity scores are additive over positions on each sequence to enable DP Scores for each possible pairing, e.g. proteins composed of 20 amino acids, 20 x 20 scoring matrix 3

Scoring matrix should reflect Degree of biological relationship between the amino-acids or nucleotides The probability that two AA s occur in homologous positions in sequences that share a common ancestor Or that one sequence is the ancestor of the other Scoring schemes based on physico-chemical properties also proposed 4

Use of Identity Unequal AA s score zero, equal AA s score 1. Overall score can then be normalised by length of sequences to provide percentage identity Use of Genetic Code How many mutations required in NA s to transform one AA to another Phe (Codes UUU & UUC) to Asn (AAU, AAC) Use of AA Classification Similarity based on properties such as charge, acidic/basic, hydrophobicity, etc 5

should be developed from experimental data Reflecting the kind of relationships occurring in nature Point Accepted Mutation (PAM) matrices Dayhoff (1978) Estimated substitution probabilities Using known mutational (substitution) histories 6

Dayhoff employed 71 groups of near homologous sequences (>85% identity) For each group a phylogenetic tree constructed Mutations accepted by species are estimated New AA must have similar functional characteristics to one replaced Requires strong physico-chemical similarity Dependent on how critical position of AA is to protein Employs time intervals based on number of mutations per residue 7

Overall Dayhoff Procedure:- Divide set of sequences into groups of similar sequences multiple alignment for each group Construct phylogenetic tree for each group Define evolutionary model to explain evolution Construct substitution matrices The substitution matrix for an evolutionary time interval t gives for each pair of AA (a, b) an estimate for the probability of a to mutate to b in a time interval t. 8

Evolutionary Model Scoring Matrices Assumptions : The probability of a mutation in one position of a sequence is only dependent on which AA is in the position Independent of position and neighbour AA s Independent of previous mutations in the position No need to consider position of AA s in sequence Biological clock rate of mutations constant over time Time of evolution measured by number of mutations observed in given number of AA s. 1-PAM = one accepted mutation per 100 residues 9

Calculating Substitution Matrix count number of accepted mutations ACGH DKGH DDIL CKIL C-K D-A D-K C-A A C D G H I K L AKGH G-I H-L D-A AKIL A C D G H 1 2 1 2 1 1 1 1 I 1 K 1 1 L 1 10

Once all accepted mutations identified calculate The number of a to b or b to a mutations from table denoted as f ab The total number of mutations in which a takes part denoted as f a = Σ b a f ab The total number of mutations f =Σ a f a (each mutation counted twice) Calculate relative occurrence of AA s p a where Σ a p a = 1 11

Calculate the relative mutability for each AA Measure of probability that a will mutate in the evolutionary time being considered Mutability depends on f a As f a increases so should mutability m a ; AA occurring in many mutations indicates high mutability As p a increases mutability should decrease ; many occurrences of AA indicate many mutations due to frequent occurrence of AA Mutability can be defined as m a = K f a / p a where K is a constant 12

Probability that an arbitrary mutation contains a 2f a / f Probability that an arbitrary mutation is from a f a / f For 100 AA s there are 100p a occurrences of a Probability to select a 1/ 100p a Probability of any of a to mutate m a = (1/ 100p a ) x (f a / f) Probability that a mutates in 1 PAM time unit defined by m a 13

Probability that a mutates to b given that a mutates is f ab / f a Probability that a mutates to b in time t = 1 PAM M ab = m a f ab / f a when a b X=0 C 12 S 0 2 T -2 1 3 P -3 1 0 6 Log-odds PAM 250 matrix A -2 1 1 1 2 G -3 1 0-1 1 5 N -4 1 0-1 0 0 2 D -5 0 0-1 0 1 2 4 E -5 0 0-1 0 0 1 3 4 Q -5-1 -1 0 0-1 1 2 2 4 H -3-1 -1 0-1 -2 2 1 1 3 6 R -4 0-1 0-2 -3 0-1 -1 1 2 6 K -5 0 0-1 -1-2 1 0 0 1 0 3 5 M -5-2 -1-2 -1-3 -2-3 -2-1 -2 0 0 6 I -2-1 0-2 -1-3 -2-2 -2-2 -2-2 -2 2 5 L -6-3 -2-3 -2-4 -3-4 -3-2 -2-3 -3 4 2 6 V -2-1 0-1 0-1 -2-2 -2-2 -2-2 -2 2 4 2 4 F -4-3 -3-5 -4-5 -4-6 -5-5 -2-4 -5 0 1 2-1 9 W 0-3 -3-5 -3-5 -2-4 -4-4 0-4 -4-2 -1-1 -2 7 10 Y -8-2 -5-6 -6-7 -4-7 -7-5 -3 2-3 -4-5 -2-6 0 0 17 C S T P A G N D E Q H R K M I L V F W Y 14

Dayhoff mutation matrix (1978) - summary Point Accepted Mutation (PAM) Dayhof matrices derived from sequences 85% identical Evolutionary distance of 1 PAM = probability of 1 point mutation per 100 residues Likelihood (odds) ratio for residues a and b : Probability a-b is a mutation / probability a-b is chance PAM matrices contain log-odds figures val > 0 : likely mutation val = 0 : random mutation vak < 0 : unlikely mutation 250 PAM : similarity scores equivalent to 20% identity low PAM - good for finding short, strong local similarities high PAM = long weak similarities 15

What about longer evolutionary times? Consider two mutation periods 2-PAM a is mutated to b in first period and unchanged in second Probability is M ab M bb a is unchanged in first period but mutated to b in the second Probability is M aa M ab a is mutated to c in the first which is mutated to b in the second Probability is M ac M cb Final probability for a to be replaced with b M 2 ab = M ab M bb + M aa M ab + Σ c a,b M ac M cb = Σ c M ac M cb 16

Simple definition of matrix multiplication M 2 ab = Σ c M ac M cb M 3 ab = Σ c M2 ac M cb etc Typically M 40 M 120 M 160 M 250 are used in scoring Low values find short local alignments, High values find longer and weaker alignments Two AA s can be opposite in alignment not as a results of homology but by pure chance Need to use odds-ratio O ab = M ab / P b (Use of Log) O ab > 1 : b replaces a more often in bologically related sequences than in random sequences where b occurs with probability P b O ab < 1 : b replaces a less often in bologically related sequences than in random sequences where b occurs with probability P b 17

BLOSUM Scoring Matrices PAM matrices derived from sequences with at least 85% identity Alignments usually performed on sequences with less similarity Henikoff & Henikoff (1992) develop scoring system based on more diverse sequences BLOSUM BLOcks SUbstitution Matrix Blocks defined as ungapped regions of aligned AA s from related proteins Employed > 2000 blocks to derive scoring matrix 18

BLOSUM Scoring Matrices Statistics of occurrence of AA pairs obtained As with PAM frequency of co-occurrence of AA pairs and individual AA s employed to derive Odds ratio BLOSUM matrices for different evolutionary distances Unlike PAM cannot derive direct from original matrix Scoring Matrices derived from Blocks with differing levels of identity 19

BLOSUM Scoring Matrices Overall procedure to develop a BLOSUM X matrix Collect a set of multiple alignments Find the Blocks (no gaps) Group segments of Blocks with X% identity Count the occurrence of all pairs of AA s Employ these counts to obtain odds ratio (log) Most common BLOSUM matrices are 45, 62 & 80 20

Differences between PAM & BLOSUM PAM based on predictions of mutations when proteins diverge from common ancestor explicit evolutionary model BLOSUM based on common regions (BLOCKS) in protein families BLOSUM better designed to find conserved domains BLOSUM - Much larger data set used than for the PAM matrix BLOSUM matrices with small percentage correspond to PAM with large evolutionary distances BLOSUM64 is roughly equivalent to PAM 120 21