Scoring Matrices. Shifra Ben Dor Irit Orr

Similar documents
Scoring Matrices. Shifra Ben-Dor Irit Orr

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Quantifying sequence similarity

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Computational Biology

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Substitution matrices

Practical considerations of working with sequencing data

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

PAM-1 Matrix 10,000. From: Ala Arg Asn Asp Cys Gln Glu To:

Sequence analysis and comparison

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Algorithms in Bioinformatics

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Large-Scale Genomic Surveys

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Local Alignment Statistics

Basic Local Alignment Search Tool

Tools and Algorithms in Bioinformatics

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Advanced topics in bioinformatics

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Database Search Techniques I: Blast and PatternHunter tools

In-Depth Assessment of Local Sequence Alignment

CSE 549: Computational Biology. Substitution Matrices

BLAST: Target frequencies and information content Dannie Durand

Exercise 5. Sequence Profiles & BLAST

Sequence Analysis '17 -- lecture 7

M.O. Dayhoff, R.M. Schwartz, and B. C, Orcutt

Week 10: Homology Modelling (II) - HHpred

Pairwise sequence alignment

Global alignments - review

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Protein Sequence Alignment and Database Scanning

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Optimization of a New Score Function for the Detection of Remote Homologs

Single alignment: Substitution Matrix. 16 march 2017

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

EECS730: Introduction to Bioinformatics

COPIA: A New Software for Finding Consensus Patterns. Chengzhi Liang. A thesis. presented to the University ofwaterloo. in fulfilment of the

Pairwise sequence alignments

Chapter 7: Rapid alignment methods: FASTA and BLAST

Bioinformatics and BLAST

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

Substitution Matrices

7.36/7.91 recitation CB Lecture #4

BLAST: Basic Local Alignment Search Tool

Alignment & BLAST. By: Hadi Mozafari KUMS

Foreword by. Stephen Altschul. An Essential Guide to the Basic Local Alignment Search Tool BLAST. Ian Korf, Mark Yandell & Joseph Bedell

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Pairwise & Multiple sequence alignments

Tutorial 4 Substitution matrices and PSI-BLAST

Pairwise Sequence Alignment

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Introduction to Bioinformatics

Lecture 5,6 Local sequence alignment

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Protein Secondary Structure Prediction

An Introduction to Sequence Similarity ( Homology ) Searching

Similarity searching summary (2)

Moreover, the circular logic

Similarity or Identity? When are molecules similar?

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

bioinformatics 1 -- lecture 7

Bioinformatics for Biologists

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... methods use different attributes related to mis sense mutations such as

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Proteome Informatics. Brian C. Searle Creative Commons Attribution

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Ch. 9 Multiple Sequence Alignment (MSA)

Sequence Alignment Techniques and Their Uses

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Introduction to protein alignments

BIOINFORMATICS: An Introduction

Chapter 2: Sequence Alignment

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

Multiple Alignment using Hydrophobic Clusters : a tool to align and identify distantly related proteins

Physics 150. Electric current and circuits. Chapter 18

Hidden Markov Models

Sequence analysis and Genomics

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

Lecture 14 - Cells. Astronomy Winter Lecture 14 Cells: The Building Blocks of Life

Transcription:

Scoring Matrices Shifra Ben Dor Irit Orr

Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison rely on some scoring scheme for that. Scoring matrices are used to assign a score to each comparison of a pair of characters.

Scoring matrices The scores in the matrix are integer values. In most cases a posidve score is given to idendcal or similar character pairs, and a negadve or zero score to dissimilar character pairs.

Different types of matrices IdenDty scoring the simplest scoring scheme, where characters are classified as: idendcal (scores 1), or non idendcal (scores 0). This scoring scheme is not much used. DNA scoring consider changes as transidons and transversions. This matrix scores idendcal bp 3, transidons 2, and transversions 0.

Different types of matrices Chemical similarity scoring (for proteins) this matrix gives greater weight to amino acids with similar chemical properdes (e.g size, shape or charge of the aa). Observed matrices for proteins most commonly used by all programs. These matrices are constructed by analyzing the subsdtudon frequencies seen in the alignments of known families of proteins.

Observed Scoring Matrices Every possible idendty and subsdtudon is assigned a score. This score is based on the observed frequencies of such occurrences in alignments of evoludonary related proteins. This score will also reflect the frequency that a pardcular amino acid occurs in nature, as some amino acids are more abundant than others.

Observed Scoring Matrices IdenDDes are assigned the most posidve scores Frequently observed subsdtudons also receive posidve scores Mismatches, or matches that are unlikely to have been a result of evoludon, are given negadve scores.

Each matrix entry gives the rado of the observed frequency of subsdtudon between each possible pair of amino acids in related proteins to that expected by chance, given the frequencies of amino acids in proteins. These rados are called odds scores. These rados are transformed to logarithms of odds scores called log odds scores. Odds scores and log odds scores are used to score protein alignments

Different types of matrices Observed Scoring Matrices are superior to simple idendty scores, or scores based solely on chemical properdes of amino acids The most frequently used observed log odds matrices used are the PAM and BLOSUM matrices.

PAM Matrices Developed by Margaret Dayhoff and co workers. Derived from global alignments of very similar sequences (at least 85% idendty), so that there would be li[le likelihood of an observed change being the result of several successive mutadons, but it should reflect one mutadon only. PAM Point Accepted MutaDons.

An accepted point mutadon in a protein is a replacement of one amino acid by another, accepted by natural selecdon. It is the result of two disdnct processes: the first is the occurrence of a mutadon in the pordon of the gene template producing one amino acid of a protein the second is the acceptance of the mutadon by the species as the new predominant form. To be accepted, the new amino acid usually must funcdon in a way similar to the old one: chemical and physical similarides are found between the amino acids that are observed to interchange frequently.

Dayhoff esdmated mutadon rates from subsdtudons observed in closely related proteins and extrapolated those rates to model distant reladonships. PAM gives the probability that a given amino acid will be replaced by any other pardcular amino acid a\er a given evoludonary interval, in this case 1 accepted point mutadon per 100 amino acids.

When used for protein comparison, the mutadon probability (odds) matrix is normalized and the logarithm is taken. (this lets us add the scores along a protein instead of muldplying the probabilides) The resuldng matrix is the log odds matrix, known as the PAM matrix.

PAM#=Point Accepted MutaDons / 100 bases The number with the matrix (PAM120, PAM90), refers to the evoludonary distance. Greater numbers are greater distances. To derive PAM250 you muldply PAM1 250 Dmes itself PAM250 is the matrix derived of sequences with 250 PAMs.

PAM250 At this evoludonary distance, only one amino acid in five remains unchanged. However, the amino acids vary greatly in their mutability; 55% of the tryptophans, 52% of the cysteines and 27% of the glycines would sdll be unchanged, but only 6% of the highly mutable asparagines would remain. Several other amino acids, pardcularly alanine, aspardc acid, glutamic acid, glycine, lysine, and serine are more likely to occur in place of an original asparagine than asparagine itself at this evoludonary distance!

PAM250 MATRIX # This matrix was produced by "pam" Version 1.0.6 [28-Jul-93] # # PAM 250 substitution matrix, scale = ln(2)/3 = 0.231049 # # Expected score = -0.844, Entropy = 0.354 bits # # Lowest score = -8, Highest score = 17 # A R N D C Q E G H I L K M F P S T W Y V B Z X * A 2-2 0 0-2 0 0 1-1 -1-2 -1-1 -3 1 1 1-6 -3 0 0 0 0-8 R -2 6 0-1 -4 1-1 -3 2-2 -3 3 0-4 0 0-1 2-4 -2-1 0-1 -8 N 0 0 2 2-4 1 1 0 2-2 -3 1-2 -3 0 1 0-4 -2-2 2 1 0-8 D 0-1 2 4-5 2 3 1 1-2 -4 0-3 -6-1 0 0-7 -4-2 3 3-1 -8 C -2-4 -4-5 12-5 -5-3 -3-2 -6-5 -5-4 -3 0-2 -8 0-2 -4-5 -3-8 Q 0 1 1 2-5 4 2-1 3-2 -2 1-1 -5 0-1 -1-5 -4-2 1 3-1 -8 E 0-1 1 3-5 2 4 0 1-2 -3 0-2 -5-1 0 0-7 -4-2 3 3-1 -8 G 1-3 0 1-3 -1 0 5-2 -3-4 -2-3 -5 0 1 0-7 -5-1 0 0-1 -8 H -1 2 2 1-3 3 1-2 6-2 -2 0-2 -2 0-1 -1-3 0-2 1 2-1 -8 I -1-2 -2-2 -2-2 -2-3 -2 5 2-2 2 1-2 -1 0-5 -1 4-2 -2-1 -8 L -2-3 -3-4 -6-2 -3-4 -2 2 6-3 4 2-3 -3-2 -2-1 2-3 -3-1 -8 K -1 3 1 0-5 1 0-2 0-2 -3 5 0-5 -1 0 0-3 -4-2 1 0-1 -8 M -1 0-2 -3-5 -1-2 -3-2 2 4 0 6 0-2 -2-1 -4-2 2-2 -2-1 -8 F -3-4 -3-6 -4-5 -5-5 -2 1 2-5 0 9-5 -3-3 0 7-1 -4-5 -2-8 P 1 0 0-1 -3 0-1 0 0-2 -3-1 -2-5 6 1 0-6 -5-1 -1 0-1 -8 S 1 0 1 0 0-1 0 1-1 -1-3 0-2 -3 1 2 1-2 -3-1 0 0 0-8 T 1-1 0 0-2 -1 0 0-1 0-2 0-1 -3 0 1 3-5 -3 0 0-1 0-8

Pet91 an updated Dayhoff matrix Since the family of PAM matrices were derived from a comparadvely small number of families, many of the possible mutadons were not observed. Jones et al. have derived an updated matrix by examining a very large number of families, and created the PET91 scoring matrix.

Gonnet Matrices Another improvement on the PAM matrices Done when the database was larger All against all pairwise comparison to see all possible subsdtudons OpDmized to find sequences that are further apart (as opposed to PAM, that started with very similar sequences)

BLOSUM Matrices Created by Henikoff & Henikoff, based on local muldple alignments of more distantly related sequences. First, muldple alignments of short regions (without gaps) of related sequences were gathered. In each alignment the sequences similar at some threshold value of percent idendty were clustered into groups and averaged.

BLOSUM Matrices SubsDtuDon frequencies for all pairs of amino acids were calculated between the groups, this was used to create the logodds BLOSUM ( Block SubsDtuDon Matrix ).

BLOSUM# where # is the threshold idendty percentage of the sequences clustered in those blocks. Thus, BLOSUM62 means that the sequences clustered in this block are at least 62% idendcal. This allows detecdon of more distantly related sequences, as it downplays the role of the more related sequences in the block when building the matrix.

BLOSUM62 MATRIX # Matrix made by matblas from blosum62.iij # * column uses minimum score # BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Blocks Database = /data/blocks_5.0/blocks.dat # Cluster Percentage: >= 62 # Entropy = 0.6979, Expected = -0.5209 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4-1 -2-2 0-1 -1 0-2 -1-1 -1-1 -2-1 1 0-3 -2 0-2 -1 0-4 R -1 5 0-2 -3 1 0-2 0-3 -2 2-1 -3-2 -1-1 -3-2 -3-1 0-1 -4 N -2 0 6 1-3 0 0 0 1-3 -3 0-2 -3-2 1 0-4 -2-3 3 0-1 -4 D -2-2 1 6-3 0 2-1 -1-3 -4-1 -3-3 -1 0-1 -4-3 -3 4 1-1 -4 C 0-3 -3-3 9-3 -4-3 -3-1 -1-3 -1-2 -3-1 -1-2 -2-1 -3-3 -2-4 Q -1 1 0 0-3 5 2-2 0-3 -2 1 0-3 -1 0-1 -2-1 -2 0 3-1 -4 E -1 0 0 2-4 2 5-2 0-3 -3 1-2 -3-1 0-1 -3-2 -2 1 4-1 -4 G 0-2 0-1 -3-2 -2 6-2 -4-4 -2-3 -3-2 0-2 -2-3 -3-1 -2-1 -4 H -2 0 1-1 -3 0 0-2 8-3 -3-1 -2-1 -2-1 -2-2 2-3 0 0-1 -4 I -1-3 -3-3 -1-3 -3-4 -3 4 2-3 1 0-3 -2-1 -3-1 3-3 -3-1 -4 L -1-2 -3-4 -1-2 -3-4 -3 2 4-2 2 0-3 -2-1 -2-1 1-4 -3-1 -4 K -1 2 0-1 -3 1 1-2 -1-3 -2 5-1 -3-1 0-1 -3-2 -2 0 1-1 -4 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 0-2 -1-1 -1-1 1-3 -1-1 -4 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6-4 -2-2 1 3-1 -3-3 -1-4 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7-1 -1-4 -3-2 -2-1 -2-4 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 1-3 -2-2 0 0 0-4 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5-2 -2 0-1 -1 0-4 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 2-3 -4-3 -2-4 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7-1 -3-2 -1-4

So which observed matrix to use??? For most programs the default is BLOSUM62, and for most searches it works very well. If you don t get results, then try the following rules of thumb:

So which observed matrix to use??? For global alignments use PAM matrices. Lower PAM matrices tend to find short alignments of highly similar regions. Higher PAM matrices will find weaker, longer alignments. For local alignments use BLOSUM matrices. BLOSUM matrices with HIGH number, are be[er for similar sequences. BLOSUM matrices with LOW number, are be[er for distant sequences.

Tips... When doing global alignment (and database scanning) of related (similar) sequences use PAM200 or PAM250. If you don t know what to expect (e.g. for database scanning) use PAM120. For local database scanning (e.g blast), or for ungapped, local alignments, use BLOSUM62 (recommended for proteins).

Tips. In all cases it is recommended to use more than one matrix for any database scanning, when the defaults don t work...