Scoring Matrices. Shifra Ben-Dor Irit Orr

Similar documents
Scoring Matrices. Shifra Ben Dor Irit Orr

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices

Quantifying sequence similarity

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Computational Biology

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Substitution matrices

Practical considerations of working with sequencing data

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Advanced topics in bioinformatics

Large-Scale Genomic Surveys

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Sequence analysis and comparison

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Algorithms in Bioinformatics

Local Alignment Statistics

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

CSE 549: Computational Biology. Substitution Matrices

PAM-1 Matrix 10,000. From: Ala Arg Asn Asp Cys Gln Glu To:

M.O. Dayhoff, R.M. Schwartz, and B. C, Orcutt

In-Depth Assessment of Local Sequence Alignment

Pairwise sequence alignment

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Exercise 5. Sequence Profiles & BLAST

Tools and Algorithms in Bioinformatics

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Global alignments - review

Single alignment: Substitution Matrix. 16 march 2017

Basic Local Alignment Search Tool

Optimization of a New Score Function for the Detection of Remote Homologs

Protein Sequence Alignment and Database Scanning

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Tutorial 4 Substitution matrices and PSI-BLAST

Bioinformatics and BLAST

Introduction to Bioinformatics

Chapter 7: Rapid alignment methods: FASTA and BLAST

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

COPIA: A New Software for Finding Consensus Patterns. Chengzhi Liang. A thesis. presented to the University ofwaterloo. in fulfilment of the

Week 10: Homology Modelling (II) - HHpred

Protein Secondary Structure Prediction

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

7.36/7.91 recitation CB Lecture #4

Pairwise sequence alignments

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

Pairwise Sequence Alignment

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Pairwise & Multiple sequence alignments

BLAST: Target frequencies and information content Dannie Durand

Substitution Matrices

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... methods use different attributes related to mis sense mutations such as

Foreword by. Stephen Altschul. An Essential Guide to the Basic Local Alignment Search Tool BLAST. Ian Korf, Mark Yandell & Joseph Bedell

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Moreover, the circular logic

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

Alignment & BLAST. By: Hadi Mozafari KUMS

Proteome Informatics. Brian C. Searle Creative Commons Attribution

Lecture 5,6 Local sequence alignment

Similarity or Identity? When are molecules similar?

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Sequence Analysis '17 -- lecture 7

Ch. 9 Multiple Sequence Alignment (MSA)

Chapter 2: Sequence Alignment

Copyright 2000 N. AYDIN. All rights reserved. 1

Sequence Alignment Techniques and Their Uses

An Introduction to Sequence Similarity ( Homology ) Searching

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

BIOINFORMATICS: An Introduction

Phylogenetic inference

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

Bioinformatics for Biologists

BLAST: Basic Local Alignment Search Tool

BINF 730. DNA Sequence Alignment Why?

... and searches for related sequences probably make up the vast bulk of bioinformatics activities.

MEME - Motif discovery tool REFERENCE TRAINING SET COMMAND LINE SUMMARY

Dr. Amira A. AL-Hosary

Sequence analysis and Genomics

EECS730: Introduction to Bioinformatics

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Similarity searching summary (2)

Exploring Evolution & Bioinformatics

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Transcription:

Scoring Matrices Shifra Ben-Dor Irit Orr

Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison rely on some scoring scheme for that. Scoring matrices are used to assign a score to each comparison of a pair of characters.

Scoring matrices The scores in the matrix are integer values. In most cases a positive score is given to identical or similar character pairs, and a negative or zero score to dissimilar character pairs.

Different types of matrices Identity scoring - the simplest scoring scheme, where characters are classified as: identical (scores 1), or non-identical (scores 0). This scoring scheme is not much used. DNA scoring - consider changes as transitions and transversions. This matrix scores identical bp 3, transitions 2, and transversions 0.

Different types of matrices Chemical similarity scoring (for proteins) - this matrix gives greater weight to amino acids with similar chemical properties (e.g size, shape or charge of the aa). Observed matrices for proteins - most commonly used by all programs. These matrices are constructed by analyzing the substitution frequencies seen in the alignments of known families of proteins.

Different types of matrices The most frequently observed matrices used are PAM and BLOSUM matrices.

PAM Matrices Developed by Margaret Dayhoff and co-workers. Derived from global alignments of very similar sequences (at least 85% identity), so that there would be little likelihood of an observed change being the result of several successive mutations, but it should reflect one mutation only. PAM - Point Accepted Mutations.

An accepted point mutation in a protein is a replacement of one amino acid by another, accepted by natural selection. It is the result of two distinct processes: the first is the occurrence of a mutation in the portion of the gene template producing one amino acid of a protein the second is the acceptance of the mutation by the species as the new predominant form. To be accepted, the new amino acid usually must function in a way similar to the old one: chemical and physical similarities are found between the amino acids that are observed to interchange frequently.

Dayhoff estimated mutation rates from substitutions observed in closely related proteins and extrapolated those rates to model distant relationships. PAM gives the probability that a given amino acid will be replaced by any other particular amino acid after a given evolutionary interval, in this case 1 accepted point mutation per 100 amino acids.

When used for protein comparison, the mutation probability (odds) matrix is normalized and the logarithm is taken. (this lets us add the scores along a protein instead of multiplying the probabilities) The resulting matrix is the log-odds matrix, known as the PAM matrix.

PAM#=Point Accepted Mutations / 100 bases The number with the matrix (PAM120, PAM90), refers to the evolutionary distance. Greater numbers are greater distances. To derive PAM250 you multiply PAM1 250 times itself PAM250 is the matrix derived of sequences with 250 PAMs.

PAM250 At this evolutionary distance, only one amino acid in five remains unchanged. However, the amino acids vary greatly in their mutability; 55% of the tryptophans. 52% of the cysteines and 27% of the glycines would still be unchanged, but only 6% of the highly mutable asparagines would remain. Several other amino acids, particularly alanine, aspartic acid, glutamic acid, glycine, lysine, and serine are more likely to occur in place of an original asparagine than asparagine itself at this evolutionary distance!

PAM250 MATRIX # This matrix was produced by "pam" Version 1.0.6 [28-Jul-93] # # PAM 250 substitution matrix, scale = ln(2)/3 = 0.231049 # # Expected score = -0.844, Entropy = 0.354 bits # # Lowest score = -8, Highest score = 17 # A R N D C Q E G H I L K M F P S T W Y V B Z X * A 2-2 0 0-2 0 0 1-1 -1-2 -1-1 -3 1 1 1-6 -3 0 0 0 0-8 R -2 6 0-1 -4 1-1 -3 2-2 -3 3 0-4 0 0-1 2-4 -2-1 0-1 -8 N 0 0 2 2-4 1 1 0 2-2 -3 1-2 -3 0 1 0-4 -2-2 2 1 0-8 D 0-1 2 4-5 2 3 1 1-2 -4 0-3 -6-1 0 0-7 -4-2 3 3-1 -8 C -2-4 -4-5 12-5 -5-3 -3-2 -6-5 -5-4 -3 0-2 -8 0-2 -4-5 -3-8 Q 0 1 1 2-5 4 2-1 3-2 -2 1-1 -5 0-1 -1-5 -4-2 1 3-1 -8 E 0-1 1 3-5 2 4 0 1-2 -3 0-2 -5-1 0 0-7 -4-2 3 3-1 -8 G 1-3 0 1-3 -1 0 5-2 -3-4 -2-3 -5 0 1 0-7 -5-1 0 0-1 -8 H -1 2 2 1-3 3 1-2 6-2 -2 0-2 -2 0-1 -1-3 0-2 1 2-1 -8 I-1-2 -2-2 -2-2 -2-3 -2 5 2-2 2 1-2 -1 0-5 -1 4-2 -2-1 -8 L -2-3 -3-4 -6-2 -3-4 -2 2 6-3 4 2-3 -3-2 -2-1 2-3 -3-1 -8 K -1 3 1 0-5 1 0-2 0-2 -3 5 0-5 -1 0 0-3 -4-2 1 0-1 -8 M -1 0-2 -3-5 -1-2 -3-2 2 4 0 6 0-2 -2-1 -4-2 2-2 -2-1 -8 F -3-4 -3-6 -4-5 -5-5 -2 1 2-5 0 9-5 -3-3 0 7-1 -4-5 -2-8 P 1 0 0-1 -3 0-1 0 0-2 -3-1 -2-5 6 1 0-6 -5-1 -1 0-1 -8 S 1 0 1 0 0-1 0 1-1 -1-3 0-2 -3 1 2 1-2 -3-1 0 0 0-8 T 1-1 0 0-2 -1 0 0-1 0-2 0-1 -3 0 1 3-5 -3 0 0-1 0-8

Pet91 - an updated Dayhoff matrix Since the family of PAM matrices were derived from a comparatively small number of families, many of the possible mutations were not observed. Jones et al. have derived an updated matrix by examining a very large number of families, and created the PET91 scoring matrix.

Gonnet Matrices Another improvement on the PAM matrices Done when the database was larger All-against-all pairwise comparison to see all possible substitutions Optimized to find sequences that are further apart (as opposed to PAM, that started with very similar sequences)

BLOSUM Matrices Created by Henikoff & Henikoff, based on local multiple alignments of more distantly related sequences. First, multiple alignments of short regions (without gaps) of related sequences were gathered. In each alignment the sequences similar at some threshold value of percent identity were clustered into groups and averaged.

BLOSUM Matrices Substitution frequencies for all pairs of amino acids were calculated between the groups, this was used to create the log-odds BLOSUM ( Block Substitution Matrix ).

BLOSUM# - where # is the threshold identity percentage of the sequences clustered in those blocks. Thus, BLOSUM62 means that the sequences clustered in this block are at least 62% identical. This allows detection of more distantly related sequences, as it downplays the role of the more related sequences in the block when building the matrix.

BLOSUM62 MATRIX # Matrix made by matblas from blosum62.iij # * column uses minimum score # BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Blocks Database = /data/blocks_5.0/blocks.dat # Cluster Percentage: >= 62 # Entropy = 0.6979, Expected = -0.5209 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4-1 -2-2 0-1 -1 0-2 -1-1 -1-1 -2-1 1 0-3 -2 0-2 -1 0-4 R -1 5 0-2 -3 1 0-2 0-3 -2 2-1 -3-2 -1-1 -3-2 -3-1 0-1 -4 N -2 0 6 1-3 0 0 0 1-3 -3 0-2 -3-2 1 0-4 -2-3 3 0-1 -4 D -2-2 1 6-3 0 2-1 -1-3 -4-1 -3-3 -1 0-1 -4-3 -3 4 1-1 -4 C 0-3 -3-3 9-3 -4-3 -3-1 -1-3 -1-2 -3-1 -1-2 -2-1 -3-3 -2-4 Q -1 1 0 0-3 5 2-2 0-3 -2 1 0-3 -1 0-1 -2-1 -2 0 3-1 -4 E -1 0 0 2-4 2 5-2 0-3 -3 1-2 -3-1 0-1 -3-2 -2 1 4-1 -4 G 0-2 0-1 -3-2 -2 6-2 -4-4 -2-3 -3-2 0-2 -2-3 -3-1 -2-1 -4 H -2 0 1-1 -3 0 0-2 8-3 -3-1 -2-1 -2-1 -2-2 2-3 0 0-1 -4 I-1-3 -3-3 -1-3 -3-4 -3 4 2-3 1 0-3 -2-1 -3-1 3-3 -3-1 -4 L -1-2 -3-4 -1-2 -3-4 -3 2 4-2 2 0-3 -2-1 -2-1 1-4 -3-1 -4 K -1 2 0-1 -3 1 1-2 -1-3 -2 5-1 -3-1 0-1 -3-2 -2 0 1-1 -4 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 0-2 -1-1 -1-1 1-3 -1-1 -4 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6-4 -2-2 1 3-1 -3-3 -1-4 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7-1 -1-4 -3-2 -2-1 -2-4 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 1-3 -2-2 0 0 0-4 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5-2 -2 0-1 -1 0-4 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 2-3 -4-3 -2-4 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7-1 -3-2 -1-4

So which observed matrix to use??? For global alignments use PAM matrices. Lower PAM matrices tend to find short alignments of highly similar regions. Higher PAM matrices will find weaker, longer alignments. For local alignments use BLOSUM matrices. BLOSUM matrices with HIGH number, are better for similar sequences. BLOSUM matrices with LOW number, are better for distant sequences.

Tips... When doing global alignment (and database scanning) of related (similar) sequences use PAM200 or PAM250. If you don t know what to expect (e.g. for database scanning) use PAM120. For local database scanning (e.g blast), or for ungapped, local alignments, use BLOSUM62 (recommended for proteins).

Tips. In all cases it is recommended to use more than one matrix for any database scanning.

Matrices used in GCG GCG uses 2 types of matrix Native GCG matrices BLAST-formatted matrices. In most programs, the default matrix is BLOSUM62. A different matrix can be specified on the command line: -MAT=matrix_name

To view a list of available matrices for GCG: go to the gcg root directory: DAPSAS /gcgdomain/gcg/gcgcore/data/moredata INN /usr/local/gcg/gcgcore/data/moredata

Defaults: blosum50, blosum62 blosum30 blosum35 blosum40 blosum45 blosum55 blosum60 blosum65 blosum70 blosum75 blosum80 blosum85 blosum90 blosum100 pam120 pam250