PAM-1 Matrix 10,000. From: Ala Arg Asn Asp Cys Gln Glu To:

Similar documents
Scoring Matrices. Shifra Ben-Dor Irit Orr

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices

Quantifying sequence similarity

Scoring Matrices. Shifra Ben Dor Irit Orr

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Sequence analysis and comparison

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Practical considerations of working with sequencing data

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

BLAST: Target frequencies and information content Dannie Durand

Substitution matrices

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Tools and Algorithms in Bioinformatics

Sequence Alignment (chapter 6)

Sequence Database Search Techniques I: Blast and PatternHunter tools

Stephen Scott.

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Advanced topics in bioinformatics

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

CSE 549: Computational Biology. Substitution Matrices

Algorithms in Bioinformatics

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Local Alignment Statistics

An Introduction to Sequence Similarity ( Homology ) Searching

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Lecture 26: Polymers: DNA Packing and Protein folding 26.1 Problem Set 4 due today. Reading for Lectures 22 24: PKT Chapter 8 [ ].

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Week 10: Homology Modelling (II) - HHpred

Computational Biology

Pairwise & Multiple sequence alignments

Can protein model accuracy be. identified? NO! CBS, BioCentrum, Morten Nielsen, DTU

Exercise 5. Sequence Profiles & BLAST

M.O. Dayhoff, R.M. Schwartz, and B. C, Orcutt

Moreover, the circular logic

Large-Scale Genomic Surveys

Protein Sequence Alignment and Database Scanning

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Alignment & BLAST. By: Hadi Mozafari KUMS

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Single alignment: Substitution Matrix. 16 march 2017

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Similarity searching summary (2)

Pairwise sequence alignments

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Introduction to Bioinformatics

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

EECS730: Introduction to Bioinformatics

Practice Midterm Exam 200 points total 75 minutes Multiple Choice (3 pts each 30 pts total) Mark your answers in the space to the left:

Ch. 9 Multiple Sequence Alignment (MSA)

Local Alignment: Smith-Waterman algorithm

EECS730: Introduction to Bioinformatics

Lecture 5,6 Local sequence alignment

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Basics of protein structure

6-3 Solving Systems by Elimination

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

In-Depth Assessment of Local Sequence Alignment

114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009

Sequence Database Search. 北京大学生物信息学中心高歌 Ge Gao, Ph.D. Center for Bioinformatics, Peking University

Substitution Matrices

BINF 730. DNA Sequence Alignment Why?

BLAST: Basic Local Alignment Search Tool

Lecture 2-3: Review of forces (ctd.) and elementary statistical mechanics. Contributions to protein stability

7.36/7.91 recitation CB Lecture #4

Multiple Alignment using Hydrophobic Clusters : a tool to align and identify distantly related proteins

Global alignments - review

bioinformatics 1 -- lecture 7

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

Pairwise sequence alignment

Lec.1 Chemistry Of Water

Sequence Analysis '17 -- lecture 7

Analysis and Design of Algorithms Dynamic Programming

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

Introduction to Bioinformatics

Foreword by. Stephen Altschul. An Essential Guide to the Basic Local Alignment Search Tool BLAST. Ian Korf, Mark Yandell & Joseph Bedell

Today s Lecture: HMMs

Introduction to spectral alignment

4. Determinants.

Transcription:

119-1 atrix 10,000 rom: la rg sn sp ys ln lu o: la 9867 2 9 10 3 8 17 rg 1 9913 1 0 1 10 0 sn 4 1 9822 36 0 4 6 sp 6 0 42 9859 0 6 53 ys 1 1 0 0 9973 0 0 ln 3 9 4 5 0 9876 27 lu 10 0 7 56 0 35 9865

120 1 is the expectation after approximately 1% of the sequence has been substituted. 2 is calculated as 1 1 x is calculated as (x-1) 1 250 is generally used for distant comparisons. t corresponds to 2.5 differences per site ( 20% identity). O: hese measure divergence not time.

121-250 atrix 100 rom: la rg sn sp ys ln lu o: la 13 6 9 9 5 8 9 rg 3 17 4 3 2 5 3 sn 4 4 6 7 2 5 6 sp 5 4 8 11 1 7 10 ys 2 1 1 1 52 1 1 ln 3 5 5 6 1 10 7 lu 5 4 7 11 1 9 12

122 scoring matrix he scoring values are generally shown as a symmetric log odds ratio matrix. Odds (for those who do not gamble) are 1 p where p is the probability of an event and 1 p is the probability of some other event. or example if p = 0.5 then the odds are 50/50 or 1 to 1 ( 0.5 0.5 = 1). hile if p = 0.75 then the odds are 3 to 1 ( 0.75 0.25 = 3). he odds ratio is the ratio of the odds for and against. p

123 scoring matrix enerally the odds are presented as log values. or matrices it is generally log 10 that is used and so each integer value represents an order of magnitude. or example if p = 0.08, odds are 0.08/0.92 = 0.087 (11 to 1) and log odds are log 10 (0.087) = 1.06 while if p = 0.996, odds are 0.996/0.004 = 249 (order magnitude larger and opposite direction), the log odds are log 10 (249) = +2.40.

124 or a scoring matrix ij = log p i ij p i p j = log ij p j = log observed frequency expected frequency his matrix will be symmetric.

125 12 0 2 2 1 3 3 1 0 6 2 1 1 1 2 3 1 0 1 1 5 4 1 0 1 0 0 2 5 0 0 1 0 1 2 4 5 0 0 1 0 0 1 3 4 5 1 1 0 0 1 1 2 2 4 3 1 1 0 1 2 2 1 1 3 6 4 0 1 0 2 3 0 1 1 1 2 6 5 0 0 1 1 2 1 0 0 1 0 3 5 5 2 1 2 1 3 2 3 2 1 2 0 0 6 2 1 0 2 1 3 2 2 2 2 2 2 2 2 5 6 3 2 3 2 4 3 4 3 2 2 3 3 4 2 6 2 1 0 1 0 1 2 2 2 2 2 2 2 2 4 2 4 4 3 3 5 4 5 3 6 5 5 2 4 5 0 1 2 1 9 0 3 3 5 3 5 2 4 4 4 0 4 4 2 1 1 2 7 10 8 2 5 6 6 7 4 7 7 5 3 2 3 4 5 2 6 0 0 17 alues multiplied by 10.

126 log odds of zero implies the two amino acids are found across from each in an alignment as often as expected by chance (given their mutabilities and frequencies of occurrence). log odds greater than zero implies the two amino acids are found across from each in an alignment more often than expected by chance (given their mutabilities and frequencies of occurrence). log odds less than zero implies the two amino acids are found across from each in an alignment less often than expected by chance (given their mutabilities and frequencies of occurrence).

127 wo uses for matrices, coring matrix 250 (very distant) 160 (distant) 70 (less distant) 30 (more similar) etc ransition matrix 1

128-1 atrix 10,000 rom: la rg sn sp ys ln lu o: la 9867 2 9 10 3 8 17 rg 1 9913 1 0 1 10 0 sn 4 1 9822 36 0 4 6 sp 6 0 42 9859 0 6 53 ys 1 1 0 0 9973 0 0 ln 3 9 4 5 0 9876 27 lu 10 0 7 56 0 35 9865

129 - strange (?) patterns ots of interesting properties any exchanges between amino acids and ar more double codon substitutions than expected ewer of some single codon substitutions; e.g. and

130 - scoring an amino acid alignment onsider an alignment... eq1 eq2 250 12 5 2-3 otal score is 12 + 5 + 2 3 = 16 he chances of getting an alignment this good by chance is given by the odds. ormally one would multiply the odds at each site (assuming independence) but since log s have been taken we can add the log odds. he log 10 odds of 1.6 corresponds to odds of 39.8. o this is an unusual similarity between these two peptides despite their length (in large part due to rare cysteines across from each other).

131 he matrix was computed on globular proteins and may therefore not be a good representation of the substitution matrix for membrane or other non-globular proteins. t assumes that all sites are equally mutable (but not all residues). Only a limited number of proteins were available in comparison to the huge numbers today.

132 he J matrix (Jones, aylor, hornton 1992) was an update of the matrix. t is mostly used as a transition matrix rather than as a scoring matrix (for the later purpose 250 still seems the method of choice).

133 matrix of BO BOcks Ubstitution atrix Based on the analysis of conserved proteins regions from the BO database. ore reliable than the matrix for distantly related proteins efault for B searches Used in many other programs including

134 BOU matrix 1 ind the frequency of occurrence of one amino acid p i = q ii + q ij /2 2 xpected frequencies e ij = p 2 i if i = j 3 core e ij = 2p i p j if i j s ij = 2 log 2 (q ij /e ij )

135 he matrix consist of the scores... s ij = 2 log 2 (q ij /e ij ). f the observed number of differences between a pair of amino acids is equal to the expected number then s ij = 0 f the observed is less than expected then s ij < 0 f the observed is greater than expected s ij > 0

136 BOU matrix 9 1 4 1 1 5 3 1 1 7 0 1 0 1 4 3 0 2 2 0 6 3 1 0 2 2 0 6 3 0 1 1 2 1 1 6 4 0 1 1 1 2 0 2 5 3 0 1 1 1 2 0 0 2 5 3 1 2 2 2 2 1 1 0 0 8 3 1 1 2 1 2 0 2 0 1 0 5 3 0 1 1 1 2 0 1 1 1 1 2 5 1 1 1 2 1 3 2 3 2 0 2 1 1 5 1 2 1 3 1 4 3 3 3 3 3 3 3 1 4 1 2 1 3 1 4 3 4 3 2 3 2 2 2 2 4 1 2 0 2 0 3 3 3 2 2 3 3 2 1 3 1 4 2 2 2 4 2 3 3 3 3 3 1 3 3 0 0 0 1 6 2 2 2 3 2 3 2 3 2 1 2 2 2 1 1 1 1 3 7 2 3 2 4 3 2 4 4 3 2 2 3 3 1 3 2 3 1 2 11 he lower left gives the log odds matrix (BOU62).

137 BOU matrix 9 1 4 1 1 5 3 1 1 7 0 1 0 1 4 3 0 2 2 0 6 3 1 0 2 2 0 6 3 0 1 1 2 1 1 6 4 0 1 1 1 2 0 2 5 3 0 1 1 1 2 0 0 2 5 3 1 2 2 2 2 1 1 0 0 8 3 1 1 2 1 2 0 2 0 1 0 5 3 0 1 1 1 2 0 1 1 1 1 2 5 1 1 1 2 1 3 2 3 2 0 2 1 1 5 1 2 1 3 1 4 3 3 3 3 3 3 3 1 4 1 2 1 3 1 4 3 4 3 2 3 2 2 2 2 4 1 2 0 2 0 3 3 3 2 2 3 3 2 1 3 1 4 2 2 2 4 2 3 3 3 3 3 1 3 3 0 0 0 1 6 2 2 2 3 2 3 2 3 2 1 2 2 2 1 1 1 1 3 7 2 3 2 4 3 2 4 4 3 2 2 3 3 1 3 2 3 1 2 11 he BOU matrix is less tolerant of substitutions to or from hydrophilic amino acids, but more tolerant of hydrophobic changes, cysteine, and tryptophan mismatches than a similar level matrix.

138 BOU matrix 9 1 4 1 1 5 3 1 1 7 0 1 0 1 4 3 0 2 2 0 6 3 1 0 2 2 0 6 3 0 1 1 2 1 1 6 4 0 1 1 1 2 0 2 5 3 0 1 1 1 2 0 0 2 5 3 1 2 2 2 2 1 1 0 0 8 3 1 1 2 1 2 0 2 0 1 0 5 3 0 1 1 1 2 0 1 1 1 1 2 5 1 1 1 2 1 3 2 3 2 0 2 1 1 5 1 2 1 3 1 4 3 3 3 3 3 3 3 1 4 1 2 1 3 1 4 3 4 3 2 3 2 2 2 2 4 1 2 0 2 0 3 3 3 2 2 3 3 2 1 3 1 4 2 2 2 4 2 3 3 3 3 3 1 3 3 0 0 0 1 6 2 2 2 3 2 3 2 3 2 1 2 2 2 1 1 1 1 3 7 2 3 2 4 3 2 4 4 3 2 2 3 3 1 3 2 3 1 2 11 his is a BOU62 matrix. t is roughly equivalent to a 160 matrix. he levels come from weighting different entries. n this case all proteins within 62% identity sum to a weight of 1.

139 B s recommendations uery length ubstitution matrix ap costs <35-30 ( 9,1) 35-50 -70 (10,1) 50-85 BOU-80 (10,1) >85 BOU-62 (11,1) mpirical measures still seem to work best despite many advances.

O matrix Uses classical distance measures to produce protein alignments iven the alignments it computes a new distance matrix lign again using the new distance matrix epeat this process many times 140

O matrix Uses classical distance measures to produce protein alignments iven the alignments it computes a new distance matrix lign again using the new distance matrix epeat this process many times n addition, they computed empirical measures for gap penalties. hey suggest or a probability of a gap of length k 10 ln() = 36.31 + 7.44 ln( distance) 14.93 ln(k) f a distance is not available 10 ln() = 20.63 1.65ln(k 1) 141

142 O matrix 11.5 0.1 2.2 0.5 1.5 2.5 3.1 0.4 0.1 7.6 0.5 1.1 0.6 0.3 2.4 2.0 0.4 1.1 1.6 0.5 6.6 1.8 0.9 0.5 0.9 0.3 0.4 3.8 3.2 0.5 0.0 0.7 0.3 0.1 2.2 4.7 3.0 0.2 0.1 0.5 0.0 0.8 0.9 2.7 3.6 2.4 0.2 0.0 0.2 0.2 1.0 0.7 0.9 1.7 2.7 1.3 0.2 0.3 1.1 0.8 1.4 1.2 0.4 0.4 1.2 6.0 2.2 0.2 0.2 0.9 0.6 1.0 0.3 0.3 0.4 1.5 0.6 4.7 2.8 0.1 0.1 0.6 0.4 1.1 0.8 0.5 1.2 1.5 0.6 2.7 3.2 0.9 1.4 0.6 2.4 0.7 3.5 2.2 3.0 2.0 1.0 1.3 1.7 1.4 4.3 1.1 1.8 0.6 2.6 0.8 4.5 2.8 3.8 2.7 1.9 2.2 2.4 2.1 2.5 4.0 1.5 2.1 1.3 2.3 1.2 4.4 3.0 4.0 2.8 1.6 1.9 2.2 2.1 2.8 2.8 4.0 0.0 1.0 0.0 1.8 0.1 3.3 2.2 2.9 1.9 1.5 2.0 2.0 1.7 1.6 3.1 1.8 3.4 0.8 2.8 2.2 3.8 2.3 5.2 3.1 4.5 3.9 2.6 0.1 3.2 3.3 1.6 1.0 2.0 0.1 7.0 0.5 1.9 1.9 3.1 2.2 4.0 1.4 2.8 2.7 1.7 2.2 1.8 2.1 0.2 0.7 0.0 1.1 5.1 7.8 1.0 3.3 3.5 5.0 3.6 4.0 3.6 5.2 4.3 2.7 0.8 1.6 3.5 1.0 1.8 0.7 2.6 3.6 4.1 14.2 he log odds matrix is lower left. t is 10 times the log of the prob these aa are aligned / prob of chance alignment.

143 pecialized matrices ome matrices also incorporate additional information - matrix includes information about protein structure and can be used with very distantly related sequences Other matrices are specific for different types of proteins - (coreatrix eading to ntra-embrane) and (redicted ydrophobic and ransmembrane matrix) are designed from/for membrane proteins (not soluble proteins) s of 2006, 94 matrices in enomeet