Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Similar documents
Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices

Proteins: Characteristics and Properties of Amino Acids

Scoring Matrices. Shifra Ben-Dor Irit Orr

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... methods use different attributes related to mis sense mutations such as

Protein Secondary Structure Prediction

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

Range of Certified Values in Reference Materials. Range of Expanded Uncertainties as Disseminated. NMI Service

PROTEIN STRUCTURE AMINO ACIDS H R. Zwitterion (dipolar ion) CO 2 H. PEPTIDES Formal reactions showing formation of peptide bond by dehydration:

Periodic Table. 8/3/2006 MEDC 501 Fall

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

Proteome Informatics. Brian C. Searle Creative Commons Attribution

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Quantifying sequence similarity

Lecture 14 - Cells. Astronomy Winter Lecture 14 Cells: The Building Blocks of Life

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Properties of amino acids in proteins

PROTEIN SECONDARY STRUCTURE PREDICTION: AN APPLICATION OF CHOU-FASMAN ALGORITHM IN A HYPOTHETICAL PROTEIN OF SARS VIRUS

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Viewing and Analyzing Proteins, Ligands and their Complexes 2

Hypergraphs, Metabolic Networks, Bioreaction Systems. G. Bastin

Using Higher Calculus to Study Biologically Important Molecules Julie C. Mitchell

Amino Acids and Peptides

Lecture 15: Realities of Genome Assembly Protein Sequencing

Scoring Matrices. Shifra Ben Dor Irit Orr

Protein Identification Using Tandem Mass Spectrometry. Nathan Edwards Informatics Research Applied Biosystems

Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Patrick: An Introduction to Medicinal Chemistry 5e Chapter 03

Proteomics. November 13, 2007

The Select Command and Boolean Operators

Part 4 The Select Command and Boolean Operators

Exam III. Please read through each question carefully, and make sure you provide all of the requested information.

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

A Theoretical Inference of Protein Schemes from Amino Acid Sequences

EXAM 1 Fall 2009 BCHS3304, SECTION # 21734, GENERAL BIOCHEMISTRY I Dr. Glen B Legge

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Evidence from Evolution Activity 75 Points. Fossils Use your textbook and the diagrams on the next page to answer the following questions.

Discussion Section (Day, Time):

Lecture 5,6 Local sequence alignment

Pairwise & Multiple sequence alignments

Chemistry Chapter 22

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Chemical Properties of Amino Acids

Similarity or Identity? When are molecules similar?

Translation. A ribosome, mrna, and trna.

Local Alignment: Smith-Waterman algorithm

Structures in equilibrium at point A: Structures in equilibrium at point B: (ii) Structure at the isoelectric point:

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

A rapid and highly selective colorimetric method for direct detection of tryptophan in proteins via DMSO acceleration

INTRODUCTION. Amino acids occurring in nature have the general structure shown below:

CHEMISTRY ATAR COURSE DATA BOOKLET

Investigating Evolutionary Relationships between Species through the Light of Graph Theory based on the Multiplet Structure of the Genetic Code

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Discussion Section (Day, Time):

Discussion Section (Day, Time): TF:

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics and BLAST

Pairwise sequence alignments

CHAPTER 29 HW: AMINO ACIDS + PROTEINS

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

National Nutrient Database for Standard Reference Release 28 slightly revised May, 2016

1. Wings 5.. Jumping legs 2. 6 Legs 6. Crushing mouthparts 3. Segmented Body 7. Legs 4. Double set of wings 8. Curly antennae

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Algorithms in Bioinformatics

How did they form? Exploring Meteorite Mysteries

Discussion Section (Day, Time):

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation

All Proteins Have a Basic Molecular Formula

Practical considerations of working with sequencing data

Protein Structure Bioinformatics Introduction

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

PAM-1 Matrix 10,000. From: Ala Arg Asn Asp Cys Gln Glu To:

Global alignments - review

Systematic approaches to study cancer cell metabolism

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Student Handout 2. Human Sepiapterin Reductase mrna Gene Map A 3DMD BioInformatics Activity. Genome Sequencing. Sepiapterin Reductase

Enzyme Catalysis & Biotechnology

Studies Leading to the Development of a Highly Selective. Colorimetric and Fluorescent Chemosensor for Lysine

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

12/6/12. Dr. Sanjeeva Srivastava IIT Bombay. Primary Structure. Secondary Structure. Tertiary Structure. Quaternary Structure.

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Using an Artificial Regulatory Network to Investigate Neural Computation

Protein Struktur (optional, flexible)

1 large 50g. 1 extra large 56g

Computational Biology

Collision Cross Section: Ideal elastic hard sphere collision:

1. Amino Acids and Peptides Structures and Properties

Solutions In each case, the chirality center has the R configuration

Tools and Algorithms in Bioinformatics

Read more about Pauling and more scientists at: Profiles in Science, The National Library of Medicine, profiles.nlm.nih.gov

Moreover, the circular logic

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Principles of Biochemistry

Sequence analysis and Genomics

Amino Acid Features of P1B-ATPase Heavy Metal Transporters Enabling Small Numbers of Organisms to Cope with Heavy Metal Pollution

CSE 549: Computational Biology. Substitution Matrices

Transcription:

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas

Informal inductive proof of best alignment path onsider the last step in the best alignment path to node a below This path must come from one of the three nodes shown, where X, Y, and Z are the cumulative scores of the best alignments up to those nodes We can reach node a by three possible paths: an A-B match, a gap in sequence A or a gap in sequence B: seq B X Z seq A match gap Y gap a The best-scoring path to a is the maximum of: X + match Y + gap Z + gap BUT the best paths to X, Y, and Z are analogously the max of their three upstream possibilities, etc Inductively QED

Local alignment A G T A 2-7 -5-7 -7 2-7 -5 G -5-7 2-7 T -7-5 -7 2 d = -5 A A G 0 0 0 0 A 0 2 2 0 F i 1, j 1 0 s x i, y j F i, j 1 d G 0 0 0 4 0 0 0 0 F i 1, j d F i, j (no arrow means no preceding alignment)

Local alignment Two differences from global alignment: If a score is negative, replace with 0 Traceback from the highest score in the matrix and continue until you reach 0 Global alignment algorithm: eedleman- Wunsch Local alignment algorithm: Smith- Waterman

Protein score matrices DA score matrices are much simpler (and are conceptually similar) Quantitatively represent the degree of conservation of typical amino acid residues over evolutionary time All possible amino acid changes are represented (matrix of size at least 20 x 20) Most commonly used are several different BLOSUM matrices derived for different degrees of evolutionary divergence

# BLOSUM lustered Scoring Matrix in 1/2 Bit Units # luster Percentage: >= 62 BLOSUM62 Score Matrix regular 20 amino acids ambiguity codes and stop

Hydrophobic Amino acid structures Polar harged phenylalanine F

BLOSUM62 Score Matrix Good scores chemically similar Bad scores chemically dissimilar

Amino acid structures Hydrophobic Polar harged glycine G H H SH H cysteine histidine H H alanine A H H 3 OH serine S H H 3 valine V H OH lysine K H H 3 threonine T H H 3 arginine R H H leucine L tyrosine Y H OH H 3 H 3 O isoleucine I H H 2 aspartate D H H 3 H asparagine H 2 glutamate E methionine M H S H 3 glutamine Q H O H O + H H 2 O - O - O H + 3 H 2 + proline P H H tryptophan W H

Deriving BLOSUM scores Find sets of sequences whose alignment is thought to be correct (this is partly bootstrapped by alignment) Measure how often various amino acid pairs occur in the alignments ormalize this to the expected frequency of such pairs randomly in the same set of alignments Derive a log-odds score (often in half bits)

Example of alignment block 31 amino acids (columns) 61 sequences (rows) Thousands of such blocks go into computing a single BLOSUM matrix Represent full diversity of sequences Results are summed over all columns of all blocks

Pair frequency vs expectation Actual aligned pair frequency: q ij 1 T c where cij is the count of ij pairs and T is the total pair count aa a a ij Randomly expected pair frequency: e p p e p p p p 2p p ab a b b a a b where pa and pb are the overall probabilities (frequencies) of specific residues a and b Sample column from a multiple alignment: D E D D D 6 D-D pairs 4 D-E pairs 4 D- pairs 1 E- pair etc A multiple alignment of sequences is the equivalent of all the pairwise alignments, which number ()(-1)/2

Log-odds score calculation (so adding scores == multiplying probabilities) s ij log 2 q e ij ij For computational speed often rounded to nearest integer and (to reduce round-off error) they are often multiplied by 2 (or more) first, giving a half-bit score: matrixscore (rounded) 2log 2 q e ij ij

BLOSUM62 matrix (half-bit scores) ( 9 half-bits = 45 bits ) Frequency of residue over all proteins: 00162 (you have to look this up) Reverse calculation of aligned - pair frequency in BLOSUM data set: - q e cc cc 2 5 4 2263 thus e cc q cc 00162 00162 0000262 22630000262 000594

onstructing Blocks Blocks are ungapped alignments of multiple sequences, usually 20 to 100 amino acids long luster the members of each block according to their percent identity Make pair counts and score matrix from a large collection of similarly clustered blocks Each BLOSUM matrix is named for the percent identity cutoff in step 2 (eg BLOSUM70 for 70% identity)

Probabilistic Interpretation of Scores (ungapped) matrixscore (rounded) 2log 2 By converting scores back to probabilities, we can give a probabilistic interpretation to an alignment score q e ij ij (BLOSUM62) this alignment has a score of 16 (6+2+1+7) by BLOSUM 62, meaning an alignment with this score or more is 2 8 (256) times more likely to be seen in a real alignment than in a random alignment FIAP FLSP this 15 amino acid alignment has a score of 75, meaning that it is ~10 11 times more likely to be seen in a real alignment than in a random alignment(!!) VHRDLKPELLLASK VHRDLKPELLLASK (4+8+5+6+4+5+7+5+6+4+4+4+4+4+5)

Randomly Distributed Gaps if then p g P( g k 1 ) (probability of a gap at each position in the sequence) k, P( g 2 ) k 2,, P( g n ) k n [note - the slope of the line on a log-linear plot will vary according to the frequency of gaps, but it will always be linear]

Distribution of alignment gap lengths in large set of structurally-aligned proteins log-linear plot

Summary How a score matrix is derived What the scores mean probablistically Why gap penalties should be affine How to use scores in dynamic programming