Exercise 5. Sequence Profiles & BLAST

Similar documents
Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

Week 10: Homology Modelling (II) - HHpred

Basic Local Alignment Search Tool

Computational Biology

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Pairwise sequence alignment

Sequence Database Search Techniques I: Blast and PatternHunter tools

Scoring Matrices. Shifra Ben-Dor Irit Orr

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

bioinformatics 1 -- lecture 7

EECS730: Introduction to Bioinformatics

In-Depth Assessment of Local Sequence Alignment

Sequence Analysis '17 -- lecture 7

CSE 549: Computational Biology. Substitution Matrices

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Large-Scale Genomic Surveys

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Chapter 7: Rapid alignment methods: FASTA and BLAST

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Quantifying sequence similarity

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Tutorial 4 Substitution matrices and PSI-BLAST

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Local Alignment Statistics

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Substitution matrices

Sequence analysis and comparison

Similarity searching summary (2)

Introduction to Bioinformatics

Algorithms in Bioinformatics

An Introduction to Sequence Similarity ( Homology ) Searching

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Single alignment: Substitution Matrix. 16 march 2017

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Pairwise sequence alignment and pair hidden Markov models

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Bioinformatics and BLAST

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Local Alignment: Smith-Waterman algorithm

Sequence comparison: Score matrices

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

HMMs and biological sequence analysis

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Scoring Matrices. Shifra Ben Dor Irit Orr

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

SEQUENCE alignment is an underlying application in the

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

BLAST: Target frequencies and information content Dannie Durand

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Optimization of a New Score Function for the Detection of Remote Homologs

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Moreover, the circular logic

Advanced topics in bioinformatics

An Introduction to Bioinformatics Algorithms Hidden Markov Models

BLAST. Varieties of BLAST

Multiple Sequence Alignment

Alignment & BLAST. By: Hadi Mozafari KUMS

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Multiple Alignment using Hydrophobic Clusters : a tool to align and identify distantly related proteins

Domain-based computational approaches to understand the molecular basis of diseases

Hidden Markov Models

BLAST: Basic Local Alignment Search Tool

SUPPLEMENTARY INFORMATION

Fundamentals of database searching

7.36/7.91 recitation CB Lecture #4

MATRICES. a m,1 a m,n A =

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Introduction to protein alignments

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

EECS730: Introduction to Bioinformatics

Stephen Scott.

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Tools and Algorithms in Bioinformatics

Similarity or Identity? When are molecules similar?

Sequence Alignment Techniques and Their Uses

is a 3 4 matrix. It has 3 rows and 4 columns. The first row is the horizontal row [ ]

Sequence Comparison. mouse human

Comparing whole genomes

SOLVING LINEAR SYSTEMS

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Transcription:

Exercise 5 Sequence Profiles & BLAST 1

Substitution Matrix (BLOSUM62) Likelihood to substitute one amino acid with another Figure taken from https://en.wikipedia.org/wiki/blosum 2

Substitution Matrix (BLOSUM62) Only one score per amino acid pair Is this true for every amino acid in every protein, i.e. are they independent of each other? Can we make a substitution matrix specific for a single protein? For every position of it? Why: better alignments!? A C D E >p1 SEAN S 1-1 0 0 E -1-4 1 5 A 2-2 0 4 N 0-2 1 0 3

Position-Specific Scoring Matrix (PSSM) What do we need? 1. Information about substitutions occuring in the protein 2. Differentiation between favored and unfavored substitutions 3. Transformation into positive and negative scores 4

Multiple Sequence Alignments MSAs contain information about evolution of proteins Figure taken from https://en.wikipedia.org/wiki/multiple_sequence_alignment 5

Position-Specific Scoring Matrix (PSSM) Define favored substitutions as more often than expected, unfavored as less often than expected What DO we expect? There are 20 amino acids, so P = 0.05 right? Are there better estimates? What to do? Count number of occuring amino acids in MSA Normalize to relative frequencies Divide by expected (background) frequencies 6

Position-Specific Scoring Matrix (PSSM) SE-AN SE-ES SEVEN SE-AS Count observed amino acids A E N S V 0 0 0 4 0 0 0 4 0 0 0 0 0 0 0 0 1 0 2 2 0 0 0 0 0 0 2 2 0 0 Normalize (divide by row sum) A E N S V 0 0 0 20 0 0 0 20 0 0 0 0 0 0 0 0 20 0 10 10 0 0 0 0 0 0 10 10 0 0 Divide by background frequencies (P = 0.05) A E N S V 0 0 0 4/4 0 0 0 4/4 0 0 0 0 0 0 0 0 1/1 0 2/4 2/4 0 0 0 0 0 0 2/4 2/4 0 0 7

Position-Specific Scoring Matrix (PSSM) Not quite there yet! Missing transformation into positive/negative scores Use logarithm Log(x) > 0 for x > 1 Log(x) < 0 for x < 1 S i,j = 2 log 2 f i,j P j, where S i,j is the score for amino acid j in MSA column i f i,j is the relative frequency for amino acid j in MSA column i P j is the expected (background) frequency for amino acid j 8

Position-Specific Scoring Matrix (PSSM) SE-AN SE-ES SEVEN SE-AS 2 Log 2 (x) A E N S V -inf -inf -inf 9 -inf -inf -inf 9 -inf -inf -inf -inf -inf -inf -inf -inf 9 -inf 7 7 -inf -inf -inf -inf -inf -inf 7 7 -inf -inf Note: Scores are rounded to the nearest integer Problem with small/homogeneous MSA Some amino acids are never observed in a MSA column Log(0) = negative infinity 9

PSSM: Redistributing Gaps So far we have ignored the gaps in the MSA Can we do better? How could we interpret the gaps? See them as wildcards Since the amino acid can be missing altogether, it shouldn t matter too much what amino acid we put there Redistribute gaps according to the expected (background) amino acid frequencies i.e. every gap adds P j to the count of amino acid j 10

PSSM: Redistributing Gaps SE-AN SE-ES SEVEN SE-AS Count observed amino acids and gaps A E N S V - 0 0 0 4 0 0 0 0 4 0 0 0 0 0 0 0 0 0 1 0 3 2 2 0 0 0 0 0 0 0 2 2 0 0 0 Multiply gaps by amino acid background frequencies and add to amino acid counts Normalize to rel. frequencies Divide by background Calculate Log-Score A E N S V 0 0 0 4 0 0 0 4 0 0 0 0 0.15 0.15 0.15 0.15 1.15 0.15 2 2 0 0 0 0 0 0 2 2 0 0 Note: This example uses uniform background frequencies (P = 0.05) 11

PSSM: Sequence weights Are all sequences in the MSA equal? Do some provide more information than others? SE-AN SEVEN SE-AN SE-AN SE-AN SE-AN SE-AN SEVEN Does the second MSA provide additional information regarding viable substitutions? 12

PSSM: Sequence weights Sequence weights: a matter of variation Henikoff S, Henikoff JG (1994). Position-based sequence weights. J. Mol. Biol., 243, 4:574-8. Combine MSA column variation and sequence variation w i,k = 1 r i S i,k, where w i,k is the weight for sequence k in MSA column i r i is the number of different observed amino acids in MSA column i (count gaps as a 21st amino acid) S i,k is the number of sequences in MSA column i sharing the same amino acid as sequence k (including itself) 13

PSSM: Sequence weights S1: SE-AN S2: SE-ES S3: SEVEN S4: SE-AS 1 r s S1 S2 S3 S4 r 1/4 1/4 1/4 1/4 1 1/4 1/4 1/4 1/4 1 1/6 1/6 1/2 1/6 2 1/4 1/4 1/4 1/4 2 1/4 1/4 1/4 1/4 2 0.67 0.67 1.0 0.67 Final weight is sum over all positions of a sequence Exclude positions where r = 1 For example: w S1 = 1 6 + 1 4 + 1 4 = 2 3 14

PSSM: Sequence weights Adjust amino acid and gap counts by weight of the contributing sequences SE-AN SE-ES SEVEN SE-AS S1 0.67 S2 0.67 S3 1.0 S4 0.67 A E N S V - 0 0 0 3.00 0 0 0 0 3.00 0 0 0 0 0 0 0 0 0 1.00 0 2.00 1.33 1.67 0 0 0 0 0 0 0 1.67 1.33 0 0 0 f 1,S = 2 3 + 2 3 + 1 + 2 3 = 3.00 f 4,E = 2 3 + 1 = 1.67 15

PSSM: Pseudocounts Problem with small/homogeneous MSA Some amino acids are never observed in a MSA column Log(0) = negative infinity Solution: Pseudocounts Add an arbitrary number of counts to each amino acid No more unobserved amino acids 16

PSSM: Pseudocounts Simple example: add 1 to every amino acid (not gaps) Let s ignore sequence weights and gap redistribution for now SE-AN SE-ES SEVEN SE-AS A E N S V - 1 1 1 5 1 1 0 1 5 1 1 1 1 0 1 1 1 1 2 1 3 3 3 1 1 1 1 0 1 1 3 3 1 1 0 A E N S V - 1/24 1/24 1/24 5/24 1/24 1/24 0 1/24 5/24 1/24 1/24 1/24 1/24 0 1/24 1/24 1/24 1/24 2/24 1/24 3/24 3/24 3/24 1/24 1/24 1/24 1/24 0 1/24 1/24 3/24 3/24 1/24 1/24 0 17

PSSM: Pseudocounts Simple solution eliminates Log(0) problem Can we do better? Once again: Use background frequencies Every observed amino acid adds to the pseudocounts based on its substitution ratios Where can we get those ratios? 18

PSSM: Pseudocounts Use BLOSUM62 amino acid pair frequencies Whole matrix normalized to sum up to 1.0 Each column/row sum equals background frequency of the corresponding amino acid (matrix is symmetric) 19

PSSM: Pseudocounts Every observed* amino acid adds to the pseudocounts based on its pair frequencies similar to redistributing gaps g i,a = σ j f i,j P j q a,j, where g i,a is the pseudocount value for amino acid a in MSA column i f i,j is the observed* frequency of amino acid j in MSA column i P j is the background frequency of amino acid j q a,j is the frequency for the amino acid pair a, j *adjusted by sequence weights and redistributed gaps 20

PSSM: Pseudocounts For example, let s assume that q S,S = 0.010 for amino acid pair S, S q S,A = 0.004 for amino acid pair S, A q S,j = 0.002 for all other amino acids pairs S, j Weighted f-matrix from page 15 after gap redistribution A E N S V 0 0 0 3.00 0 0 Pseudocounts (assuming uniform P = 0.05) A E N S V 0.24 0.12 0.12 0.60 0.12 0.12 Calculate PCs from f-matrix, then add them A E N S V 0.24 0.12 0.12 3.60 0.12 0.12 21

PSSM: Pseudocounts How much weight should the pseudocounts have? 50%? More? Less? Is there some dynamic value? The more independent observations in the MSA, the less pseudocounts are needed/wanted Simple estimate: average variation in the MSA columns 22

PSSM: Pseudocounts Estimate number of independent observation N = 1 σ L L i=1 r i, where N is estimated number of independent observations L is the number of MSA columns r i is the number of different observed amino acids in MSA column i (count gaps as a 21st amino acid) 23

PSSM: Pseudocounts Weight observed* amino acids against pseudocounts f i = α f i+β g i, where α+β f i are the adjusted amino acid frequencies in MSA column i f i are the observed* amino acid frequencies in MSA column i g i are the pseudocounts for MSA column i α is equal to N 1 β is an empirically chosen weight factor for the pseudocounts *adjusted by sequence weights and redistributed gaps 24

Position-Specific Scoring Matrix (PSSM) Putting it all together: 1. Calculate sequence weights 2. Count (with weights) observed amino acids and gaps 3. Redistribute gaps according to background frequencies 4. Add pseudocounts according to amino acid pair frequencies 5. Normalize to relative frequencies 6. Divide by background frequencies 7. Calculate Log-Score 8. Remove rows corresponding to gaps in the primary sequence (here the primary sequence is the first one in the MSA) Order of steps is important! 25

Position-Specific Scoring Matrix (PSSM) SE-AN SE-ES SEVEN SE-AS A C D E F G H I K L M N P Q R S T V W Y 1-1 0 0-3 0-1 -3 0-3 -2 0-1 0-1 4 1-2 -3-2 -1-4 1 5-3 -2 0-3 1-3 -2 0-1 2 0 0-1 -3-3 -2 0 0-1 -1 0-1 -1 1-1 0 0-1 -1-1 -1 0 0 2-1 0 2-2 0 4-3 -1-1 -2 0-2 -2-1 -1 1-1 0-1 -1-3 -2 0-2 1 0-3 -1 0-3 0-3 -2 5-2 0-1 3 0-2 -3-2 26

Position-Specific Scoring Matrix (PSSM) After all that WHY do we want PSSMs (again)? PSSMs help to improve alignments (local and global) Use PSSM scores instead of, for example, BLOSUM62 You can even align two PSSMs PSSMs condense information about the evolution of a protein Conserved positions are easy spot Important input feature for many prediction methods PSSMs help to find protein homologs in databases 27

BLAST Basic Local Alignment Search Tool (BLAST) Searches databases for similar protein/nucleotide sequences Scores hits based on local alignments and score matrices (default: BLOSUM62 for proteins) High speed due to using seeds for hit determination 28

BLAST What are BLAST seeds? Short sequences (3-grams for proteins) that have a high pairwise score to the query sequence (based on scoring matrix) Query: SEQWENCE Seeds: EQW = 5 + 5 + 11 = 21 WQN = 11 + 2 + 6 = 19 NCQ = 6 + 9 + 2 = 17 Analogous for seed vs PSSM Use rows of PSSM as sequence position Use corresponding amino acid column of PSSM as score 29

BLAST Search algorithm Find seeds in (indexed) sequence database Extend alignments with sequences that contain two or more seeds (dynamic programming) Keep high-scoring (local) alignments 30

PSI-BLAST Iterative BLAST 1. Use BLOSUM62 scores for first search against database 2. Build PSSM based on high-scoring hits 3. Search again using the PSSM 4. Repeat steps 2 & 3 for a specified number of times Can find more distantly related protein sequences But: false hits can pollute the PSSM 31

Homework Compute several variants of PSSMs from a MSA From basic to complex PSSMs Carefully read the different steps again Try to be efficient in calculating the arrays and matrices Re-use variables and methods Use numpy arrays and built-in features Generate a list of BLAST seeds for a PSSM PSSM and minimum score will be provided via parameter 32