Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

Similar documents
Exercise 5. Sequence Profiles & BLAST

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Bioinformatics 1 lecture 13. Database searches. Profiles Orthologs/paralogs Tree of Life term projects

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Quantifying sequence similarity

Phylogenetic inference

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Tutorial 4 Substitution matrices and PSI-BLAST

Week 10: Homology Modelling (II) - HHpred

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis '17 -- lecture 7

bioinformatics 1 -- lecture 7

Session 5: Phylogenomics

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Scoring Matrices. Shifra Ben-Dor Irit Orr

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Dr. Amira A. AL-Hosary

Large-Scale Genomic Surveys

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

EECS730: Introduction to Bioinformatics

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

SUPPLEMENTARY INFORMATION

Introduction to Bioinformatics

Moreover, the circular logic

Sequence comparison: Score matrices

Introduction to protein alignments

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Similarity or Identity? When are molecules similar?

HMMs and biological sequence analysis

Computational Biology

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

EVOLUTIONARY DISTANCES

Cladistics and Bioinformatics Questions 2013

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Ch. 9 Multiple Sequence Alignment (MSA)

Algorithms in Bioinformatics

Overview Multiple Sequence Alignment

Domain-based computational approaches to understand the molecular basis of diseases

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

7.36/7.91 recitation CB Lecture #4

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Tools and Algorithms in Bioinformatics

Evolutionary Tree Analysis. Overview

Bioinformatics Exercises

Introduction to Bioinformatics Online Course: IBT

Example questions. Z:\summer_10_teaching\bioinfo\Beispiel_frage_bioinformatik.doc [1 / 5]

Multiple Sequence Alignment

Pairwise sequence alignment

Similarity searching summary (2)

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Phylogeny: building the tree of life

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

Genomics and bioinformatics summary. Finding genes -- computer searches

Algorithms in Bioinformatics

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Phylogenetic Tree Reconstruction

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Homology Modeling. Roberto Lins EPFL - summer semester 2005

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

C3020 Molecular Evolution. Exercises #3: Phylogenetics

BINF6201/8201. Molecular phylogenetic methods

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Introduction to Bioinformatics

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Protein function prediction based on sequence analysis

Single alignment: Substitution Matrix. 16 march 2017

CSE 549: Computational Biology. Substitution Matrices

SUPPLEMENTARY INFORMATION

Advanced topics in bioinformatics

Practical considerations of working with sequencing data

Processes of Evolution

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Collected Works of Charles Dickens

Hands-On Nine The PAX6 Gene and Protein

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Bioinformatics and BLAST

Sequence analysis and comparison

Transcription:

Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

How can I represent thousands of homolog sequences in a compact way? 2

Families and superfamilies the weird shape of sequence space Long evolutionary distance. Visible by profile-based methods superfamily family Short evolutionary distance. Visible by pairwise comparison. At short sequence distances, all sequences are homologs. At longer distances, no. Homologs are clustered and may be separated by spaces of non-homolog sequence space. Very, very distance sequences may be homologous. To model a whole family or superfamily, we need to sample sequence space. A family is the set of sequences of true homologs that can see each other in a database search. A superfamily is a group of distant homologs that cannot be easily found by sequence searches.

Redundancy problem If we submit one sequence (for example, citrate synthase from human) to the GenBank database (using BLAST for example), and take 100 results, and we build a cladogram from this, we might get something like this... What is our representative going to look like if we use the rule: "one sequence one vote"? primates rabbit rat E. coli lawyer

Sequence weighting corrects for uneven sampling To build a representative model we can... (1) throw out all redundant sequences and keep representatives of each clade only, or (2) apply a weight to each sequence reflecting how nonredundant that sequence is. One measure of non-redundancy is sequence-distance, or evolutionary distance.

Crude weights from a cladogram Simplest weighting scheme: Start with weight = 1.0 at the common ancestor of the tree. Split the weight evenly at each node. 0.250 0.500 1.000 Human sequences are 10/18 of the tree, but only 0.125 of the weights 0.125 0.250 0.0625 0.0625 0.125 0.125 0.125 0.008 0.008 0.016 0.008 0.008 0.016 0.016 0.016 0.016 0.016 primates 0.0625 0.031 0.031 0.031 0.031 0.0625 0.0625 0.500 rabbit rat lawyer E. coli

What part of the MSA measures time the best? hot spot, fast evolving slow evolving 7

Pruning and trimming to remove hotspots Red boxes show blocks that are representative of the family -- useful for remote homolog detection, useful for phylogenetics if changes are sufficient. Yellow boxes show hot spots of mutation -- useful for near-homolog phylogenetic analysis, not for remote homolog detection. 8

In class exercise: Make a representative profile Select a sequence Find all near homologs using NCBI BLAST. Keep 200. Make a representative MSA using MUSCLE as follows (using short-hand): Align... Trim... Align... Prune... Align... Prune... (Calculate distances...prune... Align...) n Calculate a profile (you must trim all gaps first) 9

Another Exercise: Re-aligning within a MSA» Select region. MUSCLE Align column range.» All gaps in an internal segment are internal gaps, not end gaps.» Useful for resolving the speciation order, sometimes. 10

Better weights from a phylogram 0.2 A w A = 0.2 + 0.3/2 = 0.35 0.3 0.1 B w B = 0.1 + 0.3/2 = 0.25 0.5 C w C = 0.5 The sequence weight is calculated starting from the distance from the taxon to the first ancestor node, adding half of the distance from the first ancestor to the second ancestor, 1/4th of the distance from the second to third ancestor, and so on. Finally, the weights are normalized to sum to 1.00

Distance-based weights A B C Pseudocode : all w i initialized to 1. A B C 0.3 1.0 0.9 while (w i w' i ) do for i from A to C do w' i = Σ j w j D ij end do for i from A to C do (1) Sum the weighted distances to get new weights. (2) Normalize the new weights (3) Repeat (1) and (2) until no change. w i = w' i / Σ j w' j end do end do Self-consistent Weights Method of Sander & Schneider, 1994

A B C Distance-based weights A 0.3 1.0 (1) Sum the weighted distances to get new weights. Running the pseudocode : w' A = 0.3 + 1.0 = 1.3 w' B = 0.3 + 0.9 = 1.2 w' C = 1.0 + 0.9 = 1.9 B C 0.9 (2) Normalize the new weights (1) Sum the weighted distances to get new weights. (3) Repeat (1) and (2) until no change. w A = 1.3/(1.3+1.2+1.9)=0.30 w B = 1.2/4.4 = 0.27 w C = 1.9/4.4 = 0.43 w' A = 0.3*0.27+1.0*0.43=0.51 w' B = 0.3*0.3+0.9*0.43 =0.48 w' C = 1.0*0.3+0.9*0.27 =0.54... w ABC = 0.33 0.31 0.35 w ABC = 0.30 0.28 0.42 w ABC = 0.31 0.29 0.40 w ABC = 0.30 0.28 0.41 w ABC = 0.30 0.28 0.41 converged.

Amino acid probability profiles An amino acid profile is defined as a set of probability distributions over the 20 amino acids, a probability density function (PDF) for each position in the alignment. Gap probabilities are usually not included. P(a i) = ws / ws S Si=a G D D T S w w w w w P(G i) = 0.11 P(D i) = 0.5 P(T i) = 0.22 P(S i) = 0.28 [ The probability of amino acid a at position i is the sum of the sequence weights ws over ] all sequences S such that the amino acid at position i of that sequence Si is a, divided by the sum over the sequence weights w S for all sequences S.

PSSMs, Log-likelihood ratios position specific scoring matrix LLR(a) = log( P(a i) / P(a) ) probability of a in one column "background" likelihood of a overall (the whole database) Amino acids are not equally likely in nature. K, L and R are the most common. M, C and W are the rarest. A C D E F G H I K L M N P Q R S T V W Y

Pseudocounts account for unsampled data. LLR(a) = log( P(a i) / P(a) ) You can't take the log of zero. The probability of seeing a in column i of a sequence alignment is never really zero. So we add a small number of 'pseudocounts' ε. LLR(a) = log( (P(a i)+ε) / P(a) ) This LLR approaches log(ε/p(a)) in the limit as P(a i)->0. log(ε/p(a)) is a negative number.

Smart Pseudocounts by guessing the missing counts, using BLOSUM. LLR(a) = log( (P(a i)+εω) / P(a) ) where ω = P(b i) S(a b) b where S(a b) = probability of substitution b->a Why smart? Because here we are using the observed counts to predict the unobserved counts based on the observed counts, as opposed to blindly applying pseudocounts.

One way to visualize profiles Proline Helix C-cap Glycine Helix C-cap Glycine Helix C-cap Color matrix from I-sites library (Bystroff et al, 1998) Color = LLR. Blue = high negative values. Green = zero. Red = high positive values.

logos» http://weblogo.berkeley.edu/logo.cgi 19

Logos Height of letter is the LLR in bits, half-bits or nats.» http://weblogo.berkeley.edu/logo.cgi

Sometimes you can detect a pattern only after merging and condensing alot of data. 21

Alignments of transcription factor footprint sites Transcription factors are homodimers, therefore they interact with... palindromes!

Protein sequence logo What do you see? What might it mean? 23

How do you compare a sequence against a profile? The score is the sum of the loglikelihood ratios of the amino acid in the sequence. score = Σ i LLR(a i ) Sequence= KEMGFDHIIIHP

Aligning sequence to profile S(i,j) = 0 do aa=1,20 enddo profile1@i P(aa i) S(i,j) = S(i,j) + P(aa,i)*B(aa,s(j)) Aligning profile to profile sequence2@j BLOSUM score S(i,j) = 0 do aai=1,20 do aaj=1,20 S(i,j) = S(i,j) + P(aai,i)*P(aaj,j)*B(aai,aaj) enddo enddo No need to normalize, since P(aai i)*p(aaj j)= 1 aai aaj 25

Reminder: Psi-BLAST is Blast with profiles Psi-BLAST searches the database iteratively. (Cycle 1) Normal BLAST (with gaps) (Cycle 2) (a) Construct a profile from the results of Cycle 1. (b) Search the database using the profile. (Cycle 3) (a) Construct a profile from the results of Cycle 2. (b) Search the database using the profile. And So On... (user sets the number of cycles) Psi-BLAST is much more sensitive than BLAST. Also more vulnerable to low-complexity.

Review What is a sequence family? Superfamily? What is a profile? Why must sequences be weighted to calculate a profile? What is a distance matrix? What does it contain? How is it used to get sequence weights? What information is expressed in a sequence logo? What are pseudocounts? Why bother pseudocounting? 27