Outline Sequence-comparison methods. Buzzzzzzzz. MB330 - The class of 2008

Similar documents
Outline. Sequence-comparison methods. Buzzzzzzzz. Why compare sequences? Gerard Kleywegt Uppsala University

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

CSCE555 Bioinformatics. Protein Function Annotation

Computational methods for predicting protein-protein interactions

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Introduction to Bioinformatics

Biol478/ August

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Example of Function Prediction

Genomes and Their Evolution

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

Motivating the need for optimal sequence alignments...

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Homology and Information Gathering and Domain Annotation for Proteins

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Computational approaches for functional genomics

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Comparative genomics: Overview & Tools + MUMmer algorithm

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Pairwise & Multiple sequence alignments

Algorithms in Bioinformatics

Practical considerations of working with sequencing data

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Sequence Alignment (chapter 6)

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Phylogenetic analysis. Characters

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Dr. Amira A. AL-Hosary

Exploring Evolution & Bioinformatics

Large-Scale Genomic Surveys

Homology. and. Information Gathering and Domain Annotation for Proteins

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

MiGA: The Microbial Genome Atlas

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Tools and Algorithms in Bioinformatics

Bioinformatics Exercises

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Phylogenetic Tree Reconstruction

Session 5: Phylogenomics

Evolutionary Tree Analysis. Overview

Computational Biology

EVOLUTIONARY DISTANCES

Comparative Genomics II

Introduction to Bioinformatics Online Course: IBT

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Orthologs Detection and Applications

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Welcome to HST.508/Biophysics 170

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Hands-On Nine The PAX6 Gene and Protein

Sequence analysis and comparison

8/23/2014. Phylogeny and the Tree of Life

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Sequence analysis and Genomics

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

BLAST. Varieties of BLAST

A Phylogenetic Network Construction due to Constrained Recombination

3.B.1 Gene Regulation. Gene regulation results in differential gene expression, leading to cell specialization.

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Visit to BPRC. Data is crucial! Case study: Evolution of AIRE protein 6/7/13

Bioinformatics. Part 8. Sequence Analysis An introduction. Mahdi Vasighi

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

Algorithms in Bioinformatics

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Phylogeny Tree Algorithms

Gene Families part 2. Review: Gene Families /727 Lecture 8. Protein family. (Multi)gene family

Exhaustive search. CS 466 Saurabh Sinha

Homology Modeling. Roberto Lins EPFL - summer semester 2005

SCIENTIFIC EVIDENCE TO SUPPORT THE THEORY OF EVOLUTION. Using Anatomy, Embryology, Biochemistry, and Paleontology

Computational Biology From The Perspective Of A Physical Scientist

Inferring Molecular Phylogeny

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Seuqence Analysis '17--lecture 10. Trees types of trees Newick notation UPGMA Fitch Margoliash Distance vs Parsimony

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Overview Multiple Sequence Alignment

Multiple Sequence Alignment

Introduction to protein alignments

Bioinformatics for Biologists

Genômica comparativa. João Carlos Setubal IQ-USP outubro /5/2012 J. C. Setubal

Biology 112 Practice Midterm Questions

Cladistics and Bioinformatics Questions 2013

Pairwise sequence alignments

A New Similarity Measure among Protein Sequences

Chapter 26: Phylogeny and the Tree of Life

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Transcription:

Outline Sequence-comparison methods erard Kleywegt Uppsala University Why compare sequences otplots airwise sequence alignments Multiple sequence alignments rofile methods Buzzzzzzzz Why compare sequences Sequence comparison is the bread and butter of bioinformatics - WHY Sequence-to-database Sequence-to-sequence iscuss in groups of 2-3 for ~3 minutes Write down ~3 things that you think protein sequence comparisons could be used for! MB330 - he class of 2008 Sequence-to-database Find related genes in different species atenting (check novelty of sequence) lues about function lues about structure dentification of the protein MB330 - he class of 2008 Sequence-to-sequence Find small sequence variations Study mutation rates hylogenetic analysis, evolutionary relationships rotein structure prediction Finding sequence motifs (active site, ) 1

B351 - he class of 2007 Function prediction Structure prediction volutionary history, phylogeny, ancestry, classification Find homologous proteins dentify unknown proteins Find similarities and differences (mutations) between proteins, species Find conserved/consensus sequences omain structure MB330 - he class of 2007 Find related proteins (homology) lues about function volutionary history, phylogeny lues about structure isease-related variants dentify unknown proteins Find similarities and differences between proteins, species dentify possible active-site residues MB330 - he class of 2006 Sequence-to-database dentification of protein lues about function Find related sequences lues about domain structure Verify hypothetical proteins lues about structural similarities Find sequence motifs (active site, ) MB330 - he class of 2006 Sequence-to-sequence nvestigate evolutionary history and relationships nalyse differences between species and between individuals (e.g., disease-causing mutations) Structure modelling lues about secondary structure Sequence motifs (active site, ) Sequence-database comparison Find related sequences Homology escended from a common ancestor (/F!) Occurrence in other organisms (orthologs; speciation) Occurrence in same organism (paralogs; gene duplication) onvergent evolution ndependently evolved same function Shared motif(s) Shared domains hance similarities Find clues about structure Find clues about function Sequence-sequence comparison lignment of (possibly) homologous sequences etermine residue-residue correspondences Measure similarity, cluster nfer evolutionary relationships, phylogeny Find patterns of conservation and variability Functionally important sites Structurally important sites Sites important for specificity Structure prediction Secondary structure prediction Homology modelling Function prediction (caution!) 2

Sequence identity Sequence identity (%S) = 100% * (r of identical residues in pairwise alignment) / (ength of the shortest sequence) Other definitions exist x: -- - %S %S = 100% * 6 / min(9,10) = 67% 55% 60% 67% 75% Sequence identity/homology Homology and level of sequence identity (or similarity) are two fundamentally different concepts! hese two proteins are 28% homologous an homology be inferred/rejected based on the level of sequence identity Sequence identity/homology Sequence identity of non-homologous proteins Sequence identity/homology Sequence identity of homologous proteins (Rost, 1999) (Rost, 1999) Sequence identity/homology wo proteins of 100 or more residues with %S >35% are likely to be homologous However, homologous proteins may well have %S <35% wilight Zone (oolittle) %S <20% Midnight Zone (Rost) verage %S ~8.5% for remote homologues verage %S ~5.6% for random sequences Structure conservation Homologous proteins will have similar structures Structure better conserved than sequence! roteins with similar structure and function likely to be homologous ould also be analogous (similar due to convergent evolution) (hothia & esk, 1986) 3

Homology - current thinking Statistically significant sequence and structural similarity strongly imply common ancestry (i.e., homology) Statistically significant sequence or structural similarity Weakly implies common ancestry (homology) ould result from convergent evolution (analogy) Functional similarity Supports a common ancestry hypothesis, but is not sufficient to prove it Functional dissimilarity does not disprove common ancestry (e.g., lactalbumin vs. lysozyme) Homology - why bother Science: (probable) homology must be established before you can onclude that the structures will be similar Suspect that the functions may be related o phylogenetic analysis raw any meaningful conclusions from a (multiple) sequence alignment ractical: f you plan to design a drug against a bacterial or parasitic enzyme you want to know about any human orthologs of that enzyme! otplots otplots otplot: simple overview of the similarities of two words/sequences ives clues about alignment too alculation: Matrix olumns = residues of sequence 1 Rows = residues of sequence 2 (or 1) Simplest form: put dots in the matrix where the row and column residues are identical otplot example otplot example W H Z O W H Z O M M H H Z Z O O 4

5 Self-dotplot nternal symmetry ranslational = domain duplication nversion recognition sites for transcriptional regulators and restriction enzymes x: cor: / ow-complexity regions x: lu repeat Why compare a sequence to itself otplot of a palindrome otplot of a palindrome! ow-complexity region F F ow-complexity region! F F omain duplication omain omain B omain omain B

omain duplication! Shared domains omain B omain omain B omain omain omain B omain omain F omain Shared domains! otplots omain omain omain F omain B omain Usually: efine a window size ount number of identical residues within the window f the count exceeds a certain threshold, put a dot in the matrix element x: window 3 (-1,0,+1), minimum of 2 identities x: window 15 (-7,-6,,+7), minimum of 6 identities otplots with window otplots with window Window 3 hreshold 2 Window 3 hreshold 2 6

otplots with window otplots with window Window 3 hreshold 2 Window 3 hreshold 2 o otplot examples otplot examples HW ysozyme Human lactalbumin Human lactalbumin: alcium-binding protein involved in lactose biosynthesis 123 Residues, sequence from B entry 1B9O Hen egg-white lysozyme: nzyme that breaks down bacterial cell walls 129 Residues, sequence from B entry 2S Homologous; %S ~36% (structure-based sequence alignment) ote: plots now from lower-left to upper-right corner Window 1, threshold 1 Window 3, threshold 2 7

Window 11, threshold 5 Summary otplots are an excellent means of assessing the (self-)similarity of sequences asy to calculate asy to interpret ompare every residue in one sequence to every residue in the other sequence rovide an indication of how the sequences should be aligned etect similarities that are easily missed by global pairwise alignment (e.g., shuffled domain order, internal symmetry) different kind of dotplot Sequencing! otplots can be used to compare any strings x: a manual chapter in utch, French, erman, talian, Spanish, and Swedish (one million 4-grams) lso: academic fraud u Fr e t Sp Sw For the next lecture needed two random sequences asked the MB330 students of 2006 to each pick one of the four nucleotides:,, or his yielded a random(ish) ojk sequence (boys) and jej sequence (girls) Sequencing! jej-jej dotplot jej ote: contains low-complexity palindrome () and a repeat of the (palindromic) domain () ojk ote: contains low-complexity region () and a palindrome-in-a-palindrome () he following dotplots were calculated with window size 3 and threshold 2 8

ojk-ojk dotplot jej-ojk dotplot 9