Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

Similar documents
BLAST. Varieties of BLAST

Sequence Alignment (chapter 6)

Tools and Algorithms in Bioinformatics

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Sequence Alignment Techniques and Their Uses

Basic Local Alignment Search Tool

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Introduction to Bioinformatics

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Heuristic Alignment and Searching

Introduction to Bioinformatics

Tools and Algorithms in Bioinformatics

Bioinformatics for Biologists

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

BIOINFORMATICS 1 INTRODUCTION TO SEQUENCE ANALYSIS EVOLUTIONARY BASIS OF SEQUENCE ANALYSES EVOLUTIONARY BASIS OF SEQUENCE ANALYSES

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Alignment & BLAST. By: Hadi Mozafari KUMS

Bioinformatics and BLAST

EECS730: Introduction to Bioinformatics

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Practical Bioinformatics

Single alignment: Substitution Matrix. 16 march 2017

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Protein function prediction based on sequence analysis

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Outline Sequence-comparison methods. Buzzzzzzzz. MB330 - The class of 2008

Collected Works of Charles Dickens

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

Pairwise & Multiple sequence alignments

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Introduction to protein alignments

Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Bioinformatics Exercises

Genomics and bioinformatics summary. Finding genes -- computer searches

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Chapter 7: Rapid alignment methods: FASTA and BLAST

Sequence analysis and Genomics

Outline. Sequence-comparison methods. Buzzzzzzzz. Why compare sequences? Gerard Kleywegt Uppsala University

Fundamentals of database searching

A Method for Aligning RNA Secondary Structures

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Practical considerations of working with sequencing data

Sequence Database Search Techniques I: Blast and PatternHunter tools

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Sequence analysis and comparison

In-Depth Assessment of Local Sequence Alignment

Biol478/ August

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids

Computational approaches for functional genomics

Multiple Sequence Alignment

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Large-Scale Genomic Surveys

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Pairwise sequence alignments

SUPPLEMENTARY INFORMATION

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

An Introduction to Sequence Similarity ( Homology ) Searching

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Similarity searching summary (2)

Example of Function Prediction

Hands-On Nine The PAX6 Gene and Protein

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Bioinformatics. Part 8. Sequence Analysis An introduction. Mahdi Vasighi

EECS730: Introduction to Bioinformatics

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Comparative genomics: Overview & Tools + MUMmer algorithm

GenomeBlast: a Web Tool for Small Genome Comparison

GEP Annotation Report

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Processes of Evolution

Procedure to Create NCBI KOGS

Bioinformatics 1 lecture 13. Database searches. Profiles Orthologs/paralogs Tree of Life term projects

Supporting Information

Biology Tutorial. Aarti Balasubramani Anusha Bharadwaj Massa Shoura Stefan Giovan

Comparative Bioinformatics Midterm II Fall 2004

G4120: Introduction to Computational Biology

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Homology and Information Gathering and Domain Annotation for Proteins

Tutorial 4 Substitution matrices and PSI-BLAST

Comparative Genomics II

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

SEQUENCE alignment is an underlying application in the

Transcription:

he universe of biological sequence analysis Word/pattern recognition- Identification of restriction enzyme cleavage sites Sequence alignment methods PstI he universe of biological sequence analysis - prediction of exon structure Exon 1 MetlaProrghrLeuLeuLeuLeuLeuLeulylaLeula Leuhrlnhrrplaly Pairwise alignment SerHisSerMetrgyrPhehrhrSer Exon 2 ValSerrgProlyrglyluProrgPheIlelaVallyyrValspsphr lnphevalrgphespsersplalaserlnrgmetluprorglaprorp IlelulnlulyProluyrrpspLeulnhrrgsnValLyslalnSer lnhrsprglasnleulyhrleurglyyryrsnlnserlula - 1

Why sequence alignments? Prediction of function Protein family analysis omparative genomics Phylogeny / Evolutionary history enome sequencing: ssembly lignment to reference genome Prediction of function Sequence to be investigated Seq. with known function We have a new sequence. It is similar to a previously known sequence? We can test by alignment whether it is similar to a sequence with known function. If it is we can assign a possible function to our new sequence Database of sequences Protein family analysis omparative genomics - reveals biologically significant regions of the genome 2

Pairwise alignment dotplot - Pairwise alignment dotplot Pairwise alignment dotplot Pairwise alignment dotplot - + 2221222222222222 + + - + + + + + - + + + + + + = 25 ----- ------ + 2+ 2-2 -22 - -22 - + 22 + + 2-222 - --- 2 2222222 - + + + + + = -2 3

More sophisticated scoring of protein sequence alignments Each amino acid change has a characteristic probability substitution matrix More sophisticated scoring of protein sequence alignments Each amino acid change has a characteristic probability L E L D 4+ 0+4 +9+2 =19 Local and global alignments B Frequently used methods in sequence analysis that are based on sequence alignment Local alignment BLS - searches in databases for sequence similarity lustalw - multiple alignment of sequences lobal alignment B 4

BLS Searching databases for sequence similarity - traditional alignment method too slow BLS - Basic Local lignment Search ool FS, 1988 William Pearson BLS, 1990 query sequence (DN or protein) is tested against all sequences in a database (DN or protein), i.e the query is aligned to all the database sequences. Final output is a list of the best matching database sequences. David Lipman Stephen ltschul Searching databases for sequence similarity - shortcuts of BLS Improvement of speed as compared to local alignment algorithm: Initial search is for word hits. Word hits are then extended in either direction. "word hit" M K I Q L K R Y M K L Q L K R Y BLS output BLSP 2.2.9 [May-01-2004] Reference: ltschul, Stephen F., homas L. Madden, lejandro. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "apped BLS and PSI-BLS: a new generation of protein database search programs", Nucleic cids Res. 25:3389-3402. Query= lcl SRP54_MOUSE (P14576) Signal recognition particle 54 kda protein (SRP54) (504 letters) Database: swissprot 197,228 sequences; 71,501,181 total letters Searching...done Score E Sequences producing significant alignments: (bits) Value SRP54_MOUSE (P14576) Signal recognition particle 54 kda protein... 959 0.0 SRP54_PONPY (Q5R4R6) Signal recognition particle 54 kda protein... 958 0.0 SRP54_MF (Q4R965) Signal recognition particle 54 kda protein... 958 0.0 SRP54_HUMN (P61011) Signal recognition particle 54 kda protein... 958 0.0 SRP54_NF (P61010) Signal recognition particle 54 kda protein... 958 0.0 SRP54_R (Q6YB5) Signal recognition particle 54 kda protein (S... 957 0.0 SRP54_EOY (Q8MZJ6) Signal recognition particle 54 kda protein... 794 0.0 SR542_LYES (P49972) Signal recognition particle 54 kda protein... 565 e-161 SR543_RH (P49967) Signal recognition particle 54 kda protein... 560 e-159 SR542_HORVU (P49969) Signal recognition particle 54 kda protein... 558 e-158...... SRPR_MOUSE (Q9DB7) Signal recognition particle receptor alpha s... 99 3e-20 SRPR_HUMN (P08240) Signal recognition particle receptor alpha s... 99 3e-20 SRPR_YES (P32916) Signal recognition particle receptor alpha s... 98 7e-20 5

BLS output, cont. sp Q9I3P8.1 FLHF_PSEE RecName: Full=Flagellar biosynthesis prot... 57 3e-07 sp Q44758.1 FLHF_BORBU RecName: Full=Flagellar biosynthesis prot... 55 2e-06 sp Q01960.1 FLHF_BSU RecName: Full=Flagellar biosynthesis prot... 53 4e-06 sp O28980.1 Y1289_RFU RecName: Full=Uncharacterized protein F... 39 0.064 sp B9LK1.1 YS_HLSY RecName: Full=denylyl-sulfate kinase; l... 38 0.21 sp Q12U80.1 RDB_MEBU RecName: Full=DN repair and recombinatio... 37 0.29 sp 5D014.1 D_PELS RecName: Full=cetyl-coenzyme carboxyla... 35 0.93 sp Q0356.1 RSM_LB RecName: Full=Ribosomal RN small subunit... 35 1.2 sp Q1I2K4.1 YS_PSEE4 RecName: Full=denylyl-sulfate kinase; l... 35 1.6 sp Q38V22.1 RSM_LSS RecName: Full=Ribosomal RN small subunit... 34 1.8 sp 1U3X8.1 YS_MRV RecName: Full=denylyl-sulfate kinase; l... 34 2.3 sp 6D42.1 YS_KLEP7 RecName: Full=denylyl-sulfate kinase; l... 34 2.9 sp P63890.2 YS_SLI RecName: Full=denylyl-sulfate kinase; l... 34 2.9... Expect value (E) Parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. Essentially, the E value describes the random background noise that exists for matches between sequences. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. his means that the lower the E-value, or the closer it is to "0" the more "significant" the match is. High Scoring Pair (HSP) High Scoring Pair (HSP) Query: 1 MVLDLRKISLRSLSNIINEEVLNMLKEVLLEDVNIKLVKQLRENVKSI 60 MVLDLRKISLRSLSNIINEEVLNMLKEVLLEDVNIKLVKQLRENVKSI Sbjct: 1 MVLDLRKISLRSLSNIINEEVLNMLKEVLLEDVNIKLVKQLRENVKSI 60 Query: 61 DLEEMSLNKRKMIQHVFKELVKLVDPVKWPKKQNVIMFVLQSKSK 120 DLEEMSLNKRKMIQHVFKELVKLVDPVKWPKKQNVIMFVLQSKSK Sbjct: 61 DLEEMSLNKRKMIQHVFKELVKLVDPVKWPKKQNVIMFVLQSKSK 120 Query: 121 LYYYQRKWKLIDFRFDQLKQNKRIPFYSYEMDPVIISEVEKFK 180 LYYYQRKWKLIDFRFDQLKQNKRIPFYSYEMDPVIISEVEKFK Sbjct: 121 LYYYQRKWKLIDFRFDQLKQNKRIPFYSYEMDPVIISEVEKFK 180 Query: 181 NENFEIIIVDSRHKQEDSLFEEMLQVSNIQPDNIVYVMDSIQEQKFKDKV 240 NENFEIIIVDSRHKQEDSLFEEMLQV+NIQPDNIVYVMDSIQEQKFKDKV Sbjct: 181 NENFEIIIVDSRHKQEDSLFEEMLQVNIQPDNIVYVMDSIQEQKFKDKV 240 Query: 241 DVSVIVKLDHKLSVKSPIIFIEHIDDFEPFKQPFISKLLMDI 300 DVSVIVKLDHKLSVKSPIIFIEHIDDFEPFKQPFISKLLMDI Sbjct: 241 DVSVIVKLDHKLSVKSPIIFIEHIDDFEPFKQPFISKLLMDI 300 >SRPR_MOUSE (Q9DB7) Signal recognition particle receptor alpha subunit (SR-alpha) (Docking protein alpha) (DP-alpha) Length = 636 Score = 99.0 bits (245), Expect = 3e-20 Identities = 68/313 (21%), Positives = 143/313 (45%), aps = 31/313 (9%) Query: 14 LRSLSNIINEEVLNMLKEVLLEDVNIKLVKQLRENVKSIDLEEMSLNKRK 73 L+ L + ++ E + ++L ++ L+ +V + QL E+V + ++ + M + Sbjct: 322 LKLVSKSLSREDMESVLDKMRDHLIKNVDIVQLESVNKLEKVMFSVS 381 Query: 74 MIQHVFKELVKLVDPVKW-------PKKQNVIMFVLQSKSKLYYYQ 126 ++ + + LV+++ P + + + V+ F + K+ +K++++ Sbjct: 382 VKQLQESLVQILQPQRRVDMLRDIMDQRRQRPYVVFVNVKSNLKISFWLL 441 Query: 127 RKWKLIDFRFDQLK-------------QNKRIPFYSYEMDPVIIS 173 + + DFR +QL+ ++ + + + D I Sbjct: 442 ENFSVLIDFRVEQLRHRRLLHPPEKHRMVQLFEKYKDIM 501 6

BLS output revealing orthologs and paralogs BLSP 2.2.9 [May-01-2004] Reference: ltschul, Stephen F., homas L. Madden, lejandro. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "apped BLS and PSI-BLS: a new generation of protein database search programs", Nucleic cids Res. 25:3389-3402. Query= lcl SRP54_MOUSE (P14576) Signal recognition particle 54 kda protein (SRP54) (504 letters) Database: swissprot 197,228 sequences; 71,501,181 total letters Searching...done Score E Sequences producing significant alignments: (bits) Value SRP54_MOUSE (P14576) Signal recognition particle 54 kda protein... 959 0.0 SRP54_PONPY (Q5R4R6) Signal recognition particle 54 kda protein... 958 0.0 SRP54_MF (Q4R965) Signal recognition particle 54 kda protein... 958 0.0 SRP54_HUMN (P61011) Signal recognition particle 54 kda protein... 958 0.0 SRP54_NF (P61010) Signal recognition particle 54 kda protein... 958 0.0 SRP54_R (Q6YB5) Signal recognition particle 54 kda protein (S... 957 0.0 SRP54_EOY (Q8MZJ6) Signal recognition particle 54 kda protein... 794 0.0 SR542_LYES (P49972) Signal recognition particle 54 kda protein... 565 e-161 SR543_RH (P49967) Signal recognition particle 54 kda protein... 560 e-159 SR542_HORVU (P49969) Signal recognition particle 54 kda protein... 558 e-158...... SRPR_MOUSE (Q9DB7) Signal recognition particle receptor alpha s... 99 3e-20 SRPR_HUMN (P08240) Signal recognition particle receptor alpha s... 99 3e-20 SRPR_YES (P32916) Signal recognition particle receptor alpha s... 98 7e-20 orthologs paralogs he two kinds of protein evolutionary relationship enes or proteins are homologous if they are related by divergence from a common ancestor. Orthology Paralogy Sequences that diverged after a speciation event. Orthologous genes often have the same function in different species. Sequences that diverged after a gene duplication event.paralogous genes perform different but related functions within one organism. Orthologs Paralogs X ncestral organism X Organism Speciation Organism B ene duplication X X X X Organism Organism B X1 X2 Xa Xb Orthologs Paralogs 7

Example of orthology / paralogy relationships he different variants of BLS he variants of BLS Query Database blastp Protein Protein blastn DN DN tblastn Protein DN blastx DN Protein tblastx DN DN ited 31998 times since 1990! BL lignment software specialized for next-generation sequencing technology BW Bowtie SOP2 lign reads to a reference genome Reference genome 8

Further improvement of computational efficiency - BL (http://genome.ucsc.edu/cgi-bin/hgblat?command=start) Frequently used methods in sequence analysis that are based on sequence alignment BLS - searches in databases for sequence similarity lustalw - multiple alignment of sequences ited 34,646 times! lustalw onstruction of tree based on pairwise alignments Progressive alignment guided by tree. Introduction to the practical E B HIV D 9

Introduction to the practical Introduction to the practical EMBOSS programs in this practical sixpack plotorf dottup - dotplot analysis water - Smith Waterman local alignment needle - Needleman - Wunsch global alignment 10

ranslation of a nucleotide sequence using sixpack M K R K L K K N L K F V F S I F1 W Q R E S K R K L L L H L V L L L F2 K E K V K K E L K N F I Y Y F3 1 60 ----:---- ----:---- ----:---- ----:---- ----:---- ----:---- 1 60 X F L F N F F F K F V K N L I V F6 X P L S F L F S S L F K Q Q M H F5 H L S L F L V F S K N K S N S F4 Introduction to the practical L L L N I P I S L Q S S N F1 L Y L M V F Q L V L L S L P I Q L F2 F I V N W Y S N F N S V F Q Y N F3 61 120 ----:---- ----:---- ----:---- ----:---- ----:---- ----:---- 61 120 K N N V L P I I L K V D E L V V F6 Q K I L H Y E L H K L E K W Y L F5 S Q S I N W N S S L R I S F4 E I S Q L R N V M Y Y D W S F1 R L L H K L L Q Y V M I M V L F2 D Y F S Y Y R V N V L W L V Y F3 121 180 ----:---- ----:---- ----:---- ----:---- ----:---- ----:---- 121 180 S I V E V V P N R L I Y P S Q D F6 Q S K V L L V Y H L N H H S F5 L N S L S S P I Y H I I V P R F4 Plotorf to show open reading frames (in this case ORF is defined as starting with U codon) Ribosomal protein L19 3426-3773 Introduction to the practical Unnamed protein 416-1522 trn methyltransferase 2617-3384 Ribosomal protein S16 1771-2019 11

Introduction to the practical Introduction to the practical ag ag-pol fusion (5%) lobal alignment of mrn sequence to genomic DN sequence Effect of gap parameters lobal alignment of mrn sequence to genomic DN sequence Effect of gap parameters genomic DN mature, spliced mrn 12

Introduction to the practical Dot plot analysis (dottup) reveals repeats Introduction to the "Exercises with biological sequences - examining HIV genes and proteins" - biological questions addressed with BLS and lustalx. BLS - search databases for sequence similarity Identifying homologous proteins. Non-viral homologues to any HIV proteins? re we able to identify a relationship between human HIV and the monkey SIV? lustalx - multiple sequence alignment Identifying amino acids involved in drug resistance. What is the relationship between HIV and monkey SIV? Using a multiple alignment to compute a phylogenetic tree. 13