Alignment & BLAST. By: Hadi Mozafari KUMS

Similar documents
Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Algorithms in Bioinformatics

Bioinformatics for Biologists

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Basic Local Alignment Search Tool

Collected Works of Charles Dickens

Single alignment: Substitution Matrix. 16 march 2017

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

In-Depth Assessment of Local Sequence Alignment

Tools and Algorithms in Bioinformatics

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Bioinformatics and BLAST

Tools and Algorithms in Bioinformatics

BLAST. Varieties of BLAST

Sequence analysis and Genomics

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Large-Scale Genomic Surveys

Heuristic Alignment and Searching

Sequence Database Search Techniques I: Blast and PatternHunter tools

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Computational Biology

Fundamentals of database searching

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Introduction to Bioinformatics

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Sequence analysis and comparison

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Pairwise sequence alignments

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

EECS730: Introduction to Bioinformatics

Similarity or Identity? When are molecules similar?

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Sequence Alignment Techniques and Their Uses

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Practical considerations of working with sequencing data

Introduction to protein alignments

Homology Modeling. Roberto Lins EPFL - summer semester 2005

An Introduction to Sequence Similarity ( Homology ) Searching

Pairwise sequence alignment

Protein function prediction based on sequence analysis

Quantifying sequence similarity

Scoring Matrices. Shifra Ben-Dor Irit Orr

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Pairwise & Multiple sequence alignments

Comparing whole genomes

Week 10: Homology Modelling (II) - HHpred

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

Administration. ndrew Torda April /04/2008 [ 1 ]

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Ch. 9 Multiple Sequence Alignment (MSA)

Motivating the need for optimal sequence alignments...

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Pairwise Sequence Alignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

BLAST: Target frequencies and information content Dannie Durand

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Chapter 7: Rapid alignment methods: FASTA and BLAST

11/18/2010. Pairwise sequence. Copyright notice. November 22, Announcements. Outline: pairwise alignment. Learning objectives

IMPLEMENTING HIERARCHICAL CLUSTERING METHOD FOR MULTIPLE SEQUENCE ALIGNMENT AND PHYLOGENETIC TREE CONSTRUCTION

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

Moreover, the circular logic

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

Sequence Comparison. mouse human

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Multiple sequence alignment

SUPPLEMENTARY INFORMATION

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Hands-On Nine The PAX6 Gene and Protein

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

MegAlign Pro Pairwise Alignment Tutorials

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Exercise 5. Sequence Profiles & BLAST

Bio nformatics. Lecture 3. Saad Mneimneh

Biologically significant sequence alignments using Boltzmann probabilities

Effects of Gap Open and Gap Extension Penalties

Genomics and bioinformatics summary. Finding genes -- computer searches

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Copyright 2000 N. AYDIN. All rights reserved. 1

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

BLAST: Basic Local Alignment Search Tool

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Sequence Analysis '17 -- lecture 7

Using Bioinformatics to Study Evolutionary Relationships Instructions

Introduction to Computation & Pairwise Alignment

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

G4120: Introduction to Computational Biology

Local Alignment: Smith-Waterman algorithm

Transcription:

Alignment & BLAST By: Hadi Mozafari KUMS

SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence is known from experiments!!! Thinking by analogy Assuming that if the sequence is similar, the function is also similar Is it contaminated with vector sequences? Is it an already known gene? Is it related to any other genes either by having a common ancestor? Is it similar in function to other genes via convergent evolution? What could the protein sequence be for this nucleotide fragment if it is translated and what might this be like?

Similarity

Limits of Similarity

Significance of Alignment

Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information from a well studied to a newly determined sequence, we need an alignment that represents the protein structures of today.

Sequence Alignment In phylogeny one wants to line up residues that came from a common ancestor. For information transfer one wants to line up residues at similar positions in the structure. gap = insertion ór deletion

Global versus Local Alignment Global Local

An Example for Proteins 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP..... :..: : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : ::.. :. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...QYSC 136 RBP. :... 94 IPAVFKIDALNENKVL...VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. :. 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI... 178 lactoglobulin

Pairwise alignment of retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP..... :..: : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : ::.. :. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...QYSC 136 RBP. :... 94 IPAVFKIDALNENKVL...VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin Identity 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. :. (bar) 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI... 178 lactoglobulin

Pairwise alignment of retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP..... :..: : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : ::.. :. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...QYSC 136 RBP. :... 94 IPAVFKIDALNENKVL...VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin Somewhat 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. :. similar similar 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI... 178 lactoglobulin (one dot) Very (two dots)

Pairwise alignment of retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP..... :..: : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : ::.. :. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...QYSC 136 RBP. :... 94 IPAVFKIDALNENKVL...VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. :. 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI... 178 lactoglobulin Internal gap Terminal gap

Global Alignment Align two sequences from head to toe, i.e. from 5 ends to 3 ends from N-termini to C-termini Algorithm published by: Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443-453.

Global Alignment a a c t t g a g c - c -6 t We fill-up this matrix backwards, -5 g using a very simple scorings -4 a scheme. Identity = 1. Other = 0. -3 Gaps cost -1. g -2 t -1 - -9-8 -7-6 -5-4 -3-2 -1 0

Global Alignment a a c t t g a g c - c -6 t Score = -5 g Where you came from + -4 Gap penalty + a -3 Similarity score g -2 t -1 - -9-8 -7-6 -5-4 -3-2 -1 0

Local Alignment Locate region(s) with high degree of similarity in two sequences Algorithm published by: Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol. 147:195-197.

Alignment methods Rigorous algorithms Needleman-Wunsch Smith-Waterman Heuristic algorithms BLAST FASTA

The Needleman-Wunsch algorithm, published in 1970, provides a method of finding the optimal global alignment of two sequences by maximizing the number of amino acid matches and minimizing the number of gaps necessary to align the two sequences. The Smith-Waterman algorithm was published in 1981 and is very similar to the Needleman-Wunsch algorithm. Yet, the Smith-Waterman algorithm is different in that it is a local sequence alignment algorithm. Instead of aligning the entire length of two protein sequences, this algorithm finds the region of highest similarity between two proteins. Heuristic Methods Thus far, we have discussed optimal sequence alignment methods which find the highest scoring alignment for any pair of protein sequences. However, these algorithms tend to be slow, and when searching an entire database, these methods are often too slow to perform a search in reasonable time. Thus, heuristic, or approximate, algorithms like FASTA and BLAST were developed to speed up the process while attempting to keep as much sensitivity as possible. BLAST The BLAST (Basic Local Alignment Search Tool) algorithm was developed by Altschul et al. in 1990 and similar to the FASTA algorithm, is also a heuristic pairwise sequence aligner. Using a substitution matrix, a list of other words, called a neighborhood, is created for each word found in the protein sequence; these words must be related to the original word and must have a substitution matrix score higher than T, else they are not considered. For fast access to these data, the word positions are entered into a hash table.

Pairwise comparison Local alignment Identify the most similar region shared between two sequences Smith-Waterman Global alignment Align over the length of both sequences Needleman-Wunsch

Global local alignment TEGNAP VELED VOLTAM TEGNAP VELED MAGOLTAM VELE DALOLTAM :::::::::::: : ::::: TEGNAP VELED----------V-------OLTAM Global TEGNAP VELED MAGOLTAM VELE DALOLTAM ::::::::::::.::::: TEGNAP-VELED---VOLTAM-------------- TEGNAP VELED MAGOLTAM VELE DALOLTAM ::::::::::::.::::: TEGNAP VELED ----------------VOLTAM TEGNAP VELED MAGOLTAM VELE DALOLTAM :::::: :::: :.::::: TEGNAP----------------VELE-D-VOLTAM Local TEGNAP VELED MAGOLTAM ::::::::::::.::::: TEGNAP VELED---VOLTAM TEGNAP VELED :::::: ::::: TEGNAP VELED VELE DALOLTAM :::: :.::::: VELE-D-VOLTAM

Multiple Sequence Alignment (MSA) and Trees Take, for example, the three sequences: 1 ASWTFGHK 2 GTWSFANR 3 ATWAFADR and you see immediately that 2 and 3 are close, while 1 is further away. So the tree will look roughly like: 3 2 1

لغات و عبارات مفید در همردیفی توالی

Scoring Matrix/Substitution Matrix To score quality of an alignment Contains scores for pairs of residues (amino acids or nucleic acids) in a sequence alignment For protein/protein comparisons: a 20 x 20 matrix of similarity scores where identical amino acids and those of similar character (e.g. Ile, Leu) give higher scores compared to those of different character (e.g. Ile, Asp). Symmetric

Protein Scoring Systems Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution. aliphatic I L C S+S V A G T P G C SH S D N tiny small hydrophobic aromatic M F Y W H K E Q R charged positive polar

Protein Scoring Systems Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution. Scoring matrices reflect probabilities of mutual substitutions the probability of occurrence of each amino acid. Widely used scoring matrices: PAM BLOSUM

DNA Scoring Systems Sequence 1 Sequence 2 actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact Negative scoring values to penalize mismatches: A T C G A 5-4 -4-4 T -4 5-4 -4 C -4-4 5-4 G -4-4 -4 5 Matches: 5 Mismatches: 19 Score: 5 x 5 + 19 x (-4) = - 51

Dotplots CCTCCTTTGT 5 5 5 5 5 5 5 5 5 5 Point = 50 A T G C A 5-4 -4-4 T -4 5-4 -4 G 4-4 5-4 C -4-4 -4 5 CCTCCTTTGT Pro Leu CCTCCTTTGG 5 5 5 5 5-4 5 5 5 CCTCCCTTAG -4 Point = 32 Pro Leu

Substitution Matrices Not all amino acids are equal Residues mutate more easily to similar ones Residues at surface mutate more easily Aromatics mutate preferably into aromatics Mutations tend to favor some substitutions Core tends to be hydrophobic Selection tends to favor some substitutions Cysteines are dangerous at the surface Cysteines in bridges seldom mutate

Which Matrix to use? Close relationships (Low PAM, high Blosum) Distant relationships (High PAM, low Blosum) BLOSUM 80 BLOSUM 62 BLOSUM 45 PAM 20 PAM 120 PAM 250 More conserved More variable Often used defaults are: PAM250, BLOSUM62

BLOSUM62 Substitution Matrix Zero: by chance + more than chance - less than chance Arranged by Sidegroups So, high scoring in the end boxes Example M,I,L,V Interchangeable

PAM250 Matrix

Scoring example Score of an alignment is the sum of the scores of all pairs of residues in the alignment sequence 1: TCCPSIVARSN sequence 2: SCCPSISARNT 1 12 12 6 2 5-1 2 6 1 0 => alignment score = 46

BLAST Question: What database sequences are most similar to (or contain the most similar regions to) my previously uncharacterised sequence? BLAST finds the highest scoring locally optimal alignments between a query sequence and a database. It compares new genes to old ones from different species or hosts and possible functions based on similarities to known sequences. Very fast algorithm Can be used to search extremely large databases Sufficiently sensitive and selective for most purposes Robust the default parameters can usually be used

BLAST is like using Google for DNA sequences

BLAST Algorithme Step 1: Read/understand user query sequence. Step 2: Use hashing technology to select several thousand likely candidates. Step 3: Do a real alignment between the query sequence and those likely candidate. Real alignment is a main topic of this course. Step 4: Present output to user.

Steps in running BLAST: BLAST Input Entering your query sequence (cut-and-paste) Select the database(s) you want to search Choose output parameters Choose alignment parameters (scoring matrix, filters,.) Example query= MAFIWLLSCYALLGTTFGCGVNAIHPVLTGLSKIVNGEEAVPGTWPWQVTLQDRSGFHFC GGSLISEDWVVTAAHCGVRTSEILIAGEFDQGSDEDNIQVLRIAKVFKQPKYSILTVNND ITLLKLASPARYSQTISAVCLPSVDDDAGSLCATTGWGRTKYNANKSPDKLERAALPLLT NAECKRSWGRRLTDVMICGAASGVSSCMGDSGGPLVCQKDGAYTLVAIVSWASDTCSASS GGVYAKVTKIIPWVQKILSSN

Alignment Significance in BLAST P-value (probability) Relates the score for an alignment to the likelihood that it arose by chance. The closer to zero, the greater the confidence that the hit is real. E-value (expect value) The number of alignments with E that would be expected by chance in that database (e.g. if E=10, 10 matches with scores this high are expected to be found by chance). A match will be reported if its E is below the threshold. Lower E thresholds are more stringent, and report fewer matches.

BLAST Types

BLAST programs Program Input Database 1 blastn DNA DNA 1 blastp protein protein 6 blastx DNA protein 6 tblastn protein DNA 36 tblastx DNA DNA

راهنمایی نتایج در بالست

Database Searching Overview Query sequence Q List of similar protein sequences Comparison algorithm Database of sequences Infer homologues and similar structures

Search with Protein not DNA 1) 4 DNA bases vs. 20 amino acids - less random similarity 2) Can have varying degrees of similarity between different aminoacids 3) Protein databanks are much smaller than DNA databanks.

Pairwise alignment: protein sequences can be more informative than DNA Many times, DNA alignments are appropriate --to confirm the identity of a cdna --to study noncoding regions of DNA --to study DNA polymorphisms --example: Neanderthal vs modern human DNA Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240 Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247

BLAST Databases

BLAST output

Graphic Display Colors

Output Parts

Practical: Go to BLAST in NCBI

Select BLAST Type

blastn

Results Page: Graphic view

Alignments

Alignments

Target Sequence in NCBI

Blasx parameters

Blasx parameters

tblastx Parameters

tblastx Parameters