BLAST. Varieties of BLAST

Similar documents
BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Basic Local Alignment Search Tool

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Bioinformatics and BLAST

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Heuristic Alignment and Searching

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Introduction to Bioinformatics

GEP Annotation Report

Algorithms in Bioinformatics

Computational Biology

Bioinformatics for Biologists

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Tools and Algorithms in Bioinformatics

SUPPLEMENTARY INFORMATION

EECS730: Introduction to Bioinformatics

Sequence Alignment Techniques and Their Uses

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Alignment & BLAST. By: Hadi Mozafari KUMS

Protein function prediction based on sequence analysis

Single alignment: Substitution Matrix. 16 march 2017

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Computational methods for predicting protein-protein interactions

Comparative Network Analysis

Comparative Bioinformatics Midterm II Fall 2004

Tools and Algorithms in Bioinformatics

Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Mitochondrial Genome Annotation

Example of Function Prediction

Sequences, Structures, and Gene Regulatory Networks

Genomics and bioinformatics summary. Finding genes -- computer searches

In-Depth Assessment of Local Sequence Alignment

An Introduction to Sequence Similarity ( Homology ) Searching

Sequence analysis and comparison

Sequence Database Search Techniques I: Blast and PatternHunter tools

Quantifying sequence similarity

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

MiGA: The Microbial Genome Atlas

Motivating the need for optimal sequence alignments...

BIOINFORMATICS LAB AP BIOLOGY

Computational approaches for functional genomics

Practical considerations of working with sequencing data

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Introduction to protein alignments

Hands-On Nine The PAX6 Gene and Protein

Bioinformatics Chapter 1. Introduction

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Bioinformatics Exercises

BLAST: Basic Local Alignment Search Tool

Sequencing alignment Ameer Effat M. Elfarash

Session 5: Phylogenomics

Robert Edgar. Independent scientist

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Fundamentals of database searching

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Introduction to Bioinformatics Introduction to Bioinformatics

Network Alignment 858L

Introduction to Bioinformatics

Week 10: Homology Modelling (II) - HHpred

Chapter 7: Rapid alignment methods: FASTA and BLAST

Biol478/ August

EECS730: Introduction to Bioinformatics

Introduction to Sequence Alignment. Manpreet S. Katari

bioinformatics 1 -- lecture 7

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

2 Genome evolution: gene fusion versus gene fission

Bioinformatics Workshop - NM-AIST

A profile-based protein sequence alignment algorithm for a domain clustering database

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Sequencing alignment Ameer Effat M. Elfarash

Annotation of Drosophila grimashawi Contig12

Exploring Evolution & Bioinformatics

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Comparative Genomics II

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Comparing Genomes! Homologies and Families! Sequence Alignments!

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

In order to compare the proteins of the phylogenomic matrix, we needed a similarity

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Phylogenetic analysis. Characters

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

GENOME DUPLICATION AND GENE ANNOTATION: AN EXAMPLE FOR A REFERENCE PLANT SPECIES.

-max_target_seqs: maximum number of targets to report

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Transcription:

BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database However, many more choices (parameters) to make to adjust search success (over 30!!) Varieties of BLAST Program Database Query Typical uses BLASTN Nucleotide Nucleotide BLASTP Protein Protein BLASTX Protein Nucleotide translated into protein TBLASTN Nucleotide translated into protein Protein TBLASTX Nucleotide translated into protein Nucleotide translated into protein Mapping oligonucleotides, cdnas, and PCR products to a genome; screening repetitive elements; crossspecies sequence exploration; annotating genomic DNA; clustering sequencing reads; vector clipping Identifying common regions between proteins; collecting related proteins for phylogenetic analyses Finding protein-coding genes in genomic DNA; determining if a cdna corresponds to a known protein Identifying transcripts, potentially from multiple organisms, similar to a given protein; mapping a protein to genomic DNA Cross-species gene prediction at the genome or transcript level; searching for genes missed by traditional methods or not yet in protein databases From: BLAST by Joseph Bedell, Ian Korf, Mark Yandell; O Rielly 2003

How Does it Work? Searches for short exact (nucleotide) or near-exact matches aka neighborhoods (protein) of certain word lengths Defaults: Blastp: 3 amino acids Blastn: 11 nucleotides Without an initial word match, can MISS possibly important matches With seed hit, tries to extend alignment in both directions A fully extended hit is an HSP (high scoring pair) Translated BLAST DNA-> protein, 3 reading frames upper sequence + 3 reading frames lower sequence Important when you don t know or trust gene sequencing quality or annotation Translating in all possible frames gives additional sensitivity, avoids reading frame errors due to incorrect gene prediction or sequencing errors

Customized Applications Because there are so many parameters, a few web-versions of BLAST have parameters preset for specific applications MEGA BLAST find long alignments; allows you to specify min percent identity Search for short and near exact matches find short identical hits ~20 bases (i.e. for checking primers) More discussion: http://www.ncbi.nlm.nih.gov/blast/why.shtml Assessing Significance Most Basic Rules of thumb: Two nucleotide sequences at least 70% identical, they are likely homologous Two protein sequences at least 25% identical over 100 amino acid alignment Does not take into account precise length of alignment, or number of gaps! Not sufficient to quantitatively rank hits from a database search

Re: The Twilight Zone Less than 25% sequence identity for two protein sequences May still be homologous, but only similarity of 3-D protein structures can verify similar function (structural comparison tools to detect these discussed later in quarter) Quantitative Assessment of Significance: E-values & P-values Expressions of the same thing: E-value : number of expected random (not biological) matches in a given db search Examples: 0.001, 0.1, 1.0, 10, 100, 1000 P-value: probability that this hit is random (not biological) Examples: 0.1, 0.05, 0.0001, 1x10-3, 1x10-60

E/P - values Mathematical conversion between them: P-value = 1 e -(E-value) For values < 0.01, E-value and P-value are nearly the same Takes into account: Length of sequence similarity Conservation of aligned nucleotides/amino acids Number and length of insertions and deletions Sizes of query sequence and database you are searching What is Reliable? In biology P-value of 0.05 expect would be good enough (5 chances in 100 of not being correlated) Due to BLAST s estimation of significance, shouldn t blindly trust P or E values > 1x10-4 Note: Even with a good E-value, the match may be between paralogs with different function! Examine alignment for local areas of high similarity (are these known domains from CDD search?) For good measure, I don t have great confidence unless E value is less than 1x10-8

Beware Hit Transitivity! BLAST hits are not transitive, unless alignments are overlapping Seq1: AAAAABBBB Seq2: AAAAA Seq3: BBBB Seq2 and Seq3 not necessarily homologous! Example Fibrillarin-like protein DNA: XM_293903, NR_024356 Protein: XP_293903 How far can we go in tree of life using nucleotide v. protein searches? Another query: Hox gene gi 13124744 sp P31270 HXA11_HUMAN

Why would you ever use BLASTN if BlastP is more Sensitive? Non-translated sequences (RNA genes, promotors, etc) Closely related species, where you expect sequence identity > 70% A Related Note: Homology Based on inference that two sequences are ancestrally derived from same molecule If two sequences have high similarity, they may be inferred to be homologous It is WRONG to say two sequences or genes are 80% homologous (they either are related, or they are not)

Homology: Same Function? Even if two sequences are ancestrally derived from same molecule, they may or may not still have the same function Orthologs: homologous genes created by speciation Generally implies function remains the same Paralogs: homologous genes created by a gene duplication event (in same species) Implies function may have changed Homology Diagram Source: http://www.ncbi.nlm.nih.gov/education/blastinfo/orthology.html