BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Similar documents
BLAST. Varieties of BLAST

Basic Local Alignment Search Tool

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Heuristic Alignment and Searching

Bioinformatics and BLAST

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

GEP Annotation Report

SUPPLEMENTARY INFORMATION

Algorithms in Bioinformatics

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

Bioinformatics for Biologists

EECS730: Introduction to Bioinformatics

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Hands-On Nine The PAX6 Gene and Protein

Comparative Bioinformatics Midterm II Fall 2004

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Computational Biology

Protein function prediction based on sequence analysis

Tools and Algorithms in Bioinformatics

Introduction to Bioinformatics

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Alignment & BLAST. By: Hadi Mozafari KUMS

Sequence Alignment Techniques and Their Uses

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

Bioinformatics Chapter 1. Introduction

Session 5: Phylogenomics

MiGA: The Microbial Genome Atlas

Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Supplemental Materials

Tools and Algorithms in Bioinformatics

Genomics and bioinformatics summary. Finding genes -- computer searches

Computational approaches for functional genomics

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Single alignment: Substitution Matrix. 16 march 2017

Sequencing alignment Ameer Effat M. Elfarash

BIOINFORMATICS LAB AP BIOLOGY

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence analysis and comparison

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

In-Depth Assessment of Local Sequence Alignment

Comparing whole genomes

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Introduction to Bioinformatics

Computational methods for predicting protein-protein interactions

Comparative Network Analysis

A Browser for Pig Genome Data

EBI web resources II: Ensembl and InterPro

Sequencing alignment Ameer Effat M. Elfarash

Robert Edgar. Independent scientist

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Bioinformatics Exercises

Example of Function Prediction

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Introduction to protein alignments

Sequences, Structures, and Gene Regulatory Networks

Annotation of Drosophila grimashawi Contig12

Introduction to Bioinformatics Introduction to Bioinformatics

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Motivating the need for optimal sequence alignments...

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Procedure to Create NCBI KOGS

Biol478/ August

An Introduction to Sequence Similarity ( Homology ) Searching

Practical considerations of working with sequencing data

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

-max_target_seqs: maximum number of targets to report

Using Bioinformatics to Study Evolutionary Relationships Instructions

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Quantifying sequence similarity

BLAST: Basic Local Alignment Search Tool

Bioinformatics 1 lecture 13. Database searches. Profiles Orthologs/paralogs Tree of Life term projects

Translation Part 2 of Protein Synthesis

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

Exploring Evolution & Bioinformatics

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

Fundamentals of database searching

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Microbiome: 16S rrna Sequencing 3/30/2018

bioinformatics 1 -- lecture 7

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Introduction to Sequence Alignment. Manpreet S. Katari

Georgia Standards of Excellence Biology

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids

Sequence Alignment (chapter 6)

Chapter 7: Rapid alignment methods: FASTA and BLAST

Homology and Information Gathering and Domain Annotation for Proteins

Introduction to Evolutionary Concepts

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Transcription:

BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010

Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for next class 5 question quiz at the beginning of next class (using WebCT), please be sure to bring laptops Will cover readings & skills through today Reminder: Homework due at 5pm this Tuesday

Self-help at NCBI for BLAST Simple Tutorial and Guide available http://www.ncbi.nih.gov/education/blastinfo/information3.html Will use these with in-class exercises Many helpful, detailed discussions of BLAST read for help on homework assignments Entire book on BLAST, available on-line within the.ucsc.edu domain: http://proquest.safaribooksonline.com/0596002998

BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database However, many more choices (parameters) to make to adjust search success (over 30!!)

Varieties of BLAST Program Database Query Typical uses BLASTN Nucleotide Nucleotide BLASTP Protein Protein BLASTX Protein Nucleotide translated into protein TBLASTN Nucleotide translated into protein Protein TBLASTX Nucleotide translated into protein Nucleotide translated into protein Mapping oligonucleotides, cdnas, and PCR products to a genome; screening repetitive elements; crossspecies sequence exploration; annotating genomic DNA; clustering sequencing reads; vector clipping Identifying common regions between proteins; collecting related proteins for phylogenetic analyses Finding protein-coding genes in genomic DNA; determining if a cdna corresponds to a known protein Identifying transcripts, potentially from multiple organisms, similar to a given protein; mapping a protein to genomic DNA Cross-species gene prediction at the genome or transcript level; searching for genes missed by traditional methods or not yet in protein databases From: BLAST by Joseph Bedell, Ian Korf, Mark Yandell; O Rielly 2003

How Does it Work? Searches for short exact (nucleotide) or near-exact matches aka neighborhoods (protein) of certain word lengths Defaults: Blastp: 3 amino acids Blastn: 11 nucleotides Without an initial word match, can MISS possibly important matches With seed hit, tries to extend alignment in both directions A fully extended hit is an HSP (high scoring pair)

BLASTP Heuristics Most sequences will be unrelated Related sequences are likely to have short stretches of identities Use identical (or closely related) short words as seeds for local alignment Like BLASTn with shorter words Will return global alignment if proteins similar over whole length

Translated BLAST DNA-> protein, 3 reading frames upper sequence + 3 reading frames lower sequence Important when you don t know or trust gene sequencing quality or annotation Translating in all possible frames gives additional sensitivity, avoids reading frame errors due to incorrect gene prediction or sequencing errors Example: Gene to right of Pcal_1600 in Pyrobaculum calidifontis

Customized Applications Because there are so many parameters, a few web-versions of BLAST have parameters preset for specific applications MEGA BLAST find long alignments; allows you to specify min percent identity Search for short and near exact matches find short identical hits ~20 bases (i.e. for checking primers) More discussion: http://www.ncbi.nlm.nih.gov/blast/why.shtml

Assessing Significance Most Basic Rules of thumb: Two nucleotide sequences at least 70% identical, they are likely homologous Two protein sequences at least 25% identical over 100 amino acid alignment Does not take into account precise length of alignment, or number of gaps! Not sufficient to quantitatively rank hits from a database search

Re: The Twilight Zone Less than 25% sequence identity for two protein sequences May still be homologous, but only similarity of 3-D protein structures can verify similar function (structural comparison tools to detect these discussed later in quarter)

Quantitative Assessment of Significance: E-values & P-values Expressions of the same thing: E-value : number of expected random (not biological) matches in a given db search Examples: 0.001, 0.1, 1.0, 10, 100, 1000 P-value: probability that this hit is random (not biological) Examples: 0.1, 0.05, 0.0001, 1x10-3, 1x10-60

E/P - values Mathematical conversion between them: P-value = 1 e -(E-value ) For values < 0.01, E-value and P-value are nearly the same Takes into account: Length of sequence similarity Conservation of aligned nucleotides/amino acids Number and length of insertions and deletions Sizes of query sequence and database you are searching

What is Reliable? In biology P-value of 0.05 expect would be good enough (5 chances in 100 of not being correlated) Due to BLAST s estimation of significance, shouldn t blindly trust P or E values > 1x10-4 Note: Even with a good E-value, the match may be between paralogs with different function! Examine alignment for local areas of high similarity (are these known domains from CDD search?) For good measure, I don t have great confidence unless E value is less than 1x10-8

Beware Hit Transitivity! BLAST hits are not transitive, unless alignments are overlapping Seq1: AAAAABBBB Seq2: AAAAA Seq3: BBBB Seq2 and Seq3 not necessarily homologous!

Example Fibrillarin-like protein DNA: NM_001436, Protein: NP_001427 How far can we go in tree of life using nucleotide v. protein searches? Another query: Hox gene NM_153631.2 (HOX3A)

Why would you ever use BLASTN if BlastP is more Sensitive? Non-translated sequences (RNA genes, promotors, etc) Closely related species, where you expect sequence identity > 70%

Repetitive (Low Complexity) Element Filtering Removes sequences that occur commonly in genomes, but do not imply functional similarity SINEs, LINEs, and other selfish DNA elements Simple repeats: long runs of a short (1-4) nucleotide repeats due to errors in DNA replication or structural elements (telomeres) Protein: polymer tracks common in trans-membrane domains, etc. Always use UNLESS looking for ncrnas can remove biologically-important RNA hits!!

Limit Search Space If you only want hits to a specific species or phylogenetic group, it is **much** faster to only search that sub-group: Under Options for Advanced Blasting, use Limit by entrez query, or select from : Viruses, Archaea, Bacteria, Mammalia, Homo sapiens, etc.

A Related Note: Homology Based on inference that two sequences are ancestrally derived from same molecule If two sequences have high similarity, they may be inferred to be homologous It is WRONG to say two sequences or genes are 80% homologous (they either are related, or they are not)

Homology: Same Function? Even if two sequences are ancestrally derived from same molecule, they may or may not still have the same function Orthologs: homologous genes created by speciation Generally implies function remains the same Paralogs: homologous genes created by a gene duplication event (in same species) Implies function may have changed

Homology Diagram Source: http://www.ncbi.nlm.nih.gov/education/blastinfo/orthology.html