EECS730: Introduction to Bioinformatics

Similar documents
Sequence Database Search Techniques I: Blast and PatternHunter tools

Basic Local Alignment Search Tool

Bioinformatics and BLAST

Bioinformatics for Biologists

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

EECS730: Introduction to Bioinformatics

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Heuristic Alignment and Searching

Algorithms in Bioinformatics

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

BLAST. Varieties of BLAST

Linear-Space Alignment

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Tools and Algorithms in Bioinformatics

Text Algorithms (4AP) Biological measures of similarity. Jaak Vilo 2008 fall. MTAT Text Algorithms. Complete genomes

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

L3: Blast: Keyword match basics

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Introduction to Bioinformatics

Local Alignment Statistics

Whole Genome Alignments and Synteny Maps

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

SUPPLEMENTARY INFORMATION

Single alignment: Substitution Matrix. 16 march 2017

Tools and Algorithms in Bioinformatics

Alignment & BLAST. By: Hadi Mozafari KUMS

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Computational Biology

Collected Works of Charles Dickens

BLAST: Target frequencies and information content Dannie Durand

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

In-Depth Assessment of Local Sequence Alignment

Practical considerations of working with sequencing data

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Introduction to Sequence Alignment. Manpreet S. Katari

Sequence Comparison. mouse human

Quantifying sequence similarity

1.5 Sequence alignment

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

Protein function prediction based on sequence analysis

Sequence analysis and comparison

Large-Scale Genomic Surveys

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Introduction to Bioinformatics

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Local Alignment: Smith-Waterman algorithm

Bioinformatics I, WS 16/17, D. Huson, January 30,

Alignment Strategies for Large Scale Genome Alignments

Chapter 7: Rapid alignment methods: FASTA and BLAST

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids

Sequence analysis and Genomics

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

B I O I N F O R M A T I C S

Lecture 4: September 19

Sequence Alignment (chapter 6)

An Introduction to Sequence Similarity ( Homology ) Searching

Lecture 5: September Time Complexity Analysis of Local Alignment

Hidden Markov Models

Computational Genomics and Molecular Biology, Fall

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Pairwise & Multiple sequence alignments

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Scoring Matrices. Shifra Ben-Dor Irit Orr

Bioinformatics Exercises

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis '17 -- lecture 7

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

BLAST: Basic Local Alignment Search Tool

Sequence Alignment. Johannes Starlinger

Tutorial 4 Substitution matrices and PSI-BLAST

Pairwise sequence alignment

Motivating the need for optimal sequence alignments...

Algorithms in Bioinformatics I, ZBIT, Uni Tübingen, Daniel Huson, WS 2003/4 1

bioinformatics 1 -- lecture 7

Multiple Alignment of Genomic Sequences

Sequencing alignment Ameer Effat M. Elfarash

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Handling Rearrangements in DNA Sequence Alignment

Sequence Comparison: Local Alignment. Genome 373 Genomic Informatics Elhanan Borenstein

Performing local similarity searches with variable length seeds

Administration. ndrew Torda April /04/2008 [ 1 ]

Similarity searching summary (2)

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

Exercise 5. Sequence Profiles & BLAST

Comparative genomics: Overview & Tools + MUMmer algorithm

BIOINFORMATICS: An Introduction

Scoring Matrices. Shifra Ben Dor Irit Orr

Similarity or Identity? When are molecules similar?

Transcription:

EECS730: Introduction to Bioinformatics Lecture 05: Index-based alignment algorithms Slides adapted from Dr. Shaojie Zhang (University of Central Florida)

Real applications of alignment Database search Assume we have a gene g and a genome G, and we want to find the homolog of g in G Smith-Waterman algorithm (local alignment) would take O(g*G) time. Even more ambitious, if you want to search g against all homologs from a collection of genomes As of 2014, there are 157,943,793,171nt (~160 billion) being registered in the database NCBI NT (non-redundant nucleotide).

Naïve Smith-Waterman Given a newly discovered gene, Does it occur in other species? How fast does it evolve? Our new gene 10 4 The entire genomic database 10 11-10 12

Let s try a shorter one >gi 57013850 sp P69905.2 HBA_HUMAN RecName: Full=Hemoglobin subunit alpha; AltName: Full=Alpha-globin; AltName: Full=Hemoglobin alpha chain MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKK VADALTNA VAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASV STVLTSK YR

Different flavors of BLAST BLASTN: nucleotide to nucleotide BLASTP: protein to protein BLASTX: nucleotide to protein; finding protein that is encoded by the query TBLASTX: nucleotide to nucleotide; finding nucleotide sequences that code for the same/similar protein TBLASTN: protein to nucleotide; finding nucleotides that code for the query

Index-based searches (BLAST- Basic Local Alignment Search Tool) Main idea: query 1. Construct a dictionary of all the words in the query 2. Initiate a local alignment for each word match between query and DB DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman

Words A k-long sequence fragment It is also called k-mer Intuition: if we require a k-mer to initialize the alignment, we can expect to speedup the alignment by a^k times (a is the size of the alphabet, 4 for DNA and 20 for protein), given that the distribution of different k-mers are uniform

The indexing scheme Dictionary: All words of length k (11-13, tunable) Alignment initiated between k-mer matches query Alignment: Ungapped extensions until score below statistical threshold Gapped extension until score below statistical threshold scan DB Output: All local alignments with score > statistical threshold query

C T T C C T G G A T T G C G A Example: A C G A A G T A A G G T C C A G T k = 4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < T below best so far Output: GTAAGGTCC GTTAGGTCC

G A T C C T G G A T T G C G A Gapped extensions A C G A A G T A A G G T C C A G T Extensions with gaps in a band around anchor Terminates after significant score drop-off Output: GTAAGGTCC-AGT GTTAGGTCCTAGT

long words (k = 15) short words (k = 7) Sensitivity/Speed tradeoff Sensitivity Speed Kent WJ, Genome Research 2002

Using gapped seeds To allow variations in between ATAACGGACGACTGATTACACTGATTCTTAC GGCACGGACCAGTGACTACTCTGATTCCCAG

The BLAST configuration

Gapped seeds Kent WJ, Genome Research 2002

Inexact matches Pre-building high-scoring k-mer pairs allowing mismatches Rescue homologs without long stretches of perfect matches More being used in protein alignment ATAACCGACCGTTAC GGCACGGACAGTTCC

Inexact matches Kent WJ, Genome Research 2002

Reduced alphabet Ye et al. BMC Bioinformatics, 2011

Choice of reduced alphabet in seeding Percentage of related alignment sharing the seed log Percentage of related alignment sharing the seed Percentage of unrelated alignment sharing the seed Ye et al. BMC Bioinformatics, 2011

Patterns Non-consecutive words Increase the probability of at least one hit while reduce the number of hits 3 seeds 2 seeds No 5-mer perfect matches pattern On a 100-long 70% conserved region: Consecutive Non-consecutive Expected # hits: 1.07 0.97 Prob[at least one hit]: 0.30 0.47

Patterns Note that using patterns less seeds will be generated, so it is also faster!!! 11 positions 11 positions 10 positions

Variants of BLAST NCBI BLAST: search the universe http://www.ncbi.nlm.nih.gov/blast/ MEGABLAST: Optimized to align very similar sequences Works best when k = 4i 16 Linear gap penalty WU-BLAST: (Wash U BLAST) Very good optimizations Good set of features & command line arguments BLAT Faster, less sensitive than BLAST Good for aligning huge numbers of queries CHAOS Uses inexact k-mers, sensitive PatternHunter Uses patterns instead of k-mers BlastZ Uses patterns, good for finding genes

BLAST statistics Score matrices used to seek local alignments of variable length should have a negative expected score. Otherwise the alignment will span over the entire sequence (good alignment score even for random sequences). p i* p j* S i,j < 0; p i is the frequency of character i; p j is the frequency of character j; and S i,j is the score for substituting character i with j.

Log odds score Let S i,j be the scaled (to integers to facilitate score computation) logodds likelihood for substituting i with j. S i,j = ln(q i,j / p i* p j ) / λ; q i,j is the frequency of substituting i with j; λ is a scaling parameter that converts the score back to the probability space. Find the appropriate value for λ

Finding λ We know that the sum of all q i,j should be 1. Recall that S i,j = ln(q i,j / p i* p j ) / λ q i,j = p i* p j* e^(λ *S i,j ); and f(λ) = p i* p j* e^(λ *S i,j ) = 1. We know that λ is always a solution, i.e. f(0) = 1 f (0) = p i* p j* S i,j < 0 (the negative expected score assumption) f (λ) = p i* p j* (S i,j )^2*e^(λ *S i,j ) > 0

Sketching the function f(λ) 1 λ

The E-value Given a particular scoring system, how many distinct local alignments with score S can one expect to find by chance from the comparison of two random sequence of lengths m and n? The answer, E(S,m,n), should depend upon S, and the lengths of the sequences compared. If we double the size of m, we will get twice more local alignments; if we double the size of n, we will also get twice more local alignments. E(S,m,n) is proportional to m*n

Frequency of observing a stretch of alignment 1/Π (q i,j / p i* p j ) = e^(-log(π (q i,j / p i* p j ) )) = e^(- log(q i,j / p i* p j )) e^(- log(q i,j / p i* p j )) = e^(- λ *S i,j ) = e^(-λ * S i,j ) = e^(-λ * S) E(S,m,n) is proportional to e^(-λ * S)

The E-value E(S,m,n) is proportional to m*n E(S,m,n) is proportional to e^(-λ * S) E(S,m,n) = K * m * n * e^(-λ * S) K can be determined through simulation

With gaps The theory is provably valid only for local alignments without gaps. However, although no formal proof is available, random simulation suggests the theory remains valid when gaps are allowed, with sufficiently large gap costs. In this case, no analytic formulas for the statistical parameters λ and K are available, but these parameters may be estimated by random simulation. https://wwwold.cs.umd.edu/class/fall2011/cmsc858s/local_alignment_statistics.pdf