Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027

Similar documents
Welcome to CS262: Computational Genomics

Minimum Edit Distance. Defini'on of Minimum Edit Distance

Moreover, the circular logic

Lecture 2: Pairwise Alignment. CG Ron Shamir

EECS730: Introduction to Bioinformatics

Practical considerations of working with sequencing data

Bio nformatics. Lecture 3. Saad Mneimneh

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise sequence alignment

Introduction to Bioinformatics Algorithms Homework 3 Solution

Text Algorithms (4AP) Biological measures of similarity. Jaak Vilo 2008 fall. MTAT Text Algorithms. Complete genomes

Quantifying sequence similarity

Algorithms in Bioinformatics

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

Pairwise & Multiple sequence alignments

Analysis and Design of Algorithms Dynamic Programming

Collected Works of Charles Dickens

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

Sequence analysis and Genomics

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

1.5 Sequence alignment

Local Alignment: Smith-Waterman algorithm

Bioinformatics and BLAST

Sequence Alignment (chapter 6)

Lecture 4: September 19

Practical Bioinformatics

6.6 Sequence Alignment

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Lecture 5: September Time Complexity Analysis of Local Alignment

Linear-Space Alignment

In-Depth Assessment of Local Sequence Alignment

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Pair Hidden Markov Models

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Algorithm Design and Analysis

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

Computational Biology

Computational Biology Lecture 5: Time speedup, General gap penalty function Saad Mneimneh

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Sequence Comparison. mouse human

Lecture 5,6 Local sequence alignment

Sequence Alignment Techniques and Their Uses

CS 580: Algorithm Design and Analysis

Global alignments - review

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

String Matching Problem

Copyright 2000 N. AYDIN. All rights reserved. 1

CSE 202 Dynamic Programming II

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

Unsupervised Vocabulary Induction

Motivating the need for optimal sequence alignments...

2 Pairwise alignment. 2.1 References. 2.2 Importance of sequence alignment. Introduction to the pairwise sequence alignment problem.

Large-Scale Genomic Surveys

Tools and Algorithms in Bioinformatics

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

NUMB3RS Activity: DNA Sequence Alignment. Episode: Guns and Roses

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Introduction to Bioinformatics

Whole Genome Alignments and Synteny Maps

Sequence analysis and comparison

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Pairwise sequence alignments

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Single alignment: Substitution Matrix. 16 march 2017

Pairwise Sequence Alignment

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

Sequence comparison: Score matrices

More Dynamic Programming

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

Chapter 6. Dynamic Programming. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

More Dynamic Programming

Introduction to Sequence Alignment. Manpreet S. Katari

Hidden Markov Models. x 1 x 2 x 3 x K

Tools and Algorithms in Bioinformatics

Dynamic Programming. Weighted Interval Scheduling. Algorithmic Paradigms. Dynamic Programming

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Evolutionary Models. Evolutionary Models

Alignment Algorithms. Alignment Algorithms

L3: Blast: Keyword match basics

Algorithms in Bioinformatics I, ZBIT, Uni Tübingen, Daniel Huson, WS 2003/4 1

Sequence Comparison with Mixed Convex and Concave Costs

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Pairwise sequence alignment and pair hidden Markov models

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Transcription:

Sequence Alignment

Evolution CT Amemiya et al. Nature 496, 311-316 (2013) doi:10.1038/nature12027

Evolutionary Rates next generation OK OK OK X X Still OK?

Sequence conservation implies function Alignment is the key to Finding important regions Determining function Uncovering evolutionary events

Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y 1 y 2 y N, an alignment is an assignment of gaps to positions 0,, N in x, and 0,, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence

What is a good alignment? AGGCTAGTT, AGCGAAGTTT AGGCTAGTT- 6 matches, 3 mismatches, 1 gap AGCGAAGTTT AGGCTA-GTT- 7 matches, 1 mismatch, 3 gaps AG-CGAAGTTT AGGC-TA-GTT- AG-CG-AAGTTT 7 matches, 0 mismatches, 5 gaps

Scoring Function Sequence edits: AGGCCTC Mutations AGGACTC Insertions AGGGCCTC Deletions AGG. CTC Scoring Function: Match: +m Mismatch: -s Gap: -d Alternative definition: minimal edit distance Given two strings x, y, find minimum # of edits (insertions, deletions, mutations) to transform one string to the other Score F = (# matches) m - (# mismatches) s (#gaps) d

How do we compute the best alignment? AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Too many possible alignments: (exercise) >> 2 N

Alignment is additive Observation: The score of aligning x 1 x M y 1 y N is additive Say that x 1 x i x i+1 x M aligns to y 1 y j y j+1 y N The two scores add up: F(x[1:M], y[1:n]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:n])

Dynamic Programming There are only a polynomial number of subproblems Align x 1 x i to y 1 y j Original problem is one of the subproblems Align x 1 x M to y 1 y N Each subproblem is easily solved from smaller subproblems We will show next Let F(i, j) = optimal score of aligning x 1 x i y 1 y j F: Dynamic Programming table

Dynamic Programming (cont d) Notice three possible cases: 1. x i aligns to y j x 1 x i-1 x i y 1 y j-1 y j F(i, j) = F(i 1, j 1) + m, if x i = y j -s, if not 2. x i aligns to a gap x 1 x i-1 x i y 1 y j - F(i, j) = F(i 1, j) d 3. y j aligns to a gap x 1 x i - y 1 y j-1 y j F(i, j) = F(i, j 1) d

Dynamic Programming (cont d) How do we know which case is correct? Inductive assumption: F(i, j 1), F(i 1, j), F(i 1, j 1) are optimal Then, F(i, j) = max F(i 1, j 1) + s(x i, y j ) F(i 1, j) d F(i, j 1) d Where s(x i, y j ) = m, if x i = y j ; -s, if not

Example x = AGTA m = 1 Procedure to output y = ATA s = -1 Alignment d = -1 F(i,j) i = 0 1 2 3 4 A G T A j = 0 0-1 -2-3 -4 1 A -1 1 0-1 -2 2 T -2 0 0 1 0 Follow the backpointers F(1, 1) = When max{f(0,0) diagonal, + s(a, A), OUTPUT x F(0, 1) i, y j d, F(1, 0) d} = When up, OUTPUT max{0 + 1, y j -1 1, When left, -1 1} = 1 OUTPUT x i 3 A -3-1 -1 0 2 A A G - T T A A

The Needleman-Wunsch Matrix x 1 x M y 1 y N Every nondecreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequences An optimal alignment is composed of optimal subalignments

The Needleman-Wunsch Algorithm Initialization. F(0, 0) = 0 F(0, j) = - j d F(i, 0) = - i d Main Iteration. Filling-in partial alignments For each i = 1 M For each j = 1 N F(i 1,j 1) + s(x i, y j ) [case 1] F(i, j) = max F(i 1, j) d [case 2] F(i, j 1) d [case 3] DIAG, if [case 1] Ptr(i, j) = LEFT, if [case 2] UP, if [case 3] 3. Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Performance Time: O(NM) Space: O(NM) Later we will cover more efficient methods

A variant of the basic algorithm: Maybe it is OK to have an unlimited # of gaps in the beginning and end: ----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG-------------- Then, we don t want to penalize gaps in the ends

Different types of overlaps Example: 2 overlapping reads from a sequencing project Example: Search for a mouse gene within a human chromosome

The Overlap Detection variant x 1 x M Changes: y 1 y N 1. Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2. Termination max i F(i, N) F OPT = max max j F(M, j)

The local alignment problem Given two strings x = x 1 x M, y = y 1 y N Find substrings x, y whose similarity (optimal global alignment value) is maximum x = aaaacccccggggtta y = ttcccgggaaccaacc

Why local alignment Genes are shuffled between genomes

The Smith-Waterman algorithm Idea: Ignore badly aligning regions Modifications to Needleman-Wunsch: Initialization: F(0, j) = F(i, 0) = 0 Iteration: F(i, j) = max F(i 1, j) d 0 F(i, j 1) d F(i 1, j 1) + s(x i, y j )

The Smith-Waterman algorithm Termination: 1. If we want the best local alignment F OPT = max i,j F(i, j) Find F OPT and trace back 2. If we want all local alignments scoring > t?? For all i, j find F(i, j) > t, and trace back? Complicated by overlapping local alignments Waterman Eggert 87: find all non-overlapping local alignments with minimal recalculation of the DP matrix

Scoring the gaps more accurately Current model: γ(n) Gap of length n incurs penalty n d However, gaps usually occur in bunches Concave gap penalty function γ(n) (aka Convex -γ(n)): γ(n) γ(n): for all n, γ(n + 1) - γ(n) γ(n) - γ(n 1)

Convex gap dynamic programming Initialization: same Iteration: F(i 1, j 1) + s(x i, y j ) F(i, j) = max max k=0 i-1 F(k, j) γ(i k) max k=0 j-1 F(i, k) γ(j k) Termination: same Running Time: O(N 2 M) (assume N>M) Space: O(NM)

Compromise: affine gaps γ(n) = d + (n 1) e gap gap open extend To compute optimal alignment, γ(n) d e At position i, j, need to remember best score if gap is open best score if gap is not open F(i, j): score of alignment x 1 x i to y 1 y j if x i aligns to y j G(i, j): score if x i aligns to a gap after y j H(i, j): score if y j aligns to a gap after x i V(i, j) = best score of alignment x 1 x i to y 1 y j

Needleman-Wunsch with affine gaps Why do we need matrices F, G, H? Because, perhaps x i aligns to y j G(i, j) < V(i, j) x 1 x i-1 x i x i+1 y 1 y j-1 y j - (it is best to align x i to y j if we were aligning only x 1 x i to y 1 y j and not the rest of x, y), but on the contrary Add -d G(i+1, j) = F(i, j) d G(i, j) x e i aligns > V(i, j) to d a gap after y j x 1 x i-1 x i x i+1 y 1 y j - - (i.e., had we fixed our decision that x i aligns to y j, we could regret it at the next step when aligning x 1 x i+1 to y 1 y j ) Add -e G(i+1, j) = G(i, j) e

Needleman-Wunsch with affine gaps Initialization: V(i, 0) = - d - (i 1) e V(0, j) = - d - (j 1) e Iteration: V(i, j) = max{ F(i, j), G(i, j), H(i, j) } F(i, j) = V(i 1, j 1) + s(x i, y j ) G(i, j) = max V(i 1, j) d G(i 1, j) e Termination: V(i, j 1) d H(i, j) = max H(i, j 1) e V(i, j) has the best alignment Time? Space?

To generalize a bit think of how you would compute optimal alignment with this gap function γ(n).in time O(MN)

Bounded Dynamic Programming Assume we know that x and y are very similar Assumption: # gaps(x, y) < k(n) x i Then, implies i j < k(n) y j We can align x and y more efficiently: Time, Space: O(N k(n)) << O(N 2 )

Bounded Dynamic Programming y 1 y N x 1 x M k(n) Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1 M For j = max(1, i k) min(n, i+k) F(i 1, j 1)+ s(x i, y j ) F(i, j) = max F(i, j 1) d, if j > i k(n) F(i 1, j) d, if j < i + k(n) Termination: same Easy to extend to the affine gap case

Outline Linear-Space Alignment BLAST local alignment search Ultra-fast alignment for (human) genome resequencing