Welcome to CS262: Computational Genomics

Similar documents
Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027

Linear-Space Alignment

Minimum Edit Distance. Defini'on of Minimum Edit Distance

Lecture 2: Pairwise Alignment. CG Ron Shamir

Moreover, the circular logic

EECS730: Introduction to Bioinformatics

Bio nformatics. Lecture 3. Saad Mneimneh

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Analysis and Design of Algorithms Dynamic Programming

Practical considerations of working with sequencing data

Sequence Alignment (chapter 6)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Introduction to Bioinformatics Algorithms Homework 3 Solution

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Lecture 4: September 19

Pairwise sequence alignment

Text Algorithms (4AP) Biological measures of similarity. Jaak Vilo 2008 fall. MTAT Text Algorithms. Complete genomes

Sequence analysis and Genomics

6.6 Sequence Alignment

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

CS 580: Algorithm Design and Analysis

Local Alignment: Smith-Waterman algorithm

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Bioinformatics and BLAST

CSE 202 Dynamic Programming II

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

Pair Hidden Markov Models

Lecture 5: September Time Complexity Analysis of Local Alignment

Algorithm Design and Analysis

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

NUMB3RS Activity: DNA Sequence Alignment. Episode: Guns and Roses

Algorithms in Bioinformatics

Quantifying sequence similarity

Computational Biology Lecture 5: Time speedup, General gap penalty function Saad Mneimneh

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Computational Biology

1.5 Sequence alignment

Pairwise & Multiple sequence alignments

Sequence Comparison. mouse human

An Introduction to Bioinformatics Algorithms Hidden Markov Models

More Dynamic Programming

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

String Matching Problem

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

More Dynamic Programming

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

6. DYNAMIC PROGRAMMING II

Chapter 6. Dynamic Programming. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Alignment Algorithms. Alignment Algorithms

Hidden Markov Models

CS173 Lecture B, November 3, 2015

Practical Bioinformatics

Whole Genome Alignments and Synteny Maps

Chapter 6. Dynamic Programming. CS 350: Winter 2018

Dynamic Programming. Weighted Interval Scheduling. Algorithmic Paradigms. Dynamic Programming

Introduction to Bioinformatics

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

CHEM 181: Chemical Biology

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

Sequence Alignment Techniques and Their Uses

Lecture 5,6 Local sequence alignment

Chapter 6. Weighted Interval Scheduling. Dynamic Programming. Algorithmic Paradigms. Dynamic Programming Applications

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Motivating the need for optimal sequence alignments...

Pairwise sequence alignment and pair hidden Markov models

Collected Works of Charles Dickens

List of Code Challenges. About the Textbook Meet the Authors... xix Meet the Development Team... xx Acknowledgments... xxi

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Sequence analysis and comparison

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Copyright 2000 N. AYDIN. All rights reserved. 1

In-Depth Assessment of Local Sequence Alignment

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Computational Molecular Biology (

CHEM 121: Chemical Biology

An Introduction to Sequence Similarity ( Homology ) Searching

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Single alignment: Substitution Matrix. 16 march 2017

Whole Genome Alignment. Adam Phillippy University of Maryland, Fall 2012

Tandem Mass Spectrometry: Generating function, alignment and assembly

Global alignments - review

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Multiple Sequence Alignment using Profile HMM

Implementing Approximate Regularities

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Efficient High-Similarity String Comparison: The Waterfall Algorithm

Effects of Gap Open and Gap Extension Penalties

Evolutionary Models. Evolutionary Models

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

Areas. ! Bioinformatics. ! Control theory. ! Information theory. ! Operations research. ! Computer science: theory, graphics, AI, systems,.

Alignment Strategies for Large Scale Genome Alignments

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Sequence Alignment. Johannes Starlinger

CSE 241 Class 1. Jeremy Buhler. August 24,

2 Pairwise alignment. 2.1 References. 2.2 Importance of sequence alignment. Introduction to the pairwise sequence alignment problem.

Transcription:

Instructor: Welcome to CS262: Computational Genomics Serafim Batzoglou TA: Paul Chen email: cs262-win2015-staff@lists.stanford.edu Tuesdays & Thursdays 12:50-2:05pm Clark S361 http://cs262.stanford.edu

Goals of this course Introduction to Computational Biology & Genomics Basic concepts and scientific questions Why does it matter? Basic biology for computer scientists In-depth coverage of algorithmic techniques Current active areas of research Useful algorithms Dynamic programming String algorithms HMMs and other graphical models for sequence analysis

Topics in CS262 Part 1: Basic Algorithms Dynamic Programming & sequence alignment HMMs, CRFs & sequence modeling Sequence indexing; Burrows-Wheeler transform, De Brujin graphs Part 2: Topics in computational genomics and areas of active research DNA sequencing and assembly Comparative genomics Human genome resequencing Alignment Compression Human genome variation Cancer genomics Functional genomics Population genomics

Course responsibilities Homeworks 4 challenging problem sets, 4-5 problems/pset Due at beginning of class Up to 3 late days (24-hr periods) for the quarter Collaboration allowed please give credit Teams of 2 or 3 students Individual writeups If individual (no team) then drop score of worst problem per problem set (Optional) Scribing Due one week after the lecture, except special permission Scribing grade replaces 2 lowest problems from all problem sets First-come first-serve, email staff list to sign up

Reading material Main Reading: Lecture notes Papers Optional: Biological sequence analysis by Durbin, Eddy, Krogh, Mitchison Chapters 1-4, 6, 7-8, 9-10

Birth of Molecular Biology Phosphate Group Sugar Nitrogenous Base A, C, G, T DNA A G C G A C T T C G C T G A G C Physicist Ornithologist

Genetics in the 20 th Century

Human Genome Project 1990: Start 3 billion basepairs $3 billion 2000: Bill Clinton: 2001: Draft 2003: Finished now what? most important scien.fic discovery in the 20th century

Sequencing Growth Cost of one human genome 2004: $30,000,000 2008: $100,000 2010: $10,000 2014: $1,000???: $300 How much would you pay for a smartphone?

Uses of Genomes Medicine Prenatal/Mendelian diseases Drug dosage (eg. Warfarin) Disease risk Diagnosis of infections Ancestry Genealogy Nutrition? Psychology? Baby Engineering???... Ethical Issues

How soon will we all be sequenced? Cost Killer apps Roadblocks? Applications Cost 2015? 2020? Time

Intro to Biology

Sequence Alignment

Evolution CT Amemiya et al. Nature 496, 311-316 (2013) doi:10.1038/nature12027

Evolution at the DNA level Deletion Mutation ACGGTGCAGTTACCA SEQUENCE EDITS AC----CAGTCCACCA REARRANGEMENTS Inversion Translocation Duplication

Evolutionary Rates next generation OK OK OK X X Still OK?

Sequence conservation implies function Alignment is the key to Finding important regions Determining function Uncovering evolutionary events

Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y 1 y 2 y N, an alignment is an assignment of gaps to positions 0,, N in x, and 0,, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence

What is a good alignment? AGGCTAGTT, AGCGAAGTTT AGGCTAGTT- 6 matches, 3 mismatches, 1 gap AGCGAAGTTT AGGCTA-GTT- 7 matches, 1 mismatch, 3 gaps AG-CGAAGTTT AGGC-TA-GTT- AG-CG-AAGTTT 7 matches, 0 mismatches, 5 gaps

Scoring Function Sequence edits: Mutations Insertions Deletions Scoring Function: Match: +m Mismatch: -s Gap: -d AGGCCTC AGGACTC AGGGCCTC AGG. CTC Alternative definition: minimal edit distance Given two strings x, y, find minimum # of edits (insertions, deletions, mutations) to transform one string to the other Score F = (# matches) m - (# mismatches) s (#gaps) d

How do we compute the best alignment? AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Too many possible alignments: (exercise) >> 2 N

Alignment is additive Observation: The score of aligning x 1 x M y 1 y N is additive Say that x 1 x i x i+1 x M aligns to y 1 y j y j+1 y N The two scores add up: F(x[1:M], y[1:n]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:n])

Dynamic Programming There are only a polynomial number of subproblems Align x 1 x i to y 1 y j Original problem is one of the subproblems Align x 1 x M to y 1 y N Each subproblem is easily solved from smaller subproblems We will show next Then, we can apply Dynamic Programming!!! Let F(i, j) = optimal score of aligning x 1 x i y 1 y j F is the DP Matrix or Table Memoization

Dynamic Programming (cont d) Notice three possible cases: 1. x i aligns to y j x 1 x i-1 x i y 1 y j-1 y j F(i, j) = F(i 1, j 1) + m, if x i = y j -s, if not 2. x i aligns to a gap x 1 x i-1 x i y 1 y j - F(i, j) = F(i 1, j) d 3. y j aligns to a gap x 1 x i - y 1 y j-1 y j F(i, j) = F(i, j 1) d

Dynamic Programming (cont d) How do we know which case is correct? Inductive assumption: F(i, j 1), F(i 1, j), F(i 1, j 1) are optimal Then, F(i, j) = max F(i 1, j 1) + s(x i, y j ) F(i 1, j) d F(i, j 1) d Where s(x i, y j ) = m, if x i = y j ; -s, if not

Example x = AGTA m = 1 Procedure to output y = ATA s = -1 Alignment d = -1 F(i,j) i = 0 1 2 3 4 A G T A j = 0 0-1 -2-3 -4 1 A -1 1 0-1 -2 2 T -2 0 0 1 0 Follow the backpointers F(1, 1) = When max{f(0,0) diagonal, + s(a, A), OUTPUT x F(0, 1) i, y j d, F(1, 0) d} = When up, OUTPUT max{0 + 1, y j -1 1, When left, -1 1} = 1 OUTPUT x i 3 A -3-1 -1 0 2 A A G - T T A A

The Needleman-Wunsch Matrix x 1 x M y 1 y N Every nondecreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequences An optimal alignment is composed of optimal subalignments

The Needleman-Wunsch Algorithm Initialization. F(0, 0) = 0 F(0, j) = - j d F(i, 0) = - i d Main Iteration. Filling-in partial alignments For each i = 1 M For each j = 1 N F(i 1,j 1) + s(x i, y j ) [case 1] F(i, j) = max F(i 1, j) d [case 2] F(i, j 1) d [case 3] DIAG, if [case 1] Ptr(i, j) = LEFT, if [case 2] UP, if [case 3] 3. Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Performance Time: O(NM) Space: O(NM) Later we will cover more efficient methods

A variant of the basic algorithm: Maybe it is OK to have an unlimited # of gaps in the beginning and end: ----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG-------------- Then, we don t want to penalize gaps in the ends

Different types of overlaps Example: 2 overlapping reads from a sequencing project Example: Search for a mouse gene within a human chromosome

The Overlap Detection variant x 1 x M Changes: y 1 y N 1. Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2. Termination max i F(i, N) F OPT = max max j F(M, j)

The local alignment problem Given two strings x = x 1 x M, y = y 1 y N Find substrings x, y whose similarity (optimal global alignment value) is maximum x = aaaacccccggggtta y = ttcccgggaaccaacc

Why local alignment Genes are shuffled between genomes

Cross-species genome similarity 98% of genes are conserved between any two mammals >70% average similarity in protein sequence hum_a : GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------- @ 57331/400001 mus_a : GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------- @ 78560/400001 rat_a : GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------- @ 112658/369938 fug_a : TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG @ 36008/68174 hum_a : CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 57381/400001 mus_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 78610/400001 rat_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 112708/369938 fug_a : TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG @ 36058/68174 hum_a : AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT @ 57431/400001 mus_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT @ 78659/400001 rat_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT @ 112757/369938 fug_a : AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC @ 36084/68174 atoh enhancer in human, mouse, rat, fugu fish hum_a : AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 57481/400001 mus_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 78708/400001 rat_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 112806/369938 fug_a : CCGAGGACCCTGA------------------------------------- @ 36097/68174

The Smith-Waterman algorithm Idea: Ignore badly aligning regions Modifications to Needleman-Wunsch: Initialization: F(0, j) = F(i, 0) = 0 Iteration: F(i, j) = max F(i 1, j) d 0 F(i, j 1) d F(i 1, j 1) + s(x i, y j )

The Smith-Waterman algorithm Termination: 1. If we want the best local alignment F OPT = max i,j F(i, j) Find F OPT and trace back 2. If we want all local alignments scoring > t?? For all i, j find F(i, j) > t, and trace back? Complicated by overlapping local alignments Waterman Eggert 87: find all non-overlapping local alignments with minimal recalculation of the DP matrix

Scoring the gaps more accurately Current model: γ(n) Gap of length n incurs penalty n d However, gaps usually occur in bunches Concave gap penalty function γ(n) (aka Convex -γ(n)): γ(n) γ(n): for all n, γ(n + 1) - γ(n) γ(n) - γ(n 1)

Convex gap dynamic programming Initialization: same Iteration: F(i 1, j 1) + s(x i, y j ) F(i, j) = max max k=0 i-1 F(k, j) γ(i k) max k=0 j-1 F(i, k) γ(j k) Termination: same Running Time: O(N 2 M) (assume N>M) Space: O(NM)

Compromise: affine gaps γ(n) = d + (n 1) e gap gap open extend To compute optimal alignment, γ(n) d e At position i, j, need to remember best score if gap is open best score if gap is not open F(i, j): score of alignment x 1 x i to y 1 y j if x i aligns to y j G(i, j): score if x i aligns to a gap after y j H(i, j): score if y j aligns to a gap after x i V(i, j) = best score of alignment x 1 x i to y 1 y j

Needleman-Wunsch with affine gaps Why do we need matrices F, G, H? Because, perhaps x i aligns to y j G(i, j) < V(i, j) x 1 x i-1 x i x i+1 y 1 y j-1 y j - (it is best to align x i to y j if we were aligning only x 1 x i to y 1 y j and not the rest of x, y), but on the contrary Add -d G(i+1, j) = F(i, j) d G(i, j) x e i aligns > V(i, j) to d a gap after y j x 1 x i-1 x i x i+1 y 1 y j - - (i.e., had we fixed our decision that x i aligns to y j, we could regret it at the next step when aligning x 1 x i+1 to y 1 y j ) Add -e G(i+1, j) = G(i, j) e

Needleman-Wunsch with affine gaps Initialization: V(i, 0) = d + (i 1) e V(0, j) = d + (j 1) e Iteration: V(i, j) = max{ F(i, j), G(i, j), H(i, j) } F(i, j) = V(i 1, j 1) + s(x i, y j ) G(i, j) = max V(i 1, j) d G(i 1, j) e Termination: V(i, j 1) d H(i, j) = max H(i, j 1) e V(i, j) has the best alignment Time? Space?

To generalize a bit think of how you would compute optimal alignment with this gap function γ(n).in time O(MN)

Bounded Dynamic Programming Assume we know that x and y are very similar Assumption: # gaps(x, y) < k(n) x i Then, implies i j < k(n) y j We can align x and y more efficiently: Time, Space: O(N k(n)) << O(N 2 )

Bounded Dynamic Programming y 1 y N x 1 x M k(n) Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1 M For j = max(1, i k) min(n, i+k) F(i 1, j 1)+ s(x i, y j ) F(i, j) = max F(i, j 1) d, if j > i k(n) F(i 1, j) d, if j < i + k(n) Termination: same Easy to extend to the affine gap case

Outline Linear-Space Alignment BLAST local alignment search Ultra-fast alignment for (human) genome resequencing

Linear-Space Alignment

Subsequences and Substrings Definition A string x is a substring of a string x, if x = ux v for some prefix string u and suffix string v (similarly, x = x i x j, for some 1 i j x ) A string x is a subsequence of a string x if x can be obtained from x by deleting 0 or more letters (x = x i1 x ik, for some 1 i 1 i k x ) Note: a substring is always a subsequence Example: x = abracadabra y = cadabr; substring z = brcdbr; subseqence, not substring

Hirschberg s algortihm Given a set of strings x, y,, a common subsequence is a string u that is a subsequence of all strings x, y, Longest common subsequence Given strings x = x 1 x 2 x M, y = y 1 y 2 y N, Find longest common subsequence u = u 1 u k Algorithm: F(i 1, j) F(i, j) = max F(i, j 1) F(i 1, j 1) + [1, if x i = y j ; 0 otherwise] Ptr(i, j) = (same as in N-W) Termination: trace back from Ptr(M, N), and prepend a letter to u whenever Ptr(i, j) = DIAG and F(i 1, j 1) < F(i, j) Hirschberg s original algorithm solves this in linear space

Introduction: Compute optimal score It is easy to compute F(M, N) in linear space Allocate ( column[1] ) Allocate ( column[2] ) F(i,j) For If i = 1.M i > 1, then: Free( column[ i 2 ] ) Allocate( column[ i ] ) For j = 1 N F(i, j) =

Linear-space alignment To compute both the optimal score and the optimal alignment: Divide & Conquer approach: Notation: x r, y r : reverse of x, y E.g. x = accgg; x r = ggcca F r (i, j): optimal score of aligning x r 1 xr i & yr 1 yr j same as aligning x M-i+1 x M & y N-j+1 y N

Linear-space alignment Lemma: (assume M is even) F(M, N) = max k=0 N ( F(M/2, k) + F r (M/2, N k) ) x y M/2 F(M/2, k) F r (M/2, N k) k * Example: ACC-GGTGCCCAGGACTG--CAT ACCAGGTG----GGACTGGGCAG k * = 8

Linear-space alignment Now, using 2 columns of space, we can compute for k = 1 M, F(M/2, k), F r (M/2, N k) PLUS the backpointers x 1 x M/2 x M x 1 x M/2+1 x M y 1 y 1 y N y N

Linear-space alignment Now, we can find k * maximizing F(M/2, k) + F r (M/2, N-k) Also, we can trace the path exiting column M/2 from k * 0 1 M/2 M/2+1 M M+1 k * k * +1

Linear-space alignment Iterate this procedure to the left and right! k * N-k * M/2 M/2

Linear-space alignment Hirschberg s Linear-space algorithm: MEMALIGN(l, l, r, r ): (aligns x l x l with y r y r ) 1. Let h = (l -l)/2 2. Find (in Time O((l l) (r r)), Space O(r r)) the optimal path, L h, entering column h 1, exiting column h Let k 1 = pos n at column h 2 where L h enters k 2 = pos n at column h + 1 where L h exits 3. MEMALIGN(l, h 2, r, k 1 ) 4. Output L h 5. MEMALIGN(h + 1, l, k 2, r ) Top level call: MEMALIGN(1, M, 1, N)

Linear-space alignment Time, Space analysis of Hirschberg s algorithm: To compute optimal path at middle column, For box of size M N, Space: 2N Time: cmn, for some constant c Then, left, right calls cost c( M/2 k * + M/2 (N k * ) ) = cmn/2 All recursive calls cost Total Time: cmn + cmn/2 + cmn/4 +.. = 2cMN = O(MN) Total Space: O(N) for computation, O(N + M) to store the optimal alignment