Introduction to Bioinformatics Algorithms Homework 3 Solution

Similar documents
Computational Biology Lecture 5: Time speedup, General gap penalty function Saad Mneimneh

Quantifying sequence similarity

CSCI Honor seminar in algorithms Homework 2 Solution

Bio nformatics. Lecture 3. Saad Mneimneh

Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027

BLAST: Basic Local Alignment Search Tool

Moreover, the circular logic

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Maximum sum contiguous subsequence Longest common subsequence Matrix chain multiplication All pair shortest path Kna. Dynamic Programming

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Sequence analysis and Genomics

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Computational Biology

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Pairwise & Multiple sequence alignments

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

General Methods for Algorithm Design

Algorithms in Bioinformatics

Dynamic Programming: Shortest Paths and DFA to Reg Exps

Introduction to Computational Biology Homework 2 Solution

Local Alignment: Smith-Waterman algorithm

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Math 54 Homework 3 Solutions 9/

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

CSCI 150 Discrete Mathematics Homework 5 Solution

Lecture 5,6 Local sequence alignment

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Pair Hidden Markov Models

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Matrix Dimensions(orders)

Dynamic Programming: Shortest Paths and DFA to Reg Exps

Sequence analysis and comparison

Chapter 4 Divide-and-Conquer

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Sequence Alignment. Johannes Starlinger

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

More Dynamic Programming

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

More Dynamic Programming

Introduction to spectral alignment

Analysis and Design of Algorithms Dynamic Programming

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

Sequence Analysis '17 -- lecture 7

Substitution matrices

CSE 202 Dynamic Programming II

Fall Inverse of a matrix. Institute: UC San Diego. Authors: Alexander Knop

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 5: September Time Complexity Analysis of Local Alignment

Lecture 2: Divide and conquer and Dynamic programming

BIOINFORMATICS: An Introduction

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Greedy Algorithms My T. UF

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Math/CS 466/666: Homework Solutions for Chapter 3

Exercise 5. Sequence Profiles & BLAST

10708 Graphical Models: Homework 2

CS 598RM: Algorithmic Game Theory, Spring Practice Exam Solutions

COM S 330 Homework 08 Solutions. Type your answers to the following questions and submit a PDF file to Blackboard. One page per problem.

14 Increasing and decreasing functions

Lecture 10: Powers of Matrices, Difference Equations

Math 320, spring 2011 before the first midterm

Word Alignment via Submodular Maximization over Matroids

18.06 Problem Set 3 Due Wednesday, 27 February 2008 at 4 pm in

Algorithms Exam TIN093 /DIT602

6.6 Sequence Alignment

String Matching Problem

Tools and Algorithms in Bioinformatics

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

1.5 Sequence alignment

Advanced Analysis of Algorithms - Midterm (Solutions)

EECS730: Introduction to Bioinformatics

General context Anchor-based method Evaluation Discussion. CoCoGen meeting. Accuracy of the anchor-based strategy for genome alignment.

Second Midterm Exam April 14, 2011 Answers., and

Sequence Comparison. mouse human

Quicksort (CLRS 7) We previously saw how the divide-and-conquer technique can be used to design sorting algorithm Merge-sort

Using Bioinformatics to Study Evolutionary Relationships Instructions

In-Depth Assessment of Local Sequence Alignment

Section Summary. Sequences. Recurrence Relations. Summations Special Integer Sequences (optional)

Sequence Comparison with Mixed Convex and Concave Costs

Dynamic programming. Curs 2015

M 340L CS Homework Set 12 Solutions. Note: Scale all eigenvectors so the largest component is +1.

Large-Scale Genomic Surveys

Chapter 6. Dynamic Programming. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Hidden Markov Models

Sequence Alignment (chapter 6)

Phylogenetic inference

Implementing Approximate Regularities

Lecture 2: Pairwise Alignment. CG Ron Shamir

Single alignment: Substitution Matrix. 16 march 2017

Tools and Algorithms in Bioinformatics

Algorithm Design and Analysis Homework #5 Due: 1pm, Monday, December 26, === Homework submission instructions ===

EECS730: Introduction to Bioinformatics

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

The Matrix Algebra of Sample Statistics

Transcription:

Introduction to Bioinformatics Algorithms Homework 3 Solution Saad Mneimneh Computer Science Hunter College of CUNY Problem 1: Concave penalty function We have seen in class the following recurrence for alignment with a general penalty function: A(i, j) = max A(i k, j) γ(k) k = 1,..., i A(i, j k) γ(k) k = 1,..., j where A(i, 0) = γ(i) and A(0, j) = γ(j), and γ(k) is the penalty of a gap of length k. (a) Show by a counter example that this algorithm fails to correctly find the optimal score. Hint: construct a γ that is not concave. Solution: Consider the penalty function γ(0) = 0, γ(1) = 2, γ(x) = 10 for x > 1. Assume that a match has a +1 score, and a mismatch a 1 score. Consider the two strings A and ABA. Then the optimal alignment has score -5: A B A - A - However, the algorithm will compute the optimal score as 3 for the following alignment: A B A A - - This is the score of a gap of length 1 plus the score of optimally aligning AB and A, which is -1. The problem is that the penalty of a gap of length 2 is higher then the penalty of two gaps of length 1. The algorithm will fail because splitting this gap falsely suggests a smaller penalty. This will not happen if the condition in part (b) below is satisfied. (b) Show that if γ(0) = 0 and γ(k) γ(k 1) is non-increasing in k, then we have (sub-additive property) γ(k 1 + k 2 ) γ(k 1 ) + γ(k 2 ) Explain why this property makes the algorithm above correct.

Solution: γ(x 1 +x 2 ) = γ(x 1 )+[γ(x 1 +1) γ(x 1 )]+[γ(x 1 +2) γ(x 1 +1)]+...+[γ(x 1 +x 2 ) γ(x 1 +x 2 1)] γ(x 1 ) + [γ(1) γ(0)] + [γ(2) γ(1)] +... + [γ(x 2 ) γ(x 2 1)] = γ(x 1 ) + γ(x 2 ) γ(0) = γ(x 1 ) + γ(x 2 ) Since the penalty of a gap becomes higher when the gap is scored as separate smaller pieces, the algorithm will not favor such erroneous breaks. (c) Modify the algorithm to work with any gap function. Solution: Let A(i, j) be the score of the optimal alignment that ends in x i and y j aligned. Let B(i, j) be the score of the optimal alignment when x i is aligned with a gap. Finally, let C(i, j) be the score of the optimal alignment when y j is aligned with a gap. A(i, j) = max B(i 1, j 1) + s(x i, y j ) C(i 1, j 1) + s(x i, y j ) { A(i k, j) γ(k) k = 1... i B(i, j) = max C(i k, j) γ(k) k = 1... i { A(i, j k) γ(k) k = 1... j C(i, j) = max B(i, j k) γ(k) k = 1... j For initialization: A(0, 0) = 0, A(i, 0) = A(0, j) =, B(i, 0) = γ(i), B(0, j) =, C(0, j) = γ(j), C(i, 0) =. The running time of this algorithm is O(m 2 n + n 2 m). Problem 2: Substitution matrices Let s say that we would like to build a DNA substitution matrix (x matrix) optimized for finding alignments among sequences that have 99% conservation (so mutation rate is, evolutionary distance 1). Assume that p i = 0.25 for every nucleotide i (A, G, C, or T), and that all matches are equally probable, and all mismatches are equally probable (uniform model). (a) Find the matrix Q, where Q ij is the probability of seeing nucleotides i and j aligned. Solution: Q = 2 (b) After constructing the matrix M, where M ij = Q ij /p i, find the match and mismatch scores in the matrix S given by S ij = log 2 M k ij /p j, for evolutionary

distances k = 1, 2, 5, 25, 50, 75, 100. What is the percentage conservation in each case (e.g. for k = 1 it is )? Solution: Here are the match and mismatch scores found for the different evolutionary distances, and the percentage conservation: match mismatch % conservation 1 1.98-6.23 99 2 1.97-5.25 98.0 5 1.93-3.9 95.1 10 1.86-2.99 90.6 25 1.65-1.81 78.6 50 1.3-1.03 63.3 75 1.07-0.66 52. 100 0.83-0..6 (c) [optional] Show that regardless of the probabilities p i and Q ij, S will always be symmetric for any evolutionary distance. Hint: M = DQ for some diagonal matrix D (what is D?), and S = log 2 [(DQ) k D]. Solution: D is the diagonal matrix defined as D ii = 1/p i and D ij = 0 for i j. Observe that D and Q are symmetric. A matrix is symmetric iff it is equal to it s transpose. In addition (AB) T is B T A T. Moreover, matrix multiplication is associative. So [(DQ) k D] T = (DQDQ... DQD) T = D T Q T D T... Q T D T Q T D T = DQD... QDQD = (DQ) k D Problem 3: Longest increasing subsequence Given a sequence x 1, x 2,..., x n, an increasing subsequence of length k is a sequence x i1, x i2,..., x ik such that i 1 < i 2 <... < i k and x i1 < x i2 <... < x ik. For example, in the sequence 8, 2, 1, 6, 5, 7,, 3, 9, an increasing subsequence is 1, 5, 7, 9. It has length, and it is a longest possible increasing subsequence. (a) Describe a dynamic programming algorithm to find a longest increasing subsequence. Hint: you can think about alignment between the sequence and its sorted version (what scores will you assign for matches, mismatches, and gaps?). Solution: We can align the sequence x with it s sorted version y. For any given pair (i, j), the score of the optimal alignment of x 1... x i and y 1... y j is given by (gaps score 0): A(i, j) = max A(i 1, j) A(j, j 1) where A(i, 0) = A(j, 0) = 0 and s(x i, y j ) = 1 if x i = y j and 1 otherwise. This way, we will never align x i with y j if they are not equal, because the mismatch can be replaced by two gaps (one in each sequence). (b) A 2-increasing subsequence is one that can be partitioned into two subsequences that are increasing. For example, while 2, 1, 6, 5, 7, 9 is not an increasing

subsequence, it is a 2-increasing subsequence because it can be partitioned into: 2, 6 and 1, 5, 7, 9. Show by a counter example that the longest 2-increasing subsequence cannot be obtained by a greedy strategy, i.e. removing the longest increasing sequence, then finding the longest increasing sequence among the remaining elements, and finally interleaving the two. Solution: Consider the sequence i + 1, 1,..., i 1, i + 2,..., n, i The longest increasing sequence has length n 2, leaving two elements i + 1, i, which define an increasing sequence of length 1. So in total, we achieve length n 1. However, the entire sequence of length n can be partitioned into 1,..., i and i + 1,..., n. (c) Describe a dynamic programming algorithm to find the longest 2-increasing subsequence. Hint: you can think about aligning the sequence with two of its sorted versions. But this is not the standard multiple alignment because an alignment up to x i, y j, and z k can end in one of 5 possibilities: x i x i x i y j y j z k z k Solution: This generalizes part (a). A(,, ) is dropped from the maximization if any of the indices is negative. This will simplify the initial condition to A(0, 0, 0) = 0. A(i, j, k) = max where A(0, 0, 0) = 0 and s defined as before. Problem : Homodeletions Problem 6.0 in the book. A(i 1, j 1, k) + s(x i, y j ) A(i 1, j, k 1) + s(x i, z k ) A(i 1, j, k) A(i, j 1, k) A(i, j, k 1) Solution: We only allow matches and gaps. A match has a score of +1. The score of the gap will depend on whether it s an opening of the gap ( 1), or an extension of it (zero). Let A(i, j) be the score of the optimal alignment that ends in x i and y j aligned. Let B(i, j) be the score of the optimal alignment when x i is aligned with a gap. Finally, let C(i, j) be the score of the optimal alignment when y j is aligned with a gap. A(i, j) = min B(i, j) = min B(i 1, j 1) + s(x i, y j ) C(i 1, j 1) + s(x i, y j ) A(i 1, j) + 1 B(i 1, j) + γ(x i 1, x i ) C(i 1, j) + 1

C(i, j) = min A(i, j 1) + 1 B(i, j 1) + 1 C(i, j 1) + γ(y j 1, y j ) where s(x i, y j ) = 0 if x i = y j and otherwise (no mismatches), and γ(a, b) = 1 if a b and 0 otherwise. For initialization: A(0, 0) = 0, A(i, 0) = A(0, j) =, B(1, 0) = 1, B(i, 0) = B(i 1, 0) + γ(x i 1, x i ), B(0, j) =, C(0, 1) = 1, C(0, j) = C(0, j 1) + γ(y j 1, y j ), C(i, 0) =. The running time of this algorithm is O(mn).