Did you know that Multiple Alignment is NP-hard? Isaac Elias Royal Institute of Technology Sweden

Similar documents
Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

ON THE APPROXIMABILITY OF

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Theory of Computation Time Complexity

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Combinatorial Optimization

Evolutionary Tree Analysis. Overview

Implementing Approximate Regularities

Math.3336: Discrete Mathematics. Chapter 9 Relations

Non-Approximability Results (2 nd part) 1/19

CMPT307: Complexity Classes: P and N P Week 13-1

Integer Programming for Phylogenetic Network Problems

Value-Ordering and Discrepancies. Ciaran McCreesh and Patrick Prosser

The Intractability of Computing the Hamming Distance

The Complexity of Probabilistic Lobbying

EVOLUTIONARY DISTANCES

7 Distances. 7.1 Metrics. 7.2 Distances L p Distances

Combinatorial optimization problems

PCPs and Inapproximability Gap-producing and Gap-Preserving Reductions. My T. Thai

Consensus Optimizing Both Distance Sum and Radius

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

Structure-Based Comparison of Biomolecules

Hardness of Approximation

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Network Design and Game Theory Spring 2008 Lecture 6

BBM402-Lecture 11: The Class NP

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved.

Multiple Sequence Alignment

Approximation algorithms and mechanism design for minimax approval voting

Hidden Markov Models

Lecture 18 Classification of Dynkin Diagrams

Algorithm Design CS 515 Fall 2015 Sample Final Exam Solutions

Sequence analysis and Genomics

Section Summary. Relations and Functions Properties of Relations. Combining Relations

Hidden Markov Models

k-means Clustering Raten der µi Neubestimmung µi Zuordnungsschritt Konvergenz

Introduction to spectral alignment

NP and Computational Intractability

Definition: A binary relation R from a set A to a set B is a subset R A B. Example:

Maths GCSE Langdon Park Foundation Calculator pack A

Effects of Gap Open and Gap Extension Penalties

Comparing whole genomes

1 Alphabets and Languages

Algorithm Analysis Divide and Conquer. Chung-Ang University, Jaesung Lee

Approximation Algorithms and Hardness of Approximation. IPM, Jan Mohammad R. Salavatipour Department of Computing Science University of Alberta

Bio nformatics. Lecture 3. Saad Mneimneh

Bio nformatics. Lecture 16. Saad Mneimneh

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

Strong Mathematical Induction

Graph. Supply Vertices and Demand Vertices. Supply Vertices. Demand Vertices

Classical Complexity and Fixed-Parameter Tractability of Simultaneous Consecutive Ones Submatrix & Editing Problems

NP-Complete Problems. Complexity Class P. .. Cal Poly CSC 349: Design and Analyis of Algorithms Alexander Dekhtyar..

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

NP-Completeness. Andreas Klappenecker. [based on slides by Prof. Welch]

X-MA2C01-1: Partial Worked Solutions

Efficient High-Similarity String Comparison: The Waterfall Algorithm

An Introduction to Bioinformatics Algorithms Hidden Markov Models

EECS730: Introduction to Bioinformatics

Evolutionary Models. Evolutionary Models

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

ABHELSINKI UNIVERSITY OF TECHNOLOGY

Local Alignment of RNA Sequences with Arbitrary Scoring Schemes

Phylogeny: traditional and Bayesian approaches

CSE 421 NP-Completeness

Pairwise sequence alignment

Counting Strategies: Inclusion-Exclusion, Categories

Dynamic programming. Curs 2015

Advanced Algorithms 南京大学 尹一通

Phylogenetic Reconstruction with Insertions and Deletions

CSE101: Design and Analysis of Algorithms. Ragesh Jaiswal, CSE, UCSD

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

July 18, Approximation Algorithms (Travelling Salesman Problem)

Lecture 4: An FPTAS for Knapsack, and K-Center

Enumeration Schemes for Words Avoiding Permutations

SAT, NP, NP-Completeness

Generalized Splines. Madeline Handschy, Julie Melnick, Stephanie Reinders. Smith College. April 1, 2013

Linear Programming. Scheduling problems

CSCI3390-Second Test with Solutions

Scheduling on Unrelated Parallel Machines. Approximation Algorithms, V. V. Vazirani Book Chapter 17

Chapter 2. Reductions and NP. 2.1 Reductions Continued The Satisfiability Problem (SAT) SAT 3SAT. CS 573: Algorithms, Fall 2013 August 29, 2013

Lecture 2: Pairwise Alignment. CG Ron Shamir

Hidden Markov Models. x 1 x 2 x 3 x K

Proving languages to be nonregular

arxiv:cs/ v1 [cs.cc] 21 May 2002

Space Complexity. The space complexity of a program is how much memory it uses.

Coloring Vertices and Edges of a Path by Nonempty Subsets of a Set

Relations. Relations of Sets N-ary Relations Relational Databases Binary Relation Properties Equivalence Relations. Reading (Epp s textbook)

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

arxiv: v1 [cs.cc] 15 Nov 2016

Coloring Vertices and Edges of a Path by Nonempty Subsets of a Set

Time to learn about NP-completeness!

Search. Search is a key component of intelligent problem solving. Get closer to the goal if time is not enough

CS6999 Probabilistic Methods in Integer Programming Randomized Rounding Andrew D. Smith April 2003

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

Finding Consensus Strings With Small Length Difference Between Input and Solution Strings

Transcription:

Did you know that Multiple Alignment is NP-hard? Isaac Elias Royal Institute of Technology Sweden 1

Results Multiple Alignment with SP-score Star Alignment Tree Alignment (with given phylogeny) are NP-hard under all metrics! 2

Pairwise Alignment Mutations: substitutions, insertions, and deletions. Input: Two related strings s 1 and s 2. Output: The least number of mutations needed to s 1 s 2. s 1 = a a g a c t s 2 = a g t g c t 3

Pairwise Alignment Mutations: substitutions, insertions, and deletions. Input: Two related strings s 1 and s 2. Output: The least number of mutations needed to s 1 s 2. 2 l matrix A s 1 = a a g a c t s 2 = a g t g c t d A (s 1, s 2 ) = 3 mutations 3

Metric symbol distance Unit metric - binary alphabet (the edit distance) Σ 0 1 0 0 1 1 1 1 0 1 1 1 0 4

Metric symbol distance Unit metric - binary alphabet (the edit distance) Σ 0 1 0 0 1 1.5 1 1 0 1.5 1.5 1.5 0 Insertions and deletions occur less frequently! 4

Metric symbol distance Unit metric - binary alphabet (the edit distance) Σ 0 1 0 0 1 1 1 1 0 1 1 1 0 Insertions and deletions occur less frequently! General metric - α, β, γ have metric properties (identity, symmetry, triangle ineq.) Σ 0 1 0 0 α β 1 α 0 γ β γ 0 4

Multiple Alignment When sequence similarity is low pairwise alignment may fail to find similarities. 5

Multiple Alignment When sequence similarity is low pairwise alignment may fail to find similarities. Considering several strings may be helpful. Input: k strings s 1... s k. Output: k l matrix A. a a g a - c t - - g a g c t a c g a g c - a - g t g c t 5

Multiple Alignment When sequence similarity is low pairwise alignment may fail to find similarities. Considering several strings may be helpful. Input: k strings s 1... s k. Output: k l matrix A. a a g a - c t - - g a g c t a c g a g c - a - g t g c t How do we score columns? 5

Sum of Pairs score (SP-score) Let the cost be the sum of costs for all pairs of rows: k i=1 k j=i d A (s i, s j ), where d A (s i, s j ) is the pairwise distance between the rows containing strings s i and s j. 6

Earlier NP results Wang and Jiang 94 - Non-metric Bonizzoni and Vedova 01 - Specific metric (we build on their construction) Just 01 - Many metrics

Earlier NP results Wang and Jiang 94 - Non-metric Bonizzoni and Vedova 01 - Specific metric (we build on their construction) Just 01 - Many metrics

Earlier NP results Wang and Jiang 94 - Non-metric Bonizzoni and Vedova 01 - Specific metric (we build on their construction) Just 01 - Many metrics

Earlier NP results Wang and Jiang 94 - Non-metric Bonizzoni and Vedova 01 - Specific metric (we build on their construction) Just 01 - Many metrics In particular it was unknown for the unit metric.

Earlier NP results Wang and Jiang 94 - Non-metric Bonizzoni and Vedova 01 - Specific metric (we build on their construction) Just 01 - Many metrics In particular it was unknown for the unit metric. Here: All binary or larger alphabets under all metrics!

Independent R3 Set Independent set in three regular graphs, i.e. all vertices have degree 3, is NP-hard. Input: A three regular graph G = (V, E) and a integer c. Output: Yes if there is an independent set V V of size c, otherwise No. 8

Independent R3 Set Independent set in three regular graphs, i.e. all vertices have degree 3, is NP-hard. Input: A three regular graph G = (V, E) and a integer c. Output: Yes if there is an independent set V V of size c, otherwise No. 8

Reduction Independent R3 Set G = (V, E) c set of size c V SP-score Set of strings K matrix of cost K A 9

Reduction Independent R3 Set G = (V, E) c set of size c V SP-score Set of strings K matrix of cost K A v 1 v 4 v 2 v 3 v 1 v 2 v 3 v 4 1 0... 0... 1 0... 0... 1 0... 0... 1.......... 1 0... 0... 1 0... 0... 1 0... 0... 1 1. 1 0 0... 10... 1 0... 0... 0 0... 0... 0 0... 0... 1 0... 0... 0 0... 10... 0 0... 0... 0 0 0... 10... 0 0... 0... 0 0... 0... 1 0... 0... 0 0... 0... 1 0... 10... 0 0... 0... 0 0... 0 0... 0... 1 0... 0... 0 0... 10... 0 0... 0 0... 0... 0 0... 0... 1 0... 10... 0 9

Gadgetering 1. Each vertex is represented by a column (n vertex columns). 2. c vertex columns are picked. 3. Each edge is represented by an edge string. v 1 v 2... v i... vn 1 0... 0... 1... 1... 1........... 1 0... 0... 1... 1... 1 1 1.. 1 1 0 0... 10... 1 0... 0... 0 0... 0... 0 0... 0... 1 0... 0... 0 0... 10... 0 0... 0... 0 0 0... 10... 0 0... 0... 0 0... 0... 1 0... 0... 0 0... 0... 1 0... 10... 0 0... 0... 0 0... 0 0... 0... 1 0... 0... 0 0... 10... 0 0... 0 0... 0... 0 0... 0... 1 0... 10... 0 10

Template strings Vertex columns We add very many template strings: T = (10 b ) n 1 1 v 1 v 2... v i... vn 1 0... 0... 1... 1... 1........... 1 0... 0... 1... 1... 1 Fact: Identical strings are aligned identically in an optimal alignment. 11

Pick strings V V We add very many pick strings: P = 1 c. v 1 v 2... v i... vn 1 0... 0... 1... 1... 1........... 1 0... 0... 1... 1... 1 1 1.. 1 1 Fact: c vertex columns are picked since this is the alignment with least missmatches. 12

Edge strings Property: If there is an edge (v i, v j ) then both v i and v j can not be part of the independent set. E ij = 0 s (00 b ) i 1 10 b s (00 b ) j i 1 10 b (00 b ) n j 1 00 s v 1 v i...... v j... vn 1 0... 0... 1...... 1... 1............... 1 0... 0... 1...... 1... 1 good 0 s 0 0... 0... 1 0... 1 0 s 1 0 0... 0... 0 good 0 0... 0... 0 0 s 1 1 0... 1 0... 0... 0 0 s bad 0 s 0 0... 0... 1 s 0... 1 0... 0... 0 0 s 13

Independent set c Alignment K ( K = (n c)γb 2 + b(n 1)βb 2 + (sβ + nα)bm + cα + (s + b(n ) 1) + n c 2)β + 2γ bm 3cb(α + γ β) + (sβ + 2α)m 2 14

Independent set c Alignment K 1. Remember three regular graph. So there are atmost three ones in each vertex column. 2. If a column with only two ones is picked then there is a missmatch extra for each pick string ( score > K). 14

Example 1. Atmost three ones in each vertex column. 2. Pick strings pick columns with most ones. v 1 v 4 v 2 v 3 v 1 v 2 v 3 v 4 1 0... 0... 1 0... 0... 1 0... 0... 1.......... 1 0... 0... 1 0... 0... 1 0... 0... 1 1. 1 E 12 0 0... 10... 1 0... 0... 0 0... 0... 0 0... E 13 0... 1 0... 0... 0 0... 10... 0 0... 0... 0 E 14 0 0... 10... 0 0... 0... 0 0... 0... 1 0... E 23 0... 0 0... 0... 1 0... 10... 0 0... 0... 0 E 24 0... 0 0... 0... 1 0... 0... 0 0... 0... 0 E 34 0... 0 0... 0... 0 0... 0... 1 0... 10... 0 15

Example 1. Atmost three ones in each vertex column. 2. Pick strings pick columns with most ones. v 1 v 4 v 2 v 3 v 1 v 2 v 3 v 4 1 0... 0... 1 0... 0... 1 0... 0... 1.......... 1 0... 0... 1 0... 0... 1 0... 0... 1 1. 1 E 12 0... 1 0... 10... 0 0... 0... 0 0... 0... 0 E 13 0... 1 0... 0... 0 0... 10... 0 0... 0... 0 E 14 0 0... 10... 0 0... 0... 0 0... 0... 1 0... E 23 0... 0 0... 0... 1 0... 10... 0 0... 0... 0 E 24 0... 0 0... 0... 1 0... 0... 0 0... 0... 0 E 34 0... 0 0... 0... 0 0... 0... 1 0... 10... 0 15

Example 1. Atmost three ones in each vertex column. 2. Pick strings pick columns with most ones. v 1 v 4 v 2 v 3 v 1 v 2 v 3 v 4 1 0... 0... 1 0... 0... 1 0... 0... 1.......... 1 0... 0... 1 0... 0... 1 0... 0... 1 1. 1 E 12 0... 1 0... 10... 0 0... 0... 0 0... 0... 0 E 13 0... 1 0... 0... 0 0... 10... 0 0... 0... 0 E 14 0 0... 10... 0 0... 0... 0 0... 0... 1 0... E 23 0... 0 0... 0... 1 0... 10... 0 0... 0... 0 E 24 0 0... 0... 0 0... 0... 0 0... 0... 1 0... E 34 0... 0 0... 0... 0 0... 0... 1 0... 10... 0 15

Example 1. Atmost three ones in each vertex column. 2. Pick strings pick columns with most ones. v 1 v 4 v 2 v 3 v 1 v 2 v 3 v 4 1 0... 0... 1 0... 0... 1 0... 0... 1.......... 1 0... 0... 1 0... 0... 1 0... 0... 1 1. 1 E 12 0... 1 0... 10... 0 0... 0... 0 0... 0... 0 E 13 0... 1 0... 0... 0 0... 10... 0 0... 0... 0 E 14 0 0... 10... 0 0... 0... 0 0... 0... 1 0... E 23 0... 0 0... 0... 1 0... 10... 0 0... 0... 0 E 24 0 0... 0... 0 0... 0... 0 0... 0... 1 0... E 34 0... 0 0... 0... 0 0... 0... 1 0... 10... 0 15

Star Alignment SP-score not a good model of evolution. s 2 s 4 c s 1 s 3 16

Star Alignment SP-score not a good model of evolution. Input: k strings s 1... s k Output: string c minimizing d(c, s i ) i s 2 s 4? s 1 s 3 16

Star Alignment SP-score not a good model of evolution. Input: k strings s 1... s k Output: string c minimizing d(c, s i ) i s 2 s 4? s 1 s 3 Symbol distance metric = Pairwise alignment distance metric Star Alignment is a special case of Steiner Star in a metric space. 16

Earlier results - Star Alignment Wang and Jiang 94 - Non-metric APX-complete Li, Ma, Wang 99 - Constant number gaps, NP-hard and PTAS Sim and Park 01 - Specific metric NP-hard

Earlier results - Star Alignment Wang and Jiang 94 - Non-metric APX-complete Li, Ma, Wang 99 - Constant number gaps, NP-hard and PTAS Sim and Park 01 - Specific metric NP-hard

Earlier results - Star Alignment Wang and Jiang 94 - Non-metric APX-complete Li, Ma, Wang 99 - Constant number gaps, NP-hard and PTAS Sim and Park 01 - Specific metric NP-hard

Earlier results - Star Alignment Wang and Jiang 94 - Non-metric APX-complete Li, Ma, Wang 99 - Constant number gaps, NP-hard and PTAS Sim and Park 01 - Specific metric NP-hard Here: All binary or larger alphabets under all metrics!

Reduction Vertex Cover G = (V, E) minimum cover V Star Alignment Set of strings minimum string c = DDCDD 18

Reduction Vertex Cover G = (V, E) minimum cover V Star Alignment Set of strings minimum string c = DDCDD v 1 v 2 v 3... v n C =...?...?...?...?... 18

Reduction Vertex Cover G = (V, E) minimum cover V Star Alignment Set of strings minimum string c = DDCDD v 1 v 2 v 3... v n C V =... 1... 1... 1... 1... C V - Only 1 s. 18

Reduction Vertex Cover G = (V, E) minimum cover V Star Alignment Set of strings minimum string c = DDCDD v 1 v 2 v 3... v n C =... 0... 0... 0... 0... C V - Only 1 s. C - Only 0 s. 18

Construction Idea (E, G) c = DDCDD (E, S ab ) c = vertex cover G minimum cover S ab E E c G E S a b E G G String minimizing i d(c, s i) minimum cover 19

Canonical Structure E = DDC V DD G = DDC DD Differ only in the vertex positions. 20

Canonical Structure E = DDC V DD G = DDC DD Differ only in the vertex positions. Many such pairs will vote for the canonical structure. E G c E G 20

Canonical Structure E = DDC V DD G = DDC DD Differ only in the vertex positions. Many such pairs will vote for the canonical structure. Vote even in the vertex positions! E G c E G 20

Canonical Structure E = DDC V DD G = DDC DD Differ only in the vertex positions. Many such pairs will vote for the canonical structure. Vote even in the vertex positions! E G c E G c = DDCDD 20

String Cover Edge (v a, v b ) add two strings E = DDC V DD S ab = C a DC b v 1... v a... v b... v n C a =... 0... 1... 0... 0... 21

String Cover Edge (v a, v b ) add two strings E = DDC V DD S ab = C a DC b v 1... v a... v b... v n C b =... 0... 0... 1... 0... 21

String Cover Edge (v a, v b ) add two strings E = DDC V DD S ab = C a DC b E votes 1 in all vertex positions. 21

String Cover Edge (v a, v b ) add two strings E = DDC V DD S ab = C a DC b E votes 1 in all vertex positions. S ab votes 0 in all except vertex position a or b. D D C D D C a D C b C a D C b 21

String Cover Edge (v a, v b ) add two strings E = DDC V DD S ab = C a DC b E votes 1 in all vertex positions. S ab votes 0 in all except vertex position a or b. D D C D D C a D C b C a D C b Either vertex v a or vertex v b is part of the cover. 21

Minimal String Minimum Cover (E, G) c = DDCDD (E, S ab ) c = vertex cover S ab E E c G E S a b E G 22

Minimal String Minimum Cover (E, G) c = DDCDD (E, S ab ) c = vertex cover S ab E E c G E S a b E G G G = DDC DD penalizes each 1 in vertex positions. 22

Minimal String Minimum Cover (E, G) c = DDCDD (E, S ab ) c = vertex cover S ab E E c G E S a b E G G G = DDC DD penalizes each 1 in vertex positions. String minimizing i d(c, s i) minimum cover 22

Tree Alignment (given phylogeny) Star Alignment not a good model of evolution. p 1 p 2 p 3 s 1 s 2 s 3 s 4 23

Tree Alignment (given phylogeny) Star Alignment not a good model of evolution. Input: k strings s 1... s k phylogeny T Output: strings p 1... p k 1 min (a,b) E(T ) d(a, b)??? s 1 s 2 s 3 s 4 23

Tree Alignment (given phylogeny) Star Alignment not a good model of evolution. Input: k strings s 1... s k phylogeny T Output: strings p 1... p k 1 min (a,b) E(T ) d(a, b)??? s 1 s 2 s 3 s 4 Symbol distance metric = Pairwise alignment distance metric Tree Alignment is a special case of Steiner Tree in a metric space. 23

Construction Idea Same idea as for Star Alignment. r base comp. inbetween each branch E G E G E G r base comp. E G E E G r base comp. E G G E S ij m branches S i j 24

Overview and Open problems Problem Here Approx SP-score all metrics 2-approx [GPBL] Star Alignment all metrics 2-approx Tree Alignment all metrics PTAS [WJGL] (given phylogeny) 25

Overview and Open problems Problem Here Approx SP-score all metrics 2-approx [GPBL] Star Alignment all metrics 2-approx Tree Alignment all metrics PTAS [WJGL] (given phylogeny) Consensus Patterns NP-hard PTAS [LMW] Substring Parsimony NP-hard PTAS [ WJGL] 25

Acknowledgments My advisor Prof. Jens Lagergren Prof. Benny for hosting me Thanks! 26