Computing Bi-Clusters for Microarray Analysis

Similar documents
Closest String and Closest Substring Problems

ACO Comprehensive Exam October 14 and 15, 2013

Fisher Equilibrium Price with a Class of Concave Utility Functions

Bounds for pairs in partitions of graphs

CS6999 Probabilistic Methods in Integer Programming Randomized Rounding Andrew D. Smith April 2003

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

THE SZEMERÉDI REGULARITY LEMMA AND ITS APPLICATION

On the k-closest Substring and k-consensus Pattern Problems

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

x 1 + x 2 2 x 1 x 2 1 x 2 2 min 3x 1 + 2x 2

6.842 Randomness and Computation Lecture 5

Problem Set 1: Solutions Math 201A: Fall Problem 1. Let (X, d) be a metric space. (a) Prove the reverse triangle inequality: for every x, y, z X

Non-Interactive Zero Knowledge (II)

Matrices: 2.1 Operations with Matrices

On the complexity of unsigned translocation distance

Fast Algorithms for Constant Approximation k-means Clustering

Approximation Algorithms

The Algorithmic Aspects of the Regularity Lemma

On approximation of real, complex, and p-adic numbers by algebraic numbers of bounded degree. by K. I. Tsishchanka

Topics in Approximation Algorithms Solution for Homework 3

Reductions. Reduction. Linear Time Reduction: Examples. Linear Time Reductions

1 The Knapsack Problem

Approximate Counting and Markov Chain Monte Carlo

A conjecture on the alphabet size needed to produce all correlation classes of pairs of words

ICALP g Mingji Xia

Integer Factorization using lattices

Statistics and data analyses

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Factoring Algorithms Pollard s p 1 Method. This method discovers a prime factor p of an integer n whenever p 1 has only small prime factors.

SZEMERÉDI S REGULARITY LEMMA FOR MATRICES AND SPARSE GRAPHS

Approximation Algorithms and Hardness of Approximation. IPM, Jan Mohammad R. Salavatipour Department of Computing Science University of Alberta

A Polytope for a Product of Real Linear Functions in 0/1 Variables

Lecture 2. MATH3220 Operations Research and Logistics Jan. 8, Pan Li The Chinese University of Hong Kong. Integer Programming Formulations

Lecture 5: Two-point Sampling

Improved Parallel Approximation of a Class of Integer Programming Problems

Math 1302 Notes 2. How many solutions? What type of solution in the real number system? What kind of equation is it?

Signal Recovery from Permuted Observations

Probability and Samples. Sampling. Point Estimates

ON COST MATRICES WITH TWO AND THREE DISTINCT VALUES OF HAMILTONIAN PATHS AND CYCLES

Analysis and Design of Algorithms Dynamic Programming

Integer Linear Programs

RANK AND PERIMETER PRESERVER OF RANK-1 MATRICES OVER MAX ALGEBRA

HEURISTICS FOR TWO-MACHINE FLOWSHOP SCHEDULING WITH SETUP TIMES AND AN AVAILABILITY CONSTRAINT

Technische Universität München, Zentrum Mathematik Lehrstuhl für Angewandte Geometrie und Diskrete Mathematik. Combinatorial Optimization (MA 4502)

Zero-sum square matrices

Random Variable. Pr(X = a) = Pr(s)

A Note on the Density of the Multiple Subset Sum Problems

Scheduling on Unrelated Parallel Machines. Approximation Algorithms, V. V. Vazirani Book Chapter 17

CS 372: Computational Geometry Lecture 14 Geometric Approximation Algorithms

Diameter of the Zero Divisor Graph of Semiring of Matrices over Boolean Semiring

Lecture Semidefinite Programming and Graph Partitioning

Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing.

Algorithms for Molecular Biology

Assignment 2 : Probabilistic Methods

Matrices. Chapter Definitions and Notations

Separation Techniques for Constrained Nonlinear 0 1 Programming

Proofs Not Based On POMI

STABILITY OF JOHNSON S SCHEDULE WITH LIMITED MACHINE AVAILABILITY

Handout 5. α a1 a n. }, where. xi if a i = 1 1 if a i = 0.

Data Mining and Matrices

Approximation Schemes for Job Shop Scheduling Problems with Controllable Processing Times

LP based heuristics for the multiple knapsack problem. with assignment restrictions

CONSTRUCTION OF SLICED SPACE-FILLING DESIGNS BASED ON BALANCED SLICED ORTHOGONAL ARRAYS

EE512: Error Control Coding

Finding Consensus Strings With Small Length Difference Between Input and Solution Strings

Lecture 9: Submatrices and Some Special Matrices

COVARIANCE IDENTITIES AND MIXING OF RANDOM TRANSFORMATIONS ON THE WIENER SPACE

CHAPTER 6. Direct Methods for Solving Linear Systems

Approximability of Dense Instances of Nearest Codeword Problem

1 Column Generation and the Cutting Stock Problem

RECAP How to find a maximum matching?

Linear and Integer Programming - ideas

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

On The Computational Complexity Of Bi-Clustering

Linear Classification: Linear Programming

Lecture 4 Scheduling 1

Consensus Optimizing Both Distance Sum and Radius

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

National University of Singapore Department of Electrical & Computer Engineering. Examination for

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Algorithms for Data Science: Lecture on Finding Similar Items

Sequential pairing of mixed integer inequalities

Q(x n+1,..., x m ) in n independent variables

CS 6820 Fall 2014 Lectures, October 3-20, 2014

A Branch-and-Cut Algorithm for the Stochastic Uncapacitated Lot-Sizing Problem

Bipartite Perfect Matching

CONSTRUCTION OF SLICED ORTHOGONAL LATIN HYPERCUBE DESIGNS

Solving the Order-Preserving Submatrix Problem via Integer Programming

Faster Algorithms for some Optimization Problems on Collinear Points

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

CS Homework Chapter 6 ( 6.14 )

Approximations for MAX-SAT Problem

Concentration function and other stuff

Approximations for MAX-SAT Problem

CHAPTER 10 Shape Preserving Properties of B-splines

5 Integer Linear Programming (ILP) E. Amaldi Foundations of Operations Research Politecnico di Milano 1

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Conference Program Design with Single-Peaked and Single-Crossing Preferences

Applied Mathematics Letters

Transcription:

December 21, 2006

General Bi-clustering Problem Similarity of Rows (1-5) Cheng and Churchs model General Bi-clustering Problem Input: a n m matrix A. Output: a sub-matrix A P,Q of A such that the rows of A P,Q are similar. That is, all the rows are identical. Why sub-matrix? A subset of genes are co-regulated and co-expressed under specific conditions. It is interesting to find the subsets of genes and conditions.

General Bi-clustering Problem Similarity of Rows (1-5) Cheng and Churchs model Similarity of Rows (1-5) 1. All rows are identical 1 1 2 3 2 3 3 2 1 1 2 3 2 3 3 2 1 1 2 3 2 3 3 2 2. All the elements in a row are identical 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 5 5 5 5 5 5 5 5 (the same as 1 if we treat columns as rows)

General Bi-clustering Problem Similarity of Rows (1-5) Cheng and Churchs model Similarity of Rows (1-5) 3. The curves for all rows are similar (additive) a i,j a i,k = c(j, k) for i = 1, 2,..., m. Case 3 is equivalent to case 2 (thus also case 1) if we construct a new matrix a i,j = a i,j a i,p for a fixed p indicate a row.

General Bi-clustering Problem Similarity of Rows (1-5) Cheng and Churchs model Similarity of Rows (1-5) 4. The curves for all rows are similar (multiplicative) a 1,1 a 1,2 a 1,3... a 1,m c 1 a 1,1 c 1 a 1,2 c 1 a 1,3... c 1 a 1,m c 2 a 1,1 c 2 a 1,2 c 2 a 1,3... c 2 a 1,m... c n a 1,1 c n a 1,2 c n a 1,3... c n a 1,m Transfer to case 2 (thus case 1) by taking log and subtraction. Case 3 and Case 4 are called bi-clusters with coherent values.

General Bi-clustering Problem Similarity of Rows (1-5) Cheng and Churchs model Similarity of Rows (1-5) 5. The curves for all rows are similar (multiplicative and additive) a i,j = c i a k,j + d i Transfer to case 2 (thus case 1) by subtraction of a fixed row (row i), taking log and subtraction of row i again. The basic model: All the rows in the sub-matrix are identical.

General Bi-clustering Problem Similarity of Rows (1-5) Cheng and Churchs model Cheng and Churchs model The model introduced a similarity score called the mean squared residue score H to measure the coherence of the rows and columns in the submatrix. where H(P, Q) = 1 P Q a i,q = 1 Q j Q i P,j Q a i,j, a P,j = 1 P (a i,j a i,q a P,j + a P,Q ) 2 i P a i,j, a P,Q = 1 P Q i P,j Q If there is no error, H(P, Q)=0 for case 1, 2 and 3. A lot of heuristics (programs) have been produced. a i,j.

Consensus Sub-matrix Problem Bottleneck Sub-matrix Problem NP-Hardness Results Consensus Sub-matrix Problem Bottleneck Sub-matrix Problem

Consensus Sub-matrix Problem Bottleneck Sub-matrix Problem NP-Hardness Results Consensus Sub-matrix Problem Input: a n m matrix A, integers l and k. Output: a sub-matrix A P,Q of A with l rows and k columns and a consensus row z (of k elements) such that r i P d(r i Q, z) is minimized. Here d(, ) is the Hamming distance.

Consensus Sub-matrix Problem Bottleneck Sub-matrix Problem NP-Hardness Results Bottleneck Sub-matrix Problem Input: a n m matrix A, integers l and k. Output: a sub-matrix A P,Q of A with l rows and k columns and a consensus row z (of k elements) such that for any r i in P d(r i Q, z) d and d is minimized Here d(, ) is the Hamming distance.

Consensus Sub-matrix Problem Bottleneck Sub-matrix Problem NP-Hardness Results NP-Hardness Results Theorem 1: Both consensus sub-matrix and bottleneck sub-matrix problems are NP-hard. Proof: We use a reduction from maximum edge bipartite problem.

for Consensus Sub-matrix Problem Input: a n m matrix A, integers l and k. Output: a sub-matrix A P,Q of A with l rows and k columns and a consensus row z (of k elements) such that r i P d(r i Q, z) is minimized. Here d(, ) is the Hamming distance.

Assumptions: H opt = p i P opt d(x pi Qopt, z opt) = O(kl), H opt c = kl and Q opt = k = O(n), k c = n. : We use a random sampling technique to randomly select O(logm) columns in Q opt, enumerate all possible vectors of length O(logm) for those columns. At some moment, we know O(logm) bits of r opt and we can use the partial z opt to select the l rows which are closest to z opt in those O(logm) bits. After that we can construct a consensus vector r as follows: for each column, choose the (majority) letter that appears the most in each of the l letters in the l selected rows. Then for each of the n columns, we can calculate the number of mismatches between the majority letter and the l letters in the l selected rows. By selecting the best k columns, we can get a good solution.

How to randomly select O(logm) columns in Q opt while Q opt is unknown? Our new idea is to randomly select a set B of (c + 1)logm columns from A and enumerate all size logm subsets of B in time O(m c+1) which is polynomial in terms of the input size O(mn). We can show that with high probability, we can get a set of logm columns randomly selected from Q opt.

Algorithm 1 for The Consensus Submatrix Problem Input: one m n matrix A, integers l and k, and ɛ > 0 Output: a size l subset P of rows, a size k subset Q of columns and a length k consensus vector z Step 1: randomly select a set B of (c + 1)( 4 log m + 1) columns from A. ɛ 2 (1.1) for every size 4 log m ɛ 2 subset R of B do (1.2) for every z R Σ R do (a) Select the best l rows P = {p 1,..., p l } that minimize d(z R, x i R ). (b) for each column j do Compute f (j) = l i=1 d(s j, a pi,j), where s j is the majority element of the l rows in P in column j. Select the best k columns Q = {q 1,..., q k } with minimum value f (j) and let z(q) = s q1 s q2... s qk. (c) Calculate H = l i=1 d(xp i Q, z) of this solution. Step 2: Output P, Q and z with minimum H.

Lemma 1: With probability at most m 2 ɛ 2 c 2 (c+1), no subset R of size 4 log m used in Step 1 of Algorithm 1 satisfies ɛ 2 R Q opt. Lemma 2: Assume R = 4 log m and R Q ɛ 2 opt. Let ρ = k R. With probability at most m 1, there is a row x i in X satisfying d(z opt, x i Qopt ) ɛk ρ > d(z opt R, x i R ). With probability at most m 1 3, there is a row x i in X satisfying d(z opt R, x i R ) > d(z opt, x i Qopt ) + ɛk. ρ

Lemma 3: When R Q opt and z R = z opt R, with probability at most 2m 1 3, the set of rows P = {p 1,..., p l } selected in Step 1 (a) of Algorithm 1 satisfies l i=1 d(z opt, x pi Qopt ) > H opt + 2ɛkl. Theorem 2: For any δ > 0, with probability at least 2 1 m 8c δ 2 c 2 (c+1) 2m 1 3, Algorithm 1 will output a solution with consensus score at most (1 + δ)h opt in O(nm O( 1 δ 2 ) ) time.

Thanks for Bottleneck Sub-matrix Problem Input: a n m matrix A, integers l and k. Output: a sub-matrix A P,Q of A with l rows and k columns and a consensus row z (of k elements) such that for any r i in P d(r i Q, z) d and d is minimized Here d(, ) is the Hamming distance.

Thanks Assumptions: d opt = MAX pi P opt d(x pi Qopt, z opt ) = O(k), d opt c = k and Q opt = k = O(n), k c = n. : (1) Use random sampling technique to know O(logm) bits of z opt and select l best rows like Algorithm 1. (2) Use linear programming and randomized rounding technique to select k columns in the matrix.

Thanks Linear programming Given a set of rows P = {p 1,..., p l }, we want to find a set of k columns Q and vector z such that bottleneck score is minimized. min d; Σ n y i,j = k, i=1 Σ j=1 j=1 y i,j 1, i = 1, 2,..., n, Σ n χ(π j, x ps,i)y i,j d, s = 1, 2,..., l. i=1 j=1 y i,j = 1 if and only if column i is in Q and the corresponding bit in z is π j. Here, for any a, b Σ, χ(a, b) = 0 if a = b and χ(a, b) = 1 if a b.

Thanks Randomized rounding To achieve two goals: (1) Select k columns, where k k δd opt. (2) Get integers values for y i,j such that the distance (restricted on the k selected columns) between any row in P and the center vector thus obtained is at most (1 + γ)d opt. Here δ > 0 and γ > 0 are two parameters used to control the errors.

Thanks Lemma 4: When nγ 2 3(cc ) 2 2 log m, for any γ, δ > 0, with probability at most exp( nδ2 ) + m 1, the rounding result 2(cc ) 2 y = {y 1,1,..., y 1, Σ,..., y n,1,..., y n, Σ } does not satisfy at least one of the following inequalities, Σ n ( y i,j) > k δd opt, i=1 j=1 and for every row x ps (s = 1, 2,..., l), Σ n ( χ(π j, x ps,i)y i,j) < d + γd opt. i=1 j=1

Thanks Algorithm 2 for The bottleneck Sub-matrix Problem Input: one matrix A Σ m n, integer l, k, a row z Σ n and small numbers ɛ > 0, γ > 0 and δ > 0. Output: a size l subset P of rows, a size k subset Q of columns and a length k consensus vector z. nγ 2 3(cc ) 2 if 2 log m then try all size k subset Q of the n columns and all z of length k to solve the problem. if > 2 log m then nγ 2 3(cc ) 2 Step 1: randomly select a set B of size subset R of B do 4 log m ɛ 2 4(c+1) log m columns from A. for every ɛ 2 for every z R Σ R do (a) Select the best l rows P = {p 1,..., p l } that minimize d(z R, x i R ). (b)solve the optimization problem by linear programming and randomized rounding to get Q and z. Step 2: Output P,Q and z with minimum bottleneck score d.

Thanks Lemma 5: When R Q opt and z R = z opt R, with probability at most 2m 1 3, the set of rows P = {p 1,..., p l } obtained in Step 1(a) of Algorithm 2 satisfies d(z opt, x pi Qopt ) > d opt + 2ɛk for some row x pi (1 i l). Theorem 3: With probability at least 1 m 2 ɛ 2 c 2 (c+1) 2m 1 3 exp( nδ2 ) m 1, Algorithm 2 2(cc ) 2 runs in time O(n O(1) m O( 1 ɛ 2 + 1 γ 2 ) ) and obtains a solution with bottleneck score at most (1 + 2c ɛ + γ + δ)d opt for any fixed ɛ, γ, δ > 0.

Thanks Thanks Acknowledgements This work is fully supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China [Project No. CityU 1070/02E]. This work is collaborated with Dr. Lusheng Wang and Xiaowen Liu in City University of Hong Kong, Hong Kong, China.

Thanks Let X 1, X 2,..., X n be n independent random 0-1 variables, where X i takes 1 with probability p i, 0 < p i < 1. Let X = n i=1 X i, and µ = E[X ]. Then for any 0 < ɛ 1, Pr(X > µ + ɛ n) < e 1 3 nɛ2, Pr(X < µ ɛ n) e 1 2 nɛ2.