STUDY of genetic variations is a critical task to disease

Similar documents
Robust Principal Component Pursuit via Alternating Minimization Scheme on Matrix Manifolds

Subspace Projection Matrix Completion on Grassmann Manifold

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Robust Principal Component Analysis

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization

Exact Low-rank Matrix Recovery via Nonconvex M p -Minimization

Matrix completion: Fundamental limits and efficient algorithms. Sewoong Oh Stanford University

Matrix Completion from Noisy Entries

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

Rank-Constrainted Optimization: A Riemannian Manifold Approach

Supplementary Materials for Riemannian Pursuit for Big Matrix Recovery

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

CSC 576: Variants of Sparse Learning

A New Estimate of Restricted Isometry Constants for Sparse Solutions

Sparse and low-rank decomposition for big data systems via smoothed Riemannian optimization

c 2013 Society for Industrial and Applied Mathematics

Lecture 5 : Projections

An implementation of the steepest descent method using retractions on riemannian manifolds

Matrix Completion from a Few Entries

Information-Theoretic Limits of Matrix Completion

Generalized Power Method for Sparse Principal Component Analysis

Subspace Projection Matrix Recovery from Incomplete Information

Riemannian Pursuit for Big Matrix Recovery

Matrix Completion for Structured Observations

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

The convex algebraic geometry of rank minimization

Optimization on the Grassmann manifold: a case study

An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems

Matrix Completion: Fundamental Limits and Efficient Algorithms

Vectors. January 13, 2013

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

arxiv: v1 [math.st] 10 Sep 2015

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

Robust Principal Component Analysis Based on Low-Rank and Block-Sparse Matrix Decomposition

Max-Planck-Institut für Mathematik in den Naturwissenschaften Leipzig

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Rank Determination for Low-Rank Data Completion

Sparse Solutions of Systems of Equations and Sparse Modelling of Signals and Images

A Variational Model for Data Fitting on Manifolds by Minimizing the Acceleration of a Bézier Curve

Section 3.9. Matrix Norm

Optimization methods

arxiv: v1 [cs.it] 26 Oct 2018

Compressed Sensing and Robust Recovery of Low Rank Matrices

Low-rank Matrix Completion with Noisy Observations: a Quantitative Comparison

Sparse Solutions of an Undetermined Linear System

Lecture Notes 10: Matrix Factorization

Manifold Optimization Over the Set of Doubly Stochastic Matrices: A Second-Order Geometry

Uniqueness Conditions for A Class of l 0 -Minimization Problems

arxiv: v3 [stat.me] 8 Jun 2018

GoDec: Randomized Low-rank & Sparse Matrix Decomposition in Noisy Case

PHASE RETRIEVAL OF SPARSE SIGNALS FROM MAGNITUDE INFORMATION. A Thesis MELTEM APAYDIN

Low-Rank Matrix Recovery

MATRIX COMPLETION AND TENSOR RANK

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion

CHAPTER 7. Regression

x M where f is a smooth real valued function (the cost function) defined over a Riemannian manifold

Low Rank Matrix Completion Formulation and Algorithm

Robust PCA by Manifold Optimization

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint

Key words. Low-Rank Matrix Completion, Riemannian optimization, outliers, smoothing techniques, l 1 norm, non-smooth, fixed-rank manifold.

DEVELOPMENT OF MORSE THEORY

A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases

Design of Projection Matrix for Compressive Sensing by Nonsmooth Optimization

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Tensor Completion for Estimating Missing Values in Visual Data

Optimisation Combinatoire et Convexe.

Recent Developments in Compressed Sensing

Truncation Strategy of Tensor Compressive Sensing for Noisy Video Sequences

Lecture 6 Positive Definite Matrices

Sparse Sensing in Colocated MIMO Radar: A Matrix Completion Approach

Low-rank Promoting Transformations and Tensor Interpolation - Applications to Seismic Data Denoising

Linear dimensionality reduction for data analysis

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms


SAT in Bioinformatics: Making the Case with Haplotype Inference

arxiv: v1 [stat.ml] 1 Mar 2015

A Scalable Spectral Relaxation Approach to Matrix Completion via Kronecker Products

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Does Compressed Sensing have applications in Robust Statistics?

Guaranteed Rank Minimization via Singular Value Projection

On the Convex Geometry of Weighted Nuclear Norm Minimization

ONP-MF: An Orthogonal Nonnegative Matrix Factorization Algorithm with Application to Clustering

Lecture 9: September 28

Rapid, Robust, and Reliable Blind Deconvolution via Nonconvex Optimization

Low rank matrix completion by alternating steepest descent methods

Lecture 2 Part 1 Optimization

Sparse PCA with applications in finance

Analysis of Robust PCA via Local Incoherence

SPARSE signal representations have gained popularity in recent

An iterative hard thresholding estimator for low rank matrix recovery

Matrix Rank Minimization with Applications

A direct formulation for sparse PCA using semidefinite programming

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Sparse and Low-Rank Matrix Decompositions

Principal Component Analysis

Robust PCA. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

Transcription:

Haplotype Assembly Using Manifold Optimization and Error Correction Mechanism Mohamad Mahdi Mohades, Sina Majidian, and Mohammad Hossein Kahaei arxiv:9.36v [math.oc] Jan 29 Abstract Recent matrix completion based methods have not been able to properly model the Haplotype Assembly Problem HAP for noisy observations. To cope with such a case, in this letter we propose a new Minimum Error Correction MEC based matrix completion optimization problem over the manifold of rank-one matrices. The convergence of a specific iterative algorithm for solving this problem is proved. Simulation results illustrate that the proposed method not only outperforms some well-known matrix completion based methods, but also presents a more accurate result compared to a most recent MEC based algorithm for haplotype estimation. Index Terms Haplotype assembly, manifold optimization, matrix completion, imum error correction, Hamg distance. I. INTRODUCTION STUDY of genetic variations is a critical task to disease recognition and also drug development. Such variations can be recognized using haplotypes which are known as strings of Single Nucleotide Polymorphisms SNPs of chromosomes []. However, due to hardware limitations, haplotypes can not be entirely measured. Instead, they are supposed to be reconstructed from the short reads obtained by high-throughput sequencing systems. For diploid organisms, each SNP site is modeled by either + or. Hence, a haplotype sequence h corresponding to one of chromosomes is formed by a vector of ± entries, while the haplotype of the other one is given by h. Accordingly, the short reads are obtained from either h or h. Such a sampling procedure can be modeled by a matrix, say R, whose entries are either ± or unobserved. Specifically, if we denote the set of observed entries by Ω, the entries of R {,+, } m n are given by { rij = ± i,j Ω r ij = i,j / Ω, where shows unobserved entries. The completed matrix of R, defined by R, is supposed to be a rank-one matrix whose entries are ± and can be factorized as R = c m h T n, 2 where h is the intended haplotype sequence and c is a sequence of + and entries corresponding to h and h, respectively. When the number of reads observed entries are sufficient and there is no error in reads, it is easy to obtain R and consequently h. In practice, however, measurements are corrupted by noise which usually leads to erroneous signs in some observed entries. Many haplotype assembly methods deal with erroneous measurements mostly based on the MEC defined as [2] m MECR,h = hdr i,h,hdr i, h, 3 i= where the Hamg distance hd, calculates the number of mismatch entries using hdr i,h = dr ij,h j, 4 j i,j Ω with d, being for equal inputs, and otherwise. For optimal solution of 3, which is an NP-hard problem, some heuristic methods have already been addressed [2] [3] [4]. Apart from the MEC approaches, some other techniques are developed considering the fact that the rank of R ought to be one. In, matrix factorization is utilized in the following imization problem, u,v R P Ω um vn T 2 F, 5 where P Ω is a sampling operator defined as { q P Ω Q = ij = q ij i,j Ω P Ω q ij = i,j / Ω and F shows the Frobenius norm. Then, h is calculated by applying the sign function over v. The nonconvex problem of 5 is easily solved using the alternating imization approach. From the low-rank matrix completion point of view, haplotype estimation can be performed using some other approaches as follows. Based on [6], it is possible to complete P Ω R using the following optimization problem, 6 X P ΩR P Ω X 2 F +λ X, 7 where stands for the nuclear norm and λ is a regularization factor. Afterwards, we can estimate h by applying the sign function over the right singular vector corresponding to the largest singular value. In addition, imizing the cost function P Ω R P Ω X 2 F over the manifold of rankone matrices or Grassmann manifold will result in completion of P Ω R and estimating h. In this paper, we propose a new error correction mechanism over the manifold of rank one matrices to solve the noisy HAP. This approach benefits the underlying structure of the HAP which seeks for a rank one matrix with the entries of ±. As a result, unlike common matrix completion methods, we propose a Hamg distance cost function over the manifold of rank

2 one matrices to complete the desired matrix. We next present a surrogate for the nondifferentiable structure of this cost function and analyze its convergence behaviour. In this way, as opposed to the existing MEC based algorithms, we will be able to derive the performance guarantees for our optimization problem. Simulation results confirm that this method is more reliable for solving noisy HAPs. The paper is organized as follows. In Section II some required concepts of manifold optimization are presented. Section III is devoted to our main result. Simulation results are illustrated in Section IV and Section V concludes the paper. II. PLIMINARIES As mentioned, a HAP can be modeled as a rank-one matrix completion problem. On the other hand, the set of real valued rank-one matrices can be considered as a smooth manifold. To optimize a differentiable function over such a manifold, we first red some concepts for optimization over manifolds. Theorem. The set of all m n real valued matrices of rank is a smooth manifold, M, whose tangent space at point X M is defined as { [ T X M R = [U U ] { R m = UMV T +U p V T +UV T p : M R, R n r m n ] [V V ] T } U p R m,u T p U =,V p R n,v T p V =, 8 where UΣV T is the singular value decomposition SVD of X. Convergence of manifold optimization schemes is evaluated by a metric on the manifold. Inspired from the Euclidean space R m n, the metric on the manifold of rank-one matrices is defined by the inner product ξ,η X = tr ξ T η, where the subscriptx shows the restriction to the tangent spacet X M and the tangent vectors ξ,η T X M. A smoothly varying inner product is Riemannian metric and the manifold endowed with this metric is called Riemannian. Also, the gradient of a function on such a manifold is given as follows. Definition. [9] Let f be a scalar valued function over the Riemannian manifold M endowed with the inner product, X. Then, the gradient of f at point X M denoted by gradfx is the unique element of the tangent space T X M satisfying gradfx,ξ X = DfX[ξ] ξ T X M, 9 where Df denotes the directional derivative acting on the tangent vector ξ. By iteratively solving an optimization problem over the manifold M, we may find a point on the tangent space T X M which would not necessarily belong to the manifold. To bring this point back to the manifold, the retraction function defined below can be utilized. Definition 2. [9] Let T M := {X} T xm be the X M tangent bundle and R : TM M be a smooth mapping whose restriction to T X M is R X. Then, R is a retraction on the manifold M, if, R X X = X, where, X is the zero element of T X M, 2 DR X X = id TXM, where, id TXM is the identity mapping on T X M. More clearly, we find the result of the retraction R X ξ by first calculating Y = X + ξ where X M and ξ T X M and applying the retraction mapping on Y. It has been shown that the retraction R X ξ to the manifold of rank-one matrices is equivalent to taking the SVD of the point Y = X+ξ and then making all the singular values equal to zero except the largest one []. Another concept to note is the gradient descent algorithm on Riemannian manifolds which is given in Alg. [9]. As seen, Steps and 2 concern with the search direction and convergence evaluation. Step 3, known as Armijo backtracking, guarantees a sequence of points which constitutes a descent direction [9]. Step 4 performs retraction on the manifold. Alg. : Gradient descent method on a Riemannian manifold. Requirements: Differentiable cost function f, Manifold M, inner product,, Initial matrix X M, Retraction function R, Scalars ᾱ >, β,σ,, and, tolerance τ >. for i =,,2,... do Step : Set ξ as the negative direction of the gradient, ξ i := gradfx i Step 2: Convergence evaluation, if ξ i < τ, then break Step 3: Find the smallest m satisfying fx i fr Xi ᾱβ m ξ i σᾱβ m ξ i,ξ i Xi Step 4: Find the modified point as X i+ := R Xi ᾱβ m ξ i III. MAIN SULT In the following, first, in an example for the noisy HAP, we show that imizing the cost function P Ω R P Ω X 2 F and calculating the haplotype by applying the sign function over the right singular vector corresponding to the largest singular value would not lead to the desired result. Then, we propose an optimization problem which can properly model the HAP. Example. Build up R by substituting h = [ ] T and c = [ ] T in 2 to get R =. Next, let an erroneous sampling from R be given as P Ω =, where subscript E stands for erroneous observation. As seen, the only unobserved entry is {,2} and the set of erroneous

3 observations is {,5,2,,3,3}. Based on our simulations, the results of imization of X 2 F over the manifold of rank-one matrices can be both R = and R 2 =.2255.763.2255.865.3655.2955.2955.7.479.2955.2955.7.479 εγ.9975ε εγ.9975ε.9975ε γ.9975 γ.9975.9975 γ.9975 γ.9975.9975 2 3 where ε and γ are infinitesimal positive numbers. One can easily check that although for this example we get R F R2 F 2.8284, their haplotypes are estimated differently as h = [ ] T and h 2 = [ ] T, while only h is the correct answer. For the above example, having the Hamg distance in d, it is easy to verify that the following inequality holds, hd vec P Ω sign R,vec < hd vec P Ω sign R2,vec, 4 where vec vectorizes the input matrix. Equivalently, we have sign R < P Ω sign R2 5, where is thel -norm which counts the number of nonzero entries of a matrix. Therefore, we propose the following imization cost function for the noisy HAP as, X M signx. 6 In other words, using 6, we find a rank-one matrix whose sign is as similar as possible to the observations. However, the proposed objective function in 6 is nondifferentiable. To cope with this difficulty, we first replace signx ij by the differentiable function γ tan γ 2 x ij in which the positive valuesγ andγ 2 are selected to approximate the sign function. Another difficulty with nondifferentiability of 6 is the discontinuity of l -norm which is solved by replacing the l p - norm p that is applied to the vector form of the matrix for p >. However, since l p -norm is yet nondifferentiable in the case of γ tan γ 2 X ij = 7 for any i,j Ω, we limit the value of γ to the open interval, 2 / π. In this way, 6 is reformulated in a new optimization problem as X M fx = γ tan γ 2 X p p, 8 where tan acts elementwise. Although, for < p the sparsity is promoted in 8, here we choose p equal or slightly greater than to easily use the triangle inequality for proving the convergence of Alg.. Simulation results will later show that such a choice will lead to more accurate estimates of haplotypes. Now, we prove the convergence of Alg. for 8. Theorem 2. Let positive values γ < 2 / π and γ 2 be given. Then, by choosing the initial matrix X in such a way that X 2 F is bounded and fx < Ω, Alg. will converge. Proof. Using Theorem 4.3. in [9], we know that every limit point of the infinite sequence {X i } generated by Alg. is a critical point of the cost function fx in 8. Accordingly, we need to prove that under the mentioned conditions of Theorem 2, {X i } owns a limit point and also belongs to the manifold of rank-one matrices, meaning that it converges. To do so, we show that this sequence lies in a compact set in which any sequence owns a convergent subsequence []. Consequently, due to the fact that {X i } is a decreasing sequence, having a convergent subsequence guarantees its convergence. To show the compactness of the set containing {X i }, we need to prove that the set is bounded and closed []. First, we prove the closedness of the set. Suppose that in our problem there exists a limit point for {X i }, which can be either a zero matrix, or a rank-one matrix belonging to the manifold of rank-one matrices. However, we ought to show that under the conditions of Theorem 2 the limit point of {X i } is a rank-one matrix. To see this, it is enough to prove that the zero matrix is not a limit point of {X i }. By Step 3 of Alg., we know that {X i } is a decreasing sequence for fx. Suppose that {Y j } is a subsequence of {X i } converging to the zero matrix. Then, there is an integer K such that j > K, Y j p p Y j 2 F < 2 j. From the triangle inequality, we get fy j p p γ tan γ 2 Y j p, j > K, and thus, p lim fy j Ω. Now, we easily enforce Alg. to have j an initial matrix {X } for which fx < Ω. Hence, due to the fact that {X i } is a decreasing sequence for fx, the zero matrix will not be the limit point. This ensures that {X i } is in a closed set. Now, we show that {X i } stays in a bounded set. For this to happen, we modify 8 to define f m X by adding the regularization term µ X 2 F, where µ is a positive value. Since we enforced Alg. to have the initial matrix X satisfying fx < Ω, we get f m X < Ω + µ X 2 F. Also, let X be a norm bounded initial matrix. Consequently, f m X will be bounded to a value, say z. Therefore, the decreasing sequence{x i } generated by Alg. for f m satisfies fx i + µ X i 2 F < z showing that X i 2 F is bounded. Considering the fact that by choosing an infinitesimal value for µ, the sequence generated by Alg. for either fx or f m X is the same, {X i } stays in a bounded set. From the above reasoning, {X i } stays in a compact set, meaning that the proposed imization problem in 8 will converge. IV. SIMULATION SULTS For comparison purposes two criteria are considered. First, we evaluate the completion performance of different methods

4 using the NMSE defined as NMSE = R R 2 F R 2 F, 9 where R is the estimation of the original matrix R. The other criterion is the Hamg distance hd of the original haplotype and its estimate. Note that since the original haplotype can be either h or h, we calculate the Hamg distance of the estimated haplotype for both h and h and choose the imum one. We compare our results with the matrix completion based methods,,, and the most recent MEC based algorithm [4]..6.5.4 Figs. and 2 are depicted for m = 25, n = 3, and.5 pd.2. Moreover, Ω E / Ω is set to.25, where denotes the cardinality of a set. One can observe that the proposed method outperforms the other ones in estimating haplotypes by generating lower NMSEs and hds. As seen in Fig. 2, the MEC based algorithm in [4] can only compete our proposed algorithm for observation probabilities less than.7 and it severely fails to truly estimate the haplotpe as the observation probability increases. The latter deficiency of [4] is essentially due to relying on the erroneous observations. NMSE.7.6.5.4.3 NMSE.3.2..2..4.6.8.2.22.24.26.28 Percentage of erroneous entries.6.8..2.4.6.8 Observation probability Fig.. NMSE for matrix completion versus observation probability for Ω E Ω =.25. hd 4 2 8 6 4 2 5 5.5.6.7 [4].6.8..2.4.6.8 Observation probability Fig. 2. Averaged Hamg distance between the estimated haplotype and original one versus observation probability for Ω E =.25. Ω Simulations are performed with synthetic data. To generate R m n in 2, we randomly generate h n and c m. The random set of observationsω is produced while the probability of observation is pd. Moreover, erroneous observations Ω E are defined by changing the sign of some entries observed by Ω, where Ω E Ω. Also, we have set γ =.5, γ 2 = 2, and p =.2. For the initial matrix X, we use the rankone approximation of the matrix P Ω R E. Simulation results show that this initial matrix satisfies the conditions of Theorem 2. To obtain smooth curves, the results are averaged over 5 independent trials, each one with different sets of Ω and Ω E. Fig. 3. NMSE for matrix completion versus Ω E Ω for pd =.7. hd 8 6 4 2 8 6 4 2 [4].4.6.8.2.22.24.26.28 Percentage of erroneous entries Fig. 4. Averaged Hamg distance between estimated haplotype and original one versus Ω E Ω for pd =.7. To generate Figs. 3 and 4, we change Ω E / Ω from.4 to.28 and proportionally p from.5 to.2, and set pd =.7. As seen, these results again demonstrate the outperformance of the proposed method. V. CONCLUSION The haplotype assembly problem was investigated. A new matrix completion imization problem was proposed which benefits from an error correction mechanism over the manifold of rank-one matrices. The convergence of an iterative algorithm for this problem was theoretically proved. Simulation results demonstrated the validation of the proposed optimization problem by generating more accurate haplotype estimates compared to some recent related methods.

5 FENCES [] R. Schwartz et al., Theory and algorithms for the haplotype assembly problem, Communications in Information & Systems, vol., no., pp. 23 38, 2. [2] V. Bansal and V. Bafna, Hapcut: an efficient and accurate algorithm for the haplotype assembly problem, Bioinformatics, vol. 24, no. 6, pp. i53 i59, 28. [3] Z. Puljiz and H. Vikalo, Decoding genetic variations: Communicationsinspired haplotype assembly, IEEE/ACM Transactions on Computational Biology and Bioinformatics TCBB, vol. 3, no. 3, pp. 58 53, 26. [4] A. Hashemi, B. Zhu, and H. Vikalo, Sparse tensor decomposition for haplotype assembly of diploids and polyploids, BMC genomics, vol. 9, no. 4, p. 9, 28. C. Cai, S. Sanghavi, and H. Vikalo, Structured low-rank matrix factorization for haplotype assembly, IEEE Journal of Selected Topics in Signal Processing, vol., no. 4, pp. 647 657, 26. [6] E. J. Candès and B. Recht, Exact matrix completion via convex optimization, Foundations of Computational mathematics, vol. 9, no. 6, p. 77, 29. D. L. Donoho, Compressed sensing, Information Theory, IEEE Transactions on, vol. 52, no. 4, pp. 289 36, 26. R. H. Keshavan, A. Montanari, and S. Oh, Matrix completion from a few entries, IEEE transactions on information theory, vol. 56, no. 6, pp. 298 2998, 2. [9] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization algorithms on matrix manifolds. Princeton University Press, 29. [] P.-A. Absil and J. Malick, Projection-like retractions on matrix manifolds, SIAM Journal on Optimization, vol. 22, no., pp. 35 58, 22. [] E. Kreyszig, Introductory functional analysis with applications, vol.. wiley New York, 978. J.-F. Cai, E. J. Candès, and Z. Shen, A singular value thresholding algorithm for matrix completion, SIAM Journal on Optimization, vol. 2, no. 4, pp. 956 982, 2.