Dictionary Learning for L1-Exact Sparse Coding

Similar documents
Fast Sparse Representation Based on Smoothed

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases

SPARSE signal representations have gained popularity in recent

Equivalence Probability and Sparsity of Two Sparse Solutions in Sparse Representation

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Dictionary learning for speech based on a doubly sparse greedy adaptive dictionary algorithm

Fast Dictionary Learning for Sparse Representations of Speech Signals

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

SPARSE signal representations allow the salient information. Fast dictionary learning for sparse representations of speech signals

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

NON-NEGATIVE SPARSE CODING

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT

c Springer, Reprinted with permission.

Generalized Orthogonal Matching Pursuit- A Review and Some

Sparse & Redundant Signal Representation, and its Role in Image Processing

Thomas Blumensath and Michael E. Davies

Machine Learning for Signal Processing Sparse and Overcomplete Representations

On Sparsity, Redundancy and Quality of Frame Representations

EUSIPCO

Morphological Diversity and Source Separation

AN ALTERNATING MINIMIZATION ALGORITHM FOR NON-NEGATIVE MATRIX APPROXIMATION

BLIND SEPARATION OF SPATIALLY-BLOCK-SPARSE SOURCES FROM ORTHOGONAL MIXTURES. Ofir Lindenbaum, Arie Yeredor Ran Vitek Moshe Mishali

STRUCTURE-AWARE DICTIONARY LEARNING WITH HARMONIC ATOMS

Noise Removal? The Evolution Of Pr(x) Denoising By Energy Minimization. ( x) An Introduction to Sparse Representation and the K-SVD Algorithm

A simple test to check the optimality of sparse signal approximations

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

The Iteration-Tuned Dictionary for Sparse Representations

On Optimal Frame Conditioners

Blind Spectral-GMM Estimation for Underdetermined Instantaneous Audio Source Separation

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing

Solution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions

Topographic Dictionary Learning with Structured Sparsity

Riemannian Optimization Method on the Flag Manifold for Independent Subspace Analysis

l 0 -norm Minimization for Basis Selection

Thresholds for the Recovery of Sparse Solutions via L1 Minimization

Neural networks: Unsupervised learning

Pre-weighted Matching Pursuit Algorithms for Sparse Recovery

INDUSTRIAL MATHEMATICS INSTITUTE. B.S. Kashin and V.N. Temlyakov. IMI Preprint Series. Department of Mathematics University of South Carolina

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 9, SEPTEMBER

ON SOME EXTENSIONS OF THE NATURAL GRADIENT ALGORITHM. Brain Science Institute, RIKEN, Wako-shi, Saitama , Japan

Tractable Upper Bounds on the Restricted Isometry Constant

Blind separation of instantaneous mixtures of dependent sources

Linear Systems. Carlo Tomasi. June 12, r = rank(a) b range(a) n r solutions

Numerical Methods. Rafał Zdunek Underdetermined problems (2h.) Applications) (FOCUSS, M-FOCUSS,

Interior-Point Methods for Linear Optimization

Sparse Solutions of an Undetermined Linear System

A ROBUST BEAMFORMER BASED ON WEIGHTED SPARSE CONSTRAINT

Linear Systems. Carlo Tomasi

Sparsity in Underdetermined Systems

Simultaneous Sparsity

GREEDY SIGNAL RECOVERY REVIEW

Sparse representation classification and positive L1 minimization

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Recent developments on sparse representation

Sparse Approximation of Signals with Highly Coherent Dictionaries

Compressed Sensing and Related Learning Problems

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm

RSP-Based Analysis for Sparsest and Least l 1 -Norm Solutions to Underdetermined Linear Systems

Prediction-based adaptive control of a class of discrete-time nonlinear systems with nonlinear growth rate

Inverse problems and sparse models (1/2) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France

Why Sparse Coding Works

MATCHING PURSUIT WITH STOCHASTIC SELECTION

Improving the Incoherence of a Learned Dictionary via Rank Shrinkage

Sensing systems limited by constraints: physical size, time, cost, energy

Structured matrix factorizations. Example: Eigenfaces

A Dual Sparse Decomposition Method for Image Denoising

Overcomplete Dictionaries for. Sparse Representation of Signals. Michal Aharon

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Sparse linear models

Recovery Guarantees for Rank Aware Pursuits

NON-LINEAR CONTROL OF OUTPUT PROBABILITY DENSITY FUNCTION FOR LINEAR ARMAX SYSTEMS

Stability and Robustness of Weak Orthogonal Matching Pursuits

Sparse Solutions of Systems of Equations and Sparse Modelling of Signals and Images

GLOBAL ANALYSIS OF PIECEWISE LINEAR SYSTEMS USING IMPACT MAPS AND QUADRATIC SURFACE LYAPUNOV FUNCTIONS

A NEW FRAMEWORK FOR DESIGNING INCOHERENT SPARSIFYING DICTIONARIES

Improved FOCUSS Method With Conjugate Gradient Iterations

5742 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 12, DECEMBER /$ IEEE

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note 25

Convex/Schur-Convex (CSC) Log-Priors

CS281 Section 4: Factor Analysis and PCA

Latent Variable Models and EM algorithm

Written Examination

About Split Proximal Algorithms for the Q-Lasso

SPARSE NONNEGATIVE MATRIX FACTORIZATION USINGl 0 -CONSTRAINTS. Robert Peharz, Michael Stark, Franz Pernkopf

On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint

Uniqueness Conditions for A Class of l 0 -Minimization Problems

SHIFT-INVARIANT DICTIONARY LEARNING FOR SPARSE REPRESENTATIONS: EXTENDING K-SVD

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

Separation of the EEG Signal using Improved FastICA Based on Kurtosis Contrast Function

Digital Audio Resampling Detection Based on Sparse Representation Classifier and Periodicity of Second Derivative

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

A tutorial on sparse modeling. Outline:

Analysis of Greedy Algorithms

Introduction to Compressed Sensing

Handout on Newton s Method for Systems

A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem

Transcription:

Dictionary Learning for L1-Exact Sparse Coding Mar D. Plumbley Department of Electronic Engineering, Queen Mary University of London, Mile End Road, London E1 4NS, United Kingdom. Email: mar.plumbley@elec.qmul.ac.u Abstract. We have derived a new algorithm for dictionary learning for sparse coding in the l 1 exact sparse framewor. The algorithm does not rely on an approximation residual to operate, but rather uses the special geometry of the l 1 exact sparse solution to give a computationally simple yet conceptually interesting algorithm. A self-normalizing version of the algorithm is also derived, which uses negative feedbac to ensure that basis vectors converge to unit norm. The operation of the algorithm is illustrated on a simple numerical example. 1 Introduction Suppose we have a sequence of observations X = [x 1,...,x p ], x IR n. In the sparse coding problem [1] we wish to find a dictionary matrix A and representation matrix S such that X = AS 1) and where the representations s IR m in the matrix S = [s 1,...,s p ] are sparse, i.e. where there are few non-zero entries in each s j. In the case where we loo for solutions X = AS with no error, we say that this is an exact sparse solution. In the sparse coding problem we typically have m > n, so this is closely related to the overcomplete independent component analysis overcomplete ICA) problem, which has the additional assumption that the components of the representation vectors s j are statistically independent. If the dictionary A is given, then for each sample x = x we can separately loo the sparsest representation min s s 0 such that x = As. ) However, even this is a hard problem, so one approach is to solve instead the relaxed l 1 -norm problem min s s 1 such that x = As. 3) This approach, nown in the signal processing literature as Basis Pursuit [], can be solved efficiently with linear programming methods see also [3]). Even if such efficient sparse representation methods exist, learning the dictionary A is a non-trivial tas. Several methods have been proposed in the

literature, such as those by Olshausen and Field [4] and Lewici and Sejnowsi [5], and these can be derived within a principled probabilistic framewor [1]. A recent alternative is the K-SVD algorithm [6], which is a generalization of the K-means algorithm. However, many of these algorithms are designed to solve the sparse approximation problem X = AS+R for some nonzero residual term R, rather than the exact sparse problem 1). For example, the Olshausen and Field [4] approximate maximum lielihood algorithm is A = ηers T ) 4) where r = x As is the residual after approximation, and the K-SVD algorithm [1] minimizes the norm R F = X AS F. If we have a sparse representation algorithm that is successfully able to solve 3) exactly on each data sample x, then we have a zero residual R = 0, and there is nothing to drive the dictionary learning algorithm. Some other dictionary learning algorithms have other constraints: for example, the method of Georgiev et al [7] requires at most m 1 nonzero elements in each column of S. While these algorithms have been successful for practical problems, in this paper we specifically explore the special geometry of the l 1 exact sparse dictionary learning problem. We shall derive a new dictionary learning algorithm for the l 1 exact sparse problem, using the basis vertex c = A ) T 1 associated with a subdictionary basis set) A identified in the l 1 exact sparse representation problem 3). The Dual Problem and Basis Vertex The linear program 3) has a corresponding dual linear program [] maxx T c such that ± a T j c 1 j = 1,...,m 5) c which has an optimum c associated with any optimum s of 3). In a previous paper we explored the polytope geometry of this type of dual problem, and derived an algorithm, Polytope Faces Pursuit PFP), which searches for the optimal vertex which maximizes x T c, and uses that to find the optimal vector s [8]. Polytope Faces Pursuit is a gradient projection method [9] which iteratively builds a solution basis A consisting of a subset of the signed columns σ j a j of A, σ j { 1,0,+1}, chosen such that x = A s with s > 0 containing the absolute value of the nonzero coefficients of s at the solution. The algorithm is similar in structure to orthogonal matching pursuit OMP), but with a modified admission criterion a = arg max a i a T i r 1 a T i c 6) to add a new basis vector a to the current basis set, together with an additional rule to switch out basis vectors which are no longer feasible.

The basis vertex c = A ) T 1 is the solution to the dual problem 5). During the operation of the algorithm c satisfies A T c 1, so it remains dual-feasible throughout. For all active atoms a j in the current basis set, we have a T j c = 1. Therefore at the minimum l 1 norm solution the following conditions hold: A T c = 1 7) x = A s s > 0. 8) We will use these conditions in our derivation of the dictionary learning algorithm that follows. 3 Dictionary Learning Algorithm We would lie to construct an algorithm to find the matrix A that minimizes the total l 1 norm p J = s 1 9) =1 where s is chosen such that x = As for all = 1,...,p, i.e. such that X = AS, and where the columns of A are constrained to have unit norm. In particular, we would lie to construct an iterative algorithm to adjust A to reduce the total l 1 norm 9): let us therefore investigate how J depends on A. For the contribution due to the th sample we have J = J where J = s 1 = 1 T s since s 0. Dropping the superscripts from x, A and s we therefore wish to find how J = 1 T s changes with A, so taing derivatives of J we get dj /dt = 1 T d s/dt). 10) Now taing the derivative of 8) for fixed x we get and pre-multiplying by c T gives us 0 = da/dt) s + Ad s/dt) 11) c T da/dt) s = c T Ad s/dt) 1) = 1 T d s/dt) 13) = dj /dt 14) where the last two equations follow from 7) and 10). Introducing trace ) for the trace of a matrix, we can rearrange this to get dj = trace c T da ) dt dt s = trace c s T ) T da ) = c s T, da 15) dt dt from which we see that the gradient of J with respect to A is given by A J = c s T. Summing up over all and applying to the original matrix A we get A J = c s ) T = CS T 16)

with C = [c ], a somewhat surprisingly simple result. Therefore the update A = η c s ) T 17) will perform a steepest descent search for the minimum total l 1 norm J, and any path da/dt for which da/dt), A J < 0 will cause J to decrease. 3.1 Unit norm atom constraint Now without any constraint, algorithm 17) will tend to reduce the l 1 norm by causing A to increase without bound, so we need to impose a constraint on A. A common constraint is to require the columns a j of A to be unit vectors, a j = 1, i.e. at j a j = 1. We therefore require our update to be restricted to paths da j /dt for which a T j da j/dt) = 0. To find the projection of 16) in this direction, consider the gradient component g j = dj = c s da j. 18) j The orthogonal projection of g j onto the required tangent space is given by g j = P aj g j = I a ja T j a j ) g j = g j 1 a j a j a T j g j ). 19) Now considering the rightmost factor a T j g j, from 18) we get a T j g j = a T j c s j. 0) Considering just the th term a T j c s j, if a j is one of the basis vectors in A with possible change of sign σj = signs j )) which forms part of the solution x = A s found by a minimum l 1 norm solution, then we must have σj at j c = 1 where σj = signs j ), because AT c = 1, so a T j c s j = σ j s j = s j. On the other hand, if a j does not form part of A, then s j = 0 so at j c s j = 0 = s j. Thus regardless of the involvement of a j in A, we have a T j c s j = s j, so a T j g j = s j 1) and therefore g j = ) c s j 1 a j a j s j. )

Therefore we have the following tangent update rule: a j T + 1) = a j T) + η ) c s 1 j a j T) a j T) s j 3) which will perform a tangent-constrained steepest descent update to find the minimum total l 1 norm J. We should note that the tangent update is not entirely sufficient to constrain a j to remain of unit norm, so an occasional renormalization step a j a j / a j will be required after a number of applications of 3). 4 Self-normalizing Algorithm Based on the well-nown negative feedbac structure used in PCA algorithms such as the Oja [10] PCA neuron, we can modify algorithm 3) to produce the following self-normalizing algorithm that does not require the explicit renormalization step: a j T + 1) = a j T) + η c s j a j T) s j ) 4) where we have simply removed the factor 1/ a j T) from the second term in 3). This algorithm is computationally very simple, and suggests an online version a j ) = a j 1)+η c s j a j 1) s j ) with the dictionary updated as each data point is presented. For unit norm basis vectors a j T) = 1, the update produced by algorithm 4) is identical to that produced by the tangent algorithm 3). Therefore, for unit norm basis vectors, algorithm 4) produces a step in a direction which reduces J. Note that algorithm 4) will not necessarily reduce J when a j is not unit norm.) To show that the norm of the basis vectors a j in algorithm 4) converge to unit length, we require that each a j must be involved in the representation of at least one pattern x, i.e. for some we have s j 0. If this were not true, that basis vector would have been ignored completely so would not be updated by the algorithm.) Consider the ordinary differential equation ode) version of 4): da j dt = = g j + c s j a j s j ) 5) 1 a j = g j + 1 a j 1 ) 1 a j a j s j 6) )a j s j 7) which, noting that a T j g j = a T j P a j g j = 0, gives us a T da j j dt = 1 a j 1 a j )a T j a j s j = 1 a j ) s j. 8)

Constructing the Lyapunov function [11] Q = 1/4)1 a j ) 0, which is zero if and only if a j has unit length, we get dq/dt = 1 a j )at j da j /dt) 9) = 1 a j ) s j 30) 0 31) where s j > 0 since at least one of the s j is nonzero, so equality holds in 31) if and only if a j = 0. Therefore the ode 5) will cause a j to converge to 1 for all basis vectors a j. While algorithm 4) does not strictly require renormalization, we found experimentally that an explicit unit norm renormalization step did produce slightly more consistent behaviour in reduction of the total l 1 norm J. Finally we note that at convergence of algorithm 4), the basis vectors must satisfy a j = c s j s j 3) so that a j must be a signed) weighted sum of the basis vertices c in which it is involved. While equation 3) is suggestive of a fixed point algorithm, we have observed that it yields unstable behaviour if used directly. Nevertheless we believe that it would be interesting to explore this in future for the final stages of an algorithm, as it nears convergence. 5 Augmenting Polytope Faces Pursuit After an update to the dictionary A, it is not necessary to restart the search for the minimum l 1 norm solutions s to x = As from s = 0. In many cases the dictionary vector will have changed only slightly, so the signs σ j and selected subdictionary A may be very similar to the previous solution, before the dictionary update. At the T th dictionary learning step we can therefore restart the search for s T) from the basis set selected by the last solution s T 1). However, if we start from the same subdictionary selection pattern after a change to the dictionary, we can no longer guarantee that the solution will be dual-feasible, i.e. that 5) is always satisfied, which is required for the Polytope Faces Pursuit algorithm [8]. While we will still have a T j c = 1 for all vectors a j in the solution subdictionary, we may have increased some other a j, not in the original solution set, so that now a T j c > 1. To overcome this problem, if a T j c > 1 such that dual feasibility fails for a particular sample and basis vector a j, we simply restart the sparse Polytope Faces Pursuit algorithm from s = 0 for this particular sample, to guarantee that dual-feasibility is restored. We believe that it may be possible to construct

a more efficient method to restore dual-feasibility, based on selectively swapping vectors to bring the solution into feasibility, but it appears to be non-trivial to guarantee that loops will not result. 6 Numerical Illustration To illustrate the operation of the algorithm, Figure 1 shows a small graphical example. Here four source variables s j are generated with identical mixture-of- 6 6 4 4 x 0 x 0 4 4 6 5 0 5 6 5 0 5 x 1 x 1 a) b) 6 3100 x 4 0 L1 norm 3000 900 800 4 700 6 5 0 5 x 1 c) 600 0 5 10 15 0 5 Iteration d) Fig. 1. Illustration of dictionary learning of a n = dimensional mixture of m = 4 MoG-distributed sources, for p = 3000 samples. The plots show a) the initial condition, and updates after b) 5 dictionary updates and c) 5 dictionary updates. The learning curve d) confirms that the l 1 norm decreases as the algorithm progresses. On a)-c), the longer arrows are scaled versions of the learned dictionary vectors a j, with the shorter arrows showing the directions of the generating dictionary vectors for comparison. gaussian MoG) densities in an n = dimensional space and added with angles θ {0,π/6,π/3,4π/6}. It is important to note that, even in the initial condition Figure 1a), the basis set spans the input space and optimization of 3) has an exact solution x = As for all data samples x, at least to within numerical precision of the algorithm. Therefore this situation would not be suitable for any dictionary learning algorithm which relies on a residual r = x As.

7 Conclusions We have derived a new algorithm for dictionary learning for sparse coding in the l 1 exact sparse framewor. The algorithm does not rely on an approximation residual to operate, but rather uses the special geometry of the l 1 exact sparse solution to give a computationally simple yet conceptually interesting algorithm. A self-normalizing version of the algorithm is also derived, which uses negative feedbac to ensure that basis vectors converge to unit norm. The operation of the algorithm was illustrated on a simple numerical example. While we emphasize the derivation and geometry of the algorithm in the present paper, we are currently woring on applying this new algorithm to practical sparse approximation problems, and will present these results in future wor. 8 Acnowledgements This wor was supported by EPSRC grants GR/S7580/01, GR/S813/01, GR/S85900/01, EP/C005554/1 and EP/D00046/1. References 1. Kreutz-Delgado, K., Murray, J.F., Rao, B.D., Engan, K., Lee, T.W., Sejnowsi, T.J.: Dictionary learning algorithms for sparse representation. Neural Computation 15 003) 349 396. Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing 0 1998) 33 61 3. Bofill, P., Zibulevsy, M.: Underdetermined blind source separation using sparse representations. Signal Processing 81 001) 353 36 4. Olshausen, B.A., Field, D.J.: Emergence of simple-cell receptive-field properties by learning a sparse code for natural images. Nature 381 1996) 607 609 5. Lewici, M.S., Sejnowsi, T.J.: Learning overcomplete representations. Neural Computation 1 000) 337 365 6. Aharon, M., Elad, M., Brucstein, A.: K-SVD: Design of dictionaries for sparse representation. In: Proceedings of SPARS 05, Rennes, France. 005) 9 1 7. Georgiev, P., Theis, F., Cichoci, A.: Sparse component analysis and blind source separation of underdetermined mixtures. IEEE Transactions on Neural Networs 16 005) 99 996 8. Plumbley, M.D.: Recovery of sparse representations by polytope faces pursuit. In: Proceedings of the 6th International Conference on Independent Component Analysis and Blind Source Separation ICA 006), Charleston, SC, USA. 006) 06 13 LNCS 3889. 9. Rosen, J.B.: The gradient projection method for nonlinear programming. Part I. Linear constraints. J. Soc. Indust. Appl. Math. 8 1960) 181 17 10. Oja, E.: A simplified neuron model as a principal component analyzer. Journal of Mathematical Biology 15 198) 67 73 11. Coo, P.A.: Nonlinear Dynamical Systems. Prentice Hall, Englewood Cliffs, NJ 1986)