Topographic Dictionary Learning with Structured Sparsity

Similar documents
Recent Advances in Structured Sparse Models

Sparse Estimation and Dictionary Learning

Structured Sparse Estimation with Network Flow Optimization

Optimization for Sparse Estimation and Structured Sparsity

Parcimonie en apprentissage statistique

Backpropagation Rules for Sparse Coding (Task-Driven Dictionary Learning)

Structured sparsity-inducing norms through submodular functions

Convex relaxation for Combinatorial Penalties

Proximal Methods for Optimization with Spasity-inducing Norms

Structured sparsity through convex optimization

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Structured sparsity-inducing norms through submodular functions

A tutorial on sparse modeling. Outline:

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Fast Hard Thresholding with Nesterov s Gradient Method

Proximal Methods for Hierarchical Sparse Coding

Sparse & Redundant Signal Representation, and its Role in Image Processing

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lasso: Algorithms and Extensions

Hierarchical kernel learning

DNNs for Sparse Coding and Dictionary Learning

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Structured sparsity-inducing norms through submodular functions

Efficient Methods for Overlapping Group Lasso

Structured sparsity through convex optimization

OWL to the rescue of LASSO

TUTORIAL PART 1 Unsupervised Learning

Adaptive Compressive Imaging Using Sparse Hierarchical Learned Dictionaries

Optimization methods

2.3. Clustering or vector quantization 57

An Homotopy Algorithm for the Lasso with Online Observations

Online Dictionary Learning with Group Structure Inducing Norms

On Optimal Frame Conditioners

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

An Efficient Proximal Gradient Method for General Structured Sparse Learning

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

1 Sparsity and l 1 relaxation

Learning with stochastic proximal gradient

Online Learning for Matrix Factorization and Sparse Coding

Tractable Upper Bounds on the Restricted Isometry Constant

Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

6. Regularized linear regression

Sparse PCA with applications in finance

arxiv: v1 [cs.lg] 20 Feb 2014

IMA Preprint Series # 2276

Compressed Sensing and Related Learning Problems

EUSIPCO

Path Coding Penalties for Directed Acyclic Graphs

Online Learning for Matrix Factorization and Sparse Coding

Efficient Adaptive Compressive Sensing Using Sparse Hierarchical Learned Dictionaries

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Generalized Conditional Gradient and Its Applications

Apprentissage de dictionnaire pour les représentations parcimonieuses

Trace Lasso: a trace norm regularization for correlated designs

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

Dictionary learning for speech based on a doubly sparse greedy adaptive dictionary algorithm

Equivalence Probability and Sparsity of Two Sparse Solutions in Sparse Representation

A direct formulation for sparse PCA using semidefinite programming

Structured sparsity through convex optimization

arxiv: v1 [stat.ml] 22 Nov 2016

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

A direct formulation for sparse PCA using semidefinite programming

Dictionary Learning for L1-Exact Sparse Coding

Learning MMSE Optimal Thresholds for FISTA

Multiple Change Point Detection by Sparse Parameter Estimation

GROUP SPARSITY WITH OVERLAPPING PARTITION FUNCTIONS

Compressed Sensing and Neural Networks

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Accelerated Gradient Method for Multi-Task Sparse Learning Problem

A Family of Penalty Functions for Structured Sparsity

Greedy Dictionary Selection for Sparse Representation

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Sparse and Regularized Optimization

Optimization methods

The Iteration-Tuned Dictionary for Sparse Representations

Separable Dictionary Learning

Generalized Orthogonal Matching Pursuit- A Review and Some

A Universal Catalyst for Gradient-Based Optimization

Bhaskar Rao Department of Electrical and Computer Engineering University of California, San Diego

SVRG++ with Non-uniform Sampling

About Split Proximal Algorithms for the Q-Lasso

Lecture Notes 10: Matrix Factorization

A Quest for a Universal Model for Signals: From Sparsity to ConvNets

Structured matrix factorizations. Example: Eigenfaces

A Multi-task Learning Strategy for Unsupervised Clustering via Explicitly Separating the Commonality

Oslo Class 6 Sparsity based regularization

THe linear decomposition of data using a few elements

Introduction to Three Paradigms in Machine Learning. Julien Mairal

Fast Dictionary Learning for Sparse Representations of Speech Signals

ONLINE DICTIONARY LEARNING OVER DISTRIBUTED MODELS. Jianshu Chen, Zaid J. Towfic, and Ali H. Sayed

operator and Group LASSO penalties

Optimization methods for large-scale machine learning and sparse estimation. Julien Mairal

Optimization for Machine Learning

Sparsity in Underdetermined Systems

Coordinate Descent and Ascent Methods

Estimation Error Bounds for Frame Denoising

Convex Optimization Algorithms for Machine Learning in 10 Slides

A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases

Transcription:

Topographic Dictionary Learning with Structured Sparsity Julien Mairal 1 Rodolphe Jenatton 2 Guillaume Obozinski 2 Francis Bach 2 1 UC Berkeley 2 INRIA - SIERRA Project-Team San Diego, Wavelets and Sparsity XIV, August 2011 Julien Mairal, UC Berkeley Topographic Dictionary Learning 1/32

What this work is about Group sparsity with overlapping groups. Hierarchical, topographic dictionary learning, More generally: structured dictionaries of natural image patches. Related publications: [1] J. Mairal, R. Jenatton, G. Obozinski and F. Bach. Network Flow Algorithms for Structured Sparsity. NIPS, 2010. [2] R. Jenatton, J. Mairal, G. Obozinski and F. Bach. Proximal Methods for Hierarchical Sparse Coding. JMLR, 2011. Julien Mairal, UC Berkeley Topographic Dictionary Learning 2/32

Part I: Introduction to Dictionary Learning Julien Mairal, UC Berkeley Topographic Dictionary Learning 3/32

What is a Sparse Linear Model? Let x in R m be a signal. Let D = [d 1,...,d p ] R m p be a set of normalized basis vectors. We call it dictionary. D is adapted to x if it can represent it with a few basis vectors that is, there exists a sparse vector α in R p such that x Dα. We call α the sparse code. α 1 x d 1 d 2 d p α 2 }{{} x R m } {{ } D R m p.. α p }{{} α R p,sparse Julien Mairal, UC Berkeley Topographic Dictionary Learning 4/32

The Sparse Decomposition Problem 1 min α R p 2 x Dα 2 2 }{{} data fitting term + λψ(α) }{{} sparsity-inducing regularization ψ induces sparsity in α: the l 0 pseudo-norm. α 0 = #{i s.t. αi 0} (NP-hard) the l 1 norm. α 1 = p i=1 α i (convex),... This is a selection problem. When ψ is the l 1 -norm, the problem is called Lasso [Tibshirani, 1996] or basis pursuit [Chen et al., 1999] Julien Mairal, UC Berkeley Topographic Dictionary Learning 5/32

Sparse representations for image restoration Designed dictionaries [Haar, 1910], [Zweig, Morlet, Grossman 70s], [Meyer, Mallat, Daubechies, Coifman, Donoho, Candes 80s-today]... Wavelets, Curvelets, Wedgelets, Bandlets,... lets Learned dictionaries of patches [Olshausen and Field, 1997, Engan et al., 1999, Lewicki and Sejnowski, 2000, Aharon et al., 2006],... min α i,d D n 1 2 xi Dα i 2 2 }{{} reconstruction i=1 ψ(α) = α 0 ( l 0 pseudo-norm ) ψ(α) = α 1 (l 1 norm) +λψ(α i ) }{{} sparsity Julien Mairal, UC Berkeley Topographic Dictionary Learning 6/32

Sparse representations for image restoration Grayscale vs color image patches Figure: Left: learned on grayscale image patches. Right: learned on color image patches (after removing the mean color from each patch) Julien Mairal, UC Berkeley Topographic Dictionary Learning 7/32

Algorithms min α R p n D D n i=1 1 2 xi Dα i 2 2 +λψ(α i ). How do we optimize that? alternate between D and α [Engan et al., 1999], or other variants [Elad and Aharon, 2006] online learning [Olshausen and Field, 1997, Mairal et al., 2009, Skretting and Engan, 2010] Code SPAMS available: http://www.di.ens.fr/willow/spams/, now open-source! Julien Mairal, UC Berkeley Topographic Dictionary Learning 8/32

Part II: Introduction to Structured Sparsity (Let us play with ψ) Julien Mairal, UC Berkeley Topographic Dictionary Learning 9/32

Group Sparsity-Inducing Norms The most popular choice for ψ: 1 min α R p 2 x Dα 2 2 +λ ψ(α) }{{} sparsity-inducing norm The l 1 norm, ψ(α) = α 1. However, the l 1 norm encodes poor information, just cardinality! Another popular choice for Ω: The l 1 -l q norm [Turlach et al., 2005], with q = 2 or q = α g q with G a partition of {1,...,p}. g G The l 1 -l q norm sets to zero groups of non-overlapping variables (as opposed to single variables for the l 1 norm). Julien Mairal, UC Berkeley Topographic Dictionary Learning 10/32

Structured Sparsity with Overlapping Groups Warning: Under the name structured sparsity appear in fact significantly different formulations! 1 non-convex zero-tree wavelets [Shapiro, 1993] sparsity patterns are in a predefined collection: [Baraniuk et al., 2010] select a union of groups: [Huang et al., 2009] structure via Markov Random Fields: [Cehver et al., 2008] 2 convex tree-structure: [Zhao et al., 2009] non-zero patterns are a union of groups: [Jacob et al., 2009] zero patterns are a union of groups: [Jenatton et al., 2009] other norms: [Micchelli et al., 2010] Julien Mairal, UC Berkeley Topographic Dictionary Learning 11/32

Structured Sparsity with Overlapping Groups ψ(α) = g G α g q What happens when the groups overlap? [Jenatton et al., 2009] Inside the groups, the l 2 -norm (or l ) does not promote sparsity. Variables belonging to the same groups are encouraged to be set to zero together. Julien Mairal, UC Berkeley Topographic Dictionary Learning 12/32

Examples of set of groups G [Jenatton et al., 2009] Selection of contiguous patterns on a sequence, p = 6. G is the set of blue groups. Any union of blue groups set to zero leads to the selection of a contiguous pattern. Julien Mairal, UC Berkeley Topographic Dictionary Learning 13/32

Hierarchical Norms [Zhao et al., 2009] A node can be active only if its ancestors are active. The selected patterns are rooted subtrees. Julien Mairal, UC Berkeley Topographic Dictionary Learning 14/32

Algorithms/Difficulties [Jenatton et al., 2010, Mairal et al., 2011] 1 min α R p 2 x Dα 2 2 +λ α g q. g G The function is convex non-differentiable; the sum is a sum of simple non-separable regularizers. How do we optimize that? hierarchical norms: same complexity as l 1 with proximal methods. general case: Augmenting Lagrangian Techniques. general case with l -norms: proximal methods combine with network flow optimization. Also implemented in the toolbox SPAMS Julien Mairal, UC Berkeley Topographic Dictionary Learning 15/32

Part III: Learning Structured Dictionaries Julien Mairal, UC Berkeley Topographic Dictionary Learning 16/32

Topographic Dictionary Learning [Kavukcuoglu et al., 2009]: organize the dictionary elements on a 2D-grids and use ψ with e e overlapping groups. [Garrigues and Olshausen, 2010]: sparse coding + probabilistic model to model lateral interactions. topographic ICA by Hyvärinen et al. [2001]: Julien Mairal, UC Berkeley Topographic Dictionary Learning 17/32

Topographic Dictionary Learning [Mairal, Jenatton, Obozinski, and Bach, 2011], 3 3-neighborhoods Julien Mairal, UC Berkeley Topographic Dictionary Learning 18/32

Topographic Dictionary Learning [Mairal, Jenatton, Obozinski, and Bach, 2011],4 4-neighborhoods Julien Mairal, UC Berkeley Topographic Dictionary Learning 19/32

Hierarchical Dictionary Learning [Jenatton, Mairal, Obozinski, and Bach, 2010] Julien Mairal, UC Berkeley Topographic Dictionary Learning 20/32

Conclusion / Discussion Structured sparsity is a natural framework for learning structured dictionaries......and has efficient optimization tools. other applications in natural language processing, bio-informatics, neuroscience... Julien Mairal, UC Berkeley Topographic Dictionary Learning 21/32

SPAMS toolbox (open-source) C++ interfaced with Matlab. proximal gradient methods for l 0, l 1, elastic-net, fused-lasso, group-lasso, tree group-lasso, tree-l 0, sparse group Lasso, overlapping group Lasso......for square, logistic, multi-class logistic loss functions. handles sparse matrices, provides duality gaps. also coordinate descent, block coordinate descent algorithms. fastest available implementation of OMP and LARS. dictionary learning and matrix factorization (NMF, sparse PCA). fast projections onto some convex sets. Try it! http://www.di.ens.fr/willow/spams/ Julien Mairal, UC Berkeley Topographic Dictionary Learning 22/32

References I M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD: An algorithm for designing of overcomplete dictionaries for sparse representations. IEEE Transactions on Signal Processing, 54(11):4311 4322, November 2006. R. G. Baraniuk, V. Cevher, M. Duarte, and C. Hegde. Model-based compressive sensing. IEEE Transactions on Information Theory, 2010. to appear. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009. V. Cehver, M. F. Duarte, C. Hegde, and R. G. Baraniuk. Sparse signal recovery usingmarkov random fields. In Advances in Neural Information Processing Systems, 2008. Julien Mairal, UC Berkeley Topographic Dictionary Learning 23/32

References II S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20:33 61, 1999. M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 54(12):3736 3745, December 2006. K. Engan, S. O. Aase, and J. H. Husoy. Frame based signal compression using method of optimal directions (MOD). In Proceedings of the 1999 IEEE International Symposium on Circuits Systems, volume 4, 1999. P. Garrigues and B. Olshausen. Group sparse coding with a laplacian scale mixture prior. In Advances in Neural Information Processing Systems, 2010. A. Haar. Zur theorie der orthogonalen funktionensysteme. Mathematische Annalen, 69:331 371, 1910. Julien Mairal, UC Berkeley Topographic Dictionary Learning 24/32

References III J. Huang, Z. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the International Conference on Machine Learning (ICML), 2009. A. Hyvärinen, P. Hoyer, and M. Inki. Topographic independent component analysis. Neural Computation, 13(7):1527 1558, 2001. L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with overlap and graph Lasso. In Proceedings of the International Conference on Machine Learning (ICML), 2009. R. Jenatton, J-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms. Technical report, 2009. preprint arxiv:0904.3523v1. R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary learning. In Proceedings of the International Conference on Machine Learning (ICML), 2010. Julien Mairal, UC Berkeley Topographic Dictionary Learning 25/32

References IV K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features through topographic filter maps. In Proceedings of CVPR, 2009. M. S. Lewicki and T. J. Sejnowski. Learning overcomplete representations. Neural Computation, 12(2):337 365, 2000. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In Proceedings of the International Conference on Machine Learning (ICML), 2009. J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Convex and network flow optimization for structured sparsity. Preprint arxiv:1104.1872, 2011. C. A. Micchelli, J. M. Morales, and M. Pontil. A family of penalty functions for structured sparsity. In Advances in Neural Information Processing Systems, 2010. Julien Mairal, UC Berkeley Topographic Dictionary Learning 26/32

References V Y. Nesterov. A method for solving a convex programming problem with convergence rate O(1/k 2 ). Soviet Math. Dokl., 27:372 376, 1983. Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, CORE, 2007. B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37: 3311 3325, 1997. J.M. Shapiro. Embedded image coding using zerotrees of wavelet coefficients. IEEE Transactions on Signal Processing, 41(12): 3445 3462, 1993. K. Skretting and K. Engan. Recursive least squares dictionary learning algorithm. IEEE Transactions on Signal Processing, 58(4):2121 2130, 2010. R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B, 58(1):267 288, 1996. Julien Mairal, UC Berkeley Topographic Dictionary Learning 27/32

References VI B. A. Turlach, W. N. Venables, and S. J. Wright. Simultaneous variable selection. Technometrics, 47(3):349 363, 2005. P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection. 37(6A):3468 3497, 2009. Julien Mairal, UC Berkeley Topographic Dictionary Learning 28/32

First-order/proximal methods min f(α)+λω(α) α R p f is strictly convex and differentiable with a Lipshitz gradient. Generalizes the idea of gradient descent α k+1 argminf(α k )+ f(α k ) (α α k ) + L α R p }{{} 2 α αk 2 2+λΩ(α) }{{} linear approximation quadratic term 1 argmin α R p 2 α (αk 1 L f(αk )) 2 2 + λ L Ω(α) When λ = 0, α k+1 α k 1 L f(αk ), this is equivalent to a classical gradient descent step. Julien Mairal, UC Berkeley Topographic Dictionary Learning 29/32

First-order/proximal methods They require solving efficiently the proximal operator 1 min α R p 2 u α 2 2 +λω(α) For the l 1 -norm, this amounts to a soft-thresholding: α i = sign(u i )(u i λ) +. There exists accelerated versions based on Nesterov optimal first-order method (gradient method with extrapolation ) [Beck and Teboulle, 2009, Nesterov, 2007, 1983] suited for large-scale experiments. Julien Mairal, UC Berkeley Topographic Dictionary Learning 30/32

Tree-structured groups Proposition [Jenatton, Mairal, Obozinski, and Bach, 2010] If G is a tree-structured set of groups, i.e., g,h G, g h = or g h or h g For q = 2 or q =, we define Prox g and Prox Ω as 1 Prox g :u argmin α R p 2 u α +λ α g q, 1 Prox Ω :u argmin α R p 2 u α +λ α g q, g G If the groups are sorted from the leaves to the root, then Prox Ω = Prox gm... Prox g1. Tree-structured regularization : Efficient linear time algorithm. Julien Mairal, UC Berkeley Topographic Dictionary Learning 31/32

General Overlapping Groups for q = [Mairal, Jenatton, Obozinski, and Bach, 2011] Dual formulation The solutions α and ξ of the following optimization problems 1 min α R p 2 u α +λ α g, g G (Primal) 1 min ξ R p G 2 u ξ g 2 2 s.t. g G, ξ g 1 λ and ξ g j = 0 if j / g, g G (Dual) satisfy α = u g Gξ g. (Primal-dual relation) The dual formulation has more variables, but is equivalent to quadratic min-cost flow problem. Julien Mairal, UC Berkeley Topographic Dictionary Learning 32/32