Sparse Estimation and Dictionary Learning

Similar documents
Topographic Dictionary Learning with Structured Sparsity

Recent Advances in Structured Sparse Models

Optimization for Sparse Estimation and Structured Sparsity

Structured Sparse Estimation with Network Flow Optimization

Backpropagation Rules for Sparse Coding (Task-Driven Dictionary Learning)

Parcimonie en apprentissage statistique

Proximal Methods for Optimization with Spasity-inducing Norms

Structured sparsity-inducing norms through submodular functions

A tutorial on sparse modeling. Outline:

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Convex relaxation for Combinatorial Penalties

Sparse & Redundant Signal Representation, and its Role in Image Processing

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Structured sparsity through convex optimization

Hierarchical kernel learning

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Linear Methods for Regression. Lijun Zhang

TUTORIAL PART 1 Unsupervised Learning

An Homotopy Algorithm for the Lasso with Online Observations

Lasso: Algorithms and Extensions

DATA MINING AND MACHINE LEARNING

EUSIPCO

Adaptive Compressive Imaging Using Sparse Hierarchical Learned Dictionaries

Oslo Class 6 Sparsity based regularization

OWL to the rescue of LASSO

A direct formulation for sparse PCA using semidefinite programming

2.3. Clustering or vector quantization 57

STA141C: Big Data & High Performance Statistical Computing

Sparse PCA with applications in finance

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

1 Sparsity and l 1 relaxation

ECS289: Scalable Machine Learning

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Pathwise coordinate optimization

Structured sparsity-inducing norms through submodular functions

Online Learning for Matrix Factorization and Sparse Coding

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

Compressed Sensing and Neural Networks

Sparsity in Underdetermined Systems

SPARSE SIGNAL RESTORATION. 1. Introduction

A Multi-task Learning Strategy for Unsupervised Clustering via Explicitly Separating the Commonality

Introduction to Three Paradigms in Machine Learning. Julien Mairal

DNNs for Sparse Coding and Dictionary Learning

Sparse linear models

A discretized Newton flow for time varying linear inverse problems

Generalized Conditional Gradient and Its Applications

An Introduction to Sparse Approximation

Noise Removal? The Evolution Of Pr(x) Denoising By Energy Minimization. ( x) An Introduction to Sparse Representation and the K-SVD Algorithm

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

Sparse linear models and denoising

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Deep Learning: Approximation of Functions by Composition

Bhaskar Rao Department of Electrical and Computer Engineering University of California, San Diego

The Iteration-Tuned Dictionary for Sparse Representations

Semi-supervised Dictionary Learning Based on Hilbert-Schmidt Independence Criterion

Greedy Dictionary Selection for Sparse Representation

Inverse problems and sparse models (1/2) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France

THe linear decomposition of data using a few elements

Introduction to Sparsity. Xudong Cao, Jake Dreamtree & Jerry 04/05/2012


Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

SIGNAL SEPARATION USING RE-WEIGHTED AND ADAPTIVE MORPHOLOGICAL COMPONENT ANALYSIS

Dictionary learning for speech based on a doubly sparse greedy adaptive dictionary algorithm

Structured sparsity through convex optimization

c 2011 International Press Vol. 18, No. 1, pp , March DENNIS TREDE

Fast Hard Thresholding with Nesterov s Gradient Method

Online Learning for Matrix Factorization and Sparse Coding

Online Dictionary Learning with Group Structure Inducing Norms

Learning MMSE Optimal Thresholds for FISTA

Analysis of Greedy Algorithms

A direct formulation for sparse PCA using semidefinite programming

Structured matrix factorizations. Example: Eigenfaces

Minimizing the Difference of L 1 and L 2 Norms with Applications

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Compressed Sensing and Related Learning Problems

Applied Machine Learning for Biomedical Engineering. Enrico Grisan

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

Tractable Upper Bounds on the Restricted Isometry Constant

Block Coordinate Descent for Regularized Multi-convex Optimization

Sparse Optimization Lecture: Basic Sparse Optimization Models

LEARNING OVERCOMPLETE SPARSIFYING TRANSFORMS FOR SIGNAL PROCESSING. Saiprasad Ravishankar and Yoram Bresler

Learning with stochastic proximal gradient

SVRG++ with Non-uniform Sampling

c Springer, Reprinted with permission.

Bayesian Methods for Sparse Signal Recovery

Trace Lasso: a trace norm regularization for correlated designs

A NEW FRAMEWORK FOR DESIGNING INCOHERENT SPARSIFYING DICTIONARIES

IMA Preprint Series # 2276

Double Sparsity: Learning Sparse Dictionaries for Sparse Signal Approximation

Minimizing Isotropic Total Variation without Subiterations

MULTI-SCALE BAYESIAN RECONSTRUCTION OF COMPRESSIVE X-RAY IMAGE. Jiaji Huang, Xin Yuan, and Robert Calderbank

Structured Sparsity. group testing & compressed sensing a norm that induces structured sparsity. Mittagsseminar 2011 / 11 / 10 Martin Jaggi

Efficient Methods for Overlapping Group Lasso

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

Homotopy methods based on l 0 norm for the compressed sensing problem

Transcription:

Sparse Estimation and Dictionary Learning (for Biostatistics?) Julien Mairal Biostatistics Seminar, UC Berkeley Julien Mairal Sparse Estimation and Dictionary Learning Methods 1/69

What this talk is about? Why sparsity, what for and how? Feature learning / clustering / sparse PCA; Machine learning: selecting relevant features; Signal and image processing: restoration, reconstruction; Biostatistics: you tell me. Julien Mairal Sparse Estimation and Dictionary Learning Methods 2/69

Part I: Sparse Estimation Julien Mairal Sparse Estimation and Dictionary Learning Methods 3/69

Sparse Linear Model: Machine Learning Point of View Let (y i,x i ) n i=1 be a training set, where the vectors xi are in R p and are called features. The scalars y i are in { 1, +1} for binary classification problems. R for regression problems. We assume there is a relation y w x, and solve 1 min w R p n n l(y i,w x i ) i=1 } {{ } empirical risk + λψ(w). }{{} regularization Julien Mairal Sparse Estimation and Dictionary Learning Methods 4/69

Sparse Linear Models: Machine Learning Point of View A few examples: Ridge regression: Linear SVM: 1 min w R p n 1 min w R p n 1 Logistic regression: min w R p n n 1 2 (yi w x i ) 2 +λ w 2 2. n max(0,1 y i w x i )+λ w 2 2. i=1 i=1 n i=1 ( log 1+e yi w x i) +λ w 2 2. Julien Mairal Sparse Estimation and Dictionary Learning Methods 5/69

Sparse Linear Models: Machine Learning Point of View A few examples: Ridge regression: Linear SVM: 1 min w R p n 1 min w R p n 1 Logistic regression: min w R p n n 1 2 (yi w x i ) 2 +λ w 2 2. n max(0,1 y i w x i )+λ w 2 2. i=1 i=1 n i=1 ( log 1+e yi w x i) +λ w 2 2. The squared l 2 -norm induces smoothness in w. When one knows in advance that w should be sparse, one should use a sparsity-inducing regularization such as the l 1 -norm. [Chen et al., 1999, Tibshirani, 1996] Julien Mairal Sparse Estimation and Dictionary Learning Methods 6/69

Sparse Linear Models: Signal Processing Point of View Let y in R n be a signal. Let X = [x 1,...,x p ] R n p be a set of normalized basis vectors. We call it dictionary. X is adapted to y if it can represent it with a few basis vectors that is, there exists a sparse vector w in R p such that y Xw. We call w the sparse code. w 1 y x 1 x 2 x p w 2 }{{} y R n } {{ } X R n p.. w p }{{} w R p,sparse Julien Mairal Sparse Estimation and Dictionary Learning Methods 7/69

The Sparse Decomposition Problem 1 min w R p 2 y Xw 2 2 }{{} data fitting term ψ induces sparsity in w. It can be + λψ(w) }{{} sparsity-inducing regularization the l 0 pseudo-norm. w 0 = #{i s.t. wi 0} (NP-hard) the l 1 norm. w 1 = p i=1 w i (convex),... This is a selection problem. When ψ is the l 1 -norm, the problem is called Lasso [Tibshirani, 1996] or basis pursuit [Chen et al., 1999] Julien Mairal Sparse Estimation and Dictionary Learning Methods 8/69

Why does the l 1 -norm induce sparsity? Exemple: quadratic problem in 1D 1 min w R 2 (u w)2 +λ w Piecewise quadratic function with a kink at zero. Derivative at 0 + : g + = u +λ and 0 : g = u λ. Optimality conditions. w is optimal iff: w > 0 and (u w)+λsign(w) = 0 w = 0 and g + 0 and g 0 The solution is the soft-thresholding operator: w = sign(u)( u λ) +. Julien Mairal Sparse Estimation and Dictionary Learning Methods 9/69

Why does the l 1 -norm induce sparsity? w w λ λ u 2λ 2λ u (a) soft-thresholding operator, w = sign(u)( u λ) +, min w 1 2 (u w)2 +λ w (b) hard-thresholding operator w = 1 u 2λ u min w 1 2 (u w)2 +λ1 w >0 Julien Mairal Sparse Estimation and Dictionary Learning Methods 10/69

Why does the l 1 -norm induce sparsity? Comparison with l 2-regularization in 1D ψ(w) = w 2 ψ(w) = w ψ (w) = 2w ψ (w) = 1, ψ +(w) = 1 The gradient of the l 2 -penalty vanishes when w get close to 0. On its differentiable part, the norm of the gradient of the l 1 -norm is constant. Julien Mairal Sparse Estimation and Dictionary Learning Methods 11/69

Why does the l 1 -norm induce sparsity? Physical illustration E 1 = 0 E 1 = 0 y 0 Julien Mairal Sparse Estimation and Dictionary Learning Methods 12/69

Why does the l 1 -norm induce sparsity? Physical illustration E 1 = k 1 2 (y 0 y) 2 E 1 = k 1 2 (y 0 y) 2 E 2 = k 2 2 y 2 y y E 2 = mgy Julien Mairal Sparse Estimation and Dictionary Learning Methods 13/69

Why does the l 1 -norm induce sparsity? Physical illustration E 1 = k 1 2 (y 0 y) 2 E 1 = k 1 2 (y 0 y) 2 E 2 = k 2 2 y 2 y y = 0!! E 2 = mgy Julien Mairal Sparse Estimation and Dictionary Learning Methods 14/69

Regularizing with the l 1 -norm w 1 l 1 -ball w 2 w 1 T The projection onto a convex set is biased towards singularities. Julien Mairal Sparse Estimation and Dictionary Learning Methods 15/69

Regularizing with the l 2 -norm w 1 l 2 -ball w 2 w 2 T The l 2 -norm is isotropic. Julien Mairal Sparse Estimation and Dictionary Learning Methods 16/69

Regularizing with the l -norm w 1 l -ball w 2 w T The l -norm encourages w 1 = w 2. Julien Mairal Sparse Estimation and Dictionary Learning Methods 17/69

In 3D. Copyright G. Obozinski Julien Mairal Sparse Estimation and Dictionary Learning Methods 18/69

What about more complicated norms? Copyright G. Obozinski Julien Mairal Sparse Estimation and Dictionary Learning Methods 19/69

What about more complicated norms? Copyright G. Obozinski Julien Mairal Sparse Estimation and Dictionary Learning Methods 20/69

Examples of sparsity-inducing penalties Exploiting concave functions with a kink at zero ψ(w) = p i=1 φ( w i ). l q - pseudo-norm, with 0 < q < 1: ψ(w) = p i=1 w i q, log penalty, ψ(w) = p i=1 log( w i +ε), φ is any function that looks like this: φ( w ) w Julien Mairal Sparse Estimation and Dictionary Learning Methods 21/69

Examples of sparsity-inducing penalties (c) l 0.5-ball, 2-D (d) l 1-ball, 2-D (e) l 2-ball, 2-D Figure: Open balls in 2-D corresponding to several l q -norms and pseudo-norms. Julien Mairal Sparse Estimation and Dictionary Learning Methods 22/69

Examples of sparsity-inducing penalties The l 1 -l 2 norm (group Lasso), g G w g 2 = g G ( j g w 2 j ) 1/2, with G a partition of {1,...,p}. selects groups of variables [Yuan and Lin, 2006]. the fused Lasso [Tibshirani et al., 2005] or total variation [Rudin et al., 1992]: p 1 ψ(w) = w j+1 w j. j=1 Extensions (out of the scope of this talk): hierarchical norms [Zhao et al., 2009]. structured sparsity [Jenatton et al., 2009, Jacob et al., 2009, Huang et al., 2009, Baraniuk et al., 2010] Julien Mairal Sparse Estimation and Dictionary Learning Methods 23/69

Part II: Dictionary Learning and Matrix Factorization Julien Mairal Sparse Estimation and Dictionary Learning Methods 24/69

Matrix Factorization and Clustering Let us cluster some training vectors y 1,...,y m into p clusters using K-means: m min y i x l i 2 2. (x j ) p j=1,(l i) m i=1 It can be equivalently formulated as i=1 min X R n p,w {0,1} p m m y i Xw i 2 F s.t. wi 0 and i=1 p wj i = 1, j=1 which is a matrix factorization problem: min X R n p,w {0,1} p m Y XW 2 F s.t. W 0 and p wj i = 1, j=1 Julien Mairal Sparse Estimation and Dictionary Learning Methods 25/69

Matrix Factorization and Clustering Hard clustering min X R n p,w {0,1} p m Y XW 2 F s.t. W 0 and X = [x 1,...,x p ] are the centroids of the p clusters. p wj i = 1, j=1 Soft clustering min X R n p,w R p m Y XW 2 F s.t. W 0 and X = [x 1,...,x p ] are the centroids of the p clusters. p wj i = 1, j=1 Julien Mairal Sparse Estimation and Dictionary Learning Methods 26/69

Other Matrix Factorization Problems PCA 1 min W R p n 2 Y XW 2 F s.t. X X = I and WW is diagonal. X R m p X = [x 1,...,x p ] are the principal components. Julien Mairal Sparse Estimation and Dictionary Learning Methods 27/69

Other Matrix Factorization Problems Non-negative matrix factorization [Lee and Seung, 2001] 1 min W R p n 2 Y XW 2 F X R m p s.t. W 0 and X 0. Julien Mairal Sparse Estimation and Dictionary Learning Methods 28/69

Dictionary Learning and Matrix Factorization [Olshausen and Field, 1997] min X X,W R p m n i=1 1 2 yi Xw i 2 F +λ wi 1, which is again a matrix factorization problem 1 min W R p n 2 Y XW 2 F +λ W 1. X X Julien Mairal Sparse Estimation and Dictionary Learning Methods 29/69

Why having a unified point of view? 1 min W W 2 Y XW 2 F +λψ(w). X X same framework for NMF, sparse PCA, dictionary learning, clustering, topic modelling; can play with various constraints/penalties on W (coefficients) and on X (loadings, dictionary, centroids); same algorithms (no need to reinvent the wheel): alternate minimization, online learning [Mairal et al., 2010]. Julien Mairal Sparse Estimation and Dictionary Learning Methods 30/69

Advertisement SPAMS toolbox (open-source) C++ interfaced with Matlab, R, Python. proximal gradient methods for l 0, l 1, elastic-net, fused-lasso, group-lasso, tree group-lasso, tree-l 0, sparse group Lasso, overlapping group Lasso......for square, logistic, multi-class logistic loss functions. handles sparse matrices, provides duality gaps. fast implementations of OMP and LARS - homotopy. dictionary learning and matrix factorization (NMF, sparse PCA). coordinate descent, block coordinate descent algorithms. fast projections onto some convex sets. Try it! http://www.di.ens.fr/willow/spams/ Julien Mairal Sparse Estimation and Dictionary Learning Methods 31/69

Part III: A few Image Processing Stories Julien Mairal Sparse Estimation and Dictionary Learning Methods 32/69

The Image Denoising Problem y }{{} measurements = x orig }{{} original image + w }{{} noise Julien Mairal Sparse Estimation and Dictionary Learning Methods 33/69

Sparse representations for image restoration y = x }{{} orig + }{{}}{{} w measurements original image noise Energy minimization problem - MAP estimation E(x) = Some classical priors Smoothness λ Lx 2 2 Total variation λ x 2 1 MRF priors... 1 2 y x 2 2 + ψ(x) }{{}}{{} image model (-log prior) relation to measurements Julien Mairal Sparse Estimation and Dictionary Learning Methods 34/69

Sparse representations for image restoration Designed dictionaries [Haar, 1910], [Zweig, Morlet, Grossman 70s], [Meyer, Mallat, Daubechies, Coifman, Donoho, Candes 80s-today]...(see [Mallat, 1999]) Wavelets, Curvelets, Wedgelets, Bandlets,... lets Learned dictionaries of patches [Olshausen and Field, 1997], [Engan et al., 1999], [Lewicki and Sejnowski, 2000], [Aharon et al., 2006], [Roth and Black, 2005], [Lee et al., 2007] min w i,x C i 1 2 y i Xw i 2 2 }{{} reconstruction ψ(w) = w 0 ( l 0 pseudo-norm ) ψ(w) = w 1 (l 1 norm) +λψ(w i ) }{{} sparsity Julien Mairal Sparse Estimation and Dictionary Learning Methods 35/69

Sparse representations for image restoration Solving the denoising problem [Elad and Aharon, 2006] Extract all overlapping 8 8 patches y i. Solve a matrix factorization problem: min w i,x C n 1 2 y i Xw i 2 2+λψ(w i ), }{{}}{{} sparsity reconstruction i=1 with n > 100,000 Average the reconstruction of each patch. Julien Mairal Sparse Estimation and Dictionary Learning Methods 36/69

Sparse representations for image restoration K-SVD: [Elad and Aharon, 2006] Figure: Dictionary trained on a noisy version of the image boat. Julien Mairal Sparse Estimation and Dictionary Learning Methods 37/69

Sparse representations for image restoration Grayscale vs color image patches Julien Mairal Sparse Estimation and Dictionary Learning Methods 38/69

Sparse representations for image restoration [Mairal, Sapiro, and Elad, 2008b] Julien Mairal Sparse Estimation and Dictionary Learning Methods 39/69

Sparse representations for image restoration Inpainting, [Mairal, Elad, and Sapiro, 2008a] Julien Mairal Sparse Estimation and Dictionary Learning Methods 40/69

Sparse representations for image restoration Inpainting, [Mairal, Elad, and Sapiro, 2008a] Julien Mairal Sparse Estimation and Dictionary Learning Methods 41/69

Sparse representations for video restoration Key ideas for video processing [Protter and Elad, 2009] Using a 3D dictionary. Processing of many frames at the same time. Dictionary propagation. Julien Mairal Sparse Estimation and Dictionary Learning Methods 42/69

Sparse representations for image restoration Inpainting, [Mairal, Sapiro, and Elad, 2008b] Figure: Inpainting results. Julien Mairal Sparse Estimation and Dictionary Learning Methods 43/69

Sparse representations for image restoration Inpainting, [Mairal, Sapiro, and Elad, 2008b] Figure: Inpainting results. Julien Mairal Sparse Estimation and Dictionary Learning Methods 44/69

Sparse representations for image restoration Inpainting, [Mairal, Sapiro, and Elad, 2008b] Figure: Inpainting results. Julien Mairal Sparse Estimation and Dictionary Learning Methods 45/69

Sparse representations for image restoration Inpainting, [Mairal, Sapiro, and Elad, 2008b] Figure: Inpainting results. Julien Mairal Sparse Estimation and Dictionary Learning Methods 46/69

Sparse representations for image restoration Inpainting, [Mairal, Sapiro, and Elad, 2008b] Figure: Inpainting results. Julien Mairal Sparse Estimation and Dictionary Learning Methods 47/69

Sparse representations for image restoration Color video denoising, [Mairal, Sapiro, and Elad, 2008b] Figure: Denoising results. σ = 25 Julien Mairal Sparse Estimation and Dictionary Learning Methods 48/69

Sparse representations for image restoration Color video denoising, [Mairal, Sapiro, and Elad, 2008b] Figure: Denoising results. σ = 25 Julien Mairal Sparse Estimation and Dictionary Learning Methods 49/69

Sparse representations for image restoration Color video denoising, [Mairal, Sapiro, and Elad, 2008b] Figure: Denoising results. σ = 25 Julien Mairal Sparse Estimation and Dictionary Learning Methods 50/69

Sparse representations for image restoration Color video denoising, [Mairal, Sapiro, and Elad, 2008b] Figure: Denoising results. σ = 25 Julien Mairal Sparse Estimation and Dictionary Learning Methods 51/69

Sparse representations for image restoration Color video denoising, [Mairal, Sapiro, and Elad, 2008b] Figure: Denoising results. σ = 25 Julien Mairal Sparse Estimation and Dictionary Learning Methods 52/69

Digital Zooming Couzinie-Devy, 2010, Original Julien Mairal Sparse Estimation and Dictionary Learning Methods 53/69

Digital Zooming Couzinie-Devy, 2010, Bicubic Julien Mairal Sparse Estimation and Dictionary Learning Methods 54/69

Digital Zooming Couzinie-Devy, 2010, Proposed method Julien Mairal Sparse Estimation and Dictionary Learning Methods 55/69

Digital Zooming Couzinie-Devy, 2010, Original Julien Mairal Sparse Estimation and Dictionary Learning Methods 56/69

Digital Zooming Couzinie-Devy, 2010, Bicubic Julien Mairal Sparse Estimation and Dictionary Learning Methods 57/69

Digital Zooming Couzinie-Devy, 2010, Proposed approach Julien Mairal Sparse Estimation and Dictionary Learning Methods 58/69

Inverse half-toning Original Julien Mairal Sparse Estimation and Dictionary Learning Methods 59/69

Inverse half-toning Reconstructed image Julien Mairal Sparse Estimation and Dictionary Learning Methods 60/69

Inverse half-toning Original Julien Mairal Sparse Estimation and Dictionary Learning Methods 61/69

Inverse half-toning Reconstructed image Julien Mairal Sparse Estimation and Dictionary Learning Methods 62/69

Inverse half-toning Original Julien Mairal Sparse Estimation and Dictionary Learning Methods 63/69

Inverse half-toning Reconstructed image Julien Mairal Sparse Estimation and Dictionary Learning Methods 64/69

Inverse half-toning Original Julien Mairal Sparse Estimation and Dictionary Learning Methods 65/69

Inverse half-toning Reconstructed image Julien Mairal Sparse Estimation and Dictionary Learning Methods 66/69

Inverse half-toning Original Julien Mairal Sparse Estimation and Dictionary Learning Methods 67/69

Inverse half-toning Reconstructed image Julien Mairal Sparse Estimation and Dictionary Learning Methods 68/69

Conclusion We have seen that many formulations are related to sparse regularized matrix factorization problems: pca, sparse pca, clustering, nmf, dictionary learning; we have so successful stories in images processing, computer vision and neuroscience; there exists efficient software for Matlab/R/Python. Julien Mairal Sparse Estimation and Dictionary Learning Methods 69/69

References I M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD: An algorithm for designing of overcomplete dictionaries for sparse representations. IEEE Transactions on Signal Processing, 54(11):4311 4322, November 2006. R. G. Baraniuk, V. Cevher, M. Duarte, and C. Hegde. Model-based compressive sensing. IEEE Transactions on Information Theory, 2010. to appear. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009. E. Candes. Compressive sampling. In Proceedings of the International Congress of Mathematicians, volume 3, 2006. S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20:33 61, 1999. Julien Mairal Sparse Estimation and Dictionary Learning Methods 70/69

References II M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 54(12):3736 3745, December 2006. K. Engan, S. O. Aase, and J. H. Husoy. Frame based signal compression using method of optimal directions (MOD). In Proceedings of the 1999 IEEE International Symposium on Circuits Systems, volume 4, 1999. A. Haar. Zur theorie der orthogonalen funktionensysteme. Mathematische Annalen, 69:331 371, 1910. J. Huang, Z. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the International Conference on Machine Learning (ICML), 2009. L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with overlap and graph Lasso. In Proceedings of the International Conference on Machine Learning (ICML), 2009. Julien Mairal Sparse Estimation and Dictionary Learning Methods 71/69

References III R. Jenatton, J-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms. Technical report, 2009. preprint arxiv:0904.3523v1. D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems, 2001. H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19, pages 801 808. MIT Press, Cambridge, MA, 2007. M. S. Lewicki and T. J. Sejnowski. Learning overcomplete representations. Neural Computation, 12(2):337 365, 2000. J. Mairal, M. Elad, and G. Sapiro. Sparse representation for color image restoration. IEEE Transactions on Image Processing, 17(1):53 69, January 2008a. Julien Mairal Sparse Estimation and Dictionary Learning Methods 72/69

References IV J. Mairal, G. Sapiro, and M. Elad. Learning multiscale sparse representations for image and video restoration. SIAM Multiscale Modelling and Simulation, 7(1):214 241, April 2008b. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 2010. S. Mallat. A Wavelet Tour of Signal Processing, Second Edition. Academic Press, New York, September 1999. S. Mallat and Z. Zhang. Matching pursuit in a time-frequency dictionary. IEEE Transactions on Signal Processing, 41(12):3397 3415, 1993. Y. Nesterov. A method for solving a convex programming problem with convergence rate O(1/k 2 ). Soviet Math. Dokl., 27:372 376, 1983. Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, CORE, 2007. Julien Mairal Sparse Estimation and Dictionary Learning Methods 73/69

References V B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37: 3311 3325, 1997. M. Protter and M. Elad. Image sequence denoising via sparse and redundant representations. IEEE Transactions on Image Processing, 18(1):27 36, 2009. S. Roth and M. J. Black. Fields of experts: A framework for learning image priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005. L.I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1-4): 259 268, 1992. R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B, 58(1):267 288, 1996. Julien Mairal Sparse Estimation and Dictionary Learning Methods 74/69

References VI R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society Series B, 67(1):91 108, 2005. M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B, 68:49 67, 2006. P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection. 37(6A):3468 3497, 2009. Julien Mairal Sparse Estimation and Dictionary Learning Methods 75/69

One short slide on compressed sensing Important message Sparse coding is not compressed sensing. Compressed sensing is a theory [see Candes, 2006] saying that a sparse signal can be recovered from a few linear measurements under some conditions. Signal Acquisition: Z y, where Z R m s is a sensing matrix with s m. Signal Decoding: min w R p w 1 s.t. Z y = Z Xw. with extensions to approximately sparse signals, noisy measurements. Remark The dictionaries we are using in this lecture do not satisfy the recovery assumptions of compressed sensing. Julien Mairal Sparse Estimation and Dictionary Learning Methods 76/69

Greedy Algorithms Several equivalent non-convex and NP-hard problems: 1 min w R p 2 y Xw 2 2+ λ w 0, }{{}}{{} regularization residual r min w R p y Xw 2 2 s.t. w 0 L, min w R p w 0 s.t. y Xw 2 2 ε, The solution is often approximated with a greedy algorithm. Signal processing: Matching Pursuit [Mallat and Zhang, 1993], Orthogonal Matching Pursuit [?]. Statistics: L2-boosting, forward selection. Julien Mairal Sparse Estimation and Dictionary Learning Methods 77/69

Matching Pursuit w = (0,0,0) y x 2 x 1 z r x 3 x Julien Mairal Sparse Estimation and Dictionary Learning Methods 78/69

Matching Pursuit w = (0,0,0) y x 2 x 1 z r x 3 < r,x 3 > x 3 x Julien Mairal Sparse Estimation and Dictionary Learning Methods 79/69

Matching Pursuit w = (0,0,0) y x 2 x 1 z r r < r,x3 > x 3 x 3 x Julien Mairal Sparse Estimation and Dictionary Learning Methods 80/69

Matching Pursuit w = (0, 0, 0.75) y x 2 x 1 r z x 3 x Julien Mairal Sparse Estimation and Dictionary Learning Methods 81/69

Matching Pursuit w = (0, 0, 0.75) y x 2 x 1 r z x 3 x Julien Mairal Sparse Estimation and Dictionary Learning Methods 82/69

Matching Pursuit w = (0, 0, 0.75) y x 2 x 1 r z x 3 x Julien Mairal Sparse Estimation and Dictionary Learning Methods 83/69

Matching Pursuit w = (0, 0.24, 0.75) y x 2 x 1 z r x 3 x Julien Mairal Sparse Estimation and Dictionary Learning Methods 84/69

Matching Pursuit w = (0, 0.24, 0.75) y x 2 z x 1 r x 3 x Julien Mairal Sparse Estimation and Dictionary Learning Methods 85/69

Matching Pursuit w = (0, 0.24, 0.75) y x 2 z x 1 r x 3 x Julien Mairal Sparse Estimation and Dictionary Learning Methods 86/69

Matching Pursuit w = (0, 0.24, 0.65) y x 2 x 1 z r x 3 x Julien Mairal Sparse Estimation and Dictionary Learning Methods 87/69

Matching Pursuit min y Xw 2 w R p }{{} 2 s.t. w 0 L r 1: w 0 2: r y (residual). 3: while w 0 < L do 4: Select the predictor with maximum correlation with the residual î argmax x i r i=1,...,p 5: Update the residual and the coefficients 6: end while wî wî +xî r r r (xî r)xî Julien Mairal Sparse Estimation and Dictionary Learning Methods 88/69

Orthogonal Matching Pursuit y x 2 w = (0,0,0) J = z x 1 y x 3 x Julien Mairal Sparse Estimation and Dictionary Learning Methods 89/69

Orthogonal Matching Pursuit y x 2 w = (0,0,0.75) J = {3} π 2 x 1 z y r π 1 x 3 π 3 x Julien Mairal Sparse Estimation and Dictionary Learning Methods 90/69

Orthogonal Matching Pursuit y x 2 w = (0,0.29,0.63) J = {3,2} x 1 z π 32 y r π 31 x 3 x Julien Mairal Sparse Estimation and Dictionary Learning Methods 91/69

Orthogonal Matching Pursuit min w R y Xw 2 p 2 s.t. w 0 L 1: J =. 2: for iter = 1,...,L do 3: Select the predictor which most reduces the objective î argmin i J C { min w y X J {i} w 2 2 4: Update the active set: J J {î}. 5: Update the residual (orthogonal projection) 6: Update the coefficients 7: end for r (I X J (X J X J) 1 X J )y. w J (X J X J) 1 X J y. Julien Mairal Sparse Estimation and Dictionary Learning Methods 92/69 }

Orthogonal Matching Pursuit The keys for a good implementation If available, use Gram matrix G = X X, Maintain the computation of X r for each signal, Maintain a Cholesky decomposition of (X J X J) 1 for each signal. The total complexity for decomposing n L-sparse signals of size m with a dictionary of size p is O(p 2 m) }{{} Gram matrix +O(nL 3 ) }{{} Cholesky +O(n(pm+pL 2 )) = O(np(m+L 2 )) }{{} X T r It is also possible to use the matrix inversion lemma instead of a Cholesky decomposition (same complexity, but less numerical stability). Julien Mairal Sparse Estimation and Dictionary Learning Methods 93/69

Coordinate Descent for the Lasso Coordinate descent + nonsmooth objective: WARNING: not convergent in general Here, the problem is equivalent to a convex smooth optimization problem with separable constraints 1 min w +,w 2 y X +w + +X w 2 2+λw+1+λw T 1 T s.t. w,w + 0. For this specific problem, coordinate descent is convergent. Assume the colums of X to have unit l 2 -norm, updating the coordinate i: 1 w i argmin w R 2 y w j x j wx i 2 2 +λ w j i }{{} r sign(x i r)( x i r λ) + soft-thresholding! Julien Mairal Sparse Estimation and Dictionary Learning Methods 94/69

First-order/proximal methods min f(w)+λψ(w) w R p f is strictly convex and differentiable with a Lipshitz gradient. Generalizes the idea of gradient descent w k+1 argminf(w k )+ f(w k ) (w w k ) + L w R p }{{} 2 w wk 2 2+λψ(w) }{{} linear approximation quadratic term 1 argmin w R p 2 w (wk 1 L f(wk )) 2 2 + λ L ψ(w) When λ = 0, w k+1 w k 1 L f(wk ), this is equivalent to a classical gradient descent step. Julien Mairal Sparse Estimation and Dictionary Learning Methods 95/69

First-order/proximal methods They require solving efficiently the proximal operator 1 min w R p 2 u w 2 2 +λψ(w) For the l 1 -norm, this amounts to a soft-thresholding: w i = sign(u i )(u i λ) +. There exists accelerated versions based on Nesterov optimal first-order method (gradient method with extrapolation ) [Beck and Teboulle, 2009, Nesterov, 2007, 1983] suited for large-scale experiments. Julien Mairal Sparse Estimation and Dictionary Learning Methods 96/69

Optimization for Grouped Sparsity The proximal operator: For q = 2, For q =, 1 min w R p 2 u w 2 2 +λ w g q g G w g = u g u g 2 ( u g 2 λ) +, g G w g = u g Π. 1 λ[u g ], g G These formula generalize soft-thrsholding to groups of variables. They are used in block-coordinate descent and proximal algorithms. Julien Mairal Sparse Estimation and Dictionary Learning Methods 97/69

Smoothing Techniques: Reweighted l 2 Let us start from something simple Then and a 1 2 The formulation becomes a 2 2ab +b 2 0. ( a 2 b +b ) 1 w 1 = min η j 0 2 with equality iff a = b p w[j] 2 +η j. η j j=1 1 min w,η j ε 2 y Xw 2 2 + λ 2 p w[j] 2 +η j. η j j=1 Julien Mairal Sparse Estimation and Dictionary Learning Methods 98/69

DC (difference of convex) - Programming Remember? Concave functions with a kink at zero ψ(w) = p i=1 φ( w i ). l q - pseudo-norm, with 0 < q < 1: ψ(w) = p i=1 w i q, log penalty, ψ(w) = p i=1 log( w i +ε), φ is any function that looks like this: φ( w ) w Julien Mairal Sparse Estimation and Dictionary Learning Methods 99/69

Figure: Illustration of the DC-programming Julien Mairal Sparse approach. Estimation and The Dictionary non-convex Learning Methods part of 100/69 φ ( w k ) w +C w k f(w) w k φ(w) = log( w +ε) f(w)+φ ( w k ) w +C f(w)+φ( w ) w k

DC (difference of convex) - Programming p min w R pf(w)+λ φ( w i ). This problem is non-convex. f is convex, and φ is concave on R +. if w k is the current estimate at iteration k, the algorithm solves i=1 [ w k+1 argmin f(w)+λ w R p p i=1 ] ψ ( wi k ) w i, which is a reweighted-l 1 problem. Warning: It does not solve the non-convex problem, only provides a stationary point. In practice, each iteration sets to zero small coefficients. After 2 3 iterations, the result does not change much. Julien Mairal Sparse Estimation and Dictionary Learning Methods 101/69