Parcimonie en apprentissage statistique

Similar documents
Convex relaxation for Combinatorial Penalties

Topographic Dictionary Learning with Structured Sparsity

Recent Advances in Structured Sparse Models

Sparse Estimation and Dictionary Learning

Structured Sparse Estimation with Network Flow Optimization

Hierarchical kernel learning

Backpropagation Rules for Sparse Coding (Task-Driven Dictionary Learning)

A tutorial on sparse modeling. Outline:

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Proximal Methods for Optimization with Spasity-inducing Norms

Structured sparsity-inducing norms through submodular functions

OWL to the rescue of LASSO

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Machine Learning for Signal Processing Sparse and Overcomplete Representations

1 Sparsity and l 1 relaxation

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

2.3. Clustering or vector quantization 57

Generalized Conditional Gradient and Its Applications

A direct formulation for sparse PCA using semidefinite programming

DATA MINING AND MACHINE LEARNING

Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Oslo Class 6 Sparsity based regularization

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Mathematical Methods for Data Analysis

Computing regularization paths for learning multiple kernels

Trace Lasso: a trace norm regularization for correlated designs

Sparsity in Underdetermined Systems

Optimization for Sparse Estimation and Structured Sparsity

SVRG++ with Non-uniform Sampling

Convex Optimization Algorithms for Machine Learning in 10 Slides

Sparse PCA with applications in finance

LASSO Review, Fused LASSO, Parallel LASSO Solvers

A direct formulation for sparse PCA using semidefinite programming

ECS289: Scalable Machine Learning

Proximal Minimization by Incremental Surrogate Optimization (MISO)

A Multi-task Learning Strategy for Unsupervised Clustering via Explicitly Separating the Commonality

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

CSC 576: Variants of Sparse Learning

Structured sparsity through convex optimization

Big Data Analytics: Optimization and Randomization

Tractable Upper Bounds on the Restricted Isometry Constant

EUSIPCO

Online Dictionary Learning with Group Structure Inducing Norms

Semi-supervised Dictionary Learning Based on Hilbert-Schmidt Independence Criterion

Statistical Data Mining and Machine Learning Hilary Term 2016

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

Applied Machine Learning for Biomedical Engineering. Enrico Grisan

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

CIS 520: Machine Learning Oct 09, Kernel Methods

TUTORIAL PART 1 Unsupervised Learning

STA141C: Big Data & High Performance Statistical Computing

Fast Regularization Paths via Coordinate Descent

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

ECS289: Scalable Machine Learning

Lecture Notes 10: Matrix Factorization

L 2,1 Norm and its Applications

Support Vector Machines for Classification: A Statistical Portrait

ECS289: Scalable Machine Learning

An Homotopy Algorithm for the Lasso with Online Observations

Dual Augmented Lagrangian, Proximal Minimization, and MKL

Cheng Soon Ong & Christian Walder. Canberra February June 2018

l 1 and l 2 Regularization

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machine (SVM) and Kernel Methods

Online Learning for Matrix Factorization and Sparse Coding

Convergence Rates of Kernel Quadrature Rules

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

(k, q)-trace norm for sparse matrix factorization

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

THe linear decomposition of data using a few elements

Multiple kernel learning for multiple sources

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Linear Methods for Regression. Lijun Zhang

Lasso, Ridge, and Elastic Net

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Greedy Dictionary Selection for Sparse Representation

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

An Efficient Proximal Gradient Method for General Structured Sparse Learning

Sparse Gaussian conditional random fields

Convex Coding. David M. Bradley, J. Andrew Bagnell CMU-RI-TR May, 2009

Adaptive Compressive Imaging Using Sparse Hierarchical Learned Dictionaries

Lecture 25: November 27

Spectral k-support Norm Regularization

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Max Margin-Classifier

Oslo Class 4 Early Stopping and Spectral Regularization

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Machine Learning And Applications: Supervised Learning-SVM

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

The lasso, persistence, and cross-validation

Is the test error unbiased for these programs?

Sparse & Redundant Signal Representation, and its Role in Image Processing

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

On Optimal Frame Conditioners

Transcription:

Parcimonie en apprentissage statistique Guillaume Obozinski Ecole des Ponts - ParisTech Journée Parcimonie Fédération Charles Hermite, 23 Juin 2014 Parcimonie en apprentissage 1/44

Classical supervised learning setup (ERM) Data: (x 1, y 1 ),..., (x i, y i ),..., (x n, y n ) f H function to learn Loss function: l : (y, a) l(y, a) e.g. l(y, a) = 1 2 (y a)2, logistic loss, hinge loss, etc. Empirical Risk Minimization 1 l(f (x i ), y i ) + λ f 2 H f H n }{{} }{{} Regularization Empirical Risk H typically an RKHS λ: regularization coefficient λ controls the complexity of the function that we are willing to learn for a given amount of data. Parcimonie en apprentissage 2/44

Learning linear functions Restricting to linear functions f w : x w x i 1 w R p n l(w x i, y i ) + λ 2 w 2 For the square loss ridge regression Issue: number of features p typically large compared to the amount of data Alternative to regularization provided by sparsity Reducing the number of features entering the models yields another way to control model complexity more interpretable models (very important in biomedical applications) computationally efficient algorithms Parcimonie en apprentissage 3/44

A sparse signal y R n is the signal X R n p is some overcomplete basis w is the sparse representation of the signal Find w sparse such that y = Xw Classical signal processing formulation of the problem w 0 s.t. y = Xw. Problem: there is noise... and noise is not sparse w 0 s.t. y Xw 2 ɛ. y Xw 2 2 s.t. w 0 k These problems are NP-hard. 1 2 y Xw 2 2 + λ w 0 Parcimonie en apprentissage 4/44

Approaches Greedy Methods Matching Pursuit (MP) Orthogonal Matching Pursuit (OMP) Least-square OMP CoSamp Relaxation Methods Lasso/Basis Pursuit Dantzig Selector Bayesian Methods Spike and Slab priors (ARD) Empirical Bayes Parcimonie en apprentissage 5/44

A convex relaxation... Empirical risk: for w R p, L(w) = 1 2n (y i xi w) 2 Supp(w) = 1 {wi 0} Support of the model: Supp(w) = {i w i 0}. Penalization for variable selection Lasso L(w) + λ Supp(w) w Rd 0 w R d L(w) + λ w 1 Parcimonie en apprentissage 6/44

Related formulations through convex programs Basis Pursuit Basis Pursuit ( noisy setting) Dantzig Selector Remarks w w 1 s.t. y = Xw w w 1 s.t. y Xw 2 η w w 1 s.t. X (y Xw) λ Minima not necessarily unique Dantzig Selector linear program The optimality conditions for the Lasso require X (y Xw) λ Parcimonie en apprentissage 7/44

Optimization algorithms Generic approaches For the Lasso, with interior point methods. Subgradient descent Efficient first order methods Coordinate descent methods Proximal methods Reweighted l 2 methods (esp. for structured sparse methods) Active set methods For the Lasso: LARS algorithm In general: meta-algorithms to combine with methods above see e.g. Bach et al. (2012) Parcimonie en apprentissage 8/44

Software: SPAMs toolbox Toobox developped by Julien Mairal C++ interfaced with Matlab, R, Python. proximal gradient methods for l 0, l 1, elastic net, fused-lasso, group-lasso and more... for square, logistic, multi-class logistic loss functions handles sparse matrices fast implementations of OMP and LARS dictionary learning and matrix factorization (NMF, sparse PCA). http://www.di.ens.fr/willow/spams/ Parcimonie en apprentissage 9/44

Correlation, Stability and Elastic Net In the presence of important correlations between the variables, Lasso choses arbitrarily Stability issues Elastic Net (Zou and Hastie, 2005) w 1 2n y Xw 2 2 + λ w 1 + µ w 2 2 Makes the optimization problem strongly convex always a unique solution faster convergence for many algorithms Selects correlated variables together. Two intuitions/views by decorrelating them w 1 ( w (X X + µi )w + 2y Xw + y y ) + λ w 1 2n by smoothing: for a pair of very correlated variables x 1, x 2 encourages w 1 w 2 Better behavior with heavily correlated variables Parcimonie en apprentissage 10/44

Comparing Lasso and other strategies for linear regression Compare: 1 Ridge regression: w R p 2 y Xw 2 2 + λ 2 w 2 2 1 Lasso: w R p 2 y Xw 2 2 + λ w 1 1 OMP/FS: w R p 2 y Xw 2 2 + λ w 0 Each method builds a path of solutions from 0 to ordinary least-squares solution Regularization parameters selected on the test set Parcimonie en apprentissage 11/44

Simulation results i.i.d. Gaussian design matrix, k = 4, n = 64, p [2, 256], SNR = 1 Note stability to non-sparsity and variability 0.9 0.8 0.7 L1 L2 greedy oracle 0.9 0.8 0.7 L1 L2 greedy mean square error 0.6 0.5 0.4 0.3 mean square error 0.6 0.5 0.4 0.3 0.2 0.2 0.1 0.1 0 2 4 6 8 log (p) 2 Sparse 0 2 4 6 8 log (p) 2 Rotated (non sparse) Parcimonie en apprentissage 12/44

Advantages and Drawbacks of l 1 vs l 0 penalization Advantages The solution α(x) is a continuous (differentiable on the support) function of the data x. The l 1 -norm is more robust to violation of the sparsity assumption. It controls the influence of spuriously introduced variables (like l 0 +l 2 ) The convex formulation leads to principled algorithms that generalize well to new situations and natural theoretical analyses. Drawbacks It introduces an estimation bias which leads to the selection of two many variables if ignored. Some of the l 0 algorithms are simpler. Parcimonie en apprentissage 13/44

Group sparse models Parcimonie en apprentissage 14/44

From l 1 -regularization... 1 w R p n (y (i) w x (i) ) 2 + λ w 1 with w 1 = p j=1 w j. Parcimonie en apprentissage 15/44

...to penalization with grouped variables Assume that {1,..., p} is partitioned into m groups G 1,..., G m w = (w G1,..., w Gm ) and x = (x G1,..., x Gm ) 1 w R p n m l(w x (i), y (i) ) + λ w Gj Group Lasso (Yuan and Lin, 2007) j=1 1 w R p n m (y (i) w x (i) ) 2 + λ w Gj j=1 The l 1 /l 2 norm: Ω(w) : = G G w G 2 Unit ball in R 3 : (w 1, w 2 ) + w 3 1 Some entire groups set to 0 No zero within groups Parcimonie en apprentissage 16/44

l 1 /l q -regularization Can also consider l 1 /l -norm More non-differentiabilities Applications Group of noal variables (dummy binary variables) Learn sums of polynomial functions: f (x) = f (x 1 ) +... + f (x p ) w 1 n ( w jk x (i) k ) 2 p j y (i) + (w j1,..., w jk ) 2 j,k j=1 j: variables i: observations k: degree of monomial Parcimonie en apprentissage 17/44

Algorithms for l 1 /l 2 -regularization 1 l(w x (i), y (i) ) w R p n }{{} f (w) Reweighted l 2 algorithms Proximal methods Blockwise coordinate descent +λ m w Gj 2 j=1 } {{ } Ω(w) Parcimonie en apprentissage 18/44

Sparsity in function space and multiple kernel learning Parcimonie en apprentissage 19/44

Introducing a feature map Feature map φ : x φ(x) Maps the input data to a richer possibly more explicit feature space Typical high dimensional or possibly infinite dimensional space l(w φ(x i ), y i ) + λ w 2 w 2. Parcimonie en apprentissage 20/44

Changing the dot product Let x = (x 1, x 2 ) R 2 and φ(x) = (x 1, x 2, x 2 1, x 2 2, 2x 1 x 2 ). φ(x), φ(y) = x 1 y 1 + x 2 y 2 + x1 2 y1 2 + x2 2 y2 2 + 2x 1 x 2 y 1 y 2 = x 1 y 1 + x 2 y 2 + (x 1 y 1 ) 2 + (x 2 y 2 ) 2 + 2(x 1 y 1 )(x 2 y 2 ) = x, y + x, y 2 For w = (0, 0, 1, 1, 0), w φ(x) 1 0 x 2 1. Linear separators in R 5 correspond to conic separators in R 2. http://www.youtube.com/watch?v=3licbrzprza Let x = (x 1,..., x p ) R p and φ(x) = (x 1,..., x p, x 2 1,..., x 2 p, 2x 1 x 2,..., 2x i x j,... 2x p 1 x p ). Still have φ(x), φ(y) = x, y + x, y 2 But explicit mapping too expensive to compute: φ(x) R p+p(p+1)/2. Parcimonie en apprentissage 21/44

Duality for regularized empirical risk imization Define ψ i : u l(u, y i ). Let Ω be a norm, consider 1 w n l(w x i, y i ) + Ω(w) 2 u,w u,w max u,w α max α u,w max α 1 n 1 n 1 n 1 n 1 n ψ i (u i ) + λ 2 Ω(w)2 s.t. i, u i = w x i ψ i (u i ) + λ 2 Ω(w)2 s.t. u = Xw ψ i (u i ) + λ 2 Ω(w)2 λα (u Xw) [ ] [ 1 ψi (u i ) (nλα i )u i + λ 2 Ω(w)2 + w (X α)] ψi (nλα i ) λ 2 Ω ( X α ) 2 Parcimonie en apprentissage 22/44

Representer property and kernelized version Consider the special case Ω = 2 : 1 w n l(w x i, y i ) + w 2 2 max α max α 1 n 1 n ψi (nλα i ) λ X α 2 2 2 ψi (nλα i ) λ 2 α Kα with K = XX With the relation between optimal solutions: w = X α = αi x i So if we replace x i with φ(x i ), we have w = n α i φ(x i). And f (x) = w, φ(x) = αi φ(x i ), φ(x) = αi K(x i, x). Parcimonie en apprentissage 23/44

Regularization for multiple features x Φ 1 (x) w 1.. Φ j (x) w j. Φ p (x). w p w 1 Φ 1 (x) + + w p Φ p (x) Concatenating feature spaces is equivalent to sumg kernels p w j 2 2 K = j=1 p j=1 K j Parcimonie en apprentissage 24/44

General kernel learning (Lanckriet et al, 2004, Bach et al., 2005, Micchelli and Pontil, 2005) G(K) = w F n l(y i, w Φ(x i )) + λ 2 w 2 2 = max l α R n i (λα i ) λ 2 α Kα is a convex function of the kernel matrix K Learning a convex combination of kernels Given K 1,..., K p, consider the formulation ( ) G η j K p s.t. η j = 1, η j 0 η j The simplex constraints, like the l 1 -norm induce sparsity. j Parcimonie en apprentissage 25/44

Multiple kernel learning and l 1 /l 2 Block l 1 -norm problem: l(y i, w1 Φ 1 (x i ) + + wp Φ p (x i )) + λ 2 ( w 1 2 + + w p 2 ) 2 Proposition: Block l 1 -norm regularization is equivalent to imizing with respect to η the optimal value G( p j=1 η jk j ) (sparse) weights η obtained from optimality conditions dual parameters α optimal for K = p j=1 η jk j, Single optimization problem for learning both η and α Parcimonie en apprentissage 26/44

Proof of equivalence w 1,...,w p = l ( y i, w 1,...,w p j η j =1 = j η j =1 w 1,..., w p = j η j =1 w l ( y i, l ( y i, p wj Φ j (x i ) ) + λ ( p ) 2 w j 2 j=1 p wj Φ j (x i ) ) + λ j=1 p j=1 j=1 p w j 2 2/η j j=1 η 1/2 j w j Φ j (x i ) ) + λ p j=1 l ( y i, w Ψ η (x i ) ) + λ w 2 2 w j 2 2 with w j = w j η 1/2 j with Ψ η (x) = (η 1/2 1 Φ 1 (x),..., ηp 1/2 Φ p (x)) We have: Ψ η (x) Ψ η (x ) = p j=1 η jk j (x, x ) with p j=1 η j = 1 Parcimonie en apprentissage 27/44

Dictionary Learning Parcimonie en apprentissage 28/44

Dictionary learning for image denoising (Elad and Aharon, 2006) }{{} x = x }{{} 0 + }{{} ε measurements original image noise Parcimonie en apprentissage 29/44

Sparse PCA / Dictionary Learning Sparse PCA Dictionary Learning X = D. α X = D. α e.g. microarray data sparse dictionary (Witten et al., 2009; Bach et al., 2008) e.g. overcomplete dictionaries for natural images sparse decomposition (Elad and Aharon, 2006) Parcimonie en apprentissage 30/44

K-SVD (Aharon et al., 2006) Formulation: 1 D,α 2 x i Dα i 2 i s.t. α i 0 k Idea: 1 Decompose the signals on the current dictionary using a greedy algorithm 2 Fix the obtained supports 3 For all j in a sequence, update the jth atom and the decomposition coefficient on this atom to optimize the fit on the set of of signals that use this atom. 4 Iterate until the supports don t change any longer Parcimonie en apprentissage 31/44

K-SVD algorithm Algorithm 1 K-SVD repeat for i = 1 to n do Find α i using a greedy algorithm (e.g. OMP) end for for j = 1 to K do I j {i α ij 0} E (j) X :,Ij k j d(k) α k,ij Solve (using Lanczos algorithm) (d (j), α j,ij ) arg d,α E(j) dα 2 s.t. d 2 = 1 end for until none of the I j change anymore. Parcimonie en apprentissage 32/44

K-SVD: heuristics Replace atoms that are not much used by least explain datapoints in the dataset Remove atoms that are too correlated to other existing atoms and replace them by least explained datapoints. Parcimonie en apprentissage 33/44

l 1 formulation for Dictionary Learning 1 A R k n 2 D R p k ( xi Dα i 2 ) 2 + λ α i 1 s.t. j, d j 2 1. In both cases no orthogonality Not jointly convex but convex in each d j and α j Classical optimization alternates between D and α. Parcimonie en apprentissage 34/44

Block-coordinate descent for DL (Lee et al., 2007; Witten et al., 2009) U,V X UV 2 F + λ i v j 1 s.t. u j 2 1 Denote X (j) = X j j u j v j. Mimimizing w.r.t to u j : X (j) u j v u j 2 F s.t. u j 2 1 solved by u j X (j) v j j X (j) v j. Mimimizing w.r.t to v j : v j X (j) u j vj 2 F + λ i v j 1 Soft-thresholding requires no matrix inversion + can take advantage of efficient algorithms for Lasso can use warm start + active sets Parcimonie en apprentissage 35/44

Dictionary learning for image denoising Extract all overlapping 8 8 patches x i R 64. Form the matrix X = [x 1,..., x n ] R 64 n Solve a matrix factorization problem: A R k n D R p k ( xi Dα i 2 ) 2 + λ α i 1 s.t. j, d j 2 1. where α i is sparse, and D is the dictionary Each patch is decomposed into x i = Dα i Average the reconstruction Dα i of each patch x i to reconstruct a full-sized image y The number of patches n is large (= number of pixels) y use stochastic optimization/online learning (Mairal et al., 2009a) can handle potentially infinite datasets can adapt to dynamic training sets Parcimonie en apprentissage 36/44

Denoising result (Mairal et al., 2009b) Parcimonie en apprentissage 37/44

Dictionary Learning for Image Inpainting Example from Mairal et al. (2008) Parcimonie en apprentissage 38/44

What does the dictionary V look like? Parcimonie en apprentissage 39/44

Hierarchical sparsity and some applications Joint work with Rodolphe Jenatton, Julien Mairal and Francis Bach Parcimonie en apprentissage 40/44

Hierarchical Norms (Zhao et al., 2009; Bach, 2008) (Jenatton, Mairal, Obozinski and Bach, 2010a) Dictionary element selected only after its ancestors Structure on codes α (not on individual dictionary elements d i ) Hierarchical penalization: Ω(α) = g G α g 2 where groups g in G are equal to set of descendants of some nodes in a tree Parcimonie en apprentissage 41/44

Hierarchical dictionary for image patches Parcimonie en apprentissage 42/44

References I Aharon, M., Elad, M., and Bruckstein, A. (2006). K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. Signal Processing, IEEE Transactions on, 54(11):4311 4322. Bach, F. (2008). Exploring large feature spaces with hierarchical multiple kernel learning. In Advances in Neural Information Processing Systems. Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. (2012). Optimization with sparsity-inducing penalties. Foundations and Trends R in Machine Learning, 4(1):1 106. Bach, F., Mairal, J., and Ponce, J. (2008). Convex sparse matrix factorizations. Technical Report 0812.1869, ArXiv. Elad, M. and Aharon, M. (2006). Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 15(12):3736 3745. Lee, H., Battle, A., Raina, R., and Ng, A. (2007). Efficient sparse coding algorithms. In Advances in Neural Information Processing Systems (NIPS). Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2009a). Online dictionary learning for sparse coding. In International Conference on Machine Learning (ICML). Mairal, J., Bach, F., Ponce, J., Sapiro, G., and Zisserman, A. (2009b). Non-local sparse models for image restoration. In International Conference on Computer Vision (ICCV). Mairal, J., Sapiro, G., and Elad, M. (2008). Learning multiscale sparse representations for image and video restoration. SIAM Multiscale Modeling and Simulation, 7(1):214 241. Parcimonie en apprentissage 43/44

References II Witten, D., Tibshirani, R., and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515 534. Yuan, M. and Lin, Y. (2007). On the non-negative garrotte estimator. Journal of The Royal Statistical Society Series B, 69(2):143 161. Zhao, P., Rocha, G., and Yu, B. (2009). Grouped and hierarchical model selection through composite absolute penalties. Annals of Statistics, 37(6A):3468 3497. Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B (Statistical Methodology), 67(2):301 320. Parcimonie en apprentissage 44/44