Parcimonie en apprentissage statistique

Parcimonie en apprentissage statistique Guillaume Obozinski Ecole des Ponts - ParisTech Journée Parcimonie Fédération Charles Hermite, 23 Juin 2014 Parcimonie en apprentissage 1/44

Classical supervised learning setup (ERM) Data: (x 1, y 1 ),..., (x i, y i ),..., (x n, y n ) f H function to learn Loss function: l : (y, a) l(y, a) e.g. l(y, a) = 1 2 (y a)2, logistic loss, hinge loss, etc. Empirical Risk Minimization 1 l(f (x i ), y i ) + λ f 2 H f H n }{{} }{{} Regularization Empirical Risk H typically an RKHS λ: regularization coefficient λ controls the complexity of the function that we are willing to learn for a given amount of data. Parcimonie en apprentissage 2/44

Learning linear functions Restricting to linear functions f w : x w x i 1 w R p n l(w x i, y i ) + λ 2 w 2 For the square loss ridge regression Issue: number of features p typically large compared to the amount of data Alternative to regularization provided by sparsity Reducing the number of features entering the models yields another way to control model complexity more interpretable models (very important in biomedical applications) computationally efficient algorithms Parcimonie en apprentissage 3/44

A sparse signal y R n is the signal X R n p is some overcomplete basis w is the sparse representation of the signal Find w sparse such that y = Xw Classical signal processing formulation of the problem w 0 s.t. y = Xw. Problem: there is noise... and noise is not sparse w 0 s.t. y Xw 2 ɛ. y Xw 2 2 s.t. w 0 k These problems are NP-hard. 1 2 y Xw 2 2 + λ w 0 Parcimonie en apprentissage 4/44

Approaches Greedy Methods Matching Pursuit (MP) Orthogonal Matching Pursuit (OMP) Least-square OMP CoSamp Relaxation Methods Lasso/Basis Pursuit Dantzig Selector Bayesian Methods Spike and Slab priors (ARD) Empirical Bayes Parcimonie en apprentissage 5/44

A convex relaxation... Empirical risk: for w R p, L(w) = 1 2n (y i xi w) 2 Supp(w) = 1 {wi 0} Support of the model: Supp(w) = {i w i 0}. Penalization for variable selection Lasso L(w) + λ Supp(w) w Rd 0 w R d L(w) + λ w 1 Parcimonie en apprentissage 6/44

Related formulations through convex programs Basis Pursuit Basis Pursuit ( noisy setting) Dantzig Selector Remarks w w 1 s.t. y = Xw w w 1 s.t. y Xw 2 η w w 1 s.t. X (y Xw) λ Minima not necessarily unique Dantzig Selector linear program The optimality conditions for the Lasso require X (y Xw) λ Parcimonie en apprentissage 7/44

Optimization algorithms Generic approaches For the Lasso, with interior point methods. Subgradient descent Efficient first order methods Coordinate descent methods Proximal methods Reweighted l 2 methods (esp. for structured sparse methods) Active set methods For the Lasso: LARS algorithm In general: meta-algorithms to combine with methods above see e.g. Bach et al. (2012) Parcimonie en apprentissage 8/44

Software: SPAMs toolbox Toobox developped by Julien Mairal C++ interfaced with Matlab, R, Python. proximal gradient methods for l 0, l 1, elastic net, fused-lasso, group-lasso and more... for square, logistic, multi-class logistic loss functions handles sparse matrices fast implementations of OMP and LARS dictionary learning and matrix factorization (NMF, sparse PCA). http://www.di.ens.fr/willow/spams/ Parcimonie en apprentissage 9/44

Correlation, Stability and Elastic Net In the presence of important correlations between the variables, Lasso choses arbitrarily Stability issues Elastic Net (Zou and Hastie, 2005) w 1 2n y Xw 2 2 + λ w 1 + µ w 2 2 Makes the optimization problem strongly convex always a unique solution faster convergence for many algorithms Selects correlated variables together. Two intuitions/views by decorrelating them w 1 ( w (X X + µi )w + 2y Xw + y y ) + λ w 1 2n by smoothing: for a pair of very correlated variables x 1, x 2 encourages w 1 w 2 Better behavior with heavily correlated variables Parcimonie en apprentissage 10/44

Comparing Lasso and other strategies for linear regression Compare: 1 Ridge regression: w R p 2 y Xw 2 2 + λ 2 w 2 2 1 Lasso: w R p 2 y Xw 2 2 + λ w 1 1 OMP/FS: w R p 2 y Xw 2 2 + λ w 0 Each method builds a path of solutions from 0 to ordinary least-squares solution Regularization parameters selected on the test set Parcimonie en apprentissage 11/44

Simulation results i.i.d. Gaussian design matrix, k = 4, n = 64, p [2, 256], SNR = 1 Note stability to non-sparsity and variability 0.9 0.8 0.7 L1 L2 greedy oracle 0.9 0.8 0.7 L1 L2 greedy mean square error 0.6 0.5 0.4 0.3 mean square error 0.6 0.5 0.4 0.3 0.2 0.2 0.1 0.1 0 2 4 6 8 log (p) 2 Sparse 0 2 4 6 8 log (p) 2 Rotated (non sparse) Parcimonie en apprentissage 12/44

Advantages and Drawbacks of l 1 vs l 0 penalization Advantages The solution α(x) is a continuous (differentiable on the support) function of the data x. The l 1 -norm is more robust to violation of the sparsity assumption. It controls the influence of spuriously introduced variables (like l 0 +l 2 ) The convex formulation leads to principled algorithms that generalize well to new situations and natural theoretical analyses. Drawbacks It introduces an estimation bias which leads to the selection of two many variables if ignored. Some of the l 0 algorithms are simpler. Parcimonie en apprentissage 13/44

Group sparse models Parcimonie en apprentissage 14/44

From l 1 -regularization... 1 w R p n (y (i) w x (i) ) 2 + λ w 1 with w 1 = p j=1 w j. Parcimonie en apprentissage 15/44

...to penalization with grouped variables Assume that {1,..., p} is partitioned into m groups G 1,..., G m w = (w G1,..., w Gm ) and x = (x G1,..., x Gm ) 1 w R p n m l(w x (i), y (i) ) + λ w Gj Group Lasso (Yuan and Lin, 2007) j=1 1 w R p n m (y (i) w x (i) ) 2 + λ w Gj j=1 The l 1 /l 2 norm: Ω(w) : = G G w G 2 Unit ball in R 3 : (w 1, w 2 ) + w 3 1 Some entire groups set to 0 No zero within groups Parcimonie en apprentissage 16/44

l 1 /l q -regularization Can also consider l 1 /l -norm More non-differentiabilities Applications Group of noal variables (dummy binary variables) Learn sums of polynomial functions: f (x) = f (x 1 ) +... + f (x p ) w 1 n ( w jk x (i) k ) 2 p j y (i) + (w j1,..., w jk ) 2 j,k j=1 j: variables i: observations k: degree of monomial Parcimonie en apprentissage 17/44

Algorithms for l 1 /l 2 -regularization 1 l(w x (i), y (i) ) w R p n }{{} f (w) Reweighted l 2 algorithms Proximal methods Blockwise coordinate descent +λ m w Gj 2 j=1 } {{ } Ω(w) Parcimonie en apprentissage 18/44

Sparsity in function space and multiple kernel learning Parcimonie en apprentissage 19/44

Introducing a feature map Feature map φ : x φ(x) Maps the input data to a richer possibly more explicit feature space Typical high dimensional or possibly infinite dimensional space l(w φ(x i ), y i ) + λ w 2 w 2. Parcimonie en apprentissage 20/44

Changing the dot product Let x = (x 1, x 2 ) R 2 and φ(x) = (x 1, x 2, x 2 1, x 2 2, 2x 1 x 2 ). φ(x), φ(y) = x 1 y 1 + x 2 y 2 + x1 2 y1 2 + x2 2 y2 2 + 2x 1 x 2 y 1 y 2 = x 1 y 1 + x 2 y 2 + (x 1 y 1 ) 2 + (x 2 y 2 ) 2 + 2(x 1 y 1 )(x 2 y 2 ) = x, y + x, y 2 For w = (0, 0, 1, 1, 0), w φ(x) 1 0 x 2 1. Linear separators in R 5 correspond to conic separators in R 2. http://www.youtube.com/watch?v=3licbrzprza Let x = (x 1,..., x p ) R p and φ(x) = (x 1,..., x p, x 2 1,..., x 2 p, 2x 1 x 2,..., 2x i x j,... 2x p 1 x p ). Still have φ(x), φ(y) = x, y + x, y 2 But explicit mapping too expensive to compute: φ(x) R p+p(p+1)/2. Parcimonie en apprentissage 21/44

Duality for regularized empirical risk imization Define ψ i : u l(u, y i ). Let Ω be a norm, consider 1 w n l(w x i, y i ) + Ω(w) 2 u,w u,w max u,w α max α u,w max α 1 n 1 n 1 n 1 n 1 n ψ i (u i ) + λ 2 Ω(w)2 s.t. i, u i = w x i ψ i (u i ) + λ 2 Ω(w)2 s.t. u = Xw ψ i (u i ) + λ 2 Ω(w)2 λα (u Xw) [ ] [ 1 ψi (u i ) (nλα i )u i + λ 2 Ω(w)2 + w (X α)] ψi (nλα i ) λ 2 Ω ( X α ) 2 Parcimonie en apprentissage 22/44

Representer property and kernelized version Consider the special case Ω = 2 : 1 w n l(w x i, y i ) + w 2 2 max α max α 1 n 1 n ψi (nλα i ) λ X α 2 2 2 ψi (nλα i ) λ 2 α Kα with K = XX With the relation between optimal solutions: w = X α = αi x i So if we replace x i with φ(x i ), we have w = n α i φ(x i). And f (x) = w, φ(x) = αi φ(x i ), φ(x) = αi K(x i, x). Parcimonie en apprentissage 23/44

Regularization for multiple features x Φ 1 (x) w 1.. Φ j (x) w j. Φ p (x). w p w 1 Φ 1 (x) + + w p Φ p (x) Concatenating feature spaces is equivalent to sumg kernels p w j 2 2 K = j=1 p j=1 K j Parcimonie en apprentissage 24/44

General kernel learning (Lanckriet et al, 2004, Bach et al., 2005, Micchelli and Pontil, 2005) G(K) = w F n l(y i, w Φ(x i )) + λ 2 w 2 2 = max l α R n i (λα i ) λ 2 α Kα is a convex function of the kernel matrix K Learning a convex combination of kernels Given K 1,..., K p, consider the formulation ( ) G η j K p s.t. η j = 1, η j 0 η j The simplex constraints, like the l 1 -norm induce sparsity. j Parcimonie en apprentissage 25/44

Multiple kernel learning and l 1 /l 2 Block l 1 -norm problem: l(y i, w1 Φ 1 (x i ) + + wp Φ p (x i )) + λ 2 ( w 1 2 + + w p 2 ) 2 Proposition: Block l 1 -norm regularization is equivalent to imizing with respect to η the optimal value G( p j=1 η jk j ) (sparse) weights η obtained from optimality conditions dual parameters α optimal for K = p j=1 η jk j, Single optimization problem for learning both η and α Parcimonie en apprentissage 26/44

Proof of equivalence w 1,...,w p = l ( y i, w 1,...,w p j η j =1 = j η j =1 w 1,..., w p = j η j =1 w l ( y i, l ( y i, p wj Φ j (x i ) ) + λ ( p ) 2 w j 2 j=1 p wj Φ j (x i ) ) + λ j=1 p j=1 j=1 p w j 2 2/η j j=1 η 1/2 j w j Φ j (x i ) ) + λ p j=1 l ( y i, w Ψ η (x i ) ) + λ w 2 2 w j 2 2 with w j = w j η 1/2 j with Ψ η (x) = (η 1/2 1 Φ 1 (x),..., ηp 1/2 Φ p (x)) We have: Ψ η (x) Ψ η (x ) = p j=1 η jk j (x, x ) with p j=1 η j = 1 Parcimonie en apprentissage 27/44

Dictionary Learning Parcimonie en apprentissage 28/44

Dictionary learning for image denoising (Elad and Aharon, 2006) }{{} x = x }{{} 0 + }{{} ε measurements original image noise Parcimonie en apprentissage 29/44

Sparse PCA / Dictionary Learning Sparse PCA Dictionary Learning X = D. α X = D. α e.g. microarray data sparse dictionary (Witten et al., 2009; Bach et al., 2008) e.g. overcomplete dictionaries for natural images sparse decomposition (Elad and Aharon, 2006) Parcimonie en apprentissage 30/44

K-SVD (Aharon et al., 2006) Formulation: 1 D,α 2 x i Dα i 2 i s.t. α i 0 k Idea: 1 Decompose the signals on the current dictionary using a greedy algorithm 2 Fix the obtained supports 3 For all j in a sequence, update the jth atom and the decomposition coefficient on this atom to optimize the fit on the set of of signals that use this atom. 4 Iterate until the supports don t change any longer Parcimonie en apprentissage 31/44

K-SVD algorithm Algorithm 1 K-SVD repeat for i = 1 to n do Find α i using a greedy algorithm (e.g. OMP) end for for j = 1 to K do I j {i α ij 0} E (j) X :,Ij k j d(k) α k,ij Solve (using Lanczos algorithm) (d (j), α j,ij ) arg d,α E(j) dα 2 s.t. d 2 = 1 end for until none of the I j change anymore. Parcimonie en apprentissage 32/44

K-SVD: heuristics Replace atoms that are not much used by least explain datapoints in the dataset Remove atoms that are too correlated to other existing atoms and replace them by least explained datapoints. Parcimonie en apprentissage 33/44

l 1 formulation for Dictionary Learning 1 A R k n 2 D R p k ( xi Dα i 2 ) 2 + λ α i 1 s.t. j, d j 2 1. In both cases no orthogonality Not jointly convex but convex in each d j and α j Classical optimization alternates between D and α. Parcimonie en apprentissage 34/44

Block-coordinate descent for DL (Lee et al., 2007; Witten et al., 2009) U,V X UV 2 F + λ i v j 1 s.t. u j 2 1 Denote X (j) = X j j u j v j. Mimimizing w.r.t to u j : X (j) u j v u j 2 F s.t. u j 2 1 solved by u j X (j) v j j X (j) v j. Mimimizing w.r.t to v j : v j X (j) u j vj 2 F + λ i v j 1 Soft-thresholding requires no matrix inversion + can take advantage of efficient algorithms for Lasso can use warm start + active sets Parcimonie en apprentissage 35/44

Dictionary learning for image denoising Extract all overlapping 8 8 patches x i R 64. Form the matrix X = [x 1,..., x n ] R 64 n Solve a matrix factorization problem: A R k n D R p k ( xi Dα i 2 ) 2 + λ α i 1 s.t. j, d j 2 1. where α i is sparse, and D is the dictionary Each patch is decomposed into x i = Dα i Average the reconstruction Dα i of each patch x i to reconstruct a full-sized image y The number of patches n is large (= number of pixels) y use stochastic optimization/online learning (Mairal et al., 2009a) can handle potentially infinite datasets can adapt to dynamic training sets Parcimonie en apprentissage 36/44

Denoising result (Mairal et al., 2009b) Parcimonie en apprentissage 37/44

Dictionary Learning for Image Inpainting Example from Mairal et al. (2008) Parcimonie en apprentissage 38/44

What does the dictionary V look like? Parcimonie en apprentissage 39/44

Hierarchical sparsity and some applications Joint work with Rodolphe Jenatton, Julien Mairal and Francis Bach Parcimonie en apprentissage 40/44

Hierarchical Norms (Zhao et al., 2009; Bach, 2008) (Jenatton, Mairal, Obozinski and Bach, 2010a) Dictionary element selected only after its ancestors Structure on codes α (not on individual dictionary elements d i ) Hierarchical penalization: Ω(α) = g G α g 2 where groups g in G are equal to set of descendants of some nodes in a tree Parcimonie en apprentissage 41/44

Hierarchical dictionary for image patches Parcimonie en apprentissage 42/44

References I Aharon, M., Elad, M., and Bruckstein, A. (2006). K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. Signal Processing, IEEE Transactions on, 54(11):4311 4322. Bach, F. (2008). Exploring large feature spaces with hierarchical multiple kernel learning. In Advances in Neural Information Processing Systems. Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. (2012). Optimization with sparsity-inducing penalties. Foundations and Trends R in Machine Learning, 4(1):1 106. Bach, F., Mairal, J., and Ponce, J. (2008). Convex sparse matrix factorizations. Technical Report 0812.1869, ArXiv. Elad, M. and Aharon, M. (2006). Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 15(12):3736 3745. Lee, H., Battle, A., Raina, R., and Ng, A. (2007). Efficient sparse coding algorithms. In Advances in Neural Information Processing Systems (NIPS). Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2009a). Online dictionary learning for sparse coding. In International Conference on Machine Learning (ICML). Mairal, J., Bach, F., Ponce, J., Sapiro, G., and Zisserman, A. (2009b). Non-local sparse models for image restoration. In International Conference on Computer Vision (ICCV). Mairal, J., Sapiro, G., and Elad, M. (2008). Learning multiscale sparse representations for image and video restoration. SIAM Multiscale Modeling and Simulation, 7(1):214 241. Parcimonie en apprentissage 43/44

References II Witten, D., Tibshirani, R., and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515 534. Yuan, M. and Lin, Y. (2007). On the non-negative garrotte estimator. Journal of The Royal Statistical Society Series B, 69(2):143 161. Zhao, P., Rocha, G., and Yu, B. (2009). Grouped and hierarchical model selection through composite absolute penalties. Annals of Statistics, 37(6A):3468 3497. Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B (Statistical Methodology), 67(2):301 320. Parcimonie en apprentissage 44/44