Parcimonie en apprentissage statistique
|
|
- Jocelyn Welch
- 6 years ago
- Views:
Transcription
1 Parcimonie en apprentissage statistique Guillaume Obozinski Ecole des Ponts - ParisTech Journée Parcimonie Fédération Charles Hermite, 23 Juin 2014 Parcimonie en apprentissage 1/44
2 Classical supervised learning setup (ERM) Data: (x 1, y 1 ),..., (x i, y i ),..., (x n, y n ) f H function to learn Loss function: l : (y, a) l(y, a) e.g. l(y, a) = 1 2 (y a)2, logistic loss, hinge loss, etc. Empirical Risk Minimization 1 l(f (x i ), y i ) + λ f 2 H f H n }{{} }{{} Regularization Empirical Risk H typically an RKHS λ: regularization coefficient λ controls the complexity of the function that we are willing to learn for a given amount of data. Parcimonie en apprentissage 2/44
3 Learning linear functions Restricting to linear functions f w : x w x i 1 w R p n l(w x i, y i ) + λ 2 w 2 For the square loss ridge regression Issue: number of features p typically large compared to the amount of data Alternative to regularization provided by sparsity Reducing the number of features entering the models yields another way to control model complexity more interpretable models (very important in biomedical applications) computationally efficient algorithms Parcimonie en apprentissage 3/44
4 A sparse signal y R n is the signal X R n p is some overcomplete basis w is the sparse representation of the signal Find w sparse such that y = Xw Classical signal processing formulation of the problem w 0 s.t. y = Xw. Problem: there is noise... and noise is not sparse w 0 s.t. y Xw 2 ɛ. y Xw 2 2 s.t. w 0 k These problems are NP-hard. 1 2 y Xw λ w 0 Parcimonie en apprentissage 4/44
5 Approaches Greedy Methods Matching Pursuit (MP) Orthogonal Matching Pursuit (OMP) Least-square OMP CoSamp Relaxation Methods Lasso/Basis Pursuit Dantzig Selector Bayesian Methods Spike and Slab priors (ARD) Empirical Bayes Parcimonie en apprentissage 5/44
6 A convex relaxation... Empirical risk: for w R p, L(w) = 1 2n (y i xi w) 2 Supp(w) = 1 {wi 0} Support of the model: Supp(w) = {i w i 0}. Penalization for variable selection Lasso L(w) + λ Supp(w) w Rd 0 w R d L(w) + λ w 1 Parcimonie en apprentissage 6/44
7 Related formulations through convex programs Basis Pursuit Basis Pursuit ( noisy setting) Dantzig Selector Remarks w w 1 s.t. y = Xw w w 1 s.t. y Xw 2 η w w 1 s.t. X (y Xw) λ Minima not necessarily unique Dantzig Selector linear program The optimality conditions for the Lasso require X (y Xw) λ Parcimonie en apprentissage 7/44
8 Optimization algorithms Generic approaches For the Lasso, with interior point methods. Subgradient descent Efficient first order methods Coordinate descent methods Proximal methods Reweighted l 2 methods (esp. for structured sparse methods) Active set methods For the Lasso: LARS algorithm In general: meta-algorithms to combine with methods above see e.g. Bach et al. (2012) Parcimonie en apprentissage 8/44
9 Software: SPAMs toolbox Toobox developped by Julien Mairal C++ interfaced with Matlab, R, Python. proximal gradient methods for l 0, l 1, elastic net, fused-lasso, group-lasso and more... for square, logistic, multi-class logistic loss functions handles sparse matrices fast implementations of OMP and LARS dictionary learning and matrix factorization (NMF, sparse PCA). Parcimonie en apprentissage 9/44
10 Correlation, Stability and Elastic Net In the presence of important correlations between the variables, Lasso choses arbitrarily Stability issues Elastic Net (Zou and Hastie, 2005) w 1 2n y Xw λ w 1 + µ w 2 2 Makes the optimization problem strongly convex always a unique solution faster convergence for many algorithms Selects correlated variables together. Two intuitions/views by decorrelating them w 1 ( w (X X + µi )w + 2y Xw + y y ) + λ w 1 2n by smoothing: for a pair of very correlated variables x 1, x 2 encourages w 1 w 2 Better behavior with heavily correlated variables Parcimonie en apprentissage 10/44
11 Comparing Lasso and other strategies for linear regression Compare: 1 Ridge regression: w R p 2 y Xw λ 2 w Lasso: w R p 2 y Xw λ w 1 1 OMP/FS: w R p 2 y Xw λ w 0 Each method builds a path of solutions from 0 to ordinary least-squares solution Regularization parameters selected on the test set Parcimonie en apprentissage 11/44
12 Simulation results i.i.d. Gaussian design matrix, k = 4, n = 64, p [2, 256], SNR = 1 Note stability to non-sparsity and variability L1 L2 greedy oracle L1 L2 greedy mean square error mean square error log (p) 2 Sparse log (p) 2 Rotated (non sparse) Parcimonie en apprentissage 12/44
13 Advantages and Drawbacks of l 1 vs l 0 penalization Advantages The solution α(x) is a continuous (differentiable on the support) function of the data x. The l 1 -norm is more robust to violation of the sparsity assumption. It controls the influence of spuriously introduced variables (like l 0 +l 2 ) The convex formulation leads to principled algorithms that generalize well to new situations and natural theoretical analyses. Drawbacks It introduces an estimation bias which leads to the selection of two many variables if ignored. Some of the l 0 algorithms are simpler. Parcimonie en apprentissage 13/44
14 Group sparse models Parcimonie en apprentissage 14/44
15 From l 1 -regularization... 1 w R p n (y (i) w x (i) ) 2 + λ w 1 with w 1 = p j=1 w j. Parcimonie en apprentissage 15/44
16 ...to penalization with grouped variables Assume that {1,..., p} is partitioned into m groups G 1,..., G m w = (w G1,..., w Gm ) and x = (x G1,..., x Gm ) 1 w R p n m l(w x (i), y (i) ) + λ w Gj Group Lasso (Yuan and Lin, 2007) j=1 1 w R p n m (y (i) w x (i) ) 2 + λ w Gj j=1 The l 1 /l 2 norm: Ω(w) : = G G w G 2 Unit ball in R 3 : (w 1, w 2 ) + w 3 1 Some entire groups set to 0 No zero within groups Parcimonie en apprentissage 16/44
17 l 1 /l q -regularization Can also consider l 1 /l -norm More non-differentiabilities Applications Group of noal variables (dummy binary variables) Learn sums of polynomial functions: f (x) = f (x 1 ) f (x p ) w 1 n ( w jk x (i) k ) 2 p j y (i) + (w j1,..., w jk ) 2 j,k j=1 j: variables i: observations k: degree of monomial Parcimonie en apprentissage 17/44
18 Algorithms for l 1 /l 2 -regularization 1 l(w x (i), y (i) ) w R p n }{{} f (w) Reweighted l 2 algorithms Proximal methods Blockwise coordinate descent +λ m w Gj 2 j=1 } {{ } Ω(w) Parcimonie en apprentissage 18/44
19 Sparsity in function space and multiple kernel learning Parcimonie en apprentissage 19/44
20 Introducing a feature map Feature map φ : x φ(x) Maps the input data to a richer possibly more explicit feature space Typical high dimensional or possibly infinite dimensional space l(w φ(x i ), y i ) + λ w 2 w 2. Parcimonie en apprentissage 20/44
21 Changing the dot product Let x = (x 1, x 2 ) R 2 and φ(x) = (x 1, x 2, x 2 1, x 2 2, 2x 1 x 2 ). φ(x), φ(y) = x 1 y 1 + x 2 y 2 + x1 2 y1 2 + x2 2 y x 1 x 2 y 1 y 2 = x 1 y 1 + x 2 y 2 + (x 1 y 1 ) 2 + (x 2 y 2 ) 2 + 2(x 1 y 1 )(x 2 y 2 ) = x, y + x, y 2 For w = (0, 0, 1, 1, 0), w φ(x) 1 0 x 2 1. Linear separators in R 5 correspond to conic separators in R 2. Let x = (x 1,..., x p ) R p and φ(x) = (x 1,..., x p, x 2 1,..., x 2 p, 2x 1 x 2,..., 2x i x j,... 2x p 1 x p ). Still have φ(x), φ(y) = x, y + x, y 2 But explicit mapping too expensive to compute: φ(x) R p+p(p+1)/2. Parcimonie en apprentissage 21/44
22 Duality for regularized empirical risk imization Define ψ i : u l(u, y i ). Let Ω be a norm, consider 1 w n l(w x i, y i ) + Ω(w) 2 u,w u,w max u,w α max α u,w max α 1 n 1 n 1 n 1 n 1 n ψ i (u i ) + λ 2 Ω(w)2 s.t. i, u i = w x i ψ i (u i ) + λ 2 Ω(w)2 s.t. u = Xw ψ i (u i ) + λ 2 Ω(w)2 λα (u Xw) [ ] [ 1 ψi (u i ) (nλα i )u i + λ 2 Ω(w)2 + w (X α)] ψi (nλα i ) λ 2 Ω ( X α ) 2 Parcimonie en apprentissage 22/44
23 Representer property and kernelized version Consider the special case Ω = 2 : 1 w n l(w x i, y i ) + w 2 2 max α max α 1 n 1 n ψi (nλα i ) λ X α ψi (nλα i ) λ 2 α Kα with K = XX With the relation between optimal solutions: w = X α = αi x i So if we replace x i with φ(x i ), we have w = n α i φ(x i). And f (x) = w, φ(x) = αi φ(x i ), φ(x) = αi K(x i, x). Parcimonie en apprentissage 23/44
24 Regularization for multiple features x Φ 1 (x) w 1.. Φ j (x) w j. Φ p (x). w p w 1 Φ 1 (x) + + w p Φ p (x) Concatenating feature spaces is equivalent to sumg kernels p w j 2 2 K = j=1 p j=1 K j Parcimonie en apprentissage 24/44
25 General kernel learning (Lanckriet et al, 2004, Bach et al., 2005, Micchelli and Pontil, 2005) G(K) = w F n l(y i, w Φ(x i )) + λ 2 w 2 2 = max l α R n i (λα i ) λ 2 α Kα is a convex function of the kernel matrix K Learning a convex combination of kernels Given K 1,..., K p, consider the formulation ( ) G η j K p s.t. η j = 1, η j 0 η j The simplex constraints, like the l 1 -norm induce sparsity. j Parcimonie en apprentissage 25/44
26 Multiple kernel learning and l 1 /l 2 Block l 1 -norm problem: l(y i, w1 Φ 1 (x i ) + + wp Φ p (x i )) + λ 2 ( w w p 2 ) 2 Proposition: Block l 1 -norm regularization is equivalent to imizing with respect to η the optimal value G( p j=1 η jk j ) (sparse) weights η obtained from optimality conditions dual parameters α optimal for K = p j=1 η jk j, Single optimization problem for learning both η and α Parcimonie en apprentissage 26/44
27 Proof of equivalence w 1,...,w p = l ( y i, w 1,...,w p j η j =1 = j η j =1 w 1,..., w p = j η j =1 w l ( y i, l ( y i, p wj Φ j (x i ) ) + λ ( p ) 2 w j 2 j=1 p wj Φ j (x i ) ) + λ j=1 p j=1 j=1 p w j 2 2/η j j=1 η 1/2 j w j Φ j (x i ) ) + λ p j=1 l ( y i, w Ψ η (x i ) ) + λ w 2 2 w j 2 2 with w j = w j η 1/2 j with Ψ η (x) = (η 1/2 1 Φ 1 (x),..., ηp 1/2 Φ p (x)) We have: Ψ η (x) Ψ η (x ) = p j=1 η jk j (x, x ) with p j=1 η j = 1 Parcimonie en apprentissage 27/44
28 Dictionary Learning Parcimonie en apprentissage 28/44
29 Dictionary learning for image denoising (Elad and Aharon, 2006) }{{} x = x }{{} 0 + }{{} ε measurements original image noise Parcimonie en apprentissage 29/44
30 Sparse PCA / Dictionary Learning Sparse PCA Dictionary Learning X = D. α X = D. α e.g. microarray data sparse dictionary (Witten et al., 2009; Bach et al., 2008) e.g. overcomplete dictionaries for natural images sparse decomposition (Elad and Aharon, 2006) Parcimonie en apprentissage 30/44
31 K-SVD (Aharon et al., 2006) Formulation: 1 D,α 2 x i Dα i 2 i s.t. α i 0 k Idea: 1 Decompose the signals on the current dictionary using a greedy algorithm 2 Fix the obtained supports 3 For all j in a sequence, update the jth atom and the decomposition coefficient on this atom to optimize the fit on the set of of signals that use this atom. 4 Iterate until the supports don t change any longer Parcimonie en apprentissage 31/44
32 K-SVD algorithm Algorithm 1 K-SVD repeat for i = 1 to n do Find α i using a greedy algorithm (e.g. OMP) end for for j = 1 to K do I j {i α ij 0} E (j) X :,Ij k j d(k) α k,ij Solve (using Lanczos algorithm) (d (j), α j,ij ) arg d,α E(j) dα 2 s.t. d 2 = 1 end for until none of the I j change anymore. Parcimonie en apprentissage 32/44
33 K-SVD: heuristics Replace atoms that are not much used by least explain datapoints in the dataset Remove atoms that are too correlated to other existing atoms and replace them by least explained datapoints. Parcimonie en apprentissage 33/44
34 l 1 formulation for Dictionary Learning 1 A R k n 2 D R p k ( xi Dα i 2 ) 2 + λ α i 1 s.t. j, d j 2 1. In both cases no orthogonality Not jointly convex but convex in each d j and α j Classical optimization alternates between D and α. Parcimonie en apprentissage 34/44
35 Block-coordinate descent for DL (Lee et al., 2007; Witten et al., 2009) U,V X UV 2 F + λ i v j 1 s.t. u j 2 1 Denote X (j) = X j j u j v j. Mimimizing w.r.t to u j : X (j) u j v u j 2 F s.t. u j 2 1 solved by u j X (j) v j j X (j) v j. Mimimizing w.r.t to v j : v j X (j) u j vj 2 F + λ i v j 1 Soft-thresholding requires no matrix inversion + can take advantage of efficient algorithms for Lasso can use warm start + active sets Parcimonie en apprentissage 35/44
36 Dictionary learning for image denoising Extract all overlapping 8 8 patches x i R 64. Form the matrix X = [x 1,..., x n ] R 64 n Solve a matrix factorization problem: A R k n D R p k ( xi Dα i 2 ) 2 + λ α i 1 s.t. j, d j 2 1. where α i is sparse, and D is the dictionary Each patch is decomposed into x i = Dα i Average the reconstruction Dα i of each patch x i to reconstruct a full-sized image y The number of patches n is large (= number of pixels) y use stochastic optimization/online learning (Mairal et al., 2009a) can handle potentially infinite datasets can adapt to dynamic training sets Parcimonie en apprentissage 36/44
37 Denoising result (Mairal et al., 2009b) Parcimonie en apprentissage 37/44
38 Dictionary Learning for Image Inpainting Example from Mairal et al. (2008) Parcimonie en apprentissage 38/44
39 What does the dictionary V look like? Parcimonie en apprentissage 39/44
40 Hierarchical sparsity and some applications Joint work with Rodolphe Jenatton, Julien Mairal and Francis Bach Parcimonie en apprentissage 40/44
41 Hierarchical Norms (Zhao et al., 2009; Bach, 2008) (Jenatton, Mairal, Obozinski and Bach, 2010a) Dictionary element selected only after its ancestors Structure on codes α (not on individual dictionary elements d i ) Hierarchical penalization: Ω(α) = g G α g 2 where groups g in G are equal to set of descendants of some nodes in a tree Parcimonie en apprentissage 41/44
42 Hierarchical dictionary for image patches Parcimonie en apprentissage 42/44
43 References I Aharon, M., Elad, M., and Bruckstein, A. (2006). K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. Signal Processing, IEEE Transactions on, 54(11): Bach, F. (2008). Exploring large feature spaces with hierarchical multiple kernel learning. In Advances in Neural Information Processing Systems. Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. (2012). Optimization with sparsity-inducing penalties. Foundations and Trends R in Machine Learning, 4(1): Bach, F., Mairal, J., and Ponce, J. (2008). Convex sparse matrix factorizations. Technical Report , ArXiv. Elad, M. and Aharon, M. (2006). Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 15(12): Lee, H., Battle, A., Raina, R., and Ng, A. (2007). Efficient sparse coding algorithms. In Advances in Neural Information Processing Systems (NIPS). Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2009a). Online dictionary learning for sparse coding. In International Conference on Machine Learning (ICML). Mairal, J., Bach, F., Ponce, J., Sapiro, G., and Zisserman, A. (2009b). Non-local sparse models for image restoration. In International Conference on Computer Vision (ICCV). Mairal, J., Sapiro, G., and Elad, M. (2008). Learning multiscale sparse representations for image and video restoration. SIAM Multiscale Modeling and Simulation, 7(1): Parcimonie en apprentissage 43/44
44 References II Witten, D., Tibshirani, R., and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3): Yuan, M. and Lin, Y. (2007). On the non-negative garrotte estimator. Journal of The Royal Statistical Society Series B, 69(2): Zhao, P., Rocha, G., and Yu, B. (2009). Grouped and hierarchical model selection through composite absolute penalties. Annals of Statistics, 37(6A): Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B (Statistical Methodology), 67(2): Parcimonie en apprentissage 44/44
Convex relaxation for Combinatorial Penalties
Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,
More informationTopographic Dictionary Learning with Structured Sparsity
Topographic Dictionary Learning with Structured Sparsity Julien Mairal 1 Rodolphe Jenatton 2 Guillaume Obozinski 2 Francis Bach 2 1 UC Berkeley 2 INRIA - SIERRA Project-Team San Diego, Wavelets and Sparsity
More informationRecent Advances in Structured Sparse Models
Recent Advances in Structured Sparse Models Julien Mairal Willow group - INRIA - ENS - Paris 21 September 2010 LEAR seminar At Grenoble, September 21 st, 2010 Julien Mairal Recent Advances in Structured
More informationSparse Estimation and Dictionary Learning
Sparse Estimation and Dictionary Learning (for Biostatistics?) Julien Mairal Biostatistics Seminar, UC Berkeley Julien Mairal Sparse Estimation and Dictionary Learning Methods 1/69 What this talk is about?
More informationStructured Sparse Estimation with Network Flow Optimization
Structured Sparse Estimation with Network Flow Optimization Julien Mairal University of California, Berkeley Neyman seminar, Berkeley Julien Mairal Neyman seminar, UC Berkeley /48 Purpose of the talk introduce
More informationHierarchical kernel learning
Hierarchical kernel learning Francis Bach Willow project, INRIA - Ecole Normale Supérieure May 2010 Outline Supervised learning and regularization Kernel methods vs. sparse methods MKL: Multiple kernel
More informationBackpropagation Rules for Sparse Coding (Task-Driven Dictionary Learning)
Backpropagation Rules for Sparse Coding (Task-Driven Dictionary Learning) Julien Mairal UC Berkeley Edinburgh, ICML, June 2012 Julien Mairal, UC Berkeley Backpropagation Rules for Sparse Coding 1/57 Other
More informationA tutorial on sparse modeling. Outline:
A tutorial on sparse modeling. Outline: 1. Why? 2. What? 3. How. 4. no really, why? Sparse modeling is a component in many state of the art signal processing and machine learning tasks. image processing
More informationMachine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013
Machine Learning for Signal Processing Sparse and Overcomplete Representations Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013 1 Key Topics in this Lecture Basics Component-based representations
More informationProximal Methods for Optimization with Spasity-inducing Norms
Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology
More informationStructured sparsity-inducing norms through submodular functions
Structured sparsity-inducing norms through submodular functions Francis Bach Sierra team, INRIA - Ecole Normale Supérieure - CNRS Thanks to R. Jenatton, J. Mairal, G. Obozinski June 2011 Outline Introduction:
More informationOWL to the rescue of LASSO
OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,
More informationMIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco
MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y
More informationMachine Learning for Signal Processing Sparse and Overcomplete Representations
Machine Learning for Signal Processing Sparse and Overcomplete Representations Abelino Jimenez (slides from Bhiksha Raj and Sourish Chaudhuri) Oct 1, 217 1 So far Weights Data Basis Data Independent ICA
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationMLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT
MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net
More information2.3. Clustering or vector quantization 57
Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :
More informationGeneralized Conditional Gradient and Its Applications
Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationExploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning Francis Bach INRIA - Willow Project, École Normale Supérieure 45, rue d Ulm, 75230 Paris, France francis.bach@mines.org Abstract
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationOslo Class 6 Sparsity based regularization
RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity
More informationIncremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning
Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic
More informationMathematical Methods for Data Analysis
Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data
More informationComputing regularization paths for learning multiple kernels
Computing regularization paths for learning multiple kernels Francis Bach Romain Thibaux Michael Jordan Computer Science, UC Berkeley December, 24 Code available at www.cs.berkeley.edu/~fbach Computing
More informationTrace Lasso: a trace norm regularization for correlated designs
Author manuscript, published in "NIPS 2012 - Neural Information Processing Systems, Lake Tahoe : United States (2012)" Trace Lasso: a trace norm regularization for correlated designs Edouard Grave edouard.grave@inria.fr
More informationSparsity in Underdetermined Systems
Sparsity in Underdetermined Systems Department of Statistics Stanford University August 19, 2005 Classical Linear Regression Problem X n y p n 1 > Given predictors and response, y Xβ ε = + ε N( 0, σ 2
More informationOptimization for Sparse Estimation and Structured Sparsity
Optimization for Sparse Estimation and Structured Sparsity Julien Mairal INRIA LEAR, Grenoble IMA, Minneapolis, June 2013 Short Course Applied Statistics and Machine Learning Julien Mairal Optimization
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationSparse PCA with applications in finance
Sparse PCA with applications in finance A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon 1 Introduction
More informationLASSO Review, Fused LASSO, Parallel LASSO Solvers
Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationA Multi-task Learning Strategy for Unsupervised Clustering via Explicitly Separating the Commonality
A Multi-task Learning Strategy for Unsupervised Clustering via Explicitly Separating the Commonality Shu Kong, Donghui Wang Dept. of Computer Science and Technology, Zhejiang University, Hangzhou 317,
More informationProbabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms
Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms François Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini,
More informationCSC 576: Variants of Sparse Learning
CSC 576: Variants of Sparse Learning Ji Liu Department of Computer Science, University of Rochester October 27, 205 Introduction Our previous note basically suggests using l norm to enforce sparsity in
More informationStructured sparsity through convex optimization
Structured sparsity through convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with R. Jenatton, J. Mairal, G. Obozinski Journées INRIA - Apprentissage - December
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationTractable Upper Bounds on the Restricted Isometry Constant
Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.
More informationEUSIPCO
EUSIPCO 013 1569746769 SUBSET PURSUIT FOR ANALYSIS DICTIONARY LEARNING Ye Zhang 1,, Haolong Wang 1, Tenglong Yu 1, Wenwu Wang 1 Department of Electronic and Information Engineering, Nanchang University,
More informationOnline Dictionary Learning with Group Structure Inducing Norms
Online Dictionary Learning with Group Structure Inducing Norms Zoltán Szabó 1, Barnabás Póczos 2, András Lőrincz 1 1 Eötvös Loránd University, Budapest, Hungary 2 Carnegie Mellon University, Pittsburgh,
More informationSemi-supervised Dictionary Learning Based on Hilbert-Schmidt Independence Criterion
Semi-supervised ictionary Learning Based on Hilbert-Schmidt Independence Criterion Mehrdad J. Gangeh 1, Safaa M.A. Bedawi 2, Ali Ghodsi 3, and Fakhri Karray 2 1 epartments of Medical Biophysics, and Radiation
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationLinear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1
Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1 ( OWL ) Regularization Mário A. T. Figueiredo Instituto de Telecomunicações and Instituto Superior Técnico, Universidade de
More informationApplied Machine Learning for Biomedical Engineering. Enrico Grisan
Applied Machine Learning for Biomedical Engineering Enrico Grisan enrico.grisan@dei.unipd.it Data representation To find a representation that approximates elements of a signal class with a linear combination
More informationIs the test error unbiased for these programs? 2017 Kevin Jamieson
Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationTUTORIAL PART 1 Unsupervised Learning
TUTORIAL PART 1 Unsupervised Learning Marc'Aurelio Ranzato Department of Computer Science Univ. of Toronto ranzato@cs.toronto.edu Co-organizers: Honglak Lee, Yoshua Bengio, Geoff Hinton, Yann LeCun, Andrew
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationFast Regularization Paths via Coordinate Descent
August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. August 2008 Trevor
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationLecture Notes 10: Matrix Factorization
Optimization-based data analysis Fall 207 Lecture Notes 0: Matrix Factorization Low-rank models. Rank- model Consider the problem of modeling a quantity y[i, j] that depends on two indices i and j. To
More informationL 2,1 Norm and its Applications
L 2, Norm and its Applications Yale Chang Introduction According to the structure of the constraints, the sparsity can be obtained from three types of regularizers for different purposes.. Flat Sparsity.
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationAn Homotopy Algorithm for the Lasso with Online Observations
An Homotopy Algorithm for the Lasso with Online Observations Pierre J. Garrigues Department of EECS Redwood Center for Theoretical Neuroscience University of California Berkeley, CA 94720 garrigue@eecs.berkeley.edu
More informationDual Augmented Lagrangian, Proximal Minimization, and MKL
Dual Augmented Lagrangian, Proximal Minimization, and MKL Ryota Tomioka 1, Taiji Suzuki 1, and Masashi Sugiyama 2 1 University of Tokyo 2 Tokyo Institute of Technology 2009-09-15 @ TU Berlin (UT / Tokyo
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationl 1 and l 2 Regularization
David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationOnline Learning for Matrix Factorization and Sparse Coding
Online Learning for Matrix Factorization and Sparse Coding Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro To cite this version: Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro. Online
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationSCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.
SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression
More informationCOMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017
COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We
More information(k, q)-trace norm for sparse matrix factorization
(k, q)-trace norm for sparse matrix factorization Emile Richard Department of Electrical Engineering Stanford University emileric@stanford.edu Guillaume Obozinski Imagine Ecole des Ponts ParisTech guillaume.obozinski@imagine.enpc.fr
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationTHe linear decomposition of data using a few elements
1 Task-Driven Dictionary Learning Julien Mairal, Francis Bach, and Jean Ponce arxiv:1009.5358v2 [stat.ml] 9 Sep 2013 Abstract Modeling data with linear combinations of a few elements from a learned dictionary
More informationMultiple kernel learning for multiple sources
Multiple kernel learning for multiple sources Francis Bach INRIA - Ecole Normale Supérieure NIPS Workshop - December 2008 Talk outline Multiple sources in computer vision Multiple kernel learning (MKL)
More informationBayesian Grouped Horseshoe Regression with Application to Additive Models
Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationLasso, Ridge, and Elastic Net
Lasso, Ridge, and Elastic Net David Rosenberg New York University February 7, 2017 David Rosenberg (New York University) DS-GA 1003 February 7, 2017 1 / 29 Linearly Dependent Features Linearly Dependent
More informationApproximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.
Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and
More informationGreedy Dictionary Selection for Sparse Representation
Greedy Dictionary Selection for Sparse Representation Volkan Cevher Rice University volkan@rice.edu Andreas Krause Caltech krausea@caltech.edu Abstract We discuss how to construct a dictionary by selecting
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationAn Efficient Proximal Gradient Method for General Structured Sparse Learning
Journal of Machine Learning Research 11 (2010) Submitted 11/2010; Published An Efficient Proximal Gradient Method for General Structured Sparse Learning Xi Chen Qihang Lin Seyoung Kim Jaime G. Carbonell
More informationSparse Gaussian conditional random fields
Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian
More informationConvex Coding. David M. Bradley, J. Andrew Bagnell CMU-RI-TR May, 2009
Convex Coding David M. Bradley, J. Andrew Bagnell CMU-RI-TR-09-22 May, 2009 Robotics Institute Carnegie Mellon University Pittsburgh, Pennsylvania 15213 c Carnegie Mellon University I Abstract Inspired
More informationAdaptive Compressive Imaging Using Sparse Hierarchical Learned Dictionaries
Adaptive Compressive Imaging Using Sparse Hierarchical Learned Dictionaries Jarvis Haupt University of Minnesota Department of Electrical and Computer Engineering Supported by Motivation New Agile Sensing
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationSpectral k-support Norm Regularization
Spectral k-support Norm Regularization Andrew McDonald Department of Computer Science, UCL (Joint work with Massimiliano Pontil and Dimitris Stamos) 25 March, 2015 1 / 19 Problem: Matrix Completion Goal:
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationOslo Class 4 Early Stopping and Spectral Regularization
RegML2017@SIMULA Oslo Class 4 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x
More informationA Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)
A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationMIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design
MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation
More informationEE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015
EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,
More informationPre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models
Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable
More informationThe lasso, persistence, and cross-validation
The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University
More informationIs the test error unbiased for these programs?
Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x
More informationSparse & Redundant Signal Representation, and its Role in Image Processing
Sparse & Redundant Signal Representation, and its Role in Michael Elad The CS Department The Technion Israel Institute of technology Haifa 3000, Israel Wave 006 Wavelet and Applications Ecole Polytechnique
More informationProbabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms
Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Adrien Todeschini Inria Bordeaux JdS 2014, Rennes Aug. 2014 Joint work with François Caron (Univ. Oxford), Marie
More informationOn Optimal Frame Conditioners
On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,
More information