MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Similar documents
Oslo Class 6 Sparsity based regularization

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Oslo Class 7 Structured sparsity

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Optimization methods

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

1 Sparsity and l 1 relaxation

Lasso: Algorithms and Extensions

RegML 2018 Class 2 Tikhonov regularization and kernels

ESL Chap3. Some extensions of lasso

Learning with stochastic proximal gradient

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Parcimonie en apprentissage statistique

Is the test error unbiased for these programs?

Discriminative Models

Optimization methods

CPSC 540: Machine Learning

Oslo Class 2 Tikhonov regularization and kernels

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Linear Methods for Regression. Lijun Zhang

Discriminative Models

TUM 2016 Class 3 Large scale learning by regularization

Proximal methods. S. Villa. October 7, 2014

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Oslo Class 4 Early Stopping and Spectral Regularization

Lasso, Ridge, and Elastic Net

Sparse Approximation and Variable Selection

An Introduction to Sparse Approximation

DATA MINING AND MACHINE LEARNING

ORIE 4741: Learning with Big Messy Data. Regularization

ECS289: Scalable Machine Learning

Mathematical methods for Image Processing

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Compressed Sensing and Neural Networks

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Proximal Methods for Optimization with Spasity-inducing Norms

Statistical Data Mining and Machine Learning Hilary Term 2016

EUSIPCO

Perturbed Proximal Gradient Algorithm

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble

Accelerated Block-Coordinate Relaxation for Regularized Optimization

STA141C: Big Data & High Performance Statistical Computing

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

OWL to the rescue of LASSO

STAT 200C: High-dimensional Statistics

Bayesian Methods for Sparse Signal Recovery

Convex relaxation for Combinatorial Penalties

Lecture 16: Compressed Sensing

Convex Optimization Algorithms for Machine Learning in 10 Slides

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

ECS289: Scalable Machine Learning

A tutorial on sparse modeling. Outline:

Conditional Gradient (Frank-Wolfe) Method

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Regularization Algorithms for Learning

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Generalized Orthogonal Matching Pursuit- A Review and Some

Mathematical Methods for Data Analysis

Least Sparsity of p-norm based Optimization Problems with p > 1

6. Regularized linear regression

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

CSC 576: Variants of Sparse Learning

Generalized Conditional Gradient and Its Applications

Restricted Strong Convexity Implies Weak Submodularity

l 1 and l 2 Regularization

Parameter Norm Penalties. Sargur N. Srihari

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Sparsity in Underdetermined Systems

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

ECS289: Scalable Machine Learning

Within Group Variable Selection through the Exclusive Lasso

Compressed Sensing in Cancer Biology? (A Work in Progress)

LASSO Review, Fused LASSO, Parallel LASSO Solvers

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

SPARSE signal representations have gained popularity in recent

Optimization for Compressed Sensing

Lasso Regression: Regularization for feature selection

Convex Optimization Conjugate, Subdifferential, Proximation

Lasso Regression: Regularization for feature selection

Robust Sparse Recovery via Non-Convex Optimization

Lecture 8: February 9

Dual Proximal Gradient Method

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Designing Information Devices and Systems I Discussion 13B

Recent Advances in Structured Sparse Models

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Sparse Linear Models (10/7/13)

Primal Dual Pursuit A Homotopy based Algorithm for the Dantzig Selector

Stability and the elastic net

Classification Logistic Regression

Stochastic Optimization: First order method

MLCC 2017 Regularization Networks I: Linear Models

Optimization for Learning and Big Data


Transcription:

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco

Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y i,w x i ) + λ w 2. i=1 Implicit regularization by optimization. Regularization with projections/sketching. Non linear extension with features/kernels. What about other norms/penalties?

Sparsity The function of interest depends on few building blocks

Why sparsity? Interpretability High dimensional statistics, n d Compression

What is sparsity? d f (x) = x j w j j=1 Sparse coefficients: few w j 0

Sparsity and dictionaries More generally consider f (x) = with φ 1,...,φ p dictionary. p φ j (x)w j j=1

Sparsity and dictionaries (cont.) The concept of sparsity depends on the considered dictionary. If we let (φ j ) j,(ψ j ) j two dictionaries of lin. indip. features such that f (x) = φ j β j = ψ j b j, then f = β = b. j j However, sparsity on (φ j ) j,(ψ j ) j can be very different!

Linear We stick to linear functions for sake of simplicity. d f (x) = x j w j. j=1 Given data, consider the linear system Xw = Ŷ.

Linear systems with sparsity n d There is a solution with s d non zero entries in unknown locations.

Best subset selection Solve for all possible columns subsets. Aka torturing the data until they confess.

Sparse regularization Best subset selection is equivalent to or min w 0 w R d w, subj. to Xw = Ŷ, 1 min w R d n Xw Ŷ 2 + λ w 0 w 2 l 0 -norm w 0 = d j=1 1 {wj 0}

Best subset selection min w 0, subj. to Xw = Ŷ, w R d The problem is combinatorially hard. Approximate approaches include: 1. Greedy methods. 2. Convex relaxations.

Greedy methods Initalize, then select a variable. Compute solution. Update. Repeat.

Matching pursuit r 0 = Ŷ, w 0 = 0, I 0 = for i = 1 to T Let X j = Xe j, and select j {1,...,d} maximizing 1 I i = I i 1 {j}, w i = w i 1 + v j e j r i = r i 1 X j v j = Ŷ Xw i a j = v 2 j X j 2 with v j = r i 1 X j X j 2, 1 Note that v j = argmin ˆX j v r i 1 2, and, ˆX j v j r i 1 2 = r i 1 a j v R

Orthogonal Matching pursuit r 0 = Ŷ, w 0 = 0, I 0 = for i = 1 to T Let X j = Xe j, and select j {1,...,d} maximizing I i = I i 1 {j}, a j = v 2 j X j 2 with v j = r i 1 X j X j 2, w i = argmin w XM Ii w Ŷ 2, where (M Ii w) j = δ j Ii w j r i = Ŷ Xw i

Convex relaxation Lasso (statistics) or Basis Pursuit (signal processing) 1 min w R d n Xw Ŷ 2 + λ w 1 w 2 l 1 -norm d w 1 = w i. i=1 Next, we discuss modeling + optimization aspects.

The geometry of sparsity min w 1, s.t. Xw = Ŷ

Ridge regression and sparsity Replace w 1 with w?

l 1 vs l 2 Unlike ridge-regression, l 1 regularization leads to sparsity!

Optimization for sparse regularization Convex but not smooth 1 min w R d n Xw Ŷ 2 + λ w 1

Optimization Could be solved via the subgradient method Objective function is composite min n Xw Ŷ 2 +λ w 1 } {{ }}{{} convex w R d 1 convex smooth

Proximal methods min w R d E(w) + R(w) Let 1 Prox R (w) = min v R d 2 v w 2 + R(v) and, for w 0 = 0 w t = Prox γr (w t 1 γ E(w t 1 ))

Proximal Methods (cont.) min w R d E(w) + R(w) Let R : R p R convex continuous and E : R p R differentiable, convex and such that E(w) E(w ) L w w (e.g. sup w H(w) L), Then for γ = 1/L, }{{} hessian converges to a minimizer of E + R. w t = Prox γr (w t 1 γ E(w t 1 ))

Soft thresholding R(w) = λ w 1 w j λ w j > λ (Prox λ 1 (w)) j = 0 w j [ λ,λ] w j + λ w j < λ

ISTA w t+1 = Prox γλ 1 (w t γ n X ( Xw t Ŷ )) w (Prox γλ 1 (w)) j j γλ w j > γλ = 0 w j [ γλ,γλ] w j + γλ w j < γλ Small coefficients are set to zero!

Back to inverse problems Xw = Ŷ If x i are i.i.d. gaussian vectors, w 0 = s and n 2s log d s then l 1 regularization recovers w with high probability.

Sampling theorem Classically 2ω 0 samples needed

LASSO 1 min w R d n Xw Ŷ 2 + λ w 1 Interpretability: variable selection!

Variable selection and correlation min n Xw Ŷ 2 + λ w 1 } {{ } strictly convex w R d 1 Cannot handle correlations between the variables

Elastic net regularization 1 min w R d n Xw Ŷ 2 + λ(α w 1 + (1 α) w 2 )

ISTA for elastic net w t+1 = Prox γλα 1 (w t γ 2 n X ( Xw t Ŷ ) γλ(1 α)w t 1 ) w (Prox γλα 1 (w)) j j γλα w j > γλα = 0 w j [ γλα,γλα] w j + γλα w j < γλα Small coefficients are set to zero!

Grouping effect Strong convexity = All relevant (possibly correlated) variables are selected

Elastic net and l p norms 1.15 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75 0.7-0.3-0.2-0.1 0 0.1 0.2 1.15 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75-0.3-0.2-0.1 0 0.1 0.2 1 2 w 1 + 1 2 w 2 = 1 d ( w j p ) 1/p = 1 j=1 l p norms are similar to elastic net but they are smooth (no kink!)

Summary Sparsity Geometry Computations Variable selection and elastic net