Oslo Class 6 Sparsity based regularization

Similar documents
MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Oslo Class 7 Structured sparsity

Optimization methods

Oslo Class 4 Early Stopping and Spectral Regularization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

1 Sparsity and l 1 relaxation

TUM 2016 Class 3 Large scale learning by regularization

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

RegML 2018 Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels

Is the test error unbiased for these programs? 2017 Kevin Jamieson

ESL Chap3. Some extensions of lasso

CPSC 540: Machine Learning

Optimization methods

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Lasso: Algorithms and Extensions

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Learning with stochastic proximal gradient

ECS289: Scalable Machine Learning

Proximal methods. S. Villa. October 7, 2014

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Sparse Approximation and Variable Selection

Discriminative Models

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Is the test error unbiased for these programs?

Lecture 9: September 28

Convex Optimization Algorithms for Machine Learning in 10 Slides

Parcimonie en apprentissage statistique

Regularization Algorithms for Learning

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

Conditional Gradient (Frank-Wolfe) Method

Algorithms for sparse analysis Lecture I: Background on sparse approximation

Proximal Methods for Optimization with Spasity-inducing Norms

Discriminative Models

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Linear Models for Regression CS534

EUSIPCO

An Introduction to Sparse Approximation

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Accelerated Block-Coordinate Relaxation for Regularized Optimization

STA141C: Big Data & High Performance Statistical Computing

DATA MINING AND MACHINE LEARNING

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Linear Methods for Regression. Lijun Zhang

Linear Models for Regression CS534

Introduction to Sparsity. Xudong Cao, Jake Dreamtree & Jerry 04/05/2012

Lecture 1: September 25

Stochastic Optimization: First order method

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

A tutorial on sparse modeling. Outline:

CSC 576: Variants of Sparse Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Generalized greedy algorithms.

CPSC 540: Machine Learning

Compressed Sensing and Neural Networks


Classification Logistic Regression

Machine Learning for NLP

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

STAT 200C: High-dimensional Statistics

Big Data Analytics: Optimization and Randomization

Bayesian Methods for Sparse Signal Recovery

Generalized Conditional Gradient and Its Applications

Perturbed Proximal Gradient Algorithm

Least Sparsity of p-norm based Optimization Problems with p > 1

Regression.

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Lasso, Ridge, and Elastic Net

Optimization for Learning and Big Data

OWL to the rescue of LASSO

ORIE 4741: Learning with Big Messy Data. Regularization

Parameter Norm Penalties. Sargur N. Srihari

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

l 1 and l 2 Regularization

Dual Proximal Gradient Method

MLCC 2017 Regularization Networks I: Linear Models

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

Iterative Convex Regularization

Mathematical methods for Image Processing

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Recent developments on sparse representation

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT

Lecture 8: February 9

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Supremum of simple stochastic processes

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm

Designing Information Devices and Systems I Discussion 13B

Minimizing the Difference of L 1 and L 2 Norms with Applications

Lecture 16: Compressed Sensing

Restricted Strong Convexity Implies Weak Submodularity

10725/36725 Optimization Homework 2 Solutions

MATH 680 Fall November 27, Homework 3

Generalized Orthogonal Matching Pursuit- A Review and Some

Transcription:

RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017

Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity

Sparsity The function of interest depends on few building blocks

Why sparsity Interpretability High dimensional statistics Compression

What is sparsity? d f(x) = x j w j j=1 Sparse coefficients: few w j 0

Sparsity and dictionaries More generally consider p f(x) = φ j (x)w j j=1 with φ 1,..., φ p atoms of a dictionary. The concept of sparsity depends on the considered dictionary.

Linear inverse problems n < d more variables than observations

Sparse regularization min w 1 n ˆXw ŷ 2 + λ w 0 w 2 2 l 0 -norm w 0 = d j=1 1 {wj 0}

Best subset selection

Best subset selection min w 1 n ˆXw ŷ 2 + λ w 0 as hard as trying all possible subsets...

Best subset selection min w 1 n ˆXw ŷ 2 + λ w 0 as hard as trying all possible subsets... 1. Greedy methods 2. Convex relaxations

Greedy methods Initalize, then Select a variable Compute solution Update Repeat

Matching pursuit for i = 1 to T r 0 = ŷ, w 0 = 0, I 0 = Let ˆX j = ˆXe j, and select j {1,..., d} maximizing 1 a j = v2 j ˆX j 2, with v j = r i 1 ˆX j 1 Note that v j = argmin ˆX j v r i 1 2, and, a j = ˆX j v j r i 1 2 v R

Matching pursuit for i = 1 to T r 0 = ŷ, w 0 = 0, I 0 = Let ˆX j = ˆXe j, and select j {1,..., d} maximizing 1 a j = v2 j ˆX j 2, with v j = r i 1 ˆX j I i = I i 1 {j}, 1 Note that v j = argmin ˆX j v r i 1 2, and, a j = ˆX j v j r i 1 2 v R

Matching pursuit for i = 1 to T r 0 = ŷ, w 0 = 0, I 0 = Let ˆX j = ˆXe j, and select j {1,..., d} maximizing 1 a j = v2 j ˆX j 2, with v j = r i 1 ˆX j I i = I i 1 {j}, w i = w i 1 + v j e j 1 Note that v j = argmin ˆX j v r i 1 2, and, a j = ˆX j v j r i 1 2 v R

Matching pursuit for i = 1 to T r 0 = ŷ, w 0 = 0, I 0 = Let ˆX j = ˆXe j, and select j {1,..., d} maximizing 1 a j = v2 j ˆX j 2, with v j = r i 1 ˆX j I i = I i 1 {j}, w i = w i 1 + v j e j r i = r i 1 ˆXw i 1 Note that v j = argmin ˆX j v r i 1 2, and, a j = ˆX j v j r i 1 2 v R

Orthogonal Matching pursuit r 0 = ŷ, w 0 = 0, I 0 = for i = 1 to T Select j {1,..., d} which maximizes I i = I i 1 {j}, v 2 j ˆXe j 2, with v j = r i 1 ˆXe j w i = arg min w ˆXM Ii w ŷ 2, where (M Ii w) j = δ j Ii w j r i = r i 1 ˆXw i

Convex relaxation min w 1 n ˆXw ŷ 2 + λ w 1 w 2 2 l 1 -norm d w 1 = w i i=1 Modeling Optimization

The problem of sparsity min w 1, s.t. ˆXw = ŷ

Ridge Regression and sparsity Replace w 1 with w 2?

Unlike ridge-regression, l 1 regularization leads to sparsity!

Sparse regularization min w Called Lasso or Basis Pursuit Convex but not smooth 1 n ˆXw ŷ 2 + λ w 1

Optimization Could be solved via the subgradient method Objective function is composite min w 1 n ˆXw ŷ 2 }{{} convex smooth +λ w 1 }{{} convex

Proximal methods min E(w) + R(w) w Let 1 Prox R (w) = min v 2 v w 2 + R(v) and, for w 0 = 0 w t = Prox γr (w t 1 γ E(w t 1 ))

Proximal Methods (cont.) min E(w) + R(w) w Let R : R p R convex continuous and E : R p R differentiable, convex and such that E(w) E(w ) L w w (e.g. sup w H(w) L), Then for γ = 1/L, }{{} hessian converges to a minimizer of E + R. w t = Prox γr (w t 1 γ E(w t 1 ))

Soft thresholding R(w) = λ w 1 w j λ w j > λ (Prox λ 1 (w)) j = 0 w j [ λ, λ] w j + λ w j < λ

ISTA w t+1 = Prox γλ 1 (w t γ n ˆX ( ˆXw t ŷ)) w j γλ w j > γλ (Prox γλ 1 (w)) j = 0 w j [ γλ, γλ] w j + γλ w j < γλ Small coefficients are set to zero!

Back to inverse problems ˆXw + δ = ŷ If x i i.i.d. random and then l 1 regularization recovers w n 2s log d s

Sampling theorem 2ω 0 samples needed

LASSO min w Interpretability: variable selection! 1 n ˆXw ŷ 2 + λ w 1

Variable selection and correlation min w 1 n ˆXw ŷ 2 + λ w 1 }{{} strictly convex Cannot handle correlations between the variables

Elastic net regularization min w 1 n ˆXw ŷ 2 + λ(α w 1 + (1 α) w 2 )

ISTA for elastic net w t+1 = Prox γλα 1 (w t γ 2 n ˆX ( ˆXw t ŷ) γλ(1 α)w t 1 ) w j γλα w j > γλα (Prox γλα 1 (w)) j = 0 w j [ γλα, γλα] w j + γλα w j < γλα Small coefficients are set to zero!

Grouping effect Strong convexity = All relevant (possibly correlated) variables are selected

Elastic net and l p norms 1.15 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75 0.7-0.3-0.2-0.1 0 0.1 0.2 1.15 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75-0.3-0.2-0.1 0 0.1 0.2 1 2 w 1 + 1 2 w 2 = 1 d ( w j p ) 1/p = 1 j=1 l p norms are similar to elastic net but they are smooth (no kink!)

This Class Sparsity Geometry Computations Variable selection and elastic net

Next Class Structured Sparsity