MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Similar documents
Oslo Class 6 Sparsity based regularization

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

ESL Chap3. Some extensions of lasso

MLCC 2015 Dimensionality Reduction and PCA

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Sparse Approximation and Variable Selection

Linear Methods for Regression. Lijun Zhang

An Introduction to Sparse Approximation

MLCC 2017 Regularization Networks I: Linear Models

MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Optimization methods

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Sparsity in Underdetermined Systems

Lasso: Algorithms and Extensions

Oslo Class 7 Structured sparsity

1 Regression with High Dimensional Data

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

1 Sparsity and l 1 relaxation

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Sparse Linear Models (10/7/13)

Inverse problems and sparse models (1/2) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France

Introduction to Sparsity. Xudong Cao, Jake Dreamtree & Jerry 04/05/2012

Applied Machine Learning for Biomedical Engineering. Enrico Grisan

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Analysis of Greedy Algorithms

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm

Lasso Regression: Regularization for feature selection

DATA MINING AND MACHINE LEARNING

2.3. Clustering or vector quantization 57

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

RegML 2018 Class 8 Deep learning

Bias-free Sparse Regression with Guaranteed Consistency

Oslo Class 4 Early Stopping and Spectral Regularization

Sparse analysis Lecture III: Dictionary geometry and greedy algorithms

ISyE 691 Data mining and analytics

A tutorial on sparse modeling. Outline:

RegML 2018 Class 2 Tikhonov regularization and kernels

Compressed Sensing and Neural Networks

Optimization methods

Generalized Orthogonal Matching Pursuit- A Review and Some

SGN Advanced Signal Processing Project bonus: Sparse model estimation

Lasso Regression: Regularization for feature selection

Adaptive Primal Dual Optimization for Image Processing and Learning

Linear Regression. Volker Tresp 2018

Proximal Methods for Optimization with Spasity-inducing Norms

Introduction to Compressed Sensing

Algorithms for sparse analysis Lecture I: Background on sparse approximation

Regularization Algorithms for Learning

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

Reconstruction of Block-Sparse Signals by Using an l 2/p -Regularized Least-Squares Algorithm

Sketching for Large-Scale Learning of Mixture Models

A Least Squares Formulation for Canonical Correlation Analysis

Lecture 5 : Sparse Models

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Compressed Sensing: Extending CLEAN and NNLS

COMPARATIVE ANALYSIS OF ORTHOGONAL MATCHING PURSUIT AND LEAST ANGLE REGRESSION

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Discriminative Models

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

CPSC 540: Machine Learning

6. Regularized linear regression

ECS289: Scalable Machine Learning

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Sparse representation classification and positive L1 minimization

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Algorithmic Foundations Of Data Sciences: Lectures explaining convex and non-convex optimization

CPSC 540: Machine Learning


Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Sparse Approximation of Signals with Highly Coherent Dictionaries

25 : Graphical induced structured input/output models

ECS289: Scalable Machine Learning

Sparse and Robust Optimization and Applications

Pathwise coordinate optimization

SPARSE signal representations have gained popularity in recent

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Learning with stochastic proximal gradient

Lass0: sparse non-convex regression by local search

Bhaskar Rao Department of Electrical and Computer Engineering University of California, San Diego

Restricted Strong Convexity Implies Weak Submodularity

Machine Learning for NLP

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Discriminative Models

Oslo Class 2 Tikhonov regularization and kernels

Generalized greedy algorithms.

Regularization and Variable Selection via the Elastic Net

Greedy Signal Recovery and Uniform Uncertainty Principles

OWL to the rescue of LASSO

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

TUM 2016 Class 3 Large scale learning by regularization

Linear Regression. S. Sumitra

Transcription:

MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT

Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net MLCC 2018 2

Prediction and Interpretability In many practical situations, beyond prediction, it is important to obtain interpretable results MLCC 2018 3

Prediction and Interpretability In many practical situations, beyond prediction, it is important to obtain interpretable results Interpretability is often determined by detecting which factors allow good prediction MLCC 2018 4

Prediction and Interpretability In many practical situations, beyond prediction, it is important to obtain interpretable results Interpretability is often determined by detecting which factors allow good prediction We look at this question from the perspective of variable selection MLCC 2018 5

Linear Models Consider a linear model Here f w (x) = w T x = v w j x j the components x j of an input can be seen as measurements (pixel values, dictionary words count, gene expressions,... ) i=1 MLCC 2018 6

Linear Models Consider a linear model Here f w (x) = w T x = v w j x j the components x j of an input can be seen as measurements (pixel values, dictionary words count, gene expressions,... ) i=1 Given data, the goal of variable selection is to detect which are variables important for prediction MLCC 2018 7

Linear Models Consider a linear model Here f w (x) = w T x = v w j x j the components x j of an input can be seen as measurements (pixel values, dictionary words count, gene expressions,... ) i=1 Given data, the goal of variable selection is to detect which are variables important for prediction Key assumption: the best possible prediction rule is sparse, that is only few of the coefficients are non zero MLCC 2018 8

Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net MLCC 2018 9

Linear Models Consider a linear model Here f w (x) = w T x = v w j x j (1) the components x j of an input are specific measurements (pixel values, dictionary words count, gene expressions,... ) i=1 MLCC 2018 10

Linear Models Consider a linear model Here f w (x) = w T x = v w j x j (1) the components x j of an input are specific measurements (pixel values, dictionary words count, gene expressions,... ) i=1 Given data the goal of variable selection is to detect which variables important for prediction MLCC 2018 11

Linear Models Consider a linear model Here f w (x) = w T x = v w j x j (1) the components x j of an input are specific measurements (pixel values, dictionary words count, gene expressions,... ) i=1 Given data the goal of variable selection is to detect which variables important for prediction Key assumption: the best possible prediction rule is sparse, that is only few of the coefficients are non zero MLCC 2018 12

Notation We need some notation: X n be the n by D data matrix MLCC 2018 13

Notation We need some notation: X n be the n by D data matrix X j R n, j = 1,..., D its columns MLCC 2018 14

Notation We need some notation: X n be the n by D data matrix X j R n, j = 1,..., D its columns Y n R n the output vector MLCC 2018 15

High-dimensional Statistics Estimating a linear model corresponds to solving a linear system X n w = Y n. Classically n D low dimension/overdetermined system MLCC 2018 16

High-dimensional Statistics Estimating a linear model corresponds to solving a linear system X n w = Y n. Classically n D low dimension/overdetermined system Lately n D high dimensional/underdetermined system Buzzwords: compressed sensing, high-dimensional statistics... MLCC 2018 17

High-dimensional Statistics Estimating a linear model corresponds to solving a linear system X n w = Y n. n { X n w = Y n { D MLCC 2018 18

High-dimensional Statistics Estimating a linear model corresponds to solving a linear system X n w = Y n. n { X n w = Y n { D Sparsity! MLCC 2018 19

Brute Force Approach Sparsity can be measured by the l 0 norm w 0 = {j w j 0} that counts non zero components in w MLCC 2018 20

Brute Force Approach Sparsity can be measured by the l 0 norm w 0 = {j w j 0} that counts non zero components in w If we consider the square loss, a regularization approach is given by 1 min w R D n n (y i f w (x i )) 2 + λ w 0 i=1 MLCC 2018 21

Best subset selection is Hard 1 min w R D n n (y i f w (x i )) 2 + λ w 0 i=1 The above approach is as hard as a brute force approach: considering all training sets obtained with all possible subsets of variables (single, couples, triplets... of variables) MLCC 2018 22

Best subset selection is Hard 1 min w R D n n (y i f w (x i )) 2 + λ w 0 i=1 The above approach is as hard as a brute force approach: considering all training sets obtained with all possible subsets of variables (single, couples, triplets... of variables) The computational complexity is combinatorial. In the following we consider two possible approximate approaches: greedy methods convex relaxation MLCC 2018 23

Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net MLCC 2018 24

Greedy Methods Approach Greedy approaches encompasses the following steps: 1. initialize the residual, the coefficient vector, and the index set MLCC 2018 25

Greedy Methods Approach Greedy approaches encompasses the following steps: 1. initialize the residual, the coefficient vector, and the index set 2. find the variable most correlated with the residual MLCC 2018 26

Greedy Methods Approach Greedy approaches encompasses the following steps: 1. initialize the residual, the coefficient vector, and the index set 2. find the variable most correlated with the residual 3. update the index set to include the index of such variable MLCC 2018 27

Greedy Methods Approach Greedy approaches encompasses the following steps: 1. initialize the residual, the coefficient vector, and the index set 2. find the variable most correlated with the residual 3. update the index set to include the index of such variable 4. update/compute coefficient vector MLCC 2018 28

Greedy Methods Approach Greedy approaches encompasses the following steps: 1. initialize the residual, the coefficient vector, and the index set 2. find the variable most correlated with the residual 3. update the index set to include the index of such variable 4. update/compute coefficient vector 5. update residual. The simplest such procedure is called forward stage-wise regression in statistics and matching pursuit (MP) in signal processing MLCC 2018 29

Initialization Let r, w, I denote the residual, the coefficient vector, an index set, respectively. MLCC 2018 30

Initialization Let r, w, I denote the residual, the coefficient vector, an index set, respectively. The MP algorithm starts by initializing the residual r R n, the coefficient vector w R D, and the index set I {1,..., D} r 0 = Y n, w 0 = 0, I 0 = MLCC 2018 31

Selection The variable most correlated with the residual is given by k = arg max a j, a j = (rt i 1 Xj ) 2 j=1,...,d X j 2, where we note that v j = rt i 1 Xj X j 2 = arg min r i 1 X j v 2, v R r i 1 X j v j 2 = r i 1 2 a j MLCC 2018 32

Selection (cont.) Such a selection rule has two interpretations: We select the variable with larger projection on the output, or equivalently we select the variable such that the corresponding column best explains the the output vector in a least squares sense MLCC 2018 33

Active Set, Solution and residual Update Then, index set is updated as I i = I i 1 {k}, and the coefficients vector is given by w i = w i 1 + w k, w k = v k e k where e k is the element of the canonical basis in R D with k-th component different from zero MLCC 2018 34

Active Set, Solution and residual Update Then, index set is updated as I i = I i 1 {k}, and the coefficients vector is given by w i = w i 1 + w k, w k = v k e k where e k is the element of the canonical basis in R D with k-th component different from zero Finally, the residual is updated r i = r i 1 Xw k MLCC 2018 35

Orthogonal Matching Pursuit A variant of the above procedure, called Orthogonal Matching Pursuit, is also often considered, where the coefficient computation is replaced by w i = arg min w R D Y n X n M Ii w 2, where the D by D matrix M I is such that (M I w) j = w j if j I and (M I w) j = 0 otherwise. Moreover, the residual update is replaced by r i = Y n X n w i MLCC 2018 36

Theoretical Guarantees If the solution is sparse, and the data matrix has columns not too correlated OMP can be shown to recover with high probability the right vector of coefficients MLCC 2018 37

Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net MLCC 2018 38

l 1 Norm and Regularization Another popular approach to find sparse solutions is based on a convex relaxation Namely, the l 0 norm is replaced by the l 1 norm, w 1 = D w j j=1 MLCC 2018 39

l 1 Norm and Regularization Another popular approach to find sparse solutions is based on a convex relaxation Namely, the l 0 norm is replaced by the l 1 norm, w 1 = D w j j=1 In the case of least squares, one can consider 1 min w R D n n (y i f w (x i )) 2 + λ w 1 i=1 MLCC 2018 40

Convex Relxation 1 min w R D n n (y i f w (x i )) 2 + λ w 1. i=1 The above problem is called LASSO in statistics and Basis Pursuit in signal processing The objective function defining the corresponding minimization problem is convex but not differentiable Tools from non-smooth convex optimization are needed to find a solution MLCC 2018 41

Iterative Soft Thresholding A simple yet powerful procedure to compute a solution is based on the so called iterative soft thresholding algorithm (ISTA): MLCC 2018 42

Iterative Soft Thresholding A simple yet powerful procedure to compute a solution is based on the so called iterative soft thresholding algorithm (ISTA): w 0 = 0, w i = S λγ (w i 1 2γ n XT n (Y n X n w i 1 )), i = 1,..., T max MLCC 2018 43

Iterative Soft Thresholding A simple yet powerful procedure to compute a solution is based on the so called iterative soft thresholding algorithm (ISTA): w 0 = 0, w i = S λγ (w i 1 2γ n XT n (Y n X n w i 1 )), i = 1,..., T max At each iteration a non linear soft thresholding operator is applied to a gradient step MLCC 2018 44

Iterative Soft Thresholding (cont.) w 0 = 0, w i = S λγ (w i 1 2γ n XT n (Y n X n w i 1 )), i = 1,..., T max the iteration should be run until a convergence criterion is met, e.g. w i w i 1 ɛ, for some precision ɛ, or a maximum number of iteration T max is reached To ensure convergence we should choose the step-size γ = n 2 X T n X n MLCC 2018 45

Splitting Methods In ISTA the contribution of error and regularization are split: the argument of the soft thresholding operator corresponds to a step of gradient descent 2 n XT n (Y n X n w i 1 ) The soft thresholding operator depends only on the regularization and acts component wise on a vector w, so that S α (u) = u α + u u. MLCC 2018 46

Soft Thresholding and Sparsity S α (u) = u α + u u. The above expression shows that the coefficients of the solution computed by ISTA can be exactly zero MLCC 2018 47

Soft Thresholding and Sparsity S α (u) = u α + u u. The above expression shows that the coefficients of the solution computed by ISTA can be exactly zero This can be contrasted to Tikhonov regularization where this is hardly the case MLCC 2018 48

Lasso meets Tikhonov: Elastic Net Indeed, it is possible to see that: while Tikhonov allows to compute a stable solution, in general its solution is not sparse On the other hand the solution of LASSO, might not be stable MLCC 2018 49

Lasso meets Tikhonov: Elastic Net Indeed, it is possible to see that: while Tikhonov allows to compute a stable solution, in general its solution is not sparse On the other hand the solution of LASSO, might not be stable The elastic net algorithm, defined as 1 min w R D n n (y i f w (x i )) 2 + λ(α w 1 + (1 α) w 2 2), α [0, 1] (2) i=1 can be seen as hybrid algorithm which interpolates between Tikhonov and LASSO MLCC 2018 50

ISTA for Elastic Net The ISTA procedure can be adapted to solve the elastic net problem, where the gradient descent step incorporates also the derivative of the l 2 penalty term. The resulting algorithm is w 0 = 0, for i = 1,..., T max w i = S λαγ ((1 λγ(1 α))w i 1 2γ n XT n (Y n X n w i 1 )), To ensure convergence we should choose the step-size γ = n 2( X T n X n + λ(1 α)) MLCC 2018 51

Wrapping Up Sparsity and interpretable models greedy methods convex relaxation MLCC 2018 52

Next Class unsupervised learning: clustering! MLCC 2018 53