Sparse Approximation and Variable Selection

Similar documents
MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Sparsity in Underdetermined Systems

Regularization Algorithms for Learning

Spectral Regularization

An Introduction to Sparse Approximation

Regularization via Spectral Filtering

Machine Learning for Biomedical Engineering. Enrico Grisan

Variable Selection in Data Mining Project

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Sparse Linear Models (10/7/13)

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

ISyE 691 Data mining and analytics

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Oslo Class 6 Sparsity based regularization

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Structured matrix factorizations. Example: Eigenfaces

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

MLCC 2017 Regularization Networks I: Linear Models

2.3. Clustering or vector quantization 57

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Reproducing Kernel Hilbert Spaces

Introduction to Logistic Regression

SPARSE signal representations have gained popularity in recent

Some models of genomic selection

6.036 midterm review. Wednesday, March 18, 15

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

Linear & nonlinear classifiers

6. Regularized linear regression

Linear Methods for Regression. Lijun Zhang

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

Descent methods. min x. f(x)

DATA MINING AND MACHINE LEARNING

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Reproducing Kernel Hilbert Spaces

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Sparse linear models

High-Dimensional Statistical Learning: Introduction

MSA220/MVE440 Statistical Learning for Big Data

Compressed Sensing and Neural Networks

CPSC 540: Machine Learning

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

2 Tikhonov Regularization and ERM

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Linear & nonlinear classifiers

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Compressed Sensing in Cancer Biology? (A Work in Progress)

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Max Margin-Classifier

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Kernel Methods and Support Vector Machines

Lasso Regression: Regularization for feature selection

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Wavelet Footprints: Theory, Algorithms, and Applications

c 2011 International Press Vol. 18, No. 1, pp , March DENNIS TREDE


Machine Learning for OR & FE

Midterm exam CS 189/289, Fall 2015

Analysis of Greedy Algorithms

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support vector machines Lecture 4

The Learning Problem and Regularization

MULTIPLEKERNELLEARNING CSE902

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

MS-C1620 Statistical inference

Machine Learning Linear Classification. Prof. Matteo Matteucci

Classification Logistic Regression

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

1 Sparsity and l 1 relaxation

Reservoir Computing and Echo State Networks

Class 2 & 3 Overfitting & Regularization

Support Vector Machines for Classification: A Statistical Portrait

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Is the test error unbiased for these programs?

Support Vector Machines

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Bayesian Support Vector Machines for Feature Ranking and Selection

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Parameter Norm Penalties. Sargur N. Srihari

CIS 520: Machine Learning Oct 09, Kernel Methods

Sparsity Regularization

Reproducing Kernel Hilbert Spaces

Support Vector Machine & Its Applications

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Sparse representation classification and positive L1 minimization

COMS 4771 Introduction to Machine Learning. Nakul Verma

Machine Learning for NLP

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Reproducing Kernel Hilbert Spaces

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

COMS 4771 Regression. Nakul Verma

6.867 Machine learning

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Lasso Regression: Regularization for feature selection

Sparse Algorithms are not Stable: A No-free-lunch Theorem

Transcription:

Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007

About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation and describe some of the methods designed to solve such problems

Why Selecting Variables? interpretabilty of the model: in many learning problems a main goal, besides good prediction, is to gain a better understanding of the problems, for example detecting the most discriminative information data driven representation: in place of tailoring an ad hoc representation (for example via a kernel) we can take a large, redundant set of measurements and then try to devise a data driven selection scheme compression it is often desirable to have parsimonious models, that is models requiring a (possibly very) small number of parameters to be described...

A Useful Example Biomarker Identification Set up: n patients belonging to 2 groups (say two different diseases) p measurements for each patient quantifying the expression of p genes Goal: learn a classification rule to predict occurrence of the disease for future patients detect which are the genes responsible for the disease p n paradigm typically n is in the order of tens and p of thousands...

Some Notation Measurement matrix Let X be the n p measurements matrix. X = x 1 1......... x p 1..... xn 1......... xn p n is the number of examples p is the number of variables we denote with X i, i = 1,..., p the columns of X For each patient we have a response (output) y R or y = ±1. In particular we are given the responses for the training set Y = (y 1, y 2,..., y n )

Approaches to Variable Selection So far we still have to define what are "relevant" variables. Different approaches are based on different way to specify what is relevant. Filters methods. Wrappers. Embedded methods. We will focus on the latter class of methods. (see "Introduction to variable and features selection" Guyon and Elisseeff 03)

Few Words on Filter Methods The idea is to rank the variables X i, i = 1,..., p according to some criteria r(i) correlation score (similar criteria for non linear correlation): cov(x i, Y ) r(i) = Var(X i )Var(Y ) single variable classifiers, r(i) is the test error of a classifier trained with only the i-th variable information based criteria r(i) = P(X i, Y ) log P(X i, Y ) P(X i )P(Y ) X i Y

Filter or Non Filter? Cons Filter methods usually look separately at each variable and are prone to provide us with redundant set of variables. Moreover no explicit selection is embedded in the method and some thresholding procedure is needed. Pros Nonetheless they are often pretty fast and provide indications if some variables are by far more important than some others.

Wrapper methods An algorithm of choice is used (as a black box) to evaluate the importance of a set of variables for the given task. a strategy is needed to explore the variable space. A brute force approach is infeasible even for relatively few features. two main approaches backward elimination: start with all the variables and proceed iteratively to discard them... forward selection: start with no variables and incrementally add them......until some stopping criteria is satisfied. For example a fixed number of variables, or a certain validation error, is reached.

About Wrappers Such methods often require massive computation unless a favorable search strategy can be used. Using a learning algorithm as a black box make them often simple and problems independent......still we risk to forget what the learning machine does to avoid overfitting.

Embedded Methods The selection procedure is embedded in the training phase. An intuition what happens to the generalization properties of empirical risk minimization as we subtract variables? if we keep all the variables we probably overfit if we take just a few variables we are likely to oversmooth (in the limit we have a single variable classifier!) We are going to discuss this class of methods in detail.

Function Approximation Suppose the output is a linear combination of the variables f (x) = p β i x i = β, x i=1 each coefficient β i can be seen as a weight on the i-th variable. For a given training set we have to solve (approximatively) f (x i ) = y i i = 1,..., n recall that that data are sampled and noisy.

Solving a BIG linear system In vector notation we can can write the problem as a linear system of equation Y = Xβ Though existence is ensured (recall p n) solution is not unique. Moreover we can expect unstable behavior w.r.t. noisy data. In other words the problem is ill-posed

Sparse Approximation We can consider a more general framework. Given a function (a signal) f : X R we look for a sparse approximation on an overcomplete dictionary. Overcomplete Dictionary Let D = {φ i : X R; i = 1,... } H be a set of functions, namely the dictionary. The functions φ i are called atoms. Overcomplete means that we have many ways to write f = i β i φ i where f H.

Examples of Overcomplete Dictionary In the last years a large number of overcomplete dictionary have been proposed: stationary wavelets, wavelet packets, cosine packets, chirplets, and warplets... The simplest example we can think of, is a union of orthonormal basis (e i ), (g i ),... in some space H. It is clear that every function in H can be written in many ways using the elements of D.

Finite dimensional problem Typically one has to solve a (noisy) discretization of the sparse approximation problem: the signal f is replaced by a vector f. Discretization via Sampling sampling: a possible discretization is obtained considering a sample x 1,..., x n and f = (f (x 1 ),..., f (x n )) R n noise: the signal is often noisy so that f one has a noisy vector Y = (y 1,..., y n ) R n where y j = f (x j ) + noise; j = 1,..., n

Truncated Dictionary In such cases it often suffices to consider a truncated dictionary φ 1..., φ p and write f (x j ) = p β i φ i (x j ) j = 1,..., n i=1 The measurements matrix is now φ 1 (x 1 )......... φ p (x 1 ) Φ =..... φ 1 (x n )......... φ p (x n )

Few Comments It is easy to check that if the dictionary is simply an orthonormal basis φ i = e i, then e i (x) = x i and the measurement matrix Φ is simply the matrix X we previously defined; in principle if we sample (x 1,..., x n ) and n can increase the size of the truncated dictionary should depend on n, that is p = p(n); if f is a discrete and finite signal (e.g. an image) the idea is simply to take p n. In any case one should be aware of the possible effects of the discretization step.

Ill-Posed Problem Ill-posed problem Using again vector notation, we have to solve (approximatively) the ill-posed problem Y = Φβ Even in the noiseless case there is not a unique solution and in general for noisy data we need some regularization procedure. In particular we would like to have sparsity enhancing regularization.

Generalized and Tikhonov Regularized Solutions To restore well posedeness: if there is no noise then take the generalized solution min β R p ( p ) 1/2 βi 2 s.t f = Φβ i=1 (called method of frames (MOF) in signal processing) if there is noise use Tikhonov regularization min β R p{1 n n V (y j, β, Φ(x j ) ) + λ j=1 p βi 2 } i=1 in general all the β i will be different from zero.

Sparsity Define the "zero"-norm (not a real norm) as β 0 = #{i = 1,..., p β i 0} It is a measure of how "complex" is f and of how many variables are important. Is it a good way to define sparsity?

L0 Regularization If we have the prior information that only a few variables are meaningful we can look for min β R p β 0 s.t. f = Φβ or, since the data are noisy, we would like to consider min β R P { 1 n n V (y j, β, Φ(x j ) ) + λ β 0 } j=1 This is as difficult as trying all possible subsets of variables. Can we find meaningful approximations?

Approximate solutions Two main approaches There exist approximations for various choices of loss function. Most of them fall in either one of the following two approaches 1 Convex relaxation (L1 regularization, L0-SVM...) 2 Greedy schemes (boosting algorithms, projection pursuit...) We mostly discuss the first class of methods.

One Slide on Greedy Approaches Very similar techniques have been proposed by different communities with different names: statistics - forward stagewise regression, approximation theory - greedy algorithms, learning - boosting methods, signal processing - projection pursuit methods. The various algorithms are often based on the iteration of the following steps. After some initialization: 1 selection an element of the dictionary, 2 update of the solution. These schemes proceed incrementally and are not based on a global optimization procedure.

Convex Relaxation A natural approximation to L0 regularization is given by: 1 n n V (y j, β, Φ(x j ) ) + λ β 1 j=1 where β 1 = p i=1 β i. If we choose the square loss 1 n n (y j β, Φ(x j ) ) 2 = Y Φβ 2 n j=1 Such a scheme is related to Basis Pursuit and the Lasso algorithms.

Lq regularization? One might wonder why L1 penalty should ensure sparseness and what happens using more general penalty of the form β q = ( p β q ) 1/q i=1 (related to bridge regression in statistics). It can be proved that: lim q 0 β q β 0, for 0 < q < 1 the norm is not a convex map, for q = 1 the norm is a convex map and is strictly convex for q > 1. How about sparsity?

Geometrical View

Graphical Solution in in 2D

Back to L1 regularization We focus on the square loss so that we now have to solve min β R p Y βφ 2 + λ β 1. Though the problem is no longer hopeless there is no straightforward way to solve it. The functional is convex but not strictly convex, so that the solution is not unique. One possible approach relies on linear or quadratic programming techniques. Using convex analysis tools we can get a constructive representation theorem.

An Iterative Thresholding Algorithm If we let β λ be a solution of the convex relaxation problem, it can be proved that the following iterated algorithm converges to β λ as the number of iteration increases. Set β λ 0 = 0 and let β λ t = S λ [β λ t 1 + τφt (Y Φβ λ t 1 )] where τ is a normalization constant ensuring Φ 1 and the map S λ is defined component-wise as β i + λ if β i < λ/2 S λ (β i ) = 0 if β i λ/2 β i λ if β i < λ/2 (see Daubechies et al. 05)

Thresholding

Algorithmics Aspects Set β λ 0 = 0 for t=1:tmax β λ t = S λ [β λ t 1 + τφt (Y Φβ λ t 1 )] The algorithm we just described is very easy to implement but can be quite heavy from a computational point of view. The number of iteration t can be stopped when a certain precision is reached. The complexity of the algorithm is O(tp 2 ) for each value of the regularization parameter. The regularization parameter controls the degree of sparsity of the solution.

Some Final Remarks Computational Burden: algorithms for feature selection are often computationally expansive. This should be no surprise since the problem we are trying to solve is more difficult then a simple regression/classification task. About Uniqueness: the solution of in L1 regularization is not unique. Note however that the various solution have the same prediction properties but different selection properties. Correlated Variables: If we have a group of correlated variables the algorithm is going to select just one of them. This can be bad for interpretability but maybe good for compression. Connections: the same approaches is used MANY different problem: compressed sensing, inverse problems, linear coding...