Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Similar documents
Optimization methods

Linear regression. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

DS-GA 1002 Lecture notes 12 Fall Linear regression

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Linear Methods for Regression. Lijun Zhang

Optimization methods

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Oslo Class 6 Sparsity based regularization

Sparse Linear Models (10/7/13)

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Machine Learning for OR & FE

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Sparse linear models

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

DS-GA 1002 Lecture notes 10 November 23, Linear models

Is the test error unbiased for these programs?

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Regression Shrinkage and Selection via the Lasso

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Statistical Data Mining and Machine Learning Hilary Term 2016

2.3. Clustering or vector quantization 57

Gaussian Graphical Models and Graphical Lasso

COMS 4771 Regression. Nakul Verma

Machine Learning for Biomedical Engineering. Enrico Grisan

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Regression, Ridge Regression, Lasso

ESL Chap3. Some extensions of lasso

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Lasso: Algorithms and Extensions

Prediction & Feature Selection in GLM

ECE521 week 3: 23/26 January 2017

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Sparse and Robust Optimization and Applications

Linear Models in Machine Learning

Constrained optimization

Sparse Gaussian conditional random fields

Lecture Notes 6: Linear Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Model Selection and Regularization

An Introduction to Sparse Approximation

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

Compressed Sensing and Neural Networks

ISyE 691 Data mining and analytics

Introduction to Machine Learning

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Lecture 6: Methods for high-dimensional problems

MS-C1620 Statistical inference

DATA MINING AND MACHINE LEARNING

Lecture Notes 10: Matrix Factorization

Sparsity Regularization

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

1 Regression with High Dimensional Data

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Machine Learning for Signal Processing Sparse and Overcomplete Representations

A Short Introduction to the Lasso Methodology

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Feature Engineering, Model Evaluations

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016

CSC 576: Variants of Sparse Learning

Support Vector Machine I


A Least Squares Formulation for Canonical Correlation Analysis

Least Sparsity of p-norm based Optimization Problems with p > 1

Minimizing the Difference of L 1 and L 2 Norms with Applications

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

A Modern Look at Classical Multivariate Techniques

High-dimensional Ordinary Least-squares Projection for Screening Variables

Linear Models for Regression

Sparse Approximation and Variable Selection

Robust Principal Component Analysis

10-725/36-725: Convex Optimization Prerequisite Topics

4 Bias-Variance for Ridge Regression (24 points)

Nonconvex Sparse Logistic Regression with Weakly Convex Regularization

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Supremum of simple stochastic processes

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Chapter 3. Linear Models for Regression

Compressed Sensing in Cancer Biology? (A Work in Progress)

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

6. Regularized linear regression

Learning From Data: Modelling as an Optimisation Problem

Sparsity in Underdetermined Systems

Summary and discussion of: Dropout Training as Adaptive Regularization

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Regularization and Variable Selection via the Elastic Net

An Introduction to Statistical and Probabilistic Linear Models

Lecture Notes 9: Constrained Optimization

Lasso Regression: Regularization for feature selection

Linear Models for Regression CS534

Least Squares Regression

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

Machine Learning Linear Classification. Prof. Matteo Matteucci

Transcription:

Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016

Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

Regression Aim: Predict the value of a response / dependent variable y R from p predictors / features / independent variables X 1, X 2,..., X p R Methodology: 1. Fit a model with using n training examples y 1, y 2,..., y n y i f (X i1, X i2,..., X ip ) 1 i n 2. Use learned model f to predict from new data

Linear regression f is parametrized by an intercept β 0 and a vector of weights β R p p y i β 0 + β j X ij j=1 1 i n y 1 y 2 y n 1 1 1 X 11 X 12 X 1p β 0 + X 21 X 22 X 2p X n1 X n2 X np β 1 β 2 β p y 1β 0 + X β Aim: Learn β 0 and β from the training data

Linear models Signal representation (compression, denoising) x = D c Aim: Find sparse representation of the signal x Columns of D are designed/learned atoms

Linear models Inverse problems (compressed sensing, super-resolution) y = A x Aim: Estimate the signal x from the measurements y A models the measurement process

Linear models Linear regression y = X β Aim: Learn the relation between the response y and the predictors X 1, X 2,..., X p to predict from new data Each column of X contains a variable that may or may not be related to the response

Preprocessing 1. Center each predictor column X i subtracting its mean so that n X ij = 0 i=1 for all 1 j p 2. Normalize each predictor column X j so that X j 2 = 1 for all 1 j p 3. Center the response vector y so that n y i = 0 i=1

Least-squares fit Minimize l 2 norm of error over training data minimize y X β 2 After preprocessing β 0 = 0, so we can ignore it

Geometric interpretation Decompose y into its projection onto the column space of X and onto the orthogonal complement y = y X + y X By Pythagoras s Theorem y X β 2 = y X 2 2 2 + yx X β 2 2 If n p and X is full rank β ls = ( X T X ) 1 X T y satisfies X β ls = y X After preprocessing β 0 = 0

Probabilistic interpretation: Maximum likelihood Assumption: Linear model is correct but examples are corrupted by additive iid Gaussian noise y = X β + z The likelihood is the probability density function of y parametrized by β ( ) ( 1 L β = (2π) n exp 1 ) y X β 2 2 2 The maximum-likelihood estimate of β is ( ) β ML = arg max L β β ( ) = arg max log L β β = arg min β y X β 2

Temperature predictor A friend tells you: I found a cool way to predict the temperature in New York: It s just a linear combination of the temperature in every other state. I fit the model on data from the last month and a half and it s perfect!

Overfitting If a model is very complex, it may overfit the data To evaluate a model we separate the data into a training and a test set 1. We fit the model using the training set 2. We evaluate the error on the test set

Experiment X train, X test, z train and β are iid Gaussian with mean 0 and variance 1 y train = X train β + z train y test = X test β We use y train and X train to compute β ls error train = X train β ls y train 2 y train 2 error test = X test β ls y test 2 y test 2

Experiment 0.5 0.4 Error (training) Error (test) Noise level (training) Relative error (l2 norm) 0.3 0.2 0.1 0.0 50 100 200 300 400 500 n

Analysis Data model, z is Gaussian noise with variance σ 2 z, y = X β + z σ min /σ max are the smallest/largest singular values of X 1 ɛ σ 2 min β β ls 2 2 pσ 2 z 1 + ɛ σ 2 max with high probability

Proof β ls = V Σ 1 U T where X = UΣV T β β ls 2 = p ( U T j z j=1 U T z is Gaussian with mean 0 and covariance I U T z 2 is chi-square with p degrees of freedom 2 U T z 2 p 2 σ j ) 2

Consistency Assume σ 2 z = 1 If X is well conditioned and entries have constant magnitude σ j n β β ls 2 If entries of β are constant, β 2 p Least-squares estimator is consistent p n β β ls 2 β 2 1 n

Experiment: X R n p, β R p, z R n iid N (0, 1) Relative coefficient error (l2 norm) 0.10 0.08 0.06 0.04 0.02 p=50 p=100 p=200 1/ n 0.00 50 5000 10000 15000 20000 n

Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

Maximum temperatures in Oxford, UK 30 25 Temperature (Celsius) 20 15 10 5 0 1860 1880 1900 1920 1940 1960 1980 2000

Maximum temperatures in Oxford, UK 25 20 Temperature (Celsius) 15 10 5 0 1900 1901 1902 1903 1904 1905

Model fitted by least squares 30 25 Temperature (Celsius) 20 15 10 5 0 1860 1880 1900 1920 1940 1960 1980 2000 Data Model

Model fitted by least squares 25 20 Temperature (Celsius) 15 10 5 Data Model 0 1900 1901 1902 1903 1904 1905

Model fitted by least squares 25 20 Temperature (Celsius) 15 10 5 0 Data Model 5 1960 1961 1962 1963 1964 1965

Trend: Increase of 0.75 C / 100 years (1.35 F) 30 25 Temperature (Celsius) 20 15 10 5 0 1860 1880 1900 1920 1940 1960 1980 2000 Data Trend

Model for minimum temperatures 20 15 Temperature (Celsius) 10 5 0 5 10 1860 1880 1900 1920 1940 1960 1980 2000 Data Model

Model for minimum temperatures Temperature (Celsius) 14 12 10 8 6 4 2 0 Data Model 2 1900 1901 1902 1903 1904 1905

Model for minimum temperatures 15 10 Temperature (Celsius) 5 0 5 Data Model 10 1960 1961 1962 1963 1964 1965

Trend: Increase of 0.88 C / 100 years (1.58 F) 20 15 Temperature (Celsius) 10 5 0 5 10 1860 1880 1900 1920 1940 1960 1980 2000 Data Trend

Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

Classification Aim: Predict the value of a binary response y {0, 1} from p predictors X 1, X 2,..., X p R Methodology: 1. Fit a model with using n training examples y 1, y 2,..., y n y i f (X i1, X i2,..., X ip ) 1 i n 2. Use learned model f to predict from new data

Logistic-regression model We model the probability that y i = 1 by P (y i = 1 X i1, X i2,..., X ip ) := g β 0 + p β j X ij j=1 1 = ( 1 + exp β 0 ) p j=1 β jx ij g is the logistic function or sigmoid g (t) := 1 1 + exp t

Cost function Likelihood of the data under a Bernouilli model in which p P (y i = 1 X i1, X i2,..., X ip ) := g β 0 + β j X ij j=1 P (y i = 0 X i1, X i2,..., X ip ) := 1 g β 0 + Assuming independence ( ) L β 0, β = n g β 0 + i=1 p j=1 p β j X ij j=1 y i β j X ij 1 g β 0 + p β j X ij j=1 1 y i

Cost function Log likelihood is concave ( log L β0, β ) n = y i log g β0 + i=1 p j=1 β j X ij + (1 y i ) log 1 g β 0 + p β j X ij j=1

Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

Sparse regression Assumption: Response only depends on a subset S of s p predictors y 1β 0 + X S β S Regime of interest: p n Model-selection problem: Determine what predictors are relevant Best-subset selection: Choose among all possible sparse models Intractable unless s is very small

Forward stepwise regression Greedy method similar to orthogonal matching pursuit Initialization: j 0 := arg max y, X j j J := {j 0 } β ls := arg min β r (0) := y X J β ls y X J β 2

Forward stepwise regression Iterations: k = 1, 2,... j k := arg max j / J y, P col(xj ) j X J := J {j k } β ls := arg min y X J β β r (k) := r (k 1) X J β ls 2 Stop when fit does not improve much anymore

The lasso l 1 -norm regularized least-squares regression minimize 1 2n y X β 2 + λ β 2 1 We assume that the data are centered, so β 0 = 0 Original formulation minimize subject to y X β β τ 1 2 2 Equivalent, but relation between λ and τ depends on X and y

Experiment X train and X test are iid Gaussian (0, 1), z train is iid Gaussian (0, 0.5) β has 10 nonzero entries, p = 50, n = 100 y train = X train β + z train y test = X test β For an estimate ˆβ learnt from the training data error train = error test = 2 X train ˆβ ytrain y train 2 2 X test ˆβ y test y test 2

Lasso path Coefficients 0.4 0.3 0.2 0.1 0.0 0.1 0.2 Relevant features 0.3 10-4 10-3 10-2 10-1 10 0 10 1 Regularization parameter

Training error Relative error (l2 norm) 0.35 0.30 0.25 0.20 0.15 0.10 Least squares Forward regression Lasso Noise level 0.05 0.00 50 60 70 80 90 100 n

Test error 1.0 0.8 Least squares Forward regression Lasso Relative error (l2 norm) 0.6 0.4 0.2 0.0 50 60 70 80 90 100 n

Logistic regression Add l 1 -norm regularization term to log likelihood, we minimize n p y i log g β0 + β j X ij i=1 j=1 (1 y i ) log 1 g β 0 + + λ β 1 p β j X ij j=1

Arrhythmia prediction Predict whether patient has arrhythmia from n = 271 examples and p = 182 predictors Age, sex, height, weight Features obtained from electrocardiogram recordings Best sparse model uses around 60 predictors

Prediction accuracy 182 172 174 159 107 62 18 5 Misclassification Error 0.25 0.35 0.45-10 -8-6 -4-2 log(lambda)

Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

Least-squares analysis Data model, z is Gaussian noise with variance σz 2 and β is s sparse y = X β + z Assumption: X is well conditioned, entries have constant magnitude σ min n Columns of X have l 2 norm n For the least-squares estimator we have β β ls 2 σ z p n If p n, but s p we should do better!

Restricted eigenvalue property There exists γ > 0 such that for any v R p if for any subset T, T s, then v T c 1 v T 1 1 n M v 2 2 γ v 2 2 Similar to restricted isometry property, may hold even if p > n!

Guarantees If X satisfies the RE property and τ = β 1, the solution β lasso to satisfies minimize subject to y X β β τ 1 2 2 β β lasso 2 σ z 32 α s log p γ n with probability 1 2 exp ( (α 1) log p) for any α > 1 Close to least-squares error if we know which predictors are relevant

Proof The error h := β β lasso satisfies h T c 1 h T 1 so β β lasso 2 2 1 γn X h 2 2

Proof By optimality of β lasso Xh 2 2 2 z T Xh 2 X T z h 1 Since h T c 1 h T 1 h 1 2 s h 2 so β β lasso 2 2 4 s X γn β β lasso T 2 z

Proof Xi T z is Gaussian with variance σz 2 X i 2 2, so for t > 0 P ( X T i ) z > tσ z X i 2 2 exp ( ) t2 2 By the union bound, ( X P T z > tσ z max i ) ( ) X i 2 2 p exp t2 2 = 2 exp ( t22 ) + log p

Proof Choosing t = 2α log p, α > 2, if max i X i 2 = n we have ( X P T ) z > σ z 2α n log p 2 exp ( (α 1) log p) We conclude β β lasso 2 σ z 32 α s log p γ n with probability 1 2 exp ( (α 1) log p)

Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

Correlated predictors Error of least-squares estimator β β ls 2 = p ( U T j z If predictors are strongly correlated singular values can be very small Estimator suffers from noise amplification j=1 σ j ) 2

Ridge regression Aim: Control norm of weights minimize y X β 2 + λ β 2 2 2 Equivalent to regression problem minimize [ ] y 0 [ ] X λi β 2 2

Ridge regression If y = X β + z σ1 2 0 0 σ1 2 +λ2 σ2 0 2 0 β ridge = V σ2 2 +λ2 V T β σp 0 0 2 σp+λ 2 2 σ 1 0 0 σ1 2+λ2 σ 0 2 0 + V σ 2 2 +λ2 UT z 0 0 σ p σ 2 p +λ2 Decreases noise amplification if singular values are small

Correlated predictors in model selection Lasso controls norm amplification but tends to choose a small subset of the relevant predictors Aim: Selecting all relevant predictors Why? Interpretability + improving prediction

Identical predictors If two predictors are the same X i = X j then β i and β j should be the same True for any cost function of the form minimize 1 2n y X β 2 ( ) + λ R β 2 as long as the regularizer R is strictly convex Not true for the lasso

Experiment The entries of z train, Z train and Z test are iid Gaussian (0, 1) y train := X A,train β A + X B,train β B + z train y test := X A,test β A + X B,test β B X train := [ X A,train X B,train Z train ] X test := [ X A,test X B,test Z test ]

Experiment Rows of X A,train, X B,train, X A,train, X B,train are independent Gaussian random vectors with mean zero and covariance matrix 1 0.95 0.95 0.95 Σ := 0.95 1 0.95 0.95 0.95 0.95 1 0.95 0.95 0.95 0.95 1 β A and β B have 4 nonzero entries each

Lasso path Coefficients 0.8 0.6 0.4 0.2 0.0 0.2 0.4 Group 1 Group 2 0.6 10-4 10-3 10-2 10-1 10 0 10 1 Regularization parameter

Ridge-regression path Coefficients 0.8 0.6 0.4 0.2 0.0 0.2 0.4 Group 1 Group 2 0.6 10-4 10-3 10-2 10-1 10 0 10 1 10 2 10 3 10 4 10 5 Regularization parameter

Elastic net Combines lasso and ridge-regression penalties minimize ( y X β 2 1 α + λ 2 2 β 2 + α 2 2 β ) 1

Elastic-net path, α = 0.5 Coefficients 0.8 0.6 0.4 0.2 0.0 0.2 0.4 Group 1 Group 2 0.6 10-4 10-3 10-2 10-1 10 0 10 1 Regularization parameter

Errors Relative error (l2 norm) 1.0 0.8 0.6 0.4 Least squares (training) Least squares (test) Ridge (training) Ridge (test) Elastic net (training) Elastic net (test) Lasso (training) Lasso (test) 0.2 0.0 10-4 10-3 10-2 10-1 10 0 10 1 10 2 10 3 10 4 10 5 Regularization parameter

Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

Group sparse structure Predictors may be partitioned into k groups G 1, G 2,..., G m ] β y β 0 + X β = β 0 + [X G1 X G2 X G2 Gm β G1 β Gm Aim: Select groups that are relevant

Multi-task learning Several responses y 1, y 2,..., y k modeled with the same predictors Y = [ y 1 y 2... y k ] B0 + XB = B 0 + X [ β 1 β 2 β k ] Assumption: Responses depend on the same subset of predictors Aim: Learn a group-sparse model

Mixed l 1 /l 2 norm Mixed l 1 /l 2 norm β := 1,2 k β 2 Gi i=1 promotes group-sparse structure For multitask learning B := 1,2 k 2 B :i i=1

Geometric intuition G 1 = {1, 2} G 2 = {3} l 1 -norm ball l 1 /l 2 -norm ball Figure taken from Statistical Learning with Sparsity The Lasso and Generalizations by Hastie, Tibshirani and Wainwright

Group and multi-task lasso Group lasso minimize y β 0 X β 2 + λ β 2 1,2 Multi-task lasso minimize Y B 0 X B 2 B + λ F 1,2

Proximal gradient method Method to solve the optimization problem minimize f (x) + g (x), where f is differentiable and prox g is tractable Proximal-gradient iteration: x (0) = arbitrary initialization ( x (k+1) = prox αk g x (k) α k f (x (k)))

Subdifferential of l 1 /l 2 norm q R p is a subgradient of the l 1 /l 2 norm at β R p if q Gi = β G i β Gi 2 if β Gi 0, q Gi 2 1 if β Gi = 0

Proximal operator of l 1 /l 2 norm Proximal operator of l 1 /l 2 norm is block soft-thresholding prox α 1,2 (β) = BS α (β) where α > 0 and β Gi α β G i if β BS α (β) Gi := β Gi Gi 2 α 2 0 otherwise

Iterative Shrinkage Thresholding for the l 1 /l 2 norm Proximal gradient method for the problem minimize 1 2 Ax y 2 2 + λ x 1,2 ISTA iteration: x (0) = arbitrary initialization ( ( )) x (k+1) = BS αk λ x (k) α k A T Ax (k) y

Speech denoising: STFT thresholding Frequency Time

Speech denoising: STFT block thresholding Frequency Time

Experiment X train and X test are iid Gaussian (0, 1), Z train is iid Gaussian (0, 0.5) B R p k has s nonzero rows, p = 50, n = 100 Y train = X train B + Z train Y test = X test B For an estimate ˆB learnt from the training data error test = 2 X test ˆB Y test Y test 2

Test error, s = 4 1.0 0.9 0.8 Lasso (k=4) Lasso (k=40) Multi-task lasso (k=4) Multi-task lasso (k=40) Relative error (l2 norm) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 10-5 10-4 10-3 10-2 10-1 10 0 10 1 10 2 Regularization parameter

Test error, s = 30 Relative error (l2 norm) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 Lasso (k=4) Lasso (k=40) Multi-task lasso (k=4) Multi-task lasso (k=40) 0.3 0.2 10-5 10-4 10-3 10-2 10-1 10 0 10 1 10 2 Regularization parameter

Magnitude of coefficients, s = 30 k = 40 Lasso Multitask lasso Original 3.6 3.2 2.8 2.4 2.0 1.6 1.2 0.8 0.4 0.0