Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016

Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

Regression Aim: Predict the value of a response / dependent variable y R from p predictors / features / independent variables X 1, X 2,..., X p R Methodology: 1. Fit a model with using n training examples y 1, y 2,..., y n y i f (X i1, X i2,..., X ip ) 1 i n 2. Use learned model f to predict from new data

Linear regression f is parametrized by an intercept β 0 and a vector of weights β R p p y i β 0 + β j X ij j=1 1 i n y 1 y 2 y n 1 1 1 X 11 X 12 X 1p β 0 + X 21 X 22 X 2p X n1 X n2 X np β 1 β 2 β p y 1β 0 + X β Aim: Learn β 0 and β from the training data

Linear models Signal representation (compression, denoising) x = D c Aim: Find sparse representation of the signal x Columns of D are designed/learned atoms

Linear models Inverse problems (compressed sensing, super-resolution) y = A x Aim: Estimate the signal x from the measurements y A models the measurement process

Linear models Linear regression y = X β Aim: Learn the relation between the response y and the predictors X 1, X 2,..., X p to predict from new data Each column of X contains a variable that may or may not be related to the response

Preprocessing 1. Center each predictor column X i subtracting its mean so that n X ij = 0 i=1 for all 1 j p 2. Normalize each predictor column X j so that X j 2 = 1 for all 1 j p 3. Center the response vector y so that n y i = 0 i=1

Least-squares fit Minimize l 2 norm of error over training data minimize y X β 2 After preprocessing β 0 = 0, so we can ignore it

Geometric interpretation Decompose y into its projection onto the column space of X and onto the orthogonal complement y = y X + y X By Pythagoras s Theorem y X β 2 = y X 2 2 2 + yx X β 2 2 If n p and X is full rank β ls = ( X T X ) 1 X T y satisfies X β ls = y X After preprocessing β 0 = 0

Probabilistic interpretation: Maximum likelihood Assumption: Linear model is correct but examples are corrupted by additive iid Gaussian noise y = X β + z The likelihood is the probability density function of y parametrized by β ( ) ( 1 L β = (2π) n exp 1 ) y X β 2 2 2 The maximum-likelihood estimate of β is ( ) β ML = arg max L β β ( ) = arg max log L β β = arg min β y X β 2

Temperature predictor A friend tells you: I found a cool way to predict the temperature in New York: It s just a linear combination of the temperature in every other state. I fit the model on data from the last month and a half and it s perfect!

Overfitting If a model is very complex, it may overfit the data To evaluate a model we separate the data into a training and a test set 1. We fit the model using the training set 2. We evaluate the error on the test set

Experiment X train, X test, z train and β are iid Gaussian with mean 0 and variance 1 y train = X train β + z train y test = X test β We use y train and X train to compute β ls error train = X train β ls y train 2 y train 2 error test = X test β ls y test 2 y test 2

Experiment 0.5 0.4 Error (training) Error (test) Noise level (training) Relative error (l2 norm) 0.3 0.2 0.1 0.0 50 100 200 300 400 500 n

Analysis Data model, z is Gaussian noise with variance σ 2 z, y = X β + z σ min /σ max are the smallest/largest singular values of X 1 ɛ σ 2 min β β ls 2 2 pσ 2 z 1 + ɛ σ 2 max with high probability

Proof β ls = V Σ 1 U T where X = UΣV T β β ls 2 = p ( U T j z j=1 U T z is Gaussian with mean 0 and covariance I U T z 2 is chi-square with p degrees of freedom 2 U T z 2 p 2 σ j ) 2

Consistency Assume σ 2 z = 1 If X is well conditioned and entries have constant magnitude σ j n β β ls 2 If entries of β are constant, β 2 p Least-squares estimator is consistent p n β β ls 2 β 2 1 n

Experiment: X R n p, β R p, z R n iid N (0, 1) Relative coefficient error (l2 norm) 0.10 0.08 0.06 0.04 0.02 p=50 p=100 p=200 1/ n 0.00 50 5000 10000 15000 20000 n

Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

Maximum temperatures in Oxford, UK 30 25 Temperature (Celsius) 20 15 10 5 0 1860 1880 1900 1920 1940 1960 1980 2000

Maximum temperatures in Oxford, UK 25 20 Temperature (Celsius) 15 10 5 0 1900 1901 1902 1903 1904 1905

Model fitted by least squares 30 25 Temperature (Celsius) 20 15 10 5 0 1860 1880 1900 1920 1940 1960 1980 2000 Data Model

Model fitted by least squares 25 20 Temperature (Celsius) 15 10 5 Data Model 0 1900 1901 1902 1903 1904 1905

Model fitted by least squares 25 20 Temperature (Celsius) 15 10 5 0 Data Model 5 1960 1961 1962 1963 1964 1965

Trend: Increase of 0.75 C / 100 years (1.35 F) 30 25 Temperature (Celsius) 20 15 10 5 0 1860 1880 1900 1920 1940 1960 1980 2000 Data Trend

Model for minimum temperatures 20 15 Temperature (Celsius) 10 5 0 5 10 1860 1880 1900 1920 1940 1960 1980 2000 Data Model

Model for minimum temperatures Temperature (Celsius) 14 12 10 8 6 4 2 0 Data Model 2 1900 1901 1902 1903 1904 1905

Model for minimum temperatures 15 10 Temperature (Celsius) 5 0 5 Data Model 10 1960 1961 1962 1963 1964 1965

Trend: Increase of 0.88 C / 100 years (1.58 F) 20 15 Temperature (Celsius) 10 5 0 5 10 1860 1880 1900 1920 1940 1960 1980 2000 Data Trend

Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

Classification Aim: Predict the value of a binary response y {0, 1} from p predictors X 1, X 2,..., X p R Methodology: 1. Fit a model with using n training examples y 1, y 2,..., y n y i f (X i1, X i2,..., X ip ) 1 i n 2. Use learned model f to predict from new data

Logistic-regression model We model the probability that y i = 1 by P (y i = 1 X i1, X i2,..., X ip ) := g β 0 + p β j X ij j=1 1 = ( 1 + exp β 0 ) p j=1 β jx ij g is the logistic function or sigmoid g (t) := 1 1 + exp t

Cost function Likelihood of the data under a Bernouilli model in which p P (y i = 1 X i1, X i2,..., X ip ) := g β 0 + β j X ij j=1 P (y i = 0 X i1, X i2,..., X ip ) := 1 g β 0 + Assuming independence ( ) L β 0, β = n g β 0 + i=1 p j=1 p β j X ij j=1 y i β j X ij 1 g β 0 + p β j X ij j=1 1 y i

Cost function Log likelihood is concave ( log L β0, β ) n = y i log g β0 + i=1 p j=1 β j X ij + (1 y i ) log 1 g β 0 + p β j X ij j=1

Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

Sparse regression Assumption: Response only depends on a subset S of s p predictors y 1β 0 + X S β S Regime of interest: p n Model-selection problem: Determine what predictors are relevant Best-subset selection: Choose among all possible sparse models Intractable unless s is very small

Forward stepwise regression Greedy method similar to orthogonal matching pursuit Initialization: j 0 := arg max y, X j j J := {j 0 } β ls := arg min β r (0) := y X J β ls y X J β 2

Forward stepwise regression Iterations: k = 1, 2,... j k := arg max j / J y, P col(xj ) j X J := J {j k } β ls := arg min y X J β β r (k) := r (k 1) X J β ls 2 Stop when fit does not improve much anymore

The lasso l 1 -norm regularized least-squares regression minimize 1 2n y X β 2 + λ β 2 1 We assume that the data are centered, so β 0 = 0 Original formulation minimize subject to y X β β τ 1 2 2 Equivalent, but relation between λ and τ depends on X and y

Experiment X train and X test are iid Gaussian (0, 1), z train is iid Gaussian (0, 0.5) β has 10 nonzero entries, p = 50, n = 100 y train = X train β + z train y test = X test β For an estimate ˆβ learnt from the training data error train = error test = 2 X train ˆβ ytrain y train 2 2 X test ˆβ y test y test 2

Lasso path Coefficients 0.4 0.3 0.2 0.1 0.0 0.1 0.2 Relevant features 0.3 10-4 10-3 10-2 10-1 10 0 10 1 Regularization parameter

Training error Relative error (l2 norm) 0.35 0.30 0.25 0.20 0.15 0.10 Least squares Forward regression Lasso Noise level 0.05 0.00 50 60 70 80 90 100 n

Test error 1.0 0.8 Least squares Forward regression Lasso Relative error (l2 norm) 0.6 0.4 0.2 0.0 50 60 70 80 90 100 n

Logistic regression Add l 1 -norm regularization term to log likelihood, we minimize n p y i log g β0 + β j X ij i=1 j=1 (1 y i ) log 1 g β 0 + + λ β 1 p β j X ij j=1

Arrhythmia prediction Predict whether patient has arrhythmia from n = 271 examples and p = 182 predictors Age, sex, height, weight Features obtained from electrocardiogram recordings Best sparse model uses around 60 predictors

Prediction accuracy 182 172 174 159 107 62 18 5 Misclassification Error 0.25 0.35 0.45-10 -8-6 -4-2 log(lambda)

Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

Least-squares analysis Data model, z is Gaussian noise with variance σz 2 and β is s sparse y = X β + z Assumption: X is well conditioned, entries have constant magnitude σ min n Columns of X have l 2 norm n For the least-squares estimator we have β β ls 2 σ z p n If p n, but s p we should do better!

Restricted eigenvalue property There exists γ > 0 such that for any v R p if for any subset T, T s, then v T c 1 v T 1 1 n M v 2 2 γ v 2 2 Similar to restricted isometry property, may hold even if p > n!

Guarantees If X satisfies the RE property and τ = β 1, the solution β lasso to satisfies minimize subject to y X β β τ 1 2 2 β β lasso 2 σ z 32 α s log p γ n with probability 1 2 exp ( (α 1) log p) for any α > 1 Close to least-squares error if we know which predictors are relevant

Proof The error h := β β lasso satisfies h T c 1 h T 1 so β β lasso 2 2 1 γn X h 2 2

Proof By optimality of β lasso Xh 2 2 2 z T Xh 2 X T z h 1 Since h T c 1 h T 1 h 1 2 s h 2 so β β lasso 2 2 4 s X γn β β lasso T 2 z

Proof Xi T z is Gaussian with variance σz 2 X i 2 2, so for t > 0 P ( X T i ) z > tσ z X i 2 2 exp ( ) t2 2 By the union bound, ( X P T z > tσ z max i ) ( ) X i 2 2 p exp t2 2 = 2 exp ( t22 ) + log p

Proof Choosing t = 2α log p, α > 2, if max i X i 2 = n we have ( X P T ) z > σ z 2α n log p 2 exp ( (α 1) log p) We conclude β β lasso 2 σ z 32 α s log p γ n with probability 1 2 exp ( (α 1) log p)

Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

Correlated predictors Error of least-squares estimator β β ls 2 = p ( U T j z If predictors are strongly correlated singular values can be very small Estimator suffers from noise amplification j=1 σ j ) 2

Ridge regression Aim: Control norm of weights minimize y X β 2 + λ β 2 2 2 Equivalent to regression problem minimize [ ] y 0 [ ] X λi β 2 2

Ridge regression If y = X β + z σ1 2 0 0 σ1 2 +λ2 σ2 0 2 0 β ridge = V σ2 2 +λ2 V T β σp 0 0 2 σp+λ 2 2 σ 1 0 0 σ1 2+λ2 σ 0 2 0 + V σ 2 2 +λ2 UT z 0 0 σ p σ 2 p +λ2 Decreases noise amplification if singular values are small

Correlated predictors in model selection Lasso controls norm amplification but tends to choose a small subset of the relevant predictors Aim: Selecting all relevant predictors Why? Interpretability + improving prediction

Identical predictors If two predictors are the same X i = X j then β i and β j should be the same True for any cost function of the form minimize 1 2n y X β 2 ( ) + λ R β 2 as long as the regularizer R is strictly convex Not true for the lasso

Experiment The entries of z train, Z train and Z test are iid Gaussian (0, 1) y train := X A,train β A + X B,train β B + z train y test := X A,test β A + X B,test β B X train := [ X A,train X B,train Z train ] X test := [ X A,test X B,test Z test ]

Experiment Rows of X A,train, X B,train, X A,train, X B,train are independent Gaussian random vectors with mean zero and covariance matrix 1 0.95 0.95 0.95 Σ := 0.95 1 0.95 0.95 0.95 0.95 1 0.95 0.95 0.95 0.95 1 β A and β B have 4 nonzero entries each

Lasso path Coefficients 0.8 0.6 0.4 0.2 0.0 0.2 0.4 Group 1 Group 2 0.6 10-4 10-3 10-2 10-1 10 0 10 1 Regularization parameter

Ridge-regression path Coefficients 0.8 0.6 0.4 0.2 0.0 0.2 0.4 Group 1 Group 2 0.6 10-4 10-3 10-2 10-1 10 0 10 1 10 2 10 3 10 4 10 5 Regularization parameter

Elastic net Combines lasso and ridge-regression penalties minimize ( y X β 2 1 α + λ 2 2 β 2 + α 2 2 β ) 1

Elastic-net path, α = 0.5 Coefficients 0.8 0.6 0.4 0.2 0.0 0.2 0.4 Group 1 Group 2 0.6 10-4 10-3 10-2 10-1 10 0 10 1 Regularization parameter

Errors Relative error (l2 norm) 1.0 0.8 0.6 0.4 Least squares (training) Least squares (test) Ridge (training) Ridge (test) Elastic net (training) Elastic net (test) Lasso (training) Lasso (test) 0.2 0.0 10-4 10-3 10-2 10-1 10 0 10 1 10 2 10 3 10 4 10 5 Regularization parameter

Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

Group sparse structure Predictors may be partitioned into k groups G 1, G 2,..., G m ] β y β 0 + X β = β 0 + [X G1 X G2 X G2 Gm β G1 β Gm Aim: Select groups that are relevant

Multi-task learning Several responses y 1, y 2,..., y k modeled with the same predictors Y = [ y 1 y 2... y k ] B0 + XB = B 0 + X [ β 1 β 2 β k ] Assumption: Responses depend on the same subset of predictors Aim: Learn a group-sparse model

Mixed l 1 /l 2 norm Mixed l 1 /l 2 norm β := 1,2 k β 2 Gi i=1 promotes group-sparse structure For multitask learning B := 1,2 k 2 B :i i=1

Geometric intuition G 1 = {1, 2} G 2 = {3} l 1 -norm ball l 1 /l 2 -norm ball Figure taken from Statistical Learning with Sparsity The Lasso and Generalizations by Hastie, Tibshirani and Wainwright

Group and multi-task lasso Group lasso minimize y β 0 X β 2 + λ β 2 1,2 Multi-task lasso minimize Y B 0 X B 2 B + λ F 1,2

Proximal gradient method Method to solve the optimization problem minimize f (x) + g (x), where f is differentiable and prox g is tractable Proximal-gradient iteration: x (0) = arbitrary initialization ( x (k+1) = prox αk g x (k) α k f (x (k)))

Subdifferential of l 1 /l 2 norm q R p is a subgradient of the l 1 /l 2 norm at β R p if q Gi = β G i β Gi 2 if β Gi 0, q Gi 2 1 if β Gi = 0

Proximal operator of l 1 /l 2 norm Proximal operator of l 1 /l 2 norm is block soft-thresholding prox α 1,2 (β) = BS α (β) where α > 0 and β Gi α β G i if β BS α (β) Gi := β Gi Gi 2 α 2 0 otherwise

Iterative Shrinkage Thresholding for the l 1 /l 2 norm Proximal gradient method for the problem minimize 1 2 Ax y 2 2 + λ x 1,2 ISTA iteration: x (0) = arbitrary initialization ( ( )) x (k+1) = BS αk λ x (k) α k A T Ax (k) y

Speech denoising: STFT thresholding Frequency Time

Speech denoising: STFT block thresholding Frequency Time

Experiment X train and X test are iid Gaussian (0, 1), Z train is iid Gaussian (0, 0.5) B R p k has s nonzero rows, p = 50, n = 100 Y train = X train B + Z train Y test = X test B For an estimate ˆB learnt from the training data error test = 2 X test ˆB Y test Y test 2

Test error, s = 4 1.0 0.9 0.8 Lasso (k=4) Lasso (k=40) Multi-task lasso (k=4) Multi-task lasso (k=40) Relative error (l2 norm) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 10-5 10-4 10-3 10-2 10-1 10 0 10 1 10 2 Regularization parameter

Test error, s = 30 Relative error (l2 norm) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 Lasso (k=4) Lasso (k=40) Multi-task lasso (k=4) Multi-task lasso (k=40) 0.3 0.2 10-5 10-4 10-3 10-2 10-1 10 0 10 1 10 2 Regularization parameter

Magnitude of coefficients, s = 30 k = 40 Lasso Multitask lasso Original 3.6 3.2 2.8 2.4 2.0 1.6 1.2 0.8 0.4 0.0