Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016
Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity
Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity
Regression Aim: Predict the value of a response / dependent variable y R from p predictors / features / independent variables X 1, X 2,..., X p R Methodology: 1. Fit a model with using n training examples y 1, y 2,..., y n y i f (X i1, X i2,..., X ip ) 1 i n 2. Use learned model f to predict from new data
Linear regression f is parametrized by an intercept β 0 and a vector of weights β R p p y i β 0 + β j X ij j=1 1 i n y 1 y 2 y n 1 1 1 X 11 X 12 X 1p β 0 + X 21 X 22 X 2p X n1 X n2 X np β 1 β 2 β p y 1β 0 + X β Aim: Learn β 0 and β from the training data
Linear models Signal representation (compression, denoising) x = D c Aim: Find sparse representation of the signal x Columns of D are designed/learned atoms
Linear models Inverse problems (compressed sensing, super-resolution) y = A x Aim: Estimate the signal x from the measurements y A models the measurement process
Linear models Linear regression y = X β Aim: Learn the relation between the response y and the predictors X 1, X 2,..., X p to predict from new data Each column of X contains a variable that may or may not be related to the response
Preprocessing 1. Center each predictor column X i subtracting its mean so that n X ij = 0 i=1 for all 1 j p 2. Normalize each predictor column X j so that X j 2 = 1 for all 1 j p 3. Center the response vector y so that n y i = 0 i=1
Least-squares fit Minimize l 2 norm of error over training data minimize y X β 2 After preprocessing β 0 = 0, so we can ignore it
Geometric interpretation Decompose y into its projection onto the column space of X and onto the orthogonal complement y = y X + y X By Pythagoras s Theorem y X β 2 = y X 2 2 2 + yx X β 2 2 If n p and X is full rank β ls = ( X T X ) 1 X T y satisfies X β ls = y X After preprocessing β 0 = 0
Probabilistic interpretation: Maximum likelihood Assumption: Linear model is correct but examples are corrupted by additive iid Gaussian noise y = X β + z The likelihood is the probability density function of y parametrized by β ( ) ( 1 L β = (2π) n exp 1 ) y X β 2 2 2 The maximum-likelihood estimate of β is ( ) β ML = arg max L β β ( ) = arg max log L β β = arg min β y X β 2
Temperature predictor A friend tells you: I found a cool way to predict the temperature in New York: It s just a linear combination of the temperature in every other state. I fit the model on data from the last month and a half and it s perfect!
Overfitting If a model is very complex, it may overfit the data To evaluate a model we separate the data into a training and a test set 1. We fit the model using the training set 2. We evaluate the error on the test set
Experiment X train, X test, z train and β are iid Gaussian with mean 0 and variance 1 y train = X train β + z train y test = X test β We use y train and X train to compute β ls error train = X train β ls y train 2 y train 2 error test = X test β ls y test 2 y test 2
Experiment 0.5 0.4 Error (training) Error (test) Noise level (training) Relative error (l2 norm) 0.3 0.2 0.1 0.0 50 100 200 300 400 500 n
Analysis Data model, z is Gaussian noise with variance σ 2 z, y = X β + z σ min /σ max are the smallest/largest singular values of X 1 ɛ σ 2 min β β ls 2 2 pσ 2 z 1 + ɛ σ 2 max with high probability
Proof β ls = V Σ 1 U T where X = UΣV T β β ls 2 = p ( U T j z j=1 U T z is Gaussian with mean 0 and covariance I U T z 2 is chi-square with p degrees of freedom 2 U T z 2 p 2 σ j ) 2
Consistency Assume σ 2 z = 1 If X is well conditioned and entries have constant magnitude σ j n β β ls 2 If entries of β are constant, β 2 p Least-squares estimator is consistent p n β β ls 2 β 2 1 n
Experiment: X R n p, β R p, z R n iid N (0, 1) Relative coefficient error (l2 norm) 0.10 0.08 0.06 0.04 0.02 p=50 p=100 p=200 1/ n 0.00 50 5000 10000 15000 20000 n
Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity
Maximum temperatures in Oxford, UK 30 25 Temperature (Celsius) 20 15 10 5 0 1860 1880 1900 1920 1940 1960 1980 2000
Maximum temperatures in Oxford, UK 25 20 Temperature (Celsius) 15 10 5 0 1900 1901 1902 1903 1904 1905
Model fitted by least squares 30 25 Temperature (Celsius) 20 15 10 5 0 1860 1880 1900 1920 1940 1960 1980 2000 Data Model
Model fitted by least squares 25 20 Temperature (Celsius) 15 10 5 Data Model 0 1900 1901 1902 1903 1904 1905
Model fitted by least squares 25 20 Temperature (Celsius) 15 10 5 0 Data Model 5 1960 1961 1962 1963 1964 1965
Trend: Increase of 0.75 C / 100 years (1.35 F) 30 25 Temperature (Celsius) 20 15 10 5 0 1860 1880 1900 1920 1940 1960 1980 2000 Data Trend
Model for minimum temperatures 20 15 Temperature (Celsius) 10 5 0 5 10 1860 1880 1900 1920 1940 1960 1980 2000 Data Model
Model for minimum temperatures Temperature (Celsius) 14 12 10 8 6 4 2 0 Data Model 2 1900 1901 1902 1903 1904 1905
Model for minimum temperatures 15 10 Temperature (Celsius) 5 0 5 Data Model 10 1960 1961 1962 1963 1964 1965
Trend: Increase of 0.88 C / 100 years (1.58 F) 20 15 Temperature (Celsius) 10 5 0 5 10 1860 1880 1900 1920 1940 1960 1980 2000 Data Trend
Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity
Classification Aim: Predict the value of a binary response y {0, 1} from p predictors X 1, X 2,..., X p R Methodology: 1. Fit a model with using n training examples y 1, y 2,..., y n y i f (X i1, X i2,..., X ip ) 1 i n 2. Use learned model f to predict from new data
Logistic-regression model We model the probability that y i = 1 by P (y i = 1 X i1, X i2,..., X ip ) := g β 0 + p β j X ij j=1 1 = ( 1 + exp β 0 ) p j=1 β jx ij g is the logistic function or sigmoid g (t) := 1 1 + exp t
Cost function Likelihood of the data under a Bernouilli model in which p P (y i = 1 X i1, X i2,..., X ip ) := g β 0 + β j X ij j=1 P (y i = 0 X i1, X i2,..., X ip ) := 1 g β 0 + Assuming independence ( ) L β 0, β = n g β 0 + i=1 p j=1 p β j X ij j=1 y i β j X ij 1 g β 0 + p β j X ij j=1 1 y i
Cost function Log likelihood is concave ( log L β0, β ) n = y i log g β0 + i=1 p j=1 β j X ij + (1 y i ) log 1 g β 0 + p β j X ij j=1
Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity
Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity
Sparse regression Assumption: Response only depends on a subset S of s p predictors y 1β 0 + X S β S Regime of interest: p n Model-selection problem: Determine what predictors are relevant Best-subset selection: Choose among all possible sparse models Intractable unless s is very small
Forward stepwise regression Greedy method similar to orthogonal matching pursuit Initialization: j 0 := arg max y, X j j J := {j 0 } β ls := arg min β r (0) := y X J β ls y X J β 2
Forward stepwise regression Iterations: k = 1, 2,... j k := arg max j / J y, P col(xj ) j X J := J {j k } β ls := arg min y X J β β r (k) := r (k 1) X J β ls 2 Stop when fit does not improve much anymore
The lasso l 1 -norm regularized least-squares regression minimize 1 2n y X β 2 + λ β 2 1 We assume that the data are centered, so β 0 = 0 Original formulation minimize subject to y X β β τ 1 2 2 Equivalent, but relation between λ and τ depends on X and y
Experiment X train and X test are iid Gaussian (0, 1), z train is iid Gaussian (0, 0.5) β has 10 nonzero entries, p = 50, n = 100 y train = X train β + z train y test = X test β For an estimate ˆβ learnt from the training data error train = error test = 2 X train ˆβ ytrain y train 2 2 X test ˆβ y test y test 2
Lasso path Coefficients 0.4 0.3 0.2 0.1 0.0 0.1 0.2 Relevant features 0.3 10-4 10-3 10-2 10-1 10 0 10 1 Regularization parameter
Training error Relative error (l2 norm) 0.35 0.30 0.25 0.20 0.15 0.10 Least squares Forward regression Lasso Noise level 0.05 0.00 50 60 70 80 90 100 n
Test error 1.0 0.8 Least squares Forward regression Lasso Relative error (l2 norm) 0.6 0.4 0.2 0.0 50 60 70 80 90 100 n
Logistic regression Add l 1 -norm regularization term to log likelihood, we minimize n p y i log g β0 + β j X ij i=1 j=1 (1 y i ) log 1 g β 0 + + λ β 1 p β j X ij j=1
Arrhythmia prediction Predict whether patient has arrhythmia from n = 271 examples and p = 182 predictors Age, sex, height, weight Features obtained from electrocardiogram recordings Best sparse model uses around 60 predictors
Prediction accuracy 182 172 174 159 107 62 18 5 Misclassification Error 0.25 0.35 0.45-10 -8-6 -4-2 log(lambda)
Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity
Least-squares analysis Data model, z is Gaussian noise with variance σz 2 and β is s sparse y = X β + z Assumption: X is well conditioned, entries have constant magnitude σ min n Columns of X have l 2 norm n For the least-squares estimator we have β β ls 2 σ z p n If p n, but s p we should do better!
Restricted eigenvalue property There exists γ > 0 such that for any v R p if for any subset T, T s, then v T c 1 v T 1 1 n M v 2 2 γ v 2 2 Similar to restricted isometry property, may hold even if p > n!
Guarantees If X satisfies the RE property and τ = β 1, the solution β lasso to satisfies minimize subject to y X β β τ 1 2 2 β β lasso 2 σ z 32 α s log p γ n with probability 1 2 exp ( (α 1) log p) for any α > 1 Close to least-squares error if we know which predictors are relevant
Proof The error h := β β lasso satisfies h T c 1 h T 1 so β β lasso 2 2 1 γn X h 2 2
Proof By optimality of β lasso Xh 2 2 2 z T Xh 2 X T z h 1 Since h T c 1 h T 1 h 1 2 s h 2 so β β lasso 2 2 4 s X γn β β lasso T 2 z
Proof Xi T z is Gaussian with variance σz 2 X i 2 2, so for t > 0 P ( X T i ) z > tσ z X i 2 2 exp ( ) t2 2 By the union bound, ( X P T z > tσ z max i ) ( ) X i 2 2 p exp t2 2 = 2 exp ( t22 ) + log p
Proof Choosing t = 2α log p, α > 2, if max i X i 2 = n we have ( X P T ) z > σ z 2α n log p 2 exp ( (α 1) log p) We conclude β β lasso 2 σ z 32 α s log p γ n with probability 1 2 exp ( (α 1) log p)
Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity
Correlated predictors Error of least-squares estimator β β ls 2 = p ( U T j z If predictors are strongly correlated singular values can be very small Estimator suffers from noise amplification j=1 σ j ) 2
Ridge regression Aim: Control norm of weights minimize y X β 2 + λ β 2 2 2 Equivalent to regression problem minimize [ ] y 0 [ ] X λi β 2 2
Ridge regression If y = X β + z σ1 2 0 0 σ1 2 +λ2 σ2 0 2 0 β ridge = V σ2 2 +λ2 V T β σp 0 0 2 σp+λ 2 2 σ 1 0 0 σ1 2+λ2 σ 0 2 0 + V σ 2 2 +λ2 UT z 0 0 σ p σ 2 p +λ2 Decreases noise amplification if singular values are small
Correlated predictors in model selection Lasso controls norm amplification but tends to choose a small subset of the relevant predictors Aim: Selecting all relevant predictors Why? Interpretability + improving prediction
Identical predictors If two predictors are the same X i = X j then β i and β j should be the same True for any cost function of the form minimize 1 2n y X β 2 ( ) + λ R β 2 as long as the regularizer R is strictly convex Not true for the lasso
Experiment The entries of z train, Z train and Z test are iid Gaussian (0, 1) y train := X A,train β A + X B,train β B + z train y test := X A,test β A + X B,test β B X train := [ X A,train X B,train Z train ] X test := [ X A,test X B,test Z test ]
Experiment Rows of X A,train, X B,train, X A,train, X B,train are independent Gaussian random vectors with mean zero and covariance matrix 1 0.95 0.95 0.95 Σ := 0.95 1 0.95 0.95 0.95 0.95 1 0.95 0.95 0.95 0.95 1 β A and β B have 4 nonzero entries each
Lasso path Coefficients 0.8 0.6 0.4 0.2 0.0 0.2 0.4 Group 1 Group 2 0.6 10-4 10-3 10-2 10-1 10 0 10 1 Regularization parameter
Ridge-regression path Coefficients 0.8 0.6 0.4 0.2 0.0 0.2 0.4 Group 1 Group 2 0.6 10-4 10-3 10-2 10-1 10 0 10 1 10 2 10 3 10 4 10 5 Regularization parameter
Elastic net Combines lasso and ridge-regression penalties minimize ( y X β 2 1 α + λ 2 2 β 2 + α 2 2 β ) 1
Elastic-net path, α = 0.5 Coefficients 0.8 0.6 0.4 0.2 0.0 0.2 0.4 Group 1 Group 2 0.6 10-4 10-3 10-2 10-1 10 0 10 1 Regularization parameter
Errors Relative error (l2 norm) 1.0 0.8 0.6 0.4 Least squares (training) Least squares (test) Ridge (training) Ridge (test) Elastic net (training) Elastic net (test) Lasso (training) Lasso (test) 0.2 0.0 10-4 10-3 10-2 10-1 10 0 10 1 10 2 10 3 10 4 10 5 Regularization parameter
Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity
Group sparse structure Predictors may be partitioned into k groups G 1, G 2,..., G m ] β y β 0 + X β = β 0 + [X G1 X G2 X G2 Gm β G1 β Gm Aim: Select groups that are relevant
Multi-task learning Several responses y 1, y 2,..., y k modeled with the same predictors Y = [ y 1 y 2... y k ] B0 + XB = B 0 + X [ β 1 β 2 β k ] Assumption: Responses depend on the same subset of predictors Aim: Learn a group-sparse model
Mixed l 1 /l 2 norm Mixed l 1 /l 2 norm β := 1,2 k β 2 Gi i=1 promotes group-sparse structure For multitask learning B := 1,2 k 2 B :i i=1
Geometric intuition G 1 = {1, 2} G 2 = {3} l 1 -norm ball l 1 /l 2 -norm ball Figure taken from Statistical Learning with Sparsity The Lasso and Generalizations by Hastie, Tibshirani and Wainwright
Group and multi-task lasso Group lasso minimize y β 0 X β 2 + λ β 2 1,2 Multi-task lasso minimize Y B 0 X B 2 B + λ F 1,2
Proximal gradient method Method to solve the optimization problem minimize f (x) + g (x), where f is differentiable and prox g is tractable Proximal-gradient iteration: x (0) = arbitrary initialization ( x (k+1) = prox αk g x (k) α k f (x (k)))
Subdifferential of l 1 /l 2 norm q R p is a subgradient of the l 1 /l 2 norm at β R p if q Gi = β G i β Gi 2 if β Gi 0, q Gi 2 1 if β Gi = 0
Proximal operator of l 1 /l 2 norm Proximal operator of l 1 /l 2 norm is block soft-thresholding prox α 1,2 (β) = BS α (β) where α > 0 and β Gi α β G i if β BS α (β) Gi := β Gi Gi 2 α 2 0 otherwise
Iterative Shrinkage Thresholding for the l 1 /l 2 norm Proximal gradient method for the problem minimize 1 2 Ax y 2 2 + λ x 1,2 ISTA iteration: x (0) = arbitrary initialization ( ( )) x (k+1) = BS αk λ x (k) α k A T Ax (k) y
Speech denoising: STFT thresholding Frequency Time
Speech denoising: STFT block thresholding Frequency Time
Experiment X train and X test are iid Gaussian (0, 1), Z train is iid Gaussian (0, 0.5) B R p k has s nonzero rows, p = 50, n = 100 Y train = X train B + Z train Y test = X test B For an estimate ˆB learnt from the training data error test = 2 X test ˆB Y test Y test 2
Test error, s = 4 1.0 0.9 0.8 Lasso (k=4) Lasso (k=40) Multi-task lasso (k=4) Multi-task lasso (k=40) Relative error (l2 norm) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 10-5 10-4 10-3 10-2 10-1 10 0 10 1 10 2 Regularization parameter
Test error, s = 30 Relative error (l2 norm) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 Lasso (k=4) Lasso (k=40) Multi-task lasso (k=4) Multi-task lasso (k=40) 0.3 0.2 10-5 10-4 10-3 10-2 10-1 10 0 10 1 10 2 Regularization parameter
Magnitude of coefficients, s = 30 k = 40 Lasso Multitask lasso Original 3.6 3.2 2.8 2.4 2.0 1.6 1.2 0.8 0.4 0.0