Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Size: px

Start display at page:

Download "Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda"

Darrell Wood
5 years ago
Views:

Sparse regression Optimization-Based Data Analysis http://www.cims.

1 Sparse regression Optimization-Based Data Analysis Carlos Fernandez-Granda 3/28/2016

2 Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

3 Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

4 Regression Aim: Predict the value of a response / dependent variable y R from p predictors / features / independent variables X 1, X 2,..., X p R Methodology: 1. Fit a model with using n training examples y 1, y 2,..., y n y i f (X i1, X i2,..., X ip ) 1 i n 2. Use learned model f to predict from new data

5 Linear regression f is parametrized by an intercept β 0 and a vector of weights β R p p y i β 0 + β j X ij j=1 1 i n y 1 y 2 y n X 11 X 12 X 1p β 0 + X 21 X 22 X 2p X n1 X n2 X np β 1 β 2 β p y 1β 0 + X β Aim: Learn β 0 and β from the training data

6 Linear models Signal representation (compression, denoising) x = D c Aim: Find sparse representation of the signal x Columns of D are designed/learned atoms

7 Linear models Inverse problems (compressed sensing, super-resolution) y = A x Aim: Estimate the signal x from the measurements y A models the measurement process

8 Linear models Linear regression y = X β Aim: Learn the relation between the response y and the predictors X 1, X 2,..., X p to predict from new data Each column of X contains a variable that may or may not be related to the response

9 Preprocessing 1. Center each predictor column X i subtracting its mean so that n X ij = 0 i=1 for all 1 j p 2. Normalize each predictor column X j so that X j 2 = 1 for all 1 j p 3. Center the response vector y so that n y i = 0 i=1

10 Least-squares fit Minimize l 2 norm of error over training data minimize y X β 2 After preprocessing β 0 = 0, so we can ignore it

11 Geometric interpretation Decompose y into its projection onto the column space of X and onto the orthogonal complement y = y X + y X By Pythagoras s Theorem y X β 2 = y X yx X β 2 2 If n p and X is full rank β ls = ( X T X ) 1 X T y satisfies X β ls = y X After preprocessing β 0 = 0

12 Probabilistic interpretation: Maximum likelihood Assumption: Linear model is correct but examples are corrupted by additive iid Gaussian noise y = X β + z The likelihood is the probability density function of y parametrized by β ( ) ( 1 L β = (2π) n exp 1 ) y X β The maximum-likelihood estimate of β is ( ) β ML = arg max L β β ( ) = arg max log L β β = arg min β y X β 2

13 Temperature predictor A friend tells you: I found a cool way to predict the temperature in New York: It s just a linear combination of the temperature in every other state. I fit the model on data from the last month and a half and it s perfect!

14 Overfitting If a model is very complex, it may overfit the data To evaluate a model we separate the data into a training and a test set 1. We fit the model using the training set 2. We evaluate the error on the test set

15 Experiment X train, X test, z train and β are iid Gaussian with mean 0 and variance 1 y train = X train β + z train y test = X test β We use y train and X train to compute β ls error train = X train β ls y train 2 y train 2 error test = X test β ls y test 2 y test 2

16 Experiment Error (training) Error (test) Noise level (training) Relative error (l2 norm) n

17 Analysis Data model, z is Gaussian noise with variance σ 2 z, y = X β + z σ min /σ max are the smallest/largest singular values of X 1 ɛ σ 2 min β β ls 2 2 pσ 2 z 1 + ɛ σ 2 max with high probability

18 Proof β ls = V Σ 1 U T where X = UΣV T β β ls 2 = p ( U T j z j=1 U T z is Gaussian with mean 0 and covariance I U T z 2 is chi-square with p degrees of freedom 2 U T z 2 p 2 σ j ) 2

19 Consistency Assume σ 2 z = 1 If X is well conditioned and entries have constant magnitude σ j n β β ls 2 If entries of β are constant, β 2 p Least-squares estimator is consistent p n β β ls 2 β 2 1 n

20 Experiment: X R n p, β R p, z R n iid N (0, 1) Relative coefficient error (l2 norm) p=50 p=100 p=200 1/ n n

21 Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

22 Maximum temperatures in Oxford, UK Temperature (Celsius)

23 Maximum temperatures in Oxford, UK Temperature (Celsius)

24 Model fitted by least squares Temperature (Celsius) Data Model

25 Model fitted by least squares Temperature (Celsius) Data Model

26 Model fitted by least squares Temperature (Celsius) Data Model

27 Trend: Increase of 0.75 C / 100 years (1.35 F) Temperature (Celsius) Data Trend

28 Model for minimum temperatures Temperature (Celsius) Data Model

29 Model for minimum temperatures Temperature (Celsius) Data Model

30 Model for minimum temperatures Temperature (Celsius) Data Model

31 Trend: Increase of 0.88 C / 100 years (1.58 F) Temperature (Celsius) Data Trend

32 Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

33 Classification Aim: Predict the value of a binary response y {0, 1} from p predictors X 1, X 2,..., X p R Methodology: 1. Fit a model with using n training examples y 1, y 2,..., y n y i f (X i1, X i2,..., X ip ) 1 i n 2. Use learned model f to predict from new data

34 Logistic-regression model We model the probability that y i = 1 by P (y i = 1 X i1, X i2,..., X ip ) := g β 0 + p β j X ij j=1 1 = ( 1 + exp β 0 ) p j=1 β jx ij g is the logistic function or sigmoid g (t) := exp t

35 Cost function Likelihood of the data under a Bernouilli model in which p P (y i = 1 X i1, X i2,..., X ip ) := g β 0 + β j X ij j=1 P (y i = 0 X i1, X i2,..., X ip ) := 1 g β 0 + Assuming independence ( ) L β 0, β = n g β 0 + i=1 p j=1 p β j X ij j=1 y i β j X ij 1 g β 0 + p β j X ij j=1 1 y i

36 Cost function Log likelihood is concave ( log L β0, β ) n = y i log g β0 + i=1 p j=1 β j X ij + (1 y i ) log 1 g β 0 + p β j X ij j=1

37 Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

38 Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

39 Sparse regression Assumption: Response only depends on a subset S of s p predictors y 1β 0 + X S β S Regime of interest: p n Model-selection problem: Determine what predictors are relevant Best-subset selection: Choose among all possible sparse models Intractable unless s is very small

40 Forward stepwise regression Greedy method similar to orthogonal matching pursuit Initialization: j 0 := arg max y, X j j J := {j 0 } β ls := arg min β r (0) := y X J β ls y X J β 2

41 Forward stepwise regression Iterations: k = 1, 2,... j k := arg max j / J y, P col(xj ) j X J := J {j k } β ls := arg min y X J β β r (k) := r (k 1) X J β ls 2 Stop when fit does not improve much anymore

42 The lasso l 1 -norm regularized least-squares regression minimize 1 2n y X β 2 + λ β 2 1 We assume that the data are centered, so β 0 = 0 Original formulation minimize subject to y X β β τ Equivalent, but relation between λ and τ depends on X and y

43 Experiment X train and X test are iid Gaussian (0, 1), z train is iid Gaussian (0, 0.5) β has 10 nonzero entries, p = 50, n = 100 y train = X train β + z train y test = X test β For an estimate ˆβ learnt from the training data error train = error test = 2 X train ˆβ ytrain y train 2 2 X test ˆβ y test y test 2

44 Lasso path Coefficients Relevant features Regularization parameter

45 Training error Relative error (l2 norm) Least squares Forward regression Lasso Noise level n

46 Test error Least squares Forward regression Lasso Relative error (l2 norm) n

47 Logistic regression Add l 1 -norm regularization term to log likelihood, we minimize n p y i log g β0 + β j X ij i=1 j=1 (1 y i ) log 1 g β λ β 1 p β j X ij j=1

48 Arrhythmia prediction Predict whether patient has arrhythmia from n = 271 examples and p = 182 predictors Age, sex, height, weight Features obtained from electrocardiogram recordings Best sparse model uses around 60 predictors

49 Prediction accuracy Misclassification Error log(lambda)

50 Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

51 Least-squares analysis Data model, z is Gaussian noise with variance σz 2 and β is s sparse y = X β + z Assumption: X is well conditioned, entries have constant magnitude σ min n Columns of X have l 2 norm n For the least-squares estimator we have β β ls 2 σ z p n If p n, but s p we should do better!

52 Restricted eigenvalue property There exists γ > 0 such that for any v R p if for any subset T, T s, then v T c 1 v T 1 1 n M v 2 2 γ v 2 2 Similar to restricted isometry property, may hold even if p > n!

53 Guarantees If X satisfies the RE property and τ = β 1, the solution β lasso to satisfies minimize subject to y X β β τ β β lasso 2 σ z 32 α s log p γ n with probability 1 2 exp ( (α 1) log p) for any α > 1 Close to least-squares error if we know which predictors are relevant

54 Proof The error h := β β lasso satisfies h T c 1 h T 1 so β β lasso γn X h 2 2

55 Proof By optimality of β lasso Xh z T Xh 2 X T z h 1 Since h T c 1 h T 1 h 1 2 s h 2 so β β lasso s X γn β β lasso T 2 z

56 Proof Xi T z is Gaussian with variance σz 2 X i 2 2, so for t > 0 P ( X T i ) z > tσ z X i 2 2 exp ( ) t2 2 By the union bound, ( X P T z > tσ z max i ) ( ) X i 2 2 p exp t2 2 = 2 exp ( t22 ) + log p

57 Proof Choosing t = 2α log p, α > 2, if max i X i 2 = n we have ( X P T ) z > σ z 2α n log p 2 exp ( (α 1) log p) We conclude β β lasso 2 σ z 32 α s log p γ n with probability 1 2 exp ( (α 1) log p)

58 Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

59 Correlated predictors Error of least-squares estimator β β ls 2 = p ( U T j z If predictors are strongly correlated singular values can be very small Estimator suffers from noise amplification j=1 σ j ) 2

60 Ridge regression Aim: Control norm of weights minimize y X β 2 + λ β Equivalent to regression problem minimize [ ] y 0 [ ] X λi β 2 2

61 Ridge regression If y = X β + z σ σ1 2 +λ2 σ β ridge = V σ2 2 +λ2 V T β σp σp+λ 2 2 σ σ1 2+λ2 σ V σ 2 2 +λ2 UT z 0 0 σ p σ 2 p +λ2 Decreases noise amplification if singular values are small

62 Correlated predictors in model selection Lasso controls norm amplification but tends to choose a small subset of the relevant predictors Aim: Selecting all relevant predictors Why? Interpretability + improving prediction

63 Identical predictors If two predictors are the same X i = X j then β i and β j should be the same True for any cost function of the form minimize 1 2n y X β 2 ( ) + λ R β 2 as long as the regularizer R is strictly convex Not true for the lasso

64 Experiment The entries of z train, Z train and Z test are iid Gaussian (0, 1) y train := X A,train β A + X B,train β B + z train y test := X A,test β A + X B,test β B X train := [ X A,train X B,train Z train ] X test := [ X A,test X B,test Z test ]

65 Experiment Rows of X A,train, X B,train, X A,train, X B,train are independent Gaussian random vectors with mean zero and covariance matrix Σ := β A and β B have 4 nonzero entries each

66 Lasso path Coefficients Group 1 Group Regularization parameter

67 Ridge-regression path Coefficients Group 1 Group Regularization parameter

68 Elastic net Combines lasso and ridge-regression penalties minimize ( y X β 2 1 α + λ 2 2 β 2 + α 2 2 β ) 1

69 Elastic-net path, α = 0.5 Coefficients Group 1 Group Regularization parameter

70 Errors Relative error (l2 norm) Least squares (training) Least squares (test) Ridge (training) Ridge (test) Elastic net (training) Elastic net (test) Lasso (training) Lasso (test) Regularization parameter

71 Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

72 Group sparse structure Predictors may be partitioned into k groups G 1, G 2,..., G m ] β y β 0 + X β = β 0 + [X G1 X G2 X G2 Gm β G1 β Gm Aim: Select groups that are relevant

73 Multi-task learning Several responses y 1, y 2,..., y k modeled with the same predictors Y = [ y 1 y 2... y k ] B0 + XB = B 0 + X [ β 1 β 2 β k ] Assumption: Responses depend on the same subset of predictors Aim: Learn a group-sparse model

74 Mixed l 1 /l 2 norm Mixed l 1 /l 2 norm β := 1,2 k β 2 Gi i=1 promotes group-sparse structure For multitask learning B := 1,2 k 2 B :i i=1

75 Geometric intuition G 1 = {1, 2} G 2 = {3} l 1 -norm ball l 1 /l 2 -norm ball Figure taken from Statistical Learning with Sparsity The Lasso and Generalizations by Hastie, Tibshirani and Wainwright

76 Group and multi-task lasso Group lasso minimize y β 0 X β 2 + λ β 2 1,2 Multi-task lasso minimize Y B 0 X B 2 B + λ F 1,2

77 Proximal gradient method Method to solve the optimization problem minimize f (x) + g (x), where f is differentiable and prox g is tractable Proximal-gradient iteration: x (0) = arbitrary initialization ( x (k+1) = prox αk g x (k) α k f (x (k)))

78 Subdifferential of l 1 /l 2 norm q R p is a subgradient of the l 1 /l 2 norm at β R p if q Gi = β G i β Gi 2 if β Gi 0, q Gi 2 1 if β Gi = 0

79 Proximal operator of l 1 /l 2 norm Proximal operator of l 1 /l 2 norm is block soft-thresholding prox α 1,2 (β) = BS α (β) where α > 0 and β Gi α β G i if β BS α (β) Gi := β Gi Gi 2 α 2 0 otherwise

80 Iterative Shrinkage Thresholding for the l 1 /l 2 norm Proximal gradient method for the problem minimize 1 2 Ax y λ x 1,2 ISTA iteration: x (0) = arbitrary initialization ( ( )) x (k+1) = BS αk λ x (k) α k A T Ax (k) y

81 Speech denoising: STFT thresholding Frequency Time

82 Speech denoising: STFT block thresholding Frequency Time

83 Experiment X train and X test are iid Gaussian (0, 1), Z train is iid Gaussian (0, 0.5) B R p k has s nonzero rows, p = 50, n = 100 Y train = X train B + Z train Y test = X test B For an estimate ˆB learnt from the training data error test = 2 X test ˆB Y test Y test 2

84 Test error, s = Lasso (k=4) Lasso (k=40) Multi-task lasso (k=4) Multi-task lasso (k=40) Relative error (l2 norm) Regularization parameter

85 Test error, s = 30 Relative error (l2 norm) Lasso (k=4) Lasso (k=40) Multi-task lasso (k=4) Multi-task lasso (k=40) Regularization parameter

86 Magnitude of coefficients, s = 30 k = 40 Lasso Multitask lasso Original

Optimization methods

Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to