ORIE 4741: Learning with Big Messy Data. Regularization

Size: px

Start display at page:

Download "ORIE 4741: Learning with Big Messy Data. Regularization"

Francis Moody
6 years ago
Views:

1 ORIE 4741: Learning with Big Messy Data Regularization Professor Udell Operations Research and Information Engineering Cornell October 26, / 24

2 Regularized empirical risk minimization choose model by solving minimize l(x i, y i ; w) + r(w) with variable w R d parameter vector w R d loss function l : X Y R d R regularizer r : R d R 2 / 24

3 Regularized empirical risk minimization choose model by solving minimize l(x i, y i ; w) + r(w) with variable w R d parameter vector w R d loss function l : X Y R d R regularizer r : R d R why? want to minimize the risk E (x,y) P l(x, y; w) approximate it by the empirical risk n l(x, y; w) add regularizer to help model generalize 2 / 24

4 Example: regularized least squares find best model by solving minimize l(x i, y i ; w) + r(w) with variable w R d ridge regression, aka quadratically regularized least squares: loss function l(x, y; w) = (y w T x) 2 regularizer r(w) = w 2 3 / 24

5 Regularization why regularize? reduce variance of the model impose prior structural knowledge improve interpretability 4 / 24

6 Regularization why regularize? reduce variance of the model impose prior structural knowledge improve interpretability why not regularize? Gauss-Markov theorem: least squares is the best linear unbiased estimator regularization adds bias 4 / 24

7 Regularizers: a tour we might choose regularizer so models will be small sparse nonnegative smooth... 5 / 24

8 Regularizers: a tour we might choose regularizer so models will be small sparse nonnegative smooth... comparison with forward- and backward-stepwise selection: regularized models tend to have lower variance. see Elements of Statistical Learning (Hastie, Tibshirani, Friedman) for more information. 5 / 24

9 Quadratic regularizer quadratic regularizer ridge regression r(w) = λ w 2 i minimize (y i w T x i ) 2 + λ w 2 i with variable w R d solution w = (X T X + λi ) 1 X T y 6 / 24

10 Quadratic regularizer shrinks coefficients towards 0 shrinks more in the direction of the smallest singular values of X 7 / 24

11 Is least squares scaling invariant? suppose Alice and Bob do the same experiment Alice measures distance in mm Bob measures distance in km they each compute an estimator with least squares and compare their predictions 8 / 24

12 Is least squares scaling invariant? suppose Alice and Bob do the same experiment Alice measures distance in mm Bob measures distance in km they each compute an estimator with least squares and compare their predictions Q: Do they make the same predictions? 8 / 24

13 Is least squares scaling invariant? suppose Alice and Bob do the same experiment Alice measures distance in mm Bob measures distance in km they each compute an estimator with least squares and compare their predictions Q: Do they make the same predictions? A: Yes! 8 / 24

14 Least squares is scaling invariant if β R, D R d d is diagonal, and Alice s measurements (X, y ) are related to Bob s (X, y) by y = βy, X = XD, then the resulting least squares models are w = (X T X ) 1 X T y, w = (X T X ) 1 X T y and they make the same predictions: X w = X (X T X ) 1 X T y = XD(D T X T XD) 1 D T X T βy = XDD 1 (X T X ) 1 (D T ) 1 D T X T βy = βx (X T X ) 1 X T y = βxw 9 / 24

15 Least squares is scaling invariant if β R, D R d d is diagonal, and Alice s measurements (X, y ) are related to Bob s (X, y) by y = βy, X = XD, then the resulting least squares models are w = (X T X ) 1 X T y, w = (X T X ) 1 X T y and they make the same predictions: X w = X (X T X ) 1 X T y = XD(D T X T XD) 1 D T X T βy = XDD 1 (X T X ) 1 (D T ) 1 D T X T βy = βx (X T X ) 1 X T y = βxw we say least squares is invariant under scaling 9 / 24

16 Is ridge regression scaling invariant? suppose Alice and Bob do the same experiment Alice measures distance in mm Bob measures distance in km they each compute an estimator with ridge regression and compare their predictions 10 / 24

17 Is ridge regression scaling invariant? suppose Alice and Bob do the same experiment Alice measures distance in mm Bob measures distance in km they each compute an estimator with ridge regression and compare their predictions Q: Do they make the same predictions? 10 / 24

18 Is ridge regression scaling invariant? suppose Alice and Bob do the same experiment Alice measures distance in mm Bob measures distance in km they each compute an estimator with ridge regression and compare their predictions Q: Do they make the same predictions? A: No! 10 / 24

19 Ridge regression is not scaling invariant if β R, D R d d is diagonal, and Alice s measurements (X, y ) are related to Bob s (X, y) by y = βy, X = XD, then the resulting ridge regression models are w = (X T X + λi ) 1 X T y, w = (X T X + λi ) 1 X T y and the predictions are Xw = X (X T X + λi ) 1 X T y, X w = X (X T X + λi ) 1 X T y ridge regression is not invariant under coordinate transformations 11 / 24

20 Scaling and offsets to get the same answer no matter the units of measurement, standardize the data: for each column of X and of y let demean: subtract column mean standardize: divide by column standard deviation solve σ 2 j = 1 n minimize µ j = 1 n X ij, µ = 1 n (X ij µ j ) 2, σ 2 = 1 n y i µ σ d j=1 y i (y i µ) 2 w j X ij µ i σ i 2 + λ d j=1 w 2 j 12 / 24

21 Scale the regularizer, not the data instead of minimize y i µ σ d j=1 w j X ij µ i σ i 2 + d j=1 w 2 j, multiply through by σ 2 reparametrize w j = σ σ j w j to find the equivalent problem d d minimize (y i w j X ij + c(w )) 2 + σj 2 (w j ) 2, j=1 where c(w ) is some linear function of w j=1 finally absorb c(w ) into the constant term in the model d minimize y Xw 2 + λ σj 2 (w j ) 2, j=1 13 / 24

22 Scaling and offsets a different solution to scaling and offsets: take the MAP view r(w) is negative log prior on w with a gaussian prior, r(w) = σi 2 wi 2 where 1 σ i is the variance of the prior on the ith entry of w if you believe the noise in the ith features is large, penalize the ith entry more (σ i big); if you believe the noise in the ith features is small, penalize the ith entry less (σ i small); if you measure X or y in different units, your prior on w should change accordingly example: don t penalize the offset w n of the model (σ n ) r(w) = n 1 w 2 i 14 / 24

23 l 1 regularization l 1 regularizer r(w) = λ w i lasso problem minimize (y i w T x i ) 2 + λ w i with variable w R d penalizes large w less than ridge regression no closed form solution 15 / 24

24 Recall l p norms l p norm w p for p (0, ) is defined as examples: w p = ( d w p ) 1/p l 1 norm is w 1 = d w l 2 norm is w 2 = d w 2 for p = 0 or p =, l p norm is defined by taking limit: l norm is w = lim p ( d w p ) 1/p = max i w i l 0 norm is w 0 = lim p 0 ( d w p ) 1/p = card(w), number of nonzeros in w note: l 0 is not actually a norm (not absolutely homogeneous since αw 0 = w 0 for α 0) 16 / 24

25 l 1 regularization why use l 1? best convex lower bound for l 0 on the l unit ball tends to produce sparse solution example: suppose X :1 = y, X :2 = αy for some α > 0 fit lasso model and ridge regression model as λ 0 w ridge = lim argmin y Xw 2 + λ w 2 2 λ 0 w w lasso = lim argmin y Xw 2 + λ w 1 λ 0 as λ 0, solution has w 1 + αw 2 = 1 w 17 / 24

26 l 1 regularization why use l 1? best convex lower bound for l 0 on the l unit ball tends to produce sparse solution example: suppose X :1 = y, X :2 = αy for some α > 0 fit lasso model and ridge regression model as λ 0 w ridge = lim argmin y Xw 2 + λ w 2 2 λ 0 w w lasso = lim argmin y Xw 2 + λ w 1 λ 0 as λ 0, solution has w 1 + αw 2 = 1 ridge regression minimizes w w 2 2 = w 1 = w 2 = 1 2 lasso minimizes w 1 + w 2 = w 1 = 1, w 2 = 0 is valid w 17 / 24

27 Sparsity why would you want sparsity? credit card application: requires less info from applicant medical diagnosis: easier to explain model to doctor genomic study: which genes to investigate? 18 / 24

28 Convex indicator define convex indicator 1 : {true, false} R { } { 0 z is true 1(z) = z is false define convex indicator of set C { 0 x C 1 C (x) = 1(x C) = otherwise 19 / 24

29 Convex indicator define convex indicator 1 : {true, false} R { } { 0 z is true 1(z) = z is false define convex indicator of set C { 0 x C 1 C (x) = 1(x C) = otherwise don t confuse this with the boolean indicator 1(z) (no standard notation... ) 19 / 24

30 Nonnegative regularization nonnegative regularizer r(w) = 1(w i 0) nonnegative least squares problem (NNLS) minimize (y i w T x i ) 2 + λ 1(w i 0) with variable w R d value is if w i < 0 so solution is always nonnegative often, solution is also sparse 20 / 24

31 Nonnegative coefficients why would you want nonnegativity? 21 / 24

32 Nonnegative coefficients why would you want nonnegativity? electricity usage: how often is device turned on? n = times, d = electric devices, y = usage, X = which devices use power at which times w = devices used by household 21 / 24

33 Nonnegative coefficients why would you want nonnegativity? electricity usage: how often is device turned on? n = times, d = electric devices, y = usage, X = which devices use power at which times w = devices used by household hyperspectral imaging: which species are present? n = frequencies, d = possible materials, y = observed spectrum, X = known spectrum of each material w = material composition of location 21 / 24

34 Nonnegative coefficients why would you want nonnegativity? electricity usage: how often is device turned on? n = times, d = electric devices, y = usage, X = which devices use power at which times w = devices used by household hyperspectral imaging: which species are present? n = frequencies, d = possible materials, y = observed spectrum, X = known spectrum of each material w = material composition of location logistics: which routes to run? n = locations, d = possible routes, y = demand, X = which routes visit which locations w = size of truck to send on each route 21 / 24

35 Demo: Regularized Regression RegularizedRegression.ipynb 22 / 24

36 Smooth coefficients smooth regularizer d 1 r(w) = (w i+1 w i ) 2 = Dw 2 where D R (d 1) d is the first order difference operator 1 j = i D ij = 1 j = i else smoothed least squares problem minimize (y i w T x i ) 2 + λ Dw 2 23 / 24

37 Why smooth? allow model to change over space or time e.g., different years in tax data interpolates between one model and separate models for different domains e.g., counties in tax data can couple any pairs of model coefficients, not just (i, i + 1) 24 / 24

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived