6. Regularized linear regression

Size: px

Start display at page:

Download "6. Regularized linear regression"

Samson Greer
6 years ago
Views:

1 Foundations of Machine Learning École Centrale Paris Fall Regularized linear regression Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe paristech.fr

2 Practical maters

3 Learning objectives Understand regularization as a means to control model complexity. Define Lasso, ridge regression, elastic net. Understand the role of the l1 and l2 norms in regularization Interpret solution paths for Lasso and ridge regression. 3

4 Regression setting features variables descriptors regressors attributes p data matrix design matrix n observations samples data points X outcome target label y 4

5 Large p, small n E.g. neuroimaging thousands of brain regions / pixels / voxels much fewer patients genetics and genomics thousands of genes, millions of SNPs... usually, at best thousands of patients 5

6 Linear regression Least-squares fit (equivalent to MLE under the assumption of Gaussian noise): The solution is uniquely defined when n > p and XTX invertible. 6

7 When X X not inversible T Pseudo-inverse Linear system of p equations: Numerical methods Gaussian elimination LU decomposition Gradient descent J(β) β 7

8 Linear regression when p >> n Simulated data: p=1000, n=100, 10 causal features True coefficients Figures: Víctor Bellón Predicted coefficients 8

9 Advantages of least-squares fit Unbiased E[ββ] = β Explicit form Computational time? 9

10 Advantages of least-squares fit Unbiased E[ββ] = β Explicit form Computational time: O(np²+p³) compute XTX invert XTX computation of XTy: O(np) computation of (XTX)-1 XTy: O(np) 10

11 Cons of least-squares fit Multicollinearity leads to high variance of the estimator Requires n > p Prediction error increases linearly as a function of p Hard to interpret when p is large Would prefer a small subset with strong effects. 11

12 Regularization 12

13 Regularization Minimize SSE + λ penalty on model complexity Biased estimator when λ 0. Trade bias for a smaller variance. λ can be set by cross-validation. Simpler model fewer parameters shrinkage: drive the coefficients of the parameters towards 0. 13

14 Ridge regression 14

15 Ridge regression Sum-of-squares penalty Compute the ridge regression estimator. 15

16 Ridge regression Sum-of-squares penalty Compute the ridge regression estimator. 16

17 feature coefficients Ridge regression solution path decreasing value of λ 17

18 Standardization What happens if we multiply xj by a constant? For standard linear regression For ridge regression 18

19 Standardization What happens if we multiply xj by a constant? For standard linear regression: For ridge regression: Not so clear, because of the penalization term Need to standardize the features average value of xj 19

20 Ridge regression Grouped selection: correlated variables get similar weights identical variables get identical weights Ridge regression shrinks coefficients towards 0 but does not result in a sparse model. Sparsity: many coefficients get a weight of 0 they can be eliminated from the model. 20

21 Lasso 21

22 Lasso L1 penalty aka basis pursuit (signal processing) no closed-form solution quadratic programming problem: equivalent to for a unique one-to-one match between t and λ. QP: maximize a quadratic form under linear constraints. 22

23 Equivalence between the formulations minimize f(β) under the constraint g(β) 0 23

24 Equivalence between the formulations minimize f(β) under the constraint g(β) 0 Case 1: the unconstraind minimum lies in the feasible region. 24

25 Equivalence between the formulations minimize f(β) under the constraint g(β) 0 Case 1: the unconstraind minimum lies in the feasible region. Case 2: it does not. feasible region iso-contours of f Where is the solution? unconstrained minimum of f 25

26 Equivalence between the formulations minimize f(β) under the constraint g(β) 0 Case 1: the unconstraind minimum lies in the feasible region. Case 2: it does not. feasible region iso-contours of f solution unconstrained minimum of f 26

27 Equivalence between the formulations minimize f(β) under the constraint g(β) 0 Case 1: the unconstraind minimum lies in the feasible region. Case 2: it does not. feasible region iso-contours of f unconstrained minimum of f The gradient is orthonormal to the curve and points towards the direction of max increase. 27

28 Equivalence between the formulations minimize f(β) under the constraint g(β) 0 Case 1: the unconstraind minimum lies in the feasible region. Case 2: it does not. Then it lies at a point where the feasible region and the iso-contours of f are tangent and the gradients are in opposite directions. The Lagrangian f(β) + λ g(β) must be minimized (λ 0) 28

29 feature coefficients Lasso solution path decreasing value of λ 29

30 Forward stepwise regression Build model sequentially, adding one variable at a time Start with the intercept At each step, add the variable that most improves the fit Stop when Greedy solution 30

31 Least Angle Regression At each step, add only as much of a variable as needed 31

32 Least Angle Regression At each step, add only as much of a variable as needed 32

33 Least Angle Regression At each step, add only as much of a variable as needed step size 33

34 Least Angle Regression At each step, add only as much of a variable as needed 34

35 Least Angle Regression At each step, add only as much of a variable as needed 35

36 Least Angle Regression At each step, add only as much of a variable as needed 36

37 Least Angle Regression At each step, add only as much of a variable as needed Maximum number of steps: max(n-1, p) 37

38 Elastic Net 38

39 Elastic Net Combine lasso and ridge regression Select variables like the lasso. Shrinks together coefficients of correlated variables like the ridge regression. 39

40 E.g. Leukemia data Elastic Net results in more non-zero coefficients than Lasso, but with smaller amplitudes. 40

41 Lq-norm regularization 41

42 Lq-norm regularization Equivalently: 42

43 Lasso vs. ridge L1 norm β2 iso-contours of the least-squares error unconstrained min. = least-squares sol. β1 feasible region: β1 + β2 t 43

44 Lasso vs. ridge L1 norm β2 iso-contours of the least-squares error L2 norm β2 unconstrained min. = least-squares sol. β1 feasible region: β1 + β2 t β1 feasible region: β1² + β2² t 44

45 Elastic net Elastic penalty 45

46 Structured regularization 46

47 Group lasso Use K predefined groups of variables that are known to work together and expected to be either all active or all inactive together. E.g. genes belonging to the same biological pathway. Size of group k Features belonging to group k 47

48 Other examples of structured penalties Overlapping groups Jacob et al. (2009). Group lasso with overlap and graph lasso. ICML. Graphs Li & Li (2010). Variable selection and regression analysis for graph-structured covariates with an application to genomics. Ann. App. Stats. Trees Zhao et al. (2006). Grouped and hierarchical model selection through composite absolute penalties. Ann. Stat. Multiple related tasks Obozinski et al. (2006). Multitask feature selection. Technical Report, UC Berkeley. 48

49 Minimize SSE + λ x regularizer Ridge Lasso randomly picks one of several correlated variables sparse LAR algorithm Elastic net gives similar weights to similar variables not very sparse analytical solution selects variables like the lasso shrinks together the coeffcients of correlated variables. Many other regularizers are possible Lp norms, groups, graphs, trees... 49

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived