Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Size: px

Start display at page:

Download "Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017"

Chad Crawford
5 years ago
Views:

1 Machine Learning Regularization and Feature Selection Fabio Vandin November 13,

2 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m, with z i Z = X Y, generated from distribution D) training set available to A to produce A(S); H: the hypothesis (or model) set for A loss function: l(h, (x, y)), l : H Z R + L S (h): empirical risk or training error of hypothesis h H L S (h) = 1 m m l(h, z i ) i=1 L D (h): true risk or generalization error of hypothesis h H: L D (h) = E z D[l(h,z)] 2

3 Learning Paradigm We would like A to produce A(S) such that L D (A(S)) is small, or at least close to the smallest generalization error L D (h ) achievable by the best hypothesis h in H: h = arg min h H L D(h) We have seen two learning paradigms: Empirical Risk Minimization Structural Risk Minimization Minimum Description Length 3

4 Empirical Risk Minimization (ERM): pick A(S) arg min h H L S(h) guarantees learning for PAC learnable H Minimum Description Length: pick ( ) h + ln(2/δ) A(S) arg min L S (h) + ) h H 2m where h is the length of the description of h. considers trade-off between training error and complexity We now see another learning paradigm. 4

5 Regularized Loss Minimization Assume h is defined by a vector w = (w 1,..., w d ) T R d (e.g., linear models) Regularization function R : R R Regularized Loss Minimization (RLM): pick h obtained as arg min (L S(w) + R(w)) w Intuition: R(w) is a measure of complexity of hypothesis h defined by w regularization balances between low empirical risk and less complex hypotheses We will see some of the most common regularization function 5

6 Tikhonov regularization Regularization function: R(w) = λ w 2 λ R, λ > 0 l 2 norm: w 2 = d i=1 w 2 i Therefore the learning rule is: pick ( A(S) = arg min LS (w) + λ w 2) w Intuition: w 2 measures the complexity of hypothesis defined by w λ regulates the tradeoff between the empirical risk (L S (w)) or overfitting and the complexity ( w 2 ) of the model we pick 6

7 Ridge Regression Linear regression with squared loss + Tikhonov regularization ridge regression Linear regression with squared loss: given: training set S = ((x 1, y 1 ),..., (x m, y m )), with x i R d and y i R want: w which minimizes empirical risk: w = arg min w 1 m m ( w, x i y i ) 2 i=1 equivalently, find w which minimizes the residual sum of squares RSS(w) w = arg min w RSS(w) = arg min w m ( w, x i y i ) 2 i=1 7

8 Linear regression: pick w = arg min w RSS(w) = arg min w m ( w, x i y i ) 2 i=1 Ridge regression: pick w = arg min w λ w 2 + m ( w, x i y i ) 2 i=1 8

9 Let RSS: Matrix Form x 1 x 2 X =. x m X: design matrix 9

10 Let X: design matrix RSS: Matrix Form x 1 x 2 X =. x m y 1 y 2 y =. y m 9

11 Let RSS: Matrix Form x 1 x 2 X =. x m X: design matrix we have that RSS is y 1 y 2 y =. y m 9

12 Let RSS: Matrix Form x 1 x 2 X =. x m X: design matrix y 1 y 2 y =. y m we have that RSS is m ( w, x i y i ) 2 = (y Xw) T (y Xw) i=1 9

13 Ridge Regression: Matrix Form 10 Linear regression: pick arg min (y w Xw)T (y Xw) Ridge regression: pick arg min w λ w 2 + (y Xw) T (y Xw)

14 11 Want to find w which minimizes f (w) = λ w 2 + (y Xw) T (y Xw). How? Compute gradient it to 0. f (w) w of objective function w.r.t w and compare

15 11 Want to find w which minimizes f (w) = λ w 2 + (y Xw) T (y Xw). How? Compute gradient it to 0. f (w) w of objective function w.r.t w and compare f (w) w = 2λw 2XT (y Xw)

16 11 Want to find w which minimizes f (w) = λ w 2 + (y Xw) T (y Xw). How? Compute gradient it to 0. f (w) w of objective function w.r.t w and compare f (w) w = 2λw 2XT (y Xw) Then we need to find w such that 2λw 2X T (y Xw) = 0

17 12 2λw 2X T (y Xw) = 0 is equivalent to ( ) λi + X T X w = X T y Note: X T X is positive semidefinite λi is positive definite λi + X T X is positive definite λi + X T X is invertible Ridge regression solution: w = ( λi + X T X ) 1 X T y

18 Tikhonov Regularization: Some Results 13 We?ll show: Tikhonov regularization makes the learner stable w.r.t. small perturbations of the training set, which in turn leads to small bounds on generalization error Informally: an algorithm A is stable if a small change of the training data (i.e., its input) S will lead to a small change of its output hypothesis Questions: what is a small change of the training data? what is a small change of its output hypothesis?

19 Stability 14 small change of the training data = replace one sample! Given S = (z 1,..., z m ) and additional example (i.e., pair instance label/target) z let S (i) = (z 1,..., z i 1, z, z i+1,..., z m ) small change of its output hypothesis = on-average-replace-one-stable (OAROS) Definition Let ɛ : N R be a monotonically decreasing function. We say that a learning algorithm A is on-average-replace-one-stable OAROS with rate ɛ(m) if for every distribution D: E (S,z ) D m+1,i U(m)[l(A(S (i) ), z i ) l(a(s), z i )] ɛ(m)

20 Stability Rules do not Overfit proposition If algorithm A is OAROS with rate ɛ(m) then: E S D m[l D (A(S)) L S (A(S))] ɛ(m) Proof. Since S and z are both drawn i.i.d. from D, we have that for every i: E S [L D (A(S))] = E S,z [l(a(s), z )] = E S,z [l(a(s (i) ), z i )]. On the other hand we have: E S [L S (A(S))] = E S,i [l(a(s), z i )]. The proof follows from the definition of stability. 15

21 Tikhonov Regularization is a Stabilizer 16 Proposition Assume the loss function is convex and ρ-lipschitz continuous. Then, the RLM rule with regularizer λ w 2 is OAROS with rate. It follows that for the RLM rule: 2ρ λm E S D m[l D (A(S)) L S (A(S))] 2ρ λm A function f : R d R is ρ-lipschitz continuous if for every w 1, w 2 R d we have that f (w 2 ) f (w 2 ) ρ w 1 w 2

22 Note that The Fitting-Stability Tradeoff E S [L D (A(S))] = E S [L S (A(S))] + E S [L D (A(S)) L S (A(S))] Notes: E S [L S (A(S))]: how well A fits the training set E S [L D (A(S)) L S (A(S))] = overfitting, bounded by stability of A in Tikhonov regularization, λ controls tradeoff between the two terms Questions: how do L S (A(S)) and A(S) 2 = w 2 vary as a function of λ? how may E S [L D (A(S)) L S (A(S))] change as a function of λ? How do set λ? 17

23 18 Using the fitting-stability tradeoff decomposition and the result for convex, Lipschitz losses we can prove that knowing the properties (e.g., ρ in ρ-lipschitz continuity, etc.) one can pick λ to guarantee that c E S [L D (A(S))] min L D(w) + H m where c > 0 depends on the parameters of the loss function. Question: how do we pick λ in practice?

24 l 1 Regularization 19 Regularization function: R(w) = λ w 1 λ R, λ > 0 l 1 norm: w 1 = d i=1 w i Therefore the learning rule is: pick A(S) = arg min w (L S(w) + λ w 1 ) Intuition: w 1 measures the complexity of hypothesis defined by w λ regulates the tradeoff between the empirical risk (L S (w)) or overfitting and the complexity ( w 1 ) of the model we pick

25 LASSO 20 Linear regression with squared loss + l 1 regression LASSO (least absolute shrinkage and selection operator) LASSO: pick Notes: w = arg min w λ w 1 + no closed form solution! m ( w, x i y i ) 2 l 1 norm is a convex function and squared loss is a convex problem can be solved efficiently! (true for every convex loss function) i=1 l 1 regularization often induces sparse solutions

26 LASSO and Sparse Solution 21 Ridge$Regression$ LASSO$ w i$ w i$ 1/λ$ 1/λ$

27 LASSO and Sparse Solution 21 Ridge$Regression$ LASSO$ w i$ w i$ 1/λ$ 1/λ$

28 Ridge Regression vs LASSO LASSO {w: L S (w)=α} RIDGE REGRESSION w 2 w 2 w 1 s w 2 s w 1 w 1 l 1 regularization performs a sort of feature selection 22

29 Feature Selection 23 We have a large pool of features Goal: select a small number of features that will be used by our (final) predictor Assume X = R d. Goal: learn (final) predictor using k << d predictors Motivation? prevent overfitting: less predictors hypotheses of lower complexity! predictions can be done faster useful in many applications!

30 Feature Selection: Computational Problem 24 Assume that we use the Empirical Risk Minimization (ERM) procedure. The problem of selecting k features that minimize the empirical risk can be written as: where w 0 = {i : w i 0} How can we solve it? min L S(w) subject to w 0 k w

31 How do we find the solution to the problem? 25 min L S(w) subject to w 0 k w Brute-Force Algorithm 1 sol ; 2 Foreach subset P of k features of do: i) wp best model using only the features in k ii) If L S (wp ) < L S(sol) then sol wp 3 Return sol Complexity? Learn Θ ( (d k) ) Θ ( n k) models exponential algorithm! Can we do better? Proposition The optimization problem of feature selection NP-hard.

32 Greedy Algorithms for Feature Selection 26 forward selection: start from the empty solution, add one features at the time, until solution has cardinality k (or no improvement) backward selection: start from the solution which includes all features, remove one features at the time, until solution has cardinality k (or no improvement) ( k 1 ) Complexity? Learn Θ 0=1 d i Θ (kd) models

33 Bibliography [UML] 27 Regularization and Ridge Regression: Chapter 12 no section 13.3; sectio 13.4 only up to Corollary 13.8 (excluded) Feature Selection and LASSO: Chapter 25 only section (introduction and Backward Elimination ) and

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 14, 2017 1 Regularized Loss Minimization Assume h is defined by a vector w = (w 1,..., w d ) T R d (e.g., linear models) Regularization