Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Size: px

Start display at page:

Download "Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017"

Derek Palmer
5 years ago
Views:

1 Machine Learning Regularization and Feature Selection Fabio Vandin November 14,

2 Regularized Loss Minimization Assume h is defined by a vector w = (w 1,..., w d ) T R d (e.g., linear models) Regularization function R : R d R Regularized Loss Minimization (RLM): pick h obtained as arg min (L S(w) + R(w)) w Intuition: R(w) is a measure of complexity of hypothesis h defined by w regularization balances between low empirical risk and less complex hypotheses We will see some of the most common regularization function 2

3 l 1 Regularization Regularization function: R(w) = λ w 1 λ R, λ > 0 l 1 norm: w 1 = d i=1 w i Therefore the learning rule is: pick A(S) = arg min w (L S(w) + λ w 1 ) Intuition: w 1 measures the complexity of hypothesis defined by w λ regulates the tradeoff between the empirical risk (L S (w)) or overfitting and the complexity ( w 1 ) of the model we pick 3

4 LASSO Linear regression with squared loss + l 1 regression LASSO (least absolute shrinkage and selection operator) LASSO: pick How? w = arg min w λ w 1 + m ( w, x i y i ) 2 i=1 Notes: no closed form solution! l 1 norm is a convex function and squared loss is a convex problem can be solved efficiently! (true for every convex loss function) l 1 regularization often induces sparse solutions 4

5 LASSO and Sparse Solution Ridge$Regression$ LASSO$ w i$ w i$ 1/λ$ 1/λ$ 5

6 Ridge Regression vs LASSO LASSO {w: L S (w)=α} RIDGE REGRESSION w 2 w 2 w 1 s w 2 s w 1 w 1 l 1 regularization performs a sort of feature selection 6

7 Feature Selection In general, in machine learning one has to decide what to use as features ( = input ) for learning. Even if somebody gives us a representation as a feature vector, maybe there is a better representation? What is better? Example features x 1, x 2, output y x 1 U[ 1, 1] y = x 2 1 x 2 y + U[ 0.01, 0.01] Which feature is better: x 1 or x 2? No-free lunch... 7

8 Feature Selection: Scenario We have a large pool of features Goal: select a small number of features that will be used by our (final) predictor Assume X = R d. Goal: learn (final) predictor using k << d predictors Motivation? prevent overfitting: less predictors hypotheses of lower complexity! predictions can be done faster useful in many applications! 8

9 Feature Selection: Computational Problem Assume that we use the Empirical Risk Minimization (ERM) procedure. The problem of selecting k features that minimize the empirical risk can be written as: where w 0 = {i : w i 0} How can we solve it? min L S(w) subject to w 0 k w 9

10 Subset Selection How do we find the solution to the problem below? Let: I = {1,..., m}; min L S(w) subject to w 0 k w given p = {i 1,..., i k } I: H p = hypotheses/models where only features w i1, w i2..., w ik are used P (k) {J I : J = k}; foreach p P (k) do h p arg min L S (h); h H p return h (k) arg min p P (k) L S (h p ); Complexity? Learn Θ ( (d k) ) Θ ( d k) models exponential algorithm! 10

11 11 What about finding the best subset of features (of any size)? for k 0 to d do P (k) {J I : J = k}; foreach p P (k) do h p arg min L S (h); h H p h (k) arg min L S (h p ); p P (k) return arg min L S (h) h {h (0),h (1),...,h (d) } Complexity? Learn Θ ( 2 d) models!

12 12 Can we do better? Proposition The optimization problem of feature selection NP-hard. What can we do? Heuristic solution greedy algorithms

13 Greedy Algorithms for Feature Selection 13 Forward Selection: start from the empty solution, add one feature at the time, until solution has cardinality k sol ; while sol < k do foreach i I \ sol do p sol {i}; h p arg min h H p L S (h); sol sol arg min i I\sol L S(h sol {i} ); return sol; Complexity? Learns Θ (kd) models

14 14 Backward Selection: start from the solution which includes all features, remove one features at the time, until solution has cardinality k Pseudocode: analogous to forward selection [Exercize!] Complexity? Learns Θ (kd) models

15 Notes 15 We have used only training set to select the best hypothesis... we may overfit! Solution? Use validation! (or cross-validation) Split data into training data and validation data, learn models on training, evaluate ( = pick among different hypothesis models) on validation data. Algorithms are similar.

16 Subset Selection with Validation Data 16 S = training data (from data split) V = validation data (from data split) Using training and validation: for k 0 to d do P (k) {J I : J = k}; foreach p P (k) do h p arg min L S (h); h H p h (k) arg min L V (h p ); p P (k) return arg min L V (h) h {h (0),h (1),...,h (d) }

17 Forward Selection with Validation Data 17 Using training and validation: sol ; while sol < k do foreach i I \ sol do p sol {i}; h p arg min L S (h); h H p sol sol arg min i I\sol L V (h sol {i} ); return sol;

18 18 Backward Selection with validation: similar [Exercize] Similar approach for all algorithm with cross-validation [Exercize]

19 Bibliography [UML] 19 Regularization and Ridge Regression: Chapter 12 no Section 13.3; Section 13.4 only up to Corollary 13.8 (excluded) Feature Selection and LASSO: Chapter 25 only Section (introduction and Backward Elimination ) and

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,