Machine Learning Regularization and Feature Selection Fabio Vandin November 14, 2017 1
Regularized Loss Minimization Assume h is defined by a vector w = (w 1,..., w d ) T R d (e.g., linear models) Regularization function R : R d R Regularized Loss Minimization (RLM): pick h obtained as arg min (L S(w) + R(w)) w Intuition: R(w) is a measure of complexity of hypothesis h defined by w regularization balances between low empirical risk and less complex hypotheses We will see some of the most common regularization function 2
l 1 Regularization Regularization function: R(w) = λ w 1 λ R, λ > 0 l 1 norm: w 1 = d i=1 w i Therefore the learning rule is: pick A(S) = arg min w (L S(w) + λ w 1 ) Intuition: w 1 measures the complexity of hypothesis defined by w λ regulates the tradeoff between the empirical risk (L S (w)) or overfitting and the complexity ( w 1 ) of the model we pick 3
LASSO Linear regression with squared loss + l 1 regression LASSO (least absolute shrinkage and selection operator) LASSO: pick How? w = arg min w λ w 1 + m ( w, x i y i ) 2 i=1 Notes: no closed form solution! l 1 norm is a convex function and squared loss is a convex problem can be solved efficiently! (true for every convex loss function) l 1 regularization often induces sparse solutions 4
LASSO and Sparse Solution Ridge$Regression$ LASSO$ w i$ w i$ 1/λ$ 1/λ$ 5
Ridge Regression vs LASSO LASSO {w: L S (w)=α} RIDGE REGRESSION w 2 w 2 w 1 s w 2 s w 1 w 1 l 1 regularization performs a sort of feature selection 6
Feature Selection In general, in machine learning one has to decide what to use as features ( = input ) for learning. Even if somebody gives us a representation as a feature vector, maybe there is a better representation? What is better? Example features x 1, x 2, output y x 1 U[ 1, 1] y = x 2 1 x 2 y + U[ 0.01, 0.01] Which feature is better: x 1 or x 2? No-free lunch... 7
Feature Selection: Scenario We have a large pool of features Goal: select a small number of features that will be used by our (final) predictor Assume X = R d. Goal: learn (final) predictor using k << d predictors Motivation? prevent overfitting: less predictors hypotheses of lower complexity! predictions can be done faster useful in many applications! 8
Feature Selection: Computational Problem Assume that we use the Empirical Risk Minimization (ERM) procedure. The problem of selecting k features that minimize the empirical risk can be written as: where w 0 = {i : w i 0} How can we solve it? min L S(w) subject to w 0 k w 9
Subset Selection How do we find the solution to the problem below? Let: I = {1,..., m}; min L S(w) subject to w 0 k w given p = {i 1,..., i k } I: H p = hypotheses/models where only features w i1, w i2..., w ik are used P (k) {J I : J = k}; foreach p P (k) do h p arg min L S (h); h H p return h (k) arg min p P (k) L S (h p ); Complexity? Learn Θ ( (d k) ) Θ ( d k) models exponential algorithm! 10
11 What about finding the best subset of features (of any size)? for k 0 to d do P (k) {J I : J = k}; foreach p P (k) do h p arg min L S (h); h H p h (k) arg min L S (h p ); p P (k) return arg min L S (h) h {h (0),h (1),...,h (d) } Complexity? Learn Θ ( 2 d) models!
12 Can we do better? Proposition The optimization problem of feature selection NP-hard. What can we do? Heuristic solution greedy algorithms
Greedy Algorithms for Feature Selection 13 Forward Selection: start from the empty solution, add one feature at the time, until solution has cardinality k sol ; while sol < k do foreach i I \ sol do p sol {i}; h p arg min h H p L S (h); sol sol arg min i I\sol L S(h sol {i} ); return sol; Complexity? Learns Θ (kd) models
14 Backward Selection: start from the solution which includes all features, remove one features at the time, until solution has cardinality k Pseudocode: analogous to forward selection [Exercize!] Complexity? Learns Θ (kd) models
Notes 15 We have used only training set to select the best hypothesis... we may overfit! Solution? Use validation! (or cross-validation) Split data into training data and validation data, learn models on training, evaluate ( = pick among different hypothesis models) on validation data. Algorithms are similar.
Subset Selection with Validation Data 16 S = training data (from data split) V = validation data (from data split) Using training and validation: for k 0 to d do P (k) {J I : J = k}; foreach p P (k) do h p arg min L S (h); h H p h (k) arg min L V (h p ); p P (k) return arg min L V (h) h {h (0),h (1),...,h (d) }
Forward Selection with Validation Data 17 Using training and validation: sol ; while sol < k do foreach i I \ sol do p sol {i}; h p arg min L S (h); h H p sol sol arg min i I\sol L V (h sol {i} ); return sol;
18 Backward Selection with validation: similar [Exercize] Similar approach for all algorithm with cross-validation [Exercize]
Bibliography [UML] 19 Regularization and Ridge Regression: Chapter 12 no Section 13.3; Section 13.4 only up to Corollary 13.8 (excluded) Feature Selection and LASSO: Chapter 25 only Section 25.1.2 (introduction and Backward Elimination ) and 25.1.3