Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Similar documents
Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

MS-C1620 Statistical inference

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Linear Methods for Regression. Lijun Zhang

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Subset selection with sparse matrices

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Convex optimization COMS 4771

Lecture 2 Machine Learning Review

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Regression, Ridge Regression, Lasso

l 1 and l 2 Regularization

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

ISyE 691 Data mining and analytics

Sparse Linear Models (10/7/13)

Linear Models for Regression CS534

Machine Learning for OR & FE

Machine Learning for Biomedical Engineering. Enrico Grisan

Dimension Reduction Methods

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

the tree till a class assignment is reached

i=1 = H t 1 (x) + α t h t (x)

Machine Learning for NLP

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

Machine Learning Linear Regression. Prof. Matteo Matteucci

Lecture 14: Shrinkage

Sparse Approximation and Variable Selection

Introduction to Machine Learning

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Introduction to Machine Learning (67577) Lecture 7

Day 3: Classification, logistic regression

Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representations

Variable Selection in Data Mining Project

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Is the test error unbiased for these programs?

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Lecture 3: Introduction to Complexity Regularization

Oslo Class 6 Sparsity based regularization

Data Mining Stat 588

Introduction to Machine Learning (67577) Lecture 3

ECS289: Scalable Machine Learning

The Perceptron algorithm

STA141C: Big Data & High Performance Statistical Computing

Gradient Boosting (Continued)

Prediction & Feature Selection in GLM

Lasso, Ridge, and Elastic Net

Classification Logistic Regression

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Computational and Statistical Learning Theory

Empirical Risk Minimization, Model Selection, and Model Assessment

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Reproducing Kernel Hilbert Spaces

IEOR165 Discussion Week 5

Introduction to Statistical Learning Theory

STAT 100C: Linear models

Machine Learning And Applications: Supervised Learning-SVM

Decision trees COMS 4771

Introduction to Statistical modeling: handout for Math 489/583

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

DATA MINING AND MACHINE LEARNING

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

An Introduction to Sparse Approximation

STA141C: Big Data & High Performance Statistical Computing

Introduction to Machine Learning

Linear regression COMS 4771

Online Learning With Kernel

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Discriminative Models

High-Dimensional Statistical Learning: Introduction

18.6 Regression and Classification with Linear Models

Generalization, Overfitting, and Model Selection

COMS 4771 Introduction to Machine Learning. Nakul Verma

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Cross-Validation with Confidence

Linear Model Selection and Regularization

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

An economic application of machine learning: Nowcasting Thai exports using global financial market data and time-lag lasso

Linear Regression 1 / 25. Karl Stratos. June 18, 2018


Linear Models for Regression CS534

Importance Sampling for Minibatches

6. Regularized linear regression

Transcription:

Machine Learning Regularization and Feature Selection Fabio Vandin November 14, 2017 1

Regularized Loss Minimization Assume h is defined by a vector w = (w 1,..., w d ) T R d (e.g., linear models) Regularization function R : R d R Regularized Loss Minimization (RLM): pick h obtained as arg min (L S(w) + R(w)) w Intuition: R(w) is a measure of complexity of hypothesis h defined by w regularization balances between low empirical risk and less complex hypotheses We will see some of the most common regularization function 2

l 1 Regularization Regularization function: R(w) = λ w 1 λ R, λ > 0 l 1 norm: w 1 = d i=1 w i Therefore the learning rule is: pick A(S) = arg min w (L S(w) + λ w 1 ) Intuition: w 1 measures the complexity of hypothesis defined by w λ regulates the tradeoff between the empirical risk (L S (w)) or overfitting and the complexity ( w 1 ) of the model we pick 3

LASSO Linear regression with squared loss + l 1 regression LASSO (least absolute shrinkage and selection operator) LASSO: pick How? w = arg min w λ w 1 + m ( w, x i y i ) 2 i=1 Notes: no closed form solution! l 1 norm is a convex function and squared loss is a convex problem can be solved efficiently! (true for every convex loss function) l 1 regularization often induces sparse solutions 4

LASSO and Sparse Solution Ridge$Regression$ LASSO$ w i$ w i$ 1/λ$ 1/λ$ 5

Ridge Regression vs LASSO LASSO {w: L S (w)=α} RIDGE REGRESSION w 2 w 2 w 1 s w 2 s w 1 w 1 l 1 regularization performs a sort of feature selection 6

Feature Selection In general, in machine learning one has to decide what to use as features ( = input ) for learning. Even if somebody gives us a representation as a feature vector, maybe there is a better representation? What is better? Example features x 1, x 2, output y x 1 U[ 1, 1] y = x 2 1 x 2 y + U[ 0.01, 0.01] Which feature is better: x 1 or x 2? No-free lunch... 7

Feature Selection: Scenario We have a large pool of features Goal: select a small number of features that will be used by our (final) predictor Assume X = R d. Goal: learn (final) predictor using k << d predictors Motivation? prevent overfitting: less predictors hypotheses of lower complexity! predictions can be done faster useful in many applications! 8

Feature Selection: Computational Problem Assume that we use the Empirical Risk Minimization (ERM) procedure. The problem of selecting k features that minimize the empirical risk can be written as: where w 0 = {i : w i 0} How can we solve it? min L S(w) subject to w 0 k w 9

Subset Selection How do we find the solution to the problem below? Let: I = {1,..., m}; min L S(w) subject to w 0 k w given p = {i 1,..., i k } I: H p = hypotheses/models where only features w i1, w i2..., w ik are used P (k) {J I : J = k}; foreach p P (k) do h p arg min L S (h); h H p return h (k) arg min p P (k) L S (h p ); Complexity? Learn Θ ( (d k) ) Θ ( d k) models exponential algorithm! 10

11 What about finding the best subset of features (of any size)? for k 0 to d do P (k) {J I : J = k}; foreach p P (k) do h p arg min L S (h); h H p h (k) arg min L S (h p ); p P (k) return arg min L S (h) h {h (0),h (1),...,h (d) } Complexity? Learn Θ ( 2 d) models!

12 Can we do better? Proposition The optimization problem of feature selection NP-hard. What can we do? Heuristic solution greedy algorithms

Greedy Algorithms for Feature Selection 13 Forward Selection: start from the empty solution, add one feature at the time, until solution has cardinality k sol ; while sol < k do foreach i I \ sol do p sol {i}; h p arg min h H p L S (h); sol sol arg min i I\sol L S(h sol {i} ); return sol; Complexity? Learns Θ (kd) models

14 Backward Selection: start from the solution which includes all features, remove one features at the time, until solution has cardinality k Pseudocode: analogous to forward selection [Exercize!] Complexity? Learns Θ (kd) models

Notes 15 We have used only training set to select the best hypothesis... we may overfit! Solution? Use validation! (or cross-validation) Split data into training data and validation data, learn models on training, evaluate ( = pick among different hypothesis models) on validation data. Algorithms are similar.

Subset Selection with Validation Data 16 S = training data (from data split) V = validation data (from data split) Using training and validation: for k 0 to d do P (k) {J I : J = k}; foreach p P (k) do h p arg min L S (h); h H p h (k) arg min L V (h p ); p P (k) return arg min L V (h) h {h (0),h (1),...,h (d) }

Forward Selection with Validation Data 17 Using training and validation: sol ; while sol < k do foreach i I \ sol do p sol {i}; h p arg min L S (h); h H p sol sol arg min i I\sol L V (h sol {i} ); return sol;

18 Backward Selection with validation: similar [Exercize] Similar approach for all algorithm with cross-validation [Exercize]

Bibliography [UML] 19 Regularization and Ridge Regression: Chapter 12 no Section 13.3; Section 13.4 only up to Corollary 13.8 (excluded) Feature Selection and LASSO: Chapter 25 only Section 25.1.2 (introduction and Backward Elimination ) and 25.1.3