Is the test error unbiased for these programs? 2017 Kevin Jamieson

Similar documents
Is the test error unbiased for these programs?

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Classification Logistic Regression

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Classification Logistic Regression

Lasso Regression: Regularization for feature selection

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Lasso Regression: Regularization for feature selection

Stochastic Gradient Descent

Warm up: risk prediction with logistic regression

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Announcements Kevin Jamieson

Classification Logistic Regression

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Optimization methods

Announcements Kevin Jamieson

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Oslo Class 6 Sparsity based regularization

Machine Learning (CSE 446): Neural Networks

Regression.

l 1 and l 2 Regularization

Statistical Data Mining and Machine Learning Hilary Term 2016

ECS289: Scalable Machine Learning

Linear Methods for Regression. Lijun Zhang

Announcements. Proposals graded

Support Vector Machines

Discriminative Models

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Ad Placement Strategies

Discriminative Models

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

6. Regularized linear regression

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Logistic Regression Logistic

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Recap from previous lecture

Optimization methods

DATA MINING AND MACHINE LEARNING

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

STA141C: Big Data & High Performance Statistical Computing

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Generative v. Discriminative classifiers Intuition

Lecture 2 Machine Learning Review

Logistic Regression. Machine Learning Fall 2018

CS60021: Scalable Data Mining. Large Scale Machine Learning

MSA220/MVE440 Statistical Learning for Big Data

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Machine Learning for NLP

CSC 411 Lecture 17: Support Vector Machine

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

INTRODUCTION TO DATA SCIENCE

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Dimension Reduction Methods

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Convex Optimization Lecture 16

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

ISyE 691 Data mining and analytics

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Nesterov s Acceleration

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Ad Placement Strategies

Lecture 11. Linear Soft Margin Support Vector Machines

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

The exam is closed book, closed notes except your one-page cheat sheet.

Machine Learning (CS 567) Lecture 3

Linear Regression. Volker Tresp 2018

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Machine Learning Linear Models

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Composite Objective Mirror Descent

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Machine Learning

Applied Machine Learning for Biomedical Engineering. Enrico Grisan

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Introduction to Machine Learning

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Lecture 25: November 27

Transcription:

Is the test error unbiased for these programs? 2017 Kevin Jamieson 1

Is the test error unbiased for this program? 2017 Kevin Jamieson 2

Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 9, 2016 3

Sparsity bw LS = arg min w y i x T i w 2 Vector w is sparse, if many entries are zero Very useful for many tasks, e.g., Efficiency: If size(w) = 100 Billion, each prediction is expensive: If part of an online system, too slow If w is sparse, prediction computation only depends on number of non-zeros 4

Sparsity bw LS = arg min w y i x T i w 2 Vector w is sparse, if many entries are zero Very useful for many tasks, e.g., Efficiency: If size(w) = 100 Billion, each prediction is expensive: If part of an online system, too slow If w is sparse, prediction computation only depends on number of non-zeros Interpretability: What are the relevant dimension to make a prediction? E.g., what are the parts of the brain associated with particular words? 5Figure from Tom Mitchell

Sparsity bw LS = arg min w y i x T i w 2 Vector w is sparse, if many entries are zero Very useful for many tasks, e.g., Efficiency: If size(w) = 100 Billion, each prediction is expensive: If part of an online system, too slow If w is sparse, prediction computation only depends on number of non-zeros Interpretability: What are the relevant dimension to make a prediction? E.g., what are the parts of the brain associated with particular words? How do we find best subset among all possible? 6Figure from Tom Mitchell

Greedy model selection algorithm Pick a dictionary of features e.g., cosines of random inner products Greedy heuristic: Start from empty (or simple) set of features F 0 = Run learning algorithm for current set of features F t Obtain weights for these features Select next best feature h i (x) * e.g., h j (x) that results in lowest training error learner when using F t + {h j (x) * } F t+1! F t + {h i (x) * } Recurse 7

Greedy model selection Applicable in many other settings: Considered later in the course: Logistic regression: Selecting features (basis functions) Naïve Bayes: Selecting (independent) features P(X i Y) Decision trees: Selecting leaves to expand Only a heuristic! Finding the best set of k features is computationally intractable! Sometimes you can prove something strong about it 8

When do we stop??? Greedy heuristic: Select next best feature X i * E.g. h j (x) that results in lowest training error learner when using F t + {h j (x) * } Recurse When do you stop??? When training error is low enough? When test set error is low enough? Using cross validation? Is there a more principled approach? 9

Recall Ridge Regression Ridge Regression objective: bw ridge = arg min w + +... + + = y i x T i w 2 + w 2 2 10

Ridge vs. Lasso Regression Ridge Regression objective: bw ridge = arg min w + +... + + = y i x T i w 2 + w 2 2 Lasso objective: bw lasso = arg min w y i x T i w 2 + w 1 + +... + + 11

Penalized Least Squares Ridge : r(w) = w 2 2 Lasso : r(w) = w 1 bw r = arg min w y i x T i w 2 + r(w) 12

Penalized Least Squares Ridge : r(w) = w 2 2 Lasso : r(w) = w 1 bw r = arg min w y i x T i w 2 + r(w) For any 0 for which bw r achieves the minimum, there exists a 0 such that bw r = arg min w y i x T i w 2 subject to r(w) apple 13

Penalized Least Squares Ridge : r(w) = w 2 2 Lasso : r(w) = w 1 bw r = arg min w y i x T i w 2 + r(w) For any 0 for which bw r achieves the minimum, there exists a 0 such that bw r = arg min w y i x T i w 2 subject to r(w) apple 14

Optimizing the LASSO Objective LASSO solution: bw lasso, b b lasso = arg min w,b b blasso = arg min w,b 1 n y i (x T i w + b) 2 + w 1 y i x T i bw lasso ) 15

Optimizing the LASSO Objective LASSO solution: bw lasso, b b lasso = arg min w,b b blasso = arg min w,b 1 n y i (x T i w + b) 2 + w 1 y i x T i bw lasso ) So as usual, preprocess to make sure that 1 n so we don t have to worry about an o set. y i =0, 1 n x i = 0 16

Optimizing the LASSO Objective LASSO solution: bw lasso, b b lasso = arg min w,b b blasso = arg min w,b 1 n y i (x T i w + b) 2 + w 1 y i x T i bw lasso ) So as usual, preprocess to make sure that 1 n so we don t have to worry about an o set. y i =0, 1 n x i = 0 bw lasso = arg min w How do we solve this? y i x T i w 2 + w 1 17

Coordinate Descent Given a function, we want to find minimum Often, it is easy to find minimum along a single coordinate: How do we pick next coordinate? Super useful approach for *many* problems Converges to optimum in some cases, such as LASSO 18

Optimizing LASSO Objective One Coordinate at a Time Fix any j 2 {1,...,d} y i x T i w 2 + w 1 = y i 0 = @ X y i x i,k w k k6=j! 2 dx dx x i,k w k + w k k=1 k=1 1 2 x i,j w j A + X w k + w j k6=j 19

Optimizing LASSO Objective One Coordinate at a Time Fix any j 2 {1,...,d} y i x T i w 2 + w 1 = y i 0 = @ X y i x i,k w k k6=j! 2 dx dx x i,k w k + w k k=1 k=1 1 2 x i,j w j A + X w k + w j k6=j Initialize bw k = 0 for all k 2 {1,...,d} Loop over j 2 {1,...,n}: r (j) i = y i X k6=j bw j = arg min w j x i,j bw k r (j) i x i,j w j 2 + wj 20

Convex Functions Equivalent definitions of convexity: y x x f convex: f ( x +(1 )y) apple f(x)+(1 )f(y) 8x, y, 2 [0, 1] f(y) f(x)+rf(x) T (y x) 8x, y Gradients lower bound convex functions and are unique at x iff function differentiable at x Subgradients generalize gradients to non-differentiable points: Any supporting hyperplane at x that lower bounds entire function g is a subgradient at x if f(y) f(x)+g T (y x) 21

Taking the Subgradient bw j = arg min w j r (j) i x i,j w j 2 + wj g is a subgradient at x if f(y) f(x)+g T (y x) Convex function is minimized at w if 0 is a sub-gradient at w. @ wj w j = @ wj n X r (j) i x i,j w j 2 = 22

Setting Subgradient to 0 @ wj a j =( n X! 2 r (j) i x i,j w j + wj x 2 i,j) c j = 2( r (j) i x i,j ) = 8 >< a j w j c j if w j < 0 [ c j, c j + ] if w j =0 >: a j w j c j + if w j > 0 23

Setting Subgradient to 0 @ wj a j =( n X! 2 r (j) i x i,j w j + wj x 2 i,j) c j = 2( bw j = arg min w j r (j) i x i,j ) = 8 >< a j w j c j if w j < 0 [ c j, c j + ] if w j =0 >: a j w j c j + if w j > 0 r (j) i x i,j w j 2 + wj w is a minimum if 0 is a sub-gradient at w bw j = 8 >< (c j + )/a j if c j < 0 if c j apple >: (c j )/a j if c j > 24

Soft Thresholding a j = bw j = x 2 i,j 8 >< (c j + )/a j if c j < 0 if c j apple >: (c j )/a j if c j > X c j =2 y i x i,k w k x i,j k6=j 25

Coordinate Descent for LASSO (aka Shooting Algorithm) Repeat until convergence (initialize w=0) Pick a coordinate l at (random or sequentially) Set: 8 >< (c j + )/a j if c j < bw j = 0 if c j apple >: (c j )/a j if c j > Where: a j = x 2 i,j c j =2 X y i x i,k bw k x i,j k6=j For convergence rates, see Shalev-Shwartz and Tewari 2009 Other common technique = LARS Least angle regression and shrinkage, Efron et al. 2004 26

Recall: Ridge Coefficient Path From Kevin Murphy textbook Typical approach: select λ using cross validation 27

Now: LASSO Coefficient Path From Kevin Murphy textbook 28

What you need to know Variable Selection: find a sparse solution to learning problem L 1 regularization is one way to do variable selection Applies beyond regression Hundreds of other approaches out there LASSO objective non-differentiable, but convex Use subgradient No closed-form solution for minimization Use coordinate descent Shooting algorithm is simple approach for solving LASSO 29