cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Similar documents
Is the test error unbiased for these programs? 2017 Kevin Jamieson

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Is the test error unbiased for these programs?

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Warm up: risk prediction with logistic regression

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 2, Kevin Jamieson 1

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Announcements. Proposals graded

Classification Logistic Regression

Stochastic Gradient Descent

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Classification Logistic Regression

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Recap from previous lecture

Announcements Kevin Jamieson

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Lecture 2 Machine Learning Review

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Least Squares Regression

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Least Squares Regression

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning for OR & FE

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Dimension Reduction Methods

Applied Machine Learning Annalisa Marsico

ECE 5424: Introduction to Machine Learning

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Linear classifiers: Overfitting and regularization

Learning with multiple models. Boosting.

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Sparse Linear Models (10/7/13)

Decision Trees: Overfitting

Classification Logistic Regression

Bias-Variance Tradeoff

Statistical Data Mining and Machine Learning Hilary Term 2016

Evaluation requires to define performance measures to be optimized

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Linear Regression (continued)

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

PDEEC Machine Learning 2016/17

CS60021: Scalable Data Mining. Large Scale Machine Learning

Discriminative Models

Stephen Scott.

ESS2222. Lecture 4 Linear model

MS&E 226: Small Data

Introduction to Machine Learning

Evaluation. Andrea Passerini Machine Learning. Evaluation

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Performance Evaluation

Midterm exam CS 189/289, Fall 2015

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Ensemble Methods and Random Forests

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Holdout and Cross-Validation Methods Overfitting Avoidance

Discriminative Models

Prediction & Feature Selection in GLM

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Learning Theory Continued

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Logistic Regression. Machine Learning Fall 2018

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Introduction to Machine Learning and Cross-Validation

Week 3: Linear Regression

VBM683 Machine Learning

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

6. Regularized linear regression

MSA220/MVE440 Statistical Learning for Big Data

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

An Introduction to Statistical Machine Learning - Theoretical Aspects -

Empirical Risk Minimization, Model Selection, and Model Assessment

Variance Reduction and Ensemble Methods

Machine Learning

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Decision Tree Learning Lecture 2

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Bias-Variance in Machine Learning

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

A Magiv CV Theory for Large-Margin Classifiers

Machine Learning Gaussian Naïve Bayes Big Picture

[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Parameter Norm Penalties. Sargur N. Srihari

Voting (Ensemble Methods)

CSE546 Machine Learning, Autumn 2016: Homework 1

The exam is closed book, closed notes except your one-page cheat sheet.

Machine Learning

Transcription:

Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is the x 2 R that minimizes max{ ax + b, cx} axtb cxx ab.ec I x c 1

Ko yi D I AT in Iq xiw 2 Xi CIRD yo cir Ki Cai yi tf Phx y c A Fix some 2 c Rd 1,44 net 3fGddx d Eg Gotz YocI zdpxrk.at dpx.la

Overfitting Machine Learning CSE546 Kevin Jamieson University of Washington Oct 9, 2018 2

Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance But in practice?? Before we saw how increasing the feature space can increase the complexity of the learned estimator: F 1 F 2 F 3... bf (k) D 1 = arg min f2f k D X (x i,y i )2D Complexity grows as k grows (y i f(x i )) 2

Training set error as a function of model complexity F 1 F 2 F 3... bf (k) D 1 = arg min f2f k D X (x i,y i )2D D i.i.d. P XY (y i f(x i )) 2 1 TRAIN error: X (y i b(k) f D D (x i)) 2 (x i,y i )2D TRUE error: E XY [(Y f b(k) D (X))2 ] 4

Training set error as a function of model complexity F 1 F 2 F 3... bf (k) D 1 = arg min f2f k D X (x i,y i )2D D i.i.d. P XY (y i f(x i )) 2 1 TRAIN error: X (y i b(k) f D D (x i)) 2 (x i,y i )2D TRUE error: E XY [(Y f b(k) D (X))2 ] Complexity (k) TEST error: T i.i.d. 1 T P XY X (x i,y i )2T (y i b f (k) D (x i)) 2 Important: D \ T = ; 5

Training set error as a function of model complexity F 1 F 2 F 3... bf (k) D 1 = arg min f2f k D X (x i,y i )2D D i.i.d. P XY (y i f(x i )) 2 1 TRAIN error: X (y i b(k) f D D (x i)) 2 (x i,y i )2D TRUE error: E XY [(Y f b(k) D (X))2 ] Each line is i.i.d. draw of D or T TEST error: T i.i.d. 1 T P XY X (x i,y i )2T (y i b f (k) D (x i)) 2 Important: D \ T = ; Complexity (k) Plot from Hastie et al 6

Training set error as a function of model complexity F 1 F 2 F 3... bf (k) D 1 = arg min f2f k D X (x i,y i )2D D i.i.d. P XY (y i f(x i )) 2 1 TRAIN error: X (y i b(k) f D D (x i)) 2 (x i,y i )2D TRAIN error is optimistically biased because it is evaluated on the data it trained on. TEST error is unbiased only if T is never used to train the model or even pick the complexity k. TRUE error: E XY [(Y f b(k) D (X))2 ] TEST error: T i.i.d. 1 T P XY X (x i,y i )2T (y i b f (k) D (x i)) 2 Important: D \ T = ; 7

Test set error Given a dataset, randomly split it into two parts: Training data: Test data: Use training data to learn predictor e.g., 1 D T X D (x i,y i )2D Important: D \ T = ; (y i f b(k) D (x i)) 2 use training data to pick complexity k Use test data to report predicted performance 1 T X (x i,y i )2T (y i b f (k) D (x i)) 2 8

Regularization Machine Learning CSE546 Kevin Jamieson University of Washington October 9, 2016 9

Regularization in Linear Regression Recall Least Squares: bw LS = arg min w nx i=1 y i x T i w 2 When x i 2 R d and d>nthe objective function is flat in some directions: Implies optimal solution is underconstrained and unstable due to lack of curvature: small changes in training data result in large changes in solution often the magnitudes of w are very large Regularization imposes simpler solutions by a complexity penalty 10

Ridge Regression Old Least squares objective: + +... + bw LS = arg min w nx i=1 y i x T i w 2 Ridge Regression objective: bw ridge = arg min w nx i=1 y i x T i w 2 + w 2 2 + +... + + = 11

Shrinkage Properties N (0, 2 I) Assume: and bw ridge = arg min w bw ridge =(X T X + X T X = ni I) 1 X T y y = Xw + bw ridge =(X T X + I) 1 X T (Xw + ) = n n + w + 1 n + XT Ek bw ridge wk 2 = nx i=1 2 (n + ) 2 kwk2 + dn 2 (n + ) 2 = d 2 Fiance y i x T i w 2 + w 2 2 kwk 2 12

Ridge Regression: Effect of Regularization bw ridge = arg min w nx i=1 y i x T i w 2 + w 2 2 Solution is indexed by the regularization parameter λ Larger λ Smaller λ As λ! 0 As λ! 13

Ridge Regression: Effect of Regularization D i.i.d. bw ( ) D,ridge P XY = arg min w 1 D X (x i,y i )2D (y i x T i w) 2 + w 2 2 TRAIN error: 1 D X (x i,y i )2D (y i x T i bw ( ) D,ridge )2 TRUE error: E[(Y X T bw ( ) D,ridge )2 ] TEST error: T i.i.d. 1 X T P XY (x i,y i )2D (y i x T i bw ( ) D,ridge )2 Important: D \ T = ; 14

Ridge Regression: Effect of Regularization D i.i.d. bw ( ) D,ridge P XY = arg min w 1 D X (x i,y i )2D (y i x T i w) 2 + w 2 2 TRAIN error: 1 D X (x i,y i )2D (y i x T i bw ( ) D,ridge )2 TRUE error: E[(Y X T bw ( ) D,ridge )2 ] TEST error: T i.i.d. 1 X T P XY (x i,y i )2D (y i x T i bw ( ) D,ridge )2 Each line is i.i.d. draw of D or T large λ 1/ small λ Important: D \ T = ; 15

Ridge Coefficient Path From Kevin Murphy textbook 1/ Typical approach: select λ using cross validation, up next 16

What you need to know Regularization Penalizes for complex models Ridge regression L 2 penalized least-squares regression Regularization parameter trades off model complexity with training error 17

Cross-Validation Machine Learning CSE546 Kevin Jamieson University of Washington October 9, 2016 18

How How How??????? How do we pick the regularization constant λ How do we pick the number of basis functions We could use the test data, but 19

How How How??????? How do we pick the regularization constant λ How do we pick the number of basis functions We could use the test data, but Never ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever train on the test data 20

(LOO) Leave-one-out cross validation Consider a validation set with 1 example: D training data O D\j training data with j th data point (x j,y j ) moved to validation set Learn classifier f D\j with D\j dataset Estimate true error as squared error on predicting y j : Unbiased estimate of error true (f D\j )! 21

(LOO) Leave-one-out cross validation Consider a validation set with 1 example: D training data D\j training data with j th data point (x j,y j ) moved to validation set Learn classifier f D\j with D\j dataset Estimate true error as squared error on predicting y j : Unbiased estimate of error true (f D\j )! LOO cross validation: Average over all data points j: For each data point you leave out, learn a new classifier f D\j Estimate error as: error LOO = 1 n nx (y j f D\j (x j )) 2 j=1 22

LOO cross validation is (almost) unbiased estimate of true error of h D! When computing LOOCV error, we only use N-1 data points So it s not estimate of true error of learning with N data points Usually pessimistic, though learning with less data typically gives worse answer LOO is almost unbiased! Use LOO error for model selection!!! E.g., picking λ 23

Computational cost of LOO Suppose you have 100,000 data points You implemented a great version of your learning algorithm Learns in only 1 second Computing LOO will take about 1 day!!! 24

Use k-fold cross validation Randomly divide training data into k equal parts D 1,,D k For each i Learn classifier f D\Di using data point not in D i Estimate error of f D\Di on validation set D i : error Di = 1 X (y j f D\Di (x j )) 2 D i (x j,y j )2D i 25

Use k-fold cross validation Randomly divide training data into k equal parts D 1,,D k For each i Learn classifier f D\Di using data point not in D i Estimate error of f D\Di on validation set D i : error Di = 1 X (y j f D\Di (x j )) 2 D i (x j,y j )2D i k-fold cross validation error is average over data splits: k-fold cross validation properties: Much faster to compute than LOO More (pessimistically) biased using much less data, only n(k-1)/k Usually, k = 10 26

Recap Given a dataset, begin by splitting into Model selection: Use k-fold cross-validation on TRAIN to train predictor and choose magic parameters such as λ TRAIN-1 VAL-1 TRAIN TRAIN TEST TRAIN-2 VAL-3 VAL-2 TRAIN-3 Model assessment: Use TEST to assess the accuracy of the model you output Never ever ever ever ever train or choose parameters based on the test data TRAIN-2 27

Example 1 Make CV splits 2 Them pick 50 and perform CVon X Given 10,000-dimensional data and n examples, we pick a subset of 50 dimensions that have the highest correlation with labels in the training set: 9 E validation 50 indices j that have largest P n i=1 x i,jy i q Pn i=1 x2 i,j After picking our 50 features, we then use CV to train ridge regression with regularization λ What s wrong with this procedure? 28

Recap Learning is Collect some data E.g., housing info and sale price Randomly split dataset into TRAIN, VAL, and TEST E.g., 80%, 10%, and 10%, respectively Choose a hypothesis class or model E.g., linear with non-linear transformations Choose a loss function E.g., least squares with ridge regression penalty on TRAIN Choose an optimization procedure E.g., set derivative to zero to obtain estimator, cross-validation on VAL to pick num. features and amount of regularization Justifying the accuracy of the estimate E.g., report TEST error 29

Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 9, 2016 30

Sparsity bw LS = arg min w nx i=1 y i x T i w 2 Vector w is sparse, if many entries are zero Very useful for many tasks, e.g., Efficiency: If size(w) = 100 Billion, each prediction is expensive: If part of an online system, too slow If w is sparse, prediction computation only depends on number of non-zeros 31

Sparsity bw LS = arg min w nx i=1 y i x T i w 2 Vector w is sparse, if many entries are zero Very useful for many tasks, e.g., Efficiency: If size(w) = 100 Billion, each prediction is expensive: If part of an online system, too slow If w is sparse, prediction computation only depends on number of non-zeros Interpretability: What are the relevant dimension to make a prediction? E.g., what are the parts of the brain associated with particular words? Figure from Tom Mitchell 32

Sparsity bw LS = arg min w nx i=1 y i x T i w 2 Vector w is sparse, if many entries are zero Very useful for many tasks, e.g., Efficiency: If size(w) = 100 Billion, each prediction is expensive: If part of an online system, too slow If w is sparse, prediction computation only depends on number of non-zeros Interpretability: What are the relevant dimension to make a prediction? E.g., what are the parts of the brain associated with particular words? How do we find best subset among all possible? Figure from Tom Mitchell 33

Greedy model selection algorithm Pick a dictionary of features e.g., cosines of random inner products Greedy heuristic: Start from empty (or simple) set of features F 0 = Run learning algorithm for current set of features F t Obtain weights for these features Select next best feature h i (x) * e.g., h j (x) that results in lowest training error learner when using F t + {h j (x) * } F t+1 " F t + {h i (x) * } Recurse 34

Greedy model selection Applicable in many other settings: Considered later in the course: Logistic regression: Selecting features (basis functions) Naïve Bayes: Selecting (independent) features P(X i Y) Decision trees: Selecting leaves to expand Only a heuristic! Finding the best set of k features is computationally intractable! Sometimes you can prove something strong about it 35

When do we stop??? Greedy heuristic: Select next best feature X i * E.g. h j (x) that results in lowest training error learner when using F t + {h j (x) * } Recurse When do you stop??? When training error is low enough? When test set error is low enough? Using cross validation? Is there a more principled approach? 36

Recall Ridge Regression Ridge Regression objective: bw ridge = arg min w nx i=1 + +... + + = y i x T i w 2 + w 2 2 37

Ridge vs. Lasso Regression Ridge Regression objective: bw ridge = arg min w nx i=1 + +... + + = y i x T i w 2 + w 2 2 Lasso objective: bw lasso = arg min w nx i=1 y i x T i w 2 + w 1 + +... + + 38

Penalized Least Squares Ridge : r(w) = w 2 2 Lasso : r(w) = w 1 bw r = arg min w nx i=1 y i x T i w 2 + r(w) 39

Penalized Least Squares Ridge : r(w) = w 2 2 Lasso : r(w) = w 1 bw r = arg min w nx i=1 y i x T i w 2 + r(w) For any 0 for which bw r achieves the minimum, there exists a 0 such that bw r = arg min w nx i=1 y i x T i w 2 subject to r(w) apple 40

Penalized Least Squares Ridge : r(w) = w 2 2 Lasso : r(w) = w 1 bw r = arg min w nx i=1 y i x T i w 2 + r(w) For any 0 for which bw r achieves the minimum, there exists a 0 such that bw r = arg min w nx i=1 y i x T i w 2 subject to r(w) apple O O 41