Lecture 3: Empirical Risk Minimization

Similar documents
Lecture 4: Linear predictors and the Perceptron

Computational and Statistical Learning theory

A first model of learning

Lecture 3: Introduction to Complexity Regularization

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Introduction to Machine Learning

Understanding Generalization Error: Bounds and Decompositions

Generalization, Overfitting, and Model Selection

Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Lecture 15: Random Projections

We choose parameter values that will minimize the difference between the model outputs & the true function values.

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

COMS 4771 Introduction to Machine Learning. Nakul Verma

Recap from previous lecture

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Machine Learning

Introduction to Machine Learning (67577) Lecture 5

Advanced Introduction to Machine Learning CMU-10715

Class 2 & 3 Overfitting & Regularization

Generalization and Overfitting

Empirical Risk Minimization, Model Selection, and Model Assessment

Day 3: Classification, logistic regression

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

1 A Lower Bound on Sample Complexity

Computational Learning Theory. CS534 - Machine Learning


Statistical Learning Theory: Generalization Error Bounds

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Generalization, Overfitting, and Model Selection

The Perceptron algorithm

Lecture Data Science

Learning Theory Continued

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

Lecture 7: DecisionTrees

The PAC Learning Framework -II

Algorithm Independent Topics Lecture 6

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

IFT Lecture 7 Elements of statistical learning theory

PAC-learning, VC Dimension and Margin-based Bounds

Optimization Methods for Machine Learning (OMML)

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Lecture 3: Statistical Decision Theory (Part II)

Week 5: Logistic Regression & Neural Networks

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

1 Review of The Learning Setting

Computational learning theory. PAC learning. VC dimension.

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Support Vector Machines. Machine Learning Fall 2017

Least Mean Squares Regression

Introduction to Machine Learning (67577) Lecture 3

ECE 5424: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning (67577) Lecture 7

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

ORIE 4741: Learning with Big Messy Data. Generalization

Online Learning, Mistake Bounds, Perceptron Algorithm

Computational and Statistical Learning Theory

Statistical and Computational Learning Theory

Variance Reduction and Ensemble Methods

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Understanding Machine Learning A theory Perspective

CS 188: Artificial Intelligence Spring Today

CSC321 Lecture 4 The Perceptron Algorithm

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Empirical Risk Minimization Algorithms

Bayes Decision Theory - I

Computational Learning Theory

Uni- and Bivariate Power

Introduction to Machine Learning

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Lecture 4 Discriminant Analysis, k-nearest Neighbors

COMS 4771 Lecture Boosting 1 / 16

Consistency of Nearest Neighbor Methods

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Learning theory. Ensemble methods. Boosting. Boosting: history

Nonlinear Classification

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

VBM683 Machine Learning

Lecture 8. Instructor: Haipeng Luo

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Department of Computer Science and Engineering CSE 151 University of California, San Diego Fall Midterm Examination

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

Overfitting, Bias / Variance Analysis

CSE 546 Final Exam, Autumn 2013

Ensemble Methods and Random Forests

Linear and Logistic Regression. Dr. Xiaowei Huang

Least Mean Squares Regression. Machine Learning Fall 2018

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

Generalization theory

2 Upper-bound of Generalization Error of AdaBoost

Computational and Statistical Learning Theory

Transcription:

Lecture 3: Empirical Risk Minimization Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 3 1 / 11

A more general approach We saw the learning algorithms Memorize and k-nearest Neighbor. We will now discuss a more general approach to the design of learning algorithms. We use the same assumptions: Examples X labels Y Distribution D over X Y A learning algorithm gets S D m and outputs ĥs : X Y. D is unknown to the learning algorithm. Kontorovich and Sabato (BGU) Lecture 3 2 / 11

Choosing a prediction rule If the algorithm knew D, it could find the optimal prediction rule. The Bayes-optimal predictor. Since S is a random sample from D, it should be similar to D. Idea: find a prediction rule that works well on S. The error of prediction rule h on S of size m: err(h, S) := 1 m m I[h(x i ) y i ]. i=1 (also called the empirical risk) Empirical Risk Minimization (ERM): Choose a prediction rule that minimizes err(h, S). Both Memorize and Nearest Neighbor are ERM algorithms. What about k-nearest-neighbors? Kontorovich and Sabato (BGU) Lecture 3 3 / 11

Overfitting Problem: Empirical risk minimization can fail miserably. Example: The Memorize algorithm. If the training sample is of size m, and there are N examples (customers) distributed uniformly, N m, and there are two labels (drinks), then err( ĥ S, S) = 0, but err(ĥs, D) will be very large. Overfitting: When the error on the training sample is low, but the error on the distribution is large. Can another learning algorithm avoid this issue? Kontorovich and Sabato (BGU) Lecture 3 4 / 11

The No Free Lunch theorem Theorem Recall: X - examples, Y = {0, 1} - binary labels. For any learning algorithm, if m X /2, there exists a distribution D over X {0, 1} such that There exists a prediction rule f : X {0, 1} with err(f, D) = 0, but With a probability of at least 1/7 over random samples S D m, err(ĥ S, D) 1/8. Proof idea: Assume some algorithm A; choose a uniform distribution over 2m examples; Set the true labels to be the opposite of what A would guess on examples it didn t observe. Kontorovich and Sabato (BGU) Lecture 3 5 / 11

Introducting inductive bias By the No Free Lunch theorem, no learning algorithm gets a low error on all distributions, unless it observes almost all possible examples. A common solution: assume something about the learning problem. Examples: The coffee shop: The waiter got a hint that all customers with the same hairstyle like the same drink. Identifying documents about economics: Assume that there is a small number of words that determine whether a document is about economics or not. Identifying people in photos: Assume that photos of the same person are similar in a specific feature representation. Inductive bias: Restricting/directing the learning algorithms using external knowledge/assumptions about the learning problem. Kontorovich and Sabato (BGU) Lecture 3 6 / 11

Example: Learning dosage safety Learning problem: which medicine dosages are safe? X = [0, 100] (dosage), Y = {0, 1} (causes side effects?) A possible training sample (blue: label is 0, red: label is 1): ERM without inductive bias might return the following rule: Here err(ĥs, S) = 0. What do you think is err(ĥs, D)? Inductive bias: Limit the ERM algorithm to return only functions that describe thresholds on the line: x X, f a (x) := I[x a]. Now the algorithm will return something like this: Again, err(ĥs, S) = 0. What do you think is err(ĥs, D) this time? Kontorovich and Sabato (BGU) Lecture 3 7 / 11

Inductive bias Recall: Empirical Risk Minimization (ERM): Choose a prediction rule that minimizes err(h, S). Inductive bias: Restrict the ERM algorithm. A popular type of inductive bias: Choose the prediction rule from a restricted set of functions called a hypothesis class: H Y X. ERM with a hypothesis class H Given a training sample S D m, output ĥs such that ĥ S argmin err(h, S). h H We will show (later in the course): Restricting the ERM to a simple H can prevent overfitting. A small finite class is always simple. But there are also simple infinite classes. Kontorovich and Sabato (BGU) Lecture 3 8 / 11

Hypothesis classes Popular hypothesis classes: Thresholds (for 1-dimensional examples) Linear functions (for examples in R d ) Combinations of circles (for examples in R d ) Small logical formulas (for examples with binary features) Neural networks Any class of prediction rules is a valid hypothesis class. Why are some classes more popular? Easy to work with (efficient algorithms, easy implementation) Suitable for many different types of problems Good error guarantees Work well in practice Fashionable Kontorovich and Sabato (BGU) Lecture 3 9 / 11

The Bias-Complexity tradeoff Suppose we are using a specific hypothesis class H. Sources of prediction error in an ERM algorithm: Perhaps the rules in H are not very good for D. Approximation error : err app := inf err(h, D) h H Perhaps the error of the rule the ERM selected is far from the best. Estimation error : err est := err(ĥ S, D) inf err(h, D). h H Total error: err(ĥs, D) = err app + err est. If we select a richer (larger) H: Approximation error gets smaller (lower bias), Estimation error gets larger (higher statistical complexity). There is a trade-off between the two kinds of error. Kontorovich and Sabato (BGU) Lecture 3 10 / 11

The Bias-Complexity tradeoff Recall Overfitting: When err(ĥ S, D) err(ĥ S, S) is large. Symptoms: training error (error on S) is low, but true error is high. This usually means errest := err(ĥ S, D) inf h H err(h, D) is also large. Can happen if H is too rich (large). Underfitting: When approximation error is large. Symptoms: training error is high. Can happen if H is not suitable for our problem, or too simple. Best of both worlds: a simple H which is suitable for our problem. E.g. when looking for safe medicine dosages, choose H to be the set of threshold functions: x I[x a]. Selecting H can represent world-knowledge that helps learning. Kontorovich and Sabato (BGU) Lecture 3 11 / 11