Empirical Risk Minimization Algorithms

Similar documents
Online Learning Summer School Copenhagen 2015 Lecture 1

The Perceptron algorithm

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

CS229 Supplemental Lecture notes

Littlestone s Dimension and Online Learnability

Lecture 4: Linear predictors and the Perceptron

Online Learning, Mistake Bounds, Perceptron Algorithm

Computational Learning Theory. Definitions

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

2 Upper-bound of Generalization Error of AdaBoost

From Batch to Transductive Online Learning

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Support vector machines Lecture 4

Linear Classifiers and the Perceptron

Lecture 8. Instructor: Haipeng Luo

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

On the tradeoff between computational complexity and sample complexity in learning

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Agnostic Online learnability

Logistic regression and linear classifiers COMS 4771

Manual for a computer class in ML

The Perceptron Algorithm 1

PAC-learning, VC Dimension and Margin-based Bounds

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

PAC Model and Generalization Bounds

Practical Agnostic Active Learning

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Conditional Sparse Linear Regression

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Regression with Numerical Optimization. Logistic

ECE 5424: Introduction to Machine Learning

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Introduction to Machine Learning

CS229 Supplemental Lecture notes

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Evaluation. Andrea Passerini Machine Learning. Evaluation

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

1 Review of Winnow Algorithm

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Evaluation requires to define performance measures to be optimized

Online Learning and Sequential Decision Making

Does Unlabeled Data Help?

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Name (NetID): (1 Point)

CS7267 MACHINE LEARNING

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

QUIZ ON CHAPTER 4 APPLICATIONS OF DERIVATIVES; MATH 150 FALL 2016 KUNIYUKI 105 POINTS TOTAL, BUT 100 POINTS

Notes on Decision Theory and Prediction

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Computational Learning Theory. CS534 - Machine Learning

Generalization, Overfitting, and Model Selection

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Game Theory, On-line prediction and Boosting (Freund, Schapire)

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Lecture 7: Passive Learning

Part of the slides are adapted from Ziko Kolter

Machine Learning and Data Mining. Linear classification. Kalev Kask

Data Mining und Maschinelles Lernen

Midterm, Fall 2003

The Perceptron Algorithm, Margins

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Understanding Machine Learning A theory Perspective

8.1 Polynomial Threshold Functions

Lecture 2 Machine Learning Review

The sample complexity of agnostic learning with deterministic labels

Is our computers learning?

Statistical Machine Learning Hilary Term 2018

i=1 = H t 1 (x) + α t h t (x)

Computational Learning Theory

Voting (Ensemble Methods)

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Hands-On Learning Theory Fall 2016, Lecture 3

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

The Perceptron Algorithm

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

Generalization and Overfitting

Inexact Search is Good Enough

Active Learning and Optimized Information Gathering

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Ensemble Methods for Machine Learning

Lecture notes for quantum semidefinite programming (SDP) solvers

Warm up: risk prediction with logistic regression

1 A Lower Bound on Sample Complexity

Introduction to Machine Learning (67577) Lecture 3

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Learning theory. Ensemble methods. Boosting. Boosting: history

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Computational Learning

Perceptron (Theory) + Linear Regression

10.1 The Formal Model

Machine Learning. Ensemble Methods. Manfred Huber

Transcription:

Empirical Risk Minimization Algorithms Tirgul 2 Part I November 2016

Reminder Domain set, X : the set of objects that we wish to label. Label set, Y : the set of possible labels. A prediction rule, h: X Y: used to label future examples. This function is called a predictor, a hypothesis, or a classifier. 2

Example X = R 2, representing 2 features of a cookie. Y = ±1, representing yummy or not yummy. h x = 1, if x is within the inner rectangle: 3

Reminder: Online Learning for t = 1 to T: 1. pick the pair (x i,y i ) (X,Y) 2. predict ŷ using the hypothesis 3. compare performance ŷ i vs. y i (Learner pays 1 if ŷ i y i and 0 otherwise) 4. update hypothesis Goal of the learner: Make few mistakes 4

Is learning possible?? 5

Mission Impossible? 6

Mission Impossible? If X =, then for every new instance x t, the learner can t know its label and might always err. If X <, the learner can memorize all labels, but that isn t really learning 7

Prior Knowledge Solution: Give more knowledge to the learner: H Y X is a pre-defined set of classifiers. Y X denotes all of the functions from X to Y. 8

Prior Knowledge Solution: Give more knowledge to the learner: Suppose we have a function f: X Y, that comes from the aforementioned hypothesis class H Y X. We assume that the labels in our dataset were determined by using f i.e.: t, f x t = y t. Formally: the sequence t(x t, y t ) is realized by H. Assumption: the dataset is realizable (for now ). The learner knows H (but of course doesn t know f). 9

Will it help? Let X = R, and H be thresholds: H = {h θ : θ R}, where h θ = sign (x θ) 10

Doesn t always help! Theorem: For every learner, there exists a sequence of examples which is consistent with some f H, but on which the learner will always err. Proof Idea: y: y: +1? -1! +1? -1! +1? -1! +1? -1! Θ -1? +1! 11

Restriction: Hypothesis Class Assume that H is of a finite size: E.g.: H is thresholds over a grid X = {0, 1 n, 2 n,, 1} 12

Learning Finite Hypothesis Classes Consistent Halving 13

The Consistent Learner Initialize: V 1 = H For t = 1, 2, Get x t Pick some h V t and predict y t = h(x t ) Get y t and update V t+1 = {h V t h x t = y t } 14

The Consistent Learner: Analysis (1) Claim: V t consists of all the functions h that correctly predict the labels of the examples it has seen. Proof: Base case: V 1 includes all of the hypotheses in H, which all correctly predict the labels of the examples it has seen. This is correct because it hasn t seen any examples yet. Inductive step: V t+1 consists of all of the hypotheses in V t, which: Correctly predict the labels of all of the examples 1.t. and which correctly predict the labels of example t+1. Therefore: all hypotheses in V t+1 predict correctly examples 1 t+1. 15

The Consistent Learner: Analysis (2) Theorem: The consistent learner will make at most H 1 mistakes. Proof: Denote by M: the number of mistakes the algorithm made. Given our assumption that H is realizable: V t 1 (because it must include the correct function f) If we err at round t, then h V t we used for prediction will not be in V t+1. Therefore: V t+1 V t 1 16

The Consistent Learner: Analysis (3) V T H M Number of rounds in which ŷ y 1 V T H M 1 H M M H 1 The original size of V 1 i.e.: The consistent learner will make at most H 1 mistakes. 17

Can we do better? 18

The Halving Learner Our goal is to return the correct hypothesis (duh ) To make the challenge easier - we receive access to the predictions of N experts (\hypotheses). 1 0 1 0 1 19

The Halving Learner Initialize: V 1 = H For t = 1, 2, Get x t Predict Majority h x t h V t Get y t and update V t+1 = {h V t h x t = y t } 20

The Halving Learner: Analysis (1) Theorem: The Halving learner will make at most log 2 H mistakes. Proof: V t 1 (as before) For every iteration i in which there is a mistake, at least half of the Experts are wrong and will not continue to the next round: V t+1 V t /2 21

The Halving Learner: Analysis (2) V T H 2 M The original size of V 1 For every round in which there was a mistake 1 V T H 2 M 1 H 2 M M log 2 H i.e., the Halving learner will make at most log 2 H mistakes. 22

The Halving Learner: Analysis (3) Halving s mistake bound grows with log 2 H, BUT: The Runtime of halving grows with H On every iteration we have to go through the whole hypothesis set Learning must take computational consideration into account 23