Empirical Risk Minimization Algorithms

Empirical Risk Minimization Algorithms Tirgul 2 Part I November 2016

Reminder Domain set, X : the set of objects that we wish to label. Label set, Y : the set of possible labels. A prediction rule, h: X Y: used to label future examples. This function is called a predictor, a hypothesis, or a classifier. 2

Example X = R 2, representing 2 features of a cookie. Y = ±1, representing yummy or not yummy. h x = 1, if x is within the inner rectangle: 3

Reminder: Online Learning for t = 1 to T: 1. pick the pair (x i,y i ) (X,Y) 2. predict ŷ using the hypothesis 3. compare performance ŷ i vs. y i (Learner pays 1 if ŷ i y i and 0 otherwise) 4. update hypothesis Goal of the learner: Make few mistakes 4

Is learning possible?? 5

Mission Impossible? 6

Mission Impossible? If X =, then for every new instance x t, the learner can t know its label and might always err. If X <, the learner can memorize all labels, but that isn t really learning 7

Prior Knowledge Solution: Give more knowledge to the learner: H Y X is a pre-defined set of classifiers. Y X denotes all of the functions from X to Y. 8

Prior Knowledge Solution: Give more knowledge to the learner: Suppose we have a function f: X Y, that comes from the aforementioned hypothesis class H Y X. We assume that the labels in our dataset were determined by using f i.e.: t, f x t = y t. Formally: the sequence t(x t, y t ) is realized by H. Assumption: the dataset is realizable (for now ). The learner knows H (but of course doesn t know f). 9

Will it help? Let X = R, and H be thresholds: H = {h θ : θ R}, where h θ = sign (x θ) 10

Doesn t always help! Theorem: For every learner, there exists a sequence of examples which is consistent with some f H, but on which the learner will always err. Proof Idea: y: y: +1? -1! +1? -1! +1? -1! +1? -1! Θ -1? +1! 11

Restriction: Hypothesis Class Assume that H is of a finite size: E.g.: H is thresholds over a grid X = {0, 1 n, 2 n,, 1} 12

Learning Finite Hypothesis Classes Consistent Halving 13

The Consistent Learner Initialize: V 1 = H For t = 1, 2, Get x t Pick some h V t and predict y t = h(x t ) Get y t and update V t+1 = {h V t h x t = y t } 14

The Consistent Learner: Analysis (1) Claim: V t consists of all the functions h that correctly predict the labels of the examples it has seen. Proof: Base case: V 1 includes all of the hypotheses in H, which all correctly predict the labels of the examples it has seen. This is correct because it hasn t seen any examples yet. Inductive step: V t+1 consists of all of the hypotheses in V t, which: Correctly predict the labels of all of the examples 1.t. and which correctly predict the labels of example t+1. Therefore: all hypotheses in V t+1 predict correctly examples 1 t+1. 15

The Consistent Learner: Analysis (2) Theorem: The consistent learner will make at most H 1 mistakes. Proof: Denote by M: the number of mistakes the algorithm made. Given our assumption that H is realizable: V t 1 (because it must include the correct function f) If we err at round t, then h V t we used for prediction will not be in V t+1. Therefore: V t+1 V t 1 16

The Consistent Learner: Analysis (3) V T H M Number of rounds in which ŷ y 1 V T H M 1 H M M H 1 The original size of V 1 i.e.: The consistent learner will make at most H 1 mistakes. 17

Can we do better? 18

The Halving Learner Our goal is to return the correct hypothesis (duh ) To make the challenge easier - we receive access to the predictions of N experts (\hypotheses). 1 0 1 0 1 19

The Halving Learner Initialize: V 1 = H For t = 1, 2, Get x t Predict Majority h x t h V t Get y t and update V t+1 = {h V t h x t = y t } 20

The Halving Learner: Analysis (1) Theorem: The Halving learner will make at most log 2 H mistakes. Proof: V t 1 (as before) For every iteration i in which there is a mistake, at least half of the Experts are wrong and will not continue to the next round: V t+1 V t /2 21

The Halving Learner: Analysis (2) V T H 2 M The original size of V 1 For every round in which there was a mistake 1 V T H 2 M 1 H 2 M M log 2 H i.e., the Halving learner will make at most log 2 H mistakes. 22

The Halving Learner: Analysis (3) Halving s mistake bound grows with log 2 H, BUT: The Runtime of halving grows with H On every iteration we have to go through the whole hypothesis set Learning must take computational consideration into account 23