Online Learning Summer School Copenhagen 2015 Lecture 1

Online Learning Summer School Copenhagen 2015 Lecture 1 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Online Learning Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 1 / 32

Outline 1 The Online Learning Framework Online Classification Hypothesis class 2 Learning Finite Hypothesis Classes The Consistent learner The Halving learner 3 Structure over the hypothesis class Halfspaces The Ellipsoid Learner Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 2 / 32

Gentle Start: An Online Classification Game For t = 1, 2,... Environment presents a question x t Learner predicts an answer ŷ t {±1} Environment reveals true label y t {±1} Learner pays 1 if ŷ t y t and 0 otherwise Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 3 / 32

Gentle Start: An Online Classification Game For t = 1, 2,... Environment presents a question x t Learner predicts an answer ŷ t {±1} Environment reveals true label y t {±1} Learner pays 1 if ŷ t y t and 0 otherwise Goal of the learner: Make few mistakes Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 3 / 32

Example Applications Weather forecasting (will it rain tomorrow) Finance (buy or sell an asset) Spam filtering (is this email a spam) Compression (what s the next symbol in a sequence) Proxy for optimization (will be clear later) Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 4 / 32

When can we hope to make few mistakes? Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 5 / 32

When can we hope to make few mistakes? Task is hopeless if there s no correlation between past and future We are making no statistical assumptions on the origin of the sequence Need to give more knowledge to the learner Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 5 / 32

Prior Knowledge Recall the online game: For t = 1, 2,...: get question x t X, predict ŷ t {±1}, then get y t {±1} The realizability by H assumption H is a predefined set of functions from X to {±1} Exists f H s.t. for every t, y t = f(x t ) The learner knows H (but of course doesn t know f) Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 6 / 32

Prior Knowledge Recall the online game: For t = 1, 2,...: get question x t X, predict ŷ t {±1}, then get y t {±1} The realizability by H assumption H is a predefined set of functions from X to {±1} Exists f H s.t. for every t, y t = f(x t ) The learner knows H (but of course doesn t know f) Remark: What if our prior knowledge is wrong? We ll get back to this question later Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 6 / 32

Not always helpful Let X = R, and H be thresholds: H = {h θ : θ R}, where h θ (x) = sign(x θ) h θ (x) = 1 h θ (x) = 1 θ Theorem: for every learner, exists sequence of examples which is consistent with some f H but on which the learner will always err Exercise: Prove the theorem by showing that the environment can follow the bisection method Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 7 / 32

Learning Finite Classes Assume that H is of finite size E.g.: H is all the functions from X to {±1} that can be implemented using a Python program of length at most b E.g.: H is thresholds over a grid X = {0, 1 n, 2 n,..., 1} Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 9 / 32

Learning Finite Classes The consistent learner Initialize V 1 = H For t = 1, 2,... Get x t Pick some h V t and predict ŷ t = h(x t ) Get y t and update V t+1 = {h V t : h(x t ) = y t } Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 10 / 32

Analysis Theorem The consistent learner will make at most H 1 mistakes Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 11 / 32

Analysis Theorem The consistent learner will make at most H 1 mistakes Proof. If we err at round t, then the h V t we used for prediction will not be in V t+1. Therefore, V t+1 V t 1. Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 11 / 32

Analysis Theorem The consistent learner will make at most H 1 mistakes Proof. If we err at round t, then the h V t we used for prediction will not be in V t+1. Therefore, V t+1 V t 1. Can we do better? Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 11 / 32

The Halving learner The Halving learner Initialize V 1 = H For t = 1, 2,... Get x t Predict Majority(h(x t ) : h V t ) Get y t and update V t+1 = {h V t : h(x t ) = y t } Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 12 / 32

Analysis Theorem The Halving learner will make at most log 2 ( H ) mistakes Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 13 / 32

Analysis Theorem The Halving learner will make at most log 2 ( H ) mistakes Proof. If we err at round t, then at least half of the functions in V t will not be in V t+1. Therefore, V t+1 V t /2. Corollary The Halving learner can learn the class H of all python programs of length < b bits while making at most b mistakes. Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 13 / 32

Powerful, but... 1 What if the environment is not consistent with any f H? We ll deal with this later Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 14 / 32

Powerful, but... 1 What if the environment is not consistent with any f H? We ll deal with this later 2 While the mistake bound of Halving grows with log 2 ( H ), the runtime of Halving grows with H Learning must take computational considerations into account Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 14 / 32

Efficient learning with structured H Example: Recall again the class H of thresholds over a grid X = {0, 1 n,..., 1} for some integer n 1 Halving mistake bound is log(n + 1) A naive implementation of Halving takes Ω(n) time How to implement Halving efficiently? Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 16 / 32

Efficient learning with structured H Efficient Halving for discrete thresholds Initialize l 1 = 0, r 1 = 1 For t = 1, 2,... Get x t {0, 1 n,..., 1} Predict sign((x t l t ) (r t x t )) Get y t and if x t [l t, r t ] update: if y t = 1 then l t+1 = l t, r t+1 = x t if y t = 1 then l t+1 = x t, r t+1 = r t Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 17 / 32

Efficient learning with structured H Efficient Halving for discrete thresholds Initialize l 1 = 0, r 1 = 1 For t = 1, 2,... Get x t {0, 1 n,..., 1} Predict sign((x t l t ) (r t x t )) Get y t and if x t [l t, r t ] update: if y t = 1 then l t+1 = l t, r t+1 = x t if y t = 1 then l t+1 = x t, r t+1 = r t Exercise: show that the above is indeed an implementation of Halving and that the runtime of each iteration is O(log(n)) Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 17 / 32

Halfspaces w + + H = {x sign( w, x + b) : w R d, b R} Inner product: w, x = w x = d i=1 w ix i w is called a weight vector and b a bias Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 19 / 32

Halfspaces w + + H = {x sign( w, x + b) : w R d, b R} Inner product: w, x = w x = d i=1 w ix i w is called a weight vector and b a bias For d = 1, the class of Halfspaces is the class of thresholds W.l.o.g., assume that x d = 1 for all examples, and then we can treat w d as the bias and forget about b Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 19 / 32

Using halving to learn halfspaces on a grid Let us represent all numbers on the grid G = { 1, 1 + 1/n,..., 1 1/n, 1} Then, H = G d = (2n + 1) d Therefore, Halving s bound is at most d log(2n + 1) We will show an algorithm with a slightly worse mistake bound but that can be implemented efficiently Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 20 / 32

The Ellipsoid Learner Recall that Halving maintains the Version Space, V t, containing all hypotheses in H which are consistent with the examples observed so far Each halfspace hypothesis corresponds to a vector in G d Instead of maintaining V t, we will maintain an ellipsoid, E t, that contains V t We will show that every time we make a mistake the volume of E t shrinks by a factor of e 1/(2n+2) On the other hand, we will show that the volume of E t cannot be made too small (this is where we use the grid assumption) Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 21 / 32

Background: Balls and Ellipsoids Let B = {w R d : w 2 1} be the unit ball of R d Recall: w 2 = w, w = w w = d i=1 w2 i An ellipsoid is the image of a ball under an affine mapping: given a matrix M and a vector v, E(M, v) = {Mw + v : w 2 1} w Mw + v 0 v Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 22 / 32

The Ellipsoid Learner We implicitly maintain an ellipsoid: E t = E(A 1/2 t, w t ) Start with w 1 = 0, A 1 = I For t = 1, 2,... Get x t Predict ŷ t = sign(w t x t ) Get y t If ŷ t y t update: w t+1 = w t + y t A t x t d + 1 x t A t x t ( A t+1 = d2 d 2 A t 2 A t x t x ) t A t 1 d + 1 x t A t x t If ŷ t = y t keep w t+1 = w t and A t+1 = A t Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 23 / 32

Intuition Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 24 / 32

Intuition Suppose x 1 = (1, 0), y 1 = 1. Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 24 / 32

Intuition Suppose x 1 = (1, 0), y 1 = 1. ( ) 1/3 w 2 = 0 Then:, A 2 = ( 4/3 0 0 4/9 ) Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 24 / 32

Intuition Suppose x 1 = (1, 0), y 1 = 1. ( ) 1/3 w 2 = 0 Then:, A 2 = ( 4/3 0 0 4/9 ) E 2 is Ellipsoid of minimum volume that contains E 1 {w : y 1 w, x 1 > 0} Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 24 / 32

Analysis Theorem The Ellipsoid learner makes at most 2d(2d + 2) log(n) mistakes. Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 25 / 32

Analysis Theorem The Ellipsoid learner makes at most 2d(2d + 2) log(n) mistakes. Proof is based on two lemmas: Lemma (Volume Reduction) Whenever we make a mistake, Vol(E t+1 ) Vol(E t ) e 1 2d+2. Lemma (Volume can t be too small) For every t, Vol(E t ) Vol(B) (1/n) 2d Therefore, after M mistakes: Vol(B) (1/n) 2d Vol(E t ) Vol(B) e M 1 2d+2 Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 25 / 32

Summary A basic online classification model Need prior knowledge Learning finite hypothesis classes using Halving The runtime problem The Ellipsoid efficiently learns halfspaces (over a grid) Next lectures: Online learnability: for which H can we have finite number of mistakes? Non-realizable sequences Beyond binary classification A more general online learning game Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 26 / 32

Exercises Exercise on Page 7 Exercise on Page 17 Derive the Ellipsoid update equations Prove the two lemmas on Page 25 Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 27 / 32

Background: Balls and Ellipsoids Recall: E(M, v) = {Mw + v : w 2 1} We deal with non-degenerative ellipsoids, i.e., M is invertible SVD theorem: Every real invertible matrix M can be decomposed as M = UDV where U, V orthonormal and D diagonal with D i,i > 0. Exercise: Show that E(M, v) = E(UD, v) = E(UDU, v) Therefore, we can assume w.l.o.g. that M = UDU (i.e., it is symmetric positive definite) Exercise: Show that for such M E(M, v) = {x : (x v) M 2 (x v) 1} where M 2 = UD 2 U with (D 2 ) i,i = D 2 i,i Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 28 / 32

Volume Calculations Let Vol(B) be the volume of the unit ball Lemma: If M = UDU is positive definite, then ( m ) Vol(E(M, v)) = det(m)vol(b) = D i,i Vol(B) i=1 Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 29 / 32

Why volume shrinks Suppose A t = UD 2 U. Define x t = DU x t. Then: ( A t+1 = d2 d 2 A t 2 A t x t x ) t A t 1 d + 1 x t A tx t ( = d2 d 2 1 UD I 2 ) x t x t d + 1 x t 2 DU By Sylvester s determinant theorem, det(i + uv ) = 1 + u, v. Therefore, ( d 2 det(a t+1 ) = d 2 1 ( d 2 = det(a t ) d 2 1 ) d det(d) det ( I 2 d + 1 ) ) d ( 1 2 d + 1 ) x t x t x t 2 det(d) Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 30 / 32

Why volume shrinks We obtain: Vol(E t+1 ) Vol(E t ) ( d 2 = d 2 1 ( d 2 = d 2 1 = ) d/2 ( 1 2 ) d 1 2 ( 1 + 1 d 2 1 d + 1 ) 1/2 d d 1 (d 1)(d + 1) d + 1 ) d 1 2 ( 1 1 ) d + 1 e d 1 2(d 2 1) e 1 d+1 = e 1 2(d+1) where we used 1 + a e a which holds for all a R. Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 31 / 32

Why volume can t be too small Recall, y t w, x t > 0 for every t. Since w, x t are on the grid G, it follows that y t w, x t 1/n 2. Therefore, if w w < 1/n 2 then y t w, x t = y t w w, x t +y t w, x t w w x t +1/n 2 > 0 Convince yourself (by induction) that E t contains the ball of radius 1/n 2 centered around w. It follows that Vol(B) (1/n 2 ) d = Vol(E( 1 n 2 I, w )) Vol(E t ) Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 32 / 32