Online Learning Summer School Copenhagen 2015 Lecture 1

Similar documents
Empirical Risk Minimization Algorithms

Introduction to Machine Learning (67577) Lecture 3

Littlestone s Dimension and Online Learnability

Active Learning: Disagreement Coefficient

Linear Classifiers and the Perceptron

Agnostic Online learnability

Linear Classifiers. Michael Collins. January 18, 2012

Online Learning, Mistake Bounds, Perceptron Algorithm

Introduction to Machine Learning (67577)

The Perceptron algorithm

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Accelerating Stochastic Optimization

CS229 Supplemental Lecture notes

Online Convex Optimization

Introduction to Machine Learning (67577) Lecture 5

On the tradeoff between computational complexity and sample complexity in learning

Introduction to Machine Learning (67577) Lecture 7

8.1 Polynomial Threshold Functions

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Lecture 14 Ellipsoid method

Understanding Machine Learning A theory Perspective

COMPUTATIONAL LEARNING THEORY

Efficient Bandit Algorithms for Online Multiclass Prediction

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Name (NetID): (1 Point)

Support vector machines Lecture 4

Active Learning and Optimized Information Gathering

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Failures of Gradient-Based Deep Learning

The No-Regret Framework for Online Learning

Lecture 4: Linear predictors and the Perceptron

THE FORGETRON: A KERNEL-BASED PERCEPTRON ON A BUDGET

Composite Objective Mirror Descent

Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Lecture 8. Instructor: Haipeng Luo

CS675: Convex and Combinatorial Optimization Spring 2018 The Ellipsoid Algorithm. Instructor: Shaddin Dughmi

The Forgetron: A Kernel-Based Perceptron on a Fixed Budget

Practical Agnostic Active Learning

From Batch to Transductive Online Learning

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Multiclass Learnability and the ERM principle

CPSC 540: Machine Learning

Computational Learning

Convex Repeated Games and Fenchel Duality

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

CS 188: Artificial Intelligence. Outline

Logistic regression and linear classifiers COMS 4771

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Grothendieck s Inequality

Machine Learning (CS 567) Lecture 2

The Perceptron Algorithm

Lecture 16: Perceptron and Exponential Weights Algorithm

Elementary Linear Algebra

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Game Theory, On-line prediction and Boosting (Freund, Schapire)

Lecture 29: Computational Learning Theory

Combinatorial Optimization Spring Term 2015 Rico Zenklusen. 2 a = ( 3 2 ) 1 E(a, A) = E(( 3 2 ), ( 4 0

Multiclass Learnability and the ERM Principle

2 Upper-bound of Generalization Error of AdaBoost

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Near-Optimal Algorithms for Online Matrix Prediction

Using More Data to Speed-up Training Time

Computational and Statistical Learning theory

b + O(n d ) where a 1, b > 1, then O(n d log n) if a = b d d ) if a < b d O(n log b a ) if a > b d

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

Online Passive-Aggressive Algorithms

A New Perspective on an Old Perceptron Algorithm

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Linear Classifiers: Expressiveness

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Is our computers learning?

Computational and Statistical Learning Theory

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Convex Repeated Games and Fenchel Duality

Linear Regression (continued)

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

On the power and the limits of evolvability. Vitaly Feldman Almaden Research Center

Lecture 3. Big-O notation, more recurrences!!

6.842 Randomness and Computation April 2, Lecture 14

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Two Lectures on the Ellipsoid Method for Solving Linear Programs

Weighted Majority and the Online Learning Approach

Online Passive-Aggressive Algorithms

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection

CPSC 340: Machine Learning and Data Mining

Machine Learning in the Data Revolution Era

ECE 5424: Introduction to Machine Learning

Transcription:

Online Learning Summer School Copenhagen 2015 Lecture 1 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Online Learning Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 1 / 32

Outline 1 The Online Learning Framework Online Classification Hypothesis class 2 Learning Finite Hypothesis Classes The Consistent learner The Halving learner 3 Structure over the hypothesis class Halfspaces The Ellipsoid Learner Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 2 / 32

Gentle Start: An Online Classification Game For t = 1, 2,... Environment presents a question x t Learner predicts an answer ŷ t {±1} Environment reveals true label y t {±1} Learner pays 1 if ŷ t y t and 0 otherwise Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 3 / 32

Gentle Start: An Online Classification Game For t = 1, 2,... Environment presents a question x t Learner predicts an answer ŷ t {±1} Environment reveals true label y t {±1} Learner pays 1 if ŷ t y t and 0 otherwise Goal of the learner: Make few mistakes Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 3 / 32

Example Applications Weather forecasting (will it rain tomorrow) Finance (buy or sell an asset) Spam filtering (is this email a spam) Compression (what s the next symbol in a sequence) Proxy for optimization (will be clear later) Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 4 / 32

When can we hope to make few mistakes? Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 5 / 32

When can we hope to make few mistakes? Task is hopeless if there s no correlation between past and future We are making no statistical assumptions on the origin of the sequence Need to give more knowledge to the learner Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 5 / 32

Prior Knowledge Recall the online game: For t = 1, 2,...: get question x t X, predict ŷ t {±1}, then get y t {±1} The realizability by H assumption H is a predefined set of functions from X to {±1} Exists f H s.t. for every t, y t = f(x t ) The learner knows H (but of course doesn t know f) Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 6 / 32

Prior Knowledge Recall the online game: For t = 1, 2,...: get question x t X, predict ŷ t {±1}, then get y t {±1} The realizability by H assumption H is a predefined set of functions from X to {±1} Exists f H s.t. for every t, y t = f(x t ) The learner knows H (but of course doesn t know f) Remark: What if our prior knowledge is wrong? We ll get back to this question later Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 6 / 32

Not always helpful Let X = R, and H be thresholds: H = {h θ : θ R}, where h θ (x) = sign(x θ) h θ (x) = 1 h θ (x) = 1 θ Theorem: for every learner, exists sequence of examples which is consistent with some f H but on which the learner will always err Exercise: Prove the theorem by showing that the environment can follow the bisection method Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 7 / 32

Outline 1 The Online Learning Framework Online Classification Hypothesis class 2 Learning Finite Hypothesis Classes The Consistent learner The Halving learner 3 Structure over the hypothesis class Halfspaces The Ellipsoid Learner Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 8 / 32

Learning Finite Classes Assume that H is of finite size E.g.: H is all the functions from X to {±1} that can be implemented using a Python program of length at most b E.g.: H is thresholds over a grid X = {0, 1 n, 2 n,..., 1} Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 9 / 32

Learning Finite Classes The consistent learner Initialize V 1 = H For t = 1, 2,... Get x t Pick some h V t and predict ŷ t = h(x t ) Get y t and update V t+1 = {h V t : h(x t ) = y t } Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 10 / 32

Analysis Theorem The consistent learner will make at most H 1 mistakes Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 11 / 32

Analysis Theorem The consistent learner will make at most H 1 mistakes Proof. If we err at round t, then the h V t we used for prediction will not be in V t+1. Therefore, V t+1 V t 1. Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 11 / 32

Analysis Theorem The consistent learner will make at most H 1 mistakes Proof. If we err at round t, then the h V t we used for prediction will not be in V t+1. Therefore, V t+1 V t 1. Can we do better? Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 11 / 32

The Halving learner The Halving learner Initialize V 1 = H For t = 1, 2,... Get x t Predict Majority(h(x t ) : h V t ) Get y t and update V t+1 = {h V t : h(x t ) = y t } Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 12 / 32

Analysis Theorem The Halving learner will make at most log 2 ( H ) mistakes Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 13 / 32

Analysis Theorem The Halving learner will make at most log 2 ( H ) mistakes Proof. If we err at round t, then at least half of the functions in V t will not be in V t+1. Therefore, V t+1 V t /2. Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 13 / 32

Analysis Theorem The Halving learner will make at most log 2 ( H ) mistakes Proof. If we err at round t, then at least half of the functions in V t will not be in V t+1. Therefore, V t+1 V t /2. Corollary The Halving learner can learn the class H of all python programs of length < b bits while making at most b mistakes. Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 13 / 32

Powerful, but... 1 What if the environment is not consistent with any f H? We ll deal with this later Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 14 / 32

Powerful, but... 1 What if the environment is not consistent with any f H? We ll deal with this later 2 While the mistake bound of Halving grows with log 2 ( H ), the runtime of Halving grows with H Learning must take computational considerations into account Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 14 / 32

Outline 1 The Online Learning Framework Online Classification Hypothesis class 2 Learning Finite Hypothesis Classes The Consistent learner The Halving learner 3 Structure over the hypothesis class Halfspaces The Ellipsoid Learner Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 15 / 32

Efficient learning with structured H Example: Recall again the class H of thresholds over a grid X = {0, 1 n,..., 1} for some integer n 1 Halving mistake bound is log(n + 1) A naive implementation of Halving takes Ω(n) time How to implement Halving efficiently? Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 16 / 32

Efficient learning with structured H Efficient Halving for discrete thresholds Initialize l 1 = 0, r 1 = 1 For t = 1, 2,... Get x t {0, 1 n,..., 1} Predict sign((x t l t ) (r t x t )) Get y t and if x t [l t, r t ] update: if y t = 1 then l t+1 = l t, r t+1 = x t if y t = 1 then l t+1 = x t, r t+1 = r t Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 17 / 32

Efficient learning with structured H Efficient Halving for discrete thresholds Initialize l 1 = 0, r 1 = 1 For t = 1, 2,... Get x t {0, 1 n,..., 1} Predict sign((x t l t ) (r t x t )) Get y t and if x t [l t, r t ] update: if y t = 1 then l t+1 = l t, r t+1 = x t if y t = 1 then l t+1 = x t, r t+1 = r t Exercise: show that the above is indeed an implementation of Halving and that the runtime of each iteration is O(log(n)) Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 17 / 32

Outline 1 The Online Learning Framework Online Classification Hypothesis class 2 Learning Finite Hypothesis Classes The Consistent learner The Halving learner 3 Structure over the hypothesis class Halfspaces The Ellipsoid Learner Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 18 / 32

Halfspaces w + + H = {x sign( w, x + b) : w R d, b R} Inner product: w, x = w x = d i=1 w ix i w is called a weight vector and b a bias Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 19 / 32

Halfspaces w + + H = {x sign( w, x + b) : w R d, b R} Inner product: w, x = w x = d i=1 w ix i w is called a weight vector and b a bias For d = 1, the class of Halfspaces is the class of thresholds W.l.o.g., assume that x d = 1 for all examples, and then we can treat w d as the bias and forget about b Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 19 / 32

Using halving to learn halfspaces on a grid Let us represent all numbers on the grid G = { 1, 1 + 1/n,..., 1 1/n, 1} Then, H = G d = (2n + 1) d Therefore, Halving s bound is at most d log(2n + 1) We will show an algorithm with a slightly worse mistake bound but that can be implemented efficiently Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 20 / 32

The Ellipsoid Learner Recall that Halving maintains the Version Space, V t, containing all hypotheses in H which are consistent with the examples observed so far Each halfspace hypothesis corresponds to a vector in G d Instead of maintaining V t, we will maintain an ellipsoid, E t, that contains V t We will show that every time we make a mistake the volume of E t shrinks by a factor of e 1/(2n+2) On the other hand, we will show that the volume of E t cannot be made too small (this is where we use the grid assumption) Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 21 / 32

Background: Balls and Ellipsoids Let B = {w R d : w 2 1} be the unit ball of R d Recall: w 2 = w, w = w w = d i=1 w2 i An ellipsoid is the image of a ball under an affine mapping: given a matrix M and a vector v, E(M, v) = {Mw + v : w 2 1} w Mw + v 0 v Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 22 / 32

The Ellipsoid Learner We implicitly maintain an ellipsoid: E t = E(A 1/2 t, w t ) Start with w 1 = 0, A 1 = I For t = 1, 2,... Get x t Predict ŷ t = sign(w t x t ) Get y t If ŷ t y t update: w t+1 = w t + y t A t x t d + 1 x t A t x t ( A t+1 = d2 d 2 A t 2 A t x t x ) t A t 1 d + 1 x t A t x t If ŷ t = y t keep w t+1 = w t and A t+1 = A t Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 23 / 32

Intuition Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 24 / 32

Intuition Suppose x 1 = (1, 0), y 1 = 1. Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 24 / 32

Intuition Suppose x 1 = (1, 0), y 1 = 1. ( ) 1/3 w 2 = 0 Then:, A 2 = ( 4/3 0 0 4/9 ) Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 24 / 32

Intuition Suppose x 1 = (1, 0), y 1 = 1. ( ) 1/3 w 2 = 0 Then:, A 2 = ( 4/3 0 0 4/9 ) E 2 is Ellipsoid of minimum volume that contains E 1 {w : y 1 w, x 1 > 0} Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 24 / 32

Analysis Theorem The Ellipsoid learner makes at most 2d(2d + 2) log(n) mistakes. Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 25 / 32

Analysis Theorem The Ellipsoid learner makes at most 2d(2d + 2) log(n) mistakes. Proof is based on two lemmas: Lemma (Volume Reduction) Whenever we make a mistake, Vol(E t+1 ) Vol(E t ) e 1 2d+2. Lemma (Volume can t be too small) For every t, Vol(E t ) Vol(B) (1/n) 2d Therefore, after M mistakes: Vol(B) (1/n) 2d Vol(E t ) Vol(B) e M 1 2d+2 Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 25 / 32

Summary A basic online classification model Need prior knowledge Learning finite hypothesis classes using Halving The runtime problem The Ellipsoid efficiently learns halfspaces (over a grid) Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 26 / 32

Summary A basic online classification model Need prior knowledge Learning finite hypothesis classes using Halving The runtime problem The Ellipsoid efficiently learns halfspaces (over a grid) Next lectures: Online learnability: for which H can we have finite number of mistakes? Non-realizable sequences Beyond binary classification A more general online learning game Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 26 / 32

Exercises Exercise on Page 7 Exercise on Page 17 Derive the Ellipsoid update equations Prove the two lemmas on Page 25 Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 27 / 32

Background: Balls and Ellipsoids Recall: E(M, v) = {Mw + v : w 2 1} We deal with non-degenerative ellipsoids, i.e., M is invertible SVD theorem: Every real invertible matrix M can be decomposed as M = UDV where U, V orthonormal and D diagonal with D i,i > 0. Exercise: Show that E(M, v) = E(UD, v) = E(UDU, v) Therefore, we can assume w.l.o.g. that M = UDU (i.e., it is symmetric positive definite) Exercise: Show that for such M E(M, v) = {x : (x v) M 2 (x v) 1} where M 2 = UD 2 U with (D 2 ) i,i = D 2 i,i Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 28 / 32

Volume Calculations Let Vol(B) be the volume of the unit ball Lemma: If M = UDU is positive definite, then ( m ) Vol(E(M, v)) = det(m)vol(b) = D i,i Vol(B) i=1 Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 29 / 32

Why volume shrinks Suppose A t = UD 2 U. Define x t = DU x t. Then: ( A t+1 = d2 d 2 A t 2 A t x t x ) t A t 1 d + 1 x t A tx t ( = d2 d 2 1 UD I 2 ) x t x t d + 1 x t 2 DU By Sylvester s determinant theorem, det(i + uv ) = 1 + u, v. Therefore, ( d 2 det(a t+1 ) = d 2 1 ( d 2 = det(a t ) d 2 1 ) d det(d) det ( I 2 d + 1 ) ) d ( 1 2 d + 1 ) x t x t x t 2 det(d) Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 30 / 32

Why volume shrinks We obtain: Vol(E t+1 ) Vol(E t ) ( d 2 = d 2 1 ( d 2 = d 2 1 = ) d/2 ( 1 2 ) d 1 2 ( 1 + 1 d 2 1 d + 1 ) 1/2 d d 1 (d 1)(d + 1) d + 1 ) d 1 2 ( 1 1 ) d + 1 e d 1 2(d 2 1) e 1 d+1 = e 1 2(d+1) where we used 1 + a e a which holds for all a R. Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 31 / 32

Why volume can t be too small Recall, y t w, x t > 0 for every t. Since w, x t are on the grid G, it follows that y t w, x t 1/n 2. Therefore, if w w < 1/n 2 then y t w, x t = y t w w, x t +y t w, x t w w x t +1/n 2 > 0 Convince yourself (by induction) that E t contains the ball of radius 1/n 2 centered around w. It follows that Vol(B) (1/n 2 ) d = Vol(E( 1 n 2 I, w )) Vol(E t ) Shai Shalev-Shwartz (Hebrew U) OLSS Lecture 1 Online Learning 32 / 32