Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Size: px

Start display at page:

Download "Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron"

Sheila Lawson
5 years ago
Views:

1 Stat 928: Statistical Learning Theory Lecture: 18 Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron Instructor: Sham Kakade 1 Introduction This course will be divided into 2 parts. In each part we will make different assumptions about the data generating process: Online Learning No assumptions about data generating process. Worst case analysis. Fundamental connections to Game Theory. Statistical Learning Assume data consists of independently and identically distributed examples drawn according to some fixed but unknown distribution. Our examples will come from some space X Y. Given a data set {(x t, y t )} T (X Y) T, our goal is to predict y T +1 for a new point x T +1. A hypothesis is simply a function h : X Y. Sometimes, a hypothesis will map to a set D (for decision space) larger than Y. Depending on the nature of the set Y, we get special cases of the general prediction problem. Binary classification Y = { 1, +1} Multiclass classification Y = {1, 2,..., K} =: [K] for K 3 Regression Y = [ B, B] or Y = R A set of hypotheses is often called a hypotheses class. If the range of a hypothesis is { 1, +1} (or {0, 1}) then it also called a concept. A concept can be identified with the subset of X on which it is 1. 2 Mistake Bound Model In this model, learning proceeds in rounds, as we see examples one by one. Suppose Y = { 1, +1}. At the beginning of round t, the learning algorithm A has the hypothesis h t. In round t, we see x t and predict h t (x t ). At the end of the round, y t is revealed and A makes a mistake if h t (x t ) y t. The algorithm then updates its hypothesis to h t+1 and this continues till time T. Suppose the labels were actually produced by some function f in a given concept class C. Then it is natural to bound the total number of mistakes the learner commits, no matter how long the sequence. To this end, define 1

2 mistake(a, C) := max f C,T,x 1:T 1 [h t (x t ) f(x t )]. We can now define what it means for an algorithm to learn a class in the mistake bound model. Definition 2.1. An algorithm A learns a class C with mistake bound M iff mistake(a, C) M. Note that we are ignoring efficiency issues here. We have not said anything about the amount of computation A has to do in each round in order to update its hypothesis from h t to h t+1. Setting this issue aside for a moment, we have a remarkably simple algorithm HALVING (C) that has a mistake bound of lg( C ) for any finite concept class C. For a finite set H of hypotheses, define the hypothesis majority (H) as follows, { +1 {h H h(x) = +1} H /2, majority (H) (x) := 1 otherwise. Algorithm 1 HALVING (C) C 1 C h 1 majority (C 1 ) for t = 1 to T do Receive x t Predict h t (x t ) Receive y t C t+1 {f C t f(x t ) = y t } h t+1 majority (C t+1 ) end for Theorem 2.2. For any finite concept class C, we have mistake(halving (C), C)) lg C. Proof. The key idea is that if the algorithm makes a mistake then at least half of the hypothesis in C t are eliminated. Formally, h t (x t ) y t C t+1 C t /2. Therefore, denoting the number of mistakes up to time t by M t, we have M t := 1 [h t (x t ) y t ], C t+1 C 1 C = (1) Mt 2 2 Mt Since there is an f C which perfectly classifies all x t, we also have Combining (1) and (2), we have which gives M t lg( C ). 1 C t+1. (2) 1 C 2 Mt, 2

3 3 Linear Classifiers and Margin Let us now look at a concrete example of a concept class. Suppose X = R d and we have a vector w R d. We define the hypothesis, h w (x) = sgn(w x), where sgn(z) = 2 1 [z 0] 1 gives the sign of z. With some abuse of terminology, we will often speak of the hypothesis w when we actually mean the hypothesis h w. The class of linear classifiers in the (uncountable) concept class C lin := { h w w R d }. Note that w and cw yield the same linear classifier for any c > 0. Suppose we have a data set that is linearly separable. That is, there is a w such that, t [T ], y t = sgn(w x t ). (3) Separability means that y t (w x t ) > 0 for all t. The minimum value of this quantity over the data set is referred to as the margin. Let us make the assumption that the margin is at least for some > 0. Assumption M. There exists a w R d for which (3) holds. Further assume that for some > 0. min y t(w x t ), (4) t [T ] Define x 1:t := max t [T ] x t. We now show that under Assumption M, our simple halving algorithm can be used with a suitable finite subset of C lin to derive a mistake bound. Let W be those w such that w i is of the form m/2 x 1:T d for some m { 2 x 1:t w d/,..., 1, 0, +1,..., 2 x 1:t w d/ }. In other words, since each coordinate of w is in the range [ w, w ], we have discretized that interval at a scale of /2 x 1:T d. We want to run the halving algorithm on the (finite) concept class, C lin := {h w w W }. ( The size of this class is 4 x1:t w d d. + 1) Note that there exists a w W such that, Thus, we have, for any t [T ], i [d], w i w i /2 x 1:T d. y t ( w x t ) y t (w x t ) = w x t w x t d w i w i x t,i i=1 d i=1 /2. 2 x 1:T d x t This, together with Assumption R, implies that y t ( w x t ) /2 > 0. Thus, there exists a hypothesis in C lin that classifies the data set perfectly. Theorem 2.2 immediately gives the following corollary. 3

4 Corollary 3.1. Under Assumption M, HALVING (C lin ) makes at most ( 4d x1:t w ) d lg + 1 mistakes. This bound is nice because even though we had an uncountable concept class to begin with, the margin assumption allowed us to work with a finite subset of the concept class and we were able to derive a mistake bound. However, the result is unsatisfactory because running the halving algorithm on C lin is extremely inefficient. One might wonder if one can use the special structure of the space of linear classifiers to implement the halving algorithm more efficiently. Indeed, it possible to implement a variant of the halving algorithm efficiently using the ellipsoid method developed for the linear programming feasibility problem. Note that the mistake bound depends explicitly on the dimension d of the problem. We would also like to be able to give a dimension independent mistake bound. Indeed, a classic algorithm called PERCEPTRON has such a mistake bound. 4 The Perceptron Algorithm Algorithm 2 PERCEPTRON w 1 0 for t = 1 to T do Receive x t R d Predict sgn(w t x t ) Receive y t { 1, +1} if sgn(w t x t ) y t then w t+1 w t + y t x t else w t+1 w t end if end for The following theorem gives a dimension independent bound on the number of mistakes the PERCEPTRON algorithm makes. Theorem 4.1. Suppose Assumption M holds. Let M T := 1 [sgn(w t x t ) y t ] denote the number of mistakes the PERCEPTRON algorithm makes. Then we have, M T x 1:T 2 w 2 2. Proof. The key idea of the proof is to look at how the quantity w w t evolves over time. We first provide an lower bound for it. Define m t = 1 [sgn(w t x t ) y t ]. Note that w t+1 = w t + y t x t m t and M T = t m t. We have, w w t+1 = w w t + y t x t m t = w w t + y t (w x t )m t w w t + m t. (Assumption M) 4

5 Unwinding the recursion, we get Now, we use Cauchy-Schwarz inequality to get the upper bound, Moreover, w w T +1 w w 1 + M T = M T. (5) w w T +1 w w T +1. (6) w t+1 2 = w t + y t x t m t 2 = w t 2 + 2y t (w t x t )m t + x t 2 m t w t x 1:T 2 m t, where the last step follows because y t (w t x t ) < 0 when a mistake is made and x t x 1:T. Unwinding the recursion once again, we get, Combining (5), (6) and (7) gives, This implies that M T w 2 x 1:T 2 / 2. w T +1 2 w x 1:T 2 M T = x 1:T 2 M T. (7) M T w w T +1 w w T +1 w x 1:T M T. 5

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber Boulder Online Learning 1 of 31 Motivation PAC learning: distribution fixed over