Perceptron Mistake Bounds

Size: px

Start display at page:

Download "Perceptron Mistake Bounds"

Scarlett Owen
6 years ago
Views:

1 Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce novel bounds for the Perceptron or the kernel Perceptron algorithm. Our novel bounds generalize beyond standard margin-loss type bounds, allow for any convex and Lipschitz loss function, and admit a very simple proof. Introduction The Perceptron algorithm belongs to the broad family of on-line learning algorithms (see Cesa-Bianchi and Lugosi [006] for a survey) and admits a large number of variants. The algorithm learns a linear separator by processing the training sample in an on-line fashion, examining a single example at each iteration [Rosenblatt, 958]. At each round, the current hypothesis is updated if it makes a mistake, that is if it incorrectly classifies the new training point processed. The full pseudocode of the algorithm is provided in Figure. In what follows, we will assume that w 0 = 0 and η = for simplicity of presentation, however, the more general case also allows for similar guarantees which can be derived following the same methods we are presenting. This paper briefly surveys some existing mistake bounds for the Perceptron algorithm and introduces new ones which can be used to derive generalization bounds in a stochastic setting. A mistake bound is an upper bound on the number of updates, or the number of mistakes, made by the Perceptron algorithm when processing a sequence of training examples. Here, the bound will be expressed in terms of the performance of any linear separator, including the best. Such mistake bounds can be directly used to derive generalization guarantees for a combined hypothesis, using existing on-line-to-batch techniques. Separable case The seminal work of Novikoff [96] gave the first margin-based bound for the Perceptron algorithm, one of the early results in learning theory and probably one of the first based on the notion of margin. Assuming that the data is separable with some margin ρ, Novikoff showed that the number of mistakes made by the Perceptron algorithm can be bounded as a function of the normalized margin ρ/r, where R is the radius of the sphere containing the training instances. We start with a Lemma that can be used to prove Novikoff s theorem and that will be used throughout.

2 Perceptron(w 0) w w 0 typically w 0 = 0 for t to T do 3 Receive(x t) 4 by t sgn(w t x t) 5 Receive(y t) 6 if (by t y t) then 7 w t w t y tx t more generally ηy tx t, η > 0. 8 else w t w t 9 return w T Fig.. Perceptron algorithm [Rosenblatt, 958]. Lemma. Let I denote the set of rounds at which the Perceptron algorithm makes an update when processing a sequence of training instances x,..., x T R N. Then, the following inequality holds: s y tx t x t. Proof. The inequality holds using the following sequence of observations, y tx t = (w t w t) (definition of updates) = w T (telescoping sum, w 0 = 0) s = w t w t (telescoping sum, w 0 = 0) s = w t y tx t w t v = u t y tw t x t x t {z } 0 s x t. (definition of updates) The final inequality uses the fact that an update is made at round t only when the current hypothesis makes a mistake, that is, y t(w t x t) 0. The lemma can be used straightforwardly to derive the following mistake bound for the separable setting. Theorem ([Novikoff, 96]). Let x,..., x T R N be a sequence of T points with x t r for all t [, T ], for some r > 0. Assume that

3 there exist ρ > 0 and v R N, v 0, such that for all t [, T ], ρ. Then, the number of updates made by the Perceptron algorithm y t (v x t ) v when processing x,..., x T is bounded by r /ρ. Proof. Let I denote the subset of the T rounds at which there is an update, and let M be the total number of updates, i.e., I = M. Summing up the inequalities yields: Mρ v P ytxt s y tx t x t v Mr, where the second inequality holds by the Cauchy-Schwarz inequality, the third by Lemma and the final one by assumption. Comparing the leftand right-hand sides gives M r/ρ, that is, M r /ρ. 3 Non-separable case In real-world problems, the training sample processed by the Perceptron algorithm is typically not linearly separable. Nevertheless, it is possible to give a margin-based mistake bound in that general case in terms of the radius of the sphere containing the sample and the margin-based loss of an arbitrary weight vector. We present two different types of bounds: first, a bound that depends on the L -norm of the vector of ρ-margin hinge losses, or the vector of more general losses that we will describe, next a bound that depends on the L -norm of the vector of margin losses, which extends the original results presented by Freund and Schapire [999]. 3. L -norm mistake bounds We first present a simple proof of a mistake bound for the Perceptron algorithm that depends on the L -norm of the losses incurred by an arbitrary weight vector, for a general definition of the loss function that covers the ρ-margin hinge loss. The family of admissible loss functions is quite general and defined as follows. Definition (γ-admissible loss function). A γ-admissible loss function φ γ : R R satisfies the following conditions:. The function φ γ is convex.. φ γ is non-negative: x R, φ γ(x) At zero, the φ γ is strictly positive: φ γ(0) > φ γ is γ-lipschitz: φ γ(x) φ γ(y) γ x y, for some γ > 0. These are mild conditions satisfied by many loss functions including the hinge-loss, the squared hinge-loss, the Huber loss and general p-norm losses over bounded domains. Theorem. Let I denote the set of rounds at which the Perceptron x,..., x T R N. For any vector u R N with u and any γ-admissible loss function φ γ, consider the vector of losses incurred by

4 u: L φγ (u) = ˆφ γ(y t(u x t)). Then, the number of updates MT = I made by the Perceptron algorithm can be bounded as follows: M T γ>0, u φ γ(0) L φ γ (u) γ φ γ(0) s x t. () If we further assume that x t r for all t [, T ], for some r > 0, this implies s «γr M T γ>0, u φ L φγ (u). () γ(0) φ γ(0) Proof. For all γ > 0 and u with u, the following statements hold. By convexity of φ γ we have M T P φγ(ytu xt) φγ(u z), where z = M T P ytxt. Then, by using the Lipschitz property of φγ we have, φ γ(u z) = φ γ(u z) φ γ(0) φ γ(0) = φγ(0) φ γ(u z) φγ(0) γ u z φ γ(0). Combining the two inequalities above and multiplying both sides by M T implies Mφ γ(0) φ γ(y tu x t) γ y tu x t. Finally, using the Cauchy-Schwartz inequality and Lemma yields s y tu x t = u y tx t u y tx t x t, which completes the proof of the first statement after re-arranging terms. If it is further assumed that x t r for all t I, then this implies Mφ γ(0) r M P φγ(ytu xt) 0. Solving this quadratic expression in terms of M proves the second statement. It is straightforward to see that the ρ-margin hinge loss φ ρ(x) = ( x/ρ) is (/ρ)-admissible with φ ρ(0) = for all ρ, which gives the following corollary. Corollary. Let I denote the set of rounds at which the Perceptron x,..., x T R N. For any ρ > 0 and any u R N with u, consider the vector of ρ-hinge losses incurred by u: L ρ(u) = ˆ( y t (u x t ) ) ρ. Then, the number of updates MT = I made by the Perceptron algorithm can be bounded as follows: M T ρ>0 u Lρ(u) xt. (3) ρ If we further assume that x t r for all t [, T ], for some r > 0, this implies r M T ρ>0, u ρ p «L ρ(u). (4)

5 The mistake bound (3) appears already in Cesa-Bianchi et al. [004] but we could not find its proof either in that paper or in those it references for this bound. Another application of Theorem is to the squared-hinge loss φ ρ(x) = ( x/ρ). Assume that x r, then the inequality y(u x) u x r implies that the derivative of the hinge-loss is also bounded, achieving a maximum absolute value φ ρ(r) = ( r ρ ρ the ρ-margin squared hinge loss is (r/ρ )-admissible with φ ρ(0) = for all ρ. This leads to the following corollary. ) r ρ. Thus, Corollary. Let I denote the set of rounds at which the Perceptron x,..., x T R N with x t r for all t [, T ]. For any ρ > 0 and any u R N with u, consider the vector of ρ-margin squared hinge losses incurred by u: L ρ(u) = ˆ( y t(u x t ) ) ρ. Then, the number of updates M T = I made by the Perceptron algorithm can be bounded as follows: M T This also implies M T r ρ>0 u Lρ(u) xt. (5) ρ r ρ>0, u ρ p «L ρ(u). (6) Theorem can be similarly used to derive mistake bounds in terms of other admissible losses. 3. L -norm mistake bounds The original results of this section are due to Freund and Schapire [999]. Here, we extend their proof to derive finer mistake bounds for the Perceptron algorithm in terms of the L -norm of the vector of hinge losses of an arbitrary weight vector at points where an update is made. Theorem 3. Let I denote the set of rounds at which the Perceptron x,..., x T R N. For any ρ > 0 and any u R N with u, consider the vector of ρ-hinge losses incurred by u: L ρ(u) = ˆ( y t (u x t ) ) ρ. Then, the number of updates MT = I made by the Perceptron algorithm can be bounded as follows: 0 v M T ρ>0, u B L u t L ρ(u) 4 xt ρ C A. (7) If we further assume that x t r for all t [, T ], for some r > 0, this implies «r M T ρ>0, u ρ Lρ(u). (8)

6 Proof. We first reduce the problem to the separable case by mapping each input vector x t R N to a vector in x t R NT as follows: x t = 6 4 x t,. x t,n x t = " xt,... x t,n {z} #, (N t)th component where the first N components of x t coincide with those of x and the only other non-zero component is the (N t)th component which is set to, a parameter whose value will be determined later. Define l t by l t = ( y tu x t ) ρ. Then, the vector u is replaced by the vector u defined by u = ˆ u Z... u NZ y l ρ Z... y T l T ρ Z The first N components of u are equal to the components of u/z and the remaining T components are functions of the labels and hinge losses. q The normalization factor Z is chosen to guarantee that u = : Z = ρ L ρ(u). Since the additional coordinates of the instances are non-zero exactly once, the predictions made by the Perceptron algorithm for x t, t [, T ] coincide with those made in the original space for x t, t [, T ]. In particular, a change made to the additional coordinates of w does no affect any subsequent prediction. Furthermore, by definition of u and x t, we can write for any t I: y t(u x u xt t) = y t Z ytu xt = ltρ Z Z ytu xt Z ytltρ Z ρ yt(u xt) Z. = ρ Z, where the inequality results from the definition of l t. Summing up the inequalities for all t I and using Lemma yields M T ρ Z P yt(u x t) x t. Substituting the value of Z and re-writing in terms of x implies: MT ρ Lρ(u) R M T = R ρ R L ρ(u) MT M T L ρ(u), ρ where R = xt. Now, solving for to minimize this bound gives = ρ Lρ(u) R MT and further simplifies the bound MT R ρ MT L ρ(u) R M T L ρ(u) ρ = R ρ M T L ρ(u).

7 Solving the second-degree inequality M T M T L ρ(u) R 0 proves ρ the first statement of the theorem. The second theorem is obtained by first bounding R with r M T and then solving the second-degree inequality. 3.3 Discussion One natural question this survey raises is the respective quality of the L - and L -norm bounds. The comparison of (4) and (8) for the ρ-margin hinge loss shows that, for a fixed ρ, the bounds differ only by the following two quantities: min L ρ(u) = min u u ( y t`u xt)/ρ min L ρ(u) = min ( y t`u xt)/ρ u u These two quantities are data-dependent and in general not comparable. For a vector u for which the individual losses ( y t`u xt)) are all less than one, we have L ρ(u) L ρ(u), while the contrary holds if the individual losses are larger than one. 4 Generalization Bounds In this section, we consider the case where the training sample processed is drawn according to some distribution D. Under some mild conditions on the loss function, the hypotheses returned by an on-line learning algorithm can then be combined to define a hypothesis whose generalization error can be bounded in terms of its regret. Such a hypothesis can be determined via cross-validation Littlestone [989] or using the online-tobatch theorem of Cesa-Bianchi et al. [004]. The latter can be combined with any of the mistake bounds presented in the previous section to derive generalization bounds for the Perceptron predictor. Given δ > 0, a sequence of labeled examples (x, y ),..., (y T, x T ), a sequence of hypotheses h,..., h T, and a loss function L, define the penalized risk minimizing hypothesis as b h = h i with i = argmin i [,T ] T i. s T T (T ) log δ L(y th i(x t)) (T i ). t=i The following theorem gives a bound on the expected loss of b h on future examples. Theorem 4 (Cesa-Bianchi et al. [004]). Let S be a labeled sample ((x, y ),..., (x T, y T )) drawn i.i.d. according to D, L a loss function bounded by one, and h,..., h T the sequence of hypotheses generated by an on-line algorithm A sequentially processing S. Then, for any δ > 0, with probability at least δ, the following holds: E [L(yb h(x))] r T (T ) L(y ih i(x i)) 6 log. (9) (x,y) D T T δ i=

8 KernelPerceptron(α 0) α α 0 typically α 0 = 0 for t to T do 3 Receive(x t) 4 by t sgn( P T s= 5 Receive(y t) 6 if (by t y t) then 7 α t α t 8 return α αsysk(xs, xt)) Fig.. Kernel Perceptron algorithm for PDS kernel K. Note that this theorem does not require the loss function to be convex. Thus, if L is the zero-one loss, then the empirical loss term is precisely the average number of mistakes made by the algorithm. Plugging in any of the mistake bounds from the previous sections then gives us a learning guarantee with respect to the performance of the best hypothesis as measured by a margin-loss (or any γ-admissible loss if using Theorem ). Let bw denote the weight vector corresponding to the penalized risk minimizing Perceptron hypothesis chosen from all the intermediate hypotheses generated by the algorithm. Then, in view of Theorem, the following corollary holds. Corollary 3. Let I denote the set of rounds at which the Perceptron x,..., x T R N. For any vector u R N with u and any γ-admissible loss function φ γ, consider the vector of losses incurred by u: L φγ (u) = ˆφ γ(y t(u x t)). Then, for any δ > 0, with probability at least δ, the following generalization bound holds for the penalized risk minimizing Perceptron hypothesis bw: Pr [y( bw x) < 0] (x,y) D P r L φγ (u) γq xt (T ) 6 log. γ>0, u φ γ(0)t φ γ(0)t T δ Any γ-admissible loss can be used to derive a more explicit form of this bound in special cases, in particular the hinge loss or the squared hinge loss. Using Theorem 3, we obtain the following L -norm generalization bound. Corollary 4. Let I denote the set of rounds at which the Perceptron x,..., x T R N. For any ρ > 0 and any u R N with u, consider the vector of ρ-hinge losses incurred by u: L ρ(u) = ˆ( y t (u x t ) ) ρ. Then, for any δ > 0, with probability at least δ, the

9 following generalization bound holds for the penalized risk minimizing Perceptron hypothesis bw: Pr [y( bw x) < 0] (x,y) D 0 ρ>0, u T L ρ(u) v q u P t L ρ(u) xt C 4 ρ A 5 Kernel Perceptron algorithm r 6 T log (T ) δ The Perceptron algorithm of Figure can be straightforwardly extended to define a non-linear separator using a positive definite kernel K [Aizerman et al., 964]. Figure gives the pseudocode of that algorithm known as the kernel Perceptron algorithm. The classifier sgn(h) learned by the algorithm is defined by h: x P T t= αtytk(xt, x). The results of the previous sections apply similarly to the kernel perceptron algorithm with x t replaced with K(x t, x t). In particular, the quantity xt appearing in several of the learning guarantees can be replaced with the familiar trace Tr[K] of the kernel matrix K = [K(x i, x j)] i,j I over the set of points at which an update is made, which is a standard term appearing in margin bounds for kernel-based hypothesis sets..

10 Bibliography Mark A. Aizerman, E. M. Braverman, and Lev I. Rozonoèr. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 5:8 837, 964. Nicolò Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 006. Nicolò Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9): , 004. Yoav Freund and Robert E. Schapire. Large margin classification using the perceptron algorithm. Machine Learning, 37:77 96, 999. Nick Littlestone. From on-line to batch learning. In COLT, pages 69 84, 989. Albert B.J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume, pages 65 6, 96. Frank Rosenblatt. The perceptron: A probabilistic model for ormation storage and organization in the brain. Psychological Review, 65(6):386, 958.

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.