Efficient Online Bandit Multiclass Learning with Õ( T ) Regret

Size: px

Start display at page:

Download "Efficient Online Bandit Multiclass Learning with Õ( T ) Regret"

Sharon Cunningham
6 years ago
Views:

1 with Õ ) Regret Alina Beygelzimer 1 Francesco Orabona Chicheng Zhang 3 Abstract We present an efficient second-order algorithm with Õ 1 ) 1 regret for the bandit online multiclass problem. he regret bound holds simultaneously with respect to a family of loss functions parameterized by, for a range of restricted by the norm of the competitor. he family of loss functions ranges from hinge loss = 0) to squared hinge loss = 1). his provides a solution to the open problem of Abernethy, J. and Rahlin, A. An efficient bandit algorithm for -regret in online multiclass prediction? In COL, 009). We test our algorithm experimentally, showing that it also performs favorably against earlier algorithms. 1. Introduction In the online multiclass classification problem, the learner must repeatedly classify examples into one of classes. At each step t, the learner observes an example x t R d and predicts its label ỹ t [. In the full-information case, the learner observes the true label y t [ and incurs loss 1[ỹ t y t. In the bandit version of this problem, first considered in Kaade et al., 008), the learner only observes its incurred loss 1[ỹ t y t, i.e., whether or not its prediction was correct. Bandit multiclass learning is a special case of the general contextual bandit learning Langford & Zhang, 008) where exactly one of the losses is 0 and all other losses are 1 in every round. he goal of the learner is to minimize its regret with respect to the best predictor in some reference class of predictors, that is the difference between the total number of mistaes the learner maes and the total number of mis- 1 Yahoo Research, New Yor, NY Stony Broo University, Stony Broo, NY 3 University of California, San Diego, La Jolla, CA. Correspondence to: Alina Beygelzimer <beygel@yahooinc.com>, Francesco Orabona <francesco@orabona.com>, Chicheng Zhang <chichengzhang@ucsd.edu>. Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 017. Copyright 017 by the authors). 1 Õ ) hides logarithmic factors. taes of the best predictor in the class. Kaade et al. 008) proposed a bandit modification of the Multiclass algorithm Duda & Hart, 1973), called the Banditron, that uses a reference class of linear predictors. Note that even in the full-information setting, it is difficult to provide a true regret bound. Instead, performance bounds are typically expressed in terms of the total multiclass hinge loss of the best linear predictor, a tight upper bound on 0-1 loss. he Banditron, while computationally efficient, achieves only O /3 ) expected regret with respect to this loss, where is the number of rounds. his is suboptimal as the Exp4 algorithm of Auer et al. 003) can achieve Õ ) regret for the 0-1 loss, albeit very inefficiently. Abernethy & Rahlin 009) posed an open problem: Is there an efficient bandit multiclass learning algorithm that achieves expected regret of Õ ) with respect to any reasonable loss function? he first attempt to solve this open problem was by Crammer & Gentile 013). Using a stochastic assumption about the mechanism generating the labels, they were able to show a Õ ) regret, with a second-order algorithm. Later, Hazan & Kale 011), following a suggestion by Abernethy & Rahlin 009), proposed the use of the logloss coupled with a softmax prediction. he softmax depends on a parameter that controls the smoothing factor. he value of this parameter determines the exp-concavity of the loss, allowing Hazan & Kale 011) to prove worst-case regret bounds that range between Olog ) and O 3 ), again with a second-order algorithm. However, the choice of the smoothing factor in the loss becomes critical in obtaining strong bounds. he original Banditron algorithm has been also extended in many ways. Wang et al. 010) have proposed a variant based on the exponentiated gradient algorithm Kivinen & Warmuth, 1997). Valizadegan et al. 011) proposed different strategies to adapt the exploration rate to the data in the Banditron algorithm. However, these algorithms suffer from the same theoretical shortcomings as the Banditron. here has been significant recent focus on developing efficient algorithms for the general contextual bandit problem Dudí et al., 011; Agarwal et al., 014; Rahlin & Sridharan, 016; Syrganis et al., 016a;b). While solving

2 a more general problem that does not mae assumptions on the structure of the reward vector or the policy class, these results assume that contexts or context/reward pairs are generated i.i.d., or the contexts to arrive are nown beforehand, which we do not assume here. In this paper, we follow a different route. Instead of designing an ad-hoc loss function that allows us to prove strong guarantees, we propose an algorithm that simultaneously satisfies a regret bound with respect to all the loss functions in a family of functions that are tight upper bounds to the 0-1 loss. he algorithm, named Second Order Banditron Algorithm SOBA), is efficient and based on the second-order algorithm Cesa-Bianchi et al., 005). he regret bound is of the order of Õ ), providing a solution to the open problem of Abernethy & Rahlin 009).. Definitions and Settings We first introduce our notation. Denote the rows of a matrix V R d by v 1, v,..., v. he vectorization of V is defined as vecv ) = [v 1, v,..., v, which is a vector in R d. We define the reverse operation of reshaping a d 1 vector into a d matrix by matv ), using a row-major order. o simplify notation, we will use V and vecv ) interchangeably throughout the paper. For matrices A and B, denote by A B their Kronecer product. For matrices X and Y of the same dimension, denote by X, Y = i,j X i,jy i,j their inner product. We use to denote the l norm of a vector, and F to denote the Frobenius norm of a matrix. For a positive definite matrix A, we use x A = x, Ax to denote the Mahalanobis norm of x with respect to A. We use 1 to denote the vector in R whose entries are all 1s. We use E t 1 [ to denote the conditional expectation given the observations up to time t 1 and x t, y t, that is, x 1, y 1, ỹ 1,..., x t 1, y t 1, ỹ t 1, x t, y t. Let [ denote {1,..., }, the set of possible labels. In our setting, learning proceeds in rounds: For t = 1,,..., : 1. he adversary presents an example x t R d to the learner, and commits to a hidden label y t [.. he learner predicts a label ỹ t p t, where p t Δ 1 is a probability distribution over [. 3. he learner receives the bandit feedbac 1[ỹ t y t. he goal of the learner is to minimize the total number of mistaes, M = 1[ỹ t y t. We will use linear predictors specified by a matrix W R d. he prediction is given by W x) = arg max i [ W x) i, where W x) i is the ith element of the vector W x, corresponding to class i. A useful notion to measure the performance of a competitor U R d is the multiclass hinge loss lu, x, y)) := max i y [1 Ux) y + Ux) i +, 1) where [ + = max, 0). 3. A History of Loss Functions As outlined in the introduction, a critical choice in obtaining strong theoretical guarantees is the choice of the loss function. In this section we introduce and motivate a family of multiclass loss functions. In the full information setting, strong binary and multiclass mistae bounds are obtained through the use of the algorithm Rosenblatt, 1958). A common misunderstanding of the algorithm is that it corresponds to a gradient descent procedure with respect to the binary or multiclass) hinge loss. However, it is well nown that the simultaneously satisfies mistae bounds that depend on the cumulative hinge loss and also on the cumulative squared hinge loss, see for example Mohri & Rostamizadeh 013). Note also that the squared hinge loss is not dominated by the hinge loss, so, depending on the data, one loss can be better than the other. We show that the algorithm satisfies an even stronger mistae bound with respect to the cumulative loss of any power of the multiclass hinge loss between 1 and. heorem 1. On any sequence x 1, y 1 ),..., x, y ) with x t X for all t [, and any linear predictor U R d, the total number of mistaes M of the multiclass satisfies, for any q [1,, M M 1 1 q L 1 q MH,q U) + U F X M, where L MH,q U) = lw, x t, y t )) q. In particular, it simultaneously satisfies the following: M L MH,1 U)+X U F +X U F L MH,1 U) M L MH, U)+X U F +X U F L MH, U). For the proof, see Appendix B. A similar observation was done by Orabona et al. 01) who proved a logarithmic mistae bound with respect to all loss functions in a similar family of functions smoothly interpolating between the hinge loss to the squared hinge loss. In particular, Orabona et al. 01) introduced the following family of binary loss functions l x) := { 1 x + x, x 1 0, x > 1. )

3 Figure 1. Plot of the loss functions in l for different values of. where 0 1. Note that = 0 recovers the binary hinge loss, and = 1 recovers the squared hinge loss. Meanwhile, for any 0 1, l x) max{l 0 x), l 1 x)}, and l is an upper bound on 0-1 loss: 1[x < 0 l x). See Figure 1 for a plot of the different functions in the family. Here, we define a multiclass version of the loss in ) as ) l U, x, y)) := l Ux) y max Ux) i. 3) i y Hence, l 0 U, x, y)) = lu, x, y)) is the classical multiclass hinge loss and l 1 U, x, y)) = l U, x, y)) is the squared multiclass hinge loss. Our algorithm has a Õ 1 ) regret bound that holds simultaneously for all loss functions in this family, with in a range that ensure that Ux) i Ux) j, i, j [. We also show that there exists a setting of the parameters of the algorithm that gives a mistae upper bound of ÕL ) 1/3 + ), where L is the cumulative hinge loss of the competitor, which is never worse that the best bounds in Kaade et al. 008). 4. Second Order Banditron Algorithm his section introduces our algorithm for bandit multiclass online learning, called Second Order Banditron Algorithm SOBA), described in Algorithm 1. SOBA maes a prediction using the γ-greedy strategy: At each iteration t, with probability 1 γ, it predicts ŷ t = arg max i [ W t x t ) i ; with the remaining probability γ, it selects a random action in [. As discussed in Kaade et al. 008), randomization is essential for designing bandit multiclass learning algorithms. If we deterministically output a label and mae a mistae, then it is hard to mae an update since we do not now the identity of y t. However, Algorithm 1 Second Order Banditron Algorithm SOBA) Input: Regularization parameter a > 0, exploration parameter γ [0, 1. 1: Initialization: W 1 = 0, A 0 = ai, θ 0 = 0 : for t = 1,,..., do 3: Receive instance x t R d 4: ŷ t = arg max i [ W t x t ) i 5: Define p t = 1 γ)eŷt + γ 1 6: Randomly sample ỹ t according to p t 7: Receive bandit feedbac 1[ỹ t y t 8: Initialize update indicator n t = 0 9: if ỹ t = y t then 10: ȳ t = arg max i [\{yt}w t x t ) i 11: g t = 1 eȳt e yt ) x t 1: z t = g t 13: m t = Wt, zt + W t, g t 1+zt A 1 t 1 zt 14: if m t + t 1 s=1 n sm s 0 then 15: urn on update indicator n t = 1 16: end if 17: end if 18: Update A t = A t 1 + n t z t zt 19: Update θ t = θ t 1 n t g t 0: Set W t+1 = mata 1 t θ t ) 1: end for Remar: matrix A t is of dimension d d, and vector θ t is of dimension d; in line 0, the matrix multiplication results in a d dimensional vector, which is reshaped to matrix W t+1 of dimension d. if randomization is used, we can estimate y t and perform online stochastic mirror descent type updates Bubec & Cesa-Bianchi, 01). SOBA eeps trac of two model parameters: cumulative -style updates θ t = t s=1 n sg s R d and corrected covariance matrix A t = ai + t s=1 n sz s zs R d d. he classifier W t is computed by matricizing over the matrix-vector product A 1 t 1 θ t 1 R d. he weight vector θ t is standard in designing online mirror descent type algorithms Shalev-Shwartz, 011; Bubec & Cesa- Bianchi, 01). he matrix A t is standard in designing online learning algorithms with adaptive regularization Cesa- Bianchi et al., 005; Crammer et al., 009; McMahan & Streeter, 010; Duchi et al., 011; Orabona et al., 015). he algorithm updates its model n t = 1) only when the following conditions hold simultaneously: 1) the predicted label is correct ỹ t = y t ), and ) the cumulative regularized negative margin t 1 s=1 n sm s +m t ) is positive if this update were performed. Note that when the predicted label is correct we now the identity of the true label. As we shall see, the set of iterations where n t = 1 includes all iterations where ỹ t = y t ŷ t. his fact is cru-

4 cial to the mistae bound analysis. Furthermore, there are some iterations where ỹ t = y t = ŷ t but we still mae an update. his idea is related to online passive-aggressive algorithms Crammer et al., 006; 009) in the full information setting, where the algorithm maes an update even when it predicts correctly but the margin is too small. Let s now describe our algorithm more in details. hroughout, suppose all the examples are l -bounded: x t X. As outlined above, we associate a time-varying regularizer R t W ) = 1 W A t, where A t = ai + t s=1 n sz s zs is a d d matrix and z t = g t = 1 pt,yt eȳt e yt ) x t. Note that this time-varying regularizer is constructed by scaled versions of the updates g t. his is critical, because in expectation this becomes the correct regularizer. Indeed, it is easy to verify that, for any U R d, E t 1 [1[y t = ỹ t g t = eȳt e yt ) x t, E t 1 [1[y t = ỹ t U, z t = U, eȳt e yt ) x t. In words, this means that in expectation the regularizer contains the outer products of the updates, that in turn promote the correct class and demotes the wrong one. We stress that it is impossible to get the same result with the estimator proposed in Kaade et al. 008). Also, the analysis is substantially different from the Online Newton Step approach Hazan et al., 007) used in Hazan & Kale 011). In reality, we do not mae an update in all iterations in which ỹ t = y t, since the algorithm need to maintain the invariant that t s=1 m sn s 0, which is crucial to the proof of Lemma. Instead, we prove a technical lemma that gives an explicit form on the expected update n t g t and expected regularization n t z t zt. Define [ t 1 q t := 1 n s m s + m t 0, s=1 h t := 1[ŷ t y t + q t 1[ŷ t = y t. Lemma 1. For any U R d, E t 1 [n t U, g t = h t U, e yt eȳt ) x t, E t 1 [n t U, z t = h t U, e yt eȳt ) x t. he proof of Lemma 1 is deferred to the end of Subsection 4.1. Our last contribution is to show how our second order algorithm satisfies a mistae bound for an entire family of loss functions. Finally, we relate the performance of the algorithm that predicts ŷ t to the γ-greedy algorithm. Putting all together, we have our expected mistae bound for SOBA. heorem. SOBA has the following expected upper bound on the number of mistaes, M, for any U R d and any 0 < min1, max i u i X+1 ), E [M L U) + a U F + γ ) E [ zt A 1 t z t + γ, where L U) := l U, x t, y t )) is the cumulative -loss of the linear predictor U, and {u i } i=1 are rows of U. In particular, setting γ = O d ln ) and a = X, we have E [M L U) + O X U F + ) d ln. Note that, differently from previous analyses Kaade et al., 008; Crammer & Gentile, 013; Hazan & Kale, 011), we do not need to assume a bound on the norm of the competitor, as in the full information and Second Order algorithms. In Appendix A, we also present an adaptive variant of SOBA that sets exploration rate γ t dynamically, which achieves a regret bound within a constant factor of that using optimal tuning of γ. We prove heorem in the next Subsection, while in Subsection 4. we prove a mistae bound with respect to the hinge loss, that is not fully covered by heorem Proof of heorem hroughout the proofs, U, W t, g t, and z t s should be thought of as d 1 vectors. We first show the following lemma. Note that this is a statement over any sequence and no expectation is taen. Lemma. For any U R d, with the notation of Algorithm 1, we have: n t U, g t U, z t ) a U F + n t g t A 1 t g t. hroughout the paper, expectations are taen with respect to the randomization of the algorithm.

5 Proof. First, from line 14 of Algorithm 1, it can be seen by induction) that SOBA maintains the invariant that t n s m s 0. 4) s=1 We next reduce the proof to the regret analysis of online least squares problem. For iterations where n t = 1, define α t = 1 pt,yt so that g t = α t z t. From the algorithm, A t = ai + t s=1 n sz s zs, and W t is the ridge regression solution based on data collected in time 1 to t 1, i.e. W t = A 1 t 1 t 1 s=1 n sg s ) = A 1 t 1 t 1 s=1 n sα s z s ). By per-step analysis in online least squares, see, e.g., Orabona et al., 01)See Lemma 6 for a proof), we have that if an update is made at iteration t, i.e. n t = 1, then 1 W t, z t + α t ) 1 z t A 1 t z t ) 1 U, z t + α t ) 1 U W t A t 1 1 U W t+1 A t. Otherwise n t = 0, in which case we have W t+1 = W t and A t+1 = A t. Denoting by t = 1 zt A 1 t z t, by Sherman-Morrison 1 formula, t =. Summing over all rounds t 1+zt A 1 t 1 zt [ such that n t = 1, 1 [ n t Wt, z t + α t ) t U, z t + α t ) 1 U A 0 1 U W +1 A a U F. We also have by definition of m t, W t, z t + α t ) t U, z t + α t ) = m t U, g t U, z t α t z t A 1 t z t. Putting all together and using the fact that n tm t 0, we have the stated bound. We can now prove the following mistae bound for the prediction ŷ t, defined as ˆM := 1[ŷ t y t. heorem 3. For any U R d, and any 0 < min1, max i u i X+1 ), the expected number of mistaes committed by ŷ t can be bounded as [ E ˆM L U) + a U F L U) + a U F + E[n tzt A 1 t z t γ ) ) d ln γ ) X a d where L U) := l U, x t, y t )) is the -loss of the linear predictor U., Proof. Using Lemma with U, we get that n t U, g t U, z t ) a U F + n t g t A 1 t g t. aing expectations, using Lemma 1 and the fact that γ and that A t is positive definite, we have 1 [ 0 E h t U, e yt eȳt ) x t [ + E h t U, e yt eȳt ) x t ) + a U F + γ E[n t z t A 1 t z t. [ Add the terms )E h t to both sides and divide both sides by ), to have [ [ E h t E h t f U, e yt eȳt ) x t ) + a U F + γ ) E[n t z t A 1 t z t, where fz) := 1 z + z. aing a close loo at the function f, we observe that the two roots of the quadratic function are 1 and, respectively. Setting 1, the function is negative in 1, and positive in, 1. Additionally, if 0 < max i u, then i X+1 for all i, j [, U, e i e j ) x t. herefore, we have that f U, e yt eȳt ) x t ) = fux t ) yt Ux t )ȳt ) l Ux t ) yt Ux t )ȳt ) ) l Ux t ) yt maxux t ) r r y t 5) = l U, x t, y t )). where the first equality is from algebra, the first inequality is from that f ) l ) in,, the second inequality is from that l ) is monotonically decreasing. Putting together the two constraints on, and noting that ˆM h t, we have the first bound. he second statement follows from Lemma 3 below.

6 Lemma 3. If d 1,,, then ) E[n t zt A 1 t z t d ln 1 + X. a d Specifically, if a = X, the right hand side is d ln. Proof. Observe that n t zt A 1 t z t ln A A 0 d ln 1 + X a d 1[ỹ t=y t, where the first inequality is a well-nown fact from linear algebra e.g. Hazan et al., 007, Lemma 11). Given that the A is d d, the second inequality comes from the fact that A is maximized when all its eigenvalues are equal nt zt to tra ) d = a + d a + X Finally, using Jensen s inequality, we have that, ) E[n t zt A 1 t z t d ln 1 + X. a d 1[ỹ t =y t d. If a = X, then the right hand side is d ln1+ d ), which is at most d ln under the conditions on d,,. Proof of heorem. Observe that by triangle inequality, 1[ỹ t y t 1[ỹ t ŷ t + 1[y t ŷ t. Summing over t, taing expectation on both sides, we conclude that E[M E[ ˆM + γ. 6) he first statement follows from combining the above inequality with heorem 3. For the second statement, first note that from heorem 3, and Equation 6), we have d ln 1 + E[M L U) + a U F + γ ) L U) + X U F + d ln γ ) X a d + γ, + γ where the second inequality is from that 1, and Lemma 3 with a = X. he statement is concluded by the setting of γ = O d ln ). Proof of Lemma 1. We show the lemma in two steps. Let G t := q t 1[y t = ỹ t = ŷ t, and H t := 1[y t = ỹ t ŷ t. First, we show that n t = G t + H t. Recall that SOBA maintains the invariant 4), hence t 1 s=1 n sm s 0. From line 14 of SOBA, we see that n t = 1 only if ỹ t = y t. Now consider two cases: y t = ỹ t ŷ t. In this case, ȳ t = ŷ t, therefore W t, g t 0, maing m t 0. his implies that t 1 s=1 n sm s + m t 0, guaranteeing n t = 1. y t = ỹ t = ŷ t. In this case, n t is set to 1 if and only if q t = 1, i.e. t 1 s=1 n sm s + m t 0. his gives that n t = G t + H t. Second, we have the following two equalities: E t 1 [H t U, g t [ 1[ỹt = y t = E t 1 1[ŷ t y t U, e yt eȳt ) x t = 1[ŷ t y t U, e yt eȳt ) x t, E t 1 [G t U, g t [ 1[ỹt = y t = E t 1 1[ŷ t = y t q t U, e yt eȳt ) x t = 1[ŷ t y t q t U, e yt eȳt ) x t. he first statement follows from adding up the two equalities above. he proof for the second statement is identical, except replacing U, e yt eȳt ) x t with U, e yt eȳt ) x t. 4.. Fall-Bac Analysis he loss function l is an interpolation between the hinge and the squared hinge losses. Yet, the bound becomes vacuous for = 0. Hence, in this section we show that SOBA also guarantees a ÕL 0U) ) 1/3 + ) mistae bound w.r.t. L 0 U), the multiclass hinge loss of the competitor, assuming L 0 U) is nown. hus the algorithm achieves a mistae guarantee no worse than the sharpest bound implicit in Kaade et al. 008). heorem 4. Set a = X and denote by M the number of mistaes done by SOBA. hen SOBA has the following guarantees: 3 1. If L 0 U) U F + 1) d X ln, then with parameter setting γ = min1, d X L 0U) ln ) 1/3 ), one has the following expected mistae bound: E[M L 0 U) + O U F d X L 0 U) ln ) 1/3). 3 Assuming the nowledge of U F it would be possible to reduce the dependency on U F in both bounds. However such assumption is extremely unrealistic and we prefer not to pursue it.

7 . If L 0 U) < U F + 1) d X ln, then with parameter setting γ = min1, d X ln ) 1/ ), one has the following expected mistae bound: E[M L 0 U) + O U F + 1)X ) d ln. where L 0 U) := l 0U, x t, y t )) is the hinge loss of the linear classifier U. Proof. Recall that ˆM the mistaes made by ŷ t, that is 1[ŷ t y t. Adding to both sides of 5) the term E[ h t and dividing both sides by, and plugging a = X, we get that for all > 0, [ [ E h t E h t 1 U, e yt eȳt ) x t ) + h t U, e ȳ t e yt ) x t + X U F + d γ ln [ ) E l 0 U, x t, y t )) + h t + 1 U F X + d γ ln. where the first inequality uses Lemma 3, the second inequality is from Cauchy-Schwarz that U, e yt eȳt ) x t U F e yt eȳt ) x t U F X and that 1 U, eyt eȳt ) x t ) lu, x t, y t )). d γ ln aing =, we have U F E[ t ht+ 1 )X [ E h t L 0 U) [ ) + U F E h t + 1 d X ln γ [ ) L 0 U) + U F E h t + 1 d X ln, γ where the last inequality is due to the elementary inequality c + d c + d, and the setting of a = X. Solving the inequality and using the fact that E[M E[ ˆM + γ E[ h t + γ, we have E[M L 0 U) + γ + O d U F X ln + L 0 U) d U F X ln γ γ. he theorem follows from Lemma 5 in Appendix B, taing U = U F, H = d X ln, L = L 0 U). 5. Empirical Results We tested SOBA to empirically validate the theoretical findings. We used three different datasets from Kaade et al. 008): SynSep, SynNonSep, Reuters4. he first two are synthetic, with 10 6 samples in R 400 and 9 classes. SynSep is constructed to be linearly separable, while SynNonSep is the same dataset with 5% random label noise. Reuters4 is generated from the RCV1 dataset Lewis et al., 004), extracting the 665,65 examples that have exactly one label from the set {CCA, ECA, GCA, MCA}. It contains 47,36 features. We also report the performance on Covtype from LibSVM repository. 4 We report averages over 10 different runs. SOBA, as the Newtron algorithm, has a quadratic complexity in the dimension of the data, while the Banditron and the algorithm are linear. Following the long tradition of similar algorithms Crammer et al., 009; Duchi et al., 011; Hazan & Kale, 011; Crammer & Gentile, 013), to be able to run the algorithm on large datasets, we have implemented an approximated diagonal version of SOBA, named SOBAdiag. It eeps in memory just the diagonal of the matrix A t. Following Hazan & Kale 011), we have tested only algorithms designed to wor in the fully adversarial setting. Hence, we tested the Banditron and the PNewtron, the diagonal version of the Newtron algorithm in Hazan & Kale 011). he multiclass algorithm was used as a full-information baseline. In the experiments, we only changed the exploration rate γ, leaving fixed all the other parameters the algorithms might have. In particular, for the PNewtron we set α = 10, β = 0.01, and D = 1, as in Hazan & Kale 011). In SOBA, a is fixed to 1 in all the experiments. We explore the effect of the exploration rate γ in the first row of Figure 5. We see that the PNewtron algorithm, 5 thans to the exploration based on the softmax prediction, can achieve very good performance for a wide range of γ. It is important to note that SOBAdiag has good performance on all four datasets for a value of γ close to 1%. For bigger values, the performance degrades because the best possible is lower bounded by 1 γ due to explo- 4 cjlin/ libsvmtools/datasets/ 5 We were unable to mae the PNewtron wor on Reuters4. For any setting of γ the is never better than 57%. he reason might be that the dataset RCV1 has 47,36 features, while the one reported in Kaade et al. 008); Hazan & Kale 011) has 346,810, hence the optimal setting of the 3 other parameters of PNewtron might be different. For this reason we prefer not to report the performance of PNewtron on Reuters4.

8 SynSep Banditron PNewtron SOBAdiag SynNonSep Banditron PNewtron SOBAdiag Covtype Banditron PNewtron SOBAdiag Reuters4 Banditron SOBAdiag SynSep 10 0 SynNonSep Banditron, = -6 PNewtron, = -13 SOBAdiag, = Covtype Banditron, = -5 PNewtron, = -9 SOBAdiag, = Reuters4 Banditron, = -5 SOBAdiag, = Banditron, = -13 PNewtron, = -13 SOBAdiag, = number of examples number of examples number of examples number of examples Figure. Error rates vs. the value of the exploration rate γ top row) and vs. the number examples bottom row). he x-axis is logarithmic in all the plots, while the y-axis is logarithmic in the plots in the second row. Figure best viewed in colors. ration. For smaller values of exploration, the performance degrades because the algorithm does not update enough. In fact, SOBA updates only when ỹ t = y t, so when γ is too small the algorithms does not explore enough and remains stuc around the initial solution. Also, SOBA requires an initial number of updates to accumulate enough negative terms in the t n tm t in order to start updating also when ŷ t is correct but the margin is too small. he optimal setting of γ for each algorithm was then used to generate the plots in the second row of Figure 5, where we report the over time. With the respective optimal setting of γ, we note that the performance of PNewtron does not seem better than the one of the Multiclass algorithm, and on par or worse to the Banditron s one. On the other hand, SOBAdiag has the best performance among the bandits algorithms on 3 datasets out of 4. he first dataset, SynSep, is separable and with their optimal setting of γ, all the algorithms converge with a rate of roughly O 1 ), as can be seen from the log-log plot, but the bandit algorithms will not converge to zero, but to 1 γ. However, SOBA has an initial phase in which the is high, due to the effect mentioned above. On the second dataset, SynNonSep, SOBAdiag outperforms all the other algorithms including the fullinformation ), achieving an close to the noise level of 5%. his is due to SOBA being a secondorder algorithm, while the is a first-order algorithm. A similar situation is observed on Covtype. On the last dataset, Reuters4, SOBAdiag achieves performance better than the Banditron. 6. Discussion and Future Wor In this paper, we study the problem of online multiclass learning with bandit feedbac. We propose SOBA, an algorithm that achieves a regret of Õ 1 ) with respect to - loss of the competitor. his answers a COL open problem posed by Abernethy & Rahlin, 009). Its ey ideas are to apply a novel adaptive regularizer in a second order online learning algorithm, coupled with updates only when the predictions are correct. SOBA is shown to have competitive performance compared to its precedents in synthetic and real datasets, in some cases even better than the fullinformation algorithm. here are several open questions we wish to explore: 1. Is it possible to design efficient algorithms with mistae bounds that depend on the loss of the competitor, i.e. E[M L U) + Õ dl U) + d)? his type of bound occurs naturally in the full information multiclass online learning setting, see e.g. heorem 1), or in multiarmed bandit setting, e.g. Neu, 015).. Are there efficient algorithms that have a finite mistae bound in the separable case? Kaade et al., 008) provides an algorithm that performs enumeration and plurality vote to achieve a finite mistae bound in the finite dimensional setting, but unfortunately the algorithm is impractical. Notice that it is easy to show that in SOBA ŷ t maes a logarithmic number of mistaes in the separable case, with a constant rate of exploration, yet it is not clear how to decrease the exploration over time in order to get a logarithmic number of mistaes for ỹ t. Acnowledgments. We than Claudio Gentile for suggesting the original plan of attac for this problem, and than the anonymous reviewers for thoughtful comments.

9 References Abernethy, J. and Rahlin, A. An efficient bandit algorithm for -regret in online multiclass prediction? In COL, 009. Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. E. aming the monster: a fast and simple algorithm for contextual bandits. ICML, 014. Auer, P., Cesa-Bianchi, N., and Gentile, C. Adaptive and self-confident on-line learning algorithms. J. Comput. Syst. Sci., 641):48 75, 00. Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. he nonstochastic multiarmed bandit problem. SIAM J. Comput., 31):48 77, January 003. Bubec, S. and Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. FnML, 51):1 1, 01. Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge University Press, 006. Cesa-Bianchi, N., Conconi, A., and Gentile, C. A secondorder algorithm. SIAM Journal on Computing, 343): , 005. Crammer, K. and Gentile, C. Multiclass classification with bandit feedbac using adaptive regularization. Machine learning, 903): , 013. Crammer, K., Deel, O., Keshet, J., Shalev-Shwartz, S., and Singer, Y. Online passive-aggressive algorithms. JMLR, 7Mar): , 006. Crammer, K., Kulesza, A., and Dredze, M. Adaptive regularization of weight vectors. In NIPS, pp , 009. Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 1Jul):11 159, 011. Duda, R. O. and Hart, P. E. Pattern classification and scene analysis. John Wiley, Dudí, M., Hsu, D. J., Kale, S., Karampatziais, N., Langford, J., Reyzin, L., and Zhang,. Efficient optimal learning for contextual bandits. In UAI 011, pp , 011. Hazan, E. and Kale, S. Newtron: an efficient bandit algorithm for online multiclass prediction. In NIPS, pp , 011. Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69-3):169 19, 007. Kaade, S. M., Shalev-Shwartz, S., and ewari, A. Efficient bandit algorithms for online multiclass prediction. In ICML, pp ACM, 008. Kivinen, J. and Warmuth, M. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 131):1 63, January Langford, J. and Zhang,. he epoch-greedy algorithm for multi-armed bandits with side information. In NIPS 0, pp Lewis, D. D., Yang, Y., Rose,. G., and Li, F. RCV1: A new benchmar collection for text categorization research. JMLR, 5Apr): , 004. McMahan, H Brendan and Streeter, Matthew. Adaptive bound optimization for online convex optimization. COL, 010. Mohri, M. and Rostamizadeh, A. mistae bounds. arxiv preprint arxiv: , 013. Neu, G. First-order regret bounds for combinatorial semibandits. In COL, pp , 015. Orabona, F., Cesa-Bianchi, N., and Gentile, C. Beyond logarithmic bounds in online learning. In AISAS, pp , 01. Orabona, F., Crammer, K., and Cesa-Bianchi, N. A generalized online mirror descent with applications to classification and regression. Machine Learning, 993): , 015. Rahlin, A. and Sridharan, K. BISRO: An efficient relaxation-based method for contextual bandits. In ICML, 016. Rosenblatt, F. he : A probabilistic model for information storage and organization in the brain. Psychological review, 656): , Shalev-Shwartz, S. Online learning and online convex optimization. FnML, 4): , 011. Syrganis, Vasilis, Krishnamurthy, Ashay, and Schapire, Robert. Efficient algorithms for adversarial contextual learning. In ICML, pp , 016a. Syrganis, Vasilis, Luo, Haipeng, Krishnamurthy, Ashay, and Schapire, Robert E. Improved regret bounds for oracle-based adversarial contextual bandits. In NIPS, pp , 016b. Valizadegan, H., Jin, R., and Wang, S. Learning to trade off between exploration and exploitation in multiclass bandit prediction. In KDD, pp ACM, 011.

10 Wang, S., Jin, R., and Valizadegan, H. A potential-based framewor for online multi-class learning with partial feedbac. In AISAS, pp , 010. Efficient Online Bandit Multiclass Learning

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract