Online Learning of Noisy Data with Kernels

Size: px

Start display at page:

Download "Online Learning of Noisy Data with Kernels"

Miranda Garrison
5 years ago
Views:

1 Online Learning of Noisy Data with Kernels Nicolò Cesa-Bianchi Università degli Studi di Milano Shai Shalev Shwartz The Hebrew University Ohad Shamir The Hebrew University Abstract We study online learning when individual instances are corruted by adversarially chosen random noise We assume the noise distribution is unknown, and may change over time with no restriction other than having zero mean and bounded variance Our technique relies on a family of unbiased estimators for non-linear functions, which may be of indeendent interest We show that a variant of online gradient descent can learn functions in any dotroduct eg, olynomial or Gaussian kernel sace with any analytic convex loss function Our variant uses randomized estimates that need to query a random number of noisy coies of each instance, where with high robability this number is uer bounded by a constant Allowing such multile queries cannot be avoided: Indeed, we show that online learning is in general imossible when only one noisy coy of each instance can be accessed Introduction In many machine learning alications training data are tyically collected by measuring certain hysical quantities Examles include bioinformatics, medical tests, robotics, and remote sensing These measurements have errors that may be due to several reasons: sensor costs, communication constraints, or intrinsic hysical limitations In all such cases, the learner trains on a distorted version of the actual target data, which is where the learner s redictive ability is eventually evaluated In this work we investigate the extent to which a learning algorithm can achieve a good redictive erformance when training data are corruted by noise with unknown distribution We rove uer and lower bounds on the learner s cumulative loss in the framework of online learning, where examles are generated by an arbitrary and ossibily adversarial source We model the measurement error via a random erturbation which affects each instance observed by the learner We do not assume any secific roerty of the noise distribution other than zero-mean and bounded variance Moreover, we allow the noise distribution to change at every ste in an adversarial way and fully hidden from the learner Our ositive results are quite general: by using a randomized unbiased estimate for the loss gradient and a randomized feature maing to estimate kernel values, we show that a variant of online gradient descent can learn functions in any dot-roduct eg, olynomial or Gaussian RKHS under any given analytic convex loss function Our techniques are readily extendable to other kernel tyes as well In order to obtain unbiased estimates of loss gradients and kernel values, we allow the learner to query a random number of indeendently erturbed coies of the current unseen instance We show how low-variance estimates can be comuted using a number of queries that is constant with high robability This is in shar contrast with standard averaging techniques which attemts to directly estimate the noisy instance, as these require a samle whose size deends on the scale of the roblem Finally, we formally show that learning is imossible, even without kernels, when only one erturbed coy of each instance can be accessed This is true for essentially any reasonable loss function Our aer is organized as follows In the next subsection we discuss related work In Sec 2 we introduce our setting and justify some of our choices In Sec 4 we resent our main results but before that, in Sec 3, we discuss the techniques used to obtain them In the same section, we also exlain why existing techniques are insufficient to deal with our roblem The detailed roofs and subroutine imlementations aear in Sec 5, with some of the more technical lemmas and roofs relegated to [7] We wra u with a discussion on ossible avenues for future work in Sec 6

2 Related Work In the machine learning literature, the roblem of learning from noisy examles, and, in articular, from noisy training instances, has traditionally received a lot of attention see, for examle, the recent survey [2] On the other hand, there are comarably few theoretically-rinciled studies on this toic Two of them focus on models quite different from the one studied here: random attribute noise in PAC boolean learning [3, 9], and malicious noise [0, 5] In the first case, learning is restricted to classes of boolean functions and the noise must be indeendent across each boolean coordinate In the second case, an adversary is allowed to erturb a small fraction of the training examles in an arbitrary way, making learning imossible in a strong informational sense unless this erturbed fraction is very small of the order of the desired accuracy for the redictor The revious work erhas closest to the one resented here is [], where binary classification mistake bounds are roven for the online Winnow algorithm in the resence of attribute errors Similarly to our setting, the sequence of instances observed by the learner is chosen by an adversary However, in [] the noise is generated by an adversary, who may change the value of each attribute in an arbitrary way The final mistake bound, which only alies when the noiseless data sequence is linearly searable without kernels, deends on the sum of all adversarial erturbations 2 Setting We consider a setting where the goal is to redict values y R based on instances x R d In this aer we focus on kernel-based linear redictors of the form x w, Ψx, where Ψ is a feature maing into some reroducing kernel Hilbert sace RKHS We assume there exists a kernel function that efficiently imlements dot roducts in that sace, ie, kx, x Ψx, Ψx Note that a secial case of this setting is linear kernels, where Ψ is the identity maing and kx, x x, x The standard online learning rotocol for linear rediction with kernels is defined as follows: at each round t, the learner icks a linear hyothesis w t from the RKHS The adversary then icks an examle x t, y t and reveals it to the learner The loss suffered by the learner is l w t, Ψx t, y t, where l is a known and fixed loss function The goal of the learner is to minimize regret with resect to a fixed convex set of hyotheses W, namely T l w t, Ψx t, y t min w W T l w, Ψx t, y t Tyically, we wish to find a strategy for the learner, such that no matter what is the adversary s strategy of choosing the sequence of examles, the exression above is sub-linear in T We now make the following twist, which limits the information available to the learner: instead of receiving x t, y t, the learner observes y t and is given access to an oracle A t On each call, A t returns an indeendent coy of x t + Z t, where Z t is a zero-mean random vector with some known finite bound on its variance in the sense that E [ Z t 2] a for some uniform constant a In general, the distribution of Z t is unknown to the learner It might be chosen by the adversary, and change from round to round or even between consecutive calls to A t Note that here we assume that y t remains unerturbed, but we emhasize that this is just for simlicity - our techniques can be readily extended to deal with noisy values as well The learner may call A t more than once In fact, as we discuss later on, being able to call A t more than once is necessary for the learner to have any hoe to succeed On the other hand, if the learner calls A t an unlimited number of times, it can reconstruct x t arbitrarily well by averaging, and we are back to the standard learning setting In this aer we focus on learning algorithms that call A t only a small, essentially constant number of times, which deends only on our choice of loss function and kernel rather than T, the norm of x t, or the variance of Z t, which will haen with naïve averaging techniques Moreover, since the number of queries is bounded with very high robability, one can even roduce an algorithm with an absolute bound on the number of queries, which will fail or introduce some bias with an arbitrarily small robability For simlicity, we ignore these issues in this aer In this setting, we wish to minimize the regret in hindsight with resect to the unerturbed data and averaged over the noise introduced by the oracle, namely [ T ] T E l w t, Ψx t, y t min l w, Ψx t, y t w W where the random quantities are the redictors w, w 2, generated by the learner, which deend on the observed noisy instances in [7], we briefly discuss alternative regret measures, and why

3 they are unsatisfactory This kind of regret is relevant where we actually wish to learn from data, without the noise causing a hindrance In articular, consider the batch setting, where the examles {x t, y t } T are actually samled iid from some unknown distribution, and we wish to find a redictor which minimizes the exected loss E[l w, x, y] with resect to new examles x, y Using standard online-to-batch conversion techniques, if we can find an online algorithm with a sublinear bound on Eq, then it is ossible to construct learning algorithms for the batch setting which are robust to noise That is, algorithms generating a redictor w with close to minimal exected loss E[l w, x, y] among all w W While our techniques are quite general, the exact algorithmic and theoretical results deend a lot on which loss function and kernel is used Discussing the loss function first, we will assume that l w, Ψx, y is a convex function of w for each examle x, y Somewhat abusing notation, we assume the loss can be written either as l w, Ψx, y fy w, Ψx or as l w, Ψx, y f w, Ψx y for some function f We refer to the first tye as classification losses, as it encomasses most reasonable losses for classification, where y {, +} and the goal is to redict the label We refer to the second tye as regression losses, as it encomasses most reasonable regression losses, where y takes arbitrary real values For simlicity, we resent some of our results in terms of classification losses, but they all hold for regression losses as well with slight modifications We resent our results under the assumtion that the loss function is smooth, in the sense that l a can be written as γ na n, for any a in its domain This assumtion holds for instance for the squared loss la a 2, the exonential loss la exa, and smoothed versions of loss functions such as the hinge loss and the absolute loss we discuss examles in more details in Subsection 42 This assumtion can be relaxed under certain conditions, and this is further discussed in Subsection 32 Turning to the issue of kernels, we note that the general resentation of our aroach is somewhat hamered by the fact that it needs to be tailored to the kernel we use In this aer, we focus on two families of kernels: Dot Product Kernels: the kernel kx, x can be written as a function of x, x Examles of such kernels kx, x are linear kernels x, x ; homogeneous olynomial kernels x, x n, inhomogeneous olynomial kernels + x, x n ; exonential kernels e x,x ; binomial kernels + x, x α, and more see for instance [5, 7] Gaussian Kernels: kx, x e x x 2 /σ 2 for some σ 2 > 0 Again, we emhasize that our techniques are extendable to other kernel tyes as well 3 Techniques Our results are based on two key ideas: the use of online gradient descent algorithms, and construction of unbiased gradient estimators in the kernel setting The latter is based on a general method to build unbiased estimators for non-linear functions, which may be of indeendent interest 3 Online Gradient Descent There exist well develoed theory and algorithms for dealing with the standard online learning setting, where the examle x t, y t is revealed after each round, and for general convex loss functions One of the simlest and most well known ones is the online gradient descent algorithm due to Zinkevich [8] Since this algorithm forms a basis for our algorithm in the new setting, we briefly review it below as adated to our setting The algorithm initializes the classifier w 0 At round t, the algorithm redicts according to w t, and udates the learning rule according to w t+ P w t η t t, where ηt is a suitably chosen constant which might deend on t; t l y t w t, Ψx t y t Ψx t is the gradient of l y t w, Ψx t with resect to w t ; and P is a rojection oerator on the convex set W, on whose elements we wish to achieve low regret In articular, if we wish to comete with hyotheses of bounded squared norm B w, P simly involves rescaling the norm of the redictor so as to have squared norm at most B w With this algorithm, one can rove regret bounds with resect to any w W A folklore result about this algorithm is that in fact, we do not need to udate the redictor by the gradient at each ste Instead, it is enough to udate by some random vector of bounded variance, which merely equals the gradient in exectation This is a useful roerty in settings where x t, y t is not revealed to the learner, and has been used before, such as in the online bandit setting see for instance [6, 8, ] Here, we will use this roerty in a new way, in order to devise algorithms which are robust to noise When the kernel and loss function are linear eg, Ψx x and la ca + b for some constants b, c, this roerty already ensures that the algorithm is robust

4 to noise without any further changes This is because the noise injected to each x t merely causes the exact gradient estimate to change to a random vector which is correct in exectation: If we assume l is a classification loss, then E [l y t w t, Ψ x t Ψ x t ] E [c x t ] x t On the other hand, when we use nonlinear kernels and nonlinear loss functions, using standard online gradient descent leads to systematic and unknown biases since the noise distribution is unknown, which revents the method from working roerly To deal with this roblem, we now turn to describe a technique for estimating exressions such as l y t w t, Ψx t in an unbiased manner In Subsection 33, we discuss how Ψx t can be estimated in an unbiased manner 32 Unbiased Estimators for Non-Linear Functions Suose that we are given access to indeendent coies of a real random variable X, with exectation E[X], and some real function f, and we wish to construct an unbiased estimate of fe[x] If f is a linear function, then this is easy: just samle x from X, and return fx By linearity, E[fX] fe[x] and we are done The roblem becomes less trivial when f is a general, nonlinear function, since usually E[fX] fe[x] In fact, when X takes finitely many values and f is not a olynomial function, one can rove that no unbiased estimator can exist see [4], Proosition 8 and its roof Nevertheless, we show how in many cases one can construct an unbiased estimator of fe[x], including cases covered by the imossibility result There is no contradiction, because we do not construct a standard estimator Usually, an estimator is a function from a given samle to the range of the arameter we wish to estimate An imlicit assumtion is that the size of the samle given to it is fixed, and this is also a crucial ingredient in the imossibility result We circumvent this by constructing an estimator based on a random number of samles Here is the key idea: suose f : R R is any function continuous on a bounded interval It is well known that one can construct a sequence of olynomials Q n n, where Q n is a olynomial of degree n, which converges uniformly to f on the interval If Q n x n i0 γ n,ix i, let Q nx,, x n n i0 γ i n,i j x j Now, consider the estimator which draws a ositive integer N according to some distribution PN n n, samles X for N times to get x, x 2,, x N, and returns N Q N x,, x N Q N x,, x N, where we assume Q 0 0 The exected value of this estimator is equal to: [ E N,x,,x N Q N x,, x N Q N x,, x N ] N n [ E x,,x n Q n x,, x n Q n x,, x n ] n n Qn E[X] Q n E[X] fe[x] n Thus, we have an unbiased estimator of fe[x] This technique aeared in a rather obscure early 960 s aer [6] from sequential estimation theory, and aears to be little known, articularly outside the sequential estimation community However, we believe this technique is interesting, and exect it to have useful alications for other roblems as well While this may seem at first like a very general result, the variance of this estimator must be bounded for it to be useful Unfortunately, this is not true for general continuous functions More recisely, let N be distributed according to n, and let θ be the value returned by the estimator In [2], it is shown that if X is a Bernoulli random variable, and if E[θN k ] < for some integer k, then f must be k times continuously differentiable Since E[θN k ] E[θ 2 ] + E[N 2k ]/2, this means that functions f which yield an estimator with finite variance, while using a number of queries with bounded variance, must be continuously differentiable Moreover, in case we desire the number of queries to be essentially constant ie choose a distribution for N with exonentially decaying tails, we must have E[N k ] < for all k, which means that f should be infinitely differentiable in fact, in [2] it is conjectured that f must be analytic in such cases Thus, we focus in this aer on functions f which are analytic, ie, they can be written as fx i0 γ ix i for aroriate constants γ 0, γ, In that case, Q n can simly be the truncated Taylor exansion of f to order n, ie, Q n n i0 γ ix i Moreover, we can ick n / n for any > So the estimator becomes the following: we samle a nonnegative integer N according

5 to PN n / n+, samle X indeendently N times to get x, x 2,, x N, and return θ γ N+ N x x 2 x N where we set θ γ 0 if N 0 We have the following: Lemma For the above estimator, it holds that E[θ] fe[x] The exected number of samles used by the estimator is /, and the robability of it being at least z is z Moreover, if we assume that f + x γ n x n exists for any x in the domain of interest, then E[θ 2 ] f + 2 E[X2 ] Proof The fact that E[θ] fe[x] follows from the discussion above The results about the number of samles follow directly from roerties of the geometric distribution As for the second moment, E[θ 2 ] equals E N,x,,x N [γ N 2 2N+ ] 2 x2 x 2 2 x 2 2n+ [ N 2 n+ γ2 ne x,,x n x 2 x 2 2 x 2 ] n γ 2 n n E[X 2 ] n 2 n γ n E[X2 ] n 2 γ n E[X2 ] f + 2 E[X2 ] The arameter rovides a tradeoff between the variance of the estimator and the number of samles needed: the larger is, the less samles do we need, but the estimator has more variance In any case, the samle size distribution decays exonentially fast, so the samle size is essentially bounded It should be emhasized that the estimator associated with Lemma is tailored for generality, and is subotimal in some cases For examle, if f is a olynomial function, then γ n 0 for sufficiently large n, and there is no reason to samle N from a distribution suorted on all nonnegative integers - it just increases the variance Nevertheless, in order to kee the resentation unified and general, we will always use this tye of estimator If needed, the estimator can always be otimized for secific cases We also note that this technique can be imroved in various directions, if more is known about the distribution of X For instance, if we have some estimate of the exectation and variance of X, then we can erform a Taylor exansion around the estimated E[X] rather than 0, and tune the robability distribution of N to be different than the one we used above These modifications can allow us to make the variance of the estimator arbitrarily small, if the variance of X is small enough Moreover, one can take olynomial aroximations to f which are erhas better than truncated Taylor exansions In this aer, for simlicity, we will ignore these otential imrovements Finally, we note that a related result in [2] imlies that it is imossible to estimate fe[x] in an unbiased manner when f is discontinuous, even if we allow a number of queries and estimator values which are infinite in exectation Therefore, since the derivative of the hinge loss is not continuous, estimating in an unbiased manner the gradient of the hinge loss with arbitrary noise aears to be imossible Thus, if online learning with noise and hinge loss is at all feasible, a rather different aroach than ours will need to be taken 33 Unbiasing Noise in the RKHS The third comonent of our aroach involves the unbiased estimation of Ψx t, when we only have unbiased noisy coies of x t Here again, we have a non-trivial roblem, because the feature maing Ψ is usually highly non-linear, so E[Ψ x t ] ΨE[ x t ] in general Moreover, Ψ is not a scalar function, so the technique of Subsection 32 will not work as-is To tackle this roblem, we construct an exlicit feature maing, which needs to be tailored to the kernel we want to use To give a very simle examle, suose we use the homogeneous 2nddegree olynomial kernel, kr, s r, s 2 It is not hard to verify that the function Ψ : R d R d2, Admittedly, the event N 0 should receive zero robability, as it amounts to skiing the samling altogether However, setting PN 0 0 aears to imrove the bound in this aer only in the smaller order terms, while making the analysis in the aer more comlicated

6 defined via Ψx x x, x x 2,, x d x d, is an exlicit feature maing for this kernel Now, if we query two indeendent noisy coies x, x of x, we have that the exectation of the random vector x x, x x 2,, x d x d is nothing more than Ψx Thus, we can construct unbiased estimates of Ψx in the RKHS Of course, this examle ertains to a very simle RKHS with a finite dimensional reresentation By a randomization trick somewhat similar to the one in Subsection 32, we can adat this aroach to infinite dimensional RKHS as well In a nutshell, we reresent Ψx as an infinite-dimensional vector, and its noisy unbiased estimate is a vector which is non-zero on only finitely many entries, using finitely many noisy queries Moreover, inner roducts between these estimates can be done efficiently, allowing us to imlement the learning algorithms, and use the resulting redictor on test instances 4 Main Results 4 Algorithm We resent our algorithmic aroach in a modular form We start by introducing the main algorithm, which contains several subroutines Then we rove our two main results, which bound the regret of the algorithm, the number of queries to the oracle, and the running time for two tyes of kernels: dot roduct and Gaussian our results can be extended to other kernel tyes as well In itself, the algorithm is nothing more than a standard online gradient descent algorithm with a standard O T regret bound Thus, most of the roofs are devoted to a detailed discussion of how the subroutines are imlemented including exlicit seudo-code In this section, we just describe one subroutine, based on the techniques discussed in Sec 3 The other subroutines require a more detailed and technical discussion, and thus their imlementation is described as art of the roofs in Sec 5 In any case, the intuition behind the imlementations and the techniques used are described in Sec 3 For simlicity, we will focus on a finite-horizon setting, where the number of online rounds T is fixed and known to the learner The algorithm can easily be modified to deal with the infinite horizon setting, where the learner needs to achieve sub-linear regret for all T simultaneously Also, for the remainder of this subsection, we assume for simlicity that l is a classification loss, namely can be written as a function of ly w, Ψx It is not hard to adat the results below to the case where l is a regression loss where l is a function of w, Ψx y We note that at each round, the algorithm below constructs an object which we denote as Ψx t This object has two interretations here: formally, it is an element of a reroducing kernel Hilbert sace RKHS corresonding to the kernel we use, and is equal in exectation to Ψx t However, in terms of imlementation, it is simly a data structure consisting of a finite set of vectors from R d Thus, it can be efficiently stored in memory and handled even for infinite-dimensional RKHS Algorithm Kernel Learning Algorithm with Noisy Inut Parameters: Learning rate η > 0, number of rounds T, samle arameter > Initialize: α i 0 for all i,, T Ψx i for all i,, T // Ψx i is a data structure which can store a variable number of vectors in R d For t T Define w t t i α Ψx i i Receive A t, y t // The oracle A t rovides noisy estimates of x t Let Ψx t : Ma EstimateA t, // Get unbiased estimate of Ψx t in the RKHS Let g t : Grad Length EstimateA t, y t, // Get unbiased estimate of l y t w t, Ψx t Let α t : g t η/ T // Perform gradient ste Let ñ t : t t i j α t,iα t,j Prod Ψx i, Ψx j // Comute squared norm, where Prod Ψx i, Ψx j returns Ψx i, Ψx j If ñ t > B w // If norm squared is larger than B w, then roject Let α i : α Bw i ñ t for all i,, t Like Ψx t, w t+ has also two interretations: formally, it is an element in the RKHS, as defined in the seudocode In terms of imlementation, it is defined via the data structures Ψx,, Ψx t and the values of α,, α t at round t To aly this hyothesis on a given instance x, we comute

7 t i α t,iprod Ψx i, x, where Prod Ψx i, x is a subroutine which returns Ψx i, Ψx a seudocode is rovided as art of the roofs later on We now turn to the main results ertaining to the algorithm The first result shows what regret bound is achievable by the algorithm for any dot-roduct kernel, as well as characterize the number of oracle queries er instance, and the overall running time of the algorithm Theorem Assume that the loss function l has an analytic derivative l a γ na n for all a in its domain, and let l +a γ n a n assuming it exists Assume also that the kernel kx, x can be written as Q x, x for all x, x R d Finally, assume that E[ x t 2 ] B x for any x t returned by the oracle at round t, for all t,, T Then, for all B w > 0 and >, it is ossible to imlement the subroutines of Algorithm such that: 2 The exected number of queries to each oracle A t is The exected running time of the algorithm is O T 3 + d 2 / 2 If we run Algorithm with η B w ul + u, where u Bw QB x, then [ T ] T E ly t w t, Ψx t min ly t w, Ψx t l + u ut w : w 2 B w The exectations are with resect to the randomness of the oracles and the algorithm throughout its run We note that the distribution of the number of oracle queries can be secified exlicitly, and it decays very raidly - see the roof for details Also, for simlicity, we only bound the exected regret in the theorem above If the noise is bounded almost surely or with sub-gaussian tails rather than just bounded variance, then it is ossible to obtain similar guarantees with high robability, by relying on Azuma s inequality or variants thereof see for examle [4] We now turn to the case of Gaussian kernels Theorem 2 Assume that the loss function l has an analytic derivative l a γ na n for all a in its domain, and let l +a γ n a n assuming it exists Assume that the kernel kx, x is defined as ex x x 2 /σ 2 Finally, assume that E[ x t 2 ] B x for any x t returned by the oracle at round t, for all t,, T Then for all B w > 0 and > it is ossible to imlement the subroutines of Algorithm such that 3 2 The exected number of queries to each oracle A t is The exected running time of the algorithm is O T 3 + d 2 / If we run Algorithm with η B w ul + u, where 3 B x + 2 B x u B w ex σ 2 then [ T E ly t w t, Ψx t min w : w 2 B w ] T ly t w, Ψx t l + u ut The exectations are with resect to the randomness of the oracles and the algorithm throughout its run As in Thm, note that the number of oracle queries has a fast decaying distribution Also, note that with Gaussian kernels, σ 2 is usually chosen to be on the order of the examle s squared norms Thus, if the noise added to the examles is roortional to their original norm, we can assume that B x /σ 2 O, and thus u which aears in the bound is also bounded by a constant As reviously mentioned, most of the subroutines are described in the roofs section, as art of the roof of Thm Here, we only show how to imlement Grad Length Estimate subroutine, which returns the gradient length estimate g t The idea is based on the technique described in

8 Subsection 32 We rove that g t is an unbiased estimate of l y t w t, Ψx t, and bound E[ g 2 t ] As discussed earlier, we assume that l is analytic and can be written as l a γ na n Subroutine Grad Length EstimateA t, y t, Samle nonnegative integer n according to Pn / n+ For j,, n Let Ψx t j : Ma EstimateA t // Get unbiased estimate of Ψx t in the RKHS Return g t : y t γ n+ n t n j i α t,iprod Ψx i, Ψx t j Lemma 2 Assume that E[ Ψx t ] Ψx t, and that Prod Ψx, Ψx returns Ψx, Ψx for all x, x Then for any given w t α t, Ψx + + α t,t Ψxt it holds that E t [ g t ] y t l y t w t, Ψx t and E t [ g t 2 ] 2 l + B w B Ψx where the exectation is with resect to the randomness of Subroutine, and l +a γ n a n Proof The result follows from Lemma, where g t corresonds to the estimator θ, the function f corresonds to l, and the random variable X corresonds to w t, Ψx t where Ψx t is random and w t is held fixed The term E[X 2 ] in Lemma can be uer bounded as E t [ wt, Ψx t 2 ] w t 2 E t [ Ψx t 2] B w B Ψx 42 Loss Function Examles Theorems and 2 both deal with generic loss functions l whose derivative can be written as γ na n, and the regret bounds involve the functions l +a γ n a n Below, we resent a few examles of loss functions and their corresonding l + As mentioned earlier, while the theorems in the revious subsection are in terms of classification losses ie, l is a function of y w, Ψx, virtually identical results can be roven for regression losses ie, l is a function of w, Ψx y, so we will give examles from both families Working out the first two examles is straightforward The roofs of the other two aear in Sec 5 The loss functions are illustrated grahically in Fig Examle For the squared loss function, l w, x, y w, x y 2, we have l + u 2 u Examle 2 For the exonential loss function, l w, x, y e y w,x, we have l + u e u Examle 3 Consider a smoothed absolute loss function l σ w, Ψx y, defined as an antiderivative of Erfsa for some s > 0 see roof for exact analytic form Then we have that l + u 2 + e s2 u s π u Examle 4 Consider a smoothed hinge loss ly w, Ψx, defined as an antiderivative of Erfsa /2 for some s > 0 see roof for exact analytic form Then we have that l + u 2 e s2 u s π u For any s, the loss function in the last two examles are convex, and resectively aroximate the absolute loss w, Ψx y and the hinge loss max { 0, y w, Ψx } arbitrarily well for large enough s Fig shows these loss functions grahically for s Note that s need not be large in order to get a good aroximation Also, we note that both the loss itself and its gradient are comutationally easy to evaluate Finally, we remind the reader that as discussed in Subsection 32, erforming an unbiased estimate of the gradient for non-differentiable losses directly such as the hinge loss or absolute loss aears to be imossible in general On the fli side, if one is willing to use a random number of queries with olynomial rather than exonential tails, then one can achieve much better samle comlexity results, by focusing on loss functions or aroximations thereof which are only differentiable to a bounded order, rather than fully analytic This again demonstrates the tradeoff between the samle size and the amount of information that needs to be gathered on each training examle

9 Absolute Loss Smoothed Absolute Loss s 2 Hinge Loss Smoothed Hinge Loss s Figure : Absolute loss, hinge loss, and smooth aroximations 43 One Noisy Coy is Not Enough The revious results might lead one to wonder whether it is really necessary to query the same instance more than once In some alications this is inconvenient, and one would refer a method which works when just a single noisy coy of each instance is made available In this subsection we show that, unfortunately, such a method cannot be found Secifically, we rove that under very mild assumtions, no method can achieve sub-linear regret when it has access to just a single noisy coy of each instance On the other hand, for the case of squared loss and linear kernels, our techniques can be adated to work with exactly two noisy coies of each instance, 2 so without further assumtions, the lower bound that we rove here is indeed tight For simlicity, we rove the result for linear kernels ie, where kx, x x, x It is an interesting oen roblem to show imroved lower bounds when nonlinear kernels are used We also note that the result crucially relies on the learner not knowing the noise distribution, and we leave to future work the investigation of what haens when this assumtion is relaxed Theorem 3 Let W be a comact convex subset of R d, and let l, : R R satisfy the following: it is bounded from below; 2 it is differentiable at 0 with l 0, < 0 For any learning algorithm which selects hyotheses from W and is allowed access to a single noisy coy of the instance at each round t, there exists a strategy for the adversary such that the sequence w, w 2, of redictors outut by the algorithm satisfies lim su T max w W T T l w t, x t, y t l w, x t, y t > 0 with robability with resect to the randomness of the oracles Note that condition is satisfied by virtually any loss function other than the linear loss, while condition 2 is satisfied by most regression losses, and by all classification calibrated losses, which include all reasonable losses for classification see [3] The most obvious examle where the conditions are not satisfied is when l, is a linear function This is not surrising, because when l, is linear, the learner is always robust to noise see the discussion at Sec 3 The intuition of the roof is very simle: the adversary chooses beforehand whether the examles are drawn iid from a distribution D, and then erturbed by noise, or drawn iid from some other distribution D without adding noise The distributions D, D and the noise are designed so that the examles observed by the learner are distributed in the same way irresective to which of the two samling strategies the adversary chooses Therefore, it is imossible for the learner accessing a single coy of each instance to be statistically consistent with resect to both distributions simultaneously As a result, the adversary can always choose a distribution on which the algorithm will be inconsistent, leading to constant regret The full roof is resented in Section 53 2 In a nutshell, for squared loss and linear kernels, we just need to estimate 2 w t, x t y tx t in an unbiased manner at each round t This can be done by comuting 2 w t, x t y t x t, where x t, x t are two noisy coies of x t

10 5 Proofs Due to the lack of sace, some of the roofs are given in the [7] 5 Preliminary Result To rove Thm and Thm 2, we need a theorem which basically states that if all subroutines in algorithm behave as they should, then one can achieve an O T regret bound This is rovided in the following theorem, which is an adatation of a standard result of online convex otimization see, eg, [8] The roof is given in [7] Theorem 4 Assume the following conditions hold with resect to Algorithm : For all t, Ψxt and g t are indeendent of each other as random variables induced by the randomness of Algorithm as well as indeendent of any Ψx i and g i for i < t 2 For all t, E[ Ψx t ] Ψx t, and there exists a constant B Ψ > 0 such that E[ Ψx t 2 ] B Ψ 3 For all t, E[ g t ] y t l y t w t, Ψx t, and there exists a constant B g > 0 such that E[ g 2 t ] B g 4 For any air of instances x, x, Prod Ψx, Ψx Ψx, Ψx Then if Algorithm is run with η Bw B gb, the following inequality holds Ψ [ T E l y t w t, Ψx t T min l y t w, Ψx t ] B w B g B ΨT w : w 2 B w where the exectation is with resect to the randomness of the oracles and the algorithm throughout its run 52 Proof of Thm In this subsection, we resent the roof of Thm We first show how to imlement the subroutines of Algorithm, and rove the relevant results on their behavior Then, we rove the theorem itself It is known that for k, Q x, x to be a valid kernel, it is necessary that Q x, x can be written as a Taylor exansion β n x, x n, where β n 0 see theorem 49 in [5] This makes these tyes of kernels amenable to our techniques We start by constructing an exlicit feature maing Ψ corresonding to the RKHS induced by our kernel For any x, x, we have that d n kx, x β n x, x n β n x i x i i d d β n x k x k2 x kn x k x k 2 x k n k k d d k n k n βn x k x k2 x kn βn x k x k2 x kn This suggests the following feature reresentation: for any x, Ψx returns an infinite-dimensional vector, indexed by n and k,, k n {,, d}, with the entry corresonding to n, k,, k n being βn x k x kn The dot roduct between Ψx and Ψx is similar to a standard dot roduct between two vectors, and by the derivation above equals kx, x as required We now use a slightly more elaborate variant of our unbiased estimate technique, to derive an unbiased estimate of Ψx First, we samle N according to PN n / n+ Then, we query the oracle for x for N times to get x,, x N, and formally define Ψx as Ψx n+ d d β n x k x n k n e n,k,,k n 2 k where e n,k,,k n reresents the unit vector in the direction indexed by n, k,, k n as exlained above Since the oracle queries are iid, the exectation of this exression is n+ d d βn n+ E [ x k x n ] d d k n en,k,,k n βn x k x n k n e n,k,,k n k k n k n k which is exactly Ψx We formalize the needed roerties of Ψx in the following lemma k n

11 Lemma 3 Assuming Ψx is constructed as in the discussion above, it holds that E[ Ψx] Ψx for any x Moreover, if the noisy samles x t returned by the oracle A t satisfy E[ x t 2 ] B x, then [ E Ψx t 2] QB x where we recall that Q defines the kernel by kx, x Q x, x Proof The first art of the lemma follows from the discussion above As to the second art, note that by 2, [ E Ψx t 2] E β 2n+2 d 2 n 2 x t,k x N t,k n E β 2n+2 n n x j 2 2 t k,k n n+ β 2n+2 ] n E 2 n ] [ x 2 t β n E 2 n [ x t where the second-to-last ste used the fact that β n 0 for all n j n β n B x QB x Of course, exlicitly storing Ψx as defined above is infeasible, since the number of entries is huge Fortunately, this is not needed: we just need to store x t,, x N t The reresentation above is used imlicitly when we calculate dot roducts between Ψx and other elements in the RKHS, via the subroutine Prod We note that while N is a random quantity which might be unbounded, its distribution decays exonentially fast, so the number of vectors to store is essentially bounded After the discussion above, the seudocode for Ma Estimate below should be self-exlanatory Subroutine 2 Ma EstimateA t, Samle nonnegative integer N according to PN n / n+ Query A t for N times to get x t,, x N t Return x t,, x N t as Ψx t We now turn to the subroutine Prod, which given two elements in the RKHS, returns their dot roduct This subroutine comes in two flavors: either as a rocedure defined over Ψx, Ψx and returning Ψx, Ψx Subroutine 3; or as a rocedure defined over Ψx, x Subroutine 4, where the second element is an exlicitely given vector and returning Ψx, Ψx This second variant of Prod is needed when we wish to aly the learned redictor on a new given instance x Subroutine 3 Prod Ψx, Ψx Let x,, x n be the index and vectors comrising Ψx Let x,, x n be the index and vectors comrising Ψx If n n return 0, else return β n 2n+2 2 n j xj, x j Lemma 4 Prod Ψx, Ψx returns Ψx Ψx Proof Using the formal reresentation of Ψx, Ψx in 2, we have that Ψx, Ψx is 0 whenever n n because then these two elements are comosed of different unit vectors with resect to an orthogonal basis Otherwise, we have that Ψx Ψx β n 2n+2 2 β n 2n+2 2 d k,,k n d k x k x n x k x k k n x k d k N x n k n x n k N x n k N which is exactly what the algorithm returns, hence the lemma follows β n 2n+2 2 N j x j, x j

12 The seudocode for calculating the dot roduct Ψx, Ψx where x is known is very similar, and the roof is essentially the same Subroutine 4 Prod Ψx, x Let n, x,, x n be the index and vectors comrising Ψx Return β n n+ n j xj, x We are now ready to rove Thm First, regarding the exected number of queries, notice that to run Algorithm, we invoke Ma Estimate and Grad Length Estimate once at round t Ma Estimate uses a random number B of queries distributed as PB n / n+, and Grad Length Estimate invokes Ma Estimate a random number C of times, distributed as PC n / n+ The total number of queries is therefore C+ j B j, where B j for all j are iid coies of B The exected value of this exression, using a standard result on the exected value of a sum of a random number of indeendent random variables, is equal to + E[C]E[B j ], or + 2 d, In terms of running time, we note that the exected running time of Prod is O + this because it erforms N multilications of inner roducts, each one with running time Od, and E[N] The exected running time of Ma Estimate is O + The exected running time of Grad Length Estimate is O T + d, which can be written as O + T + d Since Algorithm at each of T rounds calls Ma Estimate once, 2 Grad Length Estimate once, Prod for OT 2 times, and erforms O other oerations, we get that the overall runtime is O T T + d + T 2 + d Since, we can uer bound this by 2 O T T 2 d 2 O T 3 d + 2 The regret bound in the theorem follows from Thm 4, with the exressions for constants following from Lemma 2, Lemma 3, and Lemma 4 53 Proof Sketch of Thm 3 To rove the theorem, we use a more general result which leads to non-vanishing regret, and then show that under the assumtions of Thm 3, the result holds The roof of the result is given in [7] Theorem 5 Let W be a comact convex subset of R d and ick any learning algorithm which selects hyotheses from W and is allowed access to a single noisy coy of the instance at each round t If there exists a distribution over a comact subset of R d such that argmin w W E [ l w, x, ] and argmin l w, E[x], 3 w W are disjoint, then there exists a strategy for the adversary such that the sequence w, w 2, W of redictors outut by the algorithm satisfies lim su T max w W T T l w t, x t, y t l w, x t, y t > 0 with robability with resect to the randomness of the oracles Another way to hrase this theorem is that the regret cannot vanish, if given examles samled iid from a distribution, the learning roblem is more comlicated than just finding the mean of the data Indeed, the adversary s strategy we choose later on is simly drawing and resenting examles from such a distribution Below, we sketch how we use Thm 5 in order to rove Thm 3 A full roof is rovided in [7]

13 We construct a very simle one-dimensional distribution, which satisfies the conditions of Thm 5: it is simly the uniform distribution on {3x, x}, where x is the vector, 0,, 0 Thus, it is enough to show that argmin l3w, + l w, and argmin lw, 4 w : w 2 B w w : w 2 B w are disjoint, for some aroriately chosen B w Assuming the contrary, then under the assumtions on l, we show that the first set in Eq 4 is inside a bounded ball around the origin, in a way which is indeendent of B w, no matter how large it is Thus, if we ick B w to be large enough, and assume that the two sets in Eq 4 are not disjoint, then there must be some w such that both l3w, + l w, and lw, have a subgradient of zero at w However, this can be shown to contradict the assumtions on l, leading to the desired result 6 Future Work There are several interesting research directions worth ursuing in the noisy learning framework introduced here For instance, doing away with unbiasedness, which could lead to the design of estimators that are alicable to more tyes of loss functions, for which unbiased estimators may not even exist Also, it would be interesting to show how additional information one has about the noise distribution can be used to design imroved estimates, ossibly in association with secific losses or kernels Another oen question is whether our lower bound Thm 3 can be imroved when nonlinear kernels are used References [] J Abernethy, E Hazan, and A Rakhlin Cometing in the dark: An efficient algorithm for bandit linear otimization In COLT, ages , 2008 [2] S Bhandari and A Bose Existence of unbiased estimators in sequential binomial exeriments Sankhyā: The Indian Journal of Statistics, 52:27 30, 990 [3] N Bshouty, J Jackson, and C Tamon Uniform-distribution attribute noise learnability Information and Comutation, 872: , 2003 [4] N Cesa-Bianchi, A Conconi, and C Gentile On the generalization ability of on-line learning algorithms IEEE Transactions on Information Theory, 509: , Setember 2004 [5] N Cesa-Bianchi, E Dichterman, P Fischer, E Shamir, and H Simon Samle-efficient strategies for learning in the resence of noise Journal of the ACM, 465:684 79, 999 [6] N Cesa-Bianchi and G Lugosi Prediction, learning, and games Cambridge University Press, 2006 [7] N Cesa-Bianchi, S Shalev-Shwartz, and O Shamir Online learning of noisy data with kernels Technical Reort, available at arxiv: [8] A Flaxman, A Tauman Kalai, and H McMahan Online convex otimization in the bandit setting: gradient descent without a gradient In Proceedings of SODA, ages , 2005 [9] S Goldman and R Sloan Can ac learning algorithms tolerate random attribute noise? Algorithmica, 4:70 84, 995 [0] M Kearns and M Li Learning in the resence of malicious errors SIAM Journal on Comuting, 224: , 993 [] N Littlestone Redundant noisy attributes, attribute errors, and linear threshold learning using Winnow In Proceedings of COLT, ages 47 56, 99 [2] D Nettleton, A Orriols-Puig, and A Fornells A study of the effect of different tyes of noise on the recision of suervised learning techniques Artificial Intelligence Review, 200 [3] M Jordan P Bartlett and J McAuliffe Convexity, classification and risk bounds Journal of the American Statistical Association, 0473:38 56, March 2006 [4] L Paninski Estimation of entroy and mutual information Neural Comutation, 56:9 253, 2003 [5] B Schölkof and A Smola Learning with Kernels MIT Press, 2002 [6] R Singh Existence of unbiased estimates Sankhyā: The Indian Journal of Statistics, 26:93 96, 964 [7] I Steinwart and A Christmann Suort Vector Machines Sringer, 2008 [8] M Zinkevich Online convex rogramming and generalized infinitesimal gradient ascent In Proceedings of ICML, ages , 2003

Online Learning of Noisy Data Nicoló Cesa-Bianchi, Shai Shalev-Shwartz, and Ohad Shamir

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 12, DECEMBER 2011 7907 Online Learning of Noisy Data Nicoló Cesa-Bianchi, Shai Shalev-Shwartz, and Ohad Shamir Abstract We study online learning of