Robust polynomial regression up to the information theoretic limit

Size: px

Start display at page:

Download "Robust polynomial regression up to the information theoretic limit"

Rudolph Johnson
5 years ago
Views:

58th Annual IEEE Syposiu on Foundations of Coputer Science Robust polynoial regression up to the inforation theoretic liit Daniel Kane Departent of Coputer Science and Engineering / Departent of

1 58th Annual IEEE Syposiu on Foundations of Coputer Science Robust polynoial regression up to the inforation theoretic liit Daniel Kane Departent of Coputer Science and Engineering / Departent of Matheatics University of California, San Diego La Jolla, CA, USA dakane@ucsd.edu Sushrut Karalkar, Eric Price Departent of Coputer Science The University of Texas at Austin Austin, TX, USA {sushrutk, ecprice}@cs.utexas.edu Abstract We consider the proble of robust polynoial regression, where one receives saples that are usually within a sall additive error of a target polynoial, but have a chance of being arbitrary adversarial outliers. Previously, it was known how to efficiently estiate the target polynoial only when the outlier probability was subconstant in the degree of the target polynoial. We give an algorith that works for the entire feasible range of outlier probabilities, while siultaneously iproving other paraeters of the proble. We copleent our algorith, which gives a factor approxiation, with ipossibility results that show, for exaple, that a 1.09 approxiation is ipossible even with infinitely any saples. Keywords-polynoial regression, robust regression, robust recovery, learning, approxiation; I. INTRODUCTION Polynoial regression is the proble of finding a polynoial that passes near a collection of input points. The proble has been studied for 00 years with diverse applications [1], including coputer graphics [], achine learning [3], and statistics [4]. As with linear regression, the classic solution to polynoial regression is least squares, but this is not robust to outliers: a single outlier point can perturb the least squares solution by an arbitrary aount. Hence finding a robust ethod of polynoial regression is an iportant question. A version of this proble was foralized by Arora and Khot [5]. We want to learn a degree d polynoial p : [ 1, 1] R, and we observe (x i,y i ) where x i is drawn fro soe easure χ (say, unifor) and y i = p(x i )+w i for soe noise w i. Each saple is an outlier independently with probability at ost ρ; if the saple is not an outlier, then w i σ for soe σ. Other than the rando choice of outlier locations, the noise is adversarial. One would like to recover p with p p Cσ using as few saples as possible, with high probability. In other contexts an l requireent on the input noise would be a significant restriction, but outlier tolerance akes it uch less so. For exaple, independent bounded-variance noise fits in this fraework: by Chebyshev s inequality, the fraework applies with outlier chance ρ = and σ =10E[w i ]1/, which gives the ideal result up to constant factors. At the sae tie, the l requireent on the input allows for the strong l bound on the output. Arora and Khot [5] showed that ρ < 1/ is inforation-theoretically necessary for non-trivial recovery, and showed how to solve the proble using a linear progra when ρ = 0. For ρ > 0, the RANSAC [6] technique of fitting to a subsaple of the data works in polynoial tie when ρ log d d. Finding an efficient algorith for ρ log d d reained open until recently. Last year, Guruswai and Zuckeran [7] showed how to solve the proble in polynoial tie for ρ larger than log d d. Unfortunately, their result falls short of the ideal in several ways: it needs a low outlier rate (ρ < 1 p log d ), a bounded signal-to-noise ratio ( σ < d O(1) ), and has ( a super-constant ) approxiation factor 0.01). ( p p σ 1+ p σ It uses O(d log c d) saples fro the unifor easure, or O(d log c d) saples fro the Chebyshev easure. 1 x 1 These deficiencies ean that the algorith doesn t work when all the noise coes fro outliers (σ =0); the low outlier rate required also leads to a superconstant factor loss when reducing fro other noise odels. In this work, we give a siple algorith that avoids all these deficiencies: it works for all ρ<1/; it has no restrictions on σ; and it gets a constant approxiation factor C = + ɛ. Additionally, it only uses O(d ) saples fro the unifor easure, without any log factors, and O(d log d) saples fro the Chebyshev easure. We also give lower bounds for the approxiation factor C, indicating that one cannot achieve a 1+ɛ approxiation by showing that C > Our lower bounds are not the best possible, and it is possible that the true lower bounds are uch better. A. Algorithic results The proble forulation has randoly placed outliers, which is needed to ensure that we can estiate /17 $ IEEE DOI /FOCS

2 the polynoial locally around any point x. We use the following definition to encapsulate this requireent, after which the noise can be fully adversarial: Definition I.1. The size Chebyshev partition of [ 1, 1] is the set of intervals I j =[cos πj π(j 1), cos ] for j []. We say that a set S of saples (x i,y i ) is α-good for the partition if, in every interval I j, strictly less than an α fraction of the points x i I j are outliers. The size Chebyshev partition is the set of intervals between consecutive extrea of the Chebyshev polynoial of degree. Standard approxiation theory recoends sapling functions at the roots of the Chebyshev polynoial, so intuitively a good set of saples will allow fairly accurate estiation near each of these ideal sapling points. All our algoriths will work for any set of α- good saples, for α< 1 and sufficiently large. The saple coplexities are then what is required for S to be good with high probability. We first describe two siple algoriths that do not quite achieve the goal. We then describe how to blackbox cobine their results to get the full result. L1 regression.: Our first result is that l 1 regression alost solves the proble: it satisfies all the requireents, except that the resulting error p p is bounded in l 1 not l : Lea I.. Suppose the set S of saples is α-good for the size = O(d) Chebyshev partition, for constant α< 1. Then the result p of l 1 regression arg in degree-d n i=1 polynoial p satisfies p p 1 O α (1) σ. ean x i I j y i p(x i ) This has a nice parallel to the d =1case, where l 1 regression is the canonical robust estiator. As shown by the Chebyshev polynoial in Figure 1, l 1 regression ight not give a good solution to the l proble. However, converting to an l bound loses at ost an O(d ) factor. This eans this lea is already useful for an l bound: one of the requireents of [7] was that p σd O(1). We can avoid this by first coputing the l 1 regression estiate p (l1) and then applying [7] to the residual p p (l1), which always satisfies the condition. Median-based recovery.: Our second result is for a edian-based approach to the proble: take the edian ỹ j of the y values within each interval I j. Because the edian is robust, this will lie within the range of inlier y values for that interval. We don t know which x I j it corresponds to, but this turns out not to atter too y l 1 regression ay yield high l error Observation Scaled Chebyshev T 8 Zero polynoial x Figure 1: The blue observations are within σ =1of the red zero polynoial at each point, but are closer in l 1 to the green Chebyshev polynoial. This deonstrates that l 1 regression does not solve the l proble even for ρ =0. uch: assigning each ỹ j to an arbitrary x j I j and applying non-robust l regression gives a useful result. Lea I.3. Let ɛ, α < 1, and suppose the set S of saples is α-good for the size = O(d/ɛ) Chebyshev partition. Let x j I j be chosen arbitrarily, and let ỹ j =edian xi I j y i. Then the degree d polynoial p iniizing ax j [] p( x j) ỹ j satisfies p p ( + ɛ)σ + ɛ p. Without the additive ɛ p ter, which coes fro not knowing the x I j for ỹ j, this would be exactly what we want. If σ p, however, this is not so good but it still akes progress. Iterating the algoriths.: While neither of the above results is sufficient on its own, siply applying the iteratively on the residual left by previous rounds gives our full theore. The result of l 1 regression is a p (0) with p p (0) = O(d σ). Applying the edian recovery algorith to the residual points (x i,y i p (0) (x i )), we will get p so that p (1) = p (0) + p has p (1) p ( + ɛ)σ + ɛ O(d σ) If we continue applying the edian recovery algorith to the reaining residual, the l nor of the residual will continue to decrease exponentially. After r = O(log 1/ɛ d) rounds we will reach p (r) p ( + 4ɛ)σ 39

3 as desired 1. Since each ethod can be coputed efficiently with a linear progra, this gives our ain theore: Theore I.4. Let ɛ, α < 1, and suppose the set S of saples is α-good for the size = O(d/ɛ) Chebyshev partition. The result p of Algorith is a degree d polynoial satisfying p p ( + ɛ)σ. Its running tie is that of solving O(log 1/ɛ d) linear progras, each with O(d) variables and O( S ) constraints. We now apply this to the robust polynoial regression proble, where we receive points (x i,y i ) such that each x i is drawn fro soe distribution, and with probability ρ it is an outlier. An adversary then picks the y i such that, for each non-outlier i, y i p(x i ) σ. We observe the following: Corollary I.5. Consider the robust polynoial regression proble for constant outlier chance ρ < 1/, with points x i drawn fro the Chebyshev distribution 1 D c (x). Then O( d 1 x ɛ log d δɛ ) saples suffice to recover with probability 1 δ adegreed polynoial p with ax p(x) p(x) ( + ɛ)σ. x [ 1,1] If x i is drawn fro the unifor distribution instead, then O( d ɛ log 1 δ ) saples suffice for the sae result. In both cases, the recovery tie is polynoial in the saple size. B. Ipossibility results We show liitations on iproving any of the three paraeters used in Corollary I.5: the saple coplexity, the approxiation factor, and the requireent that ρ< 1/. Saple Coplexity.: Our result uses O(d ) saples fro the unifor distribution and O(d log d) saples fro the Chebyshev distribution. We show in Section V-A that both results are tight. In particular, we show that it is ipossible to get any constant approxiation with o(d ) saples fro the unifor distribution or o(d log d) saples fro the Chebyshev distribution. Approxiation factor.: Our approxiation factor is +ɛ. Even with infinitely any saples and no outliers, can one do better than a -approxiation for l regression? For coparison, in l regression a 1+ɛ approxiation is possible. For these lower bounds, it is convenient to consider the special case where the values y i are y(x i ) for a function y with p y σ. We show two lower bounds related to this question. 1 We reark that one could skip the initial l 1 iniization step, but the nuber of rounds would then depend on log( p ), which σ could be unbounded. Note that this only affects running tie and not the saple coplexity. Best fitting quadratic Figure : The lower bound for 1.09-approxiations via any recovery algorith. We give three quadratics that all lie within σ of each other for all x, and hence the observed y(x) can be identical regardless of which quadratic is p. The region containing points within 1.09σ of all three options just barely doesn t contain a quadratic itself. First, we show that l projection that is, the algorith that iniizes y p over degree-d polynoials p can have error arbitrarily close to σ. Since our algorith is an outlier-tolerant version of l projection, we should not expect to perfor better. Second, we show that no proper learning algorith can achieve a 1+ɛ guarantee. In particular, we show for d =that any algorith with ore than /3 success probability ust have C > 1.09; this is illustrated in Figure. For general d, we show that C>1+Ω(1/d 3 ). List decoding.: The ρ < 1/ requireent is obviously required for uniquely decoding p, since for ρ 1/ the observations could coe fro a ixture of two copletely different polynoials. But one could hope for a list decoding version of the result, outputting a sall set of polynoials such that one is close to the true output. Unfortunately, [5] showed that for a C- approxiate algorith, such a set would require size Ω( d/c) at least e event for ρ 1/. We iprove this lower bound to e Ω(d/C). At least for constant C and ρ, our result is tight: Theore I.4 iplies that if no saples were outliers, then = O(d) saples would suffice. Hence picking saples will result in Algorith outputting a polynoial consistent with the data with (1 ρ) = e O(d) probability. Repeating this would give a set of size e O(d) that works with high probability. 393

4 C. Related Work In addition to the work of [5], [7] discussed above, several papers have looked at siilar probles. When σ =0, the proble becoes the standard one of Reed- Soloon decoding with relative Haing distance ρ. Efficient algoriths such as Berlekap-Massey [8] give unique decoding for all ρ<1/. For ρ>1/, while unique decoding is ipossible, a polynoial size list decoding is possible with sufficiently any saples for all ρ<1 [9]. Other work has looked at robust estiation for distributions. In the field of robust statistics (see, e.g., [10]), as well as soe recent papers in theoretical coputer science (e.g. [11] [13]), one would like to estiate properties of a distribution fro saples that have a ρ chance of being adversarial. In soe such cases, list decoding for ρ>1/ is possible [13]. II. PRELIMINARIES For a function f :[ 1, 1] R and interval I [ 1, 1], we define the l q nor on the interval to be ( ) 1/q f I,q := f(x) q dx, I k where f I, := sup x I f(x). We also define the overall l q nor f q := f [ 1,1],q. We will need the following consequence of a generalization due to Nevai [14] of Bernstein s inequality fro l to l q. The proof of this lea is in Appendix A. Lea II.1. Let p be a degree d polynoial. Let I 1,...,I partition [ 1, 1] between the Chebyshev extrea cos πj, for soe d. Let r :[ 1, 1] R be piecewise constant, so that for each I k there exists an x k I k with r(x) =p(x k ) for all x I k. Then there exists a universal constant C such that, for any q 1, p r q C d p q. We recall the definition of the Chebyshev polynoials T d (x), which we will use extensively. Definition II.. The Chebyshev polynoials of the first kind T d (x) are defined by the following recurrence: T 0 (x) = 1,T 1 (x) = x and T n+1 (x) = xt n (x) T n 1 (x). They have the property that T d (cos θ) = cos(dθ) for all θ. III. l 1 REGRESSION Lea III.1. Suppose C d ɛ constant C. Then, for any set of saples x 1,...,x n with all S i = {j x j I i } nonepty, and any polynoial p of degree d, i for a large enough I i p(x j ) =(1± ɛ) p 1. S i j S i Proof: When all S i = 1, this is a restateent of Lea II.1 for q = 1. Otherwise, the LHS is the expectation of the S i =1case, when randoly drawing a single j in each S i. We will show a ore precise stateent than Lea I., which allows for a weaker l 1 version of the α-good requireent on the saples: Definition III.. For a set of saples (x 1,y 1 ),...,(x n,y n ), and the Chebyshev partition I 1,...,I, define S j = {i x i I j }. We say that the saples are (α, σ) close to a polynoial p in l 1 if, for we have e j := in ax p(x j) y j, S S j, S (1 α) S j j S j [] e j σ. If the saples are α-good, then each e j σ, so Ij e j ( )(ax e j ) σ and the saples are (α, σ) close to p in l 1. We now state Lea III.3 which is a ore precise stateent of Lea I.. Lea III.3. Let α<1/, and C d ɛ for a large enough constant C and soe ɛ (1 α)/4. Then, given any saples (α, σ) close to p in l 1, the degree d polynoial solution to the l 1 regression proble p =argin q j [] y i q(x i ) i S j has p p 1 σ 1 α. Proof: In each interval, we consider the α coordinates axiizing y i p(x i ) to be bad, and the rest to be good. Let the set of bad and good coordinates in S j be denoted by B j and G j respectively, and define obj(f) = I j j [] S j i S j y i f(x i ). By definition obj( p) obj(p). This gives us 0 obj( p) obj(p) = y i p(x i ) y i p(x i ) j [] i Sj i S j = y i p(x i ) y i p(x i ) j [] i G j + y i p(x i ) y i p(x i ). i B j j [] Fro the triangle inequality, we have y i p(x i ) p(x i ) p(x i ) y i p(x i ) and y i p(x i ) y i 394

5 p(x i ) p(x i ) p(x i ). This gives 0 j [] j [] ( p(x i ) p(x i ) y i p(x i ) ) i G j p(x i ) p(x i ) i B j Now, recall that our saples are (α, δ) close to p in l 1. This eans (by Definition III.) that for any good i, y i p(x i ) e j for a set of e j s that satisfy e j σ. Therefore, for these e j, 0 j [] j [] j [] j [] j [] j [] p(x i ) p(x i ) i G j p(x i ) p(x i ) i B j y i p(x i ) i G j p(x i ) p(x i ) i G j p(x i ) p(x i ) i B j ((1 α) S j e j ) Using Lea III.1 on p p we now get or Hence 0 (1 α)(1 ɛ) p p 1 α(1 + ɛ) p p 1 (1 α)σ (1 α ɛ) p p 1 (1 α)σ, p p 1 (1 α)σ 1 α ɛ. For ɛ 1 α, this gives the desired p p 1 σ 1 α. IV. l REGRESSION Given a set of α< 1 -good saples S, our goal is to find a degree p polynoial q with p p (+ɛ)σ. We start by proving Lea I.3, which we restate below for clarity: Lea I.3. Let ɛ, α < 1, and suppose the set S of saples is α-good for the size = O(d/ɛ) Chebyshev partition. Let x j I j be chosen arbitrarily, and let Algorith 1 Refineent ethod, analyzed in Lea I.3 for r =0 1: procedure REFINE(S = {(x i,y i )}, p) : ỹ j edian xi I j y i p(x i ) 3: Choose arbitrary x j I j 4: Fit degree d polynoial r iniizing r( x j ) ỹ j 5: p p + r 6: return p 7: end procedure Algorith Coplete recovery procedure 1: procedure APPROX(S) : p (0) result of l 1 regression 3: for i [0,O(log 1/ɛ d)] do 4: p (i+1) REFINE(S, p (i) ) 5: end for 6: return p (O(log 1/ɛ d)) 7: end procedure ỹ j =edian xi I j y i. Then the degree d polynoial p iniizing ax j [] p( x j) ỹ j satisfies p p ( + ɛ)σ + ɛ p. Proof: Since ore than half the points in any interval I j are such that y j p(x j ) σ, and p is continuous, there ust exist an x j I j satisfying ỹ j p(x j) σ. (1) We now define three piecewise-constant functions, r(x), r(x), and r(x), to be such that within each interval I j we have r(x) =p(x j ), r(x) = p( x j), and r(x) =ỹ j. By Lea II.1, for our choice of we have p r ɛ p and p r ɛ p. () We also have by (1) that r r σ, sobythe triangle inequality p r σ + ɛ p. (3) Now, our choice of p ensures that r r =ax j [] p( x j) ỹ j ax j [] p( x j) ỹ j p r σ + ɛ p. Cobining with () and (3) gives by the triangle inequality that p p σ +ɛ p + ɛ p. (4) 395

6 To finish the proof, we just need a bound on p. Note that (4) iplies p σ +(1+ɛ) p + ɛ p, and since ɛ 1/, this iplies p 4σ +(+4ɛ) p. Plugging into (4) and rescaling ɛ down by a constant factor gives the result. Before we prove Theore I.4, we briefly describe the algorith. Algorith first sets its initial estiate to the polynoial produced by l 1 regression. It then continues to refine the estiate it has by using Algorith 1 for O(log 1/ɛ d) iterations. Theore I.4. Let ɛ, α < 1, and suppose the set S of saples is α-good for the size = O(d/ɛ) Chebyshev partition. The result p of Algorith is a degree d polynoial satisfying p p ( + ɛ)σ. Its running tie is that of solving O(log 1/ɛ d) linear progras, each with O(d) variables and O( S ) constraints. Proof: Our algorith proceeds by iteratively iproving a polynoial approxiation p to p, so that p p iproves at each stage. Our notation will be borrowed fro the notation in the stateents of Algoriths and 1. Let p (t) (x) be the t th estiate of p found by Algorith and let e t (x) = (p p (t) )(x) be the error of the t th estiate p (t). r t is the polynoial r found by the t th iteration of Algorith 1, i.e. r t = arg in r r( x j ) ỹ j where ỹ j = edian xi I j y i p (t) (x i ), and the iniu is taken over all degree d polynoials. Lea I.3 now iplies r t (x) e t (x) ( + ɛ)σ + ɛ e t (x). Observe that r t (x) e t (x) =( p (t+1) (x) p (t) (x)) (p(x) p (t) (x)) = e t+1 (x), which gives us e t+1 (x) ( + ɛ)σ + ɛ e t (x). Proceeding by induction and using the geoetric series forula we see p p (t) +ɛ 1 ɛ σ+ɛt e 0 (+4ɛ)σ+ɛ t e 0. Rescaling ɛ, we see that in a nuber of iterations logarithic in the quality of our initial solution, we find a p such that p p ( + ɛ)σ. Finally, we analyze the quality of the initial solution produced by l 1 regression. By Lea I., e 0 1 O(σ). Applying the Markov brothers inequality to the degree d +1 polynoial Q(x) = x 1 e 0(u)du, weget e 0 = ax x [ 1,1] Q (x) (d +1) ax Q(x) x [ 1,1] (d +1) e 0 1 O(d σ). Hence the nuber of iterations required is O(log 1/ɛ d). Applying this to our rando outlier setting, we get Corollary I.5. Proof: It is enough to show that for = O( d ɛ ), drawing O( d ɛ log d δɛ ) saples fro the Chebyshev distribution or O( d ɛ log( 1 δ )) saples fro the unifor distribution gives us an α-good saple for soe α< 1 with probability 1 δ. The corollary then follows fro Theore I.4. Let k be the total nuber of saples taken, and let p j denote the probability that any sapled x i lies in the j th interval I j. With Chebyshev sapling, p j = 1 for all j. With unifor sapling, p j =Θ( in(j,+1 j) ). Let X j be the nuber of saples that appear in I j, and Y j be the nuber of these saples that are outliers. We have E[X j ]=kp j, so by a Chernoff bound, Pr[X j 1 kp j] e Ω(kpj). (5) The outliers are then chosen independently, with expectation ρx j,so Pr[Y j αx j X j ] e Ω((α ρ) X j). (6) Setting α = ρ+ 1, we have that conditioned on (5) not occurring, (6) occurs with at ost e Ω(kpj) probability. If neither occur, then less than an α fraction of the saples in I j are outliers. Hence, by a union bound over the intervals, the saples are α-good with probability at least 1 e Ω(kpj). (7) i=1 In the Chebyshev setting, we have p j = 1/ and k = O( log(/δ)), aking (7) at least 1 δ for appropriately chosen constants. In the unifor setting, we siilarly have e Ω(kpj) = j=1 e Ω( log(1/δ) in(j,1+ j) j=1 3δ, / ) aking (7) at least 1 6δ. Rescaling δ gives the result, that the saples are α-good for soe α<1/ with probability at least 1 δ. V. IMPOSSIBILITY RESULTS δ j j=1 A. Saple Coplexity Lea V.1. If the x i are sapled uniforly then it is not possible to get an O(1) approxiation in l nor to the original function in o(d ) saples and 1/4 failure probability. 396

7 1 of degree-d polynoials f S (x) = j S p b j (x) for S []. For any x [ 1, 1], let k x [] iniize b kx x. Then for any S [], f {kx} S(x) f S (x) f {kx} S(x)+α. Figure 3: p b (x) for d =50and b =0. Proof: Consider an algorith that gives a C- approxiation given s unifor saples with noise level σ =1and zero outliers, for C = O(1) and s = o(d ). We will construct two polynoials with l distance ore than C, but for which the saples have a constant chance of being indistinguishable. Define ( the ) polynoials g(x) = 0 and f(x) := T d x + α d where Td is the degree d Chebyshev polynoial of the first kind, for soe d 4 and constant α = 4 (C 1). By construction, f(x) 1 for x [ ] 1, 1 α d,but f(1) > C because for d 4 ( f(1) = T d 1+ α ) ( cosh = (d d arcosh 1+ α )) d > C. The final inequality above follows fro the fact that cosh(d arcosh(1 + x d )) 1.9(1 + x /8) at d = 4 and this function is increasing in both d and x. Since f g > C, no single answer can be a valid C- approxiate recovery of both f and g. Suppose one always observes saples of the for (x i, 0). These saples are within σ =1of both f(x) and g(x) if they lie in the region [ 1, 1 α d ]. The chance that all saples x i lie in this region is at least (1 α d ) s e sα/d, which is e o(1) > 1/. Hence there is at least a 1/ chance that the saples fro f and g are indistinguishable, so the algorith has a failure probability ore than 1/4. Our next lower bound will use the following leas, proven in Appendix A. Lea V.. Let d 1. For any point b [ 1, 1], there exists a degree d polynoial p b such that p b (b) =1, p b =1, and for all x [ 1, 1]. p b (x) d x b Lea V.3. For any d and α>0, let = d α/, and define b j = 1+ j for j []. Consider the set Lea V.4. For any distribution on sets x = (x 1,...,x s ) of s = o(d log d) saple points with independent outlier chance ρ =Ω(1), it is not possible to get a robust O(1) approxiation in l nor to the original function with 1/4 failure probability. Proof: Let = d/ 1C, and f S and k x be as in Lea V.3 for α = 1 3C. For any x, let L j := {i [s] k xi = j}. Since these sets are disjoint, there ust exist at least / different j for which L j s/ = o(log d). Let B [] contain these j. We say that agivenl j is outlier-full if all of the x i L j are outliers, which for j B happens with probability at least ρ Lj o(log d) 1/ d. Hence the probability that at least one L j is outlier-full is at least 1 (1 1/ d) B 1 e /( d) > Suppose the true polynoial p to be learned is f S for a uniforly rando S, and consider the following adversary. If at least one L j is outlier-full, she arbitrarily picks one such j and sets S to the syetric difference of S and {j }. She then flips a coin, and with 50% probability outputs (x i,f S (x i )) for each i, and otherwise outputs (x i,f S (x i )) for each i. This is valid for σ = 1 3C, because for each i [s], either k xi = j (in which case x i L j is an outlier) or {k xi } S = {k xi } S (in which case Lea V.3 iplies f S (x i ) f S (x i ) α = 1 3C ). Because the distribution on j is independent of S, the distribution of S is also unifor, so the algorith cannot distinguish whether it received f S or f S. But f S f S = f {i} =1> Cσ, so the algorith s output cannot satisfy both cases siultaneously. Hence, in the 99% of cases where one L j is outlier-full, the algorith will have 50% failure probability, for 49.5% > 1/4 overall. B. Approxiation Factor Any algorith that relies on the result of l projection cannot do significantly better. There are two functions p(x) and f(x) such that p(x) f(x) σ for all x [ 1, 1], however the l projection to the space of all degree 1 polynoials is alost σ away in l nor fro p. 397

8 σ f(x) p 1 (x)+1 q(x) p(x) p 1 (x) 1 (1 α) 1 ax + c σ(1 α) Figure 4: q(x) akes an error of ( α)σ with p(x). Lea V.5. Let d = 1, p(x) = σ be a constant function and f(x) = σ α ax (0,x (1 α)) where α< 1. Note that p(x) f(x) σ for all x [ 1, 1]. The result q =argin r f r of l projection of f to the space of degree-1 polynoials satisfies p(x) q(x) [ 1,1], ( α)σ. Proof: Observe that f(x) =0for x [ 1, (1 α)] and f(x) = ( σ α (x 1) + σ) in [1 α, 1]. The axiu difference between two linear equations in a closed interval is attained at the endpoints of that interval. This tells us q(x) =arginax{ r( 1), r(1 α), σ r(1) }. r Let q(x) =ax + b. q(x) will be such that f( 1) > q( 1), f(1 α) <q(1 α) and f(1) >q(1) (see Figure 4), and so we want arg in ax{a b, a + b αa, σ (a + b)}. (a,b) This function is iniized when all three ters are equal, which happens when a = σ and b = ασ. This gives q(x) =σx ασ, and q(x) p(x) [ 1,1], = ( α)σ. We now show that that one cannot hope for a proper (1+ɛ)-approxiate algorith, even with no outliers. We will present a set of polynoials such that no two are ore than -apart, but for which no single polynoial lies within α>1 of all polynoials in the set. Then an adversary with σ =1can output a function y(x) independent of the choice of polynoial in the set, forcing the algorith to by α-far when recovering soe polynoial in the set. Lea V.6. There exist three degree polynoials, all within of each other over [ 1, 1], such that any single quadratic function has l distance ore than 1.09 fro one of the three. p 3 (x) 1 v v 1 Figure 5: The quadratic that passes through the four points above does not satisfy the inequality ax + c p 1 (x)+1 in the range [ 1, v], for v = Proof: Consider the polynoials p 1 (x) =(x +1), p (x) =(1 x), p 3 (x) = 3+ (1 x ), and let v 1 denote 3+ (see Figure 5). Observe that p 1 and p are at distance fro each other at x =1, 1. Also p 1 is at distance fro p 3 at x = v, and siilarly p is at distance fro p 3 at x = v. Hence, any polynoial that wants to 1-approxiate all the p i s necessarily has to go through the points (1, 1), ( 1, 1), (v, v), ( v, v). By syetry, we know that any quadratic that goes through these will be of the for y = ax + c. Substituting these values in the equation and solving for a and c we see that a = 1 1 v+1 and c =1 v+1. Since the quadratic ax + c has to 1-approxiate the p i it ust be the case that ax + c p 1 (x)+1 = x + at all points. However this inequality is not satisfied in the interval ( 1, v) as shown in Figure 5. This is because p 1 (x)+1 is the line between two points on the curve ax + c, which is a concave function for a<0. Running a progra to optiize paraeters, we see that the best approxiation to these polynoials will ake an error of at least 1.09 with one of the polynoials (see Figure ). Lea V.7. There exist O(d) degree polynoials such that any degree d polynoial that tries to approxiate all these polynoials has to ake an error of Ω( 1 d 3 ) with at least one of the degree polynoials. Proof: We will consider a set of quadratic equations such that any two are at ost distance σ fro each other, and such that any polynoial that has to σ-approxiate the needs to perfor d oscillations. As no degree d polynoial can perfor ore than d oscillations, there is at least one oscillation that this polynoial does not perfor, and hence the approxi- 398

9 Figure 6: Construction for Lea V.7. On the left, any 1-approxiator necessarily has to go through the three points. On the right, placing translated copies of these functions force any 1-approxiation to perfor ore than d oscillations. Figure 7: One eleent of f S fro Lea V.3. ating polynoial will ake an error of at least 1+h where h is the height of the oscillation. Observe that (1 Ax ) A(x ± c) =1 Ac and this is achieved at x = ± c. This eans the set S 0 = {1 Ax,A(x c),a(x + c),ax + Ac } has every polynoial at ost at distance σ := 1 Ac fro each other. Any σ approxiator, hence, necessarily has to go through the three points as shown in Figure 6. Define S t = {1 A(x ct),a(x ct c),a(x ct + c),a(x ct) + Ac } Now observe that any σ approxiator to S = t [ 1.5 d,1.5 d] S t will necessarily have to perfor d oscillations. We will now define the paraeters such that these d oscillations take place in the range [ 1, 1] and every pair of functions is at distance at ost 1 Ac fro each other. Set A = 1 d and c = 1 4d. This ensures that every pair of eleents in S t are at ost at distance d fro each other. 3 We now show that if p S t and q S t for t<t, then p q d. Observe that it is enough 3 to check this for p =1 A(x ct) and q = A(x ct + c). Because of our choice of A, c 1 A(x ct) A(x ct + c) ( ( = 1 1 x t ) ( ) ) + x t 1 d 4d 4d 1 1 ( ) d 4 ((t t) 1) 16d d 3 Finally, observe that we force the approxiating polnoial to perfor one oscillation for every S t, and if c = 1 4d the polynoial has to perfor 8d 3 oscillations to σ-approxiate every polynoial in S in the interval [ 1, 1] because it has to perfor one oscillation in every interval of length 3c. Since no degree d polynoial can perfor d oscillations, there is at least one oscillation that it cannot perfor, and so the approxiating polynoial necessarily has to ake an error of at least 1+h with one of the polynoials in S, where h is the height of the oscillation, which is Ω( 1 d ). 3 C. List decoding We now show that if the probability of getting a bad saple were greater than 1, then it is not possible to find poly(d) polynoials of degree O(d) such that one of the polynoials is close to the original polynoial. Theore V.8. Consider any algorith for robust polynoial regression that returns a set L of polynoials fro saples with outlier chance ρ =1/, such that at least one eleent of L is an C-approxiation to the true answer with 3/4 probability. Then E[ L ] 3 4 Ω(d/ C). Proof: Note that we ay assue d/c is larger than a sufficiently large constant, since otherwise the result follows fro the fact that L 1 whenever the algorith succeeds. Let us then set = d/ 1C, and take f S fro Lea V.3 for α = 1/(3C). For any S [] and x [ 1, 1], the lea iplies either f {} (x) f S (x) f {} (x)+α or f [] (x) α f S (x) f [] (x). Therefore, when observing any polynoial f S for S [], the adversary for σ = α can ensure that any saple point x is observed as f {} (x) with 50% probability, and f [] (x) with 50% probability. This is because p(x) is always within α of one of these, so the adversary can output that one if x is an inlier, and the other one if x is an outlier. For this adversary, the algorith s input is independent of the choice of f S. 399

10 Hence its distribution on output L is also independent of f S. But each p L can only be a C-approxiation to at ost one f S. Hence, if S [] is chosen at rando, Pr[ p L : p f S Cσ] E L. This iplies the result. ACKNOWLEDGEMENTS We would like to thank user111 on MathOverflow [15] for pointing us to [14]. DK is supported in part by NSF Award CCF (CAREER) and a Sloan Fellowship. REFERENCES [1] J. Gergonne, The application of the ethod of least squares to the interpolation of sequences, Historia Matheatica, vol. 1, no. 4, pp , [] V. Pratt, Direct least-squares fitting of algebraic surfaces, SIGGRAPH Coput. Graph., vol. 1, no. 4, pp , Aug [Online]. Available: http: //doi.ac.org/ / [3] A. T. Kalai, A. R. Klivans, Y. Mansour, and R. A. Servedio, Agnostically learning halfspaces, SIAM Journal on Coputing, vol. 37, no. 6, pp , 008. [4] I. B. MacNeill, Properties of sequences of partial sus of polynoial regression residuals with applications to tests for change of regression at unknown ties, The Annals of Statistics, pp , [5] S. Arora and S. Khot, Fitting algebraic curves to noisy data, Journal of Coputer and Syste Sciences, vol. 67, no., pp , 003. [6] M. A. Fischler and R. C. Bolles, Rando saple consensus: a paradig for odel fitting with applications to iage analysis and autoated cartography, Counications of the ACM, vol. 4, no. 6, pp , [7] V. Guruswai and D. Zuckeran, Robust Fourier and polynoial curve fitting, FOCS, 016. [8] L. R. Welch and E. R. Berlekap, Error correction for algebraic block codes, Dec , us Patent 4,633,470. [9] V. Guruswai and M. Sudan, Iproved decoding of reed-soloon and algebraic-geoetric codes, in Foundations of Coputer Science, Proceedings. 39th Annual Syposiu on. IEEE, 1998, pp [10] P. J. Huber, Robust statistics. Springer, 011. [11] K. A. Lai, A. B. Rao, and S. Vepala, Agnostic estiation of ean and covariance, in Foundations of Coputer Science (FOCS), 016 IEEE 57th Annual Syposiu on. IEEE, 016, pp [1] I. Diakonikolas, G. Kaath, D. M. Kane, J. Li, A. Moitra, and A. Stewart, Robust estiators in high diensions without the coputational intractability, in Foundations of Coputer Science (FOCS), 016 IEEE 57th Annual Syposiu on. IEEE, 016, pp [13] M. Charikar, J. Steinhardt, and G. Valiant, Learning fro untrusted data, arxiv preprint arxiv: , 016. [14] P. G. Nevai, Bernstein s inequality in L p for 0 <p< 1, Journal of Approxiation Theory, vol. 7, no. 3, pp , [15] user111, L1 analog of bernstein s inequality, MathOverflow, url: ). [Online]. Available: http: //athoverflow.net/q/43447 [16] N. I. Achieser, Theory of approxiation. Courier Corporation, 013. APPENDIX To prove Lea II.1, we need a a generalization of Bernstein s inequality fro l to l q for all q > 0, which appears as Theore 5 of [14]. In the univariate case and setting its paraeters γ,γ=0, it states Lea A.1 (Special case of Theore 5 of [14]). There exists a universal constant C such that, for any degree d polynoial p(x) and any q>0, x p (x) q dx Cd q p(x) q dx. Lea II.1. Let p be a degree d polynoial. Let I 1,...,I partition [ 1, 1] between the Chebyshev extrea cos πj, for soe d. Let r :[ 1, 1] R be piecewise constant, so that for each I k there exists an x k I k with r(x) =p(x k ) for all x I k. Then there exists a universal constant C such that, for any q 1, p r q C d p q. Proof: For any individual I k and x k I k,wehave by Hölder s inequality that x q p(x) p(x k) q dx = p (y)dy dx x I k x I k x x x x q 1 p (y) q dydx x I k x 1 I k q x I k p (x) q dx. Hence p(x) r(x) q dx I k q x I k x I k p (x) q dx. (8) 400

11 1 or We first consider I,...I /, then separately consider the first interval I 1. The stateent holds for the intervals I /,...,I by syetry. For each interval I k with k, we have for all x I k I k =cos = k/ ( kπ ) cos sin(kπ/) ( ) (k +1)π k 1 x where we use the notation a b to denote that there exists a universal constant C such that a Cb. Hence p(x) r(x) q dx I k q p (x) q dx x I k x I k 1 (c) q 1 x p (x) q dx x I k and so by Lea A.1, for soe constant c, x I I / p(x) r(x) q dx 1 1 (c) q ( ) q d 1 x p (x) q dx 1 p q. (9) Now we consider the end I 1. By the Markov brothers inequality [16], p d p. Let x [ 1, 1] such that p(x ) = p, and let I = {y [ 1, 1] x y 1 d }.Wehavep(y) p / for all y I,so p q q I ( p /) q p q d q. Hence by (8), and using that I 1 =1 cos π =Θ( 1 ), x I 1 p(x) r(x) q dx I 1 q+1 p q ( d ) q+ p q q c d q (c) q+ p q For soe constant c. The sae holds for intervals I /,...,I by syetry, so cobining with (9) gives ( ) ( q ( ) ) d d p(x) r(x) q dx 1+ q+1 p q q c 1 p r q d ( 1+ ( ) ) /q d p q. When d, the first ter doinates giving the result. Lea V.. Let d 1. For any point b [ 1, 1], there exists a degree d polynoial p b such that p b (b) =1, p b =1, and for all x [ 1, 1]. p b (x) d x b Proof: We will show a stronger for of this lea for even d and b =0, giving p 0 (x) 1 (d +1) x. (10) By subtracting 1 fro odd d, this iplies the sae for general d with a 1/d rather than 1/(d +1) ter. Then for general b we take p b (x) =p 0 ((x b)/), which satisfies the lea. So it suffices to show (10) for even d and b =0. We choose p 0 (x) to be so that p(x) :=( 1) d/ T d+1(x) (d +1)x d/ cos((d +1)θ) p(cos θ) =( 1) (d +1)cosθ or, replacing θ with π θ and using that sin(d π +ψ) = ( 1) d/ sin ψ, p(sin θ) = sin((d +1)θ) (d +1)sinθ. Since sin((d +1)θ) 1, this iediately gives (10); we just need to show p(0) = p =1.ByL Hôpital s rule, we have (d +1)cos((d +1) 0) p(0) = =1. (d + 1) cos 0 If θ 1.1/(d+1), then sin θ 1/(d+1), and so (10) iplies p(sin θ) 1. Since p is syetric, all that reains is to show p(sin θ) 1 for 0 <θ<1.1/(d + 1). The axiu value of p(sin θ) will either appear at an endpoint of this interval which we have already shown is at ost 1 or at a zero of the derivative. We have +1)cos((d +1)θ) p(sin θ) =(d θ (d +1)sinθ sin((d +1)θ) (d +1)cosθ ((d +1)sinθ) (d +1)cos((d +1)θ)sinθ sin((d +1)θ)cosθ = (d +1)sin. θ For all 0 <ψ, we have the inequalities ψ ψ 3 /6 < sin ψ<ψand 1 ψ / < cos ψ<1 ψ /+ψ 4 /4. 401

12 Hence for 0 <θ<1.1/(d +1), the denoinator is positive and the nuerator is less than (d + 1)(1 (d +1) θ /+(d +1) 4 θ 4 /4)θ ( (d +1)θ (d +1) 3 θ 3 /6 ) (1 θ /) = θ 3 ( (d +1) 3 /+(d +1)/+(d +1) 3 /6 ) + θ 5 ( (d +1) 5 /4 (d +1) 3 /1 ) 5 18 (θ(d +1))3 +(θ(d +1)) 5 /4 0.(θ(d +1)) 3 < 0. Thus p(sin θ) does not have a local axiu on (0, 1.1/(d +1)),so p 1, finishing the proof. Lea V.3. For any d and α>0, let = d α/, and define b j = 1+ j for j []. Consider the set of degree-d polynoials f S (x) = j S p b j (x) for S []. For any x [ 1, 1], let k x [] iniize b kx x. Then for any S [], f {kx} S(x) f S (x) f {kx} S(x)+α. Proof: Since f S {kx}(x) =f {kx}(x)+f S\{kx}(x) for any S, it is sufficient to prove the result when k x / S. Then f S (x) as desired. p 4 b j (x) d x b j j k x j k x j k x 4 < d <α d (( j k x 1) ) 1 (j 1) < π d 6 j=1 40

Robust polynomial regression up to the information theoretic limit

Robust polynomial regression up to the information theoretic limit Daniel Kane Sushrut Karmalkar Eric Price August 16, 017 Abstract We consider the problem of robust polynomial regression, where one receives