Statistical learning with Lipschitz and convex loss functions

Size: px

Start display at page:

Download "Statistical learning with Lipschitz and convex loss functions"

Rebecca Gregory
5 years ago
Views:

1 Statistical learning with Lipschitz and convex loss functions Geoffrey Chinot, Guillaume Lecué and Matthieu Lerasle October 3, 08 Abstract We obtain risk bounds for Empirical Risk Minimizers ERM and minmax Median-Of-Means MOM estimators based on loss functions that are both Lipschitz and convex. Results for the ERM are derived without assumptions on the outputs and under subgaussian assumptions on the design as in [] but relaxing the global Bernstein condition of this paper into a local assumption as in [40]. Similar results are shown for minmax MOM estimators in a close setting where the design is only posed to satisfy moment assumptions, relaxing the subgaussian hypothesis necessary for ERM. Unlike alternatives based on MOM s principle [4, 9], the analysis of minmax MOM estimators is not based on the small ball assumption SBA of []. In particular, the basic example of non parametric statistics where the learning class is the linear span of localized bases, that does not satisfy SBA [37] can now be handled. Finally, minmax MOM estimators are analyzed in a setting where the local Bernstein condition is also dropped out. It is shown to achieve an oracle inequality with exponentially large probability under minimal assumptions insuring the existence of all objects. Introduction The paper studies learning problems where the loss function is simultaneously Lipschitz and convex. This situation happens in classical examples such as quantile, Huber and L regression or logistic and hinge classification [40]. As the Lipschitz property allows to remove all assumptions on the outputs, these losses have been quite popular in robust statistics [7]. Empirical risk minimizers ERM based on Lipschitz losses such as the Huber loss have received recently an important attention [44, 5, ]. Based on a dataset {X i, Y i : i =,..., } of points in X Y, a class F of functions and a risk function R defined on F, the statistician should estimate an oracle f argmin Rf. The risk function R is often defined as the expectation of a known loss function l : f, x, y F X Y l f x, y R with respect to the unknown distribution P of a random variable X, Y X Y: Rf = El f X, Y. Hereafter, the risk is assumed to have this form for a loss function l such that, for any f, x, y, l f x, y = lfx, y, for some function l : Ȳ Y R, where the set Ȳ is a convex set containing all possible values of fx. l is said Lipschitz and convex when the following assumption holds. Assumption. There exists L > 0 such that, for any y Y, l, y is L-Lipschitz and convex. Many classical loss functions satisfy Assumption as shown by the following examples. The logistic loss defined, for any u Ȳ = R and y Y = {, }, by lu, y = log + exp yu satisfies Assumption with L =. The hinge loss defined, for any u Ȳ = R and y Y = {, }, by lu, y = max uy, 0 satisfies Assumption with L =. The Huber loss defined, for any δ > 0, u, y Y = Ȳ = R, by { lu, y = y u if u y δ δ y u δ if u y > δ,

2 satisfies Assumption with L = δ. The quantile loss is defined, for any τ 0,, u, y Y = Ȳ = R, by lu, y = ρ τ u y where, for any z R, ρ τ z = zτ I{z 0}. It satisfies Assumption with L =. For τ = /, the quantile loss is the L loss. All along the paper, the following assumption is also granted. Assumption. The class F is convex. When X, Y and the data X i, Y i i= are independent and identically distributed i.i.d., for any f F, the empirical risk R f = / i= l f X i, Y i is a natural estimator of Rf. The empirical risk minimizers ERM [4] obtained by minimizing f F R f are expected to be close to the oracle f. This procedure and its regularized versions have been extensively studied in learning theory [0]. When the loss is both convex and Lipschitz, results have been obtained in practice [4, ] and theory [40]. Risk bounds with exponential deviation inequalities for the ERM can be obtained without assumptions on the outputs Y, but stronger assumptions on the design X, which must satisfy subgaussian or boundedness assumptions. Moreover, fast rates of convergence [39] can only be obtained under margin type assumptions such as the Bernstein condition [8, 40]. The Lipschitz assumption and global Bernstein conditions that hold over the entire F as in [] imply boundedness in L -norm of the class F, see the discussion preceding Assumption 4 for details. This boundedness is not satisfied in linear regression with unbounded design so the results of [] don t apply to this basic example. To bypass this restriction, the global condition has to be relaxed into a local one as in [5, 40], see also Assumption 4 below. The main constraint in our results on ERM is the subgaussian assumption on the design. This constraint can be relaxed by considering alternative estimators based on the median-of-means MOM principle of [35, 9, 8, ] and the minmax procedure of [3, 5]. The resulting minmax MOM estimators have been introduced in [4] for least-squares regression as an alternative to other MOM based procedures [9, 30, 3, 3]. In the case of convex and Lipschitz loss functions, these estimators satisfy the following properties as the ERM, they are efficient without assumptions on the noise they achieve optimal rates of convergence under weak stochastic assumptions on the design, relaxing the subgaussian or boundedness hypotheses used for ERM and 3 the rates are not downgraded by the presence of some outliers in the dataset. These improvements of MOM estimators upon ERM are not surprising. For univariate mean estimation, rate optimal subgaussian deviation bounds can be shown under minimal L moment assumptions for MOM estimators [4] while the empirical mean needs each data to have subgaussian tails to achieve such bounds [3]. In least-squares regression, MOM-based estimators [9, 30, 3, 3, 4] inherit these properties, whereas the ERM has downgraded statistical properties under moment assumptions [6]. Furthermore, MOM procedures are resistant to outliers: results hold in the O I framework of [3, 4], where inliers or informative data indexed by I only satisfy weak moments assumptions and the dataset may contain outliers indexed by O on which no assumption is made, see Section 3. This robustness, that almost comes for free from a technical point of view, is another important advantage of MOM estimators compared to ERM in practice. Figure illustrates this fact, showing that statistical performance of the standard logistic regression are strongly affected by a single corrupted observation, while the minmax MOM estimator maintains good statistical performance even with 5% of corrupted data. Compared to [9, 4], considering convex-lipschitz losses instead of the square loss allows to simplify simultaneously some assumptions and the presentation of the results for MOM estimators: L -assumptions on the noise in [9, 4] can be removed and complexity parameters driving risk of ERM and MOM estimators only involve a single stochastic linear process, see Eq. 3 and 6 below. Also, contrary to the analysis in least-squares regression, the small ball assumption [, 33] is not required here. Recall that this assumption states that there are absolute constants κ and β such that, for all f F, P[ fx f X All figures can be reproduced from the code available at

3 Figure : MOM Logistic Regression VS Logistic regression from Sklearn p = 50 and = 000 κ f f L ] β. It is interesting as it involves only moments of order and of the functions in F. However, it does not hold in classical frameworks such as histograms, see [37, 6] and Section 5. Finally, minmax MOM estimators are studied in a framework where the Bernstein condition is dropped out. In this setting, they are shown to achieve an oracle inequality with exponentially large probability see Section 4. The results are slightly weaker in this relaxed setting: the excess risk is bounded but not the L risk and the rates of convergence are slow in / in general. Fast rates of convergence in / can still be recovered from this general result if a local Bernstein type condition is satisfied though, see Section 4 for details. This last result shows that minmax MOM estimators can be safely used with Lipschitz and convex losses, assuming only that inliers data are independent with moments giving sense to all objects necessary to state the results. To approximate minmax MOM estimators, an algorithm inspired from [4, 7] is also proposed. Convergence of this algorithm has been proved in [7] under strong assumptions, but, to the best of our knowledge, convergence rates have not been established. evertheless, the simulation study presented in Section 6 shows that the algorithm presents interesting empirical performance in the setting of this paper. The paper is organized as follows. Optimal results for the ERM are presented in Section. Minmax MOM estimators are introduced and analysed in Section 3 under a local Bernstein condition and in Section 4 without the Bernstein condition. A discussion of the main assumptions is provided in Section 5 while Section 6 provides a typical example where all assumptions are satisfied and a simulation study where a natural algorithm associated to the minmax MOM estimator for logistic loss is presented. The proofs of the main theorems are gathered in Sections A, B and C. otations Let X, Y be measurable spaces. Let F be a class of measurable functions f : X Y and let X, Y X Y be a random variable with distribution P. Let µ denote the marginal distribution of X. For any probability measure Q on X Y, and any function g L Q, let Qg = gx, yqdx, y. Let l : F X Y R, f, x, y l f x, y denote a loss function measuring the error made when predicting y by fx. It is always assumed that there exist Ȳ a convex set containing all possible values of fx for f F, x X and a function l : Ȳ Y R such that, for any f, x, y F X Y, lfx, y = l f x, y. Let Rf = P l f = El f X, Y for f in F denote the risk and let L f = l f l f denote the excess loss. If F L P := L and Assumption holds, an equivalent risk can be defined even if Y / L. Actually, for any f 0 F, l f l f0 L so one can define Rf = P l f l f0. W.l.o.g. the set of risk minimizers is assumed to be reduced to a singleton argmin Rf = {f }. f is called the oracle as f X provides the prediction of Y with minimal risk among functions in F. For any f and p > 0, let f Lp = P f p /p, 3

4 for any r 0, let rb L = {f F : f L r} and rs L = {f F : f L = r}. For any set H for which it makes sense, H + f = {h + f s.t h H}, H f = {h f s.t h H}. ERM in the subgaussian framework This section studies the ERM, improving some results from []. In particular, the global Bernstein condition in [] is relaxed into a local hypothesis following [40]. All along this section, data X i, Y i i= are independent and identically distributed with common distribution P. The ERM is defined for f F P l f = / i= l f X i, Y i by ˆf ERM = arg min P l f. The results for the ERM are shown under a subgaussian assumption on the design. This result is the benchmark for the following minmax MOM estimators. Definition. Let B. F is called B-subgaussian with respect to X when for all f F and all λ > E expλ fx / f L expλ B /. Assumption 3. The class F F is B-subgaussian, where F F = {f f : f, f F }. Under this subgaussian assumption, statistical complexity can be measured via Gaussian mean-widths. Definition. Let H L. Let G h h H be the canonical centered Gaussian process indexed by H in particular, the covariance structure of G h h H is given by EG h G h / = Eh X h X / for all h, h H. The Gaussian mean-width of H is wh = E h H G h. The complexity parameter driving the performance of ˆf ERM is presented in the following definition. Definition 3. The complexity parameter is defined as r θ = inf{r > 0 : 3LwF f rb L θr } where L > 0 is the Lipschitz constant from Assumption. Let A > 0. In [8], the class F is called, A-Bernstein if, for all f F, P L f AP L f. Under Assumption, F is, AL Bernstein if the following stronger assumption is satisfied f f L AP L f. This stronger version was used, for example in [] to study ERM. However, under Assumption, Eq implies that f f L AP L f AL f f L AP L f AL f f L. Therefore, f f L AL for any f F. The class F is bounded in L -norm, which is restrictive as this assumption is not verified by the class of linear functions for example. To bypass this issue, the following condition is introduced. Assumption 4. There exists a constant A > 0 such that for all f F if f f L f f L AP L f. r θ then In Assumption 4, Bernstein condition is granted in a L -neighborhood of f only. Outside of this neighborhood, there is no restriction on the excess loss. This relaxed assumption is satisfied on linear models and Lipschitz-convex loss functions under moment assumptions as will be checked in Section 5. The following theorem is the main result of this section. 4

5 Theorem. Grant Assumptions,, 3 and 4 and let θ = /A, ˆf ERM defined in satisfies, with probability larger than exp θ r θ 6 3 L, 3 ˆf ERM f L r θ and P L ˆf ERM θr θ. 4 Theorem is proved in Section A.. It holds without restrictions on the outputs Y. Theorem shows deviation bounds both in L norm and for the excess risk, which are both minimax optimal as proved in []. As in [], a similar result can be derived if the subgaussian Assumption 3 is replaced by a boundedness in L assumption. An extension of Theorem can be shown, where Assumption 4 is replaced by the following hypothesis: there exists κ such that for all f F in a L -neighborhood of f, f f κ L AP L f. The case κ = is the most classical and its analysis contains all the ingredients for the study of the general case with any parameter κ. More general Bernstein conditions can also be considered as in [40, Chapter 7]. These extensions are left to the interested reader. 3 Minmax MOM estimators This section presents and studies minmax MOM estimators, comparing them to ERM. 3. The estimators The framework of this section is a relaxed version of the i.i.d. setup considered in Section. Following [3, 4], there exists a partition O I of {,, } in two subsets which is unknown to the statistician. o assumption is granted on the set of outliers X i, Y i i O. Inliers, indexed by I, are only assumed to satisfy the following assumption. Assumption 5. X i, Y i i I are independent and for all i I, X i, Y i has distribution P i, X i has distribution µ i and for any p > 0 and any function g for which it makes sense g Lpµ i = P i g p /p. We assume that, for any i I, f f L = f f L µ i and P i L f = P L f. Assumption 5 holds in the i.i.d case but it covers other situations where informative data X i, Y i i I may have different distributions. Recall the definition of MOM estimators of univariate means. Let B k k=,...,k denote a partition of {,..., } into blocks B k of equal size /K if is not a multiple of K, just remove some data. For any function f : X Y R and k {,..., K}, let P Bk f = K/ i B k fx i, Y i. MOM estimator is the median of these empirical means: MOM K f = MedPB f,, P BK f. The estimator MOM K f achieves rate optimal subgaussian deviation bounds, assuming only that P f <, see for example [4]. The number K is a tuning parameter. The larger K, the more outliers is allowed. When K =, MOM K f is the empirical mean, when K =, the empirical median. Following [4], remark that the oracle is also solution of the following minmax problem: f arg min P l f = arg min g F P l f l g. Minmax MOM estimators are obtained by plugging MOM estimators of the unknown expectations P l f l g in this formula: ˆf arg min MOM K lf l g. 5 g F The minmax MOM construction can be applied systematically as an alternative to ERM. For instance, it yields a robust version of logistic classifiers. The minmax MOM estimator with K = is the ERM. 5

6 The linearity of the empirical process P is important to use localisation technics and derive fast rates of convergence for ERM [], improving slow rates derived with the approach of [4], see [39] for details on fast and slow rates. The idea of the minmax reformulation comes from [3], where this strategy allows to overcome the lack of linearity of some alternative robust mean estimators. [3] introduced minmax MOM estimators to least-squares regression. 3. Theoretical results 3.. Setting The assumptions required for the study of estimator 5 are essentially those of Section except for the subgaussian Assumption 3 which is replaced by the moment Assumption 5. Instead of Gaussian mean width, the complexity parameter is expressed as a fixed point of local Rademacher complexities [7, 0, 6]. Let σ i i=,..., denote i.i.d. Rademacher random variables uniformly distributed on {, }, independent from X i, Y i i I. Let { r γ = inf r > 0, J I : J, E } σ i f f X i r J γ. 6 : f f L r The outputs do not appear in the complexity parameter. This is an interesting feature of Lipschitz losses. Since Assumption 4 depends on the complexity parameter in the subgaussian framework, it is necessary to adapt the Bernstein assumption to this framework. Assumption 6. There exists a constant A > 0 such that for all f F if f f L C K,r then f f L AP L f where C K,r = max r γ, 864A L K. 7 Assumptions 6 and 4 have a similar flavor as both require the Bernstein condition in a L neighborhood of f with radius given by the rate of convergence of the associated estimator see Theorems and. For K r γ/846a L the neighborhood {f F : f f L C K,r } is a L -ball around f of radius r γ which can be of order / see Section As a consequence, Assumption 6 holds in examples where the small ball assumption does not see discussion after Assumption Main results We are now in position to state the main result regarding the statistical properties of estimator 5 under a local Bernstein condition. Theorem. Grant Assumptions,, 5 and 6 and assume that O 3/7. Let γ = /575AL and K [ 7 O /3, ]. The minmax MOM estimator ˆf defined in 5 satisfies, with probability at least i J exp K/06, 8 ˆf f L C K,r and P L ˆf 3A C K,r. 9 Suppose that K = r γ, which is possible as long as O r γ. The deviation bound is then of order r γ and the probability estimate exp r γ/06. Therefore, minmax MOM estimators achieve the same statistical bound with the same deviation as the ERM as long as r γ and r θ are of the same order. Using generic chaining [38], this comparison is true under Assumption 3. It can also be shown under weaker moment assumption, see [34] or the example of Section When r γ r θ, the bounds are rate optimal as shown in []. This is why these bounds are called rate optimal subgaussian deviation bounds. While these hold for ERM in the i.i.d. setup with subgaussian 6

7 design in the absence of outliers see Theorem, they hold for minmax MOM estimators in a setup where inliers may not be i.i.d., nor have subgaussian design and up to r γ outliers may have contaminated the dataset. This section is concluded by presenting an estimator achieving 5 simultaneously for all K. For all K {,..., } and f F, define T K f = g F MOM K lf l g and let ˆR K = {g F : T K g /3AC K,r }. 0 ow, building on the Lepskii s method, define a data-driven number of blocks ˆK = inf K {,..., } : ˆR J J=K and let f such that f ˆR J. J= ˆK Theorem 3. Grant Assumptions,, 5 and 6 and assume that O 3/7. Let γ = /575AL. The estimator f defined in is such that for all K [ 7 O /3, ], with probability at least 4 exp K/06, f f L C K,r and P L f 3A C K,r. Theorem 3 states that f achieves the results of Theorem simultaneously for all K. This extension is useful as the number O of outliers is typically unknown in practice. However, contrary to ˆf, the estimator f requires the knowledge of A and rγ. This kind of limitation is not surprising, it holds in least-squares regression [4] and comes from the univariate mean estimation case [4, Theorem 3.] A basic example The following example illustrates the optimality of the rates provided in Theorem. Lemma [9]. In the O I framework with F = { t, : t R p }, we have r γ RankΣ/γ, where Σ = E[XX T ] is the p p covariance matrix of X. The proof of Lemma is recalled in Section B for the sake of completeness. Section 5 shows that Assumptions 4 and 6 are satisfied when F = { t, : t R p } and X is a vector with i.i.d. entries having only a few finite moments. Theorem applies therefore in this setting and the Minmax MOM estimator 5 achieves the optimal fast rate of convergence RankΣ/. 4 Relaxing the Bernstein condition This section shows that minmax MOM estimators satisfy sharp oracle inequalities with exponentially large deviation under minimal stochastic assumptions insuring the existence of all objects. These results are slightly weaker than those of the previous section: the L risk is not controlled and only slow rates of convergence hold in this relaxed setting. However, the bounds are sufficiently precise to imply fast rates of convergence for the excess risk as in Theorems if a slightly stronger Bernstein condition holds. Given that data may not have the same distribution as X, Y, the following relaxed version of Assumption 5 is introduced. Assumption 7. X i, Y i i I are independent and for all i I, X i, Y i has distribution P i, X i has distribution µ i. For any i I, F L µ i and P i L f = P L f for all f F. 7

8 When Assumption 6 does not necessary hold, the localization argument has to be modified. Instead of the L -norm, the excess risk f F P L f is used to define neighborhoods around f. The associated complexity is then defined for all γ > 0 and K {,, } by { Er r γ = inf r > 0 : max γ, } 536V K r r 3 where Er = J I: J / E and V K r = max i I :P L f r :P L f r σ i f f X i J, i J K Var Pi L f. The main difference with r θ in Definition or r γ in 6 is the extra variance term V K r. Under the Bernstein condition, this term is negligible in front of the expectation term Er see [8]. In the general setting considered here, the variance term is handled in the complexity parameter. [ Theorem ] 4. Grant Assumptions,, 7 and assume that O 3/7. Let γ = /768L and K 7 O /3,. The minmax MOM estimator ˆf defined in 5 satisfies, with probability at least exp K/06, P L ˆf r γ. Recall that Assumptions and are only meaning that the loss function is convex and Lipschitz and that the class F is convex. The stochastic assumption 7 says that inliers are independent and define the same excess risk as X, Y over F. In particular, Theorem 4 holds without assumptions on the outliers X i, Y i i O or the outputs Y i i I of the inliers. Moreover, the risk bound holds with exponentially large probability without assuming sub Gaussian design, a small ball hypothesis or a Bernstein condition. This generality can be achieved by combining MOM estimators with convex-lipschitz loss functions. The following result discuss relationships between Theorems and 4. Introduce the following modification of the Bernstein condition. Assumption 8. Let γ = /768L. There exists a constant A > 0 such that for all f F if P L f C K,r then f f L AP L f where, for r γ defined in 6, r C K,r = max γ/a A, 536AL K. 4 Assumption 8 is slightly stronger than Assumption 4 since the neighborhood around f where the condition holds is slightly larger. If Assumption 8 holds then Theorem 5 implies the same statistical bounds for 5 as Theorem up to constants, as shown by the following result. Theorem 5. Grant Assumptions,, 7 and assume that O 3/7. Assume that the local Bernstein condition Assumption 8 holds. Let γ = /768L and K [ 7 O /3, ]. The minmax MOM estimator ˆf defined in 5 satisfies, with probability at least exp K/06, ˆf f L max r γ/a, 536L A K r and P L ˆf max γ/a A, 536L AK. Proof. First, V K r LV K r for all r > 0 where V K r = K/ max i I :P Lf r f f L µ i. Moreover, r Er/r and r V K r/r are non-increasing, therefore by Assumption 8 and the definition of r γ, V K r, γ E r γ/a r γ/a A A and Hence, r γ max r γ/a/a, 536L AK/. 536V K 536L AK 536A LK. 8

9 5 Bernstein s assumption This section shows that the local Bernstein condition holds for various loss functions and design X. All along the section, the oracle f and the Bayes rules which minimizes the risk f Rf over all measurable functions from X to Y are assumed to be equal. In that case, the Bernstein s condition [8] coincides with the margin assumption [39, 3]. The class F and design X are also assumed to satisfy a local L 4 /L - assumption. Let r {r θ, C K,r}, to verify Assumption 4, choose r = r θ while for Assumption 6, choose r = C K,r. Assumption 9. There exists C > 0 such that for all f F satisfying f f L r, f f L4 C f f L Assumption 9 is a local version of a L 4 /L norm equivalence assumption over F {f } which has been used for the study of MOM estimators see [9] since it implies the small ball condition. Examples of distributions satisfying the global i.e. over all F version of Assumption 9 can be found in [33]. Assumption 9 is local, it holds only in a L neighborhood of f and not on the entire set F {f }. There are situations where the constant C depends on the dimension p of the model. In that case, the results in [9, 4] provide sub-optimal statistical upper bounds. For instance, if X is uniformly distributed on [0, ] and F = { p j= α ji Aj : α j p j= Rp } where I Aj is the indicator of A j = [j /p, j/p] then for all f F, f f L4 p f f L so C = p. As shown in the following subsections, the rates given in Theorem or Theorem 3 are not deteriorated in this example. 5. Quantile loss The proof is based on [5, Lemma.]. Recall that l f x, y = y fxτ I{y fx 0} and that the Bayes estimator is the quantile q Y X=x τ of order τ of the conditional distribution of Y given X = x. Assumption 0. Let C be the constant defined in Assumption 9. There exists α > 0 such that, for all x X and for all z in R such that z f x C r we have f Y X=x z α, where f Y X=x is the conditional density function of Y given X = x. Theorem 6. Grant Assumptions 9 and 0. Assume that the oracle f is the Bayes estimator i.e. f x = qτ Y X=x for all x X. Then, the local Bernstein condition holds: for all f F such that f f L r, f f L 4 α P L f. The proof is postponed to Section C.. Consider the example from Section 3..3, assume that K RankΣ and recall that r = C K,r = r γ RankΣ/γ. If C = p, Assumption 0 holds as long as p 3 and there exists α > 0 such that, for all x X and for all z in R such that z f x, f Y X=x z α. In this situation, the rates given in Theorems and 3 are still RankΣ/. In particular, they are not deteriorated if C = p while those given in [9, 4] are of order prankσ/. This gives a partial answer, in our setting, to the issue raised in [37] regarding results based on the small ball method. 5. Huber Loss Consider the Huber loss function defined for all f F and x X, y R by l f x, y = ρ H y fx where ρ H t = t / if t δ and ρ H t = δ t δ otherwise. Introduce the following assumption. Assumption. Let C be the constant defined in Assumption 9. There exists α > 0 such that for all x X and all z in R such that z f x C r, F Y X=x z + δ F Y X=x z δ α, where F Y X=x is the conditional cumulative function of Y given X = x. Under this assumption, the local Bernstein condition can be checked as shown by the following result. 9

10 Theorem 7. Grant Assumptions 9 and. Assume that the oracle f is the Bayes rules, meaning that for all x X, f x argmin u R E[ρ H Y u X = x]. Then, for all f F, if f f L r then f f L 4/αP L f. The proof is postponed to Section C.. Theorem 6 applies to this example too. The interested reader can check that the remark following 5.3 Hinge loss Let η denote the regression function η : x X E[Y X = x] for all x X and recall that the Bayes rule is defined by f : x X sgnηx. The function f minimizes f Rf over all measurable functions from X to Y = {, } [43]. Assume that f is the Bayes rules, i.e. that f F. Then, Assumption 4 holds if the following margin assumption is satisfied, see [5] and [39, Proposition ]: The following result was proved in [5]. there exists τ > 0 such that ηx τ a.s. 5 Theorem 8. Let F be a class of functions from X to [, ]. If f x = sgnηx for all x X and 5 is satisfied, the Bernstein condition holds for the hinge loss with A = /τ. 5.4 Logistic classification Consider the logistic loss and recall that the Bayes rule, called the log-odds ratio, is defined by f : x log[ηx/ ηx] for all x X, where ηx = PY = X = x for all x X. Consider the following assumption on η. Assumption. There exists c 0 > 0 such that P + expc 0 ηx e c 0. + exp c 0 Assumption excludes trivial cases where deterministic predictors equal to or are optimal. Theorem 9. Grant Assumptions 9 and and assume that the Bayes rules is in F. Then, there exists A > 0 such that f F : f f L r, P L f A f f L. Theorem 9 is proved in Section C.3. The explicit form of the constant A is presented in the proof. A close inspection of the proof reveals that Assumption is only used to bound f X with large probability. If the oracle is bounded, the same result holds without assuming that f is the Bayes rules. 6 Simulation study This section provides a short simulation study that illustrates our theoretical findings. The section starts with an example where all assumptions of Theorem are simultaneously satisfied. 6. Application: robust classification Let X = R p, Y = {, }, let l denote the logistic loss recalled in Section. Recall that Assumption holds with L =. Let F = { t, : t R p } be a class of linear functions, so Assumption holds. Assume that X = ξ,, ξ p in R p, where ξ j p j= are centered, independent and identically distributed and that X, Y satisfies a logistic model PY = X log = X, t + ɛ. 6 PY = X 0

11 By Theorem 9, Assumptions 4 and 6 hold if f F s.t f f L r, f f L4 C f f L 7 for r {r θ, C K,r}. ow, since ξ,, ξ p are centered i.i.d. in R p, basic algebraic computations show that 7 holds if E ξ 4 /4 AE ξ /. This assumption is satisfied, for example, by student distributions T 5 that do not have a fifth moment. The results allow non-symmetric noise such as ɛ L 0, where L µ, σ denotes the log-normal distribution with mean expµ + σ/ and variance expσ expµ + σ. Consider the model 6 with X and ɛ as above. Let X i, Y i i I independent and distributed as X, Y. Let also X i, Y i i O where I O = {,, }. For any K 7 O /3, the minmax MOM estimator is solution of the following problem: ˆt MOM K arg min t R p t R p MOM K l t l t. 8 As all assumptions in Theorem are satisfied, if the number of outliers is such that O p and the number of blocks is K = p then there exist absolute positive constants a, a, a 3 and a 4 depending only on A such that, with probability larger than a exp a p, 6. Simulations ˆt MOM p t p a 3 and P Lˆt a p 4. 9 This section presents algorithms derived from the minmax MOM construction 8. These algorithms are inspired from [4]. Consider the setup of Section 6. where Theorem applies: X = ξ,, ξ p, where ξ j p j= are independent and identically distributed, with ξ T 5, ɛ L 0, and PY = X log = X, t + ɛ. PY = X Following [4], a gradient ascent-descent step is performed on the empirical incremental risk t, t P Bk l t l t constructed on the block B k of data realizing the median of the empirical incremental risk. Initial points t 0 R p and t 0 R p are taken at random. In logistic regression, the step sizes η and η are usually chosen equal to XX op /4, where X is the p matrix with row vectors equal to X,, X and op denotes the operator norm. In a corrupted environment, this choice might lead to disastrous performance. This is why η and η are computed at each iteration using only data in the median block: let B k denote the median block at the current step, then one chooses η = η = X k X k op/4 B k where X k is the B k p matrix with rows given by Xi T for i B k. In practice, K is chosen by robust cross-validation choice as in [4]. In a first approach and according to our theoretical results, the blocks are chosen at the beginning of the algorithm. As illustrated in Figure, this first strategy has some limitations. To understand the problem, for all k =,..., K, let C k denote the following set C k = {t R p : P Bk l t = Median {P B l t,..., P BK l t }}. If the minimum of t P Bk l t lies in C k, the algorithm typically converges to this minimum if one iteration enters C k. As a consequence, when the minmax MOM estimator 8 lies in another cell, the algorithm does not converge to this estimator. To bypass this issue, the partition is changed at every ascent/descent steps of the algorithm, it is chosen uniformly at random among all equipartition of the dataset. This alternative algorithm is described in Algorithm. In practice, changing the partition seems to widely accelerate the convergence see Figure.

12 Input: The number of block K, initial points t 0 and t 0 in R p and the stopping criterion ɛ > 0 Output: A solution of the minimax problem 8 while t i t i ɛ do Split the data into K disjoint blocks B k k {,,K} of equal sizes chosen at random: B B K = {,, }. 3 Find k [K] such that MOM K l ti = P Bk l ti. 4 Compute η = η = X T k X k op /4. 5 Update t i+ = t i η tp Bk l t t=ti and t i+ = t i η t P B k l t t= t i. 6 end Algorithm : Descent-ascent gradient method with blocks of data chosen at random at every steps. Simulation results are gathered in Figure. In these simultations, there is no outlier, = 000 and p = 00 with X i, Y i 000 i= i.i.d with the same distribution as X, Y. Minmax MOM estimators 8 are compared with the Logistic Regression algorithm from the scikit-learn library of [36]. The upper pictures compare performance of MOM ascent/descent algorithms with fixed and changing blocks. These pictures give an example where the fixed block algorithm is stuck into local minima and another one where it does not converge. In both cases, the changing blocks version converges to t. Running times of logistic regression LR and its MOM version MOM LR are compared in the lower picture of Figure in a dataset free from outliers. LR and MOM LR are coded with the same algorithm in this example, meaning that MOM gradient descent-ascent and simple gradient descent are performed with the same descent algorithm. As illustrated in Figure, running each step of the gradient descent on one block only and not on the whole dataset accelerates the running time. The larger the dataset, the bigger the benefit is expected. The resistance to outliers of logistic regression and its minmax MOM alternative are depicted in Figure in the introduction. We added an increasing number of outliers to the dataset. Outliers {X i, Y i, i O} in this simulation are such that X i L 0, 5 and Y i = sign X i, t + ɛ i, with ɛ i ɛ as above. Figure shows that logistic classification is mislead by a single outlier while MOM version maintains reasonable performance with up to 50 outliers i.e 5% of the database is corrupted. A byproduct of Algorithm is an outlier detection algorithm. Each data receives a score equal to the number of times it is selected in a median block in the random choice of block version of the algorithm. The first iterations may be misleading: before convergence, the empirical loss at the current point may not reveal the centrality of the data because the current point may be far from t. Simulations are run with = 00, p = 0 and 5000 iterations and therefore only the score obtained by each data in the last 4000 iterations are displayed. 3 outliers X i, Y i i {,,3} with X i = 0 p j= and Y i = sign X i, t have been introduced at number 4, 6 and 66. Figure 3 shows that these are not selected once. 7 Conclusion The paper introduces a new homogeneity argument for learning problems with convex and Lipschitz losses. This argument allows to obtain estimation rates and oracle inequalities for ERM and minmax MOM estimators improving existing results. The ERM requires subgaussian hypotheses on the design and a local Bernstein condition see Theorem, both assumptions can be removed for minmax MOM estimators see Theorem 5. The local Bernstein conditions provided in this article can be verified in several learning problems. In particular, it allows to derive optimal risk bounds in examples where analysis based on the small ball hypothesis fail. Minmax MOM estimators applied to convex and Lipschitz losses are efficient without assumptions on the outputs Y, under minimal L assumptions on the design and the results are robust to the presence of few outliers in the dataset. A modification of these estimators can be implemented efficiently and confirm all these conclusions.

Figure : Top left and right: Comparizon of the algorithm with fixed and changing blocks. Bottom: Comparizon of running time between classical gradient descent and algorithm.

13 Figure : Top left and right: Comparizon of the algorithm with fixed and changing blocks. Bottom: Comparizon of running time between classical gradient descent and algorithm. In all simulation = 000, p = 00 and there is no outliers. A Proof of Theorems,, 3 and 4 A. Proof of Theorem The proof is splitted in two parts. First, we identify an event where the statistical behavior of the regularized estimator ˆf ERM can be controlled. Then, we prove that this event holds with probability at least 3. Introduce the following event: Ω := { f F f + r θb L, P P L f θr θ } where θ is a parameter appearing in the definition of r in Definition 3. Proposition. On the event Ω, one has ˆf ERM f L r θ and P L ˆf ERM θr θ. Proof. By construction, ˆf ERM satisfies P L ˆf ERM 0. Therefore, it is sufficient to show that, on Ω, if f f L > r θ, then P L f > 0. Let f F be such that f f L > r θ. By convexity of F, there exists f 0 F f + r θs L and α > such that For all i {,, }, let ψ i : R R be defined for all u R by f = f + αf 0 f. 0 ψ i u = lu + f X i, Y i lf X i, Y i. 3

Figure 3: Outliers Detection Procedure for = 00, p = 0 and outliers are i = 4, 6, 66 The functions ψ i are such that ψ i 0 = 0, they are convex because l is, in particular αψ i u ψ i αu for all u R

14 Figure 3: Outliers Detection Procedure for = 00, p = 0 and outliers are i = 4, 6, 66 The functions ψ i are such that ψ i 0 = 0, they are convex because l is, in particular αψ i u ψ i αu for all u R and α and ψ i fx i f X i = lfx i, Y i lf X i, Y i so that the following holds: P L f = α α ψ i fxi f X i = α α i= i= ψ i αf 0 X i f X i α ψ i f 0 X i f X i = αp L f0. i= Until the end of the proof, the event Ω is assumed to hold. Since f 0 F f + r θb L, P L f0 P L f0 θr θ. Moreover, by Assumption 4, P L f 0 A f 0 f L = A r θ, thus P L f0 A θr θ. 3 From Eq. and 3, P L f > 0 since A > θ. Therefore, ˆf ERM f L r θ. This proves the L -bound. ow, as ˆf ERM f L r θ, P P L ˆf ERM θr θ. Since P L ˆf ERM 0, This show the excess risk bound. P L ˆf ERM = P L ˆf ERM + P P L ˆf ERM θr θ. Proposition shows that ˆf ERM has the risk bounds given in Theorem on the event Ω. To show that Ω holds with probability 3, recall the following results from []. Lemma. [] [Lemma 8.] Grant Assumptions and 3. Let F F with finite L -diameter d L F. For every u > 0, with probability at least exp u, P P L f L g 6L wf + ud L F. f,g F 4

15 It follows from Lemma that for any u > 0, with probability larger that exp u, P P L f P P L f L g f +r θb L f,g F f +r θb L 6L wf f r θb L + ud L F f r θb L where d L F f r θb L r θ. By definition of the complexity parameter see Eq. 3, for u = θ r θ/64l, with probability at least exp θ r θ/6 3 L, 4 for every f in F f + r θb L, P P L f θr θ. 5 Together with Proposition, this concludes the proof of Theorem. A. Proof of Theorem The proof is splitted in two parts. First, we identify an event Ω K where the statistical properties of ˆf from Theorem can be established. ext, we prove that this event holds with probability 8. Let α, θ and γ be positive numbers to be chosen later. Define 4L K C K,r = max θ α, r γ where the exact form of α, θ and γ are given in Equation 34. Set the event Ω K to be such that { Ω K = f F f + } C K,r B L, J {,..., K} : J > K/ and k J, P Bk P L f θc K,r. A.. Deterministic argument The goal of this section is to show that, on the event Ω K, ˆf f L C K,r and P L ˆf θc K,r. Lemma 3. If there exists η > 0 such that \f + C K,r B L then ˆf f L C K,r. Proof. Assume that 7 holds, then MOM K lf l f < η and MOM K lf l f η, 7 f + C K,r B L 6 inf \f + C K,r B L MOM K[l f l f ] > η. 8 Moreover, if T K f = g F MOM K [l f l g ] for all f F, then T K f = MOM K [l f l f ] f + C K,r B L MOM K [l f l f ] η. 9 \f + C K,r B L By definition of ˆf and 9, TK ˆf T K f η. Moreover, by 8, any f F \ f + C K,r B L satisfies T K f MOM K [l f l f ] > η. Therefore ˆf F f + C K,r B L. 5

16 Lemma 4. Grant Assumption 6 and assume that θ A < θ. η = θc K,r. On the event Ω K, 7 holds with Proof. Let f F be such that f f L > C K,r. By convexity of F, there exists f 0 F f + C K,r S L and α > such that f = f + αf 0 f. For all i {,..., }, let ψ i : R R be defined for all u R by ψ i u = lu + f X i, Y i lf X i, Y i. 30 The functions ψ i are convex because l is and such that ψ i 0 = 0, so αψ i u ψ i αu for all u R and α. As ψ i fx i f X i = lfx i, Y i lf X i, Y i, for any block B k, P BK L f = α ψ i fxi f X i = α ψ i αf 0 X i f X i B k α B k α i B k i B k α ψ i f 0 X i f X i = αp Bk L f0. 3 B k i B k As f 0 F f + C K,r B L, on Ω K, there are strictly more than K/ blocks B k where P Bk L f0 P L f0 θc K,r. Moreover, from Assumption 6, P L f0 A f 0 f L = A C K,r. Therefore, on strictly more than K/ blocks B k, P Bk L f0 A θc K,r. 3 From Eq. 3 and 3, there are strictly more than K/ blocks B k where P Bk L f A θc K,r. Therefore, on Ω K, as θ A < θ, \f + C K,r B L MOM K lf l f < θ A C K,r < θc K,r. In addition, on the event Ω K, for all f F f + C K,r B L, there are strictly more than K/ blocks B k where P Bk P L f θc K,r. Therefore MOM K lf l f θck,r P L f θc K,r. Lemma 5. Grant Assumption 6 and assume that θ A < θ. On the event Ω K, P L ˆf θc K,r. Proof. Assume that Ω K holds. From Lemmas 3 and 4, ˆf f L C K,r. Therefore, on strictly more than K/ blocks B k, P L ˆf P Bk L ˆf + θc K,r. In addition, by definition of ˆf and 9 for η = θc K,r, MOM K l ˆf l f MOM K l f l f θc K,r. As a consequence, there exist at least K/ blocks B k where P Bk L ˆf θc K,r. Therefore, there exists at least one block B k where both P L ˆf P Bk L ˆf + θc K,r and P Bk L ˆf θc K,r. Hence P L ˆf θc K,r. A.. Stochastic argument This section shows that Ω K holds with probability at least 8. Proposition. Grant Assumptions,, 5 and 6 and assume that βk O. Let x > 0 and assume that β α x 8γL/θ > /. Then Ω K holds with probability larger than exp x βk/. 6

17 Proof. Let F = F f + C K,r B L and set φ : t R I{t } + t I{ t } so, for all t R, I{t } φt I{t }. Let W k = X i, Y i i Bk, G f W k = P Bk P L f. Let zf = K I{ G f W k θc K,r }. k= Let K denote the set of indices of blocks which have not been corrupted by outliers, K = {k {,, K} : B k I} and let f F. Basic algebraic manipulations show that zf K φθ C K,r G f W k Eφθ C K,r G f W k Eφθ C K,r G f W k. k K k K By Assumptions and 5, using that C K,r f f L [4L K/θ α], Eφθ C K,r G f W k P G f W k θc K,r 4K θ CK,r i B k E[L f X i, Y i ] 4 θ CK,r EG f W k 4 = θ CK,r VarP Bk L f 4L K θ C K,r f f L α. Therefore, zf K α k K φθ C K,r G f W k Eφθ C K,r G f W k. 33 Using Mc Diarmid s inequality [, Theorem 6.], for all x > 0, with probability larger than exp x K /, φθ C K,r G f W k Eφθ C K,r G f W k k K x K + E k K φθ C K,r G f W k Eφθ θ C K,r G f W k Let ɛ,..., ɛ K denote independent Rademacher variables independent of the X i, Y i, i I. By Giné-Zinn symmetrization argument, k K φθ C K,r G f W k Eφθ C K,r G f W k x K + E As φ is -Lipschitz with φ0 = 0, using the contraction lemma [8, Chapter 4], k K ɛ k φθ C K,r G f W k. E k K ɛ k φθ C K,r G f W k E k K ɛ k G f W k θc K,r = E k K ɛ k P Bk P L f θc K,r. Let σ i : i k K B k be a family of independent Rademacher variables independent of ɛ k k K and X i, Y i i I. It follows from the Giné-Zinn symmetrization argument that E k K ɛ k P Bk P L f C K,r K E L f X i, Y i σ i. C K,r i k K B k 7

18 By the Lipschitz property of the loss, the contraction principle applies and E L f X i, Y i σ i C K,r i k K B k LE f f X i σ i. C K,r i k K B k To bound from above the right-hand side in the last inequality, consider two cases C K,r = r γ or C K,r = 4L K/αθ. In the first case, by definition of the complexity parameter r γ in 6, E f f X i σ i = E C K,r i k K B : f f L r γ k r γ σ i f f X i γ K K. i k K B k In the second case, σ i f f X i E C K,r i k K B k [ E : f f L r γ σ i f f X i r i k K B k γ : r γ f f L 4L K αθ f f X i σ i 4L K i k K B k αθ ]. Let f F be such that r γ f f L [4L K]/[αθ ]; by convexity of F, there exists f 0 F such that f 0 f L = r γ and f = f + αf 0 f with α = f f L / r γ. Therefore, f f X i σ i 4L K f f X i r i k K B k αθ γ σ i f f L = r i k K B k γ σ i f 0 f X i i k K B k and so : r γ f f L 4L K αθ f f X i σ i 4L K i k K B k αθ r γ : f f L = r γ By definition of r γ, it follows that E f f X i σ i C K,r γ K K. i k K B k σ i f f X i. i k K B k Therefore, as K K O βk, with probability larger than exp x βk/, for all f F such that f f L C K,r, zf K α x 8γL > K θ. 34 A..3 End of the proof of Theorem Theorem follows from Lemmas 3, 4, 5 and Proposition for the choice of constant θ = /3A α = /4, x = /4, β = 4/7 and γ = /575AL. 8

19 A.3 Proof of Theorem 3 Let K [ 7 O /3, ] and consider the event Ω K defined in 6. It follows from the proof of Lemmas 3 and 4 that T K f θc K,r on Ω K. Setting θ = /3A, on J=K Ω J, f ˆR J for all J = K,...,, so ˆR J=K J. By definition of ˆK, it follows that ˆK K and by definition of f, f ˆR K which means that T K f θc K,r. It is proved in Lemmas 3 and 4 that on Ω K, if f F satisfies f f L C K,r then T K f > θc K,r. Therefore, f f L C K,r. On Ω K, since f f L C K,r, P L f θc K,r. Hence, on J=K Ω J, the conclusions of Theorem 3 hold. Finally, by Proposition, P [ J=KΩ J ] A.4 Proof of Theorem 4 J=K exp K/06 4 exp K/06. The proof of Theorem 4 follows the same path as the one of Theorem. We only sketch the different arguments needed because of the localization by the excess loss and the lack of Bernstein condition. Define the event Ω K in the same way as Ω K in 6 where C K,r is replaced by r γ and the L localization is replaced by the excess loss localization : { } Ω K = f L F r γ, J {,..., K} : J > K/ and k J, P B k P L f /4 r γ 35 where L F r γ = {f F : P L f r γ}. Our first goal is to show that on the event Ω K, P L ˆf /4 r γ. We will then handle P[Ω K ]. Lemma 6. Grant Assumptions and. For every r 0, the set L F r := {f F : P L f r} is convex and relatively closed to F in L µ. Moreover, if f F is such that P L f > r then there exists f 0 F and P L f /r α > such that f f = αf 0 f and P L f0 = r. Proof. Let f and g be in L F r and 0 α. We have αf + αg F because F is convex and for all x X and y R, using the convexity of u lu + f x, y, we have l αf+ αg x, y l f x, y = lαf f x + αg f x + f x, y lf x, y α lf f x + f x, y lf x, y + α lg f x + f x, y lf x, y = αl f l f + αl g l f and so P L αf+ αg αp L f + αp L g. Given that P L f, P L g r we also have P L αf+ αg r. Therefore, αf + αg L F r and L F r is convex. For all f, g F, P L f P L g f f L µ so that f F P L f is continuous onto F in L µ and therefore its level sets, such as L F r, are relatively closed to F in L µ. Finally, let f F be such that P L f > r. Define α 0 = {α 0 : f + αf f L F r }. ote that P L f +αf f αp L f = r for α = r/p L f so that α 0 r/p L f. Since L F r is relatively closed to F in L µ, we have f + α 0 f f L F r and in particular α 0 < otherwise, by convexity of L F r, we would have f L F r. Moreover, by maximality of α 0, f 0 = f + α 0 f f is such that P L f0 = r and the results follows for α = α0. Lemma 7. Grant Assumptions and. On the event Ω K, P L ˆf r γ. Proof. Let f F be such that P L r > r γ. It follows from Lemma 6 that there exists α and f 0 F such that P L f0 = r γ and f f = αf 0 f. According to 3, we have for every k {,..., K}, P Bk L f αp Bk L f0. Since f 0 L F r γ, on the event Ω K, there are strictly more than K/ blocks B k 9

20 such that P Bk L f0 have P L f0 /4 r γ = 3/4 r γ and so P B k L f 3/4 r γ. As a consequence, we \L F r γ MOM K lf l f 3/4 r γ. 36 Moreover, on the event Ω K, for all f L F r γ, there are strictly more than K/ blocks B k such that P Bk L f /4 r γ P L f /4 r γ. Therefore, MOM K lf l f /4 r γ. 37 f L F r γ We conclude from 36 and 37 that MOM K lf l f /4 r γ and that every f F such that P L f > r γ satisfies MOM K lf l f 3/4 r γ. But, by definition of ˆf, we have Therefore, we necessarily have P L ˆf r γ. MOM K l ˆf l f MOM K lf l f /4 r γ. ow, we prove that Ω K is an exponentially large event using similar argument as in Proposition. Proposition 3. Grant Assumptions, and 7 and assume that βk O and β / 3γL > /. Then Ω K holds with probability larger than exp βk/5. Sketch of proof. The proof of Proposition 3 follows the same line as the one of Proposition. Let us precise the main differences. We set F = L F r γ and for all f F, z f = K k= I{ G f W k /4 r γ} where G f W k is the same quantity as in the proof of Proposition 3. Let us consider the contraction φ introduced in Proposition 3. By definition of r γ and V K, we have Eφ8 r γ G f W k P G f W k r γ 8 64K r γ 64 r γ EG f W k = 64K r γ {Var P i L f : f F, i I} Var Pi L f i B k 64K r γ {Var P i L f : P L f r γ, i I} r γ VarP B k L f Using Mc Diarmid s inequality, the Giné-Zinn symmetrization argument and the contraction lemma twice and the Lipschitz property of the loss function, such as in the proof of Proposition, we obtain with probability larger than exp K /5, for all f F, zf K / 3LK E r γ σ i f f X i i k K B k. 38 ow, it remains to use the definition of r γ to bound the expected remum in the right-hand side of 38 to get E r γ σ i f f X i i k K B k γ K K. 39 Proof of Theorem 4. The proof of Theorem 4 follows from Lemma 7 and Proposition 3 for β = 4/7 and γ = /768L. 0

21 B Proof of Lemma Proof. We have E : f f L r i= σ i f f X i = E < t, t R p :E<t,X> r i= σ i X i >. Let Σ = EX T X denote the covariance matrix of X and consider its SVD, Σ = QDQ T where Q = [Q Q p ] R p p is an orthogonal matrix and D is a diagonal p p matrix with non-negative entries. For all t R p, we have E X, t = t T Σt = p j= d j t, Qj. Then E t R p : E<t,X> r = E re < t, i= p t R : p j= d j<t,q j > r j=:d j 0 p j=:d j 0 Qj, dj σ i X i >= E p i= Moreover, for any j such that d j 0, Qj E, dj = i= k= σ i X i = E t R p : E<t,X> r Qj dj < t, Q j >, dj σ i X i r E k,l= p j=:d j 0 T Qj EXk T X Qj k = dj dj p < t, Q j > Q j, j= i= Qj, dj σ i X i i= σ l σ k Q j dj, X k Q j dj, X l = k= σ i X i. k= T Qj Qj Σ dj dj By orthonormality, Q T Q j = e j and Q T j Q = et j, then, for any j such that d j 0, i= E Q j dj, X k σ i X i Qj E, dj i= σ i X i = k= d j e T j De j =. Finally, we obtain E : f f L r i= and therefore the fixed point r γ is such that { r γ = inf r > 0, J I : J /, E inf { r > 0, J I : J /, σ i f f p X i r {dj 0} = r RankΣ j= t R p : E<t t,x> r i J r RankΣ r } J γ } σ i < X i, t t > r J γ RankΣ γ.

arxiv: v2 [math.st] 7 Feb 2017

arxiv: v2 [math.st] 7 Feb 2017 Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions Pierre Alquier,3,4, Vincent Cottet,3,4, Guillaume Lecué 2,3,4 CREST, ESAE, Université Paris Saclay