Statistical learning with Lipschitz and convex loss functions

Size: px
Start display at page:

Download "Statistical learning with Lipschitz and convex loss functions"

Transcription

1 Statistical learning with Lipschitz and convex loss functions Geoffrey Chinot, Guillaume Lecué and Matthieu Lerasle October 3, 08 Abstract We obtain risk bounds for Empirical Risk Minimizers ERM and minmax Median-Of-Means MOM estimators based on loss functions that are both Lipschitz and convex. Results for the ERM are derived without assumptions on the outputs and under subgaussian assumptions on the design as in [] but relaxing the global Bernstein condition of this paper into a local assumption as in [40]. Similar results are shown for minmax MOM estimators in a close setting where the design is only posed to satisfy moment assumptions, relaxing the subgaussian hypothesis necessary for ERM. Unlike alternatives based on MOM s principle [4, 9], the analysis of minmax MOM estimators is not based on the small ball assumption SBA of []. In particular, the basic example of non parametric statistics where the learning class is the linear span of localized bases, that does not satisfy SBA [37] can now be handled. Finally, minmax MOM estimators are analyzed in a setting where the local Bernstein condition is also dropped out. It is shown to achieve an oracle inequality with exponentially large probability under minimal assumptions insuring the existence of all objects. Introduction The paper studies learning problems where the loss function is simultaneously Lipschitz and convex. This situation happens in classical examples such as quantile, Huber and L regression or logistic and hinge classification [40]. As the Lipschitz property allows to remove all assumptions on the outputs, these losses have been quite popular in robust statistics [7]. Empirical risk minimizers ERM based on Lipschitz losses such as the Huber loss have received recently an important attention [44, 5, ]. Based on a dataset {X i, Y i : i =,..., } of points in X Y, a class F of functions and a risk function R defined on F, the statistician should estimate an oracle f argmin Rf. The risk function R is often defined as the expectation of a known loss function l : f, x, y F X Y l f x, y R with respect to the unknown distribution P of a random variable X, Y X Y: Rf = El f X, Y. Hereafter, the risk is assumed to have this form for a loss function l such that, for any f, x, y, l f x, y = lfx, y, for some function l : Ȳ Y R, where the set Ȳ is a convex set containing all possible values of fx. l is said Lipschitz and convex when the following assumption holds. Assumption. There exists L > 0 such that, for any y Y, l, y is L-Lipschitz and convex. Many classical loss functions satisfy Assumption as shown by the following examples. The logistic loss defined, for any u Ȳ = R and y Y = {, }, by lu, y = log + exp yu satisfies Assumption with L =. The hinge loss defined, for any u Ȳ = R and y Y = {, }, by lu, y = max uy, 0 satisfies Assumption with L =. The Huber loss defined, for any δ > 0, u, y Y = Ȳ = R, by { lu, y = y u if u y δ δ y u δ if u y > δ,

2 satisfies Assumption with L = δ. The quantile loss is defined, for any τ 0,, u, y Y = Ȳ = R, by lu, y = ρ τ u y where, for any z R, ρ τ z = zτ I{z 0}. It satisfies Assumption with L =. For τ = /, the quantile loss is the L loss. All along the paper, the following assumption is also granted. Assumption. The class F is convex. When X, Y and the data X i, Y i i= are independent and identically distributed i.i.d., for any f F, the empirical risk R f = / i= l f X i, Y i is a natural estimator of Rf. The empirical risk minimizers ERM [4] obtained by minimizing f F R f are expected to be close to the oracle f. This procedure and its regularized versions have been extensively studied in learning theory [0]. When the loss is both convex and Lipschitz, results have been obtained in practice [4, ] and theory [40]. Risk bounds with exponential deviation inequalities for the ERM can be obtained without assumptions on the outputs Y, but stronger assumptions on the design X, which must satisfy subgaussian or boundedness assumptions. Moreover, fast rates of convergence [39] can only be obtained under margin type assumptions such as the Bernstein condition [8, 40]. The Lipschitz assumption and global Bernstein conditions that hold over the entire F as in [] imply boundedness in L -norm of the class F, see the discussion preceding Assumption 4 for details. This boundedness is not satisfied in linear regression with unbounded design so the results of [] don t apply to this basic example. To bypass this restriction, the global condition has to be relaxed into a local one as in [5, 40], see also Assumption 4 below. The main constraint in our results on ERM is the subgaussian assumption on the design. This constraint can be relaxed by considering alternative estimators based on the median-of-means MOM principle of [35, 9, 8, ] and the minmax procedure of [3, 5]. The resulting minmax MOM estimators have been introduced in [4] for least-squares regression as an alternative to other MOM based procedures [9, 30, 3, 3]. In the case of convex and Lipschitz loss functions, these estimators satisfy the following properties as the ERM, they are efficient without assumptions on the noise they achieve optimal rates of convergence under weak stochastic assumptions on the design, relaxing the subgaussian or boundedness hypotheses used for ERM and 3 the rates are not downgraded by the presence of some outliers in the dataset. These improvements of MOM estimators upon ERM are not surprising. For univariate mean estimation, rate optimal subgaussian deviation bounds can be shown under minimal L moment assumptions for MOM estimators [4] while the empirical mean needs each data to have subgaussian tails to achieve such bounds [3]. In least-squares regression, MOM-based estimators [9, 30, 3, 3, 4] inherit these properties, whereas the ERM has downgraded statistical properties under moment assumptions [6]. Furthermore, MOM procedures are resistant to outliers: results hold in the O I framework of [3, 4], where inliers or informative data indexed by I only satisfy weak moments assumptions and the dataset may contain outliers indexed by O on which no assumption is made, see Section 3. This robustness, that almost comes for free from a technical point of view, is another important advantage of MOM estimators compared to ERM in practice. Figure illustrates this fact, showing that statistical performance of the standard logistic regression are strongly affected by a single corrupted observation, while the minmax MOM estimator maintains good statistical performance even with 5% of corrupted data. Compared to [9, 4], considering convex-lipschitz losses instead of the square loss allows to simplify simultaneously some assumptions and the presentation of the results for MOM estimators: L -assumptions on the noise in [9, 4] can be removed and complexity parameters driving risk of ERM and MOM estimators only involve a single stochastic linear process, see Eq. 3 and 6 below. Also, contrary to the analysis in least-squares regression, the small ball assumption [, 33] is not required here. Recall that this assumption states that there are absolute constants κ and β such that, for all f F, P[ fx f X All figures can be reproduced from the code available at

3 Figure : MOM Logistic Regression VS Logistic regression from Sklearn p = 50 and = 000 κ f f L ] β. It is interesting as it involves only moments of order and of the functions in F. However, it does not hold in classical frameworks such as histograms, see [37, 6] and Section 5. Finally, minmax MOM estimators are studied in a framework where the Bernstein condition is dropped out. In this setting, they are shown to achieve an oracle inequality with exponentially large probability see Section 4. The results are slightly weaker in this relaxed setting: the excess risk is bounded but not the L risk and the rates of convergence are slow in / in general. Fast rates of convergence in / can still be recovered from this general result if a local Bernstein type condition is satisfied though, see Section 4 for details. This last result shows that minmax MOM estimators can be safely used with Lipschitz and convex losses, assuming only that inliers data are independent with moments giving sense to all objects necessary to state the results. To approximate minmax MOM estimators, an algorithm inspired from [4, 7] is also proposed. Convergence of this algorithm has been proved in [7] under strong assumptions, but, to the best of our knowledge, convergence rates have not been established. evertheless, the simulation study presented in Section 6 shows that the algorithm presents interesting empirical performance in the setting of this paper. The paper is organized as follows. Optimal results for the ERM are presented in Section. Minmax MOM estimators are introduced and analysed in Section 3 under a local Bernstein condition and in Section 4 without the Bernstein condition. A discussion of the main assumptions is provided in Section 5 while Section 6 provides a typical example where all assumptions are satisfied and a simulation study where a natural algorithm associated to the minmax MOM estimator for logistic loss is presented. The proofs of the main theorems are gathered in Sections A, B and C. otations Let X, Y be measurable spaces. Let F be a class of measurable functions f : X Y and let X, Y X Y be a random variable with distribution P. Let µ denote the marginal distribution of X. For any probability measure Q on X Y, and any function g L Q, let Qg = gx, yqdx, y. Let l : F X Y R, f, x, y l f x, y denote a loss function measuring the error made when predicting y by fx. It is always assumed that there exist Ȳ a convex set containing all possible values of fx for f F, x X and a function l : Ȳ Y R such that, for any f, x, y F X Y, lfx, y = l f x, y. Let Rf = P l f = El f X, Y for f in F denote the risk and let L f = l f l f denote the excess loss. If F L P := L and Assumption holds, an equivalent risk can be defined even if Y / L. Actually, for any f 0 F, l f l f0 L so one can define Rf = P l f l f0. W.l.o.g. the set of risk minimizers is assumed to be reduced to a singleton argmin Rf = {f }. f is called the oracle as f X provides the prediction of Y with minimal risk among functions in F. For any f and p > 0, let f Lp = P f p /p, 3

4 for any r 0, let rb L = {f F : f L r} and rs L = {f F : f L = r}. For any set H for which it makes sense, H + f = {h + f s.t h H}, H f = {h f s.t h H}. ERM in the subgaussian framework This section studies the ERM, improving some results from []. In particular, the global Bernstein condition in [] is relaxed into a local hypothesis following [40]. All along this section, data X i, Y i i= are independent and identically distributed with common distribution P. The ERM is defined for f F P l f = / i= l f X i, Y i by ˆf ERM = arg min P l f. The results for the ERM are shown under a subgaussian assumption on the design. This result is the benchmark for the following minmax MOM estimators. Definition. Let B. F is called B-subgaussian with respect to X when for all f F and all λ > E expλ fx / f L expλ B /. Assumption 3. The class F F is B-subgaussian, where F F = {f f : f, f F }. Under this subgaussian assumption, statistical complexity can be measured via Gaussian mean-widths. Definition. Let H L. Let G h h H be the canonical centered Gaussian process indexed by H in particular, the covariance structure of G h h H is given by EG h G h / = Eh X h X / for all h, h H. The Gaussian mean-width of H is wh = E h H G h. The complexity parameter driving the performance of ˆf ERM is presented in the following definition. Definition 3. The complexity parameter is defined as r θ = inf{r > 0 : 3LwF f rb L θr } where L > 0 is the Lipschitz constant from Assumption. Let A > 0. In [8], the class F is called, A-Bernstein if, for all f F, P L f AP L f. Under Assumption, F is, AL Bernstein if the following stronger assumption is satisfied f f L AP L f. This stronger version was used, for example in [] to study ERM. However, under Assumption, Eq implies that f f L AP L f AL f f L AP L f AL f f L. Therefore, f f L AL for any f F. The class F is bounded in L -norm, which is restrictive as this assumption is not verified by the class of linear functions for example. To bypass this issue, the following condition is introduced. Assumption 4. There exists a constant A > 0 such that for all f F if f f L f f L AP L f. r θ then In Assumption 4, Bernstein condition is granted in a L -neighborhood of f only. Outside of this neighborhood, there is no restriction on the excess loss. This relaxed assumption is satisfied on linear models and Lipschitz-convex loss functions under moment assumptions as will be checked in Section 5. The following theorem is the main result of this section. 4

5 Theorem. Grant Assumptions,, 3 and 4 and let θ = /A, ˆf ERM defined in satisfies, with probability larger than exp θ r θ 6 3 L, 3 ˆf ERM f L r θ and P L ˆf ERM θr θ. 4 Theorem is proved in Section A.. It holds without restrictions on the outputs Y. Theorem shows deviation bounds both in L norm and for the excess risk, which are both minimax optimal as proved in []. As in [], a similar result can be derived if the subgaussian Assumption 3 is replaced by a boundedness in L assumption. An extension of Theorem can be shown, where Assumption 4 is replaced by the following hypothesis: there exists κ such that for all f F in a L -neighborhood of f, f f κ L AP L f. The case κ = is the most classical and its analysis contains all the ingredients for the study of the general case with any parameter κ. More general Bernstein conditions can also be considered as in [40, Chapter 7]. These extensions are left to the interested reader. 3 Minmax MOM estimators This section presents and studies minmax MOM estimators, comparing them to ERM. 3. The estimators The framework of this section is a relaxed version of the i.i.d. setup considered in Section. Following [3, 4], there exists a partition O I of {,, } in two subsets which is unknown to the statistician. o assumption is granted on the set of outliers X i, Y i i O. Inliers, indexed by I, are only assumed to satisfy the following assumption. Assumption 5. X i, Y i i I are independent and for all i I, X i, Y i has distribution P i, X i has distribution µ i and for any p > 0 and any function g for which it makes sense g Lpµ i = P i g p /p. We assume that, for any i I, f f L = f f L µ i and P i L f = P L f. Assumption 5 holds in the i.i.d case but it covers other situations where informative data X i, Y i i I may have different distributions. Recall the definition of MOM estimators of univariate means. Let B k k=,...,k denote a partition of {,..., } into blocks B k of equal size /K if is not a multiple of K, just remove some data. For any function f : X Y R and k {,..., K}, let P Bk f = K/ i B k fx i, Y i. MOM estimator is the median of these empirical means: MOM K f = MedPB f,, P BK f. The estimator MOM K f achieves rate optimal subgaussian deviation bounds, assuming only that P f <, see for example [4]. The number K is a tuning parameter. The larger K, the more outliers is allowed. When K =, MOM K f is the empirical mean, when K =, the empirical median. Following [4], remark that the oracle is also solution of the following minmax problem: f arg min P l f = arg min g F P l f l g. Minmax MOM estimators are obtained by plugging MOM estimators of the unknown expectations P l f l g in this formula: ˆf arg min MOM K lf l g. 5 g F The minmax MOM construction can be applied systematically as an alternative to ERM. For instance, it yields a robust version of logistic classifiers. The minmax MOM estimator with K = is the ERM. 5

6 The linearity of the empirical process P is important to use localisation technics and derive fast rates of convergence for ERM [], improving slow rates derived with the approach of [4], see [39] for details on fast and slow rates. The idea of the minmax reformulation comes from [3], where this strategy allows to overcome the lack of linearity of some alternative robust mean estimators. [3] introduced minmax MOM estimators to least-squares regression. 3. Theoretical results 3.. Setting The assumptions required for the study of estimator 5 are essentially those of Section except for the subgaussian Assumption 3 which is replaced by the moment Assumption 5. Instead of Gaussian mean width, the complexity parameter is expressed as a fixed point of local Rademacher complexities [7, 0, 6]. Let σ i i=,..., denote i.i.d. Rademacher random variables uniformly distributed on {, }, independent from X i, Y i i I. Let { r γ = inf r > 0, J I : J, E } σ i f f X i r J γ. 6 : f f L r The outputs do not appear in the complexity parameter. This is an interesting feature of Lipschitz losses. Since Assumption 4 depends on the complexity parameter in the subgaussian framework, it is necessary to adapt the Bernstein assumption to this framework. Assumption 6. There exists a constant A > 0 such that for all f F if f f L C K,r then f f L AP L f where C K,r = max r γ, 864A L K. 7 Assumptions 6 and 4 have a similar flavor as both require the Bernstein condition in a L neighborhood of f with radius given by the rate of convergence of the associated estimator see Theorems and. For K r γ/846a L the neighborhood {f F : f f L C K,r } is a L -ball around f of radius r γ which can be of order / see Section As a consequence, Assumption 6 holds in examples where the small ball assumption does not see discussion after Assumption Main results We are now in position to state the main result regarding the statistical properties of estimator 5 under a local Bernstein condition. Theorem. Grant Assumptions,, 5 and 6 and assume that O 3/7. Let γ = /575AL and K [ 7 O /3, ]. The minmax MOM estimator ˆf defined in 5 satisfies, with probability at least i J exp K/06, 8 ˆf f L C K,r and P L ˆf 3A C K,r. 9 Suppose that K = r γ, which is possible as long as O r γ. The deviation bound is then of order r γ and the probability estimate exp r γ/06. Therefore, minmax MOM estimators achieve the same statistical bound with the same deviation as the ERM as long as r γ and r θ are of the same order. Using generic chaining [38], this comparison is true under Assumption 3. It can also be shown under weaker moment assumption, see [34] or the example of Section When r γ r θ, the bounds are rate optimal as shown in []. This is why these bounds are called rate optimal subgaussian deviation bounds. While these hold for ERM in the i.i.d. setup with subgaussian 6

7 design in the absence of outliers see Theorem, they hold for minmax MOM estimators in a setup where inliers may not be i.i.d., nor have subgaussian design and up to r γ outliers may have contaminated the dataset. This section is concluded by presenting an estimator achieving 5 simultaneously for all K. For all K {,..., } and f F, define T K f = g F MOM K lf l g and let ˆR K = {g F : T K g /3AC K,r }. 0 ow, building on the Lepskii s method, define a data-driven number of blocks ˆK = inf K {,..., } : ˆR J J=K and let f such that f ˆR J. J= ˆK Theorem 3. Grant Assumptions,, 5 and 6 and assume that O 3/7. Let γ = /575AL. The estimator f defined in is such that for all K [ 7 O /3, ], with probability at least 4 exp K/06, f f L C K,r and P L f 3A C K,r. Theorem 3 states that f achieves the results of Theorem simultaneously for all K. This extension is useful as the number O of outliers is typically unknown in practice. However, contrary to ˆf, the estimator f requires the knowledge of A and rγ. This kind of limitation is not surprising, it holds in least-squares regression [4] and comes from the univariate mean estimation case [4, Theorem 3.] A basic example The following example illustrates the optimality of the rates provided in Theorem. Lemma [9]. In the O I framework with F = { t, : t R p }, we have r γ RankΣ/γ, where Σ = E[XX T ] is the p p covariance matrix of X. The proof of Lemma is recalled in Section B for the sake of completeness. Section 5 shows that Assumptions 4 and 6 are satisfied when F = { t, : t R p } and X is a vector with i.i.d. entries having only a few finite moments. Theorem applies therefore in this setting and the Minmax MOM estimator 5 achieves the optimal fast rate of convergence RankΣ/. 4 Relaxing the Bernstein condition This section shows that minmax MOM estimators satisfy sharp oracle inequalities with exponentially large deviation under minimal stochastic assumptions insuring the existence of all objects. These results are slightly weaker than those of the previous section: the L risk is not controlled and only slow rates of convergence hold in this relaxed setting. However, the bounds are sufficiently precise to imply fast rates of convergence for the excess risk as in Theorems if a slightly stronger Bernstein condition holds. Given that data may not have the same distribution as X, Y, the following relaxed version of Assumption 5 is introduced. Assumption 7. X i, Y i i I are independent and for all i I, X i, Y i has distribution P i, X i has distribution µ i. For any i I, F L µ i and P i L f = P L f for all f F. 7

8 When Assumption 6 does not necessary hold, the localization argument has to be modified. Instead of the L -norm, the excess risk f F P L f is used to define neighborhoods around f. The associated complexity is then defined for all γ > 0 and K {,, } by { Er r γ = inf r > 0 : max γ, } 536V K r r 3 where Er = J I: J / E and V K r = max i I :P L f r :P L f r σ i f f X i J, i J K Var Pi L f. The main difference with r θ in Definition or r γ in 6 is the extra variance term V K r. Under the Bernstein condition, this term is negligible in front of the expectation term Er see [8]. In the general setting considered here, the variance term is handled in the complexity parameter. [ Theorem ] 4. Grant Assumptions,, 7 and assume that O 3/7. Let γ = /768L and K 7 O /3,. The minmax MOM estimator ˆf defined in 5 satisfies, with probability at least exp K/06, P L ˆf r γ. Recall that Assumptions and are only meaning that the loss function is convex and Lipschitz and that the class F is convex. The stochastic assumption 7 says that inliers are independent and define the same excess risk as X, Y over F. In particular, Theorem 4 holds without assumptions on the outliers X i, Y i i O or the outputs Y i i I of the inliers. Moreover, the risk bound holds with exponentially large probability without assuming sub Gaussian design, a small ball hypothesis or a Bernstein condition. This generality can be achieved by combining MOM estimators with convex-lipschitz loss functions. The following result discuss relationships between Theorems and 4. Introduce the following modification of the Bernstein condition. Assumption 8. Let γ = /768L. There exists a constant A > 0 such that for all f F if P L f C K,r then f f L AP L f where, for r γ defined in 6, r C K,r = max γ/a A, 536AL K. 4 Assumption 8 is slightly stronger than Assumption 4 since the neighborhood around f where the condition holds is slightly larger. If Assumption 8 holds then Theorem 5 implies the same statistical bounds for 5 as Theorem up to constants, as shown by the following result. Theorem 5. Grant Assumptions,, 7 and assume that O 3/7. Assume that the local Bernstein condition Assumption 8 holds. Let γ = /768L and K [ 7 O /3, ]. The minmax MOM estimator ˆf defined in 5 satisfies, with probability at least exp K/06, ˆf f L max r γ/a, 536L A K r and P L ˆf max γ/a A, 536L AK. Proof. First, V K r LV K r for all r > 0 where V K r = K/ max i I :P Lf r f f L µ i. Moreover, r Er/r and r V K r/r are non-increasing, therefore by Assumption 8 and the definition of r γ, V K r, γ E r γ/a r γ/a A A and Hence, r γ max r γ/a/a, 536L AK/. 536V K 536L AK 536A LK. 8

9 5 Bernstein s assumption This section shows that the local Bernstein condition holds for various loss functions and design X. All along the section, the oracle f and the Bayes rules which minimizes the risk f Rf over all measurable functions from X to Y are assumed to be equal. In that case, the Bernstein s condition [8] coincides with the margin assumption [39, 3]. The class F and design X are also assumed to satisfy a local L 4 /L - assumption. Let r {r θ, C K,r}, to verify Assumption 4, choose r = r θ while for Assumption 6, choose r = C K,r. Assumption 9. There exists C > 0 such that for all f F satisfying f f L r, f f L4 C f f L Assumption 9 is a local version of a L 4 /L norm equivalence assumption over F {f } which has been used for the study of MOM estimators see [9] since it implies the small ball condition. Examples of distributions satisfying the global i.e. over all F version of Assumption 9 can be found in [33]. Assumption 9 is local, it holds only in a L neighborhood of f and not on the entire set F {f }. There are situations where the constant C depends on the dimension p of the model. In that case, the results in [9, 4] provide sub-optimal statistical upper bounds. For instance, if X is uniformly distributed on [0, ] and F = { p j= α ji Aj : α j p j= Rp } where I Aj is the indicator of A j = [j /p, j/p] then for all f F, f f L4 p f f L so C = p. As shown in the following subsections, the rates given in Theorem or Theorem 3 are not deteriorated in this example. 5. Quantile loss The proof is based on [5, Lemma.]. Recall that l f x, y = y fxτ I{y fx 0} and that the Bayes estimator is the quantile q Y X=x τ of order τ of the conditional distribution of Y given X = x. Assumption 0. Let C be the constant defined in Assumption 9. There exists α > 0 such that, for all x X and for all z in R such that z f x C r we have f Y X=x z α, where f Y X=x is the conditional density function of Y given X = x. Theorem 6. Grant Assumptions 9 and 0. Assume that the oracle f is the Bayes estimator i.e. f x = qτ Y X=x for all x X. Then, the local Bernstein condition holds: for all f F such that f f L r, f f L 4 α P L f. The proof is postponed to Section C.. Consider the example from Section 3..3, assume that K RankΣ and recall that r = C K,r = r γ RankΣ/γ. If C = p, Assumption 0 holds as long as p 3 and there exists α > 0 such that, for all x X and for all z in R such that z f x, f Y X=x z α. In this situation, the rates given in Theorems and 3 are still RankΣ/. In particular, they are not deteriorated if C = p while those given in [9, 4] are of order prankσ/. This gives a partial answer, in our setting, to the issue raised in [37] regarding results based on the small ball method. 5. Huber Loss Consider the Huber loss function defined for all f F and x X, y R by l f x, y = ρ H y fx where ρ H t = t / if t δ and ρ H t = δ t δ otherwise. Introduce the following assumption. Assumption. Let C be the constant defined in Assumption 9. There exists α > 0 such that for all x X and all z in R such that z f x C r, F Y X=x z + δ F Y X=x z δ α, where F Y X=x is the conditional cumulative function of Y given X = x. Under this assumption, the local Bernstein condition can be checked as shown by the following result. 9

10 Theorem 7. Grant Assumptions 9 and. Assume that the oracle f is the Bayes rules, meaning that for all x X, f x argmin u R E[ρ H Y u X = x]. Then, for all f F, if f f L r then f f L 4/αP L f. The proof is postponed to Section C.. Theorem 6 applies to this example too. The interested reader can check that the remark following 5.3 Hinge loss Let η denote the regression function η : x X E[Y X = x] for all x X and recall that the Bayes rule is defined by f : x X sgnηx. The function f minimizes f Rf over all measurable functions from X to Y = {, } [43]. Assume that f is the Bayes rules, i.e. that f F. Then, Assumption 4 holds if the following margin assumption is satisfied, see [5] and [39, Proposition ]: The following result was proved in [5]. there exists τ > 0 such that ηx τ a.s. 5 Theorem 8. Let F be a class of functions from X to [, ]. If f x = sgnηx for all x X and 5 is satisfied, the Bernstein condition holds for the hinge loss with A = /τ. 5.4 Logistic classification Consider the logistic loss and recall that the Bayes rule, called the log-odds ratio, is defined by f : x log[ηx/ ηx] for all x X, where ηx = PY = X = x for all x X. Consider the following assumption on η. Assumption. There exists c 0 > 0 such that P + expc 0 ηx e c 0. + exp c 0 Assumption excludes trivial cases where deterministic predictors equal to or are optimal. Theorem 9. Grant Assumptions 9 and and assume that the Bayes rules is in F. Then, there exists A > 0 such that f F : f f L r, P L f A f f L. Theorem 9 is proved in Section C.3. The explicit form of the constant A is presented in the proof. A close inspection of the proof reveals that Assumption is only used to bound f X with large probability. If the oracle is bounded, the same result holds without assuming that f is the Bayes rules. 6 Simulation study This section provides a short simulation study that illustrates our theoretical findings. The section starts with an example where all assumptions of Theorem are simultaneously satisfied. 6. Application: robust classification Let X = R p, Y = {, }, let l denote the logistic loss recalled in Section. Recall that Assumption holds with L =. Let F = { t, : t R p } be a class of linear functions, so Assumption holds. Assume that X = ξ,, ξ p in R p, where ξ j p j= are centered, independent and identically distributed and that X, Y satisfies a logistic model PY = X log = X, t + ɛ. 6 PY = X 0

11 By Theorem 9, Assumptions 4 and 6 hold if f F s.t f f L r, f f L4 C f f L 7 for r {r θ, C K,r}. ow, since ξ,, ξ p are centered i.i.d. in R p, basic algebraic computations show that 7 holds if E ξ 4 /4 AE ξ /. This assumption is satisfied, for example, by student distributions T 5 that do not have a fifth moment. The results allow non-symmetric noise such as ɛ L 0, where L µ, σ denotes the log-normal distribution with mean expµ + σ/ and variance expσ expµ + σ. Consider the model 6 with X and ɛ as above. Let X i, Y i i I independent and distributed as X, Y. Let also X i, Y i i O where I O = {,, }. For any K 7 O /3, the minmax MOM estimator is solution of the following problem: ˆt MOM K arg min t R p t R p MOM K l t l t. 8 As all assumptions in Theorem are satisfied, if the number of outliers is such that O p and the number of blocks is K = p then there exist absolute positive constants a, a, a 3 and a 4 depending only on A such that, with probability larger than a exp a p, 6. Simulations ˆt MOM p t p a 3 and P Lˆt a p 4. 9 This section presents algorithms derived from the minmax MOM construction 8. These algorithms are inspired from [4]. Consider the setup of Section 6. where Theorem applies: X = ξ,, ξ p, where ξ j p j= are independent and identically distributed, with ξ T 5, ɛ L 0, and PY = X log = X, t + ɛ. PY = X Following [4], a gradient ascent-descent step is performed on the empirical incremental risk t, t P Bk l t l t constructed on the block B k of data realizing the median of the empirical incremental risk. Initial points t 0 R p and t 0 R p are taken at random. In logistic regression, the step sizes η and η are usually chosen equal to XX op /4, where X is the p matrix with row vectors equal to X,, X and op denotes the operator norm. In a corrupted environment, this choice might lead to disastrous performance. This is why η and η are computed at each iteration using only data in the median block: let B k denote the median block at the current step, then one chooses η = η = X k X k op/4 B k where X k is the B k p matrix with rows given by Xi T for i B k. In practice, K is chosen by robust cross-validation choice as in [4]. In a first approach and according to our theoretical results, the blocks are chosen at the beginning of the algorithm. As illustrated in Figure, this first strategy has some limitations. To understand the problem, for all k =,..., K, let C k denote the following set C k = {t R p : P Bk l t = Median {P B l t,..., P BK l t }}. If the minimum of t P Bk l t lies in C k, the algorithm typically converges to this minimum if one iteration enters C k. As a consequence, when the minmax MOM estimator 8 lies in another cell, the algorithm does not converge to this estimator. To bypass this issue, the partition is changed at every ascent/descent steps of the algorithm, it is chosen uniformly at random among all equipartition of the dataset. This alternative algorithm is described in Algorithm. In practice, changing the partition seems to widely accelerate the convergence see Figure.

12 Input: The number of block K, initial points t 0 and t 0 in R p and the stopping criterion ɛ > 0 Output: A solution of the minimax problem 8 while t i t i ɛ do Split the data into K disjoint blocks B k k {,,K} of equal sizes chosen at random: B B K = {,, }. 3 Find k [K] such that MOM K l ti = P Bk l ti. 4 Compute η = η = X T k X k op /4. 5 Update t i+ = t i η tp Bk l t t=ti and t i+ = t i η t P B k l t t= t i. 6 end Algorithm : Descent-ascent gradient method with blocks of data chosen at random at every steps. Simulation results are gathered in Figure. In these simultations, there is no outlier, = 000 and p = 00 with X i, Y i 000 i= i.i.d with the same distribution as X, Y. Minmax MOM estimators 8 are compared with the Logistic Regression algorithm from the scikit-learn library of [36]. The upper pictures compare performance of MOM ascent/descent algorithms with fixed and changing blocks. These pictures give an example where the fixed block algorithm is stuck into local minima and another one where it does not converge. In both cases, the changing blocks version converges to t. Running times of logistic regression LR and its MOM version MOM LR are compared in the lower picture of Figure in a dataset free from outliers. LR and MOM LR are coded with the same algorithm in this example, meaning that MOM gradient descent-ascent and simple gradient descent are performed with the same descent algorithm. As illustrated in Figure, running each step of the gradient descent on one block only and not on the whole dataset accelerates the running time. The larger the dataset, the bigger the benefit is expected. The resistance to outliers of logistic regression and its minmax MOM alternative are depicted in Figure in the introduction. We added an increasing number of outliers to the dataset. Outliers {X i, Y i, i O} in this simulation are such that X i L 0, 5 and Y i = sign X i, t + ɛ i, with ɛ i ɛ as above. Figure shows that logistic classification is mislead by a single outlier while MOM version maintains reasonable performance with up to 50 outliers i.e 5% of the database is corrupted. A byproduct of Algorithm is an outlier detection algorithm. Each data receives a score equal to the number of times it is selected in a median block in the random choice of block version of the algorithm. The first iterations may be misleading: before convergence, the empirical loss at the current point may not reveal the centrality of the data because the current point may be far from t. Simulations are run with = 00, p = 0 and 5000 iterations and therefore only the score obtained by each data in the last 4000 iterations are displayed. 3 outliers X i, Y i i {,,3} with X i = 0 p j= and Y i = sign X i, t have been introduced at number 4, 6 and 66. Figure 3 shows that these are not selected once. 7 Conclusion The paper introduces a new homogeneity argument for learning problems with convex and Lipschitz losses. This argument allows to obtain estimation rates and oracle inequalities for ERM and minmax MOM estimators improving existing results. The ERM requires subgaussian hypotheses on the design and a local Bernstein condition see Theorem, both assumptions can be removed for minmax MOM estimators see Theorem 5. The local Bernstein conditions provided in this article can be verified in several learning problems. In particular, it allows to derive optimal risk bounds in examples where analysis based on the small ball hypothesis fail. Minmax MOM estimators applied to convex and Lipschitz losses are efficient without assumptions on the outputs Y, under minimal L assumptions on the design and the results are robust to the presence of few outliers in the dataset. A modification of these estimators can be implemented efficiently and confirm all these conclusions.

13 Figure : Top left and right: Comparizon of the algorithm with fixed and changing blocks. Bottom: Comparizon of running time between classical gradient descent and algorithm. In all simulation = 000, p = 00 and there is no outliers. A Proof of Theorems,, 3 and 4 A. Proof of Theorem The proof is splitted in two parts. First, we identify an event where the statistical behavior of the regularized estimator ˆf ERM can be controlled. Then, we prove that this event holds with probability at least 3. Introduce the following event: Ω := { f F f + r θb L, P P L f θr θ } where θ is a parameter appearing in the definition of r in Definition 3. Proposition. On the event Ω, one has ˆf ERM f L r θ and P L ˆf ERM θr θ. Proof. By construction, ˆf ERM satisfies P L ˆf ERM 0. Therefore, it is sufficient to show that, on Ω, if f f L > r θ, then P L f > 0. Let f F be such that f f L > r θ. By convexity of F, there exists f 0 F f + r θs L and α > such that For all i {,, }, let ψ i : R R be defined for all u R by f = f + αf 0 f. 0 ψ i u = lu + f X i, Y i lf X i, Y i. 3

14 Figure 3: Outliers Detection Procedure for = 00, p = 0 and outliers are i = 4, 6, 66 The functions ψ i are such that ψ i 0 = 0, they are convex because l is, in particular αψ i u ψ i αu for all u R and α and ψ i fx i f X i = lfx i, Y i lf X i, Y i so that the following holds: P L f = α α ψ i fxi f X i = α α i= i= ψ i αf 0 X i f X i α ψ i f 0 X i f X i = αp L f0. i= Until the end of the proof, the event Ω is assumed to hold. Since f 0 F f + r θb L, P L f0 P L f0 θr θ. Moreover, by Assumption 4, P L f 0 A f 0 f L = A r θ, thus P L f0 A θr θ. 3 From Eq. and 3, P L f > 0 since A > θ. Therefore, ˆf ERM f L r θ. This proves the L -bound. ow, as ˆf ERM f L r θ, P P L ˆf ERM θr θ. Since P L ˆf ERM 0, This show the excess risk bound. P L ˆf ERM = P L ˆf ERM + P P L ˆf ERM θr θ. Proposition shows that ˆf ERM has the risk bounds given in Theorem on the event Ω. To show that Ω holds with probability 3, recall the following results from []. Lemma. [] [Lemma 8.] Grant Assumptions and 3. Let F F with finite L -diameter d L F. For every u > 0, with probability at least exp u, P P L f L g 6L wf + ud L F. f,g F 4

15 It follows from Lemma that for any u > 0, with probability larger that exp u, P P L f P P L f L g f +r θb L f,g F f +r θb L 6L wf f r θb L + ud L F f r θb L where d L F f r θb L r θ. By definition of the complexity parameter see Eq. 3, for u = θ r θ/64l, with probability at least exp θ r θ/6 3 L, 4 for every f in F f + r θb L, P P L f θr θ. 5 Together with Proposition, this concludes the proof of Theorem. A. Proof of Theorem The proof is splitted in two parts. First, we identify an event Ω K where the statistical properties of ˆf from Theorem can be established. ext, we prove that this event holds with probability 8. Let α, θ and γ be positive numbers to be chosen later. Define 4L K C K,r = max θ α, r γ where the exact form of α, θ and γ are given in Equation 34. Set the event Ω K to be such that { Ω K = f F f + } C K,r B L, J {,..., K} : J > K/ and k J, P Bk P L f θc K,r. A.. Deterministic argument The goal of this section is to show that, on the event Ω K, ˆf f L C K,r and P L ˆf θc K,r. Lemma 3. If there exists η > 0 such that \f + C K,r B L then ˆf f L C K,r. Proof. Assume that 7 holds, then MOM K lf l f < η and MOM K lf l f η, 7 f + C K,r B L 6 inf \f + C K,r B L MOM K[l f l f ] > η. 8 Moreover, if T K f = g F MOM K [l f l g ] for all f F, then T K f = MOM K [l f l f ] f + C K,r B L MOM K [l f l f ] η. 9 \f + C K,r B L By definition of ˆf and 9, TK ˆf T K f η. Moreover, by 8, any f F \ f + C K,r B L satisfies T K f MOM K [l f l f ] > η. Therefore ˆf F f + C K,r B L. 5

16 Lemma 4. Grant Assumption 6 and assume that θ A < θ. η = θc K,r. On the event Ω K, 7 holds with Proof. Let f F be such that f f L > C K,r. By convexity of F, there exists f 0 F f + C K,r S L and α > such that f = f + αf 0 f. For all i {,..., }, let ψ i : R R be defined for all u R by ψ i u = lu + f X i, Y i lf X i, Y i. 30 The functions ψ i are convex because l is and such that ψ i 0 = 0, so αψ i u ψ i αu for all u R and α. As ψ i fx i f X i = lfx i, Y i lf X i, Y i, for any block B k, P BK L f = α ψ i fxi f X i = α ψ i αf 0 X i f X i B k α B k α i B k i B k α ψ i f 0 X i f X i = αp Bk L f0. 3 B k i B k As f 0 F f + C K,r B L, on Ω K, there are strictly more than K/ blocks B k where P Bk L f0 P L f0 θc K,r. Moreover, from Assumption 6, P L f0 A f 0 f L = A C K,r. Therefore, on strictly more than K/ blocks B k, P Bk L f0 A θc K,r. 3 From Eq. 3 and 3, there are strictly more than K/ blocks B k where P Bk L f A θc K,r. Therefore, on Ω K, as θ A < θ, \f + C K,r B L MOM K lf l f < θ A C K,r < θc K,r. In addition, on the event Ω K, for all f F f + C K,r B L, there are strictly more than K/ blocks B k where P Bk P L f θc K,r. Therefore MOM K lf l f θck,r P L f θc K,r. Lemma 5. Grant Assumption 6 and assume that θ A < θ. On the event Ω K, P L ˆf θc K,r. Proof. Assume that Ω K holds. From Lemmas 3 and 4, ˆf f L C K,r. Therefore, on strictly more than K/ blocks B k, P L ˆf P Bk L ˆf + θc K,r. In addition, by definition of ˆf and 9 for η = θc K,r, MOM K l ˆf l f MOM K l f l f θc K,r. As a consequence, there exist at least K/ blocks B k where P Bk L ˆf θc K,r. Therefore, there exists at least one block B k where both P L ˆf P Bk L ˆf + θc K,r and P Bk L ˆf θc K,r. Hence P L ˆf θc K,r. A.. Stochastic argument This section shows that Ω K holds with probability at least 8. Proposition. Grant Assumptions,, 5 and 6 and assume that βk O. Let x > 0 and assume that β α x 8γL/θ > /. Then Ω K holds with probability larger than exp x βk/. 6

17 Proof. Let F = F f + C K,r B L and set φ : t R I{t } + t I{ t } so, for all t R, I{t } φt I{t }. Let W k = X i, Y i i Bk, G f W k = P Bk P L f. Let zf = K I{ G f W k θc K,r }. k= Let K denote the set of indices of blocks which have not been corrupted by outliers, K = {k {,, K} : B k I} and let f F. Basic algebraic manipulations show that zf K φθ C K,r G f W k Eφθ C K,r G f W k Eφθ C K,r G f W k. k K k K By Assumptions and 5, using that C K,r f f L [4L K/θ α], Eφθ C K,r G f W k P G f W k θc K,r 4K θ CK,r i B k E[L f X i, Y i ] 4 θ CK,r EG f W k 4 = θ CK,r VarP Bk L f 4L K θ C K,r f f L α. Therefore, zf K α k K φθ C K,r G f W k Eφθ C K,r G f W k. 33 Using Mc Diarmid s inequality [, Theorem 6.], for all x > 0, with probability larger than exp x K /, φθ C K,r G f W k Eφθ C K,r G f W k k K x K + E k K φθ C K,r G f W k Eφθ θ C K,r G f W k Let ɛ,..., ɛ K denote independent Rademacher variables independent of the X i, Y i, i I. By Giné-Zinn symmetrization argument, k K φθ C K,r G f W k Eφθ C K,r G f W k x K + E As φ is -Lipschitz with φ0 = 0, using the contraction lemma [8, Chapter 4], k K ɛ k φθ C K,r G f W k. E k K ɛ k φθ C K,r G f W k E k K ɛ k G f W k θc K,r = E k K ɛ k P Bk P L f θc K,r. Let σ i : i k K B k be a family of independent Rademacher variables independent of ɛ k k K and X i, Y i i I. It follows from the Giné-Zinn symmetrization argument that E k K ɛ k P Bk P L f C K,r K E L f X i, Y i σ i. C K,r i k K B k 7

18 By the Lipschitz property of the loss, the contraction principle applies and E L f X i, Y i σ i C K,r i k K B k LE f f X i σ i. C K,r i k K B k To bound from above the right-hand side in the last inequality, consider two cases C K,r = r γ or C K,r = 4L K/αθ. In the first case, by definition of the complexity parameter r γ in 6, E f f X i σ i = E C K,r i k K B : f f L r γ k r γ σ i f f X i γ K K. i k K B k In the second case, σ i f f X i E C K,r i k K B k [ E : f f L r γ σ i f f X i r i k K B k γ : r γ f f L 4L K αθ f f X i σ i 4L K i k K B k αθ ]. Let f F be such that r γ f f L [4L K]/[αθ ]; by convexity of F, there exists f 0 F such that f 0 f L = r γ and f = f + αf 0 f with α = f f L / r γ. Therefore, f f X i σ i 4L K f f X i r i k K B k αθ γ σ i f f L = r i k K B k γ σ i f 0 f X i i k K B k and so : r γ f f L 4L K αθ f f X i σ i 4L K i k K B k αθ r γ : f f L = r γ By definition of r γ, it follows that E f f X i σ i C K,r γ K K. i k K B k σ i f f X i. i k K B k Therefore, as K K O βk, with probability larger than exp x βk/, for all f F such that f f L C K,r, zf K α x 8γL > K θ. 34 A..3 End of the proof of Theorem Theorem follows from Lemmas 3, 4, 5 and Proposition for the choice of constant θ = /3A α = /4, x = /4, β = 4/7 and γ = /575AL. 8

19 A.3 Proof of Theorem 3 Let K [ 7 O /3, ] and consider the event Ω K defined in 6. It follows from the proof of Lemmas 3 and 4 that T K f θc K,r on Ω K. Setting θ = /3A, on J=K Ω J, f ˆR J for all J = K,...,, so ˆR J=K J. By definition of ˆK, it follows that ˆK K and by definition of f, f ˆR K which means that T K f θc K,r. It is proved in Lemmas 3 and 4 that on Ω K, if f F satisfies f f L C K,r then T K f > θc K,r. Therefore, f f L C K,r. On Ω K, since f f L C K,r, P L f θc K,r. Hence, on J=K Ω J, the conclusions of Theorem 3 hold. Finally, by Proposition, P [ J=KΩ J ] A.4 Proof of Theorem 4 J=K exp K/06 4 exp K/06. The proof of Theorem 4 follows the same path as the one of Theorem. We only sketch the different arguments needed because of the localization by the excess loss and the lack of Bernstein condition. Define the event Ω K in the same way as Ω K in 6 where C K,r is replaced by r γ and the L localization is replaced by the excess loss localization : { } Ω K = f L F r γ, J {,..., K} : J > K/ and k J, P B k P L f /4 r γ 35 where L F r γ = {f F : P L f r γ}. Our first goal is to show that on the event Ω K, P L ˆf /4 r γ. We will then handle P[Ω K ]. Lemma 6. Grant Assumptions and. For every r 0, the set L F r := {f F : P L f r} is convex and relatively closed to F in L µ. Moreover, if f F is such that P L f > r then there exists f 0 F and P L f /r α > such that f f = αf 0 f and P L f0 = r. Proof. Let f and g be in L F r and 0 α. We have αf + αg F because F is convex and for all x X and y R, using the convexity of u lu + f x, y, we have l αf+ αg x, y l f x, y = lαf f x + αg f x + f x, y lf x, y α lf f x + f x, y lf x, y + α lg f x + f x, y lf x, y = αl f l f + αl g l f and so P L αf+ αg αp L f + αp L g. Given that P L f, P L g r we also have P L αf+ αg r. Therefore, αf + αg L F r and L F r is convex. For all f, g F, P L f P L g f f L µ so that f F P L f is continuous onto F in L µ and therefore its level sets, such as L F r, are relatively closed to F in L µ. Finally, let f F be such that P L f > r. Define α 0 = {α 0 : f + αf f L F r }. ote that P L f +αf f αp L f = r for α = r/p L f so that α 0 r/p L f. Since L F r is relatively closed to F in L µ, we have f + α 0 f f L F r and in particular α 0 < otherwise, by convexity of L F r, we would have f L F r. Moreover, by maximality of α 0, f 0 = f + α 0 f f is such that P L f0 = r and the results follows for α = α0. Lemma 7. Grant Assumptions and. On the event Ω K, P L ˆf r γ. Proof. Let f F be such that P L r > r γ. It follows from Lemma 6 that there exists α and f 0 F such that P L f0 = r γ and f f = αf 0 f. According to 3, we have for every k {,..., K}, P Bk L f αp Bk L f0. Since f 0 L F r γ, on the event Ω K, there are strictly more than K/ blocks B k 9

20 such that P Bk L f0 have P L f0 /4 r γ = 3/4 r γ and so P B k L f 3/4 r γ. As a consequence, we \L F r γ MOM K lf l f 3/4 r γ. 36 Moreover, on the event Ω K, for all f L F r γ, there are strictly more than K/ blocks B k such that P Bk L f /4 r γ P L f /4 r γ. Therefore, MOM K lf l f /4 r γ. 37 f L F r γ We conclude from 36 and 37 that MOM K lf l f /4 r γ and that every f F such that P L f > r γ satisfies MOM K lf l f 3/4 r γ. But, by definition of ˆf, we have Therefore, we necessarily have P L ˆf r γ. MOM K l ˆf l f MOM K lf l f /4 r γ. ow, we prove that Ω K is an exponentially large event using similar argument as in Proposition. Proposition 3. Grant Assumptions, and 7 and assume that βk O and β / 3γL > /. Then Ω K holds with probability larger than exp βk/5. Sketch of proof. The proof of Proposition 3 follows the same line as the one of Proposition. Let us precise the main differences. We set F = L F r γ and for all f F, z f = K k= I{ G f W k /4 r γ} where G f W k is the same quantity as in the proof of Proposition 3. Let us consider the contraction φ introduced in Proposition 3. By definition of r γ and V K, we have Eφ8 r γ G f W k P G f W k r γ 8 64K r γ 64 r γ EG f W k = 64K r γ {Var P i L f : f F, i I} Var Pi L f i B k 64K r γ {Var P i L f : P L f r γ, i I} r γ VarP B k L f Using Mc Diarmid s inequality, the Giné-Zinn symmetrization argument and the contraction lemma twice and the Lipschitz property of the loss function, such as in the proof of Proposition, we obtain with probability larger than exp K /5, for all f F, zf K / 3LK E r γ σ i f f X i i k K B k. 38 ow, it remains to use the definition of r γ to bound the expected remum in the right-hand side of 38 to get E r γ σ i f f X i i k K B k γ K K. 39 Proof of Theorem 4. The proof of Theorem 4 follows from Lemma 7 and Proposition 3 for β = 4/7 and γ = /768L. 0

21 B Proof of Lemma Proof. We have E : f f L r i= σ i f f X i = E < t, t R p :E<t,X> r i= σ i X i >. Let Σ = EX T X denote the covariance matrix of X and consider its SVD, Σ = QDQ T where Q = [Q Q p ] R p p is an orthogonal matrix and D is a diagonal p p matrix with non-negative entries. For all t R p, we have E X, t = t T Σt = p j= d j t, Qj. Then E t R p : E<t,X> r = E re < t, i= p t R : p j= d j<t,q j > r j=:d j 0 p j=:d j 0 Qj, dj σ i X i >= E p i= Moreover, for any j such that d j 0, Qj E, dj = i= k= σ i X i = E t R p : E<t,X> r Qj dj < t, Q j >, dj σ i X i r E k,l= p j=:d j 0 T Qj EXk T X Qj k = dj dj p < t, Q j > Q j, j= i= Qj, dj σ i X i i= σ l σ k Q j dj, X k Q j dj, X l = k= σ i X i. k= T Qj Qj Σ dj dj By orthonormality, Q T Q j = e j and Q T j Q = et j, then, for any j such that d j 0, i= E Q j dj, X k σ i X i Qj E, dj i= σ i X i = k= d j e T j De j =. Finally, we obtain E : f f L r i= and therefore the fixed point r γ is such that { r γ = inf r > 0, J I : J /, E inf { r > 0, J I : J /, σ i f f p X i r {dj 0} = r RankΣ j= t R p : E<t t,x> r i J r RankΣ r } J γ } σ i < X i, t t > r J γ RankΣ γ.

arxiv: v2 [math.st] 7 Feb 2017

arxiv: v2 [math.st] 7 Feb 2017 Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions Pierre Alquier,3,4, Vincent Cottet,3,4, Guillaume Lecué 2,3,4 CREST, ESAE, Université Paris Saclay

More information

Inverse Statistical Learning

Inverse Statistical Learning Inverse Statistical Learning Minimax theory, adaptation and algorithm avec (par ordre d apparition) C. Marteau, M. Chichignoud, C. Brunet and S. Souchet Dijon, le 15 janvier 2014 Inverse Statistical Learning

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

arxiv: v2 [math.st] 30 Nov 2017

arxiv: v2 [math.st] 30 Nov 2017 Robust machine learning by median-of-means : theory and practice G. Lecué and M. Lerasle December 4, 2017 arxiv:1711.10306v2 [math.st] 30 Nov 2017 Abstract We introduce new estimators for robust machine

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

Classification with Reject Option

Classification with Reject Option Classification with Reject Option Bartlett and Wegkamp (2008) Wegkamp and Yuan (2010) February 17, 2012 Outline. Introduction.. Classification with reject option. Spirit of the papers BW2008.. Infinite

More information

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces Notes on the framework of Ando and Zhang (2005 Karl Stratos 1 Beyond learning good functions: learning good spaces 1.1 A single binary classification problem Let X denote the problem domain. Suppose we

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may

More information

Lecture 21: Minimax Theory

Lecture 21: Minimax Theory Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways

More information

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Supremum of simple stochastic processes

Supremum of simple stochastic processes Subspace embeddings Daniel Hsu COMS 4772 1 Supremum of simple stochastic processes 2 Recap: JL lemma JL lemma. For any ε (0, 1/2), point set S R d of cardinality 16 ln n S = n, and k N such that k, there

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

(Part 1) High-dimensional statistics May / 41

(Part 1) High-dimensional statistics May / 41 Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Encoder Based Lifelong Learning - Supplementary materials

Encoder Based Lifelong Learning - Supplementary materials Encoder Based Lifelong Learning - Supplementary materials Amal Rannen Rahaf Aljundi Mathew B. Blaschko Tinne Tuytelaars KU Leuven KU Leuven, ESAT-PSI, IMEC, Belgium firstname.lastname@esat.kuleuven.be

More information

Solving Corrupted Quadratic Equations, Provably

Solving Corrupted Quadratic Equations, Provably Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

1.1 Basis of Statistical Decision Theory

1.1 Basis of Statistical Decision Theory ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of

More information

Sequential and reinforcement learning: Stochastic Optimization I

Sequential and reinforcement learning: Stochastic Optimization I 1 Sequential and reinforcement learning: Stochastic Optimization I Sequential and reinforcement learning: Stochastic Optimization I Summary This session describes the important and nowadays framework of

More information

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539 Brownian motion Samy Tindel Purdue University Probability Theory 2 - MA 539 Mostly taken from Brownian Motion and Stochastic Calculus by I. Karatzas and S. Shreve Samy T. Brownian motion Probability Theory

More information

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Submitted to the Annals of Statistics arxiv: arxiv:0000.0000 RATE-OPTIMAL GRAPHON ESTIMATION By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Network analysis is becoming one of the most active

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth

More information

Towards stability and optimality in stochastic gradient descent

Towards stability and optimality in stochastic gradient descent Towards stability and optimality in stochastic gradient descent Panos Toulis, Dustin Tran and Edoardo M. Airoldi August 26, 2016 Discussion by Ikenna Odinaka Duke University Outline Introduction 1 Introduction

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

A talk on Oracle inequalities and regularization. by Sara van de Geer

A talk on Oracle inequalities and regularization. by Sara van de Geer A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

BINARY CLASSIFICATION

BINARY CLASSIFICATION BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Minimax risk bounds for linear threshold functions

Minimax risk bounds for linear threshold functions CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability

More information

1 Directional Derivatives and Differentiability

1 Directional Derivatives and Differentiability Wednesday, January 18, 2012 1 Directional Derivatives and Differentiability Let E R N, let f : E R and let x 0 E. Given a direction v R N, let L be the line through x 0 in the direction v, that is, L :=

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b) LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered

More information

Homework 5. Convex Optimization /36-725

Homework 5. Convex Optimization /36-725 Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D. Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D. Ruppert A. EMPIRICAL ESTIMATE OF THE KERNEL MIXTURE Here we

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Information theoretic perspectives on learning algorithms

Information theoretic perspectives on learning algorithms Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

Statistics for scientists and engineers

Statistics for scientists and engineers Statistics for scientists and engineers February 0, 006 Contents Introduction. Motivation - why study statistics?................................... Examples..................................................3

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

COMS 4771 Regression. Nakul Verma

COMS 4771 Regression. Nakul Verma COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

Algorithmic Stability and Generalization Christoph Lampert

Algorithmic Stability and Generalization Christoph Lampert Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts

More information

Estimation of Dynamic Regression Models

Estimation of Dynamic Regression Models University of Pavia 2007 Estimation of Dynamic Regression Models Eduardo Rossi University of Pavia Factorization of the density DGP: D t (x t χ t 1, d t ; Ψ) x t represent all the variables in the economy.

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Asymptotics of minimax stochastic programs

Asymptotics of minimax stochastic programs Asymptotics of minimax stochastic programs Alexander Shapiro Abstract. We discuss in this paper asymptotics of the sample average approximation (SAA) of the optimal value of a minimax stochastic programming

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

1-bit Matrix Completion. PAC-Bayes and Variational Approximation : PAC-Bayes and Variational Approximation (with P. Alquier) PhD Supervisor: N. Chopin Bayes In Paris, 5 January 2017 (Happy New Year!) Various Topics covered Matrix Completion PAC-Bayesian Estimation Variational

More information

1 Regression with High Dimensional Data

1 Regression with High Dimensional Data 6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:

More information

Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms

Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms university-logo Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms Andrew Barron Cong Huang Xi Luo Department of Statistics Yale University 2008 Workshop on Sparsity in High Dimensional

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Statistical Properties of Large Margin Classifiers

Statistical Properties of Large Margin Classifiers Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

Machine Learning Brett Bernstein. Recitation 1: Gradients and Directional Derivatives

Machine Learning Brett Bernstein. Recitation 1: Gradients and Directional Derivatives Machine Learning Brett Bernstein Recitation 1: Gradients and Directional Derivatives Intro Question 1 We are given the data set (x 1, y 1 ),, (x n, y n ) where x i R d and y i R We want to fit a linear

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics Data from one or a series of random experiments are collected. Planning experiments and collecting data (not discussed here). Analysis:

More information

Generalized Boosted Models: A guide to the gbm package

Generalized Boosted Models: A guide to the gbm package Generalized Boosted Models: A guide to the gbm package Greg Ridgeway April 15, 2006 Boosting takes on various forms with different programs using different loss functions, different base models, and different

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

10-701/15-781, Machine Learning: Homework 4

10-701/15-781, Machine Learning: Homework 4 10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one

More information

Manifold Regularization

Manifold Regularization 9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

An introduction to Mathematical Theory of Control

An introduction to Mathematical Theory of Control An introduction to Mathematical Theory of Control Vasile Staicu University of Aveiro UNICA, May 2018 Vasile Staicu (University of Aveiro) An introduction to Mathematical Theory of Control UNICA, May 2018

More information

Boosting with Early Stopping: Convergence and Consistency

Boosting with Early Stopping: Convergence and Consistency Boosting with Early Stopping: Convergence and Consistency Tong Zhang Bin Yu Abstract Boosting is one of the most significant advances in machine learning for classification and regression. In its original

More information