Information Theoretic Guarantees for Empirical Risk Minimization with Applications to Model Selection and Large-Scale Optimization

Size: px

Start display at page:

Download "Information Theoretic Guarantees for Empirical Risk Minimization with Applications to Model Selection and Large-Scale Optimization"

Ethan Morris
5 years ago
Views:

1 with Applications to Model Selection and Large-Scale Optiization Ibrahi Alabdulohsin 1 Abstract In this paper, we derive bounds on the utual inforation of the epirical risk iniization (ERM) procedure for both 0-1 and stronglyconvex loss classes. We prove that under the Axio of Choice, the existence of an ERM learning rule with a vanishing utual inforation is equivalent to the assertion that the loss class has a finite VC diension, thus bridging inforation theory with statistical learning theory. Siilarly, an asyptotic bound on the utual inforation is established for strongly-convex loss classes in ters of the nuber of odel paraeters. The latter result rests on a central liit theore (CLT) that we derive in this paper. In addition, we use our results to analyze the excess risk in stochastic convex optiization and unify previous works. Finally, we present two iportant applications. First, we show that the ERM of strongly-convex loss classes can be trivially scaled to big data using a naïve parallelization algorith with provable guarantees. Second, we propose a siple inforation criterion for odel selection and deonstrate experientally that it outperfors the popular Akaike s inforation criterion (AIC) and Schwarz s Bayesian inforation criterion (BIC). 1. Introduction Learning via the epirical risk iniization (ERM) of stochastic loss is a ubiquitous fraework that has been widely applied in achine learning and statistics. It is often regarded as the default strategy to use, due to its siplicity, generality, and statistical efficiency (Shalev-Shwartz et al., 2009; Koren & Levy, 2015; Vapnik, 1999; Shalev-Shwartz & Ben-David, 2014). Given a fixed hypothesis space H, 1 Saudi Araco, Dhahran 31311, Saudi Arabia. Correspondence to: Ibrahi Alabdulohsin <ibrahi.alabdulohsin@kaust.edu.sa>. Proceedings of the 35 th International Conference on Machine Learning, Stockhol, Sweden, PMLR 80, Copyright 2018 by the author(s). a doain Z, and a loss function on the product space f : H Z R, the ERM learning rule selects the hypothesis ĥ that iniizes the epirical risk: { ĥ = arg in F S (h) = 1 h H i=1 } f(h, z i ), (1) where S = (z 1,..., z ) Z is a saple of observations drawn i.i.d. fro soe unknown probability distribution D. To siplify notation, we oit the dependence of ĥ on the saple S. By contrast, the true risk iniizer is denoted h : h = arg in h H { } F (h) = E z D [f(h, z)]. (2) Hence, learning via ERM is justified if and only if F (ĥ) F (h ) + ɛ, for soe provably sall ɛ. In soe applications, such as classification, the stochastic loss of interest is often the 0-1 loss. However, iniizing the 0-1 loss is, in general, coputationally intractable even for siple hypothesis spaces, such as linear classifiers (Feldan et al., 2009). To circuvent this difficulty, a convex, surrogate loss is used instead, which is justified by consistency results when the loss is calibrated (Bartlett et al., 2006). When the stochastic loss is convex, ERM has occasionally been referred to by various naes in the literature, such as stochastic convex optiization and stochastic average approxiation (Shalev-Shwartz et al., 2009; Feldan, 2016). Exaples of the latter fraework include least squares, logistic regression, and SVM. In the literature, several approaches have been proposed for analyzing the generalization risk and, consequently, the consistency of epirical risk iniization. The ost doinant approach is unifor convergence, with seinal results for 0-1 loss classes dating to the early work of Vapnik and Chervonenkis in the 1970s (Shalev-Shwartz & Ben-David, 2014; Abu-Mostafa et al., 2012; Vapnik, 1999). The Fundaental Theore of Statistical Learning states that a hypothesis space H is agnostic PAC-learnable via ERM if and only if it is PAC-learnable at all, and that this occurs if and only if H has a finite VC diension (Shalev-Shwartz & Ben- David, 2014). This elegant characterization shows that the generalization risk for 0-1 loss classes can be bounded by coputing their VC diensions.

2 Siilarly, when the stochastic loss is both convex and of the generalized linear for, where f(h, z i ) = r(h) + g( h, φ(z i ), z i ) and r : H R is a strongly-convex regularizer, then, under ild additional conditions of soothness and boundedness, it has been shown that the true risk sub-optiality F (h) F (h ) is bounded uniforly in H by a constant ultiple of the epirical risk sub-optiality F S (h) F S (ĥ) plus an O(1/) ter (Sridharan et al., 2009). In general, it can be shown that for strongly-convex stochastic losses, the risk of the epirical risk iniizer converges to the optial risk with rate O(1/), which justifies learning via the epirical risk iniization (Shalev- Shwartz et al., 2009). Hence, the consistency of epirical risk iniization for both 0-1 loss classes and stronglyconvex loss functions is, generally, well-understood. However, any inforation-theoretic approaches have been recently proposed for analyzing achine learning algoriths. These include generalization bounds that are based on ax-inforation (Dwork et al., 2015), leave-one-out inforation (Raginsky et al., 2016), and the utual inforation (Alabdulohsin, 2015; Russo & Zou, 2016). They have found applications in iportant areas, such as in privacy and adaptive data analysis, since they naturally lead to guaranteed stability, bounded inforation leakage, and robustness against post-processing. Copared to other inforationtheoretic approaches, unifor generalization bounds using the variational inforation, also called T -inforation (Raginsky et al., 2016), yield the tightest results for bounded loss functions, as deduced by the Pinsker inequality (Cover & Thoas, 1991). It was proposed in (Alabdulohsin, 2015), who used it to prove that unifor generalization, algorithic stability, and bounded inforation were, in fact, equivalent conditions on the learning algorith. A siilar notion, called robust generalization" was later proposed in (Cuings et al., 2016), who analyzed its significance in the adaptive learning setting and showed that it could be achieved using saple copression schees, finite description lengths, and differential privacy. However, these two notions of unifor" and robust" generalization turned out to be equivalent (Alabdulohsin, 2017). To describe unifor generalization in inforal ters, suppose we have a learning algorith L : Z H, which selects a hypothesis h H according to a training saple S Z. Let the generalization risk of L w.r.t. soe bounded loss function l : H Z [0, 1] be defined by: [ R gen (L) = E S D, h L(h) LS (h) ], (3) where L(h) = E z D [l(h, z)] and L S is its epirical counterpart L S (h) = E z S [l(h, z)]. In Eq. (3), the expectation is taken over the rando choice of the saple and the internal randoness (if any) in the learning algorith. Note that l in Eq. (3) can be different fro the loss f in Eq. (1) that is optiized during the learning stage, e.g. l can be a 0-1 isclassification error rate while f is soe convex, surrogate loss. Then, L is said to generalize uniforly with rate ɛ > 0 if R gen (L) ɛ is guaranteed to hold independently of the choice of l. That is, the generalization guarantee holds uniforly in expectation across all paraetric loss functions, hence the nae. Definition 1 (Unifor Generalization). A learning algorith L : Z H generalizes uniforly with rate ɛ 0 if for all bounded paraetric losses l : H Z [0, 1], we have R gen (L) ɛ, where R gen (L) is given in Eq. (3). Inforally, Definition 1 states that once a hypothesis h is selected by a learning algorith L that achieves unifor generalization, then no adversary" can post-process the hypothesis in a anner that causes over-fitting to occur (Alabdulohsin, 2015; Cuings et al., 2016). Equivalently, unifor generalization iplies that the epirical perforance of h on the saple S is a faithful approxiation to its true risk, regardless of how that perforance is easured. For exaple, the loss function l in Eq. (3) can be the isclassification error rate, a cost-sensitive error rate in fraud detection and edical diagnosis (Elkan, 2001), or it can be the Brier score in probabilistic predictions (Kull & Flach, 2015). The generalization guarantee would hold in any case. The ain theore of (Alabdulohsin, 2015) states that the unifor generalization risk has a precise inforationtheoretic characterization: Theore 1 (Alabdulohsin, 2015). Given a fixed 0 ɛ 1 and a learning algorith L : Z H that selects a hypothesis h H according to a training saple S = {z 1,..., z }, where z i D are i.i.d., then L generalizes uniforly with rate ɛ if and only if J (h; ẑ) ɛ, where ẑ S is a single rando training exaple, J (x; y) = p(x)p(y), p(x, y) T, and q 1, q 2 T is the total variation distance between the probability easures q 1 and q 2. Throughout this paper, we will adopt the terinology used in (Alabdulohsin, 2017) and call J (x; y) the variational inforation" between the rando variables x and y. Variational inforation is an instance of the class of inforativity easures using f-divergences, for which an axioatic basis has been proposed (Csiszár, 1972; 2008). To illustrate Theore 1, consider the case of a finite hypothesis space H <. Then, a classical arguent using the union bound (Shalev-Shwartz & Ben-David, 2014) can be used to show that the generalization risk w.r.t. any fixed bounded loss l : H Z [0, 1] is Õ( log H /). This follows fro the fact that the union bound arguent does not ake any additional assuptions on the loss l beyond the fact that it has a bounded range. Hence, the unifor generalization risk is Õ( log H /). However, Theore 1 states that this bound ust also hold for the variational inforation J (h; ẑ) as well. This clai can, in fact, be ver-

3 ified readily using inforation theoretic inequalities since we have (Alabdulohsin, 2015): I(h; ẑ) I(h; S) log H J (h; ẑ) 2 2 2, where I(x; y) is the Shannon utual inforation easured in nats (i.e. using natural logariths). Here, the first inequality is due to Pinsker (Cover & Thoas, 1991), the second inequality follows because z i are i.i.d., and the last inequality holds because the Shannon utual inforation is bounded by the entropy. Note that in the latter case, only inforation-theoretic inequalities were used to recover this classical result without relying on the union bound. The generalization guarantee in Theore 1 depends on the probability distribution D. It can be ade distribution-free by defining the capacity of a learning algorith L by: C(L) = sup{j (h; ẑ) = Eẑ p(z) p(h), p(h ẑ) T }, (4) p(z) where the supreu is taken over all possible distributions of observations. This is analogous to the capacity of counication channels in inforation theory (Cover & Thoas, 1991). Then, the generalization risk of L is bounded by C(L) for any distribution D and any bounded loss function l : H Z [0, 1]. In any cases, such as in finite description lengths, countable doains, ean estiation, differential privacy, and saple copression schees, the capacity of C(L) can be tightly bounded (Alabdulohsin, 2015; 2017). In this paper, we derive new bounds for the capacity of the epirical risk iniization (ERM) procedure. Finally, we point out that even though unifor generalization guarantees are originally defined only in expectation as shown in Eq. (3), it has recently been established that these guarantees of unifor generalization in expectation also iplied a generalization with a high probability as well (Alabdulohsin, 2017). This is in stark contrast with traditional generalization in expectation guarantees that do not iply concentration. 2. Contributions The first contribution of this paper is to establish tight relations between unifor generalization and the epirical risk iniization (ERM) procedure. This will allow us to bridge inforation theory with statistical learning theory. More specifically, we will prove that under the Axio of Choice, an ERM learning rule always exists that has a vanishing learning capacity C(L) if and only if the 0-1 loss class has a finite VC diension. Second, we prove that the epirical risk iniization of strongly-convex stochastic loss also generalizes uniforly in expectation. To establish the latter result, we prove the asyptotic norality of the epirical risk iniizer over the rando choice of the training saple, which is a useful result in its own right, such as for uncertainty quantification and hypothesis testing. In the achine learning literature, central liit theores have been eployed to analyze online learning algoriths. For instance, (Polyak & Juditsky, 1992) derived a central liit theore (CLT) when using an averaging ethod to accelerate convergence. More recently, (Mandt et al., 2016) derived a central liit theore for SGD with a fixed learning rate under siplifying assuptions. Then, (Mandt et al., 2016) used their result to select the learning rate such that the trajectory of SGD generates saples fro the posterior distribution. Unlike the work in (Mandt et al., 2016), we prove our CLT without relying on soe unnecessary siplifying assuptions, such as the Gaussianity of noise. As a consequence of our work, we present a generalization bound for stochastic convex optiization, which only depends on the nuber of odel paraeters and applies to any learning task, such as regression, ulti-class classification, and ranking. Next, we use our results to analyze the excess risk in stochastic convex optiization and unify previous works. Finally, we deonstrate two iportant applications. First, we show that the ERM of strongly-convex loss classes can be trivially distributed and scaled to big data using a naïve parallelization algorith whose perforance is provably equivalent to that of the ERM learning rule. Unlike previous works, our parallelization algorith does not rely on advanced treatents, such as the ADMM procedure (Boyd et al., 2011). Second, we use our results to propose a siple inforation criterion for odel selection in ters of the nuber of odel paraeters and deonstrate experientally that it outperfors the popular Akaike s inforation criterion (AIC) (Akaike, 1998; Bishop, 2006) and Schwarz s Bayesian inforation criterion (BIC) (Schwarz, 1978). 3. Notation Our notation is fairly standard. Soe exceptions that ay require further clarification are as follows. First, when x is a rando variable whose value is drawn uniforly at rando fro a finite set S, we will write x S to denote this fact. Second, if x is a predicate, then I{x} = 1 if and only if x is true, otherwise I{x} = 0. Third, A B is a linear atrix inequality (LMI), i.e. A B is equivalent to the assertion that B A is positive seidefinite. Moreover, we will use the order in probability notation for real-valued rando variables. Here, we adopt the notation used in (Janson, 2011; Tao, 2012). In particular, let x = x n be a real-valued rando variable that depends on soe paraeter n N. Then, we will write x n = O p (f(n)) if for any δ > 0, there exists absolute constants C and n 0 such that for any fixed n n 0, the inequality x n < C f(n)

4 holds with a probability of, at least, 1 δ. In other words, the ratio x n /f(n) is stochastically bounded (Janson, 2011). Siilarly, we write x n = o p (f(n)) if x n /f(n) converges to zero in probability. As an exaple, if x N (0, I d ) is a standard ultivariate Gaussian vector, then x 2 = O p ( d) even though x 2 can be arbitrarily large. Intuitively, the probability of the event x 2 d 1 2 +ɛ when ɛ > 0 goes to zero as d so x 2 is effectively of the order O( d). Finally, let H be a fixed hypothesis space and let Z be a fixed doain of observations. Given a 0-1 loss function f : H Z {0, 1}, we will abuse terinology slightly by speaking about the VC diension" of H, when we actually ean the VC diension of the loss class {f(h, ) : h H}. Because the 0-1 loss f will be clear fro the context, this should not cause any abiguity. 4. Epirical Risk Miniization of 0-1 Loss Classes We begin by analyzing the ERM learning rule for 0-1 loss classes. Before we establish that a finite VC diension is sufficient to guarantee the existence of ERM learning rules that generalize uniforly in expectation, we first describe why ERM by itself is not sufficient even when the hypothesis space has a finite VC diension 1. Proposition 1. For any saple size 1 and a positive constant ɛ > 0, there exists a hypothesis space H, a doain Z, and a 0-1 loss function f : H Z {0, 1} such that: (1) H has a VC diension d = 1, and (2) a learning algorith L : Z H exists that outputs an epirical risk iniizer ĥ with J (ĥ; ẑ) 1 ɛ, where ẑ S is a single rando training exaple. Proposition 1 shows that one cannot obtain a non-trivial bound on the unifor generalization risk of an ERM learning rule in ters of the VC diension d and the saple size without iposing soe additional restrictions. Next, we prove that an ERM learning rule exists that satisfies the unifor generalization property if the hypothesis space has a finite VC diension. We begin by recalling a fundaental result in odern set theory. A non-epty set Q is said to be well-ordered if Q is endowed with a total order such that every nonepty subset of Q contains a least eleent. The following fundaental result is due to Ernst Zerelo. Theore 2 (Well-Ordering Theore). Under the Axio of Choice, every non-epty subset can be well-ordered. Theore 2 was proved by Zerelo in 1904 (Kologorov & Foin, 1970). 1 Detailed foral proofs are available in the suppleentary aterials. Theore 3. Given a hypothesis space H, a doain Z, and a 0-1 loss f : H Z {0, 1}, let be a well-ordering on H and let L : Z H be the learning rule that outputs the least" epirical risk iniizer to the training saple S Z according to. Then, C(L) 0 as if H has a finite VC diension. In particular: C(L) d log 2e d +, where C(L) is given by Eq. (4) and d is the VC diension of H, provided that d. Next, we prove a converse stateent. Before we do this, we present a learning proble that shows why a converse to Theore 3 is not generally possible without aking soe additional assuptions. Hence, our converse will be later established for the binary classification setting only. Exaple 1 (Integer Subset Learning Proble). Let Z = {1, 2, 3,..., d} be a finite set of positive integers. Let H = 2 Z and define the loss of a hypothesis h H to be f(h, z) = I{z / h}. Then, the VC diension is d. However, the learning rule that outputs h = Z is always an ERM learning rule that generalize uniforly with rate ɛ = 0 regardless of the saple size and the distribution of observations. The previous exaple shows that a converse to Theore 3 is not possible without iposing soe additional constraints. In particular, in the Integer Subset Learning Proble, the VC diension is not a useful easure of the coplexity of the hypothesis space H because any hypotheses doinate others (i.e. perfor better across all distributions of observations). For exaple, the hypothesis h = {1, 2, 3} doinates h = {1} because there is no distribution on observations in which h outperfors h. Even worse, the hypothesis h = Z doinates all other hypotheses. Consequently, in order to prove a lower bound for all ERM rules, we consider the standard binary classification setting. Theore 4. In any fixed doain Z = X Y, let the hypothesis space H be a concept class on X and let f(h, x, y) = I{y h(x)} be the isclassification error. Then, any ERM learning rule L w.r.t. f has a learning( capacity C(L) that is bounded fro below by C(L) , d) where is the training saple size and d is the VC diension of H. Using both Theore 3 with Theore 4, we arrive a crisp characterization of the VC diension of concept classes in ters of inforation theory. Theore 5. Given a fixed doain Z = X Y, let the hypothesis space H be a concept class on X and let f(h, x, y) = I{y h(x)} be the isclassification error. Let be the saple size. Then, the following stateents are equivalent under the Axio of Choice:

1. H adits an ERM learning rule L whose learning capacity C(L) satisfies C(L) 0 as. 2. H has a finite VC diension. Proof. The lower bound in Theore 4 holds for all ERM learning rules.

5 1. H adits an ERM learning rule L whose learning capacity C(L) satisfies C(L) 0 as. 2. H has a finite VC diension. Proof. The lower bound in Theore 4 holds for all ERM learning rules. Hence, an ERM learning rule exists that generalize uniforly with a vanishing rate across all distributions only if H has a finite VC diension. However, under the Axio of Choice, H can always be well-ordered by Theore 2 so, by Theore 3, a finite VC diension is also sufficient to guarantee the existence of a learning rule that generalize uniforly. Theore 5 presents a crisp characterization of the VC diension in ters of inforation theory. According to the theore, an ERM learning rule can be constructed that does not encode the training saple if and only if the hypothesis space has a finite VC diension. Reark 1. In (Cuings et al., 2016), it has been argued that unifor generalization, called robust generalization" in the paper, is iportant in the adaptive learning setting because it iplies that no adversary can post-process the hypothesis and causes over-fitting to occur. In other words, no adversary can use the hypothesis to infer new conclusions, which do not theselves generalize. It was shown in (Cuings et al., 2016) that this robust generalization guarantee was achievable by saple copression schees and differential privacy. Theore 3 shows that an ERM learning rule of 0-1 loss classes with finite VC diensions always exists that satisfies this robust generalization property. Reark 2. One ethod of constructing a well-ordering on a hypothesis space H is to use the fact that coputers are equipped with finite precisions. Hence, in practice, every hypothesis space is enuerable, fro which the noral ordering of the integers is a valid well-ordering. 5. Epirical Risk Miniization of Strongly-Convex Loss Classes Next, we analyze the ERM learning rule for strongly-convex loss classes. To recall, the ERM learning rule selects the hypothesis ĥ given by Eq. (1), which is identical to what was assued before. However, we now further assue that f(h, z) is γ-strongly convex, L-Lipschitz, and twicedifferentiable on its first arguent. Moreover, the hypothesis space is R d for soe finite d < Central Liit Theore We begin by establishing a central liit theore (CLT). Throughout this section, we will siplify notation by writing f i (h) = f(h, z i ). Predicted Actual Figure 1. This figure presents the predicted (left) and actual (right) covariance atrices of the epirical risk iniizer to an l 2- regularized logistic regression proble, where d = 10 and = The colorap used is parula, and siilar colors correspond to siilar values. Theore 6. Let ĥ and h be as given in Eq. (1) and Eq. (2) respectively for soe i.i.d. realizations of the stochastic loss f D. If the distribution D is supported on γ-strongly convex, L-Lipschitz, and twice differentiable loss functions, then ( ĥ h ) N ( 0, Σ ) as, where Σ is equal to: ( Ef D [ 2 f(h )] ) 1 Cov( f(h )) (E f D [ 2 f(h )] ) 1 Proof. Here is the outline of the proof. First, we use strong convexity, the first-order optiality condition of h, and Theore 6 in (Shalev-Shwartz et al., 2009) to show that ĥ h 2 = O p (1/ ). This allows us to estiate the error ter of the second-order Taylor expansion. Next, we use the continuous apping theore (Mann & Wald, 1943) and the atrix inversion lea (Hager, 1989) to show that: ĥ h = 1 ( Ef D [ 2 f(h )] ) 1 ( 1 ) f i (h )+o p i=1 Because (1/) i=1 f i(h ) is an average of i.i.d. realizations, applying the classical central liit theore, the first-order optiality condition on h, and the affine property of the ultivariate Gaussian distribution will yield the desired result. Reark 3. The CLT in Theore 6 is an asyptotic result; it holds for a sufficiently large. When the gradient f i (w) is, additionally, R-Lipschitz, it can be shown using the Matrix-Hoeffding bound (Tropp, 2012) that the noral approxiation is valid as long as L 2 /γ 2 + RL log d. To verify Theore 6 epirically, we have ipleented the l 2 -regularized logistic regression on a ixture of two Gaussians with different eans and covariance atrices, one corresponding to the positive class and one corresponding to the negative class. Both the actual and predicted covariance atrices of the epirical risk iniizer ĥ, whose randoness is derived fro the randoness of the training saple, are depicted in Fig. 1. As shown in the figure, the actual covariance atrix atches with the one predicted by Theore 6.

6 Theore 6 shows that the saple coplexity of stochastic convex optiization depends on the curvature of the risk E f D [f(h)] at its iniizer h. Next, we show that the dependence of ERM on this curvature is, in fact, optial. Proposition 2. There exists a stochastic loss f : R d R that satisfies the conditions of Theore 6 and a distribution D such that if h is an unbiased estiator of h = arg in h R d{e f D [f(h)], then the covariance of the estiator is bounded fro below (in the positive seidefinite sense) by: E S D [( h h ) ( h h )] (1/) Σ, where Σ is the covariance of ERM given by Theore The Excess Risk Theore 6 states that when the loss f is strongly convex, the epirical risk iniizer ĥ is norally distributed around the population risk iniizer h with a vanishing O(1/) covariance. This justifies learning via the epirical risk iniization rule. We use this central liit theore, next, to derive the asyptotic behavior of the excess risk F (ĥ) F (h ), where F (h) is defined in Eq. (2). In particular, we would like to deterine the ipact of regularization on the asyptotic behavior of the excess risk. We write: f(h) = λ 2 h g(h), (5) where λ 0, g is a κ-strongly-convex function for soe κ 0, and λ + κ = γ > 0. Because f satisfies the conditions of Theore 6, we know that ĥ h as. Hence, we can take the Taylor expansion of F (h) around h and write: F (ĥ) F (h ) = 1 ( 1 ) 2 (ĥ h ) T 2 F (h ) (ĥ h )+o p, which holds since F (h ) = 0 and ĥ h 2 = O p (1/ ). Using Theore 6 and siplifying yields: F (ĥ) F (h ) = 1 ( 1 ) 2 xt U T 2 F (h ) U x+o p, (6) where Σ = UU T, x = U 1 (ĥ h ), and Σ is given by Theore 6. By substituting Eq. (5), we arrive at the following corollary. Corollary 1. Let f be decoposed into the su of a regularization ter and an unregularized loss, as given by Eq. (5). Write G(h) to denote the unregularized risk: G(h) = E f D [g(h)] = E f D [f(h) (λ/2) h 2 2]. Then, under the conditions of Theore 6, G(ĥ) G(h ) converges in distribution (over the rando choice of the training saple) to λ c T x + xt Dx 2 for soe c Rd and soe D 0 that are both independent of, where x N (0, I d ) is a standard ultivariate Gaussian rando variable. We can interpret Corollary 1 in sharp Θ( ) ters. In the following rearks, both expectation" and "probability" are taken over the rando choice of the training saple: In the absence of regularization, the ERM learning rule of strongly convex loss enjoys a fast learning rate of Θ(1/) both in expectation and in probability. When regularization is used, the unregularized risk converges to its infinite-data liit with a Θ(1/) rate only in expectation. By contrast, it has a Θ(1/ ) rate of convergence in probability. Corollary 1 unifies previous results, such as those reported in (Shalev-Shwartz & Ben-David, 2014; Sridharan et al., 2009; Shalev-Shwartz et al., 2009). It also illustrates why a fast Θ(1/) learning rate in expectation, such as those reported for Exp-Concave iniization (Koren & Levy, 2015), do not necessarily correspond to fast learning rates in practice because the sae excess risk can be Θ(1/ ) in probability even if it is Θ(1/) in expectation. To validate these clais experientally, we have ipleented a linear regression proble by iniizing the Huber loss of the residual. The ean and the standard deviation of G(ĥ) G(h ) are plotted in Fig 2 against the saple size. As shown in Fig 2 (a, b), we have a fast Θ(1/) convergence both in expectation and in probability when no regularization is used. However, when regularization is used (γ = 0.01 in this experient), then the expectation of G(ĥ) G(h ) is O(1/) but its standard deviation is O(1/ ) as shown in Fig 2 (c, d). These results are in agreeent with Corollary 1. Corollary 1 provides the excess risk between the regularized epirical risk iniizer ĥ and its infinite data liit h. We can use the sae corollary to obtain a bound on the excess risk G(ĥ) inf h Rd {G(h)} using oracle inequalities in a anner that is siilar to (Sridharan et al., 2009; Shalev- Shwartz & Ben-David, 2014). Moreover, the regularization agnitude λ ay be optiized to iniize this excess risk. Because this is quite siilar to previous works, the reader is referred to (Sridharan et al., 2009; Shalev-Shwartz & Ben-David, 2014) for further details Unifor Generalization Bound Theore 6 can be a viable tool in uncertainty quantification and hypothesis testing. In this section, however, we focus on its iplication for the unifor generalization risk. We begin with the following conditional" version of the central liit theore.

7 (a) (b) (c) (d) Figure 2. This figure plots the quantity E() = G(ĥ) G(h ) vs. the saple size for a regression proble, where we iniize the Huber loss of the residual. Figures (a) and (b) display the expectation and standard deviation of E(), respectively, when γ = 0 (no regularization). By contrast, Figures (c) and (d) display the sae quantities when γ = The blue curves are the best fitted curves assuing a fast Θ(1/) convergence while the red curves are the best fitted curves assuing a standard Θ(1/ ) convergence. Proposition 3. Let ˆf D be a fixed instance of the stochastic loss, and let the training saple be S = {ˆf} {f 2, f 3,..., f } with f i D drawn i.i.d. and independently of ˆf. Let ĥ Rd be the epirical risk iniizer given by Eq. (1). Then, under the conditions of Theore 6: where: p(ĥ ˆf) N ( µ, 1 1 Σ), { [ 1 ]} µ = arg in E f D f(h) + ˆf(h) h R d 1 Here, Σ is the covariance atrix given by Theore 6. Proposition 3 states, in other words, that a single realization of the stochastic loss shifts the expectation of the epirical risk iniizer ĥ and rescales its covariance. Both Theore 6 and Proposition 3 iply that the capacity of the ERM learning rule of stochastic, strongly-convex loss classes satisfies C(L) 0 as. We establish a ore useful quantitative result in the following theore. Theore 7. Suppose that norality holds, where the epirical risk iniizer ĥ has the probability density p(ĥ) = N (µ, Σ/) with µ and Σ given by Theore 6, and for any given single realization of the stochastic loss ˆf, we also have p(ĥ ˆf) = N ( µ, Σ/( 1)) as given by Proposition 3. Then: I(ĥ; ˆf) d ( 1 ) + o, (7) where I(x; y) is the Shannon utual inforation between the rando variables x and y. Corollary 2. Under the conditions of Theore 7, the unifor generalization risk satisfies: J (ĥ; ˆf) d ( 1 ) 2 + o (8) Proof. This follows iediately fro Theore 7 and Pinsker s inequality (Cover & Thoas, 1991). 6. Applications 6.1. Large-Scale Stochastic Convex Optiization Theore 6 states that the epirical risk iniizer of strongly convex stochastic loss is norally distributed around the population risk iniizer h with covariance (1/)Σ. This justifies learning via the epirical risk iniization procedure. However, the true goal behind stochastic convex optiization in the achine learning setting is not to copute the epirical risk iniizer ĥ per se but to estiate h. The epirical risk iniizer provides such an estiate. However, a different estiator can be constructed, which is as effective as the epirical risk iniizer ĥ. Theore 8. Under the conditions of Theore 6, let S = {f 1,..., f } be i.i.d. realizations of the stochastic loss f D and fix a positive integer K 1. Let K j=1 S j be a partitioning of S into K subsets of equal size and define ĥ j to be the epirical risk iniizer for S j only. Then, h = 1 K K j=1 ĥj is asyptotically norally distributed around the population risk iniizer h with covariance (1/)Σ, where Σ is given by Theore 6. Proof. By Theore 6, every ĥj is asyptotically norally distributed around h with covariance (K/)Σ. Hence, the average of those hypotheses is asyptotically norally distributed around h with covariance (1/)Σ. Theore 8 shows that in the achine learning setting, one can trivially scale the epirical risk iniization procedure to big data using a naïve parallelization algorith. Indeed, the estiator h described in the theore is not a iniizer to the epirical risk but it is as effective as the epirical risk iniizer in estiating the population risk iniizer h. Hence, both ĥ and h enjoy the sae perforance guarantee. In the literature, ethods for scaling achine learning algoriths to big data using distributed algoriths do not always distinguish between optiization as it is used in the traditional setting vs. the optiization that is used in

8 Algorith Test Error Tie ERM ( = 50, 000) 14.8 ± 0.3% 2.93s ERM ( = 1, 000) 15.3 ± 0.5% 0.03s Parallelized (K = 50, = 1, 000) 14.8 ± 0.1% 0.03s Table 1. In this experient, we run the l 2 regularized logistic regression on the MiniBooNE Particle Identification proble (Blake & Merz, 1998). The first row corresponds to running the algorith on 50,000 training exaples, the second row is for 1,000 exaples, while the last row is for the parallelization algorith of Theore 8, which splits 50,000 exaples into 50 separate saller probles. the achine learning setting despite several calls that highlighted such a subtle distinction (Bousquet & Bottou, 2008; Shalev-Shwartz et al., 2012). One popular, and often-cited, procedure that does not ake such a distinction is the alternating direction ethod of the ultiplier (ADMM) (Boyd et al., 2011). The ADMM procedure produces a distributed algorith with essage passing for iniizing the epirical risk by reforulating stochastic convex optiization into a global consensus proble". However, the epirical risk is erely a proxy for the true risk that one seeks to iniize. Theore 8, by contrast, presents a uch sipler algorith that achieves the desired goal. Table 1 validates this clai experientally. Reark 4. The theoretical guarantee of Theore 8 is established for stochastic convex optiization only. When the stochastic loss is non-convex, such as the case in neural networks, then the parallelization ethod is likely to fail. Intuitively, the average of good solutions is not necessarily a good solution itself when the loss is non-convex Inforation Criterion for Model Selection Theore 7 shows that under the assuption of norality, which is justified by the central liit theores (Theore 6 and Proposition 3), the unifor generalization risk of the ERM learning rule of stochastic, strongly-convex loss is asyptotically bounded by d/(2). Rearkably, this bound does not depend on the strong-convexity paraeter γ, nor on any other properties of the stochastic loss f that is optiized during the training stage. In addition, because it is a unifor generalization bound, it also holds for any bounded loss function l : R d Z [0, 1]. Hence, it can serve as a siple inforation criterion for odel selection. As a result, a learning algorith should ai at iniizing the Unifor Inforation Criterion (UIC) given by: UIC = E z S [l(ĥ, z)] + d 2, (9) where d is the nuber of learned paraeters. The UIC above is analogous to the Akaike inforation criteria (AIC) (Akaike, 1998; Bishop, 2006) which seeks to iniize E z S [f(ĥ, z)] + d/. It is also siilar to Schwarz s AIC BIC UIC d Figure 3. In this experient, we have a least-squares polynoial regression where X = [ 1, +1], y = f(x) + ɛ, f is a polynoial of degree d, and ɛ is i.i.d. Gaussian noise. The odel selection proble seeks to deterine the optial polynoial degree d. In each of the three figures above, a value of d is selected, which is arked by the dashed vertical line. Then, 100 d training exaples are provided. Each curve plots an inforation criterion against d. Ideally, d iniizes the corresponding curve. As shown above, only UIC succeeds in estiating d in all probles. In this experient, the epirical risk is noralized in the range [0, 1] by dividing it by the axiu observed loss in the training saple. Bayesian inforation criterion (BIC) (Schwarz, 1978), which seeks to iniize E z S [f(ĥ, z)] + (log /2) d/. In all three criteria, the generalization risk is estiated by the nuber of learned paraeters. However, the penalty for the nuber of odel paraeters is proportional to d in both AIC and BIC, which, by the discretization trick (Shalev- Shwartz & Ben-David, 2014), is known to be pessiistic. Indeed, the discretization trick suggests that the generalization risk is proportional to O( d/), which agrees with the UIC. Experientally, the UIC succeeds when both AIC and BIC fail, as deonstrated in Fig Conclusion In this paper, we derive bounds on the utual inforation of the ERM rule for both 0-1 and strongly-convex loss classes. We prove that under the Axio of Choice, the existence of an ERM rule with a vanishing utual inforation is equivalent to the assertion that the loss class has a finite VC diension, thus bridging inforation theory with statistical learning theory. Siilarly, an asyptotic bound on the utual inforation is established for strongly-convex loss classes in ters of the nuber of odel paraeters. The latter result uses a central liit theore that we derive in this paper. After that, we prove that the ERM learning rule for strongly-convex loss classes can be trivially scaled to big data. Finally, we propose a siple inforation criterion for odel selection and deonstrate experientally that it outperfors previous works. 2 The MATLAB codes that generate Table 1 and Fig. 3 are provided in the suppleentary aterials.

9 References Abu-Mostafa, Yaser S, Magdon-Isail, Malik, and Lin, Hsuan-Tien. Learning fro data, volue 4. AMLBook New York, NY, USA:, Akaike, Hirotogu. Inforation theory and an extension of the axiu likelihood principle. In Selected Papers of Hirotugu Akaike, pp Springer, Alabdulohsin, Ibrahi M. Algorithic stability and unifor generalization. In NIPS, pp , Alabdulohsin, Ibrahi M. An inforation theoretic route fro generalization in expectation to generalization in probability. In AISTATS, Bartlett, Peter L, Jordan, Michael I, and McAuliffe, Jon D. Convexity, classification, and risk bounds. Journal of the Aerican Statistical Association, 101(473): , Bishop, Christopher M.. Pattern recognition and achine learning. Springer, Blake, C. L. and Merz, C. J. UCI repository of achine learning databases, Bousquet, Olivier and Bottou, Léon. The tradeoffs of large scale learning. In Advances in neural inforation processing systes, pp , Boyd, Stephen, Parikh, Neal, Chu, Eric, Peleato, Borja, and Eckstein, Jonathan. Distributed optiization and statistical learning via the alternating direction ethod of ultipliers. Foundations and Trends R in Machine Learning, 3(1):1 122, Cover, Thoas M and Thoas, Joy A. Eleents of inforation theory. Wiley & Sons, Csiszár, Ire. A class of easures of inforativity of observation channels. Periodica Matheatica Hungarica, 2: , Csiszár, Ire. Axioatic characterizations of inforation easures. Entropy, 10: , Cuings, Rachel, Ligett, Katrina, Nissi, Kobbi, Roth, Aaron, and Wu, Zhiwei Steven. Adaptive learning with robust generalization guarantees. In COLT, pp , Dwork, Cynthia, Feldan, Vitaly, Hardt, Moritz, Pitassi, Toniann, Reingold, Oer, and Roth, Aaron. Preserving statistical validity in adaptive data analysis. In Proceedings of the Forty-Seventh Annual ACM on Syposiu on Theory of Coputing (STOC), pp , Elkan, Charles. The foundations of cost-sensitive learning. In IJCAI, Feldan, V., Guruswai, V., Raghavendra, P., and Wu, Y. Agnostic learning of onoials by halfspaces is hard. In th Annual IEEE Syposiu on Foundations of Coputer Science, pp , Oct doi: / FOCS Feldan, Vitaly. Generalization of ERM in stochastic convex optiization: The diension strikes back. In NIPS, pp , Hager, Willia W. Updating the inverse of a atrix. SIAM review, 31(2): , Janson, Svante. Probability asyptotics: notes on notation. arxiv preprint arxiv: , Kologorov, Andreĭ Nikolaevich and Foin. Introductory real analysis. Dover Publication, Inc., Koren, Toer and Levy, Kfir. Fast rates for exp-concave epirical risk iniization. In NIPS, pp , Kull, Meelis and Flach, Peter. Novel decopositions of proper scoring rules for classification: score adjustent as precursor to calibration. In ECML-PKDD, pp Springer, Mandt, Stephan, Hoffan, Matthew D., and Blei, David M. A variational analysis of stochastic gradient algoriths. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volue 48, ICML, pp JMLR.org, URL cf?id= Mann, H. B. and Wald, A. On stochastic liit and order relationships. The Annals of Matheatical Statistics, 14(3): , ISSN URL http: // Polyak, B. T. and Juditsky, A. B. Acceleration of stochastic approxiation by averaging. SIAM J. Control Opti., 30 (4): , July ISSN doi: / URL Raginsky, Maxi, Rakhlin, Alexander, Tsao, Matthew, Wu, Yihong, and Xu, Aolin. Inforation-theoretic analysis of stability and bias of learning algoriths. In Inforation Theory Workshop (ITW), 2016 IEEE, pp IEEE, Russo, Daniel and Zou, Jaes. Controlling bias in adaptive data analysis using inforation theory. In Proceedings

10 of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), Schwarz, Gideon. Estiating the diension of a odel. The annals of statistics, 6(2): , Shalev-Shwartz, Shai and Ben-David, Shai. Understanding achine learning: Fro theory to algoriths. Cabridge University Press, Shalev-Shwartz, Shai, Shair, Ohad, Srebro, Nathan, and Sridharan, Karthik. Stochastic convex optiization. In COLT, Shalev-Shwartz, Shai, Shair, Ohad, and Troer, Eran. Using ore data to speed-up training tie. In AISTATS, pp , Sridharan, Karthik, Shalev-Shwartz, Shai, and Srebro, Nathan. Fast rates for regularized objectives. In NIPS, pp , Tao, Terence. Topics in rando atrix theory. Aerican Matheatical Society, Tropp, Joel A. User-friendly tail bounds for sus of rando atrices. Foundations of coputational atheatics, 12 (4): , Vapnik, Vladiir N. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5): , 1999.

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher