Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Sample Guarantees

Size: px

Start display at page:

Download "Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Sample Guarantees"

Jane O’Connor’
5 years ago
Views:

1 Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees Jean Honorio CSAIL, MIT Cabridge, MA 0239, USA Toi Jaakkola CSAIL, MIT Cabridge, MA 0239, USA Abstract We analyze the expected risk of linear classifiers for a fixed weight vector in the iniax setting. That is, we analyze the worst-case risk aong all data distributions with a given ean and covariance. We provide a sipler proof of the tight polynoial-tail bound for general rando variables. For sub-gaussian rando variables, we derive a novel tight exponentialtail bound. We also provide new PAC-Bayes finite-saple guarantees when training data is available. Our iniax generalization bounds are diensionality-independent and O /) for saples. Introduction Linear classifiers are the cornerstone of several applications in achine learning. The generalization ability of classifiers have been long studied both in the statistical and coputational learning theory. Several general fraeworks have been applied in order to analyze the generalization error of generic classifiers soe do not apply to linear classifiers), such as epirical risk iniization, structural risk iniization, VC diension, covering nubers and Radeacher coplexity. Additionally, several notions of stability that guarantee generalization were introduced in [5, 8, 20, 2]. We refer the interested reader to the survey article [4] for additional inforation. For linear classifiers, sharp argin bounds are provided in [2] using Radeacher coplexity, and [5, 7] using the PAC-Bayes theore. Later, [] provided Appearing in Proceedings of the 7 th International Conference on Artificial Intelligence and Statistics AISTATS) 204, Reykjavik, Iceland. JMLR: W&CP volue 33. Copyright 204 by the authors. sharp bounds for Radeacher and Gaussian coplexities of constrained) linear classes, which led to several generalization bounds, such as argin bounds, PAC-Bayes bounds and risk bounds for vectors with bounded nor. More recently, [8] generalized the KL divergence in the PAC-Bayes theore to arbitrary convex functions; and [0] provided diensionalitydependent PAC-Bayes argin bounds. In a different line of work, [9] proved generalization of sparse linear classifiers under l -regularization) by using covering nuber bounds in [23]. Let f be a classifier learnt fro available training saples drawn fro soe arbitrary data distribution). Let f be the optial classifier in the asyptotic setting intuitively speaking, when infinite aount of data is available). Generalization bounds are usually tered as a unifor convergence stateent that holds for all classifiers See [] for instance). Alternatively, by using a syetrization arguent and by optiality of the epirical iniizer, we can state that with probability at least δ: R f) Rf ) + g, δ) ) where Rf) is the expected risk of the classifier f with respect to the data distribution and with respect to the posterior for PAC-Bayes), and g, δ) is a function that decreases with respect to and δ. Very often, we have that g, δ) O /) and g, δ) O log /δ) but other rates are also possible [4]. Generalization bounds are stated as in eq.) with respect to the unknown quantity Rf ). In this paper, we are interested in developing a tight bound for the expected risk Rf) of a linear classifier f. This allows us to find a closed for expression for the bound of Rf ) and also study the behavior of R f) when training data is available. We study the expected risk of linear classifiers under two scenarios: general rando variables and sub- Gaussian variates. Many features used in real-world classification probles follow sub-gaussianity assup-

2 Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees tions. For instance, in the coputer vision literature, histogra features are used for object classification as in [6, 22]. The class of sub-gaussian variates includes for instance Gaussian variables, any bounded rando variable e.g. Bernoulli, ultinoial, unifor), any rando variable with strictly log-concave density, and any finite ixture of sub-gaussian variables. Stronger assuptions have been previously used for the analysis of linear classifiers. In the generalization bound analysis of [], the authors assued boundedness; while in the active learning analysis of [] the authors assued a log-concave distribution. In this paper, we provide tight bounds for the expected risk of linear classifiers. Interestingly, the Fisher linear discriinant objective function appears in the different scenarios under our analysis, although our results apply to any linear classifier. In our tightness proof, we construct a faily of data distributions where the expected risk bound holds with equality. Our constructions do not rely on conditional distributions of x given the class y i.e. P x y = ) and P x y = +)) that are Gaussian or that have equal covariances. This iplies that there is a whole faily of non-trivial distributions in which the Fisher discriinant is asyptotically as good as any other linear classifier. When training data is available, we provide novel iniax generalization bounds that are diensionality-independent and O /) for saples. Fro a technical point of view, we do not require boundedness of the input data as in [], but note that we analyze a new proble. As part of our analysis, we derive novel PAC-Bayes bounds for not-everywhere-bounded functions. Additionally, while PAC-Bayes bounds usually depend on the Kullback- Leibler divergence which is natural for our analysis of sub-gaussian variates), we provide bounds that depend on the Chi-squared divergence which we believe is natural for our analysis of general rando variables). 2 Preliinaries Consider a binary classification proble with n features. That is, each saple is a pair x, y) containing a class label y {, +} and a vector of features x R n. Let Z be the probability distribution of x, y). That is, x, y) Z. A linear classifier f with weight vector w R n has the following decision function: fx w) = sgnw T[ x ] ) 2) The linear classifier f akes a istake whenever the sign of y does not atch the sign of fx w), or equivalently, whenever: yw T[ x ] 0 3) In this paper, we analyze the probability of the above expression with respect to the data distribution. That is, we analyze the expected risk of a linear classifier. Without loss of generality, let z = y [ x ]. Clearly, eq.3) holds if and only if w T z 0. We assue that z has ean µ and positive definite covariance atrix Σ. That is: µ E[z] and Σ E[z µ)z µ) T ] 4) With soe abuse of notation, we write z Z Zµ, Σ) in order to denote that the distribution Z of z has ean µ and covariance Σ. Additionally, we define Ωµ, Σ) to be the faily of all distributions with ean µ and covariance Σ. Thus, Z Ωµ, Σ) is an equivalent way to denote that the distribution Z has ean µ and covariance Σ. Next, we introduce the following Fisher function of the weight vector w, given the ean µ and covariance Σ of the data distribution: Fw µ, Σ) wt µ) 2 w T Σw 5) The above expression is indeed the objective function of the Fisher linear discriinant, and it appears in the different scenarios under our analysis cf. Theores and 4). 3 Bounds for the Expected Risk The following iniax result was found by [6] and rediscovered by [3]. Proofs in both papers are ore coplex than the ones we provide here, and their ain result is stated as follows. Let Ωµ, Σ) be the faily of all distributions with ean µ and covariance Σ. For a fixed weight vector w such that w T µ > 0, we have: sup Z Ωµ,Σ) P z Z [w T z 0] = + Fw µ, Σ) 6) See Appendix A for the relationship between this specific expression and the results in [3, 6].) The iniax expression in eq.6) otivated the developent of iniax probability achines [3, 4], which only need access to the eans and covariances for training. The learning algorith in [3, 4] uses this bound for each class separately. That is, [3, 4] use the bound in eq.6) with x separately for class y = and for class y = +) instead of with z, as in our setting. Note that [2, 4] also propose a learning algorith called single-class iniax probability

3 Jean Honorio, Toi Jaakkola achine, ainly otivated for the quantile estiation proble. For our consistency analysis, the use of a single bound with z, allows us to provide closed for expressions of the upper bound of the expected risk. Note that the iniax bound in eq.6) provides an upper bound for every distribution Z and it shows that the bound is tight. In Theores and 2, we reproduce the sae polynoial-tail bounds by following a sipler approach copared to seidefinite optiization [3] and ultivariate constructions [6]. Our bound and tightness analysis relies on the analysis on a related univariate proble. In Theores 4 and 5, we use a siilar approach as we followed for the general rando variables, and derive novel exponential-tail bounds for sub-gaussian variates. 3. General Rando Variables First, we provide a bound of the expected risk of a fixed linear classifier for general rando variables. Theore. Let z = y [ x ] be a rando variable with ean µ and covariance Σ. The expected risk of a linear classifier with weight vector w such that w T µ > 0, is upper-bounded as follows: P z Zµ,Σ) [w T z 0] + Fw µ, Σ) 7) Proof. Define the rando variable s = w T z. It is easy to verify that µ s E[s] = w T µ > 0 and σs 2 E[s µ s ) 2 ] = w T Σw. By the one-sided Chebyshev s inequality for ε = µ s > 0, we have: P[w T z 0] = P[s 0] = P[µ s s ε] +ε/σ s) 2 = +µ s/σ s) 2 8) By replacing µ s and σ s, we prove our clai. Next, we show that the above bound is tight. That is, there is a data distribution and weight vector for which equality holds. Theore 2. The upper bound provided in Theore is tight. That is, given an arbitrary ean µ and covariance Σ, there is a distribution Z with ean µ and covariance Σ, and a weight vector w for which the bound holds with equality. More forally: ) Z Ωµ, Σ) w R n P z Z [w T z 0] = + Fw µ, Σ) 9) where Ωµ, Σ) is the faily of all distributions with ean µ and covariance Σ. Proof. We show that given soe arbitrary ean µ and covariance Σ, we can construct a distribution Zµ, Σ) and a weight vector w such that the provided bound holds with equality. We prove this by providing a specific univariate three-points distribution which we later use for constructing a ultivariate three-planes distribution. First, we focus on the bound for a single scalar variable s in eq.8). Our goal is to construct a three-points distribution where P[µ s s ε] = +ε/σ s). Thus, we 2 want to show that the one-sided Chebyshev s inequality is tight. In fact, equality holds for the following distribution. Let qε) = 4 εε+ ε 2 + 8) for soe constant ε 0, /3) and β > 0, let s be distributed as follows: β, with probability qε) s = β, with probability 2qε) 0) β +, with probability qε) Note that µ s E[s] = β > 0 and σ 2 s E[s µ s ) 2 ] = 2qε). Second, we construct a canonical three-planes distribution. Consider a distribution V of rando vectors v, where v = s + β)/ 2qε) and v 2,..., v n ) follows a distribution with ean 0 and covariance I. Since v has zero ean and unit variance, as well as it is independent of v 2,..., v n ), we have E[v] = 0 and E[vv T ] = I. Let α = 2qε), 0,..., 0), we have α T v + β = s and therefore α T v + β is distributed as in eq.0). Finally, we construct a general three-planes distribution. For a given ean µ and covariance Σ, we construct a rando variable z with the required ean and covariance, fro v as follows. Define the rando variable z = Σ /2 v + µ. It is easy to verify that E[v] = 0 and E[vv T ] = I if and only if E[z] = µ and E[z µ)z µ) T ] = Σ. Note that α T v + β = w T z for w = Σ /2 α and β = w T µ = α T Σ /2 µ > 0. For ε = µ s > 0, we have: P[w T z 0] = P[α T v + β 0] and we prove our clai. = P[s 0] = P[µ s s ε] = +ε/σ s) Sub-Gaussian Rando Variables In this paper, we ake use of the following definition of sub-gaussianity of vectors by [7, 9]. Definition 3. A rando vector z Z is sub- Gaussian if for all w R n, the rando variable

4 Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees s = w T z with ean µ s = E z Z [w T z] and variance σ 2 s = E z Z [w T z µ s ) 2 ] is sub-gaussian with paraeter σ s. That is: w R n ) E z Z [e wt z µ s ] e 2 σ2 s ) First, we provide a bound of the expected risk of a fixed linear classifier for sub-gaussian rando variables. Theore 4. Let z = y [ x ] be a rando variable with ean µ and covariance Σ. Assue z is a sub- Gaussian vector. The expected risk of a linear classifier with weight vector w such that w T µ > 0, is upperbounded as follows: P z Zµ,Σ) [w T z 0] e 2 Fw µ,σ) 2) Proof. Define the rando variable s = w T z. It is easy to verify that µ s E[s] = w T µ > 0 and σs 2 E[s µ s ) 2 ] = w T Σw. Furtherore, by Definition 3, the rando variable s is sub-gaussian with paraeter σ s. By the one-sided Chernoff bound for sub-gaussian variables and ε = µ s > 0, we have: P[w T z 0] = P[s 0] = P[µ s s ε] e 2 ε/σs)2 = e 2 µs/σs)2 3) By replacing µ s and σ s, we prove our clai. Next, we show that the above bound is tight. That is, there is a data distribution and weight vector for which equality holds. Theore 5. The upper bound provided in Theore 4 is tight. That is, given an arbitrary ean µ and covariance Σ, there is a distribution Z with ean µ and covariance Σ, and a weight vector w for which the bound holds with equality. More forally: ) Z Ωµ, Σ) w R n P z Z [w T z 0] = e 2 Fw µ,σ) 4) where Ωµ, Σ) is the faily of all distributions with ean µ and covariance Σ. Proof. A inor change to the three-points arguent in Theore 2 is needed. That is, we focus on the bound for a single scalar variable s in eq.3). Let Wa) be the Labert function. That is, Wa) is the solution t of the equation a = te t. Our goal is to construct a three-points distribution where P[µ s s ε] = e 2 ε/σs)2. Thus, we want to show that the one-sided Chernoff bound is tight. In fact, equality holds for the following distribution. Let qε) = e W ε2 /4) for soe constant ε 0, 2/e) and let s be distributed as in eq.0). Recall that bounded variables, such as the specifically constructed s, are sub-gaussian. The rest of the proof follows as in Theore 2 with the additional assuption that v is a sub- Gaussian vector. 4 PAC-Bayes Finite-Saple Bounds In this section, we provide finite-saple guarantees for our previously derived asyptotic bounds. Our iniax generalization bounds are diensionalityindependent and O /) for saples, for both general and sub-gaussian rando variables. First, we provide a brief introduction to the PAC- Bayes fraework in the context of linear classifiers. Let H be a set of linear classifiers. That is, H is a set of weight vectors w. After observing a training set, the task of the learner is to choose a posterior distribution of weight vectors Q of support H, such that the Bayes classifier has the sallest possible risk. The Bayes linear classifier f has the following decision function: fx Q) = sgn E w Q [w T[ ] ) x ] 5) The output of the deterinistic) Bayes classifier is closely related to the output of the stochastic) Gibbs classifier. The Gibbs linear classifier chooses randoly a deterinistic) classifier w according to Q in order to classify x. The expected risk of the Gibbs linear classifier is thus given by the probability of eq.3), with respect to the data distribution and the posterior distribution Q. PAC-Bayes guarantees are given with respect to a prior distribution of weight vectors P, also of support H, and ostly focus on the analysis of the Gibbs classifier. As noted in [8], the expected risk of the Bayes classifier is at ost twice the expected risk of the Gibbs classifier. Thus, any upper bound of the latter provides an upper bound of the forer. The factor of 2 can soeties be reduced to + ε, as shown in [5]. In our analysis, we chose a specific n-diensional ellipsoid as the support: H = {w w T Σ + µµ T )w = } 6) The latter is for convenience. Since the Fisher function uses only linear and quadratic ters, any distribution of support R n can be reparaetrized with respect to H and produce the sae results. For instance, the Fisher function is scale-independent with respect to w. Intuitively speaking, the reparaetrization can be perfored by integrating the ass of the distribution of support R n over every direction independently.

5 Jean Honorio, Toi Jaakkola Indeed, the distributions P and Q could also be described with respect to angles. Fortunately, we do not need such a reparaerization for our analysis. Next, we introduce the following Gibbs-Fisher function of weight vectors w sapled fro a distribution Q, given the ean µ and covariance Σ of the data distribution: E w Q [w T µ]) 2 FQ µ, Σ) E w Q [w T Σ+µµ T )w] E w Q [w T µ]) 2 7) The above expression appears in the bound of the expected risk of the Gibbs linear classifier, in the different scenarios under our analysis cf. Theores 6 and ). Our exponential-tail bounds for sub-gaussian variates depend on the Kullback-Leibler divergence, while our polynoial-tail bounds for general rando variables depend on the Chi-squared divergence. Given two distributions P and Q, the Kullback-Leibler and Chi-squared divergences are defined as KLQ P) = E w Q [log qw) pw) ] and χ2 Q P) = E w Q [ qw) pw) ], respectively. Our proof strategy is to first show concentration of the projected ean and variance, and then use those results in order to show concentration of the Gibbs- Fisher function. 4. General Rando Variables First, we provide a bound of the expected risk of the Gibbs linear classifier for general rando variables. Theore 6. Let z = y [ x ] be a rando variable with ean µ and covariance Σ. Let Q be the probability distribution of the rando weight vector w. The expected risk of a linear classifier drawn fro Q such that E w Q [w T µ] > 0, is upper-bounded as follows: P w Q,z Zµ,Σ) [w T z 0] + FQ µ, Σ) 8) Proof. Define the rando variable s = w T z. It is easy to verify that µ s E w Q,z Z [s] = E w Q [w T µ] > 0 and σ 2 s E w Q,z Z [s µ s ) 2 ] = E w Q [w T Σ + µµ T )w] E w Q [w T µ]) 2. The rest of the proof follows as in Theore. Next, we show PAC-Bayes concentration of the projected ean for general rando variables. Lea 7. Let z ),..., z ) be saples independently drawn fro Zµ, Σ), an arbitrary distribution with ean µ and covariance Σ. Let µ be the epirical ean coputed fro those saples. For any prior distribution P of support H as in eq.6), with probability at least δ: Q) E w Q [w T µ µ)] δ χ2 Q P) + ) 9) Proof. Note that the rando variable E w P [w T µ µ)) 2 ] is non-negative. By Markov s inequality, with probability at least δ: E w P [w T µ µ)) 2 ] δ E z )...z ) ZE w P [w T µ µ)) 2 ] 20) Define the rando variable s = w T z µ). It is easy to verify that E z Z [s] = 0 and E z Z [s 2 ] = w T Σw w T Σ + µµ T )w = in the support H as in eq.6). Note that w T µ µ) = i= si). Furtherore, s ),..., s ) are independent since z ),..., z ) are independent. Thus, we can upper-bound the expected value in the right-hand side of eq.20) as follows: E z )...z ZE ) w P [w T µ µ)) 2 ] [ = E w P E ] z )...z ) Z i= si)) 2 [ = E w P E z )...z ) Z 2 i= si)2 + )] i j si) s j) [ = E w P E z Z s 2 / ] / By putting everything together, we have: E w P [w T µ µ)) 2 ] /δ) 2) By Jensen s and Cauchy-Schwarz inequalities, we have: Q) Ew Q [w T µ µ)] E w Q [ w T µ µ) ] [ = E pw) w Q qw) w T µ µ) ) ] qw) pw) [ ] [ ] E pw) w Q qw) wt µ µ)) 2 E qw) w Q pw) = E w P [w T µ µ)) 2 ] χ 2 Q P) + ) By using our bound in eq.2), we prove our clai. In what follows, we show PAC-Bayes concentration of the projected variance for general rando variables. Lea 8. Let z ),..., z ) be saples independently drawn fro Zµ, Σ), an arbitrary distribution with ean µ and covariance Σ. Let µ and Σ be the epirical ean and covariance coputed fro those saples. Assue z has bounded fourth order oent, that is E[z T Σ + µµ T ) z) 2 ] K. For any prior

6 Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees distribution P of support H as in eq.6), with probability at least δ: Q) E w Q [w T Σ + µ µ T )w ] K δ χ2 Q P) + ) 22) Proof. Note that the rando variable E w P [w T Σ + µ µ T )w ) 2 ] is non-negative. By Markov s inequality, with probability at least δ: E w P [w T Σ + µ µ T )w ) 2 ] δ E z )...z ) ZE w P [w T Σ + µ µ T )w ) 2 ] 23) Define the rando variable s = w T z) 2. Note that in the support H as in eq.6), we have E z Z [s] = w T Σ + µµ T )w = 0. Additionally, let S = Σ + µµ T and by our assuption of bounded fourth order oent, we have: E z Z [s 2 ] = E z Z [w T z) 4 ] = E z Z [S /2 w) T S /2 z) 4 ] S /2 w 4 2 E z Z [ S /2 z 4 2] = w T Sw) 2 E z Z [z T S z) 2 ] K Note that w T Σ + µ µ T )w = i= si). Furtherore, s ),..., s ) are independent since z ),..., z ) are independent. Thus, we can upperbound the expected value in the right-hand side of eq.23). The rest of the proof follows siilarly as in Lea 7. Finally, we show PAC-Bayes concentration of the Gibbs-Fisher function for general rando variables. Our iniax generalization bound is diensionality-independent and O /) for saples. Theore 9. Let z ),..., z ) be saples independently drawn fro Zµ, Σ), an arbitrary distribution with ean µ and covariance Σ. Let µ and Σ be the epirical ean and covariance coputed fro those saples. Assue z has bounded fourth order oent, that is E[z T Σ + µµ T ) z) 2 ] K. For any prior distribution P of support H as in eq.6), with probability at least δ: Q) + FQ µ, Σ) + FQ µ, Σ) ) 8 ax, K ) χ δ 2 Q P) + ) + O 24) Proof. In order to obtain concentration of the projected ean and variance siultaneously, we apply the union bound to Leas 7 and 8. That is, with probability at least δ, let ε = 2 ax,k ) δ χ 2 Q P) + ), we have: Q) E w Q [w T µ µ)] ε Q) E w Q [w T Σ + µ µ T )w ] ε 25) With soe algebra, we can prove that for any Q, µ, Σ and S = Σ + µµ T, we have: + FQ µ, Σ) = E w Q[w T µ]) 2 E w Q [w T Sw] 26) Let Ŝ = Σ+ µ µ T, α = E w Q [w T µ], α = E w Q [w T µ] and β = E w Q [w T Ŝw]. The concentration results in eq.25) are equivalent to α α ε and β ε. Additionally by eq.26), the left-hand side of eq.24) is equivalent to α 2 / β α 2. For any Q, µ, Σ and S = Σ + µµ T, by positive seidefiniteness of Σ, we have: w) 0 w T Σw 0 E w Q [w T Σw] 0 E w Q [w T S µµ T )w] E w Q [w T µ) 2 ] E w Q [w T Sw] E w Q [w T µ) 2 ]/E w Q [w T Sw] Note that 0 E w Q [w T µ]) 2 E w Q [w T µ) 2 ] and therefore α 2 / β [0, ] and α 2 [0, ]. Since α α ε and α, we have: Siilarly: α 2 α 2 = 2α α α) + α α) 2 2ε + ε 2 α 2 α 2 = 2 αα α) + α α) 2 2α + ε)ε + ε ε)ε + ε 2 = 2ε + 3ε 2 Therefore, α 2 α 2 2ε+3ε 2. Finally, since α α ε, β ε and α 2 / β, we have: α 2 / β α 2 α 2 / β α 2 + α 2 α 2 and we prove our clai. = α 2 / β β + α 2 α 2 ε + 2ε + 3ε 2 = 3ε + O/)

7 Jean Honorio, Toi Jaakkola 4.2 Sub-Gaussian Rando Variables In this paper, we introduce the following definition of sub-gaussianity of vectors. Definition 0. Let H be a bounded set. A rando vector z Z is H-sub-Gaussian if for all distributions Q of support H and w Q, the rando variable s = w T z with ean µ s = E w Q,z Z [w T z] and variance σs 2 = E w Q,z Z [w T z µ s ) 2 ] is sub-gaussian with paraeter σ s. That is: Q) E w Q,z Z [e wt z µ s ] e 2 σ2 s 27) First, we provide a bound of the expected risk of the Gibbs linear classifier for sub-gaussian rando variables. Theore. Let z = y [ x ] be a rando variable with ean µ and covariance Σ. Let Q be the probability distribution of the rando weight vector w of support H. Assue z is an H-sub-Gaussian vector. The expected risk of a linear classifier drawn fro Q such that E w Q [w T µ] > 0, is upper-bounded as follows: P w Q,z Zµ,Σ) [w T z 0] e 2 FQ µ,σ) 28) Proof. Define the rando variable s = w T z. It is easy to verify that µ s E w Q,z Z [s] = E w Q [w T µ] > 0 and σ 2 s E w Q,z Z [s µ s ) 2 ] = E w Q [w T Σ + µµ T )w] E w Q [w T µ]) 2. Furtherore, by Definition 0, the rando variable s is sub-gaussian with paraeter σ s. The rest of the proof follows as in Theore 4. Next, we show PAC-Bayes concentration of the projected ean for sub-gaussian rando variables. In the following lea, while we concentrate on upperbounding E w Q [w T µ µ)] for all Q, a siilar arguent also bounds E w Q [w T µ µ)]. Lea 2. Let z ),..., z ) be saples independently drawn fro Zµ, Σ), an arbitrary distribution with ean µ and covariance Σ. Let µ be the epirical ean coputed fro those saples. Assue z is a sub-gaussian vector. For any prior distribution P of support H as in eq.6), with probability at least δ: ) Q) E w Q [w T µ µ)] KLQ P) + log e /2 δ 29) Proof. Let t R be a constant. Note that the rando variable E w P [e twt µ µ) ] is non-negative. By Markov s inequality, with probability at least δ: E w P [e twt µ µ) ] δ E z )...z ) ZE w P [e twt µ µ) ] 30) Define the rando variable s = w T z µ). It is easy to verify that E z Z [s] = 0. As we argued in Theore 4, the variable s is sub-gaussian with paraeter σ s where σ 2 s = w T Σw. Furtherore, σ 2 s w T Σ+µµ T )w = in the support H as in eq.6). Note that w T µ µ) = i= si). Furtherore, s ),..., s ) are independent since z ),..., z ) are independent. Thus, we can upper-bound the expected value in the right-hand side of eq.30) as follows: E z )...z ) ZE w P [e twt µ µ) ] = E w P E z )...z ) Z[e t i= si) ] = E w P i= E z Z[e t s ] E w P i= e t2 2 2 = e t2 2 By taking the logarith on each side of eq.30) and since E w P [fw)] = E w Q [ pw) qw) fw)] for every distribution Q and function f : R n R, we have: [ ] Q) log E pw) µ µ) w Q qw) etwt ) log δ E z )...z ZE ) w P [e twt µ µ) ] By using Jensen s inequality, we can lower-bound the left-hand side of the above expression: Q) log E w Q [ pw) µ µ) qw) etwt pw) µ µ) qw) etwt E w Q [log ] )] = KLQ P) + t E w Q [w T µ µ)] By putting everything together, we have: Q) E w Q [w T µ µ)] t KLQ P) + log e t δ By setting t =, we prove our clai. 2 2 In what follows, we show PAC-Bayes concentration of the projected variance for sub-gaussian rando variables. In the following lea, while we concentrate on upper-bounding E w Q [w T Σ + µ µ T )w ] for all Q, a siilar arguent also bounds E w Q [ w T Σ + µ µ T )w]. Lea 3. Let z ),..., z ) be 6 saples independently drawn fro Zµ, Σ), an arbitrary distribution with ean µ and covariance Σ. Let µ and Σ be the epirical ean and covariance coputed fro those saples. Assue z is a sub-gaussian vector. For any prior distribution P of support H as in eq.6), )

8 Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees with probability at least δ: Q) E w Q [w T Σ + µ µ T )w ] ) KLQ P) + log e6 δ 3) Proof. Let t R be a constant. Note that the rando variable E w P [e twt Σ+ µ µ T )w ) ] is non-negative. By Markov s inequality, with probability at least δ: E w P [e twt Σ+ µ µ T )w ) ] δ E z )...z ) ZE w P [e twt Σ+ µ µ T )w ) ] 32) Define the rando variable s = w T z) 2. As we argued in Theore 4, the rando variable w T z is sub-gaussian with paraeter σ where σ 2 = w T Σw. Furtherore, σ 2 w T Σ + µµ T )w = in the support H as in eq.6). Additionally, in the support H, we have E z Z [s] = w T Σ + µµ T )w = 0 and furtherore, s is sub-exponential since it is the square of a sub-gaussian variable. In what follows, we use a bound for the oent generating function of a sub-exponential variable, that resebles that of sub-gaussian variables. See Appendix B for details.) Note that w T Σ + µ µ T )w = i= si). Furtherore, s ),..., s ) are independent since z ),..., z ) are independent. Thus, we can upperbound the expected value in the right-hand side of eq.32) as follows: E z )...z ) ZE w P [e twt Σ+ µ µ T )w ) ] = E w P E z )...z ) Z[e t i= si) ] = E w P i= E z Z[e t s ] E w P i= e 6t2 2 for t / /4 = e 6t2 The rest of the proof follows siilarly as in Lea 2. Note that since we set t =, the condition t / /4 for applying the bound for the oent generating function of the sub-exponential variable s leads to the condition 6. Finally, we show PAC-Bayes concentration of the Gibbs-Fisher function for sub-gaussian rando variables. Our iniax generalization bound is diensionality-independent and O /) for saples. Theore 4. Let z ),..., z ) be 6 saples independently drawn fro Zµ, Σ), an arbitrary distribution with ean µ and covariance Σ. Let µ and Σ be the epirical ean and covariance coputed fro those saples. Assue z is a sub-gaussian vector. For any prior distribution P of support H as in eq.6), with probability at least δ: Q) e 2 FQ µ, Σ) e 2 FQ µ,σ) ) ) 36 KLQ P) + log 4e6 + O 33) δ Proof. In order to obtain two-sided concentration of both, the projected ean and variance siultaneously, we apply the union bound to the one-sided results in Leas 2 and 3. That is, with probability at least δ, let ε = KLQ P) + log 4e6 δ ), we have: Q) Ew Q [w T µ µ)] ε Q) E w Q [w T Σ + µ µ T )w ] ε 34) With soe algebra, we can prove that for any Q, µ, Σ and S = Σ + µµ T, we have: e 2 FQ µ,σ) = exp 2 E w Q [w T µ]) 2 E w Q [w T Sw] E w Q[w T µ]) 2 E w Q [w T Sw] ] 35) ) Let Ŝ = Σ+ µ µ T, α = E w Q [w T µ], α = E w Q [w T µ] and β = E w Q [w T Ŝw]. The concentration results in eq.34) are equivalent to α α ε and β ε. Note that by eq.35), the left-hand ) side of eq.33) is equivalent to exp α2 / β. exp α2 2 α 2 / β) 2 α )) 2 Furtherore, by Lipschitz continuity: exp α2 / β 2 α 2 / β) ) exp α2 2 α )) 2 α 2 / β α 2 2 The rest of the proof follows as in Theore 9 to show that α 2 / β α 2 3ε + O/). 5 Concluding Rearks There are several ways of extending this research. While in this paper we focused in the PAC-Bayes fraework, one future goal is to provide guarantees for all weight vectors w. In order to obtain bounds that are either independent or logarithically-dependent on the diension, it ight be necessary to produce novel bounds for the Radeacher and/or Gaussian coplexity of not-everywhere-bounded functions. Finally, since our bounds on the expected risk are worstcase aong all data distributions with a given ean and covariance), it would be interesting to analyze their applicability to scenarios where each training saple ay coe fro a different distribution, as well as for transfer learning.

9 Jean Honorio, Toi Jaakkola References [] M. Balcan and P. Long. Active and passive learning of linear separators under log-concave distributions. COLT, 203. [2] P. Bartlett and S. Mendelson. Radeacher and Gaussian coplexities: Risk bounds and structural results. JMLR, [3] D. Bertsias, I. Popescu, and J. Sethuraan. Moent probles and seidefinite optiization. Handbook of Seidefinite Optiization, [4] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. Advanced Lectures on Machine Learning, [5] O. Bousquet and A. Elisseeff. Stability and generalization. JMLR, [6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. IMAGENET: A large-scale hierarchical iage database. CVPR, [7] R. Fukuda. Exponential integrability of subgaussian vectors. Probability Theory and Related Fields, 990. [8] P. Gerain, A. Lacasse, F. Laviolette, and M. Marchand. PAC-Bayesian learning of linear classifiers. ICML, [6] A. Marshall and I. Olkin. Multivariate Chebyshev inequalities. The Annals of Matheatical Statistics, 960. [7] D. McAllester. Siplified PAC-Bayesian argin bounds. COLT, [8] S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of epirical risk iniization. Advances in Coputational Matheatics, [9] A. Ng. Feature selection, l vs. l 2 regularization, and rotational invariance. ICML, [20] S. Rakhlin, S. Mukherjee, and T. Poggio. Stability results in learning theory. Analysis and Applications, [2] S. Shalev-Shwartz, O. Shair, N. Srebro, and K. Sridharan. Learnability, stability and unifor convergence. JMLR, 200. [22] A. Torralba, R. Fergus, and W. T. Freean. 80 illion tiny iages: a large dataset for nonparaetric object and scene recognition. PAMI, [23] T. Zhang. Covering nuber bounds of certain regularized linear function classes. JMLR, [9] D. Hsu, S. Kakade, and T. Zhang. A tail inequality for quadratic fors of sub-gaussian rando vectors. Electronic Counications in Probability, 202. [0] C. Jin and L. Wang. Diensionality dependent PAC-Bayes argin bound. NIPS, 202. [] S. Kakade, K. Sridharan, and A. Tewari. On the coplexity of linear prediction: Risk bounds, argin bounds, and regularization. NIPS, [2] G. Lanckriet, L. El Ghaoui,, and M. Jordan. Robust novelty detection with single-class MPM. NIPS, [3] G. Lanckriet, L. El Ghaoui, C. Bhattacharyya, and M. Jordan. Miniax probability achine. NIPS, 200. [4] G. Lanckriet, L. El Ghaoui, C. Bhattacharyya, and M. Jordan. A robust iniax approach to classification. JMLR, [5] J. Langford and J. Shawe-Taylor. PAC-Bayes & argins. NIPS, 2002.

10 Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees A Proof of Expression in Equation 6) In this section, we restate the results provided in [3, 6] in order to obtain eq.6). We follow the well-known Lagrangian duality approach as in Appendix A in [4]. The following result is provided in [3, 6]. Let Ωµ, Σ) be the faily of all distributions with ean µ and covariance Σ. For a fixed weight vector a and constant b, we have: sup Z Ωµ,Σ) P z Z [a T z b] = + d 2 where d 2 = inf z µ) T Σ z µ) a T z b Let a = w and b = 0. We have: sup Z Ωµ,Σ) P z Z [w T z 0] = + d 2 where d 2 = inf z µ) T Σ z µ) w T z 0 Note that if w T µ 0, then we can just take z = µ and obtain d 2 = 0, which is certainly the optiu because d 2 0 due to positive definiteness of Σ. In what follows, we assue w T µ > 0, as required in eq.6). We are interested in the value of d 2. That is, we seek for a closed-for solution of the prial proble: in z µ) T Σ z µ) 36) w T z 0 which has the following Lagrangian: Lz, λ) = z µ) T Σ z µ) + λw T z By optiality arguents i.e. L/ z = 0), we have that L is iniized at z = λ 2 Σw + µ. Therefore, the Lagrange dual function is given by: gλ) = inf Lz, λ) z = Lz, λ) = λ2 4 wt Σw + λw T µ Consequently, the dual proble of eq.36) is: ax λ 0 gλ) Again, by optiality arguents i.e. g/ λ = 0), we have that g is axiized at λ = 2 wt µ w T Σw. Note that λ 0 since w T µ > 0. Finally: d 2 = ax λ 0 gλ) = gλ ) = wt µ) 2 w T Σw Fw µ, Σ) B Moent Generating Function of the Square of a Sub-Gaussian Variable Let s be a sub-gaussian variable with paraeter σ s and ean µ s = E[s]. By sub-gaussianity, we know that the oent generating function is bounded as follows: t R) E[e ts µs) ] e 2 t2 σ 2 s Our goal is to find a siilar bound for the oent generating function of the sub-exponential variable v = s 2. Let Γr) be the Gaa function, the oents of the sub-gaussian variable s are bounded as follows: r 0) E[ s r ] r2 r /2 σ r sγ r /2) Let µ v = E[v]. By power series expansion and since Γr) = r )! for an integer r, we have: E[e tv µv) ] = + te[v µ v ] = + r=2 r=2 r=2 t r E[ s 2r ] r! r=2 t r 2r2 r σ 2r s Γr) r! t r 2 r+ σ 2r s = + 8t2 σ 4 s 2tσ 2 s t r E[v µ v ) r ] r! By aking t /4σ 2 ), we have s / 2tσ 2 s ) 2. Finally, since α) + α e α, we have that for a sub- Gaussian variable s with paraeter σ s : t /4σ 2 s )) E[ets2 E[s 2 ]) ] e 6t2 σ 4 s 37) Thus, we obtained a bound for the oent generating function of the sub-exponential variable s 2, that is siilar to that of sub-gaussian variables but holds only for a sall range of t.

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds