Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Sample Guarantees

Size: px
Start display at page:

Download "Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Sample Guarantees"

Transcription

1 Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees Jean Honorio CSAIL, MIT Cabridge, MA 0239, USA Toi Jaakkola CSAIL, MIT Cabridge, MA 0239, USA Abstract We analyze the expected risk of linear classifiers for a fixed weight vector in the iniax setting. That is, we analyze the worst-case risk aong all data distributions with a given ean and covariance. We provide a sipler proof of the tight polynoial-tail bound for general rando variables. For sub-gaussian rando variables, we derive a novel tight exponentialtail bound. We also provide new PAC-Bayes finite-saple guarantees when training data is available. Our iniax generalization bounds are diensionality-independent and O /) for saples. Introduction Linear classifiers are the cornerstone of several applications in achine learning. The generalization ability of classifiers have been long studied both in the statistical and coputational learning theory. Several general fraeworks have been applied in order to analyze the generalization error of generic classifiers soe do not apply to linear classifiers), such as epirical risk iniization, structural risk iniization, VC diension, covering nubers and Radeacher coplexity. Additionally, several notions of stability that guarantee generalization were introduced in [5, 8, 20, 2]. We refer the interested reader to the survey article [4] for additional inforation. For linear classifiers, sharp argin bounds are provided in [2] using Radeacher coplexity, and [5, 7] using the PAC-Bayes theore. Later, [] provided Appearing in Proceedings of the 7 th International Conference on Artificial Intelligence and Statistics AISTATS) 204, Reykjavik, Iceland. JMLR: W&CP volue 33. Copyright 204 by the authors. sharp bounds for Radeacher and Gaussian coplexities of constrained) linear classes, which led to several generalization bounds, such as argin bounds, PAC-Bayes bounds and risk bounds for vectors with bounded nor. More recently, [8] generalized the KL divergence in the PAC-Bayes theore to arbitrary convex functions; and [0] provided diensionalitydependent PAC-Bayes argin bounds. In a different line of work, [9] proved generalization of sparse linear classifiers under l -regularization) by using covering nuber bounds in [23]. Let f be a classifier learnt fro available training saples drawn fro soe arbitrary data distribution). Let f be the optial classifier in the asyptotic setting intuitively speaking, when infinite aount of data is available). Generalization bounds are usually tered as a unifor convergence stateent that holds for all classifiers See [] for instance). Alternatively, by using a syetrization arguent and by optiality of the epirical iniizer, we can state that with probability at least δ: R f) Rf ) + g, δ) ) where Rf) is the expected risk of the classifier f with respect to the data distribution and with respect to the posterior for PAC-Bayes), and g, δ) is a function that decreases with respect to and δ. Very often, we have that g, δ) O /) and g, δ) O log /δ) but other rates are also possible [4]. Generalization bounds are stated as in eq.) with respect to the unknown quantity Rf ). In this paper, we are interested in developing a tight bound for the expected risk Rf) of a linear classifier f. This allows us to find a closed for expression for the bound of Rf ) and also study the behavior of R f) when training data is available. We study the expected risk of linear classifiers under two scenarios: general rando variables and sub- Gaussian variates. Many features used in real-world classification probles follow sub-gaussianity assup-

2 Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees tions. For instance, in the coputer vision literature, histogra features are used for object classification as in [6, 22]. The class of sub-gaussian variates includes for instance Gaussian variables, any bounded rando variable e.g. Bernoulli, ultinoial, unifor), any rando variable with strictly log-concave density, and any finite ixture of sub-gaussian variables. Stronger assuptions have been previously used for the analysis of linear classifiers. In the generalization bound analysis of [], the authors assued boundedness; while in the active learning analysis of [] the authors assued a log-concave distribution. In this paper, we provide tight bounds for the expected risk of linear classifiers. Interestingly, the Fisher linear discriinant objective function appears in the different scenarios under our analysis, although our results apply to any linear classifier. In our tightness proof, we construct a faily of data distributions where the expected risk bound holds with equality. Our constructions do not rely on conditional distributions of x given the class y i.e. P x y = ) and P x y = +)) that are Gaussian or that have equal covariances. This iplies that there is a whole faily of non-trivial distributions in which the Fisher discriinant is asyptotically as good as any other linear classifier. When training data is available, we provide novel iniax generalization bounds that are diensionality-independent and O /) for saples. Fro a technical point of view, we do not require boundedness of the input data as in [], but note that we analyze a new proble. As part of our analysis, we derive novel PAC-Bayes bounds for not-everywhere-bounded functions. Additionally, while PAC-Bayes bounds usually depend on the Kullback- Leibler divergence which is natural for our analysis of sub-gaussian variates), we provide bounds that depend on the Chi-squared divergence which we believe is natural for our analysis of general rando variables). 2 Preliinaries Consider a binary classification proble with n features. That is, each saple is a pair x, y) containing a class label y {, +} and a vector of features x R n. Let Z be the probability distribution of x, y). That is, x, y) Z. A linear classifier f with weight vector w R n has the following decision function: fx w) = sgnw T[ x ] ) 2) The linear classifier f akes a istake whenever the sign of y does not atch the sign of fx w), or equivalently, whenever: yw T[ x ] 0 3) In this paper, we analyze the probability of the above expression with respect to the data distribution. That is, we analyze the expected risk of a linear classifier. Without loss of generality, let z = y [ x ]. Clearly, eq.3) holds if and only if w T z 0. We assue that z has ean µ and positive definite covariance atrix Σ. That is: µ E[z] and Σ E[z µ)z µ) T ] 4) With soe abuse of notation, we write z Z Zµ, Σ) in order to denote that the distribution Z of z has ean µ and covariance Σ. Additionally, we define Ωµ, Σ) to be the faily of all distributions with ean µ and covariance Σ. Thus, Z Ωµ, Σ) is an equivalent way to denote that the distribution Z has ean µ and covariance Σ. Next, we introduce the following Fisher function of the weight vector w, given the ean µ and covariance Σ of the data distribution: Fw µ, Σ) wt µ) 2 w T Σw 5) The above expression is indeed the objective function of the Fisher linear discriinant, and it appears in the different scenarios under our analysis cf. Theores and 4). 3 Bounds for the Expected Risk The following iniax result was found by [6] and rediscovered by [3]. Proofs in both papers are ore coplex than the ones we provide here, and their ain result is stated as follows. Let Ωµ, Σ) be the faily of all distributions with ean µ and covariance Σ. For a fixed weight vector w such that w T µ > 0, we have: sup Z Ωµ,Σ) P z Z [w T z 0] = + Fw µ, Σ) 6) See Appendix A for the relationship between this specific expression and the results in [3, 6].) The iniax expression in eq.6) otivated the developent of iniax probability achines [3, 4], which only need access to the eans and covariances for training. The learning algorith in [3, 4] uses this bound for each class separately. That is, [3, 4] use the bound in eq.6) with x separately for class y = and for class y = +) instead of with z, as in our setting. Note that [2, 4] also propose a learning algorith called single-class iniax probability

3 Jean Honorio, Toi Jaakkola achine, ainly otivated for the quantile estiation proble. For our consistency analysis, the use of a single bound with z, allows us to provide closed for expressions of the upper bound of the expected risk. Note that the iniax bound in eq.6) provides an upper bound for every distribution Z and it shows that the bound is tight. In Theores and 2, we reproduce the sae polynoial-tail bounds by following a sipler approach copared to seidefinite optiization [3] and ultivariate constructions [6]. Our bound and tightness analysis relies on the analysis on a related univariate proble. In Theores 4 and 5, we use a siilar approach as we followed for the general rando variables, and derive novel exponential-tail bounds for sub-gaussian variates. 3. General Rando Variables First, we provide a bound of the expected risk of a fixed linear classifier for general rando variables. Theore. Let z = y [ x ] be a rando variable with ean µ and covariance Σ. The expected risk of a linear classifier with weight vector w such that w T µ > 0, is upper-bounded as follows: P z Zµ,Σ) [w T z 0] + Fw µ, Σ) 7) Proof. Define the rando variable s = w T z. It is easy to verify that µ s E[s] = w T µ > 0 and σs 2 E[s µ s ) 2 ] = w T Σw. By the one-sided Chebyshev s inequality for ε = µ s > 0, we have: P[w T z 0] = P[s 0] = P[µ s s ε] +ε/σ s) 2 = +µ s/σ s) 2 8) By replacing µ s and σ s, we prove our clai. Next, we show that the above bound is tight. That is, there is a data distribution and weight vector for which equality holds. Theore 2. The upper bound provided in Theore is tight. That is, given an arbitrary ean µ and covariance Σ, there is a distribution Z with ean µ and covariance Σ, and a weight vector w for which the bound holds with equality. More forally: ) Z Ωµ, Σ) w R n P z Z [w T z 0] = + Fw µ, Σ) 9) where Ωµ, Σ) is the faily of all distributions with ean µ and covariance Σ. Proof. We show that given soe arbitrary ean µ and covariance Σ, we can construct a distribution Zµ, Σ) and a weight vector w such that the provided bound holds with equality. We prove this by providing a specific univariate three-points distribution which we later use for constructing a ultivariate three-planes distribution. First, we focus on the bound for a single scalar variable s in eq.8). Our goal is to construct a three-points distribution where P[µ s s ε] = +ε/σ s). Thus, we 2 want to show that the one-sided Chebyshev s inequality is tight. In fact, equality holds for the following distribution. Let qε) = 4 εε+ ε 2 + 8) for soe constant ε 0, /3) and β > 0, let s be distributed as follows: β, with probability qε) s = β, with probability 2qε) 0) β +, with probability qε) Note that µ s E[s] = β > 0 and σ 2 s E[s µ s ) 2 ] = 2qε). Second, we construct a canonical three-planes distribution. Consider a distribution V of rando vectors v, where v = s + β)/ 2qε) and v 2,..., v n ) follows a distribution with ean 0 and covariance I. Since v has zero ean and unit variance, as well as it is independent of v 2,..., v n ), we have E[v] = 0 and E[vv T ] = I. Let α = 2qε), 0,..., 0), we have α T v + β = s and therefore α T v + β is distributed as in eq.0). Finally, we construct a general three-planes distribution. For a given ean µ and covariance Σ, we construct a rando variable z with the required ean and covariance, fro v as follows. Define the rando variable z = Σ /2 v + µ. It is easy to verify that E[v] = 0 and E[vv T ] = I if and only if E[z] = µ and E[z µ)z µ) T ] = Σ. Note that α T v + β = w T z for w = Σ /2 α and β = w T µ = α T Σ /2 µ > 0. For ε = µ s > 0, we have: P[w T z 0] = P[α T v + β 0] and we prove our clai. = P[s 0] = P[µ s s ε] = +ε/σ s) Sub-Gaussian Rando Variables In this paper, we ake use of the following definition of sub-gaussianity of vectors by [7, 9]. Definition 3. A rando vector z Z is sub- Gaussian if for all w R n, the rando variable

4 Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees s = w T z with ean µ s = E z Z [w T z] and variance σ 2 s = E z Z [w T z µ s ) 2 ] is sub-gaussian with paraeter σ s. That is: w R n ) E z Z [e wt z µ s ] e 2 σ2 s ) First, we provide a bound of the expected risk of a fixed linear classifier for sub-gaussian rando variables. Theore 4. Let z = y [ x ] be a rando variable with ean µ and covariance Σ. Assue z is a sub- Gaussian vector. The expected risk of a linear classifier with weight vector w such that w T µ > 0, is upperbounded as follows: P z Zµ,Σ) [w T z 0] e 2 Fw µ,σ) 2) Proof. Define the rando variable s = w T z. It is easy to verify that µ s E[s] = w T µ > 0 and σs 2 E[s µ s ) 2 ] = w T Σw. Furtherore, by Definition 3, the rando variable s is sub-gaussian with paraeter σ s. By the one-sided Chernoff bound for sub-gaussian variables and ε = µ s > 0, we have: P[w T z 0] = P[s 0] = P[µ s s ε] e 2 ε/σs)2 = e 2 µs/σs)2 3) By replacing µ s and σ s, we prove our clai. Next, we show that the above bound is tight. That is, there is a data distribution and weight vector for which equality holds. Theore 5. The upper bound provided in Theore 4 is tight. That is, given an arbitrary ean µ and covariance Σ, there is a distribution Z with ean µ and covariance Σ, and a weight vector w for which the bound holds with equality. More forally: ) Z Ωµ, Σ) w R n P z Z [w T z 0] = e 2 Fw µ,σ) 4) where Ωµ, Σ) is the faily of all distributions with ean µ and covariance Σ. Proof. A inor change to the three-points arguent in Theore 2 is needed. That is, we focus on the bound for a single scalar variable s in eq.3). Let Wa) be the Labert function. That is, Wa) is the solution t of the equation a = te t. Our goal is to construct a three-points distribution where P[µ s s ε] = e 2 ε/σs)2. Thus, we want to show that the one-sided Chernoff bound is tight. In fact, equality holds for the following distribution. Let qε) = e W ε2 /4) for soe constant ε 0, 2/e) and let s be distributed as in eq.0). Recall that bounded variables, such as the specifically constructed s, are sub-gaussian. The rest of the proof follows as in Theore 2 with the additional assuption that v is a sub- Gaussian vector. 4 PAC-Bayes Finite-Saple Bounds In this section, we provide finite-saple guarantees for our previously derived asyptotic bounds. Our iniax generalization bounds are diensionalityindependent and O /) for saples, for both general and sub-gaussian rando variables. First, we provide a brief introduction to the PAC- Bayes fraework in the context of linear classifiers. Let H be a set of linear classifiers. That is, H is a set of weight vectors w. After observing a training set, the task of the learner is to choose a posterior distribution of weight vectors Q of support H, such that the Bayes classifier has the sallest possible risk. The Bayes linear classifier f has the following decision function: fx Q) = sgn E w Q [w T[ ] ) x ] 5) The output of the deterinistic) Bayes classifier is closely related to the output of the stochastic) Gibbs classifier. The Gibbs linear classifier chooses randoly a deterinistic) classifier w according to Q in order to classify x. The expected risk of the Gibbs linear classifier is thus given by the probability of eq.3), with respect to the data distribution and the posterior distribution Q. PAC-Bayes guarantees are given with respect to a prior distribution of weight vectors P, also of support H, and ostly focus on the analysis of the Gibbs classifier. As noted in [8], the expected risk of the Bayes classifier is at ost twice the expected risk of the Gibbs classifier. Thus, any upper bound of the latter provides an upper bound of the forer. The factor of 2 can soeties be reduced to + ε, as shown in [5]. In our analysis, we chose a specific n-diensional ellipsoid as the support: H = {w w T Σ + µµ T )w = } 6) The latter is for convenience. Since the Fisher function uses only linear and quadratic ters, any distribution of support R n can be reparaetrized with respect to H and produce the sae results. For instance, the Fisher function is scale-independent with respect to w. Intuitively speaking, the reparaetrization can be perfored by integrating the ass of the distribution of support R n over every direction independently.

5 Jean Honorio, Toi Jaakkola Indeed, the distributions P and Q could also be described with respect to angles. Fortunately, we do not need such a reparaerization for our analysis. Next, we introduce the following Gibbs-Fisher function of weight vectors w sapled fro a distribution Q, given the ean µ and covariance Σ of the data distribution: E w Q [w T µ]) 2 FQ µ, Σ) E w Q [w T Σ+µµ T )w] E w Q [w T µ]) 2 7) The above expression appears in the bound of the expected risk of the Gibbs linear classifier, in the different scenarios under our analysis cf. Theores 6 and ). Our exponential-tail bounds for sub-gaussian variates depend on the Kullback-Leibler divergence, while our polynoial-tail bounds for general rando variables depend on the Chi-squared divergence. Given two distributions P and Q, the Kullback-Leibler and Chi-squared divergences are defined as KLQ P) = E w Q [log qw) pw) ] and χ2 Q P) = E w Q [ qw) pw) ], respectively. Our proof strategy is to first show concentration of the projected ean and variance, and then use those results in order to show concentration of the Gibbs- Fisher function. 4. General Rando Variables First, we provide a bound of the expected risk of the Gibbs linear classifier for general rando variables. Theore 6. Let z = y [ x ] be a rando variable with ean µ and covariance Σ. Let Q be the probability distribution of the rando weight vector w. The expected risk of a linear classifier drawn fro Q such that E w Q [w T µ] > 0, is upper-bounded as follows: P w Q,z Zµ,Σ) [w T z 0] + FQ µ, Σ) 8) Proof. Define the rando variable s = w T z. It is easy to verify that µ s E w Q,z Z [s] = E w Q [w T µ] > 0 and σ 2 s E w Q,z Z [s µ s ) 2 ] = E w Q [w T Σ + µµ T )w] E w Q [w T µ]) 2. The rest of the proof follows as in Theore. Next, we show PAC-Bayes concentration of the projected ean for general rando variables. Lea 7. Let z ),..., z ) be saples independently drawn fro Zµ, Σ), an arbitrary distribution with ean µ and covariance Σ. Let µ be the epirical ean coputed fro those saples. For any prior distribution P of support H as in eq.6), with probability at least δ: Q) E w Q [w T µ µ)] δ χ2 Q P) + ) 9) Proof. Note that the rando variable E w P [w T µ µ)) 2 ] is non-negative. By Markov s inequality, with probability at least δ: E w P [w T µ µ)) 2 ] δ E z )...z ) ZE w P [w T µ µ)) 2 ] 20) Define the rando variable s = w T z µ). It is easy to verify that E z Z [s] = 0 and E z Z [s 2 ] = w T Σw w T Σ + µµ T )w = in the support H as in eq.6). Note that w T µ µ) = i= si). Furtherore, s ),..., s ) are independent since z ),..., z ) are independent. Thus, we can upper-bound the expected value in the right-hand side of eq.20) as follows: E z )...z ZE ) w P [w T µ µ)) 2 ] [ = E w P E ] z )...z ) Z i= si)) 2 [ = E w P E z )...z ) Z 2 i= si)2 + )] i j si) s j) [ = E w P E z Z s 2 / ] / By putting everything together, we have: E w P [w T µ µ)) 2 ] /δ) 2) By Jensen s and Cauchy-Schwarz inequalities, we have: Q) Ew Q [w T µ µ)] E w Q [ w T µ µ) ] [ = E pw) w Q qw) w T µ µ) ) ] qw) pw) [ ] [ ] E pw) w Q qw) wt µ µ)) 2 E qw) w Q pw) = E w P [w T µ µ)) 2 ] χ 2 Q P) + ) By using our bound in eq.2), we prove our clai. In what follows, we show PAC-Bayes concentration of the projected variance for general rando variables. Lea 8. Let z ),..., z ) be saples independently drawn fro Zµ, Σ), an arbitrary distribution with ean µ and covariance Σ. Let µ and Σ be the epirical ean and covariance coputed fro those saples. Assue z has bounded fourth order oent, that is E[z T Σ + µµ T ) z) 2 ] K. For any prior

6 Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees distribution P of support H as in eq.6), with probability at least δ: Q) E w Q [w T Σ + µ µ T )w ] K δ χ2 Q P) + ) 22) Proof. Note that the rando variable E w P [w T Σ + µ µ T )w ) 2 ] is non-negative. By Markov s inequality, with probability at least δ: E w P [w T Σ + µ µ T )w ) 2 ] δ E z )...z ) ZE w P [w T Σ + µ µ T )w ) 2 ] 23) Define the rando variable s = w T z) 2. Note that in the support H as in eq.6), we have E z Z [s] = w T Σ + µµ T )w = 0. Additionally, let S = Σ + µµ T and by our assuption of bounded fourth order oent, we have: E z Z [s 2 ] = E z Z [w T z) 4 ] = E z Z [S /2 w) T S /2 z) 4 ] S /2 w 4 2 E z Z [ S /2 z 4 2] = w T Sw) 2 E z Z [z T S z) 2 ] K Note that w T Σ + µ µ T )w = i= si). Furtherore, s ),..., s ) are independent since z ),..., z ) are independent. Thus, we can upperbound the expected value in the right-hand side of eq.23). The rest of the proof follows siilarly as in Lea 7. Finally, we show PAC-Bayes concentration of the Gibbs-Fisher function for general rando variables. Our iniax generalization bound is diensionality-independent and O /) for saples. Theore 9. Let z ),..., z ) be saples independently drawn fro Zµ, Σ), an arbitrary distribution with ean µ and covariance Σ. Let µ and Σ be the epirical ean and covariance coputed fro those saples. Assue z has bounded fourth order oent, that is E[z T Σ + µµ T ) z) 2 ] K. For any prior distribution P of support H as in eq.6), with probability at least δ: Q) + FQ µ, Σ) + FQ µ, Σ) ) 8 ax, K ) χ δ 2 Q P) + ) + O 24) Proof. In order to obtain concentration of the projected ean and variance siultaneously, we apply the union bound to Leas 7 and 8. That is, with probability at least δ, let ε = 2 ax,k ) δ χ 2 Q P) + ), we have: Q) E w Q [w T µ µ)] ε Q) E w Q [w T Σ + µ µ T )w ] ε 25) With soe algebra, we can prove that for any Q, µ, Σ and S = Σ + µµ T, we have: + FQ µ, Σ) = E w Q[w T µ]) 2 E w Q [w T Sw] 26) Let Ŝ = Σ+ µ µ T, α = E w Q [w T µ], α = E w Q [w T µ] and β = E w Q [w T Ŝw]. The concentration results in eq.25) are equivalent to α α ε and β ε. Additionally by eq.26), the left-hand side of eq.24) is equivalent to α 2 / β α 2. For any Q, µ, Σ and S = Σ + µµ T, by positive seidefiniteness of Σ, we have: w) 0 w T Σw 0 E w Q [w T Σw] 0 E w Q [w T S µµ T )w] E w Q [w T µ) 2 ] E w Q [w T Sw] E w Q [w T µ) 2 ]/E w Q [w T Sw] Note that 0 E w Q [w T µ]) 2 E w Q [w T µ) 2 ] and therefore α 2 / β [0, ] and α 2 [0, ]. Since α α ε and α, we have: Siilarly: α 2 α 2 = 2α α α) + α α) 2 2ε + ε 2 α 2 α 2 = 2 αα α) + α α) 2 2α + ε)ε + ε ε)ε + ε 2 = 2ε + 3ε 2 Therefore, α 2 α 2 2ε+3ε 2. Finally, since α α ε, β ε and α 2 / β, we have: α 2 / β α 2 α 2 / β α 2 + α 2 α 2 and we prove our clai. = α 2 / β β + α 2 α 2 ε + 2ε + 3ε 2 = 3ε + O/)

7 Jean Honorio, Toi Jaakkola 4.2 Sub-Gaussian Rando Variables In this paper, we introduce the following definition of sub-gaussianity of vectors. Definition 0. Let H be a bounded set. A rando vector z Z is H-sub-Gaussian if for all distributions Q of support H and w Q, the rando variable s = w T z with ean µ s = E w Q,z Z [w T z] and variance σs 2 = E w Q,z Z [w T z µ s ) 2 ] is sub-gaussian with paraeter σ s. That is: Q) E w Q,z Z [e wt z µ s ] e 2 σ2 s 27) First, we provide a bound of the expected risk of the Gibbs linear classifier for sub-gaussian rando variables. Theore. Let z = y [ x ] be a rando variable with ean µ and covariance Σ. Let Q be the probability distribution of the rando weight vector w of support H. Assue z is an H-sub-Gaussian vector. The expected risk of a linear classifier drawn fro Q such that E w Q [w T µ] > 0, is upper-bounded as follows: P w Q,z Zµ,Σ) [w T z 0] e 2 FQ µ,σ) 28) Proof. Define the rando variable s = w T z. It is easy to verify that µ s E w Q,z Z [s] = E w Q [w T µ] > 0 and σ 2 s E w Q,z Z [s µ s ) 2 ] = E w Q [w T Σ + µµ T )w] E w Q [w T µ]) 2. Furtherore, by Definition 0, the rando variable s is sub-gaussian with paraeter σ s. The rest of the proof follows as in Theore 4. Next, we show PAC-Bayes concentration of the projected ean for sub-gaussian rando variables. In the following lea, while we concentrate on upperbounding E w Q [w T µ µ)] for all Q, a siilar arguent also bounds E w Q [w T µ µ)]. Lea 2. Let z ),..., z ) be saples independently drawn fro Zµ, Σ), an arbitrary distribution with ean µ and covariance Σ. Let µ be the epirical ean coputed fro those saples. Assue z is a sub-gaussian vector. For any prior distribution P of support H as in eq.6), with probability at least δ: ) Q) E w Q [w T µ µ)] KLQ P) + log e /2 δ 29) Proof. Let t R be a constant. Note that the rando variable E w P [e twt µ µ) ] is non-negative. By Markov s inequality, with probability at least δ: E w P [e twt µ µ) ] δ E z )...z ) ZE w P [e twt µ µ) ] 30) Define the rando variable s = w T z µ). It is easy to verify that E z Z [s] = 0. As we argued in Theore 4, the variable s is sub-gaussian with paraeter σ s where σ 2 s = w T Σw. Furtherore, σ 2 s w T Σ+µµ T )w = in the support H as in eq.6). Note that w T µ µ) = i= si). Furtherore, s ),..., s ) are independent since z ),..., z ) are independent. Thus, we can upper-bound the expected value in the right-hand side of eq.30) as follows: E z )...z ) ZE w P [e twt µ µ) ] = E w P E z )...z ) Z[e t i= si) ] = E w P i= E z Z[e t s ] E w P i= e t2 2 2 = e t2 2 By taking the logarith on each side of eq.30) and since E w P [fw)] = E w Q [ pw) qw) fw)] for every distribution Q and function f : R n R, we have: [ ] Q) log E pw) µ µ) w Q qw) etwt ) log δ E z )...z ZE ) w P [e twt µ µ) ] By using Jensen s inequality, we can lower-bound the left-hand side of the above expression: Q) log E w Q [ pw) µ µ) qw) etwt pw) µ µ) qw) etwt E w Q [log ] )] = KLQ P) + t E w Q [w T µ µ)] By putting everything together, we have: Q) E w Q [w T µ µ)] t KLQ P) + log e t δ By setting t =, we prove our clai. 2 2 In what follows, we show PAC-Bayes concentration of the projected variance for sub-gaussian rando variables. In the following lea, while we concentrate on upper-bounding E w Q [w T Σ + µ µ T )w ] for all Q, a siilar arguent also bounds E w Q [ w T Σ + µ µ T )w]. Lea 3. Let z ),..., z ) be 6 saples independently drawn fro Zµ, Σ), an arbitrary distribution with ean µ and covariance Σ. Let µ and Σ be the epirical ean and covariance coputed fro those saples. Assue z is a sub-gaussian vector. For any prior distribution P of support H as in eq.6), )

8 Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees with probability at least δ: Q) E w Q [w T Σ + µ µ T )w ] ) KLQ P) + log e6 δ 3) Proof. Let t R be a constant. Note that the rando variable E w P [e twt Σ+ µ µ T )w ) ] is non-negative. By Markov s inequality, with probability at least δ: E w P [e twt Σ+ µ µ T )w ) ] δ E z )...z ) ZE w P [e twt Σ+ µ µ T )w ) ] 32) Define the rando variable s = w T z) 2. As we argued in Theore 4, the rando variable w T z is sub-gaussian with paraeter σ where σ 2 = w T Σw. Furtherore, σ 2 w T Σ + µµ T )w = in the support H as in eq.6). Additionally, in the support H, we have E z Z [s] = w T Σ + µµ T )w = 0 and furtherore, s is sub-exponential since it is the square of a sub-gaussian variable. In what follows, we use a bound for the oent generating function of a sub-exponential variable, that resebles that of sub-gaussian variables. See Appendix B for details.) Note that w T Σ + µ µ T )w = i= si). Furtherore, s ),..., s ) are independent since z ),..., z ) are independent. Thus, we can upperbound the expected value in the right-hand side of eq.32) as follows: E z )...z ) ZE w P [e twt Σ+ µ µ T )w ) ] = E w P E z )...z ) Z[e t i= si) ] = E w P i= E z Z[e t s ] E w P i= e 6t2 2 for t / /4 = e 6t2 The rest of the proof follows siilarly as in Lea 2. Note that since we set t =, the condition t / /4 for applying the bound for the oent generating function of the sub-exponential variable s leads to the condition 6. Finally, we show PAC-Bayes concentration of the Gibbs-Fisher function for sub-gaussian rando variables. Our iniax generalization bound is diensionality-independent and O /) for saples. Theore 4. Let z ),..., z ) be 6 saples independently drawn fro Zµ, Σ), an arbitrary distribution with ean µ and covariance Σ. Let µ and Σ be the epirical ean and covariance coputed fro those saples. Assue z is a sub-gaussian vector. For any prior distribution P of support H as in eq.6), with probability at least δ: Q) e 2 FQ µ, Σ) e 2 FQ µ,σ) ) ) 36 KLQ P) + log 4e6 + O 33) δ Proof. In order to obtain two-sided concentration of both, the projected ean and variance siultaneously, we apply the union bound to the one-sided results in Leas 2 and 3. That is, with probability at least δ, let ε = KLQ P) + log 4e6 δ ), we have: Q) Ew Q [w T µ µ)] ε Q) E w Q [w T Σ + µ µ T )w ] ε 34) With soe algebra, we can prove that for any Q, µ, Σ and S = Σ + µµ T, we have: e 2 FQ µ,σ) = exp 2 E w Q [w T µ]) 2 E w Q [w T Sw] E w Q[w T µ]) 2 E w Q [w T Sw] ] 35) ) Let Ŝ = Σ+ µ µ T, α = E w Q [w T µ], α = E w Q [w T µ] and β = E w Q [w T Ŝw]. The concentration results in eq.34) are equivalent to α α ε and β ε. Note that by eq.35), the left-hand ) side of eq.33) is equivalent to exp α2 / β. exp α2 2 α 2 / β) 2 α )) 2 Furtherore, by Lipschitz continuity: exp α2 / β 2 α 2 / β) ) exp α2 2 α )) 2 α 2 / β α 2 2 The rest of the proof follows as in Theore 9 to show that α 2 / β α 2 3ε + O/). 5 Concluding Rearks There are several ways of extending this research. While in this paper we focused in the PAC-Bayes fraework, one future goal is to provide guarantees for all weight vectors w. In order to obtain bounds that are either independent or logarithically-dependent on the diension, it ight be necessary to produce novel bounds for the Radeacher and/or Gaussian coplexity of not-everywhere-bounded functions. Finally, since our bounds on the expected risk are worstcase aong all data distributions with a given ean and covariance), it would be interesting to analyze their applicability to scenarios where each training saple ay coe fro a different distribution, as well as for transfer learning.

9 Jean Honorio, Toi Jaakkola References [] M. Balcan and P. Long. Active and passive learning of linear separators under log-concave distributions. COLT, 203. [2] P. Bartlett and S. Mendelson. Radeacher and Gaussian coplexities: Risk bounds and structural results. JMLR, [3] D. Bertsias, I. Popescu, and J. Sethuraan. Moent probles and seidefinite optiization. Handbook of Seidefinite Optiization, [4] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. Advanced Lectures on Machine Learning, [5] O. Bousquet and A. Elisseeff. Stability and generalization. JMLR, [6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. IMAGENET: A large-scale hierarchical iage database. CVPR, [7] R. Fukuda. Exponential integrability of subgaussian vectors. Probability Theory and Related Fields, 990. [8] P. Gerain, A. Lacasse, F. Laviolette, and M. Marchand. PAC-Bayesian learning of linear classifiers. ICML, [6] A. Marshall and I. Olkin. Multivariate Chebyshev inequalities. The Annals of Matheatical Statistics, 960. [7] D. McAllester. Siplified PAC-Bayesian argin bounds. COLT, [8] S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of epirical risk iniization. Advances in Coputational Matheatics, [9] A. Ng. Feature selection, l vs. l 2 regularization, and rotational invariance. ICML, [20] S. Rakhlin, S. Mukherjee, and T. Poggio. Stability results in learning theory. Analysis and Applications, [2] S. Shalev-Shwartz, O. Shair, N. Srebro, and K. Sridharan. Learnability, stability and unifor convergence. JMLR, 200. [22] A. Torralba, R. Fergus, and W. T. Freean. 80 illion tiny iages: a large dataset for nonparaetric object and scene recognition. PAMI, [23] T. Zhang. Covering nuber bounds of certain regularized linear function classes. JMLR, [9] D. Hsu, S. Kakade, and T. Zhang. A tail inequality for quadratic fors of sub-gaussian rando vectors. Electronic Counications in Probability, 202. [0] C. Jin and L. Wang. Diensionality dependent PAC-Bayes argin bound. NIPS, 202. [] S. Kakade, K. Sridharan, and A. Tewari. On the coplexity of linear prediction: Risk bounds, argin bounds, and regularization. NIPS, [2] G. Lanckriet, L. El Ghaoui,, and M. Jordan. Robust novelty detection with single-class MPM. NIPS, [3] G. Lanckriet, L. El Ghaoui, C. Bhattacharyya, and M. Jordan. Miniax probability achine. NIPS, 200. [4] G. Lanckriet, L. El Ghaoui, C. Bhattacharyya, and M. Jordan. A robust iniax approach to classification. JMLR, [5] J. Langford and J. Shawe-Taylor. PAC-Bayes & argins. NIPS, 2002.

10 Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees A Proof of Expression in Equation 6) In this section, we restate the results provided in [3, 6] in order to obtain eq.6). We follow the well-known Lagrangian duality approach as in Appendix A in [4]. The following result is provided in [3, 6]. Let Ωµ, Σ) be the faily of all distributions with ean µ and covariance Σ. For a fixed weight vector a and constant b, we have: sup Z Ωµ,Σ) P z Z [a T z b] = + d 2 where d 2 = inf z µ) T Σ z µ) a T z b Let a = w and b = 0. We have: sup Z Ωµ,Σ) P z Z [w T z 0] = + d 2 where d 2 = inf z µ) T Σ z µ) w T z 0 Note that if w T µ 0, then we can just take z = µ and obtain d 2 = 0, which is certainly the optiu because d 2 0 due to positive definiteness of Σ. In what follows, we assue w T µ > 0, as required in eq.6). We are interested in the value of d 2. That is, we seek for a closed-for solution of the prial proble: in z µ) T Σ z µ) 36) w T z 0 which has the following Lagrangian: Lz, λ) = z µ) T Σ z µ) + λw T z By optiality arguents i.e. L/ z = 0), we have that L is iniized at z = λ 2 Σw + µ. Therefore, the Lagrange dual function is given by: gλ) = inf Lz, λ) z = Lz, λ) = λ2 4 wt Σw + λw T µ Consequently, the dual proble of eq.36) is: ax λ 0 gλ) Again, by optiality arguents i.e. g/ λ = 0), we have that g is axiized at λ = 2 wt µ w T Σw. Note that λ 0 since w T µ > 0. Finally: d 2 = ax λ 0 gλ) = gλ ) = wt µ) 2 w T Σw Fw µ, Σ) B Moent Generating Function of the Square of a Sub-Gaussian Variable Let s be a sub-gaussian variable with paraeter σ s and ean µ s = E[s]. By sub-gaussianity, we know that the oent generating function is bounded as follows: t R) E[e ts µs) ] e 2 t2 σ 2 s Our goal is to find a siilar bound for the oent generating function of the sub-exponential variable v = s 2. Let Γr) be the Gaa function, the oents of the sub-gaussian variable s are bounded as follows: r 0) E[ s r ] r2 r /2 σ r sγ r /2) Let µ v = E[v]. By power series expansion and since Γr) = r )! for an integer r, we have: E[e tv µv) ] = + te[v µ v ] = + r=2 r=2 r=2 t r E[ s 2r ] r! r=2 t r 2r2 r σ 2r s Γr) r! t r 2 r+ σ 2r s = + 8t2 σ 4 s 2tσ 2 s t r E[v µ v ) r ] r! By aking t /4σ 2 ), we have s / 2tσ 2 s ) 2. Finally, since α) + α e α, we have that for a sub- Gaussian variable s with paraeter σ s : t /4σ 2 s )) E[ets2 E[s 2 ]) ] e 6t2 σ 4 s 37) Thus, we obtained a bound for the oent generating function of the sub-exponential variable s 2, that is siilar to that of sub-gaussian variables but holds only for a sall range of t.

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine CSCI699: Topics in Learning and Gae Theory Lecture October 23 Lecturer: Ilias Scribes: Ruixin Qiang and Alana Shine Today s topic is auction with saples. 1 Introduction to auctions Definition 1. In a single

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification

PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification Eilie Morvant eilieorvant@lifuniv-rsfr okol Koço sokolkoco@lifuniv-rsfr Liva Ralaivola livaralaivola@lifuniv-rsfr Aix-Marseille

More information

Soft-margin SVM can address linearly separable problems with outliers

Soft-margin SVM can address linearly separable problems with outliers Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

Shannon Sampling II. Connections to Learning Theory

Shannon Sampling II. Connections to Learning Theory Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Symmetrization and Rademacher Averages

Symmetrization and Rademacher Averages Stat 928: Statistical Learning Theory Lecture: Syetrization and Radeacher Averages Instructor: Sha Kakade Radeacher Averages Recall that we are interested in bounding the difference between epirical and

More information

Estimating Parameters for a Gaussian pdf

Estimating Parameters for a Gaussian pdf Pattern Recognition and achine Learning Jaes L. Crowley ENSIAG 3 IS First Seester 00/0 Lesson 5 7 Noveber 00 Contents Estiating Paraeters for a Gaussian pdf Notation... The Pattern Recognition Proble...3

More information

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a

More information

On Conditions for Linearity of Optimal Estimation

On Conditions for Linearity of Optimal Estimation On Conditions for Linearity of Optial Estiation Erah Akyol, Kuar Viswanatha and Kenneth Rose {eakyol, kuar, rose}@ece.ucsb.edu Departent of Electrical and Coputer Engineering University of California at

More information

PAC-Bayesian Learning of Linear Classifiers

PAC-Bayesian Learning of Linear Classifiers Pascal Gerain Pascal.Gerain.@ulaval.ca Alexandre Lacasse Alexandre.Lacasse@ift.ulaval.ca François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Départeent d inforatique

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

Simplified PAC-Bayesian Margin Bounds

Simplified PAC-Bayesian Margin Bounds Siplified PAC-Bayesian Margin Bounds David McAllester Toyota Technological Institute at Chicago callester@tti-c.org Abstract. The theoretical understanding of support vector achines is largely based on

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

A Theoretical Framework for Deep Transfer Learning

A Theoretical Framework for Deep Transfer Learning A Theoretical Fraewor for Deep Transfer Learning Toer Galanti The School of Coputer Science Tel Aviv University toer22g@gail.co Lior Wolf The School of Coputer Science Tel Aviv University wolf@cs.tau.ac.il

More information

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University

More information

The proofs of Theorem 1-3 are along the lines of Wied and Galeano (2013).

The proofs of Theorem 1-3 are along the lines of Wied and Galeano (2013). A Appendix: Proofs The proofs of Theore 1-3 are along the lines of Wied and Galeano (2013) Proof of Theore 1 Let D[d 1, d 2 ] be the space of càdlàg functions on the interval [d 1, d 2 ] equipped with

More information

Testing Properties of Collections of Distributions

Testing Properties of Collections of Distributions Testing Properties of Collections of Distributions Reut Levi Dana Ron Ronitt Rubinfeld April 9, 0 Abstract We propose a fraework for studying property testing of collections of distributions, where the

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines MIT Course Notes Cynthia Rudin Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Optimal Jamming Over Additive Noise: Vector Source-Channel Case

Optimal Jamming Over Additive Noise: Vector Source-Channel Case Fifty-first Annual Allerton Conference Allerton House, UIUC, Illinois, USA October 2-3, 2013 Optial Jaing Over Additive Noise: Vector Source-Channel Case Erah Akyol and Kenneth Rose Abstract This paper

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

List Scheduling and LPT Oliver Braun (09/05/2017)

List Scheduling and LPT Oliver Braun (09/05/2017) List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)

More information

Efficient Learning with Partially Observed Attributes

Efficient Learning with Partially Observed Attributes Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy Shai Shalev-Shwartz The Hebrew University, Jerusale, Israel Ohad Shair The Hebrew University, Jerusale, Israel Abstract We describe and

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,

More information

Sub-Gaussian estimators of the mean of a random vector

Sub-Gaussian estimators of the mean of a random vector Sub-Gaussian estiators of the ean of a rando vector arxiv:702.00482v [ath.st] Feb 207 GáborLugosi ShaharMendelson February 3, 207 Abstract WestudytheprobleofestiatingtheeanofarandovectorX givena saple

More information

Geometrical intuition behind the dual problem

Geometrical intuition behind the dual problem Based on: Geoetrical intuition behind the dual proble KP Bennett, EJ Bredensteiner, Duality and Geoetry in SVM Classifiers, Proceedings of the International Conference on Machine Learning, 2000 1 Geoetrical

More information

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks Bounds on the Miniax Rate for Estiating a Prior over a VC Class fro Independent Learning Tasks Liu Yang Steve Hanneke Jaie Carbonell Deceber 01 CMU-ML-1-11 School of Coputer Science Carnegie Mellon University

More information

Exact tensor completion with sum-of-squares

Exact tensor completion with sum-of-squares Proceedings of Machine Learning Research vol 65:1 54, 2017 30th Annual Conference on Learning Theory Exact tensor copletion with su-of-squares Aaron Potechin Institute for Advanced Study, Princeton David

More information

Lecture 21. Interior Point Methods Setup and Algorithm

Lecture 21. Interior Point Methods Setup and Algorithm Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and

More information

On the Use of A Priori Information for Sparse Signal Approximations

On the Use of A Priori Information for Sparse Signal Approximations ITS TECHNICAL REPORT NO. 3/4 On the Use of A Priori Inforation for Sparse Signal Approxiations Oscar Divorra Escoda, Lorenzo Granai and Pierre Vandergheynst Signal Processing Institute ITS) Ecole Polytechnique

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

Stability Bounds for Non-i.i.d. Processes

Stability Bounds for Non-i.i.d. Processes tability Bounds for Non-i.i.d. Processes Mehryar Mohri Courant Institute of Matheatical ciences and Google Research 25 Mercer treet New York, NY 002 ohri@cis.nyu.edu Afshin Rostaiadeh Departent of Coputer

More information

Statistics and Probability Letters

Statistics and Probability Letters Statistics and Probability Letters 79 2009 223 233 Contents lists available at ScienceDirect Statistics and Probability Letters journal hoepage: www.elsevier.co/locate/stapro A CLT for a one-diensional

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers PAC-Bayes Ris Bounds for Sample-Compressed Gibbs Classifiers François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Département d informatique et de génie logiciel,

More information

Support recovery in compressed sensing: An estimation theoretic approach

Support recovery in compressed sensing: An estimation theoretic approach Support recovery in copressed sensing: An estiation theoretic approach Ain Karbasi, Ali Horati, Soheil Mohajer, Martin Vetterli School of Coputer and Counication Sciences École Polytechnique Fédérale de

More information

On Constant Power Water-filling

On Constant Power Water-filling On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

Tail estimates for norms of sums of log-concave random vectors

Tail estimates for norms of sums of log-concave random vectors Tail estiates for nors of sus of log-concave rando vectors Rados law Adaczak Rafa l Lata la Alexander E. Litvak Alain Pajor Nicole Toczak-Jaegerann Abstract We establish new tail estiates for order statistics

More information

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA) Bayesian Learning Chapter 6: Bayesian Learning CS 536: Machine Learning Littan (Wu, TA) [Read Ch. 6, except 6.3] [Suggested exercises: 6.1, 6.2, 6.6] Bayes Theore MAP, ML hypotheses MAP learners Miniu

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

Bipartite subgraphs and the smallest eigenvalue

Bipartite subgraphs and the smallest eigenvalue Bipartite subgraphs and the sallest eigenvalue Noga Alon Benny Sudaov Abstract Two results dealing with the relation between the sallest eigenvalue of a graph and its bipartite subgraphs are obtained.

More information

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation journal of coplexity 6, 459473 (2000) doi:0.006jco.2000.0544, available online at http:www.idealibrary.co on On the Counication Coplexity of Lipschitzian Optiization for the Coordinated Model of Coputation

More information

VC Dimension and Sauer s Lemma

VC Dimension and Sauer s Lemma CMSC 35900 (Spring 2008) Learning Theory Lecture: VC Diension and Sauer s Lea Instructors: Sha Kakade and Abuj Tewari Radeacher Averages and Growth Function Theore Let F be a class of ±-valued functions

More information

Domain-Adversarial Neural Networks

Domain-Adversarial Neural Networks Doain-Adversarial Neural Networks Hana Ajakan, Pascal Gerain 2, Hugo Larochelle 3, François Laviolette 2, Mario Marchand 2,2 Départeent d inforatique et de génie logiciel, Université Laval, Québec, Canada

More information

Collision-based Testers are Optimal for Uniformity and Closeness

Collision-based Testers are Optimal for Uniformity and Closeness Electronic Colloquiu on Coputational Coplexity, Report No. 178 (016) Collision-based Testers are Optial for Unifority and Closeness Ilias Diakonikolas Theis Gouleakis John Peebles Eric Price USC MIT MIT

More information

The degree of a typical vertex in generalized random intersection graph models

The degree of a typical vertex in generalized random intersection graph models Discrete Matheatics 306 006 15 165 www.elsevier.co/locate/disc The degree of a typical vertex in generalized rando intersection graph odels Jerzy Jaworski a, Michał Karoński a, Dudley Stark b a Departent

More information

arxiv: v1 [stat.ml] 27 Oct 2018

arxiv: v1 [stat.ml] 27 Oct 2018 Stein Variational Gradient Descent as Moent Matching arxiv:80.693v stat.ml] 27 Oct 208 Qiang Liu, Dilin Wang Departent of Coputer Science The University of Texas at Austin Austin, TX 7872 {lqiang, dilin}@cs.utexas.edu

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

1 Proving the Fundamental Theorem of Statistical Learning

1 Proving the Fundamental Theorem of Statistical Learning THEORETICAL MACHINE LEARNING COS 5 LECTURE #7 APRIL 5, 6 LECTURER: ELAD HAZAN NAME: FERMI MA ANDDANIEL SUO oving te Fundaental Teore of Statistical Learning In tis section, we prove te following: Teore.

More information

Asynchronous Gossip Algorithms for Stochastic Optimization

Asynchronous Gossip Algorithms for Stochastic Optimization Asynchronous Gossip Algoriths for Stochastic Optiization S. Sundhar Ra ECE Dept. University of Illinois Urbana, IL 680 ssrini@illinois.edu A. Nedić IESE Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu

More information

Research Article Robust ε-support Vector Regression

Research Article Robust ε-support Vector Regression Matheatical Probles in Engineering, Article ID 373571, 5 pages http://dx.doi.org/10.1155/2014/373571 Research Article Robust ε-support Vector Regression Yuan Lv and Zhong Gan School of Mechanical Engineering,

More information

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate The Siplex Method is Strongly Polynoial for the Markov Decision Proble with a Fixed Discount Rate Yinyu Ye April 20, 2010 Abstract In this note we prove that the classic siplex ethod with the ost-negativereduced-cost

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Sha M Kakade Microsoft Research and Wharton, U Penn skakade@icrosoftco Varun Kanade SEAS, Harvard University vkanade@fasharvardedu

More information

arxiv: v1 [cs.ds] 17 Mar 2016

arxiv: v1 [cs.ds] 17 Mar 2016 Tight Bounds for Single-Pass Streaing Coplexity of the Set Cover Proble Sepehr Assadi Sanjeev Khanna Yang Li Abstract arxiv:1603.05715v1 [cs.ds] 17 Mar 2016 We resolve the space coplexity of single-pass

More information

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES Vol. 57, No. 3, 2009 Algoriths for parallel processor scheduling with distinct due windows and unit-tie obs A. JANIAK 1, W.A. JANIAK 2, and

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial

More information

3.3 Variational Characterization of Singular Values

3.3 Variational Characterization of Singular Values 3.3. Variational Characterization of Singular Values 61 3.3 Variational Characterization of Singular Values Since the singular values are square roots of the eigenvalues of the Heritian atrices A A and

More information

On the Impact of Kernel Approximation on Learning Accuracy

On the Impact of Kernel Approximation on Learning Accuracy On the Ipact of Kernel Approxiation on Learning Accuracy Corinna Cortes Mehryar Mohri Aeet Talwalkar Google Research New York, NY corinna@google.co Courant Institute and Google Research New York, NY ohri@cs.nyu.edu

More information

AN OPTIMAL SHRINKAGE FACTOR IN PREDICTION OF ORDERED RANDOM EFFECTS

AN OPTIMAL SHRINKAGE FACTOR IN PREDICTION OF ORDERED RANDOM EFFECTS Statistica Sinica 6 016, 1709-178 doi:http://dx.doi.org/10.5705/ss.0014.0034 AN OPTIMAL SHRINKAGE FACTOR IN PREDICTION OF ORDERED RANDOM EFFECTS Nilabja Guha 1, Anindya Roy, Yaakov Malinovsky and Gauri

More information

Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions Kernel Choice and Classifiability for RKHS Ebeddings of Probability Distributions Bharath K. Sriperubudur Departent of ECE UC San Diego, La Jolla, USA bharathsv@ucsd.edu Kenji Fukuizu The Institute of

More information

A Simple Homotopy Algorithm for Compressive Sensing

A Simple Homotopy Algorithm for Compressive Sensing A Siple Hootopy Algorith for Copressive Sensing Lijun Zhang Tianbao Yang Rong Jin Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China Departent of Coputer

More information

Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality

Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality Counication Lower Bounds for Statistical Estiation Probles via a Distributed Data Processing Inequality Mark Braveran 1, Ankit Garg 1, Tengyu Ma 1, Huy L. Nguyen, and David P. Woodruff 3 1 Princeton University

More information

Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds.

Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds. Proceedings of the 2016 Winter Siulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtan, E. Zhou, T. Huschka, and S. E. Chick, eds. THE EMPIRICAL LIKELIHOOD APPROACH TO SIMULATION INPUT UNCERTAINTY

More information

Learning with Deep Cascades

Learning with Deep Cascades Learning with Deep Cascades Giulia DeSalvo 1, Mehryar Mohri 1,2, and Uar Syed 2 1 Courant Institute of Matheatical Sciences, 251 Mercer Street, New Yor, NY 10012 2 Google Research, 111 8th Avenue, New

More information

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes Journal of Machine Learning Research (200 66-686 Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences

More information