Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Sample Guarantees
|
|
- Jane O’Connor’
- 5 years ago
- Views:
Transcription
1 Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees Jean Honorio CSAIL, MIT Cabridge, MA 0239, USA Toi Jaakkola CSAIL, MIT Cabridge, MA 0239, USA Abstract We analyze the expected risk of linear classifiers for a fixed weight vector in the iniax setting. That is, we analyze the worst-case risk aong all data distributions with a given ean and covariance. We provide a sipler proof of the tight polynoial-tail bound for general rando variables. For sub-gaussian rando variables, we derive a novel tight exponentialtail bound. We also provide new PAC-Bayes finite-saple guarantees when training data is available. Our iniax generalization bounds are diensionality-independent and O /) for saples. Introduction Linear classifiers are the cornerstone of several applications in achine learning. The generalization ability of classifiers have been long studied both in the statistical and coputational learning theory. Several general fraeworks have been applied in order to analyze the generalization error of generic classifiers soe do not apply to linear classifiers), such as epirical risk iniization, structural risk iniization, VC diension, covering nubers and Radeacher coplexity. Additionally, several notions of stability that guarantee generalization were introduced in [5, 8, 20, 2]. We refer the interested reader to the survey article [4] for additional inforation. For linear classifiers, sharp argin bounds are provided in [2] using Radeacher coplexity, and [5, 7] using the PAC-Bayes theore. Later, [] provided Appearing in Proceedings of the 7 th International Conference on Artificial Intelligence and Statistics AISTATS) 204, Reykjavik, Iceland. JMLR: W&CP volue 33. Copyright 204 by the authors. sharp bounds for Radeacher and Gaussian coplexities of constrained) linear classes, which led to several generalization bounds, such as argin bounds, PAC-Bayes bounds and risk bounds for vectors with bounded nor. More recently, [8] generalized the KL divergence in the PAC-Bayes theore to arbitrary convex functions; and [0] provided diensionalitydependent PAC-Bayes argin bounds. In a different line of work, [9] proved generalization of sparse linear classifiers under l -regularization) by using covering nuber bounds in [23]. Let f be a classifier learnt fro available training saples drawn fro soe arbitrary data distribution). Let f be the optial classifier in the asyptotic setting intuitively speaking, when infinite aount of data is available). Generalization bounds are usually tered as a unifor convergence stateent that holds for all classifiers See [] for instance). Alternatively, by using a syetrization arguent and by optiality of the epirical iniizer, we can state that with probability at least δ: R f) Rf ) + g, δ) ) where Rf) is the expected risk of the classifier f with respect to the data distribution and with respect to the posterior for PAC-Bayes), and g, δ) is a function that decreases with respect to and δ. Very often, we have that g, δ) O /) and g, δ) O log /δ) but other rates are also possible [4]. Generalization bounds are stated as in eq.) with respect to the unknown quantity Rf ). In this paper, we are interested in developing a tight bound for the expected risk Rf) of a linear classifier f. This allows us to find a closed for expression for the bound of Rf ) and also study the behavior of R f) when training data is available. We study the expected risk of linear classifiers under two scenarios: general rando variables and sub- Gaussian variates. Many features used in real-world classification probles follow sub-gaussianity assup-
2 Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees tions. For instance, in the coputer vision literature, histogra features are used for object classification as in [6, 22]. The class of sub-gaussian variates includes for instance Gaussian variables, any bounded rando variable e.g. Bernoulli, ultinoial, unifor), any rando variable with strictly log-concave density, and any finite ixture of sub-gaussian variables. Stronger assuptions have been previously used for the analysis of linear classifiers. In the generalization bound analysis of [], the authors assued boundedness; while in the active learning analysis of [] the authors assued a log-concave distribution. In this paper, we provide tight bounds for the expected risk of linear classifiers. Interestingly, the Fisher linear discriinant objective function appears in the different scenarios under our analysis, although our results apply to any linear classifier. In our tightness proof, we construct a faily of data distributions where the expected risk bound holds with equality. Our constructions do not rely on conditional distributions of x given the class y i.e. P x y = ) and P x y = +)) that are Gaussian or that have equal covariances. This iplies that there is a whole faily of non-trivial distributions in which the Fisher discriinant is asyptotically as good as any other linear classifier. When training data is available, we provide novel iniax generalization bounds that are diensionality-independent and O /) for saples. Fro a technical point of view, we do not require boundedness of the input data as in [], but note that we analyze a new proble. As part of our analysis, we derive novel PAC-Bayes bounds for not-everywhere-bounded functions. Additionally, while PAC-Bayes bounds usually depend on the Kullback- Leibler divergence which is natural for our analysis of sub-gaussian variates), we provide bounds that depend on the Chi-squared divergence which we believe is natural for our analysis of general rando variables). 2 Preliinaries Consider a binary classification proble with n features. That is, each saple is a pair x, y) containing a class label y {, +} and a vector of features x R n. Let Z be the probability distribution of x, y). That is, x, y) Z. A linear classifier f with weight vector w R n has the following decision function: fx w) = sgnw T[ x ] ) 2) The linear classifier f akes a istake whenever the sign of y does not atch the sign of fx w), or equivalently, whenever: yw T[ x ] 0 3) In this paper, we analyze the probability of the above expression with respect to the data distribution. That is, we analyze the expected risk of a linear classifier. Without loss of generality, let z = y [ x ]. Clearly, eq.3) holds if and only if w T z 0. We assue that z has ean µ and positive definite covariance atrix Σ. That is: µ E[z] and Σ E[z µ)z µ) T ] 4) With soe abuse of notation, we write z Z Zµ, Σ) in order to denote that the distribution Z of z has ean µ and covariance Σ. Additionally, we define Ωµ, Σ) to be the faily of all distributions with ean µ and covariance Σ. Thus, Z Ωµ, Σ) is an equivalent way to denote that the distribution Z has ean µ and covariance Σ. Next, we introduce the following Fisher function of the weight vector w, given the ean µ and covariance Σ of the data distribution: Fw µ, Σ) wt µ) 2 w T Σw 5) The above expression is indeed the objective function of the Fisher linear discriinant, and it appears in the different scenarios under our analysis cf. Theores and 4). 3 Bounds for the Expected Risk The following iniax result was found by [6] and rediscovered by [3]. Proofs in both papers are ore coplex than the ones we provide here, and their ain result is stated as follows. Let Ωµ, Σ) be the faily of all distributions with ean µ and covariance Σ. For a fixed weight vector w such that w T µ > 0, we have: sup Z Ωµ,Σ) P z Z [w T z 0] = + Fw µ, Σ) 6) See Appendix A for the relationship between this specific expression and the results in [3, 6].) The iniax expression in eq.6) otivated the developent of iniax probability achines [3, 4], which only need access to the eans and covariances for training. The learning algorith in [3, 4] uses this bound for each class separately. That is, [3, 4] use the bound in eq.6) with x separately for class y = and for class y = +) instead of with z, as in our setting. Note that [2, 4] also propose a learning algorith called single-class iniax probability
3 Jean Honorio, Toi Jaakkola achine, ainly otivated for the quantile estiation proble. For our consistency analysis, the use of a single bound with z, allows us to provide closed for expressions of the upper bound of the expected risk. Note that the iniax bound in eq.6) provides an upper bound for every distribution Z and it shows that the bound is tight. In Theores and 2, we reproduce the sae polynoial-tail bounds by following a sipler approach copared to seidefinite optiization [3] and ultivariate constructions [6]. Our bound and tightness analysis relies on the analysis on a related univariate proble. In Theores 4 and 5, we use a siilar approach as we followed for the general rando variables, and derive novel exponential-tail bounds for sub-gaussian variates. 3. General Rando Variables First, we provide a bound of the expected risk of a fixed linear classifier for general rando variables. Theore. Let z = y [ x ] be a rando variable with ean µ and covariance Σ. The expected risk of a linear classifier with weight vector w such that w T µ > 0, is upper-bounded as follows: P z Zµ,Σ) [w T z 0] + Fw µ, Σ) 7) Proof. Define the rando variable s = w T z. It is easy to verify that µ s E[s] = w T µ > 0 and σs 2 E[s µ s ) 2 ] = w T Σw. By the one-sided Chebyshev s inequality for ε = µ s > 0, we have: P[w T z 0] = P[s 0] = P[µ s s ε] +ε/σ s) 2 = +µ s/σ s) 2 8) By replacing µ s and σ s, we prove our clai. Next, we show that the above bound is tight. That is, there is a data distribution and weight vector for which equality holds. Theore 2. The upper bound provided in Theore is tight. That is, given an arbitrary ean µ and covariance Σ, there is a distribution Z with ean µ and covariance Σ, and a weight vector w for which the bound holds with equality. More forally: ) Z Ωµ, Σ) w R n P z Z [w T z 0] = + Fw µ, Σ) 9) where Ωµ, Σ) is the faily of all distributions with ean µ and covariance Σ. Proof. We show that given soe arbitrary ean µ and covariance Σ, we can construct a distribution Zµ, Σ) and a weight vector w such that the provided bound holds with equality. We prove this by providing a specific univariate three-points distribution which we later use for constructing a ultivariate three-planes distribution. First, we focus on the bound for a single scalar variable s in eq.8). Our goal is to construct a three-points distribution where P[µ s s ε] = +ε/σ s). Thus, we 2 want to show that the one-sided Chebyshev s inequality is tight. In fact, equality holds for the following distribution. Let qε) = 4 εε+ ε 2 + 8) for soe constant ε 0, /3) and β > 0, let s be distributed as follows: β, with probability qε) s = β, with probability 2qε) 0) β +, with probability qε) Note that µ s E[s] = β > 0 and σ 2 s E[s µ s ) 2 ] = 2qε). Second, we construct a canonical three-planes distribution. Consider a distribution V of rando vectors v, where v = s + β)/ 2qε) and v 2,..., v n ) follows a distribution with ean 0 and covariance I. Since v has zero ean and unit variance, as well as it is independent of v 2,..., v n ), we have E[v] = 0 and E[vv T ] = I. Let α = 2qε), 0,..., 0), we have α T v + β = s and therefore α T v + β is distributed as in eq.0). Finally, we construct a general three-planes distribution. For a given ean µ and covariance Σ, we construct a rando variable z with the required ean and covariance, fro v as follows. Define the rando variable z = Σ /2 v + µ. It is easy to verify that E[v] = 0 and E[vv T ] = I if and only if E[z] = µ and E[z µ)z µ) T ] = Σ. Note that α T v + β = w T z for w = Σ /2 α and β = w T µ = α T Σ /2 µ > 0. For ε = µ s > 0, we have: P[w T z 0] = P[α T v + β 0] and we prove our clai. = P[s 0] = P[µ s s ε] = +ε/σ s) Sub-Gaussian Rando Variables In this paper, we ake use of the following definition of sub-gaussianity of vectors by [7, 9]. Definition 3. A rando vector z Z is sub- Gaussian if for all w R n, the rando variable
4 Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees s = w T z with ean µ s = E z Z [w T z] and variance σ 2 s = E z Z [w T z µ s ) 2 ] is sub-gaussian with paraeter σ s. That is: w R n ) E z Z [e wt z µ s ] e 2 σ2 s ) First, we provide a bound of the expected risk of a fixed linear classifier for sub-gaussian rando variables. Theore 4. Let z = y [ x ] be a rando variable with ean µ and covariance Σ. Assue z is a sub- Gaussian vector. The expected risk of a linear classifier with weight vector w such that w T µ > 0, is upperbounded as follows: P z Zµ,Σ) [w T z 0] e 2 Fw µ,σ) 2) Proof. Define the rando variable s = w T z. It is easy to verify that µ s E[s] = w T µ > 0 and σs 2 E[s µ s ) 2 ] = w T Σw. Furtherore, by Definition 3, the rando variable s is sub-gaussian with paraeter σ s. By the one-sided Chernoff bound for sub-gaussian variables and ε = µ s > 0, we have: P[w T z 0] = P[s 0] = P[µ s s ε] e 2 ε/σs)2 = e 2 µs/σs)2 3) By replacing µ s and σ s, we prove our clai. Next, we show that the above bound is tight. That is, there is a data distribution and weight vector for which equality holds. Theore 5. The upper bound provided in Theore 4 is tight. That is, given an arbitrary ean µ and covariance Σ, there is a distribution Z with ean µ and covariance Σ, and a weight vector w for which the bound holds with equality. More forally: ) Z Ωµ, Σ) w R n P z Z [w T z 0] = e 2 Fw µ,σ) 4) where Ωµ, Σ) is the faily of all distributions with ean µ and covariance Σ. Proof. A inor change to the three-points arguent in Theore 2 is needed. That is, we focus on the bound for a single scalar variable s in eq.3). Let Wa) be the Labert function. That is, Wa) is the solution t of the equation a = te t. Our goal is to construct a three-points distribution where P[µ s s ε] = e 2 ε/σs)2. Thus, we want to show that the one-sided Chernoff bound is tight. In fact, equality holds for the following distribution. Let qε) = e W ε2 /4) for soe constant ε 0, 2/e) and let s be distributed as in eq.0). Recall that bounded variables, such as the specifically constructed s, are sub-gaussian. The rest of the proof follows as in Theore 2 with the additional assuption that v is a sub- Gaussian vector. 4 PAC-Bayes Finite-Saple Bounds In this section, we provide finite-saple guarantees for our previously derived asyptotic bounds. Our iniax generalization bounds are diensionalityindependent and O /) for saples, for both general and sub-gaussian rando variables. First, we provide a brief introduction to the PAC- Bayes fraework in the context of linear classifiers. Let H be a set of linear classifiers. That is, H is a set of weight vectors w. After observing a training set, the task of the learner is to choose a posterior distribution of weight vectors Q of support H, such that the Bayes classifier has the sallest possible risk. The Bayes linear classifier f has the following decision function: fx Q) = sgn E w Q [w T[ ] ) x ] 5) The output of the deterinistic) Bayes classifier is closely related to the output of the stochastic) Gibbs classifier. The Gibbs linear classifier chooses randoly a deterinistic) classifier w according to Q in order to classify x. The expected risk of the Gibbs linear classifier is thus given by the probability of eq.3), with respect to the data distribution and the posterior distribution Q. PAC-Bayes guarantees are given with respect to a prior distribution of weight vectors P, also of support H, and ostly focus on the analysis of the Gibbs classifier. As noted in [8], the expected risk of the Bayes classifier is at ost twice the expected risk of the Gibbs classifier. Thus, any upper bound of the latter provides an upper bound of the forer. The factor of 2 can soeties be reduced to + ε, as shown in [5]. In our analysis, we chose a specific n-diensional ellipsoid as the support: H = {w w T Σ + µµ T )w = } 6) The latter is for convenience. Since the Fisher function uses only linear and quadratic ters, any distribution of support R n can be reparaetrized with respect to H and produce the sae results. For instance, the Fisher function is scale-independent with respect to w. Intuitively speaking, the reparaetrization can be perfored by integrating the ass of the distribution of support R n over every direction independently.
5 Jean Honorio, Toi Jaakkola Indeed, the distributions P and Q could also be described with respect to angles. Fortunately, we do not need such a reparaerization for our analysis. Next, we introduce the following Gibbs-Fisher function of weight vectors w sapled fro a distribution Q, given the ean µ and covariance Σ of the data distribution: E w Q [w T µ]) 2 FQ µ, Σ) E w Q [w T Σ+µµ T )w] E w Q [w T µ]) 2 7) The above expression appears in the bound of the expected risk of the Gibbs linear classifier, in the different scenarios under our analysis cf. Theores 6 and ). Our exponential-tail bounds for sub-gaussian variates depend on the Kullback-Leibler divergence, while our polynoial-tail bounds for general rando variables depend on the Chi-squared divergence. Given two distributions P and Q, the Kullback-Leibler and Chi-squared divergences are defined as KLQ P) = E w Q [log qw) pw) ] and χ2 Q P) = E w Q [ qw) pw) ], respectively. Our proof strategy is to first show concentration of the projected ean and variance, and then use those results in order to show concentration of the Gibbs- Fisher function. 4. General Rando Variables First, we provide a bound of the expected risk of the Gibbs linear classifier for general rando variables. Theore 6. Let z = y [ x ] be a rando variable with ean µ and covariance Σ. Let Q be the probability distribution of the rando weight vector w. The expected risk of a linear classifier drawn fro Q such that E w Q [w T µ] > 0, is upper-bounded as follows: P w Q,z Zµ,Σ) [w T z 0] + FQ µ, Σ) 8) Proof. Define the rando variable s = w T z. It is easy to verify that µ s E w Q,z Z [s] = E w Q [w T µ] > 0 and σ 2 s E w Q,z Z [s µ s ) 2 ] = E w Q [w T Σ + µµ T )w] E w Q [w T µ]) 2. The rest of the proof follows as in Theore. Next, we show PAC-Bayes concentration of the projected ean for general rando variables. Lea 7. Let z ),..., z ) be saples independently drawn fro Zµ, Σ), an arbitrary distribution with ean µ and covariance Σ. Let µ be the epirical ean coputed fro those saples. For any prior distribution P of support H as in eq.6), with probability at least δ: Q) E w Q [w T µ µ)] δ χ2 Q P) + ) 9) Proof. Note that the rando variable E w P [w T µ µ)) 2 ] is non-negative. By Markov s inequality, with probability at least δ: E w P [w T µ µ)) 2 ] δ E z )...z ) ZE w P [w T µ µ)) 2 ] 20) Define the rando variable s = w T z µ). It is easy to verify that E z Z [s] = 0 and E z Z [s 2 ] = w T Σw w T Σ + µµ T )w = in the support H as in eq.6). Note that w T µ µ) = i= si). Furtherore, s ),..., s ) are independent since z ),..., z ) are independent. Thus, we can upper-bound the expected value in the right-hand side of eq.20) as follows: E z )...z ZE ) w P [w T µ µ)) 2 ] [ = E w P E ] z )...z ) Z i= si)) 2 [ = E w P E z )...z ) Z 2 i= si)2 + )] i j si) s j) [ = E w P E z Z s 2 / ] / By putting everything together, we have: E w P [w T µ µ)) 2 ] /δ) 2) By Jensen s and Cauchy-Schwarz inequalities, we have: Q) Ew Q [w T µ µ)] E w Q [ w T µ µ) ] [ = E pw) w Q qw) w T µ µ) ) ] qw) pw) [ ] [ ] E pw) w Q qw) wt µ µ)) 2 E qw) w Q pw) = E w P [w T µ µ)) 2 ] χ 2 Q P) + ) By using our bound in eq.2), we prove our clai. In what follows, we show PAC-Bayes concentration of the projected variance for general rando variables. Lea 8. Let z ),..., z ) be saples independently drawn fro Zµ, Σ), an arbitrary distribution with ean µ and covariance Σ. Let µ and Σ be the epirical ean and covariance coputed fro those saples. Assue z has bounded fourth order oent, that is E[z T Σ + µµ T ) z) 2 ] K. For any prior
6 Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees distribution P of support H as in eq.6), with probability at least δ: Q) E w Q [w T Σ + µ µ T )w ] K δ χ2 Q P) + ) 22) Proof. Note that the rando variable E w P [w T Σ + µ µ T )w ) 2 ] is non-negative. By Markov s inequality, with probability at least δ: E w P [w T Σ + µ µ T )w ) 2 ] δ E z )...z ) ZE w P [w T Σ + µ µ T )w ) 2 ] 23) Define the rando variable s = w T z) 2. Note that in the support H as in eq.6), we have E z Z [s] = w T Σ + µµ T )w = 0. Additionally, let S = Σ + µµ T and by our assuption of bounded fourth order oent, we have: E z Z [s 2 ] = E z Z [w T z) 4 ] = E z Z [S /2 w) T S /2 z) 4 ] S /2 w 4 2 E z Z [ S /2 z 4 2] = w T Sw) 2 E z Z [z T S z) 2 ] K Note that w T Σ + µ µ T )w = i= si). Furtherore, s ),..., s ) are independent since z ),..., z ) are independent. Thus, we can upperbound the expected value in the right-hand side of eq.23). The rest of the proof follows siilarly as in Lea 7. Finally, we show PAC-Bayes concentration of the Gibbs-Fisher function for general rando variables. Our iniax generalization bound is diensionality-independent and O /) for saples. Theore 9. Let z ),..., z ) be saples independently drawn fro Zµ, Σ), an arbitrary distribution with ean µ and covariance Σ. Let µ and Σ be the epirical ean and covariance coputed fro those saples. Assue z has bounded fourth order oent, that is E[z T Σ + µµ T ) z) 2 ] K. For any prior distribution P of support H as in eq.6), with probability at least δ: Q) + FQ µ, Σ) + FQ µ, Σ) ) 8 ax, K ) χ δ 2 Q P) + ) + O 24) Proof. In order to obtain concentration of the projected ean and variance siultaneously, we apply the union bound to Leas 7 and 8. That is, with probability at least δ, let ε = 2 ax,k ) δ χ 2 Q P) + ), we have: Q) E w Q [w T µ µ)] ε Q) E w Q [w T Σ + µ µ T )w ] ε 25) With soe algebra, we can prove that for any Q, µ, Σ and S = Σ + µµ T, we have: + FQ µ, Σ) = E w Q[w T µ]) 2 E w Q [w T Sw] 26) Let Ŝ = Σ+ µ µ T, α = E w Q [w T µ], α = E w Q [w T µ] and β = E w Q [w T Ŝw]. The concentration results in eq.25) are equivalent to α α ε and β ε. Additionally by eq.26), the left-hand side of eq.24) is equivalent to α 2 / β α 2. For any Q, µ, Σ and S = Σ + µµ T, by positive seidefiniteness of Σ, we have: w) 0 w T Σw 0 E w Q [w T Σw] 0 E w Q [w T S µµ T )w] E w Q [w T µ) 2 ] E w Q [w T Sw] E w Q [w T µ) 2 ]/E w Q [w T Sw] Note that 0 E w Q [w T µ]) 2 E w Q [w T µ) 2 ] and therefore α 2 / β [0, ] and α 2 [0, ]. Since α α ε and α, we have: Siilarly: α 2 α 2 = 2α α α) + α α) 2 2ε + ε 2 α 2 α 2 = 2 αα α) + α α) 2 2α + ε)ε + ε ε)ε + ε 2 = 2ε + 3ε 2 Therefore, α 2 α 2 2ε+3ε 2. Finally, since α α ε, β ε and α 2 / β, we have: α 2 / β α 2 α 2 / β α 2 + α 2 α 2 and we prove our clai. = α 2 / β β + α 2 α 2 ε + 2ε + 3ε 2 = 3ε + O/)
7 Jean Honorio, Toi Jaakkola 4.2 Sub-Gaussian Rando Variables In this paper, we introduce the following definition of sub-gaussianity of vectors. Definition 0. Let H be a bounded set. A rando vector z Z is H-sub-Gaussian if for all distributions Q of support H and w Q, the rando variable s = w T z with ean µ s = E w Q,z Z [w T z] and variance σs 2 = E w Q,z Z [w T z µ s ) 2 ] is sub-gaussian with paraeter σ s. That is: Q) E w Q,z Z [e wt z µ s ] e 2 σ2 s 27) First, we provide a bound of the expected risk of the Gibbs linear classifier for sub-gaussian rando variables. Theore. Let z = y [ x ] be a rando variable with ean µ and covariance Σ. Let Q be the probability distribution of the rando weight vector w of support H. Assue z is an H-sub-Gaussian vector. The expected risk of a linear classifier drawn fro Q such that E w Q [w T µ] > 0, is upper-bounded as follows: P w Q,z Zµ,Σ) [w T z 0] e 2 FQ µ,σ) 28) Proof. Define the rando variable s = w T z. It is easy to verify that µ s E w Q,z Z [s] = E w Q [w T µ] > 0 and σ 2 s E w Q,z Z [s µ s ) 2 ] = E w Q [w T Σ + µµ T )w] E w Q [w T µ]) 2. Furtherore, by Definition 0, the rando variable s is sub-gaussian with paraeter σ s. The rest of the proof follows as in Theore 4. Next, we show PAC-Bayes concentration of the projected ean for sub-gaussian rando variables. In the following lea, while we concentrate on upperbounding E w Q [w T µ µ)] for all Q, a siilar arguent also bounds E w Q [w T µ µ)]. Lea 2. Let z ),..., z ) be saples independently drawn fro Zµ, Σ), an arbitrary distribution with ean µ and covariance Σ. Let µ be the epirical ean coputed fro those saples. Assue z is a sub-gaussian vector. For any prior distribution P of support H as in eq.6), with probability at least δ: ) Q) E w Q [w T µ µ)] KLQ P) + log e /2 δ 29) Proof. Let t R be a constant. Note that the rando variable E w P [e twt µ µ) ] is non-negative. By Markov s inequality, with probability at least δ: E w P [e twt µ µ) ] δ E z )...z ) ZE w P [e twt µ µ) ] 30) Define the rando variable s = w T z µ). It is easy to verify that E z Z [s] = 0. As we argued in Theore 4, the variable s is sub-gaussian with paraeter σ s where σ 2 s = w T Σw. Furtherore, σ 2 s w T Σ+µµ T )w = in the support H as in eq.6). Note that w T µ µ) = i= si). Furtherore, s ),..., s ) are independent since z ),..., z ) are independent. Thus, we can upper-bound the expected value in the right-hand side of eq.30) as follows: E z )...z ) ZE w P [e twt µ µ) ] = E w P E z )...z ) Z[e t i= si) ] = E w P i= E z Z[e t s ] E w P i= e t2 2 2 = e t2 2 By taking the logarith on each side of eq.30) and since E w P [fw)] = E w Q [ pw) qw) fw)] for every distribution Q and function f : R n R, we have: [ ] Q) log E pw) µ µ) w Q qw) etwt ) log δ E z )...z ZE ) w P [e twt µ µ) ] By using Jensen s inequality, we can lower-bound the left-hand side of the above expression: Q) log E w Q [ pw) µ µ) qw) etwt pw) µ µ) qw) etwt E w Q [log ] )] = KLQ P) + t E w Q [w T µ µ)] By putting everything together, we have: Q) E w Q [w T µ µ)] t KLQ P) + log e t δ By setting t =, we prove our clai. 2 2 In what follows, we show PAC-Bayes concentration of the projected variance for sub-gaussian rando variables. In the following lea, while we concentrate on upper-bounding E w Q [w T Σ + µ µ T )w ] for all Q, a siilar arguent also bounds E w Q [ w T Σ + µ µ T )w]. Lea 3. Let z ),..., z ) be 6 saples independently drawn fro Zµ, Σ), an arbitrary distribution with ean µ and covariance Σ. Let µ and Σ be the epirical ean and covariance coputed fro those saples. Assue z is a sub-gaussian vector. For any prior distribution P of support H as in eq.6), )
8 Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees with probability at least δ: Q) E w Q [w T Σ + µ µ T )w ] ) KLQ P) + log e6 δ 3) Proof. Let t R be a constant. Note that the rando variable E w P [e twt Σ+ µ µ T )w ) ] is non-negative. By Markov s inequality, with probability at least δ: E w P [e twt Σ+ µ µ T )w ) ] δ E z )...z ) ZE w P [e twt Σ+ µ µ T )w ) ] 32) Define the rando variable s = w T z) 2. As we argued in Theore 4, the rando variable w T z is sub-gaussian with paraeter σ where σ 2 = w T Σw. Furtherore, σ 2 w T Σ + µµ T )w = in the support H as in eq.6). Additionally, in the support H, we have E z Z [s] = w T Σ + µµ T )w = 0 and furtherore, s is sub-exponential since it is the square of a sub-gaussian variable. In what follows, we use a bound for the oent generating function of a sub-exponential variable, that resebles that of sub-gaussian variables. See Appendix B for details.) Note that w T Σ + µ µ T )w = i= si). Furtherore, s ),..., s ) are independent since z ),..., z ) are independent. Thus, we can upperbound the expected value in the right-hand side of eq.32) as follows: E z )...z ) ZE w P [e twt Σ+ µ µ T )w ) ] = E w P E z )...z ) Z[e t i= si) ] = E w P i= E z Z[e t s ] E w P i= e 6t2 2 for t / /4 = e 6t2 The rest of the proof follows siilarly as in Lea 2. Note that since we set t =, the condition t / /4 for applying the bound for the oent generating function of the sub-exponential variable s leads to the condition 6. Finally, we show PAC-Bayes concentration of the Gibbs-Fisher function for sub-gaussian rando variables. Our iniax generalization bound is diensionality-independent and O /) for saples. Theore 4. Let z ),..., z ) be 6 saples independently drawn fro Zµ, Σ), an arbitrary distribution with ean µ and covariance Σ. Let µ and Σ be the epirical ean and covariance coputed fro those saples. Assue z is a sub-gaussian vector. For any prior distribution P of support H as in eq.6), with probability at least δ: Q) e 2 FQ µ, Σ) e 2 FQ µ,σ) ) ) 36 KLQ P) + log 4e6 + O 33) δ Proof. In order to obtain two-sided concentration of both, the projected ean and variance siultaneously, we apply the union bound to the one-sided results in Leas 2 and 3. That is, with probability at least δ, let ε = KLQ P) + log 4e6 δ ), we have: Q) Ew Q [w T µ µ)] ε Q) E w Q [w T Σ + µ µ T )w ] ε 34) With soe algebra, we can prove that for any Q, µ, Σ and S = Σ + µµ T, we have: e 2 FQ µ,σ) = exp 2 E w Q [w T µ]) 2 E w Q [w T Sw] E w Q[w T µ]) 2 E w Q [w T Sw] ] 35) ) Let Ŝ = Σ+ µ µ T, α = E w Q [w T µ], α = E w Q [w T µ] and β = E w Q [w T Ŝw]. The concentration results in eq.34) are equivalent to α α ε and β ε. Note that by eq.35), the left-hand ) side of eq.33) is equivalent to exp α2 / β. exp α2 2 α 2 / β) 2 α )) 2 Furtherore, by Lipschitz continuity: exp α2 / β 2 α 2 / β) ) exp α2 2 α )) 2 α 2 / β α 2 2 The rest of the proof follows as in Theore 9 to show that α 2 / β α 2 3ε + O/). 5 Concluding Rearks There are several ways of extending this research. While in this paper we focused in the PAC-Bayes fraework, one future goal is to provide guarantees for all weight vectors w. In order to obtain bounds that are either independent or logarithically-dependent on the diension, it ight be necessary to produce novel bounds for the Radeacher and/or Gaussian coplexity of not-everywhere-bounded functions. Finally, since our bounds on the expected risk are worstcase aong all data distributions with a given ean and covariance), it would be interesting to analyze their applicability to scenarios where each training saple ay coe fro a different distribution, as well as for transfer learning.
9 Jean Honorio, Toi Jaakkola References [] M. Balcan and P. Long. Active and passive learning of linear separators under log-concave distributions. COLT, 203. [2] P. Bartlett and S. Mendelson. Radeacher and Gaussian coplexities: Risk bounds and structural results. JMLR, [3] D. Bertsias, I. Popescu, and J. Sethuraan. Moent probles and seidefinite optiization. Handbook of Seidefinite Optiization, [4] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. Advanced Lectures on Machine Learning, [5] O. Bousquet and A. Elisseeff. Stability and generalization. JMLR, [6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. IMAGENET: A large-scale hierarchical iage database. CVPR, [7] R. Fukuda. Exponential integrability of subgaussian vectors. Probability Theory and Related Fields, 990. [8] P. Gerain, A. Lacasse, F. Laviolette, and M. Marchand. PAC-Bayesian learning of linear classifiers. ICML, [6] A. Marshall and I. Olkin. Multivariate Chebyshev inequalities. The Annals of Matheatical Statistics, 960. [7] D. McAllester. Siplified PAC-Bayesian argin bounds. COLT, [8] S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of epirical risk iniization. Advances in Coputational Matheatics, [9] A. Ng. Feature selection, l vs. l 2 regularization, and rotational invariance. ICML, [20] S. Rakhlin, S. Mukherjee, and T. Poggio. Stability results in learning theory. Analysis and Applications, [2] S. Shalev-Shwartz, O. Shair, N. Srebro, and K. Sridharan. Learnability, stability and unifor convergence. JMLR, 200. [22] A. Torralba, R. Fergus, and W. T. Freean. 80 illion tiny iages: a large dataset for nonparaetric object and scene recognition. PAMI, [23] T. Zhang. Covering nuber bounds of certain regularized linear function classes. JMLR, [9] D. Hsu, S. Kakade, and T. Zhang. A tail inequality for quadratic fors of sub-gaussian rando vectors. Electronic Counications in Probability, 202. [0] C. Jin and L. Wang. Diensionality dependent PAC-Bayes argin bound. NIPS, 202. [] S. Kakade, K. Sridharan, and A. Tewari. On the coplexity of linear prediction: Risk bounds, argin bounds, and regularization. NIPS, [2] G. Lanckriet, L. El Ghaoui,, and M. Jordan. Robust novelty detection with single-class MPM. NIPS, [3] G. Lanckriet, L. El Ghaoui, C. Bhattacharyya, and M. Jordan. Miniax probability achine. NIPS, 200. [4] G. Lanckriet, L. El Ghaoui, C. Bhattacharyya, and M. Jordan. A robust iniax approach to classification. JMLR, [5] J. Langford and J. Shawe-Taylor. PAC-Bayes & argins. NIPS, 2002.
10 Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees A Proof of Expression in Equation 6) In this section, we restate the results provided in [3, 6] in order to obtain eq.6). We follow the well-known Lagrangian duality approach as in Appendix A in [4]. The following result is provided in [3, 6]. Let Ωµ, Σ) be the faily of all distributions with ean µ and covariance Σ. For a fixed weight vector a and constant b, we have: sup Z Ωµ,Σ) P z Z [a T z b] = + d 2 where d 2 = inf z µ) T Σ z µ) a T z b Let a = w and b = 0. We have: sup Z Ωµ,Σ) P z Z [w T z 0] = + d 2 where d 2 = inf z µ) T Σ z µ) w T z 0 Note that if w T µ 0, then we can just take z = µ and obtain d 2 = 0, which is certainly the optiu because d 2 0 due to positive definiteness of Σ. In what follows, we assue w T µ > 0, as required in eq.6). We are interested in the value of d 2. That is, we seek for a closed-for solution of the prial proble: in z µ) T Σ z µ) 36) w T z 0 which has the following Lagrangian: Lz, λ) = z µ) T Σ z µ) + λw T z By optiality arguents i.e. L/ z = 0), we have that L is iniized at z = λ 2 Σw + µ. Therefore, the Lagrange dual function is given by: gλ) = inf Lz, λ) z = Lz, λ) = λ2 4 wt Σw + λw T µ Consequently, the dual proble of eq.36) is: ax λ 0 gλ) Again, by optiality arguents i.e. g/ λ = 0), we have that g is axiized at λ = 2 wt µ w T Σw. Note that λ 0 since w T µ > 0. Finally: d 2 = ax λ 0 gλ) = gλ ) = wt µ) 2 w T Σw Fw µ, Σ) B Moent Generating Function of the Square of a Sub-Gaussian Variable Let s be a sub-gaussian variable with paraeter σ s and ean µ s = E[s]. By sub-gaussianity, we know that the oent generating function is bounded as follows: t R) E[e ts µs) ] e 2 t2 σ 2 s Our goal is to find a siilar bound for the oent generating function of the sub-exponential variable v = s 2. Let Γr) be the Gaa function, the oents of the sub-gaussian variable s are bounded as follows: r 0) E[ s r ] r2 r /2 σ r sγ r /2) Let µ v = E[v]. By power series expansion and since Γr) = r )! for an integer r, we have: E[e tv µv) ] = + te[v µ v ] = + r=2 r=2 r=2 t r E[ s 2r ] r! r=2 t r 2r2 r σ 2r s Γr) r! t r 2 r+ σ 2r s = + 8t2 σ 4 s 2tσ 2 s t r E[v µ v ) r ] r! By aking t /4σ 2 ), we have s / 2tσ 2 s ) 2. Finally, since α) + α e α, we have that for a sub- Gaussian variable s with paraeter σ s : t /4σ 2 s )) E[ets2 E[s 2 ]) ] e 6t2 σ 4 s 37) Thus, we obtained a bound for the oent generating function of the sub-exponential variable s 2, that is siilar to that of sub-gaussian variables but holds only for a sall range of t.
E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis
E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,
More informationComputational and Statistical Learning Theory
Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher
More informationPAC-Bayes Analysis Of Maximum Entropy Learning
PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E
More informationE0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)
E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how
More informationCS Lecture 13. More Maximum Likelihood
CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood
More informationLearnability and Stability in the General Learning Setting
Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu
More informationComputational and Statistical Learning Theory
Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic
More informationKernel Methods and Support Vector Machines
Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic
More information1 Proof of learning bounds
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a
More informationA Simple Regression Problem
A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where
More informationRobustness and Regularization of Support Vector Machines
Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA
More informationUnderstanding Machine Learning Solution Manual
Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y
More informationComputable Shell Decomposition Bounds
Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung
More information13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices
CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay
More informationFoundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October
More informationSupport Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization
Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering
More informationLecture October 23. Scribes: Ruixin Qiang and Alana Shine
CSCI699: Topics in Learning and Gae Theory Lecture October 23 Lecturer: Ilias Scribes: Ruixin Qiang and Alana Shine Today s topic is auction with saples. 1 Introduction to auctions Definition 1. In a single
More information1 Bounding the Margin
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost
More informationIntelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines
Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes
More informationComputable Shell Decomposition Bounds
Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago
More informationBayes Decision Rule and Naïve Bayes Classifier
Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.
More informationSupport Vector Machines. Goals for the lecture
Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed
More informationSupport Vector Machines. Maximizing the Margin
Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the
More informationPAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification
PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification Eilie Morvant eilieorvant@lifuniv-rsfr okol Koço sokolkoco@lifuniv-rsfr Liva Ralaivola livaralaivola@lifuniv-rsfr Aix-Marseille
More informationSoft-margin SVM can address linearly separable problems with outliers
Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly
More information1 Rademacher Complexity Bounds
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability
More informationShannon Sampling II. Connections to Learning Theory
Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,
More information1 Generalization bounds based on Rademacher complexity
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges
More informationSymmetrization and Rademacher Averages
Stat 928: Statistical Learning Theory Lecture: Syetrization and Radeacher Averages Instructor: Sha Kakade Radeacher Averages Recall that we are interested in bounding the difference between epirical and
More informationEstimating Parameters for a Gaussian pdf
Pattern Recognition and achine Learning Jaes L. Crowley ENSIAG 3 IS First Seester 00/0 Lesson 5 7 Noveber 00 Contents Estiating Paraeters for a Gaussian pdf Notation... The Pattern Recognition Proble...3
More informationSupport Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab
Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a
More informationOn Conditions for Linearity of Optimal Estimation
On Conditions for Linearity of Optial Estiation Erah Akyol, Kuar Viswanatha and Kenneth Rose {eakyol, kuar, rose}@ece.ucsb.edu Departent of Electrical and Coputer Engineering University of California at
More informationPAC-Bayesian Learning of Linear Classifiers
Pascal Gerain Pascal.Gerain.@ulaval.ca Alexandre Lacasse Alexandre.Lacasse@ift.ulaval.ca François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Départeent d inforatique
More informationProbability Distributions
Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples
More informationSimplified PAC-Bayesian Margin Bounds
Siplified PAC-Bayesian Margin Bounds David McAllester Toyota Technological Institute at Chicago callester@tti-c.org Abstract. The theoretical understanding of support vector achines is largely based on
More informationUsing EM To Estimate A Probablity Density With A Mixture Of Gaussians
Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points
More informationBoosting with log-loss
Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the
More informationA Theoretical Framework for Deep Transfer Learning
A Theoretical Fraewor for Deep Transfer Learning Toer Galanti The School of Coputer Science Tel Aviv University toer22g@gail.co Lior Wolf The School of Coputer Science Tel Aviv University wolf@cs.tau.ac.il
More informationNew Bounds for Learning Intervals with Implications for Semi-Supervised Learning
JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University
More informationThe proofs of Theorem 1-3 are along the lines of Wied and Galeano (2013).
A Appendix: Proofs The proofs of Theore 1-3 are along the lines of Wied and Galeano (2013) Proof of Theore 1 Let D[d 1, d 2 ] be the space of càdlàg functions on the interval [d 1, d 2 ] equipped with
More informationTesting Properties of Collections of Distributions
Testing Properties of Collections of Distributions Reut Levi Dana Ron Ronitt Rubinfeld April 9, 0 Abstract We propose a fraework for studying property testing of collections of distributions, where the
More informationStochastic Subgradient Methods
Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods
More informationSupport Vector Machines MIT Course Notes Cynthia Rudin
Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance
More informationA Theoretical Analysis of a Warm Start Technique
A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful
More informationOptimal Jamming Over Additive Noise: Vector Source-Channel Case
Fifty-first Annual Allerton Conference Allerton House, UIUC, Illinois, USA October 2-3, 2013 Optial Jaing Over Additive Noise: Vector Source-Channel Case Erah Akyol and Kenneth Rose Abstract This paper
More informationRademacher Complexity Margin Bounds for Learning with a Large Number of Classes
Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute
More informationList Scheduling and LPT Oliver Braun (09/05/2017)
List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)
More informationEfficient Learning with Partially Observed Attributes
Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy Shai Shalev-Shwartz The Hebrew University, Jerusale, Israel Ohad Shair The Hebrew University, Jerusale, Israel Abstract We describe and
More informationMachine Learning Basics: Estimators, Bias and Variance
Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics
More informationPattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition
Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition
More informationModel Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon
Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential
More informationNyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison
yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,
More informationSub-Gaussian estimators of the mean of a random vector
Sub-Gaussian estiators of the ean of a rando vector arxiv:702.00482v [ath.st] Feb 207 GáborLugosi ShaharMendelson February 3, 207 Abstract WestudytheprobleofestiatingtheeanofarandovectorX givena saple
More informationGeometrical intuition behind the dual problem
Based on: Geoetrical intuition behind the dual proble KP Bennett, EJ Bredensteiner, Duality and Geoetry in SVM Classifiers, Proceedings of the International Conference on Machine Learning, 2000 1 Geoetrical
More informationBounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks
Bounds on the Miniax Rate for Estiating a Prior over a VC Class fro Independent Learning Tasks Liu Yang Steve Hanneke Jaie Carbonell Deceber 01 CMU-ML-1-11 School of Coputer Science Carnegie Mellon University
More informationExact tensor completion with sum-of-squares
Proceedings of Machine Learning Research vol 65:1 54, 2017 30th Annual Conference on Learning Theory Exact tensor copletion with su-of-squares Aaron Potechin Institute for Advanced Study, Princeton David
More informationLecture 21. Interior Point Methods Setup and Algorithm
Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and
More informationOn the Use of A Priori Information for Sparse Signal Approximations
ITS TECHNICAL REPORT NO. 3/4 On the Use of A Priori Inforation for Sparse Signal Approxiations Oscar Divorra Escoda, Lorenzo Granai and Pierre Vandergheynst Signal Processing Institute ITS) Ecole Polytechnique
More information3.8 Three Types of Convergence
3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to
More informationStability Bounds for Non-i.i.d. Processes
tability Bounds for Non-i.i.d. Processes Mehryar Mohri Courant Institute of Matheatical ciences and Google Research 25 Mercer treet New York, NY 002 ohri@cis.nyu.edu Afshin Rostaiadeh Departent of Coputer
More informationStatistics and Probability Letters
Statistics and Probability Letters 79 2009 223 233 Contents lists available at ScienceDirect Statistics and Probability Letters journal hoepage: www.elsevier.co/locate/stapro A CLT for a one-diensional
More informationThis article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and
This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution
More informationCombining Classifiers
Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/
More informationarxiv: v1 [cs.ds] 3 Feb 2014
arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/
More informationSharp Time Data Tradeoffs for Linear Inverse Problems
Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used
More informationPAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers
PAC-Bayes Ris Bounds for Sample-Compressed Gibbs Classifiers François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Département d informatique et de génie logiciel,
More informationSupport recovery in compressed sensing: An estimation theoretic approach
Support recovery in copressed sensing: An estiation theoretic approach Ain Karbasi, Ali Horati, Soheil Mohajer, Martin Vetterli School of Coputer and Counication Sciences École Polytechnique Fédérale de
More informationOn Constant Power Water-filling
On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives
More informationNon-Parametric Non-Line-of-Sight Identification 1
Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,
More informationTail estimates for norms of sums of log-concave random vectors
Tail estiates for nors of sus of log-concave rando vectors Rados law Adaczak Rafa l Lata la Alexander E. Litvak Alain Pajor Nicole Toczak-Jaegerann Abstract We establish new tail estiates for order statistics
More informationBayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)
Bayesian Learning Chapter 6: Bayesian Learning CS 536: Machine Learning Littan (Wu, TA) [Read Ch. 6, except 6.3] [Suggested exercises: 6.1, 6.2, 6.6] Bayes Theore MAP, ML hypotheses MAP learners Miniu
More informationFeature Extraction Techniques
Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that
More informationBipartite subgraphs and the smallest eigenvalue
Bipartite subgraphs and the sallest eigenvalue Noga Alon Benny Sudaov Abstract Two results dealing with the relation between the sallest eigenvalue of a graph and its bipartite subgraphs are obtained.
More informationOn the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation
journal of coplexity 6, 459473 (2000) doi:0.006jco.2000.0544, available online at http:www.idealibrary.co on On the Counication Coplexity of Lipschitzian Optiization for the Coordinated Model of Coputation
More informationVC Dimension and Sauer s Lemma
CMSC 35900 (Spring 2008) Learning Theory Lecture: VC Diension and Sauer s Lea Instructors: Sha Kakade and Abuj Tewari Radeacher Averages and Growth Function Theore Let F be a class of ±-valued functions
More informationDomain-Adversarial Neural Networks
Doain-Adversarial Neural Networks Hana Ajakan, Pascal Gerain 2, Hugo Larochelle 3, François Laviolette 2, Mario Marchand 2,2 Départeent d inforatique et de génie logiciel, Université Laval, Québec, Canada
More informationCollision-based Testers are Optimal for Uniformity and Closeness
Electronic Colloquiu on Coputational Coplexity, Report No. 178 (016) Collision-based Testers are Optial for Unifority and Closeness Ilias Diakonikolas Theis Gouleakis John Peebles Eric Price USC MIT MIT
More informationThe degree of a typical vertex in generalized random intersection graph models
Discrete Matheatics 306 006 15 165 www.elsevier.co/locate/disc The degree of a typical vertex in generalized rando intersection graph odels Jerzy Jaworski a, Michał Karoński a, Dudley Stark b a Departent
More informationarxiv: v1 [stat.ml] 27 Oct 2018
Stein Variational Gradient Descent as Moent Matching arxiv:80.693v stat.ml] 27 Oct 208 Qiang Liu, Dilin Wang Departent of Coputer Science The University of Texas at Austin Austin, TX 7872 {lqiang, dilin}@cs.utexas.edu
More informatione-companion ONLY AVAILABLE IN ELECTRONIC FORM
OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer
More information1 Proving the Fundamental Theorem of Statistical Learning
THEORETICAL MACHINE LEARNING COS 5 LECTURE #7 APRIL 5, 6 LECTURER: ELAD HAZAN NAME: FERMI MA ANDDANIEL SUO oving te Fundaental Teore of Statistical Learning In tis section, we prove te following: Teore.
More informationAsynchronous Gossip Algorithms for Stochastic Optimization
Asynchronous Gossip Algoriths for Stochastic Optiization S. Sundhar Ra ECE Dept. University of Illinois Urbana, IL 680 ssrini@illinois.edu A. Nedić IESE Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu
More informationResearch Article Robust ε-support Vector Regression
Matheatical Probles in Engineering, Article ID 373571, 5 pages http://dx.doi.org/10.1155/2014/373571 Research Article Robust ε-support Vector Regression Yuan Lv and Zhong Gan School of Mechanical Engineering,
More informationThe Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate
The Siplex Method is Strongly Polynoial for the Markov Decision Proble with a Fixed Discount Rate Yinyu Ye April 20, 2010 Abstract In this note we prove that the classic siplex ethod with the ost-negativereduced-cost
More informationBlock designs and statistics
Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent
More informationEfficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Sha M Kakade Microsoft Research and Wharton, U Penn skakade@icrosoftco Varun Kanade SEAS, Harvard University vkanade@fasharvardedu
More informationarxiv: v1 [cs.ds] 17 Mar 2016
Tight Bounds for Single-Pass Streaing Coplexity of the Set Cover Proble Sepehr Assadi Sanjeev Khanna Yang Li Abstract arxiv:1603.05715v1 [cs.ds] 17 Mar 2016 We resolve the space coplexity of single-pass
More informationAlgorithms for parallel processor scheduling with distinct due windows and unit-time jobs
BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES Vol. 57, No. 3, 2009 Algoriths for parallel processor scheduling with distinct due windows and unit-tie obs A. JANIAK 1, W.A. JANIAK 2, and
More informationPattern Recognition and Machine Learning. Artificial Neural networks
Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial
More information3.3 Variational Characterization of Singular Values
3.3. Variational Characterization of Singular Values 61 3.3 Variational Characterization of Singular Values Since the singular values are square roots of the eigenvalues of the Heritian atrices A A and
More informationOn the Impact of Kernel Approximation on Learning Accuracy
On the Ipact of Kernel Approxiation on Learning Accuracy Corinna Cortes Mehryar Mohri Aeet Talwalkar Google Research New York, NY corinna@google.co Courant Institute and Google Research New York, NY ohri@cs.nyu.edu
More informationAN OPTIMAL SHRINKAGE FACTOR IN PREDICTION OF ORDERED RANDOM EFFECTS
Statistica Sinica 6 016, 1709-178 doi:http://dx.doi.org/10.5705/ss.0014.0034 AN OPTIMAL SHRINKAGE FACTOR IN PREDICTION OF ORDERED RANDOM EFFECTS Nilabja Guha 1, Anindya Roy, Yaakov Malinovsky and Gauri
More informationKernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
Kernel Choice and Classifiability for RKHS Ebeddings of Probability Distributions Bharath K. Sriperubudur Departent of ECE UC San Diego, La Jolla, USA bharathsv@ucsd.edu Kenji Fukuizu The Institute of
More informationA Simple Homotopy Algorithm for Compressive Sensing
A Siple Hootopy Algorith for Copressive Sensing Lijun Zhang Tianbao Yang Rong Jin Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China Departent of Coputer
More informationCommunication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality
Counication Lower Bounds for Statistical Estiation Probles via a Distributed Data Processing Inequality Mark Braveran 1, Ankit Garg 1, Tengyu Ma 1, Huy L. Nguyen, and David P. Woodruff 3 1 Princeton University
More informationProceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds.
Proceedings of the 2016 Winter Siulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtan, E. Zhou, T. Huschka, and S. E. Chick, eds. THE EMPIRICAL LIKELIHOOD APPROACH TO SIMULATION INPUT UNCERTAINTY
More informationLearning with Deep Cascades
Learning with Deep Cascades Giulia DeSalvo 1, Mehryar Mohri 1,2, and Uar Syed 2 1 Courant Institute of Matheatical Sciences, 251 Mercer Street, New Yor, NY 10012 2 Google Research, 111 8th Avenue, New
More informationStability Bounds for Stationary ϕ-mixing and β-mixing Processes
Journal of Machine Learning Research (200 66-686 Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences
More information