A Strongly Quasiconvex PAC-Bayesian Bound

Size: px

Start display at page:

Download "A Strongly Quasiconvex PAC-Bayesian Bound"

Elizabeth Webster
5 years ago
Views:

1 A Strongly Quasiconvex PAC-Bayesian Bound Yevgeny Seldin NIPS-2017 Workshop on (Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights Based on joint work with Niklas Thiemann, Christian Igel, and Olivier Wintenberger, ALT 2017

2 Quick Summary Two major ways to convexify classification with 0-1 loss Convexify the loss Work in the space of distributions over H (PAC-Bayes)

3 Quick Summary Two major ways to convexify classification with 0-1 loss Convexify the loss Work in the space of distributions over H (PAC-Bayes) We propose A relaxation of the PAC-Bayes-kl bound (Seeger, 2002) and an alternating minimization procedure Sufficient conditions for strong quasiconvexity of the bound which guarantee convergence to the global minimum Construction of a hypothesis space tailored for the bound In our experiments rigorous minimization of the bound was competitive with cross-validation in tuning the trade-off between complexity and empirical performance

4 Outline A Very Quick Recap of PAC-Bayesian Analysis A Strongly Quasiconvex PAC-Bayesian Bound Construction of a Hypothesis Set Experiments

5 Randomized Classifiers Let ρ be a distribution over H Randomized Classifiers At each round of the game: 1. Pick h H according to ρ(h) 2. Observe x 3. Return h(x)

6 Randomized Classifiers Let ρ be a distribution over H Randomized Classifiers At each round of the game: 1. Pick h H according to ρ(h) 2. Observe x 3. Return h(x) Expected loss of ρ E h ρ [L(h) = E ρ [L(h) Empirical loss of ρ on a sample S E h ρ [ˆL(h, S) = E ρ [ˆL(h, S)

7 Approximation-Estimation Perspective (Bias-Variance)

8 Approximation-Estimation Perspective (Bias-Variance) Randomized Classification Avoid selection when not necessary If ˆL(h, S) ˆL(h, S) and π(h) π(h ), take ρ(h) ρ(h ) Reduced variance at the same bias level

9 Kullback-Leibler (KL) divergence = Relative Entropy KL divergence Let ρ and π be two distributions over H [ KL(ρ π) = E ρ ln ρ π Binary kl divergence For two Bernoulli random variables with biases p and q kl(p q) = KL([p, 1 p [q, 1 q)

10 PAC-Bayes-kl Inequality Theorem (Seeger, 2002) For any prior π over H and any (0, 1), with probability greater than 1 over a random draw of a sample S, for all distributions ρ over H simultaneously: ( ) Eρ kl E ρ [ˆL(h, S) [L(h) KL(ρ π) + ln 2 n. n

11 PAC-Bayes-kl Inequality Theorem (Seeger, 2002) For any prior π over H and any (0, 1), with probability greater than 1 over a random draw of a sample S, for all distributions ρ over H simultaneously: ( ) Eρ kl E ρ [ˆL(h, S) [L(h) KL(ρ π) + ln 2 n n. Challenge The bound is not convex in ρ Common heuristic: replace with a parametrized tradeoff βne ρ [ˆL(h, S) + KL(ρ π) and tune β by cross-validation

12 Outline A Very Quick Recap of PAC-Bayesian Analysis A Strongly Quasiconvex PAC-Bayesian Bound Construction of a Hypothesis Set Experiments

13 Relaxation of PAC-Bayes-kl Based on refined Pinsker s inequality Theorem (PAC-Bayes-λ Inequality) For any prior π and any (0, 1), with probability greater than 1, for all ρ and λ (0, 2) simultaneously: E ρ [ˆL(h, S) 2 KL(ρ π) + ln n E ρ [L(h) 1 λ + 2 λ ( 1 λ ). 2 n

14 Relaxation of PAC-Bayes-kl Based on refined Pinsker s inequality Theorem (PAC-Bayes-λ Inequality) For any prior π and any (0, 1), with probability greater than 1, for all ρ and λ (0, 2) simultaneously: E ρ [ˆL(h, S) 2 KL(ρ π) + ln n E ρ [L(h) 1 λ + 2 λ ( 1 λ ). 2 n For the optimal λ this leads to ( ) 2E ρ [ˆL(h, S) KL(ρ π) + ln E ρ [L(h) E ρ [ˆL(h, S) + 2 n n Fast convergence rate ( 2 + KL(ρ π) + ln 2 n n )

15 Alternating Minimization of PAC-Bayes-λ E ρ [ˆL(h, S) E ρ [L(h) 1 λ KL(ρ π) + ln n λ ( 1 λ ) 2 n } {{ } F(ρ,λ)

16 Alternating Minimization of PAC-Bayes-λ E ρ [ˆL(h, S) E ρ [L(h) 1 λ KL(ρ π) + ln n λ ( 1 λ ) 2 n } {{ } F(ρ,λ) For a fixed λ the bound is convex in ρ and minimized by π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s)

17 Alternating Minimization of PAC-Bayes-λ E ρ [ˆL(h, S) E ρ [L(h) 1 λ KL(ρ π) + ln n λ ( 1 λ ) 2 n } {{ } F(ρ,λ) For a fixed λ the bound is convex in ρ and minimized by π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s) For a fixed ρ the bound is convex in λ and minimized by λ = 2 2nE ρ[ˆl(h,s) KL(ρ π)+ln 2 n

18 Alternating Minimization of PAC-Bayes-λ E ρ [ˆL(h, S) E ρ [L(h) 1 λ KL(ρ π) + ln n λ ( 1 λ ) 2 n } {{ } F(ρ,λ) For a fixed λ the bound is convex in ρ and minimized by π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s) For a fixed ρ the bound is convex in λ and minimized by λ = 2 2nE ρ[ˆl(h,s) KL(ρ π)+ln 2 n F(ρ, λ) is not necessarily jointly convex in ρ and λ

19 Simplification 1 F(ρ, λ) = E ρ [ˆL(h, S) 1 λ KL(ρ π) + ln n λ ( 1 λ ) 2 n π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s)

20 Simplification 1 F(ρ, λ) = E ρ [ˆL(h, S) 1 λ KL(ρ π) + ln n λ ( 1 λ ) 2 n π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s) E ρλ [ˆL(h, S) F(λ) = F(ρ λ, λ) = 1 λ 2 + KL(ρ λ π) + ln 2 n λ ( 1 λ 2 ) n

21 Simplification 1 F(ρ, λ) = E ρ [ˆL(h, S) 1 λ KL(ρ π) + ln n λ ( 1 λ ) 2 n π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s) E ρλ [ˆL(h, S) F(λ) = F(ρ λ, λ) = 1 λ 2 + KL(ρ λ π) + ln 2 n λ ( 1 λ ) 2 n One-dimensional function

22 Simplification 2 F(λ) = F(ρ λ, λ) = E ρλ [ˆL(h, S) 1 λ 2 + KL(ρ λ π) + ln 2 n λ ( 1 λ ) 2 n π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s)

23 Simplification 2 F(λ) = F(ρ λ, λ) = E ρλ [ˆL(h, S) 1 λ 2 + KL(ρ λ π) + ln 2 n λ ( 1 λ ) 2 n π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s) [ KL(ρ λ π) = E ρλ ln ρ λ(h) = E ρλ ln π(h) e nλˆl(h,s) E π [e nλˆl(h,s) = nλe ρλ [ˆL(h, S) ln E π [ e nλˆl(h,s)

24 Simplification 2 F(λ) = F(ρ λ, λ) = E ρλ [ˆL(h, S) 1 λ 2 + KL(ρ λ π) + ln 2 n λ ( 1 λ ) 2 n π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s) [ KL(ρ λ π) = E ρλ ln ρ λ(h) = E ρλ ln π(h) e nλˆl(h,s) E π [e nλˆl(h,s) = nλe ρλ [ˆL(h, S) ln E π [ e nλˆl(h,s) F(λ) = ln E π [e nλˆl(h,s) + ln 2 n nλ(1 λ/2)

25 Strong Quasiconvexity - Sufficient Condition Theorem (Strong Quasiconvexity) If at least one of the two conditions or is satisfied for all λ 2 KL(ρ λ π) + ln 4n 2 > λ2 n 2 Var ρλ [ˆL(h, S) E ρλ [ˆL(h, S) > (1 λ)nvar ρλ [ˆL(h, S) [ ln 2 n n, 1, then F(λ) is strongly quasiconvex for λ (0, 1 and alternating minimization converges to the global minimum of F.

26 Proof Highlights F(λ) = ln E π [e nλˆl(h,s) + ln 2 n nλ(1 λ/2) Show that the second derivative of F(λ) is positive at all stationary points 1 d ln E π [e nλ ˆL(h,S) n dλ = E ρλ [ˆL(h, S) 1 d 2 ln E π [e nλ ˆL(h,S) n = nvar dλ 2 ρλ [ˆL(h, S)

27 Weak Separation Sufficient Condition for Strong Quasiconvexity

28 Weak Separation Sufficient Condition for Strong Quasiconvexity Theorem (Weak Separation) Let H be finite with H = m and π(h) uniform. Let a = ln(3mn) n ln 2 n and b. If the number of hypotheses for which ˆL(h, S) (ˆL(h, S) + a, ˆL(h ), S) + b is at most e2 4n 12 ln 2 F(λ) is strongly quasiconvex and alternating minimization converges to the global minimum. ln 4n 2 n 3 then

29 Proof Highlights By the Strong Quasiconvexity Theorem, if Var ρλ [ˆL(h, S) is small then F(λ) is strongly quasiconvex Let h = ˆL(h, S) ˆL(h, S) [ Var ρλ [ˆL(h, S) E ρλ 2 h = ρ λ (h) 2 h = h h 2 h e nλ h / h e nλ h

30 Breaking the Quasiconvexity It is possible to break the quasiconvexity but one has to work hard for it

31 Breaking the Quasiconvexity It is possible to break the quasiconvexity but one has to work hard for it For example, taking n = 200, = 0.25, m = , h = 0.1 and uniform π breaks it

32 Breaking the Quasiconvexity It is possible to break the quasiconvexity but one has to work hard for it For example, taking n = 200, = 0.25, m = , h = 0.1 and uniform π breaks it In all our experiments F(λ) was convex even when the weak separation sufficient condition was violated So it might be possible to relax the sufficient condition further

33 Outline A Very Quick Recap of PAC-Bayesian Analysis A Strongly Quasiconvex PAC-Bayesian Bound Construction of a Hypothesis Set Experiments

34 Challenge Computation of the normalization of ρ λ can be prohibitively expensive π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s)

35 Challenge Computation of the normalization of ρ λ can be prohibitively expensive π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s) Parametrization of ρ may break the convexity

36 Challenge Computation of the normalization of ρ λ can be prohibitively expensive π(h)e λnˆl(h,s) ρ λ (h) = = E π [e λnˆl(h,s) Parametrization of ρ may break the convexity Solution Work with finite H We need a powerful finite H π(h)e λnˆl(h,s) h π(h )e λnˆl(h,s)

37 Construction of a finite sample-dependent H Select m = H subsamples of r points each Train a model h on r points and validate on n r points Validation loss: ˆLval (h)

38 Construction of a finite sample-dependent H Select m = H subsamples of r points each Train a model h on r points and validate on n r points Validation loss: ˆLval (h) Adapted Bound E ρ [L(h) [ˆLval E ρ (h, S) 1 λ 2 n r+1 KL(ρ π) + ln + (n r)λ ( 1 λ ) 2

39 Construction of a finite sample-dependent H Select m = H subsamples of r points each Train a model h on r points and validate on n r points Validation loss: ˆLval (h) Adapted Bound E ρ [L(h) [ˆLval E ρ (h, S) 1 λ 2 n r+1 KL(ρ π) + ln + (n r)λ ( 1 λ ) 2 Special Case: k-fold cross-validation Most computational advantage is achieved by inverse CV

40 Outline A Very Quick Recap of PAC-Bayesian Analysis A Strongly Quasiconvex PAC-Bayesian Bound Construction of a Hypothesis Set Experiments

41 Experiments We compare Kernel-SVM trained by cross-validation ρ-weighting of multiple weak SVMs trained on d + 1 samples

42 Experiments We compare Kernel-SVM trained by cross-validation ρ-weighting of multiple weak SVMs trained on d + 1 samples * More precisely, we apply ρ-weighted aggregation ( ) MV ρ (x) = sign ρ(h)h(x) but in our case there was no significant difference between L(MV ρ ) and E ρ [L(h) h

43 Rough Runtime Comparison k-fold cross-validation of kernel SVMs k }{{} n 2+ + V }{{} kn 2+ training validation

44 Rough Runtime Comparison k-fold cross-validation of kernel SVMs k }{{} n 2+ + V }{{} kn 2+ training validation PAC-Bayesian aggregation of kernel SVMs For r = d + 1 and m = n: m r 2+ }{{} training + rn }{{} validation + A }{{} aggregation mrn dn 2

45 Rough Runtime Comparison k-fold cross-validation of kernel SVMs k }{{} n 2+ + V }{{} kn 2+ training validation PAC-Bayesian aggregation of kernel SVMs For r = d + 1 and m = n: m r 2+ }{{} training + rn }{{} validation + A }{{} aggregation mrn dn 2 Computational Speed-up!

46 Experiments (a) Ionosphere n = 200, r = d + 1 = 35. (b) Waveform n = 2000, r = d + 1 = 41. (c) Breast cancer n = 340, r = d + 1 = 11. (d) AvsB n = 1000, r = d + 1 = 17.

47 Summary We proposed A relaxation of the PAC-Bayes-kl bound (Seeger, 2002) An alternating minimization procedure Sufficient conditions for strong quasiconvexity which guarantee convergence to the global minimum Construction of H In our experiments rigorous minimization of the bound was competitive with cross-validation in tuning the trade-off between complexity and empirical performance

48 Summary We proposed A relaxation of the PAC-Bayes-kl bound (Seeger, 2002) An alternating minimization procedure Sufficient conditions for strong quasiconvexity which guarantee convergence to the global minimum Construction of H In our experiments rigorous minimization of the bound was competitive with cross-validation in tuning the trade-off between complexity and empirical performance Rigorous minimization of a theoretical bound competitive with cross-validation!

49 What s next? Improved Sufficient Conditions In practice the bound was strongly convex even when the weak separation sufficient condition was violated. Relax the sufficient condition We have dropped some terms when going from the Strong Quasiconvexity Theorem to the Weak Separation Condition

Strong Quasiconvexity - Sufficient Condition Theorem (Strong Quasiconvexity) If at least one of the two conditions or is satisfied for all λ 2 KL(ρ λ π) + ln 4n 2 > λ2 n 2 Var

50 Strong Quasiconvexity - Sufficient Condition Theorem (Strong Quasiconvexity) If at least one of the two conditions or is satisfied for all λ 2 KL(ρ λ π) + ln 4n 2 > λ2 n 2 Var ρλ [ˆL(h, S) E ρλ [ˆL(h, S) > (1 λ)nvar ρλ [ˆL(h, S) [ ln 2 n n, 1, then F(λ) is strongly quasiconvex for λ (0, 1 and alternating minimization converges to the global minimum of F.

51 What s next? Improved Sufficient Conditions In practice the bound was strongly convex even when the weak separation sufficient condition was violated. Relax the sufficient condition We have dropped some terms when going from the Strong Quasiconvexity Theorem to the Weak Separation Condition

52 What s next? Improved Sufficient Conditions In practice the bound was strongly convex even when the weak separation sufficient condition was violated. Relax the sufficient condition We have dropped some terms when going from the Strong Quasiconvexity Theorem to the Weak Separation Condition Improved Analysis of the Weighted Majority Vote Combine the results with improved analysis of weighted majority vote (the C-bound ) Lacasse, Laviolette, Marchand, Germain, and Usunier, NIPS, 2007 Laviolette, Marchand, Roy, ICML, 2011 Germain, Lacasse, Laviolette, Marchand, Roy, JMLR, 2015

PAC-Bayesian Generalization Bound for Multi-class Learning

PAC-Bayesian Generalization Bound for Multi-class Learning Loubna BENABBOU Department of Industrial Engineering Ecole Mohammadia d Ingènieurs Mohammed V University in Rabat, Morocco Benabbou@emi.ac.ma