Variations sur la borne PAC-bayésienne

Size: px

Start display at page:

Download "Variations sur la borne PAC-bayésienne"

Elmer Fleming
5 years ago
Views:

1 Variations sur la borne PAC-bayésienne Pascal Germain INRIA Paris Équipe SIRRA Séminaires du département d informatique et de génie logiciel Université Laval 11 juillet 2016 Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

2 Plan 1 Introduction 2 PAC-Bayesian Theory Majority Vote Classifiers A General PAC-Bayesian Theorem Transductive Bounds Rényi-Based Bounds Regression Bounds 3 PAC-Bayesian Theory Meets Bayesian Inference PAC-Bayesian Marginal Likelihood Model Comparaison Toy xperiments: Linear Regression 4 Conclusion and future works Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

3 Plan 1 Introduction 2 PAC-Bayesian Theory Majority Vote Classifiers A General PAC-Bayesian Theorem Transductive Bounds Rényi-Based Bounds Regression Bounds 3 PAC-Bayesian Theory Meets Bayesian Inference PAC-Bayesian Marginal Likelihood Model Comparaison Toy xperiments: Linear Regression 4 Conclusion and future works Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

4 Definitions Learning example An example x, y X Y is a description-label pair. Data generating distribution ach example is an i.i.d. observation from distribution D on X Y. Learning sample S = { x 1, y 1, x 2, y 2,..., x n, y n } D n Predictors or hypothesis h : X Y, h H Learning algorithm AS h Loss function l : H X Y R mpirical loss L l S h = 1 n n lh, x i, y i i=1 Generalization loss L l D h = x,y D lh, x i, y i Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

5 PAC-Bayesian Theory Initiated by McAllester 1999, the PAC-Bayesian theory gives PAC generalization guarantees to Bayesian like algorithms. PAC guarantees Probably Approximately Correct With probability at least 1, the loss of predictor h is less than ε Pr LDh l ε L Sh, l n,,... 1 S D n Bayesian flavor Given: A prior distribution P on H. A posterior distribution Q on H. Pr L l l Dh ε L S h, n,, P,... 1 S D n Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

6 A Classical PAC-Bayesian Theorem PAC-Bayesian theorem adapted from McAllester 1999, 2003 For any distribution D on X Y, for any set of predictors H, for any loss l : H X Y [0, 1], for any distribution P on H, for any 0, 1], we have, Pr S D n Q on H : L l Dh l L S h + [ 1 2n KLQ P + ln 2 n ] 1, where KLQ P = Training bound ln Qh Ph is the Kullback-Leibler divergence. Gives generalization guarantees not based on testing sample. Valid for all posterior Q on H Inspiration for conceiving new learning algorithms. Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

7 Plan 1 Introduction 2 PAC-Bayesian Theory Majority Vote Classifiers A General PAC-Bayesian Theorem Transductive Bounds Rényi-Based Bounds Regression Bounds 3 PAC-Bayesian Theory Meets Bayesian Inference PAC-Bayesian Marginal Likelihood Model Comparaison Toy xperiments: Linear Regression 4 Conclusion and future works Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

8 Plan 1 Introduction 2 PAC-Bayesian Theory Majority Vote Classifiers A General PAC-Bayesian Theorem Transductive Bounds Rényi-Based Bounds Regression Bounds 3 PAC-Bayesian Theory Meets Bayesian Inference PAC-Bayesian Marginal Likelihood Model Comparaison Toy xperiments: Linear Regression 4 Conclusion and future works Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

Majority Vote Classifiers Consider a binary classification problem, where Y = { 1, +1} and the set H contains binary voters h : X { 1, +1} Weighted majority vote To predict the label of x X, the

9 Majority Vote Classifiers Consider a binary classification problem, where Y = { 1, +1} and the set H contains binary voters h : X { 1, +1} Weighted majority vote To predict the label of x X, the classifier asks for the prevailing opinion B Q x = sgn hx Many learning algorithms output majority vote classifiers AdaBoost, Random Forests, Bagging,... Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

10 A Surrogate Loss Majority vote risk R D B Q = Pr x,y D B Q x y where I[a] = 1 if predicate a is true; I[a] = 0 otherwise. [ ] = I y hx 0 x,y D Gibbs Risk / Linear Loss The stochastic Gibbs classifier G Q x draws h H according to Q and output h x. [ ] R D G Q = I hx y x,y D = L l 01 D h, where l 01h, x, y = I [ hx y ]. Factor two It is well-known that R D B Q 2 R D G Q y hx y hx See Germain, Lacasse, Laviolette, Marchand, and Roy 2015, JMLR for an extensive study. Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

11 Plan 1 Introduction 2 PAC-Bayesian Theory Majority Vote Classifiers A General PAC-Bayesian Theorem Transductive Bounds Rényi-Based Bounds Regression Bounds 3 PAC-Bayesian Theory Meets Bayesian Inference PAC-Bayesian Marginal Likelihood Model Comparaison Toy xperiments: Linear Regression 4 Conclusion and future works Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

12 A General PAC-Bayesian Theorem -function: «distance» between R S G Q et R D G Q Convex function : [0, 1] [0, 1] R. General theorem Bégin et al. 2014, 2016; Germain 2015 For any distribution D on X Y, for any set H of voters, for any distribution P on H, for any 0, 1], and for any -function, we have, with probability at least 1 over the choice of S D n, Q on H : RS G Q, R D G Q 1 n [ KLQ P + ln I ] n, where I n = sup r [0,1] [ n ] n k r k 1 r n k e n k n, r. }{{} k=0 Bin k;n,r Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

13 General theorem Pr S D n Q on H : RS G Q, R D G Q 1 n [ KLQ P + ln I ] n 1. Interpretation R S G Q, r 1 n [KLQ P +ln I n ] r Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

14 General theorem Pr S D n Q on H : RS G Q, R D G Q 1 n [ KLQ P + ln I ] n 1. Proof ideas. Change of Measure Inequality For any P and Q on H, and for any measurable function φ : H R, we have φh KLQ P + ln eφh. h P Markov s inequality Pr X a X Pr X X a 1. Probability of observing k misclassifications among n examples Given a voter h, consider a binomial variable of n trials with success L l 01 D h: Pr L l 01 S D n S h= k n k n k = L l01 n D k h 1 L l 01 D h = Bin k; n, L l 01 D h Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

15 General theorem Pr S D n Q on H : RS G Q, R D G Q 1 n [ KLQ P + ln I ] n 1. Proof. n L S l h, L l Dh L l S h, LDh l Jensen s Inequality n Change of measure KLQ P + ln e n L h P Markov s Inequality 1 KLQ P + ln 1 xpectation swap = KLQ P + ln 1 Binomial law = KLQ P + ln 1 S D n h P h P Supremum over risk KLQ P + ln 1 sup r [0,1] = KLQ P + ln 1 I n. l S h,l D l h e h P S D l n L S h,ld l h l L nen S h,l l D h n Bin k; n, LDh l e n k n,l D l h k=0 [ n Bin k; n, r ] e n k n, r Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33 k=0

16 General theorem Pr S D n Q on H : RS G Q, R D G Q 1 n [ KLQ P + ln I ] n 1. Corollary [...] with probability at least 1 over the choice of S D n, for all Q on H : [ ] a kl RS G Q, R D G Q 1 n KLQ P + ln 2 n, Langford and Seeger 2001 b R D G Q R [ ] 1 S G Q + 2n KLQ P + ln 2 n, McAllester 1999, 2003 c R D G Q c 1 R 1 e c S G Q + 1 [ ] n KLQ P + ln 1, Catoni 2007 d R D G Q R S G Q + 1 [ λ KLQ P + ln 1 + f λ, n]. Alquier et al klq, p = q ln q 1 q p + 1 q ln 1 p 2q p2, c q, p = ln[1 1 e c p] c q, λ q, p = λ n p q. Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

17 Plan 1 Introduction 2 PAC-Bayesian Theory Majority Vote Classifiers A General PAC-Bayesian Theorem Transductive Bounds Rényi-Based Bounds Regression Bounds 3 PAC-Bayesian Theory Meets Bayesian Inference PAC-Bayesian Marginal Likelihood Model Comparaison Toy xperiments: Linear Regression 4 Conclusion and future works Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

18 Transductive Learning Assumption xamples are drawn without replacement from a finite set Z of size N. S = { x 1, y 1, x 2, y 2,..., x n, y n } Z U = { x n+1,, x n+2,,..., x N, } = Z \ S Inductive learning: n draws with replacement according to D Binomial law. Transductive learning: n draws without replacement in Z Hypergeometric law. Theorem Bégin et al For any set Z of N examples, [...] with probability at least 1 over the choice of n examples among Z, Q on H : RS G Q, R Z G Q 1 [ KLQ P + ln T ] n, N, n where T n, N = max K=0...N min[n,k] k=max[0,k+n N] K k N K N n n k e n k n, K N Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33.

19 Theorem Pr S [Z] n Q on H : RS G Q, R Z G Q 1 n [ KLQ P + ln T ] n, N 1. Proof. n L S l l h, L Z h L l S h, L Z h l Jensen s inequality n Change of measure KLQ P + ln e n L h P Markov s inequality 1 KLQ P + ln 1 xpectations swap = KLQ P + ln 1 Hypergeometric law = KLQ P + ln 1 h P Supremum over risk KLQ P + ln 1 max K=0...N = KLQ P + ln 1 T n, N. l S h, L Z l h l n L e S h, L Z l h S [Z] n h P l L S h P S [Z] nen h, L Z l h k N L l Z h k k N N L Z l h n k e n k n, L Z l h N n [ ] K k N K n k e n k N n, K N n Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

20 A New Transductive Bound for the Gibbs Risk Corollary Bégin et al [...] with probability at least 1 over the choice of n examples among Z, Q on H : R Z G Q R [ 1 S G Q + n N 2n KLQ P + ln 3 lnn ] n1 n N. Theorem Derbeko et al Q on H : R Z G Q R ] S G Q + [KLQ P + ln nn n N 2n ] 1 m N mn +17 [KLQ P +ln 2m 1 [ 1 m N KLQ P +ln 3lnm m1 m N 2m [ ] 1 KLQ P +ln 2 m 2m ] Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33 N

21 Plan 1 Introduction 2 PAC-Bayesian Theory Majority Vote Classifiers A General PAC-Bayesian Theorem Transductive Bounds Rényi-Based Bounds Regression Bounds 3 PAC-Bayesian Theory Meets Bayesian Inference PAC-Bayesian Marginal Likelihood Model Comparaison Toy xperiments: Linear Regression 4 Conclusion and future works Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

22 A New Change of Measure Kullback-Leibler Change of Measure Inequality For any P and Q on H, and for any φ : H R, we have φh KLQ P + ln h P eφh. Rényi Change of Measure Inequality Atar and Merhav 2015 For any P and Q on H, any φ : H R, and for any α > 1, we have α α 1 ln φh D αq P + ln φh α α 1, h P with D α Q P = 1 [ α 1 ln and h P lim D α Q P = KLQ P. α 1 Qh α ] Ph KLQ P, Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

23 Rényi-Based General Theorem Theorem Bégin et al [...] for any α > 1, with probability at least 1 over the choice of S D n, with Q on H: ln RS G Q, R D G Q 1 [ α D α Q P+ ln IR n, ] α, and α := I R n, α = α α 1 > 1. sup r [0,1] [ n Bin k; n, r ] k n, rα, k=0 Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

24 Rényi-Based General Theorem Pr S D n Q on H : ln RS G Q, R D G Q 1 α [ D αq P+ ln IR n, α ] 1. Proof. α := α α ln L S l h, L l Dh Jensen s Inequality α ln L l S h, L l Dh Change of measure D αq P + ln l L S h, L l h P Dh α Markov s Inequality 1 D αq P + ln 1 xpectation swap = D αq P + ln 1 Binomial law = D αq P + ln 1 S D n h P h P Supremum over risk D αq P + ln 1 sup r [0,1] L l S h, L l h P Dh α S D n L l S h, L Dh l α n k=0 α 1 Bin k; n, L l Dh k n, L l Dh α [ n Bin k; n, r ] k, n rα k=0 = D αq P + ln 1 IR n, α. Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

25 mpirical Study Majority votes of 500 decision trees on Mushroom dataset Weak Decision Trees Strong Decision Trees 0.00 R D G Q Jensen s inequality Change of measure Markov s inequality Supremum over risk KLQ P and := 2q p 2 D αq P and := 2q p 2 KLQ P and := klq, p D αq P and := klq, p Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

26 Plan 1 Introduction 2 PAC-Bayesian Theory Majority Vote Classifiers A General PAC-Bayesian Theorem Transductive Bounds Rényi-Based Bounds Regression Bounds 3 PAC-Bayesian Theory Meets Bayesian Inference PAC-Bayesian Marginal Likelihood Model Comparaison Toy xperiments: Linear Regression 4 Conclusion and future works Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

27 PAC-Bayesian Bounds for Regression Lemma Maurer 2004 For any l : H X Y [0, 1], and convex : [0, 1] [0, 1] R, n S D en L l S h, LD l h Bin k; n, LDh l e n k n, L D l h k=0 General theorem for regression with bounded losses For any distribution D on X Y, for any set H of predictors, for any l : H X Y [0, 1] for any distribution P on H, for any 0, 1], and for any -function, we have, with probability at least 1 over the choice of S D n, Q on H : L Sh, l L l Dh 1 n [ KLQ P + ln I ] n. Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

28 General theorem for regression with bounded losses Pr S D n Q on H : L S l h, L l Dh 1 n [ KLQ P + ln I n ] 1. Proof. n L S l h, L l Dh L l S h, LDh l Jensen s Inequality n Change of measure KLQ P + ln e n L h P Markov s Inequality 1 KLQ P + ln 1 xpectation swap = KLQ P + ln 1 Maurer s Lemma KLQ P + ln 1 S D n h P h P Supremum over risk KLQ P + ln 1 sup r [0,1] l S h,l D l h e h P S D l n L S h,ld l h l L en n S h,l l D h n Bin k; n, LDh l e n k n,l D l h k=0 = KLQ P + ln 1 I n. [ n Bin k; n, r ] e n k n, r Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33 k=0

29 PAC-Bayesian Bounds for Regression General theorem for regression with bounded losses Pr S D n Q on H : L S l h, L l Dh 1 n [ KLQ P + ln I n ] 1. Corollary [...] with probability at least 1 over the choice of S D n, for all Q on H : [ a kl L S lh, L D l h 1 n KLQ P + ln 2 n [ b LD l h l 1 L S h + 2n KLQ P + ln 2 n ], Langford and Seeger 2001 ], McAllester 1999, 2003 c d LD c l h 1 1 e c L l D h L l S h + 1 λ L l S h + 1 n [ KLQ P + ln 1 ], Catoni 2007 [ KLQ P + ln 1 + f λ, n]. Alquier et al Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

30 Plan 1 Introduction 2 PAC-Bayesian Theory Majority Vote Classifiers A General PAC-Bayesian Theorem Transductive Bounds Rényi-Based Bounds Regression Bounds 3 PAC-Bayesian Theory Meets Bayesian Inference PAC-Bayesian Marginal Likelihood Model Comparaison Toy xperiments: Linear Regression 4 Conclusion and future works Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

31 Plan 1 Introduction 2 PAC-Bayesian Theory Majority Vote Classifiers A General PAC-Bayesian Theorem Transductive Bounds Rényi-Based Bounds Regression Bounds 3 PAC-Bayesian Theory Meets Bayesian Inference PAC-Bayesian Marginal Likelihood Model Comparaison Toy xperiments: Linear Regression 4 Conclusion and future works Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

32 Optimal Gibbs Posterior Corollary [...] with probability at least 1 over the choice of S D n, for all Q on H : c d LD c l h 1 1 e c LD l h L l S h + 1 λ L l S h + 1 n [ KLQ P + ln 1 ], Catoni 2007 [ KLQ P + ln 1 + f λ, n]. Alquier et al From an algorithm design perspective, Corollary c suggests optimizing the following trade-off: c n R S G Q + KLQ P, which also minimizes d, with λ := c n. The optimal Gibbs posterior is given by Q c h = 1 Z S Ph e c n L l S h. See Catoni 2007, Alquier et al. 2015,... Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

33 Tying the Concepts Let us denote Θ as the set of all possible model parameters. Bayesian Rule pθ X, Y = pθ py X, θ py X where X = {x 1,..., x n }, Y = {y 1,..., y n }, and pθ py X, θ, pθ is the prior for each θ Θ similar to P over H pθ X, Y is the posterior for each θ Θ similar to Q over H py X, θ is the likelihood of the parameters θ given the sample S. Negative log-likelihood loss function Then, L l nll S θ = 1 n l nll θ, x, y = ln 1 py x,θ. n l nll θ, x i, y i = 1 n i=1 n ln py i x i, θ = 1 ln py X, θ. n Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33 i=1

34 Rediscovering the Marginal Likelihood With the negative log-likelihood loss, the Bayesian and PAC-Bayesian posteriors align: pθ X, Y = pθ py X, θ py X = l Pθ e n L nll S θ Z S = Q θ. The normalization constant Z S corresponds to the marginal likelihood Z S = py X = Pθ e n L l nll S θ dθ. Putting back the posterior inside the PAC-Bayesian bounds, we obtain: l n L nll S θ + KLQ P θ Q = n = Θ Θ Pθ e n L l nll θ S l L nll S θ dθ + Z S Θ ] [ln 1ZS Pθ e n L l nll θ S Z S dθ = Θ [ Pθ e n L l nll θ S ln Z S Z S Z S ln 1 Z S = ln Z S. Pθ e n L l nll S Pθ Z S Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33 θ ] dθ

35 From the Marginal Likelihood to PAC-Bayesian Bounds Corollary Germain, Bach, Lacoste, Lacoste-Julien 2016 Given a data distribution D, a parameter set Θ, a prior distribution P over Θ, a 0, 1], if l nll lies in [a, b], we have, with probability at least 1 over the choice of S D n, c d θ Q Ll nll D θ a + b a 1 e a b [1 e a n Z S θ Q Ll nll D θ 1 2 b a2 1 n ln Z S Take home message! The marginal likelihood minimizes some PAC-Bayesian Bounds under the negative log-likelihood loss function. ], Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

36 Plan 1 Introduction 2 PAC-Bayesian Theory Majority Vote Classifiers A General PAC-Bayesian Theorem Transductive Bounds Rényi-Based Bounds Regression Bounds 3 PAC-Bayesian Theory Meets Bayesian Inference PAC-Bayesian Marginal Likelihood Model Comparaison Toy xperiments: Linear Regression 4 Conclusion and future works Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

37 Model Comparaison Consider a discrete set of L models {M i } L i=1 with parameters {Θ i} L i=1, a prior pm i over these models, for each model M i, a prior pθ M i = P i θ over Θ i Bayesian Rule pθ X, Y, M i = pθ M i py X, θ, M i py X, M i where the model evidence is py X, M i = pθ M i py X, θ, M i dθ = Z S,i. Θ i, Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

38 Bayesian Model Selection Slide from Zoubin Ghahramani s MLSS 2012 talk : Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

39 Frequentist Bounds for Bayesian Model Selection Alternative explanation for the Bayesian Occam s Razor phenomena... Corollary Germain, Bach, et al [...] with probability at least 1 over the choice of S D n, i {1,..., L} : c d θ Q i [ L l nll b a D θ a + 1 e a n Z 1 e a b S,i L θ Q Ll nll D θ 1 2 b a2 1 n ln Z S,i L. ], Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

40 Plan 1 Introduction 2 PAC-Bayesian Theory Majority Vote Classifiers A General PAC-Bayesian Theorem Transductive Bounds Rényi-Based Bounds Regression Bounds 3 PAC-Bayesian Theory Meets Bayesian Inference PAC-Bayesian Marginal Likelihood Model Comparaison Toy xperiments: Linear Regression 4 Conclusion and future works Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

41 Bayesian Linear Regression Consider a mapping function φ : X R d. Given x, y X Y, model parameters θ := w R d and a fixed noise σ, we consider the likelihood py x, w = N y w φx, σ 2 = 1 2πσ 2 e 1 2σ 2 y w φx 2 Thus, the negative log-likelihood loss function is l nll w, x, y = ln 1 py x, w = 1 2 ln2πσ σ 2 y w φx 2 We also consider an isotropic Gaussian prior of mean 0 and variance σ 2 P pw σ P = N w 0, σ 2 P = 1 e 1 2σ 2 w 2 P. 2π d σp 2 Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

42 Bayesian Linear Regression The Gibbs optimal posterior is given by Q w = pw σ, σ P = pw σ, σ P px, Y w, σ, σ P py X, σ, σ P = N w ŵ, A 1, where A := 1 Φ T Φ + 1 I and ŵ := 1 A 1 Φ T y. σ 2 σp 2 σ 2 The negative log marginal likelihood is ln Z S σ, σ P = 1 2σ 2 y Φŵ 2 + n 2 ln2πσ σ 2 P ŵ log A + d ln σ P = n L l nll S ŵ + 1 trφ T ΦA 1 2σ }{{ 2 } n w L l nll S w Q + 1 2σ P 2 tra 1 d σ 2 P ŵ log A + d ln σ P }{{} KL N ŵ, A 1 N 0, σ 2 PI. Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

43 Fitting y = sinx + ɛ with polynomial models Inspired by Bishop 2006 Illustrate the decomposition of the marginal likelihood into the empirical loss and KL-divergence. ln Z S = n θ Q l L nll S θ + KLQ P model d=1 model d=2 model d=3 model d=4 model d=5 model d=6 model d=7 sinx ln Z X,Y KLˆρ π n θ ˆρ L lnll X,Y θ n θ ˆρ L lnll D θ π π 3 2 π 2π x model degree d Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

44 Plan 1 Introduction 2 PAC-Bayesian Theory Majority Vote Classifiers A General PAC-Bayesian Theorem Transductive Bounds Rényi-Based Bounds Regression Bounds 3 PAC-Bayesian Theory Meets Bayesian Inference PAC-Bayesian Marginal Likelihood Model Comparaison Toy xperiments: Linear Regression 4 Conclusion and future works Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

45 Conclusion and future works I talked about.. A General theorem from which we recover existing results; My modular proof, easy to adapt to various frameworks; A direct link between PAC-Bayesian frequentist bounds and Bayesian model selection. I did not talk about... Our learning algorithms inspired by PAC-Bayesian Bounds; see Germain, Lacasse, Laviolette, and Marchand 2009 ICML and Germain, Habrard, et al ICML Our PAC-Bayesian theorems for unbounded losses. I plan to... see Germain, Bach, et al arxiv Study other Bayesian techniques from a PAC-Bayes perspective empirical Bayes, variational Bayes, etc. Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 33

46 References I Alquier, Pierre, James Ridgway, and Nicolas Chopin On the properties of variational approximations of Gibbs posteriors. In: ArXiv e-prints. url: Atar, Rami and Neri Merhav Information-theoretic applications of the logarithmic probability comparison bound. In: I International Symposium on Information Theory ISIT. Bégin, Luc, Pascal Germain, François Laviolette, and Jean-Francis Roy PAC-Bayesian Theory for Transductive Learning. In: AISTATS PAC-Bayesian Bounds based on the Rényi Divergence. In: AISTATS. Bishop, Christopher M Pattern Recognition and Machine Learning Information Science and Statistics. Secaucus, NJ, USA: Springer-Verlag New York, Inc. Catoni, Olivier PAC-Bayesian supervised classification: the thermodynamics of statistical learning. Vol. 56. Inst. of Mathematical Statistic. Derbeko, Philip, Ran l-yaniv, and Ron Meir xplicit Learning Curves for Transduction and Application to Clustering and Compression Algorithms. In: J. Artif. Intell. Res. JAIR 22. Germain, Pascal Généralisations de la théorie PAC-bayésienne pour l apprentissage inductif, l apprentissage transductif et l adaptation de domaine. PhD thesis. Université Laval. url: Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 2

47 References II Germain, Pascal, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien PAC-Bayesian Theory Meets Bayesian Inference. In: ArXiv e-prints. url: Germain, Pascal, Amaury Habrard, François Laviolette, and milie Morvant A New PAC-Bayesian Perspective on Domain Adaptation. In: ICML. url: Germain, Pascal, Alexandre Lacasse, Francois Laviolette, and Mario Marchand PAC-Bayesian learning of linear classifiers. In: ICML. Germain, Pascal, Alexandre Lacasse, Francois Laviolette, Mario Marchand, and Jean-Francis Roy Risk Bounds for the Majority Vote: From a PAC-Bayesian Analysis to a Learning Algorithm. In: JMLR 16. Langford, John and Matthias Seeger Bounds for averaging classifiers. Tech. rep. Carnegie Mellon, Departement of Computer Science. Maurer, Andreas A Note on the PAC-Bayesian Theorem. In: CoRR cs.lg/ McAllester, David Some PAC-Bayesian Theorems. In: Machine Learning PAC-Bayesian Stochastic Model selection. In: Machine Learning Pascal Germain INRIA/SIRRA Variations sur la borne PAC-bayésienne 11 juillet / 2

Generalization of the PAC-Bayesian Theory

Generalization of the PAC-Bayesian Theory Generalization of the PACBayesian Theory and Applications to SemiSupervised Learning Pascal Germain INRIA Paris (SIERRA Team) Modal Seminar INRIA Lille January 24, 2017 Dans la vie, l essentiel est de