Online Bayesian Passive-Aggressive Learning

Size: px

Start display at page:

Download "Online Bayesian Passive-Aggressive Learning"

Beverly Gordon
5 years ago
Views:

Online Bayesian Passive-Aggressive Learning Full Journal Version: http://qr.

1 Online Bayesian Passive-Aggressive Learning Full Journal Version: Tianlin Shi Jun Zhu ICML 2014 T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

2 Outline Introduction Motivation Framework Theory Connection to PA Regret Analysis Practice Extension Latent Space Topic Modeling Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

3 Outline Introduction Introduction Motivation Framework Theory Connection to PA Regret Analysis Practice Extension Latent Space Topic Modeling Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

4 Introduction The Big Data Challenge Motivation T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

5 Introduction The Big Data Challenge Motivation Huge amounts of data push the limits of methods/systems. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

6 Introduction The Big Data Challenge Motivation Huge amounts of data push the limits of methods/systems. Streaming and potentially infinite data in real applications. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

7 Introduction The Big Data Challenge Motivation Huge amounts of data push the limits of methods/systems. Streaming and potentially infinite data in real applications. Complex with latent variables and hierarchical modeling. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

8 Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Model& w t Predict& Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

9 Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Setting Online learner A. Model& w t Predict& Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

10 Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Setting Online learner A. Model& w t Predict& Incoming data stream (x 0, y 0 ), (x 1, y 1 ), (x 2, y 2 )... Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

11 Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Setting Online learner A. Model& w t Predict& Incoming data stream (x 0, y 0 ), (x 1, y 1 ), (x 2, y 2 )... At each round t, learner A Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

12 Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Setting Online learner A. Model& w t Predict& Incoming data stream (x 0, y 0 ), (x 1, y 1 ), (x 2, y 2 )... At each round t, learner A 1. predicts w t. Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

13 Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Setting Online learner A. Model& w t Predict& Incoming data stream (x 0, y 0 ), (x 1, y 1 ), (x 2, y 2 )... At each round t, learner A Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 1. predicts w t. 2. Incurs loss l(w t ; x t, y t ). T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

14 Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Setting Online learner A. Model& w t Predict& Incoming data stream (x 0, y 0 ), (x 1, y 1 ), (x 2, y 2 )... At each round t, learner A Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 1. predicts w t. 2. Incurs loss l(w t ; x t, y t ). 3. Learn how to predict. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

15 Introduction Motivation Online Passive-Aggressive (PA) Algorithms [Crammer et al. 06. Online Passive-Aggressive Algorithms.] T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

16 Introduction Motivation Online Passive-Aggressive (PA) Algorithms Update weight Obtain w t+1 = [Crammer et al. 06. Online Passive-Aggressive Algorithms.] arg min w 1 2 w w t 2 } {{ } regularizer + 2c l ɛ (w; x t, y t ) }{{} large margin (1) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

17 Introduction Motivation Online Passive-Aggressive (PA) Algorithms Update weight Obtain w t+1 = [Crammer et al. 06. Online Passive-Aggressive Algorithms.] arg min w 1 2 w w t 2 } {{ } regularizer + 2c l ɛ (w; x t, y t ) }{{} large margin (1) Illustration T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

18 Introduction Motivation Online Passive-Aggressive (PA) Algorithms Update weight Obtain w t+1 = [Crammer et al. 06. Online Passive-Aggressive Algorithms.] arg min w 1 2 w w t 2 } {{ } regularizer + 2c l ɛ (w; x t, y t ) }{{} large margin (1) Illustration T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

19 Introduction Motivation Why online learning works. Notion of Regret Accumulative loss against fixed model w with hindsight of data: T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

20 Introduction Motivation Why online learning works. Notion of Regret Accumulative loss against fixed model w with hindsight of data: Regret = T 1 t=0 T 1 l(w t ; x t, y t ) t=0 l(w ; x t, y t ) (2) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

21 Introduction Motivation Why online learning works. Notion of Regret Accumulative loss against fixed model w with hindsight of data: Regret Bound Regret = T 1 t=0 Example. In PA, l = l 2 ɛ, then T 1 l(w t ; x t, y t ) t=0 l(w ; x t, y t ) (2) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

22 Introduction Motivation Why online learning works. Notion of Regret Accumulative loss against fixed model w with hindsight of data: Regret Bound Regret = T 1 t=0 Example. In PA, l = l 2 ɛ, then T 1 t=0 T 1 l(w t ; x t, y t ) l(w t ) cr T 1 t=0 t=0 l(w ; x t, y t ) (2) l(w ) + const (3) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

23 Introduction Motivation Why online learning works. Notion of Regret Accumulative loss against fixed model w with hindsight of data: Regret Bound Regret = T 1 t=0 Example. In PA, l = l 2 ɛ, then T 1 t=0 T 1 l(w t ; x t, y t ) l(w t ) cr T 1 Implicitly minimizes the empirical loss. t=0 t=0 l(w ; x t, y t ) (2) l(w ) + const (3) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

24 Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

25 Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

26 Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, 1. Let current distribution q t (w) be prior. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

27 Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, 1. Let current distribution q t (w) be prior. 2. For each incoming data point (x t, y t ), apply Bayes theorem q t+1 (w) q t (w)p(x t w). (4) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

28 Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, 1. Let current distribution q t (w) be prior. 2. For each incoming data point (x t, y t ), apply Bayes theorem q t+1 (w) q t (w)p(x t w). (4) 3. If inference is intractable, use approximations of q t+1 (w). [Broderick et al. 13. Streaming Variational Bayes.] T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

29 Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, 1. Let current distribution q t (w) be prior. 2. For each incoming data point (x t, y t ), apply Bayes theorem q t+1 (w) q t (w)p(x t w). (4) 3. If inference is intractable, use approximations of q t+1 (w). [Broderick et al. 13. Streaming Variational Bayes.] Advantage Captures uncertainty. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

30 Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, 1. Let current distribution q t (w) be prior. 2. For each incoming data point (x t, y t ), apply Bayes theorem q t+1 (w) q t (w)p(x t w). (4) 3. If inference is intractable, use approximations of q t+1 (w). [Broderick et al. 13. Streaming Variational Bayes.] Advantage Captures uncertainty. Flexible to model the underlying structure. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

31 Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, 1. Let current distribution q t (w) be prior. 2. For each incoming data point (x t, y t ), apply Bayes theorem q t+1 (w) q t (w)p(x t w). (4) 3. If inference is intractable, use approximations of q t+1 (w). [Broderick et al. 13. Streaming Variational Bayes.] Advantage Captures uncertainty. Flexible to model the underlying structure. Could be nonparametric. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

32 The Gap Introduction Motivation Story I. Online Learning: PA T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

33 Introduction Motivation The Gap Story I. Online Learning: PA Huge, Stream, T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

34 Introduction Motivation The Gap Story I. Online Learning: PA Huge, Stream, Complex 1. Single estimate of the model. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

35 Introduction Motivation The Gap Story I. Online Learning: PA Huge, Stream, Complex 1. Single estimate of the model. 2. Ignores latent structure underlying complex data. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

36 Introduction Motivation The Gap Story I. Online Learning: PA Huge, Stream, Complex 1. Single estimate of the model. 2. Ignores latent structure underlying complex data. Story II. Sequential Bayesian Inference T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

37 Introduction Motivation The Gap Story I. Online Learning: PA Huge, Stream, Complex 1. Single estimate of the model. 2. Ignores latent structure underlying complex data. Story II. Sequential Bayesian Inference Lack of discriminative power T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

38 Introduction Motivation The Gap Story I. Online Learning: PA Huge, Stream, Complex 1. Single estimate of the model. 2. Ignores latent structure underlying complex data. Story II. Sequential Bayesian Inference Lack of discriminative power Does not directly minimize loss (or regret). T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

39 A New Perspective Introduction Motivation Illustration T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

40 A New Perspective Introduction Motivation Illustration Intuition T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

41 Introduction Motivation A New Perspective Illustration Intuition Incoming data (x t, y t ) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

42 Introduction Motivation A New Perspective Illustration Intuition Incoming data (x t, y t ) Infer latent variable z t from x t. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

43 Introduction Motivation A New Perspective Illustration Intuition Incoming data (x t, y t ) Infer latent variable z t from x t. Make prediction ŷ t. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

44 Introduction Motivation A New Perspective Illustration Intuition Incoming data (x t, y t ) Infer latent variable z t from x t. Make prediction ŷ t. Learner incurs instantaneous loss. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

45 Introduction Motivation A New Perspective Illustration Intuition Incoming data (x t, y t ) Infer latent variable z t from x t. Make prediction ŷ t. Learner incurs instantaneous loss. Learn both how to infer and to predict. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

46 Introduction Framework Bayesian Passive-Aggressive Learning (BayesPA) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

47 Introduction Framework Bayesian Passive-Aggressive Learning (BayesPA) 1. At round t, the learner makes a prediction by giving q t (w). T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

48 Introduction Framework Bayesian Passive-Aggressive Learning (BayesPA) 1. At round t, the learner makes a prediction by giving q t (w). 2. Incurs instantaneous loss l, both T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

49 Introduction Framework Bayesian Passive-Aggressive Learning (BayesPA) 1. At round t, the learner makes a prediction by giving q t (w). 2. Incurs instantaneous loss l, both Bayesian Log-loss: E qt(w)[log p(x t w)] T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

50 Introduction Framework Bayesian Passive-Aggressive Learning (BayesPA) 1. At round t, the learner makes a prediction by giving q t (w). 2. Incurs instantaneous loss l, both Bayesian Log-loss: E qt(w)[log p(x t w)] Decision-theoretic loss: l(q(w); xt, y t ) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

51 Introduction Framework Bayesian Passive-Aggressive Learning (BayesPA) 1. At round t, the learner makes a prediction by giving q t (w). 2. Incurs instantaneous loss l, both Bayesian Log-loss: E qt(w)[log p(x t w)] Decision-theoretic loss: l(q(w); xt, y t ) 3. Learning rule q t+1 (w) = [ ] [ ] ) argmin KL q(w) q t (w) E q(w) log p(x t w) + 2c l ɛ (q(w); x t, y t, q(w) F t } {{ }} {{ } regularizer instantaneous loss (5) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

52 Interpretation Introduction Framework Rewrite the learning rule q t+1 (w) = arg max q(w) F KL[q(w) q t(w)p(x t w) }{{} unnormalized posterior ] + 2c l(q t (w); x t, y t ) (6) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

53 Interpretation Introduction Framework Rewrite the learning rule q t+1 (w) = arg max q(w) F KL[q(w) Illustration q t(w)p(x t w) }{{} unnormalized posterior ] + 2c l(q t (w); x t, y t ) (6) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

54 Interpretation Introduction Framework Rewrite the learning rule q t+1 (w) = arg max q(w) F KL[q(w) Illustration q t(w)p(x t w) }{{} unnormalized posterior ] + 2c l(q t (w); x t, y t ) (6) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

55 Introduction Framework BayesPA classifiers: two types 1. Averaging classifiers Prediction rule ŷ t = sign E q(w) [w x t ] Averaged hinge loss ) (q(w); x t, y t = ( [ ɛ y t E q(w) w ]) x t l ave ɛ +. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

56 Introduction Framework BayesPA classifiers: two types 1. Averaging classifiers Prediction rule ŷ t = sign E q(w) [w x t ] Averaged hinge loss ) (q(w); x t, y t = ( [ ɛ y t E q(w) w ]) x t 2. Gibbs classifiers l ave ɛ First randomly draw w qt (w) and decide ŷ t = sign(w x t ). Expected hinge loss ) [ (ɛ (q(w); x t, y t = E q(w) yt w ) ] x t. l gibbs ɛ +. + T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

57 Introduction Framework BayesPA classifiers: two types 1. Averaging classifiers Prediction rule ŷ t = sign E q(w) [w x t ] Averaged hinge loss ) (q(w); x t, y t = ( [ ɛ y t E q(w) w ]) x t 2. Gibbs classifiers l ave ɛ First randomly draw w qt (w) and decide ŷ t = sign(w x t ). Expected hinge loss ) [ (ɛ (q(w); x t, y t = E q(w) yt w ) ] x t. l gibbs ɛ +. + Lemma ) l gibbs ɛ (q(w); x t, y t ) l ave ɛ (q(w); x t, y t. (7) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

58 Update Rule Introduction Framework Averaging Classifier ) q t+1 (w) q t (w) p(x }{{} t w) exp (τt y }{{} t w x t, (8) }{{} prior likelihood pseudo likelihood Dual slack variable τ t enforces max-margin constraints. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

59 Update Rule Introduction Framework Averaging Classifier ) q t+1 (w) q t (w) p(x }{{} t w) exp (τt y }{{} t w x t, (8) }{{} prior likelihood pseudo likelihood Dual slack variable τ t enforces max-margin constraints. Gibbs Classifier ( ) ) q t+1 (w) q t (w) p(x }{{} t w) exp 2c (ɛ y }{{} t w x t + prior likelihood }{{} pseudo likelihood (9) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

60 Outline Theory Introduction Motivation Framework Theory Connection to PA Regret Analysis Practice Extension Latent Space Topic Modeling Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

61 Connection to PA Theory Connection to PA Theorem BayesPA subsumes the online PA. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

62 Connection to PA Theory Connection to PA Theorem BayesPA subsumes the online PA. Proof. Show µ t 1 µ t follows PA. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

63 Theory Regret Analysis Regret Analysis Regret in BayesPA Decision is the distribution q(w). Cumulative loss against fixed distribution p (w) with hindsight of all data. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

64 Theory Regret Analysis Regret Analysis Regret in BayesPA Decision is the distribution q(w). Cumulative loss against fixed distribution p (w) with hindsight of all data. Regret = T 1 t=0 T 1 l(q t (w); x t, y t ) t=0 l(p (w); x t, y t ) (10) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

65 Theory Regret Analysis Regret Analysis Regret in BayesPA Decision is the distribution q(w). Cumulative loss against fixed distribution p (w) with hindsight of all data. where the loss Regret = T 1 t=0 T 1 l(q t (w); x t, y t ) t=0 l(p (w); x t, y t ) (10) l(q(w); x t, y t ) = E q(w) [log p(x t w)] + 2c l ɛ (q(w t ); x t, y t ) (11) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

66 Theory A regret bound for BayesPA Regret Analysis T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

67 Theory A regret bound for BayesPA Regret Analysis T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

68 Theory Regret Analysis A regret bound for BayesPA qt τ,u (w) q t (w) exp(uu(w, x t, y t ) + τt (w, x t, y t )). where the sufficient statistics, U(w, x t, y t ) = log p(x t w), and T (w, x t, y t ) = y t w x t. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

69 Theory Regret Analysis A regret bound for BayesPA (Contd.) Theorem Suppose Fisher information J t λ t I, Regret KL[p (w) p 0 (w)] + 1 T 2 ( c2 ) λ t (12) t=0 T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

70 Theory Regret Analysis A regret bound for BayesPA (Contd.) Theorem Suppose Fisher information J t λ t I, Regret KL[p (w) p 0 (w)] + 1 T 2 ( c2 ) λ t (12) t=0 Remark. If Bayesian CLT holds, BayesPA achieve optimal regret O(log T ). T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

71 Outline Practice Introduction Motivation Framework Theory Connection to PA Regret Analysis Practice Extension Latent Space Topic Modeling Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

72 Some Extensions Practice Extension Mini-batches X t = {x d } d Bt, Y t = {y d } d Bt. [ ] [ ] ) min KL q(w) q t (w) E q(w) log p(x t w) + 2c l q F t }{{} ɛ (q(w); X t, Y t, }{{} log p(x t w) d Bt l ɛ(q(w);x t,y t) d Bt T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

73 Some Extensions Practice Extension Mini-batches X t = {x d } d Bt, Y t = {y d } d Bt. [ ] [ ] ) min KL q(w) q t (w) E q(w) log p(x t w) + 2c l q F t }{{} ɛ (q(w); X t, Y t, }{{} Multi-Task Learning Multiple labels (x t, y 1 t,..., y τ t ). min q F t log p(x t w) d Bt [ ] KL q(w) q t (w)p(x t w, M, H t ) + 2c l ɛ(q(w);x t,y t) d Bt T ) l ɛ (q(w, M, H t ); X t, Y τ t τ=1 T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

74 Practice BayesPA with Latent Space Latent Space Illustration global&variables w local&variables H t (a) sampling analysis m draw&a&mini5 batch infer&the&hidden& structure t = 1,2,..., X t,y t (X t,y t ) q * (H t ) (b) up glo T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

75 Practice BayesPA with Latent Space Latent Space Illustration global&variables w Formulation local&variables Maintain distribution q(w, M) of global variables. H t (a) sampling analysis m draw&a&mini5 batch infer&the&hidden& structure t = 1,2,..., X t,y t (X t,y t ) q * (H t ) [ ] ) min KL q(w, M, H t) q t(w, M)p(x t w, M, H t) + 2c l ɛ (q(w, M, H t); X t, Y t, (13) q F t (b) up glo T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

76 Practice BayesPA with Latent Space Latent Space Illustration global&variables w Formulation local&variables Maintain distribution q(w, M) of global variables. H t (a) sampling analysis m draw&a&mini5 batch infer&the&hidden& structure t = 1,2,..., X t,y t (X t,y t ) q * (H t ) [ ] ) min KL q(w, M, H t) q t(w, M)p(x t w, M, H t) + 2c l ɛ (q(w, M, H t); X t, Y t, (13) q F t Problem Efficient inference for marginal q(w, M)? T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35 (b) up glo

77 Practice Latent Space Approximate Inference in BayesPA Variational Inference Idea: restrict F t to be tractable. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

78 Practice Latent Space Approximate Inference in BayesPA Variational Inference Idea: restrict F t to be tractable. sampling analysis model&update draw&a&mini5 batch infer&the&hidden& structure (b) update&distribu8on&of& global&variables (X t,y t ) q * (H t ) q* (w,m) 1. Structured mean field q (w, M t, H t ) = q (w)q (M t ) q(h t ). T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

79 Practice Latent Space Approximate Inference in BayesPA Variational Inference Idea: restrict F t to be tractable. sampling analysis model&update draw&a&mini5 batch infer&the&hidden& structure (b) update&distribu8on&of& global&variables (X t,y t ) q * (H t ) q* (w,m) 1. Structured mean field q (w, M t, H t ) = q (w)q (M t ) q(h t ). 2. Solve BayesPA problem. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

80 Practice Latent Space Approximate Inference in BayesPA Variational Inference Idea: restrict F t to be tractable. sampling analysis model&update draw&a&mini5 batch infer&the&hidden& structure (b) update&distribu8on&of& global&variables (X t,y t ) q * (H t ) q* (w,m) 1. Structured mean field q (w, M t, H t ) = q (w)q (M t ) q(h t ). 2. Solve BayesPA problem. 3. Output q t+1 (M, w) = q (w)q (M). T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

81 Practice Topic Modeling Application to Topic Modeling Review of Maximum Entropy Discriminant LDA (MedLDA) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

82 Practice Topic Modeling Online Max-Margin Topic Models Online MedLDA For MedLDA, latent BayesPA concretizes to [ ] KL q(w, Φ, Z t ) q t (w, Φ)p 0 (Z t )p(x t Φ, Z t ) min q,ξ d + 2c d B t l ɛ (w; z t, y t ), (14) s.t. q(w, Φ, Z t ) P. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

83 Practice Topic Modeling Online Max-Margin Topic Models Online MedLDA For MedLDA, latent BayesPA concretizes to [ ] KL q(w, Φ, Z t ) q t (w, Φ)p 0 (Z t )p(x t Φ, Z t ) min q,ξ d + 2c d B t l ɛ (w; z t, y t ), (14) In our paper s.t. q(w, Φ, Z t ) P. Derive efficient algorithms for online MedLDA with both l Avg ɛ, l Gibbs ɛ, T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

84 Practice Topic Modeling Online Max-Margin Topic Models Online MedLDA For MedLDA, latent BayesPA concretizes to [ ] KL q(w, Φ, Z t ) q t (w, Φ)p 0 (Z t )p(x t Φ, Z t ) min q,ξ d + 2c d B t l ɛ (w; z t, y t ), (14) In our paper s.t. q(w, Φ, Z t ) P. Derive efficient algorithms for online MedLDA with both l Avg ɛ, l Gibbs ɛ, Provided nonparametric extension: Med Hierarchical Dirichlet Process (MedHDP). T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

85 Practice Topic Modeling Online Max-Margin Topic Models Online MedLDA For MedLDA, latent BayesPA concretizes to [ ] KL q(w, Φ, Z t ) q t (w, Φ)p 0 (Z t )p(x t Φ, Z t ) min q,ξ d + 2c d B t l ɛ (w; z t, y t ), (14) In our paper s.t. q(w, Φ, Z t ) P. Derive efficient algorithms for online MedLDA with both l Avg ɛ, l Gibbs ɛ, Provided nonparametric extension: Med Hierarchical Dirichlet Process (MedHDP). BayesPA for nonparametric models: derive efficient algorithms for online MedHDP. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

86 Outline Experiment Introduction Motivation Framework Theory Connection to PA Regret Analysis Practice Extension Latent Space Topic Modeling Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

87 Experiment Classification on 20NG 20 Newsgroup categories of documents. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

88 Experiment Classification on 20NG 20 Newsgroup categories of documents. 2. training/testing split 11269/7505. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

89 Experiment Classification on 20NG 20 Newsgroup categories of documents. 2. training/testing split 11269/ one vs. all for multiple categories. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

90 Experiment Classification on 20NG 20 Newsgroup categories of documents. 2. training/testing split 11269/ one vs. all for multiple categories. Baseline batch MedLDA (Zhu el al. 2009, 2013) batch MedHDP. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

91 Experiment Classification on 20NG 20 Newsgroup categories of documents. 2. training/testing split 11269/ one vs. all for multiple categories. Baseline batch MedLDA (Zhu el al. 2009, 2013) batch MedHDP. sparse stochastic LDA (mimno et al. 2012), truncation-free HDP (Wang & Blei 2012). T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

92 Experiment Results of Online MedLDA Passes through dataset. 1 pamedlda ave pamedlda gibbs MedLDA gibbs gmedlda splda+svm Error Rate #Passes Through the Dataset T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

93 Experiment Results of Online MedLDA Accuracy and running time pamedlda ave pamedlda gibbs MedLDA gibbs gmedlda splda+svm Accuracy Time(s) #Topic #Topic T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

94 Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

95 Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

96 Experiment Sensitivity with Batchsize Batch Size Batch Test Error Test Error CPU Seconds (Log Scale) CPU Seconds (Log Scale) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

97 Experiment Online MedHDP on 20NG pamedhdp ave pamedhdp gibbs bmedhdp tfhdp+svm Accuracy Time(s) #Topic #Topic T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

98 Experiment Multi-Task Learning on Wikipedia Dataset Training data: 1.1 M Wikipedia articles, 20 categories. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

99 Experiment Multi-Task Learning on Wikipedia Dataset Training data: 1.1 M Wikipedia articles, 20 categories. Test data: 5000 Wikipedia articles. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

100 Experiment Multi-Task Learning on Wikipedia Dataset Training data: 1.1 M Wikipedia articles, 20 categories. Test data: 5000 Wikipedia articles. Vocabulary: 917,683 unique items. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

101 Experiment Multi-Task Learning on Wikipedia Dataset Training data: 1.1 M Wikipedia articles, 20 categories. Test data: 5000 Wikipedia articles. Vocabulary: 917,683 unique items. Result 0.6 pamedldamt ave pamedhdpmt ave pamedldamt gibbs pamedhdpmt gibbs MedLDAmt MedHDPmt F1 Score Time (Seconds) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

102 Contribution Experiment A generic online learning framework. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

103 Contribution Experiment A generic online learning framework. Extend Passive-Aggressive learning to Bayesian setting. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

104 Contribution Experiment A generic online learning framework. Extend Passive-Aggressive learning to Bayesian setting. Generalize Sequential Bayesian Inference to discriminative learning. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

105 Contribution Experiment A generic online learning framework. Extend Passive-Aggressive learning to Bayesian setting. Generalize Sequential Bayesian Inference to discriminative learning. Develop efficient online learning algorithm for max-margin topic models. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

Online Bayesian Passive-Agressive Learning

Online Bayesian Passive-Agressive Learning International Conference on Machine Learning, 2014 Tianlin Shi Jun Zhu Tsinghua University, China 21 August 2015 Presented by: Kyle Ulrich Introduction Online