Online Bayesian Passive-Aggressive Learning

Size: px
Start display at page:

Download "Online Bayesian Passive-Aggressive Learning"

Transcription

1 Online Bayesian Passive-Aggressive Learning Full Journal Version: Tianlin Shi Jun Zhu ICML 2014 T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

2 Outline Introduction Motivation Framework Theory Connection to PA Regret Analysis Practice Extension Latent Space Topic Modeling Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

3 Outline Introduction Introduction Motivation Framework Theory Connection to PA Regret Analysis Practice Extension Latent Space Topic Modeling Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

4 Introduction The Big Data Challenge Motivation T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

5 Introduction The Big Data Challenge Motivation Huge amounts of data push the limits of methods/systems. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

6 Introduction The Big Data Challenge Motivation Huge amounts of data push the limits of methods/systems. Streaming and potentially infinite data in real applications. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

7 Introduction The Big Data Challenge Motivation Huge amounts of data push the limits of methods/systems. Streaming and potentially infinite data in real applications. Complex with latent variables and hierarchical modeling. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

8 Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Model& w t Predict& Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

9 Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Setting Online learner A. Model& w t Predict& Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

10 Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Setting Online learner A. Model& w t Predict& Incoming data stream (x 0, y 0 ), (x 1, y 1 ), (x 2, y 2 )... Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

11 Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Setting Online learner A. Model& w t Predict& Incoming data stream (x 0, y 0 ), (x 1, y 1 ), (x 2, y 2 )... At each round t, learner A Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

12 Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Setting Online learner A. Model& w t Predict& Incoming data stream (x 0, y 0 ), (x 1, y 1 ), (x 2, y 2 )... At each round t, learner A 1. predicts w t. Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

13 Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Setting Online learner A. Model& w t Predict& Incoming data stream (x 0, y 0 ), (x 1, y 1 ), (x 2, y 2 )... At each round t, learner A Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 1. predicts w t. 2. Incurs loss l(w t ; x t, y t ). T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

14 Introduction Motivation Story I. Online Learning [Shalev-Shwartz 07. Online learning: Theory, algorithms and applications.] Illustration Setting Online learner A. Model& w t Predict& Incoming data stream (x 0, y 0 ), (x 1, y 1 ), (x 2, y 2 )... At each round t, learner A Online&Learner& A Learn& Loss&& l(w ; x, y ) t t t Data& {x t, y t } t=0 1. predicts w t. 2. Incurs loss l(w t ; x t, y t ). 3. Learn how to predict. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

15 Introduction Motivation Online Passive-Aggressive (PA) Algorithms [Crammer et al. 06. Online Passive-Aggressive Algorithms.] T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

16 Introduction Motivation Online Passive-Aggressive (PA) Algorithms Update weight Obtain w t+1 = [Crammer et al. 06. Online Passive-Aggressive Algorithms.] arg min w 1 2 w w t 2 } {{ } regularizer + 2c l ɛ (w; x t, y t ) }{{} large margin (1) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

17 Introduction Motivation Online Passive-Aggressive (PA) Algorithms Update weight Obtain w t+1 = [Crammer et al. 06. Online Passive-Aggressive Algorithms.] arg min w 1 2 w w t 2 } {{ } regularizer + 2c l ɛ (w; x t, y t ) }{{} large margin (1) Illustration T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

18 Introduction Motivation Online Passive-Aggressive (PA) Algorithms Update weight Obtain w t+1 = [Crammer et al. 06. Online Passive-Aggressive Algorithms.] arg min w 1 2 w w t 2 } {{ } regularizer + 2c l ɛ (w; x t, y t ) }{{} large margin (1) Illustration T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

19 Introduction Motivation Why online learning works. Notion of Regret Accumulative loss against fixed model w with hindsight of data: T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

20 Introduction Motivation Why online learning works. Notion of Regret Accumulative loss against fixed model w with hindsight of data: Regret = T 1 t=0 T 1 l(w t ; x t, y t ) t=0 l(w ; x t, y t ) (2) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

21 Introduction Motivation Why online learning works. Notion of Regret Accumulative loss against fixed model w with hindsight of data: Regret Bound Regret = T 1 t=0 Example. In PA, l = l 2 ɛ, then T 1 l(w t ; x t, y t ) t=0 l(w ; x t, y t ) (2) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

22 Introduction Motivation Why online learning works. Notion of Regret Accumulative loss against fixed model w with hindsight of data: Regret Bound Regret = T 1 t=0 Example. In PA, l = l 2 ɛ, then T 1 t=0 T 1 l(w t ; x t, y t ) l(w t ) cr T 1 t=0 t=0 l(w ; x t, y t ) (2) l(w ) + const (3) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

23 Introduction Motivation Why online learning works. Notion of Regret Accumulative loss against fixed model w with hindsight of data: Regret Bound Regret = T 1 t=0 Example. In PA, l = l 2 ɛ, then T 1 t=0 T 1 l(w t ; x t, y t ) l(w t ) cr T 1 Implicitly minimizes the empirical loss. t=0 t=0 l(w ; x t, y t ) (2) l(w ) + const (3) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

24 Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

25 Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

26 Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, 1. Let current distribution q t (w) be prior. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

27 Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, 1. Let current distribution q t (w) be prior. 2. For each incoming data point (x t, y t ), apply Bayes theorem q t+1 (w) q t (w)p(x t w). (4) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

28 Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, 1. Let current distribution q t (w) be prior. 2. For each incoming data point (x t, y t ), apply Bayes theorem q t+1 (w) q t (w)p(x t w). (4) 3. If inference is intractable, use approximations of q t+1 (w). [Broderick et al. 13. Streaming Variational Bayes.] T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

29 Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, 1. Let current distribution q t (w) be prior. 2. For each incoming data point (x t, y t ), apply Bayes theorem q t+1 (w) q t (w)p(x t w). (4) 3. If inference is intractable, use approximations of q t+1 (w). [Broderick et al. 13. Streaming Variational Bayes.] Advantage Captures uncertainty. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

30 Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, 1. Let current distribution q t (w) be prior. 2. For each incoming data point (x t, y t ), apply Bayes theorem q t+1 (w) q t (w)p(x t w). (4) 3. If inference is intractable, use approximations of q t+1 (w). [Broderick et al. 13. Streaming Variational Bayes.] Advantage Captures uncertainty. Flexible to model the underlying structure. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

31 Introduction Motivation Story II. Sequential Bayesian Inference Maintain a distribution over w. For round t, 1. Let current distribution q t (w) be prior. 2. For each incoming data point (x t, y t ), apply Bayes theorem q t+1 (w) q t (w)p(x t w). (4) 3. If inference is intractable, use approximations of q t+1 (w). [Broderick et al. 13. Streaming Variational Bayes.] Advantage Captures uncertainty. Flexible to model the underlying structure. Could be nonparametric. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

32 The Gap Introduction Motivation Story I. Online Learning: PA T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

33 Introduction Motivation The Gap Story I. Online Learning: PA Huge, Stream, T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

34 Introduction Motivation The Gap Story I. Online Learning: PA Huge, Stream, Complex 1. Single estimate of the model. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

35 Introduction Motivation The Gap Story I. Online Learning: PA Huge, Stream, Complex 1. Single estimate of the model. 2. Ignores latent structure underlying complex data. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

36 Introduction Motivation The Gap Story I. Online Learning: PA Huge, Stream, Complex 1. Single estimate of the model. 2. Ignores latent structure underlying complex data. Story II. Sequential Bayesian Inference T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

37 Introduction Motivation The Gap Story I. Online Learning: PA Huge, Stream, Complex 1. Single estimate of the model. 2. Ignores latent structure underlying complex data. Story II. Sequential Bayesian Inference Lack of discriminative power T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

38 Introduction Motivation The Gap Story I. Online Learning: PA Huge, Stream, Complex 1. Single estimate of the model. 2. Ignores latent structure underlying complex data. Story II. Sequential Bayesian Inference Lack of discriminative power Does not directly minimize loss (or regret). T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

39 A New Perspective Introduction Motivation Illustration T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

40 A New Perspective Introduction Motivation Illustration Intuition T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

41 Introduction Motivation A New Perspective Illustration Intuition Incoming data (x t, y t ) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

42 Introduction Motivation A New Perspective Illustration Intuition Incoming data (x t, y t ) Infer latent variable z t from x t. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

43 Introduction Motivation A New Perspective Illustration Intuition Incoming data (x t, y t ) Infer latent variable z t from x t. Make prediction ŷ t. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

44 Introduction Motivation A New Perspective Illustration Intuition Incoming data (x t, y t ) Infer latent variable z t from x t. Make prediction ŷ t. Learner incurs instantaneous loss. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

45 Introduction Motivation A New Perspective Illustration Intuition Incoming data (x t, y t ) Infer latent variable z t from x t. Make prediction ŷ t. Learner incurs instantaneous loss. Learn both how to infer and to predict. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

46 Introduction Framework Bayesian Passive-Aggressive Learning (BayesPA) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

47 Introduction Framework Bayesian Passive-Aggressive Learning (BayesPA) 1. At round t, the learner makes a prediction by giving q t (w). T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

48 Introduction Framework Bayesian Passive-Aggressive Learning (BayesPA) 1. At round t, the learner makes a prediction by giving q t (w). 2. Incurs instantaneous loss l, both T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

49 Introduction Framework Bayesian Passive-Aggressive Learning (BayesPA) 1. At round t, the learner makes a prediction by giving q t (w). 2. Incurs instantaneous loss l, both Bayesian Log-loss: E qt(w)[log p(x t w)] T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

50 Introduction Framework Bayesian Passive-Aggressive Learning (BayesPA) 1. At round t, the learner makes a prediction by giving q t (w). 2. Incurs instantaneous loss l, both Bayesian Log-loss: E qt(w)[log p(x t w)] Decision-theoretic loss: l(q(w); xt, y t ) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

51 Introduction Framework Bayesian Passive-Aggressive Learning (BayesPA) 1. At round t, the learner makes a prediction by giving q t (w). 2. Incurs instantaneous loss l, both Bayesian Log-loss: E qt(w)[log p(x t w)] Decision-theoretic loss: l(q(w); xt, y t ) 3. Learning rule q t+1 (w) = [ ] [ ] ) argmin KL q(w) q t (w) E q(w) log p(x t w) + 2c l ɛ (q(w); x t, y t, q(w) F t } {{ }} {{ } regularizer instantaneous loss (5) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

52 Interpretation Introduction Framework Rewrite the learning rule q t+1 (w) = arg max q(w) F KL[q(w) q t(w)p(x t w) }{{} unnormalized posterior ] + 2c l(q t (w); x t, y t ) (6) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

53 Interpretation Introduction Framework Rewrite the learning rule q t+1 (w) = arg max q(w) F KL[q(w) Illustration q t(w)p(x t w) }{{} unnormalized posterior ] + 2c l(q t (w); x t, y t ) (6) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

54 Interpretation Introduction Framework Rewrite the learning rule q t+1 (w) = arg max q(w) F KL[q(w) Illustration q t(w)p(x t w) }{{} unnormalized posterior ] + 2c l(q t (w); x t, y t ) (6) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

55 Introduction Framework BayesPA classifiers: two types 1. Averaging classifiers Prediction rule ŷ t = sign E q(w) [w x t ] Averaged hinge loss ) (q(w); x t, y t = ( [ ɛ y t E q(w) w ]) x t l ave ɛ +. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

56 Introduction Framework BayesPA classifiers: two types 1. Averaging classifiers Prediction rule ŷ t = sign E q(w) [w x t ] Averaged hinge loss ) (q(w); x t, y t = ( [ ɛ y t E q(w) w ]) x t 2. Gibbs classifiers l ave ɛ First randomly draw w qt (w) and decide ŷ t = sign(w x t ). Expected hinge loss ) [ (ɛ (q(w); x t, y t = E q(w) yt w ) ] x t. l gibbs ɛ +. + T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

57 Introduction Framework BayesPA classifiers: two types 1. Averaging classifiers Prediction rule ŷ t = sign E q(w) [w x t ] Averaged hinge loss ) (q(w); x t, y t = ( [ ɛ y t E q(w) w ]) x t 2. Gibbs classifiers l ave ɛ First randomly draw w qt (w) and decide ŷ t = sign(w x t ). Expected hinge loss ) [ (ɛ (q(w); x t, y t = E q(w) yt w ) ] x t. l gibbs ɛ +. + Lemma ) l gibbs ɛ (q(w); x t, y t ) l ave ɛ (q(w); x t, y t. (7) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

58 Update Rule Introduction Framework Averaging Classifier ) q t+1 (w) q t (w) p(x }{{} t w) exp (τt y }{{} t w x t, (8) }{{} prior likelihood pseudo likelihood Dual slack variable τ t enforces max-margin constraints. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

59 Update Rule Introduction Framework Averaging Classifier ) q t+1 (w) q t (w) p(x }{{} t w) exp (τt y }{{} t w x t, (8) }{{} prior likelihood pseudo likelihood Dual slack variable τ t enforces max-margin constraints. Gibbs Classifier ( ) ) q t+1 (w) q t (w) p(x }{{} t w) exp 2c (ɛ y }{{} t w x t + prior likelihood }{{} pseudo likelihood (9) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

60 Outline Theory Introduction Motivation Framework Theory Connection to PA Regret Analysis Practice Extension Latent Space Topic Modeling Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

61 Connection to PA Theory Connection to PA Theorem BayesPA subsumes the online PA. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

62 Connection to PA Theory Connection to PA Theorem BayesPA subsumes the online PA. Proof. Show µ t 1 µ t follows PA. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

63 Theory Regret Analysis Regret Analysis Regret in BayesPA Decision is the distribution q(w). Cumulative loss against fixed distribution p (w) with hindsight of all data. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

64 Theory Regret Analysis Regret Analysis Regret in BayesPA Decision is the distribution q(w). Cumulative loss against fixed distribution p (w) with hindsight of all data. Regret = T 1 t=0 T 1 l(q t (w); x t, y t ) t=0 l(p (w); x t, y t ) (10) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

65 Theory Regret Analysis Regret Analysis Regret in BayesPA Decision is the distribution q(w). Cumulative loss against fixed distribution p (w) with hindsight of all data. where the loss Regret = T 1 t=0 T 1 l(q t (w); x t, y t ) t=0 l(p (w); x t, y t ) (10) l(q(w); x t, y t ) = E q(w) [log p(x t w)] + 2c l ɛ (q(w t ); x t, y t ) (11) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

66 Theory A regret bound for BayesPA Regret Analysis T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

67 Theory A regret bound for BayesPA Regret Analysis T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

68 Theory Regret Analysis A regret bound for BayesPA qt τ,u (w) q t (w) exp(uu(w, x t, y t ) + τt (w, x t, y t )). where the sufficient statistics, U(w, x t, y t ) = log p(x t w), and T (w, x t, y t ) = y t w x t. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

69 Theory Regret Analysis A regret bound for BayesPA (Contd.) Theorem Suppose Fisher information J t λ t I, Regret KL[p (w) p 0 (w)] + 1 T 2 ( c2 ) λ t (12) t=0 T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

70 Theory Regret Analysis A regret bound for BayesPA (Contd.) Theorem Suppose Fisher information J t λ t I, Regret KL[p (w) p 0 (w)] + 1 T 2 ( c2 ) λ t (12) t=0 Remark. If Bayesian CLT holds, BayesPA achieve optimal regret O(log T ). T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

71 Outline Practice Introduction Motivation Framework Theory Connection to PA Regret Analysis Practice Extension Latent Space Topic Modeling Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

72 Some Extensions Practice Extension Mini-batches X t = {x d } d Bt, Y t = {y d } d Bt. [ ] [ ] ) min KL q(w) q t (w) E q(w) log p(x t w) + 2c l q F t }{{} ɛ (q(w); X t, Y t, }{{} log p(x t w) d Bt l ɛ(q(w);x t,y t) d Bt T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

73 Some Extensions Practice Extension Mini-batches X t = {x d } d Bt, Y t = {y d } d Bt. [ ] [ ] ) min KL q(w) q t (w) E q(w) log p(x t w) + 2c l q F t }{{} ɛ (q(w); X t, Y t, }{{} Multi-Task Learning Multiple labels (x t, y 1 t,..., y τ t ). min q F t log p(x t w) d Bt [ ] KL q(w) q t (w)p(x t w, M, H t ) + 2c l ɛ(q(w);x t,y t) d Bt T ) l ɛ (q(w, M, H t ); X t, Y τ t τ=1 T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

74 Practice BayesPA with Latent Space Latent Space Illustration global&variables w local&variables H t (a) sampling analysis m draw&a&mini5 batch infer&the&hidden& structure t = 1,2,..., X t,y t (X t,y t ) q * (H t ) (b) up glo T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

75 Practice BayesPA with Latent Space Latent Space Illustration global&variables w Formulation local&variables Maintain distribution q(w, M) of global variables. H t (a) sampling analysis m draw&a&mini5 batch infer&the&hidden& structure t = 1,2,..., X t,y t (X t,y t ) q * (H t ) [ ] ) min KL q(w, M, H t) q t(w, M)p(x t w, M, H t) + 2c l ɛ (q(w, M, H t); X t, Y t, (13) q F t (b) up glo T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

76 Practice BayesPA with Latent Space Latent Space Illustration global&variables w Formulation local&variables Maintain distribution q(w, M) of global variables. H t (a) sampling analysis m draw&a&mini5 batch infer&the&hidden& structure t = 1,2,..., X t,y t (X t,y t ) q * (H t ) [ ] ) min KL q(w, M, H t) q t(w, M)p(x t w, M, H t) + 2c l ɛ (q(w, M, H t); X t, Y t, (13) q F t Problem Efficient inference for marginal q(w, M)? T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35 (b) up glo

77 Practice Latent Space Approximate Inference in BayesPA Variational Inference Idea: restrict F t to be tractable. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

78 Practice Latent Space Approximate Inference in BayesPA Variational Inference Idea: restrict F t to be tractable. sampling analysis model&update draw&a&mini5 batch infer&the&hidden& structure (b) update&distribu8on&of& global&variables (X t,y t ) q * (H t ) q* (w,m) 1. Structured mean field q (w, M t, H t ) = q (w)q (M t ) q(h t ). T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

79 Practice Latent Space Approximate Inference in BayesPA Variational Inference Idea: restrict F t to be tractable. sampling analysis model&update draw&a&mini5 batch infer&the&hidden& structure (b) update&distribu8on&of& global&variables (X t,y t ) q * (H t ) q* (w,m) 1. Structured mean field q (w, M t, H t ) = q (w)q (M t ) q(h t ). 2. Solve BayesPA problem. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

80 Practice Latent Space Approximate Inference in BayesPA Variational Inference Idea: restrict F t to be tractable. sampling analysis model&update draw&a&mini5 batch infer&the&hidden& structure (b) update&distribu8on&of& global&variables (X t,y t ) q * (H t ) q* (w,m) 1. Structured mean field q (w, M t, H t ) = q (w)q (M t ) q(h t ). 2. Solve BayesPA problem. 3. Output q t+1 (M, w) = q (w)q (M). T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

81 Practice Topic Modeling Application to Topic Modeling Review of Maximum Entropy Discriminant LDA (MedLDA) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

82 Practice Topic Modeling Online Max-Margin Topic Models Online MedLDA For MedLDA, latent BayesPA concretizes to [ ] KL q(w, Φ, Z t ) q t (w, Φ)p 0 (Z t )p(x t Φ, Z t ) min q,ξ d + 2c d B t l ɛ (w; z t, y t ), (14) s.t. q(w, Φ, Z t ) P. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

83 Practice Topic Modeling Online Max-Margin Topic Models Online MedLDA For MedLDA, latent BayesPA concretizes to [ ] KL q(w, Φ, Z t ) q t (w, Φ)p 0 (Z t )p(x t Φ, Z t ) min q,ξ d + 2c d B t l ɛ (w; z t, y t ), (14) In our paper s.t. q(w, Φ, Z t ) P. Derive efficient algorithms for online MedLDA with both l Avg ɛ, l Gibbs ɛ, T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

84 Practice Topic Modeling Online Max-Margin Topic Models Online MedLDA For MedLDA, latent BayesPA concretizes to [ ] KL q(w, Φ, Z t ) q t (w, Φ)p 0 (Z t )p(x t Φ, Z t ) min q,ξ d + 2c d B t l ɛ (w; z t, y t ), (14) In our paper s.t. q(w, Φ, Z t ) P. Derive efficient algorithms for online MedLDA with both l Avg ɛ, l Gibbs ɛ, Provided nonparametric extension: Med Hierarchical Dirichlet Process (MedHDP). T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

85 Practice Topic Modeling Online Max-Margin Topic Models Online MedLDA For MedLDA, latent BayesPA concretizes to [ ] KL q(w, Φ, Z t ) q t (w, Φ)p 0 (Z t )p(x t Φ, Z t ) min q,ξ d + 2c d B t l ɛ (w; z t, y t ), (14) In our paper s.t. q(w, Φ, Z t ) P. Derive efficient algorithms for online MedLDA with both l Avg ɛ, l Gibbs ɛ, Provided nonparametric extension: Med Hierarchical Dirichlet Process (MedHDP). BayesPA for nonparametric models: derive efficient algorithms for online MedHDP. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

86 Outline Experiment Introduction Motivation Framework Theory Connection to PA Regret Analysis Practice Extension Latent Space Topic Modeling Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

87 Experiment Classification on 20NG 20 Newsgroup categories of documents. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

88 Experiment Classification on 20NG 20 Newsgroup categories of documents. 2. training/testing split 11269/7505. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

89 Experiment Classification on 20NG 20 Newsgroup categories of documents. 2. training/testing split 11269/ one vs. all for multiple categories. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

90 Experiment Classification on 20NG 20 Newsgroup categories of documents. 2. training/testing split 11269/ one vs. all for multiple categories. Baseline batch MedLDA (Zhu el al. 2009, 2013) batch MedHDP. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

91 Experiment Classification on 20NG 20 Newsgroup categories of documents. 2. training/testing split 11269/ one vs. all for multiple categories. Baseline batch MedLDA (Zhu el al. 2009, 2013) batch MedHDP. sparse stochastic LDA (mimno et al. 2012), truncation-free HDP (Wang & Blei 2012). T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

92 Experiment Results of Online MedLDA Passes through dataset. 1 pamedlda ave pamedlda gibbs MedLDA gibbs gmedlda splda+svm Error Rate #Passes Through the Dataset T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

93 Experiment Results of Online MedLDA Accuracy and running time pamedlda ave pamedlda gibbs MedLDA gibbs gmedlda splda+svm Accuracy Time(s) #Topic #Topic T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

94 Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

95 Experiment T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

96 Experiment Sensitivity with Batchsize Batch Size Batch Test Error Test Error CPU Seconds (Log Scale) CPU Seconds (Log Scale) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

97 Experiment Online MedHDP on 20NG pamedhdp ave pamedhdp gibbs bmedhdp tfhdp+svm Accuracy Time(s) #Topic #Topic T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

98 Experiment Multi-Task Learning on Wikipedia Dataset Training data: 1.1 M Wikipedia articles, 20 categories. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

99 Experiment Multi-Task Learning on Wikipedia Dataset Training data: 1.1 M Wikipedia articles, 20 categories. Test data: 5000 Wikipedia articles. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

100 Experiment Multi-Task Learning on Wikipedia Dataset Training data: 1.1 M Wikipedia articles, 20 categories. Test data: 5000 Wikipedia articles. Vocabulary: 917,683 unique items. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

101 Experiment Multi-Task Learning on Wikipedia Dataset Training data: 1.1 M Wikipedia articles, 20 categories. Test data: 5000 Wikipedia articles. Vocabulary: 917,683 unique items. Result 0.6 pamedldamt ave pamedhdpmt ave pamedldamt gibbs pamedhdpmt gibbs MedLDAmt MedHDPmt F1 Score Time (Seconds) T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

102 Contribution Experiment A generic online learning framework. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

103 Contribution Experiment A generic online learning framework. Extend Passive-Aggressive learning to Bayesian setting. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

104 Contribution Experiment A generic online learning framework. Extend Passive-Aggressive learning to Bayesian setting. Generalize Sequential Bayesian Inference to discriminative learning. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

105 Contribution Experiment A generic online learning framework. Extend Passive-Aggressive learning to Bayesian setting. Generalize Sequential Bayesian Inference to discriminative learning. Develop efficient online learning algorithm for max-margin topic models. T. Shi, J. Zhu (Tsinghua) BayesPA ICML / 35

Online Bayesian Passive-Agressive Learning

Online Bayesian Passive-Agressive Learning Online Bayesian Passive-Agressive Learning International Conference on Machine Learning, 2014 Tianlin Shi Jun Zhu Tsinghua University, China 21 August 2015 Presented by: Kyle Ulrich Introduction Online

More information

Online Bayesian Passive-Aggressive Learning"

Online Bayesian Passive-Aggressive Learning Online Bayesian Passive-Aggressive Learning" Tianlin Shi! stl501@gmail.com! Jun Zhu! dcszj@mail.tsinghua.edu.cn! The BIG DATA challenge" Large amounts of data.! Big data:!! Big Science: 25 PB annual data.!

More information

Online Bayesian Passive-Aggressive Learning

Online Bayesian Passive-Aggressive Learning Journal of Machine Learning Research 1 2014 1-48 Submitted 4/00; Published 10/00 Online Bayesian Passive-Aggressive Learning Tianlin Shi Institute for Interdisciplinary Information Sciences Tsinghua University

More information

Small-variance Asymptotics for Dirichlet Process Mixtures of SVMs

Small-variance Asymptotics for Dirichlet Process Mixtures of SVMs Small-variance Asymptotics for Dirichlet Process Mixtures of SVMs Yining Wang Jun Zhu Tsinghua University July, 2014 Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, 2014 1 / 25 Outline

More information

Classical Predictive Models

Classical Predictive Models Laplace Max-margin Markov Networks Recent Advances in Learning SPARSE Structured I/O Models: models, algorithms, and applications Eric Xing epxing@cs.cmu.edu Machine Learning Dept./Language Technology

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up Much of this material is adapted from Blei 2003. Many of the images were taken from the Internet February 20, 2014 Suppose we have a large number of books. Each is about several unknown topics. How can

More information

Collapsed Variational Inference for Sum-Product Networks

Collapsed Variational Inference for Sum-Product Networks for Sum-Product Networks Han Zhao 1, Tameem Adel 2, Geoff Gordon 1, Brandon Amos 1 Presented by: Han Zhao Carnegie Mellon University 1, University of Amsterdam 2 June. 20th, 2016 1 / 26 Outline Background

More information

Online Passive-Aggressive Algorithms. Tirgul 11

Online Passive-Aggressive Algorithms. Tirgul 11 Online Passive-Aggressive Algorithms Tirgul 11 Multi-Label Classification 2 Multilabel Problem: Example Mapping Apps to smart folders: Assign an installed app to one or more folders Candy Crush Saga 3

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Using Both Latent and Supervised Shared Topics for Multitask Learning

Using Both Latent and Supervised Shared Topics for Multitask Learning Using Both Latent and Supervised Shared Topics for Multitask Learning Ayan Acharya, Aditya Rawal, Raymond J. Mooney, Eduardo R. Hruschka UT Austin, Dept. of ECE September 21, 2013 Problem Definition An

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine Nitish Srivastava nitish@cs.toronto.edu Ruslan Salahutdinov rsalahu@cs.toronto.edu Geoffrey Hinton hinton@cs.toronto.edu

More information

Lecture 16: Perceptron and Exponential Weights Algorithm

Lecture 16: Perceptron and Exponential Weights Algorithm EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

Littlestone s Dimension and Online Learnability

Littlestone s Dimension and Online Learnability Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David

More information

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan Bayesian Learning CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Bayes Theorem MAP Learners Bayes optimal classifier Naïve Bayes classifier Example text classification Bayesian networks

More information

Diversity-Promoting Bayesian Learning of Latent Variable Models

Diversity-Promoting Bayesian Learning of Latent Variable Models Diversity-Promoting Bayesian Learning of Latent Variable Models Pengtao Xie 1, Jun Zhu 1,2 and Eric Xing 1 1 Machine Learning Department, Carnegie Mellon University 2 Department of Computer Science and

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Sparse Stochastic Inference for Latent Dirichlet Allocation

Sparse Stochastic Inference for Latent Dirichlet Allocation Sparse Stochastic Inference for Latent Dirichlet Allocation David Mimno 1, Matthew D. Hoffman 2, David M. Blei 1 1 Dept. of Computer Science, Princeton U. 2 Dept. of Statistics, Columbia U. Presentation

More information

Probabilistic Time Series Classification

Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

More information

Online multiclass learning with bandit feedback under a Passive-Aggressive approach

Online multiclass learning with bandit feedback under a Passive-Aggressive approach ESANN 205 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 22-24 April 205, i6doc.com publ., ISBN 978-28758704-8. Online

More information

The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904,

More information

Gaussian Models

Gaussian Models Gaussian Models ddebarr@uw.edu 2016-04-28 Agenda Introduction Gaussian Discriminant Analysis Inference Linear Gaussian Systems The Wishart Distribution Inferring Parameters Introduction Gaussian Density

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

Streaming Variational Bayes

Streaming Variational Bayes Streaming Variational Bayes Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael I. Jordan UC Berkeley Discussion led by Miao Liu September 13, 2013 Introduction The SDA-Bayes Framework

More information

The supervised hierarchical Dirichlet process

The supervised hierarchical Dirichlet process 1 The supervised hierarchical Dirichlet process Andrew M. Dai and Amos J. Storkey Abstract We propose the supervised hierarchical Dirichlet process (shdp), a nonparametric generative model for the joint

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Max-margin learning of GM Eric Xing Lecture 28, Apr 28, 2014 b r a c e Reading: 1 Classical Predictive Models Input and output space: Predictive

More information

Online Manifold Regularization: A New Learning Setting and Empirical Study

Online Manifold Regularization: A New Learning Setting and Empirical Study Online Manifold Regularization: A New Learning Setting and Empirical Study Andrew B. Goldberg 1, Ming Li 2, Xiaojin Zhu 1 1 Computer Sciences, University of Wisconsin Madison, USA. {goldberg,jerryzhu}@cs.wisc.edu

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Bayesian Nonparametrics for Speech and Signal Processing

Bayesian Nonparametrics for Speech and Signal Processing Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

Online Forest Density Estimation

Online Forest Density Estimation Online Forest Density Estimation Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr UAI 16 1 Outline 1 Probabilistic Graphical Models 2 Online Density Estimation 3 Online Forest Density

More information

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Discriminative Training of Mixed Membership Models

Discriminative Training of Mixed Membership Models 18 Discriminative Training of Mixed Membership Models Jun Zhu Department of Computer Science and Technology, State Key Laboratory of Intelligent Technology and Systems; Tsinghua National Laboratory for

More information

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank Shoaib Jameel Shoaib Jameel 1, Wai Lam 2, Steven Schockaert 1, and Lidong Bing 3 1 School of Computer Science and Informatics,

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts ICML 2015 Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes Machine Learning Research Group and Oxford-Man Institute University of Oxford July 8, 2015 Point Processes

More information

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

More information

Online Learning of Probabilistic Graphical Models

Online Learning of Probabilistic Graphical Models 1/34 Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U Nankin 2016 Probabilistic Graphical Models 2/34 Outline 1 Probabilistic

More information

Variational Autoencoders

Variational Autoencoders Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

More information

Generative MaxEnt Learning for Multiclass Classification

Generative MaxEnt Learning for Multiclass Classification Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science,

More information

Better Algorithms for Selective Sampling

Better Algorithms for Selective Sampling Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized

More information

Linear Classification

Linear Classification Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative

More information

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions Mohammad Emtiyaz Khan, Reza Babanezhad, Wu Lin, Mark Schmidt, Masashi Sugiyama Conference on Uncertainty

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem Set 2 Due date: Wednesday October 6 Please address all questions and comments about this problem set to 6867-staff@csail.mit.edu. You will need to use MATLAB for some of

More information

Polyhedral Outer Approximations with Application to Natural Language Parsing

Polyhedral Outer Approximations with Application to Natural Language Parsing Polyhedral Outer Approximations with Application to Natural Language Parsing André F. T. Martins 1,2 Noah A. Smith 1 Eric P. Xing 1 1 Language Technologies Institute School of Computer Science Carnegie

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

Recurrent Latent Variable Networks for Session-Based Recommendation

Recurrent Latent Variable Networks for Session-Based Recommendation Recurrent Latent Variable Networks for Session-Based Recommendation Panayiotis Christodoulou Cyprus University of Technology paa.christodoulou@edu.cut.ac.cy 27/8/2017 Panayiotis Christodoulou (C.U.T.)

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Efficient Bandit Algorithms for Online Multiclass Prediction

Efficient Bandit Algorithms for Online Multiclass Prediction Efficient Bandit Algorithms for Online Multiclass Prediction Sham Kakade, Shai Shalev-Shwartz and Ambuj Tewari Presented By: Nakul Verma Motivation In many learning applications, true class labels are

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

An empirical study about online learning with generalized passive-aggressive approaches

An empirical study about online learning with generalized passive-aggressive approaches An empirical study about online learning with generalized passive-aggressive approaches Adrian Perez-Suay, Francesc J. Ferri, Miguel Arevalillo-Herráez, and Jesús V. Albert Dept. nformàtica, Universitat

More information

Learning, Games, and Networks

Learning, Games, and Networks Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Authors: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira and Jennifer Wortman (University of Pennsylvania)

Authors: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira and Jennifer Wortman (University of Pennsylvania) Learning Bouds for Domain Adaptation Authors: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira and Jennifer Wortman (University of Pennsylvania) Presentation by: Afshin Rostamizadeh (New York

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Practical Agnostic Active Learning

Practical Agnostic Active Learning Practical Agnostic Active Learning Alina Beygelzimer Yahoo Research based on joint work with Sanjoy Dasgupta, Daniel Hsu, John Langford, Francesco Orabona, Chicheng Zhang, and Tong Zhang * * introductory

More information

Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models

More information

29 : Posterior Regularization

29 : Posterior Regularization 10-708: Probabilistic Graphical Models 10-708, Spring 2014 29 : Posterior Regularization Lecturer: Eric P. Xing Scribes: Felix Juefei Xu, Abhishek Chugh 1 Introduction This is the last lecture which tends

More information

Part 1: Expectation Propagation

Part 1: Expectation Propagation Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

More information

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation Lecture 15. Pattern Classification (I): Statistical Formulation Outline Statistical Pattern Recognition Maximum Posterior Probability (MAP) Classifier Maximum Likelihood (ML) Classifier K-Nearest Neighbor

More information

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Joseph Keshet Shai Shalev-Shwartz Yoram Singer School of Computer Science and Engineering The Hebrew University Jerusalem, 91904, Israel CRAMMER@CIS.UPENN.EDU

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes MAD-Bayes: MAP-based Asymptotic Derivations from Bayes Tamara Broderick Brian Kulis Michael I. Jordan Cat Clusters Mouse clusters Dog 1 Cat Clusters Dog Mouse Lizard Sheep Picture 1 Picture 2 Picture 3

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das Online Learning 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das About this class Goal To introduce the general setting of online learning. To describe an online version of the RLS algorithm

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning Topics Bayesian Learning Sattiraju Prabhakar CS898O: ML Wichita State University Objectives for Bayesian Learning Bayes Theorem and MAP Bayes Optimal Classifier Naïve Bayes Classifier An Example Classifying

More information