Automated Scalable Bayesian Inference via Data Summarization

Size: px
Start display at page:

Download "Automated Scalable Bayesian Inference via Data Summarization"

Transcription

1 Automated Scalable Bayesian Inference via Data Summarization Tamara Broderick ITT Career Development Assistant Professor, MIT With: Trevor Campbell, Jonathan H. Huggins

2 Bayesian inference Fuel consumption Microcredit [Chati, Balakrishnan [Julian Herzog 2016] 2017] Cybersecurity [Meager 2018a] [amcharts.com 2016] [Webb, Caverlee, Pu 2006] 1

3 Bayesian inference Fuel consumption Microcredit [Chati, Balakrishnan 2017] [Julian Herzog 2016] Cybersecurity [Meager 2018a] [amcharts.com 2016] Desiderata: [Webb, Caverlee, Pu 2006] 1

4 Bayesian inference Fuel consumption Microcredit [Chati, Balakrishnan 2017] [Julian Herzog 2016] Cybersecurity [Meager 2018a] [amcharts.com 2016] Desiderata: Point estimates, coherent uncertainties [Webb, Caverlee, Pu 2006] 1

5 Bayesian inference Fuel consumption Microcredit [Chati, Balakrishnan 2017] [Julian Herzog 2016] Cybersecurity [Meager 2018a] [amcharts.com 2016] Desiderata: Point estimates, coherent uncertainties [Webb, Caverlee, Pu 2006] Interpretable, complex, modular; prior information 1

6 Bayesian inference Fuel consumption Microcredit [Chati, Balakrishnan 2017] [Julian Herzog 2016] Cybersecurity [Meager 2018a] [amcharts.com 2016] Desiderata: Point estimates, coherent uncertainties [Webb, Caverlee, Pu 2006] Interpretable, complex, modular; prior information Challenge: existing methods can be slow, tedious, unreliable 1

7 Bayesian inference Fuel consumption Microcredit [Chati, Balakrishnan 2017] [Julian Herzog 2016] Cybersecurity [Meager 2018a] [amcharts.com 2016] Desiderata: Point estimates, coherent uncertainties [Webb, Caverlee, Pu 2006] Interpretable, complex, modular; prior information Challenge: existing methods can be slow, tedious, unreliable Our proposal: use efficient data summaries for scalable, 1 automated algorithms with error bounds for finite data

8 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

9 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

10 Bayesian inference 2

11 Bayesian inference p( y) / p(y )p( ) 2

12 Bayesian inference p( y) / p(y )p( ) 2

13 Bayesian inference p( y) / p(y )p( ) 2

14 Bayesian inference p( y) / p(y )p( ) (x n, y n ) 2

15 Bayesian inference p( y) / p(y )p( ) Normal (x n, y n ) Phishing 2

16 Bayesian inference p( y) / p(y )p( ) Normal (x n, y n ) Phishing 2

17 Bayesian inference p( y) / p(y )p( ) Normal (x n, y n ) Phishing 2

18 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 Exact posterior (x n, y n ) [Bishop 2006] 1 2

19 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 Exact posterior (x n, y n ) [Bishop 2006] 1 MCMC: Eventually accurate but can be slow [Bardenet, Doucet, Holmes 2017] 2

20 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 Exact posterior (x n, y n ) [Bishop 2006] 1 MCMC: Eventually accurate but can be slow (Mean-field) variational Bayes: (MF)VB [Bardenet, Doucet, Holmes 2017] 2

21 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 Exact posterior (x n, y n ) [Bishop 2006] 1 MCMC: Eventually accurate but can be slow (Mean-field) variational Bayes: (MF)VB [Bardenet, Doucet, Holmes 2017] Fast 2

22 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 Exact posterior (x n, y n ) [Bishop 2006] 1 MCMC: Eventually accurate but can be slow (Mean-field) variational Bayes: (MF)VB Fast, streaming, distributed [Broderick, Boyd, Wibisono, Wilson, Jordan 2013] [Bardenet, Doucet, Holmes 2017] 2

23 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 Exact posterior (x n, y n ) [Bishop 2006] 1 MCMC: Eventually accurate but can be slow (Mean-field) variational Bayes: (MF)VB Fast, streaming, distributed (3.6M Wikipedia, 32 cores, ~hour) [Broderick, Boyd, Wibisono, Wilson, Jordan 2013] [Bardenet, Doucet, Holmes 2017] 2

24 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 Exact posterior (x n, y n ) [Bishop 2006] 1 MCMC: Eventually accurate but can be slow (Mean-field) variational Bayes: (MF)VB Fast, streaming, distributed (3.6M Wikipedia, 32 cores, ~hour) [Broderick, Boyd, Wibisono, Wilson, Jordan 2013] Misestimation & lack of quality guarantees [Bardenet, Doucet, Holmes 2017] [MacKay 2003; Bishop 2006; Wang, Titterington 2004; Turner, Sahani 2011; Fosdick 2013; Dunson 2014; Bardenet, Doucet, Holmes 2017; Opper, Winther 2003; Giordano, Broderick, Jordan 2015, 2017] 2

25 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 p( x) MFVB q ( ) Exact posterior (x n, y n ) [Bishop 2006] 1 MCMC: Eventually accurate but can be slow (Mean-field) variational Bayes: (MF)VB Fast, streaming, distributed (3.6M Wikipedia, 32 cores, ~hour) [Broderick, Boyd, Wibisono, Wilson, Jordan 2013] Misestimation & lack of quality guarantees [Bardenet, Doucet, Holmes 2017] [MacKay 2003; Bishop 2006; Wang, Titterington 2004; Turner, Sahani 2011; Fosdick 2013; Dunson 2014; Bardenet, Doucet, Holmes 2017; Opper, Winther 2003; Giordano, Broderick, Jordan 2015, 2017] 2

26 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 p( x) MFVB q ( ) Exact posterior (x n, y n ) LRVB [Bishop 2006] 1 MCMC: Eventually accurate but can be slow (Mean-field) variational Bayes: (MF)VB Fast, streaming, distributed (3.6M Wikipedia, 32 cores, ~hour) [Broderick, Boyd, Wibisono, Wilson, Jordan 2013] Misestimation & lack of quality guarantees [Bardenet, Doucet, Holmes 2017] [MacKay 2003; Bishop 2006; Wang, Titterington 2004; Turner, Sahani 2011; Fosdick 2013; Dunson 2014; Bardenet, Doucet, Holmes 2017; Opper, Winther 2003; Giordano, Broderick, Jordan 2015, 2017] 2

27 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 p( x) MFVB q ( ) Exact posterior (x n, y n ) LRVB [Bishop 2006] 1 MCMC: Eventually accurate but can be slow (Mean-field) variational Bayes: (MF)VB Fast, streaming, distributed (3.6M Wikipedia, 32 cores, ~hour) [Broderick, Boyd, Wibisono, Wilson, Jordan 2013] Misestimation & lack of quality guarantees [Bardenet, Doucet, Holmes 2017] 2 [MacKay 2003; Bishop 2006; Wang, Titterington 2004; Turner, Sahani 2011; Fosdick 2013; Dunson 2014; Bardenet, Doucet, Holmes 2017; Opper, Winther 2003; Giordano, Broderick, Jordan 2015, 2017] Automation: e.g. Stan, NUTS, ADVI [ ; Hoffman, Gelman 2014; Kucukelbir, Tran, Ranganath, Gelman, Blei 2017]

28 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

29 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

30 3 Bayesian coresets

31 Bayesian coresets Observe: redundancies can exist even if data isn t tall 3

32 Bayesian coresets Observe: redundancies can exist even if data isn t tall Football Curling 3

33 Bayesian coresets Observe: redundancies can exist even if data isn t tall Football Curling 3

34 Bayesian coresets Observe: redundancies can exist even if data isn t tall Football Curling 3

35 Bayesian coresets Observe: redundancies can exist even if data isn t tall Coresets: pre-process data to get a smaller, weighted data set Football Curling 3 [Agarwal et al 2005; Feldman & Langberg 2011]

36 Bayesian coresets Observe: redundancies can exist even if data isn t tall Coresets: pre-process data to get a smaller, weighted data set Football Curling Theoretical guarantees on quality 3 [Agarwal et al 2005; Feldman & Langberg 2011]

37 Bayesian coresets Observe: redundancies can exist even if data isn t tall Coresets: pre-process data to get a smaller, weighted data set Football Curling Theoretical guarantees on quality Previous heuristics: data squashing, big data GPs 3 [Agarwal et al 2005; Feldman & Langberg 2011; DuMouchel et al 1999; Madigan et al 2002]

38 Bayesian coresets Observe: redundancies can exist even if data isn t tall Coresets: pre-process data to get a smaller, weighted data set Football Curling Theoretical guarantees on quality Previous heuristics: data squashing, big data GPs Cf. subsampling 3 [Agarwal et al 2005; Feldman & Langberg 2011; DuMouchel et al 1999; Madigan et al 2002]

39 Bayesian coresets Observe: redundancies can exist even if data isn t tall Coresets: pre-process data to get a smaller, weighted data set Football Curling Theoretical guarantees on quality Previous heuristics: data squashing, big data GPs Cf. subsampling 3 [Agarwal et al 2005; Feldman & Langberg 2011; DuMouchel et al 1999; Madigan et al 2002]

40 Bayesian coresets Observe: redundancies can exist even if data isn t tall Coresets: pre-process data to get a smaller, weighted data set Football Curling Theoretical guarantees on quality Previous heuristics: data squashing, big data GPs Cf. subsampling 3 [Agarwal et al 2005; Feldman & Langberg 2011; DuMouchel et al 1999; Madigan et al 2002]

41 Bayesian coresets Observe: redundancies can exist even if data isn t tall Coresets: pre-process data to get a smaller, weighted data set Football Curling Theoretical guarantees on quality Previous heuristics: data squashing, big data GPs Cf. subsampling 3 How to develop coresets for Bayes? [Agarwal et al 2005; Feldman & Langberg 2011; DuMouchel et al 1999; Madigan et al 2002; Huggins, Campbell, Broderick 2016; Campbell, Broderick 2017; Campbell, Broderick 2018]

42 Bayesian coresets Posterior p( y) / p(y )p( ) Normal Phishing 4

43 Bayesian coresets Posterior p( y) / p(y )p( ) Log likelihood L n ( ) := log p(y n ), L( ) := NX n=1 L n ( ) Normal Phishing 4

44 Bayesian coresets Posterior p( y) / p(y )p( ) Log likelihood L n ( ) := log p(y n ), L( ) := Coreset log likelihood NX n=1 L n ( ) Normal Phishing 4

45 Bayesian coresets Posterior p( y) / p(y )p( ) Log likelihood L n ( ) := log p(y n ), L( ) := Coreset log likelihood NX L n ( ) n=1 kwk 0 N Normal Phishing 4

46 Bayesian coresets Posterior p( y) / p(y )p( ) NX Log likelihood L n ( ) := log p(y n ), L( ) := L n ( ) Coreset log likelihood L(w, ) := NX n=1 n=1 w n L n ( ) s.t. kwk 0 N Normal Phishing 4

47 Bayesian coresets Posterior p( y) / p(y )p( ) NX Log likelihood L n ( ) := log p(y n ), L( ) := L n ( ) Coreset log likelihood L(w, ) := NX n=1 n=1 w n L n ( ) s.t. kwk 0 N Normal Phishing L(w) L 4

48 Bayesian coresets Posterior p( y) / p(y )p( ) NX Log likelihood L n ( ) := log p(y n ), L( ) := L n ( ) Coreset log likelihood L(w, ) := ε-coreset: kl(w) Lk apple NX n=1 n=1 w n L n ( ) s.t. kwk 0 N Normal Phishing L(w) L 4

49 <latexit sha1_base64="wqvsehinzmkis3++stx/wlujese=">aaacxhicbvhrshtbfj1dtczy21ihl325nbqsigfxcvoovuolfbdqgcetltnziy7ozi4zd2vdud/pmy/9lxy25ihvxhg4c8653lln0ljji1h0eirr6xsvnltb7e2xo69ed3bfnnuimlymekekc5eyk5tuyoqslbgojwb5qsq4vtlp9pfpyaws9a+cl2kas0stz5iz9ftssvnixom7rutemdz2km8khduy9wdqrtz6qjwak8yi9a5ozvckm+w+1b3bpuyvel5phexxc03uxp+/ntb1ak6bsg3uxymdwiedbjsmfgxpqbwexbkss6rzt7ocv7nqybwzdhjhju4dmyi5enwbvlaujn+wszhxulnc2klbhfpdb89kmcumpxphwa52ojzbo89t72ywse+1hvyfnqlwdjr1upcvcs0fb80qbvhakzrk0gioau4b40b6twk/yozx9p/r9ihet1d+ds4phne0jl9/7b5/wsbriu/ie9ijmtkkx+qlosmjwskd+ro0gq3gd7geboc7j9ywwpbskx8qfpsxy6+y+a==</latexit> <latexit sha1_base64="wqvsehinzmkis3++stx/wlujese=">aaacxhicbvhrshtbfj1dtczy21ihl325nbqsigfxcvoovuolfbdqgcetltnziy7ozi4zd2vdud/pmy/9lxy25ihvxhg4c8653lln0ljji1h0eirr6xsvnltb7e2xo69ed3bfnnuimlymekekc5eyk5tuyoqslbgojwb5qsq4vtlp9pfpyaws9a+cl2kas0stz5iz9ftssvnixom7rutemdz2km8khduy9wdqrtz6qjwak8yi9a5ozvckm+w+1b3bpuyvel5phexxc03uxp+/ntb1ak6bsg3uxymdwiedbjsmfgxpqbwexbkss6rzt7ocv7nqybwzdhjhju4dmyi5enwbvlaujn+wszhxulnc2klbhfpdb89kmcumpxphwa52ojzbo89t72ywse+1hvyfnqlwdjr1upcvcs0fb80qbvhakzrk0gioau4b40b6twk/yozx9p/r9ihet1d+ds4phne0jl9/7b5/wsbriu/ie9ijmtkkx+qlosmjwskd+ro0gq3gd7geboc7j9ywwpbskx8qfpsxy6+y+a==</latexit> <latexit sha1_base64="wqvsehinzmkis3++stx/wlujese=">aaacxhicbvhrshtbfj1dtczy21ihl325nbqsigfxcvoovuolfbdqgcetltnziy7ozi4zd2vdud/pmy/9lxy25ihvxhg4c8653lln0ljji1h0eirr6xsvnltb7e2xo69ed3bfnnuimlymekekc5eyk5tuyoqslbgojwb5qsq4vtlp9pfpyaws9a+cl2kas0stz5iz9ftssvnixom7rutemdz2km8khduy9wdqrtz6qjwak8yi9a5ozvckm+w+1b3bpuyvel5phexxc03uxp+/ntb1ak6bsg3uxymdwiedbjsmfgxpqbwexbkss6rzt7ocv7nqybwzdhjhju4dmyi5enwbvlaujn+wszhxulnc2klbhfpdb89kmcumpxphwa52ojzbo89t72ywse+1hvyfnqlwdjr1upcvcs0fb80qbvhakzrk0gioau4b40b6twk/yozx9p/r9ihet1d+ds4phne0jl9/7b5/wsbriu/ie9ijmtkkx+qlosmjwskd+ro0gq3gd7geboc7j9ywwpbskx8qfpsxy6+y+a==</latexit> <latexit sha1_base64="wqvsehinzmkis3++stx/wlujese=">aaacxhicbvhrshtbfj1dtczy21ihl325nbqsigfxcvoovuolfbdqgcetltnziy7ozi4zd2vdud/pmy/9lxy25ihvxhg4c8653lln0ljji1h0eirr6xsvnltb7e2xo69ed3bfnnuimlymekekc5eyk5tuyoqslbgojwb5qsq4vtlp9pfpyaws9a+cl2kas0stz5iz9ftssvnixom7rutemdz2km8khduy9wdqrtz6qjwak8yi9a5ozvckm+w+1b3bpuyvel5phexxc03uxp+/ntb1ak6bsg3uxymdwiedbjsmfgxpqbwexbkss6rzt7ocv7nqybwzdhjhju4dmyi5enwbvlaujn+wszhxulnc2klbhfpdb89kmcumpxphwa52ojzbo89t72ywse+1hvyfnqlwdjr1upcvcs0fb80qbvhakzrk0gioau4b40b6twk/yozx9p/r9ihet1d+ds4phne0jl9/7b5/wsbriu/ie9ijmtkkx+qlosmjwskd+ro0gq3gd7geboc7j9ywwpbskx8qfpsxy6+y+a==</latexit> Bayesian coresets Posterior p( y) / p(y )p( ) NX Log likelihood L n ( ) := log p(y n ), L( ) := L n ( ) NX Coreset log likelihood L(w, ) := ε-coreset: kl(w) Lk apple n=1 n=1 w n L n ( ) s.t. kwk 0 N Approximate posterior close in Wasserstein distance d (p Wj w ( y),p( y)) apple C j kl(w) Lk WFID,j 2 {1, 2} Normal Phishing L(w) L 4

50 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

51 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

52 Uniform subsampling revisited Normal Phishing 5

53 Uniform subsampling revisited Normal Phishing 5

54 Uniform subsampling revisited Normal Phishing 5

55 Uniform subsampling revisited Normal Phishing Might miss important data 5

56 Uniform subsampling revisited Normal Phishing Might miss important data 5

57 Uniform subsampling revisited Normal Phishing Might miss important data 5

58 Uniform subsampling revisited Normal Phishing Might miss important data 5

59 Uniform subsampling revisited Normal Phishing Might miss important data 5

60 Uniform subsampling revisited Normal Phishing Might miss important data 5

61 Uniform subsampling revisited Normal Phishing Might miss important data 5

62 Uniform subsampling revisited Normal Phishing Might miss important data 5

63 Uniform subsampling revisited Normal Phishing Might miss important data 5

64 Uniform subsampling revisited Normal Phishing Might miss important data 5

65 Uniform subsampling revisited Normal Phishing Might miss important data Noisy estimates 5

66 Uniform subsampling revisited Normal Phishing Might miss important data Noisy estimates 5 M = 10 M = 100 M = 1000

67 Uniform subsampling revisited Normal Phishing Might miss important data Noisy estimates 5 M = 10 M = 100 M = 1000

68 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

69 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

70 Importance sampling Normal Phishing 6

71 Importance sampling Normal Phishing 6

72 Importance sampling Normal Phishing 6

73 Importance sampling Normal Phishing 6

74 Importance sampling Normal Phishing 6

75 Importance sampling Normal Phishing 6

76 Importance sampling Normal Phishing 6

77 Importance sampling Normal Phishing 6

78 Importance sampling Normal Phishing 6

79 Importance sampling Normal Phishing 6

80 Importance sampling Normal Phishing n /kl n k 6

81 Importance sampling Normal Phishing NX := kl n k n=1 n := kl n k/ 6

82 Importance sampling Normal Phishing NX := kl n k n=1 n := kl n k/ 1. data 2. importance weights 3. importance sample 4. invert weights 6

83 Importance sampling Normal Phishing NX := kl n k n=1 n := kl n k/ 1. data 2. importance weights 3. importance sample 4. invert weights 6

84 Importance sampling Normal Phishing NX := kl n k n=1 n := kl n k/ 1. data 2. importance weights 3. importance sample 4. invert weights 6

85 Importance sampling Normal Phishing NX := kl n k n=1 n := kl n k/ 1. data 2. importance weights 3. importance sample 4. invert weights 6

86 Importance sampling Thm sketch (CB). δ (0,1). W.p. 1 - δ, after M iterations, kl(w) Lk apple p M 1+ r2 log 1! 7

87 Importance sampling Thm sketch (CB). δ (0,1). W.p. 1 - δ, after M iterations,! r 1 kl(w) Lk p log M Still noisy estimates M = 10 7 M = 100 M = 1000

88 Importance sampling Thm sketch (CB). δ (0,1). W.p. 1 - δ, after M iterations,! r 1 kl(w) Lk p log M Still noisy estimates M = 10 7 M = 100 M = 1000

89 Hilbert coresets Want a good coreset: min kl(w) Lk 2 w2r N s.t. w 0, kwk 0 apple M 8

90 Hilbert coresets Want a good coreset: min kl(w) Lk 2 w2r N s.t. w 0, kwk 0 apple M exp(l( )) exp(l n ( )) y n 8

91 Hilbert coresets Want a good coreset: min kl(w) Lk 2 w2r N s.t. w 0, kwk 0 apple M exp(l n ( )) exp(l( )) L L n y n 8

92 Hilbert coresets Want a good coreset: min kl(w) Lk 2 w2r N s.t. w 0, kwk 0 apple M exp(l n ( )) exp(l( )) L L n y n need to consider (residual) error direction 8

93 Hilbert coresets Want a good coreset: min kl(w) Lk 2 w2r N s.t. w 0, kwk 0 apple M exp(l n ( )) exp(l( )) L L n y n need to consider (residual) error direction sparse optimization 8

94 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

95 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

96 Frank-Wolfe Convex optimization on a polytope D [Jaggi 2013] 9

97 Frank-Wolfe Convex optimization on a polytope D Repeat: 1. Find gradient 2. Find argmin point on plane in D 3. Do line search between current point and argmin point [Jaggi 2013] 9

98 Frank-Wolfe Convex optimization on a polytope D Repeat: 1. Find gradient 2. Find argmin point on plane in D 3. Do line search between current point and argmin point [Jaggi 2013] Convex combination of M vertices after M -1 steps 9

99 Frank-Wolfe Convex optimization on a polytope D Repeat: 1. Find gradient 2. Find argmin point on plane in D 3. Do line search between current point and argmin point [Jaggi 2013] Convex combination of M vertices after M -1 steps Our problem: min kl(w) Lk 2 w2r N 9

100 Frank-Wolfe Convex optimization on a polytope D Repeat: 1. Find gradient 2. Find argmin point on plane in D 3. Do line search between current point and argmin point [Jaggi 2013] Convex combination of M vertices after M -1 steps Our problem: min kl(w) Lk 2 w2r N 9

101 Frank-Wolfe Convex optimization on a polytope D Repeat: 1. Find gradient 2. Find argmin point on plane in D 3. Do line search between current point and argmin point [Jaggi 2013] Convex combination of M vertices after M -1 steps Our problem: min kl(w) Lk 2 w2r N s.t. w 0, kwk 0 apple M 9

102 Frank-Wolfe Convex optimization on a polytope D Repeat: 1. Find gradient 2. Find argmin point on plane in D 3. Do line search between current point and argmin point [Jaggi 2013] Convex combination of M vertices after M -1 steps Our problem: min kl(w) Lk 2 w2r N ( ) NX N 1 := w 2 R N : nw n =,w 0 n=1 9

103 Frank-Wolfe 9 Convex optimization on a polytope D Repeat: 1. Find gradient 2. Find argmin point on plane in D 3. Do line search between current point and argmin point [Jaggi 2013] Convex combination of M vertices after M -1 steps Our problem: min kl(w) Lk 2 w2r N ( ) NX N 1 := w 2 R N : nw n =,w 0 n=1 Thm sketch (CB). After M iterations, kl(w) Lk applep 2M + M

104 Gaussian model (simulated) 10K pts; norms, inference: closed-form Uniform subsampling Importance sampling M = 5 Frank-Wolfe 10

105 Gaussian model (simulated) 10K pts; norms, inference: closed-form Uniform subsampling Importance sampling M = 5 M = 50 M = 500 Frank-Wolfe 10

106 Gaussian model (simulated) 10K pts; norms, inference: closed-form Uniform subsampling Importance sampling Frank-Wolfe M = 5 M = 50 M =

107 Gaussian model (simulated) 10K pts; norms, inference: closed-form Uniform subsampling Importance sampling Frank-Wolfe 10 M = 5 M = 50 M = 500

108 Gaussian model (simulated) 10K pts; norms, inference: closed-form lower error 11

109 Logistic regression (simulated) 10K pts; general inference Uniform subsampling Importance sampling Frank-Wolfe 12 M = 10 M = 100 M = 1000

110 Poisson regression (simulated) 10K pts; general inference Uniform subsampling Importance sampling Frank-Wolfe 13 M = 10 M = 100 M = 1000

111 Real data experiments Logistic regression Poisson regression lower error 14 uniform subsampling Frank-Wolfe

112 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

113 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

114 15 Data summarization

115 Data summarization Exponential family likelihood 15

116 Data summarization Exponential family likelihood NY Sufficient statistics p(y 1:N p(y x, 1:N, ) = exp [T (y n,x n ) ( )] n=1 "( N X ) # =exp n=1 T (y n,x n ) ( ) 15

117 Data summarization Exponential family likelihood NY Sufficient statistics p(y 1:N p(y x, 1:N, ) = exp [T (y n,x n ) ( )] n=1 "( N X ) # =exp n=1 T (y n,x n ) ( ) 15

118 Data summarization Exponential family likelihood NY Sufficient statistics p(y 1:N p(y x, 1:N, ) = exp [T (y n,x n ) ( )] n=1 "( N X ) # =exp n=1 T (y n,x n ) ( ) Scalable, single-pass, streaming, distributed, complementary to MCMC 15

119 Data summarization Exponential family likelihood NY Sufficient statistics p(y 1:N p(y x, 1:N, ) = exp [T (y n,x n ) ( )] n=1 "( N X ) # =exp n=1 T (y n,x n ) ( ) Scalable, single-pass, streaming, distributed, complementary to MCMC But: Often no simple sufficient statistics 15

120 Data summarization Exponential family likelihood NY Sufficient statistics p(y 1:N p(y x, 1:N, ) = exp [T (y n,x n ) ( )] n=1 "( N X ) # =exp n=1 T (y n,x n ) ( ) Scalable, single-pass, streaming, distributed, complementary to MCMC But: Often no simple sufficient statistics E.g. Bayesian logistic regression; GLMs; deeper models NY 1 Likelihood p(y 1:N p(y x, 1:N, ) = 1 + exp( y n x n ) n=1 15

121 Data summarization Exponential family likelihood NY Sufficient statistics p(y 1:N p(y x, 1:N, ) = exp [T (y n,x n ) ( )] n=1 "( N X ) # =exp n=1 T (y n,x n ) ( ) Scalable, single-pass, streaming, distributed, complementary to MCMC But: Often no simple sufficient statistics E.g. Bayesian logistic regression; GLMs; deeper models NY 1 Likelihood p(y 1:N p(y x, 1:N, ) = 1 + exp( y n x n ) n=1 Our proposal: (polynomial) approximate sufficient statistics 15

122 Data summarization 6M data points, 1000 features Streaming, distributed; minimal communication 22 cores, 16 sec Finite-data guarantees on 16 Wasserstein distance to exact posterior [Huggins, Adams, Broderick 2017]

123 Conclusions Data summarization for scalable, automated approx. Bayes algorithms with error bounds on quality for finite data Get more accurate with more computation investment Coresets Approx. suff. stats 17

124 Conclusions Data summarization for scalable, automated approx. Bayes algorithms with error bounds on quality for finite data Get more accurate with more computation investment Coresets Approx. suff. stats A start Lots of potential improvements/ directions 17

125 Conclusions Data summarization for scalable, automated approx. Bayes algorithms with error bounds on quality for finite data Get more accurate with more computation investment Coresets Approx. suff. stats A start lower error Lots of potential improvements/ directions 17 [Campbell, Broderick 2018]

126 References (1/5) T Campbell and T Broderick. Automated scalable Bayesian inference via Hilbert coresets. Under review. ArXiv: ! T Campbell and T Broderick. Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent. ICML 2018, to appear.! JH Huggins, T Campbell, and T Broderick. Coresets for scalable Bayesian logistic regression. NIPS 2016.! JH Huggins, RP Adams, and T Broderick. PASS-GLM: Polynomial approximate sufficient statistics for scalable Bayesian GLM inference. NIPS Code links in the papers

127 References (2/5) PK Agarwal, S Har-Peled, and KR Varadarajan. "Geometric approximation via coresets." Combinatorial and Computational Geometry 52 (2005): R Bardenet, A Doucet, and C Holmes. "On Markov chain Monte Carlo methods for tall data." The Journal of Machine Learning Research 18.1 (2017): T Broderick, N Boyd, A Wibisono, AC Wilson, and MI Jordan. Streaming variational Bayes. NIPS CM Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, W DuMouchel, C Volinsky, T Johnson, C Cortes, and D Pregibon. "Squashing flat files flatter." In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp ACM, D Dunson. Robust and scalable approach to Bayesian inference. Talk at ISBA D Feldman, and M Langberg. "A unified framework for approximating and clustering data." In Proceedings of the forty-third annual ACM symposium on Theory of computing, pp ACM, B Fosdick. Modeling Heterogeneity within and between Matrices and Arrays, Chapter 4.7. PhD Thesis, University of Washington, RJ Giordano, T Broderick, and MI Jordan. "Linear response methods for accurate covariance estimates from mean field variational Bayes." NIPS R Giordano, T Broderick, R Meager, J Huggins, and MI Jordan. "Fast robustness quantification with variational Bayes." ICML 2016 Workshop on #Data4Good: Machine Learning in Social Good Applications, 2016.

128 References (3/5) MD Hoffman, and A Gelman. "The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo." Journal of Machine Learning Research 15, no. 1 (2014): M Jaggi. "Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization." ICML A Kucukelbir, R Ranganath, A Gelman, and D Blei. "Automatic variational inference in Stan." NIPS A Kucukelbir, D Tran, R Ranganath, A Gelman, and DM Blei. "Automatic differentiation variational inference." The Journal of Machine Learning Research 18.1 (2017): DJC MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, D Madigan, N Raghavan, W Dumouchel, M Nason, C Posse, and G Ridgeway. "Likelihoodbased data squashing: A modeling approach to instance construction." Data Mining and Knowledge Discovery 6, no. 2 (2002): M Opper and O Winther. Variational linear response. NIPS Stan (open source software). Accessed: RE Turner and M Sahani. Two problems with variational expectation maximisation for timeseries models. In D Barber, AT Cemgil, and S Chiappa, editors, Bayesian Time Series Models, B Wang and M Titterington. Inadequacy of interval estimates corresponding to variational Bayesian approximations. In AISTATS, 2004.

129 Application References (4/5) Chati, Yashovardhan Sushil, and Hamsa Balakrishnan. "A Gaussian process regression approach to model aircraft engine fuel flow rate." Cyber-Physical Systems (ICCPS), 2017 ACM/ IEEE 8th International Conference on. IEEE, Meager, Rachael. Understanding the impact of microcredit expansions: A Bayesian hierarchical analysis of 7 randomized experiments. AEJ: Applied, to appear, 2018a. Meager, Rachael. Aggregating Distributional Treatment Effects: A Bayesian Hierarchical Analysis of the Microcredit Literature. Working paper, 2018b. Webb, Steve, James Caverlee, and Calton Pu. "Introducing the Webb Spam Corpus: Using Spam to Identify Web Spam Automatically." In CEAS

130 Additional image references (5/5) amcharts. Visited Countries Map. Accessed: J. Herzog. 3 June 2016, 17:17:30. Obtained from: File:Airbus_A _F-WWCF_MSN002_ILA_Berlin_2016_17.jpg (Creative Commons Attribution 4.0 International License)

131 Practicalities

132 Practicalities Choice of norm

133 Practicalities Choice of norm E.g. (weighted) Fisher information distance kl(w) Lk 2ˆ,F := Eˆ krl( ) rl(w, )k 2 2

134 Practicalities Choice of norm E.g. (weighted) Fisher information distance kl(w) Lk 2ˆ,F := Eˆ krl( ) rl(w, )k 2 2 Associated inner product: hl n, L m iˆ,f := Eˆ rln ( ) T rl m ( )

135 Practicalities Choice of norm E.g. (weighted) Fisher information distance kl(w) Lk 2ˆ,F := Eˆ krl( ) rl(w, )k 2 2 Associated inner product: hl n, L m iˆ,f := Eˆ rln ( ) T rl m ( )

136 Practicalities Choice of norm E.g. (weighted) Fisher information distance kl(w) Lk 2ˆ,F := Eˆ krl( ) rl(w, )k 2 2 Associated inner product: hl n, L m iˆ,f := Eˆ rln ( ) T rl m ( ) Random feature projection

137 Practicalities Choice of norm E.g. (weighted) Fisher information distance kl(w) Lk 2ˆ,F := Eˆ krl( ) rl(w, )k 2 2 Associated inner product: hl n, L m iˆ,f := Eˆ rln ( ) T rl m ( ) Random feature projection hl n, L m iˆ,f D JX (rl n ( j )) (rl J dj m ( j )), dj j=1 iid iid d j Unif{1,...,D}, j ˆ

138 Practicalities Choice of norm E.g. (weighted) Fisher information distance kl(w) Lk 2ˆ,F := Eˆ krl( ) rl(w, )k 2 2 Associated inner product: hl n, L m iˆ,f := Eˆ rln ( ) T rl m ( ) Random feature projection hl n, L m iˆ,f D JX (rl n ( j )) (rl J dj m ( j )), dj j=1 iid iid d j Unif{1,...,D}, j ˆ Thm sketch (CB). With high probability and large enough J, a good coreset after random feat. proj. is a good coreset for (L n ) N n=1

139 Full pipeline

140 Full pipeline + cost ˆ

141 Full pipeline + cost ˆ

142 Full pipeline + cost ˆ

143 Full pipeline + cost ˆ

144 Full pipeline + cost ˆ

145 Full pipeline + cost ˆ vs. O(NT)

146 Full pipeline + cost ˆ vs. O(NT) Can make streaming, distributed

Part IV: Variational Bayes and beyond

Part IV: Variational Bayes and beyond Part IV: Variational Bayes and beyond Tamara Broderick ITT Career Development Assistant Professor, MIT http://www.tamarabroderick.com/tutorial_2018_lugano.html Bayesian inference [Woodard et al 2017] [ESO/

More information

Automated Scalable Bayesian Inference via Data Summarization

Automated Scalable Bayesian Inference via Data Summarization Automated Scalable Bayesian Inference via Data Summarization Tamara Broderick ITT Career Development Assistant Professor, MIT With: Trevor Campbell, Jonathan H. Huggins 1 Core of the data set Core of the

More information

Coresets for Bayesian Logistic Regression

Coresets for Bayesian Logistic Regression Coresets for Bayesian Logistic Regression Tamara Broderick ITT Career Development Assistant Professor, MIT With: Jonathan H. Huggins, Trevor Campbell 1 Bayesian inference Bayesian inference Complex, modular

More information

Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5)

Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5) Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5) Tamara Broderick ITT Career Development Assistant Professor Electrical Engineering & Computer Science MIT Bayes Foundations

More information

Understanding Covariance Estimates in Expectation Propagation

Understanding Covariance Estimates in Expectation Propagation Understanding Covariance Estimates in Expectation Propagation William Stephenson Department of EECS Massachusetts Institute of Technology Cambridge, MA 019 wtstephe@csail.mit.edu Tamara Broderick Department

More information

Streaming Variational Bayes

Streaming Variational Bayes Streaming Variational Bayes Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael I. Jordan UC Berkeley Discussion led by Miao Liu September 13, 2013 Introduction The SDA-Bayes Framework

More information

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes MAD-Bayes: MAP-based Asymptotic Derivations from Bayes Tamara Broderick Brian Kulis Michael I. Jordan Cat Clusters Mouse clusters Dog 1 Cat Clusters Dog Mouse Lizard Sheep Picture 1 Picture 2 Picture 3

More information

Black-box α-divergence Minimization

Black-box α-divergence Minimization Black-box α-divergence Minimization José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, Richard Turner, Harvard University, University of Cambridge, Universidad Autónoma de Madrid.

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17 MCMC for big data Geir Storvik BigInsight lunch - May 2 2018 Geir Storvik MCMC for big data BigInsight lunch - May 2 2018 1 / 17 Outline Why ordinary MCMC is not scalable Different approaches for making

More information

Bayesian leave-one-out cross-validation for large data sets

Bayesian leave-one-out cross-validation for large data sets 1st Symposium on Advances in Approximate Bayesian Inference, 2018 1 5 Bayesian leave-one-out cross-validation for large data sets Måns Magnusson Michael Riis Andersen Aki Vehtari Aalto University mans.magnusson@aalto.fi

More information

Gaussian Mixture Model

Gaussian Mixture Model Case Study : Document Retrieval MAP EM, Latent Dirichlet Allocation, Gibbs Sampling Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 5 th,

More information

On Bayesian Computation

On Bayesian Computation On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang Previous Work: Information Constraints on Inference Minimize the minimax risk under constraints

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning

More information

VCMC: Variational Consensus Monte Carlo

VCMC: Variational Consensus Monte Carlo VCMC: Variational Consensus Monte Carlo Maxim Rabinovich, Elaine Angelino, Michael I. Jordan Berkeley Vision and Learning Center September 22, 2015 probabilistic models! sky fog bridge water grass object

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Scaling up Bayesian Inference

Scaling up Bayesian Inference Scaling up Bayesian Inference David Dunson Departments of Statistical Science, Mathematics & ECE, Duke University May 1, 2017 Outline Motivation & background EP-MCMC amcmc Discussion Motivation & background

More information

Expectation Propagation in Dynamical Systems

Expectation Propagation in Dynamical Systems Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex

More information

arxiv: v2 [stat.ml] 28 May 2018

arxiv: v2 [stat.ml] 28 May 2018 Trevor Campbell 1 Tamara Broderick 1 arxiv:1802.01737v2 [stat.ml] 28 May 2018 Abstract Coherent uncertainty quantification is a key strength of Bayesian methods. But modern algorithms for approximate Bayesian

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

A Review of Pseudo-Marginal Markov Chain Monte Carlo

A Review of Pseudo-Marginal Markov Chain Monte Carlo A Review of Pseudo-Marginal Markov Chain Monte Carlo Discussed by: Yizhe Zhang October 21, 2016 Outline 1 Overview 2 Paper review 3 experiment 4 conclusion Motivation & overview Notation: θ denotes the

More information

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016 Log Gaussian Cox Processes Chi Group Meeting February 23, 2016 Outline Typical motivating application Introduction to LGCP model Brief overview of inference Applications in my work just getting started

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

Automatic Differentiation Variational Inference

Automatic Differentiation Variational Inference Journal of Machine Learning Research X (2016) 1-XX Submitted 01/16; Published X/16 Automatic Differentiation Variational Inference Alp Kucukelbir Data Science Institute and Department of Computer Science

More information

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Inference Course, WTCN, UCL, March 2013 Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information

More information

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server in collaboration with: Minjie Xu, Balaji Lakshminarayanan, Leonard Hasenclever, Thibaut Lienart, Stefan Webb,

More information

Automatic Differentiation Variational Inference

Automatic Differentiation Variational Inference Journal of Machine Learning Research X (2016) 1-XX Submitted X/16; Published X/16 Automatic Differentiation Variational Inference Alp Kucukelbir Data Science Institute, Department of Computer Science Columbia

More information

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY Arto Klami Adapted from my talk in AIHelsinki seminar Dec 15, 2016 1 MOTIVATING INTRODUCTION Most of the artificial intelligence success stories

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

Part 1: Expectation Propagation

Part 1: Expectation Propagation Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

More information

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY. Arto Klami

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY. Arto Klami PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY Arto Klami 1 PROBABILISTIC PROGRAMMING Probabilistic programming is to probabilistic modelling as deep learning is to neural networks (Antti Honkela,

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling Christophe Dupuy INRIA - Technicolor christophe.dupuy@inria.fr Francis Bach INRIA - ENS francis.bach@inria.fr Abstract

More information

Automatic Variational Inference in Stan

Automatic Variational Inference in Stan Automatic Variational Inference in Stan Alp Kucukelbir Columbia University alp@cs.columbia.edu Andrew Gelman Columbia University gelman@stat.columbia.edu Rajesh Ranganath Princeton University rajeshr@cs.princeton.edu

More information

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Willem van den Boom Department of Statistics and Applied Probability National University

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II 1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector

More information

Classification: Logistic Regression from Data

Classification: Logistic Regression from Data Classification: Logistic Regression from Data Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3 Slides adapted from Emily Fox Machine Learning: Jordan Boyd-Graber Boulder Classification:

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks Creighton Heaukulani and Zoubin Ghahramani University of Cambridge TU Denmark, June 2013 1 A Network Dynamic network data

More information

Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model

Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model LETTER Communicated by Ilya M. Nemenman Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model Charles K. Fisher charleskennethfisher@gmail.com Pankaj Mehta pankajm@bu.edu

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Model Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC.

Model Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC. Course on Bayesian Inference, WTCN, UCL, February 2013 A prior distribution over model space p(m) (or hypothesis space ) can be updated to a posterior distribution after observing data y. This is implemented

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis Summarizing a posterior Given the data and prior the posterior is determined Summarizing the posterior gives parameter estimates, intervals, and hypothesis tests Most of these computations are integrals

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For

More information

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms

More information

Bayesian Hidden Markov Models and Extensions

Bayesian Hidden Markov Models and Extensions Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Stochastic Variational Inference

Stochastic Variational Inference Stochastic Variational Inference David M. Blei Princeton University (DRAFT: DO NOT CITE) December 8, 2011 We derive a stochastic optimization algorithm for mean field variational inference, which we call

More information

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Probabilistic machine learning group, Aalto University  Bayesian theory and methods, approximative integration, model Aki Vehtari, Aalto University, Finland Probabilistic machine learning group, Aalto University http://research.cs.aalto.fi/pml/ Bayesian theory and methods, approximative integration, model assessment and

More information

Graphical Models for Query-driven Analysis of Multimodal Data

Graphical Models for Query-driven Analysis of Multimodal Data Graphical Models for Query-driven Analysis of Multimodal Data John Fisher Sensing, Learning, & Inference Group Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which

More information

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

More information

A variational radial basis function approximation for diffusion processes

A variational radial basis function approximation for diffusion processes A variational radial basis function approximation for diffusion processes Michail D. Vrettas, Dan Cornford and Yuan Shen Aston University - Neural Computing Research Group Aston Triangle, Birmingham B4

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling 1 / 27 Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 27 Monte Carlo Integration The big question : Evaluate E p(z) [f(z)]

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 17. Bayesian inference; Bayesian regression Training == optimisation (?) Stages of learning & inference: Formulate model Regression

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

Markov Chain Monte Carlo

Markov Chain Monte Carlo Markov Chain Monte Carlo Recall: To compute the expectation E ( h(y ) ) we use the approximation E(h(Y )) 1 n n h(y ) t=1 with Y (1),..., Y (n) h(y). Thus our aim is to sample Y (1),..., Y (n) from f(y).

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu Dilin Wang Department of Computer Science Dartmouth College Hanover, NH 03755 {qiang.liu, dilin.wang.gr}@dartmouth.edu

More information

First Technical Course, European Centre for Soft Computing, Mieres, Spain. 4th July 2011

First Technical Course, European Centre for Soft Computing, Mieres, Spain. 4th July 2011 First Technical Course, European Centre for Soft Computing, Mieres, Spain. 4th July 2011 Linear Given probabilities p(a), p(b), and the joint probability p(a, B), we can write the conditional probabilities

More information

Geometric ergodicity of the Bayesian lasso

Geometric ergodicity of the Bayesian lasso Geometric ergodicity of the Bayesian lasso Kshiti Khare and James P. Hobert Department of Statistics University of Florida June 3 Abstract Consider the standard linear model y = X +, where the components

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

A parametric approach to Bayesian optimization with pairwise comparisons

A parametric approach to Bayesian optimization with pairwise comparisons A parametric approach to Bayesian optimization with pairwise comparisons Marco Co Eindhoven University of Technology m.g.h.co@tue.nl Bert de Vries Eindhoven University of Technology and GN Hearing bdevries@ieee.org

More information

Variational Inference: A Review for Statisticians

Variational Inference: A Review for Statisticians Variational Inference: A Review for Statisticians David M. Blei Department of Computer Science and Statistics Columbia University Alp Kucukelbir Department of Computer Science Columbia University Jon D.

More information

Distance dependent Chinese restaurant processes

Distance dependent Chinese restaurant processes David M. Blei Department of Computer Science, Princeton University 35 Olden St., Princeton, NJ 08540 Peter Frazier Department of Operations Research and Information Engineering, Cornell University 232

More information

MIT /30 Gelman, Carpenter, Hoffman, Guo, Goodrich, Lee,... Stan for Bayesian data analysis

MIT /30 Gelman, Carpenter, Hoffman, Guo, Goodrich, Lee,... Stan for Bayesian data analysis MIT 1985 1/30 Stan: a program for Bayesian data analysis with complex models Andrew Gelman, Bob Carpenter, and Matt Hoffman, Jiqiang Guo, Ben Goodrich, and Daniel Lee Department of Statistics, Columbia

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Truncation error of a superposed gamma process in a decreasing order representation

Truncation error of a superposed gamma process in a decreasing order representation Truncation error of a superposed gamma process in a decreasing order representation B julyan.arbel@inria.fr Í www.julyanarbel.com Inria, Mistis, Grenoble, France Joint work with Igor Pru nster (Bocconi

More information

Probabilistic Models for Learning Data Representations. Andreas Damianou

Probabilistic Models for Learning Data Representations. Andreas Damianou Probabilistic Models for Learning Data Representations Andreas Damianou Department of Computer Science, University of Sheffield, UK IBM Research, Nairobi, Kenya, 23/06/2015 Sheffield SITraN Outline Part

More information

arxiv: v1 [stat.ml] 2 Mar 2016

arxiv: v1 [stat.ml] 2 Mar 2016 Automatic Differentiation Variational Inference Alp Kucukelbir Data Science Institute, Department of Computer Science Columbia University arxiv:1603.00788v1 [stat.ml] 2 Mar 2016 Dustin Tran Department

More information

Fast yet Simple Natural-Gradient Variational Inference in Complex Models

Fast yet Simple Natural-Gradient Variational Inference in Complex Models Fast yet Simple Natural-Gradient Variational Inference in Complex Models Mohammad Emtiyaz Khan RIKEN Center for AI Project, Tokyo, Japan http://emtiyaz.github.io Joint work with Wu Lin (UBC) DidrikNielsen,

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

Bayesian Nonparametric Regression for Diabetes Deaths

Bayesian Nonparametric Regression for Diabetes Deaths Bayesian Nonparametric Regression for Diabetes Deaths Brian M. Hartman PhD Student, 2010 Texas A&M University College Station, TX, USA David B. Dahl Assistant Professor Texas A&M University College Station,

More information

MCMC and Gibbs Sampling. Kayhan Batmanghelich

MCMC and Gibbs Sampling. Kayhan Batmanghelich MCMC and Gibbs Sampling Kayhan Batmanghelich 1 Approaches to inference l Exact inference algorithms l l l The elimination algorithm Message-passing algorithm (sum-product, belief propagation) The junction

More information

Tokamak profile database construction incorporating Gaussian process regression

Tokamak profile database construction incorporating Gaussian process regression Tokamak profile database construction incorporating Gaussian process regression A. Ho 1, J. Citrin 1, C. Bourdelle 2, Y. Camenen 3, F. Felici 4, M. Maslov 5, K.L. van de Plassche 1,4, H. Weisen 6 and JET

More information

Adaptive HMC via the Infinite Exponential Family

Adaptive HMC via the Infinite Exponential Family Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family

More information

RANDOM TOPICS. stochastic gradient descent & Monte Carlo

RANDOM TOPICS. stochastic gradient descent & Monte Carlo RANDOM TOPICS stochastic gradient descent & Monte Carlo MASSIVE MODEL FITTING nx minimize f(x) = 1 n i=1 f i (x) Big! (over 100K) minimize 1 least squares 2 kax bk2 = X i 1 2 (a ix b i ) 2 minimize 1 SVM

More information

Multimodal Nested Sampling

Multimodal Nested Sampling Multimodal Nested Sampling Farhan Feroz Astrophysics Group, Cavendish Lab, Cambridge Inverse Problems & Cosmology Most obvious example: standard CMB data analysis pipeline But many others: object detection,

More information

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

Stochastic Annealing for Variational Inference

Stochastic Annealing for Variational Inference Stochastic Annealing for Variational Inference arxiv:1505.06723v1 [stat.ml] 25 May 2015 San Gultekin, Aonan Zhang and John Paisley Department of Electrical Engineering Columbia University Abstract We empirically

More information