Automated Scalable Bayesian Inference via Data Summarization

Size: px

Start display at page:

Download "Automated Scalable Bayesian Inference via Data Summarization"

Camron Small
5 years ago
Views:

1 Automated Scalable Bayesian Inference via Data Summarization Tamara Broderick ITT Career Development Assistant Professor, MIT With: Trevor Campbell, Jonathan H. Huggins

2 Bayesian inference Fuel consumption Microcredit [Chati, Balakrishnan [Julian Herzog 2016] 2017] Cybersecurity [Meager 2018a] [amcharts.com 2016] [Webb, Caverlee, Pu 2006] 1

3 Bayesian inference Fuel consumption Microcredit [Chati, Balakrishnan 2017] [Julian Herzog 2016] Cybersecurity [Meager 2018a] [amcharts.com 2016] Desiderata: [Webb, Caverlee, Pu 2006] 1

4 Bayesian inference Fuel consumption Microcredit [Chati, Balakrishnan 2017] [Julian Herzog 2016] Cybersecurity [Meager 2018a] [amcharts.com 2016] Desiderata: Point estimates, coherent uncertainties [Webb, Caverlee, Pu 2006] 1

5 Bayesian inference Fuel consumption Microcredit [Chati, Balakrishnan 2017] [Julian Herzog 2016] Cybersecurity [Meager 2018a] [amcharts.com 2016] Desiderata: Point estimates, coherent uncertainties [Webb, Caverlee, Pu 2006] Interpretable, complex, modular; prior information 1

6 Bayesian inference Fuel consumption Microcredit [Chati, Balakrishnan 2017] [Julian Herzog 2016] Cybersecurity [Meager 2018a] [amcharts.com 2016] Desiderata: Point estimates, coherent uncertainties [Webb, Caverlee, Pu 2006] Interpretable, complex, modular; prior information Challenge: existing methods can be slow, tedious, unreliable 1

7 Bayesian inference Fuel consumption Microcredit [Chati, Balakrishnan 2017] [Julian Herzog 2016] Cybersecurity [Meager 2018a] [amcharts.com 2016] Desiderata: Point estimates, coherent uncertainties [Webb, Caverlee, Pu 2006] Interpretable, complex, modular; prior information Challenge: existing methods can be slow, tedious, unreliable Our proposal: use efficient data summaries for scalable, 1 automated algorithms with error bounds for finite data

8 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

9 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

10 Bayesian inference 2

11 Bayesian inference p( y) / p(y )p( ) 2

12 Bayesian inference p( y) / p(y )p( ) 2

13 Bayesian inference p( y) / p(y )p( ) 2

14 Bayesian inference p( y) / p(y )p( ) (x n, y n ) 2

15 Bayesian inference p( y) / p(y )p( ) Normal (x n, y n ) Phishing 2

16 Bayesian inference p( y) / p(y )p( ) Normal (x n, y n ) Phishing 2

17 Bayesian inference p( y) / p(y )p( ) Normal (x n, y n ) Phishing 2

18 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 Exact posterior (x n, y n ) [Bishop 2006] 1 2

19 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 Exact posterior (x n, y n ) [Bishop 2006] 1 MCMC: Eventually accurate but can be slow [Bardenet, Doucet, Holmes 2017] 2

20 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 Exact posterior (x n, y n ) [Bishop 2006] 1 MCMC: Eventually accurate but can be slow (Mean-field) variational Bayes: (MF)VB [Bardenet, Doucet, Holmes 2017] 2

21 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 Exact posterior (x n, y n ) [Bishop 2006] 1 MCMC: Eventually accurate but can be slow (Mean-field) variational Bayes: (MF)VB [Bardenet, Doucet, Holmes 2017] Fast 2

22 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 Exact posterior (x n, y n ) [Bishop 2006] 1 MCMC: Eventually accurate but can be slow (Mean-field) variational Bayes: (MF)VB Fast, streaming, distributed [Broderick, Boyd, Wibisono, Wilson, Jordan 2013] [Bardenet, Doucet, Holmes 2017] 2

23 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 Exact posterior (x n, y n ) [Bishop 2006] 1 MCMC: Eventually accurate but can be slow (Mean-field) variational Bayes: (MF)VB Fast, streaming, distributed (3.6M Wikipedia, 32 cores, ~hour) [Broderick, Boyd, Wibisono, Wilson, Jordan 2013] [Bardenet, Doucet, Holmes 2017] 2

24 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 Exact posterior (x n, y n ) [Bishop 2006] 1 MCMC: Eventually accurate but can be slow (Mean-field) variational Bayes: (MF)VB Fast, streaming, distributed (3.6M Wikipedia, 32 cores, ~hour) [Broderick, Boyd, Wibisono, Wilson, Jordan 2013] Misestimation & lack of quality guarantees [Bardenet, Doucet, Holmes 2017] [MacKay 2003; Bishop 2006; Wang, Titterington 2004; Turner, Sahani 2011; Fosdick 2013; Dunson 2014; Bardenet, Doucet, Holmes 2017; Opper, Winther 2003; Giordano, Broderick, Jordan 2015, 2017] 2

25 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 p( x) MFVB q ( ) Exact posterior (x n, y n ) [Bishop 2006] 1 MCMC: Eventually accurate but can be slow (Mean-field) variational Bayes: (MF)VB Fast, streaming, distributed (3.6M Wikipedia, 32 cores, ~hour) [Broderick, Boyd, Wibisono, Wilson, Jordan 2013] Misestimation & lack of quality guarantees [Bardenet, Doucet, Holmes 2017] [MacKay 2003; Bishop 2006; Wang, Titterington 2004; Turner, Sahani 2011; Fosdick 2013; Dunson 2014; Bardenet, Doucet, Holmes 2017; Opper, Winther 2003; Giordano, Broderick, Jordan 2015, 2017] 2

26 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 p( x) MFVB q ( ) Exact posterior (x n, y n ) LRVB [Bishop 2006] 1 MCMC: Eventually accurate but can be slow (Mean-field) variational Bayes: (MF)VB Fast, streaming, distributed (3.6M Wikipedia, 32 cores, ~hour) [Broderick, Boyd, Wibisono, Wilson, Jordan 2013] Misestimation & lack of quality guarantees [Bardenet, Doucet, Holmes 2017] [MacKay 2003; Bishop 2006; Wang, Titterington 2004; Turner, Sahani 2011; Fosdick 2013; Dunson 2014; Bardenet, Doucet, Holmes 2017; Opper, Winther 2003; Giordano, Broderick, Jordan 2015, 2017] 2

Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 p( x) MFVB q ( ) Exact posterior (x n, y n ) LRVB [Bishop 2006] 1 MCMC: Eventually accurate but can be slow (Mean-field) variational Bayes:

27 Bayesian inference p( y) / p(y )p( ) Normal Phishing 2 p( x) MFVB q ( ) Exact posterior (x n, y n ) LRVB [Bishop 2006] 1 MCMC: Eventually accurate but can be slow (Mean-field) variational Bayes: (MF)VB Fast, streaming, distributed (3.6M Wikipedia, 32 cores, ~hour) [Broderick, Boyd, Wibisono, Wilson, Jordan 2013] Misestimation & lack of quality guarantees [Bardenet, Doucet, Holmes 2017] 2 [MacKay 2003; Bishop 2006; Wang, Titterington 2004; Turner, Sahani 2011; Fosdick 2013; Dunson 2014; Bardenet, Doucet, Holmes 2017; Opper, Winther 2003; Giordano, Broderick, Jordan 2015, 2017] Automation: e.g. Stan, NUTS, ADVI [ ; Hoffman, Gelman 2014; Kucukelbir, Tran, Ranganath, Gelman, Blei 2017]

28 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

29 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

30 3 Bayesian coresets

31 Bayesian coresets Observe: redundancies can exist even if data isn t tall 3

32 Bayesian coresets Observe: redundancies can exist even if data isn t tall Football Curling 3

33 Bayesian coresets Observe: redundancies can exist even if data isn t tall Football Curling 3

34 Bayesian coresets Observe: redundancies can exist even if data isn t tall Football Curling 3

35 Bayesian coresets Observe: redundancies can exist even if data isn t tall Coresets: pre-process data to get a smaller, weighted data set Football Curling 3 [Agarwal et al 2005; Feldman & Langberg 2011]

36 Bayesian coresets Observe: redundancies can exist even if data isn t tall Coresets: pre-process data to get a smaller, weighted data set Football Curling Theoretical guarantees on quality 3 [Agarwal et al 2005; Feldman & Langberg 2011]

37 Bayesian coresets Observe: redundancies can exist even if data isn t tall Coresets: pre-process data to get a smaller, weighted data set Football Curling Theoretical guarantees on quality Previous heuristics: data squashing, big data GPs 3 [Agarwal et al 2005; Feldman & Langberg 2011; DuMouchel et al 1999; Madigan et al 2002]

38 Bayesian coresets Observe: redundancies can exist even if data isn t tall Coresets: pre-process data to get a smaller, weighted data set Football Curling Theoretical guarantees on quality Previous heuristics: data squashing, big data GPs Cf. subsampling 3 [Agarwal et al 2005; Feldman & Langberg 2011; DuMouchel et al 1999; Madigan et al 2002]

39 Bayesian coresets Observe: redundancies can exist even if data isn t tall Coresets: pre-process data to get a smaller, weighted data set Football Curling Theoretical guarantees on quality Previous heuristics: data squashing, big data GPs Cf. subsampling 3 [Agarwal et al 2005; Feldman & Langberg 2011; DuMouchel et al 1999; Madigan et al 2002]

40 Bayesian coresets Observe: redundancies can exist even if data isn t tall Coresets: pre-process data to get a smaller, weighted data set Football Curling Theoretical guarantees on quality Previous heuristics: data squashing, big data GPs Cf. subsampling 3 [Agarwal et al 2005; Feldman & Langberg 2011; DuMouchel et al 1999; Madigan et al 2002]

41 Bayesian coresets Observe: redundancies can exist even if data isn t tall Coresets: pre-process data to get a smaller, weighted data set Football Curling Theoretical guarantees on quality Previous heuristics: data squashing, big data GPs Cf. subsampling 3 How to develop coresets for Bayes? [Agarwal et al 2005; Feldman & Langberg 2011; DuMouchel et al 1999; Madigan et al 2002; Huggins, Campbell, Broderick 2016; Campbell, Broderick 2017; Campbell, Broderick 2018]

42 Bayesian coresets Posterior p( y) / p(y )p( ) Normal Phishing 4

43 Bayesian coresets Posterior p( y) / p(y )p( ) Log likelihood L n ( ) := log p(y n ), L( ) := NX n=1 L n ( ) Normal Phishing 4

44 Bayesian coresets Posterior p( y) / p(y )p( ) Log likelihood L n ( ) := log p(y n ), L( ) := Coreset log likelihood NX n=1 L n ( ) Normal Phishing 4

45 Bayesian coresets Posterior p( y) / p(y )p( ) Log likelihood L n ( ) := log p(y n ), L( ) := Coreset log likelihood NX L n ( ) n=1 kwk 0 N Normal Phishing 4

46 Bayesian coresets Posterior p( y) / p(y )p( ) NX Log likelihood L n ( ) := log p(y n ), L( ) := L n ( ) Coreset log likelihood L(w, ) := NX n=1 n=1 w n L n ( ) s.t. kwk 0 N Normal Phishing 4

47 Bayesian coresets Posterior p( y) / p(y )p( ) NX Log likelihood L n ( ) := log p(y n ), L( ) := L n ( ) Coreset log likelihood L(w, ) := NX n=1 n=1 w n L n ( ) s.t. kwk 0 N Normal Phishing L(w) L 4

48 Bayesian coresets Posterior p( y) / p(y )p( ) NX Log likelihood L n ( ) := log p(y n ), L( ) := L n ( ) Coreset log likelihood L(w, ) := ε-coreset: kl(w) Lk apple NX n=1 n=1 w n L n ( ) s.t. kwk 0 N Normal Phishing L(w) L 4

49 <latexit sha1_base64="wqvsehinzmkis3++stx/wlujese=">aaacxhicbvhrshtbfj1dtczy21ihl325nbqsigfxcvoovuolfbdqgcetltnziy7ozi4zd2vdud/pmy/9lxy25ihvxhg4c8653lln0ljji1h0eirr6xsvnltb7e2xo69ed3bfnnuimlymekekc5eyk5tuyoqslbgojwb5qsq4vtlp9pfpyaws9a+cl2kas0stz5iz9ftssvnixom7rutemdz2km8khduy9wdqrtz6qjwak8yi9a5ozvckm+w+1b3bpuyvel5phexxc03uxp+/ntb1ak6bsg3uxymdwiedbjsmfgxpqbwexbkss6rzt7ocv7nqybwzdhjhju4dmyi5enwbvlaujn+wszhxulnc2klbhfpdb89kmcumpxphwa52ojzbo89t72ywse+1hvyfnqlwdjr1upcvcs0fb80qbvhakzrk0gioau4b40b6twk/yozx9p/r9ihet1d+ds4phne0jl9/7b5/wsbriu/ie9ijmtkkx+qlosmjwskd+ro0gq3gd7geboc7j9ywwpbskx8qfpsxy6+y+a==</latexit> <latexit sha1_base64="wqvsehinzmkis3++stx/wlujese=">aaacxhicbvhrshtbfj1dtczy21ihl325nbqsigfxcvoovuolfbdqgcetltnziy7ozi4zd2vdud/pmy/9lxy25ihvxhg4c8653lln0ljji1h0eirr6xsvnltb7e2xo69ed3bfnnuimlymekekc5eyk5tuyoqslbgojwb5qsq4vtlp9pfpyaws9a+cl2kas0stz5iz9ftssvnixom7rutemdz2km8khduy9wdqrtz6qjwak8yi9a5ozvckm+w+1b3bpuyvel5phexxc03uxp+/ntb1ak6bsg3uxymdwiedbjsmfgxpqbwexbkss6rzt7ocv7nqybwzdhjhju4dmyi5enwbvlaujn+wszhxulnc2klbhfpdb89kmcumpxphwa52ojzbo89t72ywse+1hvyfnqlwdjr1upcvcs0fb80qbvhakzrk0gioau4b40b6twk/yozx9p/r9ihet1d+ds4phne0jl9/7b5/wsbriu/ie9ijmtkkx+qlosmjwskd+ro0gq3gd7geboc7j9ywwpbskx8qfpsxy6+y+a==</latexit> <latexit sha1_base64="wqvsehinzmkis3++stx/wlujese=">aaacxhicbvhrshtbfj1dtczy21ihl325nbqsigfxcvoovuolfbdqgcetltnziy7ozi4zd2vdud/pmy/9lxy25ihvxhg4c8653lln0ljji1h0eirr6xsvnltb7e2xo69ed3bfnnuimlymekekc5eyk5tuyoqslbgojwb5qsq4vtlp9pfpyaws9a+cl2kas0stz5iz9ftssvnixom7rutemdz2km8khduy9wdqrtz6qjwak8yi9a5ozvckm+w+1b3bpuyvel5phexxc03uxp+/ntb1ak6bsg3uxymdwiedbjsmfgxpqbwexbkss6rzt7ocv7nqybwzdhjhju4dmyi5enwbvlaujn+wszhxulnc2klbhfpdb89kmcumpxphwa52ojzbo89t72ywse+1hvyfnqlwdjr1upcvcs0fb80qbvhakzrk0gioau4b40b6twk/yozx9p/r9ihet1d+ds4phne0jl9/7b5/wsbriu/ie9ijmtkkx+qlosmjwskd+ro0gq3gd7geboc7j9ywwpbskx8qfpsxy6+y+a==</latexit> <latexit sha1_base64="wqvsehinzmkis3++stx/wlujese=">aaacxhicbvhrshtbfj1dtczy21ihl325nbqsigfxcvoovuolfbdqgcetltnziy7ozi4zd2vdud/pmy/9lxy25ihvxhg4c8653lln0ljji1h0eirr6xsvnltb7e2xo69ed3bfnnuimlymekekc5eyk5tuyoqslbgojwb5qsq4vtlp9pfpyaws9a+cl2kas0stz5iz9ftssvnixom7rutemdz2km8khduy9wdqrtz6qjwak8yi9a5ozvckm+w+1b3bpuyvel5phexxc03uxp+/ntb1ak6bsg3uxymdwiedbjsmfgxpqbwexbkss6rzt7ocv7nqybwzdhjhju4dmyi5enwbvlaujn+wszhxulnc2klbhfpdb89kmcumpxphwa52ojzbo89t72ywse+1hvyfnqlwdjr1upcvcs0fb80qbvhakzrk0gioau4b40b6twk/yozx9p/r9ihet1d+ds4phne0jl9/7b5/wsbriu/ie9ijmtkkx+qlosmjwskd+ro0gq3gd7geboc7j9ywwpbskx8qfpsxy6+y+a==</latexit> Bayesian coresets Posterior p( y) / p(y )p( ) NX Log likelihood L n ( ) := log p(y n ), L( ) := L n ( ) NX Coreset log likelihood L(w, ) := ε-coreset: kl(w) Lk apple n=1 n=1 w n L n ( ) s.t. kwk 0 N Approximate posterior close in Wasserstein distance d (p Wj w ( y),p( y)) apple C j kl(w) Lk WFID,j 2 {1, 2} Normal Phishing L(w) L 4

50 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

51 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

52 Uniform subsampling revisited Normal Phishing 5

53 Uniform subsampling revisited Normal Phishing 5

54 Uniform subsampling revisited Normal Phishing 5

55 Uniform subsampling revisited Normal Phishing Might miss important data 5

56 Uniform subsampling revisited Normal Phishing Might miss important data 5

57 Uniform subsampling revisited Normal Phishing Might miss important data 5

58 Uniform subsampling revisited Normal Phishing Might miss important data 5

59 Uniform subsampling revisited Normal Phishing Might miss important data 5

60 Uniform subsampling revisited Normal Phishing Might miss important data 5

61 Uniform subsampling revisited Normal Phishing Might miss important data 5

62 Uniform subsampling revisited Normal Phishing Might miss important data 5

63 Uniform subsampling revisited Normal Phishing Might miss important data 5

64 Uniform subsampling revisited Normal Phishing Might miss important data 5

65 Uniform subsampling revisited Normal Phishing Might miss important data Noisy estimates 5

66 Uniform subsampling revisited Normal Phishing Might miss important data Noisy estimates 5 M = 10 M = 100 M = 1000

67 Uniform subsampling revisited Normal Phishing Might miss important data Noisy estimates 5 M = 10 M = 100 M = 1000

68 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

69 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

70 Importance sampling Normal Phishing 6

71 Importance sampling Normal Phishing 6

72 Importance sampling Normal Phishing 6

73 Importance sampling Normal Phishing 6

74 Importance sampling Normal Phishing 6

75 Importance sampling Normal Phishing 6

76 Importance sampling Normal Phishing 6

77 Importance sampling Normal Phishing 6

78 Importance sampling Normal Phishing 6

79 Importance sampling Normal Phishing 6

80 Importance sampling Normal Phishing n /kl n k 6

81 Importance sampling Normal Phishing NX := kl n k n=1 n := kl n k/ 6

82 Importance sampling Normal Phishing NX := kl n k n=1 n := kl n k/ 1. data 2. importance weights 3. importance sample 4. invert weights 6

83 Importance sampling Normal Phishing NX := kl n k n=1 n := kl n k/ 1. data 2. importance weights 3. importance sample 4. invert weights 6

84 Importance sampling Normal Phishing NX := kl n k n=1 n := kl n k/ 1. data 2. importance weights 3. importance sample 4. invert weights 6

85 Importance sampling Normal Phishing NX := kl n k n=1 n := kl n k/ 1. data 2. importance weights 3. importance sample 4. invert weights 6

86 Importance sampling Thm sketch (CB). δ (0,1). W.p. 1 - δ, after M iterations, kl(w) Lk apple p M 1+ r2 log 1! 7

87 Importance sampling Thm sketch (CB). δ (0,1). W.p. 1 - δ, after M iterations,! r 1 kl(w) Lk p log M Still noisy estimates M = 10 7 M = 100 M = 1000

88 Importance sampling Thm sketch (CB). δ (0,1). W.p. 1 - δ, after M iterations,! r 1 kl(w) Lk p log M Still noisy estimates M = 10 7 M = 100 M = 1000

89 Hilbert coresets Want a good coreset: min kl(w) Lk 2 w2r N s.t. w 0, kwk 0 apple M 8

90 Hilbert coresets Want a good coreset: min kl(w) Lk 2 w2r N s.t. w 0, kwk 0 apple M exp(l( )) exp(l n ( )) y n 8

91 Hilbert coresets Want a good coreset: min kl(w) Lk 2 w2r N s.t. w 0, kwk 0 apple M exp(l n ( )) exp(l( )) L L n y n 8

92 Hilbert coresets Want a good coreset: min kl(w) Lk 2 w2r N s.t. w 0, kwk 0 apple M exp(l n ( )) exp(l( )) L L n y n need to consider (residual) error direction 8

93 Hilbert coresets Want a good coreset: min kl(w) Lk 2 w2r N s.t. w 0, kwk 0 apple M exp(l n ( )) exp(l( )) L L n y n need to consider (residual) error direction sparse optimization 8

94 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

95 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

96 Frank-Wolfe Convex optimization on a polytope D [Jaggi 2013] 9

97 Frank-Wolfe Convex optimization on a polytope D Repeat: 1. Find gradient 2. Find argmin point on plane in D 3. Do line search between current point and argmin point [Jaggi 2013] 9

98 Frank-Wolfe Convex optimization on a polytope D Repeat: 1. Find gradient 2. Find argmin point on plane in D 3. Do line search between current point and argmin point [Jaggi 2013] Convex combination of M vertices after M -1 steps 9

99 Frank-Wolfe Convex optimization on a polytope D Repeat: 1. Find gradient 2. Find argmin point on plane in D 3. Do line search between current point and argmin point [Jaggi 2013] Convex combination of M vertices after M -1 steps Our problem: min kl(w) Lk 2 w2r N 9

100 Frank-Wolfe Convex optimization on a polytope D Repeat: 1. Find gradient 2. Find argmin point on plane in D 3. Do line search between current point and argmin point [Jaggi 2013] Convex combination of M vertices after M -1 steps Our problem: min kl(w) Lk 2 w2r N 9

101 Frank-Wolfe Convex optimization on a polytope D Repeat: 1. Find gradient 2. Find argmin point on plane in D 3. Do line search between current point and argmin point [Jaggi 2013] Convex combination of M vertices after M -1 steps Our problem: min kl(w) Lk 2 w2r N s.t. w 0, kwk 0 apple M 9

102 Frank-Wolfe Convex optimization on a polytope D Repeat: 1. Find gradient 2. Find argmin point on plane in D 3. Do line search between current point and argmin point [Jaggi 2013] Convex combination of M vertices after M -1 steps Our problem: min kl(w) Lk 2 w2r N ( ) NX N 1 := w 2 R N : nw n =,w 0 n=1 9

103 Frank-Wolfe 9 Convex optimization on a polytope D Repeat: 1. Find gradient 2. Find argmin point on plane in D 3. Do line search between current point and argmin point [Jaggi 2013] Convex combination of M vertices after M -1 steps Our problem: min kl(w) Lk 2 w2r N ( ) NX N 1 := w 2 R N : nw n =,w 0 n=1 Thm sketch (CB). After M iterations, kl(w) Lk applep 2M + M

104 Gaussian model (simulated) 10K pts; norms, inference: closed-form Uniform subsampling Importance sampling M = 5 Frank-Wolfe 10

105 Gaussian model (simulated) 10K pts; norms, inference: closed-form Uniform subsampling Importance sampling M = 5 M = 50 M = 500 Frank-Wolfe 10

106 Gaussian model (simulated) 10K pts; norms, inference: closed-form Uniform subsampling Importance sampling Frank-Wolfe M = 5 M = 50 M =

107 Gaussian model (simulated) 10K pts; norms, inference: closed-form Uniform subsampling Importance sampling Frank-Wolfe 10 M = 5 M = 50 M = 500

108 Gaussian model (simulated) 10K pts; norms, inference: closed-form lower error 11

109 Logistic regression (simulated) 10K pts; general inference Uniform subsampling Importance sampling Frank-Wolfe 12 M = 10 M = 100 M = 1000

110 Poisson regression (simulated) 10K pts; general inference Uniform subsampling Importance sampling Frank-Wolfe 13 M = 10 M = 100 M = 1000

111 Real data experiments Logistic regression Poisson regression lower error 14 uniform subsampling Frank-Wolfe

112 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

113 Roadmap Approximate Bayes review The core of the data set Uniform data subsampling isn t enough Importance sampling for coresets Optimization for coresets Approximate sufficient statistics

114 15 Data summarization

115 Data summarization Exponential family likelihood 15

116 Data summarization Exponential family likelihood NY Sufficient statistics p(y 1:N p(y x, 1:N, ) = exp [T (y n,x n ) ( )] n=1 "( N X ) # =exp n=1 T (y n,x n ) ( ) 15

117 Data summarization Exponential family likelihood NY Sufficient statistics p(y 1:N p(y x, 1:N, ) = exp [T (y n,x n ) ( )] n=1 "( N X ) # =exp n=1 T (y n,x n ) ( ) 15

118 Data summarization Exponential family likelihood NY Sufficient statistics p(y 1:N p(y x, 1:N, ) = exp [T (y n,x n ) ( )] n=1 "( N X ) # =exp n=1 T (y n,x n ) ( ) Scalable, single-pass, streaming, distributed, complementary to MCMC 15

119 Data summarization Exponential family likelihood NY Sufficient statistics p(y 1:N p(y x, 1:N, ) = exp [T (y n,x n ) ( )] n=1 "( N X ) # =exp n=1 T (y n,x n ) ( ) Scalable, single-pass, streaming, distributed, complementary to MCMC But: Often no simple sufficient statistics 15

120 Data summarization Exponential family likelihood NY Sufficient statistics p(y 1:N p(y x, 1:N, ) = exp [T (y n,x n ) ( )] n=1 "( N X ) # =exp n=1 T (y n,x n ) ( ) Scalable, single-pass, streaming, distributed, complementary to MCMC But: Often no simple sufficient statistics E.g. Bayesian logistic regression; GLMs; deeper models NY 1 Likelihood p(y 1:N p(y x, 1:N, ) = 1 + exp( y n x n ) n=1 15

121 Data summarization Exponential family likelihood NY Sufficient statistics p(y 1:N p(y x, 1:N, ) = exp [T (y n,x n ) ( )] n=1 "( N X ) # =exp n=1 T (y n,x n ) ( ) Scalable, single-pass, streaming, distributed, complementary to MCMC But: Often no simple sufficient statistics E.g. Bayesian logistic regression; GLMs; deeper models NY 1 Likelihood p(y 1:N p(y x, 1:N, ) = 1 + exp( y n x n ) n=1 Our proposal: (polynomial) approximate sufficient statistics 15

122 Data summarization 6M data points, 1000 features Streaming, distributed; minimal communication 22 cores, 16 sec Finite-data guarantees on 16 Wasserstein distance to exact posterior [Huggins, Adams, Broderick 2017]

123 Conclusions Data summarization for scalable, automated approx. Bayes algorithms with error bounds on quality for finite data Get more accurate with more computation investment Coresets Approx. suff. stats 17

124 Conclusions Data summarization for scalable, automated approx. Bayes algorithms with error bounds on quality for finite data Get more accurate with more computation investment Coresets Approx. suff. stats A start Lots of potential improvements/ directions 17

125 Conclusions Data summarization for scalable, automated approx. Bayes algorithms with error bounds on quality for finite data Get more accurate with more computation investment Coresets Approx. suff. stats A start lower error Lots of potential improvements/ directions 17 [Campbell, Broderick 2018]

126 References (1/5) T Campbell and T Broderick. Automated scalable Bayesian inference via Hilbert coresets. Under review. ArXiv: ! T Campbell and T Broderick. Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent. ICML 2018, to appear.! JH Huggins, T Campbell, and T Broderick. Coresets for scalable Bayesian logistic regression. NIPS 2016.! JH Huggins, RP Adams, and T Broderick. PASS-GLM: Polynomial approximate sufficient statistics for scalable Bayesian GLM inference. NIPS Code links in the papers

127 References (2/5) PK Agarwal, S Har-Peled, and KR Varadarajan. "Geometric approximation via coresets." Combinatorial and Computational Geometry 52 (2005): R Bardenet, A Doucet, and C Holmes. "On Markov chain Monte Carlo methods for tall data." The Journal of Machine Learning Research 18.1 (2017): T Broderick, N Boyd, A Wibisono, AC Wilson, and MI Jordan. Streaming variational Bayes. NIPS CM Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, W DuMouchel, C Volinsky, T Johnson, C Cortes, and D Pregibon. "Squashing flat files flatter." In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp ACM, D Dunson. Robust and scalable approach to Bayesian inference. Talk at ISBA D Feldman, and M Langberg. "A unified framework for approximating and clustering data." In Proceedings of the forty-third annual ACM symposium on Theory of computing, pp ACM, B Fosdick. Modeling Heterogeneity within and between Matrices and Arrays, Chapter 4.7. PhD Thesis, University of Washington, RJ Giordano, T Broderick, and MI Jordan. "Linear response methods for accurate covariance estimates from mean field variational Bayes." NIPS R Giordano, T Broderick, R Meager, J Huggins, and MI Jordan. "Fast robustness quantification with variational Bayes." ICML 2016 Workshop on #Data4Good: Machine Learning in Social Good Applications, 2016.

128 References (3/5) MD Hoffman, and A Gelman. "The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo." Journal of Machine Learning Research 15, no. 1 (2014): M Jaggi. "Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization." ICML A Kucukelbir, R Ranganath, A Gelman, and D Blei. "Automatic variational inference in Stan." NIPS A Kucukelbir, D Tran, R Ranganath, A Gelman, and DM Blei. "Automatic differentiation variational inference." The Journal of Machine Learning Research 18.1 (2017): DJC MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, D Madigan, N Raghavan, W Dumouchel, M Nason, C Posse, and G Ridgeway. "Likelihoodbased data squashing: A modeling approach to instance construction." Data Mining and Knowledge Discovery 6, no. 2 (2002): M Opper and O Winther. Variational linear response. NIPS Stan (open source software). Accessed: RE Turner and M Sahani. Two problems with variational expectation maximisation for timeseries models. In D Barber, AT Cemgil, and S Chiappa, editors, Bayesian Time Series Models, B Wang and M Titterington. Inadequacy of interval estimates corresponding to variational Bayesian approximations. In AISTATS, 2004.

129 Application References (4/5) Chati, Yashovardhan Sushil, and Hamsa Balakrishnan. "A Gaussian process regression approach to model aircraft engine fuel flow rate." Cyber-Physical Systems (ICCPS), 2017 ACM/ IEEE 8th International Conference on. IEEE, Meager, Rachael. Understanding the impact of microcredit expansions: A Bayesian hierarchical analysis of 7 randomized experiments. AEJ: Applied, to appear, 2018a. Meager, Rachael. Aggregating Distributional Treatment Effects: A Bayesian Hierarchical Analysis of the Microcredit Literature. Working paper, 2018b. Webb, Steve, James Caverlee, and Calton Pu. "Introducing the Webb Spam Corpus: Using Spam to Identify Web Spam Automatically." In CEAS

130 Additional image references (5/5) amcharts. Visited Countries Map. Accessed: J. Herzog. 3 June 2016, 17:17:30. Obtained from: File:Airbus_A _F-WWCF_MSN002_ILA_Berlin_2016_17.jpg (Creative Commons Attribution 4.0 International License)

131 Practicalities

132 Practicalities Choice of norm

133 Practicalities Choice of norm E.g. (weighted) Fisher information distance kl(w) Lk 2ˆ,F := Eˆ krl( ) rl(w, )k 2 2

134 Practicalities Choice of norm E.g. (weighted) Fisher information distance kl(w) Lk 2ˆ,F := Eˆ krl( ) rl(w, )k 2 2 Associated inner product: hl n, L m iˆ,f := Eˆ rln ( ) T rl m ( )

135 Practicalities Choice of norm E.g. (weighted) Fisher information distance kl(w) Lk 2ˆ,F := Eˆ krl( ) rl(w, )k 2 2 Associated inner product: hl n, L m iˆ,f := Eˆ rln ( ) T rl m ( )

136 Practicalities Choice of norm E.g. (weighted) Fisher information distance kl(w) Lk 2ˆ,F := Eˆ krl( ) rl(w, )k 2 2 Associated inner product: hl n, L m iˆ,f := Eˆ rln ( ) T rl m ( ) Random feature projection

137 Practicalities Choice of norm E.g. (weighted) Fisher information distance kl(w) Lk 2ˆ,F := Eˆ krl( ) rl(w, )k 2 2 Associated inner product: hl n, L m iˆ,f := Eˆ rln ( ) T rl m ( ) Random feature projection hl n, L m iˆ,f D JX (rl n ( j )) (rl J dj m ( j )), dj j=1 iid iid d j Unif{1,...,D}, j ˆ

138 Practicalities Choice of norm E.g. (weighted) Fisher information distance kl(w) Lk 2ˆ,F := Eˆ krl( ) rl(w, )k 2 2 Associated inner product: hl n, L m iˆ,f := Eˆ rln ( ) T rl m ( ) Random feature projection hl n, L m iˆ,f D JX (rl n ( j )) (rl J dj m ( j )), dj j=1 iid iid d j Unif{1,...,D}, j ˆ Thm sketch (CB). With high probability and large enough J, a good coreset after random feat. proj. is a good coreset for (L n ) N n=1

139 Full pipeline

140 Full pipeline + cost ˆ

141 Full pipeline + cost ˆ

142 Full pipeline + cost ˆ

143 Full pipeline + cost ˆ

144 Full pipeline + cost ˆ

145 Full pipeline + cost ˆ vs. O(NT)

146 Full pipeline + cost ˆ vs. O(NT) Can make streaming, distributed

Part IV: Variational Bayes and beyond

Part IV: Variational Bayes and beyond Tamara Broderick ITT Career Development Assistant Professor, MIT http://www.tamarabroderick.com/tutorial_2018_lugano.html Bayesian inference [Woodard et al 2017] [ESO/