MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Size: px

Start display at page:

Download "MAD-Bayes: MAP-based Asymptotic Derivations from Bayes"

Oscar Lawson
5 years ago
Views:

1 MAD-Bayes: MAP-based Asymptotic Derivations from Bayes Tamara Broderick Brian Kulis Michael I. Jordan

2 Cat Clusters Mouse clusters Dog 1

3 Cat Clusters Dog Mouse Lizard Sheep Picture 1 Picture 2 Picture 3 Picture 4 Picture 5 Picture 6 Picture 7 2

4 Cat Features Dog Mouse Lizard Sheep Picture 1 Picture 2 Picture 3 Picture 4 Picture 5 Picture 6 Picture 7 3

5 Cat Features Dog Mouse Lizard Sheep Picture 1 Picture 2 Picture 3 Picture 4 Picture 5 Many other possible latent structures in data Picture 6 Picture 7 3

6 How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters 4

7 How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters 4

8 How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters 4

9 How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters 4

10 How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters 4

11 How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters 4

12 How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters Nonparametric Bayes! Modular (general latent structure)! Flexible (K can grow as data grows)! Coherent treatment of uncertainty But...! E.g., Silicon Valley: can have petabytes of data 4! Practitioners turn to what runs

13 How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters Nonparametric Bayes! Modular (general latent structure)! Flexible (K can grow as data grows)! Coherent treatment of uncertainty But...! E.g., Silicon Valley: can have petabytes of data 4! Practitioners turn to what runs

14 How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters Nonparametric Bayes! Modular (general latent structure)! Flexible (K can grow as data grows)! Coherent treatment of uncertainty But...! E.g., Silicon Valley: can have petabytes of data 4! Practitioners turn to what runs

15 How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters Nonparametric Bayes! Modular (general latent structure)! Flexible (K can grow as data grows)! Coherent treatment of uncertainty But...! E.g., Silicon Valley: can have petabytes of data 4! Practitioners turn to what runs

16 How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters Nonparametric Bayes! Modular (general latent structure)! Flexible (K can grow as data grows)! Coherent treatment of uncertainty But...! E.g., Silicon Valley: can have petabytes of data 4! Practitioners turn to what runs

17 MAD-Bayes Perspectives! Bayesian nonparametrics assists the optimizationbased inference community " New, modular, flexible, nonparametric objectives & regularizers " Alternative perspective: fast initialization Inspiration! Consider a finite Gaussian mixture model! The steps of the EM algorithm limit to the steps of the K-means algorithm as the Gaussian variance is taken to 0 5

18 MAD-Bayes Perspectives! Bayesian nonparametrics assists the optimizationbased inference community " New, modular, flexible, nonparametric objectives & regularizers " Alternative perspective: fast initialization Inspiration! Consider a finite Gaussian mixture model! The steps of the EM algorithm limit to the steps of the K-means algorithm as the Gaussian variance is taken to 0 5

19 MAD-Bayes Perspectives! Bayesian nonparametrics assists the optimizationbased inference community " New, modular, flexible, nonparametric objectives & regularizers " Alternative perspective: fast initialization Inspiration! Consider a finite Gaussian mixture model! The steps of the EM algorithm limit to the steps of the K-means algorithm as the Gaussian variance is taken to 0 5

20 MAD-Bayes Perspectives! Bayesian nonparametrics assists the optimizationbased inference community " New, modular, flexible, nonparametric objectives & regularizers " Alternative perspective: fast initialization Inspiration! Consider a finite Gaussian mixture model! The steps of the EM algorithm limit to the steps of the K-means algorithm as the Gaussian variance is taken to 0 5

21 MAD-Bayes Perspectives! Bayesian nonparametrics assists the optimizationbased inference community " New, modular, flexible, nonparametric objectives & regularizers " Alternative perspective: fast initialization Inspiration! Consider a finite Gaussian mixture model! The steps of the EM algorithm limit to the steps of the K-means algorithm as the Gaussian variance is taken to 0 5

22 MAD-Bayes The MAD-Bayes idea! Start with nonparametric Bayes model! Take a similar limit to get a K-means-like objective 6

23 MAD-Bayes The MAD-Bayes idea! Start with nonparametric Bayes model! Take a similar limit to get a K-means-like objective 6

24 MAD-Bayes The MAD-Bayes idea! Start with nonparametric Bayes model! Take a similar limit to get a K-means-like objective 6

25 K-means K-means clustering problem 7

26 K-means K-means clustering problem minimize (sum of square distances from data points to cluster centers) 7

27 K-means K-means clustering problem minimize (sum of square distances from data points to cluster centers) 7

28 K-means K-means clustering problem minimize N n=1 x n center n 2 8

29 K-means K-means clustering problem minimize N n=1 x n center n 2 8

30 K-means K-means clustering problem minimize N n=1 x n center n 2 8

31 K-means K-means objective minimize N n=1 x n center n 2 8

32 Lloyd s algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to a cluster 2. Update cluster means 9

33 Lloyd s algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to a cluster 2. Update cluster means 9

34 Lloyd s algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to a cluster 2. Update cluster means 9

35 Lloyd s algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to a cluster 2. Update cluster means 9

36 Lloyd s algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to a cluster 2. Update cluster means 9

37 Lloyd s algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to a cluster 2. Update cluster means 9

38 Lloyd s algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to a cluster 2. Update cluster means 9

39 Lloyd s algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to a cluster 2. Update cluster means 9

40 Lloyd s algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to a cluster 2. Update cluster means 9

41 MAD-Bayes The MAD-Bayes idea! Start with nonparametric Bayes model! Take a similar limit to get a K-means-like objective

42 MAD-Bayes The MAD-Bayes idea! Start with nonparametric Bayes model! Take a similar limit to get a K-means-like objective

43 Bayesian model 10

44 Bayesian model 10

45 Bayesian model 10

46 Bayesian model 11

47 Bayesian model Nonparametric! number of parameters can grow with the number of data points 11

48 MAD-Bayes The MAD-Bayes idea! Start with nonparametric Bayes model! Take a similar limit to get a K-means-like objective

49 MAD-Bayes The MAD-Bayes idea! Start with nonparametric Bayes model! Take a similar limit to get a K-means-like objective

50 MAD-Bayes 12

51 MAD-Bayes! Maximum a Posteriori (MAP) is an optimization problem argmax parameters P(parameters data)! We take a limit of the objective (posterior) and get one like K-means " Small-variance asymptotics 12

52 MAD-Bayes! Maximum a Posteriori (MAP) is an optimization problem argmax parameters P(parameters data)! We take a limit of the objective (posterior) and get one like K-means " Small-variance asymptotics 12

53 MAD-Bayes! Maximum a Posteriori (MAP) is an optimization problem argmax parameters P(parameters data)! We take a limit of the objective (posterior) and get one like K-means " Small-variance asymptotics 12

54 MAD-Bayes Bayesian posterior K-means-like objectives 13

55 MAD-Bayes Bayesian posterior K-means-like objectives Mixture of K Gaussians K-means 13

56 MAD-Bayes Bayesian posterior K-means-like objectives Mixture of K Gaussians K-means Dirichlet process mixture Unbounded number of clusters 13

57 MAD-Bayes Bayesian posterior K-means-like objectives Mixture of K Gaussians K-means Dirichlet process mixture Unbounded number of clusters Hierarchical Dirichlet process Multiple data sets share cluster centers 13

58 MAD-Bayes Bayesian posterior K-means-like objectives Mixture of K Gaussians K-means Dirichlet process mixture Unbounded number of clusters Hierarchical Dirichlet process Multiple data sets share cluster centers

59 MAD-Bayes Bayesian posterior K-means-like objectives Mixture of K Gaussians K-means Dirichlet process mixture Unbounded number of clusters Hierarchical Dirichlet process Multiple data sets share cluster centers Beta process Features 13

60 Features Z Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Point 1 Point 2 Point 3 Point 4 Point 5 Point 6 Point 7 14

61 Features Z Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Point 1 Point 2 A Point 3 Point 4 Point 5 Point 6 Point 7 14

62 Features Z Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Point 1 Point 2 A Point 3 Point 4 Point 5 Point 6 Point 7 14

63 MAD-Bayes Bayesian posterior P(Z, A X) 1 (2 2 ) ND/2 exp 1 2 K + exp N n=1 n 2 tr((x ZA) (X ZA)) K + H h=1 K h! k=1 1 (2 2 ) K+ D/2 exp 1 tr(a A) 2 2 (S N,k 1)!(N S N,k )!. N! 15

64 MAD-Bayes Bayesian posterior P(Z, A X) 1 (2 2 ) ND/2 exp 1 2 K + exp N n=1 n 2 tr((x ZA) (X ZA)) K + H h=1 K h! k=1 1 (2 2 ) K+ D/2 exp 1 tr(a A) 2 2 (S N,k 1)!(N S N,k )!. N! 15

65 MAD-Bayes Bayesian posterior P(Z, A X) 1 (2 2 ) ND/2 exp 1 2 K + exp N n=1 n 2 tr((x ZA) (X ZA)) K + H h=1 K h! k=1 1 (2 2 ) K+ D/2 exp 1 tr(a A) 2 2 (S N,k 1)!(N S N,k )!. N! 15

66 MAD-Bayes Bayesian posterior P(Z, A X) 1 (2 2 ) ND/2 exp 1 2 K + exp N n=1 n 2 tr((x ZA) (X ZA)) K + H h=1 K h! k=1 1 (2 2 ) K+ D/2 exp 1 tr(a A) 2 2 (S N,k 1)!(N S N,k )!. N! 15

67 MAD-Bayes Bayesian posterior P(Z, A X) 1 (2 2 ) ND/2 exp 1 2 K + exp N n=1 n 2 tr((x ZA) (X ZA)) K + H h=1 K h! k=1 1 (2 2 ) K+ D/2 exp 1 tr(a A) 2 2 (S N,k 1)!(N S N,k )!. N! 15

68 MAD-Bayes Bayesian posterior P(Z, A X) 1 (2 2 ) ND/2 exp 1 2 K + exp N n=1 n 2 tr((x ZA) (X ZA)) K + H h=1 K h! k=1 1 (2 2 ) K+ D/2 exp 1 tr(a A) 2 2 (S N,k 1)!(N S N,k )!. N! 15

69 MAD-Bayes Bayesian posterior P(Z, A X) 1 (2 2 ) ND/2 exp 1 2 K + exp N n=1 n 2 tr((x ZA) (X ZA)) K + H h=1 K h! k=1 1 (2 2 ) K+ D/2 exp 1 tr(a A) 2 2 (S N,k 1)!(N S N,k )!. N! 15

70 MAD-Bayes BP-means objective argmin K +,Z,Atr[(X ZA) (X ZA)] + K

71 MAD-Bayes BP-means objective argmin K +,Z,Atr[(X ZA) (X ZA)] + K

72 MAD-Bayes BP-means objective argmin K +,Z,Atr[(X ZA) (X ZA)] + K

73 MAD-Bayes BP-means objective argmin K +,Z,Atr[(X ZA) (X ZA)] + K + 2. BP-means algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to features! Create a new feature if it lowers the objective 2. Update feature means A (Z Z) 1 Z X 16

74 MAD-Bayes BP-means objective argmin K +,Z,Atr[(X ZA) (X ZA)] + K + 2. BP-means algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to features! Create a new feature if it lowers the objective 2. Update feature means A (Z Z) 1 Z X 16

75 MAD-Bayes BP-means objective argmin K +,Z,Atr[(X ZA) (X ZA)] + K + 2. BP-means algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to features! Create a new feature if it lowers the objective 2. Update feature means A (Z Z) 1 Z X 16

76 MAD-Bayes BP-means objective argmin K +,Z,Atr[(X ZA) (X ZA)] + K + 2. BP-means algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to features! Create a new feature if it lowers the objective 2. Update feature means A (Z Z) 1 Z X 16

77 MAD-Bayes Griffiths & Ghahramani (2006) computer vision problem tabletop data Bayesian posterior Gibbs sampler BP-means algorithm 8.5 * 10 3 sec 0.36 sec Still faster by order of magnitude if restart 1000 times 17

78 MAD-Bayes Parallelism and optimistic concurrency control DP-means alg. BP-means alg. # data points 134M 8M time per iteration 5.5 min 4.3 min 18

79 MAD-Bayes Bayesian posterior K-means-like objectives Mixture of K Gaussians K-means Dirichlet process mixture Unbounded number of clusters Hierarchical Dirichlet process Multiple data sets share cluster centers Beta process Features 19

80 MAD-Bayes conclusions 20

81 MAD-Bayes conclusions! We provide new optimization objectives and regularizers " In fact, general means of obtaining more " Straightforward, fast algorithms! Can the promise of nonparametric Bayes be fulfilled? 20

82 MAD-Bayes conclusions! We provide new optimization objectives and regularizers " In fact, general means of obtaining more " Straightforward, fast algorithms! Can the promise of nonparametric Bayes be fulfilled? 20

83 MAD-Bayes conclusions! We provide new optimization objectives and regularizers " In fact, general means of obtaining more " Straightforward, fast algorithms! Can the promise of nonparametric Bayes be fulfilled? 20

84 References T. Broderick, B. Kulis, and M. I. Jordan. MAD- Bayes: MAP-based asymptotic derivations from Bayes. In International Conference on Machine Learning, X. Pan, J. E. Gonzales, S. Jegelka, T. Broderick, and M. I. Jordan. Optimistic concurrency control for distributed unsupervised learning. In Neural Information Processing Systems, T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan. Streaming variational Bayes. In Neural Information Processing Systems,

85 Further References T. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In Neural Information Processing Systems, N. L. Hjort. Nonparametric Bayes estimators based on beta processes in models for life history data. Annals of Statistics, 18(3): , J. F. C. Kingman. The representation of partition structures. Journal of the London Mathematical Society, 2(2):374, B. Kulis and M. I. Jordan. Revisiting k-means: New algorithms via Bayesian nonparametrics. In International Conference on Machine Learning, J. Pitman. Exchangeable and partially exchangeable random partitions. Probability Theory and Related Fields, 102(2): , R. Thibaux and M. I. Jordan. Hierarchical beta processes and the Indian buffet process. In International Conference on Artificial Intelligence and Statistics, 2007.

Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5)

Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5) Tamara Broderick ITT Career Development Assistant Professor Electrical Engineering & Computer Science MIT Bayes Foundations