Bregman Divergences. Barnabás Póczos. RLAI Tea Talk UofA, Edmonton. Aug 5, 2008

Size: px

Start display at page:

Download "Bregman Divergences. Barnabás Póczos. RLAI Tea Talk UofA, Edmonton. Aug 5, 2008"

Augustus Fields
6 years ago
Views:

1 Bregman Divergences Barnabás Póczos RLAI Tea Talk UofA, Edmonton Aug 5, 2008

2 Contents Bregman Divergences Bregman Matrix Divergences Relation to Exponential Family Applications Definition Properties Generalization of PCA to Exponential Family Generalized2 Linear2 Models Clustering / Coclustering with Bregman Divergences Generalized Nonnegative Matrix Factorization Conclusion 2

3 Bregman Divergences (Euclidean distance) 2 Squared Euclidean distance is a Bregman divergence 3 (upcoming figs are borrowed from Dhillon)

4 Bregman Divergences (KL-divergence) Generalized Relative Entropy (also called generalized KL-divergence) is another Bregman divergence 4

5 Bregman Divergences (Itakura-Saito) Itakura-Saito distance is another Bregman divergence (used in signal processing, also known as Burg entropy) 5

6 Examples of Bregman Divergences 6

7 Properties of the Bregman Divergences z γ a b y x c Euclidean special case: 7

8 Properties of the Bregman Divergences Nearness in Bregman divergence: the Bregman projection of y onto a convex set Ω. Ω When Ω is affine set, the Pythagoras theorem holds with equality. Generalized Pythagoras theorem: Opposite to triangle inequality: 8

9 (Regular) Exponential Families 9

10 Gaussian Distribution Note: Gaussian distribution $ Squared Loss from the expected value µ 10

11 Poisson Distribution The Poisson distribution: The Poisson distribution is a member of exponential family. Its expected value µ=λ. Is there a Divergence associated with the Poisson distribution? Yes! p(x) can be rewritten as Implication: Poisson distribution $ Relative Entropy Implication: Poisson distribution $ Relative Entropy 11

12 Exponential Distribution The Exponential distribution: The Exponential distribution is a member of exponential family. Its expected value µ=1/λ. Is there a Divergence associated with the Exponential distribution? Yes! p(x) can be rewritten as Implication: Exponential distribution $ Itakura-Saito Distribution 12

13 Fenchel Conjugate Defintion: The Fenchel conjugate of function f is defined as: Properties of the Fenchel conjugate: 13

14 Bregman Divergences and the Exponential Family Bijection Theorem 14

15 15

16 Bregman Matrix Divergences An immediate solution would be the componentwise sum of Bregman divergences. However, we can get more interesting divergences using the general definition. 16

17 Bregman Divergences of Hermitian matrices A complex square matrix A is Hermitian, if A = A*. The eigenvalues of a Hermitian matrix are real. Let 17

18 Burg Matrix Divergence (Logdet divergence) 18

19 Von Neumann Matrix Divergence 19

20 Applications, Matrix inequalities Hadamard inequality: Proof: Another inequality: Proof: What is more, here we can arbitrarily permute the eigenvalues! 20

21 Applications of Bregman divergences Clustering Co-clustering Partition the columns of a data matrix, so that similar columns are in the same partition (Banerjee et al. JMLR, 2005) Simultaneously partition both the rows and columns of a data matrix (Banerjee et al. JMLR, 2007) Low-Rank Matrix Approximation Exponential Family PCA (Collins et al, NIPS 2001) Non-negative matrix factorization (Dhillon & Sra, NIPS 2005) Generalized2 Linear2 Models POMDP (Gordon, NIPS,2002) (Gordon, NIPS,2002) Online learning (Warmuth, COLT2000) Metric Nearness Problem Given a matrix of distances, find the nearest matrix of distances such that all distances satisfy the triangle inequality (Dhillon et al, 2004) 21

22 Generalized2 Linear2 Models (GL)2M Goal: GLM Special cases: PCA, SVD Exp-family PCA Infomax ICA Linear regression Nonnegative matrix factorization 22

23 What is a good loss function? Euclidean metric as a loss function: instead of 1000 predicting 1010 is just as bad as predicting 3 instead of -7 Sigmoid regression exp many local minima in dim The log loss function is convex in θ! We say f(z) and the log loss match each other. 23

24 Searching for matching loss 24

25 Searching for matching loss 25

26 Special cases Thus, Log loss, entropic loss Other special cases: 26

27 Logistic regression 27

28 (GL)2M algorithm GLM goal: GLM cost: (GL)2M goal: (GL)2M cost: The (GL)2M algorithm, fix point equations:: 28

Robot Navigation A corridor in the CMU CS building with initial belief: (it doesn t know which end) Belief space = R550 550 states.

29 Robot Navigation A corridor in the CMU CS building with initial belief: (it doesn t know which end) Belief space = R states. (275 positons x 2 orientation) Robot can: - sense both side walls - compute an accurate estimate of its lateral position Robot cannot: - resolve its position along the corridor, unless its near an observable feature - tell whether its pointing left or right 29

30 Robot Navigation The belief space is large, but sparse and compressible. The belief vectors lie on a nonlinear manifold. This method can be used for planning, too. They factored a matrix of 400 beliefs using feature space ranks l=3,4,5. f(z)=exp(z), H*=10-12 V 2, G*= U 2+ (U) A belief vector using belief tracker algorithm Reconstructions using l=3,4,5 ranks With PCA, they need 85 dimensions to match (GL)^2M rank-5 decomposition and 25 dimension for the rank-3 decomposition 30

31 Nonnegative matrix factorization Goal: Cost functions: Algorithms: 31

32 Nonnegative matrix factorization, results With sparse constraints Without constraints CBCL face image database P. Hoyer, sparse NMF algorithm. 32

33 Exponential Family PCA PCA 1 PCA 2 Cost function Special case 33

34 Exponential family PCA, Results 34

35 Clustering with Bregman Divergences 35

36 The Original Problem of Bregman 36

37 Conclusion Introduced the Bregman divergence Relationship to Exponential family Generalization to matrices Applications: Matrix inequalities Exponential family PCA NMF GLM Clustering / Biclustering Online learning Bregman divergences propose new algorithms Lots of existing algorithms turn to be special case Matching loss function can help to decrease the number of local minima 37

38 References Matrix Nearness Problems with Bregman Divergences I. S. Dhillon and J. A. Tropp SIAM Journal on Matrix Analysis and Applications, vol. 29, no. 4, pages , November A Generalized Maximum Entropy Approach to Bregman Co-Clustering and Matrix Approximations A. Banerjee, I. S. Dhillon, J. Ghosh, S. Merugu, and D. S. Modha Journal of Machine Learning Research (JMLR), vol. 8, pages , August Clustering with Bregman Divergences A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh Journal of Machine Learning Research (JMLR), vol. 6, pages , October A Generalized Maximum Entropy Approach to Bregman Co-Clustering and Matrix Approximations A. Banerjee, I. S. Dhillon, J. Ghosh, S. Merugu, and D. S. Modha Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages , August Clustering with Bregman Divergences A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh Proceedings of the Fourth SIAM International Conference on Data Mining, pages , April 2004 Nonnegative Matrix Approximation: Algorithms and Applications S. Sra and I. S. Dhillon UTCS Technical Report #TR-06-27, June 2006 Generalized Nonnegative Matrix Approximations with Bregman Divergences I. S. Dhillon and S. Sra NIPS, pages , Vancouver Canada, December (Also appears as UTCS Technical Report #TR-05-31, June 1,

39 PPT slides Irina Rish Bregman Divergences in Clustering and Dimensionality reduction Manfred K. Warmuth COLT2000 Inderjit S. Dhillon Machine Learning with Bregman Divergences Low-Rank Kernel Learning with Bregman Matrix Divergences Matrix Nearness Problems Using Bregman Divergences Information Theoretic Clustering, Co-clustering and Matrix Approximations 39

Bregman Divergences for Data Mining Meta-Algorithms

p.1/?? Bregman Divergences for Data Mining Meta-Algorithms Joydeep Ghosh University of Texas at Austin ghosh@ece.utexas.edu Reflects joint work with Arindam Banerjee, Srujana Merugu, Inderjit Dhillon,