Bregman Divergences. Barnabás Póczos. RLAI Tea Talk UofA, Edmonton. Aug 5, 2008

Similar documents
Bregman Divergences for Data Mining Meta-Algorithms

Kernel Learning with Bregman Matrix Divergences

Relative Loss Bounds for Multidimensional Regression Problems

A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation

Randomized Algorithms

Generalized Nonnegative Matrix Approximations with Bregman Divergences

MULTIPLICATIVE ALGORITHM FOR CORRENTROPY-BASED NONNEGATIVE MATRIX FACTORIZATION

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

A Generalization of Principal Component Analysis to the Exponential Family

Riemannian Metric Learning for Symmetric Positive Definite Matrices

A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation

A Geometric View of Conjugate Priors

Machine Learning (BSMC-GA 4439) Wenke Liu

arxiv: v1 [cs.lg] 1 May 2010

Online Kernel PCA with Entropic Matrix Updates

A Geometric View of Conjugate Priors

Online Kernel PCA with Entropic Matrix Updates

cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way.

Principal Component Analysis

Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure

MATRIX NEARNESS PROBLEMS WITH BREGMAN DIVERGENCES

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

The Free Matrix Lunch

Journée Interdisciplinaire Mathématiques Musique

Inderjit Dhillon The University of Texas at Austin

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Expectation Maximization

Non-negative matrix factorization and term structure of interest rates

Improving Response Prediction for Dyadic Data

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Deriving Principal Component Analysis (PCA)

Transductive De-Noising and Dimensionality Reduction using Total Bregman Regression

Mirror Descent for Metric Learning. Gautam Kunapuli Jude W. Shavlik

Preserving Privacy in Data Mining using Data Distortion Approach

Functional Bregman Divergence and Bayesian Estimation of Distributions Béla A. Frigyik, Santosh Srivastava, and Maya R. Gupta

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

EUSIPCO

ECE521 Lectures 9 Fully Connected Neural Networks

Orthogonal Nonnegative Matrix Factorization: Multiplicative Updates on Stiefel Manifolds

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection

Discriminative Direction for Kernel Classifiers

Accelerated Training of Max-Margin Markov Networks with Kernels

ONP-MF: An Orthogonal Nonnegative Matrix Factorization Algorithm with Application to Clustering

Advanced Introduction to Machine Learning CMU-10715

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

Lecture: Face Recognition and Feature Reduction

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Online Nonnegative Matrix Factorization with General Divergences

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

Non-Negative Matrix Factorization with Quasi-Newton Optimization

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

CS Lecture 19. Exponential Families & Expectation Propagation

Lecture: Face Recognition and Feature Reduction

The Bregman Variational Dual-Tree Framework

Dimensionality Reduction with Generalized Linear Models

Unsupervised learning: beyond simple clustering and PCA

The Informativeness of k-means for Learning Mixture Models

Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY

Fast Bregman Divergence NMF using Taylor Expansion and Coordinate Descent

Matrix factorization models for patterns beyond blocks. Pauli Miettinen 18 February 2016

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Independent Component Analysis and Unsupervised Learning


Fast Coordinate Descent methods for Non-Negative Matrix Factorization

Must-read Material : Multimedia Databases and Data Mining. Indexing - Detailed outline. Outline. Faloutsos

Relational Learning via Collective Matrix Factorization. June 2008 CMU-ML

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Support Vector Machines

Pattern Recognition and Machine Learning

Online Passive-Aggressive Algorithms

Big Data Analytics. Special Topics for Computer Science CSE CSE Jan 21

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

A Unified View of Matrix Factorization Models

Introduction to Machine Learning

Collaborative Filtering Using Orthogonal Nonnegative Matrix Tri-factorization

Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory

Part 2: Generalized output representations and structure

Tailored Bregman Ball Trees for Effective Nearest Neighbors

Approximate Principal Components Analysis of Large Data Sets

Compressive Sensing, Low Rank models, and Low Rank Submatrix

Visual Tracking via Geometric Particle Filtering on the Affine Group with Optimal Importance Functions

Linear Algebra & Geometry why is linear algebra useful in computer vision?

From Last Meeting. Studied Fisher Linear Discrimination. - Mathematics. - Point Cloud view. - Likelihood view. - Toy examples

Unsupervised Learning: K- Means & PCA

Information-Theoretic Metric Learning

A geometric view of conjugate priors

Neural Networks and Machine Learning research at the Laboratory of Computer and Information Science, Helsinki University of Technology

c Springer, Reprinted with permission.

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Tensor intro 1. SIAM Rev., 51(3), Tensor Decompositions and Applications, Kolda, T.G. and Bader, B.W.,

Kernel Methods. Barnabás Póczos

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Data Mining and Matrices

Memory Efficient Kernel Approximation

POSITIVE DEFINITE MATRICES AND THE S-DIVERGENCE

Transcription:

Bregman Divergences Barnabás Póczos RLAI Tea Talk UofA, Edmonton Aug 5, 2008

Contents Bregman Divergences Bregman Matrix Divergences Relation to Exponential Family Applications Definition Properties Generalization of PCA to Exponential Family Generalized2 Linear2 Models Clustering / Coclustering with Bregman Divergences Generalized Nonnegative Matrix Factorization Conclusion 2

Bregman Divergences (Euclidean distance) 2 Squared Euclidean distance is a Bregman divergence 3 (upcoming figs are borrowed from Dhillon)

Bregman Divergences (KL-divergence) Generalized Relative Entropy (also called generalized KL-divergence) is another Bregman divergence 4

Bregman Divergences (Itakura-Saito) Itakura-Saito distance is another Bregman divergence (used in signal processing, also known as Burg entropy) 5

Examples of Bregman Divergences 6

Properties of the Bregman Divergences z γ a b y x c Euclidean special case: 7

Properties of the Bregman Divergences Nearness in Bregman divergence: the Bregman projection of y onto a convex set Ω. Ω When Ω is affine set, the Pythagoras theorem holds with equality. Generalized Pythagoras theorem: Opposite to triangle inequality: 8

(Regular) Exponential Families 9

Gaussian Distribution Note: Gaussian distribution $ Squared Loss from the expected value µ 10

Poisson Distribution The Poisson distribution: The Poisson distribution is a member of exponential family. Its expected value µ=λ. Is there a Divergence associated with the Poisson distribution? Yes! p(x) can be rewritten as Implication: Poisson distribution $ Relative Entropy Implication: Poisson distribution $ Relative Entropy 11

Exponential Distribution The Exponential distribution: The Exponential distribution is a member of exponential family. Its expected value µ=1/λ. Is there a Divergence associated with the Exponential distribution? Yes! p(x) can be rewritten as Implication: Exponential distribution $ Itakura-Saito Distribution 12

Fenchel Conjugate Defintion: The Fenchel conjugate of function f is defined as: Properties of the Fenchel conjugate: 13

Bregman Divergences and the Exponential Family Bijection Theorem 14

15

Bregman Matrix Divergences An immediate solution would be the componentwise sum of Bregman divergences. However, we can get more interesting divergences using the general definition. 16

Bregman Divergences of Hermitian matrices A complex square matrix A is Hermitian, if A = A*. The eigenvalues of a Hermitian matrix are real. Let 17

Burg Matrix Divergence (Logdet divergence) 18

Von Neumann Matrix Divergence 19

Applications, Matrix inequalities Hadamard inequality: Proof: Another inequality: Proof: What is more, here we can arbitrarily permute the eigenvalues! 20

Applications of Bregman divergences Clustering Co-clustering Partition the columns of a data matrix, so that similar columns are in the same partition (Banerjee et al. JMLR, 2005) Simultaneously partition both the rows and columns of a data matrix (Banerjee et al. JMLR, 2007) Low-Rank Matrix Approximation Exponential Family PCA (Collins et al, NIPS 2001) Non-negative matrix factorization (Dhillon & Sra, NIPS 2005) Generalized2 Linear2 Models POMDP (Gordon, NIPS,2002) (Gordon, NIPS,2002) Online learning (Warmuth, COLT2000) Metric Nearness Problem Given a matrix of distances, find the nearest matrix of distances such that all distances satisfy the triangle inequality (Dhillon et al, 2004) 21

Generalized2 Linear2 Models (GL)2M Goal: GLM Special cases: PCA, SVD Exp-family PCA Infomax ICA Linear regression Nonnegative matrix factorization 22

What is a good loss function? Euclidean metric as a loss function: instead of 1000 predicting 1010 is just as bad as predicting 3 instead of -7 Sigmoid regression exp many local minima in dim The log loss function is convex in θ! We say f(z) and the log loss match each other. 23

Searching for matching loss 24

Searching for matching loss 25

Special cases Thus, Log loss, entropic loss Other special cases: 26

Logistic regression 27

(GL)2M algorithm GLM goal: GLM cost: (GL)2M goal: (GL)2M cost: The (GL)2M algorithm, fix point equations:: 28

Robot Navigation A corridor in the CMU CS building with initial belief: (it doesn t know which end) Belief space = R550 550 states. (275 positons x 2 orientation) Robot can: - sense both side walls - compute an accurate estimate of its lateral position Robot cannot: - resolve its position along the corridor, unless its near an observable feature - tell whether its pointing left or right 29

Robot Navigation The belief space is large, but sparse and compressible. The belief vectors lie on a nonlinear manifold. This method can be used for planning, too. They factored a matrix of 400 beliefs using feature space ranks l=3,4,5. f(z)=exp(z), H*=10-12 V 2, G*= 10-12 U 2+ (U) A belief vector using belief tracker algorithm Reconstructions using l=3,4,5 ranks With PCA, they need 85 dimensions to match (GL)^2M rank-5 decomposition and 25 dimension for the rank-3 decomposition 30

Nonnegative matrix factorization Goal: Cost functions: Algorithms: 31

Nonnegative matrix factorization, results With sparse constraints Without constraints CBCL face image database P. Hoyer, sparse NMF algorithm. 32

Exponential Family PCA PCA 1 PCA 2 Cost function Special case 33

Exponential family PCA, Results 34

Clustering with Bregman Divergences 35

The Original Problem of Bregman 36

Conclusion Introduced the Bregman divergence Relationship to Exponential family Generalization to matrices Applications: Matrix inequalities Exponential family PCA NMF GLM Clustering / Biclustering Online learning Bregman divergences propose new algorithms Lots of existing algorithms turn to be special case Matching loss function can help to decrease the number of local minima 37

References Matrix Nearness Problems with Bregman Divergences I. S. Dhillon and J. A. Tropp SIAM Journal on Matrix Analysis and Applications, vol. 29, no. 4, pages 1120-1146, November 2007. A Generalized Maximum Entropy Approach to Bregman Co-Clustering and Matrix Approximations A. Banerjee, I. S. Dhillon, J. Ghosh, S. Merugu, and D. S. Modha Journal of Machine Learning Research (JMLR), vol. 8, pages 1919-1986, August 2007. Clustering with Bregman Divergences A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh Journal of Machine Learning Research (JMLR), vol. 6, pages 1705-1749, October 2005. A Generalized Maximum Entropy Approach to Bregman Co-Clustering and Matrix Approximations A. Banerjee, I. S. Dhillon, J. Ghosh, S. Merugu, and D. S. Modha Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 509-514, August 2004. Clustering with Bregman Divergences A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh Proceedings of the Fourth SIAM International Conference on Data Mining, pages 234-245, April 2004 Nonnegative Matrix Approximation: Algorithms and Applications S. Sra and I. S. Dhillon UTCS Technical Report #TR-06-27, June 2006 Generalized Nonnegative Matrix Approximations with Bregman Divergences I. S. Dhillon and S. Sra NIPS, pages 283-290, Vancouver Canada, December 2005. (Also appears as UTCS Technical Report #TR-05-31, June 1, 2005. 38

PPT slides Irina Rish Bregman Divergences in Clustering and Dimensionality reduction Manfred K. Warmuth COLT2000 Inderjit S. Dhillon Machine Learning with Bregman Divergences Low-Rank Kernel Learning with Bregman Matrix Divergences Matrix Nearness Problems Using Bregman Divergences Information Theoretic Clustering, Co-clustering and Matrix Approximations 39