Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Similar documents
Clustering using Mixture Models

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Non-Parametric Bayes

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Machine Learning. Clustering 1. Hamid Beigy. Sharif University of Technology. Fall 1395

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

Non-parametric Clustering with Dirichlet Processes

Gentle Introduction to Infinite Gaussian Mixture Modeling

Spectral Clustering. Guokun Lai 2016/10

Mathematical Formulation of Our Example

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Bayesian Nonparametric Models

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Clustering and Gaussian Mixture Models

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT

Module Master Recherche Apprentissage et Fouille

Stochastic Processes, Kernel Regression, Infinite Mixture Models

Probabilistic Graphical Models

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Bayesian Nonparametrics: Dirichlet Process

Machine Learning - MT Clustering

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CSE446: Clustering and EM Spring 2017

Spectral clustering. Two ideal clusters, with two points each. Spectral clustering algorithms

Part 6: Multivariate Normal and Linear Models

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543

MATH 829: Introduction to Data Mining and Analysis Clustering II

Data Preprocessing. Cluster Similarity

Computer Vision Group Prof. Daniel Cremers. 3. Regression

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Probabilistic Graphical Models

Mixtures of Gaussians. Sargur Srihari

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007

Statistical Machine Learning

Probabilistic & Unsupervised Learning

Bayesian nonparametrics

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Spatial Bayesian Nonparametrics for Natural Image Segmentation

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Results: MCMC Dancers, q=10, n=500

Spectral Clustering. Zitao Liu

Infinite latent feature models and the Indian Buffet Process

Hierarchical Bayesian Nonparametrics

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Final Exam, Machine Learning, Spring 2009

Collapsed Gibbs Sampler for Dirichlet Process Gaussian Mixture Models (DPGMM)

Data Mining Techniques

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Multivariate Statistics

Lecture 12 : Graph Laplacians and Cheeger s Inequality

Unsupervised Image Segmentation Using Comparative Reasoning and Random Walks

13: Variational inference II

Unsupervised machine learning

Statistical Learning. Dong Liu. Dept. EEIS, USTC

STA 4273H: Statistical Machine Learning

1 Matrix notation and preliminaries from spectral graph theory

STA 4273H: Statistical Machine Learning

Latent Variable Models and EM Algorithm

Variational Inference (11/04/13)

Lecture 6: Graphical Models: Learning

STAT Advanced Bayesian Inference

CPSC 540: Machine Learning

Lecture 9: PGM Learning

28 : Approximate Inference - Distributed MCMC

Data Mining Techniques

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Hierarchical Clustering

Tree-Based Inference for Dirichlet Process Mixtures

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Machine Learning Summer School

Clustering compiled by Alvin Wan from Professor Benjamin Recht s lecture, Samaneh s discussion

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering

MATH 567: Mathematical Techniques in Data Science Clustering II

Bayesian Methods for Machine Learning

Construction of Dependent Dirichlet Processes based on Poisson Processes

Notes on Markov Networks

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Nonparametric Mixed Membership Models

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Part 1: Expectation Propagation

University of Florida CISE department Gator Engineering. Clustering Part 1

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

Pattern Recognition and Machine Learning

Networks and Their Spectra

A Brief Overview of Nonparametric Bayesian Models

8.1 Concentration inequality for Gaussian random matrix (cont d)

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Bayesian Nonparametrics for Speech and Signal Processing

1 Matrix notation and preliminaries from spectral graph theory

Bayesian Clustering with the Dirichlet Process: Issues with priors and interpreting MCMC. Shane T. Jensen

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Transcription:

Group Prof. Daniel Cremers 14. Clustering

Motivation Supervised learning is good for interaction with humans, but labels from a supervisor are hard to obtain Clustering is unsupervised learning, i.e. it tries to lear only from the data Main idea: find a similarity measure and group similar data objects together Clustering is a very old research field, many approaches have been suggested Main problem in most methods: how to find a good number of clusters 2 Group

Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior (Dirichlet) parameter prior (Gauss-IW) In this model, we use: µ =(µ 1,...,µ K ) z i x i N k K =( 1,..., K ) (µ k, k )= k Simplification for now: Assume Thus: k k = µ k are known 3 Group

Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior (Dirichlet) parameter prior (Gauss-IW) Given this model, we can create new samples: z i x i N k K 1.Sample from priors, k z i 2.Sample corresp. 3.Sample data point x i 4 Group

Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior (Dirichlet) parameter prior (Gauss-IW) An equivalent formulation of this model is this: 1.Sample 2.Sample params p( i, k )=, k from priors i KX k ( k, i ) k=1 3.Sample data point from: x i i x i N k K 5 Group

Clustering using Mixture Models What is the difference in that model? there is one parameter i for each observation x i intuitively: we first sample the location of the cluster and then the data that corresponds to it In general, we use the notation: Dir( K 1) Base distribution k H( ) i G(, k ) G(, k )= KX k=1 where k ( k, i ) However: We need to know K 6 i x i N Group k K

The Dirichlet Process So far, we assumed that K is known To extend that to infinity, we use a trick: Definition: A Dirichlet process (DP) is a distribution over probability measures G, i.e. Z G( ) and. If for any partition it holds: G( )d =1 (T 1,...,T K ) (G(T 1 ),...,G(T K )) Dir( H(T 1 ),..., H(T K )) then G is sampled from a Dirichlet process. Notation: G DP(,H) where and H is the concentration parameter is the base measure 7 Group

Repetition The Dirichlet distribution is defined as: Dir(µ ) = ( ) ( 1 ) ( K ) KX KY k=1 µ k 1 k = KX k=1 k apple µ k apple 1 µ k =1 k=1 It is the conjugate prior for the multinomial distribution There, the parameter can be interpreted as the effective number of observations for every state The simplex for K=3 8 Group

Some Examples = (2, 2, 2) = (2, 2, 2) controls the strength α=.1 of the distribution ( peakedness ) k control the location of the peak 15 p 1 5 1 1.5.5 = (.1,.1,.1) 9 PD Dr. Rudolph Triebel Group

Intuitive Interpretation Every sample from a Dirichlet distribution is a vector of K positive values that sum up to 1, i.e. the sample itself is a finite distribution Accordingly, a sample from a Dirichlet process is an infinite (but still discrete!) distribution Base distribution (here Gaussian) Infinitely many samples (sum up to 1) 1 Group

Construction of a Dirichlet Process The Dirichlet process is only defined implicitly, i.e. we can test whether a given probability measure is sampled from a DP, but we can not yet construct one. A DP can be constructed using the stickbreaking analogy: imagine a stick of length 1 we select a random number β between and 1 from a Beta-distribution we break the stick at π = β * length-of-stick we repeat this infinitely often 11 Group

The Stick-Breaking Construction β 1 π 1 π β 2 2 β3 1 β 1 1 β 2 1 β3 π 3 π 4 β 4 β 5.5.4.3.2.1 1 β 4 1 2 3.4 α = 2 α = 5.4.3.2.1 1 2 3.2 α = 2 α = 5 π 5.3.2.1 1 2 3.15.1.5 1 2 3 formally, we have ky 1 k Beta(1, ) k = k now we define G( ) = l=1 (1 l) = k (1 k 1 X l=1 l ) 1X k ( k, ) k H then: G DP(,H) k=1 12 Group

The Chinese Restaurant Process... Consider a restaurant with infinitely many tables Everytime a new customer comes in, he sits at an occupied table with probability proportional to the number of people sitting at that table, but he may choose to sit on a new table with decreasing probability as more customers enter the room. 13 Group

The Chinese Restaurant Process It can be shown that the probability for a new customer is p( N+1 = 1:N,,H)= 1 + N H( )+ K X k=1 N k ( k, )! This means that currently occupied tables are more likely to get new customers (rich get richer) The number of occupied tables grows logarithmically with the number of customers 14 Group

The DP for Mixture Modeling Using the stick-breaking construction, we see that we can extend the mixture model clustering to the situation where K goes to infinity The algorithm can be implemented using Gibbs sampling iter# 5 iter# 1.7 4 4.6 2 2.5.4 2 2.3 4 4.2 6 6.1 6 4 2 2 4 6 4 2 2 4 1 2 3 4 5 6 15 Group

Affinity Propagation Often, we are only given a similarity matrix for the data points The idea of Affinity Propagation is to determine cluster centers ( exemplars ) that explain other data points in an optimal way This is similar to k-medoids, but the algorithm is more robust against local minima Idea: each data point must choose another data point as its exemplar; some points will choose themselves as exemplar The number of clusters is then found automatically 16 Group

Affinity Propagation Input: similarity values s(i,j) Initialize the responsibilities r(i,j), and the availabilities a(i,j) to do until convergence: recompute the responsibilities: r(i, j) =s(i, j) max j 6=j {a(i, j )+s(i, j )} recompute the availabilities: a(i, j) =min 8 < X,r(j, j)+ : i /2{i,j} 9 = max{,r(i,j)} ; the j that maximizes r(i,j) + a(i,j) is the exemplar of i 17 Group

Affinity Propagation Intuitively: responsibility measures how much i thinks that j would be a good exemplar availability measures how strongly j things it should be an exemplar for i The algorithm can be shown to be equivalent to max-product loopy belief propagation Convergence is not guaranteed, but with damping oscillations can be avoided The number of clusters can be controlled by the self-similarity 18 Group

+5 A INITIALIZATION +5 Affinity Propagation ITERATION #1 +5 ITERATION #2 +5 ITERATION #3-5 -5 +5-5 -5 +5-5 -5 +5-5 -5 +5 +5 ITERATION #4 +5 ITERATION #5 +5 ITERATION #6 +5 CONVERGENCE -5-5 +5-5 -5 +5-5 -5 +5-5 -5 +5 non-exemplar exemplar Colours: how much each point wants to be an exemplar Edge strengths: how much a point wants to belong to a cluster 19 Group

Spectral Clustering Consider an undirected graph that connects all data points The edge weights are the similarities ( closeness ) We define the weighted degree sum of all outgoing edges d i of a node as the W = d i = NX w ij j=1 d 1d2d3 D = d 4 2 Group

Spectral Clustering The Graph Laplacian is defined as: L = D W This matrix has the following properties: the 1 vector is eigenvector with eigenvalue 21 Group

Spectral Clustering The Graph Laplacian is defined as: L = D W This matrix has the following properties: the 1 vector is eigenvector with eigenvector the matrix is symmetric and positive semi-definite 22 Group

Spectral Clustering The Graph Laplacian is defined as: L = D W This matrix has the following properties: the 1 vector is eigenvector with eigenvector the matrix is symmetric and positive semi-definite With these properties we can show: Theorem: The set of eigenvectors of L with eigenvalue is spanned by the indicator vectors 1 A1,...,1 AK A k, where are the K connected components of the graph. 23 Group

The Algorithm Input: Similarity matrix W Compute L = D - W Compute the eigenvectors that correspond to the K smallest eigenvalues Stack these vectors as rows in a matrix U Treat each row of U as a K-dim data point Cluster the N rows with K-means clustering The indices of the rows that correspond to the resulting clusters are those of the original data points. 24 Group

An Example 5 k means clustering 5 spectral clustering 4 4 3 3 2 2 1 1 y y 1 1 2 2 3 3 4 4 5 6 4 2 2 4 6 x 5 6 4 2 2 4 6 x Spectral clustering can handle complex problems such as this one The complexity of the algorithm is O(N ), because it has to solve an eigenvector problem But there are efficient variants of the algorithm 3 25 Group

Further Remarks To account for nodes that are highly connected, we can use a normalized version of the graph Laplacian Two different methods exist: L rw = D 1 L = I L sym = D 1 2 LD 1 2 = I D 1 2 WD 1 2 These have similar eigenspaces than the original Laplacian L D 1 W Clustering results tend to be better than with the unnormalized Laplacian 26 Group

Another Small Example 27 Group

Hierarchical Clustering Often, we want to have nested clusters instead of a flat clustering Two possible methods: bottom-up or agglomerative clustering top-down or divisive clustering Both methods take a dissimilarity matrix as input Bottom-up grows merges points to clusters Top-down splits clusters into sub-clusters Both are heuristics, there is no clear objective function They always produce a clustering (also for noise) 28 Group

Agglomerative Clustering Start with N clusters, each contains exactly one data point At each step, merge the two most similar groups Repeat until there is a single group 5 4.5 4 2 2.5 3.5 3 2.5 2 1.5 1 3 5 4 2 1.5 Dendrogram 1.5 1 2 3 4 5 6 7 8 1 4 5 1 3 2 29 Group

Linkage In agglomerative clustering, it is important to define a distance measure between two clusters There are three different methods: Single linkage: considers the two closest elements from both clusters and uses their distance Complete linkage: considers the two farthest elements from both clusters Average linkage: uses the average distance between pairs of points from both clusters Depending on the application, one linkage should be preferred over the other 3 Group

Single Linkage The distance is based on The resulting dendrogram is a minimum spanning tree, i.e. it minimizes the sum of the edge weights Thus: we can compute the clustering in O(N ) time single link d SL (G, H) = min i2g,i 2H d i,i 2.3.25.2.15.1.5 31 Group

Complete Linkage The distance is based on d CL (G, H) = Complete linkage fulfills the compactness max i2g,i 2H d i,i property, i.e. all points in a group should be similar to each other Tends to produce clusters with smaller diameter 2 complete link 1.8 1.6 1.4 1.2 1.8.6.4.2 32 Group

Average Linkage The distance is based on Is a good compromise between single and complete linkage d avg (G, H) = 1 X n G n H i2g X i 2H d i,i However: sensitive to changes on the meas. scale average link 1.8 1.6 1.4 1.2 1.8.6.4.2 33 Group

Divisive Clustering Start with all data in a single cluster Recursively divide each cluster into two child clusters Problem: optimal split is hard to find Idea: use the cluster with the largest diameter and use K-means with K = 2 Or: use minimum-spanning tree and cut links with the largest dissimilarity In general two advantages: Can be faster More globally informed (not myopic as bottom-up) 34 Group

Choosing the Number of Clusters As in general, choosing the number of clusters is hard When a dendrogram is available, a gap can be detected in the lengths of the links This represents the dissimilarity between merged groups However: in real data this can be hard to detect There are Bayesian techniques to address this problem (Bayesian hierarchical clustering) 35 Group

Evaluation of Clustering Algorithms Clustering is unsupervised: evaluation of the output is hard, because no ground truth is given Intuitively, points in a cluster should be similar and points in different clusters dissimilar However, better methods use external information, such as labels or a reference clustering Then we can compare clusterings with the labels using different metrics, e.g. purity mutual information 36 Group

Purity Define N ij the number of objects in cluster i that are in class j Define p ij = N ij N i = Purity N i overall purity CX j=1 N ij p i = max X number of objects in cluster i Purity ranges from (bad) to 1 (good) But: a clustering with each object in its own cluster has a purity of 1 i j N i N p i p ij Purity =.71 37 Group

Mutual Information Let U and V be two clusterings Define the probability that a randomly chosen point belongs to cluster in U and to in V Also: The prob. that a point is in I(U, V )= This can be normalized to account for many small clusters with low entropy u i p UV (i, j) = u i \ v j RX i=1 CX j=1 N v j u i p U (i) = u i p UV (i, j) log p UV (i, j) p U (i)p V (j) N 38 Group

Summary Several Clustering methods: Dirichlet process mixture model does not require the number of clusters to be known; full Bayesian Affinity Propagation: iterative approach where exemplars are determined as cluster centers Spectral clustering uses the graph Laplacian and performs an eigenvector analysis Hierarchical approaches can be bottom-up or topdown Evaluation methods for Clustering are hard to find Some are based on purity or mutual information 39 Group