Clustering using Mixture Models

Similar documents
Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Non-Parametric Bayes

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Gentle Introduction to Infinite Gaussian Mixture Modeling

Machine Learning. Clustering 1. Hamid Beigy. Sharif University of Technology. Fall 1395

Bayesian Nonparametric Models

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Bayesian Methods for Machine Learning

Clustering and Gaussian Mixture Models

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Data Preprocessing. Cluster Similarity

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Bayesian Nonparametrics: Dirichlet Process

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

CSE446: Clustering and EM Spring 2017

Non-parametric Clustering with Dirichlet Processes

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Part 6: Multivariate Normal and Linear Models

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Probabilistic Graphical Models

STA 4273H: Statistical Machine Learning

Computer Vision Group Prof. Daniel Cremers. 3. Regression

STAT Advanced Bayesian Inference

Bayesian Models in Machine Learning

Chris Bishop s PRML Ch. 8: Graphical Models

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

STA 4273H: Statistical Machine Learning

Collapsed Gibbs Sampler for Dirichlet Process Gaussian Mixture Models (DPGMM)

Bayesian nonparametrics

Mixtures of Gaussians. Sargur Srihari

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Stat 5101 Lecture Notes

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Data Mining Techniques

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Bayesian non parametric approaches: an introduction

CPSC 540: Machine Learning

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Data Exploration and Unsupervised Learning with Clustering

Statistical Machine Learning

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007

Machine Learning - MT Clustering

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Dirichlet Processes: Tutorial and Practical Course

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Latent Variable Models and EM Algorithm

Mathematical Formulation of Our Example

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Stochastic Processes, Kernel Regression, Infinite Mixture Models

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Spatial Bayesian Nonparametrics for Natural Image Segmentation

MATH 829: Introduction to Data Mining and Analysis Clustering II

Spectral Clustering. Zitao Liu

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Clustering VS Classification

Hierarchical Bayesian Nonparametrics

Finite Singular Multivariate Gaussian Mixture

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

CS281 Section 4: Factor Analysis and PCA

CS6220: DATA MINING TECHNIQUES

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Spectral Clustering. Guokun Lai 2016/10

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Inference for Dirichlet-Multinomials

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Generative v. Discriminative classifiers Intuition

COS513 LECTURE 8 STATISTICAL CONCEPTS

Multivariate Statistics

ECE 5984: Introduction to Machine Learning

PMR Learning as Inference

Bayesian Nonparametrics for Speech and Signal Processing

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

PATTERN CLASSIFICATION

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543

Unsupervised machine learning

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Introduction to Probabilistic Graphical Models

CS6220: DATA MINING TECHNIQUES

CS534 Machine Learning - Spring Final Exam

Gaussian Mixture Models, Expectation Maximization

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

More Spectral Clustering and an Introduction to Conjugacy

Curve Fitting Re-visited, Bishop1.2.5

28 : Approximate Inference - Distributed MCMC

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Final Exam, Machine Learning, Spring 2009

Learning Bayesian network : Given structure and completely observed data

Statistics 202: Data Mining. c Jonathan Taylor. Model-based clustering Based in part on slides from textbook, slides of Susan Holmes.

A Brief Overview of Nonparametric Bayesian Models

ECE521 week 3: 23/26 January 2017

Transcription:

Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior (Dirichlet) parameter prior (Gauss-IW) Given this model, we can create new samples: z i x i N k K 1.Sample from priors, k z i 2.Sample corresp. 3.Sample data point x i!1

Repetition: MAP for Regression In MLE, we searched for parameters w, that maximize the data likelihood. Now, we assume a Gaussian prior: Using this, we can compute the posterior (Bayes): p(w x, t) / p(t w, x)p(w) Posterior Likelihood Prior Maximum A-Posteriori Estimation (MAP)!2

Generalization: The Bayesian Approach This idea can be generalized: Given a data-dependent likelihood term Find an appropriate prior distribution Multiply both and obtain the (unnormalized) posterior from Bayes rule Main benefit: less overfitting However: How should we define the prior? Often used principle: Conjugacy!3

Conjugate Priors A conjugate prior distribution allows to represent the posterior in the same functional (closed) form as the prior, e.g.: N (w; 0, 2 2 2I M ) N (t; w, 1I M ) =N (w; µ, ) Gaussian prior Gaussian likelihood Gaussian posterior Common pairs of likelihood and conjugate priors are: Likelihood Normal with known variance Binomial Multinomial Multivariate Normal Conjugate Prior Normal Beta Dirichlet Normal-inverse Wishart!4

<latexit sha1_base64="da5umrqkdd4xxx5f03vzyhv38ma=">aaac8xiczvllattafb2pr1rjw6dddjpugfiwrjkbdhmiyayqelkok0dgeapr2b40dzezcjzch9jd6bzfvojh9morbejckhs49577ospkkzyp499b+otps+cvtl5g2zuvxr/p7b69ckayje+ykczezdrxktsfeoelvyotpyqt/dirjtv45s23thj9za9kplv0rsvmmorblfzuyz2vjkosg++gkj3brikck8zi3k0ufgpsimz49hef4ihkfc50xsrqrvg20rkmjai6jqzly+bhhov8fwyprcntujhimpstoqq0uklvwjrprx+p4rxhxydpqb91dp7ubgxmzjxi2jnjnbto4tjpa2q9yjjdp8rxkrkczvk1qe0vd9n6faagd8ct45mx8gip1977jjoq164lmtd5wm3gwue/2ob+0c/bim6jv599ntzcl5xnmt21n1use4nbdxaulgdergbqzgvsgnmcwso8kpwguitrprsvhdczqt3e6upl3piclpdza/sfgvmg4wggs8cbho4mxl0vlltzveknidjumak0nkpp6fj0lrp8ja6px3etgszjpgipwcv4lmsj5ot+//coe2glvucf0b5k0cd0il6gczrbdp0jwma72ald+d38ef68sw2djvmopbdw119ko+4i</latexit> <latexit sha1_base64="da5umrqkdd4xxx5f03vzyhv38ma=">aaac8xiczvllattafb2pr1rjw6dddjpugfiwrjkbdhmiyayqelkok0dgeapr2b40dzezcjzch9jd6bzfvojh9morbejckhs49577ospkkzyp499b+otps+cvtl5g2zuvxr/p7b69ckayje+ykczezdrxktsfeoelvyotpyqt/dirjtv45s23thj9za9kplv0rsvmmorblfzuyz2vjkosg++gkj3brikck8zi3k0ufgpsimz49hef4ihkfc50xsrqrvg20rkmjai6jqzly+bhhov8fwyprcntujhimpstoqq0uklvwjrprx+p4rxhxydpqb91dp7ubgxmzjxi2jnjnbto4tjpa2q9yjjdp8rxkrkczvk1qe0vd9n6faagd8ct45mx8gip1977jjoq164lmtd5wm3gwue/2ob+0c/bim6jv599ntzcl5xnmt21n1use4nbdxaulgdergbqzgvsgnmcwso8kpwguitrprsvhdczqt3e6upl3piclpdza/sfgvmg4wggs8cbho4mxl0vlltzveknidjumak0nkpp6fj0lrp8ja6px3etgszjpgipwcv4lmsj5ot+//coe2glvucf0b5k0cd0il6gczrbdp0jwma72ald+d38ef68sw2djvmopbdw119ko+4i</latexit> <latexit sha1_base64="da5umrqkdd4xxx5f03vzyhv38ma=">aaac8xiczvllattafb2pr1rjw6dddjpugfiwrjkbdhmiyayqelkok0dgeapr2b40dzezcjzch9jd6bzfvojh9morbejckhs49577ospkkzyp499b+otps+cvtl5g2zuvxr/p7b69ckayje+ykczezdrxktsfeoelvyotpyqt/dirjtv45s23thj9za9kplv0rsvmmorblfzuyz2vjkosg++gkj3brikck8zi3k0ufgpsimz49hef4ihkfc50xsrqrvg20rkmjai6jqzly+bhhov8fwyprcntujhimpstoqq0uklvwjrprx+p4rxhxydpqb91dp7ubgxmzjxi2jnjnbto4tjpa2q9yjjdp8rxkrkczvk1qe0vd9n6faagd8ct45mx8gip1977jjoq164lmtd5wm3gwue/2ob+0c/bim6jv599ntzcl5xnmt21n1use4nbdxaulgdergbqzgvsgnmcwso8kpwguitrprsvhdczqt3e6upl3piclpdza/sfgvmg4wggs8cbho4mxl0vlltzveknidjumak0nkpp6fj0lrp8ja6px3etgszjpgipwcv4lmsj5ot+//coe2glvucf0b5k0cd0il6gczrbdp0jwma72ald+d38ef68sw2djvmopbdw119ko+4i</latexit> <latexit sha1_base64="da5umrqkdd4xxx5f03vzyhv38ma=">aaac8xiczvllattafb2pr1rjw6dddjpugfiwrjkbdhmiyayqelkok0dgeapr2b40dzezcjzch9jd6bzfvojh9morbejckhs49577ospkkzyp499b+otps+cvtl5g2zuvxr/p7b69ckayje+ykczezdrxktsfeoelvyotpyqt/dirjtv45s23thj9za9kplv0rsvmmorblfzuyz2vjkosg++gkj3brikck8zi3k0ufgpsimz49hef4ihkfc50xsrqrvg20rkmjai6jqzly+bhhov8fwyprcntujhimpstoqq0uklvwjrprx+p4rxhxydpqb91dp7ubgxmzjxi2jnjnbto4tjpa2q9yjjdp8rxkrkczvk1qe0vd9n6faagd8ct45mx8gip1977jjoq164lmtd5wm3gwue/2ob+0c/bim6jv599ntzcl5xnmt21n1use4nbdxaulgdergbqzgvsgnmcwso8kpwguitrprsvhdczqt3e6upl3piclpdza/sfgvmg4wggs8cbho4mxl0vlltzveknidjumak0nkpp6fj0lrp8ja6px3etgszjpgipwcv4lmsj5ot+//coe2glvucf0b5k0cd0il6gczrbdp0jwma72ald+d38ef68sw2djvmopbdw119ko+4i</latexit> <latexit sha1_base64="soy2c739gkbfi5xb+5z7zieclcs=">aaaci3icbvhbattaef2rbs5uk9rty16wmkafjjgcqkvoq2gpfeoggtgjxmasvmn7yv7e7qi1efquvrbf1l/pynfd4mrgl8pmnlmcsxotasbxv1b07pmlre2d3fblv3v7rzvdn5fbfv7csdrt/huqamhlyyqknvznhorjnvylt1/r+nvp8ee5e4grhczgzk2aksmqxnnod5yradifzw5dv8y/pp1epijxxh+dpae91tjztntkis4laxalfihcjhgok1j4vfjd1r4xaxihb8ucbghaysbmyvxsft8kt8znztozynfe+4xsmbbwjqvmi3arnmo188kyluucyam9zj5nsmxzashku+6zqnn0vfahz8qdrl0iikrxtacxc+gfrnlwqava8kzvc02/dkzeptjcpoaho9313bf9yyzqcx7isrxsj/bmyn5u4cogfjpqtmcbulrcuqpyvcxpbxq1ppufqhzgvzuukmze4dg4ha6sejccf+idfgnus8mo2dv2nixsizth39kzgzhjfrhf7a/7g+1fr9fx9pkunwo1nlfsguxf/gm/+siy</latexit> <latexit sha1_base64="soy2c739gkbfi5xb+5z7zieclcs=">aaaci3icbvhbattaef2rbs5uk9rty16wmkafjjgcqkvoq2gpfeoggtgjxmasvmn7yv7e7qi1efquvrbf1l/pynfd4mrgl8pmnlmcsxotasbxv1b07pmlre2d3fblv3v7rzvdn5fbfv7csdrt/huqamhlyyqknvznhorjnvylt1/r+nvp8ee5e4grhczgzk2aksmqxnnod5yradifzw5dv8y/pp1epijxxh+dpae91tjztntkis4laxalfihcjhgok1j4vfjd1r4xaxihb8ucbghaysbmyvxsft8kt8znztozynfe+4xsmbbwjqvmi3arnmo188kyluucyam9zj5nsmxzashku+6zqnn0vfahz8qdrl0iikrxtacxc+gfrnlwqava8kzvc02/dkzeptjcpoaho9313bf9yyzqcx7isrxsj/bmyn5u4cogfjpqtmcbulrcuqpyvcxpbxq1ppufqhzgvzuukmze4dg4ha6sejccf+idfgnus8mo2dv2nixsizth39kzgzhjfrhf7a/7g+1fr9fx9pkunwo1nlfsguxf/gm/+siy</latexit> <latexit sha1_base64="soy2c739gkbfi5xb+5z7zieclcs=">aaaci3icbvhbattaef2rbs5uk9rty16wmkafjjgcqkvoq2gpfeoggtgjxmasvmn7yv7e7qi1efquvrbf1l/pynfd4mrgl8pmnlmcsxotasbxv1b07pmlre2d3fblv3v7rzvdn5fbfv7csdrt/huqamhlyyqknvznhorjnvylt1/r+nvp8ee5e4grhczgzk2aksmqxnnod5yradifzw5dv8y/pp1epijxxh+dpae91tjztntkis4laxalfihcjhgok1j4vfjd1r4xaxihb8ucbghaysbmyvxsft8kt8znztozynfe+4xsmbbwjqvmi3arnmo188kyluucyam9zj5nsmxzashku+6zqnn0vfahz8qdrl0iikrxtacxc+gfrnlwqava8kzvc02/dkzeptjcpoaho9313bf9yyzqcx7isrxsj/bmyn5u4cogfjpqtmcbulrcuqpyvcxpbxq1ppufqhzgvzuukmze4dg4ha6sejccf+idfgnus8mo2dv2nixsizth39kzgzhjfrhf7a/7g+1fr9fx9pkunwo1nlfsguxf/gm/+siy</latexit> <latexit sha1_base64="soy2c739gkbfi5xb+5z7zieclcs=">aaaci3icbvhbattaef2rbs5uk9rty16wmkafjjgcqkvoq2gpfeoggtgjxmasvmn7yv7e7qi1efquvrbf1l/pynfd4mrgl8pmnlmcsxotasbxv1b07pmlre2d3fblv3v7rzvdn5fbfv7csdrt/huqamhlyyqknvznhorjnvylt1/r+nvp8ee5e4grhczgzk2aksmqxnnod5yradifzw5dv8y/pp1epijxxh+dpae91tjztntkis4laxalfihcjhgok1j4vfjd1r4xaxihb8ucbghaysbmyvxsft8kt8znztozynfe+4xsmbbwjqvmi3arnmo188kyluucyam9zj5nsmxzashku+6zqnn0vfahz8qdrl0iikrxtacxc+gfrnlwqava8kzvc02/dkzeptjcpoaho9313bf9yyzqcx7isrxsj/bmyn5u4cogfjpqtmcbulrcuqpyvcxpbxq1ppufqhzgvzuukmze4dg4ha6sejccf+idfgnus8mo2dv2nixsizth39kzgzhjfrhf7a/7g+1fr9fx9pkunwo1nlfsguxf/gm/+siy</latexit> <latexit sha1_base64="qovetvpth3k4l+4fmo3l0xzfjkg=">aaac2xiczvjbi9nafj7g21pvxx30zbauvpcsfefffhz9ewrlbbu72lrhmpm2q+bgzim0hnnwtxz1l/kn/au+6g9w0g3ltnsgytfno5ec801ubhcqx7870y2bt27f2bvbvxf/wcnhvf3hp05xlrix1ulb85w4jrhiy+ag2lmxjmhcslo8fnvwz1+ydvyrt7a2bcrjqve5pwsck+vnzmhnvpicp7kwhvvl8kltw/1zfihty3wr1eow8bmp7afsdu9twwxlrp4aynl7y8gtumalz3r9ebhvdf8hsqv6qlwtbl9d00ltsjifvbdnjklsyfotc5wk5rtp5zghtcqlnglqecnctn4swunb8br4rm14focn92pgtarrzgyrksds7xkn85ibxcvh1vr0o/1h/npac2uqyipetj9xaopgza5xws2jinybegp5madtjbgeqlbkq1ijx/hcvik8qzbn9e5vmmewfwf4sdahfslhzgm8wgetyueot3+m2yqykwdehbrd1dggulkhux1mvscelf+f2+b8pyp9n8is7ipwhzyohkk8td6+7b+9aqxaq0/rm3saevqkhaf36asneuw/0b/0f/2ljtg36hv04yi06rq5t9cwrt//a4fk6zk=</latexit> <latexit sha1_base64="qovetvpth3k4l+4fmo3l0xzfjkg=">aaac2xiczvjbi9nafj7g21pvxx30zbauvpcsfefffhz9ewrlbbu72lrhmpm2q+bgzim0hnnwtxz1l/kn/au+6g9w0g3ltnsgytfno5ec801ubhcqx7870y2bt27f2bvbvxf/wcnhvf3hp05xlrix1ulb85w4jrhiy+ag2lmxjmhcslo8fnvwz1+ydvyrt7a2bcrjqve5pwsck+vnzmhnvpicp7kwhvvl8kltw/1zfihty3wr1eow8bmp7afsdu9twwxlrp4aynl7y8gtumalz3r9ebhvdf8hsqv6qlwtbl9d00ltsjifvbdnjklsyfotc5wk5rtp5zghtcqlnglqecnctn4swunb8br4rm14focn92pgtarrzgyrksds7xkn85ibxcvh1vr0o/1h/npac2uqyipetj9xaopgza5xws2jinybegp5madtjbgeqlbkq1ijx/hcvik8qzbn9e5vmmewfwf4sdahfslhzgm8wgetyueot3+m2yqykwdehbrd1dggulkhux1mvscelf+f2+b8pyp9n8is7ipwhzyohkk8td6+7b+9aqxaq0/rm3saevqkhaf36asneuw/0b/0f/2ljtg36hv04yi06rq5t9cwrt//a4fk6zk=</latexit> <latexit sha1_base64="qovetvpth3k4l+4fmo3l0xzfjkg=">aaac2xiczvjbi9nafj7g21pvxx30zbauvpcsfefffhz9ewrlbbu72lrhmpm2q+bgzim0hnnwtxz1l/kn/au+6g9w0g3ltnsgytfno5ec801ubhcqx7870y2bt27f2bvbvxf/wcnhvf3hp05xlrix1ulb85w4jrhiy+ag2lmxjmhcslo8fnvwz1+ydvyrt7a2bcrjqve5pwsck+vnzmhnvpicp7kwhvvl8kltw/1zfihty3wr1eow8bmp7afsdu9twwxlrp4aynl7y8gtumalz3r9ebhvdf8hsqv6qlwtbl9d00ltsjifvbdnjklsyfotc5wk5rtp5zghtcqlnglqecnctn4swunb8br4rm14focn92pgtarrzgyrksds7xkn85ibxcvh1vr0o/1h/npac2uqyipetj9xaopgza5xws2jinybegp5madtjbgeqlbkq1ijx/hcvik8qzbn9e5vmmewfwf4sdahfslhzgm8wgetyueot3+m2yqykwdehbrd1dggulkhux1mvscelf+f2+b8pyp9n8is7ipwhzyohkk8td6+7b+9aqxaq0/rm3saevqkhaf36asneuw/0b/0f/2ljtg36hv04yi06rq5t9cwrt//a4fk6zk=</latexit> <latexit sha1_base64="qovetvpth3k4l+4fmo3l0xzfjkg=">aaac2xiczvjbi9nafj7g21pvxx30zbauvpcsfefffhz9ewrlbbu72lrhmpm2q+bgzim0hnnwtxz1l/kn/au+6g9w0g3ltnsgytfno5ec801ubhcqx7870y2bt27f2bvbvxf/wcnhvf3hp05xlrix1ulb85w4jrhiy+ag2lmxjmhcslo8fnvwz1+ydvyrt7a2bcrjqve5pwsck+vnzmhnvpicp7kwhvvl8kltw/1zfihty3wr1eow8bmp7afsdu9twwxlrp4aynl7y8gtumalz3r9ebhvdf8hsqv6qlwtbl9d00ltsjifvbdnjklsyfotc5wk5rtp5zghtcqlnglqecnctn4swunb8br4rm14focn92pgtarrzgyrksds7xkn85ibxcvh1vr0o/1h/npac2uqyipetj9xaopgza5xws2jinybegp5madtjbgeqlbkq1ijx/hcvik8qzbn9e5vmmewfwf4sdahfslhzgm8wgetyueot3+m2yqykwdehbrd1dggulkhux1mvscelf+f2+b8pyp9n8is7ipwhzyohkk8td6+7b+9aqxaq0/rm3saevqkhaf36asneuw/0b/0f/2ljtg36hv04yi06rq5t9cwrt//a4fk6zk=</latexit> <latexit sha1_base64="cmgeov8uuacy+odsgvfvz9ncmzk=">aaacj3icbvflixnbeo6mrzw+eswtl8aw4ehctfhylytbl4osrgb2f3bi0nntszr0y+iqkyrhfoxx/ux+g3uyc3czfntzuy+vqr7ks62q4vhpl7pz9979bwcp+48ep3n6bdb8fo6u8hjm0mnnl3obojwfgsnscfl6ecbxcjgvp7bxix/gutn7jbylzi1ywrvqulbwzyoxkvymq9cnsfp9s1qqbm1pejinrve43hm/dziojfhnz9mwl6efk5ubs1ilxksklmlec09kamj6ayvqcrkws7gk0aodok938zf8mhgkvna+pet85/23ohygcwvykgkerxa/1jr/g6nns4h77wnxbl4rw1yevl53x1sak+otqrxqhitpbqbcehuw4hilvjaudlzb1ipevpwvdr90pl0ebwvy8fce3fxshfkvmudd+sepqgr9khcd846fk4skdodopwgkxwvdo/pube4febujn8smnsrnp1wl2b/bbxa+gsfxopl6njp+6o5zwf6x1+wns9gxm7jp7iznmgq1+8l+sd/rmdqo3kft69so19w8ydcs+vwxwoxj/w==</latexit> <latexit sha1_base64="cmgeov8uuacy+odsgvfvz9ncmzk=">aaacj3icbvflixnbeo6mrzw+eswtl8aw4ehctfhylytbl4osrgb2f3bi0nntszr0y+iqkyrhfoxx/ux+g3uyc3czfntzuy+vqr7ks62q4vhpl7pz9979bwcp+48ep3n6bdb8fo6u8hjm0mnnl3obojwfgsnscfl6ecbxcjgvp7bxix/gutn7jbylzi1ywrvqulbwzyoxkvymq9cnsfp9s1qqbm1pejinrve43hm/dziojfhnz9mwl6efk5ubs1ilxksklmlec09kamj6ayvqcrkws7gk0aodok938zf8mhgkvna+pet85/23ohygcwvykgkerxa/1jr/g6nns4h77wnxbl4rw1yevl53x1sak+otqrxqhitpbqbcehuw4hilvjaudlzb1ipevpwvdr90pl0ebwvy8fce3fxshfkvmudd+sepqgr9khcd846fk4skdodopwgkxwvdo/pube4febujn8smnsrnp1wl2b/bbxa+gsfxopl6njp+6o5zwf6x1+wns9gxm7jp7iznmgq1+8l+sd/rmdqo3kft69so19w8ydcs+vwxwoxj/w==</latexit> <latexit sha1_base64="cmgeov8uuacy+odsgvfvz9ncmzk=">aaacj3icbvflixnbeo6mrzw+eswtl8aw4ehctfhylytbl4osrgb2f3bi0nntszr0y+iqkyrhfoxx/ux+g3uyc3czfntzuy+vqr7ks62q4vhpl7pz9979bwcp+48ep3n6bdb8fo6u8hjm0mnnl3obojwfgsnscfl6ecbxcjgvp7bxix/gutn7jbylzi1ywrvqulbwzyoxkvymq9cnsfp9s1qqbm1pejinrve43hm/dziojfhnz9mwl6efk5ubs1ilxksklmlec09kamj6ayvqcrkws7gk0aodok938zf8mhgkvna+pet85/23ohygcwvykgkerxa/1jr/g6nns4h77wnxbl4rw1yevl53x1sak+otqrxqhitpbqbcehuw4hilvjaudlzb1ipevpwvdr90pl0ebwvy8fce3fxshfkvmudd+sepqgr9khcd846fk4skdodopwgkxwvdo/pube4febujn8smnsrnp1wl2b/bbxa+gsfxopl6njp+6o5zwf6x1+wns9gxm7jp7iznmgq1+8l+sd/rmdqo3kft69so19w8ydcs+vwxwoxj/w==</latexit> <latexit sha1_base64="cmgeov8uuacy+odsgvfvz9ncmzk=">aaacj3icbvflixnbeo6mrzw+eswtl8aw4ehctfhylytbl4osrgb2f3bi0nntszr0y+iqkyrhfoxx/ux+g3uyc3czfntzuy+vqr7ks62q4vhpl7pz9979bwcp+48ep3n6bdb8fo6u8hjm0mnnl3obojwfgsnscfl6ecbxcjgvp7bxix/gutn7jbylzi1ywrvqulbwzyoxkvymq9cnsfp9s1qqbm1pejinrve43hm/dziojfhnz9mwl6efk5ubs1ilxksklmlec09kamj6ayvqcrkws7gk0aodok938zf8mhgkvna+pet85/23ohygcwvykgkerxa/1jr/g6nns4h77wnxbl4rw1yevl53x1sak+otqrxqhitpbqbcehuw4hilvjaudlzb1ipevpwvdr90pl0ebwvy8fce3fxshfkvmudd+sepqgr9khcd846fk4skdodopwgkxwvdo/pube4febujn8smnsrnp1wl2b/bbxa+gsfxopl6njp+6o5zwf6x1+wns9gxm7jp7iznmgq1+8l+sd/rmdqo3kft69so19w8ydcs+vwxwoxj/w==</latexit> Multinomial Given K clusters and probabilities of these clusters 1,..., K where The probability that out of N samples mk are in KX k=1 k = 1 cluster k is: p(m 1,...,m K,N)= N Y K m 1 m K k=1 µ m k k This is called the multinomial distribution In our case: NY KY KY p(z ) = n=1 k=1 µ z nk k = k=1 µ m k k!5

<latexit sha1_base64="lcpm3xu1pawahu6ubu7n5ufuq+y=">aaacgxiczvhbattaef0rvatulwkf+7lugfoorjkbfvos2pe+bfkoe0nszgg1tpfsreymio3qt/s1/bh+tveokikzsmthlmdmzusl0crp+rexhdx4+ojx4zp+02fpx7w8on51qb4kcifkgx+moraa7xdcmg1oy4bgc4ox+fxxnn75ewnp737wtss5hzxts62ao2s6a1ouyzetjgbpkn2zva+ydgxez+el456afv5vfh0ra0rxwvryvibawhls+roksar1dsu8itcbrzrxu4eboyyeqi59im+x3hlvv9rgiby2j5kwee37sdb5pza8herny0h7/xn5av5rv1amtt20x1zgspetjrlqarwbbqsggo4bslwgaiqjcneywpmld6ey8vfettutq2yoayu4vfn5wl62y2ykhmoosxsizdex7fikdsrgikd/rsjkvy42qs9gcwyc9czejzp6ndb9ejzs/wj3wcv4lkwj7pvj4prld6bd8ua8fe9ejj6ku/fnniujumkix+k3+jmcjo+tnbnfpca9rua1ugpj539tmsuy</latexit> <latexit sha1_base64="lcpm3xu1pawahu6ubu7n5ufuq+y=">aaacgxiczvhbattaef0rvatulwkf+7lugfoorjkbfvos2pe+bfkoe0nszgg1tpfsreymio3qt/s1/bh+tveokikzsmthlmdmzusl0crp+rexhdx4+ojx4zp+02fpx7w8on51qb4kcifkgx+moraa7xdcmg1oy4bgc4ox+fxxnn75ewnp737wtss5hzxts62ao2s6a1ouyzetjgbpkn2zva+ydgxez+el456afv5vfh0ra0rxwvryvibawhls+roksar1dsu8itcbrzrxu4eboyyeqi59im+x3hlvv9rgiby2j5kwee37sdb5pza8herny0h7/xn5av5rv1amtt20x1zgspetjrlqarwbbqsggo4bslwgaiqjcneywpmld6ey8vfettutq2yoayu4vfn5wl62y2ykhmoosxsizdex7fikdsrgikd/rsjkvy42qs9gcwyc9czejzp6ndb9ejzs/wj3wcv4lkwj7pvj4prld6bd8ua8fe9ejj6ku/fnniujumkix+k3+jmcjo+tnbnfpca9rua1ugpj539tmsuy</latexit> <latexit sha1_base64="lcpm3xu1pawahu6ubu7n5ufuq+y=">aaacgxiczvhbattaef0rvatulwkf+7lugfoorjkbfvos2pe+bfkoe0nszgg1tpfsreymio3qt/s1/bh+tveokikzsmthlmdmzusl0crp+rexhdx4+ojx4zp+02fpx7w8on51qb4kcifkgx+moraa7xdcmg1oy4bgc4ox+fxxnn75ewnp737wtss5hzxts62ao2s6a1ouyzetjgbpkn2zva+ydgxez+el456afv5vfh0ra0rxwvryvibawhls+roksar1dsu8itcbrzrxu4eboyyeqi59im+x3hlvv9rgiby2j5kwee37sdb5pza8herny0h7/xn5av5rv1amtt20x1zgspetjrlqarwbbqsggo4bslwgaiqjcneywpmld6ey8vfettutq2yoayu4vfn5wl62y2ykhmoosxsizdex7fikdsrgikd/rsjkvy42qs9gcwyc9czejzp6ndb9ejzs/wj3wcv4lkwj7pvj4prld6bd8ua8fe9ejj6ku/fnniujumkix+k3+jmcjo+tnbnfpca9rua1ugpj539tmsuy</latexit> <latexit sha1_base64="lcpm3xu1pawahu6ubu7n5ufuq+y=">aaacgxiczvhbattaef0rvatulwkf+7lugfoorjkbfvos2pe+bfkoe0nszgg1tpfsreymio3qt/s1/bh+tveokikzsmthlmdmzusl0crp+rexhdx4+ojx4zp+02fpx7w8on51qb4kcifkgx+moraa7xdcmg1oy4bgc4ox+fxxnn75ewnp737wtss5hzxts62ao2s6a1ouyzetjgbpkn2zva+ydgxez+el456afv5vfh0ra0rxwvryvibawhls+roksar1dsu8itcbrzrxu4eboyyeqi59im+x3hlvv9rgiby2j5kwee37sdb5pza8herny0h7/xn5av5rv1amtt20x1zgspetjrlqarwbbqsggo4bslwgaiqjcneywpmld6ey8vfettutq2yoayu4vfn5wl62y2ykhmoosxsizdex7fikdsrgikd/rsjkvy42qs9gcwyc9czejzp6ndb9ejzs/wj3wcv4lkwj7pvj4prld6bd8ua8fe9ejj6ku/fnniujumkix+k3+jmcjo+tnbnfpca9rua1ugpj539tmsuy</latexit> <latexit sha1_base64="4mdsnwv8ivffleyjrgnp9ijls6a=">aaacgxiczvhbattaef0rvsro29we+7lugfoorhkbfposmpe8bfkie0nszgg1tpfsreyoio3qt/s1/bh8tvaocikzsmthlmdmzmsfkp7i+l4tbb15++799k5398pht3v7b4fx3pzo4fbyzd0oa49kghysjiwjwihotofndnfwxg/+oppsmitaftjrmddyjgvqci3goioftnppfi8exgvjr0hsgh5r7xj60bhj3ipsoyghwpvbjc5ouoejkrtw3xhpsqbxb3o8ddcarj+p1gpxvb88oz9zf54hvvy+r6hae7/swcjuqau/gwuct7h+8yatg0a/0z9mpyavnevjamrj+1mpofneamjz6vcqwguawsmwarclccaokpecqze5/+5kfx5hdbo9n6xo0geelldzg8oxoswa8z4psgsbpg8n5i0ll8ytqmdrhxskyustgluxslwacnizrubrko3rbjhlsnme1+a6hstxipl93dv91r5om31mx9hxlratdsro2submseu+8v+sf/rvvqtiqp0mtxqtdvh7ivfpx8avujfmw==</latexit> <latexit sha1_base64="4mdsnwv8ivffleyjrgnp9ijls6a=">aaacgxiczvhbattaef0rvsro29we+7lugfoorhkbfposmpe8bfkie0nszgg1tpfsreyoio3qt/s1/bh8tvaocikzsmthlmdmzmsfkp7i+l4tbb15++799k5398pht3v7b4fx3pzo4fbyzd0oa49kghysjiwjwihotofndnfwxg/+oppsmitaftjrmddyjgvqci3goioftnppfi8exgvjr0hsgh5r7xj60bhj3ipsoyghwpvbjc5ouoejkrtw3xhpsqbxb3o8ddcarj+p1gpxvb88oz9zf54hvvy+r6hae7/swcjuqau/gwuct7h+8yatg0a/0z9mpyavnevjamrj+1mpofneamjz6vcqwguawsmwarclccaokpecqze5/+5kfx5hdbo9n6xo0geelldzg8oxoswa8z4psgsbpg8n5i0ll8ytqmdrhxskyustgluxslwacnizrubrko3rbjhlsnme1+a6hstxipl93dv91r5om31mx9hxlratdsro2submseu+8v+sf/rvvqtiqp0mtxqtdvh7ivfpx8avujfmw==</latexit> <latexit sha1_base64="4mdsnwv8ivffleyjrgnp9ijls6a=">aaacgxiczvhbattaef0rvsro29we+7lugfoorhkbfposmpe8bfkie0nszgg1tpfsreyoio3qt/s1/bh8tvaocikzsmthlmdmzmsfkp7i+l4tbb15++799k5398pht3v7b4fx3pzo4fbyzd0oa49kghysjiwjwihotofndnfwxg/+oppsmitaftjrmddyjgvqci3goioftnppfi8exgvjr0hsgh5r7xj60bhj3ipsoyghwpvbjc5ouoejkrtw3xhpsqbxb3o8ddcarj+p1gpxvb88oz9zf54hvvy+r6hae7/swcjuqau/gwuct7h+8yatg0a/0z9mpyavnevjamrj+1mpofneamjz6vcqwguawsmwarclccaokpecqze5/+5kfx5hdbo9n6xo0geelldzg8oxoswa8z4psgsbpg8n5i0ll8ytqmdrhxskyustgluxslwacnizrubrko3rbjhlsnme1+a6hstxipl93dv91r5om31mx9hxlratdsro2submseu+8v+sf/rvvqtiqp0mtxqtdvh7ivfpx8avujfmw==</latexit> <latexit sha1_base64="4mdsnwv8ivffleyjrgnp9ijls6a=">aaacgxiczvhbattaef0rvsro29we+7lugfoorhkbfposmpe8bfkie0nszgg1tpfsreyoio3qt/s1/bh8tvaocikzsmthlmdmzmsfkp7i+l4tbb15++799k5398pht3v7b4fx3pzo4fbyzd0oa49kghysjiwjwihotofndnfwxg/+oppsmitaftjrmddyjgvqci3goioftnppfi8exgvjr0hsgh5r7xj60bhj3ipsoyghwpvbjc5ouoejkrtw3xhpsqbxb3o8ddcarj+p1gpxvb88oz9zf54hvvy+r6hae7/swcjuqau/gwuct7h+8yatg0a/0z9mpyavnevjamrj+1mpofneamjz6vcqwguawsmwarclccaokpecqze5/+5kfx5hdbo9n6xo0geelldzg8oxoswa8z4psgsbpg8n5i0ll8ytqmdrhxskyustgluxslwacnizrubrko3rbjhlsnme1+a6hstxipl93dv91r5om31mx9hxlratdsro2submseu+8v+sf/rvvqtiqp0mtxqtdvh7ivfpx8avujfmw==</latexit> <latexit sha1_base64="q0w9aqecgaqwpdvvmafm+hwy8dy=">aaacgxiczvhbihnbeo3melnjbvcffwkmaqujm1fq8gxrf18wvjc7gu0int2vpnm+dn01kjdmt/iqp+bfwjmdzddb0m2hlqeqtuwl0zhs9g8vobh3/8hdw0f9x0+epnt+dpzippoqkjwob3yy5hdraict0mrwwgyemxu8yk++tvglnxii9u4hbuucw1g5vdqkif3tgzhydyv3i6nbokp3ju+cramd0dnz4rinzovxluvhykcml1la0rygqfozbpqzkmij6gpwemnqgcu4r3cdn3linkiufednso68nytqsdfubc6zfmgd92ot839sednim5yx7vwn5ad5rv1zetp13x5zguletpriqgduzlymqaxng0i1hgckwllbtk3mxbtqgf6vt+320vu2x4afl29wnsvxdoynlepjsrbauxyty45fahcjdhp0zxfj+cpxo/ounqdaqw/4argpx2nt57nk+0e4c87hoywdzd8/de6+dac6fk/ea/fgzokjobhfxjmyccwm+cv+iz/jqfi2szpxdwrs62peiluwfp4hv17fna==</latexit> <latexit sha1_base64="q0w9aqecgaqwpdvvmafm+hwy8dy=">aaacgxiczvhbihnbeo3melnjbvcffwkmaqujm1fq8gxrf18wvjc7gu0int2vpnm+dn01kjdmt/iqp+bfwjmdzddb0m2hlqeqtuwl0zhs9g8vobh3/8hdw0f9x0+epnt+dpzippoqkjwob3yy5hdraict0mrwwgyemxu8yk++tvglnxii9u4hbuucw1g5vdqkif3tgzhydyv3i6nbokp3ju+cramd0dnz4rinzovxluvhykcml1la0rygqfozbpqzkmij6gpwemnqgcu4r3cdn3linkiufednso68nytqsdfubc6zfmgd92ot839sednim5yx7vwn5ad5rv1zetp13x5zguletpriqgduzlymqaxng0i1hgckwllbtk3mxbtqgf6vt+320vu2x4afl29wnsvxdoynlepjsrbauxyty45fahcjdhp0zxfj+cpxo/ounqdaqw/4argpx2nt57nk+0e4c87hoywdzd8/de6+dac6fk/ea/fgzokjobhfxjmyccwm+cv+iz/jqfi2szpxdwrs62peiluwfp4hv17fna==</latexit> <latexit sha1_base64="q0w9aqecgaqwpdvvmafm+hwy8dy=">aaacgxiczvhbihnbeo3melnjbvcffwkmaqujm1fq8gxrf18wvjc7gu0int2vpnm+dn01kjdmt/iqp+bfwjmdzddb0m2hlqeqtuwl0zhs9g8vobh3/8hdw0f9x0+epnt+dpzippoqkjwob3yy5hdraict0mrwwgyemxu8yk++tvglnxii9u4hbuucw1g5vdqkif3tgzhydyv3i6nbokp3ju+cramd0dnz4rinzovxluvhykcml1la0rygqfozbpqzkmij6gpwemnqgcu4r3cdn3linkiufednso68nytqsdfubc6zfmgd92ot839sednim5yx7vwn5ad5rv1zetp13x5zguletpriqgduzlymqaxng0i1hgckwllbtk3mxbtqgf6vt+320vu2x4afl29wnsvxdoynlepjsrbauxyty45fahcjdhp0zxfj+cpxo/ounqdaqw/4argpx2nt57nk+0e4c87hoywdzd8/de6+dac6fk/ea/fgzokjobhfxjmyccwm+cv+iz/jqfi2szpxdwrs62peiluwfp4hv17fna==</latexit> <latexit sha1_base64="q0w9aqecgaqwpdvvmafm+hwy8dy=">aaacgxiczvhbihnbeo3melnjbvcffwkmaqujm1fq8gxrf18wvjc7gu0int2vpnm+dn01kjdmt/iqp+bfwjmdzddb0m2hlqeqtuwl0zhs9g8vobh3/8hdw0f9x0+epnt+dpzippoqkjwob3yy5hdraict0mrwwgyemxu8yk++tvglnxii9u4hbuucw1g5vdqkif3tgzhydyv3i6nbokp3ju+cramd0dnz4rinzovxluvhykcml1la0rygqfozbpqzkmij6gpwemnqgcu4r3cdn3linkiufednso68nytqsdfubc6zfmgd92ot839sednim5yx7vwn5ad5rv1zetp13x5zguletpriqgduzlymqaxng0i1hgckwllbtk3mxbtqgf6vt+320vu2x4afl29wnsvxdoynlepjsrbauxyty45fahcjdhp0zxfj+cpxo/ounqdaqw/4argpx2nt57nk+0e4c87hoywdzd8/de6+dac6fk/ea/fgzokjobhfxjmyccwm+cv+iz/jqfi2szpxdwrs62peiluwfp4hv17fna==</latexit> The Dirichlet Distribution The Dirichlet distribution is defined as: Dir(µ ) = ( 0 ) ( 1 ) ( K ) KX KY k=1 µ k 1 k 0 = KX k=1 k 0 apple µ k apple 1 k=1 µ k =1 2 It is the conjugate prior for the multinomial distribution There, the parameter can 3 be interpreted as the effective 1 number of observations for every state The simplex for K=3!6

Some Examples = (2, 2, 2) = (20, 2, 2) 0 controls the strength α=0.10 of the distribution ( peakedness ) k control the location of the peak 15 p 10 5 0 1 1 0.5 0.5 0 0 = (0.1, 0.1, 0.1)!7

Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior (Dirichlet) parameter prior (Gauss-IW) Given this model, we can create new samples: z i x i N k K 1.Sample from priors, k z i 2.Sample corresp. 3.Sample data point x i!8

<latexit sha1_base64="ut0p6/7pyrti/xm+gpymshuvius=">aaactniczvfla9taef6rr9r9oe2xl6xgkeixkik0ketol+0hjyu6dksuwa3g9uj9in1rsbh6mf01vbbh/puohffiz2b3p76z+wznjiu0chjhfzvrvfsphj46enx98vtz8xe9w5cxwzvewlg67fxljgjozwgmcjvcfh6eytrmstwnxj+5bh+us99xu8duiivvcyufejxrnasz03nyghqqfjeaop6tebqu4smr6+rrl0l9tbolst4x9dtzrx8p463xuybpqz+1dj477mg0d7i0yffqecjvehc4ryrhjtxu3bqmuai5egu4imifgtcttl3wfebmzufo07hit+ztjeqy0hyrio3azdj3ner/3+c2e9enytirj/pjaavsusjyevn+xmqojjed5lnyiffvcajpfxxa5vj4izhgvapu7cz/50tnt3sm6t7y0mtgiafm9cjr+tkmooz8wgksnkda2x/zvourg1bo0uimavc60lkh6kyszwr6tazvh7oaxxwx1plsl+euubgnk3iyfhvfp/3yluiavwzv2bfl2ad2yj6zczzmkv1kv9hv9ic6jn5eec1uqqnom/ok7vhu/anbddus</latexit> <latexit sha1_base64="ut0p6/7pyrti/xm+gpymshuvius=">aaactniczvfla9taef6rr9r9oe2xl6xgkeixkik0ketol+0hjyu6dksuwa3g9uj9in1rsbh6mf01vbbh/puohffiz2b3p76z+wznjiu0chjhfzvrvfsphj46enx98vtz8xe9w5cxwzvewlg67fxljgjozwgmcjvcfh6eytrmstwnxj+5bh+us99xu8duiivvcyufejxrnasz03nyghqqfjeaop6tebqu4smr6+rrl0l9tbolst4x9dtzrx8p463xuybpqz+1dj477mg0d7i0yffqecjvehc4ryrhjtxu3bqmuai5egu4imifgtcttl3wfebmzufo07hit+ztjeqy0hyrio3azdj3ner/3+c2e9enytirj/pjaavsusjyevn+xmqojjed5lnyiffvcajpfxxa5vj4izhgvapu7cz/50tnt3sm6t7y0mtgiafm9cjr+tkmooz8wgksnkda2x/zvourg1bo0uimavc60lkh6kyszwr6tazvh7oaxxwx1plsl+euubgnk3iyfhvfp/3yluiavwzv2bfl2ad2yj6zczzmkv1kv9hv9ic6jn5eec1uqqnom/ok7vhu/anbddus</latexit> <latexit sha1_base64="ut0p6/7pyrti/xm+gpymshuvius=">aaactniczvfla9taef6rr9r9oe2xl6xgkeixkik0ketol+0hjyu6dksuwa3g9uj9in1rsbh6mf01vbbh/puohffiz2b3p76z+wznjiu0chjhfzvrvfsphj46enx98vtz8xe9w5cxwzvewlg67fxljgjozwgmcjvcfh6eytrmstwnxj+5bh+us99xu8duiivvcyufejxrnasz03nyghqqfjeaop6tebqu4smr6+rrl0l9tbolst4x9dtzrx8p463xuybpqz+1dj477mg0d7i0yffqecjvehc4ryrhjtxu3bqmuai5egu4imifgtcttl3wfebmzufo07hit+ztjeqy0hyrio3azdj3ner/3+c2e9enytirj/pjaavsusjyevn+xmqojjed5lnyiffvcajpfxxa5vj4izhgvapu7cz/50tnt3sm6t7y0mtgiafm9cjr+tkmooz8wgksnkda2x/zvourg1bo0uimavc60lkh6kyszwr6tazvh7oaxxwx1plsl+euubgnk3iyfhvfp/3yluiavwzv2bfl2ad2yj6zczzmkv1kv9hv9ic6jn5eec1uqqnom/ok7vhu/anbddus</latexit> <latexit sha1_base64="ut0p6/7pyrti/xm+gpymshuvius=">aaactniczvfla9taef6rr9r9oe2xl6xgkeixkik0ketol+0hjyu6dksuwa3g9uj9in1rsbh6mf01vbbh/puohffiz2b3p76z+wznjiu0chjhfzvrvfsphj46enx98vtz8xe9w5cxwzvewlg67fxljgjozwgmcjvcfh6eytrmstwnxj+5bh+us99xu8duiivvcyufejxrnasz03nyghqqfjeaop6tebqu4smr6+rrl0l9tbolst4x9dtzrx8p463xuybpqz+1dj477mg0d7i0yffqecjvehc4ryrhjtxu3bqmuai5egu4imifgtcttl3wfebmzufo07hit+ztjeqy0hyrio3azdj3ner/3+c2e9enytirj/pjaavsusjyevn+xmqojjed5lnyiffvcajpfxxa5vj4izhgvapu7cz/50tnt3sm6t7y0mtgiafm9cjr+tkmooz8wgksnkda2x/zvourg1bo0uimavc60lkh6kyszwr6tazvh7oaxxwx1plsl+euubgnk3iyfhvfp/3yluiavwzv2bfl2ad2yj6zczzmkv1kv9hv9ic6jn5eec1uqqnom/ok7vhu/anbddus</latexit> <latexit sha1_base64="pj1sknjpylgska5ho/pqo4wbofo=">aaacvhiczvfba9swffa8w5vd0u2xl2ih0meidihst6pbxvbs0chsfupgzpk4edhfsmclmfdv6a/p68b+zetug216qnlhd3sux15j4tco//sibw8fpx6ys9t/+uz5i5edvvenztsww5qbaex5zhxiowgkaiwcvxayyiwc5csvrf/seqwtrv/adquzxezaliizdfq2+jqqhou89ksmezr1qtenw5n035qdndeycgsvhp/iapa1mf8x8joeng+zwtaexxuj90hsgshp7ctb6/g0mlxwojfl5txfelc488yi4bkaflo7qbhfsjlcbkizajfzm1kbogpmqutjw9fin+ztcm+uaxsop9s23bavjf/7rredugozuq36wh6yeagrgkhzm/jllska2q6tfsicr7kogherwgsul5hlhmps72rqfsre2vqgmxvvtu90rxkwuith5dye8iwaqeppiia1hau52nvmuyxuaidmhhz91afyu+tqyb+z1tfdk1zbcnf4sdz0gyzjtgj3welknmtj5pvh8ohzj9ao2sdvyafjyhtyrl6sezilnfyra/kl/i4+rkw0jntn16jxxbwmdyy6/au7g96/</latexit> <latexit sha1_base64="pj1sknjpylgska5ho/pqo4wbofo=">aaacvhiczvfba9swffa8w5vd0u2xl2ih0meidihst6pbxvbs0chsfupgzpk4edhfsmclmfdv6a/p68b+zetug216qnlhd3sux15j4tco//sibw8fpx6ys9t/+uz5i5edvvenztsww5qbaex5zhxiowgkaiwcvxayyiwc5csvrf/seqwtrv/adquzxezaliizdfq2+jqqhou89ksmezr1qtenw5n035qdndeycgsvhp/iapa1mf8x8joeng+zwtaexxuj90hsgshp7ctb6/g0mlxwojfl5txfelc488yi4bkaflo7qbhfsjlcbkizajfzm1kbogpmqutjw9fin+ztcm+uaxsop9s23bavjf/7rredugozuq36wh6yeagrgkhzm/jllska2q6tfsicr7kogherwgsul5hlhmps72rqfsre2vqgmxvvtu90rxkwuith5dye8iwaqeppiia1hau52nvmuyxuaidmhhz91afyu+tqyb+z1tfdk1zbcnf4sdz0gyzjtgj3welknmtj5pvh8ohzj9ao2sdvyafjyhtyrl6sezilnfyra/kl/i4+rkw0jntn16jxxbwmdyy6/au7g96/</latexit> <latexit sha1_base64="pj1sknjpylgska5ho/pqo4wbofo=">aaacvhiczvfba9swffa8w5vd0u2xl2ih0meidihst6pbxvbs0chsfupgzpk4edhfsmclmfdv6a/p68b+zetug216qnlhd3sux15j4tco//sibw8fpx6ys9t/+uz5i5edvvenztsww5qbaex5zhxiowgkaiwcvxayyiwc5csvrf/seqwtrv/adquzxezaliizdfq2+jqqhou89ksmezr1qtenw5n035qdndeycgsvhp/iapa1mf8x8joeng+zwtaexxuj90hsgshp7ctb6/g0mlxwojfl5txfelc488yi4bkaflo7qbhfsjlcbkizajfzm1kbogpmqutjw9fin+ztcm+uaxsop9s23bavjf/7rredugozuq36wh6yeagrgkhzm/jllska2q6tfsicr7kogherwgsul5hlhmps72rqfsre2vqgmxvvtu90rxkwuith5dye8iwaqeppiia1hau52nvmuyxuaidmhhz91afyu+tqyb+z1tfdk1zbcnf4sdz0gyzjtgj3welknmtj5pvh8ohzj9ao2sdvyafjyhtyrl6sezilnfyra/kl/i4+rkw0jntn16jxxbwmdyy6/au7g96/</latexit> <latexit sha1_base64="pj1sknjpylgska5ho/pqo4wbofo=">aaacvhiczvfba9swffa8w5vd0u2xl2ih0meidihst6pbxvbs0chsfupgzpk4edhfsmclmfdv6a/p68b+zetug216qnlhd3sux15j4tco//sibw8fpx6ys9t/+uz5i5edvvenztsww5qbaex5zhxiowgkaiwcvxayyiwc5csvrf/seqwtrv/adquzxezaliizdfq2+jqqhou89ksmezr1qtenw5n035qdndeycgsvhp/iapa1mf8x8joeng+zwtaexxuj90hsgshp7ctb6/g0mlxwojfl5txfelc488yi4bkaflo7qbhfsjlcbkizajfzm1kbogpmqutjw9fin+ztcm+uaxsop9s23bavjf/7rredugozuq36wh6yeagrgkhzm/jllska2q6tfsicr7kogherwgsul5hlhmps72rqfsre2vqgmxvvtu90rxkwuith5dye8iwaqeppiia1hau52nvmuyxuaidmhhz91afyu+tqyb+z1tfdk1zbcnf4sdz0gyzjtgj3welknmtj5pvh8ohzj9ao2sdvyafjyhtyrl6sezilnfyra/kl/i4+rkw0jntn16jxxbwmdyy6/au7g96/</latexit> <latexit sha1_base64="q0tqh/3hsvrnru+t0poyhp29hwo=">aaacqniczvhbbhmxehwwwwm3fb55sygifakkuxespfbwwkukgkhbli0ir3c2serlyh6jbgs/g6/hft6cv8gbrlcbjmtr6ixnxmdouuvhme3/9pi7d+/df3dwsp/o8zonzwahz8+c8zbdjbtp7exbheihyyycjvzufpgqjjwxlx/b/pl3se4y/rw3nswuw2lrcc4wusvbm1wxxbdv+nesbc2dudrxhdmeqzfyhowfkaxbrkagvbbn6+vgmi7txddbiovakhrxujzs8bw03cvqycvzbp6lns4csyi4hkafewc145dsbfminvpgfmgnrkgjyjs0mjyejxthxq8itln2e/flq8pt51ryf250pymbtqpbm4/v+0uquvyiml+nr7ykagi7pfokcxzlnglgrygkkf8zyzjgfd/o1pprhlsv482natu77vubfsooxq5mlf+rctsujmhcs1yqo92padefcu2qydijnztabryog8kubaymrdhee10tjmntj7zk+ybcbmetczaos89vhycfoomoyevyihyrjlwjj+qtosuzwslp8ov8jn+s4+rl8i2zxz1nel3nc3ijkvifcwfwoq==</latexit> <latexit sha1_base64="q0tqh/3hsvrnru+t0poyhp29hwo=">aaacqniczvhbbhmxehwwwwm3fb55sygifakkuxespfbwwkukgkhbli0ir3c2serlyh6jbgs/g6/hft6cv8gbrlcbjmtr6ixnxmdouuvhme3/9pi7d+/df3dwsp/o8zonzwahz8+c8zbdjbtp7exbheihyyycjvzufpgqjjwxlx/b/pl3se4y/rw3nswuw2lrcc4wusvbm1wxxbdv+nesbc2dudrxhdmeqzfyhowfkaxbrkagvbbn6+vgmi7txddbiovakhrxujzs8bw03cvqycvzbp6lns4csyi4hkafewc145dsbfminvpgfmgnrkgjyjs0mjyejxthxq8itln2e/flq8pt51ryf250pymbtqpbm4/v+0uquvyiml+nr7ykagi7pfokcxzlnglgrygkkf8zyzjgfd/o1pprhlsv482natu77vubfsooxq5mlf+rctsujmhcs1yqo92padefcu2qydijnztabryog8kubaymrdhee10tjmntj7zk+ybcbmetczaos89vhycfoomoyevyihyrjlwjj+qtosuzwslp8ov8jn+s4+rl8i2zxz1nel3nc3ijkvifcwfwoq==</latexit> <latexit sha1_base64="q0tqh/3hsvrnru+t0poyhp29hwo=">aaacqniczvhbbhmxehwwwwm3fb55sygifakkuxespfbwwkukgkhbli0ir3c2serlyh6jbgs/g6/hft6cv8gbrlcbjmtr6ixnxmdouuvhme3/9pi7d+/df3dwsp/o8zonzwahz8+c8zbdjbtp7exbheihyyycjvzufpgqjjwxlx/b/pl3se4y/rw3nswuw2lrcc4wusvbm1wxxbdv+nesbc2dudrxhdmeqzfyhowfkaxbrkagvbbn6+vgmi7txddbiovakhrxujzs8bw03cvqycvzbp6lns4csyi4hkafewc145dsbfminvpgfmgnrkgjyjs0mjyejxthxq8itln2e/flq8pt51ryf250pymbtqpbm4/v+0uquvyiml+nr7ykagi7pfokcxzlnglgrygkkf8zyzjgfd/o1pprhlsv482natu77vubfsooxq5mlf+rctsujmhcs1yqo92padefcu2qydijnztabryog8kubaymrdhee10tjmntj7zk+ybcbmetczaos89vhycfoomoyevyihyrjlwjj+qtosuzwslp8ov8jn+s4+rl8i2zxz1nel3nc3ijkvifcwfwoq==</latexit> <latexit sha1_base64="q0tqh/3hsvrnru+t0poyhp29hwo=">aaacqniczvhbbhmxehwwwwm3fb55sygifakkuxespfbwwkukgkhbli0ir3c2serlyh6jbgs/g6/hft6cv8gbrlcbjmtr6ixnxmdouuvhme3/9pi7d+/df3dwsp/o8zonzwahz8+c8zbdjbtp7exbheihyyycjvzufpgqjjwxlx/b/pl3se4y/rw3nswuw2lrcc4wusvbm1wxxbdv+nesbc2dudrxhdmeqzfyhowfkaxbrkagvbbn6+vgmi7txddbiovakhrxujzs8bw03cvqycvzbp6lns4csyi4hkafewc145dsbfminvpgfmgnrkgjyjs0mjyejxthxq8itln2e/flq8pt51ryf250pymbtqpbm4/v+0uquvyiml+nr7ykagi7pfokcxzlnglgrygkkf8zyzjgfd/o1pprhlsv482natu77vubfsooxq5mlf+rctsujmhcs1yqo92padefcu2qydijnztabryog8kubaymrdhee10tjmntj7zk+ybcbmetczaos89vhycfoomoyevyihyrjlwjj+qtosuzwslp8ov8jn+s4+rl8i2zxz1nel3nc3ijkvifcwfwoq==</latexit> <latexit sha1_base64="0vlamx/teiqhhuxpm4hhxdsubs8=">aaacxhiczvhbattaef2rt9s9oe1jx5yaqwrgscbqpoa2ljysskfoapexq9xkxrixstmqnkl9px5nh/rs/kphjiijmydt0czojm45wwk0ybz/7kv37t67/2dvyf/r4ydpnw32n5+cr4jum+mnd+ezagw0uzpuanr5gzswmvfn2ex7tn72xqxq3n3dtanmviydlrqusknf4hoaezpdxtjrp6vueara8ps+1/uhhzqdtahc1qkw5uo09zdmnoyeybybfr0ydonjva1+gyqdgliuthb7pumjzgwvq2keweuslzivruatjwr6aqwqfpjslnufqsesgnm95dzwewvyxvhaj0o+zv7vqiwflhbdtajxsftrk/9ro+tfxlctywc/fm/ntxzlhcrjq/vfzth63srkcx2urlmhigtqxidllscjkms/mal1kh+hytbbetuyb1fztawve3mz9ns+slpvcd7ijasjblz7y95n4dobckmz+ikoll5ytkg+futjgugvyxho6mnc9mmwznee2+b0oknisfl1chj0rjnoj71kr9gbs9gbdsq+srm2y5l9zl/yh/y3+hizcklq6mru63pesbsr/fghrm7hsg==</latexit> <latexit sha1_base64="0vlamx/teiqhhuxpm4hhxdsubs8=">aaacxhiczvhbattaef2rt9s9oe1jx5yaqwrgscbqpoa2ljysskfoapexq9xkxrixstmqnkl9px5nh/rs/kphjiijmydt0czojm45wwk0ybz/7kv37t67/2dvyf/r4ydpnw32n5+cr4jum+mnd+ezagw0uzpuanr5gzswmvfn2ex7tn72xqxq3n3dtanmviydlrqusknf4hoaezpdxtjrp6vueara8ps+1/uhhzqdtahc1qkw5uo09zdmnoyeybybfr0ydonjva1+gyqdgliuthb7pumjzgwvq2keweuslzivruatjwr6aqwqfpjslnufqsesgnm95dzwewvyxvhaj0o+zv7vqiwflhbdtajxsftrk/9ro+tfxlctywc/fm/ntxzlhcrjq/vfzth63srkcx2urlmhigtqxidllscjkms/mal1kh+hytbbetuyb1fztawve3mz9ns+slpvcd7ijasjblz7y95n4dobckmz+ikoll5ytkg+futjgugvyxho6mnc9mmwznee2+b0oknisfl1chj0rjnoj71kr9gbs9gbdsq+srm2y5l9zl/yh/y3+hizcklq6mru63pesbsr/fghrm7hsg==</latexit> <latexit sha1_base64="0vlamx/teiqhhuxpm4hhxdsubs8=">aaacxhiczvhbattaef2rt9s9oe1jx5yaqwrgscbqpoa2ljysskfoapexq9xkxrixstmqnkl9px5nh/rs/kphjiijmydt0czojm45wwk0ybz/7kv37t67/2dvyf/r4ydpnw32n5+cr4jum+mnd+ezagw0uzpuanr5gzswmvfn2ex7tn72xqxq3n3dtanmviydlrqusknf4hoaezpdxtjrp6vueara8ps+1/uhhzqdtahc1qkw5uo09zdmnoyeybybfr0ydonjva1+gyqdgliuthb7pumjzgwvq2keweuslzivruatjwr6aqwqfpjslnufqsesgnm95dzwewvyxvhaj0o+zv7vqiwflhbdtajxsftrk/9ro+tfxlctywc/fm/ntxzlhcrjq/vfzth63srkcx2urlmhigtqxidllscjkms/mal1kh+hytbbetuyb1fztawve3mz9ns+slpvcd7ijasjblz7y95n4dobckmz+ikoll5ytkg+futjgugvyxho6mnc9mmwznee2+b0oknisfl1chj0rjnoj71kr9gbs9gbdsq+srm2y5l9zl/yh/y3+hizcklq6mru63pesbsr/fghrm7hsg==</latexit> <latexit sha1_base64="0vlamx/teiqhhuxpm4hhxdsubs8=">aaacxhiczvhbattaef2rt9s9oe1jx5yaqwrgscbqpoa2ljysskfoapexq9xkxrixstmqnkl9px5nh/rs/kphjiijmydt0czojm45wwk0ybz/7kv37t67/2dvyf/r4ydpnw32n5+cr4jum+mnd+ezagw0uzpuanr5gzswmvfn2ex7tn72xqxq3n3dtanmviydlrqusknf4hoaezpdxtjrp6vueara8ps+1/uhhzqdtahc1qkw5uo09zdmnoyeybybfr0ydonjva1+gyqdgliuthb7pumjzgwvq2keweuslzivruatjwr6aqwqfpjslnufqsesgnm95dzwewvyxvhaj0o+zv7vqiwflhbdtajxsftrk/9ro+tfxlctywc/fm/ntxzlhcrjq/vfzth63srkcx2urlmhigtqxidllscjkms/mal1kh+hytbbetuyb1fztawve3mz9ns+slpvcd7ijasjblz7y95n4dobckmz+ikoll5ytkg+futjgugvyxho6mnc9mmwznee2+b0oknisfl1chj0rjnoj71kr9gbs9gbdsq+srm2y5l9zl/yh/y3+hizcklq6mru63pesbsr/fghrm7hsg==</latexit> Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior (Dirichlet) parameter prior (Gauss-IW) Dir( K,..., K ) z i x i k K z i Mult( ) k NIW( ) x i N ( zi ) N!9

Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior (Dirichlet) parameter prior (Gauss-IW) An equivalent formulation of this model is this: 1.Sample, k from priors 2.Sample params from a discrete dist. G i i k K 3.Sample data point x i x i N!10

Clustering using Mixture Models What is the difference in that model? there is one parameter i for each observation x i intuitively: we first sample the location of the cluster and then the data that corresponds to it In general, we use the notation: Dir( K 1) Base distribution k H( ) i G(, k ) G(, k )= KX k=1 where k ( k, i ) However: We need to know K!11 i x i N k K

The Dirichlet Process So far, we assumed that K is known To extend that to infinity, we use a trick: Definition: A Dirichlet process (DP) is a distribution over probability measures G, i.e. Z G( ) 0 and. If for any partition it holds: G( )d =1 (T 1,...,T K ) (G(T 1 ),...,G(T K )) Dir( H(T 1 ),..., H(T K )) then G is sampled from a Dirichlet process. Notation: G DP(,H) where and H is the concentration parameter is the base measure!12

Intuitive Interpretation Every sample from a Dirichlet distribution is a vector of K positive values that sum up to 1, i.e. the sample itself is a finite distribution Accordingly, a sample from a Dirichlet process is an infinite (but still discrete!) distribution Base distribution (here Gaussian) Infinitely many samples (sum up to 1)!13

Construction of a Dirichlet Process The Dirichlet process is only defined implicitly, i.e. we can test whether a given probability measure is sampled from a DP, but we can not yet construct one. A DP can be constructed using the stickbreaking analogy: imagine a stick of length 1 we select a random number β between 0 and 1 from a Beta-distribution we break the stick at π = β * length-of-stick we repeat this infinitely often!14

The Stick-Breaking Construction β 1 π 1 π β 2 2 β3 1 β 1 1 β 2 1 β3 π 3 π 4 β 4 β 5 0.5 0.4 0.3 0.2 0.1 0 1 β 4 0 10 20 30 0.4 α = 2 α = 5 0.4 0.3 0.2 0.1 0 0 10 20 30 0.2 α = 2 α = 5 π 5 0.3 0.2 0.1 0 0 10 20 30 0.15 0.1 0.05 0 0 10 20 30 formally, we have ky 1 k Beta(1, ) k = k now we define G( ) = l=1 (1 l) = k (1 k 1 X l=1 l ) 1X k ( k, ) k H then: G DP(,H) k=1!15

The Chinese Restaurant Process... Consider a restaurant with infinitely many tables Everytime a new customer comes in, he sits at an occupied table with probability proportional to the number of people sitting at that table, but he may choose to sit on a new table with decreasing probability as more customers enter the room.!16

The Chinese Restaurant Process It can be shown that the probability for a new customer is p( N+1 = 1:N,,H)= 1 + N H( )+ K X k=1 N k ( k, )! This means that currently occupied tables are more likely to get new customers (rich get richer) The number of occupied tables grows logarithmically with the number of customers!17

The DP for Mixture Modeling Using the stick-breaking construction, we see that we can extend the mixture model clustering to the situation where K goes to infinity The algorithm can be implemented using Gibbs sampling iter# 50 iter# 100 0.7 4 4 0.6 2 2 0.5 0 0 0.4 2 2 0.3 4 4 0.2 6 6 0.1 6 4 2 0 2 4 6 4 2 0 2 4 0 1 2 3 4 5 6!18

Questions What if the clusters can not be approximated well by Gaussians? Can we formulate an algorithm that only relies on pairwise similarities? One example for such an algorithm is Spectral Clustering!19

Spectral Clustering Consider an undirected graph that connects all data points The edge weights are the similarities ( closeness ) We define the weighted degree sum of all outgoing edges d i of a node as the W = d i = NX w ij j=1 d 1d2d3 D = d 4!20

Spectral Clustering The Graph Laplacian is defined as: L = D W This matrix has the following properties: the 1 vector is eigenvector with eigenvalue 0!21

Spectral Clustering The Graph Laplacian is defined as: L = D W This matrix has the following properties: the 1 vector is eigenvector with eigenvector 0 the matrix is symmetric and positive semi-definite!22

Spectral Clustering The Graph Laplacian is defined as: L = D W This matrix has the following properties: the 1 vector is eigenvector with eigenvector 0 the matrix is symmetric and positive semi-definite With these properties we can show: Theorem: The set of eigenvectors of L with eigenvalue 0 is spanned by the indicator vectors 1 A1,...,1 AK A k, where are the K connected components of the graph.!23

The Algorithm Input: Similarity matrix W Compute L = D - W Compute the eigenvectors that correspond to the K smallest eigenvalues Stack these vectors as columns in a matrix U Treat each row of U as a K-dim data point Cluster the N rows with K-means clustering The indices of the rows that correspond to the resulting clusters are those of the original data points.!24

An Example 5 k means clustering 5 spectral clustering 4 4 3 3 2 2 1 1 y 0 y 0 1 1 2 2 3 3 4 4 5 6 4 2 0 2 4 6 x 5 6 4 2 0 2 4 6 x Spectral clustering can handle complex problems such as this one The complexity of the algorithm is O(N ), because it has to solve an eigenvector problem But there are efficient variants of the algorithm 3!25

Further Remarks To account for nodes that are highly connected, we can use a normalized version of the graph Laplacian Two different methods exist: L rw = D 1 L = I L sym = D 1 2 LD 1 2 = I D 1 2 WD 1 2 These have similar eigenspaces than the original Laplacian L Clustering results tend to be better than with the unnormalized Laplacian The number of clusters K can be found using the eigen-gap heuristic D 1 W!26

Eigen-Gap Heuristic Compute all eigen values of the graph Laplacian Sort them in increasing order Usually, there is a big jump between two consecutive eigen values The corresponding number K is a good choice for the estimated number of clusters!27

Hierarchical Clustering Often, we want to have nested clusters instead of a flat clustering Two possible methods: bottom-up or agglomerative clustering top-down or divisive clustering Both methods take a dissimilarity matrix as input Bottom-up grows merges points to clusters Top-down splits clusters into sub-clusters Both are heuristics, there is no clear objective function They always produce a clustering (also for noise)!28

Agglomerative Clustering Start with N clusters, each contains exactly one data point At each step, merge the two most similar groups Repeat until there is a single group 5 4.5 4 2 2.5 3.5 3 2.5 2 1.5 1 3 5 4 2 1.5 Dendrogram 1 0.5 0 0 1 2 3 4 5 6 7 8 1 4 5 1 3 2!29

Linkage In agglomerative clustering, it is important to define a distance measure between two clusters There are three different methods: Single linkage: considers the two closest elements from both clusters and uses their distance Complete linkage: considers the two farthest elements from both clusters Average linkage: uses the average distance between pairs of points from both clusters Depending on the application, one linkage should be preferred over the other!30

Single Linkage The distance is based on The resulting dendrogram is a minimum spanning tree, i.e. it minimizes the sum of the edge weights Thus: we can compute the clustering in O(N ) time single link d SL (G, H) = min i2g,i 0 2H d i,i 0 2 0.3 0.25 0.2 0.15 0.1 0.05!31

Complete Linkage The distance is based on d CL (G, H) = Complete linkage fulfills the compactness max i2g,i 0 2H d i,i 0 property, i.e. all points in a group should be similar to each other Tends to produce clusters with smaller diameter 2 complete link 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2!32

Average Linkage The distance is based on Is a good compromise between single and complete linkage d avg (G, H) = 1 X n G n H i2g X i 0 2H d i,i 0 However: sensitive to changes on the meas. scale average link 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2!33

Divisive Clustering Start with all data in a single cluster Recursively divide each cluster into two child clusters Problem: optimal split is hard to find Idea: use the cluster with the largest diameter and use K-means with K = 2 Or: use minimum-spanning tree and cut links with the largest dissimilarity In general two advantages: Can be faster More globally informed (not myopic as bottom-up)!34

Choosing the Number of Clusters As in general, choosing the number of clusters is hard When a dendrogram is available, a gap can be detected in the lengths of the links This represents the dissimilarity between merged groups However: in real data this can be hard to detect There are Bayesian techniques to address this problem (Bayesian hierarchical clustering)!35

Evaluation of Clustering Algorithms Clustering is unsupervised: evaluation of the output is hard, because no ground truth is given Intuitively, points in a cluster should be similar and points in different clusters dissimilar However, better methods use external information, such as labels or a reference clustering Then we can compare clusterings with the labels using different metrics, e.g. purity mutual information!36

Purity Define N ij the number of objects in cluster i that are in class j Define p ij = N ij N i = Purity N i overall purity CX j=1 N ij p i = max X number of objects in cluster i Purity ranges from 0 (bad) to 1 (good) But: a clustering with each object in its own cluster has a purity of 1 i j N i N p i p ij Purity = 0.71!37

Mutual Information Let U and V be two clusterings Define the probability that a randomly chosen point belongs to cluster in U and to in V Also: The prob. that a point is in I(U, V )= This can be normalized to account for many small clusters with low entropy u i p UV (i, j) = u i \ v j RX i=1 CX j=1 N v j u i p U (i) = u i p UV (i, j) log p UV (i, j) p U (i)p V (j) N!38

Summary Several Clustering methods exist: K-means clustering and Expectation-Maximization, both based on Gaussian Mixture Models K-means uses hard assignments, whereas EM uses soft assignments and estimates also the covariances The Dirichlet Process is a non-parametric model to perform clustering without specifying K Spectral clustering uses the graph Laplacian and performs an eigenvector analysis Major Problem: most clustering algorithms require the number of clusters to be given!39