Clustering & (Ken Kreutz-Delgado) UCSD

Similar documents
Clustering & Unsupervised Learning

Maximum Likelihood Estimation (MLE)

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

VQ widely used in coding speech, image, and video

Lecture 12: Classification

EM and Structure Learning

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Generalized Linear Methods

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Lecture Notes on Linear Regression

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Kernel Methods and SVMs Extension

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Homework Assignment 3 Due in class, Thursday October 15

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Lecture Nov

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Mixture o f of Gaussian Gaussian clustering Nov

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Gaussian Mixture Models

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

The big picture. Outline

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

SDMML HT MSc Problem Sheet 4

Classification as a Regression Problem

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Course 395: Machine Learning - Lectures

Semi-Supervised Learning

Ensemble Methods: Boosting

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Clustering gene expression data & the EM algorithm

10-701/ Machine Learning, Fall 2005 Homework 3

5. POLARIMETRIC SAR DATA CLASSIFICATION

Limited Dependent Variables

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

Machine learning: Density estimation

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

Composite Hypotheses testing

Statistical pattern recognition

EEE 241: Linear Systems

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

MAXIMUM A POSTERIORI TRANSDUCTION

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Maximum Likelihood Estimation

Classification. Representing data: Hypothesis (classifier) Lecture 2, September 14, Reading: Eric CMU,

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Feature Selection: Part 1

Conjugacy and the Exponential Family

The Basic Idea of EM

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Chapter Newton s Method

Machine Learning. Classification. Theory of Classification and Nonparametric Classifier. Representing data: Hypothesis (classifier) Eric Xing

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

Basically, if you have a dummy dependent variable you will be estimating a probability.

CSE 546 Midterm Exam, Fall 2014(with Solution)

Probability Density Function Estimation by different Methods

Outline. Communication. Bellman Ford Algorithm. Bellman Ford Example. Bellman Ford Shortest Path [1]

Communication with AWGN Interference

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Retrieval Models: Language models

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

Clustering with Gaussian Mixtures

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

Linear Approximation with Regularization and Moving Least Squares

Linear Feature Engineering 11

Instance-Based Learning and Clustering

Tracking with Kalman Filter

Expectation Maximization Mixture Models HMMs

CSE 252C: Computer Vision III

Problem Set 9 Solutions

Linear Classification, SVMs and Nearest Neighbors

Absolute chain codes. Relative chain code. Chain code. Shape representations vs. descriptors. Start

Support Vector Machines

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Support Vector Machines

Radial-Basis Function Networks

Multilayer Perceptron (MLP)

Cluster Validation Determining Number of Clusters. Umut ORHAN, PhD.

Learning from Data 1 Naive Bayes

Note on EM-training of IBM-model 1

9 : Learning Partially Observed GM : EM Algorithm

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

CSC 411 / CSC D11 / CSC C11

Transcription:

Clusterng & Unsupervsed Learnng Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD

Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y ), fnd an approxmatng functon f (x) y x f () ŷy = f ( x ) y Ths s called tranng or learnng. Two major types of learnng: Unsupervsed Classfcaton (aka Clusterng) : only X s known. Supervsed Classfcaton or Regresson: both X and target value Y are known durng tranng, only X s known at test tme. 2

Unsupervsed Learnng Clusterng Why learnng wthout supervson? In many problems labels are not avalable or are mpossble or expensve to get. E.g. n the hand-wrtten dgts example, a human sat n front of the computer for hours to label all those examples. For other problems the classes to be labeled depend on the applcaton. A good example s mage segmentaton: f you want to know f ths s an mage of the wld or of a bg cty, there s probably no need to segment. If you want to know f there s an anmal n the mage, then you would segment. Unfortunately, the segmentaton mask s usually not avalable 3

Revew of Supervsed Classfcaton Although our focus on clusterng, let us start by revewng supervsed classfcaton: To mplement the optmal decson rule for a supervsed classfcaton problem, we need to Collect a labeled d tranng data set D = {(x 1,y 1 ),, (x n,y n )} where x s a vector of observatons and y s the assocated class label, and then Learn a probablty model for each class Ths nvolves estmatng P X Y (x ) and P Y () for each class 4

Supervsed Classfcaton Ths can be done by Maxmum Lkelhood Estmaton MLE has two steps: 1) Choose a parametrc model for each class pdf: P X Y ( x ; θ ) θ Θ 2) Select the parameters of class to be the ones that maxmze the probablty of the d data from that class: ˆ θ = = X Y ( ( ) D θ ) argmax P ; θ Θ XY ( ( ) D θ ) arg max log P ; θ Θ 5

Maxmum Lkelhood Estmaton We have seen that MLE can be a straghtforward procedure. In partcular, f the pdf s twce dfferentable then: Solutons are parameters values such that P θ ( ) ˆ XY ( D ; θ ) = 0 θ θ Θ 2 T () ˆ p P θ Θ 2 XY ( D ; ) θ 0, θ θ You always have to check the second-order d condton max We must also fnd an MLE for the class probabltes P Y () But here there s not much choce of probablty blt model o E.g. Bernoull: ML estmate s the percent of tranng ponts n the class 6

Maxmum Lkelhood Estmaton We have worked out the Gaussan case n detal: D( ) = {x () 1,..., x () n } = set of examples from class The ML estmates for class are 1 ( ) ˆ µ = x ˆ n j P n Y () = j n 1 Σ = ˆ ( ) ( ) ( ˆ )( ˆ ) T xj µ xj µ n j There are many other dstrbutons for whch we can derve a smlar set of equatons But the Gaussan case s partcularly relevant for clusterng (more on ths later) 7

Supervsed Learnng va MLE Ths gves probablty models for each of the classes Now we utlze the fact that: assumng the zero/one loss, the optmal decson rule (BDR) s the MAP rule: * x = PYX x ( ) argmax ( ) Whch can also be wrtten as ( ) = arg max log ( ) + log ( ) * x P XY x P Y Ths completes the process of supervsed learnng of a BDR. We now have a rule for classfyng any (unlabeled) future measurement x. 8

Gaussan Classfer In the Gaussan case the BDR s * 2 ( x ) = argmn d ( x, µ ) + α dscrmnant for P Y X (1 x ) = 0.5 wth d ( x, y) = ( x y) Σ ( x y) 2 T 1 d α = log(2 π ) Σ 2log P ( ) Y Ths can be seen as fndng the nearest class neghbor, usng a funny metrc Each class has ts own squared-dstance whch s the sum of Mahalanobs-squared for that class plus the α constant o We effectvely have dfferent metrcs n dfferent regons of the space 9

Gaussan Classfer A specal case of fnterest s when all classes have the same covarance Σ = Σ dscrmnant for P Y X (1 x ) = 0.5 x = d x µ + α * 2 ( ) argmn (, ) wth d x y x y x y 2 T 1 (, ) = ( ) Σ ( ) α = 2log P ( ) Y Note: α can be dropped when all classes have equal probablty Then ths s close to the NN classfer wth Mahalanobs dstance However, nstead of fndng the nearest neghbor, t looks for the nearest class prototype or template µ 10

Gaussan Classfer Σ = Σ for two classes (detecton) One mportant property of ths case s that the decson boundary s a hyperplane. Ths can be shown by computng the set of ponts x such that d ( x, µ ) + α = d ( x, µ ) + α 2 2 0 0 1 1 and showng that they satsfy T w ( x x ) = 0 0 Ths s the equaton of a hyperplane wth normal w. x 0 can be any fxed pont on the hyperplane, but t s standard to choose t to have mnmum norm, n whch case w and x 0 are then parallel x n x 1 x 3 x 2 dscrmnant for P Y X (1 x ) = 0.5 x 0 w x 11

Gaussan Classfer f all the covarances are the dentty Σ = Ι x = d x µ + α * 2 ( ) argmn (, ) wth d 2 ( x, y) = x y 2 α = 2log P ( ) Y *? Ths s just (Eucldean dstance) template matchng wth class means as templates e.g. for dgt t classfcaton, the class means (templates) t are: Compare complexty of template matchng to nearest neghbors! 12

Unsupervsed Classfcaton - Clusterng In a clusterng problem we do not have labels n the tranng set We can try to estmate both the class labels and the class pdf parameters Here s a strategy: Assume k classes wth pdf s ntalzed to randomly chosen parameter values Then terate between two steps: 1) Apply the optmal decson rule for the (estmated) class pdf s ths assgns each pont to one of the clusters, creatng pseudo-labeled data 2) Update the pdf estmates by dong parameter estmaton wthn each estmated (pseudo-labeled) class cluster found n step 1 13

Unsupervsed Classfcaton - Clusterng Natural queston: what probablty model do we assume? Let s start as smple as possble Assume: k Gaussan classes wth dentty covarances & equal P Y () Each class has an unknown mean (prototype) µ whch must be learned Resultng clusterng algorthm s the k-means algorthm: Start wth some ntal estmate of the µ (e.g. random, but dstnct) Then, terate between 1) BDR Classfcaton usng the current estmates of the k class means: * ( x ) = arg mn x µ 1 k 2 2) Re-estmaton of the k class means: n 1 new ( ) µ µ = xj for = 1,, k n j = 1 14

K-means (thanks to Andrew Moore, CMU) 15

K-means (thanks to Andrew Moore, CMU) 16

K-means (thanks to Andrew Moore, CMU) 17

K-means (thanks to Andrew Moore, CMU) 18

K-means (thanks to Andrew Moore, CMU) 19

K-means Clusterng The name comes from the fact that we are tryng to learn the k means (mean values) of k assumed clusters It s optmal f you want to mnmze the expected value of the squared error between vector x and template to whch x s assgned. K-means results n a Vorono tessellaton of the feature space. Problems: How many clusters? (.e., what s k?) Varous methods avalable, Bayesan nformaton crteron, Akake nformaton crteron, mnmum descrpton length Guessng can work pretty well Algorthm converges to a local mnmum soluton only How does one ntalze? Random can be pretty bad Mean Splttng can be sgnfcantly better 20

Growng k va Mean Splttng Let k = 1. Compute the sample mean of all ponts, µ ( 1 ). (The superscrpt denotes the current value of k) To ntalze t means for k = 2 perturb the mean µ (1) randomly µ 1 (2) = µ (1) µ (2) = (1+ε) (1) 2 µ ε << 1 Then run k-means untl convergence for k = 2 Intalze means for k = 4 µ 1 (4) = µ 1 (2) µ 2 (4) = (1+ε) µ 1 (2) µ (4) 3 = µ (2) 2 µ 4 (4) = (1+ε) µ 2 (2) Then run k-means untl convergence for k = 4 Etc. 21

Deletng Empty Clusters Empty Clusters can be a source of algorthmc dffcultes Therefore, at the end of each teraton of k-means Check the number of elements n each cluster If too low, throw the cluster away Rentalze the mean of the most populated cluster wth a perturbed verson of that mean Note that there are alternatve names: In the compresson lterature ths s known as the Generalzed Loyd Algorthm Ths s actually the rght name, snce Loyd was the frst to nvent t It s also known as (data) Vector Quantzaton and s used n the desgn of vector quantzers 22

Vector Quantzaton Is a popular data compresson technque Fnd a codebook of prototypes for the vectors to compress Instead of transmttng each vector, transmt the codebook ndex Image compresson example Each pxel has 3 colors (requrng 3 bytes of nformaton) Instead, fnd the optmal 256 color prototypes! (256 ~ 1 byte of nformaton) 23

Vector Quantzaton We now have an mage compresson scheme Each pxel has 3 colors (1 byte per color = 3 bytes total needed)) Instead, fnd the nearest neghbor template for 256 colors We transmt the template ndex Snce there are only 256 templates, only need one byte needed Usng the ndex, the decoder looks up the prototype n ts table By sacrfcng a lttle bt of dstorton, we saved 2 bytes per pxel! 24

K-means There are many other applcatons of K-means E.g. mage segmentaton: decompose each mage nto component objects Then run k-means on the colors and look at the assgnments E.g., the pxels assgned to the red cluster tend to be from the booth: 25

K-means We can also use texture nformaton n addton to color Many methods for clusterng usng texture metrcs Here are some results Note that ths s not the state-of-the-art n mage segmentaton But gves a good dea of what k-means can do 26

Extensons to basc K-means There are many extensons to the basc k-means algorthm One of the most mportant applcatons s to the problem of learnng accurate approxmatons to general, nontrval PDF s. Remember that the optmal decson rule ( ) argmax log ( ) log ( ) * x = PXY x + PY s optmal ff the true probabltes P X Y (x ) are correctly estmated Ths often turns out to be mpossble when we use overly smple parametrc models lke the Gaussan Often the true probablty s too complcated for any smple model to hold accurately Even f smple models provde good local approxmatons, there are usually multple clusters when we take a global vew These weakness can be addressed by use of mxture dstrbutons and the use of the Expectaton-Maxmzaton (EM) Algorthm 27

Mxture Dstrbutons Consder the followng problem Certan types of traffc banned from a brdge We want an automatc detector/classfer to see f the ban s holdng A sensor measures vehcle weght Want to classfy each car nto class = OK or class = Banned We know that n each class there are multple sub-classes E.g. OK = {compact, sedan, staton wagon, SUV} Banned = {truck, bus, sem} Each of the sub-classes s close to Gaussan, but for the whole class we get ths 28

Mxture dstrbutons Ths dstrbuton s a mxture The overall shape s determned by a number of (sub) class denstes We ntroduce a random varable Z to account for ths A value of Z = c ponts to class c and thus pcks out the c th component densty from the mxture. E.g. a Gaussan mxture: # of mxture components c th component weght c th mxture component = Gaussan pdf 29

Mxture Dstrbutons Learnng a mxture densty s a type of soft clusterng problem For each tranng pont x k we need to fgure out from whch component class Z k =Z(x k ) =jt was drawn Once we know how ponts are assgned to a component j we can estmate the component j pdf parameters Ths could be done wth k-means A more general algorthm s Expectaton-Maxmzaton (EM) A key dfference from k-means: we never hard assgn the ponts x k In the expectaton step we compute posteror probabltes that a pont x k belongs to class j, for every j, condtoned on all the data D. But we do not make a hard decson! (e.g., we do not assgn the pont x k only ytoas sngle gecass class va athe MAP rule.) ue) Instead, n the maxmzaton step, the pont x k partcpates n all classes to a degree weghted by the posteror class probabltes 30

Expectaton-Maxmzaton (EM) The EM Algorthm: 1. Start wth an ntal parameter vector estmate θ (0) 2. E-step: Gven current parameters θ () and observatons n D, estmate the ndcator functons χ(z k = j) va the condtonal Expectaton h kj = E{ χ(z k = j) D ; θ () } = E{ χ(z k = j ) x k ; θ () } 1. M-step: Weghtng the data x k by h kj, we have a complete data MLE problem for each class j. Ie I.e. Maxmze the class j lkelhoods for the parameters,.e. re-compute θ (+1) 2. Go to 2. In a graphcal form: Estmate parameters θ (+1) E-step M-step Fll n class assgnments hkj 31

Expectaton Maxmzaton (EM) Note that for any mxture densty we have: ( ) ( { χ( = j) ; θ } P ) ( Z j x ; θ ) h = E (Z j x = = ; and kj k k ZX k k n = = P x Z j P Z j C ( ) ( ) XZ ( k k = ; θ ) Z( k = ; θ ) ( ) PX( xk; θ ) P ( x Z = j; θ ) π c= 1 ( ) ( ) XZ k k j P ( x Z = c ; θ ) π n ( ) ( ) XZ k k c { ( ) θ } χ(z = j) nˆ E n x ; = (from Bayes rule) = j k j j k k j k= 1 k= 1 C C n= n n= nˆ j j= 1 j= 1 j n h 32

Expectaton-Maxmzaton (EM) In partcular, for a Gaussan mxture we have: Expectaton Step h ( ) kj = PZX ( Zk = j xk; θ ) = C Maxmzaton Step n ( + 1) j hk j π j k = 1 nˆ =, = ( ) G( x ; µ, σ ) π c=1 nˆ k G( x ( ) 2 ( ) j j j ( ) ; µ, σ ) π ( ) 2 ( ) k c c c n n 1 ( j + 1) 1 2 ( + 1) 2 ( + 1) j k j k j k j k j nˆ 1 ˆ j k = nj k = 1 j n ( x ) µ = h x, σ = h µ Compare to the sngle (non-mxture) Gaussan MLE soluton shown on slde 7! They are equvalent solutons when h kj s the hard ndcator functon whch selects class-labeled data. 33

Expectaton-Maxmzaton (EM) Note that the dfference between EM and k-means s that In the E-step h j s not hard-lmted to 0 or 1 Dong so would make the M-step exactly the same as k-means Plus we get estmates of the class covarances and class probabltes automatcally k-means can be seen as a greedy verson of EM At each teraton, for each pont we make a hard decson (the optmal MAP BDR for dentty covarances & equal class prors) But ths does not take nto account the nformaton n the ponts we throw away. I.e., potentally all ponts carry nformaton about all (sub) classes Note: If the hard assgnment s best, EM wll learn t To get a feelng for EM you can use http://www-cse.ucsd.edu/users/bayrakt/java/em/ 34

END 35