Lecture Nov

Size: px

Start display at page:

Download "Lecture Nov"

Bryce Green
5 years ago
Views:

1 Lecture 18 Nov

2 Revew Clusterng Groupng smlar obects nto clusters Herarchcal clusterng Agglomeratve approach (HAC: teratvely merge smlar clusters Dfferent lnkage algorthms for computng dstances among clusters Non herarchcal clusterng K-means: start wth a set of ntal seeds (ceners, teratvely go through reassgnment and recenterng steps untl convergence

3 More about Kmeans It always converges (fast It converges to local optmum Dfferent ntal seeds lead to dfferent local optmum, to address ths: Many random restart and pck the best wrt MSE Separate ntal seeds far apart Other problems: It s best suted for cases where clusters are all sphercal and smlar n sze It does not allow an obect to partally belong to multple clusters

4 Soft vs hard Clusterng Hard clusterng: Data pont s determnstcally assgned to one and only one cluster But n realty clusters may overlap Soft-clusterng: Data ponts are assgned to clusters wth certan probabltes

5 How can we etend Kmeans to make soft clusterng Gven a set of clusters centers μ 1, μ 2,, μ k, nstead of drectly assgn all data ponts to ther closest clusters, we can assgn them partally based on the dstances If each pont only partally belongs to a partcular cluster If each pont only partally belongs to a partcular cluster, when computng the centrod, should we stll use t as f t was fully there?

6 Gaussan for representng a cluster What eactly s a cluster? Intutvely t s a tghtly packed ball-shape lke thng We can use a Gaussan (normal dstrbuton to descrbe t Let s frst revew what s a Gaussan dstrbuton

7 Sde track: Gaussan Dstrbtuon Unvarate Gaussan dstrbuton: N(μ, σ 2 μ mean, center of the mass σ 2 standard devaton, spread of the mass Multvarate Gaussan dstrbuton: N(μ, Σ μ (μ 1, μ 2 Σ Covarance matr σ 2 1 σ 12 σ 12 σ 2 2

8 Mture of Gaussans Assume that we have k clusters n our data Each cluster contans data generated from a Gaussan dstrbuton Overall process of generatng g data: frst randomly select one of the clusters accordng to a pror dstrbuton of the clusters draw a random sample from the Gaussan dstrbuton of that partcular cluster Smlar to the generatve model we have learned n Bayes Classfer, dfference? Here we don t know the cluster membershp of each data pont (unsupervsed

9 Clusterng usng mture of Gaussan models Gven a set of data ponts, and assume that we know there are k clusters n the data, we need to: Assgn the data ponts to the k clusters (soft assgnment Learn the gaussan dstrbuton b t parameters for each cluster: μ and Σ

10 A smpler problem If we know the parameters of each Gaussan: (μ 1,Σ 1 ; (μ 2,Σ 2 ;..., (μ Κ,Σ Κ we can compute the probablty of each data pont belongng to each cluster P( C P( C P ( C = P ( 1 1 α ep[ ( μ d 2 2 (2 1/ π Σ 2 T 1 Σ / ( μ ] The same as n makng predcton n Bayes classfer

11 Another smpler problem If we know what ponts belong to cluster, we can estmate the gaussan parameters easly: Cluster pror ˆμ = 1 n Cluster mean C Σˆ = 1 n What we have s slghtly dfferent T ( ˆ μ ( ˆ μ C Cluster covarance For each data pont, we have P( C for =1,2,, K

12 Modfcatons Cluster pror = = n C P n, 1, ( 1 L α = 1 ˆμ Cluster mean = =,n, C P 1 ( ( ˆ L μ C n mean = n C P, 1, ( L μ Cluster covarance T C n = Σ ( ( ˆ ˆ 1 ˆ μ μ = = Σ,n, T C P C P 1 1 ( ˆ ˆ ( ˆ L ( ( μ μ = n,, 1 L

13 A procedure smlar to Kmeans Randomly ntalze the Gaussan parameters Repeat untl converge 1. Compute P ( C for all data ponts and all clusters Ths s called the E-step for t computes the epected values of the cluster membershps for each data pont 2. Re-compute the parameters of each Gaussan Ths s called the M-step for t performs mamum lkelhood estmaton of parameters

21 Q: Why are these two ponts red when they appear to be closer to blue?

22 K-Means s a Specal Case we get K-Means f we make followng restrctons: All Gaussans have the dentty covarance matr (.e., sphercal Gaussans Use hard assgnment for the E-step to assgn data pont to ts most lkely cluster

23 Behavor of EM It s guaranteed to converge In practce t may converge slowly, one can stop early f the change n loglkelhood s smaller than a threshold Lke K-means t converges to a local l optmum Multple restart s recommended

Mixture o f of Gaussian Gaussian clustering Nov

Mixture o f of Gaussian Gaussian clustering Nov Mture of Gaussan clusterng Nov 11 2009 Soft vs hard lusterng Kmeans performs Hard clusterng: Data pont s determnstcally assgned to one and only one cluster But n realty clusters may overlap Soft-clusterng: