Lecture Nov - PDF Free Download

Lecture 18 Nov 07 2008

Revew Clusterng Groupng smlar obects nto clusters Herarchcal clusterng Agglomeratve approach (HAC: teratvely merge smlar clusters Dfferent lnkage algorthms for computng dstances among clusters Non herarchcal clusterng K-means: start wth a set of ntal seeds (ceners, teratvely go through reassgnment and recenterng steps untl convergence

More about Kmeans It always converges (fast It converges to local optmum Dfferent ntal seeds lead to dfferent local optmum, to address ths: Many random restart and pck the best wrt MSE Separate ntal seeds far apart Other problems: It s best suted for cases where clusters are all sphercal and smlar n sze It does not allow an obect to partally belong to multple clusters

Soft vs hard Clusterng Hard clusterng: Data pont s determnstcally assgned to one and only one cluster But n realty clusters may overlap Soft-clusterng: Data ponts are assgned to clusters wth certan probabltes

How can we etend Kmeans to make soft clusterng Gven a set of clusters centers μ 1, μ 2,, μ k, nstead of drectly assgn all data ponts to ther closest clusters, we can assgn them partally based on the dstances If each pont only partally belongs to a partcular cluster If each pont only partally belongs to a partcular cluster, when computng the centrod, should we stll use t as f t was fully there?

Gaussan for representng a cluster What eactly s a cluster? Intutvely t s a tghtly packed ball-shape lke thng We can use a Gaussan (normal dstrbuton to descrbe t Let s frst revew what s a Gaussan dstrbuton

Sde track: Gaussan Dstrbtuon Unvarate Gaussan dstrbuton: N(μ, σ 2 μ mean, center of the mass σ 2 standard devaton, spread of the mass Multvarate Gaussan dstrbuton: N(μ, Σ μ (μ 1, μ 2 Σ Covarance matr σ 2 1 σ 12 σ 12 σ 2 2

Mture of Gaussans Assume that we have k clusters n our data Each cluster contans data generated from a Gaussan dstrbuton Overall process of generatng g data: frst randomly select one of the clusters accordng to a pror dstrbuton of the clusters draw a random sample from the Gaussan dstrbuton of that partcular cluster Smlar to the generatve model we have learned n Bayes Classfer, dfference? Here we don t know the cluster membershp of each data pont (unsupervsed

Clusterng usng mture of Gaussan models Gven a set of data ponts, and assume that we know there are k clusters n the data, we need to: Assgn the data ponts to the k clusters (soft assgnment Learn the gaussan dstrbuton b t parameters for each cluster: μ and Σ

A smpler problem If we know the parameters of each Gaussan: (μ 1,Σ 1 ; (μ 2,Σ 2 ;..., (μ Κ,Σ Κ we can compute the probablty of each data pont belongng to each cluster P( C P( C P ( C = P ( 1 1 α ep[ ( μ d 2 2 (2 1/ π Σ 2 T 1 Σ / ( μ ] The same as n makng predcton n Bayes classfer

Another smpler problem If we know what ponts belong to cluster, we can estmate the gaussan parameters easly: Cluster pror ˆμ = 1 n Cluster mean C Σˆ = 1 n What we have s slghtly dfferent T ( ˆ μ ( ˆ μ C Cluster covarance For each data pont, we have P( C for =1,2,, K

Modfcatons Cluster pror = = n C P n, 1, ( 1 L α = 1 ˆμ Cluster mean = =,n, C P 1 ( ( ˆ L μ C n mean = n C P, 1, ( L μ Cluster covarance T C n = Σ ( ( ˆ ˆ 1 ˆ μ μ = = Σ,n, T C P C P 1 1 ( ˆ ˆ ( ˆ L ( ( μ μ = n,, 1 L

A procedure smlar to Kmeans Randomly ntalze the Gaussan parameters Repeat untl converge 1. Compute P ( C for all data ponts and all clusters Ths s called the E-step for t computes the epected values of the cluster membershps for each data pont 2. Re-compute the parameters of each Gaussan Ths s called the M-step for t performs mamum lkelhood estmaton of parameters

Q: Why are these two ponts red when they appear to be closer to blue?

K-Means s a Specal Case we get K-Means f we make followng restrctons: All Gaussans have the dentty covarance matr (.e., sphercal Gaussans Use hard assgnment for the E-step to assgn data pont to ts most lkely cluster

Behavor of EM It s guaranteed to converge In practce t may converge slowly, one can stop early f the change n loglkelhood s smaller than a threshold Lke K-means t converges to a local l optmum Multple restart s recommended