Clusterng Technques for Informaton Retreval Berln Chen Department of Computer Scence & Informaton Engneerng Natonal Tawan Normal Unversty References:. Chrstopher D. Mannng, Prabhaar Raghavan and Hnrch Schütze, Introducton to Informaton Retreval, Cambrdge Unversty Press, 2008. (Chapters 6 & 7) 2. Modern Informaton Retreval, Chapters 5 & 7 3. "A Gentle Tutoral of the EM Algorthm and ts Applcaton to Parameter Estmaton for Gaussan Mture and Hdden Marov Models," Jeff A. Blmes, U.C. Bereley TR-97-02
Clusterng Place smlar obects n the same group and assgn dssmlar obects to dfferent groups (typcally usng a dstance measure, such as Eucldean dstance) Word clusterng Neghbor overlap: words occur wth the smlar left and rght neghbors (such as n and on) Document clusterng Documents wth the smlar topcs or concepts are put together Nevertheless, clusterng cannot gve a comprehensve descrpton of the obect How to label obects shown on the vsual dsplay s a dffcult problem IR Berln Chen 2
Clusterng vs. Classfcaton Classfcaton s supervsed and requres a set of labeled tranng nstances for each group (class) Learnng wth a teacher Clusterng s unsupervsed and learns wthout a teacher to provde the labelng nformaton of the tranng data set Also called automatc or unsupervsed classfcaton IR Berln Chen 3
Types of Clusterng Algorthms Two types of structures produced by clusterng algorthms Flat or non-herarchcal clusterng Herarchcal clusterng Flat clusterng Smply consstng of a certan number of clusters and the relaton between clusters s often undetermned Measurement: constructon error mnmzaton or probablstc optmzaton Herarchcal clusterng A herarchy wth usual nterpretaton that each node stands for a sub-cluster of ts mother s node The leaves of the tree are the sngle obects Each node represents the cluster that contans all the obects of ts descendants Measurement: smlartes of nstances IR Berln Chen 4
Hard Assgnment vs. Soft Assgnment (/2) Another mportant dstncton between clusterng algorthms s whether they perform soft or hard assgnment Hard Assgnment Each obect (or document n the contet of IR) s assgned to one and only one cluster Soft Assgnment (probablstc approach) Each obect may be assgned to multple clusters An obect has a probablty dstrbuton P over clusters c where s the probablty that s a P c member of c Is somewhat more approprate n many tass such as NLP, IR, IR Berln Chen 5
Hard Assgnment vs. Soft Assgnment (2/2) Herarchcal clusterng usually adopts hard assgnment Whle n flat clusterng, both types of assgnments are common IR Berln Chen 6
Summarzed Attrbutes of Clusterng Algorthms (/2) Herarchcal Clusterng Preferable for detaled data analyss Provde more nformaton than flat clusterng No sngle best algorthm (each of the algorthms s seemngly only applcable/optmal for some applcatons) Less effcent than flat clusterng (mnmally have to compute n n matr of smlarty coeffcents) IR Berln Chen 7
Summarzed Attrbutes of Clusterng Algorthms (2/2) Flat Clusterng Preferable f effcency s a consderaton or data sets are very large K-means s the conceptually feasble method and should probably be used on a new data because ts results are often suffcent K-means assumes a smple Eucldean representaton space, and so cannot be used for many data sets, e.g., nomnal data le colors (or samples wth features of dfferent scales) The EM algorthm s the most choce. It can accommodate defnton of clusters and allocaton of obects based on comple probablstc models Its etensons can be used to handle topologcal/herarchcal orders of samples E.g., Probablstc Latent Semantc Analyss (PLSA) IR Berln Chen 8
Some Applcatons of Clusterng n IR (/5) Cluster Hypothess (for IR): Documents n the same cluster behave smlarly wth respect to relevance to nformaton needs Possble applcatons of Clusterng n IR These possble applcatons dffer n The collecton of documents to be clustered The aspect of the IR system to be mproved IR Berln Chen 9
Some Applcatons of Clusterng n IR (2/5). Whole corpus analyss/navgaton Better user nterface (users prefer browsng over searchng snce they are unsure about whch search terms to use) E.g., the scatter-gather approach (for a collecton of New Yor Tmes) Users often prefer browsng over searchng, because they are unsure about whch search terms to use. IR Berln Chen 0
Some Applcatons of Clusterng n IR (3/5) 2. Improve recall n search applcatons Acheve better search results by Allevatng the term-msmatch (synonym) problem facng the vector space model Frst, dentfy an ntal set of documents that match the query (.e., contan some of the query words) Then, add other documents from the same clusters even f they have low smlarty to the query Estmatng the collecton model of the language modelng (LM) retreval approach more accurately Q M Pw M Pw P M D N D C The collecton model can be estmated from the cluster the document D belongs to, nstead of the entre collecton Q M P w M P w M P D N D Cluster( D) IR Berln Chen
Some Applcatons of Clusterng n IR (4/5) 3. Better navgaton of search results Result set clusterng Effectve user recall wll be hgher http://clusty.com IR Berln Chen 2
Some Applcatons of Clusterng n IR (5/5) 4. Speed up the search process For retreval models usng ehaustve matchng (computng the smlarty of the query to every document) wthout effcent nverted nde supports E.g., latent semantc analyss (LSA), language modelng (LM)? Soluton: cluster-based retreval Frst fnd the clusters that are closet to the query and then only consder documents from these clusters Wthn ths much smaller set, we can compute smlartes ehaustvely and ran documents n the usual way IR Berln Chen 3
Evaluaton of Clusterng (/2) Internal crteron for the qualty of a clusterng result The typcal obectve s to attan Hgh ntra-cluster smlarty (documents wth a cluster are smlar) Low nter-cluster smlarty (document from dfferent clusters are dssmlar) The measured qualty depends on both the document representaton and the smlarty measure used Good scores on an nternal crteron do not necessarly translate nto good effectveness n an applcaton IR Berln Chen 4
Evaluaton of Clusterng (2/2) Eternal crteron for the qualty of a clusterng result Evaluate how well the clusterng matches the gold standard classes produced by human udges That s, the qualty s measured by the ablty of the clusterng algorthm to dscover some or all of the hdden patterns or latent (true) classes......... Two common crtera Purty Rand Inde (RI) IR Berln Chen 5
Purty (/2) Each cluster s frst assgned to class whch s most frequent n the cluster Then, the accuracy of the assgnment s measured by countng the number of correctly assgned documents and dvdng by the sample sze Purty N, ma c, 2,, K : the set of clusters c, c 2,, c J : the set of classes : the sample sze N Purty 7, 5 4 3 0. 7 IR Berln Chen 6
Purty (2/2) Hgh purty s easy to acheve for a large number of clusters (?) Purty wll be f each document gets ts own cluster Therefore, purty cannot be used to trade off the qualty of the clusterng aganst the number of clusters IR Berln Chen 7
Rand Inde (/3) Measure the smlarty between the clusters and the classes n ground truth Consder the assgnments of all possble N(N-)/2 pars of N dstnct documents n the cluster and the true class Number of document pars Same class n ground truth Dfferent classes n ground truth Same cluster n clusterng TP (True Postve) FP (False Postve) Dfferent clusters n clusterng FN (False Negatve) TN (True Negatve) RI TP TP FP TN FN TN IR Berln Chen 8
IR Berln Chen 9 Rand Inde (2/3) TP 2 5 2 6 2 6 20 2 3 4 4 5 FP 20 2 2 2 3 2 4 2 5 TP TN 5 6 5 6 6 6 24 3 2 2 5 4 5 FN 72 2 3 2 4 3 4 2 3 3 5 5 4 5 TN.......................... ω 2 ω ω 3 ω ω 2 ω 3 ω 3 ω 2 ω ω ω 3 ω 2 all postve pars ω ω 2 ω 2 ω 3 ω ω 3 ω ω 2 ω ω 3 ω 2 ω 3 all negatve pars 0.68 72 24 20 20 72 20 RI 36 2 / 6 7 2 / N N all pars ω ω 2 ω ω 3 ω 2 ω 3
Rand Inde (3/3) The rand nde has a value between 0 and 0 ndcates that the clusters and the classes n ground truth do not agree on any par of ponts (documents) ndcates that the clusters and the classes n ground truth are eactly the same IR Berln Chen 20
F-Measure Based on Rand Inde F-Measure: harmonc mean of precson (P) and recall (R) P TP TP FP, R TP TP FN F b b 2 b R 2 P b b 2 2 P PR R If we want to penalze false negatves (FN) more strongly than false postves (FP), then we can set b (separatng smlar documents s sometmes worse than puttng dssmlar documents n the same cluster) That s, gvng more weght to recall (R) IR Berln Chen 2
Normalzed Mutual Informaton (NMI) NMI s an nformaton-theoretcal measure I ; C, C H H C / 2 p c ; C p c log p pc NMI I H log p log p log (ML estmate) N N NMI wll have a value between 0 and NMI has the same problem as purty N c N (ML estmate) NMI does not penalze large cardnaltes and thus does not formalze our bas, other thng beng equal, fewer clusters are better c c IR Berln Chen 22
Summary of Eternal Evaluaton Measures IR Berln Chen 23
Flat Clusterng IR Berln Chen 24
Flat Clusterng Start out wth a partton based on randomly selected seeds (one seed per cluster) and then refne the ntal partton In a mult-pass manner (recurson/teratons) Problems assocated wth non-herarchcal clusterng When to stop? What s the rght number of clusters (cluster cardnalty)? Algorthms ntroduced here The K-means algorthm The EM algorthm group average smlarty, lelhood, mutual nformaton - + Herarchcal clusterng s also faced wth ths problem IR Berln Chen 25
The K-means Algorthm (/0) Also called Lnde-Buzo-Gray (LBG) n sgnal processng A hard clusterng algorthm Defne clusters by the center of mass of ther members Obects (e.g., documents) should be represented n vector form The K-means algorthm also can be regarded as A nd of vector quantzaton Map from a contnuous space (hgh resoluton) to a dscrete space (low resoluton) E.g. color quantzaton 24 bts/pel (6 mllon colors) 8 bts/pel (256 colors) A compresson rate of 3 X m t n nde F m Dm( t )=24 F =2 8 t : cluster centrod or reference vector, code word, code vector IR Berln Chen 26
The K-means Algorthm (2/0) Total reconstructon error (RSS : resdual sum of E m X N t b t automatc label 2 t m, where b t 0 squares) f t m otherwse mn t m m t b and are unnown n advance t b depends on m and ths optmzaton problem can not be solved analytcally IR Berln Chen 27
The K-means Algorthm (3/0) Intalzaton A set of ntal cluster centers s needed m Recurson Assgn each obect t to the cluster whose center s closest t t t f m mn m b 0 otherwse Then, re-compute the center of each cluster as the centrod or mean (average) of ts members m N t t t b N t t b These two steps are repeated untl m stablzes (a stoppng crteron) Or, we can nstead use the medod as the cluster center? (a medod s one of the obects n the cluster that s closest to the centrod) IR Berln Chen 28
The K-means Algorthm (4/0) Algorthm IR Berln Chen 29
The K-means Algorthm (5/0) Eample IR Berln Chen 30
Eample 2 The K-means Algorthm (6/0) government fnance sports research name IR Berln Chen 3
Complety: O(IKNM) The K-means Algorthm (7/0) I: Iteratons; K: cluster number; N: obect number; M: obect dmensonalty Choce of ntal cluster centers (seeds) s mportant Pc at random Or, calculate the mean m of all data and generate ntal centers m by addng small random vector to the mean m δ Or, proect data onto the prncpal component (frst egenvector), dvde t range nto equal nterval, and tae the mean of data n each group as the ntal center m Or, use another method such as herarchcal clusterng algorthm on a subset of the obects E.g., bucshot algorthm uses the group-average agglomeratve clusterng to randomly sample of the data that has sze square root of the complete set IR Berln Chen 32
The K-means Algorthm (8/0) Poor seeds wll result n sub-optmal clusterng IR Berln Chen 33
The K-means Algorthm (9/0) How to brea tes when n case there are several centers wth the same dstance from an obect E.g., randomly assgn the obect to one of the canddate clusters (or assgn the obect to the cluster wth lowest nde) Or, perturb obects slghtly Possble Applcatons of the K-means Algorthm Clusterng Vector quantzaton A preprocessng stage before classfcaton or regresson Map from the orgnal space to l-dmensonal space/hypercube l=log 2 ( clusters) Nodes on the hypercube A lnear classfer IR Berln Chen 34
The K-means Algorthm (0/0) E.g., the LBG algorthm M 2M at each teraton By Lnde, Buzo, and Gray { 2, 2, 2 } {,, } Global mean Cluster mean Cluster 2mean { 3, 3, 3 } { 4, 4, 4 } Total reconstructon error (resdual sum of squares) E m X N b t t t m 2 IR Berln Chen 35
The EM Algorthm (/3) EM (Epectaton-Mamzaton) algorthm A nd of model-based clusterng Also can be vewed as a generalzaton of K-means Each cluster s a model for generatng the data The centrod s good representatve for each model Generate an obect (e.g., document) conssts of frst pcng a centrod at random and then addng some nose If the nose s normally dstrbuted, the procedure wll result n clusters of sphercal shape Physcal Models for EM Dscrete: Mture of multnomal dstrbutons Contnuous: Mture of Gaussan dstrbutons IR Berln Chen 36
IR Berln Chen 37 The EM Algorthm (2/3) EM s a soft verson of K-mean Each obect could be the member of multple clusters Clusterng as estmatng a mture of (contnuous) probablty dstrbutons P P 2 P P K 2 P K P K P ; P P Θ Θ T m P 2 ep 2 Θ ; Contnuous case: Lelhood functon for data samples: P ; P P P n K n Θ Θ Θ Θ X A Mture Gaussan HMM (or A Mture of Gaussans) n,, 2, X Θ Θ ma Θ Θ ma, ma : classfca ton P ; P P ; P y dstrbuted (..d.) arendependent dentcall 's n,,, 2 X
The EM Algorthm (2/3) 2 2 2 IR Berln Chen 38
Mamum Lelhood Estmaton (MLE) (/2) Hard Assgnment cluster ω P(B ω )=2/4=0.5 P(W ω )=2/4=0.5 IR Berln Chen 39
Mamum Lelhood Estmaton (2/2) Soft Assgnment P(ω )=(0.7+0.4+0.9+0.5)/ (0.7+0.4+0.9+0.5 +0.3+0.6+0.+0.5) =2.5/4=0.625 State ω P(ω 2 )=- P(ω )=0.375 State ω 2 0.7 0.3 0.4 0.6 P(B ω )=(0.7+0.9)/ (0.7+0.4+0.9+0.5) =.6/2.5=0.64 P(B ω )=(0.4+0.5)/ (0.7+0.4+0.9+0.5) =0.9/2.5=0.36 0.9 0. 0.5 0.5 P(B ω 2 )=(0.3+0.)/ (0.3+0.6+0.+0.5) =0.4/.5=0.27 P(B ω 2 )=(0.6+0.5)/ (0.3+0.6+0.+0.5) =0./.5=0.73 IR Berln Chen 40
Epectaton-Mamzaton Updatng Formulas (/3) Epectaton K P l, Θ P Θ P, Θ P Θ l Compute the lelhood that each cluster generates a document vector l IR Berln Chen 4
IR Berln Chen 42 Epectaton-Mamzaton Updatng Formulas (2/3) Θˆ n P n K n n Mamzaton Mture Weght Mean of Gaussan n n ˆ
IR Berln Chen 43 Epectaton-Mamzaton Updatng Formulas (3/3) Covarance Matr of Gaussan n n T n n T ˆ ˆ ˆ ˆ ˆ
More facts about The EM Algorthm The ntal cluster dstrbutons can be estmated usng the K-means algorthm, whch EM can then soften up The procedure termnates when the lelhood functon P X s converged or mamum number of teratons s reached IR Berln Chen 44
Herarchcal Clusterng IR Berln Chen 45
Herarchcal Clusterng Can be n ether bottom-up or top-down manners Bottom-up (agglomeratve) 凝集的 Start wth ndvdual obects and try to group the most smlar ones E.g., wth the mnmum dstance apart sm, y The procedure termnates when one cluster contanng all obects has been formed d, y dstance measures wll be dscussed later on Top-down (dvsve) 分裂的 Start wth all obects n a group and dvde them nto groups so as to mamze wthn-group smlarty IR Berln Chen 46
Herarchcal Agglomeratve Clusterng (HAC) A bottom-up approach Assume a smlarty measure for determnng the smlarty of two obects Start wth all obects n a separate cluster (a sngleton) and then repeatedly ons the two clusters that have the most smlarty untl there s only one cluster survved The hstory of mergng/clusterng forms a bnary tree or herarchy IR Berln Chen 47
HAC: Algorthm Intalzaton (for tree leaves): Each obect s a cluster cluster number merged as a new cluster The orgnal two clusters are removed c denotes a specfc cluster here IR Berln Chen 48
Dstance Metrcs Eucldan Dstance (L 2 norm) L m 2 2 (, y) ( y ) Mae sure that all attrbutes/dmensons have the same scale (or the same varance) L Norm (Cty-bloc dstance) L (, y) m y Cosne Smlarty (transform to a dstance by subtractng from ) y ranged between 0 and y IR Berln Chen 49
Measures of Cluster Smlarty (/9) Especally for the bottom-up approaches. Sngle-ln clusterng The smlarty between two clusters s the smlarty of the two closest obects n the clusters Search over all pars of obects that are from the two dfferent clusters and select the par wth the greatest smlarty Elongated clusters are acheved sm, ma sm,y, y ω ω cf. the mnmal spannng tree greatest smlarty IR Berln Chen 50
Measures of Cluster Smlarty (2/9) 2. Complete-ln clusterng The smlarty between two clusters s the smlarty of ther two most dssmlar members Sphere-shaped clusters are acheved Preferable for most IR and NLP applcatons sm, mn sm,y, y ω ω least smlarty More senstve to outlers IR Berln Chen 5
Measures of Cluster Smlarty (3/9) sngle ln complete ln IR Berln Chen 52
Measures of Cluster Smlarty (4/9) IR Berln Chen 53
Measures of Cluster Smlarty (5/9) 3. Group-average agglomeratve clusterng A compromse between sngle-ln and complete-ln clusterng The smlarty between two clusters s the average smlarty between members ω ω If the obects are represented as length-normalzed vectors and the smlarty measure s the cosne There ests an fast algorthm for computng the average smlarty sm, y cos, y y y y length-normalzed vectors IR Berln Chen 54
IR Berln Chen 55 Measures of Cluster Smlarty (6/9) 3. Group-average agglomeratve clusterng (cont.) The average smlarty SIM between vectors n a cluster ω s defned as The sum of members n a cluster ω : Epress n terms of y y y y y y sm SIM, s SIM s c y s s c SIM SIM SIM y s s s = length-normalzed vector
Measures of Cluster Smlarty (7/9) 3. Group-average agglomeratve clusterng (cont.) -As mergng two clusters c and c, the cluster sum vectors s and s are nown n advance s New s s, New The average smlarty for ther unon wll be SIM s s s s s ω ω s IR Berln Chen 56
IR Berln Chen 57 Measures of Cluster Smlarty (8/9) 4. Centrod clusterng The smlarty of two clusters s defned as the smlarty of ther centrods s t t s t s t s N N N N sm,
Measures of Cluster Smlarty (9/9) Graphcal summary of four cluster smlarty measures IR Berln Chen 58
Eample: Word Clusterng Words (obects) are descrbed and clustered usng a set of features and values E.g., the left and rght neghbors of toens of words hgher nodes: decreasng of smlarty be has least smlarty wth the other 2 words! IR Berln Chen 59
Dvsve Clusterng (/2) A top-down approach Start wth all obects n a sngle cluster At each teraton, select the least coherent cluster and splt t Contnue the teratons untl a predefned crteron (e.g., the cluster number) s acheved The hstory of clusterng forms a bnary tree or herarchy IR Berln Chen 60
Dvsve Clusterng (2/2) To select the least coherent cluster, the measures used n bottom-up clusterng (e.g. HAC) can be used agan here Sngle ln measure Complete-ln measure Group-average measure How to splt a cluster Also s a clusterng tas (fndng two sub-clusters) Any clusterng algorthm can be used for the splttng operaton, e.g., Bottom-up (agglomeratve) algorthms Non-herarchcal clusterng algorthms (e.g., K-means) IR Berln Chen 6
Dvsve Clusterng: Algorthm : splt the least coherent cluster Generate two new clusters and remove the orgnal one c u denotes a specfc cluster here IR Berln Chen 62
Herarchcal Document Organzaton (/7) Eplore the Probablstc Latent Topcal Informaton TMM/PLSA approach Two-dmensonal Tree Structure for Organzed Topcs dst T T 2 y y 2, E T, T l dst ep 2 l 2 T, T 2 2 P K K w D P T D P T Y P w T l l l P E T l, T T Y l Documents are clustered by the latent topcs and organzed n a twodmensonal tree structure, or a two-layer map Those related documents are n the same cluster and the relatonshps among the clusters have to do wth the dstance on the map When a cluster has many documents, we can further analyze t nto an other map on the net layer K s E T s, T IR Berln Chen 63
IR Berln Chen 64 Herarchcal Document Organzaton (2/7) The model can be traned by mamzng the total loglelhood of all terms observed n the document collecton EM tranng can be performed K K l l l N J n N J n T T w P Y T P D T P D w c D w P D w c L log, log, J N N,D w P T D w c,d w P T D w c T w P,, ˆ, ˆ J D c,d w P T D w c D P T where K K l l l K l l l D P T T P T T P w D P T T P T T w P,D w T P
IR Berln Chen 65 Herarchcal Document Organzaton (3/7) Crteron for Topc Word Selectng N N D P T D w c D P T D w c T w S ] [,,,
Herarchcal Document Organzaton (4/7) Eample IR Berln Chen 66
Herarchcal Document Organzaton (5/7) Eample (cont.) IR Berln Chen 67
Herarchcal Document Organzaton (6/7) Self-Organzaton Map (SOM) A recursve regresson process (Mappng Layer m m,, m,2,..., m, n T m m,, m,2,..., m, n T Weght Vector m Input Layer Input Vector, 2,..., T n m m m ( t ) m ( t) hc ( ), ( t)[ ( t) m ( t)] c ( ) arg mn m where m h 2 n n m, n t) ( t) ep c( ) c( ), ( 2 r r 2 ( t) 2 IR Berln Chen 68
Results Herarchcal Document Organzaton (7/7) Model Iteratons dst Between /dst Wthn 0.965 TMM 20 2.0650 30.9477 40.975 SOM 00 2.0604 R Dst dst dst Between Wthn where dst dst D D Between Between D D D D f C Between Wthn Wthn D D f C Wthn (, ) (, ) (, ) (, ) f Between dst C f Map Between Wthn dst (, ) 0 Map (,) T r, T r, otherwse 2 y y 2 (,) (, ) 0 dst (, ) 0 CWthn (, ) 0 Map T r, T r, otherwse (,) Tr, Tr, otherwse T r, T r, otherwse IR Berln Chen 69