Information Retrieval

Similar documents
Clustering Techniques

Clustering Techniques for Information Retrieval

Clustering Techniques for Information Retrieval

Clustering. Outline. Supervised vs. Unsupervised Learning. Clustering. Clustering Example. Applications of Clustering

Correspondence Analysis & Related Methods

Machine Learning. Spectral Clustering. Lecture 23, April 14, Reading: Eric Xing 1

Tian Zheng Department of Statistics Columbia University

Learning the structure of Bayesian belief networks

Multistage Median Ranked Set Sampling for Estimating the Population Median

Set of square-integrable function 2 L : function space F

8 Baire Category Theorem and Uniform Boundedness

Khintchine-Type Inequalities and Their Applications in Optimization

Thermodynamics of solids 4. Statistical thermodynamics and the 3 rd law. Kwangheon Park Kyung Hee University Department of Nuclear Engineering

A. Thicknesses and Densities

Using DP for hierarchical discretization of continuous attributes. Amit Goyal (31 st March 2008)

Dirichlet Mixture Priors: Inference and Adjustment

P 365. r r r )...(1 365

Optimization Methods: Linear Programming- Revised Simplex Method. Module 3 Lecture Notes 5. Revised Simplex Method, Duality and Sensitivity analysis

On Maneuvering Target Tracking with Online Observed Colored Glint Noise Parameter Estimation

Machine Learning 4771

Physics 11b Lecture #2. Electric Field Electric Flux Gauss s Law

CS649 Sensor Networks IP Track Lecture 3: Target/Source Localization in Sensor Networks

3. A Review of Some Existing AW (BT, CT) Algorithms

Re-Ranking Retrieval Model Based on Two-Level Similarity Relation Matrices

Rigid Bodies: Equivalent Systems of Forces

Generating Functions, Weighted and Non-Weighted Sums for Powers of Second-Order Recurrence Sequences

Chapter Fifiteen. Surfaces Revisited

Exact Simplification of Support Vector Solutions

The M 2 -tree: Processing Complex Multi-Feature Queries with Just One Index

APPLICATIONS OF SEMIGENERALIZED -CLOSED SETS

A Brief Guide to Recognizing and Coping With Failures of the Classical Regression Assumptions

4 SingularValue Decomposition (SVD)

Distinct 8-QAM+ Perfect Arrays Fanxin Zeng 1, a, Zhenyu Zhang 2,1, b, Linjie Qian 1, c

VQ widely used in coding speech, image, and video

V. Principles of Irreversible Thermodynamics. s = S - S 0 (7.3) s = = - g i, k. "Flux": = da i. "Force": = -Â g a ik k = X i. Â J i X i (7.

Experimental study on parameter choices in norm-r support vector regression machines with noisy input

SOME NEW SELF-DUAL [96, 48, 16] CODES WITH AN AUTOMORPHISM OF ORDER 15. KEYWORDS: automorphisms, construction, self-dual codes

UNIT10 PLANE OF REGRESSION

A Novel Ordinal Regression Method with Minimum Class Variance Support Vector Machine

Chapter 23: Electric Potential

The Greatest Deviation Correlation Coefficient and its Geometrical Interpretation

Information Retrieval Advanced IR models. Luca Bondi

Scalars and Vectors Scalar

Integral Vector Operations and Related Theorems Applications in Mechanics and E&M

A NOVEL DWELLING TIME DESIGN METHOD FOR LOW PROBABILITY OF INTERCEPT IN A COMPLEX RADAR NETWORK

LASER ABLATION ICP-MS: DATA REDUCTION

Approximate Abundance Histograms and Their Use for Genome Size Estimation

CHAPTER 7. Multivariate effect sizes indices

Pattern Analyses (EOF Analysis) Introduction Definition of EOFs Estimation of EOFs Inference Rotated EOFs

N = N t ; t 0. N is the number of claims paid by the

Engineering Mechanics. Force resultants, Torques, Scalar Products, Equivalent Force systems

If there are k binding constraints at x then re-label these constraints so that they are the first k constraints.

Detection and Estimation Theory

Constraint Score: A New Filter Method for Feature Selection with Pairwise Constraints

24-2: Electric Potential Energy. 24-1: What is physics

Kernel Methods and SVMs Extension

4 Recursive Linear Predictor

Physics 202, Lecture 2. Announcements

PHY126 Summer Session I, 2008

A Tutorial on Low Density Parity-Check Codes

Vibration Input Identification using Dynamic Strain Measurement

State Estimation. Ali Abur Northeastern University, USA. Nov. 01, 2017 Fall 2017 CURENT Course Lecture Notes

Amplifier Constant Gain and Noise

Professor Wei Zhu. 1. Sampling from the Normal Population

Some Approximate Analytical Steady-State Solutions for Cylindrical Fin

A. P. Sakis Meliopoulos Power System Modeling, Analysis and Control. Chapter 7 3 Operating State Estimation 3

Groupoid and Topological Quotient Group

Test 1 phy What mass of a material with density ρ is required to make a hollow spherical shell having inner radius r i and outer radius r o?

The Backpropagation Algorithm

CSJM University Class: B.Sc.-II Sub:Physics Paper-II Title: Electromagnetics Unit-1: Electrostatics Lecture: 1 to 4

SURVEY OF APPROXIMATION ALGORITHMS FOR SET COVER PROBLEM. Himanshu Shekhar Dutta. Thesis Prepared for the Degree of MASTER OF SCIENCE

Retrieval Models: Language models

Part V: Velocity and Acceleration Analysis of Mechanisms

A NOTE ON ELASTICITY ESTIMATION OF CENSORED DEMAND

Backward Haplotype Transmission Association (BHTA) Algorithm. Tian Zheng Department of Statistics Columbia University. February 5 th, 2002

PHYS 705: Classical Mechanics. Derivation of Lagrange Equations from D Alembert s Principle

Light Time Delay and Apparent Position

Remember: When an object falls due to gravity its potential energy decreases.

Background. 3D Object recognition. Modeling polyhedral objects. Modeling polyhedral objects. Object and world coordinate systems

Energy in Closed Systems

Recursive Least-Squares Estimation in Case of Interval Observation Data

An Approach to Inverse Fuzzy Arithmetic

A Bijective Approach to the Permutational Power of a Priority Queue

VParC: A Compression Scheme for Numeric Data in Column-Oriented Databases

MULTILAYER PERCEPTRONS

Goodness-of-fit for composite hypotheses.

DISC-GLASSO: DISCRIMINATIVE GRAPH LEARNING WITH SPARSITY REGULARIZATION. 201 Broadway, Cambridge, MA 02139, USA

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

Open Shop Scheduling Problems with Late Work Criteria

SVM and a Novel POOL Method Coupled with THEMATICS for. Protein Active Site Prediction

Monte Carlo comparison of back-propagation, conjugate-gradient, and finite-difference training algorithms for multilayer perceptrons

Physics 2A Chapter 11 - Universal Gravitation Fall 2017

Chapter Newton s Method

4/18/2005. Statistical Learning Theory

an application to HRQoL

Bayesian Assessment of Availabilities and Unavailabilities of Multistate Monotone Systems

Maximum Likelihood Directed Enumeration Method in Piecewise-Regular Object Recognition

The geometric construction of Ewald sphere and Bragg condition:

Statistical pattern recognition

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Transcription:

Clusteng Technques fo Infomaton Reteval Beln Chen Depatment t of Compute Scence & Infomaton Engneeng Natonal Tawan Nomal Unvesty Refeences:. Chstophe D. Mannng, Pabhaa Raghavan and Hnch Schütze, Intoducton to Infomaton Reteval, Cambdge Unvesty Pess, 008. (Chaptes 6 & 7). Moden Infomaton Reteval, Chaptes 5 & 7 3. "A Gentle Tutoal of the EM Algothm and ts Applcaton to Paamete Estmaton fo Gaussan Mxtue and Hdden Maov Models," Jeff A. Blmes, U.C. Beeley TR-97-0

Clusteng Place smla obects n the same goup and assgn dssmla obects to dffeent goups (typcally usng a dstance measue, such as Eucldean dstance) Wod clusteng Neghbo ovelap: wods occu wth the smla left and ght neghbos (such as n and on) ) Document clusteng Documents wth the smla topcs o concepts ae put togetheth But clusteng cannot gve a compehensve descpton of the obect How to label obects shown on the vsual dsplay s a dffcult poblem IR Beln Chen

Clusteng vs. Classfcaton Classfcaton s supevsed and eques a set of labeled tanng nstances fo each goup (class) Leanng wth a teache Clusteng s unsupevsed and leans wthout a teache to povde the labelng nfomaton of the tanng data set Also called automatc o unsupevsed classfcaton IR Beln Chen 3

Types of Clusteng Algothms Two types of stuctues poduced by clusteng algothms Flat o non-heachcal clusteng Heachcal clusteng Flat clusteng Smply pyconsstng of a cetan numbe of clustes and the elaton between clustes s often undetemned Measuement: constucton eo mnmzaton o pobablstc optmzaton Heachcal clusteng A heachy wth usual ntepetaton that each node stands fo a subclass of ts mothe s node The leaves of the tee ae the sngle obects Each node epesents the cluste that contans all the obects of ts descendants Measuement: smlates of nstances IR Beln Chen 4

Had Assgnment vs. Soft Assgnment (/) Anothe mpotant dstncton between clusteng algothms s whethe they pefom soft o had assgnment Had Assgnment Each obect (o document n the context of IR) s assgned to one and only one cluste Soft Assgnment (pobablstc bbl appoach) Each obect may be assgned to multple clustes An obect x has a pobablty dstbuton P ( x ) ove clustes c whee P ( x c ) s the pobablty that x s a membe of c Is somewhat moe appopate n many tass such as NLP, IR, IR Beln Chen 5

Had Assgnment vs. Soft Assgnment (/) Heachcal clusteng usually adopts had assgnment Whle n flat clusteng, both types of assgnments ae common IR Beln Chen 6

Summazed Attbutes of Clusteng Algothms (/) Heachcal Clusteng Pefeable fo detaled d data analyss Povde moe nfomaton than flat clusteng No sngle best algothm (each of the algothms only optmal fo some applcatons) Less effcent than flat clusteng (mnmally have to compute n x n matx of smlaty coeffcents) IR Beln Chen 7

Summazed Attbutes of Clusteng Algothms (/) Flat Clusteng Pefeable f effcency s a consdeaton o data sets ae vey lage K-means s the conceptually feasble method and should pobably be used on a new data because ts esults ae often suffcent K-means assumes a smple Eucldean epesentaton space, and so cannot be used fo many data sets, e.g., nomnal data le colos (o samples wth featues of dffeent scales) The EM algothm s the most choce. It can accommodate defnton of clustes and allocaton of obects based on complex pobablstc models Its extensons can be used to handle topologcal/heachcal odes of samples E.g., Pobablstc Latent Semantc Analyss (PLSA) IR Beln Chen 8

Some Applcatons of Clusteng n IR (/5) Cluste Hypothess (fo IR): Documents n the same cluste behave smlaly wth espect to elevance to nfomaton needs Possble applcatons of Clusteng n IR These possble applcatons dffe n The collecton of documents to be clusteed The aspect of the IR system to be mpoved IR Beln Chen 9

Some Applcatons of Clusteng n IR (/5). Whole copus analyss/navgaton Bette use nteface (uses pefe bowsng ove seachng snce they ae unsue about whch seach tems to use) E.g., the scatte-gathe appoach (fo a collecton of New Yo Tmes) Uses often pefe bowsng ove seachng, because they ae unsue about whch seach tems to use. IR Beln Chen 0

Some Applcatons of Clusteng n IR (3/5). Impove ecall n seach applcatons Acheve bette seach esults by Allevatng the tem-msmatch (synonym) poblem facng the vecto space model found elevant document Estmatng the collecton model of the language modelng (LM) eteval appoach moe accuately N ( Q M ) [ λ P ( w M ) + ( λ ) P ( w ) ] P M D N D C The collecton model can be estmated fom the cluste the document D belongs to, nstead of the ente collecton IR Beln Chen

Some Applcatons of Clusteng n IR (4/5) 3. Bette navgaton of seach esults Result set clusteng Effectve use ecall wll be hghe http://clusty.com IR Beln Chen

Some Applcatons of Clusteng n IR (5/5) 4. Speed up the seach pocess Fo eteval models usng exhaustve matchng (computng the smlaty of the quey to evey document) wthout effcent nveted ndex suppots E.g., latent semantc analyss (LSA), language modelng (LM)? Soluton: cluste-based eteval Fst fnd the clustes that ae closet to the quey and then only consde documents fom these clustes IR Beln Chen 3

Evaluaton of Clusteng (/) Intenal cteon fo the qualty of a clusteng esult The typcal obectve s to attan Hgh nta-cluste smlaty (documents wth a cluste ae smla) l Low nte-cluste smlaty (document fom dffeent clustes ae dssmla) The measued qualty depends on both the document epesentaton and the smlaty measue used Good scoes on an ntenal cteon do not necessaly tanslate t nto good effectveness n an applcaton IR Beln Chen 4

Evaluaton of Clusteng (/) Extenal cteon fo the qualty of a clusteng esult Evaluate how well the clusteng matches the gold standad classes poduced by human udges That s, the qualty s measued by the ablty of the clusteng algothm to dscove some o all of the hdden pattens o latent (tue) classes......... Two common ctea Puty Rand Index (RI) IR Beln Chen 5

Puty (/) Each cluste s fst assgned to class whch s most fequent n the cluste Then, the accuacy of the assgnment s measued by countng the numbe of coectly assgned documents and dvdng by the sample sze Puty N ( Ω, Γ ) max ω I c { } Ω ω, ω, K, ω K : the set of clustes Γ { c, c, K, c J } : the set of classes : the sample sze N Puty 7 ( Ω, Γ) ( 5 + 4 + 3) 0. 7 IR Beln Chen 6

Puty (/) Hgh puty s easy to acheve fo a lage numbe of clustes (?) Puty wll be f each document gets ts own cluste Theefoe, puty cannot be used to tade off the qualty of the clusteng aganst the numbe of clustes IR Beln Chen 7

Rand Index (/3) Measue the smlaty between the clustes and the classes n gound tuthth Consde the assgnments of all possble N(N-)/ pas of N dstnct documents n the cluste and the tue class Numbe of Same cluste n Dffeent clustes ponts clusteng n clusteng Same class n TP FN gound tuth (Tue Postve) (False Negatve) Dffeent classes n gound tuth FP (False Postve) TN (Tue Negatve) RI TP + TN TP + FP + FN + TN IR Beln Chen 8

Rand Index (/3) ω 3 ω ω + + + 0 3 4 5 TP + + + 5 4 5 TN....... ω ω ω + + + + 3 4 4 5 FP + + + 3 3 5............... ω ω ω 3 ω ω ω ω ω 3 + + TP 5 6 6 0 + + + + 7 3 4 3 4.... ω ω ω 3 ω ω 3 + + + + 4 3 5 4 5 FN ω ω 3 ω all postve pas ω ω ω ω 3 ω ω 3 + + TN 5 6 5 6 6 6 0.68 7 4 0 0 7 0 RI + + + + ω ω ω ω 3 ω ω 3 IR Beln Chen 9 all negatve pas 7 4 0 0 + + + ( ) 36 / 6 7 / N N all pas 3 3

Rand Index (3/3) The and ndex has a value between 0 and 0 ndcates that the clustes and the classes n gound tuth do not agee on any pa of ponts (documents) ndcates that tthe clustes and dthe classes n gound dtuth thae exactly the same IR Beln Chen 0

F-Measue Based on Rand Index F-Measue: hamonc mean of pecson (P) and ecall (R) P TP TP, TP + FP R TP + FN F b ( ) b + b + )PR b b P + R + R P If we want to penalze false negatves (FN) moe stongly than false postves (FP), then we can set b > (sepaatng smla documents s sometmes wose than puttng dssmla l documents n the same cluste) That s,,gvng gmoe weght to ecall (R) IR Beln Chen

Nomalzed Mutual Infomaton (NMI) NMI s an nfomaton-theoetcal measue I ( ) ( Ω ; C ) Ω, C ( H ( Ω ) + H ( C ))/ p ( ) ( ) ( ω c ) Ω ; C p ω c log p( ω ) p( c ) NMI I H ( Ω ) p ( ω ) log p ( ω ) ω ω c N N log ω N N ω log ω c c (ML estmate) (ML estmate) NMI wll have a value between 0 and IR Beln Chen

Summay of Extenal Evaluaton Measues IR Beln Chen 3

Flat Clusteng IR Beln Chen 4

Flat Clusteng Stat out wth a patton based on andomly selected seeds (one seed pe cluste) and then efne the ntal patton In a mult-pass manne (ecuson/teatons) Poblems assocated wth non-heachcal clusteng When to stop? What s the ght numbe of clustes (cluste cadnalty)? goup aveage smlaty, lelhood, mutual nfomaton - + Algothms ntoduced hee The K-means algothm Heachcal clusteng The EM algothm also has to face ths poblem IR Beln Chen 5

The K-means Algothm (/0) Also called Lnde-Buzo-Gay (LBG) n sgnal pocessng A had clusteng algothm Defne clustes by the cente of mass of the membes Obects (e.g., documents) should be epesented n vecto fom The K-means algothm also can be egaded as A nd of vecto quantzaton Map fom a contnuous space (hgh esoluton) to a dscete space (low esoluton) E.g. colo quantzaton 4 bts/pxel (6 mllon colos) 8 bts/pxel (56 colos) X m A compesson ate of 3 { t } n ndex x F { m } Dm(x t )4 F 8 t : cluste centod o efeence vecto, code wod, code vecto IR Beln Chen 6

The K-means Algothm (/0) Total econstucton eo (RSS : esdual sum of squaes) E ({ } ) m X N t f x m t t t b x m, whee b 0 othewse t automatc label mn x t m m t b and ae unnown n advance t b depends on m and ths optmzaton poblem can not be solved analytcally IR Beln Chen 7

The K-means Algothm (3/0) Intalzaton A set of ntal cluste centes s needed { } Recuson m t Assgn each obect x to the cluste whose cente s closest t t t f x m mn x m b 0 othewse Then, e-compute the cente of each cluste as the centod o mean (aveage) of ts membes Usng the medod as the cluste cente? (a medod s one of the obects n the cluste that s closest to the centod) t t m m These two steps ae epeated untl m stablzes N t Th t t t d tl t bl t b x N t t b IR Beln Chen 8

The K-means Algothm (4/0) Algothm IR Beln Chen 9

The K-means Algothm (5/0) Example IR Beln Chen 30

Example The K-means Algothm (6/0) govenment fnance spots eseach name IR Beln Chen 3

Complexty: O(IKNM) The K-means Algothm (7/0) I: Iteatons; K: cluste numbe; N: obect numbe; M: obect dmensonalty Choce of ntal cluste centes (seeds) s mpotant Pc at andom O, calculate the mean m of all data and geneate ntal centes m by addng small andom vecto to the mean m ± δ O, poect data onto the pncpal component (fst egenvecto), dvde t ange nto equal nteval, and tae the mean of data n each goup as the ntal cente m O, use anothe method such as heachcal clusteng algothm on a subset of the obects E.g., bucshot algothm uses the goup-aveage agglomeatve clusteng to andomly sample of the data that has sze squae oot of the complete set IR Beln Chen 3

The K-means Algothm (8/0) Poo seeds wll esult n sub-optmal clusteng IR Beln Chen 33

The K-means Algothm (9/0) How to bea tes when n case thee ae seveal centes wth the same dstance fom an obect E.g., andomly assgn the obect to one of the canddate clustes (o assgn the obect to the cluste wth lowest ndex) O, petub obects slghtly Applcatons of the K-means Algothm Clusteng Vecto quantzaton A pepocessng stage befoe classfcaton o egesson Map fom the ognal space to l-dmensonal space/hypecube llogl ( clustes) Nodes on the hypecube A lnea classfe IR Beln Chen 34

The K-means Algothm (0/0) E.g., the LBG algothm M M at each teaton By Lnde, Buzo, and Gay {μ,σσ,ω } {μ,σ,ω } Global mean Cluste mean Cluste mean {μ 3,Σ 3,ω 3 } {μ 4,Σ 4,ω 4 } Total Reconstucton eo (esdual sum of E ({ m } X) squaes) N t t b x m t IR Beln Chen 35

The EM Algothm (/3) EM (Expectaton-Maxmzaton) algothm A nd of model-based clusteng Also can be vewed as a genealzaton of K-means Each cluste s a model fo geneatng the data The centod s good epesentatve fo each model Geneate an obect (e.g., document) conssts of fst pcng a centod at andom and then addng some nose If the nose s nomally dstbuted, the pocedue wll esult n clustes of sphecal shape Physcal Models fo EM Dscete: Mxtue of multnomal dstbutons Contnuous: Mxtue of Gaussan dstbutons IR Beln Chen 36

EM s a soft veson of K-mean The EM Algothm (/3) Each obect could be the membe of multple clustes Clusteng as estmatng a mxtue of (contnuous) pobablty dstbutons x P( ω ) P x P ( ω ) P( ω K ) Lelhood functon fo data samples: P ( X Θ ) n P ( x Θ ) n K P X ω ( ) x ω ω A Mxtue Gaussan HMM (o A Mxtue of Gaussans) K P ( x ω ) P( x Θ ) P( x ω ; Θ ) P( ω Θ ) P x, x, K, x ( x ω ) x ( x ω ; Θ ) P ( ω Θ ) n K P classfca ton max P Contnuous case: ( x ω ; Θ) : ( ω x, Θ) max max ( x ω ; Θ) P( ω Θ) P ( x Θ ) ( x ω ; Θ) P( ω Θ) exp m ( π ) Σ 's { } x X x, x, L, x n T ( x μ ) Σ ( x μ ) aendependent dentcally dstbuted (..d.) IR Beln Chen 37

The EM Algothm (/3) ω ω ω ω ω ω IR Beln Chen 38

Maxmum Lelhood Estmaton (MLE) (/) Had Assgnment P(B ω )/40.5 cluste ω P(W ω )/40.5 IR Beln Chen 39

Maxmum Lelhood Estmaton (/) Soft Assgnment P(ω )(0.7+0.4+0.9+0.5)/ (0.7+0.4+0.9+0.5 +0.3+0.6+0.+0.5).5/40.65 State ω P(ω )- P(ω )0.375 State ω 0.7 0.3 P(B ω )(0.7+0.9)/ (0.7+0.4+0.9+0.5).6/.50.64 P(B ω )(0.4+0.5)/ 0.5)/ (0.7+0.4+0.9+0.5) 0.9/.50.36 04 0.4 06 0.6 0.9 0. 0.5 0.5 P(B ω )(0.3+0.)/ (0.3+0.6+0.+0.5) 0.6 0. 0.5) 0.4/.50.7 P(B ω )(0.6+0.5)/ (0.3+0.6+0.+0.5) 0./.50.73 IR Beln Chen 40

Expectaton-Maxmzaton Updatng Fomulas (/3) Expectaton γ K P l ( x ω, Θ ) P( ω Θ ) P ( x ω, Θ ) P( ω Θ ) l Compute the lelhood that each cluste geneates a document vecto x l ω IR Beln Chen 4

Expectaton-Maxmzaton Updatng Fomulas (/3) Maxmzaton n n γ γ Mxtue Weght ( ) Θˆ n P K n γ γ γ ω Mean of Gaussan n x ˆ γ n γ μ IR Beln Chen 4

Expectaton-Maxmzaton Updatng Fomulas (3/3) Covaance Matx of Gaussan ( )( ) Σ n T x x ˆ ˆ ˆ μ μ γ Σ n γ ( )( ) n n T x x μˆ μˆ γ γ IR Beln Chen 43

Moe facts about The EM Algothm The ntal cluste dstbutons can be estmated usng the K-means algothm, whch EM can then soften up The pocedue temnates when the lelhood functon P ( X Θ ) s conveged o maxmum numbe of teatons s eached IR Beln Chen 44

Heachcal Clusteng IR Beln Chen 45

Heachcal Clusteng Can be n ethe bottom-up o top-down mannes Bottom-up (agglomeatve) 凝集的 Stat wth ndvdual obects and goupng the most smla ones E.g., wth the mnmum dstance apat sm ( x, y ) dstance measues wll be dscussed late on + ( x, y ) The pocedue temnates when one cluste contanng all obects has been fomed d Top-down (d (dvsve) 分裂的 Stat wth all obects n a goup and dvde them nto goups so as to maxmze wthn-goup smlaty IR Beln Chen 46

Heachcal Agglomeatve Clusteng (HAC) A bottom-up appoach Assume a smlaty measue fo detemnng the smlaty of two obects Stat wth all obects n a sepaate cluste (a sngleton) and then epeatedly ons the two clustes that have the most smlaty untl thee s one only cluste suvved The hstoy of megng/clusteng foms a bnay tee o heachy IR Beln Chen 47

HAC: Algothm Intalzaton (fo tee leaves): Each obect s a cluste cluste numbe meged as a new cluste The ognal two clustes ae emoved c denotes a specfc cluste hee IR Beln Chen 48

Dstance Metcs Eucldan Dstance (L nom) L m ( x, y) ( x y ) Mae sue that all attbutes/dmensons have the same scale (o the same vaance) L Nom (Cty-bloc dstance) L ( x, y) ) m x y Cosne Smlaty (tansfom to a dstance by subtactng fom ) x y db t 0 d x y anged between 0 and IR Beln Chen 49

Measues of Cluste Smlaty (/9) Especally fo the bottom-up appoaches. Sngle-ln clusteng The smlaty between two clustes s the smlaty of the two closest obects n the clustes Seach ove all pas of obects that ae fom the two dffeent clustes and select the pa wth the geatest smlaty Elongated clustes ae acheved sm ( ω, ω ) max sm ( x,y ) x ω ω, y ω ω spannng tee cf. the mnmal geatest smlaty IR Beln Chen 50

Measues of Cluste Smlaty (/9). Complete-ln clusteng The smlaty between two clustes s the smlaty of the two most dssmla membes Sphee-shaped h clustes ae acheved Pefeable fo most IR and NLP applcatons sm ω, ω mn sm x,y ( ) ( ) x ω ω, y ω ω least smlaty Moe senstve to outles IR Beln Chen 5

Measues of Cluste Smlaty (3/9) sngle ln complete ln IR Beln Chen 5

Measues of Cluste Smlaty (4/9) IR Beln Chen 53

Measues of Cluste Smlaty (5/9) 3. Goup-aveage agglomeatve clusteng A compomse between sngle-ln and complete-ln clusteng The smlaty between two clustes s the aveage smlaty between membes ω ω If the obects ae epesented as length-nomalzed vectos and the smlaty measue s the cosne Thee exsts an fast algothm fo computng the aveage smlaty sm x x ( x, y ) cos ( x, y ) x y y y length-nomalzed vectos IR Beln Chen 54

Measues of Cluste Smlaty (6/9) 3. Goup-aveage agglomeatve clusteng (cont.) The aveage smlaty SIM between vectos n a cluste ω s defned as SIM ω ( ) x y ω ω x ω y ω ω ω x ω y ω ( ω ) sm( x, y) ( ) y x The sum of membes n a cluste ω : s SIM ( ω ) s ( ω ) Expess n tems of ( ω ) ( ) x ω x y x s ( ω ) s ( ω ) x s ( ω ) x SIM ( c ) ω ω s x ω x ω y ω ( ω ) SIM ( ω ) ( ω ) SIM ( ω ) ( ω ) s ( ω ) ω ( ω ) ω + + x c ω y length-nomalzed vecto x x IR Beln Chen 55

Measues of Cluste Smlaty (7/9) 3. Goup-aveage agglomeatve clusteng (cont.) ( ) -As megng two clustes c and c, the cluste sum s ( ω ) s ω vectos and ae nown n advance s ω s ω + s ω, ω ω + ( New ) ( ) ( ) New ω ω + ω The aveage smlaty fo the unon wll be SIM ( ω ω ) ( s ( ω ) + s ( ω ) s ( ω ) + s ( ω ) ( ) ( ) ω + ω ( ω + ω )( ω + ω ) ω ω s ( ω ) ω ω s ( ω ) IR Beln Chen 56

Measues of Cluste Smlaty (8/9) 4. Centod clusteng The smlaty of two clustes s defned as the smlaty of the centods ( ) ( ) ( ) sm ω μ ω μ ω ω, t s x t x s x N x N ω ω s x x t t s x x N N ω ω IR Beln Chen 57

Measues of Cluste Smlaty (9/9) Gaphcal summay of fou cluste smlaty measues IR Beln Chen 58

Example: Wod Clusteng Wods (obects) ae descbed and clusteed usng a set of featues and values E.g., the left and ght neghbos of toens of wods hghe nodes: deceasng of smlaty be has least smlaty wth the othe wods! IR Beln Chen 59

Dvsve Clusteng (/) A top-down appoach Stat wth all obects n a sngle cluste At each teaton, select the least coheent cluste and splt t Contnue the teatons untl a pedefned cteon (e.g., the cluste numbe) s acheved The hstoy of clusteng foms a bnay tee o heachy IR Beln Chen 60

Dvsve Clusteng (/) To select the least coheent cluste, the measues used n bottom-up clusteng (e.g. HAC) can be used agan hee Sngle ln measue Complete-ln l measue Goup-aveage measue How to splt a cluste Also s a clusteng tas (fndng two sub-clustes) Any clusteng algothm can be used fo the splttng opeaton, e.g., Bottom-up (agglomeatve) algothms Non-heachcal clusteng algothms (e.g., K-means) IR Beln Chen 6

Dvsve Clusteng: Algothm : splt the least coheent cluste Geneate two new clustes and emove the ognal one c u denotes a specfc cluste hee IR Beln Chen 6

Heachcal Document Oganzaton (/7) Exploe the Pobablstc Latent Topcal Infomaton TMM/PLSA appoach Two-dmensonal Tee Stuctue fo Oganzed Topcs dst ( T T ) ( x x ) + ( y y ), E ( T, T ) l dst exp π σ ( T, T ) l σ K K P ( w D ) P ( T D ) P ( Tl Y ) P ( w Tl ) l E ( ) ( Tl, T ) P T l Y ( T, T ) Documents ae clusteed by the latent t topcs and oganzed n a twodmensonal tee stuctue, o a two-laye map Those elated documents ae n the same cluste and the elatonshps among the clustes have to do wth the dstance on the map When a cluste has many documents, we can futhe analyze t nto an othe map on the next laye K E s s IR Beln Chen 63

Heachcal Document Oganzaton (/7) The model can be taned by maxmzng the total log- lelhood lh of all tems obseved n the document collecton L T N J n N J n c c ( w, D ) log P ( w D ) K K ( w, D ) log P ( T D ) P ( T Y ) P ( w T ) EM tanng can be pefomed l l l N Pˆ ( w ) T J N c c P ˆ T D ( ) J ( w, D ) P ( T w,d ) c ( w, D ) P( T w,d ) ( w, D ) P ( T w ),D c ( D ) whee P ( T w,d ) ( w T ) P( T T ) P( T D ) K P l l l K K P w Tl P Tl T P T l ( ) ( ) ( D ) IR Beln Chen 64

Heachcal Document Oganzaton (3/7) Cteon fo Topc Wod Selectng S ( w, T ) N N c c ( w, D ) P( T D ) ( w, D )[ P( T D ) ] IR Beln Chen 65

Heachcal Document Oganzaton (4/7) Example IR Beln Chen 66

Heachcal Document Oganzaton (5/7) Example (cont.) IR Beln Chen 67

Heachcal Document Oganzaton (6/7) Self-Oganzaton Map (SOM) A ecusve egesson pocess (Mappng Laye [ m, m m ] T m,,,...,, n [ m, m m ] T m,,,...,, n Weght Vecto x m Input Laye Input Vecto [ x x ] T x,,..., x n m m x m ( t + c ( x ), t + ) m ( t ) + h ( t )[ x ( t ) m ( )] whee c ( x ) ag mn x m ( x m ) n n, n x m h t) α ( t) exp c ( x ) c( x), ( σ ( t) IR Beln Chen 68

Results Heachcal Document Oganzaton (7/7) Model Iteatons dst Between /dst Wthn 0.965 TMM 0.0650 30.9477 40.975 SOM 00.0604 R Dst dst dst Between Wthn whee dst dst D D Between + Between D D + D D + Wthn D D + f C Between (, ) (, ) f Between dst C Map Between dst (, ) 0 Map (, ) T, T, othewse ( x x ) + ( y y ) (,) (, ) 0 T, T, othewse dst ( ) T Map,, T, fwthn(, ) f Wthn (, ) 0 othewse C Wthn (, ) T, T, CWthn (, ) 0 othewse IR Beln Chen 69