INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Size: px

Start display at page:

Download "INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from"

Kathleen Austin
5 years ago
Views:

1 INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from IR 22/26: Hierarchical Clustering Paul Ginsparg Cornell University, Ithaca, NY 17 Nov / 37

2 Overview 1 Recap 2 Introduction to Hierarchical clustering 2/ 37

3 Outline 1 Recap 2 Introduction to Hierarchical clustering 3/ 37

4 Applications of clustering in IR Scatter-Gather Application What is Benefit Example clustered? Search result clustering search more effective information results presentation to user (subsets of) collection alternative user interface: search without typing Collection clustering collection effective information presentation for exploratory browsing McKeown et al. 2002, news.google.com Cluster-based retrieval collection higher efficiency: faster search Salton / 37

5 K-means algorithm K-means({ x 1,..., x N },K) 1 ( s 1, s 2,..., s K ) SelectRandomSeeds({ x 1,..., x N },K) 2 for k 1 to K 3 do µ k s k 4 while stopping criterion has not been met 5 do for k 1 to K 6 do ω k {} 7 for n 1 to N 8 do j arg min j µ j x n 9 ω j ω j { x n } (reassignment of vectors) 10 for k 1 to K 11 do µ k 1 ω k 12 return { µ 1,..., µ K } x ω k x (recomputation of centroids) 5/ 37

6 Initialization of K-means Random seed selection is just one of many ways K-means can be initialized. Random seed selection is not very robust: It s easy to get a suboptimal clustering. Better heuristics: Select seeds not randomly, but using some heuristic (e.g., filter out outliers or find a set of seeds that has good coverage of the document space) Use hierarchical clustering to find good seeds (next class) Select i (e.g., i = 10) different sets of seeds, do a K-means clustering for each, select the clustering with lowest RSS 6/ 37

7 External criterion: Purity purity(ω,c) = 1 N k max ω k c j j Ω = {ω 1,ω 2,...,ω K } is the set of clusters and C = {c 1,c 2,...,c J } is the set of classes. For each cluster ω k : find class c j with most members n kj in ω k Sum all n kj and divide by total number of points 7/ 37

8 Discussion 6 Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. Usenix SDI 04, papers/dean/dean.pdf See also (Jan 2009): part of lectures on google technology stack : (including PageRank, etc.) 8/ 37

9 Some Questions Who are the authors? When was it written? When was the work started? What is the problem they were trying to solve? Is there a compiler that will automatically parallelize the most general program? How does the example in section 2.1 work? What are other examples of algorithms amenable to map reduce methodology? What s going on in Figure 1? What happens between map and reduce steps? map(k1,v1) list(k2,v2) reduce(k2,list(v2)) list(v2) 9/ 37

10 Wordcount example from a.txt: The quick brown fox jumped over the lazy grey dogs. b.txt: That s one small step for a man, one giant leap for mankind. c.txt: Mary had a little lamb, Its fleece was white as snow; And everywhere that Mary went, The lamb was sure to go. 10/ 37

11 Map mapper( a.txt,i[ a.txt ]) returns: [( the, 1), ( quick, 1), ( brown, 1), ( fox, 1), ( jumped, 1), ( over, 1), ( the, 1), ( lazy, 1), ( grey, 1), ( dogs, 1)] def mapper(input key,input value): return [(word,1) for word in remove punctuation(input value.lower()).split()] def remove punctuation(s): return s.translate(string.maketrans(, ), string.punctuation) 11/ 37

12 Output of the map phase [( the, 1), ( quick, 1), ( brown, 1), ( fox, 1), ( jumped, 1), ( over, 1), ( the, 1), ( lazy, 1), ( grey, 1), ( dogs, 1), ( mary, 1), ( had, 1), ( a, 1), ( little, 1), ( lamb, 1), ( its, 1), ( fleece, 1), ( was, 1), ( white, 1), ( as, 1), ( snow, 1), ( and, 1), ( everywhere, 1), ( that, 1), ( mary, 1), ( went, 1), ( the, 1), ( lamb, 1), ( was, 1), ( sure, 1), ( to, 1), ( go, 1), ( thats, 1), ( one, 1), ( small, 1), ( step, 1), ( for, 1), ( a, 1), ( man, 1), ( one, 1), ( giant, 1), ( leap, 1), ( for, 1), ( mankind, 1)] 12/ 37

13 Combine gives { and : [1], fox : [1], over : [1], one : [1, 1], as : [1], go : [1], its : [1], lamb : [1, 1], giant : [1], for : [1, 1], jumped : [1], had : [1], snow : [1], to : [1], leap : [1], white : [1], was : [1, 1], mary : [1, 1], brown : [1], lazy : [1], sure : [1], that : [1], little : [1], small : [1], step : [1], everywhere : [1], mankind : [1], went : [1], man : [1], a : [1, 1], fleece : [1], grey : [1], dogs : [1], quick : [1], the : [1, 1, 1], thats : [1]} 13/ 37

14 Output of the reduce phase def reducer(intermediate key,intermediate value list): return (intermediate key,sum(intermediate value list)) [( and, 1), ( fox, 1), ( over, 1), ( one, 2), ( as, 1), ( go, 1), ( its, 1), ( lamb, 2), ( giant, 1), ( for, 2), ( jumped, 1), ( had, 1), ( snow, 1), ( to, 1), ( leap, 1), ( white, 1), ( was, 2), ( mary, 2), ( brown, 1), ( lazy, 1), ( sure, 1), ( that, 1), ( little, 1), ( small, 1), ( step, 1), ( everywhere, 1), ( mankind, 1), ( went, 1), ( man, 1), ( a, 2), ( fleece, 1), ( grey, 1), ( dogs, 1), ( quick, 1), ( the, 3), ( thats, 1)] 14/ 37

15 PageRank example, P jk = A jk /d j Input (key,value) to MapReduce key = id j of the webpage value contains data describing the page: current r j, out-degree d j, and a list [k 1,k 2,...,k dj ] of pages to which it links For each of the latter pages k a, a = 1,... d j, mapper outputs an intermediate key-value pair [k a,r j /d j ] (where r j /d j is the contribution to the PageRank from page j to page k a, and corresponds to random websurfer moving from j to k a combines probability r j of starting at page j with probability 1/d j of moving from j to k a ) Between map and reduce phases, MapReduce collects all intermediate values corresponding to any given intermediate key k (list of all probabilities of moving to page k). The reducer sums up probabilities, outputting result as second entry in pair (k,r k ), giving the entries of rp = r, as desired. 15/ 37

16 k-means clustering, e.g., Netflix data Goal Find similar movies from ratings provided by users Vector Model Give each movie a vector Make one dimension per user Put origin at average rating (so poor is negative) Normalize all vectors to unit length (cosine similarity) Issues - Users are biased in the movies they rate + Addresses different numbers of raters 16/ 37

17 k-means clustering Goal cluster similar data points Approach: given data points and distance function select k centroids µ a assign x i to closest centroid µ a minimize a,i d( x i, µ a ) Algorithm: randomly pick centroids, possibly from data points assign points to closest centroid average assigned points to obtain new centroids repeat 2,3 until nothing changes Issues: - takes superpolynomial time on some inputs - not guaranteed to find optimal solution + converges quickly in practice 17/ 37

18 Iterative MapReduce (from ) 18/ 37

19 Outline 1 Recap 2 Introduction to Hierarchical clustering 19/ 37

20 Hierarchical clustering Our goal in hierarchical clustering is to create a hierarchy like the one we saw earlier in Reuters: TOP regions industries Kenya China UK France coffee poultry oil & gas We want to create this hierarchy automatically. We can do this either top-down or bottom-up. The best known bottom-up method is hierarchical agglomerative clustering. 20/ 37

21 Hierarchical agglomerative clustering (HAC) HAC creates a hierachy in the form of a binary tree. Assumes a similarity measure for determining the similarity of two clusters. Up to now, our similarity measures were for documents. We will look at four different cluster similarity measures. 21/ 37

22 Hierarchical agglomerative clustering (HAC) Start with each document in a separate cluster Then repeatedly merge the two clusters that are most similar Until there is only one cluster The history of merging is a hierarchy in the form of a binary tree. The standard way of depicting this history is a dendrogram. 22/ 37

23 A dendrogram Ag trade reform. Back to school spending is up Lloyd s CEO questioned Lloyd s chief / U.S. grilling Viag stays positive Chrysler / Latin America Ohio Blue Cross Japanese prime minister / Mexico CompuServe reports loss Sprint / Internet access service Planet Hollywood Trocadero: tripling of revenues German unions split War hero Colin Powell War hero Colin Powell Oil prices slip Chains may raise prices Clinton signs law Lawsuit against tobacco companies suits against tobacco firms Indiana tobacco lawsuit Most active stocks Mexican markets Hog prices tumble NYSE closing averages British FTSE index Fed holds interest rates steady Fed to keep interest rates steady Fed keeps interest rates steady Fed keeps interest rates steady The history of mergers can be read off from left to right. The vertical line of each merger tells us what the similarity of the merger was. We can cut the dendrogram at a particular point (e.g., at 0.1 or 0.4) to get a flat clustering. 23/ 37

24 Divisive clustering Divisive clustering is top-down. Alternative to HAC (which is bottom up). Divisive clustering: Start with all docs in one big cluster Then recursively split clusters Eventually each node forms a cluster on its own. Bisecting K-means at the end For now: HAC (= bottom-up) 24/ 37

25 Naive HAC algorithm SimpleHAC(d 1,...,d N ) 1 for n 1 to N 2 do for i 1 to N 3 do C[n][i] Sim(d n,d i ) 4 I[n] 1 (keeps track of active clusters) 5 A [] (collects clustering as a sequence of merges) 6 for k 1 to N 1 7 do i,m arg max { i,m :i m I[i]=1 I[m]=1} C[i][m] 8 A.Append( i, m ) (store merge) 9 for j 1 to N 10 do (use i as representative for < i, m >) 11 C[i][j] Sim(< i,m >,j) 12 C[j][i] Sim(< i,m >,j) 13 I[m] 0 (deactivate cluster) 14 return A 25/ 37

26 Computational complexity of the naive algorithm First, we compute the similarity of all N N pairs of documents. Then, in each of N iterations: We scan the O(N N) similarities to find the maximum similarity. We merge the two clusters with maximum similarity. We compute the similarity of the new cluster with all other (surviving) clusters. There are O(N) iterations, each performing a O(N N) scan operation. Overall complexity is O(N 3 ). We ll look at more efficient algorithms later. 26/ 37

27 Key question: How to define cluster similarity Single-link: Maximum similarity Maximum similarity of any two documents Complete-link: Minimum similarity Minimum similarity of any two documents Centroid: Average intersimilarity Average similarity of all document pairs (but excluding pairs of docs in the same cluster) This is equivalent to the similarity of the centroids. Group-average: Average intrasimilarity Average similary of all document pairs, including pairs of docs in the same cluster 27/ 37

28 Cluster similarity: Example / 37

29 Single-link: Maximum similarity / 37

30 Complete-link: Minimum similarity / 37

31 Centroid: Average intersimilarity intersimilarity = similarity of two documents in different clusters / 37

32 Group average: Average intrasimilarity intrasimilarity = similarity of any pair, including cases where the two documents are in the same cluster / 37

33 Cluster similarity: Larger Example / 37

34 Single-link: Maximum similarity / 37

35 Complete-link: Minimum similarity / 37

36 Centroid: Average intersimilarity / 37

37 Group average: Average intrasimilarity / 37

Hierarchical Clustering

Hierarchical Clustering Most slides are from Hinrich Schütze & Lucia D. Krisnawati March, 27 Hierarchical clustering / 62 Overview Introduction 2 Single-link/Complete-link 3 Centroid/GAAC 4 Labeling clusters