INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Size: px
Start display at page:

Download "INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from"

Transcription

1 INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from IR 22/26: Hierarchical Clustering Paul Ginsparg Cornell University, Ithaca, NY 17 Nov / 37

2 Overview 1 Recap 2 Introduction to Hierarchical clustering 2/ 37

3 Outline 1 Recap 2 Introduction to Hierarchical clustering 3/ 37

4 Applications of clustering in IR Scatter-Gather Application What is Benefit Example clustered? Search result clustering search more effective information results presentation to user (subsets of) collection alternative user interface: search without typing Collection clustering collection effective information presentation for exploratory browsing McKeown et al. 2002, news.google.com Cluster-based retrieval collection higher efficiency: faster search Salton / 37

5 K-means algorithm K-means({ x 1,..., x N },K) 1 ( s 1, s 2,..., s K ) SelectRandomSeeds({ x 1,..., x N },K) 2 for k 1 to K 3 do µ k s k 4 while stopping criterion has not been met 5 do for k 1 to K 6 do ω k {} 7 for n 1 to N 8 do j arg min j µ j x n 9 ω j ω j { x n } (reassignment of vectors) 10 for k 1 to K 11 do µ k 1 ω k 12 return { µ 1,..., µ K } x ω k x (recomputation of centroids) 5/ 37

6 Initialization of K-means Random seed selection is just one of many ways K-means can be initialized. Random seed selection is not very robust: It s easy to get a suboptimal clustering. Better heuristics: Select seeds not randomly, but using some heuristic (e.g., filter out outliers or find a set of seeds that has good coverage of the document space) Use hierarchical clustering to find good seeds (next class) Select i (e.g., i = 10) different sets of seeds, do a K-means clustering for each, select the clustering with lowest RSS 6/ 37

7 External criterion: Purity purity(ω,c) = 1 N k max ω k c j j Ω = {ω 1,ω 2,...,ω K } is the set of clusters and C = {c 1,c 2,...,c J } is the set of classes. For each cluster ω k : find class c j with most members n kj in ω k Sum all n kj and divide by total number of points 7/ 37

8 Discussion 6 Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. Usenix SDI 04, papers/dean/dean.pdf See also (Jan 2009): part of lectures on google technology stack : (including PageRank, etc.) 8/ 37

9 Some Questions Who are the authors? When was it written? When was the work started? What is the problem they were trying to solve? Is there a compiler that will automatically parallelize the most general program? How does the example in section 2.1 work? What are other examples of algorithms amenable to map reduce methodology? What s going on in Figure 1? What happens between map and reduce steps? map(k1,v1) list(k2,v2) reduce(k2,list(v2)) list(v2) 9/ 37

10 Wordcount example from a.txt: The quick brown fox jumped over the lazy grey dogs. b.txt: That s one small step for a man, one giant leap for mankind. c.txt: Mary had a little lamb, Its fleece was white as snow; And everywhere that Mary went, The lamb was sure to go. 10/ 37

11 Map mapper( a.txt,i[ a.txt ]) returns: [( the, 1), ( quick, 1), ( brown, 1), ( fox, 1), ( jumped, 1), ( over, 1), ( the, 1), ( lazy, 1), ( grey, 1), ( dogs, 1)] def mapper(input key,input value): return [(word,1) for word in remove punctuation(input value.lower()).split()] def remove punctuation(s): return s.translate(string.maketrans(, ), string.punctuation) 11/ 37

12 Output of the map phase [( the, 1), ( quick, 1), ( brown, 1), ( fox, 1), ( jumped, 1), ( over, 1), ( the, 1), ( lazy, 1), ( grey, 1), ( dogs, 1), ( mary, 1), ( had, 1), ( a, 1), ( little, 1), ( lamb, 1), ( its, 1), ( fleece, 1), ( was, 1), ( white, 1), ( as, 1), ( snow, 1), ( and, 1), ( everywhere, 1), ( that, 1), ( mary, 1), ( went, 1), ( the, 1), ( lamb, 1), ( was, 1), ( sure, 1), ( to, 1), ( go, 1), ( thats, 1), ( one, 1), ( small, 1), ( step, 1), ( for, 1), ( a, 1), ( man, 1), ( one, 1), ( giant, 1), ( leap, 1), ( for, 1), ( mankind, 1)] 12/ 37

13 Combine gives { and : [1], fox : [1], over : [1], one : [1, 1], as : [1], go : [1], its : [1], lamb : [1, 1], giant : [1], for : [1, 1], jumped : [1], had : [1], snow : [1], to : [1], leap : [1], white : [1], was : [1, 1], mary : [1, 1], brown : [1], lazy : [1], sure : [1], that : [1], little : [1], small : [1], step : [1], everywhere : [1], mankind : [1], went : [1], man : [1], a : [1, 1], fleece : [1], grey : [1], dogs : [1], quick : [1], the : [1, 1, 1], thats : [1]} 13/ 37

14 Output of the reduce phase def reducer(intermediate key,intermediate value list): return (intermediate key,sum(intermediate value list)) [( and, 1), ( fox, 1), ( over, 1), ( one, 2), ( as, 1), ( go, 1), ( its, 1), ( lamb, 2), ( giant, 1), ( for, 2), ( jumped, 1), ( had, 1), ( snow, 1), ( to, 1), ( leap, 1), ( white, 1), ( was, 2), ( mary, 2), ( brown, 1), ( lazy, 1), ( sure, 1), ( that, 1), ( little, 1), ( small, 1), ( step, 1), ( everywhere, 1), ( mankind, 1), ( went, 1), ( man, 1), ( a, 2), ( fleece, 1), ( grey, 1), ( dogs, 1), ( quick, 1), ( the, 3), ( thats, 1)] 14/ 37

15 PageRank example, P jk = A jk /d j Input (key,value) to MapReduce key = id j of the webpage value contains data describing the page: current r j, out-degree d j, and a list [k 1,k 2,...,k dj ] of pages to which it links For each of the latter pages k a, a = 1,... d j, mapper outputs an intermediate key-value pair [k a,r j /d j ] (where r j /d j is the contribution to the PageRank from page j to page k a, and corresponds to random websurfer moving from j to k a combines probability r j of starting at page j with probability 1/d j of moving from j to k a ) Between map and reduce phases, MapReduce collects all intermediate values corresponding to any given intermediate key k (list of all probabilities of moving to page k). The reducer sums up probabilities, outputting result as second entry in pair (k,r k ), giving the entries of rp = r, as desired. 15/ 37

16 k-means clustering, e.g., Netflix data Goal Find similar movies from ratings provided by users Vector Model Give each movie a vector Make one dimension per user Put origin at average rating (so poor is negative) Normalize all vectors to unit length (cosine similarity) Issues - Users are biased in the movies they rate + Addresses different numbers of raters 16/ 37

17 k-means clustering Goal cluster similar data points Approach: given data points and distance function select k centroids µ a assign x i to closest centroid µ a minimize a,i d( x i, µ a ) Algorithm: randomly pick centroids, possibly from data points assign points to closest centroid average assigned points to obtain new centroids repeat 2,3 until nothing changes Issues: - takes superpolynomial time on some inputs - not guaranteed to find optimal solution + converges quickly in practice 17/ 37

18 Iterative MapReduce (from ) 18/ 37

19 Outline 1 Recap 2 Introduction to Hierarchical clustering 19/ 37

20 Hierarchical clustering Our goal in hierarchical clustering is to create a hierarchy like the one we saw earlier in Reuters: TOP regions industries Kenya China UK France coffee poultry oil & gas We want to create this hierarchy automatically. We can do this either top-down or bottom-up. The best known bottom-up method is hierarchical agglomerative clustering. 20/ 37

21 Hierarchical agglomerative clustering (HAC) HAC creates a hierachy in the form of a binary tree. Assumes a similarity measure for determining the similarity of two clusters. Up to now, our similarity measures were for documents. We will look at four different cluster similarity measures. 21/ 37

22 Hierarchical agglomerative clustering (HAC) Start with each document in a separate cluster Then repeatedly merge the two clusters that are most similar Until there is only one cluster The history of merging is a hierarchy in the form of a binary tree. The standard way of depicting this history is a dendrogram. 22/ 37

23 A dendrogram Ag trade reform. Back to school spending is up Lloyd s CEO questioned Lloyd s chief / U.S. grilling Viag stays positive Chrysler / Latin America Ohio Blue Cross Japanese prime minister / Mexico CompuServe reports loss Sprint / Internet access service Planet Hollywood Trocadero: tripling of revenues German unions split War hero Colin Powell War hero Colin Powell Oil prices slip Chains may raise prices Clinton signs law Lawsuit against tobacco companies suits against tobacco firms Indiana tobacco lawsuit Most active stocks Mexican markets Hog prices tumble NYSE closing averages British FTSE index Fed holds interest rates steady Fed to keep interest rates steady Fed keeps interest rates steady Fed keeps interest rates steady The history of mergers can be read off from left to right. The vertical line of each merger tells us what the similarity of the merger was. We can cut the dendrogram at a particular point (e.g., at 0.1 or 0.4) to get a flat clustering. 23/ 37

24 Divisive clustering Divisive clustering is top-down. Alternative to HAC (which is bottom up). Divisive clustering: Start with all docs in one big cluster Then recursively split clusters Eventually each node forms a cluster on its own. Bisecting K-means at the end For now: HAC (= bottom-up) 24/ 37

25 Naive HAC algorithm SimpleHAC(d 1,...,d N ) 1 for n 1 to N 2 do for i 1 to N 3 do C[n][i] Sim(d n,d i ) 4 I[n] 1 (keeps track of active clusters) 5 A [] (collects clustering as a sequence of merges) 6 for k 1 to N 1 7 do i,m arg max { i,m :i m I[i]=1 I[m]=1} C[i][m] 8 A.Append( i, m ) (store merge) 9 for j 1 to N 10 do (use i as representative for < i, m >) 11 C[i][j] Sim(< i,m >,j) 12 C[j][i] Sim(< i,m >,j) 13 I[m] 0 (deactivate cluster) 14 return A 25/ 37

26 Computational complexity of the naive algorithm First, we compute the similarity of all N N pairs of documents. Then, in each of N iterations: We scan the O(N N) similarities to find the maximum similarity. We merge the two clusters with maximum similarity. We compute the similarity of the new cluster with all other (surviving) clusters. There are O(N) iterations, each performing a O(N N) scan operation. Overall complexity is O(N 3 ). We ll look at more efficient algorithms later. 26/ 37

27 Key question: How to define cluster similarity Single-link: Maximum similarity Maximum similarity of any two documents Complete-link: Minimum similarity Minimum similarity of any two documents Centroid: Average intersimilarity Average similarity of all document pairs (but excluding pairs of docs in the same cluster) This is equivalent to the similarity of the centroids. Group-average: Average intrasimilarity Average similary of all document pairs, including pairs of docs in the same cluster 27/ 37

28 Cluster similarity: Example / 37

29 Single-link: Maximum similarity / 37

30 Complete-link: Minimum similarity / 37

31 Centroid: Average intersimilarity intersimilarity = similarity of two documents in different clusters / 37

32 Group average: Average intrasimilarity intrasimilarity = similarity of any pair, including cases where the two documents are in the same cluster / 37

33 Cluster similarity: Larger Example / 37

34 Single-link: Maximum similarity / 37

35 Complete-link: Minimum similarity / 37

36 Centroid: Average intersimilarity / 37

37 Group average: Average intrasimilarity / 37

Hierarchical Clustering

Hierarchical Clustering Hierarchical Clustering Most slides are from Hinrich Schütze & Lucia D. Krisnawati March, 27 Hierarchical clustering / 62 Overview Introduction 2 Single-link/Complete-link 3 Centroid/GAAC 4 Labeling clusters

More information

Hierarchical Clustering

Hierarchical Clustering Hierarchical Clustering Most slides are from Hinrich Schütze & Lucia D. Krisnawati November 7, 28 Hierarchical clustering / 6 Overview Introduction 2 Single-link/Complete-link 3 Centroid/GAAC 4 Labeling

More information

Machine Learning for NLP: Unsupervised learning techniques Dept. of Computer Science, Trinity College Dublin, Ireland ESSLLI 07 Dublin Ireland

Machine Learning for NLP: Unsupervised learning techniques Dept. of Computer Science, Trinity College Dublin, Ireland ESSLLI 07 Dublin Ireland Machine Learning for NLP: Unsupervised learning techniques Saturnino Luz Dept. of Computer Science, Trinity College Dublin, Ireland ESSLLI 07 Dublin Ireland Supervised vs. unsupervised learning So far

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 12: Latent Semantic Indexing and Relevance Feedback Paul Ginsparg Cornell

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Marielle Caccam Jewel Refran

Marielle Caccam Jewel Refran Marielle Caccam Jewel Refran Cluster analysis is a group of multivariate techniques whose primary purpose is to group objects (e.g., respondents, products, or other entities) based on the characteristics

More information

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Efficient

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 8: Evaluation & SVD Paul Ginsparg Cornell University, Ithaca, NY 20 Sep 2011

More information

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America Clustering Léon Bottou NEC Labs America COS 424 3/4/2010 Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification, clustering, regression, other.

More information

Machine Learning - MT Clustering

Machine Learning - MT Clustering Machine Learning - MT 2016 15. Clustering Varun Kanade University of Oxford November 28, 2016 Announcements No new practical this week All practicals must be signed off in sessions this week Firm Deadline:

More information

Chapter 5-2: Clustering

Chapter 5-2: Clustering Chapter 5-2: Clustering Jilles Vreeken Revision 1, November 20 th typo s fixed: dendrogram Revision 2, December 10 th clarified: we do consider a point x as a member of its own ε-neighborhood 12 Nov 2015

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Computer Vision Group Prof. Daniel Cremers. 14. Clustering Group Prof. Daniel Cremers 14. Clustering Motivation Supervised learning is good for interaction with humans, but labels from a supervisor are hard to obtain Clustering is unsupervised learning, i.e. it

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data cleaning

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Overview of clustering analysis. Yuehua Cui

Overview of clustering analysis. Yuehua Cui Overview of clustering analysis Yuehua Cui Email: cuiy@msu.edu http://www.stt.msu.edu/~cui A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this

More information

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10 PageRank Ryan Tibshirani 36-462/36-662: Data Mining January 24 2012 Optional reading: ESL 14.10 1 Information retrieval with the web Last time we learned about information retrieval. We learned how to

More information

CS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C.

CS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C. CS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring 2006 Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C. Latombe Scribe: Neda Nategh How do you update the energy function during the

More information

IR: Information Retrieval

IR: Information Retrieval / 44 IR: Information Retrieval FIB, Master in Innovation and Research in Informatics Slides by Marta Arias, José Luis Balcázar, Ramon Ferrer-i-Cancho, Ricard Gavaldá Department of Computer Science, UPC

More information

Linear Programming Duality

Linear Programming Duality Summer 2011 Optimization I Lecture 8 1 Duality recap Linear Programming Duality We motivated the dual of a linear program by thinking about the best possible lower bound on the optimal value we can achieve

More information

MATH 10 INTRODUCTORY STATISTICS

MATH 10 INTRODUCTORY STATISTICS MATH 10 INTRODUCTORY STATISTICS Tommy Khoo Your friendly neighbourhood graduate student. It is Time for Homework! ( ω `) First homework + data will be posted on the website, under the homework tab. And

More information

CSC236 Week 3. Larry Zhang

CSC236 Week 3. Larry Zhang CSC236 Week 3 Larry Zhang 1 Announcements Problem Set 1 due this Friday Make sure to read Submission Instructions on the course web page. Search for Teammates on Piazza Educational memes: http://www.cs.toronto.edu/~ylzhang/csc236/memes.html

More information

Clustering using Mixture Models

Clustering using Mixture Models Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

2. Sample representativeness. That means some type of probability/random sampling.

2. Sample representativeness. That means some type of probability/random sampling. 1 Neuendorf Cluster Analysis Model: X1 X2 X3 X4 X5 Clusters (Nominal variable) Y1 Y2 Y3 Clustering/Internal Variables External Variables Assumes: 1. Actually, any level of measurement (nominal, ordinal,

More information

Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming

Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305,

More information

0.1 Naive formulation of PageRank

0.1 Naive formulation of PageRank PageRank is a ranking system designed to find the best pages on the web. A webpage is considered good if it is endorsed (i.e. linked to) by other good webpages. The more webpages link to it, and the more

More information

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions CS246: Mining Massive Data Sets Winter 2017 Problem Set 2 Due 11:59pm February 9, 2017 Only one late period is allowed for this homework (11:59pm 2/14). General Instructions Submission instructions: These

More information

Cutting Graphs, Personal PageRank and Spilling Paint

Cutting Graphs, Personal PageRank and Spilling Paint Graphs and Networks Lecture 11 Cutting Graphs, Personal PageRank and Spilling Paint Daniel A. Spielman October 3, 2013 11.1 Disclaimer These notes are not necessarily an accurate representation of what

More information

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper) PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper) In class, we saw this graph, with each node representing people who are following each other on Twitter: Our

More information

Tricks of the Trade in Combinatorics and Arithmetic

Tricks of the Trade in Combinatorics and Arithmetic Tricks of the Trade in Combinatorics and Arithmetic Zachary Friggstad Programming Club Meeting Fast Exponentiation Given integers a, b with b 0, compute a b exactly. Fast Exponentiation Given integers

More information

Google Page Rank Project Linear Algebra Summer 2012

Google Page Rank Project Linear Algebra Summer 2012 Google Page Rank Project Linear Algebra Summer 2012 How does an internet search engine, like Google, work? In this project you will discover how the Page Rank algorithm works to give the most relevant

More information

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review INFO 4300 / CS4300 Information Retrieval IR 9: Linear Algebra Review Paul Ginsparg Cornell University, Ithaca, NY 24 Sep 2009 1/ 23 Overview 1 Recap 2 Matrix basics 3 Matrix Decompositions 4 Discussion

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

Applying cluster analysis to 2011 Census local authority data

Applying cluster analysis to 2011 Census local authority data Applying cluster analysis to 2011 Census local authority data Kitty.Lymperopoulou@manchester.ac.uk SPSS User Group Conference November, 10 2017 Outline Basic ideas of cluster analysis How to choose variables

More information

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is

More information

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 3

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 3 CS 70 Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 3 Modular Arithmetic In several settings, such as error-correcting codes and cryptography, we sometimes wish to work over a smaller

More information

Program Slicing. Author: Mark Weiser Published in TSE, Presented by Peeratham (Karn) Techapalokul 10/13/2015

Program Slicing. Author: Mark Weiser Published in TSE, Presented by Peeratham (Karn) Techapalokul 10/13/2015 Program Slicing Author: Mark Weiser Published in TSE, 1984 Presented by Peeratham (Karn) Techapalokul 1/13/215 About Mark Weiser a chief scientist at Xerox PARC Widely considered to be the father of ubiquitous

More information

Math 304 Handout: Linear algebra, graphs, and networks.

Math 304 Handout: Linear algebra, graphs, and networks. Math 30 Handout: Linear algebra, graphs, and networks. December, 006. GRAPHS AND ADJACENCY MATRICES. Definition. A graph is a collection of vertices connected by edges. A directed graph is a graph all

More information

Chapter 8 The Disjoint Sets Class

Chapter 8 The Disjoint Sets Class Chapter 8 The Disjoint Sets Class 2 Introduction equivalence problem can be solved fairly simply simple data structure each function requires only a few lines of code two operations: and can be implemented

More information

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland MapReduce in Spark Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, second semester

More information

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas Hidden Markov Models Vibhav Gogate The University of Texas at Dallas Intro to AI (CS 4365) Many slides over the course adapted from either Dan Klein, Luke Zettlemoyer, Stuart Russell or Andrew Moore 1

More information

Lecture: Local Spectral Methods (1 of 4)

Lecture: Local Spectral Methods (1 of 4) Stat260/CS294: Spectral Graph Methods Lecture 18-03/31/2015 Lecture: Local Spectral Methods (1 of 4) Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning: these notes are still very rough. They provide

More information

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam. March 16, :30AM - 11:30AM CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions

More information

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 5

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 5 CS 70 Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 5 Modular Arithmetic In several settings, such as error-correcting codes and cryptography, we sometimes wish to work over a

More information

Multivariate Statistics

Multivariate Statistics Multivariate Statistics Chapter 6: Cluster Analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2017/2018 Master in Mathematical Engineering

More information

Theory versus Experiment: Analysis and Measurements of Allocation Costs in Sorting (With Hints for Applying Similar Techniques to Garbage Collection)

Theory versus Experiment: Analysis and Measurements of Allocation Costs in Sorting (With Hints for Applying Similar Techniques to Garbage Collection) Theory versus Experiment: Analysis and Measurements of Allocation Costs in Sorting (With Hints for Applying Similar Techniques to Garbage Collection) CS 152 Staff February, 2006 Introduction This handout

More information

Synthesis of 2-level Logic Exact and Heuristic Methods. Two Approaches

Synthesis of 2-level Logic Exact and Heuristic Methods. Two Approaches Synthesis of 2-level Logic Exact and Heuristic Methods Lecture 7: Branch & Bound Exact Two Approaches Find all primes Find a complete sum Find a minimum cover (covering problem) Heuristic Take an initial

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic

More information

UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer

UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer UpdatingtheStationary VectorofaMarkovChain Amy Langville Carl Meyer Department of Mathematics North Carolina State University Raleigh, NC NSMC 9/4/2003 Outline Updating and Pagerank Aggregation Partitioning

More information

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides CSE 494/598 Lecture-4: Correlation Analysis LYDIA MANIKONDA HT TP://WWW.PUBLIC.ASU.EDU/~LMANIKON / **Content adapted from last year s slides Announcements Project-1 Due: February 12 th 2016 Analysis report:

More information

Spatial Extension of the Reality Mining Dataset

Spatial Extension of the Reality Mining Dataset R&D Centre for Mobile Applications Czech Technical University in Prague Spatial Extension of the Reality Mining Dataset Michal Ficek, Lukas Kencl sponsored by Mobility-Related Applications Wanted! Urban

More information

Lecture 7 Decision Tree Classifier

Lecture 7 Decision Tree Classifier Machine Learning Dr.Ammar Mohammed Lecture 7 Decision Tree Classifier Decision Tree A decision tree is a simple classifier in the form of a hierarchical tree structure, which performs supervised classification

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London Motivation Relevance Feedback revisited The user marks a number of documents

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic Indexing Hinrich Schütze Center for Information and Language Processing, University of Munich 2013-07-10 1/43

More information

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees) Decision Trees Lewis Fishgold (Material in these slides adapted from Ray Mooney's slides on Decision Trees) Classification using Decision Trees Nodes test features, there is one branch for each value of

More information

Justin Appleby CS 229 Machine Learning Project Report 12/15/17 Kevin Chalhoub Building Electricity Load Forecasting

Justin Appleby CS 229 Machine Learning Project Report 12/15/17 Kevin Chalhoub Building Electricity Load Forecasting Justin Appleby CS 229 Machine Learning Project Report 12/15/17 Kevin Chalhoub Building Electricity Load Forecasting with ARIMA and Sequential Linear Regression Abstract Load forecasting is an essential

More information

Decision Trees Entropy, Information Gain, Gain Ratio

Decision Trees Entropy, Information Gain, Gain Ratio Changelog: 14 Oct, 30 Oct Decision Trees Entropy, Information Gain, Gain Ratio Lecture 3: Part 2 Outline Entropy Information gain Gain ratio Marina Santini Acknowledgements Slides borrowed and adapted

More information

CS 188: Artificial Intelligence Fall Recap: Inference Example

CS 188: Artificial Intelligence Fall Recap: Inference Example CS 188: Artificial Intelligence Fall 2007 Lecture 19: Decision Diagrams 11/01/2007 Dan Klein UC Berkeley Recap: Inference Example Find P( F=bad) Restrict all factors P() P(F=bad ) P() 0.7 0.3 eather 0.7

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Clustering compiled by Alvin Wan from Professor Benjamin Recht s lecture, Samaneh s discussion

Clustering compiled by Alvin Wan from Professor Benjamin Recht s lecture, Samaneh s discussion Clustering compiled by Alvin Wan from Professor Benjamin Recht s lecture, Samaneh s discussion 1 Overview With clustering, we have several key motivations: archetypes (factor analysis) segmentation hierarchy

More information

CSEP 573: Artificial Intelligence

CSEP 573: Artificial Intelligence CSEP 573: Artificial Intelligence Hidden Markov Models Luke Zettlemoyer Many slides over the course adapted from either Dan Klein, Stuart Russell, Andrew Moore, Ali Farhadi, or Dan Weld 1 Outline Probabilistic

More information

Divide-and-Conquer Algorithms Part Two

Divide-and-Conquer Algorithms Part Two Divide-and-Conquer Algorithms Part Two Recap from Last Time Divide-and-Conquer Algorithms A divide-and-conquer algorithm is one that works as follows: (Divide) Split the input apart into multiple smaller

More information

Personal PageRank and Spilling Paint

Personal PageRank and Spilling Paint Graphs and Networks Lecture 11 Personal PageRank and Spilling Paint Daniel A. Spielman October 7, 2010 11.1 Overview These lecture notes are not complete. The paint spilling metaphor is due to Berkhin

More information

Multivariate Analysis

Multivariate Analysis Multivariate Analysis Chapter 5: Cluster analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2015/2016 Master in Business Administration and

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun yzsun@ccs.neu.edu November 16, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining Matrix Data Decision

More information

Lecture 5: November 19, Minimizing the maximum intracluster distance

Lecture 5: November 19, Minimizing the maximum intracluster distance Analysis of DNA Chips and Gene Networks Spring Semester, 2009 Lecture 5: November 19, 2009 Lecturer: Ron Shamir Scribe: Renana Meller 5.1 Minimizing the maximum intracluster distance 5.1.1 Introduction

More information

Slides based on those in:

Slides based on those in: Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

Data Exploration and Unsupervised Learning with Clustering

Data Exploration and Unsupervised Learning with Clustering Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence Clustering Idea Given a set of data can we find a

More information

CS473 - Algorithms I

CS473 - Algorithms I CS473 - Algorithms I Lecture 10 Dynamic Programming View in slide-show mode CS 473 Lecture 10 Cevdet Aykanat and Mustafa Ozdal, Bilkent University 1 Introduction An algorithm design paradigm like divide-and-conquer

More information

Divide and Conquer Algorithms. CSE 101: Design and Analysis of Algorithms Lecture 14

Divide and Conquer Algorithms. CSE 101: Design and Analysis of Algorithms Lecture 14 Divide and Conquer Algorithms CSE 101: Design and Analysis of Algorithms Lecture 14 CSE 101: Design and analysis of algorithms Divide and conquer algorithms Reading: Sections 2.3 and 2.4 Homework 6 will

More information

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

Computer Science & Engineering 423/823 Design and Analysis of Algorithms Computer Science & Engineering 423/823 Design and Analysis of Algorithms Lecture 03 Dynamic Programming (Chapter 15) Stephen Scott and Vinodchandran N. Variyam sscott@cse.unl.edu 1/44 Introduction Dynamic

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve

More information

Large-scale Collaborative Ranking in Near-Linear Time

Large-scale Collaborative Ranking in Near-Linear Time Large-scale Collaborative Ranking in Near-Linear Time Liwei Wu Depts of Statistics and Computer Science UC Davis KDD 17, Halifax, Canada August 13-17, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack

More information

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Homework 5. This homework is due October 6, 2014, at 12:00 noon.

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Homework 5. This homework is due October 6, 2014, at 12:00 noon. EECS 70 Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Homework 5 This homework is due October 6, 2014, at 12:00 noon. 1. Modular Arithmetic Lab (continue) Oystein Ore described a puzzle

More information

B490 Mining the Big Data

B490 Mining the Big Data B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations

More information

Constraint-based Subspace Clustering

Constraint-based Subspace Clustering Constraint-based Subspace Clustering Elisa Fromont 1, Adriana Prado 2 and Céline Robardet 1 1 Université de Lyon, France 2 Universiteit Antwerpen, Belgium Thursday, April 30 Traditional Clustering Partitions

More information

Supplementary Information

Supplementary Information Supplementary Information For the article"comparable system-level organization of Archaea and ukaryotes" by J. Podani, Z. N. Oltvai, H. Jeong, B. Tombor, A.-L. Barabási, and. Szathmáry (reference numbers

More information

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14 STATS 306B: Unsupervised Learning Spring 2014 Lecture 5 April 14 Lecturer: Lester Mackey Scribe: Brian Do and Robin Jia 5.1 Discrete Hidden Markov Models 5.1.1 Recap In the last lecture, we introduced

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees CS194-10 Fall 2011 Lecture 8 CS194-10 Fall 2011 Lecture 8 1 Outline Decision tree models Tree construction Tree pruning Continuous input features CS194-10 Fall 2011 Lecture 8 2

More information

CORRECTNESS OF A GOSSIP BASED MEMBERSHIP PROTOCOL BY (ANDRÉ ALLAVENA, ALAN DEMERS, JOHN E. HOPCROFT ) PRATIK TIMALSENA UNIVERSITY OF OSLO

CORRECTNESS OF A GOSSIP BASED MEMBERSHIP PROTOCOL BY (ANDRÉ ALLAVENA, ALAN DEMERS, JOHN E. HOPCROFT ) PRATIK TIMALSENA UNIVERSITY OF OSLO CORRECTNESS OF A GOSSIP BASED MEMBERSHIP PROTOCOL BY (ANDRÉ ALLAVENA, ALAN DEMERS, JOHN E. HOPCROFT ) PRATIK TIMALSENA UNIVERSITY OF OSLO OUTLINE q Contribution of the paper q Gossip algorithm q The corrected

More information

Irreversibility. Have you ever seen this happen? (when you weren t asleep or on medication) Which stage never happens?

Irreversibility. Have you ever seen this happen? (when you weren t asleep or on medication) Which stage never happens? Lecture 5: Statistical Processes Random Walk and Particle Diffusion Counting and Probability Microstates and Macrostates The meaning of equilibrium 0.10 0.08 Reading: Elements Ch. 5 Probability (N 1, N

More information

MITOCW ocw f99-lec23_300k

MITOCW ocw f99-lec23_300k MITOCW ocw-18.06-f99-lec23_300k -- and lift-off on differential equations. So, this section is about how to solve a system of first order, first derivative, constant coefficient linear equations. And if

More information

Clustering analysis of vegetation data

Clustering analysis of vegetation data Clustering analysis of vegetation data Valentin Gjorgjioski 1, Sašo Dzeroski 1 and Matt White 2 1 Jožef Stefan Institute Jamova cesta 39, SI-1000 Ljubljana Slovenia 2 Arthur Rylah Institute for Environmental

More information

Markov Models. CS 188: Artificial Intelligence Fall Example. Mini-Forward Algorithm. Stationary Distributions.

Markov Models. CS 188: Artificial Intelligence Fall Example. Mini-Forward Algorithm. Stationary Distributions. CS 88: Artificial Intelligence Fall 27 Lecture 2: HMMs /6/27 Markov Models A Markov model is a chain-structured BN Each node is identically distributed (stationarity) Value of X at a given time is called

More information

Exploratory Hierarchical Clustering for Management Zone Delineation in Precision Agriculture

Exploratory Hierarchical Clustering for Management Zone Delineation in Precision Agriculture Exploratory Hierarchical Clustering for Management Zone Delineation in Precision Agriculture Georg Ruß, Rudolf Kruse Otto-von-Guericke-Universität Magdeburg, Germany {russ,kruse}@iws.cs.uni-magdeburg.de

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 13: Query Expansion and Probabilistic Retrieval Paul Ginsparg Cornell University,

More information

Economics 2: Growth (Growth in the Solow Model)

Economics 2: Growth (Growth in the Solow Model) Economics 2: Growth (Growth in the Solow Model) Lecture 3, Week 7 Solow Model - I Definition (Solow Model I) The most basic Solow model with no population growth or technological progress. Solow Model

More information

b + O(n d ) where a 1, b > 1, then O(n d log n) if a = b d d ) if a < b d O(n log b a ) if a > b d

b + O(n d ) where a 1, b > 1, then O(n d log n) if a = b d d ) if a < b d O(n log b a ) if a > b d CS161, Lecture 4 Median, Selection, and the Substitution Method Scribe: Albert Chen and Juliana Cook (2015), Sam Kim (2016), Gregory Valiant (2017) Date: January 23, 2017 1 Introduction Last lecture, we

More information

Lecture 2: Data Analytics of Narrative

Lecture 2: Data Analytics of Narrative Lecture 2: Data Analytics of Narrative Data Analytics of Narrative: Pattern Recognition in Text, and Text Synthesis, Supported by the Correspondence Analysis Platform. This Lecture is presented in three

More information

http://xkcd.com/1570/ Strategy: Top Down Recursive divide-and-conquer fashion First: Select attribute for root node Create branch for each possible attribute value Then: Split

More information

Today s Outline. CS 362, Lecture 13. Matrix Chain Multiplication. Paranthesizing Matrices. Matrix Multiplication. Jared Saia University of New Mexico

Today s Outline. CS 362, Lecture 13. Matrix Chain Multiplication. Paranthesizing Matrices. Matrix Multiplication. Jared Saia University of New Mexico Today s Outline CS 362, Lecture 13 Jared Saia University of New Mexico Matrix Multiplication 1 Matrix Chain Multiplication Paranthesizing Matrices Problem: We are given a sequence of n matrices, A 1, A

More information

ECEN 689 Special Topics in Data Science for Communications Networks

ECEN 689 Special Topics in Data Science for Communications Networks ECEN 689 Special Topics in Data Science for Communications Networks Nick Duffield Department of Electrical & Computer Engineering Texas A&M University Lecture 8 Random Walks, Matrices and PageRank Graphs

More information

Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li

Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li 77 Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li 1) Introduction Cluster analysis deals with separating data into groups whose identities are not known in advance. In general, even the

More information

MITOCW MIT8_01F16_L12v01_360p

MITOCW MIT8_01F16_L12v01_360p MITOCW MIT8_01F16_L12v01_360p Let's look at a typical application of Newton's second law for a system of objects. So what I want to consider is a system of pulleys and masses. So I'll have a fixed surface

More information

Decision Tree And Random Forest

Decision Tree And Random Forest Decision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany) Spring 2019 Contact: mailto: Ammar@cu.edu.eg

More information

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras Lecture - 3 Simplex Method for Bounded Variables We discuss the simplex algorithm

More information