INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
|
|
- Kathleen Austin
- 5 years ago
- Views:
Transcription
1 INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from IR 22/26: Hierarchical Clustering Paul Ginsparg Cornell University, Ithaca, NY 17 Nov / 37
2 Overview 1 Recap 2 Introduction to Hierarchical clustering 2/ 37
3 Outline 1 Recap 2 Introduction to Hierarchical clustering 3/ 37
4 Applications of clustering in IR Scatter-Gather Application What is Benefit Example clustered? Search result clustering search more effective information results presentation to user (subsets of) collection alternative user interface: search without typing Collection clustering collection effective information presentation for exploratory browsing McKeown et al. 2002, news.google.com Cluster-based retrieval collection higher efficiency: faster search Salton / 37
5 K-means algorithm K-means({ x 1,..., x N },K) 1 ( s 1, s 2,..., s K ) SelectRandomSeeds({ x 1,..., x N },K) 2 for k 1 to K 3 do µ k s k 4 while stopping criterion has not been met 5 do for k 1 to K 6 do ω k {} 7 for n 1 to N 8 do j arg min j µ j x n 9 ω j ω j { x n } (reassignment of vectors) 10 for k 1 to K 11 do µ k 1 ω k 12 return { µ 1,..., µ K } x ω k x (recomputation of centroids) 5/ 37
6 Initialization of K-means Random seed selection is just one of many ways K-means can be initialized. Random seed selection is not very robust: It s easy to get a suboptimal clustering. Better heuristics: Select seeds not randomly, but using some heuristic (e.g., filter out outliers or find a set of seeds that has good coverage of the document space) Use hierarchical clustering to find good seeds (next class) Select i (e.g., i = 10) different sets of seeds, do a K-means clustering for each, select the clustering with lowest RSS 6/ 37
7 External criterion: Purity purity(ω,c) = 1 N k max ω k c j j Ω = {ω 1,ω 2,...,ω K } is the set of clusters and C = {c 1,c 2,...,c J } is the set of classes. For each cluster ω k : find class c j with most members n kj in ω k Sum all n kj and divide by total number of points 7/ 37
8 Discussion 6 Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. Usenix SDI 04, papers/dean/dean.pdf See also (Jan 2009): part of lectures on google technology stack : (including PageRank, etc.) 8/ 37
9 Some Questions Who are the authors? When was it written? When was the work started? What is the problem they were trying to solve? Is there a compiler that will automatically parallelize the most general program? How does the example in section 2.1 work? What are other examples of algorithms amenable to map reduce methodology? What s going on in Figure 1? What happens between map and reduce steps? map(k1,v1) list(k2,v2) reduce(k2,list(v2)) list(v2) 9/ 37
10 Wordcount example from a.txt: The quick brown fox jumped over the lazy grey dogs. b.txt: That s one small step for a man, one giant leap for mankind. c.txt: Mary had a little lamb, Its fleece was white as snow; And everywhere that Mary went, The lamb was sure to go. 10/ 37
11 Map mapper( a.txt,i[ a.txt ]) returns: [( the, 1), ( quick, 1), ( brown, 1), ( fox, 1), ( jumped, 1), ( over, 1), ( the, 1), ( lazy, 1), ( grey, 1), ( dogs, 1)] def mapper(input key,input value): return [(word,1) for word in remove punctuation(input value.lower()).split()] def remove punctuation(s): return s.translate(string.maketrans(, ), string.punctuation) 11/ 37
12 Output of the map phase [( the, 1), ( quick, 1), ( brown, 1), ( fox, 1), ( jumped, 1), ( over, 1), ( the, 1), ( lazy, 1), ( grey, 1), ( dogs, 1), ( mary, 1), ( had, 1), ( a, 1), ( little, 1), ( lamb, 1), ( its, 1), ( fleece, 1), ( was, 1), ( white, 1), ( as, 1), ( snow, 1), ( and, 1), ( everywhere, 1), ( that, 1), ( mary, 1), ( went, 1), ( the, 1), ( lamb, 1), ( was, 1), ( sure, 1), ( to, 1), ( go, 1), ( thats, 1), ( one, 1), ( small, 1), ( step, 1), ( for, 1), ( a, 1), ( man, 1), ( one, 1), ( giant, 1), ( leap, 1), ( for, 1), ( mankind, 1)] 12/ 37
13 Combine gives { and : [1], fox : [1], over : [1], one : [1, 1], as : [1], go : [1], its : [1], lamb : [1, 1], giant : [1], for : [1, 1], jumped : [1], had : [1], snow : [1], to : [1], leap : [1], white : [1], was : [1, 1], mary : [1, 1], brown : [1], lazy : [1], sure : [1], that : [1], little : [1], small : [1], step : [1], everywhere : [1], mankind : [1], went : [1], man : [1], a : [1, 1], fleece : [1], grey : [1], dogs : [1], quick : [1], the : [1, 1, 1], thats : [1]} 13/ 37
14 Output of the reduce phase def reducer(intermediate key,intermediate value list): return (intermediate key,sum(intermediate value list)) [( and, 1), ( fox, 1), ( over, 1), ( one, 2), ( as, 1), ( go, 1), ( its, 1), ( lamb, 2), ( giant, 1), ( for, 2), ( jumped, 1), ( had, 1), ( snow, 1), ( to, 1), ( leap, 1), ( white, 1), ( was, 2), ( mary, 2), ( brown, 1), ( lazy, 1), ( sure, 1), ( that, 1), ( little, 1), ( small, 1), ( step, 1), ( everywhere, 1), ( mankind, 1), ( went, 1), ( man, 1), ( a, 2), ( fleece, 1), ( grey, 1), ( dogs, 1), ( quick, 1), ( the, 3), ( thats, 1)] 14/ 37
15 PageRank example, P jk = A jk /d j Input (key,value) to MapReduce key = id j of the webpage value contains data describing the page: current r j, out-degree d j, and a list [k 1,k 2,...,k dj ] of pages to which it links For each of the latter pages k a, a = 1,... d j, mapper outputs an intermediate key-value pair [k a,r j /d j ] (where r j /d j is the contribution to the PageRank from page j to page k a, and corresponds to random websurfer moving from j to k a combines probability r j of starting at page j with probability 1/d j of moving from j to k a ) Between map and reduce phases, MapReduce collects all intermediate values corresponding to any given intermediate key k (list of all probabilities of moving to page k). The reducer sums up probabilities, outputting result as second entry in pair (k,r k ), giving the entries of rp = r, as desired. 15/ 37
16 k-means clustering, e.g., Netflix data Goal Find similar movies from ratings provided by users Vector Model Give each movie a vector Make one dimension per user Put origin at average rating (so poor is negative) Normalize all vectors to unit length (cosine similarity) Issues - Users are biased in the movies they rate + Addresses different numbers of raters 16/ 37
17 k-means clustering Goal cluster similar data points Approach: given data points and distance function select k centroids µ a assign x i to closest centroid µ a minimize a,i d( x i, µ a ) Algorithm: randomly pick centroids, possibly from data points assign points to closest centroid average assigned points to obtain new centroids repeat 2,3 until nothing changes Issues: - takes superpolynomial time on some inputs - not guaranteed to find optimal solution + converges quickly in practice 17/ 37
18 Iterative MapReduce (from ) 18/ 37
19 Outline 1 Recap 2 Introduction to Hierarchical clustering 19/ 37
20 Hierarchical clustering Our goal in hierarchical clustering is to create a hierarchy like the one we saw earlier in Reuters: TOP regions industries Kenya China UK France coffee poultry oil & gas We want to create this hierarchy automatically. We can do this either top-down or bottom-up. The best known bottom-up method is hierarchical agglomerative clustering. 20/ 37
21 Hierarchical agglomerative clustering (HAC) HAC creates a hierachy in the form of a binary tree. Assumes a similarity measure for determining the similarity of two clusters. Up to now, our similarity measures were for documents. We will look at four different cluster similarity measures. 21/ 37
22 Hierarchical agglomerative clustering (HAC) Start with each document in a separate cluster Then repeatedly merge the two clusters that are most similar Until there is only one cluster The history of merging is a hierarchy in the form of a binary tree. The standard way of depicting this history is a dendrogram. 22/ 37
23 A dendrogram Ag trade reform. Back to school spending is up Lloyd s CEO questioned Lloyd s chief / U.S. grilling Viag stays positive Chrysler / Latin America Ohio Blue Cross Japanese prime minister / Mexico CompuServe reports loss Sprint / Internet access service Planet Hollywood Trocadero: tripling of revenues German unions split War hero Colin Powell War hero Colin Powell Oil prices slip Chains may raise prices Clinton signs law Lawsuit against tobacco companies suits against tobacco firms Indiana tobacco lawsuit Most active stocks Mexican markets Hog prices tumble NYSE closing averages British FTSE index Fed holds interest rates steady Fed to keep interest rates steady Fed keeps interest rates steady Fed keeps interest rates steady The history of mergers can be read off from left to right. The vertical line of each merger tells us what the similarity of the merger was. We can cut the dendrogram at a particular point (e.g., at 0.1 or 0.4) to get a flat clustering. 23/ 37
24 Divisive clustering Divisive clustering is top-down. Alternative to HAC (which is bottom up). Divisive clustering: Start with all docs in one big cluster Then recursively split clusters Eventually each node forms a cluster on its own. Bisecting K-means at the end For now: HAC (= bottom-up) 24/ 37
25 Naive HAC algorithm SimpleHAC(d 1,...,d N ) 1 for n 1 to N 2 do for i 1 to N 3 do C[n][i] Sim(d n,d i ) 4 I[n] 1 (keeps track of active clusters) 5 A [] (collects clustering as a sequence of merges) 6 for k 1 to N 1 7 do i,m arg max { i,m :i m I[i]=1 I[m]=1} C[i][m] 8 A.Append( i, m ) (store merge) 9 for j 1 to N 10 do (use i as representative for < i, m >) 11 C[i][j] Sim(< i,m >,j) 12 C[j][i] Sim(< i,m >,j) 13 I[m] 0 (deactivate cluster) 14 return A 25/ 37
26 Computational complexity of the naive algorithm First, we compute the similarity of all N N pairs of documents. Then, in each of N iterations: We scan the O(N N) similarities to find the maximum similarity. We merge the two clusters with maximum similarity. We compute the similarity of the new cluster with all other (surviving) clusters. There are O(N) iterations, each performing a O(N N) scan operation. Overall complexity is O(N 3 ). We ll look at more efficient algorithms later. 26/ 37
27 Key question: How to define cluster similarity Single-link: Maximum similarity Maximum similarity of any two documents Complete-link: Minimum similarity Minimum similarity of any two documents Centroid: Average intersimilarity Average similarity of all document pairs (but excluding pairs of docs in the same cluster) This is equivalent to the similarity of the centroids. Group-average: Average intrasimilarity Average similary of all document pairs, including pairs of docs in the same cluster 27/ 37
28 Cluster similarity: Example / 37
29 Single-link: Maximum similarity / 37
30 Complete-link: Minimum similarity / 37
31 Centroid: Average intersimilarity intersimilarity = similarity of two documents in different clusters / 37
32 Group average: Average intrasimilarity intrasimilarity = similarity of any pair, including cases where the two documents are in the same cluster / 37
33 Cluster similarity: Larger Example / 37
34 Single-link: Maximum similarity / 37
35 Complete-link: Minimum similarity / 37
36 Centroid: Average intersimilarity / 37
37 Group average: Average intrasimilarity / 37
Hierarchical Clustering
Hierarchical Clustering Most slides are from Hinrich Schütze & Lucia D. Krisnawati March, 27 Hierarchical clustering / 62 Overview Introduction 2 Single-link/Complete-link 3 Centroid/GAAC 4 Labeling clusters
More informationHierarchical Clustering
Hierarchical Clustering Most slides are from Hinrich Schütze & Lucia D. Krisnawati November 7, 28 Hierarchical clustering / 6 Overview Introduction 2 Single-link/Complete-link 3 Centroid/GAAC 4 Labeling
More informationMachine Learning for NLP: Unsupervised learning techniques Dept. of Computer Science, Trinity College Dublin, Ireland ESSLLI 07 Dublin Ireland
Machine Learning for NLP: Unsupervised learning techniques Saturnino Luz Dept. of Computer Science, Trinity College Dublin, Ireland ESSLLI 07 Dublin Ireland Supervised vs. unsupervised learning So far
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 12: Latent Semantic Indexing and Relevance Feedback Paul Ginsparg Cornell
More informationModern Information Retrieval
Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction
More informationMarielle Caccam Jewel Refran
Marielle Caccam Jewel Refran Cluster analysis is a group of multivariate techniques whose primary purpose is to group objects (e.g., respondents, products, or other entities) based on the characteristics
More informationOutline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting
Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Efficient
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 8: Evaluation & SVD Paul Ginsparg Cornell University, Ithaca, NY 20 Sep 2011
More informationClustering. Léon Bottou COS 424 3/4/2010. NEC Labs America
Clustering Léon Bottou NEC Labs America COS 424 3/4/2010 Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification, clustering, regression, other.
More informationMachine Learning - MT Clustering
Machine Learning - MT 2016 15. Clustering Varun Kanade University of Oxford November 28, 2016 Announcements No new practical this week All practicals must be signed off in sessions this week Firm Deadline:
More informationChapter 5-2: Clustering
Chapter 5-2: Clustering Jilles Vreeken Revision 1, November 20 th typo s fixed: dendrogram Revision 2, December 10 th clarified: we do consider a point x as a member of its own ε-neighborhood 12 Nov 2015
More informationClustering. CSL465/603 - Fall 2016 Narayanan C Krishnan
Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification
More informationComputer Vision Group Prof. Daniel Cremers. 14. Clustering
Group Prof. Daniel Cremers 14. Clustering Motivation Supervised learning is good for interaction with humans, but labels from a supervisor are hard to obtain Clustering is unsupervised learning, i.e. it
More informationCS570 Introduction to Data Mining
CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data cleaning
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationOverview of clustering analysis. Yuehua Cui
Overview of clustering analysis Yuehua Cui Email: cuiy@msu.edu http://www.stt.msu.edu/~cui A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this
More informationPageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10
PageRank Ryan Tibshirani 36-462/36-662: Data Mining January 24 2012 Optional reading: ESL 14.10 1 Information retrieval with the web Last time we learned about information retrieval. We learned how to
More informationCS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C.
CS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring 2006 Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C. Latombe Scribe: Neda Nategh How do you update the energy function during the
More informationIR: Information Retrieval
/ 44 IR: Information Retrieval FIB, Master in Innovation and Research in Informatics Slides by Marta Arias, José Luis Balcázar, Ramon Ferrer-i-Cancho, Ricard Gavaldá Department of Computer Science, UPC
More informationLinear Programming Duality
Summer 2011 Optimization I Lecture 8 1 Duality recap Linear Programming Duality We motivated the dual of a linear program by thinking about the best possible lower bound on the optimal value we can achieve
More informationMATH 10 INTRODUCTORY STATISTICS
MATH 10 INTRODUCTORY STATISTICS Tommy Khoo Your friendly neighbourhood graduate student. It is Time for Homework! ( ω `) First homework + data will be posted on the website, under the homework tab. And
More informationCSC236 Week 3. Larry Zhang
CSC236 Week 3 Larry Zhang 1 Announcements Problem Set 1 due this Friday Make sure to read Submission Instructions on the course web page. Search for Teammates on Piazza Educational memes: http://www.cs.toronto.edu/~ylzhang/csc236/memes.html
More informationClustering using Mixture Models
Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior
More informationModern Information Retrieval
Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction
More information2. Sample representativeness. That means some type of probability/random sampling.
1 Neuendorf Cluster Analysis Model: X1 X2 X3 X4 X5 Clusters (Nominal variable) Y1 Y2 Y3 Clustering/Internal Variables External Variables Assumes: 1. Actually, any level of measurement (nominal, ordinal,
More informationProject Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming
Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305,
More information0.1 Naive formulation of PageRank
PageRank is a ranking system designed to find the best pages on the web. A webpage is considered good if it is endorsed (i.e. linked to) by other good webpages. The more webpages link to it, and the more
More informationCS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions
CS246: Mining Massive Data Sets Winter 2017 Problem Set 2 Due 11:59pm February 9, 2017 Only one late period is allowed for this homework (11:59pm 2/14). General Instructions Submission instructions: These
More informationCutting Graphs, Personal PageRank and Spilling Paint
Graphs and Networks Lecture 11 Cutting Graphs, Personal PageRank and Spilling Paint Daniel A. Spielman October 3, 2013 11.1 Disclaimer These notes are not necessarily an accurate representation of what
More informationPageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)
PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper) In class, we saw this graph, with each node representing people who are following each other on Twitter: Our
More informationTricks of the Trade in Combinatorics and Arithmetic
Tricks of the Trade in Combinatorics and Arithmetic Zachary Friggstad Programming Club Meeting Fast Exponentiation Given integers a, b with b 0, compute a b exactly. Fast Exponentiation Given integers
More informationGoogle Page Rank Project Linear Algebra Summer 2012
Google Page Rank Project Linear Algebra Summer 2012 How does an internet search engine, like Google, work? In this project you will discover how the Page Rank algorithm works to give the most relevant
More informationINFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review
INFO 4300 / CS4300 Information Retrieval IR 9: Linear Algebra Review Paul Ginsparg Cornell University, Ithaca, NY 24 Sep 2009 1/ 23 Overview 1 Recap 2 Matrix basics 3 Matrix Decompositions 4 Discussion
More informationDecision Trees: Overfitting
Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9
More informationApplying cluster analysis to 2011 Census local authority data
Applying cluster analysis to 2011 Census local authority data Kitty.Lymperopoulou@manchester.ac.uk SPSS User Group Conference November, 10 2017 Outline Basic ideas of cluster analysis How to choose variables
More informationcxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c
Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is
More informationDiscrete Mathematics and Probability Theory Fall 2013 Vazirani Note 3
CS 70 Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 3 Modular Arithmetic In several settings, such as error-correcting codes and cryptography, we sometimes wish to work over a smaller
More informationProgram Slicing. Author: Mark Weiser Published in TSE, Presented by Peeratham (Karn) Techapalokul 10/13/2015
Program Slicing Author: Mark Weiser Published in TSE, 1984 Presented by Peeratham (Karn) Techapalokul 1/13/215 About Mark Weiser a chief scientist at Xerox PARC Widely considered to be the father of ubiquitous
More informationMath 304 Handout: Linear algebra, graphs, and networks.
Math 30 Handout: Linear algebra, graphs, and networks. December, 006. GRAPHS AND ADJACENCY MATRICES. Definition. A graph is a collection of vertices connected by edges. A directed graph is a graph all
More informationChapter 8 The Disjoint Sets Class
Chapter 8 The Disjoint Sets Class 2 Introduction equivalence problem can be solved fairly simply simple data structure each function requires only a few lines of code two operations: and can be implemented
More informationMapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland
MapReduce in Spark Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, second semester
More informationHidden Markov Models. Vibhav Gogate The University of Texas at Dallas
Hidden Markov Models Vibhav Gogate The University of Texas at Dallas Intro to AI (CS 4365) Many slides over the course adapted from either Dan Klein, Luke Zettlemoyer, Stuart Russell or Andrew Moore 1
More informationLecture: Local Spectral Methods (1 of 4)
Stat260/CS294: Spectral Graph Methods Lecture 18-03/31/2015 Lecture: Local Spectral Methods (1 of 4) Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning: these notes are still very rough. They provide
More informationCS246 Final Exam. March 16, :30AM - 11:30AM
CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions
More informationDiscrete Mathematics and Probability Theory Summer 2014 James Cook Note 5
CS 70 Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 5 Modular Arithmetic In several settings, such as error-correcting codes and cryptography, we sometimes wish to work over a
More informationMultivariate Statistics
Multivariate Statistics Chapter 6: Cluster Analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2017/2018 Master in Mathematical Engineering
More informationTheory versus Experiment: Analysis and Measurements of Allocation Costs in Sorting (With Hints for Applying Similar Techniques to Garbage Collection)
Theory versus Experiment: Analysis and Measurements of Allocation Costs in Sorting (With Hints for Applying Similar Techniques to Garbage Collection) CS 152 Staff February, 2006 Introduction This handout
More informationSynthesis of 2-level Logic Exact and Heuristic Methods. Two Approaches
Synthesis of 2-level Logic Exact and Heuristic Methods Lecture 7: Branch & Bound Exact Two Approaches Find all primes Find a complete sum Find a minimum cover (covering problem) Heuristic Take an initial
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic
More informationUpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer
UpdatingtheStationary VectorofaMarkovChain Amy Langville Carl Meyer Department of Mathematics North Carolina State University Raleigh, NC NSMC 9/4/2003 Outline Updating and Pagerank Aggregation Partitioning
More informationCSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides
CSE 494/598 Lecture-4: Correlation Analysis LYDIA MANIKONDA HT TP://WWW.PUBLIC.ASU.EDU/~LMANIKON / **Content adapted from last year s slides Announcements Project-1 Due: February 12 th 2016 Analysis report:
More informationSpatial Extension of the Reality Mining Dataset
R&D Centre for Mobile Applications Czech Technical University in Prague Spatial Extension of the Reality Mining Dataset Michal Ficek, Lukas Kencl sponsored by Mobility-Related Applications Wanted! Urban
More informationLecture 7 Decision Tree Classifier
Machine Learning Dr.Ammar Mohammed Lecture 7 Decision Tree Classifier Decision Tree A decision tree is a simple classifier in the form of a hierarchical tree structure, which performs supervised classification
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London Motivation Relevance Feedback revisited The user marks a number of documents
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic Indexing Hinrich Schütze Center for Information and Language Processing, University of Munich 2013-07-10 1/43
More informationDecision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)
Decision Trees Lewis Fishgold (Material in these slides adapted from Ray Mooney's slides on Decision Trees) Classification using Decision Trees Nodes test features, there is one branch for each value of
More informationJustin Appleby CS 229 Machine Learning Project Report 12/15/17 Kevin Chalhoub Building Electricity Load Forecasting
Justin Appleby CS 229 Machine Learning Project Report 12/15/17 Kevin Chalhoub Building Electricity Load Forecasting with ARIMA and Sequential Linear Regression Abstract Load forecasting is an essential
More informationDecision Trees Entropy, Information Gain, Gain Ratio
Changelog: 14 Oct, 30 Oct Decision Trees Entropy, Information Gain, Gain Ratio Lecture 3: Part 2 Outline Entropy Information gain Gain ratio Marina Santini Acknowledgements Slides borrowed and adapted
More informationCS 188: Artificial Intelligence Fall Recap: Inference Example
CS 188: Artificial Intelligence Fall 2007 Lecture 19: Decision Diagrams 11/01/2007 Dan Klein UC Berkeley Recap: Inference Example Find P( F=bad) Restrict all factors P() P(F=bad ) P() 0.7 0.3 eather 0.7
More informationText Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University
Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data
More informationClustering compiled by Alvin Wan from Professor Benjamin Recht s lecture, Samaneh s discussion
Clustering compiled by Alvin Wan from Professor Benjamin Recht s lecture, Samaneh s discussion 1 Overview With clustering, we have several key motivations: archetypes (factor analysis) segmentation hierarchy
More informationCSEP 573: Artificial Intelligence
CSEP 573: Artificial Intelligence Hidden Markov Models Luke Zettlemoyer Many slides over the course adapted from either Dan Klein, Stuart Russell, Andrew Moore, Ali Farhadi, or Dan Weld 1 Outline Probabilistic
More informationDivide-and-Conquer Algorithms Part Two
Divide-and-Conquer Algorithms Part Two Recap from Last Time Divide-and-Conquer Algorithms A divide-and-conquer algorithm is one that works as follows: (Divide) Split the input apart into multiple smaller
More informationPersonal PageRank and Spilling Paint
Graphs and Networks Lecture 11 Personal PageRank and Spilling Paint Daniel A. Spielman October 7, 2010 11.1 Overview These lecture notes are not complete. The paint spilling metaphor is due to Berkhin
More informationMultivariate Analysis
Multivariate Analysis Chapter 5: Cluster analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2015/2016 Master in Business Administration and
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun yzsun@ccs.neu.edu November 16, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining Matrix Data Decision
More informationLecture 5: November 19, Minimizing the maximum intracluster distance
Analysis of DNA Chips and Gene Networks Spring Semester, 2009 Lecture 5: November 19, 2009 Lecturer: Ron Shamir Scribe: Renana Meller 5.1 Minimizing the maximum intracluster distance 5.1.1 Introduction
More informationSlides based on those in:
Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering
More information1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:
CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at
More informationData Exploration and Unsupervised Learning with Clustering
Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence Clustering Idea Given a set of data can we find a
More informationCS473 - Algorithms I
CS473 - Algorithms I Lecture 10 Dynamic Programming View in slide-show mode CS 473 Lecture 10 Cevdet Aykanat and Mustafa Ozdal, Bilkent University 1 Introduction An algorithm design paradigm like divide-and-conquer
More informationDivide and Conquer Algorithms. CSE 101: Design and Analysis of Algorithms Lecture 14
Divide and Conquer Algorithms CSE 101: Design and Analysis of Algorithms Lecture 14 CSE 101: Design and analysis of algorithms Divide and conquer algorithms Reading: Sections 2.3 and 2.4 Homework 6 will
More informationComputer Science & Engineering 423/823 Design and Analysis of Algorithms
Computer Science & Engineering 423/823 Design and Analysis of Algorithms Lecture 03 Dynamic Programming (Chapter 15) Stephen Scott and Vinodchandran N. Variyam sscott@cse.unl.edu 1/44 Introduction Dynamic
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve
More informationLarge-scale Collaborative Ranking in Near-Linear Time
Large-scale Collaborative Ranking in Near-Linear Time Liwei Wu Depts of Statistics and Computer Science UC Davis KDD 17, Halifax, Canada August 13-17, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack
More informationDiscrete Mathematics and Probability Theory Fall 2014 Anant Sahai Homework 5. This homework is due October 6, 2014, at 12:00 noon.
EECS 70 Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Homework 5 This homework is due October 6, 2014, at 12:00 noon. 1. Modular Arithmetic Lab (continue) Oystein Ore described a puzzle
More informationB490 Mining the Big Data
B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations
More informationConstraint-based Subspace Clustering
Constraint-based Subspace Clustering Elisa Fromont 1, Adriana Prado 2 and Céline Robardet 1 1 Université de Lyon, France 2 Universiteit Antwerpen, Belgium Thursday, April 30 Traditional Clustering Partitions
More informationSupplementary Information
Supplementary Information For the article"comparable system-level organization of Archaea and ukaryotes" by J. Podani, Z. N. Oltvai, H. Jeong, B. Tombor, A.-L. Barabási, and. Szathmáry (reference numbers
More informationSTATS 306B: Unsupervised Learning Spring Lecture 5 April 14
STATS 306B: Unsupervised Learning Spring 2014 Lecture 5 April 14 Lecturer: Lester Mackey Scribe: Brian Do and Robin Jia 5.1 Discrete Hidden Markov Models 5.1.1 Recap In the last lecture, we introduced
More informationLearning Decision Trees
Learning Decision Trees CS194-10 Fall 2011 Lecture 8 CS194-10 Fall 2011 Lecture 8 1 Outline Decision tree models Tree construction Tree pruning Continuous input features CS194-10 Fall 2011 Lecture 8 2
More informationCORRECTNESS OF A GOSSIP BASED MEMBERSHIP PROTOCOL BY (ANDRÉ ALLAVENA, ALAN DEMERS, JOHN E. HOPCROFT ) PRATIK TIMALSENA UNIVERSITY OF OSLO
CORRECTNESS OF A GOSSIP BASED MEMBERSHIP PROTOCOL BY (ANDRÉ ALLAVENA, ALAN DEMERS, JOHN E. HOPCROFT ) PRATIK TIMALSENA UNIVERSITY OF OSLO OUTLINE q Contribution of the paper q Gossip algorithm q The corrected
More informationIrreversibility. Have you ever seen this happen? (when you weren t asleep or on medication) Which stage never happens?
Lecture 5: Statistical Processes Random Walk and Particle Diffusion Counting and Probability Microstates and Macrostates The meaning of equilibrium 0.10 0.08 Reading: Elements Ch. 5 Probability (N 1, N
More informationMITOCW ocw f99-lec23_300k
MITOCW ocw-18.06-f99-lec23_300k -- and lift-off on differential equations. So, this section is about how to solve a system of first order, first derivative, constant coefficient linear equations. And if
More informationClustering analysis of vegetation data
Clustering analysis of vegetation data Valentin Gjorgjioski 1, Sašo Dzeroski 1 and Matt White 2 1 Jožef Stefan Institute Jamova cesta 39, SI-1000 Ljubljana Slovenia 2 Arthur Rylah Institute for Environmental
More informationMarkov Models. CS 188: Artificial Intelligence Fall Example. Mini-Forward Algorithm. Stationary Distributions.
CS 88: Artificial Intelligence Fall 27 Lecture 2: HMMs /6/27 Markov Models A Markov model is a chain-structured BN Each node is identically distributed (stationarity) Value of X at a given time is called
More informationExploratory Hierarchical Clustering for Management Zone Delineation in Precision Agriculture
Exploratory Hierarchical Clustering for Management Zone Delineation in Precision Agriculture Georg Ruß, Rudolf Kruse Otto-von-Guericke-Universität Magdeburg, Germany {russ,kruse}@iws.cs.uni-magdeburg.de
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 13: Query Expansion and Probabilistic Retrieval Paul Ginsparg Cornell University,
More informationEconomics 2: Growth (Growth in the Solow Model)
Economics 2: Growth (Growth in the Solow Model) Lecture 3, Week 7 Solow Model - I Definition (Solow Model I) The most basic Solow model with no population growth or technological progress. Solow Model
More informationb + O(n d ) where a 1, b > 1, then O(n d log n) if a = b d d ) if a < b d O(n log b a ) if a > b d
CS161, Lecture 4 Median, Selection, and the Substitution Method Scribe: Albert Chen and Juliana Cook (2015), Sam Kim (2016), Gregory Valiant (2017) Date: January 23, 2017 1 Introduction Last lecture, we
More informationLecture 2: Data Analytics of Narrative
Lecture 2: Data Analytics of Narrative Data Analytics of Narrative: Pattern Recognition in Text, and Text Synthesis, Supported by the Correspondence Analysis Platform. This Lecture is presented in three
More informationhttp://xkcd.com/1570/ Strategy: Top Down Recursive divide-and-conquer fashion First: Select attribute for root node Create branch for each possible attribute value Then: Split
More informationToday s Outline. CS 362, Lecture 13. Matrix Chain Multiplication. Paranthesizing Matrices. Matrix Multiplication. Jared Saia University of New Mexico
Today s Outline CS 362, Lecture 13 Jared Saia University of New Mexico Matrix Multiplication 1 Matrix Chain Multiplication Paranthesizing Matrices Problem: We are given a sequence of n matrices, A 1, A
More informationECEN 689 Special Topics in Data Science for Communications Networks
ECEN 689 Special Topics in Data Science for Communications Networks Nick Duffield Department of Electrical & Computer Engineering Texas A&M University Lecture 8 Random Walks, Matrices and PageRank Graphs
More informationCluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li
77 Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li 1) Introduction Cluster analysis deals with separating data into groups whose identities are not known in advance. In general, even the
More informationMITOCW MIT8_01F16_L12v01_360p
MITOCW MIT8_01F16_L12v01_360p Let's look at a typical application of Newton's second law for a system of objects. So what I want to consider is a system of pulleys and masses. So I'll have a fixed surface
More informationDecision Tree And Random Forest
Decision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany) Spring 2019 Contact: mailto: Ammar@cu.edu.eg
More informationAdvanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras
Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras Lecture - 3 Simplex Method for Bounded Variables We discuss the simplex algorithm
More information