COMPSCI 514: Algorithms for Data Science

Size: px
Start display at page:

Download "COMPSCI 514: Algorithms for Data Science"

Transcription

1 COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018

2 Lecture 9 Similarity Queries

3 Few words about the exam The exam is Thursday (Oct 4) in two days In this room this time Duration 1 hour Syllabus: Till the end of last class Focus: Clustering An extensive list of chapters were provided in piazza, you should concentrate on things that were discussed in detail in class.

4 From the two books From Blum et al. book: Ch 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9.1, 3.9.2, All exercise in Ch 3 are relevant Ch 4.1, 4.8. Ex , 4.11, 4.12, 4.25, 4, Ch 7.1, 7.2, 7.3, 7.4, 7.5 Ex , From MMDS: Exercises are at the end of a section Ch 5.1, 5.5 Ch 7.1, 7.3, 7.4 Ch 10.4 Ch 11.1, 11.2, 11.3 Only things that are covered in class are in syllabus Concentrate on topics that were not covered by the first homework It is fine if you learn something by mistake

5 Finding similar items [Hays and Efros, SIGGRAPH 2007] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 3

6 Finding near-neighbors in a high dimensional space Pages with similar words For data de-duplication, classification by topics Customers who purchased similar products Product classification Images with similar features The space is high dimensional The database can be huge

7 Finding near-neighbors in a high dimensional space Near neighbors: Points that are only a small distance away Have to define distance or similarity For example, Jaccard distance/similarity For any two sets A and B, the Jaccard similarity A B A B For any two sets A and B, the Jaccard distance 1 A B A B

8 Jaccard metric into one of set intersection, we use a technique called shingling, which is introduced in Section 3.2. For any two sets A and B, the Jaccard similarity Jaccard Similarity of Sets A B A B The Jaccard similarity of sets S and T is S T / S T, that is, the ratio of the size of the intersection of S and T to the size of their union. We shall denote the Jaccard similarity of S and T by SIM(S, T ). For any two sets A and B, the Jaccard distance A B 1 A B Example 3.1 : In Fig. 3.1 we see two sets S and T. There are three elements in their intersection and a total of eight elements that appear in S or T or both. Thus, SIM(S, T )=3/8. S T similarity = 3 8 ; distance = 5 8 Figure 3.1: Two sets with Jaccard similarity 3/8

9 Running example: Similarity of documents Jaccard distance for textual similarity To find plagiarism No simple automated process (documents are not exactly same) To find the same news from different sources To match mirror pages

10 Documents to sets: Shingling Jaccard distance for textual similarity Document: string of characters k-shingle: a substring of length k within the document Example: abcdabd Set of 2-shingles: { ab,bc,cd,da,bd} Bag of 2-shingles: {ab,bc,cd,da,ab,bd} We might remove whitespace altogether We might replace sequence of whitespaces by only one whitespace

11 Shingling: choosing the shingle size k = 1: bad k should be picked large enough that the probability of any given shingle appearing in any given document is low. k very large: not enough statistics: bad For large documents k = 9 is considered safe Hashing Shingles: Give each shingle a number and then instead of set of shingles, keep the set of numbers

12 Shingles built from words First rule: avoid stop words: and, such, to etc. Use stop word followed by the next two or three words: and other world leaders etc.

13 3.3.1 Matrix Representation of Sets Summarizing the shingles Before explaining how it is possible to construct small signatures from large sets, it is helpful to visualize a collection of sets as their characteristic matrix. The columns of the matrix correspond to the sets, and the rows correspond to elements How of tothe go universal from Setsset from Signatures which elements of the sets are drawn. There is a 1 in row Thercharacteristic and column c if matrix the element of setsfor row r is a member of the set for column c. Otherwise the value in position (r, c) is 0. Element S 1 S 2 S 3 S 4 a b c d e Figure 3.2: A matrix representing four sets U = {a, b, c, d, e} S 1 = {a, d}; S 2 = {c} etc. Example 3.6 : In Fig. 3.2 is an example of a matrix representing sets chosen from the universal set {a, b, c, d, e}. Here,S 1 = {a, d}, S 2 = {c}, S 3 = {b, d, e}, and S 4 = {a, c, d}. The top row and leftmost columns are not part of the matrix, but are present only to remind us what the rows and columns represent.

14 Sets Signatures: Minhashing Hash the columns of the characteristic matrix How? Pick a permutation of the rows Hash value of a column is the element in first row in the permuted order where the column has 1

15 elements of the characteristic of the universal matrix. set from In this which section, elements we shall of the learn setshow are drawn. a minhash There is is computed a 1 in rowinrprinciple, and column andcinif the later Minhashing element sectionsfor werow shall r issee a member how a good of the approximation c. to Otherwise minhash theisvalue computed in position in practice. (r, c) is 0. set for column To minhash a set represented by a column of the characteristic matrix, pick a permutation of the rows. Element The minhash S 1 Svalue 2 Sof 3 any S 4 column is the number of the first row, in the permuted a order, 1 in which 0 0the column 1 has a 1. b Example 3.7 : Let us suppose c we pick 0 the 1 order 0 of 1 rows beadc for the matrix of Fig This permutation d defines 1 a minhash 0 1 function 1 h that maps sets to rows. Let us compute the minhash e 0value 0 of set 1 S 1 0according to h. The first column, which is the column for set S 1, has 0 in row b, soweproceedtorowe, the second in the permuted order. There is again a 0 in the column for S 1,so we proceed Permutation to rowfigure ofa, the wherewefinda1.thus.h(s 3.2: rows: Ab,e,a,c,d matrix representing 1 )=a. four sets Example 3.6 : In Fig. 3.2 Element is an example S 1 S 2 of asmatrix 3 S 4 representing sets chosen from the universal set {a, b, c, b d, e}. 0Here,S 0 1 = 1 {a, d}, 0 S 2 = {c}, S 3 = {b, d, e}, and S 4 = {a, c, d}. The top row e and leftmost 0 0 columns 1 0 are not part of the matrix, but are present only to remind a us what 1 the 0 rows 0 and 1 columns represent. d It is important to remember c that 0 the 1 characteristic 0 1 matrix is unlikely to be the way the data is stored, but it is useful as a way to visualize the data. For one reason not to store data as a matrix, these matrices are almost always sparse (they h(shave 1 ) = many a; Figure h(s more 2 ) 3.3: = 0 s c; Athan h(s permutation 3 ) 1 s) = b; inh(s practice. of 4 ) the = rows ait saves of Fig. space 3.2 to represent a sparse matrix of 0 s and 1 s by the positions in which the 1 s appear. Foranother

16 suppose we pick the order of rows beadc for the matr tation defines a minhash Why Minhashing? function h that maps sets the minhash value of set S 1 according to h. The fir For any two sets S, T, the probability that the minhashes (with a lumn for set S random uniform 1, has 0 in row b, soweproceedtorow permutation) being equal uted order. There is again a 0 in the column for S 1, P(h(S) = h(t )) = Sim(S, T ), herewefinda1.thus.h(s 1 )=a. the Jaccard similarity of the two sets. Why? Element S 1 S 2 S 3 S 4 b e a d c

17 Minhash signature Take n (say 100) random permutations Each define a minhash function: h 1, h 2,..., h n Minhash signature of S is a vector [h 1 (S), h 2 (S),..., h n (S)] n is much smaller than the number of rows of the characteristic matrix

Similarity Search. Stony Brook University CSE545, Fall 2016

Similarity Search. Stony Brook University CSE545, Fall 2016 Similarity Search Stony Brook University CSE545, Fall 20 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings

More information

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of

More information

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here) 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34

More information

CS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya

CS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya CS62: Scalable Data Mining Similarity Search and Hashing Sourangshu Bha>acharya Finding Similar Items Distance Measures Goal: Find near-neighbors in high-dim. space We formally define near neighbors as

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

Algorithms for Data Science: Lecture on Finding Similar Items

Algorithms for Data Science: Lecture on Finding Similar Items Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar

More information

B490 Mining the Big Data

B490 Mining the Big Data B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations

More information

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages

More information

CS5112: Algorithms and Data Structures for Applications

CS5112: Algorithms and Data Structures for Applications CS5112: Algorithms and Data Structures for Applications Lecture 19: Association rules Ramin Zabih Some content from: Wikipedia/Google image search; Harrington; J. Leskovec, A. Rajaraman, J. Ullman: Mining

More information

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 4 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

Bloom Filters and Locality-Sensitive Hashing

Bloom Filters and Locality-Sensitive Hashing Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,

More information

1 Finding Similar Items

1 Finding Similar Items 1 Finding Similar Items This chapter discusses the various measures of distance used to find out similarity between items in a given set. After introducing the basic similarity measures, we look at how

More information

Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara

Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara Spring 2017 W5.B.1 CS435 BIG DATA Today s topics PART 1. LARGE SCALE DATA ANALYSIS USING MAPREDUCE FAQs Minhash Minhash signature Calculating Minhash with MapReduce Locality Sensitive Hashing Sangmi Lee

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS. The original slides can be accessed at: www.mmds.org Customer

More information

Finding similar items

Finding similar items Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Advanced Frequent Pattern Mining & Locality Sensitivity Hashing Huan Sun, CSE@The Ohio State University /7/27 Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and

More information

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Text Processing and High-Dimensional Data

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Text Processing and High-Dimensional Data Lecture Notes to Winter Term 2017/2018 Text Processing and High-Dimensional Data Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur Schmid, Daniyal Kazempour,

More information

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Lecture 21 K - Nearest Neighbor V In this lecture we discuss; how do we evaluate the

More information

NCC Education Limited. Substitution Topic NCC Education Limited. Substitution Topic NCC Education Limited

NCC Education Limited. Substitution Topic NCC Education Limited. Substitution Topic NCC Education Limited Topic 3 - Lecture 2: Substitution Substitution Topic 3-2.2 Learning Objective To be able to substitute positive and negative values into algebraic expressions and formulae. Substitution Topic 3-2.3 Key

More information

Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves

Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves Theory of LSH Distance Measures LS Families of Hash Functions S-Curves 1 Distance Measures Generalized LSH is based on some kind of distance between points. Similar points are close. Two major classes

More information

Announcements Monday, November 13

Announcements Monday, November 13 Announcements Monday, November 13 The third midterm is on this Friday, November 17. The exam covers 3.1, 3.2, 5.1, 5.2, 5.3, and 5.5. About half the problems will be conceptual, and the other half computational.

More information

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data CS6931 Database Seminar Lecture 6: Set Operations on Massive Data Set Resemblance and MinWise Hashing/Independent Permutation Basics Consider two sets S1, S2 U = {0, 1, 2,...,D 1} (e.g., D = 2 64 ) f1

More information

CS5112: Algorithms and Data Structures for Applications

CS5112: Algorithms and Data Structures for Applications CS5112: Algorithms and Data Structures for Applications Lecture 14: Exponential decay; convolution Ramin Zabih Some content from: Piotr Indyk; Wikipedia/Google image search; J. Leskovec, A. Rajaraman,

More information

Why duplicate detection?

Why duplicate detection? Near-Duplicates Detection Naama Kraus Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze Some slides are courtesy of Kira Radinsky Why duplicate detection?

More information

Announcements Monday, November 13

Announcements Monday, November 13 Announcements Monday, November 13 The third midterm is on this Friday, November 17 The exam covers 31, 32, 51, 52, 53, and 55 About half the problems will be conceptual, and the other half computational

More information

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Anshumali Shrivastava and Ping Li Cornell University and Rutgers University WWW 25 Florence, Italy May 2st 25 Will Join

More information

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2) The Market-Basket Model Association Rules Market Baskets Frequent sets A-priori Algorithm A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set

More information

Database Systems CSE 514

Database Systems CSE 514 Database Systems CSE 514 Lecture 8: Data Cleaning and Sampling CSEP514 - Winter 2017 1 Announcements WQ7 was due last night (did you remember?) HW6 is due on Sunday Weston will go over it in the section

More information

CS246 Final Exam, Winter 2011

CS246 Final Exam, Winter 2011 CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including

More information

Name Period. Date: have an. Essential Question: Does the function ( ) inverse function? Explain your answer.

Name Period. Date: have an. Essential Question: Does the function ( ) inverse function? Explain your answer. Name Period Date: Topic: 10-3 Composition and Inverses of Functions Essential Question: Does the function inverse function? Explain your answer. have an Standard: F-BF.1c Objective: Compose functions.

More information

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam. March 16, :30AM - 11:30AM CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions

More information

COMPSCI 514: Algorithms for Data Science

COMPSCI 514: Algorithms for Data Science COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018 Lecture 8 Spectral Clustering Spectral clustering Curse of dimensionality Dimensionality Reduction

More information

1.1 Administrative Stuff

1.1 Administrative Stuff 601.433 / 601.633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Introduction, Karatsuba/Strassen Date: 9/4/18 1.1 Administrative Stuff Welcome to Algorithms! In this class you will learn the

More information

MATRIX DETERMINANTS. 1 Reminder Definition and components of a matrix

MATRIX DETERMINANTS. 1 Reminder Definition and components of a matrix MATRIX DETERMINANTS Summary Uses... 1 1 Reminder Definition and components of a matrix... 1 2 The matrix determinant... 2 3 Calculation of the determinant for a matrix... 2 4 Exercise... 3 5 Definition

More information

Relational Nonlinear FIR Filters. Ronald K. Pearson

Relational Nonlinear FIR Filters. Ronald K. Pearson Relational Nonlinear FIR Filters Ronald K. Pearson Daniel Baugh Institute for Functional Genomics and Computational Biology Thomas Jefferson University Philadelphia, PA Moncef Gabbouj Institute of Signal

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

Homework Assignment 6 Answers

Homework Assignment 6 Answers Homework Assignment 6 Answers CSCI 2670 Introduction to Theory of Computing, Fall 2016 December 2, 2016 This homework assignment is about Turing machines, decidable languages, Turing recognizable languages,

More information

Recommendation Systems

Recommendation Systems Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume

More information

CS60021: Scalable Data Mining. Dimensionality Reduction

CS60021: Scalable Data Mining. Dimensionality Reduction J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Dimensionality Reduction Sourangshu Bhattacharya Assumption: Data lies on or near a

More information

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

26 Group Theory Basics

26 Group Theory Basics 26 Group Theory Basics 1. Reference: Group Theory and Quantum Mechanics by Michael Tinkham. 2. We said earlier that we will go looking for the set of operators that commute with the molecular Hamiltonian.

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras. Lecture - 13 Conditional Convergence

Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras. Lecture - 13 Conditional Convergence Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras Lecture - 13 Conditional Convergence Now, there are a few things that are remaining in the discussion

More information

Probability (Devore Chapter Two)

Probability (Devore Chapter Two) Probability (Devore Chapter Two) 1016-345-01: Probability and Statistics for Engineers Fall 2012 Contents 0 Administrata 2 0.1 Outline....................................... 3 1 Axiomatic Probability 3

More information

Privacy in Statistical Databases

Privacy in Statistical Databases Privacy in Statistical Databases Individuals x 1 x 2 x n Server/agency ) answers. A queries Users Government, researchers, businesses or) Malicious adversary What information can be released? Two conflicting

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes Matt Gormley Lecture 19 March 20, 2018 1 Midterm Exam Reminders

More information

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit: Frequent Itemsets and Association Rule Mining Vinay Setty vinay.j.setty@uis.no Slides credit: http://www.mmds.org/ Association Rule Discovery Supermarket shelf management Market-basket model: Goal: Identify

More information

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University. Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means - Clustering

More information

Fooling Sets and. Lecture 5

Fooling Sets and. Lecture 5 Fooling Sets and Introduction to Nondeterministic Finite Automata Lecture 5 Proving that a language is not regular Given a language, we saw how to prove it is regular (union, intersection, concatenation,

More information

CSE446: non-parametric methods Spring 2017

CSE446: non-parametric methods Spring 2017 CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want

More information

Exploring Large Graphs

Exploring Large Graphs Richard P. Brent MSI & CECS ANU Joint work with Judy-anne Osborn MSI, ANU 1 October 2010 The Hadamard maximal determinant problem The Hadamard maxdet problem asks: what is the maximum determinant H(n)

More information

Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane

Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane Randomized Algorithms Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane Sotiris Nikoletseas Professor CEID - ETY Course 2017-2018 Sotiris Nikoletseas, Professor Randomized

More information

1111: Linear Algebra I

1111: Linear Algebra I 1111: Linear Algebra I Dr. Vladimir Dotsenko (Vlad) Lecture 7 Dr. Vladimir Dotsenko (Vlad) 1111: Linear Algebra I Lecture 7 1 / 8 Properties of the matrix product Let us show that the matrix product we

More information

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014 CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof.

More information

compare to comparison and pointer based sorting, binary trees

compare to comparison and pointer based sorting, binary trees Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:

More information

Content-Addressable Memory Associative Memory Lernmatrix Association Heteroassociation Learning Retrieval Reliability of the answer

Content-Addressable Memory Associative Memory Lernmatrix Association Heteroassociation Learning Retrieval Reliability of the answer Associative Memory Content-Addressable Memory Associative Memory Lernmatrix Association Heteroassociation Learning Retrieval Reliability of the answer Storage Analysis Sparse Coding Implementation on a

More information

Course overview. Heikki Mannila Laboratory for Computer and Information Science Helsinki University of Technology

Course overview. Heikki Mannila Laboratory for Computer and Information Science Helsinki University of Technology Course overview Heikki Mannila Laboratory for Computer and Information Science Helsinki University of Technology Heikki.Mannila@tkk.fi Algorithmic methods of data mining, Autumn 2007, Course overview 1

More information

Exam Question 10: Differential Equations. June 19, Applied Mathematics: Lecture 6. Brendan Williamson. Introduction.

Exam Question 10: Differential Equations. June 19, Applied Mathematics: Lecture 6. Brendan Williamson. Introduction. Exam Question 10: June 19, 2016 In this lecture we will study differential equations, which pertains to Q. 10 of the Higher Level paper. It s arguably more theoretical than other topics on the syllabus,

More information

COMP 175 COMPUTER GRAPHICS. Lecture 04: Transform 1. COMP 175: Computer Graphics February 9, Erik Anderson 04 Transform 1

COMP 175 COMPUTER GRAPHICS. Lecture 04: Transform 1. COMP 175: Computer Graphics February 9, Erik Anderson 04 Transform 1 Lecture 04: Transform COMP 75: Computer Graphics February 9, 206 /59 Admin Sign up via email/piazza for your in-person grading Anderson@cs.tufts.edu 2/59 Geometric Transform Apply transforms to a hierarchy

More information

Algorithmic methods of data mining, Autumn 2007, Course overview0-0. Course overview

Algorithmic methods of data mining, Autumn 2007, Course overview0-0. Course overview Algorithmic methods of data mining, Autumn 2007, Course overview0-0 Course overview Heikki Mannila Laboratory for Computer and Information Science Helsinki University of Technology Heikki.Mannila@tkk.fi

More information

Math 1021, Linear Algebra 1. Section: A at 10am, B at 2:30pm

Math 1021, Linear Algebra 1. Section: A at 10am, B at 2:30pm Math 1021, Linear Algebra 1. Section: A at 10am, B at 2:30pm All course information is available on Moodle. Text: Nicholson, Linear algebra with applications, 7th edition. We shall cover Chapters 1,2,3,4,5:

More information

Lecture 10: The Normal Distribution. So far all the random variables have been discrete.

Lecture 10: The Normal Distribution. So far all the random variables have been discrete. Lecture 10: The Normal Distribution 1. Continuous Random Variables So far all the random variables have been discrete. We need a different type of model (called a probability density function) for continuous

More information

Matrix-Vector Products and the Matrix Equation Ax = b

Matrix-Vector Products and the Matrix Equation Ax = b Matrix-Vector Products and the Matrix Equation Ax = b A. Havens Department of Mathematics University of Massachusetts, Amherst January 31, 2018 Outline 1 Matrices Acting on Vectors Linear Combinations

More information

Social Choice and Networks

Social Choice and Networks Social Choice and Networks Elchanan Mossel UC Berkeley All rights reserved Logistics 1 Different numbers for the course: Compsci 294 Section 063 Econ 207A Math C223A Stat 206A Room: Cory 241 Time TuTh

More information

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods) Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March

More information

Algorithms for Nearest Neighbors

Algorithms for Nearest Neighbors Algorithms for Nearest Neighbors Background and Two Challenges Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura McGill University, July 2007 1 / 29 Outline

More information

CS 124 Math Review Section January 29, 2018

CS 124 Math Review Section January 29, 2018 CS 124 Math Review Section CS 124 is more math intensive than most of the introductory courses in the department. You re going to need to be able to do two things: 1. Perform some clever calculations to

More information

Hash tables. Hash tables

Hash tables. Hash tables Basic Probability Theory Two events A, B are independent if Conditional probability: Pr[A B] = Pr[A] Pr[B] Pr[A B] = Pr[A B] Pr[B] The expectation of a (discrete) random variable X is E[X ] = k k Pr[X

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges

More information

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction Tim Roughgarden & Gregory Valiant April 12, 2017 1 The Curse of Dimensionality in the Nearest Neighbor Problem Lectures #1 and

More information

MATH 122 SYLLBAUS HARVARD UNIVERSITY MATH DEPARTMENT, FALL 2014

MATH 122 SYLLBAUS HARVARD UNIVERSITY MATH DEPARTMENT, FALL 2014 MATH 122 SYLLBAUS HARVARD UNIVERSITY MATH DEPARTMENT, FALL 2014 INSTRUCTOR: HIRO LEE TANAKA UPDATED THURSDAY, SEPTEMBER 4, 2014 Location: Harvard Hall 102 E-mail: hirohirohiro@gmail.com Class Meeting Time:

More information

Intelligent Data Analysis Lecture Notes on Document Mining

Intelligent Data Analysis Lecture Notes on Document Mining Intelligent Data Analysis Lecture Notes on Document Mining Peter Tiňo Representing Textual Documents as Vectors Our next topic will take us to seemingly very different data spaces - those of textual documents.

More information

Section 4.6 Negative Exponents

Section 4.6 Negative Exponents Section 4.6 Negative Exponents INTRODUCTION In order to understand negative exponents the main topic of this section we need to make sure we understand the meaning of the reciprocal of a number. Reciprocals

More information

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing) CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose

More information

March 19 - Solving Linear Systems

March 19 - Solving Linear Systems March 19 - Solving Linear Systems Welcome to linear algebra! Linear algebra is the study of vectors, vector spaces, and maps between vector spaces. It has applications across data analysis, computer graphics,

More information

Lecture 2e Row Echelon Form (pages 73-74)

Lecture 2e Row Echelon Form (pages 73-74) Lecture 2e Row Echelon Form (pages 73-74) At the end of Lecture 2a I said that we would develop an algorithm for solving a system of linear equations, and now that we have our matrix notation, we can proceed

More information

MA 1128: Lecture 08 03/02/2018. Linear Equations from Graphs And Linear Inequalities

MA 1128: Lecture 08 03/02/2018. Linear Equations from Graphs And Linear Inequalities MA 1128: Lecture 08 03/02/2018 Linear Equations from Graphs And Linear Inequalities Linear Equations from Graphs Given a line, we would like to be able to come up with an equation for it. I ll go over

More information

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data Elena Baralis and Tania Cerquitelli Politecnico di Torino Data set types Record Tables Document Data Transaction Data Graph World Wide Web Molecular Structures Ordered Spatial Data Temporal Data Sequential

More information

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Stat 609: Mathematical Statistics I (Fall Semester, 2016) Introduction

Stat 609: Mathematical Statistics I (Fall Semester, 2016) Introduction Stat 609: Mathematical Statistics I (Fall Semester, 2016) Introduction Course information Instructor Professor Jun Shao TA Mr. Han Chen Office 1235A MSC 1335 MSC Phone 608-262-7938 608-263-5948 Email shao@stat.wisc.edu

More information

Today s Menu. Administrativia Two Problems Cutting a Pizza Lighting Rooms

Today s Menu. Administrativia Two Problems Cutting a Pizza Lighting Rooms Welcome! L01 Today s Menu Administrativia Two Problems Cutting a Pizza Lighting Rooms Administrativia Course page: https://www.cs.duke.edu/courses/spring13/compsci230/ Who we are: Instructor: TA: UTAs:

More information

Intelligent Data Analysis. Mining Textual Data. School of Computer Science University of Birmingham

Intelligent Data Analysis. Mining Textual Data. School of Computer Science University of Birmingham Intelligent Data Analysis Mining Textual Data Peter Tiňo School of Computer Science University of Birmingham Representing documents as numerical vectors Use a special set of terms T = {t 1, t 2,..., t

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

Testing a Hash Function using Probability

Testing a Hash Function using Probability Testing a Hash Function using Probability Suppose you have a huge square turnip field with 1000 turnips growing in it. They are all perfectly evenly spaced in a regular pattern. Suppose also that the Germans

More information

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Optimal Data-Dependent Hashing for Approximate Near Neighbors Optimal Data-Dependent Hashing for Approximate Near Neighbors Alexandr Andoni 1 Ilya Razenshteyn 2 1 Simons Institute 2 MIT, CSAIL April 20, 2015 1 / 30 Nearest Neighbor Search (NNS) Let P be an n-point

More information

I am trying to keep these lessons as close to actual class room settings as possible.

I am trying to keep these lessons as close to actual class room settings as possible. Greetings: I am trying to keep these lessons as close to actual class room settings as possible. They do not intend to replace the text book actually they will involve the text book. An advantage of a

More information

Select/Special Topics in Atomic Physics Prof. P. C. Deshmukh Department of Physics Indian Institute of Technology, Madras

Select/Special Topics in Atomic Physics Prof. P. C. Deshmukh Department of Physics Indian Institute of Technology, Madras Select/Special Topics in Atomic Physics Prof. P. C. Deshmukh Department of Physics Indian Institute of Technology, Madras Lecture - 9 Angular Momentum in Quantum Mechanics Dimensionality of the Direct-Product

More information

Perform the same three operations as above on the values in the matrix, where some notation is given as a shorthand way to describe each operation:

Perform the same three operations as above on the values in the matrix, where some notation is given as a shorthand way to describe each operation: SECTION 2.1: SOLVING SYSTEMS OF EQUATIONS WITH A UNIQUE SOLUTION In Chapter 1 we took a look at finding the intersection point of two lines on a graph. Chapter 2 begins with a look at a more formal approach

More information

Ch. 7 Statistical Intervals Based on a Single Sample

Ch. 7 Statistical Intervals Based on a Single Sample Ch. 7 Statistical Intervals Based on a Single Sample Before discussing the topics in Ch. 7, we need to cover one important concept from Ch. 6. Standard error The standard error is the standard deviation

More information

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization Tim Roughgarden February 28, 2017 1 Preamble This lecture fulfills a promise made back in Lecture #1,

More information

MATH 433 Applied Algebra Lecture 22: Review for Exam 2.

MATH 433 Applied Algebra Lecture 22: Review for Exam 2. MATH 433 Applied Algebra Lecture 22: Review for Exam 2. Topics for Exam 2 Permutations Cycles, transpositions Cycle decomposition of a permutation Order of a permutation Sign of a permutation Symmetric

More information

Decision T ree Tree Algorithm Week 4 1

Decision T ree Tree Algorithm Week 4 1 Decision Tree Algorithm Week 4 1 Team Homework Assignment #5 Read pp. 105 117 of the text book. Do Examples 3.1, 3.2, 3.3 and Exercise 3.4 (a). Prepare for the results of the homework assignment. Due date

More information

Last Time. x + 3y = 6 x + 2y = 1. x + 3y = 6 y = 1. 2x + 4y = 8 x 2y = 1. x + 3y = 6 2x y = 7. Lecture 2

Last Time. x + 3y = 6 x + 2y = 1. x + 3y = 6 y = 1. 2x + 4y = 8 x 2y = 1. x + 3y = 6 2x y = 7. Lecture 2 January 9 Last Time 1. Last time we ended with saying that the following four systems are equivalent in the sense that we can move from one system to the other by a special move we discussed. (a) (b) (c)

More information

Section F Ratio and proportion

Section F Ratio and proportion Section F Ratio and proportion Ratio is a way of comparing two or more groups. For example, if something is split in a ratio 3 : 5 there are three parts of the first thing to every five parts of the second

More information

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides CSE 494/598 Lecture-6: Latent Semantic Indexing LYDIA MANIKONDA HT TP://WWW.PUBLIC.ASU.EDU/~LMANIKON / **Content adapted from last year s slides Announcements Homework-1 and Quiz-1 Project part-2 released

More information