Database Systems CSE 514

Size: px
Start display at page:

Download "Database Systems CSE 514"

Transcription

1 Database Systems CSE 514 Lecture 8: Data Cleaning and Sampling CSEP514 - Winter

2 Announcements WQ7 was due last night (did you remember?) HW6 is due on Sunday Weston will go over it in the section The final exam is next Tuesday, in class CSEP514 - Winter

3 Data Cleaning Cover slides 2-20 from the Data Cleaning Tutorial CSEP514 - Winter

4 Deduplication Other names: entity resolution, record linkage,... Cover (quickly!) slides of the ER tutorial CSEP514 - Winter

5 Record Linkage The same object is represented in different ways in two relations Problem: find matching pairs Examples Typos: Woshington v.s. Washington Different naming conventions: IBM v.s. International Business Machines Corporation 5

6 Company1 Example Company1 CName1 Microsoft Corp Apple Computer Apples, Pears, and More... Other attributes CName2 Microsoft Inc Apple Corporation Apples and Pears Farm Other attributes... 6

7 Fellegi & Sunter Model Each relation has several attributes: R(A1,A2,...,Ak) S(A1,A2,...,Ak) Fix two records r and s: we have a score of similarity for each pair of attributes: sim(r.a1,s.a1), sim(r.a2,s.a2),..., sim(r.ak,s.ak) Using these scores, decide if r = s or r s This lecture will discuss only single attribute 7

8 Similarity Join another name for Fuzzy Matching SELECT * FROM Company1, Company2 WHERE cname1 cname2 Intended meaning: retrieve all pairs that are sufficiently similar 8

9 Outline for Today Define similarity s1 s2 Jaccard, Hamming, Edit distance Given a set of sets s1, s2,... compute efficiently all pairs (si,sj) s.t. si sj Min-hashes Locality Sensitive Hashing CSEP514 - Winter

10 Similarity Functions CSEP514 - Winter

11 What is Similar? Similarity function sim(s1,s2): Sim(s1,s2) > k means s1, s2 are similar Distance function dist(s1,s2): Dist(s1,s2) < k means s1, s2 are similar 11

12 Two Approaches Q-grams Simple, and can lead to efficient optimizations Edit Distance More accurate, but less amenable to optimizations 12

13 Q-Grams Given a string s, a q-gram is a substring of length q Usually q = 3 washington woshington s1 = {was, ash, shi, hin, ing, ngt, gto, ton} s2 = {wos, osh, shi, hin, ing, ngt, gto, ton} Variation: may include beginning and end: ##w, #wa, on$, n$$

14 Hamming Distance H(s1, s2) = s1 Δ s2 = s1-s2 + s2-s1 s1 s1 is similar to s2 if H(s1, s2) < c, for some constant c s2

15 Hamming Distance s1 = {was, ash, shi, hin, ing, ngt, gto, ton} s2 = {wos, osh, shi, hin, ing, ngt, gto, ton} H(s1, s2) =? 15

16 Hamming Distance s1 = {was, ash, shi, hin, ing, ngt, gto, ton} s2 = {wos, osh, shi, hin, ing, ngt, gto, ton} H(s1, s2) = 4 16

17 Jaccard Similarity J(s1,s2) = s1 s2 / s1 s2 Note: J(s1,s2) [0,1] s1 s1 is similar to s2 if J(s1, s2) > k, for some constant k s2

18 Jaccard Similarity s1 = {was, ash, shi, hin, ing, ngt, gto, ton} s2 = {wos, osh, shi, hin, ing, ngt, gto, ton} J(s1, s2) =? 18

19 Jaccard Similarity s1 = {was, ash, shi, hin, ing, ngt, gto, ton} s2 = {wos, osh, shi, hin, ing, ngt, gto, ton} J(s1, s2) = 6/10 =

20 They are related! Suppose s1 = s2 = L s1 s2 = L H(s1,s2) s1 s2 = L + H(s1,s2) J(s1,s2) = (L H(s1,s2)) / (L + H(s1,s2)) 20

21 Representing q-grams Company (id, name, ) CQ(id, qgram) Id Name... 1 Washingto n The q-gram schema Id Qgram 1 was 1 ash 1 shi 1 hin

22 Naive Similarity Joins SELECT * FROM Company1, Company2 WHERE cname1 cname2 > 6 Company1(id1, cname1, ) Company2(id2, cname2, ) CQ1(id1, qgram1) CQ2(id2, qgram2) Rewrite the query over the q-gram schema 22

23 Naive Similarity Joins SELECT * FROM Company1, Company2 WHERE cname1 cname2 > 6 SELECT x.*, y.* FROM Company1 x, Company2 y, CQ1 u, CQ2 v WHERE x.id = u.id and y.id = v.id and u.qgram = v.qgram GROUP BY x.id, y.id HAVING count(*) > 6 23

24 Edit Distance Sometimes none of the similarity/distance measures are good enough Need to use edit distance Models more accurately the typing mistakes Slower than Jaccard or Hamming Use Jaccard or Hamming for pruning, then compute edit distance to remove false positives 24

25 The Levenstein Distance Distance is shortest sequence of edit commands that transform s to t. Simplest set of operations: Copy character from s over to t Delete a character in s (cost 1) Insert a character in t (cost 1) Substitute one character for another (cost 1)

26 Levenstein distance - example distance( William Cohen, Willliam Cohon ) s1 W I L L I A M _ C O H E N s2 op W I L L L I A M _ C O H O N C C C C I C C C C C C C S C cost

27 Levenstein distance - example distance( William Cohen, Willliam Cohon ) s1 W I L L gap I A M _ C O H E N s2 W I L L L I A M _ C O H O N C C C C I C C C C C C C S C op cost

28 Computing Levenstein distance 1/4 D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1), if si=tj //copy D(i-1,j-1)+1, if si!=tj //substitute D(i-1,j)+1 //insert D(i,j-1)+1 //delete

29 Computing Levenstein distance 2/4 D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete (simplify by letting d(c,d)=0 if c=d, 1 else) also let D(i,0)=i (for i inserts) and D(0,j)=j

30 Computing Levenstein distance 3/4 D(i-1,j-1) + d(si,tj) //subst/copy D(i,j)= min D(i-1,j)+1 //insert D(i,j-1)+1 //delete C O H E N M C C O H N = D(s,t)

31 Computing Levenstein distance 4/4 D(i-1,j-1) + d(si,tj) //subst/copy D(i,j) = min D(i-1,j)+1 //insert D(i,j-1)+1 //delete A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1) C O H E N M C C O H N

32 Needleman-Wunch distance D(i-1,j-1) + d(si,tj) //subst/copy D(i,j) = min D(i-1,j) + G //insert D(i,j-1) + G //delete d(c,d) is an arbitrary distance function on characters (e.g. related to typo frequencies, amino acid substitutibility, etc) G = gap cost William Cohen Wukkuan Cigeb

33 Pre-filtering If editdistance(s1, s2) k, then Qgrams(s1) Qgrams(s2) v Where v = max( s1, s2 ) 1 (k-1) * q Where q = length of the q-gram Thus, can use any of the signature schemes 33

34 Jaccard Recap: Jaccard similarity: J(S, T) = S T / S T Problem: given large collection of sets S1, S2,, Sn, and given a threshold s, find all pairs Si, Sj s.t. J(Si, Sj) > s 34

35 Similarity Joins CSEP514 - Winter

36 Similarity joins Problem: We are given n sets s1, s2,..., sn Find quickly all pairs s.t. J(Si,Sj) > s False positives, false negatives are OK 36

37 Application 1: Collaborative Filtering We have n customers: 1, 2,, n Each customer i buys a set of items Si We would like to recommend items bought by customer j, if J(Si, Sj) > s 37

38 Application 2: Similar Documents Given n documents d1, d2,, dn Let Si be the set of q-grams for document i Want to find all pairs of similar documents, i.e. for which J(Si, Sj) > s 38

39 Example You work for a copyright violation detection company Customer: has 10 6 documents Web: has pages Your job is to find almost identical documents 39

40 The Signature Method For each Si, compute a signature Sig(Si) s.t. J(Si,Sj) > s çè Sig(Si) Sig(Sj) emptyset With high probability 40

41 The Signature Method Step 1: compute all pairs i,j for which Sig(Si) Sig(Sj) emptyset This is a join operation! Step 2: for all such pairs, return (i,j) if J(Si,Sj) > s Hopefully only a few such pairs Both false positives and false negatives are possible 41

42 Discussion We will assume that each Si is a string, and we associate it with the set of its q-grams The signature Sig(Si) is also a set of something (to be explained) CSEP514 - Winter

43 Signature 1 = the q-grams! Obviously: s1 s2 > k è s1 s2 emptyset SELECT DISTINCT x.*, y.* FROM Company1 x, Company2 y, CQ1 u, CQ2 v WHERE x.id = u.id and y.id = v.id and u.qgram = v.qgram Lots of false positives 43

44 Signature 2: Prefix Filter Suppose all sets si have same cardinality L Order the q-grams lexicographically Define sig(s) = {the first L-k+1 smallest q- grams in q} 44

45 Signature 2: Prefix Filter Example: L = 8, k = 6, L k + 1 = 3 s1 = {was, ash, shi, hin, ing, ngt, gto, ton} = {ash, gto, hin, ing, ngt, shi, ton, was} Sig(s1) = {ash, gto, hin} 45

46 Signature 2: Prefix Filter Fact: if s1 s2 > k then sig(s1) sig(s2) emptyset Why? 46

47 Signature 3: Minhash Let π be an arbitrary permutation of the domain For each i, let: mh(si) = {the smallest element in Si according to π} 47

48 Example The entire domain is {a,b,c,d,e,f,g,h} The set Si is Si = {a,b,c,e,f} Suppose we choose the permutation: π = d,g,c,h,b,f,a,e Then what is mh(si) =? 48

49 Minhash Main property: Probability(mh(Si) = mh(sj)) = J(Si,Sj) Why? 49

50 Warmup Question Choose a random permutation π of {a,b,c,z} S= a b c d What is the probability that mh(s) = c? 50

51 Warmup Question Choose a random permutation π of {a,b,c,z} S= a b c d What is the probability that mh(s) = c? Answer: P = ¼ (each of a,b,c,d can be the min) 51

52 Main Property Sj Si f a b c e d What is Prob(mh(Si) = mh(sj))? 52

53 Main Property Sj Si f a b c e d What is Prob(mh(Si) = mh(sj))? 53 Answer: 2/6 = J(Si,Sj)

54 Computing Minhashes We use a hash function (which we assume is random) mh(s) { v = ; forall x in S do if h(x) < v then {v = h(x); y = x;} return y; } 54

55 Example The set Si is Si = {a,b,c,e,f} Compute h: h(a)=77,h(b)=55,h(c)=33,h(e)=88,h(f)=66 Then what is mh(si) =? 55

56 Using Minhashes for Fuzzy Join Recall: we have n sets S1,, Sn Want to return all pairs s.t. J(Si,Sj) > s Compute mh(s1),, mh(sn) Return pairs for which mh(si)=mh(sj) The probability of returning (Si,Sj) is J(Si,Sj) Probability of a false positive = J(Si,Sj) Probability of a false negative = 1-J(Si,Sj) 56

57 Example We have n = 1,000,000 records S1,..., Sn We want pairs s.t. J(Si,Sj) > 0.99 There are ordered pairs (Si,Sj) Suppose 2,000,000 pairs have J(Si,Sj)>0.99 the others have J(Si,Sj) = 0.5 # false positives = 0.5 * This is bad # false negatives = 0.01 * 2,000,000 = 20,000

58 Naïve Idea Use r independent hash functions h 1,, h r For each Si, compute a block of minhashes: MH(Si) = the r minhashes for each j=1,,r Return all pairs where MH(Si)=MH(Sj) The probability of returning (Si,Sj) is (J(Si,Sj)) r Probability of a false positive = (J(Si,Sj)) r Probability of a false negative = 1-(J(Si,Sj)) r 58

59 Example r = 2 The set Si is Si = {a,b,c,e,f} Compute h1: h1(a)=77,h1(b)=55,h1(c)=33,h1(e)=88,h1(f)=66 Compute h2: h2(a)=22,h2(b)=66,h2(c)=55,h2(e)=11,h2(f)=44 Then what is MH(Si) =? 59

60 Example r = 2 The set Si is Si = {a,b,c,e,f} Compute h1: h1(a)=77,h1(b)=55,h1(c)=33,h1(e)=88,h1(f)=66 Compute h2: h2(a)=22,h2(b)=66,h2(c)=55,h2(e)=11,h2(f)=44 Then what is MH(Si) =? Answer: MH(Si) = ce (an ordered pair) 60

61 Signature 4: LSH Locality Sensitive Hashing b blocks of minhashes MH 1,, MH b Each MH k has size r Total size of the signature: m = b*r Return pairs (Si,Sj) s.t. there exists k MH k (S i ) = MH k (S j ) Still a single simple join! 61

62 Example n=7 m=6 b=3 r=2 S1 S2 S3 S4 S5 S6 S r rows Each rectangle (=band) becomes one new hash value

63 Analysis Goal: want to compute the probability that Sig(MH(Si)) Sig(MH(Sj)) emptyset, as a function of s = J(Si, Sj) Let s = J(Si, Sj) 63

64 Analysis Si Sj J(Si,Sj) = s What is the probability that two entries are equal?

65 Analysis Si Sj J(Si,Sj) = s What is the probability that two entries are equal? 7 7 Answer: s 65

66 Analysis Si Sj J(Si,Sj) = s What is the probability that two entries are equal? r rows 66

67 Analysis Si Sj J(Si,Sj) = s What is the probability that two entries are equal? r rows Answer: s r 67

68 Analysis Si Sj J(Si,Sj) = s What is the probability that some pair of bands are equal? b bands 68

69 Analysis Si Sj J(Si,Sj) = s What is the probability that some pair of bands are equal? b bands Answer: 1-(1-s r 69) b

70 This is precisely the probability that Sig(Si) Sig(Sj ) emptyset Si Sj J(Si,Sj) = s What is the probability that some pair of bands are equal? b bands Answer: 1-(1-s r ) b 70

71 Analysis 1 P(s) = 1-(1-s r ) b 0 s 0 (1/b) 1/r 71 1

72 Putting it together You work for a copyright violation detection company Customers: have documents 1, 2, 3,., 10 6 Web: has pages 1, 2, 3,., Your job is to find almost identical documents 72

73 Doc DocID Qgram 1 To 1 o b 1 be 1 be 1 e o 1 or 1 or 1 r n 1 no Step 1: Q-grams Web URL Qgram abc.com Oba abc.com Bam abc.com ama abc.com ma abc.com a h bcd.com

74 Step 2: Compute m Min-hashes docid mh1 mh2 mh3 mh4 mh To que Ion or Not url mh1 mh2 mh3 mh4 mh50 0 abc.com sec ret def ens bcd.com Note: m = a few hundreds 74

75 Step 3: Compute Signatures docid h(mh1,,mh20) h(mh21,,mh40) docsig docid Sig websig url Sig @1 abc.com @ @2 abc.com 3232@2 1 abc.com @25 abc.com @ @1 bcd.com 23423@

76 Step 4: Find docs with common signatures SELECT DISTINCT docsig.docid, websig.url FROM docsig, websig WHERE docsig.sig = websig.sig Note: this is a SINGLE JOIN query!! Still need to filter out the false positives (easy) 76

CSE 321 Discrete Structures

CSE 321 Discrete Structures CSE 321 Discrete Structures March 3 rd, 2010 Lecture 22 (Supplement): LSH 1 Jaccard Jaccard similarity: J(S, T) = S*T / S U T Problem: given large collection of sets S1, S2,, Sn, and given a threshold

More information

Finding similar items

Finding similar items Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the

More information

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

Similarity Search. Stony Brook University CSE545, Fall 2016

Similarity Search. Stony Brook University CSE545, Fall 2016 Similarity Search Stony Brook University CSE545, Fall 20 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings

More information

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages

More information

Outline. Approximation: Theory and Algorithms. Application Scenario. 3 The q-gram Distance. Nikolaus Augsten. Definition and Properties

Outline. Approximation: Theory and Algorithms. Application Scenario. 3 The q-gram Distance. Nikolaus Augsten. Definition and Properties Outline Approximation: Theory and Algorithms Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 3 March 13, 2009 2 3 Nikolaus Augsten (DIS) Approximation: Theory and

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

CS246 Final Exam, Winter 2011

CS246 Final Exam, Winter 2011 CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including

More information

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of

More information

Why duplicate detection?

Why duplicate detection? Near-Duplicates Detection Naama Kraus Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze Some slides are courtesy of Kira Radinsky Why duplicate detection?

More information

B490 Mining the Big Data

B490 Mining the Big Data B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations

More information

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 4 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

Algorithms for Data Science: Lecture on Finding Similar Items

Algorithms for Data Science: Lecture on Finding Similar Items Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar

More information

Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves

Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves Theory of LSH Distance Measures LS Families of Hash Functions S-Curves 1 Distance Measures Generalized LSH is based on some kind of distance between points. Similar points are close. Two major classes

More information

Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara

Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara Spring 2017 W5.B.1 CS435 BIG DATA Today s topics PART 1. LARGE SCALE DATA ANALYSIS USING MAPREDUCE FAQs Minhash Minhash signature Calculating Minhash with MapReduce Locality Sensitive Hashing Sangmi Lee

More information

COMPSCI 514: Algorithms for Data Science

COMPSCI 514: Algorithms for Data Science COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018 Lecture 9 Similarity Queries Few words about the exam The exam is Thursday (Oct 4) in two days In

More information

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

CS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya

CS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya CS62: Scalable Data Mining Similarity Search and Hashing Sourangshu Bha>acharya Finding Similar Items Distance Measures Goal: Find near-neighbors in high-dim. space We formally define near neighbors as

More information

Algorithms for Data Science

Algorithms for Data Science Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based

More information

1 Finding Similar Items

1 Finding Similar Items 1 Finding Similar Items This chapter discusses the various measures of distance used to find out similarity between items in a given set. After introducing the basic similarity measures, we look at how

More information

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here) 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34

More information

Analysis and Design of Algorithms Dynamic Programming

Analysis and Design of Algorithms Dynamic Programming Analysis and Design of Algorithms Dynamic Programming Lecture Notes by Dr. Wang, Rui Fall 2008 Department of Computer Science Ocean University of China November 6, 2009 Introduction 2 Introduction..................................................................

More information

Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance

Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance Sharon Goldwater (based on slides by Philipp Koehn) 20 September 2018 Sharon Goldwater ANLP Lecture

More information

Minimum Edit Distance. Defini'on of Minimum Edit Distance

Minimum Edit Distance. Defini'on of Minimum Edit Distance Minimum Edit Distance Defini'on of Minimum Edit Distance How similar are two strings? Spell correc'on The user typed graffe Which is closest? graf gra@ grail giraffe Computa'onal Biology Align two sequences

More information

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance Jingbo Shang, Jian Peng, Jiawei Han University of Illinois, Urbana-Champaign May 6, 2016 Presented by Jingbo Shang 2 Outline

More information

Unsupervised Vocabulary Induction

Unsupervised Vocabulary Induction Infant Language Acquisition Unsupervised Vocabulary Induction MIT (Saffran et al., 1997) 8 month-old babies exposed to stream of syllables Stream composed of synthetic words (pabikumalikiwabufa) After

More information

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam. March 16, :30AM - 11:30AM CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions

More information

7 Distances. 7.1 Metrics. 7.2 Distances L p Distances

7 Distances. 7.1 Metrics. 7.2 Distances L p Distances 7 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards

More information

Bloom Filters, Minhashes, and Other Random Stuff

Bloom Filters, Minhashes, and Other Random Stuff Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida What? Probabilistic Space-efficient Fast Not exact Why?

More information

Bloom Filters and Locality-Sensitive Hashing

Bloom Filters and Locality-Sensitive Hashing Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,

More information

Bio nformatics. Lecture 3. Saad Mneimneh

Bio nformatics. Lecture 3. Saad Mneimneh Bio nformatics Lecture 3 Sequencing As before, DNA is cut into small ( 0.4KB) fragments and a clone library is formed. Biological experiments allow to read a certain number of these short fragments per

More information

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009 Outline Approximation: Theory and Algorithms The Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 6, 2009 1 Nikolaus Augsten (DIS) Approximation: Theory and

More information

More Dynamic Programming

More Dynamic Programming Algorithms & Models of Computation CS/ECE 374, Fall 2017 More Dynamic Programming Lecture 14 Tuesday, October 17, 2017 Sariel Har-Peled (UIUC) CS374 1 Fall 2017 1 / 48 What is the running time of the following?

More information

Algorithms Exam TIN093 /DIT602

Algorithms Exam TIN093 /DIT602 Algorithms Exam TIN093 /DIT602 Course: Algorithms Course code: TIN 093, TIN 092 (CTH), DIT 602 (GU) Date, time: 21st October 2017, 14:00 18:00 Building: SBM Responsible teacher: Peter Damaschke, Tel. 5405

More information

OECD QSAR Toolbox v.4.1. Tutorial illustrating new options for grouping with metabolism

OECD QSAR Toolbox v.4.1. Tutorial illustrating new options for grouping with metabolism OECD QSAR Toolbox v.4.1 Tutorial illustrating new options for grouping with metabolism Outlook Background Objectives Specific Aims The exercise Workflow 2 Background Grouping with metabolism is a procedure

More information

Lecture 5: The Shift-And Method

Lecture 5: The Shift-And Method Biosequence Algorithms, Spring 2005 Lecture 5: The Shift-And Method Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 5: Shift-And p.1/19 Seminumerical String Matching Most

More information

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM Alex Lascarides (Slides from Alex Lascarides and Sharon Goldwater) 2 February 2019 Alex Lascarides FNLP Lecture

More information

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo CSE 431/531: Analysis of Algorithms Dynamic Programming Lecturer: Shi Li Department of Computer Science and Engineering University at Buffalo Paradigms for Designing Algorithms Greedy algorithm Make a

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 19: Size Estimation & Duplicate Detection Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2008.07.08

More information

Approximation: Theory and Algorithms

Approximation: Theory and Algorithms Approximation: Theory and Algorithms The String Edit Distance Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 6, 2009 Nikolaus Augsten (DIS) Approximation:

More information

Algorithms and Data S tructures Structures Complexity Complexit of Algorithms Ulf Leser

Algorithms and Data S tructures Structures Complexity Complexit of Algorithms Ulf Leser Algorithms and Data Structures Complexity of Algorithms Ulf Leser Content of this Lecture Efficiency of Algorithms Machine Model Complexity Examples Multiplication of two binary numbers (unit cost?) Exact

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS. The original slides can be accessed at: www.mmds.org Customer

More information

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances 6 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards

More information

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

DATA MINING LECTURE 3. Frequent Itemsets Association Rules DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.

More information

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing) CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

More Dynamic Programming

More Dynamic Programming CS 374: Algorithms & Models of Computation, Spring 2017 More Dynamic Programming Lecture 14 March 9, 2017 Chandra Chekuri (UIUC) CS374 1 Spring 2017 1 / 42 What is the running time of the following? Consider

More information

Efficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints

Efficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints Efficient Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints Yu Jiang,, Jiannan Wang, Guoliang Li, and Jianhua Feng Tsinghua University Similarity Search&Join Competition

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Integer Sorting on the word-ram

Integer Sorting on the word-ram Integer Sorting on the word-rm Uri Zwick Tel viv University May 2015 Last updated: June 30, 2015 Integer sorting Memory is composed of w-bit words. rithmetical, logical and shift operations on w-bit words

More information

Hash tables. Hash tables

Hash tables. Hash tables Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key

More information

MA 162: Finite Mathematics - Section 3.3/4.1

MA 162: Finite Mathematics - Section 3.3/4.1 MA 162: Finite Mathematics - Section 3.3/4.1 Fall 2014 Ray Kremer University of Kentucky October 6, 2014 Announcements: Homework 3.3 due Tuesday at 6pm. Homework 4.1 due Friday at 6pm. Exam scores were

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

Chemistry Day 39. Friday, December 14 th Monday, December 17 th, 2018

Chemistry Day 39. Friday, December 14 th Monday, December 17 th, 2018 Chemistry Day 39 Friday, December 14 th Monday, December 17 th, 2018 Do-Now: Reactions Quiz Do-Now 1. Write down today s FLT 2. Copy: KCl + H 2 O à? 3. Identify the type of reaction in #2. 4. Predict the

More information

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Institute of Bioinformatics Johannes Kepler University, Linz, Austria Sequence Alignment 2. Sequence Alignment Sequence Alignment 2.1

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Hash tables. Hash tables

Hash tables. Hash tables Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key

More information

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Searching Sear ( Sub- (Sub )Strings Ulf Leser Searching (Sub-)Strings Ulf Leser This Lecture Exact substring search Naïve Boyer-Moore Searching with profiles Sequence profiles Ungapped approximate search Statistical evaluation of search results Ulf

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Lecture : p he biological problem p lobal alignment p Local alignment p Multiple alignment 6 Background: comparative genomics p Basic question in biology: what properties

More information

Samson Zhou. Pattern Matching over Noisy Data Streams

Samson Zhou. Pattern Matching over Noisy Data Streams Samson Zhou Pattern Matching over Noisy Data Streams Finding Structure in Data Pattern Matching Finding all instances of a pattern within a string ABCD ABCAABCDAACAABCDBCABCDADDDEAEABCDA Knuth-Morris-Pratt

More information

Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

Algorithmic Methods of Data Mining, Fall 2005, Course overview 1. Course overview

Algorithmic Methods of Data Mining, Fall 2005, Course overview 1. Course overview Algorithmic Methods of Data Mining, Fall 2005, Course overview 1 Course overview lgorithmic Methods of Data Mining, Fall 2005, Course overview 1 T-61.5060 Algorithmic methods of data mining (3 cp) P T-61.5060

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

1 Maintaining a Dictionary

1 Maintaining a Dictionary 15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition

More information

Dynamic Programming. Prof. S.J. Soni

Dynamic Programming. Prof. S.J. Soni Dynamic Programming Prof. S.J. Soni Idea is Very Simple.. Introduction void calculating the same thing twice, usually by keeping a table of known results that fills up as subinstances are solved. Dynamic

More information

Implementing Approximate Regularities

Implementing Approximate Regularities Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National

More information

CS 584 Data Mining. Association Rule Mining 2

CS 584 Data Mining. Association Rule Mining 2 CS 584 Data Mining Association Rule Mining 2 Recall from last time: Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2 d Use pruning techniques to reduce M

More information

Lecture 24: Bloom Filters. Wednesday, June 2, 2010

Lecture 24: Bloom Filters. Wednesday, June 2, 2010 Lecture 24: Bloom Filters Wednesday, June 2, 2010 1 Topics for the Final SQL Conceptual Design (BCNF) Transactions Indexes Query execution and optimization Cardinality Estimation Parallel Databases 2 Lecture

More information

GIS Lecture 4: Data. GIS Tutorial, Third Edition GIS 1

GIS Lecture 4: Data. GIS Tutorial, Third Edition GIS 1 GIS Lecture 4: Data GIS 1 Outline Data Types, Tables, and Formats Geodatabase Tabular Joins Spatial Joins Field Calculator ArcCatalog Functions GIS 2 Data Types, Tables, Formats GIS 3 Directly Loadable

More information

Sequence Alignment. Johannes Starlinger

Sequence Alignment. Johannes Starlinger Sequence Alignment Johannes Starlinger his Lecture Approximate String Matching Edit distance and alignment Computing global alignments Local alignment Johannes Starlinger: Bioinformatics, Summer Semester

More information

Hashing Data Structures. Ananda Gunawardena

Hashing Data Structures. Ananda Gunawardena Hashing 15-121 Data Structures Ananda Gunawardena Hashing Why do we need hashing? Many applications deal with lots of data Search engines and web pages There are myriad look ups. The look ups are time

More information

CS 580: Algorithm Design and Analysis

CS 580: Algorithm Design and Analysis CS 58: Algorithm Design and Analysis Jeremiah Blocki Purdue University Spring 28 Announcement: Homework 3 due February 5 th at :59PM Midterm Exam: Wed, Feb 2 (8PM-PM) @ MTHW 2 Recap: Dynamic Programming

More information

732A61/TDDD41 Data Mining - Clustering and Association Analysis

732A61/TDDD41 Data Mining - Clustering and Association Analysis 732A61/TDDD41 Data Mining - Clustering and Association Analysis Lecture 6: Association Analysis I Jose M. Peña IDA, Linköping University, Sweden 1/14 Outline Content Association Rules Frequent Itemsets

More information

Evolutionary Tree Analysis. Overview

Evolutionary Tree Analysis. Overview CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based

More information

Similarity Join Size Estimation using Locality Sensitive Hashing

Similarity Join Size Estimation using Locality Sensitive Hashing Similarity Join Size Estimation using Locality Sensitive Hashing Hongrae Lee, Google Inc Raymond Ng, University of British Columbia Kyuseok Shim, Seoul National University Highly Similar, but not Identical,

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

How to Make or Plot a Graph or Chart in Excel

How to Make or Plot a Graph or Chart in Excel This is a complete video tutorial on How to Make or Plot a Graph or Chart in Excel. To make complex chart like Gantt Chart, you have know the basic principles of making a chart. Though I have used Excel

More information

Lecture 4: September 19

Lecture 4: September 19 CSCI1810: Computational Molecular Biology Fall 2017 Lecture 4: September 19 Lecturer: Sorin Istrail Scribe: Cyrus Cousins Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang

More information

Searching Algorithms. CSE21 Winter 2017, Day 3 (B00), Day 2 (A00) January 13, 2017

Searching Algorithms. CSE21 Winter 2017, Day 3 (B00), Day 2 (A00) January 13, 2017 Searching Algorithms CSE21 Winter 2017, Day 3 (B00), Day 2 (A00) January 13, 2017 Selection Sort (MinSort) Pseudocode Rosen page 203, exercises 41-42 procedure selection sort(a 1, a 2,..., a n : real numbers

More information

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction. Microsoft Research Asia September 5, 2005 1 2 3 4 Section I What is? Definition is a technique for efficiently recurrence computing by storing partial results. In this slides, I will NOT use too many formal

More information

Lecture 2: Pairwise Alignment. CG Ron Shamir

Lecture 2: Pairwise Alignment. CG Ron Shamir Lecture 2: Pairwise Alignment 1 Main source 2 Why compare sequences? Human hexosaminidase A vs Mouse hexosaminidase A 3 www.mathworks.com/.../jan04/bio_genome.html Sequence Alignment עימוד רצפים The problem:

More information

Lecture 8 HASHING!!!!!

Lecture 8 HASHING!!!!! Lecture 8 HASHING!!!!! Announcements HW3 due Friday! HW4 posted Friday! Q: Where can I see examples of proofs? Lecture Notes CLRS HW Solutions Office hours: lines are long L Solutions: We will be (more)

More information

Mathematics Enhancement Programme

Mathematics Enhancement Programme UNIT 9 Lesson Plan 1 Perimeters Perimeters 1 1A 1B Units of length T: We're going to measure the lengths of the sides of some shapes. Firstly we need to decide what units to use - what length-units do

More information

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2) The Market-Basket Model Association Rules Market Baskets Frequent sets A-priori Algorithm A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set

More information

Nearest Neighbor Search with Keywords in Spatial Databases

Nearest Neighbor Search with Keywords in Spatial Databases 776 Nearest Neighbor Search with Keywords in Spatial Databases 1 Sphurti S. Sao, 2 Dr. Rahila Sheikh 1 M. Tech Student IV Sem, Dept of CSE, RCERT Chandrapur, MH, India 2 Head of Department, Dept of CSE,

More information

CSE 105 Theory of Computation

CSE 105 Theory of Computation CSE 105 Theory of Computation http://www.jflap.org/jflaptmp/ Professor Jeanne Ferrante 1 Undecidability Today s Agenda Review: The TM Acceptance problem, A TM The Halting Problem for TM s Other problems

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some

More information

6.830 Lecture 11. Recap 10/15/2018

6.830 Lecture 11. Recap 10/15/2018 6.830 Lecture 11 Recap 10/15/2018 Celebration of Knowledge 1.5h No phones, No laptops Bring your Student-ID The 5 things allowed on your desk Calculator allowed 4 pages (2 pages double sided) of your liking

More information

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association

More information

Recursion: Introduction and Correctness

Recursion: Introduction and Correctness Recursion: Introduction and Correctness CSE21 Winter 2017, Day 7 (B00), Day 4-5 (A00) January 25, 2017 http://vlsicad.ucsd.edu/courses/cse21-w17 Today s Plan From last time: intersecting sorted lists and

More information

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University Association Rules CS 5331 by Rattikorn Hewett Texas Tech University 1 Acknowledgements Some parts of these slides are modified from n C. Clifton & W. Aref, Purdue University 2 1 Outline n Association Rule

More information

Today: Amortized Analysis

Today: Amortized Analysis Today: Amortized Analysis COSC 581, Algorithms March 6, 2014 Many of these slides are adapted from several online sources Today s class: Chapter 17 Reading Assignments Reading assignment for next class:

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections

More information

CSE 105 Theory of Computation

CSE 105 Theory of Computation CSE 105 Theory of Computation http://www.jflap.org/jflaptmp/ Professor Jeanne Ferrante 1 Undecidability Today s Agenda Review and More Problems A Non-TR Language Reminders and announcements: HW 7 (Last!!)

More information

Sequence modelling. Marco Saerens (UCL) Slides references

Sequence modelling. Marco Saerens (UCL) Slides references Sequence modelling Marco Saerens (UCL) Slides references Many slides and figures have been adapted from the slides associated to the following books: Alpaydin (2004), Introduction to machine learning.

More information

Hash-based Indexing: Application, Impact, and Realization Alternatives

Hash-based Indexing: Application, Impact, and Realization Alternatives : Application, Impact, and Realization Alternatives Benno Stein and Martin Potthast Bauhaus University Weimar Web-Technology and Information Systems Text-based Information Retrieval (TIR) Motivation Consider

More information

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10 PageRank Ryan Tibshirani 36-462/36-662: Data Mining January 24 2012 Optional reading: ESL 14.10 1 Information retrieval with the web Last time we learned about information retrieval. We learned how to

More information