Database Systems CSE 514
|
|
- Ginger Mills
- 5 years ago
- Views:
Transcription
1 Database Systems CSE 514 Lecture 8: Data Cleaning and Sampling CSEP514 - Winter
2 Announcements WQ7 was due last night (did you remember?) HW6 is due on Sunday Weston will go over it in the section The final exam is next Tuesday, in class CSEP514 - Winter
3 Data Cleaning Cover slides 2-20 from the Data Cleaning Tutorial CSEP514 - Winter
4 Deduplication Other names: entity resolution, record linkage,... Cover (quickly!) slides of the ER tutorial CSEP514 - Winter
5 Record Linkage The same object is represented in different ways in two relations Problem: find matching pairs Examples Typos: Woshington v.s. Washington Different naming conventions: IBM v.s. International Business Machines Corporation 5
6 Company1 Example Company1 CName1 Microsoft Corp Apple Computer Apples, Pears, and More... Other attributes CName2 Microsoft Inc Apple Corporation Apples and Pears Farm Other attributes... 6
7 Fellegi & Sunter Model Each relation has several attributes: R(A1,A2,...,Ak) S(A1,A2,...,Ak) Fix two records r and s: we have a score of similarity for each pair of attributes: sim(r.a1,s.a1), sim(r.a2,s.a2),..., sim(r.ak,s.ak) Using these scores, decide if r = s or r s This lecture will discuss only single attribute 7
8 Similarity Join another name for Fuzzy Matching SELECT * FROM Company1, Company2 WHERE cname1 cname2 Intended meaning: retrieve all pairs that are sufficiently similar 8
9 Outline for Today Define similarity s1 s2 Jaccard, Hamming, Edit distance Given a set of sets s1, s2,... compute efficiently all pairs (si,sj) s.t. si sj Min-hashes Locality Sensitive Hashing CSEP514 - Winter
10 Similarity Functions CSEP514 - Winter
11 What is Similar? Similarity function sim(s1,s2): Sim(s1,s2) > k means s1, s2 are similar Distance function dist(s1,s2): Dist(s1,s2) < k means s1, s2 are similar 11
12 Two Approaches Q-grams Simple, and can lead to efficient optimizations Edit Distance More accurate, but less amenable to optimizations 12
13 Q-Grams Given a string s, a q-gram is a substring of length q Usually q = 3 washington woshington s1 = {was, ash, shi, hin, ing, ngt, gto, ton} s2 = {wos, osh, shi, hin, ing, ngt, gto, ton} Variation: may include beginning and end: ##w, #wa, on$, n$$
14 Hamming Distance H(s1, s2) = s1 Δ s2 = s1-s2 + s2-s1 s1 s1 is similar to s2 if H(s1, s2) < c, for some constant c s2
15 Hamming Distance s1 = {was, ash, shi, hin, ing, ngt, gto, ton} s2 = {wos, osh, shi, hin, ing, ngt, gto, ton} H(s1, s2) =? 15
16 Hamming Distance s1 = {was, ash, shi, hin, ing, ngt, gto, ton} s2 = {wos, osh, shi, hin, ing, ngt, gto, ton} H(s1, s2) = 4 16
17 Jaccard Similarity J(s1,s2) = s1 s2 / s1 s2 Note: J(s1,s2) [0,1] s1 s1 is similar to s2 if J(s1, s2) > k, for some constant k s2
18 Jaccard Similarity s1 = {was, ash, shi, hin, ing, ngt, gto, ton} s2 = {wos, osh, shi, hin, ing, ngt, gto, ton} J(s1, s2) =? 18
19 Jaccard Similarity s1 = {was, ash, shi, hin, ing, ngt, gto, ton} s2 = {wos, osh, shi, hin, ing, ngt, gto, ton} J(s1, s2) = 6/10 =
20 They are related! Suppose s1 = s2 = L s1 s2 = L H(s1,s2) s1 s2 = L + H(s1,s2) J(s1,s2) = (L H(s1,s2)) / (L + H(s1,s2)) 20
21 Representing q-grams Company (id, name, ) CQ(id, qgram) Id Name... 1 Washingto n The q-gram schema Id Qgram 1 was 1 ash 1 shi 1 hin
22 Naive Similarity Joins SELECT * FROM Company1, Company2 WHERE cname1 cname2 > 6 Company1(id1, cname1, ) Company2(id2, cname2, ) CQ1(id1, qgram1) CQ2(id2, qgram2) Rewrite the query over the q-gram schema 22
23 Naive Similarity Joins SELECT * FROM Company1, Company2 WHERE cname1 cname2 > 6 SELECT x.*, y.* FROM Company1 x, Company2 y, CQ1 u, CQ2 v WHERE x.id = u.id and y.id = v.id and u.qgram = v.qgram GROUP BY x.id, y.id HAVING count(*) > 6 23
24 Edit Distance Sometimes none of the similarity/distance measures are good enough Need to use edit distance Models more accurately the typing mistakes Slower than Jaccard or Hamming Use Jaccard or Hamming for pruning, then compute edit distance to remove false positives 24
25 The Levenstein Distance Distance is shortest sequence of edit commands that transform s to t. Simplest set of operations: Copy character from s over to t Delete a character in s (cost 1) Insert a character in t (cost 1) Substitute one character for another (cost 1)
26 Levenstein distance - example distance( William Cohen, Willliam Cohon ) s1 W I L L I A M _ C O H E N s2 op W I L L L I A M _ C O H O N C C C C I C C C C C C C S C cost
27 Levenstein distance - example distance( William Cohen, Willliam Cohon ) s1 W I L L gap I A M _ C O H E N s2 W I L L L I A M _ C O H O N C C C C I C C C C C C C S C op cost
28 Computing Levenstein distance 1/4 D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1), if si=tj //copy D(i-1,j-1)+1, if si!=tj //substitute D(i-1,j)+1 //insert D(i,j-1)+1 //delete
29 Computing Levenstein distance 2/4 D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete (simplify by letting d(c,d)=0 if c=d, 1 else) also let D(i,0)=i (for i inserts) and D(0,j)=j
30 Computing Levenstein distance 3/4 D(i-1,j-1) + d(si,tj) //subst/copy D(i,j)= min D(i-1,j)+1 //insert D(i,j-1)+1 //delete C O H E N M C C O H N = D(s,t)
31 Computing Levenstein distance 4/4 D(i-1,j-1) + d(si,tj) //subst/copy D(i,j) = min D(i-1,j)+1 //insert D(i,j-1)+1 //delete A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1) C O H E N M C C O H N
32 Needleman-Wunch distance D(i-1,j-1) + d(si,tj) //subst/copy D(i,j) = min D(i-1,j) + G //insert D(i,j-1) + G //delete d(c,d) is an arbitrary distance function on characters (e.g. related to typo frequencies, amino acid substitutibility, etc) G = gap cost William Cohen Wukkuan Cigeb
33 Pre-filtering If editdistance(s1, s2) k, then Qgrams(s1) Qgrams(s2) v Where v = max( s1, s2 ) 1 (k-1) * q Where q = length of the q-gram Thus, can use any of the signature schemes 33
34 Jaccard Recap: Jaccard similarity: J(S, T) = S T / S T Problem: given large collection of sets S1, S2,, Sn, and given a threshold s, find all pairs Si, Sj s.t. J(Si, Sj) > s 34
35 Similarity Joins CSEP514 - Winter
36 Similarity joins Problem: We are given n sets s1, s2,..., sn Find quickly all pairs s.t. J(Si,Sj) > s False positives, false negatives are OK 36
37 Application 1: Collaborative Filtering We have n customers: 1, 2,, n Each customer i buys a set of items Si We would like to recommend items bought by customer j, if J(Si, Sj) > s 37
38 Application 2: Similar Documents Given n documents d1, d2,, dn Let Si be the set of q-grams for document i Want to find all pairs of similar documents, i.e. for which J(Si, Sj) > s 38
39 Example You work for a copyright violation detection company Customer: has 10 6 documents Web: has pages Your job is to find almost identical documents 39
40 The Signature Method For each Si, compute a signature Sig(Si) s.t. J(Si,Sj) > s çè Sig(Si) Sig(Sj) emptyset With high probability 40
41 The Signature Method Step 1: compute all pairs i,j for which Sig(Si) Sig(Sj) emptyset This is a join operation! Step 2: for all such pairs, return (i,j) if J(Si,Sj) > s Hopefully only a few such pairs Both false positives and false negatives are possible 41
42 Discussion We will assume that each Si is a string, and we associate it with the set of its q-grams The signature Sig(Si) is also a set of something (to be explained) CSEP514 - Winter
43 Signature 1 = the q-grams! Obviously: s1 s2 > k è s1 s2 emptyset SELECT DISTINCT x.*, y.* FROM Company1 x, Company2 y, CQ1 u, CQ2 v WHERE x.id = u.id and y.id = v.id and u.qgram = v.qgram Lots of false positives 43
44 Signature 2: Prefix Filter Suppose all sets si have same cardinality L Order the q-grams lexicographically Define sig(s) = {the first L-k+1 smallest q- grams in q} 44
45 Signature 2: Prefix Filter Example: L = 8, k = 6, L k + 1 = 3 s1 = {was, ash, shi, hin, ing, ngt, gto, ton} = {ash, gto, hin, ing, ngt, shi, ton, was} Sig(s1) = {ash, gto, hin} 45
46 Signature 2: Prefix Filter Fact: if s1 s2 > k then sig(s1) sig(s2) emptyset Why? 46
47 Signature 3: Minhash Let π be an arbitrary permutation of the domain For each i, let: mh(si) = {the smallest element in Si according to π} 47
48 Example The entire domain is {a,b,c,d,e,f,g,h} The set Si is Si = {a,b,c,e,f} Suppose we choose the permutation: π = d,g,c,h,b,f,a,e Then what is mh(si) =? 48
49 Minhash Main property: Probability(mh(Si) = mh(sj)) = J(Si,Sj) Why? 49
50 Warmup Question Choose a random permutation π of {a,b,c,z} S= a b c d What is the probability that mh(s) = c? 50
51 Warmup Question Choose a random permutation π of {a,b,c,z} S= a b c d What is the probability that mh(s) = c? Answer: P = ¼ (each of a,b,c,d can be the min) 51
52 Main Property Sj Si f a b c e d What is Prob(mh(Si) = mh(sj))? 52
53 Main Property Sj Si f a b c e d What is Prob(mh(Si) = mh(sj))? 53 Answer: 2/6 = J(Si,Sj)
54 Computing Minhashes We use a hash function (which we assume is random) mh(s) { v = ; forall x in S do if h(x) < v then {v = h(x); y = x;} return y; } 54
55 Example The set Si is Si = {a,b,c,e,f} Compute h: h(a)=77,h(b)=55,h(c)=33,h(e)=88,h(f)=66 Then what is mh(si) =? 55
56 Using Minhashes for Fuzzy Join Recall: we have n sets S1,, Sn Want to return all pairs s.t. J(Si,Sj) > s Compute mh(s1),, mh(sn) Return pairs for which mh(si)=mh(sj) The probability of returning (Si,Sj) is J(Si,Sj) Probability of a false positive = J(Si,Sj) Probability of a false negative = 1-J(Si,Sj) 56
57 Example We have n = 1,000,000 records S1,..., Sn We want pairs s.t. J(Si,Sj) > 0.99 There are ordered pairs (Si,Sj) Suppose 2,000,000 pairs have J(Si,Sj)>0.99 the others have J(Si,Sj) = 0.5 # false positives = 0.5 * This is bad # false negatives = 0.01 * 2,000,000 = 20,000
58 Naïve Idea Use r independent hash functions h 1,, h r For each Si, compute a block of minhashes: MH(Si) = the r minhashes for each j=1,,r Return all pairs where MH(Si)=MH(Sj) The probability of returning (Si,Sj) is (J(Si,Sj)) r Probability of a false positive = (J(Si,Sj)) r Probability of a false negative = 1-(J(Si,Sj)) r 58
59 Example r = 2 The set Si is Si = {a,b,c,e,f} Compute h1: h1(a)=77,h1(b)=55,h1(c)=33,h1(e)=88,h1(f)=66 Compute h2: h2(a)=22,h2(b)=66,h2(c)=55,h2(e)=11,h2(f)=44 Then what is MH(Si) =? 59
60 Example r = 2 The set Si is Si = {a,b,c,e,f} Compute h1: h1(a)=77,h1(b)=55,h1(c)=33,h1(e)=88,h1(f)=66 Compute h2: h2(a)=22,h2(b)=66,h2(c)=55,h2(e)=11,h2(f)=44 Then what is MH(Si) =? Answer: MH(Si) = ce (an ordered pair) 60
61 Signature 4: LSH Locality Sensitive Hashing b blocks of minhashes MH 1,, MH b Each MH k has size r Total size of the signature: m = b*r Return pairs (Si,Sj) s.t. there exists k MH k (S i ) = MH k (S j ) Still a single simple join! 61
62 Example n=7 m=6 b=3 r=2 S1 S2 S3 S4 S5 S6 S r rows Each rectangle (=band) becomes one new hash value
63 Analysis Goal: want to compute the probability that Sig(MH(Si)) Sig(MH(Sj)) emptyset, as a function of s = J(Si, Sj) Let s = J(Si, Sj) 63
64 Analysis Si Sj J(Si,Sj) = s What is the probability that two entries are equal?
65 Analysis Si Sj J(Si,Sj) = s What is the probability that two entries are equal? 7 7 Answer: s 65
66 Analysis Si Sj J(Si,Sj) = s What is the probability that two entries are equal? r rows 66
67 Analysis Si Sj J(Si,Sj) = s What is the probability that two entries are equal? r rows Answer: s r 67
68 Analysis Si Sj J(Si,Sj) = s What is the probability that some pair of bands are equal? b bands 68
69 Analysis Si Sj J(Si,Sj) = s What is the probability that some pair of bands are equal? b bands Answer: 1-(1-s r 69) b
70 This is precisely the probability that Sig(Si) Sig(Sj ) emptyset Si Sj J(Si,Sj) = s What is the probability that some pair of bands are equal? b bands Answer: 1-(1-s r ) b 70
71 Analysis 1 P(s) = 1-(1-s r ) b 0 s 0 (1/b) 1/r 71 1
72 Putting it together You work for a copyright violation detection company Customers: have documents 1, 2, 3,., 10 6 Web: has pages 1, 2, 3,., Your job is to find almost identical documents 72
73 Doc DocID Qgram 1 To 1 o b 1 be 1 be 1 e o 1 or 1 or 1 r n 1 no Step 1: Q-grams Web URL Qgram abc.com Oba abc.com Bam abc.com ama abc.com ma abc.com a h bcd.com
74 Step 2: Compute m Min-hashes docid mh1 mh2 mh3 mh4 mh To que Ion or Not url mh1 mh2 mh3 mh4 mh50 0 abc.com sec ret def ens bcd.com Note: m = a few hundreds 74
75 Step 3: Compute Signatures docid h(mh1,,mh20) h(mh21,,mh40) docsig docid Sig websig url Sig @1 abc.com @ @2 abc.com 3232@2 1 abc.com @25 abc.com @ @1 bcd.com 23423@
76 Step 4: Find docs with common signatures SELECT DISTINCT docsig.docid, websig.url FROM docsig, websig WHERE docsig.sig = websig.sig Note: this is a SINGLE JOIN query!! Still need to filter out the false positives (easy) 76
CSE 321 Discrete Structures
CSE 321 Discrete Structures March 3 rd, 2010 Lecture 22 (Supplement): LSH 1 Jaccard Jaccard similarity: J(S, T) = S*T / S U T Problem: given large collection of sets S1, S2,, Sn, and given a threshold
More informationFinding similar items
Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the
More informationDATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationSimilarity Search. Stony Brook University CSE545, Fall 2016
Similarity Search Stony Brook University CSE545, Fall 20 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings
More informationFinding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman
Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages
More informationOutline. Approximation: Theory and Algorithms. Application Scenario. 3 The q-gram Distance. Nikolaus Augsten. Definition and Properties
Outline Approximation: Theory and Algorithms Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 3 March 13, 2009 2 3 Nikolaus Augsten (DIS) Approximation: Theory and
More informationCOMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from
COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard
More informationCS246 Final Exam, Winter 2011
CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including
More informationHigh Dimensional Search Min- Hashing Locality Sensi6ve Hashing
High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of
More informationWhy duplicate detection?
Near-Duplicates Detection Naama Kraus Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze Some slides are courtesy of Kira Radinsky Why duplicate detection?
More informationB490 Mining the Big Data
B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations
More informationDATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 4 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationAlgorithms for Data Science: Lecture on Finding Similar Items
Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar
More informationTheory of LSH. Distance Measures LS Families of Hash Functions S-Curves
Theory of LSH Distance Measures LS Families of Hash Functions S-Curves 1 Distance Measures Generalized LSH is based on some kind of distance between points. Similar points are close. Two major classes
More informationToday s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara
Spring 2017 W5.B.1 CS435 BIG DATA Today s topics PART 1. LARGE SCALE DATA ANALYSIS USING MAPREDUCE FAQs Minhash Minhash signature Calculating Minhash with MapReduce Locality Sensitive Hashing Sangmi Lee
More informationCOMPSCI 514: Algorithms for Data Science
COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018 Lecture 9 Similarity Queries Few words about the exam The exam is Thursday (Oct 4) in two days In
More informationSlides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationCS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya
CS62: Scalable Data Mining Similarity Search and Hashing Sourangshu Bha>acharya Finding Similar Items Distance Measures Goal: Find near-neighbors in high-dim. space We formally define near neighbors as
More informationAlgorithms for Data Science
Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based
More information1 Finding Similar Items
1 Finding Similar Items This chapter discusses the various measures of distance used to find out similarity between items in a given set. After introducing the basic similarity measures, we look at how
More informationPiazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)
4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34
More informationAnalysis and Design of Algorithms Dynamic Programming
Analysis and Design of Algorithms Dynamic Programming Lecture Notes by Dr. Wang, Rui Fall 2008 Department of Computer Science Ocean University of China November 6, 2009 Introduction 2 Introduction..................................................................
More informationAccelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance
Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance Sharon Goldwater (based on slides by Philipp Koehn) 20 September 2018 Sharon Goldwater ANLP Lecture
More informationMinimum Edit Distance. Defini'on of Minimum Edit Distance
Minimum Edit Distance Defini'on of Minimum Edit Distance How similar are two strings? Spell correc'on The user typed graffe Which is closest? graf gra@ grail giraffe Computa'onal Biology Align two sequences
More informationMACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance
MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance Jingbo Shang, Jian Peng, Jiawei Han University of Illinois, Urbana-Champaign May 6, 2016 Presented by Jingbo Shang 2 Outline
More informationUnsupervised Vocabulary Induction
Infant Language Acquisition Unsupervised Vocabulary Induction MIT (Saffran et al., 1997) 8 month-old babies exposed to stream of syllables Stream composed of synthetic words (pabikumalikiwabufa) After
More informationCS246 Final Exam. March 16, :30AM - 11:30AM
CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions
More information7 Distances. 7.1 Metrics. 7.2 Distances L p Distances
7 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards
More informationBloom Filters, Minhashes, and Other Random Stuff
Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida What? Probabilistic Space-efficient Fast Not exact Why?
More informationBloom Filters and Locality-Sensitive Hashing
Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,
More informationBio nformatics. Lecture 3. Saad Mneimneh
Bio nformatics Lecture 3 Sequencing As before, DNA is cut into small ( 0.4KB) fragments and a clone library is formed. Biological experiments allow to read a certain number of these short fragments per
More informationOutline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009
Outline Approximation: Theory and Algorithms The Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 6, 2009 1 Nikolaus Augsten (DIS) Approximation: Theory and
More informationMore Dynamic Programming
Algorithms & Models of Computation CS/ECE 374, Fall 2017 More Dynamic Programming Lecture 14 Tuesday, October 17, 2017 Sariel Har-Peled (UIUC) CS374 1 Fall 2017 1 / 48 What is the running time of the following?
More informationAlgorithms Exam TIN093 /DIT602
Algorithms Exam TIN093 /DIT602 Course: Algorithms Course code: TIN 093, TIN 092 (CTH), DIT 602 (GU) Date, time: 21st October 2017, 14:00 18:00 Building: SBM Responsible teacher: Peter Damaschke, Tel. 5405
More informationOECD QSAR Toolbox v.4.1. Tutorial illustrating new options for grouping with metabolism
OECD QSAR Toolbox v.4.1 Tutorial illustrating new options for grouping with metabolism Outlook Background Objectives Specific Aims The exercise Workflow 2 Background Grouping with metabolism is a procedure
More informationLecture 5: The Shift-And Method
Biosequence Algorithms, Spring 2005 Lecture 5: The Shift-And Method Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 5: Shift-And p.1/19 Seminumerical String Matching Most
More informationFoundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM
Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM Alex Lascarides (Slides from Alex Lascarides and Sharon Goldwater) 2 February 2019 Alex Lascarides FNLP Lecture
More informationCSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo
CSE 431/531: Analysis of Algorithms Dynamic Programming Lecturer: Shi Li Department of Computer Science and Engineering University at Buffalo Paradigms for Designing Algorithms Greedy algorithm Make a
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 19: Size Estimation & Duplicate Detection Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2008.07.08
More informationApproximation: Theory and Algorithms
Approximation: Theory and Algorithms The String Edit Distance Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 6, 2009 Nikolaus Augsten (DIS) Approximation:
More informationAlgorithms and Data S tructures Structures Complexity Complexit of Algorithms Ulf Leser
Algorithms and Data Structures Complexity of Algorithms Ulf Leser Content of this Lecture Efficiency of Algorithms Machine Model Complexity Examples Multiplication of two binary numbers (unit cost?) Exact
More informationIn-Depth Assessment of Local Sequence Alignment
2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.
More informationCS425: Algorithms for Web Scale Data
CS: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS. The original slides can be accessed at: www.mmds.org Customer
More information6 Distances. 6.1 Metrics. 6.2 Distances L p Distances
6 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards
More informationDATA MINING LECTURE 3. Frequent Itemsets Association Rules
DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.
More informationCS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)
CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose
More informationAlgorithms in Bioinformatics
Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology
More informationMore Dynamic Programming
CS 374: Algorithms & Models of Computation, Spring 2017 More Dynamic Programming Lecture 14 March 9, 2017 Chandra Chekuri (UIUC) CS374 1 Spring 2017 1 / 42 What is the running time of the following? Consider
More informationEfficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints
Efficient Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints Yu Jiang,, Jiannan Wang, Guoliang Li, and Jianhua Feng Tsinghua University Similarity Search&Join Competition
More information3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT
3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode
More informationInteger Sorting on the word-ram
Integer Sorting on the word-rm Uri Zwick Tel viv University May 2015 Last updated: June 30, 2015 Integer sorting Memory is composed of w-bit words. rithmetical, logical and shift operations on w-bit words
More informationHash tables. Hash tables
Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key
More informationMA 162: Finite Mathematics - Section 3.3/4.1
MA 162: Finite Mathematics - Section 3.3/4.1 Fall 2014 Ray Kremer University of Kentucky October 6, 2014 Announcements: Homework 3.3 due Tuesday at 6pm. Homework 4.1 due Friday at 6pm. Exam scores were
More information1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:
CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at
More informationChemistry Day 39. Friday, December 14 th Monday, December 17 th, 2018
Chemistry Day 39 Friday, December 14 th Monday, December 17 th, 2018 Do-Now: Reactions Quiz Do-Now 1. Write down today s FLT 2. Copy: KCl + H 2 O à? 3. Identify the type of reaction in #2. 4. Predict the
More informationBioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter
Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Institute of Bioinformatics Johannes Kepler University, Linz, Austria Sequence Alignment 2. Sequence Alignment Sequence Alignment 2.1
More informationSara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)
Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline
More informationHash tables. Hash tables
Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key
More informationSearching Sear ( Sub- (Sub )Strings Ulf Leser
Searching (Sub-)Strings Ulf Leser This Lecture Exact substring search Naïve Boyer-Moore Searching with profiles Sequence profiles Ungapped approximate search Statistical evaluation of search results Ulf
More informationIntroduction to Bioinformatics
Introduction to Bioinformatics Lecture : p he biological problem p lobal alignment p Local alignment p Multiple alignment 6 Background: comparative genomics p Basic question in biology: what properties
More informationSamson Zhou. Pattern Matching over Noisy Data Streams
Samson Zhou Pattern Matching over Noisy Data Streams Finding Structure in Data Pattern Matching Finding all instances of a pattern within a string ABCD ABCAABCDAACAABCDBCABCDADDDEAEABCDA Knuth-Morris-Pratt
More informationPairwise & Multiple sequence alignments
Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived
More informationComparing whole genomes
BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will
More informationAlgorithmic Methods of Data Mining, Fall 2005, Course overview 1. Course overview
Algorithmic Methods of Data Mining, Fall 2005, Course overview 1 Course overview lgorithmic Methods of Data Mining, Fall 2005, Course overview 1 T-61.5060 Algorithmic methods of data mining (3 cp) P T-61.5060
More information15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018
15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science
More information1 Maintaining a Dictionary
15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition
More informationDynamic Programming. Prof. S.J. Soni
Dynamic Programming Prof. S.J. Soni Idea is Very Simple.. Introduction void calculating the same thing twice, usually by keeping a table of known results that fills up as subinstances are solved. Dynamic
More informationImplementing Approximate Regularities
Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National
More informationCS 584 Data Mining. Association Rule Mining 2
CS 584 Data Mining Association Rule Mining 2 Recall from last time: Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2 d Use pruning techniques to reduce M
More informationLecture 24: Bloom Filters. Wednesday, June 2, 2010
Lecture 24: Bloom Filters Wednesday, June 2, 2010 1 Topics for the Final SQL Conceptual Design (BCNF) Transactions Indexes Query execution and optimization Cardinality Estimation Parallel Databases 2 Lecture
More informationGIS Lecture 4: Data. GIS Tutorial, Third Edition GIS 1
GIS Lecture 4: Data GIS 1 Outline Data Types, Tables, and Formats Geodatabase Tabular Joins Spatial Joins Field Calculator ArcCatalog Functions GIS 2 Data Types, Tables, Formats GIS 3 Directly Loadable
More informationSequence Alignment. Johannes Starlinger
Sequence Alignment Johannes Starlinger his Lecture Approximate String Matching Edit distance and alignment Computing global alignments Local alignment Johannes Starlinger: Bioinformatics, Summer Semester
More informationHashing Data Structures. Ananda Gunawardena
Hashing 15-121 Data Structures Ananda Gunawardena Hashing Why do we need hashing? Many applications deal with lots of data Search engines and web pages There are myriad look ups. The look ups are time
More informationCS 580: Algorithm Design and Analysis
CS 58: Algorithm Design and Analysis Jeremiah Blocki Purdue University Spring 28 Announcement: Homework 3 due February 5 th at :59PM Midterm Exam: Wed, Feb 2 (8PM-PM) @ MTHW 2 Recap: Dynamic Programming
More information732A61/TDDD41 Data Mining - Clustering and Association Analysis
732A61/TDDD41 Data Mining - Clustering and Association Analysis Lecture 6: Association Analysis I Jose M. Peña IDA, Linköping University, Sweden 1/14 Outline Content Association Rules Frequent Itemsets
More informationEvolutionary Tree Analysis. Overview
CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based
More informationSimilarity Join Size Estimation using Locality Sensitive Hashing
Similarity Join Size Estimation using Locality Sensitive Hashing Hongrae Lee, Google Inc Raymond Ng, University of British Columbia Kyuseok Shim, Seoul National University Highly Similar, but not Identical,
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationHow to Make or Plot a Graph or Chart in Excel
This is a complete video tutorial on How to Make or Plot a Graph or Chart in Excel. To make complex chart like Gantt Chart, you have know the basic principles of making a chart. Though I have used Excel
More informationLecture 4: September 19
CSCI1810: Computational Molecular Biology Fall 2017 Lecture 4: September 19 Lecturer: Sorin Istrail Scribe: Cyrus Cousins Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang
More informationSearching Algorithms. CSE21 Winter 2017, Day 3 (B00), Day 2 (A00) January 13, 2017
Searching Algorithms CSE21 Winter 2017, Day 3 (B00), Day 2 (A00) January 13, 2017 Selection Sort (MinSort) Pseudocode Rosen page 203, exercises 41-42 procedure selection sort(a 1, a 2,..., a n : real numbers
More informationDynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.
Microsoft Research Asia September 5, 2005 1 2 3 4 Section I What is? Definition is a technique for efficiently recurrence computing by storing partial results. In this slides, I will NOT use too many formal
More informationLecture 2: Pairwise Alignment. CG Ron Shamir
Lecture 2: Pairwise Alignment 1 Main source 2 Why compare sequences? Human hexosaminidase A vs Mouse hexosaminidase A 3 www.mathworks.com/.../jan04/bio_genome.html Sequence Alignment עימוד רצפים The problem:
More informationLecture 8 HASHING!!!!!
Lecture 8 HASHING!!!!! Announcements HW3 due Friday! HW4 posted Friday! Q: Where can I see examples of proofs? Lecture Notes CLRS HW Solutions Office hours: lines are long L Solutions: We will be (more)
More informationMathematics Enhancement Programme
UNIT 9 Lesson Plan 1 Perimeters Perimeters 1 1A 1B Units of length T: We're going to measure the lengths of the sides of some shapes. Firstly we need to decide what units to use - what length-units do
More informationThe Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)
The Market-Basket Model Association Rules Market Baskets Frequent sets A-priori Algorithm A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set
More informationNearest Neighbor Search with Keywords in Spatial Databases
776 Nearest Neighbor Search with Keywords in Spatial Databases 1 Sphurti S. Sao, 2 Dr. Rahila Sheikh 1 M. Tech Student IV Sem, Dept of CSE, RCERT Chandrapur, MH, India 2 Head of Department, Dept of CSE,
More informationCSE 105 Theory of Computation
CSE 105 Theory of Computation http://www.jflap.org/jflaptmp/ Professor Jeanne Ferrante 1 Undecidability Today s Agenda Review: The TM Acceptance problem, A TM The Halting Problem for TM s Other problems
More informationJeffrey D. Ullman Stanford University
Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some
More information6.830 Lecture 11. Recap 10/15/2018
6.830 Lecture 11 Recap 10/15/2018 Celebration of Knowledge 1.5h No phones, No laptops Bring your Student-ID The 5 things allowed on your desk Calculator allowed 4 pages (2 pages double sided) of your liking
More informationReductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York
Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association
More informationRecursion: Introduction and Correctness
Recursion: Introduction and Correctness CSE21 Winter 2017, Day 7 (B00), Day 4-5 (A00) January 25, 2017 http://vlsicad.ucsd.edu/courses/cse21-w17 Today s Plan From last time: intersecting sorted lists and
More informationAssociation Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University
Association Rules CS 5331 by Rattikorn Hewett Texas Tech University 1 Acknowledgements Some parts of these slides are modified from n C. Clifton & W. Aref, Purdue University 2 1 Outline n Association Rule
More informationToday: Amortized Analysis
Today: Amortized Analysis COSC 581, Algorithms March 6, 2014 Many of these slides are adapted from several online sources Today s class: Chapter 17 Reading Assignments Reading assignment for next class:
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections
More informationCSE 105 Theory of Computation
CSE 105 Theory of Computation http://www.jflap.org/jflaptmp/ Professor Jeanne Ferrante 1 Undecidability Today s Agenda Review and More Problems A Non-TR Language Reminders and announcements: HW 7 (Last!!)
More informationSequence modelling. Marco Saerens (UCL) Slides references
Sequence modelling Marco Saerens (UCL) Slides references Many slides and figures have been adapted from the slides associated to the following books: Alpaydin (2004), Introduction to machine learning.
More informationHash-based Indexing: Application, Impact, and Realization Alternatives
: Application, Impact, and Realization Alternatives Benno Stein and Martin Potthast Bauhaus University Weimar Web-Technology and Information Systems Text-based Information Retrieval (TIR) Motivation Consider
More informationPageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10
PageRank Ryan Tibshirani 36-462/36-662: Data Mining January 24 2012 Optional reading: ESL 14.10 1 Information retrieval with the web Last time we learned about information retrieval. We learned how to
More information