Similarity Search. Stony Brook University CSE545, Fall 2016
|
|
- Reynard Stafford
- 5 years ago
- Views:
Transcription
1 Similarity Search Stony Brook University CSE545, Fall 20
2 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings Entity Resolution Fingerprint Matching
3 Finding Similar Items: What we will cover Set Similarity Shingling Minhashing Locality-sensitive hashing Embeddings Distance Metrics High-Degree of Similarity
4 Document Similarity Challenge: How to represent the document in a way that can be efficiently encoded and compared?
5 Shingles Goal: Convert documents to sets
6 Shingles Goal: Convert documents to sets k-shingles (aka character n-grams ) - sequence of k characters
7 Shingles Goal: Convert documents to sets k-shingles (aka character n-grams ) - sequence of k characters E.g. k=2 doc= abcdabd singles(doc, 2) = {ab, bc, cd, da, bd}
8 Shingles Goal: Convert documents to sets k-shingles (aka character n-grams ) - sequence of k characters E.g. k=2 doc= abcdabd singles(doc, 2) = {ab, bc, cd, da, bd} Similar documents will have many common shingles Changing words or order has minimal effect. In practice use 5 < k < 0
9 Shingles Goal: Convert documents to sets k-shingles (aka character n-grams ) - sequence of k characters Large enough that any given shingle appearing a document is highly unlikely (e.g. <.% chance) E.g. k=2 doc= abcdabd singles(doc, 2) = {ab, bc, cd, da, bd} Can hash large singles to smaller (e.g. 9-shingles into 4 bytes) Can also use words (aka n-grams). Similar documents will have many common shingles Changing words or order has minimal effect. In practice use 5 < k < 0
10 Shingles Problem: Even if hashing, sets of shingles are large (e.g. 4 bytes => 4x the size of the document).
11 Minhashing Goal: Convert sets to shorter ids, signatures
12 Minhashing - Background Goal: Convert sets to shorter ids, signatures Characteristic Matrix:. Jaccard Similarity: (Leskovec at al., 204;
13 Minhashing - Background Goal: Convert sets to shorter ids, signatures Characteristic Matrix:. Jaccard Similarity: (Leskovec at al., 204; often very sparse! (lots of zeros)
14 Minhashing - Background Characteristic Matrix: ab Jaccard Similarity: bc 0 de 0 ah ha 0 0 ed ca 0
15 Minhashing - Background Characteristic Matrix: ab * * Jaccard Similarity: bc 0 * de 0 * ah ** ha 0 0 ed ** ca 0 *
16 Minhashing - Background Characteristic Matrix: ab * * Jaccard Similarity: bc 0 * de 0 * ah ** ha 0 0 ed ** sim(, ) = 3 / (# both have / # at least one has) ca 0 *
17 Minhashing - Background Characteristic Matrix: How many different rows are possible? ab * * bc 0 * de 0 * ah ** ha 0 0 ed ** ca 0 *
18 Minhashing - Background Characteristic Matrix: ab * * bc 0 * de 0 * ah ** ha 0 0 ed ** ca 0 * How many different rows are possible?, -- type a, 0 -- type b 0, -- type c 0, 0 -- type d sim(, ) = a / (a+b+c)
19 Shingles Problem: Even if hashing, sets of shingles are large (e.g. 4 bytes => 4x the size of the document).
20 Minhashing Characteristic Matrix: S 3 S 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 (Leskovec at al., 204;
21 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. S 3 S 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 (Leskovec at al., 204;
22 Minhashing Characteristic Matrix: S 3 S 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 permuted order ha 2 ed 3 ab 4 bc 5 ca ah de Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. (Leskovec at al., 204;
23 Minhashing Characteristic Matrix: S 3 S 4 permuted order Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to first row where set appears ab 0 0 bc 0 0 de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 ha 2 ed 3 ab 4 bc 5 ca ah de (Leskovec at al., 204;
24 Minhashing Characteristic Matrix: S 3 S 4 permuted order Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. 3 ab 0 0 ha 4 bc ed de ab ah bc 2 5 ha 0 0 ed 0 0 ca ca ah de h( ) = ed #permuted row 2 h( ) = ha #permuted row h(s 3 ) = (Leskovec at al., 204;
25 Minhashing Characteristic Matrix: S 3 S 4 permuted order Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. 3 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 ha 2 ed 3 ab 4 bc ha ca 2 ed 0 0 ah 5 ca 0 0 de (Leskovec at al., 204; h( ) = ed #permuted row 2 h( ) = ha #permuted row h(s 3 ) = ed #permuted row 2 h(s 4 ) =
26 Minhashing Characteristic Matrix: S 3 S 4 permuted order Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. 3 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 ha 2 ed 3 ab 4 bc ha ca 2 ed 0 0 ah 5 ca 0 0 de (Leskovec at al., 204; h( ) = ed #permuted row 2 h( ) = ha #permuted row h(s 3 ) = ed #permuted row 2 h(s 4 ) = ha #permuted row
27 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 3 4 S 3 S 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 Signature matrix: M Record first row where each set had a in the given permutation S 3 S 4 h 2 2 ha ed ca 0 0 (Leskovec at al., 204; h ( ) = ed #permuted row 2 h ( ) = ha #permuted row h (S 3 ) = ed #permuted row 2 h (S 4 ) = ha #permuted row
28 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 3 4 S 3 S 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 Signature matrix: M Record first row where each set had a in the given permutation S 3 S 4 h 2 2 ha ed ca 0 0 (Leskovec at al., 204; h( ) = ed #permuted row 2 h( ) = ha #permuted row h(s 3 ) = ed #permuted row 2 h(s 4 ) = ha #permuted row
29 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 3 4 S 3 S 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 Signature matrix: M Record first row where each set had a in the given permutation S 3 S 4 h 2 2 ha ed ca 0 0 (Leskovec at al., 204; h( ) = ed #permuted row 2 h( ) = ha #permuted row h(s 3 ) = ed #permuted row 2 h(s 4 ) = ha #permuted row
30 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 4 3 S 3 S 4 ab 0 0 Signature matrix: M Record first row where each set had a in the given permutation 2 4 bc de 0 0 ah 0 0 S 3 S 4 h 2 2 ha 0 0 h 2 2 ed ca 0 0 (Leskovec at al., 204;
31 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 4 3 S 3 S 4 ab 0 0 Signature matrix: M Record first row where each set had a in the given permutation 2 4 bc de 0 0 ah 0 0 ha 0 0 S 3 S 4 h 2 2 h ed ca 0 0 (Leskovec at al., 204;
32 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 4 3 S 3 S 4 ab 0 0 Signature matrix: M Record first row where each set had a in the given permutation bc de 0 0 ah 0 0 ha 0 0 S 3 S 4 h 2 2 h ed 0 0 h ca 0 0 (Leskovec at al., 204;
33 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 4 3 S 3 S 4 ab 0 0 Signature matrix: M Record first row where each set had a in the given permutation bc de 0 0 ah 0 0 ha 0 0 ed 0 0 S 3 S 4 h 2 2 h h ca 0 0 (Leskovec at al., 204;
34 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 4 3 S 3 S 4 ab 0 0 Signature matrix: M Record first row where each set had a in the given permutation bc de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 S 3 S 4 h 2 2 h h (Leskovec at al., 204;
35 Minhashing Characteristic Matrix: Property of signature matrix: The Minhash probability function: for h any h Based on permutation i (i.e. any row), that h of rows in the i (S characteristic ) = h i ( ) is the same as Sim(S matrix, h maps sets, S to 2 ) rows. 4 3 S 3 S 4 ab 0 0 Signature matrix: M Record first row where each set had a in the given permutation bc de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 S 3 S 4 h 2 2 h h (Leskovec at al., 204;
36 Minhashing Characteristic Matrix: S 3 S ab 0 0 Property of signature matrix: The Minhash probability function: for h any h Based on permutation i (i.e. any row), that h of rows in the i (S characteristic ) = h i ( ) is the same as Sim(S matrix, h maps sets, S to 2 ) rows. Thus, similarity of signatures S Signature matrix: M, is the fraction of minhash functions (i.e. rows) in which they agree. Record first row where each set had a in the given permutation bc de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 S 3 S 4 h 2 2 h h (Leskovec at al., 204;
37 Minhashing Characteristic Matrix: S 3 S ab 0 0 Property of signature matrix: The Minhash probability function: for h any h Based on permutation i (i.e. any row), that h of rows in the i (S characteristic ) = h i ( ) is the same as Sim(S matrix, h maps sets, S to 2 ) rows. Thus, similarity of signatures S Signature matrix: M, is the fraction of minhash functions (i.e. rows) in which they agree. Record first row where each set had a in the given permutation bc 0 0 Estimate with a random sample of permutations de 0 0 (i.e. ~00) ah ha ed ca (Leskovec at al., 204; S 3 S 4 h 2 2 h h
38 Minhashing Characteristic Matrix: S 3 S ab 0 0 Property of signature matrix: The Minhash probability function: for h any h Based on permutation i (i.e. any row), that h of rows in the i (S characteristic ) = h i ( ) is the same as Sim(S matrix, h maps sets, S to 2 ) rows. Thus, similarity of signatures S Signature matrix: M, is the fraction of minhash functions (i.e. rows) in which they agree. Record first row where each set had a in the given permutation bc 0 0 Estimate with a random sample of permutations de 0 0 (i.e. ~00) 3 ah 0 0 ha ed 0 0 S 3 S 4 h 2 2 h h Estimated Sim(, S 3 ) = agree / all = 2/ ca 0 0 (Leskovec at al., 204;
39 Minhashing Characteristic Matrix: S 3 S ab 0 0 Property of signature matrix: The Minhash probability function: for h any h Based on permutation i (i.e. any row), that h of rows in the i (S characteristic ) = h i ( ) is the same as Sim(S matrix, h maps sets, S to 2 ) rows. Thus, similarity of signatures S Signature matrix: M, is the fraction of minhash functions (i.e. rows) in which they agree. Record first row where each set had a in the given permutation bc 0 0 de 0 0 ah 0 0 ha 0 0 ed 0 0 S 3 S 4 h 2 2 h h Estimated Sim(, S 3 ) = agree / all = 2/3 Real Sim(, S 3 ) = Type a / (a + b + c) = 3/ ca 0 0 (Leskovec at al., 204;
40 Minhashing Characteristic Matrix: S 3 S ab 0 0 Property of signature matrix: The Minhash probability function: for h any h Based on permutation i (i.e. any row), that h of rows in the i (S characteristic ) = h i ( ) is the same as Sim(S matrix, h maps sets, S to 2 ) rows. Thus, similarity of signatures S Signature matrix: M, is the fraction of minhash functions (i.e. rows) in which they agree. Record first row where each set had a in the given permutation bc de ah ha ed ca (Leskovec at al., 204; S 3 S 4 h 2 2 h h Estimated Sim(, S 3 ) = agree / all = 2/3 Real Sim(, S 3 ) = Type a / (a + b + c) = 3/4 Try Sim(, S 4 ) and Sim(, )
41 Minhashing To implement Problem: Can t actually do permutations (huge space) Can t randomly grab rows according to an order (random disk seeks = slow!)
42 Minhashing To implement Problem: Can t reasonably do permutations (huge space) Can t randomly grab rows according to an order (random disk seeks = slow!) Solution: Use random hash functions. Setup: Pick ~00 hash functions, hashes Store M[i][s] = a potential minimum h i (r) #initialized to infinity (num hashs x num sets)
43 Minhashing To implement Problem: Can t reasonably do permutations (huge space) Can t randomly grab rows according to an order (random disk seeks = slow!) Solution: Use random hash functions. Setup: Pick ~00 hash functions, hashes Store M[i][s] = a potential minimum h i (r) #initialized to infinity (num hashs x num sets) Algorithm: for r in rows of cm: #cm is characteristic matrix compute h i (r) for all i in hashes #produces 00 precomputed values for each set s in row r: if cm[r][s] == : for i in hashes: #check which hash produces smallest value h i (r) < M[i][s]: M[i][s] = h i (r)
44 Minhashing To implement Problem: Can t reasonably do permutations (huge space) Can t randomly grab rows according to an order (random disk seeks = slow!) Solution: Use random hash functions. Setup: Pick ~00 hash functions, hashes Known as efficient minhashing. Store M[i][s] = a potential minimum h i (r) #initialized to infinity (num hashs x num sets) Algorithm: for r in rows of cm: #cm is characteristic matrix compute h i (r) for all i in hashes #produces 00 precomputed values for each set s in row r: if cm[r][s] == : for i in hashes: #check which hash produces smallest value h i (r) < M[i][s]: M[i][s] = h i (r)
45 Minhashing What hash functions to use? Start with a decent function E.g. h (x) = ascii(string) % large_prime_number Add a random multiple and addition E.g. h 2 (x) = (a*ascii(string) + b) % large_prime_number
46 Minhashing Problem: Even if hashing, sets of shingles are large (e.g. 4 bytes => 4x the size of the document).
47 Minhashing Problem: Even if hashing, sets of shingles are large (e.g. 4 bytes => 4x the size of the document). New Problem: Even if the size of signatures are small, it can be computationally expensive to find similar pairs. E.g. m documents;,000,000 choose 2 = 500,000,000,000 pairs
48 Locality-Sensitive Hashing Goal: find pairs of minhashes likely to be similar (in order to then test more precisely for similarity). Candidate pairs: pairs of elements to be evaluated for similarity.
49 Locality-Sensitive Hashing Goal: find pairs of minhashes likely to be similar (in order to then test more precisely for similarity). Candidate pairs: pairs of elements to be evaluated for similarity. If we wanted the similarity for all pairs of documents, could anything be done?
50 Locality-Sensitive Hashing Goal: find pairs of minhashes likely to be similar (in order to then test more precisely for similarity). Candidate pairs: pairs of elements to be evaluated for similarity. Approach: Hash multiple times: similar items are likely in the same bucket.
51 Locality-Sensitive Hashing Goal: find pairs of minhashes likely to be similar (in order to then test more precisely for similarity). Candidate pairs: pairs of elements to be evaluated for similarity. Approach: Hash multiple times: similar items are likely in the same bucket. Approach from MinHash: Hash columns of signature matrix Candidate pairs end up in the same bucket. (LSH is a type of near-neighbor search)
52 Locality-Sensitive Hashing Step : Add bands (Leskovec at al., 204;
53 Locality-Sensitive Hashing Step : Add bands Can be tuned to catch most true-positives with least false-positives. (Leskovec at al., 204;
54 Locality-Sensitive Hashing Step 2: Hash columns within bands (Leskovec at al., 204;
55 Locality-Sensitive Hashing Step 2: Hash columns within bands (Leskovec at al., 204;
56 Locality-Sensitive Hashing Step 2: Hash columns within bands (Leskovec at al., 204;
57 Locality-Sensitive Hashing Step 2: Hash columns within bands Criteria for being candidate pair: They end up in same bucket for at least band. (Leskovec at al., 204;
58 Locality-Sensitive Hashing Step 2: Hash columns within bands Simplification: There are enough buckets compared to rows per band that columns must be identical in order to hash to the same bucket. Thus, we only need to check if identical within a band. (Leskovec at al., 204;
59 Realistic Example: Probabilities of agreement 00,000 documents 00 random permutations/hash functions/rows => if 4byte integers then 40Mb to hold signature matrix => still 00k choose 2 is a lot (~5billion) 20 bands of 5 rows Want 80% Jaccard Similarity
60 Realistic Example: Probabilities of agreement 00,000 documents 00 random permutations/hash functions/rows => if 4byte integers then 40Mb to hold signature matrix => still 00k choose 2 is a lot (~5billion) 20 bands of 5 rows Want 80% Jaccard Similarity P( == b): probability S and S2 agree within a given band
61 Realistic Example: Probabilities of agreement 00,000 documents 00 random permutations/hash functions/rows => if 4byte integers then 40Mb to hold signature matrix => still 00k choose 2 is a lot (~5billion) 20 bands of 5 rows Want 80% Jaccard Similarity ; for any row p( == ) =.8 P( == b): probability S and S2 agree within a given band = =.328 => P(!= b) = =.2 P(!= ): probability S and S2 do not agree in any band
62 Realistic Example: Probabilities of agreement 00,000 documents 00 random permutations/hash functions/rows => if 4byte integers then 40Mb to hold signature matrix => still 00k choose 2 is a lot (~5billion) 20 bands of 5 rows Want 80% Jaccard Similarity ; for any row p( == ) =.8 P( == b): probability S and S2 agree within a given band = =.328 => P(!= b) = =.2 P(!= ): probability S and S2 do not agree in any band =.2 20 =.00035
63 Realistic Example: Probabilities of agreement 00,000 documents 00 random permutations/hash functions/rows => if 4byte integers then 40Mb to hold signature matrix => still 00k choose 2 is a lot (~5billion) 20 bands of 5 rows Want 80% Jaccard Similarity ; for any row p( == ) =.8 P( == b): probability S and S2 agree within a given band = =.328 => P(!= b) = =.2 P(!= ): probability S and S2 do not agree in any band =.2 20 = What if wanting 40% Jaccard Similarity?
64 Document Similarity Pipeline Shingling Minhashing Localitysensitive hashing
65 Distance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard Distance ( - Jaccard Sim). (
66 Distance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard Distance ( - Jaccard Sim). Typical properties, d: distance metric d(x, x) = 0 d(x, y) = d(y, x) d(x, y) d(x,z) + d(z,y) (
67 Distance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard Distance ( - Jaccard Sim). There are other metrics of similarity. e.g: Euclidean Distance Cosine Distance Edit Distance Hamming Distance
68 Distance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard Distance ( - Jaccard Sim). There are other metrics of similarity. e.g: Euclidean Distance ( L2 Norm ) Cosine Distance Edit Distance Hamming Distance
69 Distance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard Distance ( - Jaccard Sim). There are other metrics of similarity. e.g: Euclidean Distance ( L2 Norm ) Cosine Distance Edit Distance Hamming Distance
70 Locality Sensitive Hashing - Theory LSH Can be generalized to many distance metrics by converting output to a probability and providing a lower bound on probability of being similar.
71 Locality Sensitive Hashing - Theory LSH Can be generalized to many distance metrics by converting output to a probability and providing a lower bound on probability of being similar. E.g. for euclidean distance: Choose random lines (analogous to hash functions in minhashing) Project the two points onto each line; match if two points within an interval
DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationHigh Dimensional Search Min- Hashing Locality Sensi6ve Hashing
High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of
More informationPiazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)
4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34
More informationFinding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman
Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages
More informationAlgorithms for Data Science: Lecture on Finding Similar Items
Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar
More informationSlides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationFinding similar items
Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the
More informationDATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 4 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationCOMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from
COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard
More informationCS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya
CS62: Scalable Data Mining Similarity Search and Hashing Sourangshu Bha>acharya Finding Similar Items Distance Measures Goal: Find near-neighbors in high-dim. space We formally define near neighbors as
More informationCOMPSCI 514: Algorithms for Data Science
COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018 Lecture 9 Similarity Queries Few words about the exam The exam is Thursday (Oct 4) in two days In
More informationTheory of LSH. Distance Measures LS Families of Hash Functions S-Curves
Theory of LSH Distance Measures LS Families of Hash Functions S-Curves 1 Distance Measures Generalized LSH is based on some kind of distance between points. Similar points are close. Two major classes
More information1 Finding Similar Items
1 Finding Similar Items This chapter discusses the various measures of distance used to find out similarity between items in a given set. After introducing the basic similarity measures, we look at how
More informationToday s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara
Spring 2017 W5.B.1 CS435 BIG DATA Today s topics PART 1. LARGE SCALE DATA ANALYSIS USING MAPREDUCE FAQs Minhash Minhash signature Calculating Minhash with MapReduce Locality Sensitive Hashing Sangmi Lee
More informationB490 Mining the Big Data
B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations
More informationCS246 Final Exam. March 16, :30AM - 11:30AM
CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions
More informationBloom Filters and Locality-Sensitive Hashing
Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,
More informationWhy duplicate detection?
Near-Duplicates Detection Naama Kraus Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze Some slides are courtesy of Kira Radinsky Why duplicate detection?
More informationCS425: Algorithms for Web Scale Data
CS: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS. The original slides can be accessed at: www.mmds.org Customer
More informationCSE 321 Discrete Structures
CSE 321 Discrete Structures March 3 rd, 2010 Lecture 22 (Supplement): LSH 1 Jaccard Jaccard similarity: J(S, T) = S*T / S U T Problem: given large collection of sets S1, S2,, Sn, and given a threshold
More informationCS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction
CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction Tim Roughgarden & Gregory Valiant April 12, 2017 1 The Curse of Dimensionality in the Nearest Neighbor Problem Lectures #1 and
More informationLinear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017
Linear Models: Comparing Variables Stony Brook University CSE545, Fall 2017 Statistical Preliminaries Random Variables Random Variables X: A mapping from Ω to ℝ that describes the question we care about
More information7 Distances. 7.1 Metrics. 7.2 Distances L p Distances
7 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards
More informationDatabase Systems CSE 514
Database Systems CSE 514 Lecture 8: Data Cleaning and Sampling CSEP514 - Winter 2017 1 Announcements WQ7 was due last night (did you remember?) HW6 is due on Sunday Weston will go over it in the section
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Advanced Frequent Pattern Mining & Locality Sensitivity Hashing Huan Sun, CSE@The Ohio State University /7/27 Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan
More informationCS246 Final Exam, Winter 2011
CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including
More informationRecommendation Systems
Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume
More informationThe University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.
The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and
More informationCS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)
CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose
More informationQuerying. 1 o Semestre 2008/2009
Querying Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2008/2009 Outline 1 2 3 4 5 Outline 1 2 3 4 5 function sim(d j, q) = 1 W d W q W d is the document norm W q is the
More information6 Distances. 6.1 Metrics. 6.2 Distances L p Distances
6 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards
More informationV.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74
V.4 MapReduce. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2!74 Why MapReduce? Large clusters of commodity computers (as opposed to few supercomputers) Challenges:
More informationLecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Text Processing and High-Dimensional Data
Lecture Notes to Winter Term 2017/2018 Text Processing and High-Dimensional Data Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur Schmid, Daniyal Kazempour,
More informationSingular Value Decompsition
Singular Value Decompsition Massoud Malek One of the most useful results from linear algebra, is a matrix decomposition known as the singular value decomposition It has many useful applications in almost
More informationLecture Notes for Chapter 6. Introduction to Data Mining
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004
More informationCS 584 Data Mining. Association Rule Mining 2
CS 584 Data Mining Association Rule Mining 2 Recall from last time: Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2 d Use pruning techniques to reduce M
More informationAsymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment
Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Anshumali Shrivastava and Ping Li Cornell University and Rutgers University WWW 25 Florence, Italy May 2st 25 Will Join
More informationMining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationAnalysis of Algorithms I: Perfect Hashing
Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the
More informationSimilarity and recommender systems
Similarity and recommender systems Hiroshi Shimodaira January-March 208 In this chapter we shall look at how to measure the similarity between items To be precise we ll look at a measure of the dissimilarity
More informationDistances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining
Distances and similarities Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Similarities Start with X which we assume is centered and standardized. The PCA loadings were
More informationproximity similarity dissimilarity distance Proximity Measures:
Similarity Measures Similarity and dissimilarity are important because they are used by a number of data mining techniques, such as clustering nearest neighbor classification and anomaly detection. The
More informationAlgorithms for Nearest Neighbors
Algorithms for Nearest Neighbors Background and Two Challenges Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura McGill University, July 2007 1 / 29 Outline
More informationProblem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)
Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.
More informationPaper Reference. Paper Reference(s) 6664/01 Edexcel GCE Core Mathematics C2 Advanced Subsidiary. Monday 2 June 2008 Morning Time: 1 hour 30 minutes
Centre No. Candidate No. Paper Reference 6 6 6 4 0 1 Paper Reference(s) 6664/01 Edexcel GCE Core Mathematics C2 Advanced Subsidiary Monday 2 June 2008 Morning Time: 1 hour 30 minutes Materials required
More information1 Matrices and vector spaces
Matrices and vector spaces. Which of the following statements about linear vector spaces are true? Where a statement is false, give a counter-example to demonstrate this. (a) Non-singular N N matrices
More informationSecret Key Systems (block encoding) Encrypting a small block of text (say 64 bits) General considerations for cipher design:
Secret Key Systems (block encoding) Encrypting a small block of text (say 64 bits) General considerations for cipher design: Secret Key Systems Encrypting a small block of text (say 64 bits) General considerations
More informationData Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining
Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Types of data sets Record Tables Document Data Transaction Data Graph World Wide Web Molecular Structures
More informationProbabilistic Near-Duplicate. Detection Using Simhash
Probabilistic Near-Duplicate Detection Using Simhash Sadhan Sood, Dmitri Loguinov Presented by Matt Smith Internet Research Lab Department of Computer Science and Engineering Texas A&M University 27 October
More informationBloom Filters, Minhashes, and Other Random Stuff
Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida What? Probabilistic Space-efficient Fast Not exact Why?
More informationError-correcting codes and applications
Error-correcting codes and applications November 20, 2017 Summary and notation Consider F q : a finite field (if q = 2, then F q are the binary numbers), V = V(F q,n): a vector space over F q of dimension
More informationCOMPSCI 514: Algorithms for Data Science
COMPSCI 514: Algoritms for Data Science Arya Mazumdar University of Massacusetts at Amerst Fall 2018 Lecture 11 Locality Sensitive Hasing Midterm exam Average 28.96 out of 35 (82.8%). One exam is still
More informationHigh Dimensional Geometry, Curse of Dimensionality, Dimension Reduction
Chapter 11 High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction High-dimensional vectors are ubiquitous in applications (gene expression data, set of movies watched by Netflix customer,
More informationDATA MINING LECTURE 3. Frequent Itemsets Association Rules
DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.
More informationIntrusion Detection and Malware Analysis
Intrusion Detection and Malware Analysis IDS feature extraction Pavel Laskov Wilhelm Schickard Institute for Computer Science Metric embedding of byte sequences Sequences 1. blabla blubla blablabu aa 2.
More information2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51
2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each
More informationFrequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:
Frequent Itemsets and Association Rule Mining Vinay Setty vinay.j.setty@uis.no Slides credit: http://www.mmds.org/ Association Rule Discovery Supermarket shelf management Market-basket model: Goal: Identify
More informationD B M G Data Base and Data Mining Group of Politecnico di Torino
Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket
More informationAn Efficient Partition Based Method for Exact Set Similarity Joins
An Efficient Partition Based Method for Exact Set Similarity Joins Dong Deng Guoliang Li He Wen Jianhua Feng Department of Computer Science, Tsinghua University, Beijing, China. {dd11,wenhe1}@mails.tsinghua.edu.cn;{liguoliang,fengjh}@tsinghua.edu.cn
More informationAssociation Rules. Fundamentals
Politecnico di Torino Politecnico di Torino 1 Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket counter Association rule
More informationDATA MINING - 1DL360
DATA MINING - 1DL36 Fall 212" An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht12 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala
More informationa. Define a function called an inner product on pairs of points x = (x 1, x 2,..., x n ) and y = (y 1, y 2,..., y n ) in R n by
Real Analysis Homework 1 Solutions 1. Show that R n with the usual euclidean distance is a metric space. Items a-c will guide you through the proof. a. Define a function called an inner product on pairs
More informationThe Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)
The Market-Basket Model Association Rules Market Baskets Frequent sets A-priori Algorithm A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set
More informationPart-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the
More informationD B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.
Definitions Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Itemset is a set including one or more items Example: {Beer, Diapers} k-itemset is an itemset that contains k
More informationD B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example
Association rules Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket
More informationCISC 4631 Data Mining
CISC 4631 Data Mining Lecture 02: Data Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) 1 10 What is Data? Collection of data objects and their attributes An attribute
More informationarxiv: v1 [cs.db] 2 Sep 2014
An LSH Index for Computing Kendall s Tau over Top-k Lists Koninika Pal Saarland University Saarbrücken, Germany kpal@mmci.uni-saarland.de Sebastian Michel Saarland University Saarbrücken, Germany smichel@mmci.uni-saarland.de
More informationFast Approximate k-way Similarity Search. Anshumali Shrivastava. Ping Li
Beyond Pairwise: Provably Fast Algorithms for Approximate k-way Similarity Search NIPS 2013 1 Fast Approximate k-way Similarity Search Anshumali Shrivastava Dept. of Computer Science Cornell University
More informationCSE 215: Foundations of Computer Science Recitation Exercises Set #5 Stony Brook University. Name: ID#: Section #: Score: / 4
CSE 215: Foundations of Computer Science Recitation Exercises Set #5 Stony Brook University Name: ID#: Section #: Score: / 4 Unit 10: Proofs by Contradiction and Contraposition 1. Prove the following statement
More informationDATA MINING - 1DL360
DATA MINING - DL360 Fall 200 An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht0 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala
More information6664/01 Edexcel GCE Core Mathematics C2 Bronze Level B4
Paper Reference(s) 6664/01 Edexcel GCE Core Mathematics C Bronze Level B4 Time: 1 hour 30 minutes Materials required for examination papers Mathematical Formulae (Green) Items included with question Nil
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms
More informationTopic 4 Randomized algorithms
CSE 103: Probability and statistics Winter 010 Topic 4 Randomized algorithms 4.1 Finding percentiles 4.1.1 The mean as a summary statistic Suppose UCSD tracks this year s graduating class in computer science
More informationMetric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)
Metric Embedding of Task-Specific Similarity Greg Shakhnarovich Brown University joint work with Trevor Darrell (MIT) August 9, 2006 Task-specific similarity A toy example: Task-specific similarity A toy
More informationCS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =
CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14 1 Probability First, recall a couple useful facts from last time about probability: Linearity of expectation: E(aX + by ) = ae(x)
More informationError Detection and Correction: Small Applications of Exclusive-Or
Error Detection and Correction: Small Applications of Exclusive-Or Greg Plaxton Theory in Programming Practice, Fall 2005 Department of Computer Science University of Texas at Austin Exclusive-Or (XOR,
More informationCSE446: non-parametric methods Spring 2017
CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want
More informationNotion of Distance. Metric Distance Binary Vector Distances Tangent Distance
Notion of Distance Metric Distance Binary Vector Distances Tangent Distance Distance Measures Many pattern recognition/data mining techniques are based on similarity measures between objects e.g., nearest-neighbor
More informationThe Problem with ASICBOOST
The Problem with ACBOOT Jeremy Rubin April 11, 2017 1 ntro Recently there has been a lot of interesting material coming out regarding AC- BOOT. ACBOOT is a mining optimization that allows one to mine many
More informationLecture 6. Today we shall use graph entropy to improve the obvious lower bound on good hash functions.
CSE533: Information Theory in Computer Science September 8, 010 Lecturer: Anup Rao Lecture 6 Scribe: Lukas Svec 1 A lower bound for perfect hash functions Today we shall use graph entropy to improve the
More informationLasso Regression: Regularization for feature selection
Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 Feature selection task 1 Why might you want to perform feature selection? Efficiency: - If size(w)
More information2 How many distinct elements are in a stream?
Dealing with Massive Data January 31, 2011 Lecture 2: Distinct Element Counting Lecturer: Sergei Vassilvitskii Scribe:Ido Rosen & Yoonji Shin 1 Introduction We begin by defining the stream formally. Definition
More informationCS5112: Algorithms and Data Structures for Applications
CS5112: Algorithms and Data Structures for Applications Lecture 19: Association rules Ramin Zabih Some content from: Wikipedia/Google image search; Harrington; J. Leskovec, A. Rajaraman, J. Ullman: Mining
More informationSOLUTIONS OF 2012 MATH OLYMPICS LEVEL II T 3 T 3 1 T 4 T 4 1
SOLUTIONS OF 0 MATH OLYMPICS LEVEL II. If T n = + + 3 +... + n and P n = T T T 3 T 3 T 4 T 4 T n T n for n =, 3, 4,..., then P 0 is the closest to which of the following numbers? (a).9 (b).3 (c) 3. (d).6
More informationData Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining
Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 10 What is Data? Collection of data objects
More informationData Mining Recitation Notes Week 3
Data Mining Recitation Notes Week 3 Jack Rae January 28, 2013 1 Information Retrieval Given a set of documents, pull the (k) most similar document(s) to a given query. 1.1 Setup Say we have D documents
More informationFrom Non-Negative Matrix Factorization to Deep Learning
The Math!! From Non-Negative Matrix Factorization to Deep Learning Intuitions and some Math too! luissarmento@gmailcom https://wwwlinkedincom/in/luissarmento/ October 18, 2017 The Math!! Introduction Disclaimer
More informationLasso Regression: Regularization for feature selection
Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 1 Feature selection task 2 1 Why might you want to perform feature selection? Efficiency: - If
More informationLocality Sensitive Hashing
Locality Sensitive Hashing February 1, 016 1 LSH in Hamming space The following discussion focuses on the notion of Locality Sensitive Hashing which was first introduced in [5]. We focus in the case of
More informationGetting To Know Your Data
Getting To Know Your Data Road Map 1. Data Objects and Attribute Types 2. Descriptive Data Summarization 3. Measuring Data Similarity and Dissimilarity Data Objects and Attribute Types Types of data sets
More informationAssociation Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci
Association Rules Information Retrieval and Data Mining Prof. Matteo Matteucci Learning Unsupervised Rules!?! 2 Market-Basket Transactions 3 Bread Peanuts Milk Fruit Jam Bread Jam Soda Chips Milk Fruit
More informationPaper Reference. Ruler graduated in centimetres and millimetres, protractor, compasses, pen, HB pencil, eraser. Tracing paper may be used.
Centre No. Candidate No. Paper Reference 1 3 8 0 3 H Paper Reference(s) 1380/3H Edexcel GCSE Mathematics (Linear) 1380 Paper 3 (Non-Calculator) Vectors Past Paper Questions Arranged by Topic Surname Signature
More informationExercises for Unit V (Introduction to non Euclidean geometry)
Exercises for Unit V (Introduction to non Euclidean geometry) V.1 : Facts from spherical geometry Ryan : pp. 84 123 [ Note : Hints for the first two exercises are given in math133f07update08.pdf. ] 1.
More informationA General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY
A General-Purpose Counting Filter: Making Every Bit Count Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY Approximate Membership Query (AMQ) insert(x) ismember(x)
More informationcompare to comparison and pointer based sorting, binary trees
Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #12: Frequent Itemsets Seoul National University 1 In This Lecture Motivation of association rule mining Important concepts of association rules Naïve approaches for
More informationCS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
More informationx + 2y + 3z = 8 x + 3y = 7 x + 2z = 3
Chapter 2: Solving Linear Equations 23 Elimination Using Matrices As we saw in the presentation, we can use elimination to make a system of linear equations into an upper triangular system that is easy
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More information