Pavel Zezula, Giuseppe Amato,

Size: px
Start display at page:

Download "Pavel Zezula, Giuseppe Amato,"

Transcription

1 SIMILARITY SEARCH The Metric Space Approach Pavel Zezula Giuseppe Amato Vlastislav Dohnal Michal Batko

2 Table of Content Part I: Metric searching in a nutshell Foundations of metric space searching Survey of existing approaches Part II: Metric searching in large collections Centralized index structures Approximate similarity search Parallel and distributed indexes Similarity Search: Part I Chapter 1 2

3 Foundations of metric space searching 1. distance searching problem in metric spaces 2. metric distance measures 3. similarity queries 4. basic partitioning principles 5. principles of similarity query execution 6. policies to avoid distance computations 7. metric space transformations 8. principles of approximate similarity search 9. advanced issues Similarity Search: Part I Chapter 1 3

4 Distance searching problem Search problem: Data type The method of comparison Query formulation Extensibility: A single indexing technique applied to many specific search problems quite different in nature Similarity Search: Part I Chapter 1 4

5 Distance searching problem Traditional search: Exact (partial range) retrieval Sortable domains of data (numbers strings) Perspective search: Proximity Similarity Dissimilarity Distance Not sortable domains (Hamming distance color histograms) Similarity Search: Part I Chapter 1 5

6 Distance searching problem Definition (divide and conquer): Let D be a domain d a distance measure on objects from D Given a set X D of n elements: preprocess or structure the data so that proximity queries are answered efficiently. Similarity Search: Part I Chapter 1 6

7 Distance searching problem Metric space as similarity search abstraction Distances used for searching No coordinates no data space partitioning Vector versus metric spaces Three reasons for metric indexes: No other possibility Comparable performance for special cases High extensibility Similarity Search: Part I Chapter 1 7

8 Metric space Metric space M = (Dd) Data domain D Total (distance) function d: D D (metric function or metric) The metric space postulates: The metric space postulates: Non negativity Symmetry ) ( ) ( 0 ) ( x y d y x d y x y x d y x = D D Symmetry Identity Triangle inequality ) ( ) ( ) ( 0 ) ( ) ( ) ( z y d y x d z x d z y x y x d y x y x x y d y x d y x + = = = D D D Triangle inequality ) ( ) ( ) ( z y d y x d z x d z y x + D Similarity Search: Part I Chapter 1 8

9 Metric space Metric space Another specification: (p1) non negativity 0 ) ( y x d y x = D (p2) symmetry (p3) reflexivity 0 ) ( ) ( ) ( x x d x x y d y x d y x = = D D (p4) positiveness (p5) triangle inequality ) ( ) ( ) ( 0 ) ( z y d y x d z x d z y x y x d y x y x + > D D Similarity Search: Part I Chapter 1 9

10 Pseudo metric Pseudo metric Property (p4) does not hold If all objects at distance 0 are considered as single If all objects at distance 0 are considered as single objects we get the metric space: To be proved ) ( ) ( 0 ) ( = = z y d z x d z y x d D Since ) ( ) ( ) ( ) ( ) ( ) ( + + = = z x d y x d z y d z y d y x d z x d We get 0 ) ( ) ( ) ( ) ( ) ( ) ( = = + y x d z y d z x d z x d y x d z y d Similarity Search: Part I Chapter 1 10

11 Quasi metric Property (p2 - symmetry) does not hold e.g. Locations in cities one way streets Transformation to the metric space: d ( x y) = d ( x y) d ( y x) sym asym + asym Similarity Search: Part I Chapter 1 11

12 Super metric Also called the ultra metric Stronger constraint on (p5) x y z D : d( x z) max{ d( x y) d( y z)} At least two sides of equal length - isosceles triangle Used in evolutionary biology (phylogenetic trees) Similarity Search: Part I Chapter 1 12

13 Foundations of metric space searching 1. distance searching problem in metric spaces 2. metric distance measures 3. similarity queries 4. basic partitioning principles 5. principles of similarity query execution 6. policies to avoid distance computations 7. metric space transformations 8. principles of approximate similarity search 9. advanced issues Similarity Search: Part I Chapter 1 13

14 Distance measures Discrete functions which return only a small (predefined) set of values Continuous functions in which the cardinality of the set of values returned is very large or infinite. Similarity Search: Part I Chapter 1 14

15 Minkowski distances Also called the L p metrics Defined on n dimensional vectors L p n p p x n y y 1 L )( 1 n)] = xi yi i= 1 [( x L Similarity Search: Part I Chapter 1 15

16 Special cases L 1 Manhattan (City-Block) distance L 2 Euclidean distance L maximum (infinity) distance L n = = max i 1 i i x y L 1 L 2 L 6 L Similarity Search: Part I Chapter 1 16

17 Quadratic form distance Quadratic form distance Correlated dimensions cross talk e.g. color histograms histograms ) ( ) ( ) ( y x M y x y x d T M r r r r r r = M positive semidefinite matrix n n ) ( ) ( ) ( y x M y x y x d M = if M = diag(w 1 w n ) weighted Euclidean distance = n y x w y x d 2 ) ( ) ( r r = = i i i i M y x w y x d 1 ) ( ) ( Similarity Search: Part I Chapter 1 17

18 Example 3-dim vectors of blue red and orange colors: r Pure red: vred = (010) r Pure orange: v orange = (001) Pure blue: r v = (100 ) blue Blue and orange images are equidistant from red one r r r r L v v ) L ( v v ) 2 2 ( red orange = 2 red blue = Similarity Search: Part I Chapter 1 18

19 Example (continue) Human color perception: Red and orange are more alike than red and blue. Matrix specification: M Distance of red and orange is 0. 2 Distance of red and blue is 2 = blue red blue red orang ge orange Similarity Search: Part I Chapter 1 19

20 Edit distance Also called the Levenstein distance: minimum number of atomic operations to transform string x into string y insert character c into string x at position i ins( x i c ) = x x L x cx L x ( 1 2 L i 1 delete character at position i in string x del x i) = x x x x Lx ( 1 2L i 1 i+ 1 replace character at position i in x with c replace x i c) = x x x cx Lx ( 1 2L i 1 i+ 1 i n n n Similarity Search: Part I Chapter 1 20

21 Edit distance - weights If the weights (costs) of insert and delete operations differ the edit distance is not symmetric. Example: w insert = 2 w delete = 1 w replace = 1 d edit ( combine combination ) = 9 edit replacement e a insertion tion d edit ( combination combine ) = 5 replacement a e deletion tion Similarity Search: Part I Chapter 1 21

22 Edit distance - generalizations Replacement of different characters can be different: a b different from a c If it is symmetric it is still the metric: a b must be the same as b a Edit distance can be generalized to tree structures Similarity Search: Part I Chapter 1 22

23 Jaccard s coefficient Distance measure for sets A and B A B d( A B) = 1 A B Tanimoto similarity for vectors d TS r ( x r y) x r y r = 1 r 2 r 2 x + y r x r y x r y r x r is the scalar product is the Euclidean norm Similarity Search: Part I Chapter 1 23

24 Hausdorff distance Distance measure for sets Compares elements by a distance d e Measures the extent to which each point of the model set A lies near some point of the image set B and vice versa. Two sets are within Hausdorff distance r from each other if and only if any point of one set is within the distance r from some point of the other set. Similarity Search: Part I Chapter 1 24

25 Hausdorff distance (cont.) Hausdorff distance (cont.) = ) ( inf ) ( y x d B x d e B y p = = ) ( sup ) ( ) ( inf ) ( B x d B A d y x d y A d e A x p = = ). ( sup ) ( ) ( sup ) ( y A d A B d B x d B A d p A x s = = ). ( sup ) ( y A d A B d p B y s = )}. ( ) ( max{ ) ( A B d B A d B A d s s = Similarity Search: Part I Chapter 1 25

26 Foundations of metric space searching 1. distance searching problem in metric spaces 2. metric distance measures 3. similarity queries 4. basic partitioning principles 5. principles of similarity query execution 6. policies to avoid distance computations 7. metric space transformations 8. principles of approximate similarity search 9. advanced issues Similarity Search: Part I Chapter 1 26

27 Similarity Queries Range query Nearest neighbor query Reverse nearest neighbor query Similarity join Combined queries Complex queries Similarity Search: Part I Chapter 1 27

28 Similarity Range Query q r range query R(qr) = { x X d(qx) r } all museums up to 2km from my hotel Similarity Search: Part I Chapter 1 28

29 Nearest Neighbor Query the nearest neighbor query NN(q) = x x X y X d(qx) d(qy) k-nearest neighbor query k-nn(qk) = A A X A = k x A y X A d(qx) d(qy) k=5 q five closest museums to my hotel Similarity Search: Part I Chapter 1 29

30 Reverse Nearest Neighbor krnn( q) = { R X x R : q knn( x) x X R : q knn( x)} all hotels with a specific museum as a nearest cultural heritage cite Similarity Search: Part I Chapter 1 30

31 Example of 2-RNN o 4 q o1 o 2 Objects o 4 o 5 and o 6 have q between their two nearest neighbor. o o 5 o 3 o 6 Similarity Search: Part I Chapter 1 31

32 Similarity Queries Similarity Queries similarity join of two data sets µ D D } ) ( : ) {( ) ( 0 µ µ µ = y x d Y X y x Y X J Y X D D µ µ µ = similarity self join X = Y pairs of hotels and museums which are five minutes walk apart Similarity Search: Part I Chapter 1 32

33 Combined Queries Range + Nearest neighbors knn( q r) = { R X R k x R y X R d( q x) d( q y) d( q x) r} : Nearest neighbor + similarity joins by analogy Similarity Search: Part I Chapter 1 33

34 Complex Queries Find the best matches of circular shape objects with red color The best match for circular shape or red color needs not be the best match combined!!! Similarity Search: Part I Chapter 1 34

35 The A 0 Algorithm For each predicate i objects delivered in decreasing similarity incrementally build sets X i with best matches till i i X = i k For all o X i consider all query predicates i i establish the final rank (fuzzy algebra weighted sets etc.) Similarity Search: Part I Chapter 1 35

36 Foundations of Metric Space Searching 1. distance searching problem in metric spaces 2. metric distance measures 3. similarity queries 4. basic partitioning principles 5. principles of similarity query execution 6. policies to avoid distance computations 7. metric space transformations 8. principles of approximate similarity search 9. advanced issues Similarity Search: Part I Chapter 1 36

37 Partitioning Principles Given a set X D in M=(Dd) three basic partitioning principles have been defined: Ball partitioning Generalized hyper-plane partitioning Excluded middle partitioning Similarity Search: Part I Chapter 1 37

38 Ball partitioning Inner set: { x X d(px) d m } Outer set: { x X d(px) > d m } p d m Similarity Search: Part I Chapter 1 38

39 Multi-way ball partitioning Inner set: { x X d(px) d m1 } Middle set: { x X d(px) > d m1 d(px) d m2 } Outer set: { x X d(px) > d m2 } d m2 p d m1 Similarity Search: Part I Chapter 1 39

40 Generalized Hyper-plane Partitioning { x X d(p 1 x) d(p 2 x) } { x X d(p 1 x) > d(p 2 x) } p 2 p 1 Similarity Search: Part I Chapter 1 40

41 Excluded Middle Partitioning Inner set: { x X d(px) d m - ρ } Outer set: { x X d(px) > d m + ρ } 2ρ p d m p d m Excluded set: otherwise Similarity Search: Part I Chapter 1 41

42 Foundations of metric space searching 1. distance searching problem in metric spaces 2. metric distance measures 3. similarity queries 4. basic partitioning principles 5. principles of similarity query execution 6. policies to avoid distance computations 7. metric space transformations 8. principles of approximate similarity search 9. advanced issues Similarity Search: Part I Chapter 1 42

43 Basic Strategies Costs to answer a query are influenced by Partitioning principle Query execution algorithm Sequential organization & range query R(qr) All database objects are consecutively scanned and d(qo) are evaluated. Whenever d(qo) r o is reported on result R(q4): q Similarity Search: Part I Chapter 1 43

44 Basic Strategies (cont.) Sequential organization & k-nn query 3-NN(q) Initially: take the first k objects and order them with respect to the distance from q. All other objects are consecutively scanned and d(qo) are evaluated. If d(qo i ) d(qo k ) o i is inserted to a correct position in answer and the last neighbor o k is eliminated. 3-NN(q): q Answer: Similarity Search: Part I Chapter 1 44

45 Hypothetical Index Organization A hierarchy of entries (nodes) N=(GR(G)) G = {e e is object or e is another entry} Bounding region R(G) covers all elements of G. E.g. ball region: o d(op) r Each element belongs exactly to one G. There is one root entry G. p r Any similarity query Q returns a set of objects We can define R(Q) which covers all objects in response. Similarity Search: Part I Chapter 1 45

46 Example of Index Organization Using ball regions Root node organizes four objects and two ball regions. Child ball regions has two and three objects respectively. o 9 o 8 o 1 o 2 o 3 o 4 o 1 o 2 o 4 o 6 o 7 o 5 o 3 o 5 o 6 o 7 o 8 o 9 Similarity Search: Part I Chapter 1 46

47 Range Search Algorithm Given Q=R(qr): Start at the root. In the current node N=(GR(G)) process all elements: object element o j G: if d(qo j ) r report o j on output. non-object element N =(G R(G )) G if R(G ) and R(Q) intersect recursively search in N. Similarity Search: Part I Chapter 1 47

48 Range Search Algorithm (cont.) R(qr): Start inspecting elements in B 1. B 3 is not intersected. Inspect elements in B 2. Search is complete. B 1 o 1 o 2 o 3 o 4 B 3 B 2 B 1 q o 9 o o 8 1 o 2 B 2 o 4 B 3 o 5 o o 7 6 o 3 o 5 o 6 o 7 o 8 o 9 Response = o 8 o 9 Similarity Search: Part I Chapter 1 48

49 Nearest Neighbor Search Algorithm No query radius is given. We do not know the distance to the k-th nearest neighbor. To allow filtering of unnecessary branches The query radius is defined as the distance to the current k-th neighbor. Priority queue PR is maintained. It contains regions that may include objects relevant to the query. The regions are sorted with decreasing relevance. Similarity Search: Part I Chapter 1 49

50 NN Search Algorithm (cont.) Given Q=k-NN(q): Assumptions: The query region R(Q) is limited by the distance (r) to the current k-th neighbor in the response. Whenever PR is updated its entries are sorted with decreasing proximity to q. Objects in the response are sorted with increasing distance to q. The response can contain k objects at maximum. Initialization: Put the root node to PR. Pick k database objects at random and insert them into response. Similarity Search: Part I Chapter 1 50

51 NN Search Algorithm (cont.) While PR is not empty repeat: Pick an entry N=(GR(G)) R from PR. For each object element o j G: if d(qo j ) r add o j to the response. Update r and R(Q). Remove entries from PR that cannot intersect the query. For each non-object element N =(G R(G )) G R R if R(G ) and R(Q) intersect insert N into PR. The response contains k nearest neighbors to q. Similarity Search: Part I Chapter 1 51

52 NN Search Algorithm (cont.) 3-NN(q): Pick three random objects. Process B 1 B 1 Skip B 3 B 1 q B Process B 2 o 1 o 2 PR is empty quit. o 1 o 4 o 3 o 2 B 3 B 2 Final result o 4 o 8 o 9 B 2 o 5 o 3 Processing: B 21 B 3 o 6 o 7 PR= o 5 o 6 o 7 o 8 o 9 o 8 o 9 o 1 B 12 Response= o 8 o 1 o 32 4 Similarity Search: Part I Chapter 1 52

53 Incremental Similarity Search Hypothetical index structure is slightly modified: Elements of type 0 are objects e 0. Elements e 1 are ball regions (B 2 B 3 ) containing only objects i.e. elements e 0. B Elements e 2 contain 1 o B 2 9 o o elements e 8 0 and e 1 e.g. B 1. 1 Elements have associated distance functions from the query object q: o 3 d 0 (qe 0 ) for elements of type e 0. d t (qe t ) for elements of type e t. E.g. d t (qe t )=d(qp)-r (e t is a ball with p and r). For correctness: d t (qe t ) d 0 (qe 0 ) o 4 B 3 o 6 o 7 6 o 2 o 5 o 3 Similarity Search: Part I Chapter 1 53

54 Incremental NN Search Based on priority queue PR again Each element e t in PR knows also the distance d t (qe t ). Entries in the queue are sorted with respect to these distances. Initialization: Insert the root element with the distance 0 into PR. Similarity Search: Part I Chapter 1 54

55 Incremental NN Search (cont.) While PR is not empty do e t the first element from PR if t = 0 (e t is an object) then report e t as the next nearest neighbor. else insert each child element e l of e t with the distance d l (qe l ) into PR. Similarity Search: Part I Chapter 1 55

56 Incremental NN Search (cont.) NN(q): B 1 q B 2 o 9 9 B 1 o o 8 o 1 4 o 3 o 2 B 3 o 1 B 2 o 2 o 5 o 6 o 7 o 8 o 9 o 4 B 3 o 6 o 7 o 5 o 3 ob Processing: o 2 Queue = (B (o ) 5) 1) 2) 3) 4) 7) 8) 0) (B (o ) 6) 4) 3) 7) 8) 5) (B (o ) 3) 4) 6) 8) 5) (o (B ) 7) 4) 6) 5) (B (o ) 6) 8) 5) (o 43 6) 7) (o 3 7) Response = o 8 o 9 o 1 o 2 o 5 o 4 o 6 o 3 o 7 Similarity Search: Part I Chapter 1 56

57 Foundations of metric space searching 1. distance searching problem in metric spaces 2. metric distance measures 3. similarity queries 4. basic partitioning principles 5. principles of similarity query execution 6. policies to avoid distance computations 7. metric space transformations 8. principles of approximate similarity search 9. advanced issues Similarity Search: Part I Chapter 1 57

58 Avoiding Distance Computations In metric spaces the distance measure is expensive E.g. edit distance quadratic form distance Limit the number of distance evaluations It speeds up processing of similarity queries Pruning strategies object-pivot range-pivot pivot-pivot double-pivot pivot filtering Similarity Search: Part I Chapter 1 58

59 Explanatory Example An index structure is built over 11 objects {o 1 o 11 } applies ball-partitioning o 11 p 1 o 3 o 5 p 2 p 3 o o 1 4 p 1 p3 o 9 o 7 o 4 o 6 o 10 o 1 o 5 o 11 o 2 o 9 o 3 o 7 o 8 q o 10 o 2 o 8 Range query R(qr) Sequential scan needs 11 distance computations. Reported objects: {o 4 o 6 } p 2 o 6 Similarity Search: Part I Chapter 1 59

60 Object-Pivot Distance Constraint Usually applied in leaf nodes Assume the left-most leaf is visited Distances from q to o 4 o 6 o 10 must be computed p 1 p 2 p 3 o 4 o 6 o 10 o 1 o 5 o 11 o 2 o 9 o 3 o 7 o 8 During insertion Distances p 2 to o 4 o 6 o 10 were computed Similarity Search: Part I Chapter 1 60

61 Object-Pivot Constraint (cont.) Having d(p 2 o 4 ) d(p 2 o 6 ) d(p 2 o 10 ) and d(p 2 q) some distance calculations can be omitted Estimation of d(qo 10 ) using only distances we cannot determine position of o 10 o 10 can lie anywhere on the dotted circle o 4 o 10 p 2 r q o 6 Similarity Search: Part I Chapter 1 61

62 Object-Pivot Constraint (cont.) o 10 has two extreme positions o 10 o 4 o 6 o 10 o 4 o 6 r o 4 p 2 o 10 o 4 p 2 o 10 q o 6 r q o 6 Lower bound on d(qo 10 ) is d(p 2 o 10 ) - d(qp 2 ) If greater than the query radius an object cannot qualify (o 10 ) Upper bound on d(qo 10 ) is d(qp 2 ) + d(p 2 o 10 ) If less than the query radius an object directly qualifies! (o 6 ) Similarity Search: Part I Chapter 1 62

63 Object-Pivot Constraint (summary) Given a metric space M=(Dd) and three objects qpo D the distance d(qo) can be constrained: d ( q p) d( p o) d( q o) d( q p) + d( p o) Similarity Search: Part I Chapter 1 63

64 Range-Pivot Distance Constraint Some structures do not store all distances between database objects o i and a pivot p a range [r l r h ] of distances between p and all o i is stored Assume the left-most leaf is to be entered Using the range of distances to leaf objects we can decide whether to enter or not p 1? p 2 p 3 o 4 o 6 o 10 o 1 o 5 o 11 o 2 o 9 o 3 o 7 o 8 Similarity Search: Part I Chapter 1 64

65 Range-Pivot Constraint (cont.) Knowing interval [r l r h] of distance in the leaf we can optimize o 4 o 10 o 4 o 10 o 4 p 2 o 10 r p 2 r l r p 2 r l r 2 r l q r h o 6 q r h o 6 q r h o 6 Lower bound is r l - d(qp 2 ) If greater than the query radius r no object can qualify Upper bound is r h + d(qp 2 ) If less than the query radius r all objects qualify! Similarity Search: Part I Chapter 1 65

66 Range-Pivot Constraint (cont.) We have considered one position of q Three are possible: o r h o r h o r h p r l p r l q p r l q q Similarity Search: Part I Chapter 1 66

67 Range-Pivot Constraint (summary) Given a metric space M=(Dd) and objects po D such that r l d(op) r h. Given q D with known d(qp). The distance d(qo) is restricted by: max{ d ( q p) r r d( q p) 0} d( q o) d( q p) + h l r h Similarity Search: Part I Chapter 1 67

68 Pivot-Pivot Distance Constraint In internal nodes we can do more Assume the root node is examined p 1?? p 2 p 3 o 4 o 6 o 10 o 1 o 5 o 11 o 2 o 9 o 3 o 7 o 8 We can apply the range-pivot constraint to decide which sub-trees must be visited The ranges are known since during building phase all data objects were compared with p 1. Similarity Search: Part I Chapter 1 68

69 Pivot-Pivot Constraint (cont.) Suppose we have followed the left branch (to p 2 ) Knowing the distance d(p 1 p 2 ) and using d(qp 1 ) we can apply the object-pivot constraint d(qp 2 ) [r l r h ] p 1 o 11 o 5 p 2 p 3 p 1 o 1 o 4 o 6 o 10 o 1 o 5 o 11 o 2 o 9 o 3 o 7 o 8 p 2 q o 6 We also know range of distances in p 2 s sub-trees: d(op 2 ) [r l r h ] o 4 p 2 o 10 r l r h Similarity Search: Part I Chapter 1 69

70 Pivot-Pivot Constraint (cont.) Having d(qp o 2 ) [r l r h ] 11 o 5 r d(op 2 ) [r l r h ] h r h o 4 p 2 p 1 o 1 o 10 q o 6 r l r l Both ranges intersect lower bound on d(qo) is 0! Upper bound is r h+r h Similarity Search: Part I Chapter 1 70

71 Pivot-Pivot Constraint (cont.) If ranges do not intersect there are two possibilities. The first is: [r l r h ] is less than [r l r h ] The lower bound (left) is r l - r h A view of the upper bound r h +r h (right) r l o r h r l r h p r l - r h p 2 q o r l r h r h + r h p 2 r r l h q The second is inverse - the lower limit is r l - r h Similarity Search: Part I Chapter 1 71

72 Pivot-Pivot Constraint (summary) Given a metric space M=(Dd) and objects qpo D such that r l d(op) r h and r l d(qp) r h. The distance d(qo) can be restricted by: { r r r r 0} d ( q o ) r + r max{ } l h l h h h Similarity Search: Part I Chapter 1 72

73 Double-Pivot Distance Constraint Previous constraints use just one pivot along with ball partitioning. Applying generalized hyper-plane we have two pivots. No upper bound on d(qo) can be defined! o p 1 q p 2 Equidistant line Similarity Search: Part I Chapter 1 73

74 Double-Pivot Constraint (cont.) If q and o are in different subspaces Lower bound is (d(qp 1 ) d(qp 2 ))/2 p 1 p 2 Hyperbola shows the positions with constant lower bound. o q Moving q up (so visual distance from equidistant line is preserved) decreases the lower bound. p 1 o q q p 2 If q and o are in the same subspace the lower bound is zero Similarity Search: Part I Chapter 1 74

75 Double-Pivot Constraint (summary) Given a metric space M=(Dd) and objects op 1 p 2 D such that d(op 1 ) d(op 2 ). Given a query object q D with d(qp 1 ) and d(qp 2 ). The distance d(qo) can be lower-bounded by: d( q p ) d( q p ) max d( q o) 2 Similarity Search: Part I Chapter 1 75

76 Pivot Filtering Extended object-pivot constraint Uses more pivots Uses triangle inequality for pruning All distances between objects and a pivot p are known Prune object o X if any holds d(po) < d(pq) r d(po) > d(pq) + r p r q Similarity Search: Part I Chapter 1 76

77 Pivot Filtering (cont.) Filtering with two pivots Only Objects in the dark blue region have to be checked. Effectiveness is improved using more pivots. p 1 q o 2 r o 3 p 2 o 1 Similarity Search: Part I Chapter 1 77

78 Pivot Filtering (summary) Given a metric space M=(Dd) and a set of pivots P = { p 1 p 2 p 3 p n }. We define a mapping function Ψ: (Dd) ( n L ) as follows: Ψ(o) = (d(op 1 ) d(op n )) Then we can bound the distance d(qo) from below: Ψ Ψ L (Ψ(o) Ψ(q)) d(qo) Similarity Search: Part I Chapter 1 78

79 Pivot Filtering (consideration) Given a range query R(qr) We want to report all objects o such that d(qo) r Apply the pivot filtering We can discard objects for which L (Ψ(o) Ψ(q)) > r holds i.e. the lower bound on d(qo) is greater than r. The mapping Ψ is contractive: No eliminated object can qualify. Some qualifying objects need not be relevant. These objects have to be checked against the original function d(). Similarity Search: Part I Chapter 1 79

80 Constraints & Explanatory Example 1 Range query R(qr) = {o 4 o 6 o 8 } Sequential scan: 11 distance computations No constraint: 3+8 distance computations o 11 o 3 o 5 p 1 o 9 o 7 o 4 p 2 p 1 o 1 p 3 q o 10 o 2 o 8 o 6 p 2 p 3 o 4 o 6 o 10 o 1 o 5 o 11 o 2 o 9 o 3 o 7 o 8 Similarity Search: Part I Chapter 1 80

81 Constraints & Explanatory Example 2 Range query R(qr) = {o 4 o 6 o 8 } Only object-pivot in leaves: 3+2 distance computations o 6 is included without computing d(qo 6 ) o 10 o 2 o 9 o 3 o 7 are eliminated without computing. o 3 o 11 o 5 o 9 p 1 o 4 p 2 o 1 p 1 p 3 q o 10 o 2 o 7 o 8 o 6 p 2 p 3 o 4 o 6 o 10 o 1 o 5 o 11 o 2 o 9 o 3 o 7 o 8 Similarity Search: Part I Chapter 1 81

82 Constraints & Explanatory Example 3 Range query R(qr) = {o 4 o 6 o 8 } Only range-pivot: 3+6 distance computations o 2 o 9 are pruned. Only range-pivot +pivot-pivot: 3+6 distance computations o 3 o 11 o 5 p 1 o 9 o 4 p 1 o 1 p 3 p 2 p 3 p 2 o o 4 o 6 o 10 o 1 o 5 o 11 o 2 o 9 o 3 o 7 o 8 q o 10 o 2 o 7 o 8 Similarity Search: Part I Chapter 1 82

83 Constraints & Explanatory Example 4 Range query R(qr) = {o 4 o 6 o 8 } Assume: objects know distances to pivots along paths to the root. Only pivot filtering: 3+3 distance computations (to o 4 o 6 o 8 ) All constraints together: 3+2 distance computations (to o 4 o 8 ) o 3 o 11 o 5 o 9 p 1 o 1 o 2 o 4 p 2 p 1 p 3 p 2 p 3 o 6 o o o 4 o 6 o 10 o 1 o 5 o 11 o 2 o 9 o 3 o 7 o 8 q o 10 o 2 o 7 o 8 Similarity Search: Part I Chapter 1 83

84 Foundations of metric space searching 1. distance searching problem in metric spaces 2. metric distance measures 3. similarity queries 4. basic partitioning principles 5. principles of similarity query execution 6. policies to avoid distance computations 7. metric space transformations 8. principles of approximate similarity search 9. advanced issues Similarity Search: Part I Chapter 1 84

85 Metric Space Transformation Change one metric space into another Transformation of the original objects Changing the metric function Transforming both the function and the objects Metric space embedding Cheaper distance function User-defined search functions Similarity Search: Part I Chapter 1 85

86 Metric Space Transformation M 1 = (D 1 d 1 ) M 2 = (D 2 d 2 ) : D D Function 1 2 o f D 1 1 o2 1 : d1 ( o1 o2 ) d2 ( f ( o1 ) f ( o2 Transformed distances need not be equal )) Similarity Search: Part I Chapter 1 86

87 Lower Bounding Metric Functions Bounds on transformations Exploitable by index structures Having functions d 1 d 2 : D D d 1 is a lower-bounding distance function of d 2 D o o D : d ( o o ) d ( o o ) o2 Similarity Search: Part I Chapter 1 87

88 Lower Bounding Functions (cont.) Scaling factor Some metric functions cannot be bounded. We can bound them if they are reduced by a factor s 0 < D s < 1: o o 1 2 ) : s d ( o1 o2) d2( o1 2 1 o sd 1 is a lower-bounding function of d 2 Maximum of all possible values of s is called the optimal scaling factor. Similarity Search: Part I Chapter 1 88

89 Example of Lower Bounding Functions L p metrics Any L p metric is lower-bounding an L p metric if p p Let x r y r d L are two vectors in a 2-D space r r ( x y) = ( x y 2 r r ( x y) = x y + x y1) + ( x2 2) d L y2 L 1 is always bigger than L 2 Similarity Search: Part I Chapter 1 89

90 Example of Lower Bounding Functions Quadratic Form Distance Function Bounded by a scaled L 2 norm Optimal scaling factor is soptim = min i { λ } i where λ i denote the eigenvalues of the quadratic form function matrix. Similarity Search: Part I Chapter 1 90

91 User-defined Metric Functions Different users have different preferences Some people prefer car s speed Others prefer lower prices etc Preferences might be complex Color histograms data-mining systems Can be learnt automatically from the previous behavior of a user Similarity Search: Part I Chapter 1 91

92 User-defined Metric Functions Preferences expressed as another distance function d u can be different for different users Example: matrices for quadratic form distance functions Database indexed with a fixed metric d b Lower-bounding metric function d p lower-bounds d b and d u it is applied during the search can exploit properties the index structure Similarity Search: Part I Chapter 1 92

93 User-defined Metric Functions Searching using d p search the index but use d instead of d search the index but use d p instead of d b Possible because o D o2 : d p ( o1 o2) db( o1 2) 1 o every object that would match similarity query using d b will certainly match with d p False-positives in the result filtered afterwards - using d u possible because o D o2 : d p ( o1 o2) du ( o1 2) 1 o Similarity Search: Part I Chapter 1 93

94 Embedding the Metric Space Transform the metric space Cheaper metric function d 2 Approximate the original d 1 distances d Drawbacks ( o1 o2) d2( f ( o1 ) f ( 2)) 1 o Must transform objects using the function f False-positives pruned using the original metric function Similarity Search: Part I Chapter 1 94

95 Embedding Examples Lipschitz Embedding Mapping to an n-dimensional vector space Coordinates correspond to chosen subsets S i of objects Object is then a vector of distances to the closest object from a particular coordinate set S i f ( o ) = ( d ( o S ) d ( o S ) K d ( o S )) ( 1 2 n Transformation is very expensive SparseMap extension reduces this cost Similarity Search: Part I Chapter 1 95

96 Embedding Examples Karhunen-Loeve tranformation Linear transformation of vector spaces Dimensionality reduction technique Similar to Principal Component Analysis Projects object o onto the first k < n basis vectors r r r V = { v v K v } 1 2 n Transformation is contractive Used in the FastMap technique Similarity Search: Part I Chapter 1 96

97 Foundations of metric space searching 1. distance searching problem in metric spaces 2. metric distance measures 3. similarity queries 4. basic partitioning principles 5. principles of similarity query execution 6. policies to avoid distance computations 7. metric space transformations 8. principles of approximate similarity search 9. advanced issues Similarity Search: Part I Chapter 1 97

98 Principles of Approx. Similarity Search Approximate similarity search over-comes problems of exact similarity search when using traditional access methods. Moderate improvement of performance with respect to the sequential scan. Dimensionality curse Similarity search returns mathematically precise result sets. Similarity is often subjective so in some cases also approximate result sets satisfy the user s needs. Similarity Search: Part I Chapter 1 98

99 Principles of Approx. Similarity Search (cont.) Approximate similarity search processes a query faster at the price of imprecision in the returned result sets. Useful for instance in interactive systems: Similarity search is typically an iterative process Users submit several search queries before being satisfied Fast approximate similarity search in intermediate queries can be useful. Improvements up to two orders of magnitude Similarity Search: Part I Chapter 1 99

100 Approx. Similarity Search: Basic Strategies Space transformation Distance preserving transformations Distances in the transformed space are smaller than in the original space. Possible false hits Example: Dimensionality reduction techniques such as KLT DFT DCT DWT VA-files We will not discuss this approximation strategy in details. Similarity Search: Part I Chapter 1 100

101 Basic Strategies (cont.) Reducing subsets of data to be examined Not promising data is not accessed. False dismissals can occur. This strategy will be discussed more deeply in the following slides. Similarity Search: Part I Chapter 1 101

102 Reducing Volume of Examined Data Possible strategies: Early termination strategies A search algorithm might stop before all the needed data has been accessed. Relaxed branching strategies Data regions overlapping the query region can be discarded depending on a specific relaxed pruning strategy. Similarity Search: Part I Chapter 1 102

103 Early Termination Strategies Exact similarity search algorithms are Iterative processes where Current result set is improved at each step. Exact similarity search algorithms stop When no further improvement is possible. Similarity Search: Part I Chapter 1 103

104 Early Termination Strategies (cont.) Approximate similarity search algorithms Use a relaxed stop condition that stops the algorithm when little chances of improving the current results are detected. The hypothesis is that A good approximation is obtained after a few iterations. Further steps would consume most of the total search costs and would only marginally improve the result-set. Similarity Search: Part I Chapter 1 104

105 Early Termination Strategies (cont.) Distance Iteration Similarity Search: Part I Chapter 1 105

106 Relaxed Branching Strategies Exact similarity search algorithms Access all data regions overlapping the query region and discard all the others. Approximate similarity search algorithms Use a relaxed pruning condition that Rejects regions overlapping the query region when it detects a low likelihood that data objects are contained in the intersection. In particular useful and effective with access methods based on hierarchical decomposition of the space. Similarity Search: Part I Chapter 1 106

107 Approximate Search: Example A hypothetical index structure Three ball regions B 1 o 2 o 1 B 2 o 4 o 6 o 5 B 1 B 2 B 3 o 7 o 3 o 1 o 2 o 3 o 8 o 9 o 10 o 11 B 3 o 9 o 11 o 4 o 5 o 6 o 7 o 10 o 8 Similarity Search: Part I Chapter 1 107

108 Approximate Search: Range Query Given a range query: Access B 1 B 1 o 2 o 1 Report o 1 If early termination stopped now we would loose objects. Access B 2 Report o 4 o 5 If early termination stopped now we would not loose anything. Access B 3 Nothing to report A relaxed branching strategy may discard this region we don t loose anything. B 2 o 6 o 7 o 3 B 3 o 8 o 4 o 5 o 10 q o 9 o 11 Similarity Search: Part I Chapter 1 108

109 Approximate Search: 2-NN Query Given a 2-NN query: Access B 1 B 1 o 2 o 1 r= Neighbors: o 1 o 3 If early termination stopped now we would loose objects. Access B 2 B 2 o 6 o 4 o 5 q Neighbors: o 4 o 5 If early termination stopped now we would not loose anything. Access B 3 o 7 o 3 o 9 Neighbors: o 4 o 5 no change A relaxed branching strategy may discard this region we don t loose anything. B 3 o 8 o 10 o 11 Similarity Search: Part I Chapter 1 109

110 Measures of Performance Performance assessments of approximate similarity search should consider Improvement in efficiency Accuracy of approximate results Typically there is a trade-off between the two High improvement in efficiency is obtained at the cost of accuracy in the results. Good approximate search algorithms should offer high improvement in efficiency with high accuracy in the results. Similarity Search: Part I Chapter 1 110

111 Measures of Performance: Improvement in Efficiency Improvement in Efficiency (IE) is expressed as the ratio between the cost of the exact and approximate execution of a query Q: IE = Cost(Q) ( ) A Cost ( Q) Cost and Cost A denote the number of disk accesses or alternatively the number of distance computations for the precise and approximate execution of Q respectively. Q is a range or k-nearest neighbors query. Similarity Search: Part I Chapter 1 111

112 Improvement in Efficiency (cont.) IE=10 means that approximate execution is 10 times faster Example: exact execution 6 minutes approximate execution 36 seconds Similarity Search: Part I Chapter 1 112

113 Measures of Performance: Precision and Recall Widely used in Information Retrieval as a performance assessment. Precision: ratio between the retrieved qualifying objects and the total objects retrieved. Recall: ratio between the retrieved qualifying objects and the total qualifying objects. Similarity Search: Part I Chapter 1 113

114 Precision and Recall (cont.) Accuracy can be quantified with Precision (P) and Recall (R): P A S S = A S R = S S S A S qualifying objects i.e. objects retrieved by the precise algorithm S A actually retrieved objects i.e. objects retrieved by the approximate algorithm Similarity Search: Part I Chapter 1 114

115 Precision and Recall (cont.) They are very intuitive but in our context Their interpretation is not obvious & misleading!!! For approximate range search we typically have S A S Therefore precision is always 1 in this case Results of k-nn(q) have always size k Therefore precision is always equal to recall in this case. Every element has the same importance Loosing the first object rather than the 1000 th one is the same. Similarity Search: Part I Chapter 1 115

116 Precision and Recall (cont.) Suppose a 10-NN(q): S={ } S A1 ={ } the object 1 is missing S A2 ={ } the object 10 is missing In both cases: P = R = 0.9 However S A2 can be considered better than S A1. Similarity Search: Part I Chapter 1 116

117 Precision and Recall (cont.) Suppose 1-NN(q): S={1} S A1 ={2} just one object was skipped S A2 ={10000} the first 9999 objects were skipped In both cases: P = R = 0 However S A1 can be considered much better than S A2. Similarity Search: Part I Chapter 1 117

118 Measures of Performance: Relative Error Measures of Performance: Relative Error on Distances Another possibility to assess the accuracy is the use of the relative error on distances (ED) It compares the distances from a query object to objects in the approximate and exact results 1 ) ( ) ( ) ( ) ( ) ( = = q o d q o d q o d q o d q o d ED N A N N A where o A and o N are the approximate and the actual nearest neighbors respectively. ) ( ) ( q o d q o d nearest neighbors respectively. Generalisation to the case of the j-th NN: ) ( q o d A 1 ) ( ) ( = q o d q o d ED N j A j j Similarity Search: Part I Chapter 1 118

119 Relative Error on Distances (cont.) It has a drawback: It does not take the distribution of distances into account. Example1: The difference in distance from the query object to o N and o A is large (compared to the range of distances) If the algorithm misses o N and takes o A ED is large even if just one object has been missed. q o N o A Similarity Search: Part I Chapter 1 119

120 Relative Error on Distances (cont.) Example 2: Almost all objects have the same (large) distance from the query object. Choosing the farthest rather than the nearest neighbor would produce a small ED even if almost all objects have been missed. 02 de ensity N d( o q) A d( o q) distance Similarity Search: Part I Chapter 1 120

121 Measures of Performance: Error on Position Accuracy can also be measured as the Error on Position (EP) i.e. the discrepancy between the ranks in approximate and exact results. Obtained using the Sperman Footrule Distance (SFD): X = S 1 i SFD 2 i= 1 X the dataset s cardinality ( o ) S ( o ) 1 i 2 i S i (o) the position of object o in the ordered list S i Similarity Search: Part I Chapter 1 121

122 Error on Position (cont.) SFD computes correlation of two ordered lists. Requires both the lists to have identical elements. For partial lists: Induced Footrule Distance (IFD): A S = A IFD = OX ( o i ) S ( o i ) i= 1 OX the list containing the entire dataset ordered with respect to q. S A the approximate result ordered with respect to q. Similarity Search: Part I Chapter 1 122

123 Error on Position (cont.) Position in the approximate result is always smaller than or equal to the one in the exact result. S A is a sub-lists of OX S A (o) OX(o) A normalisation factor S A X can also be used The error on position (EP) is defined as EP = i=1 A S ( ) OX ( o ) S A ( o ) A i S A X i Similarity Search: Part I Chapter 1 123

124 Error on Position (cont.) Suppose X =10000 Let us consider a 10-NN(q): S={ } S A1 ={ } the object 1 is missing S A2 ={ } the object 10 is missing As also intuition suggests: In case of S A1 EP = 10 / ( ) = In case of S A2 EP = 1 / ( ) = Similarity Search: Part I Chapter 1 124

125 Error on Position (cont.) Suppose X =10000 Let us consider a 1-NN(q): S={1} S A1 ={2} just one object was skipped S A2 ={10000} the first 9999 objects were skipped As also intuition suggests : In case of S A1 EP = (2-1)/( ) = 1/(10000) = In case of S A2 EP = ( )/( ) = Similarity Search: Part I Chapter 1 125

126 Foundations of metric space searching 1. distance searching problem in metric spaces 2. metric distance measures 3. similarity queries 4. basic partitioning principles 5. principles of similarity query execution 6. policies to avoid distance computations 7. metric space transformations 8. principles of approximate similarity search 9. advanced issues Similarity Search: Part I Chapter 1 126

127 Statistics on Metric Datasets Statistical characteristics of datasets form the basis of performance optimisation in databases. Statistical information is used for Cost models Access structure tuning Typical statistical information Histograms of frequency values for records in databases Distribution of data in case of data represented in a vector space Similarity Search: Part I Chapter 1 127

128 Statistics on Metric Datasets (cont.) Histograms and data distribution cannot be used in generic metric spaces We can only rely on distances No coordinate system can be used Statistics useful for techniques for similarity searching in metric spaces are Distance density and distance distribution Homogeneity of viewpoints Proximity of ball regions Similarity Search: Part I Chapter 1 128

129 Data Density vs. Distance Density Data density (applicable just in vector spaces) characterizes how data are placed in the space coordinates of objects are needed to get their position Distance density (applicable in generic metric spaces) characterizes distances among objects no need of coordinates just a distance functions is required Similarity Search: Part I Chapter 1 129

130 Data Density vs. Distance Density (cont.) Data density Distance density from the object p x 2 x p x 1 Similarity Search: Part I Chapter 1 130

131 Distance Distribution and Distance Density The distance distribution with respect to the object p (viewpoint) is F D p ( x) { D x} = Pr{ d( p o x} = Pr ) p where D p is a random variable corresponding to the distance d(po) and o is a random object of the metric space. The distance density from the object p can be obtained as the derivative of the distribution. Similarity Search: Part I Chapter 1 131

132 Distance Distribution and Distance Density (cont.) The overall distance distribution (informally) is the probability of distances among objects F { d o o x } ( x) = Pr ( ) 1 2 where o 1 and o 2 are random objects of the metric space. Similarity Search: Part I Chapter 1 132

133 Homogeneity of Viewpoints A viewpoint (distance distribution from p) is different from another viewpoint. Distances from different objects are distributed differently. A viewpoint is different from the overall distance distribution. The overall distance distribution characterize the entire set of possible distances. However the overall distance distribution can be used in place of any viewpoint if the dataset is probabilistically homogeneous. i.e. when the discrepancy between various viewpoints is small. Similarity Search: Part I Chapter 1 133

134 Homogeneity of Viewpoints (cont.) The index of Homogeneity of Viewpoints (HV) for a metric space M=(Dd) is: M HV ( ) = 1 avg δ ( F F p p p 1 p 2 D where p 1 and p 2 are random objects and the discrepancy between two viewpoints is 1 2 ) δ ( F p i F p j ) = avg x [ [0 d + ] F p i ( x) F p j ( x) where F pi is the viewpoint of p i Similarity Search: Part I Chapter 1 134

135 Homogeneity of Viewpoints (cont.) If HV(M) 1 the overall distance distribution can be reliably used to replace any viewpoint. Similarity Search: Part I Chapter 1 135

136 Proximity of Ball Regions Proximity of two regions is a measure that estimates the number of objects contained in their overlap Used in: Region splitting for partitioning After splitting one region the new regions should share as little objects as possible. Disk allocation Enhancing performance by distributing data over several disks. Approximate search Applied in relaxed branching strategy a region is accessed if there is high probability to have objects in the intersection. Similarity Search: Part I Chapter 1 136

137 Proximity of Ball Regions (cont.) In Euclidean spaces it is easy to obtain compute data distributions compute integrals of data distribution on regions intersection In metric spaces coordinates cannot be used data distribution cannot be exploited distance density/distribution is the only available statistical information Similarity Search: Part I Chapter 1 137

138 Proximity of Ball Regions: Partitioning Queries usually follow data distribution Partition data to avoid overlaps i.e. accessing both regions. Low overlap (left) vs. high overlap (right) p 1 p 2 q p 1 q p 2 Similarity Search: Part I Chapter 1 138

139 Proximity of Ball Regions: Data Allocation Regions sharing many objects should be placed on different disk units declustering Because there is high probability of being accessed together by the same query. q p 1 p 2 Similarity Search: Part I Chapter 1 139

140 Proximity of Ball Regions: Approximate Search Skip visiting regions where there is low chance to find objects relevant to a query. p 1 p 2 q Similarity Search: Part I Chapter 1 140

141 Proximity of Metric Ball Regions Given two ball regions R 1 =(p 1 r 1 ) and R 2 =(p 2 r 2 ) we define proximity as follows: { d( p o ) r d( p o } prox( R r 1 R 2 ) = Pr ) In real-life datasets distance distribution does not depend on specific objects Real datasets have a high index of homogeneity. We define the overall proximity { d( p o) r d( p o) r d( p p ) z} prox z r r ) = Pr = ( Similarity Search: Part I Chapter 1 141

142 Proximity of Metric Ball Regions (cont.) Overall proximity: Triangle inequality: o D z D 1 + D 2 D 1 r 1 D 2 r 2 p 1 D z = z p 2 r 2 r 1 Proximity: Probability that an object o appears in the intersection. Similarity Search: Part I Chapter 1 142

143 Proximity: Computational Difficulties Let D 1 = d(p 1 o) D 2 = d(p 2 o) D z = d(p 1 p 2 ) be random variables the overall proximity can be mathematically evaluated as prox z ( r r r 1 2 = r ) f ( x y d ) 1 2 D D D z = z dydx An analytic formula for the joint conditional density f D1 D2 D z is not known for generic metric spaces. Similarity Search: Part I Chapter 1 143

144 Proximity: Computational Difficulties (cont.) Idea: Replace the joint conditional density f D1D2 Dz (xy z) with the joint density f D1D2 (xy). However these densities are different. The joint density is easier to obtain: f D ( x y ) = f ( x ) f ( y ) 1D2 D1 D 2 If the overall density is used: f D D ( x y) = f ( x) f ( y) 1 2 The original expression can only be approximated. Similarity Search: Part I Chapter 1 144

145 Proximity: Considerations (cont.) The joint conditional density is zero When xy and z do not satisfy the triangle inequality. Simply such distance cannot exist in metric space y=0 y=490 y=980 y=1470 y=1960 y=2450 y=2940 y=3430 y=3920 y=4410 y=4900 y=5390 y=5880 y=6370 y=6860 x=0 x=1330 x=2660 x=3990 x=5320 x=6650 joint conditional density Similarity Search: Part I Chapter 1 145

146 Proximity: Considerations (cont.) The joint density is not restricted Idea: the joint conditional density is obtained by dragging values out of borders of the triangle inequality to the border x=0 x=1120 x=2240 x=3360 y=0 y=420 y=840 y=1260 y=1680 y=2100 y=2520 y=2940 y=3360 y=3780 y=4200 y=4620 y=5040 y=5460 y=5880 y=6300 y=6720 x=4480 x=5600 x=6720 joint density Similarity Search: Part I Chapter 1 146

147 Proximity: Approximation Proximity can be computed in O(n) with high precision n is the number of samples for the integral computation of f(x). Distance density and distribution are the only information that need to be pre-computed and stored. y z z Bounded area External area x Similarity Search: Part I Chapter 1 147

148 Performance Prediction Distance distribution can be used for performance prediction of similarity search access methods Estimate the number of accessed subsets Estimate the number of distance computations Estimate the number of objects retrieved Suppose a dataset was partitioned in m subsets Suppose every dataset is bounded by a ball region R =(p r ) 1 i m with the pivot p and radius r R i =(p i r i ) 1 i m with the pivot p i and radius r i Similarity Search: Part I Chapter 1 148

149 Performance Prediction: Range Search A range query R(qr q ) will access a subset bounded by the region R i if it intersects the query i.e. if d(qp i ) r i +r q The probability for a random region R r =(pr) to be accessed is Pr { } d ( q p) r + r = F ( r + r ) F( r + r ) q q q q where p is the random centre of the region F q is the q s viewpoint and the dataset is highly homogeneous. Similarity Search: Part I Chapter 1 149

150 Performance Prediction: Range Search (cont.) The expected number of accessed subsets is obtained by summing the probability of accessing each subset: subsets ( R( q r )) F( r + r ) q provided that we have a data structure to maintain the r i s. m i = 1 i q Similarity Search: Part I Chapter 1 150

151 Performance Prediction: Range Search (cont.) The expected number of distance computations is obtained by summing the size of subsets and using the probability of accessing as a weight m i = 1 distances( R( q r )) R q i F( r The expected size of the result is given simply as i + r q ) objects( R( q r )) n F( r q q ) where n is the cardinality of the entire dataset. Similarity Search: Part I Chapter 1 151

152 Performance Prediction: Range Search (cont.) Data structure to maintain the radii and the cardinalities of all bounding regions in needed The size of this information can become unacceptable grows linearly with the size of the dataset. Similarity Search: Part I Chapter 1 152

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Indexes for Multimedia Data 14 Indexes for Multimedia

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance Notion of Distance Metric Distance Binary Vector Distances Tangent Distance Distance Measures Many pattern recognition/data mining techniques are based on similarity measures between objects e.g., nearest-neighbor

More information

University of Florida CISE department Gator Engineering. Clustering Part 1

University of Florida CISE department Gator Engineering. Clustering Part 1 Clustering Part 1 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville What is Cluster Analysis? Finding groups of objects such that the objects

More information

Making Nearest Neighbors Easier. Restrictions on Input Algorithms for Nearest Neighbor Search: Lecture 4. Outline. Chapter XI

Making Nearest Neighbors Easier. Restrictions on Input Algorithms for Nearest Neighbor Search: Lecture 4. Outline. Chapter XI Restrictions on Input Algorithms for Nearest Neighbor Search: Lecture 4 Yury Lifshits http://yury.name Steklov Institute of Mathematics at St.Petersburg California Institute of Technology Making Nearest

More information

Tailored Bregman Ball Trees for Effective Nearest Neighbors

Tailored Bregman Ball Trees for Effective Nearest Neighbors Tailored Bregman Ball Trees for Effective Nearest Neighbors Frank Nielsen 1 Paolo Piro 2 Michel Barlaud 2 1 Ecole Polytechnique, LIX, Palaiseau, France 2 CNRS / University of Nice-Sophia Antipolis, Sophia

More information

Skylines. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

Skylines. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland Yufei Tao ITEE University of Queensland Today we will discuss problems closely related to the topic of multi-criteria optimization, where one aims to identify objects that strike a good balance often optimal

More information

Mid Term-1 : Practice problems

Mid Term-1 : Practice problems Mid Term-1 : Practice problems These problems are meant only to provide practice; they do not necessarily reflect the difficulty level of the problems in the exam. The actual exam problems are likely to

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Spatial Database. Ahmad Alhilal, Dimitris Tsaras

Spatial Database. Ahmad Alhilal, Dimitris Tsaras Spatial Database Ahmad Alhilal, Dimitris Tsaras Content What is Spatial DB Modeling Spatial DB Spatial Queries The R-tree Range Query NN Query Aggregation Query RNN Query NN Queries with Validity Information

More information

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13 Indexes for Multimedia Data 13 Indexes for Multimedia

More information

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data 1/29/2010 13 Indexes for Multimedia Data 13 Indexes for Multimedia Data 13.1 R-Trees Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

More information

Pivot Selection Techniques

Pivot Selection Techniques Pivot Selection Techniques Proximity Searching in Metric Spaces by Benjamin Bustos, Gonzalo Navarro and Edgar Chávez Catarina Moreira Outline Introduction Pivots and Metric Spaces Pivots in Nearest Neighbor

More information

Algorithms for Picture Analysis. Lecture 07: Metrics. Axioms of a Metric

Algorithms for Picture Analysis. Lecture 07: Metrics. Axioms of a Metric Axioms of a Metric Picture analysis always assumes that pictures are defined in coordinates, and we apply the Euclidean metric as the golden standard for distance (or derived, such as area) measurements.

More information

Algorithms for Nearest Neighbors

Algorithms for Nearest Neighbors Algorithms for Nearest Neighbors Background and Two Challenges Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura McGill University, July 2007 1 / 29 Outline

More information

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data Elena Baralis and Tania Cerquitelli Politecnico di Torino Data set types Record Tables Document Data Transaction Data Graph World Wide Web Molecular Structures Ordered Spatial Data Temporal Data Sequential

More information

Robotics 2 Data Association. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Wolfram Burgard

Robotics 2 Data Association. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Wolfram Burgard Robotics 2 Data Association Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Wolfram Burgard Data Association Data association is the process of associating uncertain measurements to known tracks. Problem

More information

Metric-based classifiers. Nuno Vasconcelos UCSD

Metric-based classifiers. Nuno Vasconcelos UCSD Metric-based classifiers Nuno Vasconcelos UCSD Statistical learning goal: given a function f. y f and a collection of eample data-points, learn what the function f. is. this is called training. two major

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher

More information

SPATIAL INDEXING. Vaibhav Bajpai

SPATIAL INDEXING. Vaibhav Bajpai SPATIAL INDEXING Vaibhav Bajpai Contents Overview Problem with B+ Trees in Spatial Domain Requirements from a Spatial Indexing Structure Approaches SQL/MM Standard Current Issues Overview What is a Spatial

More information

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of

More information

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar

More information

Compute the Fourier transform on the first register to get x {0,1} n x 0.

Compute the Fourier transform on the first register to get x {0,1} n x 0. CS 94 Recursive Fourier Sampling, Simon s Algorithm /5/009 Spring 009 Lecture 3 1 Review Recall that we can write any classical circuit x f(x) as a reversible circuit R f. We can view R f as a unitary

More information

Similarity searching, or how to find your neighbors efficiently

Similarity searching, or how to find your neighbors efficiently Similarity searching, or how to find your neighbors efficiently Robert Krauthgamer Weizmann Institute of Science CS Research Day for Prospective Students May 1, 2009 Background Geometric spaces and techniques

More information

Multi-Dimensional Top-k Dominating Queries

Multi-Dimensional Top-k Dominating Queries Noname manuscript No. (will be inserted by the editor) Multi-Dimensional Top-k Dominating Queries Man Lung Yiu Nikos Mamoulis Abstract The top-k dominating query returns k data objects which dominate the

More information

Query Processing in Spatial Network Databases

Query Processing in Spatial Network Databases Temporal and Spatial Data Management Fall 0 Query Processing in Spatial Network Databases SL06 Spatial network databases Shortest Path Incremental Euclidean Restriction Incremental Network Expansion Spatial

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Types of data sets Record Tables Document Data Transaction Data Graph World Wide Web Molecular Structures

More information

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat Geometric View of Machine Learning Nearest Neighbor Classification Slides adapted from Prof. Carpuat What we know so far Decision Trees What is a decision tree, and how to induce it from data Fundamental

More information

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics Clustering

More information

B490 Mining the Big Data

B490 Mining the Big Data B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations

More information

Nearest Neighbor Search with Keywords

Nearest Neighbor Search with Keywords Nearest Neighbor Search with Keywords Yufei Tao KAIST June 3, 2013 In recent years, many search engines have started to support queries that combine keyword search with geography-related predicates (e.g.,

More information

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

CHAPTER 11. A Revision. 1. The Computers and Numbers therein CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of

More information

α-acyclic Joins Jef Wijsen May 4, 2017

α-acyclic Joins Jef Wijsen May 4, 2017 α-acyclic Joins Jef Wijsen May 4, 2017 1 Motivation Joins in a Distributed Environment Assume the following relations. 1 M[NN, Field of Study, Year] stores data about students of UMONS. For example, (19950423158,

More information

Shape Matching: A Metric Geometry Approach. Facundo Mémoli. CS 468, Stanford University, Fall 2008.

Shape Matching: A Metric Geometry Approach. Facundo Mémoli. CS 468, Stanford University, Fall 2008. Shape Matching: A Metric Geometry Approach Facundo Mémoli. CS 468, Stanford University, Fall 2008. 1 General stuff: class attendance is mandatory will post possible projects soon a couple of days prior

More information

LOCALITY SENSITIVE HASHING FOR BIG DATA

LOCALITY SENSITIVE HASHING FOR BIG DATA 1 LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales Outline 2 NN and c-ann Existing LSH methods for Large Data Our approach: SRS [VLDB 2015] Conclusions NN and c-ann Queries

More information

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality Measurement and Data Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality Importance of Measurement Aim of mining structured data is to discover relationships that

More information

Nearest Neighbor Search with Keywords in Spatial Databases

Nearest Neighbor Search with Keywords in Spatial Databases 776 Nearest Neighbor Search with Keywords in Spatial Databases 1 Sphurti S. Sao, 2 Dr. Rahila Sheikh 1 M. Tech Student IV Sem, Dept of CSE, RCERT Chandrapur, MH, India 2 Head of Department, Dept of CSE,

More information

Composite Quantization for Approximate Nearest Neighbor Search

Composite Quantization for Approximate Nearest Neighbor Search Composite Quantization for Approximate Nearest Neighbor Search Jingdong Wang Lead Researcher Microsoft Research http://research.microsoft.com/~jingdw ICML 104, joint work with my interns Ting Zhang from

More information

Convex and Semidefinite Programming for Approximation

Convex and Semidefinite Programming for Approximation Convex and Semidefinite Programming for Approximation We have seen linear programming based methods to solve NP-hard problems. One perspective on this is that linear programming is a meta-method since

More information

Similarity and Dissimilarity

Similarity and Dissimilarity 1//015 Similarity and Dissimilarity COMP 465 Data Mining Similarity of Data Data Preprocessing Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed.

More information

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang / CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors Furong Huang / furongh@cs.umd.edu What we know so far Decision Trees What is a decision tree, and how to induce it from

More information

Math 4A Notes. Written by Victoria Kala Last updated June 11, 2017

Math 4A Notes. Written by Victoria Kala Last updated June 11, 2017 Math 4A Notes Written by Victoria Kala vtkala@math.ucsb.edu Last updated June 11, 2017 Systems of Linear Equations A linear equation is an equation that can be written in the form a 1 x 1 + a 2 x 2 +...

More information

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. SAMs - Detailed outline. Spatial Access Methods - problem

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. SAMs - Detailed outline. Spatial Access Methods - problem Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications Lecture #26: Spatial Databases (R&G ch. 28) SAMs - Detailed outline spatial access methods problem dfn R-trees Faloutsos 2

More information

A strongly polynomial algorithm for linear systems having a binary solution

A strongly polynomial algorithm for linear systems having a binary solution A strongly polynomial algorithm for linear systems having a binary solution Sergei Chubanov Institute of Information Systems at the University of Siegen, Germany e-mail: sergei.chubanov@uni-siegen.de 7th

More information

Learning Methods for Linear Detectors

Learning Methods for Linear Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2

More information

Application: Bucket Sort

Application: Bucket Sort 5.2.2. Application: Bucket Sort Bucket sort breaks the log) lower bound for standard comparison-based sorting, under certain assumptions on the input We want to sort a set of =2 integers chosen I+U@R from

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification

More information

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014 Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists

More information

Math 471 (Numerical methods) Chapter 3 (second half). System of equations

Math 471 (Numerical methods) Chapter 3 (second half). System of equations Math 47 (Numerical methods) Chapter 3 (second half). System of equations Overlap 3.5 3.8 of Bradie 3.5 LU factorization w/o pivoting. Motivation: ( ) A I Gaussian Elimination (U L ) where U is upper triangular

More information

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering. 1 / 19 sscott@cse.unl.edu x1 If no label information is available, can still perform unsupervised learning Looking for structural information about instance space instead of label prediction function Approaches:

More information

Chapter 5 Divide and Conquer

Chapter 5 Divide and Conquer CMPT 705: Design and Analysis of Algorithms Spring 008 Chapter 5 Divide and Conquer Lecturer: Binay Bhattacharya Scribe: Chris Nell 5.1 Introduction Given a problem P with input size n, P (n), we define

More information

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations. Previously Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations y = Ax Or A simply represents data Notion of eigenvectors,

More information

Relational Nonlinear FIR Filters. Ronald K. Pearson

Relational Nonlinear FIR Filters. Ronald K. Pearson Relational Nonlinear FIR Filters Ronald K. Pearson Daniel Baugh Institute for Functional Genomics and Computational Biology Thomas Jefferson University Philadelphia, PA Moncef Gabbouj Institute of Signal

More information

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved.

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved. Chapter 11 Approximation Algorithms Slides by Kevin Wayne. Copyright @ 2005 Pearson-Addison Wesley. All rights reserved. 1 Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should

More information

EE731 Lecture Notes: Matrix Computations for Signal Processing

EE731 Lecture Notes: Matrix Computations for Signal Processing EE731 Lecture Notes: Matrix Computations for Signal Processing James P. Reilly c Department of Electrical and Computer Engineering McMaster University September 22, 2005 0 Preface This collection of ten

More information

Evaluation of Probabilistic Queries over Imprecise Data in Constantly-Evolving Environments

Evaluation of Probabilistic Queries over Imprecise Data in Constantly-Evolving Environments Evaluation of Probabilistic Queries over Imprecise Data in Constantly-Evolving Environments Reynold Cheng, Dmitri V. Kalashnikov Sunil Prabhakar The Hong Kong Polytechnic University, Hung Hom, Kowloon,

More information

arxiv:cs/ v1 [cs.cg] 7 Feb 2006

arxiv:cs/ v1 [cs.cg] 7 Feb 2006 Approximate Weighted Farthest Neighbors and Minimum Dilation Stars John Augustine, David Eppstein, and Kevin A. Wortman Computer Science Department University of California, Irvine Irvine, CA 92697, USA

More information

Descriptive Data Summarization

Descriptive Data Summarization Descriptive Data Summarization Descriptive data summarization gives the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional

More information

Nearest Neighbor Search for Relevance Feedback

Nearest Neighbor Search for Relevance Feedback Nearest Neighbor earch for Relevance Feedbac Jelena Tešić and B.. Manjunath Electrical and Computer Engineering Department University of California, anta Barbara, CA 93-9 {jelena, manj}@ece.ucsb.edu Abstract

More information

Projective Clustering by Histograms

Projective Clustering by Histograms Projective Clustering by Histograms Eric Ka Ka Ng, Ada Wai-chee Fu and Raymond Chi-Wing Wong, Member, IEEE Abstract Recent research suggests that clustering for high dimensional data should involve searching

More information

ECON 5111 Mathematical Economics

ECON 5111 Mathematical Economics Test 1 October 1, 2010 1. Construct a truth table for the following statement: [p (p q)] q. 2. A prime number is a natural number that is divisible by 1 and itself only. Let P be the set of all prime numbers

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 26, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 55 High dimensional

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Math General Topology Fall 2012 Homework 1 Solutions

Math General Topology Fall 2012 Homework 1 Solutions Math 535 - General Topology Fall 2012 Homework 1 Solutions Definition. Let V be a (real or complex) vector space. A norm on V is a function : V R satisfying: 1. Positivity: x 0 for all x V and moreover

More information

Decentralized Stabilization of Heterogeneous Linear Multi-Agent Systems

Decentralized Stabilization of Heterogeneous Linear Multi-Agent Systems 1 Decentralized Stabilization of Heterogeneous Linear Multi-Agent Systems Mauro Franceschelli, Andrea Gasparri, Alessandro Giua, and Giovanni Ulivi Abstract In this paper the formation stabilization problem

More information

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Dimensionality Reduction

Dimensionality Reduction Dimensionality Reduction Given N vectors in n dims, find the k most important axes to project them k is user defined (k < n) Applications: information retrieval & indexing identify the k most important

More information

Random projection trees and low dimensional manifolds. Sanjoy Dasgupta and Yoav Freund University of California, San Diego

Random projection trees and low dimensional manifolds. Sanjoy Dasgupta and Yoav Freund University of California, San Diego Random projection trees and low dimensional manifolds Sanjoy Dasgupta and Yoav Freund University of California, San Diego I. The new nonparametrics The new nonparametrics The traditional bane of nonparametric

More information

D B M G Data Base and Data Mining Group of Politecnico di Torino

D B M G Data Base and Data Mining Group of Politecnico di Torino Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

Algebraic Methods in Combinatorics

Algebraic Methods in Combinatorics Algebraic Methods in Combinatorics Po-Shen Loh 27 June 2008 1 Warm-up 1. (A result of Bourbaki on finite geometries, from Răzvan) Let X be a finite set, and let F be a family of distinct proper subsets

More information

Inderjit Dhillon The University of Texas at Austin

Inderjit Dhillon The University of Texas at Austin Inderjit Dhillon The University of Texas at Austin ( Universidad Carlos III de Madrid; 15 th June, 2012) (Based on joint work with J. Brickell, S. Sra, J. Tropp) Introduction 2 / 29 Notion of distance

More information

Set, functions and Euclidean space. Seungjin Han

Set, functions and Euclidean space. Seungjin Han Set, functions and Euclidean space Seungjin Han September, 2018 1 Some Basics LOGIC A is necessary for B : If B holds, then A holds. B A A B is the contraposition of B A. A is sufficient for B: If A holds,

More information

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous:

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous: MATH 51H Section 4 October 16, 2015 1 Continuity Recall what it means for a function between metric spaces to be continuous: Definition. Let (X, d X ), (Y, d Y ) be metric spaces. A function f : X Y is

More information

Association Rules. Fundamentals

Association Rules. Fundamentals Politecnico di Torino Politecnico di Torino 1 Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket counter Association rule

More information

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances 6 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards

More information

Math 320-2: Midterm 2 Practice Solutions Northwestern University, Winter 2015

Math 320-2: Midterm 2 Practice Solutions Northwestern University, Winter 2015 Math 30-: Midterm Practice Solutions Northwestern University, Winter 015 1. Give an example of each of the following. No justification is needed. (a) A metric on R with respect to which R is bounded. (b)

More information

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM SPATIAL DATA MINING Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM INTRODUCTION The main difference between data mining in relational DBS and in spatial DBS is that attributes of the neighbors

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions. Definitions Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Itemset is a set including one or more items Example: {Beer, Diapers} k-itemset is an itemset that contains k

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example Association rules Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and

More information

1 9/5 Matrices, vectors, and their applications

1 9/5 Matrices, vectors, and their applications 1 9/5 Matrices, vectors, and their applications Algebra: study of objects and operations on them. Linear algebra: object: matrices and vectors. operations: addition, multiplication etc. Algorithms/Geometric

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

The value of a problem is not so much coming up with the answer as in the ideas and attempted ideas it forces on the would be solver I.N.

The value of a problem is not so much coming up with the answer as in the ideas and attempted ideas it forces on the would be solver I.N. Math 410 Homework Problems In the following pages you will find all of the homework problems for the semester. Homework should be written out neatly and stapled and turned in at the beginning of class

More information

Math Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88

Math Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88 Math Camp 2010 Lecture 4: Linear Algebra Xiao Yu Wang MIT Aug 2010 Xiao Yu Wang (MIT) Math Camp 2010 08/10 1 / 88 Linear Algebra Game Plan Vector Spaces Linear Transformations and Matrices Determinant

More information

Vector Spaces, Orthogonality, and Linear Least Squares

Vector Spaces, Orthogonality, and Linear Least Squares Week Vector Spaces, Orthogonality, and Linear Least Squares. Opening Remarks.. Visualizing Planes, Lines, and Solutions Consider the following system of linear equations from the opener for Week 9: χ χ

More information

2 Systems of Linear Equations

2 Systems of Linear Equations 2 Systems of Linear Equations A system of equations of the form or is called a system of linear equations. x + 2y = 7 2x y = 4 5p 6q + r = 4 2p + 3q 5r = 7 6p q + 4r = 2 Definition. An equation involving

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 10 What is Data? Collection of data objects

More information

Extra Problems for Math 2050 Linear Algebra I

Extra Problems for Math 2050 Linear Algebra I Extra Problems for Math 5 Linear Algebra I Find the vector AB and illustrate with a picture if A = (,) and B = (,4) Find B, given A = (,4) and [ AB = A = (,4) and [ AB = 8 If possible, express x = 7 as

More information

Johns Hopkins Math Tournament Proof Round: Automata

Johns Hopkins Math Tournament Proof Round: Automata Johns Hopkins Math Tournament 2018 Proof Round: Automata February 9, 2019 Problem Points Score 1 10 2 5 3 10 4 20 5 20 6 15 7 20 Total 100 Instructions The exam is worth 100 points; each part s point value

More information

High-Dimensional Indexing by Distributed Aggregation

High-Dimensional Indexing by Distributed Aggregation High-Dimensional Indexing by Distributed Aggregation Yufei Tao ITEE University of Queensland In this lecture, we will learn a new approach for indexing high-dimensional points. The approach borrows ideas

More information

Correlated subqueries. Query Optimization. Magic decorrelation. COUNT bug. Magic example (slide 2) Magic example (slide 1)

Correlated subqueries. Query Optimization. Magic decorrelation. COUNT bug. Magic example (slide 2) Magic example (slide 1) Correlated subqueries Query Optimization CPS Advanced Database Systems SELECT CID FROM Course Executing correlated subquery is expensive The subquery is evaluated once for every CPS course Decorrelate!

More information

Department of Mathematics, University of California, Berkeley. GRADUATE PRELIMINARY EXAMINATION, Part A Fall Semester 2014

Department of Mathematics, University of California, Berkeley. GRADUATE PRELIMINARY EXAMINATION, Part A Fall Semester 2014 Department of Mathematics, University of California, Berkeley YOUR 1 OR 2 DIGIT EXAM NUMBER GRADUATE PRELIMINARY EXAMINATION, Part A Fall Semester 2014 1. Please write your 1- or 2-digit exam number on

More information

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming CSC2411 - Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming Notes taken by Mike Jamieson March 28, 2005 Summary: In this lecture, we introduce semidefinite programming

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

Issues Concerning Dimensionality and Similarity Search

Issues Concerning Dimensionality and Similarity Search Issues Concerning Dimensionality and Similarity Search Jelena Tešić, Sitaram Bhagavathy, and B. S. Manjunath Electrical and Computer Engineering Department University of California, Santa Barbara, CA 936-956

More information

Topological properties

Topological properties CHAPTER 4 Topological properties 1. Connectedness Definitions and examples Basic properties Connected components Connected versus path connected, again 2. Compactness Definition and first examples Topological

More information

Existence and Uniqueness

Existence and Uniqueness Chapter 3 Existence and Uniqueness An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect

More information

Theorems. Theorem 1.11: Greatest-Lower-Bound Property. Theorem 1.20: The Archimedean property of. Theorem 1.21: -th Root of Real Numbers

Theorems. Theorem 1.11: Greatest-Lower-Bound Property. Theorem 1.20: The Archimedean property of. Theorem 1.21: -th Root of Real Numbers Page 1 Theorems Wednesday, May 9, 2018 12:53 AM Theorem 1.11: Greatest-Lower-Bound Property Suppose is an ordered set with the least-upper-bound property Suppose, and is bounded below be the set of lower

More information