Nearest Neighbor Searching Under Uncertainty. Wuzhou Zhang Supervised by Pankaj K. Agarwal Department of Computer Science Duke University

Size: px

Start display at page:

Download "Nearest Neighbor Searching Under Uncertainty. Wuzhou Zhang Supervised by Pankaj K. Agarwal Department of Computer Science Duke University"

Silas Fitzgerald
5 years ago
Views:

1 Nearest Neighbor Searching Under Uncertainty Wuzhou Zhang Supervised by Pankaj K. Agarwal Department of Computer Science Duke University

2 Nearest Neighbor Searching (NNS) S: a set of n points in R. q: any query point R. Find the closest point p S to q. S q p Applications Pattern Recognition, Data Compression Statistical Classification, Clustering Databases, Information Retrieval Computer Vision, etc.

3 Nearest Neighbor Searching Under Uncertainty S = {P 1,P,,P n }, a set of uncertain points in R. P i, represented by a probability density function (pdf). Discrete pdf 0. P i = {p i,1,p i,,,p i,k } R Continuous pdf Gaussian Distribution

4 Nearest Neighbor In Expectation q: query point Expected distance from q to P i : Ed(q, P i )= k w i,j d(q, p i,j ). j=1 Problem: find its expected nearest neighbor P, Ed(q, P )=min P S Ed(q, P).

5 Bisector In Case Of Gaussian For Gaussian distribution, bisector is a line! x µ1 f(x, y) = 1 πσ 1 σ 1 ρ e Hard to get explicit formula! 1 (1 ρ ) σ 1 ρ x µ1 σ 1 y µ y µ σ + σ = + + f (x, y) = f(x, y) qx x +(qy y) dxdy f (x, y) qx x +(qy y) dxdy πσ 1 σ 1 x µ ρ 1 e 1 (1 ρ ) σ x µ 1 y µ + y µ 1 σ 1 σ σ 1 ρ Figure:

6 Squared Distance Function If we use (q x x) +(q y y) instead of (q x x) +(q y y) bisector is simple and beautiful! (µ 1 µ 1 )q x + (µ µ )q y + µ 1 + µ µ 1 µ + σ 1 + σ σ 1 σ =0 In case of discrete pdf, P i = {p i,1,p i,,,p i,k } R bisector is also a line! In both cases, compute the Voronoi diagram, solve it optimally! However, not a metric!

7 Sampling Continuous Distributions Sometimes working on continuous distributions is hard. For example, suppose a uniform distribution is given by 1 (b a)(d c) for (x, y) [a, b] [c, d] f(x, y) = 0 otherwise In L 1 metric, given ε r, # of points to be sampled: k 1 ε r. Generalization in R d, k ( 1 ε r ) d. Lower bounds on other metrics and distributions are also possible. Let s focus on discrete pdf then.

8 Expected Nearest Neighbor In L1 Metric (Manhattan metric) P i = {p i,1,p i,,,p i,k } R k Ed(q, P i )= w i,j d(q, p i,j )= j=1 k j=1 Naive algorithm: O(nk) space, O(nk) query time. w i,j x q x pi,j + y q y pi,j x q + x pi,j + y q y pi,j x q x pi,j + y q y pi,j p i,j f i (x q,y q ) is linear! x q x pi,j y q + y pi,j x q + x pi,j y q + y pi,j # of regions for P i : O(k ). total : O(nk )

9 Expected Nearest Neighbor In L1 Metric ( cont. ) f i (x q,y q ) x q y q Source: Range Searching on Uncertain Data [P.K.Agarwal et al. 009]

10 Geometric Reduction U = x 1 <x < <x u,u nk, the set of xpi,j. V = y 1 <y < <y v,v nk, the set of ypi,j. R = {r 1,r,,r t },t= O(nk ) the set of rectangles in the xy-projections of f i (x q,y q ), P i. ϕ m, the plane that contains the rectangular piece of f i (x q,y q ), whose projection is r m R. Associate P i with r m and ϕ m. Given q =(x q,y q ), among all r m R that contain q, we wish to report the one for which ϕ m lies lowest.

11 Building Block: Half- Space Intersection and Convex Hulls Upper hulls correspond to lower envelopes, an example in D Source: page 5 53, Computational Geometry: Algorithms and Applications, 3rd Edition[Mark de Berg et al. ]

12 Segment- tree Based Data Structures for Expected- NN In L1 Metric U = x 1 <x < <x u,u nk a family I of O(nk) canonical intervals.

13 Segment- tree Based Data Structures for Expected- NN In L1 Metric ( cont. ) Each rectangle r m R can be partitioned into a set C[r m ] of O(log (nk)) canonical rectangles. For each C I I, Φ C = {ϕ m C C[r m ],r m R} C Φ C = O nk log (nk) For each non-empty Φ C, we compute the lower envelope LE C, store its xy-projection S C ( planar map ), and associate ϕ m with corresponding face f S C. C S C = O nk log (nk). Expected preprocessing time: O nk log 3 (nk).

14 Segment- tree Based Data Structures for Expected- NN In L1 Metric ( cont. ) Given q =(x q,y q ), first find I xq I, I yq I, O log(nk) canonical intervals each. For each C in I xq I y q, do the point location query on S C. Query time: O log 3 (nk) Size of data structure Preprocessing time Query time O nk log (nk) O O nk log 3 log 3 (nk) (nk) Summary of the result take O k as a constant O n log n n log 3 n O log 3 n

15 Approximate L Metric For any polygon P that contains the origin, define d P : R R R + as d P (a, b) =min{λ b λp + a} It s a metric when P is centrally symmetric! m =/ δ, P the regular m-gon of circumscribing radius 1, centered at the origin. d P (a, b) (1 + δ)d(a, b). d P (a, b) = min 1 i m d i (a, b), sinced i (a, b) isfiniteiff b C i + a

16 Approximate L Metric ( cont. ) More complex!

17 Future Work Approximate the expected NN in L metric Work harder in the near future! Study the complexity of expected Voronoi diagram Study the probability case

uncertain data. PODS 009: 137-146 [] Pankaj K.

18 Thanks! Main References: [1] Pankaj K. Agarwal, Siu-Wing Cheng, Yufei Tao, Ke Yi: Indexing uncertain data. PODS 009: [] Pankaj K. Agarwal, Lars Arge, Jeff Erickson: Indexing Moving Points. J. Comput. Syst. Sci. 66(1): (003)

Nearest-Neighbor Searching Under Uncertainty

Nearest-Neighbor Searching Under Uncertainty Wuzhou Zhang Joint work with Pankaj K. Agarwal, Alon Efrat, and Swaminathan Sankararaman. To appear in PODS 2012. Nearest-Neighbor Searching S: a set of n points