QUALITY AND EFFICIENCY IN KERNEL DENSITY ESTIMATES FOR LARGE DATA

Size: px

Start display at page:

Download "QUALITY AND EFFICIENCY IN KERNEL DENSITY ESTIMATES FOR LARGE DATA"

Philomena Newman
6 years ago
Views:

1 QUALITY AND EFFICIENCY IN KERNEL DENSITY ESTIMATES FOR LARGE DATA Yan Zheng Jeffrey Jestes Jeff M. Philips Feifei Li University of Utah

2 Salt Lake City Sigmod 2014 Snowbird

3 Kernel Density Estimates Let P be an input point set of size n in domain µ from an unknown distribution f. P 2 R 1 or P 2 R 2 in our experiments. Kernel Density Estimates (KDE) function approximates the density of f at any possible input point x 2 µ KDE P (x) = 1 X K(p, x) P p2p KDE represents a continuous distribution from a finite set of points.

4 Kernel Density Estimates

Drawback of Histograms Relative Frequency Query value change significantly across the boundary of the bin 0.

5 Drawback of Histograms Relative Frequency Query value change significantly across the boundary of the bin The choice of origin can have quite an effect Relative Frequency Annual Income (K) Annual Income (K)

6 Example of Kernel Density Estimates 0.02 Kernel Density Estimate Annual Income (K)

7 Examples of kernels (2D) Gaussian: Triangle: Epanechnikov: Ball K(p, x) = exp K(p, x) = 3 2 max K(p, x) = K(p, x) = 2 max 2 kp xk 2 n 0, 1 n 0, ( 1/ 2 if kp xk < 0 otherwise kp xk o kp xk 2 2 o Gaussian Triangle Epanechnikov Ball

8 Comparing of Different Kernels 0.02 Gaussian Triangular Epan Triangular Gaussian Epan Bal Kernel Density Estimate Annual Income (K)

9 Approximate Kernel Density Estimates Kernel Density Estimates kde P (x) = 1 P P p2p K(p, x) KDE is very expensive for large dataset. Each query takes O(n) " -approximation " Given input P, and error, the goal is to produce a small set Q to ensure: max kde P (x) kde Q (x) = kkde P kde Q k 1 apple " x2r d

MergeReduce(MR) framework Initialization phase: Arbitrarily decompose P into disjoint sets of size k Combination phase Merge Step k+k->2k Reduce

10 MergeReduce(MR) framework Initialization phase: Arbitrarily decompose P into disjoint sets of size k Combination phase Merge Step k+k->2k Reduce Step 2k->k (Matching Pairs & Randomly Select one) Proceeds in log(n/k) rounds P 1 P 2 P 3 P 4 P i P j P 1,P 2,...,P n/k P 12 P 12 P 34 P 34 P ij P 1234 Q

11 Matching Min-cost Matching [Phillips SODA13] Edmonds Blossom algorithm 2 approximation Greedy algorithm Introduce More Efficient Matching Edge Map E M = {e(p, q) (p, q) 2 M} Given a disk B, E M \ B C M,B = P = {e(p, q) \ B e(p, q) 2 E M } e(p,q)2e M \B kp qk2 C M = max B C M,B. Min-cost Matching C M = O(1) Our Matching C M = O(log(1/"))

Grid Matching Min-cost matching C M = O(1) Grid Matching C M = O(log(1/")) Running time O(n log(1/")) Starting with i = 0, construct Gi, length l ",i = p 2 "2 i 2 Inside of each cell,

12 Grid Matching Min-cost matching C M = O(1) Grid Matching C M = O(log(1/")) Running time O(n log(1/")) Starting with i = 0, construct Gi, length l ",i = p 2 "2 i 2 Inside of each cell, match points arbitrarily. Only the unmatched points survive to the next round. Each cell in Gi +1 is the union of 4 cells from Gi After log(1/ ")+1rounds, match points left arbitrarily.

13 Grid Matching Point set P R 2 with n points, we can construct Q giving an ε-approximate KDE in O(n log 1 " ) running time and Q = O(1 " log n 1 log1.5 " ) Using Grid-MR With Random sampling (RS). Random sample a set P from P with size O((1/" 2 ) log(1/ )) O(n + 1 " 2 log 1 " ) running time and Q = O(1 1 " log2.5 " ) Using Grid-MR+RS

Repeat this discarding of half the points until the remaining set is sufficiently small.

14 Deterministic Z-Order Selection The levels of the Z-order curve are reminiscent of the grids Zrandom Compute the Z-order of all points, and of every two points discard one at random. Repeat this discarding of half the points until the remaining set is sufficiently small. Zorder Select one point from each set of P /k points in the sorted order using ε-approximate quantiles algorithm

15 Other Baseline Methods Improved fast gauss transform (IFGT) K-center clustering Hermite expansion For n points, m query points, reduce time from to No error vs. size or error vs. time guarantees. Only works for Gaussian Kernel Kernel herding (KH) Explore the reproducing kernel Hilbert space. It adds at each step the single point p 2 P \ Q to Q which most decreases Running Time Bounds only `2 kkde Q kde P k 2 O( Q n) error. O(mn) O(m + n)

16 Centralized Experiments Data sets: OpenStreetMap 160million records in 6.6GB Default setting 10 million records 10 random trials = = 200 on a domain 50,000 50,000 Test points 4000 randomly from P 1000 from the domain of P

17 Centralized Experiments Compare with Baseline Methods Grid-MR Blossom-MR Greedy-MR Kernel Herding Using Random Sampling as preprocessing step. Grid-MR Zorder Grid-MR+RS Zorder+RS O(n log(1/")) O(n 3 ) O(n 2 log n) Time (seconds) err Time (seconds) G-MR B-MR KH Greedy-MR (a) Construction time vs. err G-MR G-MR+RS Z Z+RS err (b) Construction time vs. err

18 Centralized Experiments Our construction time and size beats IFGT Just random sampling is slightly faster, but worse size bounds Time (seconds) G-MR+RS Z+RS IFGT RS Size (bytes) G-MR+RS Z+RS IFGT RS err (a) Construction time vs. err err (b) Size

19 Communication (bytes) Distributed Experiments Default setting: 50 million records G-MR+RS Z+RS RS err (a) Communication cost vs. err Size (bytes) G-MR+RS Z+RS RS err (b) Size vs. err Best small approximate, easy to construct representation of data for kernel density estimates.

20 Thank you

22 Observed error vs. guaranteed error 10 7 Grid RS+Grid zkern RS+zKern 10 8 err (b) err vs. "

23 Compare with Random Sampling for High Accuracy Time (seconds) G-MR+RS Z+RS RS Size (bytes) G-MR+RS Z+RS RS (x10 3 ) (x10 3 ) (b) Size vs. " (b) Size vs. " Time (seconds) G-MR+RS Z+RS RS err (c) Size vs. err Size (bytes) G-MR+RS Z+RS RS err (d) Size vs. err

Geometric Inference on Kernel Density Estimates

Geometric Inference on Kernel Density Estimates Bei Wang 1 Jeff M. Phillips 2 Yan Zheng 2 1 Scientific Computing and Imaging Institute, University of Utah 2 School of Computing, University of Utah Feb