Output-Sensitive Algorithms for Computing Nearest-Neighbour Decision Boundaries

Output-Sensitive Algoithms fo Computing Neaest-Neighbou Decision Boundaies David Bemne 1, Eik Demaine 2, Jeff Eickson 3, John Iacono 4, Stefan Langeman 5, Pat Moin 6, and Godfied Toussaint 7 1 Faculty of Compute Science, Univesity of New Bunswick, bemne@unb.ca 2 MIT Laboatoy fo Compute Science, edemaine@mit.edu 3 Compute Science Depatment, Univesity of Illinois, jeffe@cs.uiuc.edu 4 Polytechnic Univesity, jiacono@poly.edu 5 Chagé de echeches du FNRS, Univesité Libe de Buxelles, stefan.langeman@ulb.ac.be 6 School of Compute Science, Caleton Univesity, moin@cs.caleton.ca 7 School of Compute Science, McGill Univesity, godfied@cs.mcgill.ca Abstact. Given a set R of ed points and a set B of blue points, the neaest-neighbou decision ule classifies a new point q as ed (espectively, blue) if the closest point to q in R B comes fom R (espectively, B). This ule implicitly patitions space into a ed set and a blue set that ae sepaated by a ed-blue decision bounday. In this pape we develop outputsensitive algoithms fo computing this decision bounday fo point sets on the line and in R 2. Both algoithms un in time O(n log k), whee k is the numbe of points that contibute to the decision bounday. This unning time is the best possible when paameteizing with espect to n and k. 1 Intoduction Let S be a set of n points in the plane that is patitioned into a set of ed points denoted by R and a set of blue points denoted by B. The neaest-neighbou decision ule classifies a new point q as the colo of the closest point to q in S. The neaest-neighbou decision ule is popula in patten ecognition as a means of leaning by example. Fo this eason, the set S is often efeed to as a taining set. Seveal popeties make the neaest-neighbou decision ule quite attactive, including its intuitive simplicity and the theoem that the asymptotic eo ate of the neaest-neighbou ule is bounded fom above by twice the Bayes eo ate [6, 8, 16]. (See [17] fo an extensive suvey of the neaest-neighbou decision ule and its elatives.) Futhemoe, fo point sets in small dimensions, thee ae efficient and pactical algoithms fo pepocessing a set S so that the neaest neighbou of a quey point q can be found quickly. This eseach was patly funded by the Alexande von Humboldt Foundation and The Natuals Sciences and Engineeing Reseach Council of Canada.

The neaest-neighbou decision ule implicitly patitions the plane into a ed set and a blue set that meet at a ed-blue decision bounday. One attactive aspect of the neaest-neighbou decision ule is that it is often possible to educe the size of the taining set S without changing the decision bounday. To see this, conside the Voonoĭ diagam of S, which patitions the plane into convex (possibly unbounded) polygonal Voonoĭ cells, whee the Voonoĭ cell of point p S is the set of all points that ae close to p than to any othe point in S (see Figue 1.a). If the Voonoĭ cell of a ed point is completely suounded by the Voonoi cells of othe ed points then the point can be emoved fom S and this will not change the classification of any point in the plane (see Figue 1.b). We say that these points do not contibute to the decision bounday, and the emaining points contibute to the decision bounday. (a) (b) Fig. 1. The Voonoĭ diagam (a) befoe Voonoĭ condensing and (b) afte Voonoĭ condensing. Note that the decision bounday (in bold) is unaffected by Voonoĭ condensing. Note: In this figue, and all othe figues, ed points ae denoted by white cicles and blue points ae denoted by black disks. The peceding discussion suggests that one appoach to educing the size of the taining set S is to simply compute the Voonoĭ diagam of S and emove any points of S whose Voonoĭ cells ae suounded by Voonoĭ cells of the same colo. Indeed, this method is efeed to as Voonoĭ condensing [18]. Thee ae seveal O(n log n) time algoithms fo computing the Voonoĭ diagam a set of points in the plane, so Voonoĭ condensing can be implemented to un in O(n log n) time. 8 Howeve, in this pape we show that we can do significantly bette when the numbe of points that contibute to the decision bounday is small. Indeed, we show how to do Voonoĭ condensing in O(n log k) time, whee k is the numbe of points that contibute to the decision bounday (i.e., the numbe of points of S that emain afte Voonoĭ condensing). Algoithms, like 8 Histoically, the fist efficient algoithm fo specifically computing the neaestneighbou decision bounday is due to Dasaathy and White [7] and uns in O(n 4 ) time. The fist O(n log n) time algoithm fo computing the Voonoĭ diagam of a set of n points in the plane is due to Shamos [15].

these, in which the size of the input and the size of the output play a ole in the unning time ae efeed to as output-sensitive algoithms. Reades familia with the liteatue on output-sensitive convex hull algoithms may ecognize the expession O(n log k) as the unning time of optimal algoithms fo computing convex hulls of n point sets with k exteme points, in 2 o 3 dimensions [2, 4, 5, 13, 19]. This is no coincidence. Given a set of n points in R 2, we can colo them all ed and add thee blue points at infinity (see Figue 2). In this set, the only points that contibute to the neaest-neighbou decision bounday ae the thee blue points and the ed points on the convex hull of the oiginal set. Thus, identifying the points that contibute to the neaestneighbou decision bounday is at least as difficult as computing the exteme points of a set. Fig. 2. The elationship between convex hulls and decision boundaies. Each vetex of the convex hull of R contibutes to the decision bounday. Obseve that, once the size of the taining set has been educed by Voonoĭ codensing, the condensed set can be pepocessed in O(k log k) time to answe neaest neighbou queies in O(log k) time pe quey. This makes it possible to do neaest-neighbou classifications in O(log k) time. Altenatively, the algoithm we descibe fo computing the neaest neighbou decision bounday actually poduces an explicit desciption of the bounday (of size O(k)) that can

be pepocessed in O(k) time by Kikpatick s point-location algoithm [12] to allow neaest neighbou classification in O(log k) time. The emainde of this pape is oganized as follows: In Section 2 we descibe an algoithm fo computing the neaest-neighbou decision bounday of points on a line that uns in O(n log k) time. In Section 3 we pesent an algoithm fo points in the plane that also uns in O(n log k) time. Finally, in Section 4 we summaize and conclude with open poblems. 2 A 1-Dimensional Algoithm In the 1-dimensional vesion of the neaest-neighbou decision bounday poblem, the input set S consists of n eal numbes. Imagine soting S, so that S = {s 1,..., s n } whee s i < s i+1 fo all 1 i < n. The decision bounday consists of all pais (s i, s i+1 ) whee s i is ed and s i+1 is blue, o vice-vesa. Thus, this poblem is solveable in linea-time if the points of S ae soted. Since soting the elements of S can be done using any numbe of O(n log n) time soting algoithms, this immediately implies an O(n log n) time algoithm. Next, we give an algoithm that uns in O(n log k) time and is simila in spiit to Hoae s quicksot [11]. To find the decision bounday in O(n log k) time, we begin by computing the median element m = s n/2 in O(n) time using any one of the existing lineatime median finding algoithms (see [3]). Using an additional O(n) time, we split S into the sets S 1 = {s 1,..., s n/2 1 } and S 2 = {s n/2 +1,..., s n } by compaing each element of S to the median element m. At the same time we also find s n/2 1 and s n/2 +1 by finding the maximum and minimum elements of S 1 and S 2, espectively. We then check if (s n/2 1, m) and/o (m, s n/2 +1 ) ae pat of the decision bounday and epot them if necessay. At this point, a standad divide-and-conque algoithm would ecuse on both S 1 and S 2 to give an O(n log n) time algoithm. Howeve, we can impove on this by obseving that it is not necessay to ecuse on a subpoblem if it contains only elements of one colo, since it will not contibute a pai to the decision bounday. Theefoe, we ecuse on each of S 1 and S 2 only if they contain at least one ed element and one blue element. The coectness of the above algoithm is clea. To analyze its unning time we obseve that the unning time is bounded by the ecuence T (n, k) O(n) + T (n/2, l) + T (n/2, k l), whee l is the numbe of points that contibute to the decision bounday in S 1 and whee T (1, k) = O(1) and T (n, 0) = O(n). An easy inductive agument that uses the concavity of the logaithm shows that this ecuence is maximized when l = k/2, in which case the ecuence solves to O(n log k) [5]. Theoem 1 The neaest-neighbou decision bounday of a set of n eal numbes can be computed in O(n log k) time, whee k is the numbe of elements that contibute to the decision bounday.

3 A 2-Dimensional Algoithm In the 2-dimensional neaest-neighbou decision bounday poblem the Voonoĭ cells of S ae (possibly unbounded) convex polygons and the goal is to find all Voonoĭ edges that bound two cells whose defining points have diffeent colos. Thoughout this section we will assume that the points of S ae in geneal position so that no fou points of S lie on a common cicle. This assumption is not vey estictive, since geneal position can be simulated using infinitesmal petubations of the input points. It will be moe convenient to pesent ou algoithm using the teminology of Delaunay tiangulations. A Delaunay tiangle in S is a tiangle whose vetices (v 1, v 2, v 3 ) ae in S and such that the cicle with v 1, v 2 and v 3 on its bounday does not contain any point of S in its inteio. A Delaunay tiangulation of S is a patitioning of the convex hull of S into Delaunay tiangles. Altenatively, a Delaunay edge is a line segment whose vetices (v 1, v 2 ) ae in S and such that thee exists a cicle with v 1 and v 2 on its bounday that does not contain any point of S in its inteio. When S is in geneal position, the Delaunay tiangulation of S is unique and contains all tiangles whose edges ae Delaunay edges (see [14]). It is well known that the Delaunay tiangulation and the Voonoi diagam ae dual in the sense that two points of S ae joined by an edge in the Delaunay tiangulation if and only if thei Voonoi cells shae an edge. We call a Delaunay tiangle o Delaunay edge bichomatic if its set of defining vetices contains at least one ed and at least one blue point of S. Thus, the poblem of computing the neaest-neighbou decision bounday is equivalent to the poblem of finding all bichomatic Delaunay edges. 3.1 The High Level Algoithm In the next few sections, we will descibe an algoithm that, given a value κ k, finds the set of all bichomatic Delaunay tiangles in S in O((κ 2 + n) log κ) time, which fo κ n simplifies to O(n log κ). To obtain an algoithm that uns in O(n log k) time, we epeatedly guess the value of κ, un the algoithm until we find the entie decision bounday o until it detemines that κ < k and, in the latte case, estat the algoithm with a lage value of κ. If we eve each a point whee the value of κ exceeds n then we stop the entie algoithm and un an O(n log n) time algoithm to compute the entie Delaunay tiangulation of S. The values of κ that we use ae κ = 2 2i fo i = 0, 1, 2,..., log log n. Since the algoithm will teminate once κ k o κ n, the total cost of all uns of the algoithm is theefoe as equied. T (n, k) = log log k i=0 O(n log 2 2i ) = log log k i=0 O(n2 i ) = O(n log k),

3.2 Pivots A key suboutine in ou algoithm is the pivot 9 opeation illustated in Figue 3. A pivot in the set of points S takes as input a ay and epots the lagest cicle whose cente is on the ay, has the oigin of the ay on its bounday and has no point of S in its inteio. We will make use of the following data stuctuing esult, due to Chan [4]. Fo completeness, we also include a poof. Fig. 3. A pivot opeation. Lemma 1 (Chan 1996) Let S be a set of n points in R 2. Then, fo any intege 1 m n, thee exists a data stuctue of size O(n) that can be constucted in O(n log m) time, and that can pefom pivots in S in O( n m log m) time pe pivot. Poof. Dobkin and Kikpatick [9, 10] show how to pepocess a set S of n points in O(n log n) time to answe pivot queies in O(log n) time pe quey. Chan s data stuctue simply patitions S into n/m goups each of size m and then uses the Dobkin-Kikpatick data stuctue on each goup. The time to build all n/m data stuctues is n m O(m log m) = O(n log m). To pefom a quey, we simply quey each of the n/m data stuctues in O(log m) time pe data stuctue and epot the smallest cicle found, fo a quey time of n m O(log m) = O( n m log m). In the following, we will be using Lemma 1 with a value of m = κ 2, so that the time to constuct the data stuctue is O(n log κ) and the quey time is O( n κ 2 log κ). We will use two such data stuctues, one fo pefoming pivots in the set R of ed points and one fo pefoming pivots in the set B of blue points. 3.3 Finding the Fist Edge The fist step in ou algoithm is to find a single bichomatic edge of the Delaunay tiangulation. Refe to Figue 4. To do this, we begin by choosing any ed 9 The tem pivot comes fom linea pogamming. The elationship between a (pola dual) linea pogamming pivot and the cicula pivot descibed hee is evident when we conside the paabolic lifting that tansfoms the poblem of computing a 2- dimensional Delaunay tiangulation to that of computing a 3-dimensional convex hull of a set of points on the paaboloid z = x 2 + y 2. In this case, the cicle is the pojection of the intesection of a plane with the paaboloid.

point and any blue point b. We then pefom a pivot in the set B along the ay with oigin that contains b. This gives us a cicle C that has no blue points in its inteio and has as well as some blue point b (possibly b = b ) on its bounday. Next, we pefom a pivot in the set R along the ay oiginating at b and passing though the cente of C. This gives us a cicle C 1 that has no point of S in its inteio and has b and some ed point (possibly = ) on its bounday. Theefoe, (, b ) is a bichomatic edge in the Delaunay tiangulation of S. C b C C 1 b b b (a) (b) Fig. 4. The (a) fist and (b) second pivot used to find a bichomatic edge (, b ). The above agument shows how to find a bichomatic Delaunay edge using only 2 pivots, one in R and one in B. The second pat of the agument also implies the following useful lemma. Lemma 2 If thee is a cicle with a ed point and a blue point b on its bounday, and no ed (espectively, blue) points in its inteio, then (espectively, b) contibutes to the decision bounday. 3.4 Finding Moe Points Let Q be the set of points that contibute to the decision bounday, i.e., the set of points that ae the vetices of bichomatic tiangles in the Delaunay tiangulation of S. Suppose that we have aleady found a set P Q and we wish to eithe (1) find a new point p Q \ P o (2) veify that P = Q. To do this, we will make use of the augmented Delaunay tiangulation of P (see Figue 5). This is the Delaunay tiangulation of P {v 1, v 2, v 3 }, whee v 1, v 2, and v 3 ae thee black points at infinity (see Figue 5). Fo any tiangle t, we use the notation C(t) to denote the cicle whose bounday contains the thee vetices of t (note that if t contains a black point then C(t) is a halfplane). The following lemma allows us to tell when we have found the entie set of points Q that contibute to the decision bounday. Lemma 3 Let P Q. The following statements ae equivalent:

v 3 v 1 v 2 Fig. 5. The augmented Delaunay tiangulation of S. 1. Fo evey tiangle t in the augmented Delaunay tiangulation of P, if t has a blue (espectively, ed) vetex then C(t) does not have a ed (espectively, blue) point of S in its inteio. 2. P = Q. Poof. Fist we show that if Statement 1 of the lemma is not tue, then Statement 2 is also not tue, i.e., P Q. Suppose thee is some tiangle t in the augmented Delaunay tiangulation of P such that t has a blue vetex b and C(t) contains a ed point of S in its inteio. Pivot in R along the ay oiginating at b and passing though the cente of C(t) (see Figue 6). This will give a cicle C with b and some ed point / P on its bounday and with no ed points in its inteio. Theefoe, by Lemma 2, contibutes to the decision bounday and is theefoe in Q, so P Q. A symmetic agument applies when t has a ed vetex and C(t) contains a blue vetex in its inteio. t C(t) b Fig. 6. If Statement 1 of Lemma 3 is not tue then P Q.

Next we show that if Statement 2 of the lemma is not tue then Statement 1 is not tue. Suppose that P Q. Let be a point in Q \ P and, without loss of geneality, assume is a ed point. Since is in Q, thee is a cicle C with and some othe blue point b on its bounday and with no points of S in its inteio. We will use and b to show that the augmented Delaunay tiangulation of P contains a tiangle t such that eithe (1) b is a vetex of t and C(t) contains in its inteio, o (2) C(t) contains both and b in its inteio. In eithe case, Statement 1 of the lemma is not tue because of tiangle t. Refe to Figue 7 fo what follows. Conside the lagest cicle C 1 that is concentic with C and that contains no point of P in its inteio (this cicle is at least as lage as C). The cicle C 1 will have at least one point p 1 of P on its bounday (it could be that p 1 = b, if b P ). Next, pefom a pivot in P along the ay oiginating at p 1 and containing the cente of C 1. This will give a cicle C 2 that contains C 1 and with two points p 1 and p 2 of P {v 1, v 2, v 3 } on its bounday and with no points of P {v 1, v 2, v 3 } in its inteio. Theefoe, (p 1, p 2 ) is an edge in the augmented Delaunay tiangulation of P. The edge (p 1, p 2 ) patitions the inteio of C 2 into two pieces, one that contains and one that does not. It is possible to move the cente of C 2 along the pependicula bisecto of (p 1, p 2 ) maintaining p 1 and p 2 on the bounday of C 2. Thee ae two diections in which the cente of C 2 can be moved to accomplish this. In one diection, say d, the pat of the inteio that contains only inceases, so move the cente in this diection until a thid point p 3 P {v 1, v 2, v 3 } is on the bounday of C 2. The esulting cicle has the points p 1, p 2, and p 3 on its bounday and no points of P in its inteio, so p 1, p 2 and p 3 ae the vetices of a tiangle t in the augmented Delaunay tiangulation of P. The cicumcicle C(t) contains in its inteio and contains b eithe in its inteio o on its bounday. In eithe case, t contadicts Statement 1, as pomised. Note that the fist paagaph in the poof of Lemma 3 gives a method of testing whethe P = Q, and when this is not the case, of finding a point in Q \ P. Fo each tiangle t in the Delaunay tiangulation of P, if t contains a blue vetex b then pefom a pivot in R along the ay oiginating at b and passing though C(t). If the esult of this pivot is C(t), then do nothing. Othewise, the pivot finds a cicle C with no ed points in its inteio and that has one blue point b and one ed point / P on its bounday. By Lemma 2, the point must be in Q. If t contains a ed vetex, epeat the above pocedue swapping the oles of ed and blue. If both pivots (fom the ed point and the blue point) find the cicle C(t), then we have veified Statement 1 of Lemma 3 fo the tiangle t. The above pocedue pefoms at most two pivots fo each tiangle t in the augmented Delaunay tiangulation of P. Theefoe, this pocedue pefoms O( P ) = O(κ) pivots. Since we epeat this pocedue at most κ times befoe deciding that κ < k, we pefom O(κ 2 ) pivots, at a total cost of O(κ 2 n κ 2 log κ) = O(n log κ). The only othe wok done by the algoithm is that of ecomputing the augmented Delaunay tiangulation of P each time we add a new vetex to P. Since each such computation takes O( P log P ) time and P κ, the total amount of wok done in computing all these tiangulations is O(κ 2 log κ).

C 1 C p 1 p 1 = b b C = C 1 p 2 p 1 p 2 p 1 = b b C 2 C 1 C 2 C 1 p 2 t p 1 p 2 t p 1 = b b p 3 p 3 C 2 C(t) C 2 C(t) (1) (2) Fig. 7. If P Q then Statement 1 of Lemma 3 is not tue. The left column (1) coesponds to the case whee b P and the ight column (2) coesponds to the case whee b P.

In summay, we have an algoithm that given S and κ decides whethe the condensed set Q of points in S that contibute to the decision bounday has size at most κ, and if so, computes Q. This algoithm uns in O((κ 2 + n) log κ) time. By tying inceasingly lage values of κ as descibed in Section 3.1 we obtain ou main theoem. Theoem 2 The neaest-neighbou decision bounday of a set of n points in R 2 can be computed in O(n log k) time, whee k is the numbe of points that contibute to the decision bounday. Remak: Theoem 2 extends to the case whee thee ae moe than 2 colo classes and ou goal is to find all Voonoĭ edges bounding two cells of diffeent colo. The only modification equied is that, fo each colo class, R, we use two pivoting data stuctues, one fo R and one fo S \ R. When pefoming pivots fom a point in R, we use the data stuctue fo pivots in S \ R. Othewise, the details of the algoithm ae identical. Remak: In the patten-ecognition community patten classification ules ae often implemented as neual netwoks. In the teminology of neual netwoks, Theoem 2 states that it is possible, in O(n log k) time, to design a simple onelaye neual netwok that implements the neaest-neighbou decision ule and uses only k McCulloch-Pitts neuons (theshold logic units). 4 Conclusions We have given O(n log k) time algoithms fo computing neaest-neighbou decisions boundaies in 1 and 2 dimensions, whee k is the numbe of points that contibute to the decision bounday. A standad application of Ben-O s lowebound technique [1] shows that even the 1-dimensional algoithm is optimal in the algebaic decision tee model of computation. We have not studied algoithms fo dimensions d 3. In this case, it is not even clea what the tem output-sensitive means. Should k be the numbe of points that contibute to the decision bounday, o should k be the complexity of the decision bounday? In the fist case, k n fo any dimension d, while in the second case, k could be as lage as Ω(n d/2 ). To the best of ou knowledge, both ae open poblems. Refeences 1. M. Ben-O. Lowe bounds fo algebaic computation tees (peliminay epot). In Poceedings of the Fifteenth Annual ACM Symposium on Theoy of Computing, pages 80 86, 1983. 2. B. K. Bhattachaya and S. Sen. On a simple, pactical, optimal, output-sensitive andomized plana convex hull algoithm. Jounal of Algoithms, 25(1):177 193, 1997. 3. M. Blum, R. W. Floyd, V. Patt, R. L. Rivest, and R. E. Tajan. Time bounds fo selection. Jounal of Computing and Systems Science, 7:448 461, 1973.

4. T. M. Chan. Optimal output-sensitive convex hull algoithms in two and thee dimensions. Discete & Computational Geomety, 16:361 368, 1996. 5. T. M. Chan, J. Snoeyink, and C. K. Yap. Pimal dividing and dual puning: Outputsensitive constuction of fou-dimensional polytopes and thee-dimensional Voonoi diagams. Discete & Computational Geomety, 18:433 454, 1997. 6. T. M. Cove and P. E. Hat. Neaest neighbou patten classification. IEEE Tansactions on Infomation Theoy, 13:21 27, 1967. 7. B. Dasaathy and L. J. White. A chaacteization of neaest-neighbou ule decision sufaces and a new appoach to geneate them. Patten Recognition, 10:41 46, 1978. 8. L. Devoye. On the inequality of Cove and Hat. IEEE Tansactions on Patten Analysis and Machine Intelligence, 3:75 78, 1981. 9. D. P. Dobkin and D. G. Kikpatick. Fast detection of poyhedal intesection. Theoetical Compute Science, 27:241 253, 1983. 10. D. P. Dobkin and D. G. Kikpatick. A linea algoithm fo detemining the sepaation of convex polyheda. Jounal of Algoithms, 6:381 392, 1985. 11. C. A. R. Hoae. ACM Algoithm 64: Quicksot. Communications of the ACM, 4(7):321, 1961. 12. D. G. Kikpatick. Optimal seach in plana subdivisions. SIAM Jounal on Computing, 12(1):28 35, 1983. 13. D. G. Kikpatick and R. Seidel. The ultimate plana convex hull algoithm? SIAM Jounal on Computing, 15(1):287 299, 1986. 14. F. P Pepaata and M. I. Shamos. Computational Geomety. Spinge-Velag, 1985. 15. M. I. Shamos. Geometic complexity. In Poceedings of the 7th ACM Symposium on the Theoy of Computing (STOC 1975), pages 224 253, 1975. 16. C. Stone. Consistent nonpaametic egession. Annals of Statistics, 8:1348 1360, 1977. 17. G. T. Toussaint. Poximity gaphs fo instance-based leaning. Manuscipt, 2003. 18. G. T. Toussaint, B. K. Bhattachaya, and R. S. Poulsen. The application of Voonoi diagams to non-paametic decision ules. In Poceedings of Compute Science and Statistics: 16th Symposium of the Inteface, 1984. 19. R. Wenge. Randomized quick hull. Algoithmica, 17:322 329, 1997.