A Las Vegas approximation algorithm for metric 1-median selection

Similar documents
Lecture Notes on Metric Spaces

Partitioning Metric Spaces

Lower Bounds for Testing Bipartiteness in Dense Graphs

Max-Sum Diversification, Monotone Submodular Functions and Semi-metric Spaces

Part III. 10 Topological Space Basics. Topological Spaces

Active Learning: Disagreement Coefficient

converges as well if x < 1. 1 x n x n 1 1 = 2 a nx n

Chapter 2 Metric Spaces

Probability and Measure

Lecture 2: A crash course in Real Analysis

ADDENDUM B: CONSTRUCTION OF R AND THE COMPLETION OF A METRIC SPACE

Thus, X is connected by Problem 4. Case 3: X = (a, b]. This case is analogous to Case 2. Case 4: X = (a, b). Choose ε < b a

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous:

Topological properties

Lecture 9: March 26, 2014

arxiv:cs/ v1 [cs.cc] 7 Sep 2006

Random Feature Maps for Dot Product Kernels Supplementary Material

CS264: Beyond Worst-Case Analysis Lecture #18: Smoothed Complexity and Pseudopolynomial-Time Algorithms

Tree sets. Reinhard Diestel

Lectures on Elementary Probability. William G. Faris

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

Immerse Metric Space Homework

Metric Spaces and Topology

Exam 2 extra practice problems

CS264: Beyond Worst-Case Analysis Lecture #15: Smoothed Complexity and Pseudopolynomial-Time Algorithms

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

The Heine-Borel and Arzela-Ascoli Theorems

Optimal Time Bounds for Approximate Clustering

Introduction to Topology

Notes for Lecture 14 v0.9

PERIODIC POINTS OF THE FAMILY OF TENT MAPS

The space complexity of approximating the frequency moments

Math General Topology Fall 2012 Homework 1 Solutions

Rao s degree sequence conjecture

Math 117: Topology of the Real Numbers

16 Embeddings of the Euclidean metric

The concentration of the chromatic number of random graphs

Fast Algorithms for Constant Approximation k-means Clustering

Essentials on the Analysis of Randomized Algorithms

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016

Metric Space Topology (Spring 2016) Selected Homework Solutions. HW1 Q1.2. Suppose that d is a metric on a set X. Prove that the inequality d(x, y)

Learning symmetric non-monotone submodular functions

Continued fractions for complex numbers and values of binary quadratic forms

Cell-Probe Lower Bounds for Prefix Sums and Matching Brackets

CS6999 Probabilistic Methods in Integer Programming Randomized Rounding Andrew D. Smith April 2003

Course 212: Academic Year Section 1: Metric Spaces

Some Background Material

Discrete Mathematics

(1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define

Locality Sensitive Hashing

Randomized Algorithms

Lecture 23: Alternation vs. Counting

INTRODUCTION TO REAL ANALYSIS II MATH 4332 BLECHER NOTES

The expansion of random regular graphs

An Introduction to Complex Analysis and Geometry John P. D Angelo, Pure and Applied Undergraduate Texts Volume 12, American Mathematical Society, 2010

Lecture 6: September 22

16.1 Min-Cut as an LP

An Algorithmic Proof of the Lopsided Lovász Local Lemma (simplified and condensed into lecture notes)

N E W S A N D L E T T E R S

Metric spaces and metrizability

Lecture 1: 01/22/2014

Connectedness. Proposition 2.2. The following are equivalent for a topological space (X, T ).

arxiv: v1 [math.co] 25 Apr 2013

2. Metric Spaces. 2.1 Definitions etc.

Lecture notes for Analysis of Algorithms : Markov decision processes

Goldreich-Levin Hardcore Predicate. Lecture 28: List Decoding Hadamard Code and Goldreich-L

Real Analysis. Joe Patten August 12, 2018

Quantum query complexity of entropy estimation

MAS331: Metric Spaces Problems on Chapter 1

Chapter 11. Min Cut Min Cut Problem Definition Some Definitions. By Sariel Har-Peled, December 10, Version: 1.

Testing Lipschitz Functions on Hypergrid Domains

Topology. Xiaolong Han. Department of Mathematics, California State University, Northridge, CA 91330, USA address:

l(y j ) = 0 for all y j (1)

CIS 800/002 The Algorithmic Foundations of Data Privacy September 29, Lecture 6. The Net Mechanism: A Partial Converse

Lecture 4: Random-order model for the k-secretary problem

Asymptotic redundancy and prolixity

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Chapter One. The Real Number System

Discrete Mathematics: Lectures 6 and 7 Sets, Relations, Functions and Counting Instructor: Arijit Bishnu Date: August 4 and 6, 2009

Lecture 2 Sep 5, 2017

Finite Metric Spaces & Their Embeddings: Introduction and Basic Tools

Partial cubes: structures, characterizations, and constructions

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

Randomized Complexity Classes; RP

Random Variable. Pr(X = a) = Pr(s)

Existence and Uniqueness

Instructions: No books. No notes. Non-graphing calculators only. You are encouraged, although not required, to show your work.

Math 4317 : Real Analysis I Mid-Term Exam 1 25 September 2012

APPLICATIONS IN FIXED POINT THEORY. Matthew Ray Farmer. Thesis Prepared for the Degree of MASTER OF ARTS UNIVERSITY OF NORTH TEXAS.

8 Complete fields and valuation rings

1 Distributional problems

CS264: Beyond Worst-Case Analysis Lecture #20: From Unknown Input Distributions to Instance Optimality

SPACES ENDOWED WITH A GRAPH AND APPLICATIONS. Mina Dinarvand. 1. Introduction

1 Topology Definition of a topology Basis (Base) of a topology The subspace topology & the product topology on X Y 3

Approximability of Dense Instances of Nearest Codeword Problem

Introduction to Real Analysis Alternative Chapter 1

MATH 31BH Homework 1 Solutions

Set, functions and Euclidean space. Seungjin Han

CS261: Problem Set #3

Undergraduate Notes in Mathematics. Arkansas Tech University Department of Mathematics

Transcription:

A Las Vegas approximation algorithm for metric -median selection arxiv:70.0306v [cs.ds] 5 Feb 07 Ching-Lueh Chang February 8, 07 Abstract Given an n-point metric space, consider the problem of finding a point with the minimum sum of distances to all points. We show that this problem has a randomized algorithm that always outputs a ( + ɛ)-approximate solution in an expected O(n/ɛ ) time for each constant ɛ > 0. Inheriting Indyk s [9] algorithm, our algorithm outputs a ( + ɛ)-approximate -median in O(n/ɛ ) time with probability Ω(). Introduction A metric space is a nonempty set M endowed with a metric, i.e., a function d: M M [ 0, ) such that d(x, y) 0 if and only if x y (identity of indiscernibles), d(x, y) d(y, x) (symmetry), and d(x, y) + d(y, z) d(x, z) (triangle inequality) for all x, y, z M [3]. For all n Z +, define [n] {,,..., n}. Given n Z + and oracle access to a metric d: [n] [n] [ 0, ), metric -median asks for argmin y [n] d(y, x), breaking ties arbitrarily. It generalizes the classical median selection on the real line and has a brute-force Θ(n )-time algorithm. More generally, metric k- median asks for c, c,..., c k [n] minimizing mink d(x, c i ). Because d(, ) defines ( n ) Θ(n ) nonzero distances, only o(n )-time algorithms are said Department of Computer Science and Engineering, Yuan Ze University, Taoyuan, Taiwan. Email: clchang@saturn.yzu.edu.tw Supported in part by the Ministry of Science and Technology of Taiwan under grant 05- -E-55-047-.

to run in sublinear time [8]. For all α, an α-approximate -median is a point p [n] satisfying d (p, x) α min d (y, x). y [n] For all ɛ > 0, metric -median has a Monte Carlo ( + ɛ)-approximation O(n/ɛ )-time algorithm [8, 9]. Guha et al. [7] show that metric k-median has a Monte Carlo, O(exp(O(/ɛ)))-approximation, O(nk log n)-time, O(n ɛ )-space and one-pass algorithm for all small k as well as a deterministic, O(exp(O(/ɛ)))- approximation, O(n +ɛ )-time, O(n ɛ )-space and one-pass algorithm. Given n points in R D with D, the Monte Carlo algorithms of Kumar et al. [0] find a (+ɛ)-approximate -median in O(Dexp(/ɛ O() )) time and a (+ɛ)-approximate solution to metric k-median in O(Dn exp((k/ɛ) O() )) time. All randomized O()-approximation algorithms for metric k-median take Ω(nk) time [7, ]. Chang [] shows that metric -median has a deterministic, (h)-approximation, O(hn +/h )-time and nonadaptive algorithm for all constants h Z + \ {}, generalizing the results of Chang [] and Wu [5]. On the other hand, he disproves the existence of deterministic (h ɛ)-approximation O(n +/(h ) /h)-time algorithms for all constants h Z + \ {} and ɛ > 0 [3, 4]. In social network analysis, the closeness centrality of a point v is the reciprocal of the average distance from v to all points [4]. So metric -median asks for a point with the maximum closeness centrality. Given oracle access to a graph metric, the Monte-Carlo algorithms of Goldreich and Ron [6] and Eppstein and Wang [5] estimate the closeness centrality of a given point and those of all points, respectively. All known sublinear-time algorithms for metric -median are either deterministic or Monte Carlo, the latter having a positive probability of failure. For example, Indyk s Monte Carlo ( + ɛ)-approximation algorithm outputs with a positive probability a solution without approximation guarantees. In contrast, we show that metric -median has a randomized algorithm that always outputs a (+ɛ)-approximate solution in expected O(n/ɛ ) time for all constants ɛ > 0. So, excluding the known deterministic algorithms (which are Las Vegas only in the degenerate sense), this paper gives the first Las Vegas approximation algorithm for metric -median with an expected sublinear running time. Note that deterministic sublinear-time algorithms for metric -median can be 4-approximate but not (4 ɛ)-approximate for any constant ɛ > 0 [, 4]. So our approximation ratio of + ɛ beats that of any deterministic sublinear-time algorithm. Inheriting Indyk s algorithm, our algorithm outputs a ( + ɛ)-approximate -median in O(n/ɛ ) time with probability Ω() for all constants ɛ > 0. Below is our high-level and inaccurate sketch of proof, where ɛ, δ > 0 are small constants: (i) Run Indyk s algorithm to find a probably (+ɛ/0 0 )-approximate -median, z. Then let r d(z, x)/n be the average distance from z to all points.

(ii) For all R > 0, denote by B(z, R) the open ball with center z and radius R. Use the triangle inequality (with details omitted here) to show z to be a solution no worse than the points in [n] \ B(z, 8r), i.e., d (z, x) inf y [n]\b(z,8r) d (y, x). () (iii) Take a uniformly random bijection π : [ B(z, δnr) ] B(z, δnr). observe that min y B(z,8r) Then d (y, x) min (d (y, π (i )) + d (y, π (i))) () y B(z,8r) d (π (i ), π (i)), (3) where the first (resp., second) inequality follows from the injectivity of π (resp., the triangle inequality). (iv) Assume B(z, δnr) [n] for simplicity. So by inequalities () (3), if the following inequality holds, then it serves as a witness that z is ( + ɛ)- approximate: d (z, x) ( + ɛ) d (π (i ), π (i)). (4) To guarantee outputting a ( + ɛ)-approximate -median, output z only when inequality (4) holds. Restart from item (i) whenever inequality (4) is false. More details of item (iv) follow: For a -median z of B(z, δnr), it will be easy to show d (z, x) ( + o()) E d (π (i ), π (i)). (5) When z in item (i) is indeed ( + ɛ/0 0 )-approximate, ( d (z, x) + ɛ ) d (z, x). (6) 0 0 Assuming B(z, δnr) [n], inequalities () (4) imply d(z, x) (+ɛ) d(y, x) for all y B(z, 8r). Furthermore, d(z, x) d(y, x) for all y [n] \ B(z, 8r) by inequality (). Though not directly stated in later sections, this is a consequence of Lemmas 7 and in Sec. 4. 3

Assuming B(z, δnr) [n], inequalities (5) (6) make inequality (4) hold with high probability as long as d(π(i ), π(i)) is highly concentrated around its expectation. The need for such concentration is why we restrict the radius of the codomain of π to be δnr in item (iii) Large distances ruin concentration bounds. To accommodate for the points in [n] \ B(z, δnr), our witness for the approximation ratio of z actually differs slightly from inequality (4), unlike in item (iv). 3 Definitions and preliminaries For a metric space ([n], d), x [n] and R > 0, define B (x, R) {y [n] d (x, y) < R} to be the open ball with center x and radius R. For brevity, B (x, R) B (x, R) B (x, R). The pairs in B (x, R) are ordered. An algorithm A with oracle access to d: [n] [n] [ 0, ) is denoted by A d and may query d on any (x, y) [n] [n] for d(x, y). In this paper, all Landau symbols (such as O(), o(), Θ() and Ω()) are w.r.t. n. The following result is due to Indyk. Fact ([8, 9]). For all ɛ > 0, metric -median has a Monte Carlo ( + ɛ)- approximation O(n/ɛ )-time algorithm with a failure probability of at most /e. Henceforth, denote Indyk s algorithm in Fact by Indyk median. It is given n Z +, ɛ > 0 and oracle access to a metric d: [n] [n] [ 0, ). By convention, denote the expected value and the variance of a random variable X by E[ X ] and var(x), respectively. Chebyshev s inequality ([]). Let X be a random variable with a finite expected value and a finite nonzero variance. Then for all k, [ Pr X E[X] k ] var(x) k. 3 Algorithm and approximation ratio Throughout this paper, take any small constant ɛ > 0, e.g., ɛ 0 00. By line of Las Vegas median in Fig., δ > 0 is likewise a small constant. The following lemma implies that z in line 3 of Las Vegas median is a solution (to metric -median) no worse than those in [n] \ B(z, 8r), where r is as in line 4. 3 Our witness for the approximation ratio of z is as in line 6 of Las Vegas median in Fig.. 4

: Find δ > 0 such that + ɛ /( 00 δ); : while true do 3: z Indyk median d (n, ɛ/0 0 ); 4: r d(z, x)/n; 5: Pick a uniformly random bijection π : [ B(z, δnr) ] B(z, δnr); 6: if d(π(i ), π(i)) + \B(z,δnr) (d(z, x) 8r) ( 00 δ)nr/ then 7: return z; 8: end if 9: end while Figure : Algorithm Las Vegas median with oracle access to a metric d: [n] [n] [ 0, ) and with inputs n Z + and a small constant ɛ > 0 Lemma. In each iteration of the while loop of Las Vegas median, d(y, x) 7 d(z, x). inf y [n]\b(z,8r) Proof. For each y [n] \ B(z, 8r), d(y, x) (d(y, z) d(z, x)) 8nr 7 (8r d(z, x)) d(z, x), d(z, x) where the first inequality follows from the triangle inequality, the second follows from y / B(z, 8r) and the last equality follows from line 4 of Las Vegas median. Lemma 3. When line 7 of Las Vegas median is run, min y B(z,8r) Proof. Pick any y B(z, 8r). We have d(y, x) d(y, x) 00 δ 5 d(z, x). (d (y, π (i )) + d (y, π (i))) (7) d (π (i ), π (i)),

where the first and the second inequalities follow from the injectivity of π in line 5 of Las Vegas median and the triangle inequality, respectively. 4 Furthermore, d(y, x) (d(z, x) d(y, z)) \B(z,δnr) \B(z,δnr) \B(z,δnr) (d(z, x) 8r), (8) where the first and the second inequalities follow from the triangle inequality and y B(z, 8r), respectively. Summing up inequalities (7) (8), d(y, x) d (π (i ), π (i)) + \B(z,δnr) (d(z, x) 8r). This and lines 6 7 of Las Vegas median imply d(y, x) 00 δ nr when line 7 is run. Finally, nr d(z, x) by line 4. Lemmas 3 and line of Las Vegas median yield the following. Lemma 4. When line 7 of Las Vegas median is run, ( + ɛ) min d(y, x) d(z, x), y [n] i.e., z is a ( + ɛ)-approximate -median. By Lemma 4, Las Vegas median outputs a ( + ɛ)-approximate -median at termination. 4 Probability of termination in any iteration This section analyzes the probability of running line 7 in any particular iteration of the while loop of Las Vegas median. The following lemma uses an easy averaging argument. Lemma 5. [n] \ B (z, δnr) δ and, therefore, B (z, δnr) n δ ( o()) n. 4 Note that π(), π(),..., π( B(z, δnr) / ) are distinct elements of B(z, δnr). 6

Proof. Clearly, d (z, x) \B(z,δnr) d (z, x) Then use line 4 of Las Vegas median. \B(z,δnr) δnr [n] \ B (z, δnr) δnr. Henceforth, assume n /δ + 4 without loss of generality; otherwise, find a -median by brute force. So B(z, δnr) 4 by Lemma 5. Define r B(z, δnr) to be the average distance in B(z, δnr). Lemma 6. r r. d (u, v) (9) Proof. By equation (9) and the triangle inequality, r B(z, δnr) (d (z, u) + d (z, v)) (0) B(z, δnr) B(z, δnr) B(z, δnr) u B(z,δnr) d (z, u). u B(z,δnr) d (z, u) + v B(z,δnr) d (z, v) Obviously, the average distance from z to the points in B(z, δnr) is at most that from z to all points, i.e., B(z, δnr) d (z, u) n d (z, u). () u B(z,δnr) u [n] Inequalities (0) () and line 4 of Las Vegas median complete the proof. To analyze the probability that the condition in line 6 of Las Vegas median holds, we shall derive a concentration bound for d (π (i ), π (i)), whose expected value and variance are examined in the next four lemmas. Lemma 7. With expectations taken over π, E d (π (i ), π (i)) ( ± o()) nr. () 7

Proof. For each i [ B(z, δnr) / ], {π(i ), π(i)} is a uniformly random size- subset of B(z, δnr) by line 5 of Las Vegas median. Therefore, E [ d (π (i ), π (i)) ] B(z, δnr) ( B(z, δnr) ) d (u, v)(3) B(z, δnr) ( B(z, δnr) ) distinct u, v B(z, δnr) d (u, v) ( + o()) r, (4) where the second (resp., last) equality follows from the identity of indiscernibles (resp., equation (9) and Lemma 5). Finally, use equations (3) (4), the linearity of expectation and Lemma 5. Clearly, E E + distinct i, j d (π (i ), π (i)) d (π (i ), π (i)) (5) j E [ d (π (i ), π (i)) d (π (j ), π (j)) ] d (π (j ), π (j)) E [ d (π (i ), π (i)) ], (6) where the last equality follows from the linearity of expectation and the separation of pairs (i, j) according to whether i j. Lemma 8. With expectations taken over π, distinct i, j E [ d (π (i ), π (i)) d (π (j ), π (j)) ] 4 ( + o()) n (r ). Proof. Pick any distinct i, j [ B(z, δnr) / ]. By line 5 of Las Vegas median, {π (i ), π (i), π (j ), π (j)} is a uniformly random size-4 subset of B(z, δnr). So E [ d (π(i ), π(i)) d (π(j ), π(j)) ] B(z, δnr) ( B(z, δnr) ) ( B(z, δnr) ) ( B(z, δnr) 3) d (u, v) d (x, y). distinct u, v, x, y B(z, δnr) 8

Clearly, distinct u, v, x, y B(z, δnr) In summary, d (u, v) d (x, y) d (u, v) d (x, y) u,v,x,y B(z,δnr) d (u, v) d (u, v). d (x, y) x,y B(z,δnr) distinct i, j E [ d (π (i ), π (i)) d (π (j ), π (j)) ] ( ) B(z, δnr) B(z, δnr) B(z, δnr) ( B(z, δnr) ) ( B(z, δnr) ) ( B(z, δnr) 3) d (u, v). Together with Lemma 5 and equation (9), this completes the proof. Lemma 9. With expectations taken over π, E [ d (π (i ), π (i)) ] ( + o()) ( δn rr + δ nr ). (7) Proof. By line 5 of Las Vegas median, {π(i ), π(i)} is a uniformly random size- subset of B(z, δnr) for each i [ B(z, δnr) / ]. Therefore, E [ d (π (i ), π (i)) ] (8) B(z, δnr) ( B(z, δnr) ) distinct u, v B(z, δnr) d (u, v) B(z, δnr) ( B(z, δnr) ) d (u, v) B(z, δnr) B(z, δnr) ( B(z, δnr) ) d (u, v). 9

For all u, v B(z, δnr), d (u, v) d (z, u) + d (z, v) δnr + δnr δnr, (9) where the first inequality follows from the triangle inequality. By equations (9) and (8) (9), the left-hand side of inequality (7) cannot exceed the optimal value of the following problem, called max square sum: Find d u,v R for all u, v B(z, δnr) to maximize B(z, δnr) B(z, δnr) ( B(z, δnr) ) subject to B(z, δnr) d u,v (0) d u,v r, () u, v B (z, δnr), 0 d u,v δnr. () Above, constraint () (resp., ()) mimics equation (9) (resp., inequality (9) and the non-negativeness of distances). Appendix A bounds the optimal value of max square sum from above by B(z, δnr) B(z, δnr) ( B(z, δnr) ) ( B(z, δnr) r δnr This evaluates to be at most ( + o())(δn rr + δ nr ) by Lemma 5. ) + (δnr). Recall that the variance of any random variable X equals E[X ] (E[X]). Lemma 0. With variances taken over π, var d (π (i ), π (i)) ( + o()) δn r. Proof. By equations (5) (6) and Lemmas 8 9, E d (π (i ), π (i)) 4 ( + o()) ( n r ) ( + ( + o()) δn rr + δ nr ). This and Lemma 7 imply var d (π (i ), π (i)) o()n (r ) +( + o()) ( δn rr + δ nr ). Finally, invoke Lemma 6. 0

Lemma. For all k >, Pr d (π (i ), π (i)) ( ± o()) nr k ( + o()) δ nr k, where the probability is taken over π. Proof. Use Chebyshev s inequality and Lemmas 7 and 0. Let z B(z, δnr) be a -median of B(z, δnr), i.e., z argmin d (y, x), y B(z,δnr) breaking ties arbitrarily. So by the averaging argument, d (z, x) B(z, δnr) y B(z,δnr) d (y, x). (3) Lemma. d (z, x) nr. Proof. We have d (z, x) (3) Clearly, B (z, δnr) n. B(z, δnr) Lemma 3. For all sufficiently large n, d (z, z) 8r. d (u, v) (9) B (z, δnr) r. Proof. We have d (z, x) (d (z, z) d (z, x)) (4) d (z, z) d (z, x) d (z, z) nr B (z, δnr) d (z, z) nr,

where the first inequality (resp., the first equality) follows from the triangle inequality (resp., line 4 of Las Vegas median). By Lemmas 6 and, d (z, x) nr. (5) By inequalities (4) (5) and Lemma 5, d(z, z) (3 + o())r. 5 Lemma 4. For all sufficiently large n, d (z, x) nr + 6r δ + \B(z,δnr) (d (z, x) 8r). Proof. By the triangle inequality, d (z, x) \B(z,δnr) Lemma 3 Lemma 5 \B(z,δnr) \B(z,δnr) 6r δ + (d (z, z) + d (z, x)) (8r + d (z, x)) \B(z,δnr) (d (z, x) 8r). Now sum up the above with the inequality in Lemma. Lemma 5. For all sufficiently large n and with probability greater than /, d (π (i ), π (i)) + \B(z,δnr) (d(z, x) 8r) 00 δ nr,(6) where the probability is taken over π and the internal coin tosses of Indyk median in line 3 of Las Vegas median. Proof. By Lemma with k 5, d (π (i ), π (i)) > ( ± o()) nr 5 ( + o()) δ nr (7) with probability at least /5. By Fact and line 3 of Las Vegas median, ( d (z, x) + ɛ ) min d (y, x) (8) 0 0 y [n] ( + ɛ ) d (z, x) (9) 0 0 5 In fact, this is stronger than the lemma to be proved.

with probability at least /e. Now by the union bound, inequalities (7) (9) hold simultaneously with probability at least /5 /e > /. It remains to derive inequality (6) from inequalities (7) (9) for all sufficiently large n. Line 4 of Las Vegas median, inequalities (8) (9) and Lemma 4 give nr ( + ɛ ) 0 0 This and inequality (7) imply nr + 6r δ + \B(z,δnr) (d (z, x) 8r). (30) nr ( + ɛ ) 0 0 ( ± o()) + 6r δ + \B(z,δnr) d (π(i ), π(i)) + 5 ( + o()) δ nr (d (z, x) 8r). (3) 6 Clearly, 6r/δ 0.0 δ nr for all sufficiently large n. So inequality (3) implies, for all sufficiently large n and after laborious calculations, ( nr + ɛ ) 0 0 ( + o()) δ nr ( + ɛ 0 0 ) ( + o()) ( d (π(i ), π(i)) + + ɛ ) 0 0 \B(z,δnr) This implies inequality (6) for all sufficiently large n (note that ɛ/0 0 < δ by line of Las Vegas Median). 7 Lemma 5 and lines 6 7 of Las Vegas median show the probability of termination in any iteration to be Ω(). Because the proof of Lemma 5 implies that inequalities (6) (9) hold simultaneously with probability Ω() in any iteration of Las Vegas median, it happens with probability Ω() that in the first iteration, z is returned in line 7 (because of inequality (6)) and is (+ɛ/0 0 )-approximate (because of inequality (8)). So Las Vegas median outputs a (+ɛ/0 0 )-approximate -median with probability Ω() in the first iteration. In summary, we have the following. 6 To see this, rewrite inequality (7) as nr < ( ± o()) d (π (i ), π (i)) + 5 ( + o()) δ nr. 7 Divide both sides by ( + ɛ/0 0 )( + o()) so that the coefficient before d(π(i ), π(i)) becomes in the right-hand side. Then verify the left-hand side (which is now (nr ( + ɛ/0 0 ) ( + o())δ nr)/(( + ɛ/0 0 )( + o()))) to be at least ( 00 δ)nr/ for all sufficiently large n. 3 (d (z, x) 8r).

Lemma 6. The first iteration of the while loop of Las Vegas median outputs a ( + ɛ)-approximate -median with probability Ω(). 5 Putting things together We now show that metric -median has a Las Vegas ( + ɛ)-approximation algorithm with an expected O(n/ɛ ) running time for all constants ɛ > 0. Our algorithm also outputs a ( + ɛ)-approximate -median in time O(n/ɛ ) with probability Ω(). Theorem 7. For each constant ɛ > 0, metric -median has a randomized algorithm that () always outputs a ( + ɛ)-approximate solution in an expected O(n/ɛ ) time and that () outputs a (+ɛ)-approximate solution in time O(n/ɛ ) with probability Ω(). Proof. By Lemma 4, Las Vegas median outputs a (+ɛ)-approximate -median at termination. To prevent Las Vegas median from running forever, find a -median by brute force (which obviously takes O(n ) time) after n steps of computation. By Fact, line 3 of Las Vegas median takes O(n/ɛ ) time. Line 5 takes time O( B(z, δnr) ) O(n) by the Knuth shuffle. Clearly, the other lines also take O(n) time. Consequently, each iteration of the while loop of Las Vegas median takes O(n/ɛ ) time. By Lemma 5 and lines 6 7, Las Vegas median runs for at most /Ω() O() iterations in expectation. So its expected running time is O() O(n/ɛ ) O(n/ɛ ). Having shown each iteration of Las Vegas median to take O(n/ɛ ) time, establish condition () of the theorem with Lemma 6. By Fact, Indyk median satisfies condition () in Theorem 7. But it does not satisfy condition (). We briefly justify the optimality of the ratio of + ɛ in Theorem 7. Let A be a randomized algorithm that always outputs a ( ɛ)-approximate -median. Furthermore, denote by p [n] (resp., Q [n] [n]) the output (resp., the set of queries as unordered pairs) of A d (n), where d is the discrete metric (i.e., d (x, y) and d (x, x) 0 for all distinct x, y [n]). Without loss of generality, assume (p, y) Q for all y [n] \ {p} by adding dummy queries. So A knows that d (p, y) n. (3) y [n] Furthermore, assume that A never queries for the distance from a point to itself. In the sequel, consider the case that Q < ɛ (n ) /4. By the averaging argument, there exists a point ˆp [n] \ {p} involved in at most Q /(n ) 4

queries in Q. Clearly, A cannot exclude the possibility that d (ˆp, y) / for all y [n] \ {ˆp} satisfying (ˆp, y) / Q. In summary, A cannot rule out the case that d (ˆp, y) Q ( n + n Q ) ( n < + ɛ ) (n ).(33) 4 y [n] Equations (3) (33) contradict the guarantee that p is ( ɛ)-approximate. In summary, any randomized algorithm that always outputs a ( ɛ)-approximate -median must always make at least ɛ (n ) /4 Ω(ɛn ) queries given oracle access to the discrete metric. A Analyzing max square sum Max square sum has an optimal solution, denoted { d u,v R}, because its feasible solutions (i.e., those satisfying constraints () ()) form a closed and bounded subset of R ( B(z,δnr) ). (Recall from elementary mathematical analysis that a continuous real-valued function on a closed and bounded subset of R k has a maximum value, where k <.) Note that { d u,v R} must be feasible to max square sum. Below is a consequence of constraint (). Lemma A.. { (u, v) B (z, δnr) d } B(z, δnr) r u,v δnr. (34) δnr Proof. Clearly, B(z, δnr) r () { d u,v (u, v) B (z, δnr) d u,v δnr} δnr. Furthermore, the left-hand side of inequality (34) is an integer. Lemma A.. { (u, v) B (z, δnr) d } B(z, δnr) r u,v > 0 +. δnr Proof. Assume otherwise. Then { ( ) ( (u, v) B (z, δnr) du,v > 0 du,v δnr)} { (u, v) B (z, δnr) d } { u,v > 0 (u, v) B (z, δnr) d u,v δnr} B(z, δnr) r { + (u, v) B (z, δnr) δnr d u,v δnr} Lemma A.. 5

So by constraint () (and the feasibility of { d u,v } to max square sum), { (u, v) B (z, δnr) 0 < d u,v < δnr}. Consequently, there exist distinct (x, y), (x, y ) B (z, δnr) satisfying 0 < d x,y, d x,y < δnr. (35) By symmetry, assume d x,y d x,y. By inequality (35), there exists a small real number β > 0 such that increasing d x,y by β and simultaneously decreasing d x,y by β will preserve constraints () (). I.e., the solution { ˆd u,v R} defined below is feasible to max square sum: d x,y + β, if (u, v) (x, y), ˆd u,v d x,y β, if (u, v) (x, y ), (36) d u,v, otherwise. Clearly, objective (0) w.r.t. { ˆd u,v } exceeds that w.r.t. { d u,v } by B(z, δnr) B(z, δnr) ( B(z, δnr) ) ( ˆd u,v d ) u,v (36) B(z, δnr) B(z, δnr) ( B(z, δnr) ) ( ( ) ( ) ) dx,y + β + dx,y β d x,y d x,y B(z, δnr) (β B(z, δnr) ( B(z, δnr) ) d x,y β d ) x,y + β > 0, where the inequality holds because d x,y d x,y and β > 0. In summary, { ˆd u,v } is a feasible solution achieving a greater objective (0) than the optimal solution { d u,v } does, a contradiction. We now bound the optimal value of max square sum. Theorem A.3. The optimal value of max square sum is at most ( ) B(z, δnr) B(z, δnr) B(z, δnr) ( B(z, δnr) ) r + (δnr) δnr 6

Proof. W.r.t. the optimal (and thus feasible) solution { d u,v } u,v B(z,ɛnr), objective (0) equals B(z, δnr) B(z, δnr) ( B(z, δnr) ) [ ] χ du,v 0 d u,v () B(z, δnr) B(z, δnr) ( B(z, δnr) ) [ ] χ du,v > 0 (δnr), where χ[p ] if P is true and χ[p ] 0 otherwise, for any predicate P. Now invoke Lemma A.. References [] C.-L. Chang. Deterministic sublinear-time approximations for metric - median selection. Information Processing Letters, 3(8):88 9, 03. [] C.-L. Chang. A deterministic sublinear-time nonadaptive algorithm for metric -median selection. Theoretical Computer Science, 60:49 57, 05. [3] C.-L. Chang. Metric -median selection: Query complexity vs. approximation ratio. In Proceedings of the nd International Computing and Combinatorics Conference, pages 3 4, Ho Chi Minh City, Vietnam, 06. Full version at https://arxiv.org/abs/509.0566. [4] C.-L. Chang. A lower bound for metric -median selection. Journal of Computer and System Sciences, 84:44 5, 07. [5] D. Eppstein and J. Wang. Fast approximation of centrality. Journal of Graph Algorithms and Applications, 8():39 45, 004. [6] O. Goldreich and D. Ron. Approximating average parameters of graphs. Random Structures & Algorithms, 3(4):473 493, 008. [7] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O Callaghan. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering, 5(3):55 58, 003. [8] P. Indyk. Sublinear time algorithms for metric space problems. In Proceedings of the 3st Annual ACM Symposium on Theory of Computing, pages 48 434, 999. [9] P. Indyk. High-dimensional computational geometry. PhD thesis, Stanford University, 000. 7

[0] A. Kumar, Y. Sabharwal, and S. Sen. Linear-time approximation schemes for clustering problems in any dimensions. Journal of the ACM, 57():5, 00. [] R. R. Mettu and C. G. Plaxton. Optimal time bounds for approximate clustering. Machine Learning, 56( 3):35 60, 004. [] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, Cambridge, UK, 995. [3] W. Rudin. Principles of Mathematical Analysis. McGraw-Hill, 3rd edition, 976. [4] S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, 994. [5] B.-Y. Wu. On approximating metric -median in sublinear time. Information Processing Letters, 4(4):63 66, 04. 8