Fundamenta Informaticae XXI (2001) IOS Press

Fudameta Iformaticae XXI (2001) 1001 1006 1001 IOS Press Approximate Sortig Joachim Giese Max Plak Istitute for Computer Sciece Saarbrücke, Germay Eva Schuberth Istitute for Theoretical Computer Sciece ETH Zurich, Switzerlad Miloš Stojaković Departmet of Mathematics ad Iformatics Uiversity of Novi Sad, Serbia Abstract. We show that ay compariso based, radomized algorithm to approximate ay give rakig of items withi expected Spearma s footrule distace 2 /ν() eeds at least (mi{log ν(), log } 6) comparisos i the worst case. This boud is tight up to a costat factor sice there exists a determiistic algorithm that shows that 6 log ν() comparisos are always sufficiet. Keywords: algorithms, sortig, rakig, Spearma s footrule metric, Kedall s tau metric 1. Itroductio Our motivatio to study approximate sortig comes from the followig market research applicatio. We wat to fid out how a respodet raks a set of products. I order to simulate real buyig situatios the respodet is preseted pairs of products out of which he has to choose the oe that he prefers, i.e., he has to perform paired comparisos. The respodet s rakig is the recostructed from the sequece of his choices. That is, a procedure that presets a sequece of product pairs to the respodet i order to obtai the product rakig is othig else tha a compariso based sortig algorithm. We ca measure the efficiecy of such a algorithm i terms of the umber of (pairwise) comparisos eeded i order to Partly supported by the Swiss Natioal Sciece Foudatio uder the grat Robust Algorithms for Cojoit Aalysis. Correspodig author. Partly supported by the Miistry of Sciece ad Evirometal Protectio, Republic of Serbia, ad Provicial Secretariat for Sciece, Provice of Vojvodia.

1002 J. Giese, E. Schubert, M. Stojaković / Approximate Sortig obtai the rakig. The iformatio theoretic lower boud o sortig [6] states that there is o procedure that ca determie a rakig by posig less tha log e paired compariso questios to the respodet, i.e., i geeral Ω( log ) comparisos are eeded. Eve for oly moderately large that easily is too much sice respodets ofte get wor out after a certai umber of questios (idepedet of ) ad do ot aswer further questios faithfully aymore. O the other had, it might be eough to kow the respodet s rakig approximately. I this paper we pursue the questio of how may comparisos are ecessary ad sufficiet i order to approximately rak products. I order to give sese to the term approximately we eed some metric to compare rakigs. Assume that we are dealig with products. Sice a rakig is a permutatio of the products, this meas that we eed a metric o the permutatio group S. Not all of the metrics, e.g., the Hammig distace that couts how may products are raked differetly, are meaigful for our applicatio. For example, if i the respodet s rakig oe exchages every secod product with its predecessor, the the resultig rakig has maximal Hammig distace to the origial oe. Nevertheless, this rakig still tells a lot about the respodet s prefereces. I marketig applicatios Kedall s tau metric [3] is frequetly used sice it seems to capture the ituitive otio of closeess of two rakigs ad also arises aturally i the statistics of certai radom rakigs [7]. Our results. Istead of workig with Kedall s metric we use Spearma s footrule metric [3] which essetially is equivalet to Kedall s metric, sice the two metrics are withi a costat factor of each other [3]. The maximal distace betwee ay two rakigs of products i Spearma s footrule metric is less tha 2. We show that i order to obtai a rakig at distace 2 /ν() to the actual rakig, with ay strategy, a respodet has i geeral to perform at least (mi{log ν(), log } 6) comparisos i the worst case, i.e., there is a istace for which ay compariso based algorithm performs at least (mi{log ν(), log } 6) comparisos. Moreover, if we allow the strategy to be radomized such that the obtaied rakig is at expected distace 2 /ν() to the respodet s rakig, we ca show that the same boud o the miimum umber of comparisos holds. O the other had, there is a determiistic strategy (algorithm) that shows that 6 log ν() comparisos are always sufficiet. Related work. At first glace our work seems related to work doe o pre-sortig. I pre-sortig the goal is to pre-process the data such that fewer comparisos are eeded afterwards to sort them. For example i [4] it is show that with O(1) pre-processig oe ca save Θ() comparisos for Quicksort o average. Pre-processig ca be see as computig a partial order o the data that helps for a give sortig algorithm to reduce the umber of ecessary comparisos. The structural quatity that determies how may comparisos are eeded i geeral to fid the rakig give a partial order is the umber of liear extesios of the partial order, i.e., the umber of rakigs cosistet with the partial order. Actually, the logarithm of this umber is a lower boud o the umber of comparisos eeded i geeral [5]. Here we study aother structural measure, amely, the maximum diameter i the Spearma s metric of the set of rakigs cosistet with a partial order. Our results show that with o( log ) comparisos oe ca make this diameter asymptotically smaller tha the diameter of the set of all rakigs. That is ot the case for the umber of liear extesios which stays i Θ(2 log ). Notatio. The logarithm log i this paper is assumed to be biary, ad by id we deote the idetity (icreasig) permutatio of [].

J. Giese, E. Schubert, M. Stojaković / Approximate Sortig 1003 2. Lower Boud Here, we show that i order to obtai a rakig reasoably close to the actual rakig, a respodet has to perform a substatial umber of comparisos i the worst case. More precisely, for ay (possibly radomized) compariso based algorithm that outputs a rakig at distace 2 /ν() to the actual rakig, there is a istace for which it performs (i expectatio) at least (mi{log ν(), log } 6) comparisos. The distace of a approximate rakig from the actual rakig will be measured i Spearma s footrule metric, D(π, id) = D(π) = i π(i), where π(i) is the rak of the elemet of rak i i the approximate rakig, i.e., i π(i) measures deviatio of the approximated rak from the actual rak. Note that for ay rakig the distace i the Spearma s footrule metric to id is at most 2 2. For r > 0, by B D (id, r) we deote the ball cetered at id of radius r with respect to the Spearma s footrule metric, so B D (id, r) := {π S : D(π, id) r}. Next we estimate the umber of permutatios i a ball of radius r. i=1 Lemma 2.1. ( ) 2e(r + ) B D (id, r). Proof: Every permutatio π S is uiquely determied by the sequece {π(i) i} i. Hece, for ay sequece of o-egative itegers d i, i = 1,...,, there are at most 2 permutatios π S satisfyig π(i) i = d i. If D(π, id) r, the i π(i) i r. Sice the umber of sequeces of o-egative itegers whose sum is at most r is ( ) r+, we have ( r + B D (id, r) ) 2 ( ) 2e(r + ). Usig the previous lemma ad Yao s Priciple [8], we give a lower boud for the worst case ruig time of ay (radomized) compariso based approximate sortig algorithm. Theorem 2.1. Let A be a radomized approximate sortig algorithm based o comparisos, let ν = ν() be a fuctio, ad let r = r() = 2 ν(). If for every iput permutatio π S the expected Spearma s footrule distace of the output to id is at most r, the the algorithm performs at least (mi{log ν, log } 6) comparisos i expectatio i the worst case.

1004 J. Giese, E. Schubert, M. Stojaković / Approximate Sortig Proof: Let k be the smallest iteger such that A performs at most k comparisos for every iput. For a cotradictio, let us assume that k < (mi{log ν, log } 6). First, we are goig to prove Sice log ν 6 > k/, we have ν 2 6 1! > 2k 2 ( ) 2e(2r + ). (1) > 2 k/ ad sice ν = 2 r O the other had, from log 6 > k/ we get 2 6 Puttig (2) ad (3) together, we obtai e we get 2e 2r > 2k/ 2e. (2) > 2 k/ implyig 2e > 2k/ 2e. (3) 2e(2r + ) > 2k/. Hece 1 ( ) ( ) 2! > 2 k 2e(2r + ), e provig (1). We deote by R the source of radom bits for A. Oe ca see R as the set of all ifiite 0-1 sequeces, ad the the algorithm is give a radom elemet of R alog with the iput. For a permutatio π S ad α R, we deote by A(π, α) the output of the algorithm with iput π ad radom bits α. We fix α R ad ru the algorithm for every permutatio π S. Note that with the radom bits fixed the algorithm is determiistic. For every compariso made by the algorithm there are two possible outcomes. We partitio the set of all permutatios S ito classes such that all permutatios i a class have the same outcomes of all the comparisos the algorithm makes. Sice there is o radomess ivolved, we have that for every class C there exists a σ S such that for every π C we have A(π, α) = σ π, where is the multiplicatio i the permutatio group S. I particular, this implies that the set {A(π, α) : π C} is of size C. O the other had, sice the algorithm i this settig is determiistic ad the umber of comparisos of the algorithm is at most k, there ca be at most 2 k classes. Hece, each permutatio i S is the ) output for at most 2 k differet iput permutatios. From, Lemma 2.1 we have B D (id, 2r) ad this together with (1) implies that at least ( 2e(2r+)! 2 k ( 2e(2r + ) ) > 1 2! iput permutatios have output at distace to id more tha 2r. Now, if both the radom bits α R ad the iput permutatio π S are chose at radom, the expected distace of the output A(π, α) to id is more tha r. Therefore, there exists a permutatio π 0 such that for a radomly chose α R the expected distace d D (A(π 0, α), id) is more tha r. Cotradictio.

J. Giese, E. Schubert, M. Stojaković / Approximate Sortig 1005 3. Algorithm The idea of ASORT algorithm is to partitio the products ito a sorted sequece of equal-sized bis such that the elemets i each bi have smaller rak tha ay elemet i subsequet bis. It is based o a well-studied variatio of Quicksort algorithm i which the media is chose to be the pivot elemet (see, e.g., [2]). The output of the algorithm is the sequece of bis. Note that we do ot specify the orderig of elemets iside each bi, but cosider ay rakig cosistet with the orderig of the bis. As it turs out, ay such rakig approximates the actual rakig of the elemets i terms of Spearma s footrule metric well. The algorithm ASORT iteratively performs a umber of media searches, each time placig the media ito the right positio i the rakig. Here the media of elemets is defied to be the elemet of rak +1 2. ASORT (B : set, m : it) 1 B 01 := B // B ij is the j th bi i the i th roud 2 for i := 1 to m do 3 for j := 1 to 2 i 1 do 4 compute the media of B (i 1)j 5 B i(2j 1) := {x B (i 1)j x media} 6 B i(2j) := {x B (i 1)j x > media} 7 ed for 8 ed for 9 retur B m1,..., B m(2 m ) To compute the media i lie 4 ad to partitio the elemets i lie 5 ad 6 we use the determiistic algorithm by Blum et al. [1] that performs at most 5.73 comparisos i order to compute the media of elemets ad to partitio them accordig to the media. We ote that i puttig the algorithm ASORT to practice oe may wat to use a differet media algorithm, like, e.g., RANDOMIZEDSELECT [2]. I each roud, the sum of the cardialities of all the bis is. Hece, oe roud takes at most 5.73 comparisos. As the algorithm rus for m rouds overall, the total umber of comparisos is less tha 6m. Theorem 3.1. Let r = 2 ν(). Ay rakig cosistet with the orderig of the bis computed by ASORT i log ν() rouds, i.e., with less tha 6 log ν() comparisos, has a Spearma s footrule distace of at most r to the actual rakig of the elemets from B. Proof: The distace of the actual rakig of the elemets i B to ay rakig cosistet with the orderig of the bis computed by ASORT i m rouds ca be bouded by ( ) 2 m 1 2 2 m. Pluggig i m = log ν(), we see that the distace is at most r. As we saw earlier, the algorithm performs at most 6m = 6 log ν() comparisos.

1006 J. Giese, E. Schubert, M. Stojaković / Approximate Sortig Ackowledgmets. We are idebted to Jiří Matoušek for commets ad isights that made this paper possible. Refereces [1] Blum, M., Floyd, R. W., Pratt, V., Rivest, R. L., Tarja, R. E.: Liear time bouds for media computatios, STOC 72: Proceedigs of the fourth aual ACM symposium o Theory of computig, ACM Press, 1972. [2] Corme, T. H., Leiserso, C. E., Rivest, R. L.: Itroductio to Algorithms, 2d ed., The MIT Press/McGraw- Hill, 2001. [3] Diacois, P., Graham, R. L.: Spearma s Footrule as a Measure of Disarray, Joural of the Royal Statistical Society, 39(2), 1977, 262 268. [4] Hwag, H. K., Yag, B. Y., Yeh, Y. N.: Presortig algorithms: a average-case poit of view, Theoretical Computer Sciece, 242(1-2), 2000, 29 40. [5] Kah, J., Kim, J. H.: Etropy ad Sortig, STOC 92: Proceedigs of the twety-fourth aual ACM symposium o Theory of computig, ACM Press, 1992. [6] Kuth, D. E.: The Art of Computer Programmig, vol. 3, Addiso Wesley, 1973. [7] Mallows, C. L.: No-ull rakig models, Biometrica, 44, 1957, 114 130. [8] Yao, A. C.: Probabilistic computatios: Towards a uified measure of complexity, FOCS 77: Proceedigs of 18th Aual Symposium o Foudatios of Computer Sciece, IEEE Computer Society Press, 1977.