This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Size: px

Start display at page:

Download "This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and"

Clifford Wilson
6 years ago
Views:

1 This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier s archiving and manuscript policies are encouraged to visit:

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135 145 jo ur nal ho me p ag e: www.intl.elsevierhealt h.

Laboratory for Novel Software Technology, Nanjing University, China a r t i c l e i n f o Article history: Received 4 January 2013 Received in revised form 17 May 2013 Accepted 28 May 2013 Keywords:

Kruskal Wallis test is a non-parametric statistical test that evaluates whether two or more samples are drawn from the same distribution. It is commonly used in various areas.

2 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e ( ) jo ur nal ho me p ag e: h.com/journals/cmpb Privacy-preserving Kruskal Wallis test Suxin Guo a,, Sheng Zhong b, Aidong Zhang a a Department of Computer Science and Engineering, SUNY at Buffalo, United States b State Key Laboratory for Novel Software Technology, Nanjing University, China a r t i c l e i n f o Article history: Received 4 January 2013 Received in revised form 17 May 2013 Accepted 28 May 2013 Keywords: Data security Statistical test Kruskal Wallis test a b s t r a c t Statistical tests are powerful tools for data analysis. Kruskal Wallis test is a non-parametric statistical test that evaluates whether two or more samples are drawn from the same distribution. It is commonly used in various areas. But sometimes, the use of the method is impeded by privacy issues raised in fields such as biomedical research and clinical data analysis because of the confidential information contained in the data. In this work, we give a privacy-preserving solution for the Kruskal Wallis test which enables two or more parties to coordinately perform the test on the union of their data without compromising their data privacy. To the best of our knowledge, this is the first work that solves the privacy issues in the use of the Kruskal Wallis test on distributed data Elsevier Ireland Ltd. All rights reserved. 1. Introduction Statistical hypothesis tests are very widely used for data analysis. Some popular statistical tests include t-test [1], ANOVA [2], Kruskal Wallis test [3], and Wilcoxon rank sum test [4]. Although these four are different tests, they serve the same goal, which is to find out whether the samples come from the same population. The t-test and ANOVA are parametric tests and assume the normal distribution of data. The non-parametric equivalence of these two tests are the Wilcoxon rank sum test, which is also known as Mann- Whitney U test [5], and Kruskal Wallis test, respectively. They do not assume the data to be normally distributed. The t-test can only deal with the comparison between two samples, and the ANOVA extends it to multiple samples. Similarly, the Kruskal Wallis is also a generalization of the Wilcoxon rank sum test from two samples to multiple samples. As stated above, the four tests are doing similar things under different assumptions. The non-parametric tests perform better when the data is not normally distributed, and are suitable especially in the cases when the data size is small (<25 per sample group) [6]. Although the Kruskal Wallis test is a helpful tool in many areas, sometimes the use of it is impeded by privacy concerns due to the confidential information in the data, especially in the clinical and biomedical research. For example, some hospitals conducted a study and tested the INR (International Normalized Ratio) values for their patients so that each hospital holds a set of INR values. The hospitals want to perform the Kruskal Wallis test to check whether their values are following the same trend. In this case, the set of the INR values of each hospital is treated as a sample. To conduct the Kruskal Wallis test, all samples should be known, which means, the hospitals have to share their data with each other. The problem is that it might be improper for the hospitals to share their samples because the data contains the private information of patients. Currently there is no method that enables the conduction of the Kruskal Wallis test on such distributed data with privacy concerns. Corresponding author. Tel.: /$ see front matter 2013 Elsevier Ireland Ltd. All rights reserved.

3 136 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e ( ) To solve this problem, we propose a privacy-preserving algorithm that allows the Kruskal Wallis test to be applied on samples distributed in different parties without revealing each party s private information to others. Due to the similarity in non-parametric tests, our method can also help the design of privacy-preserving solutions for other nonparametric tests. For example, the Wilcoxon rank sum test and the Kruskal Wallis test are used in the situations of two samples and two or more samples, respectively, and are essentially the same in the two samples case [3]. So our algorithm also solves the privacy issue of the Wilcoxon rank sum test to some extent. The rest of this paper is organized as follows: In Section 2, we present the related work. Section 3 provides the technical preliminaries including the background knowledge about the Kruskal Wallis test and the cryptographic tools we need. We propose the basic algorithm and the complete algorithm in Sections 4 and 5, respectively. The basic algorithm shows the procedure of conducting the Kruskal Wallis test securely when there is no tie in the data. The complete algorithm follows the basic algorithm and takes the existence of ties into consideration. In Section 6, we present the experimental results and finally, Section 7 concludes the paper. 2. Related work In recent years, due to the increasing awareness of privacy problems, a lot of data analyzing methods have been enhanced to be privacy-preserving, including many popular data mining and machine learning algorithms. Most of these approaches can be divided into two categories. Approaches in the first category protect data privacy with data perturbation techniques, such as randomization [7,8],rotation [9] and resampling [10]. Since the original data is changed, these approaches usually lose some accuracy. The methods in the second category are generally based on the Secure Multiparty Computation (SMC) and apply cryptographic techniques to protect data during the computations [11,12]. Such methods usually cause no accuracy loss but have higher computational cost. In our case, since the Kruskal Wallis test is often used on small sized data, we choose the second way, which is to protect privacy with cryptographic tools. It enables us to achieve higher accuracy with an affordable computational cost. In the cryptographic category, some SMC tools are very commonly used, such as secure sum [13], secure comparison [14,15], secure division [16], secure scalar product [13,16,17], secure matrix multiplication [18 20], and secure set operations [13]. Many data mining and machine learning algorithms have been extended with privacy solutions, such as decision tree classification [11,21], k-means clustering [22,23], gradient descent methods [24], but only a few works have been proposed to study the privacy issues in statistical tests. [25] gives a privacy-preserving algorithm to compare survival curves with the logrank test. [26] presents a privacy-preserving solution to perform the permutation test securely on distributed data. There is no work studies the privacy issues of the Kruskal Wallis test on distributed data. To the best of our knowledge, our work is the first one. 3. Technical preliminaries 3.1. The Kruskal Wallis test We first review the Kruskal Wallis test in this section. The test as proposed by Kruskal and Wallis [3] evaluates whether two or more samples are from the same distribution. The null hypothesis is that all the samples come from the same distribution. Suppose we have k samples, each contains a set of values. To perform the Kruskal Wallis test, we need to first rank all the values together without considering which sample the values belong to, then compute the sum of all the ranks of values within every sample, so that each sample has its sum of ranks. If there is no tie in all the values, the test statistic is: H = 12 N(N + 1) k i=1 R 2 i n i 3(N + 1), (1) where N is the total number of values in all samples; n i is the number of values contained in the ith sample, and R i is the sum of ranks in ith sample. After the calculation of H, we compare it to a value 2 :k 1 which can be found in a table of the chi-squared probability distribution with k 1 as the degrees of freedom and as the desired significance. If H 2 :k 1, the hypothesis is rejected. Otherwise, the hypothesis is accepted. If there are ties in the values, the calculation of the test statistic should be changed slightly. First, when ranking all the values, the ranks of each group of tied values are given as the average of the ranks that these tied values would have received without ties. For example, suppose we have values {1, 3, 3, 5} with one tie of two 3 s. Without considering the tie, their ranks should be 1, 2, 3, 4, respectively. After we change the ranks of the tied values to the average of them, their ranks become 1, 2.5, 2.5, 4. Then we can compute H with these new ranks. Besides the adjustment of ranks, we also need to divide H by: g i=1 C = 1 (t3 t i i ) N 3 N, (2) where g is the number of groups of tied values, and t i is the number of tied values in the ith group. For the above example {1, 3, 3, 5}, we have only 1 group of 2 tied values, so g = 1 and t 1 = 2. To sum up for the case with existence of ties, we need to adjust the ranks of tied values, and the test statistic is: H c = H C. (3) Actually Eq. (3) is the general solution that holds no matter there are ties or not. If there is no tie, C = 1 and thus, H c = H.

4 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e ( ) Privacy protection of the Kruskal Wallis test Like the hospital example mentioned in the introduction, we assume that each party has a sample and they hope to conduct a Kruskal Wallis test jointly to find out whether their samples follow the same distribution without revealing their data to others. Here our solution is based on the semi-honest model, which is widely used in the cryptographic category of privacypreserving methods [27,11,28,29,24,13,16,30]. In this model, all parties strictly follow the protocol, but can attempt to derive the private information of other parties with the intermediate results they get during the execution of protocols Cryptographic tools Homomorphic cryptographic scheme An additive homomorphic asymmetric cryptographic system is used to encrypt and decrypt the data in our work. A cryptographic scheme that encrypts integer x as E(x) is additive homomorphic if there are operators and that for any two integers x 1, x 2 and a constant a, we have E(x 1 + x 2 ) = E(x 1 ) E(x 2 ), E(a x 1 ) = a E(x 1 ). This means, with an additive homomorphic cryptographic system, we can compute the encrypted sum of integers directly from their encryptions. We do not need to decrypt the integers and compute the sum. In an asymmetric cryptographic system, we have a pair of keys: a public key for encryption and a private key for decryption Elgamal cryptographic system There are several additive homomorphic cryptographic schemes [30,31]. In this work, we apply a variant of ElGamal scheme [32], which is semantically secure under the Diffe- Hellman Assumption [33]. Elgamal Cryptographic system is a multiplicative homomorphic asymmetric cryptographic system. With this system, the encryption of a cleartext m is such a pair: E(m) = (m y r, g r ), where g is a generator, x is the private key, y is the public key that y = g x and r is a random integer. We call the first part of the pair c 1 and the second part c 2. c 1 = m y r and c 2 = g r. To decrypt E(m), we compute s = c x 2 = g rx = g xr = y r. Then do c 1 s 1 = m y r y r and we can get the cleartext m. In the variant of Elgamal scheme we use, the cleartext m is encrypted in such a way: E(m) = (g m y r, g r ). The only difference between the original Elgomal scheme and this variant is that m in the first part is changed to g m. With this operation, this variant is an additive homomorphic cryptosystem such that: E(x 1 + x 2 ) = E(x 1 ) E(x 2 ), E(a x 1 ) = E(x 1 ) a. To decrypt E(m), we follow the same procedure as in the original Elgamal algorithm. But this time, after the above calculations, we obtain g m instead of m. To get m from g m, we need to perform exhaustive search, which is to try every possible m and look for the one that matches g m. Please note that this exhaustive search is limited to a small range of possible plaintexts only, so the time needed is reasonable. In our work, the private key is shared by all the parties and no party knows the complete private key. The parties need to coordinate with each other to do the decryptions and the ciphertexts can be exposed to any party, because no party can decrypt them without the help of others. The private key is shared in this way: Suppose there are two parties, parties A and B. A holds a part of private key, x 1 and B holds the other part, x 2 such that x 1 + x 2 = x, where x is the complete private key. In the decryption, we need to compute s = c x 2 = c x 1+x 2 2 = c x 1 2 c x 2 2. Party A computes s 1 = c x 1 2 and party B computes s 2 = c x 2 2. s = s 1 s 2. We need to do c 1 s 1 = c 1 (s 1 s 2 ) 1 = c 1 s 1 1 s 1 2. Party A computes c 1 s 1 1 and sends it to party B. Then party B computes c 1 s 1 1 s 1 2 = c 1 s 1 = g m and sends it to A. In this way both parties can get the decrypted result. Here since the party B does the decryption later, it gets the final result earlier. If it does not send the result to A, the decrypted result can only be known to party B. The sequence of the parties can be changed, so if we need the result to be known to only one party, the party should do the decryption later Secure comparison We apply the secure comparison protocol proposed in [15] to compare two values from different parties securely. The input of this algorithm are two integers a and b which are from different parties. The output is an encryption of 1 if a > b, or an encryption of 0 otherwise. The basic idea of the secure comparison algorithm is as follows. Let the binary presentation of a and b be a l,..., a 1 and b l,..., b 1, where a 1 and b 1 are the least significant bits. If a > b, there is a pivot bit i such that b i a i + 1 =0 and a j XOR b j = a j + b j 2a j b j = 0 for every i < j l. This method applies the homomorphic encryptions to check if the pivot bit exists. This method can find out if a > b, but it cannot find out if a b directly. So when we want to know if a b, we compare 2a + 1 and 2b instead of a and b. If 2a + 1 >2b, since both a and b are integers, we can derive that a b. 4. The basic algorithm of privacy-preserving Kruskal Wallis test In this part, we present the basic algorithm for computing the H statistic of the Kruskal Wallis test securely without considering the existence of ties. The complete algorithm that also deals with ties will be discussed in the next section. To make the presentation clear, we first give the algorithm for performing the test within two parties, then extend it to the multiparty case.

5 138 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e ( ) Suppose there are two parties, A and B. Party A has sample S 1 which contains n 1 values, and party B has sample S 2 that contains n 2 values. The total number of values N = n 1 + n 2. The basic structure of the algorithm goes as follows: 1. For each value in each party, count how many values in its own party (including itself) are smaller than or equal to it. Encrypt these counts. 2. For each value in each party, compare it with all the values in the other party using the secure comparison algorithm. Then by adding the comparison results up, count how many values in the other party are smaller than or equal to it. Since the results of the secure comparison algorithm are in cipher text, these counts are also in cipher text. 3. For each value in each party, add the above two counts securely so we can get the total number of values in both parties that are smaller than or equal to it, which is the rank of it in cipher text. Then for each party, add all the encrypted ranks of its values and this is the encrypted rank sum of this party. Call the rank sums of the two parties R 1 and R 2, respectively. 4. With the encrypted rank sum of both parties, compute the H statistic with Eq. (1). Here comes a problem: to calculate H, we need the squared rank sum of both parties, R 2 1 and R 2 2. Since we only have the encrypted rank sums of the two parties E(R 1 ) and E(R 2 ), we have to compute E(R 2 1 ) and E(R 2 2 ) from E(R 1 ) and E(R 2 ). This is not easy because we are using an additive homomorphic system, which does not support the direct multiplication of two encrypted integers. So we need to develop an algorithm to solve it. Let us explain each step in details Secure computation of the rank sums To compute the rank of one value, we just need to count how many values in both parties are smaller than or equal to it. For example, with values {5, 6, 7}, the rank of value 5 is 1, because only 1 value is smaller than or equal to it, which is itself (5 5). The rank of 6 is 2 since there are 2 values smaller than or equal to it (5 6 and 6 6). Similarly, the rank of 7 is 3. For each value in each party, to count how many values are smaller than or equal to it in its own party is quite simple. We compare it with all values in its party, which can be easily done. But to count the number of smaller or equal values in the other party is not that straightforward. We also need to compare the value with all values in the other party, and the comparisons should be conducted with the secure comparison algorithm. Suppose the values in party A are a 1, a 2,..., a n1, and the values in party B are b 1, b 2,..., b n2. For each value a i (i = 1, b,..., n 1 ), we need to compare it with every value in party B with the secure comparison protocol. After these n 2 secure comparisons, we have n 2 results, and each of them is an encryption of 0 or 1 (E(0) or E(1)). For each value b j (j = 1, 2,..., n 2 ), the comparison between a i and b j is E(1) if a i b j and E(0) otherwise. Since the results are in cipher text, no party knows what they are. The sum of the n 2 results is the encrypted number of values that are smaller than or equal to a i in party B. We call it E(R B (a i )). The number of values that are smaller than or equal to a i in party A can be easily computed. It is named R A (a i ). We encrypt it and get E(R A (a i )). The sum of R A (a i ) and R B (a i ) is the rank of a i, which is R(a i ). The encryption of this rank E(R(a i )) can be computed from E(R A (a i )) and E(R B (a i )) with the additive homomorphic system that we utilize. In this way, we can get the encryptions of the ranks of all values from both parties: E(R(a 1 )), E(R(a 2 )),..., E(R(a n1 )), E(R(b 1 )), E(R(b 2 )),..., E(R(b n2 )). Then E(R 1 ) and E(R 2 ), which are the encryptions of the rank sums of party A and B, respectively, can be computed from them because R 1 = R(a 1 ) + R(a 2 ) + + R(a n1 ) and R 2 = R(b 1 ) + R(b 2 ) + + R(b n2 ) Secure computation of the squared rank sums We need to compute E(R 2 1 ) and E(R 2 2 ) from E(R 1 ) and E(R 2 ). Since the additive homomorphic cryptosystem does not support the direct multiplication of two encrypted integers, here we present an algorithm to solve it. To compute E(ab) from E(a) and E(b) that are known to both parties, first we need to make one of the integers additively shared by the two parties. For example, we make a additively shared by the two parties such that party A holds an integer a A and party B holds an integer a B that a A + a B = a. a A and a B can be got from E(a) in this way: Party A randomly generates an integer a A, and computes E(a A ). Then E(a a A ) = E(a B ) can be computed from E(a) and E(a A ) by party A. A sends it to party B and the two parties coordinate with each other to decrypt E(a B ). During the decryption, we make sure that the decryption result a B is only known to party B. This can be achieved with the cryptographic system that we use, as explained in Section 3. After A gets a A and B gets a B, the two parties A and B can compute E(a A b) and E(a B b), respectively. This can be done with the additive homomorphic system from a A, a B and E(b) because a A and a B are both integers in plaintext. What we want is E(ab) = E((a A + a B ) b) = E(a A b + a B b). Since E(a A b) is held by party A and E(a B b) is held by party B, the two parties should exchange their values so that both of them can compute the final result E(ab). But exchanging the values directly may cause privacy loss. For example, if party A gives E(a A b) to party B, since E(a A b) = E(b) a A with the variant of Elgamal system we use, and E(b) is known to party B, party B can derive some information about a A from E(a A b). So before the two parties calculate E(a A b) and E(a B b) and exchange their values, they do rerandomizations to their E(b)s. With the rerandomizations, the random numbers r that are used in the encryptions are changed, so the encryptions are different from the original ones. To make the presentation clear, we call the rerandomized E(b)s as E (b) and E (b) in party A and party B, respectively. Then parties A and B can calculate E (a A b) = E (b) a A and E (a B b) = E (b) a B, respectively, and exchange their values E (a A b) and E (a B b). Since the encryptions are changed, the parties cannot derive information from the value they get from each other. For example,

6 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e ( ) although party B gets E (a A b) from A, E (a A b) = E (b) a A and party B does not know E (b) because it is the rerandomization done by party A. So B cannot derive a A. After the exchange, party A has E(a A b) and E (a B b) and party B has E (a A b) and E(a B b). They can compute E(ab) = E(a A b + a B b) by themselves. The rerandomizations do not affect the calculations of the encrypted sums. In this way, both parties can get E(ab) from E(a) and E(b). Algorithm 1 shows the main procedure of this encrypted multiplication. Algorithm 1. Encrypted multiplication of two integers Input. Encryptions of integers a and b, E(a) and E(b) that are known to both parties; Output. The encryption of a b, E(ab); 1: Party A generates a random integer a A and computes E(a A ); 2: Party A computes E(a a A ) and sends it to party B; 3: The two parties coordinately decrypt E(a a A ) and only party B gets the result a a A = a B ; 4: Parties A and B rerandomize E(b) and get E (b) and E (b), respectively; 5: Parties A and B calculate E (a A b) and E (a B b), respectively, and exchange the two values; 6: Parties A and B compute E(ab) = E(a A b + a B b) by themselves; 4.3. Secure computation of H With Algorithm 1 we can get E(R 2 1 ) and E(R 2 2 ) from E(R 1 ) and E(R 2 ). Because we assume there are two parties, the H statistic is calculated as: H = 12 N(N + 1) ( R2 1 n 1 + R2 2 n 2 ) 3(N + 1), where N, n 1 and n 2 are constants known to both parties. From E(R 2 1 ) and E(R 2 2 ), both parties can compute E(R 2 1 n 2 + R 2 2 n 1 ). They then coordinately decrypt it and get R 2 1 n 2 + R 2 2 n 1. The final result is calculated as: H = 12 N(N + 1)n 1 n 2 (R 2 1 n 2 + R 2 2 n 1 ) 3(N + 1). The reason why we compute R 2 1 n 2 + R 2 2 n 1 and then divide it with n 1 n 2 instead of compute R 2 1 /n 1 + R 2 2 /n 2 directly is that the cryptographic system we use only support the operations on non-negative integers. To avoid the decimal fractions in the encryptions, we compute R 1 2 n 2 + R 2 2 n 1 and after the decryption, the division is applied The summarized algorithm The main steps of the algorithm is summarized in Algorithm 2. Algorithm 2. The basic algorithm of privacy-preserving Kruskal Wallis test Input. Party A has sample S 1 which contains n 1 values, and party B has sample S 2 which contains n 2 values. The total number of values N = n 1 + n 2 ; Output. The statistic H; 1: for each value a i in party A do 2: Calculate the encrypted rank of it E(R(a i )); 3: end for 4: for each value b j in party B do 5: Calculate the encrypted rank of it E(R(b j )); 6: end for 7: Compute the encrypted rank sum of each party E(R 1 n ) and E(R 2 ) where R 1 = 1 i=1 R(a i) n and R 2 = 2 j=1 R(b j); 8: Calculate E(R1 2) and E(R2 2) from E(R 1 ) and E(R 2 ) with Algorithm 1; 9: Calculate E(R1 2 n 2 + R2 2 n 1 ) and decrypt it; 10: Compute H from R1 2 n 2 + R2 2 n 1 ; 4.5. Extension to multiparty The extension of the algorithm from two-party to multiparty is straightforward. For each value in each party, to get its rank in the two-party case, we count the number of values that are smaller than or equal to it in its own party and in the other party. To count the number in the other party, we need the secure comparison protocol. Similarly, in the multiparty case, we also count the number of values that are smaller than or equal to it in its own party and every other party with the help of the secure comparison protocol. After the computation of encrypted ranks for every value in every party, the encrypted rank sums are calculated, just like in the 2-party case. Then the encrypted squared rank sums E(R 2 1 ), E(R 2 2 ),..., E(R 2 k ) can be computed with Algorithm 1. They are known to all the parties. As we compute E(R 2 1 n 2 + R 2 2 n 1 ) when there are two parties, for the k parties, E(R 2 1 n 2 n 3... n k + R 2 2 n 1 n 3... n k + + R 2 k n 1 n 2... n k 1 ) is computed. We decrypt it and divide the decrypt result by n 1 n 2... n k instead of n 1 n 2 in the two-party case. Then the final result H is calculated. 5. The complete algorithm of privacy-preserving Kruskal Wallis test We present the privacy-preserving Kruskal Wallis test with considering ties in this section Modifying the data to eliminate ties Before we explain the complete algorithm, we give a simpler method to deal with the tied values. This is to modify the values slightly to eliminate the ties and then apply the basic algorithm to the modified data. Since the data is modified a little, this method causes slight accuracy loss.

7 140 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e ( ) To eliminate ties between parties, we do the following steps: If there are two parties, for every value in the first party, multiply it with 10 and then add 0 to it. For every value in the second party, multiply it with 10 and then add 1 to it. For example, suppose a i belongs to the first party and b j belongs to the second party. We do a i = a i and b j = b j In this way, the ties between the two parties are eliminated and the ranks of other values are not affected. If there are more than two parties, the data is modified similarly depending on the number of parties. For example, if there are ten parties, we still multiply every value in every party with 10 and add zeros to the values in the first party, add ones to the values in the second party,..., add nines to the tenth party. If there are 100 parties, multiply every value with 100 and add zeros to ninety-nines to the values of the first to 100th party, respectively. To deal with the ties within parties, we do not need to modify the data. We can ignore these ties when calculating the ranks. For example, suppose one party has three values, {1, 1, 1}. With our algorithm, the ranks are calculated by counting the number of smaller or equal values. For these three values, the number of smaller or equal values in their own parties are 3, 3 and 3. We change them to 1, 2 and 3, respectively. This can be easily finished because every party has the information of ties within it. After changing the local counts, the counts of smaller or equal values from other parties are added to get the rank. The ranks do not contain any tie because both ties within the local party and the ties between parties are disregarded. After the modifications, we can apply the basic algorithm that deals with data without ties The complete algorithm Here we present the complete algorithm that works for data containing ties. Similar to the previous section, the algorithm is proposed with assumption that there are only two parties and then extended to the multiparty case. As mentioned in Section 3, when there are ties in the data, the calculation of the statistic is changed in two aspects: The ranks of the tied values should be adjusted when computing H, and H should be divided by C. Both of them will be discussed in details Adjustment of the ranks of tied values The ranks of each group of tied values should be changed to the average of the ranks that these tied values would have received without ties. We use an example to show the basic idea to achieve this adjustment. Suppose there are values {1, 2, 3, 4, 4, 4, 4, 4} that are distributed in two samples held by two parties, respectively. Party A has sample S 1 which contains values {1, 2, 4, 4} and party B has sample S 2 which contains values {3, 4, 4, 4}. Without considering the tie, we know that the ranks of the values {1, 2, 3, 4, 4, 4, 4, 4} are 1, 2, 3, 4, 5, 6, 7, 8, respectively. The five 4 s are tied and their ranks are 4, 5, 6, 7, 8. The largest rank in this tie is 8 and the smallest rank is 4. The average of the ranks is 6 and it can be calculated by taking the average of the largest rank 8 and the smallest rank 4. This is because that the ranks of values in a tie is an arithmetic sequence, so the average of all values in the sequence is the same as the average of the smallest and the largest values. After changing the ranks of the tied values to the average of them, the ranks should be 1, 2, 3, 6, 6, 6, 6, 6. In our algorithm, since we calculate the rank of each value by counting the values that smaller than or equal to it, the ranks are 1, 2, 3, 8, 8, 8, 8, 8 because for each 4, there are 8 values smaller than or equal to it. So with our algorithm, the ranks of each group of tied values are actually the largest rank in the tie. We need to add some steps into our algorithm to change the ranks form 1, 2, 3, 8, 8, 8, 8, 8 to 1, 2, 3, 6, 6, 6, 6, 6. The basic idea is: Since the ranks of values in each tie is the largest rank in the tie, we only need to get the smallest rank in the tie and take the average of the largest rank and the smallest rank. To get the smallest rank from the largest rank, we need to know the number of values in the tie. With the largest rank named as l, the smallest rank named as s, and the number of values in the tie named as t, we have s = l t + 1. As in our example, the tie contains 5 values with the largest rank as 8 and the smallest rank as 4. We have = 4. So, to change the ranks form 1, 2, 3, 8, 8, 8, 8, 8 to 1, 2, 3, 6, 6, 6, 6, 6, we need to get the number of values in ties, and then compute the smallest ranks in ties, and take the average of the largest ranks and the smallest ranks. We assume that each value is in a tie and calculate the number of values in each value s tie. In our example, value 1 is in a tie that contains only 1 value, so are values 2 and 3. Each value 4 is in a tie that contains 5 values. So for values {1, 2, 3, 4, 4, 4, 4, 4}, we have 1, 1, 1, 5, 5, 5, 5, 5 as the number of values in each value s tie. Then for each value, compute the smallest rank in its tie with s = l t + 1. For value 1, the smallest rank is = 1. For value 2, the smallest rank is = 2. For value 3, the smallest rank is = 3. For each value 4, the smallest rank is = 4. So the smallest ranks for the eight values are 1, 2, 3, 4, 4, 4, 4, 4. With the largest ranks 1, 2, 3, 8, 8, 8, 8, 8, we can get the averaged ranks 1, 2, 3, 6, 6, 6, 6, 6. We can see that for values 1, 2 and 3 that are not tied, assuming that they are in ties containing 1 value does not affect the calculation results of their ranks. The reason why we make such assumption is that, although we show all the values, ranks and tied numbers of values together in cleartext to make it easier to understand, in the real settings, they are encrypted or distributed and no party has the complete information about them. So no party knows whether a value is in a tie or not. For example, party A has one value 1 and this value is not in a tie in party A. But A does not know whether party B has value 1 or not, and A does not know whether value 1 is in a tie globally. So all values are assumed to be in a tie. After explaining the basic idea of the adjustment of ranks, let us show the steps that the two parties do the adjustment securely. We follow the basic algorithm in Section 4 to get the ranks of each value in each party. Here the rank s are the number of smaller or equal values, which are the largest ranks of each tie. To count the smaller or equal values for value a i in party A, it is compared with both values in party A and party B. When comparing a i with values in party A, we also count the number of values that are equal to a i in party A and name it T A (a i ). As mentioned in Section 4, when comparing a i with every value in party B securely, each of the comparison result is an encryption of 0 or 1 such that if b j a i, the comparison result between b j and a i is E(1) and otherwise E(0). The sum of these results

8 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e ( ) Table 1 An example table. a 1... a i... a n1 b 1... b j... b n2 E(1) is the encrypted number of values smaller than or equal to a i in party B. Here we keep all the comparison results between every pair of a i and b j in a n 1 n 2 table such that the element in the table on the a i th row and b j th column is the comparison result between a i and b j, which is E(1) if b j a i and E(0) otherwise. Table 1 is an example with a n 1 n 2 table. Similarly, to count the smaller or equal values of value b j in party B, we compare it with values in both party A and party B. When comparing b j with values in party B, we also count the number of values that are equal to b j in party B and name it T B (b j ). When comparing b j with values in party A securely, each comparison result is not the same as the previous case. Here the comparison result between b j and a i is E(1) if a i b j and E(0) otherwise. We also keep the comparison results in a n 1 n 2 table. The two tables storing the comparison results are not the same. In the first table, the value in the a i th row and b j th column is E(1) if b j a i and E(0) otherwise; while in the second table, the value in the a i th row and b j th column is E(1) if b j a i and E(0) otherwise. Here we introduce a third n 1 n 2 table that each element in it is the secure sum of the two corresponding elements in the first and second tables. For example, if the value in the a i th row and b j th column in the first table is E(1) and in the second table is E(0), then the value in the a i th row and b j th column in the third table is E(1 + 0). The values in the third table is either E(1) or E(2). If a i < b j, the value in the second table is E(1) and the value in the first table is E(0). Thus, the value in the third table is E(1). If a i > b j, the value in the first table is E(1) and the value in the second table is E(0). Thus, the value in the third table is also E(1). If a i = b j, both values in the first and second tables are E(1) and the value in the third table is E(2). To sum up, the value in the a i th row and b j th column in the third table is E(1) if a i /= b j and E(2) if a i = b j. We securely deduct 1 from every element in the third table. Then the value in the a i th row and b j th column in the new table is E(0) if a i /= b j and E(1) if a i = b j. This new table contains the information of equal values between the two parties. The sum of all the values in the a i th row is the encrypted number of values that are equal to a i in party B which is named as E(T B (a i )). The sum of all the values in the b j th column is the encrypted number of values that are equal to b j in party A which is named as E(T A (b j )). Since parties A and B have computed T A (a i ) and T B (b j ), respectively, the two numbers can be encrypted and added to the E(T B (a i )) and E(T A (b j )), respectively to get E(T(a i )) = E(T A (a i ) + T B (a i )) and E(T(b j )) = E(T A (b j ) + T B (b j )). For each value a i (i = 1, 2,..., n 1 ) in party A, we have E(R(a i )) which is the encrypted largest rank in a i s tie and E(T(a i )) which is the encrypted number of values in a i s tie, or the number of values equal to a i in both parties. For each value b j (j = 1, 2,..., n 2 ) in party B, we have the similar numbers E(R(b j )) and E(T(b j )). To get the averaged rank for each value, we need to know the smallest rank in each value s tie. The smallest ranks can be calculated from the largest ranks and the numbers of values in ties. For each value a i (i = 1, 2,..., n 1 ) in party A, the encrypted smallest rank E(S(a i )) in a i s tie is E(R(a i ) T(a i ) + 1) and the encrypted adjusted rank of a i is E((S(a i ) + R(a i ))/2), which is the average between the largest and the smallest rank. To avoid the decimal fraction in the ciphertext, we only calculate E(S(a i ) + R(a i )) and the division by 2 is applied after the final decryption. For each value b j (j = 1, 2,..., n 2 ) in party B, the encrypted smallest rank E(S(b j )) in b j s tie is E(R(b j ) T(b j ) + 1) and the encrypted adjusted rank of b j is E((S(b j ) + R(b j ))/2). We also calculate E(S(b j ) + R(b j )) and apply the division by 2 after the final decryption. In this way, we can adjust the ranks of every value and the rank sums are calculated based on these new ranks. Please notice that if a value is not tied with others, the adjustment does not change its rank. The complete algorithm of calculating H is summarized in Algorithm 3. Algorithm 3. The complete algorithm of privacy-preserving Kruskal Wallis test Input. Party A has sample S 1 which contains n 1 values, and party B has sample S 2 which contains n 2 values. The total number of values N = n 1 + n 2 ; Output. The statistic H; 1: for each value a i in party A do 2: Calculate the encrypted rank of it E(R(a i )) and record the secure comparison results; 3: end for 4: for each value b j in party B do 5: Calculate the encrypted rank of it E(R(b j )) and record the secure comparison results; 6: end for 7: From the secure comparison results, get the information of equal values between the two parties; 8: for each value a i in party A do 9: Calculate the encrypted number of values equal to it E(T(a i )); 10: Calculate the encrypted smallest rank in its tie E(S(a i )) from E(T(a i )) and E(R(a i )); 11: Calculate the encrypted averaged rank of it; 12: end for 13: for each value b j in party B do 14: Calculate the encrypted number of values equal to it E(T(b j )); 15: Calculate the encrypted smallest rank in its tie E(S(b j )) from E(T(b j )) and E(R(b j )); 16: Calculate the encrypted averaged rank of it; 17: end for 18: Do the remaining calculations to compute H as in Algorithm 2 with the encrypted averaged ranks;

9 142 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e ( ) To extend the adjustment from two parties to multiple parties, we just need to create a table containing the information of equal values for each pair of parties during the computations of ranks. For each value, calculate the encrypted number of values equal to it by collecting information from all tables it is involved. Then the encrypted smallest rank in its tie and the averaged rank can be computed and the following steps are the same as in the extension of Algorithm 2 in Section Calculation of C In most cases, dividing H by C makes little change in the final result. If the number of tied values are not more than 1/4 of the total values, the division does not change the result by more than 10% for some degrees of freedom and significance [3]. To calculate C securely for two parties A and B, we need the information of ties computed in the adjustment of ranks, the E(T(a i )) for each value a i (i = 1, 2,..., n 1 ) in party A and the E(T(b j )) for each value b j (j = 1, 2,..., n 2 ) in party B. From Eq. (2), we have (t 3 t i i ) C = 1 N 3 N, where t i is the number of values in the ith tie. To compute C securely, we treat T(a i ) of each distinct a i and T(b j ) of each distinct b j as t i. For the values that are not tied with others, since their T values are equal to 1, and =0, adding them do not affect the value of C. For the tied values, their T values should be considered just once in the calculation of C, so we consider the T s of the distinct values in each party. With the example we used before that party A has values {1, 2, 4, 4} and party B has values {3, 4, 4, 4}, for party A, we only consider T(1) = 1, T(2) = 1 and T(4) = 5. For party B, we consider T(3) = 1 and T(4) = 5. Here all the T s are encrypted and no party knows the exact numbers. C can be securely computed from the encryption of t i s. The E(t 3 i ) is calculated from E(t i ) with Algorithm 1 and then E( t 3 t i i ) can be computed. The problem is, although only the T s of distinct values in each party are included in the calculation of C, there are still duplicates. Considering only the distinct values in each party can make sure that the ties within parties are counted only once, but it cannot eliminate the duplicated ties between parties. As in the above example, T(4) is counted twice because the tie of value 4 exists in both parties. We call the set of ties exist only in party A T A, the set of ties exist only in party B T B and the set of ties exist in both parties T AB. We want the information about T A, T B and T AB to be included in C just once. With the above solution, T AB is counted twice. If we consider only T(a i ) for each value a i (i = 1, 2,..., n 1 ) in party A and do not add the T(b j ) for each value b j (j = 1, 2,..., n 2 ) in party B, T A and T AB are considered once but the information of T B is lost. We cannot add the information of only T B without adding T AB, because every party does not know whether a tie in it is local or global. We haven t worked out a solution to calculate C exactly as it is. The two solutions mentioned above either add more tie information or lose some tie information when calculating C. But they can give a range of C by providing an upper bound and a lower bound and cut down the loss of accuracy. Table 2 The BMI dataset. Asians Indians Malays 32 (15) 26.4 (11) 24.9 (8) 30.1 (14) 23.1 (2) 25.3 (9) 27.6 (12) 23.5 (4) 23.8 (5) 26.2 (10) 24.6 (7) 22.1 (1) 28.2 (13) 24.3 (6) 23.4 (3) We use some examples to show the extension of the calculation of C from two parties to multiparty. Suppose there are three parties, A, B and C. Similar to the two-party case, we denote T A, T B and T C as the sets of ties exist only in party A, B and C, respectively. T AB is the set of ties in parties A and B. T AC, T BC and T ABC are defined in the same way. For each pair of parties, we have a table storing the information of tied values between the two parties. The three tables are named as Table(AB), Table(AC) and Table(BC), respectively. We collect the tie information of each distinct value in party A from all the tables that involve A, which are Table(AB) and Table(AC). This gives us the information about all the ties appear in party A, which are T A, T AB, T AC and T ABC. Then we disregard party A and the tables involving A, and collect the tie information of each distinct value in party B from all the remaining tables that involve B, which is only Table(BC). With this step, we can add the information about all the ties appearing in party B but not in A, which are T B and T BC. Then we encounter the same problem as in the two-party case: if we stop here, the tie information of T C is lost; if we add the tie information of each distinct value in party C from a table involving C such as Table(AC), both T C and T AC are added, and thus T AC is counted twice. When there are k parties, we follow the same procedure and get the information of all the ties appear in the first party, then add the information of ties in the second party, and so on. When it comes to the last party, we either lose the information of ties appearing only in the last party, or add duplicate information about ties appearing in both the last party and some other party. This gives us an upper bound and a lower bound of C. 6. Experiments The experimental results are presented in this section. All the algorithms are implemented with the Crypto++library in the C++language and the communications between parties are implemented with socket API. The experiments are conducted on a Red Hat server with GHz CPUs and 24 G of memory. We use the two datasets from [34] to test the accuracy of our algorithms. The first dataset, as shown in Table 2, contains 3 samples with equal sizes. The sample in the context of this paper is clearly different from that in many other papers. Each sample here is the set of data held by a party and the number of samples is the number of parties. In this dataset, the data are simulated Body Mass Index (BMI) values for subjects of 3 different races from a surburb of San Francisco. Here the BMI values for subjects of each race is a sample. There is no tie in this dataset and the rank of every value is given in parentheses.

10 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e ( ) Table 3 The INR dataset. Hospital A Hospital B Hospital C Hospital D 1.68 (1) 1.71 (6) 1.74 (13.5) 1.71 (6) 1.69 (2) 1.73 (10) 1.75 (16) 1.71 (6) 1.70 (3.5) 1.74 (13.5) 1.77 (18) 1.74 (13.5) 1.70 (3.5) 1.74 (13.5) 1.78 (20) 1.79 (22) 1.72 (8) 1.78 (20) 1.80 (23.5) 1.81 (26) 1.73 (10) 1.78 (20) 1.81 (26) 1.85 (29) 1.73 (10) 1.80 (23.5) 1.84 (28) 1.87 (30) 1.76 (17) 1.81 (26) 1.91 (31) The second dataset is presented in Table 3. It contains 4 samples and the sizes of them are not all equal. Each sample is a set of simulated International Normalized Ratio (INR) values of patients in one hospital. The ranks are given in parentheses. There are ties in the data and the tied ranks are bold. Since our secure algorithm only deal with non-negative integers, each value in dataset 1 is multiplied by 10 and each value in dataset 2 is multiplied by 100. This step changes all the values to non-negative integers without changing the ranks of values, and it does not affect the result of the Kruskal Wallis test which is calculated from the ranks. The accuracy of our basic algorithm for data without ties is 100%. This is shown with dataset 1. We provide both the H values calculated in two-party and multiparty scenarios in Table 4. In the two-party case, we take the first two samples of dataset 1 and calculate the H value on these two samples. In the multiparty case, the H value is calculated on all the three samples of dataset 1. Our algorithms for data with ties cause some accuracy loss. There are two methods to deal with tied values. The first one is to modify the data slightly to eliminate ties and then compute H with the basic algorithm. Accuracy loss occurs because the data is changed. The second method is to keep the data unchanged, but adjust the ranks and divide H by C. Here the accuracy loss comes from the calculation of C. Because we can compute an upper bound and a lower bound for C, we can also get an upper bound and a lower bound for the final result H c. We test the two methods with dataset 2 and the results are shown in Table 5. Here we also take the first two samples from dataset 2 to test the two-party case and all four samples of dataset 2 to test the multiparty case. As we can see in the result, the second method has better accuracy than the first one. In the case with two parties, although the first two samples of dataset 2 that we use contain a lot of ties (9 out of 16 values are in ties), the two bounds are both very close to the accurate result. In the multiparty case, both the upper and lower bounds are equal to the accurate result. This is because the two bounds are calculated by either disregarding the ties only in the last sample, or counting the ties between the last and the first samples twice. Fortunately, in this dataset, the last sample does not contain any tie that is only in it, and there is no tie between the last sample and the first sample. So with this dataset, the two bounds are equal to the accurate result. Let us show the computation overheads of the algorithms. In Fig. 1 we present the running time comparison between the algorithms we proposed with different sizes of data under the two-party scenario. The running time values are in seconds. We can find that the execution time of the basic algorithm for data without ties and the first method for data with ties are very close. This is because in the first method of dealing with ties, we eliminate the ties and then follow the same procedure as the basic algorithm. The second method for data with ties takes more time than the first one, mostly because that the adjustment of ranks takes time. We also show the overheads in the multiparty case with datasets 1 and 2. The execution time of the basic algorithm on dataset 1 is: Running time for 2 samples: 5 s Running time for 3 samples: 17 s The execution time of the first method for data containing ties on dataset 2 is: Running time for 2 samples: 15 s Running time for 3 samples: 67 s Running time for 4 samples: 599 s The execution time of the second method for data containing ties on dataset 2 is: Running time for 2 samples: 26 s Running time for 3 samples: 169 s Running time for 4 samples: 2159 s Table 5 Kruskal Wallis test result on data with ties. Table 4 Kruskal Wallis test result on data without ties. 2 samples 3 samples H calculated by the original Kruskal Wallis test H calculated by our basic algorithm H c calculated by the original Kruskal Wallis test H calculated from modified data (the first method) The upper bound of H c (the second method) The lower bound of H c (the second method) 2 samples 4 samples

A Privacy Preserving Markov Model for Sequence Classification

A Privacy Preserving Markov Model for Sequence Classification Suxin Guo Department of Computer Science and Engineering SUNY at Buffalo Buffalo 14260 U.S.A. suxinguo@buffalo.edu Sheng Zhong State Key Laboratory