A Privacy Preserving Markov Model for Sequence Classification

Size: px

Start display at page:

Download "A Privacy Preserving Markov Model for Sequence Classification"

James Hoover
5 years ago
Views:

1 A Privacy Preserving Markov Model for Sequence Classification Suxin Guo Department of Computer Science and Engineering SUNY at Buffalo Buffalo U.S.A. Sheng Zhong State Key Laboratory for Novel Software Technology Nanjing University Nanjing China Aidong Zhang Department of Computer Science and Engineering SUNY at Buffalo Buffalo U.S.A. ABSTRACT Sequence classification has attracted much interest in recent years due to its difference from the traditional classification tasks as well as its wide applications in many fields such as bioinformatics. As it is not easy to define specific features for sequence data as in traditional feature based classifications many methods have been developed to utilize the particular characteristics of sequences. One common way of classifying sequence data is to use probabilistic generative models such as the Markov model to learn the probability distribution of sequences in each class. One thing that should be considered in the research of sequence classification is the privacy issue. In many cases especially in the bioinformatics field the sequence data contains sensitive information which obstructs the mining of data. For example the DNA and protein sequences of individuals are highly sensitive and should not be released without protection. But in the real world data is usually distributed among different parties and for the parties training only with their own data may not give them strong enough models. This raises a problem when some parties each holding a set of sequences want to learn the Markov models on the union of their data but do not want to reveal their data to others due to the privacy concerns. In this paper we address this problem and propose a method to train the Markov models from the ones of the first order to the ones of order k where k > 1 on sequence data distributed among parties without revealing each party s private sequences to others. We apply the homomorphic encryption to protect the sensitive information. Categories and Subject Descriptors E.3 [Data]: Data Encryption public key cryptosystems; I.5.2 [Pattern Recognition]: Design Methodology classifier design and evaluation; J.3 [Computer Applications]: Life and Medical Science biology and genetics Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise to republish to post on servers or to redistribute to lists requires prior specific permission and/or a fee. BCB 13 September Washington DC USA Copyright 2013 ACM /13/09...$ General Terms Algorithms Security Keywords Data Security Markov Model Sequence Classification 1. INTRODUCTION Sequence classification has been a hot topic in many fields including bioinformatics text mining speech recognition and others. In this work we focus on the classification of symbolic sequences where a sequence is an ordered list of symbols drawn from a finite alphabet such as DNA and protein sequences. For example protein sequences are composed of symbols from an alphabet of 20 amino acids. The task of sequence classification is to train a classifier which assigns class labels to sequences. In traditional feature based classification tasks a sample typically has a vector of features that represents it. But for the sequence data the features are not explicit so that the traditional feature based classifiers cannot be directly used in sequence classification. Many methods have been developed to take advantage of the particular characteristics of sequences to improve the classification performance. One common way of doing this is to train probabilistic generative models on sequence data. Markov model is one of the most popular models because it well captures the probability distribution of each class and also because of its cost efficiency and decent accuracy. There is a problem in sequence classification that the privacy issues should be taken into consideration especially in the bioinformatics field. The DNA and protein sequences are usually collected from particular individuals and thus contain sensitive information regarding those people such as genetic markers for diseases [19]. Because of this the sequences are mostly anonymized. However even after the anonymization the sequences still suffer from the threat of re-identification. For example in many cases a sequence can be de-anonymized and linked to its human contributor by the recognition of certain markers [19]. The same type of sequence data is usually generated or collected by not only one organization. It is more likely that the data is distributed among different organizations. If the organizations just use their own data to learn the classifiers there is no privacy violation. But this is not practical because with the limited data the learned model may not be strong enough. A more reasonable way is that the organi-

2 zations collaborate with each other and learn the models on the union of their data. Here comes the privacy problem and no one is willing to reveal his/her data to others. For example since various kinds of cancer are related with the mutation in the human protein sequences medical institutes may collect both normal and mutated protein sequences from their patients and volunteer donators so that they can learn on the data. Then for new coming sequences the institutes can identify whether they have mutation or not. Obviously it is better for the institutes to cooperate with each other and learn on the union of their data because they can get stronger models in this way. But the private information within their sequences may stop them from sharing the data. In this paper we address the problem of learning Markov models on sequence data distributed among different parties with privacy concerns and propose a method to learn the models without revealing each party s sequences. We not only deal with ordinary Markov models of the first order but also extend the method to preserve privacy for the Markov models of order k where k is larger than one. The reason why we extend the method is that [2] has shown that Markov models with higher order can improve the accuracy of sequence classification. The rest of this paper is organized as follows: We present the related work in Section 2 and the technical preliminaries in Section 3 which includes the background knowledge about the Markov model and the cryptographic tools we need. The details of our method is explained in Section 4. In Section 5 we show the experimental results and finally Section 6 concludes the paper. 2. RELATED WORK In recent decades people have been gradually aware of the privacy problems lay in data analyzing methods. A lot of data mining and machine learning algorithms have been extended to be privacy preserving. Most of these approaches fall into two categories. The methods in the first category protect privacy by data perturbation techniques such as randomization [1 17] rotation [4] and resampling [16]. As the original data is perturbed this kind of methods usually suffer from certain accuracy loss. The approaches of the second category apply cryptographic techniques to protect data during the computations [23 10]. As the sensitive information is encrypted rather than changed in these approaches there is typically no accuracy loss. Our work is based on the second way and applies homomorphic encryptions to protect data. In the cryptographic category some secure building blocks are very commonly used such as secure sum [5] secure comparison [6 7] secure division [8] secure scalar product [5 8 12] secure matrix multiplication [ ] etc.. The data mining and machine learning algorithms that have been enhanced with privacy solutions include decision tree classification [23 28] k-means clustering [30 18] gradient descent methods [31] and others. Actually our work is not the first one that considers the privacy problem about the Markov model. [22] has proposed a method to outsource Markov chains securely without revealing sensitive information. But our problem setting is different from theirs. They consider the scenario that the Markov model has already been learned and known to one party. Another party has the test queries which are going to be tested against the model. Both of the two parties encrypt their own information and send them to an untrusted server which performs the testing procedure securely. While in our case the Markov model is not known at the beginning and our goal is to learn the model with training data distributed among different parties. All the computations are done by the data owners not a server. Hence the method from [22] cannot be directly applied to our setting. 3. TECHNICAL PRELIMINARIES 3.1 Markov Model For Sequence Classification We briefly introduce the Markov model and how it is used for sequence classification. We start with the ordinary first order Markov model and then explain the model of order k where k > Markov Model of the First Order We have a set of states of size m which is denoted by Σ and we can consider it as an alphabet. A Markov chain is a sequence of states with the Markov property which means each state is only dependent on its previous state not any others where each state is from the state alphabet. For a sequence S of length n we denote the i-th element in S by S i and the value of the i-th element by s i [2]. So that each s i is from the state alphabet. With the Markov property we have: P (S i+1 = s i+1 S i = s i S i 1 = s i 1... S 1 = s 1) = P (S i+1 = s i+1 S i = s i). That is the probability of state s i+1 given all the previous states is the same as the probability of state s i+1 given only state s i. Thus the probability of sequence S is: P (S) =P (S n = s n S n 1 = s n 1) P (S n 1 = s n 1 S n 2 = s n 2)... P (S 1 = s 1) n =P (S 1 = s 1) P (S i = s i S i 1 = s i 1). i=2 We simplify the notation of the above equation as follows: P (S) = P (s n s n 1)P (s n 1 s n 2)... P (s 1) = P (s n 1) i=2 P (si si 1). (1) To train the Markov models for sequence classification each element in the alphabet of the sequences is considered as a state. For example in the classification of protein sequence data each of the 20 amino acid is treated as a state and Σ is the set of amino acids of size 20. We need to calculate these probabilities: For any state s a in the state alphabet its prior probability is: P (s a) = count(s a) s j Σ count(sj) where count(s a) is the number of times s a appearing in the training set and s j Σ count(sj) is the sum of the number of times that all the states in the alphabet appear in the

3 training set which in this case is the total size of the training set or the sum of lengths of all the sequences in the set. For any two states s a and s b in the state alphabet the transition probability that s b happens given s a is: P (s b s a) = count(sa s b) count(s a) where count(s a s b ) is the number of times s a is followed by s b in the training set. After we calculate the transition probability of every pair of states in the alphabet Σ we can get an m by m transition matrix. Then the training process is completed. To test a sequence against a Markov model and examine how likely that this sequence is generated from this model we just need to follow Equation 1 and calculate the product of all the needed probabilities which can be found in the transition matrix and the priors. For each class such a Markov model is trained on the data of this class. Then to test a sequence and identify its class label we just need to calculate its probability against every class model and assign it to the class with the highest probability Markov Model of Order k The Markov model of order k is an extension of the Markov model of order 1 such that each state is dependent on its previous k states not just 1. With the extension the probability of sequence S is changed to: P (S) = P (s 1 s 2... s k ) n i=k+1 P (s i s i 1 s i 2... s i k ). In this case the priors are not the probabilities of every single state but the probabilities of every k-gram. Here a k- gram means a combination of k symbols from the alphabet. For any k-gram s 1... s k its prior probability is: P (s 1... s k ) = count(s1... s k) count(all k grams) where count(s 1... s k ) is the number of times that k-gram s 1... s k appears in the training set at any position and count(all k grams) is the sum of the number of times that all possible k-grams appear in the training set. For any state s a and any k-gram s 1... s k the transition probability that s a happens given s 1... s k is: P (s a s 1... s k ) = count(s1... s k s a) count(s 1... s k ) where count(s 1... s k s a) is the number of times s 1... s k is followed by s a in the training set. In this case the transition matrix is not of size m by m but of size m k by m because the number of all possible k- grams is m k. The following procedure is the same as the first order Markov model. We also train a model for each class and the test sequences are tested with every class model. 3.2 Privacy Protection of the Markov Model We assume that each party has a set of sequences and they want to learn the Markov models collaboratively on the union of their data. We develop our secure solution under the semi-honest model which is widely used in articles of this area [ ]. In this model the parties are assumed to be honest but curious which means that the parties follow the protocols correctly but they would try to derive the private information of others with the intermediate results they get during the execution of protocols. This is a reasonable assumption in the privacy preserving data mining problems because the goal of all the parties is to get the accurate mining results so they are not willing to corrupt the protocols and get invalid results. 3.3 Cryptographic Tools Homomorphic Cryptographic Scheme In this paper we apply an additive homomorphic asymmetric cryptographic system to perform the encryptions and decryptions of the data. In an asymmetric cryptographic system we have a pair of keys: a public key for encryption and a private key for decryption. We denote the encryption of integer x 1 by E(x 1). A cryptographic scheme is additive homomorphic if there are operators and that for any two integers x 1 x 2 and any constant a we have E(x 1 + x 2) = E(x 1) E(x 2) E(a x 1) = a E(x 1). This means with an additive homomorphic cryptographic system we can compute the encrypted sum of integers directly from the encryptions of these integers ElGamal Cryptographic system There are several additive homomorphic cryptographic schemes [32 26]. In this work we apply a variant of the ElGamal scheme [11] which is semantically secure under the Diffe-Hellman Assumption [3]. ElGamal cryptographic system is a multiplicative homomorphic asymmetric cryptographic system. With this system the encryption of an integer f is such a pair: E(f) = (f y r g r ) where g is a generator x is the private key y is the public key that y = g x and r is a random integer. We call the first part of the pair c 1 and the second part c 2 so that c 1 = f y r and c 2 = g r. To decrypt E(f) we compute s = c x 2 = g rx = g xr = y r. Then do c 1 s 1 = f y r y r and we can get the cleartext f. In the variant of ElGamal scheme we use the integer f is encrypted in such a way: E(f) = (g f y r g r ). The only difference between the original ElGamal scheme and this variant is that f in the first part is changed to g f. With the change this variant is an additive homomorphic cryptosystem such that: E(x 1 + x 2) = E(x 1) E(x 2) E(a x 1) = E(x 1) a. To decrypt E(f) we follow the same procedure as in the original ElGamal algorithm. But because of the change after the above decryption we get g f instead of f. To obtain f from g f we have to perform an exhaustive search which is to try every possible f and look for the one that matches g f.

4 Please note that the time needed for this exhaustive search is reasonable because we only need to search all possible values of the plaintext which is not a big range in our case. We assume that the private key is additively shared by all the parties and no party knows the complete private key. The parties need to coordinate with others to do the decryptions and the ciphertexts can be exposed to every party because no party can decrypt them without the help of others. The private key is shared in this way: Suppose there are two parties parties A and B. A has a part of private key x A and B has the other part x B such that x A + x B = x where x is the complete private key. In the decryption we need to compute s = c x 2 = c x A+x B 2 = c x A 2 c x B 2. Party A calculates s A = c x A 2 and party B calculates s B = c x B 2 so that s = s A s B. We need to do c 1 s 1 = c 1 (s A s B) 1 = c 1 s 1 A s 1 B. Party A computes c1 s 1 A and sends it to party B. Then party B computes c 1 s 1 A s 1 B = c1 s 1 = g f and sends it to A. In this way both parties can get the decrypted result. Here since the party B does its decryption part later it gets the final result earlier. If it does not send the result to A the decrypted result can only be known to party B. The order of the parties in the decryptions can be changed so if we need the result to be known to only one party the party should do its decryption later Secure Scalar Product Computation We apply the secure scalar product computation protocol in [12] to compute the scalar product of two vectors. Given the two d-dimensional vectors x = (x 1 x 2... x d ) from party A and y = (y 1 y 2... y d ) from party B the protocol securely computes the scalar product p A + p B = xy = x 1y 1 + x 2y x d y d that p A is with party A and p B is with party B Secure Logsum Computation In this work we also need the secure logsum computation proposed in [27]. The input are two d-dimensional vectors x = (x 1 x 2... x d ) which is from party A and y = (y 1 y 2... y d ) which is from party B such that x+y = log z = (log z 1 log z 2... log z d ). The output are two additive shares s A held by party A and s B held by party B that s A + s B = log( d i=1 zi) = log( d i=1 10x i+y i ). The basic idea of the secure logsum algorithm is: First party A computes vector 10 x q where q is a random number generated by A and party B computes vector 10 y. Second the two parties apply the secure scalar product protocol to calculate the scalar product of the two vectors 10 x q and 10 y. The result φ = d i=1 10x i+y i q is only known to party B. Finally party B computes s B = log φ = log( d i=1 10x i+y i ) q and party A has s A = q so that s A+s B = log( d i=1 10x i+y i ) = log( d i=1 zi). 4. PRIVACY PRESERVING MARKOV MODEL FOR SEQUENCE CLASSIFICATION In this section we present how to securely learn the Markov models for sequence classification on data distributed between two parties A and B. It can clearly be extended to the case when the number of parties is larger than two. For simplicity we just consider the two-party case here. We start with the first order Markov model and then extend it to the Markov model of order k where k > Markov Model of the First Order As mentioned in Section 3 the training of the Markov model for each class is to count the occurrences of single states and combinations of states in the class and calculate the prior and transition probabilities. Let C be the set of all class labels which is of size l. Then for each class value c j C we compute the prior probabilities of states and the transition probabilities. For any state s a in the state alphabet its prior probability in class c j is: P (s a c j) = count(sa cj) count(c j) where count(s a c j) is the number of times s a appearing in the sequences belonging to class c j and count(c j) is the sum of the number of times that all the states in the alphabet appear in the sequences belonging to class c j which in this case is the sum of lengths of all the sequences belonging to class c j. When the data is distributed between parties A and B we have: count(s a c j) = count A(s a c j) + count B(s a c j) where count A(s a c j) is the number of times s a appearing in the sequences belonging to class c j in the data of party A and count B(s a c j) is the number of times s a appearing in the sequences belonging to class c j in the data of party B. To get the total occurrence times of s a we need to add up the times it appears in both parties. Similarly we have: count(c j) = count A(c j) + count B(c j). So the prior probability of state s a in class c j is: P (s a c j) = count(sa cj) count(c j) = counta(sa cj) + countb(sa cj) count A(c j) + count B(c j) where count A(s a c j) and count A(c j) are held by party A and count B(s a c j) and count B(c j) are held by party B. Although the two parties can encrypt their own values and exchange them it is still hard to calculate P (s a c j) because an additively homomorphic cryptosystem does not support the secure computation of the division operation between two encrypted integers. So we need to calculate log P (s a c j) instead of P (s a c j) which turns the division into a substraction: log P (s a c j) = log(count A(s a c j) + count B(s a c j)) log(count A(c j) + count B(c j)) Then the problem becomes how to calculate log(a + b) where a is with party A and b is with party B securely. Here we need to utilize the secure logsum protocol which takes two d-dimensional vectors x = (x 1 x 2... x d ) from party A and y = (y 1 y 2... y d ) from party B as input where x + y = log z = (log z 1 log z 2... log z d ) and outputs two additive shares s A held by party A and s B held by party B that s A + s B = log( d i=1 zi) = log( d i=1 10x i+y i ). We feed the secure logsum protocol with such two vectors of 2-dimension: x = (log a 0) from party A and y = (0 log b) from party B. In this case x + y = log z = (log z 1 log z 2) =

5 (log a log b). Then the output of the secure logsum protocol should be s A + s B = log(z 1 + z 2) = log(a + b). Following this procedure log(count A(s a c j)+count B(s a c j)) and log(count A(c j) + count B(c j)) are calculated by the two parties with the secure logsum protocol and shared in this way: log(count A(s a c j) + count B(s a c j)) = s A 1 + s B 1 log(count A(c j) + count B(c j)) = s A 2 + s B 2 where s A 1 and s A 2 are held by party A and s B 1 and s B 2 are in party B. Then we have: log P (s a c j) = (s A 1 + s B 1 ) (s A 2 + s B 2 ) = (s A 1 s A 2 ) + (s B 1 s B 2 ). s A 1 s A 2 can be computed by party A and s B 1 s B 2 by party B. The two parties then exchange the two values and both of them can get log P (s a c j) and calculate P (s a c j). For any two states s a and s b in the state alphabet the transition probability that s b happens given s a in class c j is: P (s b s a c j) = count(sa s b c j) count(s a c j) where count(s a s b c j) is the number of times s a is followed by s b in the sequences belong to class c j. Following the same procedure as the prior probabilities both parties can get the transition probabilities securely: P (s b s a c j) = counta(sa s b c j) + count B(s a s b c j). count A(s a c j) + count B(s a c j) With all the prior probabilities and transition probabilities computed both parties can get the Markov models of every class. Since the models are known every party can test its own sequences against the models by itself. The training process of the privacy preserving Markov model of the first order is summarized in Algorithm Markov Model of Order k The training process of the Markov model of order k follows the same pattern as the training process of the Markov model of order 1. For any k-gram s 1... s k its prior probability in class c j is: P (s 1... s k c j) = count(s 1... s k c j) count(all k grams in c j) where count(s 1... s k c j) is the number of times that k- gram s 1... s k appears in the sequences belonging to class c j at any position and count(all k grams in c j) is the sum of the number of times that all possible k-grams appear in the sequences belonging to class c j. The two parties can compute the probability securely from their counts with the same method as in the training of the first order Markov model: P (s 1... s k c j) = count A(s 1... s k c j) + count B(s 1... s k c j) count A(all k grams in c j) + count B(all k grams in c j). Algorithm 1 Privacy Preserving Markov Model of Order 1 Input: Party A has a set of sequences D A and party B has a set of sequences D B; Output: The Markov models of every class where each model contains the prior probabilities of every state and the transition matrix; 1: for each class c j do 2: Party A counts the sum of the number of times that all the states in the alphabet appear in the sequences in D A that belong to class c j which is count A(c j) and party B counts count B(c j) from D B in the same way; 3: for each state s a in the state alphabet do 4: Party A counts the occurrence times of s a in the sequences in D A that belong to class c j count A(s a c j) and party B counts count B(s a c j) from D B in the same way; 5: Parties A and B jointly compute the logarithm of the prior probability of s a in c j log P (s a c j) with the counts they have under the help of the secure logsum protocol and then compute P (s a c j); 6: for each state s b in the state alphabet do 7: Party A counts the number of times s a is followed by s b in the sequences in D A that belong to class c j count A(s a s b c j) and party B counts count B(s a s b c j) from D B in the same way; 8: Parties A and B jointly compute the logarithm of the transition probability that s b happens given s a in class c j log P (s b s a c j) with the counts they have under the help of the secure logsum protocol and then compute P (s b s a c j); 9: end for 10: end for 11: end for For any state s a and any k-gram s 1... s k the transition probability that s a happens given s 1... s k in class c j is: P (s a s 1... s k c j) = count(s1... s k s a c j) count(s 1... s k c j) where count(s 1... s k s a c j) is the number of times s 1... s k is followed by s a in the sequences belonging to class c j. The probability can be computed by: P (s a s 1... s k c j) = count A(s 1... s k s a c j) + count B(s 1... s k s a c j). count A(s 1... s k c j) + count B(s 1... s k c j) The training process of the privacy preserving Markov model of order k is summarized in Algorithm EXPERIMENTS The experimental results are presented in this section. All the algorithms are implemented with the Crypto++ library in the C++ language and the communications between parties are implemented with socket API. The experiments are conducted on a Red Hat server with 16 x 2.27 GHz CPUs and 24G of memory. We use two real-world datasets to test our algorithms. The first dataset which is from [25] is a set of inorganic

6 Algorithm 2 Privacy Preserving Markov Model of Order k Input: Party A has a set of sequences D A and party B has a set of sequences D B; Output: The Markov models of every class where each model contains the prior probabilities of every k-gram and the transition matrix; 1: for each class c j do 2: Party A counts the sum of the number of times that all possible k-grams appear in the sequences in D A that belong to class c j which is count A(all k grams in c j) and party B counts count B(all k grams in c j) from D B in the same way; 3: for each possible k-gram s 1... s k do 4: Party A counts the occurrence times of s 1... s k in the sequences in D A that belong to class c j count A(s 1... s k c j) and party B counts count B(s 1... s k c j) from D B in the same way; 5: Parties A and B jointly compute the logarithm of the prior probability of s 1... s k in c j log P (s 1... s k c j) with the counts they have under the help of the secure logsum protocol and then compute P (s 1... s k c j); 6: for each state s a in the state alphabet do 7: Party A counts the number of times s 1... s k is followed by s a in the sequences in D A that belong to class c j count A(s 1... s k s a c j) and party B counts count B(s 1... s k s a c j) from D B in the same way; 8: Parties A and B jointly compute the logarithm of the transition probability that s a happens given s 1... s k in class c j log P (s a s 1... s k c j) with the counts they have under the help of the secure logsum protocol and then compute P (s a s 1... s k c j); 9: end for 10: end for 11: end for accuracy of the privacy preserving approach. The problem is that the cryptosystem we are using only support operations on non-negative integers. But in the secure logsum protocol we need to perform calculations like 10 x q and this may introduce real numbers which are not integers. When encrypting such real numbers we need to convert them into integers by multiplying them with a magnitude of 10 and then round the products to integers. After the decryption we divide the numbers by the magnitude to recover the original numbers. This is the only step that causes accuracy loss in this work. It is clear that the larger the magnitude we use the smaller the accuracy loss is. In Table 1 we show how the errors which represent the differences between the results of the privacy preserving approach and the results of the ideal centralized approach reduce when the magnitude increases. There are two kinds of errors: the probability errors and the classification errors. We define a probability error e p to be the relative error between a probability calculated with the privacy preserving approach p 1 and a probability calculated with the centralized approach p 2 such that e p = p 2 p 1 /p 2. For each prior probability and transition probability that is computed in the Markov models we can get such an error. The probability errors in Table 1 are the average of the errors calculated for each dataset and for each magnitude. The classification error is defined as e c = n e/n where n e is the number of test sequences classified to different classes by the two approaches and n is the total number of test sequences. Since each test sequence is classified to the class with the highest probability the comparisons among probabilities rather than the values of the probabilities themselves count more in the classification. Thus the classification result is not affected very much by the probability errors. We have the following observation from our experiment results in Table 1: Although there are probability errors the classification results are always correct in these two datasets. materials binding peptide sequences. It contains 25 quartzbiding peptide sequences each of which is either a strong binder or a weak binder. There are 10 strong binder sequences and 15 weak binder sequences. All of the sequences are of the same length which is 12. The second dataset is the SCOP (structural classification of proteins) dataset from [24]. The approach in [21] is applied to preprocess the data and we get proteins from seven families which are A B C D E F and G. Here we pick protein sequences from classes A and G to test our algorithm. Class A contains 23 sequences with different lengths. The length of the shortest one is 160 and the length of the longest one is 177. Class G consists of 20 protein sequences with lengths from 45 to 83. We present the performance of our algorithms in two aspects: the accuracy and the running time. Since our work only focuses on the privacy preservation of the Markov models we evaluate the accuracy of our approach by comparing the result of our privacy preserving approach with the result of the ideal centralized algorithm. Although protecting data with cryptographic approaches should not lose any information and thus the privacy preserving approach should give exactly the same result as the original algorithm there is a practical issue that reduces the Table 1: Errors in the Probabilities and the Classification Results Magnitude Probability Error Classification Error Dataset 1 Dataset 2 Dataset 1 Dataset Table 1 shows that when the magnitude becomes larger the probability errors becomes smaller. For people who want to learn perfect models it seems that a very large magnitude would be a good choice. But there is a problem that large magnitude also causes high computation cost and long training time so we need to find a balance between the accuracy and efficiency. Table 2 presents how the running time increases with the magnitude. We get these time durations by training a first order Markov model for one class on each of the two datasets. The running time is affected not only by the magnitude but also by the order of the Markov models. When the value of k increases the training time of a Markov model of order k also increases. Table 3 shows the training time of a Markov

7 Table 2: Running Time Affected by the Magnitude Magnitude Running Time Dataset 1 Dataset s 70s s 72s s 84s s 208s model of order k for the cases that k = 1 and k = 2. For each k the total training time of a model is t; the time of doing the secure logsum calculations is denoted by t l and the time of other communications is denoted by t c. All the times in Table 3 are obtained when the magnitude is set to Table 3: Running Time Affected by the Order Dataset Order 1 Order 2 t l t c t t l t c t Dataset 1 42s 34s 77s 830s 672s 1503s Dataset 2 50s 34s 84s 841s 673s 1515s We can find from Table 3 that when training a Markov model the time of doing the secure logsum calculations and other communications t l + t c plays a dominant role in the total time t. The time of the local operations such as the parties computing the counts in their own data is trivial. Hence the training time is not very relative to the size of the training data but is more relative to the value of k and the size of the alphabet m. This is because the number of secure logsum calculations and communications which dominates the overall time is determined by the number of probabilities to be computed including the prior probabilities and the transition probabilities. It can be seen in Section 4 that the calculation of each probability requires one secure logsum computation and some communications between parties. The number of probabilities to be computed is determined by k and m. The number of prior probabilities for a Markov model of order k is m k because we need to compute a prior probability for each k-gram and the number of all possible k-grams is m k. The number of transition probabilities is m k m because there is a transition probability for every k-gram and state pair. When the value of k increases the number of probabilities increases exponentially. Denote the number of probabilities when the value of k is i to be np i then np i+1 = np i m. Thus the training time when k = i + 1 should also be m times of the training time when k = i. Table 3 supports this conclusion that all the times t l t c and t in the case of order 2 are around 20 times of their counterparts in the case of order 1 where 20 is the size of the amino acid alphabet shared by the two datasets. On the other hand when the value of k is fixed the difference between the time costed to train a model in the two datasets is not significant though the sizes of the two datasets are very different. This is because that the size of the data does not affect the running time as much as the size of the alphabet does. With the above discussions we can see that it is not affordable for ordinary computers to train a Markov model of order k when k is very large. This problem lies not only in the privacy preserving solution but also in the original Markov model of order k [2]. Fortunately the Markov model of order k can give decent accuracy even when k is very small. 6. CONCLUSIONS AND FUTURE WORK In this paper we proposed a method that enables two parties to securely train Markov models of order k on the union of their sequence data without revealing either party s information to the other. We evaluated the method with two real-world datasets and shown that the information loss in our privacy preserving algorithm is very low. We also analyzed the running time of the algorithm. Although we focus on the sequence classification task here the proposed privacy preserving Markov model method can be extended and used in other fields and this will be our future work. 7. ACKNOWLEDGMENTS The materials published in this paper are partially supported by the National Science Foundation under Grants No No and No REFERENCES [1] R. Agrawal and R. Srikant. Privacy-Preserving Data Mining [2] C. Andorf A. Silvescu D. Dobbs and V. Honavar. Learning classifiers for assigning protein sequences to gene ontology functional families. In Proceedings of the Fifth International Conference On Knowledge Based Computer Systems (KBCS) [3] D. Boneh. The Decision Diffie-Hellman Problem volume 1423 pages Springer-Verlag [4] K. Chen and L. Liu. Privacy preserving data classification with rotation perturbation. In Proceedings of the Fifth IEEE International Conference on Data Mining ICDM 05 pages Washington DC USA IEEE Computer Society. [5] C. Clifton M. Kantarcioglu J. Vaidya X. Lin and M. Y. Zhu. Tools for privacy preserving distributed data mining. ACM SIGKDD Explorations Newsletter 4(2): [6] I. Damgard M. Fitzi E. Kiltz J. B. Nielsen and T. Toft. Unconditionally Secure Constant-Rounds Multi-party Computation for Equality Comparison Bits and Exponentiation volume 3876 pages Springer [7] I. Damgard M. Geisler and M. Kroigard. Homomorphic encryption and secure comparison. International Journal of Applied Cryptography 1: [8] W. Du and M. Atallah. Privacy-Preserving Cooperative Statistical Analysis page 102. IEEE Computer Society [9] W. Du Y. Y. S. Han and S. Chen. Privacy-preserving multivariate statistical analysis: Linear regression and classification volume 233. Lake Buena Vista Florida [10] W. Du and Z. Zhan. Building decision tree classifier on private data. Reproduction pages

8 [11] T. ElGamal. A public key cryptosystem and a signature scheme based on discrete logarithms. IEEE Transactions on Information Theory 31(4): [12] B. Goethals S. Laur H. Lipmaa and T. Mielik?inen. On private scalar product computation for privacy-preserving data mining. Science 3506: [13] O. Goldreich. Foundations of Cryptography volume 1. Cambridge University Press [14] S. Han and W. K. Ng. Privacy-preserving linear fisher discriminant analysis. In Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining PAKDD 08 pages Berlin Heidelberg Springer-Verlag. [15] S. Han W. K. Ng and P. S. Yu. Privacy-preserving singular value decomposition IEEE 25th International Conference on Data Engineering pages [16] G. R. Heer. A bootstrap procedure to preserve statistical confidentiality in contingency tables. In Proceedings of the International Seminar on Statistical ConïňAdentiality pages [17] Z. Huang W. Du and B. Chen. Deriving private information from randomized data. Proceedings of the 2005 ACM SIGMOD international conference on Management of data SIGMOD 05 page [18] G. Jagannathan and R. N. Wright. Privacy-preserving distributed k-means clustering over arbitrarily partitioned data pages ACM [19] S. Jha L. Kruger and V. Shmatikov. Towards practical privacy for genomic computation IEEE Symposium on Security and Privacy sp 2008 pages: [20] M. Kantarcioglu and C. Clifton. Privacy-preserving distributed mining of association rules on horizontally partitioned data. IEEE Transactions on Knowledge and Data Engineering 16(9): [21] A. Kumar and L. Cowen. Augmented training of hidden markov models to recognize remote homologs via simulated evolution. Bioinformatics 25(13): [22] P. Lin and K. S. Candan. Access-private outsourcing of markov chain and random walk based data analysis applications. In Proceedings of the 22nd International Conference on Data Engineering Workshops [23] Y. Lindell and B. Pinkas. Privacy preserving data mining. Journal of Cryptology 15(3): [24] A. G. Murzin S. E. Brenner T. Hubbard and C. Chothia. Scop: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology [25] E. E. Oren C. Tamerler D. Sahin M. Hnilova U. O. S. Seker M. Sarikaya and R. Samudrala. A novel knowledge-based approach to design inorganic-binding peptides. Bioinformatics 23(21): [26] P. Paillier. Public-key cryptosystems based on composite degree residuosity classes. Computer 1592: [27] P. Smaragdis and M. Shashanka. A framework for secure speech recognition. IEEE Transactions On Audio Speech And Language Processing 15(4): [28] Z. Teng and W. Du. A hybrid multi-group privacy-preserving approach for building decision trees. In Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining PAKDD 07 pages Berlin Heidelberg Springer-Verlag. [29] J. Vaidya and C. Clifton. Privacy-preserving outlier detection volume 41 pages IEEE [30] J. Vaidya W. Lafayette and C. Clifton. Privacy-preserving k-means clustering over vertically partitioned data. Security pages [31] L. Wan W. K. Ng S. Han and V. C. S. Lee. Privacy-preservation for gradient descent methods. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining KDD 07 page [32] S. Zhong. Privacy-preserving algorithms for distributed mining of frequent itemsets. Information Sciences 177(2):

Privacy Preserving Calculation of Fisher Criterion Score for Informative Gene Selection

Privacy Preserving Calculation of Fisher Criterion Score for Informative Gene Selection Suxin Guo, Sheng Zhong, and Aidong Zhang Department of Computer Science and Engineering, State University of New