Privacy-Preserving Bayesian Network Learning From Heterogeneous Distributed Data

Size: px

Start display at page:

Download "Privacy-Preserving Bayesian Network Learning From Heterogeneous Distributed Data"

Horatio Payne
6 years ago
Views:

1 Privacy-Preserving Bayesian Network Learning From Heterogeneous Distributed Data Jianjie Ma and Krishnamoorthy Sivakumar School of EECS, Washington State University, Pullman, WA , USA Telehone: , FAX: , {jma, Abstract In this aer, we roose a ost randomization technique to learn a Bayesian network (BN) from distributed heterogeneous data, in a rivacy sensitive fashion. In this case, two or more arties own sensitive data but want to learn a Bayesian network from the combined data. We consider both structure and arameter learning for the BN. The only required information from the data set is a set of sufficient statistics for learning both network structure and arameters. The roosed method estimates the sufficient statistics from the randomized data. The estimated sufficient statistics are then used to learn a BN. For structure learning, we face the familiar extra-link roblem since estimation errors tend to break the conditional indeendence among the variables. We roose modifications of score functions used for BN learning, to solve this roblem. We show both theoretically and exerimentally that ost randomization is an efficient, flexible, and easy-to-use method to learn Bayesian network from rivacy sensitive data. Index Terms Privacy Preserving Data Mining, Bayesian Network, Post Randomization I. INTRODUCTION Privacy-reserving data mining deals with the roblem of building accurate data mining models over aggregate data, while rotecting rivacy at the level of individual records. There are two main aroaches to rivacy reserving data mining. One aroach is to erturb or randomize the data before sending it to the data miner. The erturbed or randomized data are then used to learn or mine the models and atterns []. Evfimieski et al. [5] roosed a select-a-size randomization technique for rivacy-reserving mining of association rules. Du et. al. [4] suggested using randomized resonse techniques for rivacy-reserving data mining and constructed decision trees from randomized data. Another aroach is to use secure multiarty comutation (SMC) to enable two or more arties to build data models without every arty learning anything about the other arty s data [9]. Though the SMC aroach is aealing in its generality and simlicity, secific and efficient rotocols have to be develoed for data mining urose since it is aarently inefficient for data mining alications [3]. All the current available techniques using SMC are based on a semi-honest model. A. Related Work Privacy-reserving Bayesian network learning is a more recent toic. Wright and Yang [5] discuss rivacyreserving Bayesian network structure comutation on distributed heterogeneous data while Meng et al. [] have considered the rivacy-sensitive Bayesian Network arameter learning roblem. The underlying method used in both works is to convert the comutations required for BN learning into a series of inner roduct comutations and then to use a secure inner roduct comutation method roosed elsewhere. An SMC based method for inner roduct comutation is used in [5] whereas [] uses a method based on random rojection roosed in []. The number of secure comutation oerations increases exonentially with the ossible configurations of the roblem variables. Wright et al. [5] roosed a rivacy-reserving Bayesian network structure comutation method. The accuracy of the learned structure using [5] is still not clear since their method is based on an aroximated score function which might cause error links. Our exeriments show that extra-link roblem is severe even with small estimation errors. Existing literature on rivacy reserving Bayesian network learning focuses on multiarty models. In addition to this model, our aer also considers a model where there is a data miner who actually does all the comutations and learning for the articiating arties. SMC based method has the following two drawbacks: (a) unrealistic semi-honest model assumtion (b) large volumes of cooerative or synchronized comutations among the arties involved. Most of the synchronized comutations are the overheads due to rivacy requirement. Other related works include [4], [5], [4], all of which consider the case where there is a data miner who does all the learning. In [4], [5], the focus is on association rules mining whereas [5] uses a select-a-size randomization. Rizi and Haritsa [4] roosed a randomization scheme called MASK which is based on a simle robabilistic distortion of user data. Post randomization gives a general framework for randomization of categorical data after data are collected. Select-a-size randomization can be considered as a secial and intelligently designed ost randomization. MASK [4] is a ost randomization technique for binary variables. In [4], [5] the data is randomized by record, which actually imlements randomization to all variables simultaneously. Randomizing all the variables simultaneously introduces unnecessary randomness. The roosed Post randomization method rovides a more flexible way to coe with situations when different variables have different rivacy requirements.

2 (Left) Binary Randomization (Right) Ternary Symmetric Random- Fig.. ization X X ~ X B. Our Contribution This aer uses ost randomization techniques to reserve rivacy during Bayesian network learning. Gouweleeuw et al. [6] introduced ost randomization for statistical databases for information disclosure control. Post randomization technique has roved to be effective in disclosure control. We exlored the ossibility of using ost randomization in rivacy-reserving data mining taking the rivacy reserving Bayesian network learning as an examle. We consider two Privacy-Preserving Bayesian network learning setus on distributed heterogeneous data, where different sets of variables are collected at the different sites. We develo estimators for frequency counters used in the learning of Bayesian network (both structure and arameters) and exressions for their covariance, based on the randomized data. Our exeriments show that ost randomization is an efficient, flexible and easy-to-use method to learn Bayesian network from rivacy sensitive data. Using ost randomization in rivacy-reserving data mining overcomes the inherent drawbacks of SMC method and rovides a reasonable rivacy and accuracy. It is ossible for a malicious arty in SMC to get rivate information of other arties while he can only get randomized data if ost randomization has been imlemented to the data. Using Post randomization for rivacy-reserving data mining rovides a general framework for randomization of categorical data in rivacy-reserving data mining. II. POST RANDOMIZATION AND ITS PRIVACY ANALYSIS A. Post Randomization Consider a data set D with a set of variables X, X,..., X n, where X i takes discrete values from a set S i whose cardinality is K i. Post randomization for variable X i is a (random) maing R i : S i S i, based on a set of transition robabilities i lm = ( X i = k m X i = k l ), where k m, k l S i and X i denotes the (randomized) variable value corresonding to variable X i. The transition robability i lm is the robability that a variable with original value k l is randomized to the value k m. Let P i = { i lm } denote the K i K i dimensional matrix that has i lm as its (l, m)th entry. The randomized data set is D = {ỹ,..., ỹ N }, where ỹ i is an instance of the randomized variables { X,..., X n }. For examle, Binary Randomization can be used if the variable is binary. Ternary Symmetric Randomization is a choice if the variable is ternary. Binary and Ternary Symmetric Randomization are as shown in Fig.. We can aly the same randomization scheme indeendently to all of the variables uniform randomization of the data set. Alternatively, we 3 3 X ~ can use a non-uniform randomization, where different ost randomization schemes are alied to different variables, indeendently. For examle, we can choose different randomization arameters and to different binary variables for non-uniform randomization if the rivacy requirement of the two variables are different. Non-uniform randomization is effective when different variables require different levels of rivacy. The non-uniform randomization includes the secial case when there is no rivacy requirement for some of the variables. From the above, we can see that if variable X i takes K i values (or categories), the dimension of P i will be K i K i. With larger K i, more randomization is introduced into variable X i in general. This is good from a rivacy oint of view. However, the variances of the estimator of frequency counters will also be larger for a given size of training samles. One solution to this roblem is to artition the K i categories of variable X i into several grous such that a value in one grou can only be randomized to a value in the same grou. In this case, Matrix P i becomes a block diagonal matrix. Post randomization can also be imlemented on several variables simultaneously. For examle, the variables X i and X j can be randomized simultaneously according to transition robability ( X i = l, X j = l X i = k, X j = k ). We can consider those variables randomized simultaneously as a combined variable when estimating the frequency counters. Proer simultaneous randomization can avoid the ossible inconsistency of the randomized data set which indeendent randomization might cause. B. Privacy Analysis of Post Randomization We consider the notion of rivacy introduced by Evfimievski et al. [5] in terms of an amlification factor γ. The γ-amlification in [5] is roosed in the framework where every data record should be randomized with a factor less than γ to limit the rivacy breach before the data are sent to the data miner. However, in this aer we use the amlification γ urely as a worst-case quantification of rivacy for a designed ost randomization scheme. We shall first briefly review the notion of γ-amlification in our context of ost randomization. A ost randomization oerator for variable X i with transition robability P i is P i (k,k) P i (k,k) γ, at most γ-amlifying for Xi = k if k, k where k, k, k S i and S i = K i. A ost randomization oerator is at most γ-amlifying for variable X i if it is at most γ-amlifying for any k S i. An uward ρ -to-ρ rivacy breach occurs when the osterior belief (X i = k X i = k) ρ, while the rior belief (X i = k ) ρ. A downward ρ -to-ρ rivacy breach occurs when the osterior belief (X i k Xi = k) ρ, while the rior belief (X i k ) ρ. If the randomization oerator is at most γ-amlifying for X i, revealing X i will cause neither an uward ρ -to-ρ rivacy breach nor a downward ρ -toρ rivacy breach if ρ( ρ) ρ ( ρ ) > γ. Clearly, smaller the value of γ, better is the worst case rivacy. Ideally we would like to have γ =. Interested reader can refer to [5] for a detailed discussion about γ-amlification. For binary

3 symmetric randomization (Fig. ), if = =.5, then it is easy to see that it is at most γ-amlifying for γ =. For the ternary symmetric randomization, if, then amlification is at most γ =. The at most γ-amlification rovides a worst case quantification of rivacy. For a given γ and rior belief ρ, we can get a ρ such that ρ ( ρ) ρ ( ρ ) = γ and we will not have a rivacy breach with osterior belief ρ > ρ. However, the at most γ amlification does not rovide any information of rivacy reserved in general. Besides γ, we use K = min k#{k ( X i = k X i = k ) > }, minimum number of ossible categories that can be randomized to category k for a designed ost randomization, where minimum is taken over all categories of X i. This K indicates the rivacy reserved in general. It is similar to the K defined in K-anonymity in [3] but in a robabilistic sense. III. FRAMEWORK OF PRIVACY-PRESERVING BN LEARNING USING POST RANDOMIZATION The roblem of Bayesian network learning is to find a network G and corresonding arameters that best matches the given training data set D = {y, y,..., y N }, where each record y i is an instance of variables {X, X,..., X n }. Each arty in our case observes a subset of the variables {X, X,..., X n }. We note that all the required information for the arameter and structure learning are sufficient statistics N ijk s of each candidate structure (the fixed structure G only for arameter learning) from the training data D, where N ijk is the number of records such that variable X i is in its kth configuration and its arents P a(x i ) are in the jth configuration. Therefore, the roblem of (rivacysensitive) Bayesian network learning is equivalent to the roblem of calculating the sufficient statistics (in a rivacysensitive manner). In this aer, we estimate those sufficient statistics from the randomized data. We consider the following two setus (I) Several arties want to learn a global Bayesian network but are concerned about the rivacy of their individual data. This corresonds to the Multiarty model of SMC. (II) All arties send their randomized data to a data miner who does the learning. A. Parameter Learning For arameter learning, the structure G is assumed fixed and known to every arty. For setu I, we used the definitions of cross variable and cross arent from []. Sensitive cross arents are the only variables that need to be randomized in setu I. Learning arameters for setu I above can be done as follows: For each arty a i, () Randomize sensitive cross arents belonging to its own arty according to their resective rivacy requirements using ost randomization as described in Section II. Randomizations are done indeendently for each (combined) variable and each record. () Send randomized cross arents of arty a i for arty a j to arty a j together with the robability transition matrix used. (3) Learn arameters for local variables of arty a i. This ste does not involve randomized data. (4) Estimate the sufficient statistics N ijk s for each cross variable at same site a i using the local data and randomized arent data from other arties. (5) Estimate the arameters for cross variables using the estimated sufficient statistics. (6) Share the arameters with all other arties. In setu II: Every arty randomizes all its sensitive variables according to their resective rivacy requirements using ost randomization (similar to randomization done to cross arents in setu I). Randomized data and their corresonding robability transition matrices are then sent to data miner. Data Miner then estimates the sufficient statistics N ijk s and arameters for each node X i using the randomized data. The details of estimation of sufficient statistics N ijk and arameter learning using estimated sufficient statistics are described in Sections IV and V resectively. B. Structure Learning During the search of a BN Structure (Directed Acyclic Grah DAG) that best fit the data, erform randomization and estimation of sufficient statistics as described in Section III-A for each candidate structure as a fixed structure G. Use those estimated sufficient statistics to calculate the score of the candidate structure G. Then, we can choose a structure with maximum score. We use K- algorithm to search for a DAG that aroximately has the maximum score. The details of structure learning from randomized data using K- algorithm are resented in Section VI. IV. ESTIMATION OF SUFFICIENT STATISTICS FROM RANDOMIZED DATA From Section III, we can see that the roblem of rivacyreserving Bayesian network learning can be decomosed into a series of estimation of N ijk s for each node X i and each candidate structure. The arents P a(x i ) of Node X i are given by the (candidate) structure G. Consider the following general case: The cardinality of Node X i is K i and it has Q arent nodes P a(x i ) = {P a i (), P a i (),..., P a i (Q)} in the candidate structure. The cardinality of each arent P a i (q) is K P ai(q). These variables can be arbitrarily vertically artitioned to different arties in both setus. The randomization of each (combined) variable can also be done by grouing the categories of the variable into grous. Our discussion below is indeendent of the secific artitioning and grouing of the variables. We have the following different cases for estimating N ijk s from the randomized data D due to simultaneous randomization. Note that only the variables belonging to the same arty can be randomized simultaneously in both of our setus. (a) X i and all of its arents are randomized indeendently. (b) Some arents of X i are randomized simultaneously. (c) X i is randomized simultaneously with some of its arents. (d) X i is randomized simultaneously with other variables (not its arent). For cases (b) and (c), we can consider the simultaneously randomized variables as a combined variable. For examle, if

4 node X i is randomized simultaneously with one of its arents P a i (), N ijk is equal to the number of records such that (X i ; P a i ()) = (k; j ), P a i () = j,..., P a i (Q) = j Q, where (X i ; P a i ()) is a combined variable. Thus, we can estimate the N ijk s from the randomized data by treating (X i ; P a i ()) as a single variable with cardinality K i K P ai(). For case (d), since the current N ijk does not involve the variable randomized simultaneously with X i, we can get the marginal transition robability matrix from the given transition robability matrix, which is for the combined variable. Hence, without loss of generality, we can consider case (a) only. Suose the transition robability matrices of X i and its arents are P i, P P ai(),..., P P ai(q), resectively. The roblem here is to estimate the sufficient statistics N ijk from the randomized data. We denote by P a(x i ) as a comound variable for all the arents of Node X i. Hence P a(x i ) takes J i = Q q= K P a i(q)different values. The following are some notations used in the sequel: Suerscrit denotes the variables after randomization and suerscrit denotes an estimate of the corresonding variable. N ijk is as defined in Section III and N ij = Ki k= N ijk, where K i is the cardinality of variable X i. N i is the J i K i dimensional vector of N ijk values, that is N i = (N i, N i,..., N iki, N i,..., N ijik i ) t, where suerscrit t denotes matrix transose. N i (l) for l J i K i is the number of records that {X i, P a(x i )} have the lth configuration, where l = K i (j ) + k for some j and k such that X i is in kth configuration while P a(x i ) in jth configuration. Ñ ijk, Ñ ij, and Ñi are defined similarly as N ijk, N ij and N i but in the randomized data D. ˆNijk, ˆNij, and ˆN i are the estimations of N ijk, N ij, and N i, resectively. Given the training Data set D with N records, a candidate structure G and a randomization scheme characterized by robability transition matrices P i, P P ai(),..., P P ai(q), we have the following theorems. Theorem. (a) E[Ñi D] = P t N i, where P = P i P P ai and P P ai = P P ai() P P ai()... P P ai(q), denotes Kronecker matrix roduct. (b) Denote Yml i as a binomial random variable that gives the number of records such that { X i, P a(x i )} is in the lth configuration while {X i, P a(x i )} is in the mth configuration. We have Yml i B(N i (m), π), where π = P (m, l), the (m, l)th element of the robability matrix P defined in (a) and B denotes Bino- robability distribution. Moreover, Cov{Yml i mial, Ynl i } = Var{Y i ml } = N i(m)p (m, l )( P (m, l )) if n = m,l = l N i (m)p (m, l )P (m, l ) if n = m,l l if n m (c) For l =,,..., J i K i, Ñ i (l) = J i K i m= Y i ml. Moreover, Cov{Ñi D} = J i l= N i(l)v l where V l is a K i J i K i J i covariance matrix such that its (l, l )th element V l (l, l ) = { P (l, l )( P (l, l )) if l = l, P (l, l )P (l, l ) if l l. Proofs are omitted here due to the age limitations. Interested reader can refer to a longer version of this aer [] for details. The following theorem establishes the bias and variance of the estimator ˆN i = (P t ) Ñ i. Theorem. ˆNi = (P t ) Ñ i is an unbiased estimator for N i and Cov{ ˆN i D} = (P ) t Cov{Ñi D}P, where P and Cov{Ñi D} are given in Theorem. A Binomial distribution B(n, ) can be aroximated by a normal distribution N (n, n( )), when n is large. Since in Bayesian Network learning we usually have a relatively large samle size, the distribution of Ñ i (l) can be well aroximated by a normal distribution by Theorem (c). Ñ i and ˆN i can be aroximated by a J i K i dimensional joint normal random variable since by Theorem. In articular, ˆN i N (N i, Cov{ ˆN i D}), where Cov{ ˆN i D} is given by Theorem. V. PARAMETER ESTIMATOR AND ITS DISTRIBUTION The Maximum Likelihood (ML) estimate of the BN arameter using the estimated sufficient statistics N ijk is ˆθ ijk ML = ˆN ijk ˆN = ijk ˆN Ki ˆN and the Maximum a osteriori (MAP) estimate of the arameter using the estimated ij k= ijk sufficient statistics is ˆθ ijk = α ijk+ ˆN ijk α i j+ ˆN, where α ijk is from ij the assumtion of rior Dirichlet distribution for θ ij, that is P (θ ij G) = Dirichlet(α ij,..., α ijki ). Here, we use the ML estimator and analyze its erformance. Results for the MAP estimator can be obtained in a similar fashion. The estimated arameter is a ratio of two deendent (well aroximated) normal random variables with non-zero mean. The exact distribution of the ratio of two normal variables W = X X with (X, X ) N (µ, µ, σ, σ, ρ) for arbitrary mean, variance, and correlation is well-known [7]. However, the general distribution is quite comlicated. An aroximation to the exact distribution, when the robability P (X > ) or when µ σ is large is also given in [7]. Furthermore, Z = µw µ σ σ a(w) is aroximately a standard normal distribution if µ σ is large. A Taylor series exansion of Z around µ µ µ shows that Z (w µ µ ). It follows σ σ µ σ µ ρµ σ σ µ + σ that the distribution of W can be aroximated by a normal distribution with mean µ µ and variance σ σ ρµ σ σ µ + ). We have σ ˆθ ML ijk = ˆN ijk ˆN ij = µ ( µ σ µ ˆN ijk Ki k= ˆN ijk. The jointly normal variable ( ˆN ijk, ˆN ij ) has (aroximate) distribution N (µ, µ, σ, σ, ρ), where µ = N ijk, µ = N ij and σ, σ, ρ can be obtained from Theorem. From Theorems and, we can see µ σ is of the order of N ij for a given data set and a transition robability matrix P. So µ σ is large even for relatively small samle sizes. Hence, the aroximation of the distribution of ratio of two normal random variables with the simler form works well for us. On the other hand, the normal aroximation using Taylor exansion is also feasible in our case since N ijk N ij. Hence, for a given Data set D and a robability transition matrix P,

5 ˆθ ijk ML = ˆN ijk can be aroximated by a normal distribution ˆN ij with mean θ ijk = N ijk N ij and variance σ σ ( N ijk Nijk σ N ρn ijk ij σ σ N ij + ) = σ σ σ ( θ ijk ρθ ijk Nijk σ σ σ + ). We can see the variance is σ of the order N since σ s are of the order N from Theorem. VI. STRUCTURE LEARNING FROM RANDOMIZED DATA The roblem of learning Bayesian network structure from samle data D is to find a network structure G that best matches the samle data D. Our roblem here is to learn the network structure from the randomized data D. We use K- algorithm for structure learning. K- is a greedy search algorithm that searches for a DAG G that (aroximately) maximizes a score function Score(D, G). The two score functions we discuss in this aer are Bayesian score and BIC/MDL score. The Bayesian score for a node is Score(X i, P a i (X i )) = Ji j= Ki k= Γ(α ij ) Γ(α ijk +N ijk ) Γ(α ij +N ij ) Γ(α ij ). The BIC/MDL score for a node is given by Score(X i, P a(x i )) = Ki k= Ji j= N ijklog(θ ijk ) N #(X i, P a(x i )),where #(X i, P a(x i )) is the number of arameters we need to reresent (X i P a(x i )). The decomosability roerty of those score function makes a single oeration in K- algorithm as the addition of a arent to a variable. The addition of a arent corresonds to two different candidate structures G and G. By comaring P old = score(d, X i, P a(x i )) and P new = score(d, X i, P a(x i ) {Z}), where Z is a new candidate arent for variable X i, K- algorithm decides if there is a link between candidate arent Z and node X i. We use the two sets of estimated sufficient statistics ˆN ijk s to calculate P old and P new. The estimation of sufficient statistics is done as described in Section IV for each variable and the difference between P old and P new is caused only by the sufficient statistics associated with node X i. The estimation of sufficient statistics is done as described in Section IV for structures G and G. The framework of structure learning for the two setus were discussed in Section III-B. One roblem with structure learning is that estimation errors of the sufficient statistics tend to cause extra links, which are links that aears in the structure learned from randomized data D but not in the structure learned from original data D. Missing links haen only when the randomization is relatively large. Missing links are those links that are learnt from the original data D but are not learnt from the randomized data D. The extra-link roblem is not difficult to understand from the statistical definition of indeendence. For examle, if we have a samle data from two indeendent discrete random variables A and B, we can conclude statistically that A and B are indeendent if we have (A = i, B = j) = (A = i)(b = j) ± ε i, j, where (A = i, B = j), (A = i), and (B = j) are estimated from their resective relative frequency in the samle set and ε deends on the secific indeendence testing method used. If the data samles of A and B are ost-randomized, the relative frequencies can only be estimated using the available randomized samles. The estimation errors tend to cause (A = i, B = j) > (A = i)(b = j) + ε or (A = i, B = j) < (A = i)(b = j) ε for some i and j. If the same indeendence testing is used, we tend to conclude that A and B are deendent. Similar argument holds for conditional indeendence which is encoded by the Bayesian network structure. Our exeriments show that the extra links exist even for relatively small estimation errors. This kind of effect of estimation error on indendence testing usually causes Score(C, {AB}) > Score(C, {A}) although C is actually indeendent of B given A. Thus an extra link B C will usually result. The above discussion suggests that if we want to learn correct structures from the randomized data, we should enalize comlex structures. For Bayesian score, we roose adding a arent only when P new > ηp old, where η is a suitable threshold. For BIC/MDL score, we roose modifying the score function by increasing the enalty term (descrition length); that is, Score B (X i, P a(x i )) = Ki Ji k= j= N ijklog(θ ijk ) C N #(X i, P a(x i )) for some C >. Exeriments show that the threshold η and C deend on the level of randomization, More the randomization, larger the threshold η and C should be. The underlying relationshi between the threshold η (or C ) and randomization under the available samles can be a further research toic. We believe there exists otimal choices for threshold η and C for a designed randomization scheme. VII. EXPERIMENTAL RESULTS We now resent some exerimental results that demonstrate the accuracy of the BN learning algorithm for different levels of randomization. In Setu I, less number of variables are randomized than Setu II. Therefore, the exerimental results of Setu I are always better than those for Setu II. Hence, we resent only the results from Setu II here. A. Parameter Learning: Non-uniform Randomization In this exeriment, we use the Bayesian Network shown in Fig., where the variables are distributed over three arties. All variables are binary excet variables L and B, which are ternary. The conditional robabilities of the different nodes are also shown., samles were generated from this Bayesian Network to form the data set D. This data set was A Site T X F E L C G Site 3 S Site B D A.7,.3 T.,.9,.9,. S.5,.5 L.3,.7,.4,.5,.3,.5 X.,.6,.8,.4 F.5,.9,.75,. E.5,.8,.5,.5,.3,.4,.75,.,.85,.5,.7,.6 D.7,.65,.,.4,.8,.35,.3,.35,.9,.6,.,.65 C.9,.4,.6,.5,.,.6,.4,.75 B.8,.5,.,.5,.,.35 G.,.4,.8,.6 Fig.. Bayesian Network for Exeriment then randomized according to the scheme described in Table

6 I, where variables T, S, and G were considered not sensitive and hence not randomized. Note that we use a non-uniform randomization with different levels of randomization for different variables. The corresonding at most γ amlification are also shown in Table I. K = for Binary randomization while K = 3 for ternary randomization. Table II shows the arameter of all nodes learnt from the randomized data using the algorithm described in Section III for setu II. All the values in the Table are average over 5 runs, with the corresonding standard deviation indicated in arenthesis. It is clear from Table II that the roosed algorithms can accurately learn the BN arameters for both Setus. TABLE I RANDOMIZATION PERFORMED TO THE VARIABLES P(D B,C) A.5 A.5.5 G.3.7 B E B E Site D Site C C F D F Fig. 3. Bayesian Network for Exeriment P(D= B=,C=) P(D= B=,C=) P(D= B=,C=) P(D= B=,C=) + standard dev iation A,D Binary sym =.5 γ = 3 L,B Ternary sym =.5 γ = 4.67 E Binary sym =. γ = 4 X Binary sym =.4 γ = 3 C,F Binary non sym =., =.5 γ = 9 B. Parameter Learning: Uniform Randomization In this exeriment, we use a uniform randomization and test the accuracy of the BN arameters as a function of the randomization arameter. We used the Bayesian Network shown in Fig. 3. All nodes are binary. We consider the arameters of node D, which has arents in a different site. 5, samles were generated from the above Bayesian Network. Binary Symmetric randomization with arameter was used to randomize the variables. Fig. 4 (to) shows a grah of the arameters as a function of (mean of the estimated arameters over runs is lotted; lot of mean lus one standard deviation is also included). It is clear from the figure that for randomization arameter.5, we can estimate the BN arameters with almost no error. Even for values u to.3, we get reasonably good arameter estimates. From a rivacy ersective, =.5 corresonds TABLE II MEAN AND STANDARD DEVIATION ( ) OVER 5 RUNS OF PARAMETERS LEARNT FROM THE RANDOMIZED DATA A.7(.7).3(.7) T.(.5).9(.77).9(.5).97(.77) S.5(.).49(.) L.3(.49).7(.57).39(.64).4(.55).3(.93).5(.45) B.8(.77).6(.4).94(.7).49(.73).(.).36(.65) E.5(.).8(.9).4(.7).5 (.).3 (.6).4(.34).75(.).9(.9).86(.7).5(.).69(.64).59(.3) D.69 (.).65(.3).(3.3).38(.77).79(.7).39(5.65).3(.3).35(.3).89(3.34).6(.77).(.7).6(5.7) X.(.8).6(.).8(.8).4(.) C.9(.).38(.6).6(.6).5(.).(.).6(.6).39(.6).75(.) F.4(.73).9(.).77(.73).9(.) G.(.3).4(.9).8(.3).6(.9) * randomization arameter c=% Fig. 4. P(D= B=,C=) P(D= B=,C=) P(D= B=,C=) P(D= B=,C=) samle size N x 4 (To) Estimated arameters ˆP (D B, C) vs. ; (Bottom) vs. N to an amlification factor of γ =.5.5 = 3 with K =. We have to oint out that this is done with 5, samles. Under the same accuracy requirement, it is intuitive that the value can be closer to.5 if more samles are available. The closer is to.5, the closer γ is to. Another way to assess the erformance of the algorithms is to determine the maximum level of randomization that we can use for a given accuracy. Towards that end, for a given level of required estimation accuracy, defined in terms of an absolute arameter estimation error threshold c, let be the smallest value of (obtained by averaging over ten runs) for which the absolute value of the estimation error exceeds c. Fig. 4(bottom) shows the variation of as a function of the samle size N, for values of c = %. Note that c = % corresonds to very small arameter estimation error. C. Structure Learning In this exeriment, we test the accuracy of BN structure learning from randomized data., samles from the BN in Fig. 3 was used. All variables were randomized using a binary symmetric randomization with arameter. The K- algorithm with threshold η for Bayesian score or enalty

7 error links 6 4 η= η=.99 v ariable η...3 randomization arameter KL distance η= η=.99 variable η...3 randomization arameter Fig. 5. Structure learning using Bayesian Score: (Left) No. of links in error; (Right) KL distance term C for BIC/MDL score was used to learn the BN structure from the randomized data. We quantify the error in structure learning with two different error measures: (a) Sum of missing links and extra links and (b) KL-distance between the joint robability of the learnt BN and the true BN. The latter actually incororates errors in the structure as well as the arameters and might be better. Fig. 5 (left) shows the number of links in error (sum of missing links and extra links) as a function of the randomization arameter, for the case of Bayesian scores. Fig. 5 (right) deicts a similar grah for KL-distance. Three different choices of the threshold were considered: η =, η =.99, and a variable η value deending on the randomization arameter value chosen as follows η =.99 if <.5, η =.98 if.5 <.5 and η =.97 if.5 <.3. We have similar results using BIC/MDL score with three different values for the enalty term C =, C = 4, and C = 8. Due to age limitation, the results using BIC/MDL are not resented here. We would like to add that the structure error was always contributed by extra links, with just one excetion (when Bayesian score with variable η is used) where we had one missing link. It can be seen from the grahs that a variable value of η gives better results than a fixed value. For the case of Bayesian score, we can have a randomization level of =., with little error in structure learning. From a rivacy ersective, this corresonds to a value of γ = 4 with K =. The erformance with BIC/MDL score is even better, where we can have a randomization level of =.5, with little error in structure learning. This corresonds to a value of γ = 3. We also have to oint out here that this is done using a samle size of, oints. With more samles, it is intuitive that we can still get small errors in structure learning with more randomization; i.e., closer to.5. VIII. DISCUSSION AND CONCLUSIONS We have roosed a ost randomization technique to learn the structure and arameters of a Bayesian network from distributed heterogeneous data. Our method estimates the sufficient statistics from the randomized data, which are subsequently used to learn the BN. For structure learning, we used a modified score function to deal with the familiar extra-link roblem. Exerimental results with different levels of randomization and different samle sizes show that our method is caable of accurately estimating the BN. We quantified the rivacy of our randomization scheme using the concet of γ-amlification and K similar to the concet of K-anonimity. We showed that we obtain a fairly good level of rivacy. We believe the ost randomization can be easily extended to many other Privacy-reserving data mining alications whose comutation also only deend on a set of sufficient statistics such as the decision tree learning. The randomization in our exeriment is imlemented to the individual variables. However, the randomization can also be imlemented to combined variables, which might include all the variables in one arty. Combining all the variables in one arty as one combined variable might revent rivacy breach between the variables in the same arty. IX. ACKNOWLEDGEMENTS This work was suorted by the United States National Science Foundation Grant IIS We would like to thank Dr. Karguta for many fruitful discussions and ideas. REFERENCES [] R. Agrawal and R. Srikant, Privacy-reserving data mining, In Proceedings of SIGMOD Conference on Management of Data, ages , May. [] R. Chen, K. Sivakumar, and H. Karguta, Collective Mining of Bayesian Networks from Distributed Heterogeneous Data (acceted) Knowledge and Information Systems Journal. [3] C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Zhu, Tools for Privacy Preserving Distributed Data mining ACM SIGKDD Exlorations, 4():8-34, 3. [4] W. Du and Z.Zhan, Using Randomized Resonse Techniques for Privacy-Preserving Data Mining, In Proceedings of the 9th ACM SIGKDD, Washington, DC, USA. August 3. Page [5] A. Evfimievski, J. Gehrke, and R. Srikant, Limiting rivacy breaches in rivacy reserving data mining In roceedings of the ACM SIG- MOD/POD Conference, ages -, San Diego, CA, June 3. [6] J. M. Gouweleeuw, P. Kooiman, L.C.R.J. Willenborg, and P.-P. de Wolf. Post Randomisation for Statistical Disclosure Control: Theory and Imlementation, Journal of official Statistics, Vol ages [7] D. V. Hinkley, On the ratio of two correlated normal random variables, Biometrika (969), 56, 3, ages [8] H. Karguta, S. Datta, Q. Wang, and K. Sivakumar, On the rivacy Preserving Proerties of Random Data Perturbation Techniques, In Proceedings of the IEEE International Conference on Data Mining, Pages 99-6, Melbourne, FL. November 3. [9] Y. Lindell and B. Pinkas, Privacy reserving data mining, In Advances in Crytology-CRYPTO, ages 36-54,. [] K. Liu, H. Karguta, and J. Ryan, Multilicative Noise, Random Projection, and Privacy Preserving Data Mining from Distributed Multi-Party Data, (In Communication), 3. [] J. Ma and K. Sivakumar, Privacy-Preserving Bayesian Network Learning Using Post Randomization, (in rearation), 6. [] D. Meng, K. Sivakumar and H. Karguta, Privacy-Sensitive Bayesian Network Parameter Learning, In the Fourth IEEE International Conference on Data Mining. Brighton, UK. November 4. [3] L.Sweeney, k-anonymity: a model for rotecting rivacy, International Journal on uncertainty, Fuzziness and Knowledge-based Systems, (5):557-57,. [4] S. Rizi and J. R. Haritsa, Maintaining data rivacy in association rule mining, In the roceedings of the 8th VLDB Conference, Hongkong, China,. [5] R. Wright and Z. Yang, Privacy Preserving Bayesian Network Structure Comutation on Distributed Heterogeneous Data, In Proceedings of the 4 ACM SIGKDD international conference on Knowledge discovery and data mining.

Finite Mixture EFA in Mplus

Finite Mixture EFA in Mplus Finite Mixture EFA in Mlus November 16, 2007 In this document we describe the Mixture EFA model estimated in Mlus. Four tyes of deendent variables are ossible in this model: normally distributed, ordered