LoPub: High-Dimensional Crowdsourced Data Publication with Local Differential Privacy

Size: px

Start display at page:

Download "LoPub: High-Dimensional Crowdsourced Data Publication with Local Differential Privacy"

Justina Cross
6 years ago
Views:

1 1 LoPub: High-Dimensional Crowdsourced Data Publication with Local Dierential Privacy Xuebin Ren, Chia-Mu Yu, Weiren Yu, Shusen Yang, Xinyu Yang, Julie A. McCann, and Philip S. Yu Abstract High-dimensional crowdsourced data collected rom numerous users produces rich knowledge or our society. However, it also brings unprecedented privacy threats to the participants. Local privacy, a variant o dierential privacy, is proposed to eliminate privacy concerns. Unortunately, achieving local privacy on high-dimensional crowdsourced data raises great challenges in terms o both computational eiciency and eectiveness. To this end, based on Expectation Maximization (EM) algorithm and Lasso regression, we irst propose eicient multi-dimensional joint distribution estimation algorithms that maintain local privacy. Then, we develop a Locally privacy-preserving high-dimensional data Publication algorithm, LoPub, by taking advantage o our distribution estimation techniques. In particular, both correlations and joint distributions among multiple attributes are identiied to reduce the dimensionality o crowdsourced data, thus achieving both eiciency and eectiveness in high-dimensional data publication. To the best o our knowledge, this is the irst work addressing high-dimensional crowdsourced data publication with local privacy. Extensive experiments on realworld datasets demonstrate that our multivariate distribution estimation scheme signiicantly outperorms existing estimation schemes in terms o both communication overhead and estimation speed, and conirm that our LoPub scheme can keep average 8% and 6% accuracy over the published approximate datasets in terms o SVM and random orest classiication, respectively. Index Terms Local privacy, high-dimensional data, crowdsourced data, data publication 1 INTRODUCTION With the development o various integrated sensors and crowd sensing systems [19], crowdsourced inormation rom all aspects can be collected and analyzed to better produce rich knowledge about the group, which can beneit everyone in the crowdsourced system [2]. Particularly, with multi-dimensional crowdsourced data (data with multiple attributes), a lot o potential inormation and patterns behind the data can be mined or extracted to provide accurate dynamics and reliable prediction or both group and individuals. However, the participants privacy can still be easily inerred or identiied due to the publication o crowdsourced data [15], [33], especially high-dimensional data, even though some existing privacy-preserving schemes and end-to-end encryption are used. The reasons or privacy leaks are two-old: Non-local Privacy. Most existing solutions or privacy protection ocus on centralized datasets under the assumption that the server is trusted. However, despite the privacy protection against dierence and inerence attacks rom aggregate queries, an individual s data may still suer rom privacy leakage beore aggregation because o the lack o local privacy [17], [7] on the user side. Curse o High-dimensionality. With the increase o data dimensions, some existing privacy-preserving techniques like dierential privacy [8], i straightorwardly applied to multiple attributes with high correlations, will become vulnerable [25], [35], thereby increasing the success ratio o many reerence attacks like cross-checking. Even worse, according X. Ren, S. Yang, and X. Yang are with Xi an Jiaotong University. s: {xb.ren@stu, shusenyang@mail, yxyphd@mail}.xjtu.edu.cn C.M. Yu is with National Chung Hsing University. chimayu@gmail.com W. Yu is with Imperial College London and Aston University. s:weiren.yu@imperial.ac.uk, w.yu3@aston.ac.uk J. McCann is with Imperial College London. j.mccann@imperial.ac.uk P. Yu is with University o Illinoise at Chicago. psyu@uic.edu to the composition theorem [26], dierential privacy degrades exponentially when multiple correlated queries are processed. In addition to privacy vulnerability, the large scale o various data records collected rom many distributed users can exaserbate the ineiciency o data processing. Especially in IoT applications, the ubiquitous but resource-constrained sensors require extremely high eiciency and low overhead. For example, privacypreserving real-time pricing mechanisms require not only eective privacy guarantees or individuals electricity usage but also ast response to the dynamical changes o demands and supply in the smart grid [24]. Thus, it is important to provide an eicient privacy-preserving method to publish crowdsourced high-dimensional data. Contributions. To address the above concerns, this paper makes the ollowing contributions. We are the irst to address the problem o highdimensional crowdsourced data publication with local privacy to the best o our knowledge. We propose a locally privacy-preserving scheme or crowdsensing systems to collect and build highdimensional data rom distributed users. Particularly, dierential privacy is directly achieved or each distributed user. Then, based on EM and Lasso regression, we propose eicient algorithms or multivariate joint distribution estimation. By taking advantage o speciic marginal distributions rom the locally privacy-preserved data ater dimensionality and sparsity reduction, we propose LoPub solution that can generate an approximation o the original crowdsourced data with the guarantee o local privacy. We implemented and evaluated our schemes on real-world datasets. Experimental results conirm the eiciency and eectiveness o our proposed distribution estimation and data release mechanisms. Due to the page limit, some detailed examples and explanations that are not presented in this paper can be ound in our ull length preprint technical report [28].

Examples o the use o d- ierential privacy include privacy-preserving data aggregation, where dierential privacy o individuals can be guaranteed by injecting careully-calibrated Laplacian noise [5],

2 2 Fig. 1: Main procedures o high-dimensional data publishing with non-local ǫ = ǫ 1 +ǫ 2 privacy 2 RELATED WORK 2.1 Privacy in Centralized Setting Dierential privacy [8] orms a mathematical oundation or privacy protection by imposing proper randomness on statistical query results. Examples o the use o d- ierential privacy include privacy-preserving data aggregation, where dierential privacy o individuals can be guaranteed by injecting careully-calibrated Laplacian noise [5], [13], [18], [22], [35]. For privacy-preserving lowdimensional data publication, to show crowd statistics and draw the correlations between attributes, both the dierentially privacy-preserving histogram (univariate distribution) [3] and contingency table [27] are widely investigated. However, the techniques or non-interactive dierential privacy [9], [1] in these works suer rom the curse o dimensionality [35], [5]. Particularly, the composition theorems [26] have pointed out that the privacy levels degrade when multiple related queries are processed.to deal with the correlations in high-dimensional data, dierent schemes (e.g., approximations via low dimensional data clusters) have been proposed [5], [6], [18], [21], [32], [35]. Among them, the state-o-art scheme [5] proposed to reduce the dimension by using junction tree to model the correlations. Moreover, Su et al. [31] proposed a multi-party setting to publish synthetic dataset rom multiple data curators. However, their multi-party computation can only protect privacy between data servers and individual s local privacy cannot be guaranteed. Due to the lack o local privacy guarantee, these works, as summarized in Figure 1, may be exposed to some insider attackers, thus being unable to directly apply to crowdsourced systems. 2.2 Privacy in Distributed Setting The schemes mentioned above mainly deal with centralized datasets. Nonetheless, there could be scenarios, where distributed users contribute to the aggregate s- tatistics. Despite the privacy protection against dierence and inerence attacks rom aggregate queries, an individual s data may also suer rom privacy leakage beore aggregation [11]. Hence, local privacy [7], [16], [17] has been proposed to provide local privacy guarantees or distributed users. In addition, local privacy rom the end user can ensure the consistency o the privacy guarantees when there are multiple accesses to users data, in contrast to non-local privacy schemes that has to properly split and assign privacy budgets to dierent steps [5], [21], [35]. In existing work [15][12][14], local privacy is implemented with randomized response technique [34]. However, the correlations and sparsity in high-dimensional data are not well considered, which will cause low scalability and utility or highdimensional data [25], [35]. Fig. 2: An architecture o distributed high dimensional private data collection and publication Dierent rom these work, we propose a novel mechanism to publish high-dimensional crowdsourced data with local privacy or individuals. We compare our work with three similar existing solutions described in the Table 1. More speciically, our method has lower communication costs, time and storage complexity, compared to state-o-the-art approaches. TABLE 1: Comparison o LoPub with existing methods Comparison LoPub (Our method) RAPPOR [12] EM [14] JTree [5] Local privacy Y Y Y N High Dimension Y N N Y Communication O( j Ω j ) O( j Ω j ) O( j Ω j ) - Time Complexity Low Large Large - Space Complexity Low Large Large - Ω j is the domain size o the j-th dimension. 3 SYSTEM MODEL Our system model is depicted in Figure 2, where a number o users and a central server constitute a crowdsourcing system. The users generate multi-dimensional data records, and then send these data to the central server. The server gathers all the data and estimates high-dimensional crowdsourced data distribution with local privacy, aiming to release a privacy-preserving dataset to third-parties or conducting data analysis. In this paper, we mainly ocus on data privacy, and thus the detailed network model is omitted. Problem Statement. Given a collection o data records with d attributes rom dierent users, our goal is to help the central server publish a synthetic dataset that has the approximate joint distribution o d attributes with local privacy. Formally, let N be the total number o users (i.e., data records 1 ) and suiciently large. Let X = {X 1,X 2,...,X N } be the crowdsourced dataset, where X i denotes the data record rom the ith user. We assume that there are d attributes A = {A 1,A 2,...,A d } in X. Then each data record X i can be represented as X i = {x i 1,x i 2,...,x i d }, where xi j denotes the jth element o the ith user record. For each attribute A j (j = 1,2,...,d), we denote Ω j = {ωj 1,ω2 j,...,ω Ωj j } as the domain o A j, where ωj i is the ith possible attribute value o Ω j and Ω j is the cardinality o Ω j. With the above notations, our problem can be ormulated as ollows. Given a dataset X with local privacy, we aim to release an approximate dataset X with the same attributes A and N users record in X such that P X (A 1...A d ) P X (A 1...A d ), (1) 1. For brevity, we assume that each user sends only one data record to the central server.

3 3 where P X (A 1...A d ) P X (x i 1 = ω 1,...,x i d = ω d), i = 1,...,N, ω 1,...,ω d Ω d and P X (x i 1 = ω 1,...,x i d = ω d) is deined as the d-dimensional joint distribution on X. To ocus our research on data privacy, we assume that the central server and users are all honest-but-curious in the sense that they will honestly ollow the protocols in the system without maliciously manipulating their received data. However, they may be curious about others data and even collide to iner others data. In addition, the central server and users share the same public inormation, such as the privacy-preserving protocols (including the hash unctions used). 4 PRELIMINARIES 4.1 Dierential Privacy Dierential privacy is the de-acto standard or providing privacy guarantees [8]. It limits the adversary s ability o inerring the participation or absence o any user in a data set via adding careully calibrated noise (e.g., Laplacian noise [8]) to query results. The algorithm M is ǫ-dierentially private i or all neighboring datasets D 1 and D 2 that dier on a single element (e.g., the data o one person), and all subsets S o the image o M, Pr[M(D 1 ) S] e ǫ Pr[M(D 2 ) S], (2) where ǫ is the privacy budget to speciy the level o privacy protection and smaller ǫ means better privacy. According to the composition theorem [29], an extra privacy budget will be required when multiple related queries are sequentially applied to dierential privacy mechanisms. 4.2 Local Dierential Privacy Generally, dierential privacy research ocuses on centralized databases and implicitly assumes a trusted server. Aiming to eliminate this assumption, local dierential privacy (or simply local privacy) is proposed or crowdsourced systems to provide a stringent privacy guarantee that data contributors trust no one [7], [17]. In particular, or any user i, a mechanism M satisies ǫ-local privacy i or any two data records X i,y i Ω 1 Ω d, and or any possible privacy-preserving outputs X i Range(M), Pr[M(X i ) = X i ] e ǫ Pr[M(Y i ) = X i ], (3) where the probability is taken overm s randomness and ǫ has a similar impact on privacy as in the ordinary dierential privacy (Equation (2)). The simplest orm o local privacy is the randomized response [34], which has been widely used in the survey o people s yes or no opinions about a private issue. Participants o the survey are required to give their true answers with a certain probability or random answers with the remaining probability. Due to the randomness, the surveyor cannot determine the individuals true answers (i.e., local privacy is guaranteed) but still can predict the true proportions o alternative answers. Recently, RAPPOR has been proposed or statistics aggregation [12]. The basic idea o RAPPOR is the extension o the randomized response technique via long binary strings to uniquely represent arbitrary domain. However, it is not directly applicable to multiple dimensional data with large domain size since the binary strings will have exponential length increments in terms o the number o dimensions. To address this problem, Fanti et al. [14] propose an association learning scheme, which extends the 1-dimensional RAPPOR to estimate the 2-dimensional joint distribution. However, the sparsity in the multi-dimensional domain and the way it iteratively scans RAPPOR strings means that it will incur considerable computational complexity. 5 LOPUB: HIGH-DIMENSIONAL DATA PUBLI- CATION WITH LOCAL PRIVACY We propose LoPub, a novel solution to achieve highdimensional crowdsourced data publication with local privacy. In this section, we irst introduce the basic idea behind LoPub and then elaborate the algorithmic procedures in more detail. 5.1 Basic idea Privacy-preserving high-dimensional crowdsourced data publication aims at releasing an approximate dataset with similar statistical inormation (i.e., in terms o s- tatistical distribution as deined in Equation (1)) to the source data while guaranteeing the local privacy. This problem can be considered in our stages: First, to achieve local privacy, some local transormation should be designed to the user side to cloak individuals original data records. Then, the central server needs to obtain the statistical inormation, a.k.a, the distribution o original data. There are two plausible solutions. One is to obtain the 1-dimensional distribution on each attribute independently. Unortunately, the lack o consideration o correlations between dimensions will lose the utility o original dataset. Another is to consider all attributes as one and compute the d-dimensional joint distribution. However, due to combinations, the possible domain will increase exponentially with the number o dimensions, thus leading to both low scalability and signal-noise-ratio problems [35]. Thereore, the next crucial problem is to ind a solution or reducing the dimensionality while keeping the necessary correlations. Finally, with the statistical distribution inormation on low-dimensional data, how to synthesize a new dataset is the remaining problem. To this end, we present LoPub, a locally privacypreserving data publication scheme or highdimensional crowdsourced data. Figure 3 shows the overview o LoPub, which mainly consists o our mechanisms: local privacy protection, multi-dimensional distribution estimation, dimensionality reduction, and data synthesizing. 1) Local Privacy Protection. We irst propose the local transormation process that adopts randomized response technique to cloak the original multidimensional data records on distributed users to provide local privacy or all individuals in the crowdsourced system. Particularly, we locally transorm each attribute value to a random bit string. Then, the local privacy-preserved data is sent to and aggregated at the central server. 2) Multi-dimensional Distribution Estimation. We then propose multi-dimensional joint distribution estimation schemes to obtain both the joint and marginal probability distribution on multidimensional data. Inspired by [14], we irst extend the EM-based approach or high-dimensional

4 4 Fig. 3: An overview o LoPub distribution estimation. However, such a straightorward extension does not consider the sparsity in high-dimensional data, which will lead to high complexity or distribution estimation. To guarantee ast estimation, we then present a Lasso-based approach with the cost o slight accuracy degradation. Finally, we propose a hybrid approach striking the balance between the accuracy and eiciency. 3) Dimensionality Reduction. Based on the multidimensional distribution inormation, we then propose to reduce the dimensionality by identiying mutual-correlated attributes among all dimensions and split the high-dimensional attributes into several compact low-dimensional attribute clusters. In this paper, considering the heterogeneous attributes, we adopt mutual inormation and an undirected dependency graph to measure and model the correlations o attributes, respectively. Then, we propose to split the attributes according to the junction tree built rom the dependency graph. In addition, we also propose a heuristic pruning scheme to urther boost the process o correlation identiication. 4) Synthesizing the New Dataset. Finally, we propose to sample each low-dimensional dataset according to the connectivity o attribute clusters and the estimated joint or conditional distribution on each attribute cluster, thus synthesizing a new privacy-preserving dataset. 5.2 Local Transormation or High-dimensional Data Record Design Rationale A common ramework o locally private distribution estimation is that each individual user applies a local transormation on the data or privacy protection and then sends the transormed data to the server. The server estimates the joint distribution according to the transormed data. Local transormation in our design includes two key steps: one is mapping to Bloom ilters and the other is adding randomness. Particularly, Bloom ilters over attribute domain Ω with multiple hash unctions can hash all the variables in the domain TABLE 2: Notation N number o users (data records) in the system X entire crowdsourced dataset on the server side X i data record rom the ith user x i j jth element o X i d number o attributes in X R set o all attribute clusters A j : jth attribute o X Ω j domain o A j ω j candidate attribute value in Ω j H j (x) hash unctions or A j that map x into a Bloom ilter s i j Bloom ilter o x i j (Si j = H j(x i j )) s i j [b] bth bit o si j ŝ i j randomized Bloom ilter o s i j ŝ i j [b] bth bit o ŝi j m j length o s i j probability o lipping a bit o a Bloom ilter into a pre-deined space. Thus, the unique bit strings are the representative eatures o the original report. Then, ater privacy protection by randomized responses, a large number o samples with various levels o noise are generated by individual users. Ater aggregation, the central server obtains a large sample space with random noise. As a result, one may estimate the distribution rom the noised sample space by taking advantage o machine learning techniques such as EM algorithm and regression analysis. Under the above ramework, a key observation can be made: i eatures are mutually independent, the combinations o eatures rom dierent candidate sets are also mutually independent. Thereore, when Bloom ilters o each attribute are mutually independent (i.e., no collisions or all bits), then the Cartesian product o Bloom ilters o dierent attributes are mutually independent. In this sense, with mutually independent eatures o Bloom ilters, existing machine learning techniques like EM and Lasso regression are eective or the multivariate distribution estimation. Some notations used in this paper are listed in Table Algorithmic Procedures o Local Transormation Beore describing the distribution estimation, we present that details about the local transormation or highdimensional crowdsourced data. In essence, local transormation consists o three steps: 1) For the ith user, we have an original data record X i = {x i 1,xi 2,...,xi d } with d attributes. For each attribute A j (j = 1,...,d), we employ h hash unctions H j ( ) to map x i j to a length-m j bit string s i j (called a Bloom ilter); we calculate s i j = H j(x i j ),j = 1,...,d. 2) Each bit s i j [b] (b = 1,2,...,m j) in s i j is randomly lipped into or 1 according to the ollowing rule: s i ŝ i j [b], with probability o 1 j[b] = 1, with probability o /2 (4), with probability o /2 where [,1] is a user-controlled lipping probability that quantiies the level o randomness or local privacy. 3) Ater deriving randomized Bloom ilter ŝ i j (j = 1,...,d), we concatenates ŝ i 1,...,ŝi d to obtain a stochastic ( d j=1 m j)-bit vector, [ŝi 1 [1],...,ŝ i 1 [m 1]... ŝ i d [1],...,ŝi d [m d] ] (5)

5 5 and send it to the server. Detailed examples illustrating the above procedures can be reerred to [28]. Parameter Setup: According to the characteristics o Bloom ilter [3], given the alse positive probability p and the number Ω i o elements to be inserted, the optimal length m j o Bloom ilter can be calculated as m j = ln(1/p) (ln2) 2 Ω j. (6) Furthermore, the optimal number h j o hash unctions in the Bloom ilter is h j = m j ln(1/p) ln2 = Ω j (ln2). (7) So, the optimal h = ln(1/p) (ln 2) is used or all dimensions. Privacy Analysis: Because local transormation is perormed by the individual user, no one can obtain the original record X i, local privacy can be easily achieved and we only have to analyze the privacy guarantee on the user side. In addition, since both hash operations and randomized response on all attributes are independent, the local transormation on data consumes no extra privacy budget with the increase o number o dimensions d, as pointed by the composition theorem [26]. According to the conclusion in [12], dierential privacy obtained on the user side is ( ) 2 ǫ = 2hln, (8) where h is the number o hash unctions in the Bloom ilter and is the probability that a bit vector was lipped. Overall, since the same transormation is done by all users independently, this ǫ-local privacy guarantee is equivalent or all distributed users. Communication Overhead: Theorem 1: The minimal communication cost C LoPub ater the local transormation d C LoPub m j = ln(1/p) d (ln2) 2 Ω j. (9) j=1 j=1 Proo I we assume that the domain o each attribute is publicly known by both users and the server, then the communication cost o non-private collection is basically d j=1 ln Ω j, which is related to the domain size. Nevertheless, in our method, due to local privacy, the communication cost is d j=1 m j, which is related to the length o the Bloom ilters because only randomly lipped bit strings (not the original data) are sent. For comparison, under the same condition, when RAPPOR [12] is directly applied to the k-dimensional data, all Ω 1 Ω k candidate value will be regarded as 1-dimensional data, then the cost is C RAPPOR ln(1/p) (ln2) 2 k Ω j, (1) j=1 where k j=1 Ω j is due to the size o the candidate set Ω 1 Ω k. Dierence between Equation 9 and 1 is because our LoPub, compared with straightorward RAPPOR, considers the mutual independency between multiple attributes. 5.3 Multivariate Distribution Estimation with Local Privacy Ater receiving randomized bit strings, the central server can aggregate them and estimate their joint distribution. For example, an EM-based estimation algorithm [14] was proposed to estimate 2-dimensional joint distribution. However, due to high complexity and overheads, it is only preerable to low dimensions with small domain, which is impractical to many real-world datasets with high dimensions. Thereore, we then propose a Lasso regression based algorithm with high eiciency and also a hybrid algorithm to achieve a balance between eiciency and accuracy EM-based Distribution Estimation Here, we irst extend EM-based estimation [14] to k- dimensional dataset (2 k d) and then elaborate its computational complexity to show its ineiciency in high-dimensional crowdsourced data. Beore illustrating the algorithm, we irst introduce the ollowing notations. Without loss o generality, we considerk speciied attributes asa 1,A 2,...,A k and their index collection C = {1,2,...,k}. For simplicity, the event A j = ω j or x j = ω j is abbreviated as ω j. For example, the prior probability P(x 1 = ω 1,x 2 = ω 2,...,x k = ω k ) can be simpliied into P(ω 1 ω 2...ω k ) or P(ω C ). Algorithm 1 depicts the extended EM-based approach or estimating k-dimensional joint distribution. More speciically, it consists o the ollowing ive main steps. Algorithm 1 EM-based k-dimensional Joint Distribution (EM JD) Require: C : attribute indexes cluster, i.e., C = {1,2,...,k} A j : k-dimensional attributes (1 j k), Ω j : domain o A j (1 j k), ŝ i j : observed Bloom ilters (1 i N) (1 j k), : lipping probability, δ : convergence accuracy. Ensure: P(A C ): joint distribution o k attributes speciied by C. 1: initialize P (ω C ) = 1/( Ω j ). j C 2: or each i = 1,...,N do 3: or each j C do 4: compute P(ŝ i j ω j) = m j )ŝi b=1 ( j [b] (1 2 2 )1 ŝi j [b]. 5: end or 6: compute P(ŝ i C ω C) = P(ŝ i j ω j). j C 7: end or 8: initialize t = /* number o iterations */ 9: repeat 1: or each i = 1,...,N do 11: or each (ω C ) Ω 1 Ω 2 Ω k do 12: compute P t(ω C ŝ i C ) = Pt(ω C) P(ŝ i C ω C) P t(ω C )P(ŝ i C ω C) ω C 13: end or 14: end or 15: set P t+1 (ω C ) = 1 N N i=1 Pt(ω C ŝ i C ) 16: update t = t+1 17: until maxp ω t(ω C ) maxp t 1 (ω C ) δ. C ω C 18: return P(A C ) = P t(ω C ) 1) Beore executing EM procedures, we set an uniorm distribution P(ω 1 ω 2...ω k ) = 1/( k Ω j ) as the j=1 initial prior probability. 2) According to Equation (4), each bit s i j [b] will be lipped with probability 2. Thus, by comparing the

6 6 bits H j (ω j ) with the randomized bits, the conditional probability P(ŝ i j ω j) can be computed (see line 4 o Algorithm 1). 3) Due to the independence between attributes (and their Bloom ilters), the joint conditional probability can be easily calculated by combining each individual attribute; i.e., P(ŝ i C ω C) = P(ŝ i j ω j). j C 4) Given all the conditional distributions o one particular combination o bit strings, their corresponding posterior probability can be computed by the Bayes Theorem, P t (ω C ŝ i C) = P t(ω C ) P(ŝ i C ω C) P t (ω C )P(ŝ i C ω C). (11) ω C where P t (ω C )=P t (ω 1 ω 2...ω k ) is the k dimensional joint probability at the tth iteration. 5) Ater identiying posterior probability or each user, we calculate the mean o the posterior probability rom a large number o users to update the prior probability. The prior probability is used in another iteration to compute the posterior probability in the next iteration. The above EM-like procedures are executed iteratively until convergence, i.e., the maximum dierence between two estimations is smaller than the speciied threshold. The above algorithm can converge to a good estimation when the initial value is well chosen. EM-based k- dimensional joint distribution estimation will also ail when converging to local optimum. Especially when k increases, there will be many local optimum to prevent good convergence because sample space o all combinations in Ω j1 Ω j2 Ω jk explodes exponentially. Complexity: Beore the analysis o complexity, we should note that number o user records N needs to be suiciently large according to the analysis in [12], i.e., N v k, where v denotes the average size o Ω j, otherwise it is diicult to estimate reliably rom a small sample space with low signal-noise-ratio. Theorem 2: Suppose that the average length o m j is m and the average Ω j is v. Then, the time complexity o Algorithm 1 is O ( Nkmv k +tnv 2k). (12) Proo EM-based estimation will scan all N users bit strings with the length o km one by one to compute the conditional probability or v k dierent combinations, the time complexity basically can be estimated as O(N(km)(v k )). Also, in the tth iteration, computing the posterior probability o each combination when observing each bit string will incur the time complexity o O(tN(v k ) 2 ). As a consequence, the overall time complexity is O ( tnv 2k +Nkmv k). Theorem 3: The space complexity o Algorithm 1 is O ( Nkm+2Nv k). (13) Proo In Algorithm 1, the necessary storage includes N users bit strings with the length o km, so it is O(N km). The prior probabilities on k dimensions is O(v k ). The conditional probabilities and posterior probabilities on v k candidates or all bit strings is O(2Nv k ). So, the overall complexity is O ( Nkm + 2Nv k + v k) = O ( Nkm+2Nv k) since N is the dominant variable. According to Theorem 2, the space overhead could be daunting when either N or k is large. This makes the perormance o EM-basedk-dimensional distribution estimation degrade dramatically and not applicable to high dimensional data Lasso-based Distribution Estimation To improve the eiciency o the k-dimensional joint distribution estimation, we present a Lasso regressionbased algorithm here. As mentioned in Section 5.2.1, the bit strings are the representative eatures o the original report. Ater randomized responses and lipping, a large number o noisy samples will be generated by individual users. More precisely, one may consider that the central server receives a large number o samples rom speciic distribution, however, with random noise. In this sense, one may estimate the distribution rom the noisy sample space by taking advantage o linear regression y = Mβ, where M is predictor variables and y is response variable, and β is the regression coeicient vector. The use o Bloom ilter can guarantee that the eatures (predictor variables M) re-extracted at the server are the same as ones extracted by the user. Moreover, response variable y can be estimated rom the randomized bit strings according to the statistic characters o known. Thereore, the only problem is to ind a good solution to the linear regression y = Mβ. Obviously, k-dimensional data may incur a output domain Ω 1... Ω k with the size o Ω 1... Ω k, which increases exponentially with k. With ixed N entries in the dataset X, the requencies o many combination ω 1 ω 2...ω k Ω 1... Ω k are rather small or even zero. So, M is sparse and only part o the sparse but eective predictor variables need to be chosen. Otherwise, the general linear regression techniques will lead to overitting problem. Here, we resort to Lasso regression, eectively solving the sparse linear regression by choosing predictor variables. Algorithm 2 Lasso-based k-dimensional Joint Distribution (Lasso JD) Require: C : attribute indexes cluster i.e., {1,2,...,k}, A j : k-dimensional attributes (1 j k), Ω j : domain o A j (1 j k), ŝ i j : observed Bloom ilters (1 i N) (1 j k), : lipping probability. Ensure: P(A C ): joint distribution o k attributes speciied by C. 1: or each j C do 2: or each b = 1,2,...,m j do 3: compute ŷ j [b] = N i=1ŝi j [b] 4: compute y j [b] = (ŷ j [b] N/2)/(1 ) 5: end or 6: set H j (Ω j ) = {H j (ω) ω Ω j } 7: end or 8: set y = [ y 1 [1],...,y 1 [m 1 ] y 2 [1],...,y 2 [m 2 ]... y k [1],...,y k [m k ] ] 9: set M = [ H 1 (Ω 1 ) H 2 (Ω 2 ) H k (Ω k ) ] 1: compute β = Lasso regression(m, y) 11: return P(A C ) = β/n Our Lasso-based estimation is described in Algorithm 2 and consists o the ollowing our major steps. 1) Ater receiving all randomized Bloom ilters, or each bit b in each attribute j, the server counts the number o 1 s as ŷ j [b] = N i=1ŝi j [b]. 2) The true count sum o each bit y j [b] can be estimated as y j [b] = (ŷ j [b] N/2)/(1 ) according to the randomized response applied to the true count.

7 Fig. 4: Illustration o Lasso JD These count sums o all bits orm a vector y with the length o k j=1 m j. 3) To construct the eatures o the overall candidate set o attribute ω 1.

7 7 Fig. 4: Illustration o Lasso JD These count sums o all bits orm a vector y with the length o k j=1 m j. 3) To construct the eatures o the overall candidate set o attribute ω 1...ω k, the Bloom ilters on each dimension Ω j is re-implemented by the server with the same hash unctions H j (). Suppose all distinct Bloom ilters on Ω j are H j (Ω j ) = {H j (ω) ω Ω j }, where they are orthogonal with each other. [ The candidate set o Bloom ilters is then M = H1 (Ω 1 ) H 2 (Ω 2 ) H k (Ω k ) ] and the members in M are still mutual orthogonal. 4) Fit a Lasso regression model to the counter vector y and the candidate matrix M, and then choose the non-zero coeicients as the corresponding requencies o each candidate string. By reshaping the coeicient vector into a k-dimensional matrix by natural order and dividing with N, we can derive the k-dimensional joint distribution estimation P(A 1 A 2...A k ). For example, in Figure. 4, we it a linear regression to y 12 and the candidate matrix M to estimate the joint distribution P A1A 2. Generally, the regression operation, the core o the estimation, will lose accuracy only when there are many collisions between Bloom ilter strings. However, as mentioned in Section 5.2.1, i there is no collision in the bit strings o each single dimension, then there is no collision in conjuncted bit strings o dierent dimensions. In act, the probability o collision in conjuncted bit strings will not increase with dimensions. For example, suppose the collision rate o Bloom ilter in one dimension is p, then the collision rate will decrease to p k when we connect bit strings o k dimensions together. Thereore, we only need to choose proper m and h according to Equation (6) and (7) to lower the collision probability or each dimension and then we are guaranteed to have a proper estimation or multiple dimensions. Complexity: Compared with Algorithm 1, our Lassobased estimation can eectively reduce the time and space complexity. Theorem 4: The time complexity o Algorithm 2 is O ( v 3k +kmv 2k +Nkm ). (14) Proo Algorithm 2 involves two parts: to compute the bit counter vector, N bit strings with each length o km will be summed up and this operation at most incurs the complexity o O(N km); and Lasso regression with v k candidates (total domain size) and km samples (the length o the bit counter vector is km) has the complexity o O ( (v k ) 3 +(v k ) 2 (km) ). Based on the general assumption that N dominates Equation (14), then we can see the complexity in Equation (14) is much less than Equation (12) in Theorem 2. Theorem 5: The space complexity o Algorithm 2 is O ( Nkm+v k km ). (15) Proo In Algorithm 2, the storage overhead consists o three parts: users bit strings O(Nkm), a count vector with size O(km), and the candidate bit matrix M with size O(kmv k ). Thereore, the overall space complexity o our proposed Lasso based estimation algorithm is O ( Nkm + km + v k km ) = O ( Nkm + v k km ), which is also smaller than Equation (13) as N is dominant. The empirical results are shown in Section 6. The eiciency comes rom the act that the N bit strings o length m will be scanned to count sum only once and then one-time Lasso regression is itted to estimate the distribution. In addition, Lasso regression could extract the important (i.e., requent) eatures with high probability, which its well with the sparsity o high-dimensional data Hybrid Algorithm Recall that, with suicient samples, EM-based estimation can demonstrate good convergence but also high complexity. On the other hand, Lasso-based estimation can be very eicient with a slight accuracy deviation compared with the EM-based algorithm. The high complexity o the EM-based algorithm stems rom two parts: irst, it iteratively scans users reports and builds a prior distribution table, which has the size o O(Nv k ).For each record o table, one has to compare mj bits. However, when the dimension is high, the combination o Ω j will be very sparse and has lots o zero items. Second, the initial value o the uniormly random assignment will lead to slow convergence. To achieve a balance between the EM-based estimation and Lasso-based estimation, we propose a hybrid algorithm, Lasso+EM JD (Algorithm 3), which irst e- liminates the redundant candidates and estimates the initial value with Lasso-based algorithm and then reines the convergence using EM-based algorithm. The hybrid algorithm has two advantages: 1) The sparse candidates will be selected by the Lassobased estimation algorithm. So, the EM-based algorithm can just compute the conditional probability on these sparse candidates instead o all candidates, which can greatly reduce both time and space complexity. 2) The lasso-based algorithm can give a good initial estimation o the joint distribution. Compared with using initial values with random assignments, using the initial value estimated with the Lasso-based algorithm can urther boost the convergence o the EM algorithm, which is sensitive to the initial value especially when the candidate space is sparse. Theorem 6: The time complexity o Algorithm 3 is O ( (v 3k +kmv 2k +Nkm)+(tN(v ) 2 +Nkm(v )) ), (16) wherev is the average size o sparse items inω 1... Ω k, and v < v k.

8 8 Algorithm 3 Lasso+EM k-dimensional Joint Distribution (Lasso+EM JD) Require: A j : k-dimensional attributes (1 j k), Ω j : domain o A j (1 j k), ŝ i j : observed Bloom ilters (1 i N) (1 j k), : lipping probability. Ensure: P(A 1 A 2...A k ): k-dimensional joint distribution. 1: compute P (ω 1 ω 2...ω k ) = Lasso JD(A j,ω j,{ŝ i j }N i=1,) 2: set C = {x x C,P (x) = }. 3: or each i = 1,...,N do 4: or each j = 1,...,k do 5: compute P(ŝ i j ω j) = m j )ŝi b=1 ( j [b] (1 2 2 )1 ŝi j [b]. 6: end or 7: i ω 1 ω 2...ω k C then 8: P(ŝ i 1ŝi 2...ŝi k ω 1ω 2...ω k ) = 9: else 1: compute P(ŝ i 1ŝi 2...ŝi k ω 1ω 2...ω k ) = k j=1 P(ŝi j ω j). 11: end i 12: end or 13: initialize t = /* number o iterations */ 14: repeat 15: : /* (similar to Algorithm 1) */ 17: : until P t(ω 1 ω 2...ω k ) converges. 19: return P(A 1 A 2...A k ) = P t(ω 1 ω 2...ω k ) Proo See Theorem 2 and Theorem 4, the only dierence is that ater the Lasso based estimation, only sparse items in Ω 1... Ω k are selected. Theorem 7: The space complexity o Algorithm 3 is O ( Nkm+v k km+2nv ). (17) Proo See Theorem 3 and Theorem Dimension Reduction with Local Privacy Dimension Reduction via 2-dimensional Joint Distribution Estimation The key to reducing dimensionality in a highdimensional dataset is to ind the compact clusters, within which all attributes are tightly correlated to or dependent on each other. Inspired by [35], [5] but without extra privacy budget on dimension reduction, our dimension reduction based on locally once-or-all privacy-preserved data records consists o the ollowing three steps: 1) Pairwise Correlation Computation. We use mutual inormation to measure pairwise correlations between attributes. The mutual inormation is calculated as I m,n = p ij ln p ij (18) p i p j j Ω n i Ω m where, Ω m and Ω n are the domains o attributes A m and A n, respectively. p i and p j represent the probability that A m is the ith value in Ω m and the probability that A n is the jth value in Ω n, respectively. Then, p ij is their joint probability. Particulary,p ij can then be eiciently obtained with our proposed multi-dimensional joint distribution estimation algorithms in Section 5.3, i.e, the hybrid estimation Algorithm 3. Without loss o generality, the term JD reers to the multi-dimensional joint distribution estimation algorithms. As the corresponding marginal distribution, both p i and p j then can be learned rom p ij or estimated with the 2-dimensional joint distribution o A i (or A j ) and itsel A i (or A j ). 2) Dependency Graph Construction. Dependency graph can be used to depict the correlations among attributes. Assume each attribute A j is a node in the dependency graph and an edge between two nodes A m and A n represents that attribute A m and A n are correlated. Based on mutual inormation, the dependency graph o attributes can be constructed as ollows. First, an adjacent matrix G d d (dependency graph o all d attributes) is initialized with all s. Then, all the attribute pairs (A m,a n ) are chosen to compare their mutual inormation with an threshold τ m,n, which is deined as τ m,n = min( Ω m 1, Ω n 1) φ 2 /2, (19) where φ ( φ 1) is a lexible parameter determining the desired correlation level. Normally φ =.2 represent the basic correlation. G m,n and G n,m are both set to be 1 i and only i I m,n > τ m,n. 3) Compact Clusters Building. By triangulation, the dependency graph G d d can be transormed to a junction tree, in which each node represents an attribute A j. Then, based on the junction tree algorithm, several clusters C 1,C 2,...,C l can be obtained as the compact clusters o attributes, in which attributes are mutually correlated. Hence, the whole attributes set can be divided into several compact attribute clusters and the number o dimensions can be eectively reduced. Detailed examples can be reerred to [28]. Algorithm 4 Dimension reduction with local privacy Require: A j : k-dimensional attributes (1 j k), Ω j : domain o A j (1 j k), ŝ i j : observed Bloom ilters (1 i N) (1 j k), : lipping probability, φ : dependency degree Ensure: C 1,C 2,...,C l : attribute indexes clusters 1: initialize G d d =. 2: or each j = 1,2,...,d do 3: estimate P(A j ) by JD (i.e., Lasso+EM JD Algorithm 3) 4: end or 5: or each attribute m = 1,2,...,d 1 do 6: or each attribute n = m +1,m+2,...,d do 7: estimate P(A ma n) by JD 8: compute I m,n = i Ω m i Ω n p ij ln p ij p i p j 9: compute τ m,n = min( Ω m 1, Ω n 1) φ 2 /2 1: i I(m,n) τ mn then 11: set G m,n = G n,m = 1 12: end i 13: end or 14: end or 15: build dependency graph with G d d 16: triangulate the dependency graph into a junction tree 17: split the junction tree into several cliques C 1,C 2,...,C l with elimination algorithm. 18: return C = {C 1,C 2,...,C l } Theorem 8: The time complexity o Algorithm 4 is O(d 2 (v 6 +2mv 4 +2Nm+tN(v ) 2 +2Nm(v ))). (2) Proo The core o the dimension reduction process is the ( d 2) times o 2-dimensional joint distribution estimation. The complexity o each 2-dimensional joint distribution estimation can be derived rom Equation (16) when adopting the hybrid algorithm (Algorithm 3). The complexity o building junction tree on d d dependency graph is negligible when compared with the joint distribution estimation.

9 9 Theorem 9: The space complexity o Algorithm 4 is O(2Nm+2v 2 m+2nv ). (21) Proo When we compute the mutual correlations between any pairs, a 2-dimensional joint distribution estimation algorithm will be triggered with the space complexity o O(2Nm + 2mv 2 + 2Nv ), since k = 2 is substituted into Equation (17). This maximum complexity dominates Algorithm 4. The space complexity o building junction tree on d d dependency graph is negligible when compared with the joint distribution estimation Entropy based Pruning Scheme In existing work [18], [32] on homogeneous data, correlations can be simply captured by distance or similarity metrics [36]. However, in our work, mutual inormation is used to measure general correlations since heterogenous attributes (a.k.a., attributes with dierent domains) are also considered. As shown in Equation (18), to calculate the mutual inormation o variables X and Y, the joint probability on the joint combination is inevitable, thus making the pairwise computation o dependency necessary. Although mutual inormation is already simpler than Kendall rank coeicients in the similar work [21], here, we also propose a pruning-based heuristic to boost this pairwise correlation learning process. Intuitively, there are dierent situations in Algorithm 4: 1. When φ = or φ = 1, all attributes will be considered mutually correlated or independent. Thus, there is no need to compute pairwise correlation. 2. With the increase o φ ( < φ < 1), less dependencies will be included in the adjacent matrix G d d o dependency graph, which will become sparser. This also means that we may selectively neglect some pairs. Inspired by the relationship between mutual inormation and inormation entropy 2, we irst heuristically ilter out some portion o attributes A x with least relative inormation entropy RH(A x ) = H(A x )/ Ω x, and then veriy the mutual inormation among the remaining attributes, thus reducing the pairwise computations. Furthermore, the adjacent matrix G d d o dependency graph varies in dierent datasets. For example, the adjacent matrix G d d is rarely sparse in binary datasets but very sparse in non-binary datasets. Based on this observation, we can urther simpliy the calculation by inding the independency in binary datasets or inding the dependency in non-binary datasets. For example, we irst set all entries og d d or a binary datasets as1 s and start rom the attributes with least relative inormation entropy RH(A x ) = H(A x )/ Ω x to ind the uncorrelated attributes. While or non-binary datasets, we irst set G d d as s and then start rom the attributes with largest average entropy to ind the correlated attributes. 5.5 Synthesizing New Dataset For brevity, we irst deine A C = {A j j C} and ˆX C = {x j j C}. Then the process o synthesizing the new dataset via sampling is shown in the ollowing Algorithm The relationship between mutual inormation and inormation entropy can be represented as I(X;Y) = H(X) + H(Y) H(X,Y), where H(X) and H(X,Y) denote the inormation entropy o variable X and their joint entropy o X and Y, respectively. Algorithm 5 Entropy based Pruning Scheme Require: A j : k-dimensional attributes (1 j k), Ω j : domain o A j (1 j k), ŝ i j : observed Bloom ilters (1 i N) (1 j k), : lipping probability, φ : dependency degree Ensure: G d d : adjacent matrix G d d o dependency graph o attributes A j (j = 1,2,...,d) 1: initialize G d d = 2: or each j = 1,2,...,k do 3: compute P(A j ) = JD(A j,ω j,{ŝ i j }N i=1,) 4: compute RH(A j ) = 1 plogp Ω j p P(A j ) 5: end or 6: sort list A = {A 1,A 2,...,A j } according to entropy H(A j ) 7: pick up the previous length(list A ) (1 φ) items rom list A as a new list list A 8:... 9: compute pairwise mutual inormation among list A and set dependency graph G d d as in Algorithm 4. 1: return G d d Algorithm 6 New Dataset Synthesizing Require: C : a collection o attribute index clusters C 1,...C l, A j : k-dimensional attributes (1 j k), Ω j : domain o A j (1 j k), ŝ i j : observed Bloom ilters (1 i N) (1 j k), : lipping probability, Ensure: ˆX: Synthetic Dataset o X 1: initialize R = 2: repeat 3: randomly choose an attribute index cluster C C 4: estimate joint distribution P(A C ) by JD 5: sample ˆX C according to P(A C ) 6: C = C C, R = R C, D = {D C D R } 7: 8: or each D D do estimate joint distribution P(A D ) by JD 9: obtain conditional distribution P(A D R A D R ) rom P(A D ) 1: sample ˆX D R according to P(A D R A D R ) and ˆX D R 11: C = C D, R = R D, D = {D C D R } 12: end or 13: until C = 14: return ˆX We irst initialize a set R to keep the sampled attribute indexes. Then, we randomly choose an attribute index cluster C to estimate the joint distribution and sample new data ˆX in the attributes A j, j C. Next, we remove C rom the cluster collection C into R, and ind the connected component D o C. In the connected component, each cluster D is traversed and sampled as ollows. irst estimate the joint distribution on the attributes A D by our proposed distribution estimations and obtain the conditional distribution P(A D R A D R ). Then, sample ˆX D R according to this conditional distribution and the sampled data ˆX D R. Ater the traverse o D, the attributes in the irst connected components are sampled. Then randomly choose cluster in the remaining C to sample the attributes in the second connected components, until all clusters are sampled. Finally, a new synthetic dataset ˆX is generated according to the estimated correlations and distributions in origin dataset X. Theorem 1: The time complexity o Algorithm 6 is O(l(v 3k +kmv 2k +Nkm+tN(v ) 2 +Nkm(v ))), (22) where l is the number o clusters ater dimension reduction and k here reers to average number o dimensions in these clusters.

10 1 Fig. 5: Main procedures o high-dimensional data publishing with ǫ local privacy Proo The core o the dataset synthesizing is actually multiple (l times) k-dimensional joint distribution estimation. Theorem 11: The space complexity o Algorithm 6 is O(Nkm+v k km+2nv +Nd). (23) Proo Every time, a k-dimensional joint distribution estimation algorithm (with space complexity o O(Nkm + v k km + 2Nv )) is processed to draw a new dataset. A new dataset with the size O(N d) is maintained while synthesizing. The overall process o LoPub can be summarized in Figure 5. Clearly, all the processed are conducted on the locally privacy-preserved data. Thereore, compared with existing non-local privacy schemes in Figure 1, LoPub can provide consistency local privacy guarantee on all crowdsourced users, thus avoiding insider attacks and multiple assignment o privacy budget. 6 EVALUATION In this section, we conducted extensive experiments on real datasets to demonstrate the eiciency o our algorithms in terms o computation time and accuracy. We used three real-world datasets: Retail [1], Adult [4], and TPC-E [2]. Retail is part o a retail market basket dataset. Each record contains distinct items purchased in a shopping visit. Adult is extracted rom the 1994 US Census. This dataset contains personal inormation, such as gender, salary, and education level. TPC-E contains trade records o Trade type, Security, Security status tables in the TPC-E benchmark. It should be noted that some continuous domain were binned in the preprocess or simplicity. Datasets Type #. Records (N) #. Attributes (d) Domain Size Retail Binary 27, Adult Integer 45, TPC-E Mixed 4, All the experiments were run on a machine with Intel Core i5-52u CPU 2.2GHz and 8GB RAM, using Windows 7. We simulated the crowdsourced environment as ollows. First, users read each data record individually and locally transorm it into privacy-preserving bit strings. Then, the crowdsourced bit strings are gathered by the central server or synthesizing and publishing the high-dimensional dataset. LoPub can be realized by combining distribution estimations and data synthesizing techniques. Thus, we implemented dierent LoPub realizations using Python 2.7 with the ollowing three strategies. 1) EM JD, the generalized EM-based multivariate joint distribution estimation algorithm. 2) Lasso JD, our proposed Lasso-based multivariate joint distribution estimation algorithm. 3) Lasso+EM JD, our proposed hybrid estimation algorithm that uses the Lasso JD to ilter out some candidates to reduce the complexity and replace the initial value to boost the convergence o EM JD. It is worth mentioning that we compared only the above algorithms since our algorithm adopts a novel local privacy paradigm on high-dimensional data. Other competitors are either or non-local privacy [5], [35], [21] or on low-dimension data [12], [14], [16] and thereore not comparable. For air comparison, we randomly chose 1 combinations o k attributes rom d dimensional data. For simplicity, we sampled 3 5% data rom dataset Retail and 1% data rom datasets Adult and TPC-E, respectively. The eiciency o our algorithms is measured by computation time and accuracy. The computation time includes CPU time and IO cost. Each set o experiments is run 1 times, and the average running time is reported. To measure accuracy, we used the distance metrics AVD (average variant distance) on the three datasets, as suggested in [5], to quantiy the closeness between the estimated joint distribution P(ω) and the origin joint distribution Q(ω). The AVD error is deined as Dist AVD (P,Q) = 1 P(ω) Q(ω). (24) 2 ω Ω The deault parameters are described as ollows. In the binary dataset Retail, the maximum number o bits and the number o hash unctions used in the bloom ilter are m = 32 and h = 4, respectively. In the non-binary datasets Adult and TPC-E, the maximum number o bits and the number o hash unctions used in bloom ilter are m = 128 and h = 4, respectively. The convergence gap is set as.1 or ast convergence. 6.1 Multivariate Distribution Estimation Here, we show the perormance o our proposed distribution estimations in terms o both eiciency and eectiveness. The eiciency is measured by computation time, and the eectiveness is measured by estimation accuracy Computation Time We irst evaluate the computation time o EM JD, Lasso JD, and Lasso+EM JD or the k-dimensional joint distribution estimation on three real datasets. Figures 6 and 7 compare the computation time on the binary dataset Retail with both k = 3 and k = 5. It can be noticed that, or each dimension k, Lasso JD is consistently much aster than EM JD and Lasso+EM JD, especially when k is large. This is because EM JD has to repeatedly scan each user s bit string. Particularly, the time consumption o EM JD increases with because there will be more iterations or the ixed convergence gap. In contrast, Lasso JD uses the regression to estimate the joint distribution more eiciently. Furthermore, the complexity o Lasso+EM JD is much less than EM JD as the initial estimation o Lasso JD can greatly reduce the candidate attribute space and the number o iterations needed. When k is growing, the computation time o Lasso JD increases slowly, unlike EM JD that has a dramatic increase. This is because the 3. It should be noted that, with sampled data, the dierential privacy level can be urther enhanced [23]. But sampling used here is or simplicity.

On High-Rate Cryptographic Compression Functions

On High-Rate Cryptographic Compression Functions Richard Ostertág and Martin Stanek Department o Computer Science Faculty o Mathematics, Physics and Inormatics Comenius University Mlynská dolina, 842 48