arxiv: v1 [cs.db] 1 Aug 2012

Size: px

Start display at page:

Download "arxiv: v1 [cs.db] 1 Aug 2012"

Leo Newton
5 years ago
Views:

1 Functiona Mechanism: Regression Anaysis under Differentia Privacy arxiv: v [cs.db] Aug 202 Jun Zhang Zhenjie Zhang 2 Xiaokui Xiao Yin Yang 2 Marianne Winsett 2,3 ABSTRACT Schoo of Computer Engineering Nanyang Technoogica University {jzhang027,xkxiao}@ntu.edu.sg ϵ-differentia privacy is the state-of-the-art mode for reeasing sensitive information whie protecting privacy. Numerous methods have been proposed to enforce ϵ-differentia privacy in various anaytica tasks, e.g., regression anaysis. Existing soutions for regression anaysis, however, are either imited to non-standard types of regression or unabe to produce accurate regression resuts. Motivated by this, we propose the Functiona Mechanism, a differentiay private method designed for a arge cass of optimizationbased anayses. The main idea is to enforce ϵ-differentia privacy by perturbing the objective function of the optimization probem, rather than its resuts. As case studies, we appy the functiona mechanism to address two most widey used regression modes, namey, inear regression and ogistic regression. Both theoretica anaysis and thorough experimenta evauations show that the functiona mechanism is highy effective and efficient, and it significanty outperforms existing soutions.. INTRODUCTION Reeasing sensitive data whie protecting privacy has been a subject of active research for the past few decades. One state-of-theart approach to the probem is ϵ-differentia privacy, which works by injecting random noise into the reeased statistica resuts computed from the underying sensitive data, such that the distribution of the noisy resuts is reativey insensitive to any change of a singe record in the origina dataset. This ensures that the adversary cannot infer any information about any particuar record with high confidence controed by parameter ϵ, even if he/she possesses a the remaining tupes of the sensitive data. Meanwhie, the noisy resuts shoud be cose to the unperturbed ones in order to be usefu in practice. Hence, the goa of an ϵ-differentia private data pubication mechanism is to maximize resut accuracy, whie satisfying the privacy guarantees. The best strategy to enforce ϵ-differentia privacy depends upon the nature of the statistica anaysis that wi be performed using Permission to make digita or hard copies of a or part of this work for persona or cassroom use is granted without fee provided that copies are not made or distributed for profit or commercia advantage and that copies bear this notice and the fu citation on the first page. To copy otherwise, to repubish, to post on servers or to redistribute to ists, requires prior specific permission and/or a fee. Artices from this voume were invited to present their resuts at The 38th Internationa Conference on Very Large Data Bases, August 27th - 3st 202, Istanbu, Turkey. Proceedings of the VLDB Endowment, Vo. 5, No. Copyright 202 VLDB Endowment /2/07... $ Advanced Digita Sciences Center Iinois at Singapore Pte. Ltd. {zhenjie,yin.yang}@adsc.com.sg 3 Department of Computer Science University of Iinois at Urbana-Champaign winsett@iinois.edu ABCDD BECE DD a Linear Regression ABC DEF B DEF B b Logistic Regression Figure : Two exampes of regression probems the noisy data. This paper focuses on regression anaysis, which identifies the correations between different attributes based on the input data. Figure iustrates two most commony used types of regressions, namey, inear regression and ogistic regression. Specificay, inear regression finds the inear reationship between the input attributes that fits the input data most. In the exampe shown in Figure a, there are two attributes, age and medica expenses; the data records are shown as dots. The regression resut is a straight ine with minimum overa distances to the data points, which expresses the vaue of one attribute as a inear function of the other one. Figure b shows an exampe of ogistic regression, where there are two casses of data: diabetes patients shown as back dots and those without diabetes white dots. The goa is to predict the probabiity of having diabetes, given a patient s other attributes i.e., age and choestero eve in our exampe. The resut of this ogistic regression can be expressed as a straight ine; the probabiity of a patient getting diabetes is cacuated based on which side of the ine the patient ies in, and its distance to the ine. In particuar, if a patient s age and choestero eve correspond to a point that fas exacty on the straight ine, then his/her probabiity of having diabetes is predicted to be 50%. We present the mathematica detais of these two types of regression in Section 3. Athough regression is a very common type of anaysis in practice especiay on medica data, so far there is ony a narrow seection of methods for ϵ-differentiay private regression. The main chaenge ies in the fact that regression invoves soving an optimization probem. The reationship between the optimization resuts and the origina data is difficut to anayze; consequenty, it is hard to decide on the minimum amount of noise necessary to make the optimization resuts differentiay private. Most existing soutions for ϵ-differentiay privacy are designed for reeasing simpe 364

2 aggregates e.g., counts, or structures that can be decomposed into such aggregates, e.g., trees or histograms. One way to adapt these soutions to regression anaysis is through synthetic data generation e.g., [7], which generates synthetic data in a differentiay private way based on the origina sensitive data. The resuting synthetic dataset can be used for any subsequent anaysis. However, due to its generic nature, this methodoogy often injects an unnecessariy arge amount of noise, as shown in our experiments. To our knowedge, the ony known soutions that targets regression are [4, 5, 6, 27], which, however, are either imited to non-standard types of regression anaysis or unabe to produce accurate regression resuts, as wi be shown in Sections 2 and 7. Motivated by this, we propose the functiona mechanism, a genera framework for enforcing ϵ-differentia privacy on anayses that invove soving an optimization probem. The main idea is to enforce ϵ-differentia privacy by perturbing the objective function of the optimization probem, rather than its resuts. Pubishing the resuts of the perturbed optimization probem then naturay satisfies ϵ-differentia privacy as we. Note that, unike previous work [4,5] which reies on some specia properties of the objective function, our functiona mechanism generay appies to a forms of optimization functions. Perturbing objective functions is inherenty more chaenging than perturbing scaar aggregate vaues, for two reasons. First, injecting noise into a function is non-trivia; as we show in the paper, simpy adding noise to each coefficient of a function often eads to unbearaby high noise eves, which in turn eads to neary useess resuts. Second, not a noisy functions are vaid objective functions; in particuar, some noisy functions ead to unbounded resuts, and some others have mutipe oca minima. The proposed functiona mechanism soves these probems through a set of nove and non-trivia agorithms that perform random perturbations in the functiona space. As case studies, we appy the functiona mechanism to both inear and ogistic regressions. We prove that for both types of regressions, the noise scae required by the proposed methods is constant with respect to the cardinaity of the training set. Extensive experiments using rea data demonstrate that the functiona mechanism achieves highy accurate regression resuts with comparabe prediction power to the unperturbed resuts, and it significanty outperforms the existing soutions. The remainder of the paper is organized as foows. Section 2 reviews reated studies of differentia privacy. Section 3 provides formay defines our probems. Section 4 describes the basic framework for the functiona mechanism, and appies it to enforce ϵ-differentia privacy on inear regression. Section 5 extends the mechanism to hande more compex objective functions, and soves the probem of differentiay private ogistic regression. Section 6 presents a post-processing modue to ensure that the perturbed objective function has a unique optima soution. Section 7 contains an extensive set of experimenta evauations. Finay, Section 8 concudes the paper with directions for future work. 2. RELATED WORK Dwork et a. [9] propose ϵ-differentia privacy and show that it can be enforced using the Lapace mechanism, which supports any queries whose outputs are rea numbers see Section 3 for detais. This mechanism is widey adopted in the existing work, but most adoptions are restricted to aggregate queries e.g., counts or queries that can be reduced to simpe aggregates. In particuar, Hay et a. [3], Li et a. [7], Xiao et a. [30], and Cormode et a. [6] present methods for minimizing the worst-case error of a given set of count queries; Barak et a. [2] and Ding et a. [8] consider the pubication of data cubes; Xu et a. [3] and Li et a. [8] focus on pubishing histograms; McSherry and Mironov [20], Rastogi and Nath [24], and McSherry and Mahajan [9] devise methods for reeasing counts on particuar types of data, such as time series. Compement to the Lapace mechanism, McSherry and Tawar [2] propose the exponentia mechanism, which works for any queries whose output spaces are discrete. This enabes differentiay private soutions for various interesting probems where the outputs are not rea numbers. For instance, the exponentia mechanism has been appied for the pubication of audition resuts [2], coresets [0], frequent patterns [3], decision trees [], support vector machines [25], and synthetic datasets [7, 6]. Nevertheess, neither the Lapace mechanism nor the exponentia mechanism can be easiy adopted for regression anaysis. The reason is that both mechanisms require a carefu sensitivity anaysis of the target probem, i.e, an anaysis on how much the probem output woud change when an arbitrary tupe in the input data is modified. Unfortunatey, such an anaysis is rather difficut for regression tasks, due to the compex correations between regression inputs and outputs. To the best of our knowedge, the ony existing work that targets regression anaysis is by Chaudhuri et a. [4, 5], Smith [27], and Lei [6]. Specificay, Chaudhuri et a. [4, 5] show that, when the cost function of a regression task is convex and douby differentiabe, the regression can be performed with a differentiay private agorithm based on the objective perturbation. The agorithm, however, is inappicabe for standard ogistic regression, as the cost function of ogistic regression does not satisfy convexity requirement. Instead, Chaudhuri et a. demonstrate that their agorithm can address a non-standard type of ogistic regression with a modified input see Section 3 for detais. Nevertheess, it is uncear whether the modified ogistic regression is usefu in practice. Smith [27] proposes a genera framework for statistica anaysis that utiizes both the Lapace mechanism and exponentia mechanism. However, the framework requires that the output space of the statistica anaysis is bounded, which renders it inappicabe for both inear and ogistic regressions. For exampe, if we preform a inear regression on a three dimensiona dataset, the output woud be two rea numbers i.e., the sopes of the regression pane on two different dimensions, both of which have an unbounded domain, + see Section 3 for the detais of inear regression. Lei [6] proposes a regression method that avoids conducting sensitivity anaysis directy on the regression outputs. In a nutshe, the method first empoys the Lapace mechanism to produce a noisy muti-dimensiona histogram of the input data. After that, it produces a synthetic dataset that matches the statistics in the noisy histogram, without ooking at the origina dataset. Finay, it utiizes the synthetic data to compute the regression resuts. Observe that, the privacy guarantee of this method is soey decided by the procedure that generates the noisy histogram the subsequent parts of the agorithm ony rey on the histogram instead of the origina data, and hence, they do not revea any information about the input dataset except for the information reveaed by the noisy histogram. This makes it much easier to enforce ϵ- differentia privacy, since the muti-dimensiona histogram consists of ony counts, which can be processed with the Lapace mechanism in a differentiay private manner. Nevertheess, as wi be shown in our experiments referred to as DPME, Lei s method [6] is restricted to datasets with sma dimensionaity. This is caused by the fact that, when the dimensionaity of the input data increases, this method woud generate noisy histogram with a coarser granuarity, which in turn eads to inaccurate synthetic data and regression resuts. In summary, none of the existing soutions produce satisfactory resuts for inear or ogistic regressions. Finay, it is worth mentioning that there exists a reaxed version of ϵ-differentia privacy caed ϵ, δ-differentia privacy [23]. 365

3 Under this privacy notion, a randomized agorithm is considered privacy preserving if it achieves ϵ-differentia privacy with a high probabiity decided by δ. This reaxed notion is usefu in the scenarios where ϵ-differentia privacy is too strict to aow any meaning resuts to be reeased see [2, 5] for exampes. As we wi show in this paper, however, inear and ogistic regressions can be conducted effectivey under ϵ-differentia privacy, i.e., we do not need to resort to ϵ, δ-differentia privacy to achieve meaningfu regression resuts. 3. PRELIMINARIES Let D be a database that contains n tupes t, t 2,..., t n and d + attributes X, X 2,..., X d, Y. For each tupe t i = x i, x i2,..., x id, y i, we assume without oss of generaity that d i= x2 id. Our objective is to construct a regression mode from D that enabes us to predict any tupe s vaue on Y based on its vaues on X, X 2,..., X d, i.e., we aim to obtain a function ρ that i takes x i, x i2,..., x id as input and ii outputs a prediction of y i that is as accurate as possibe. Depending on the nature of the regression mode, the function ρ can be of various types, and it is aways parameterized with a vector of rea numbers. For exampe, for inear regression, ρ is a inear function of x i, x i2,..., x id, and the mode parameter is a d-dimensiona vector where the j-th j {,..., d} number equas the weight of x ij in the function. To evauate whether eads to an accurate mode, we have a cost function f that i takes t i and as input and ii outputs a score that measures the difference between the origina and predicted vaues of y i given as the mode parameters. The optima mode parameter is defined as: = arg min n ft i,. Without oss of generaity, we assume that contains d vaues,..., d. In addition, we consider that ft i, can be written as a function of k k {,..., d} given t i, as is the case for most regression tasks. We focus on two commony-used regression modes, namey, inear regression and ogistic regression, as defined in the foowing. For convenience, we abuse notation and use x i i {,..., d} to denote x i, x i2,..., x id, and we use x i, y i to denote t i. DEFINITION LINEAR REGRESSION. Assume without oss of generaity that the attribute Y in D has a domain [, ]. A inear regression on D returns a prediction function ρx i = x T i, where is a vector of d rea numbers that minimizes a cost function ft i, = y i x T i 2, i.e., = arg min n i= i= y i x T i 2. In other words, inear regression expresses the vaue of Y as a inear function of the vaues of X,..., X d, such that the sum square error of the predicted Y vaues is minimized 2. This assumption can be easiy enforced by changing each x ij to x ij α j β j α j d, where α j and β j denotes the minimum and maximum vaues in the domain of X j. DEFINITION 2 LOGISTIC REGRESSION. Assume that the attribute Y in D has a booean domain {0, }. A ogistic regression on D returns a prediction function, which predicts y i = with probabiity ρx i = expx T i / + expx T i, where is a vector of d rea numbers that minimizes a cost function ft i, = og + expx T i y i x T i. That is, = arg min n i= og + expx T i y ix T i For exampe, assume that D that contains three attributes X, X 2, and Y, such that X resp. X 2 represents a person s age resp. body mass index, and Y indicates whether or not the person has diabetes. In that case, a ogistic regression on the database woud return a function that maps a person s age and body mass index to the probabiity that he/she woud have diabetes, i.e, ρx i = P r[y i = ]. This formuation of ogistic regression is used extensivey in the medica and socia science fieds to predict whether certain event wi occur given some observed variabes. In [4, 5], Chaudhuri et a. consider a non-standard type of ogistic regression with modified inputs. In particuar, they assume that for each tupe t i, its vaue on Y is not a booean vaue that indicates whether t i satisfies certain condition; instead, they assume y i equas the probabiity that a condition is satisfied given x i. For instance, if we are to use Chaudhuri et a. s method to predict whether a person has diabetes or not based on his/her age and body mass index, then we woud need a dataset that gives us accurate ikeihood of diabetes for every possibe age, body mass index combination. This requirement is rather impractica as rea datasets are often sparse due to the curse of dimensionaity or the existence of arge-domain attributes and the ikeihoods are not measurabe. Furthermore, Chaudhuri et a. s method cannot be appied on datasets where Y is a booean attribute, since their method reies on convexity property on the cost function. In standard ogistic regression, cost function og + expx T i y ix T i or og + exp y ix T i in [4, 5] does not meet this assumption. To ensure privacy protection, we require that the regression anaysis shoud be performed with an agorithm that satisfies ϵ- differentia privacy, which is defined based on the concept of neighbor databases, i.e., databases that have the same cardinaity but differ in one and ony one tupe. DEFINITION 3 ϵ-differential PRIVACY [9]. A randomized agorithm A satisfies ϵ-differentia privacy, iff for any output O of A and for any two neighbor databases D and D 2, we have P r [AD = O] e ϵ P r [AD 2 = O]. By Definition 3, if an agorithm A satisfies ϵ-differentia privacy for an ϵ cose to 0, then the probabiity distribution of A s output is roughy the same for any two input databases that differ in one tupe. This indicates that the output of A does not revea significant information about any particuar tupe in the input, and hence, privacy is preserved. As wi be shown in Section 4, our soution is buit upon the Lapace mechanism [9], which is a differentiay private framework 2 There is a more genera form of inear regression with an objective function, α = arg min,α n i= yi x T i α 2. We focus ony on the type of inear regression in Definition for ease of exposition, but our soution can be easiy extended for the more genera variant. 366

4 that can be used to answer any query Q on D whose output is a vector of rea numbers. In particuar, the mechanism expoits the sensitivity of Q, which is defined as SQ = max D,D 2 QD QD 2, where D and D 2 are any two neighbor databases, and QD QD 2 is the L distance between QD and QD 2. Intuitivey, SQ captures the maximum changes that coud occur in the output of Q, when one tupe in the input data is repaced. Given SQ, the Lapace mechanism ensures ϵ-differentia privacy by injecting noise into each vaue in the output of QD, such that the noise η foows an i.i.d. Lapace distribution with zero mean and scae SQ/ϵ see [9] for detais: pdfη = ϵ 2SQ exp ϵ η. SQ In the rest of the paper, we use Lap s to denote a random variabe drawn from a Lapace distribution with zero mean and scae s. For ease of reference, we summarize in Tabe a notations that wi be frequenty used. 4. FUNCTIONAL MECHANISM This section presents the Functiona Mechanism FM, a genera framework for regression anaysis under ϵ-differentia privacy. Section 4. introduces the detais of the framework, whie Section 4.2 iustrates how to appy FM on inear regression. 4. Perturbation of Objective Function Roughy speaking, FM is an extension of the Lapace mechanism that i does not inject noise directy into the regression resuts, but ii ensures privacy by perturbing the optimization goa of regression anaysis. To expain, reca that a regression task on a database D returns a mode parameter that minimizes an optimization function f D = t i D ft i,. Direct pubication of woud vioate ϵ-differentia privacy, since reveas information about f D and D. One may attempt to address this issue by adding noise to using the Lapace mechanism; however, this soution requires an anaysis on the sensitivity of see Equation, which is rather chaenging given the compex correation between D and. Instead of injecting noise directy into, FM achieves ϵ- differentia privacy by i perturbing the objective function f D and then ii reeasing the mode parameter that minimizes the perturbed objective function f D instead of the origina one. A key issue here is: how can we perturb f D in a differentiay private manner given that f D can be a compicated function of? We address this issue by expoiting the poynomia representation of f D, as wi be shown in the foowing. Reca that is a vector that contains d vaues,..., d. Let ϕ denote a product of,..., d, namey, ϕ = c c c d d for some c,..., c d N. Let Φ j j N denote the set of a products of,..., d with degree j, i.e., Φ j = { c c c d } d d = c = j For exampe, Φ 0 = {}, Φ = {,..., d }, and Φ 2 = { i j i, j [, d]}. By the Stone-Weierstrass Theorem [26], any continuous and differentiabe ft i, can aways be written 2 Notation D t i = x i, y i d f t i, Description database of n records the i-th tupe in D the number of vaues in the vector x i the parameter vector of the regression mode the cost function of the regression mode that evauates whether a mode parameter eads to an accurate prediction for a tupe t i = x i, y i f D f D = t i D f ti, = arg min f D f D f D ˆf D ˆ ϕ Φ j a noisy version of f D = arg min f D the Tayor expansion of f D = arg min fd a ow order approximation of f D ˆ = arg min ˆfD a product of one or more vaues in, e.g, 3 2 the set of a possibe ϕ of order j λ ϕti the poynomia coefficient of ϕ in ft i, Tabe : Tabe of notations as a potentiay infinite poynomia of,..., d, i.e., for some J [0, ], we have ft i, = J j=0 λ ϕti ϕ, 3 where λ ϕti R denotes the coefficient of ϕ in the poynomia. Simiary, f D can aso be expressed as a poynomia of,..., d. Given the above poynomia representation of f D, we perturb f D by injecting Lapace noise into its poynomia coefficients, and then derive the mode parameter that minimizes the perturbed function f D, as shown in Agorithm. The correctness of Agorithm is based on the foowing emma and theorem. LEMMA. Let D and D be any two neighbor databases. Let f D and f D be the objective functions of regression anaysis on D and D, respectivey, and denote their poynomia representations as foows: f D = f D = J λ ϕti ϕ, j= t i D J j= Then, we have the foowing inequaity: J λ ϕti j= t i D t i D λ ϕt where t i is an arbitrary tupe. t i D λ ϕt i ϕ. 2 max t J j= λ ϕt. 367

5 Agorithm Functiona Mechanism Database D, objective function f D, privacy budget ε : Set = 2 max J t j= λ ϕt 2: for each 0 j J do 3: for each ϕ Φ j do 4: set λ ϕ = t i D λ ϕt i + Lap ε 5: end for 6: end for 7: Let f D = J j= λ ϕ ϕ 8: Compute = arg min f D 9: Return PROOF. Without oss of generaity, assume that D and D differ in the ast tupe. Let t n t n resp. be the ast tupe in D D resp.. Then, J λ ϕti J = λ ϕtn λ ϕt n j= t i D j= t i D λ ϕt J J λ ϕtn + λ ϕt n j= j= 2 max t J λ ϕt j= THEOREM. Agorithm satisfies ϵ-differentia privacy. PROOF. Let D and D be two neighbor databases. Without oss of generaity, assume that D and D differ in the ast tupe. Let t n t n be the ast tupe in D D. is cacuated as done on Line of Agorithm, and f = J j= of Line 7 of the agorithm. We have P r { f D } J j= ϕ Φ P r { f D } = j exp J j= exp J exp ϵ λ ϕti j= t i D λ ϕ ϕ be the output ϵ t i D λ φt λ i φ ϵ t i D λ φt i t i D λ ϕt i J ϵ = exp λϕxn λ ϕx n j= = exp ϵ J λ ϕtn λ ϕt n exp = exp ϵ. j= ϵ 2 max t j= λ φ J λ ϕt by Lemma In other words, the computation of f ensures ϵ-differentia privacy. The fina resut of Agorithm is derived from f without using any additiona information from the origina database. Therefore, Agorithm is ϵ-differentiay private. One potentia issue with Agorithm is the optimization on the noisy objective function f D see Line 8 can be unbounded when the amount of noise inserted is sufficienty arge, eading to meaningess regression resuts. We address this issue ater in Section 6. In the foowing, we provide a convergence anaysis on Agorithm, showing that its output is arbitrariy cose to the actua minimizer of f D, when the database cardinaity n is sufficienty arge. Our anaysis focuses the averaged objective function fd instead of fd, since the atter one monotonicay increases with n. Assume we have a series of databases, n {D, D 2,..., D n,...}, where each D j contains j tupes a drawn from a fixed but unknown distribution foowing probabiity distribution function pt. We have the foowing emma. LEMMA 2. If λ ϕt is a bounded rea number in, + for any t and ϕ J j=φ j, there exists a poynomia g with constant coefficients such that im n n f D n = g. PROOF. Based on our poynomia representation scheme, f n D n = J n j= n i= λ ϕt i ϕ, where each ti is an i.i.d. sampe from pt. When the database cardinaity n approaches +, we rewrite n n i= λ ϕt i as foows: im n n n λ ϕti = λ ϕt ptdx = Eλ ϕt = c ϕ. 4 i= t By the assumption that λ ϕt is bounded, c ϕ = Eλ ϕt is a constant that aways exists for any ϕ J j=φ j. Thus, im n f n D n = J j= c ϕ ϕ. This competes the proof, by etting g = J j= c ϕ ϕ. THEOREM 2. When database cardinaity n +, the output of Agorithm satisfies g = min g, if λ ϕt is bounded for any t and ϕ J j=φ j. PROOF. To prove this theorem, we first show that im n f n D n = g for any. Given Agorithm with input dataset D n, objective function f Dn and privacy budget ϵ, the averaged perturbed objective function f n D n = J j= ϕ Φ λ j n ϕ ϕ = J n j= n i= λ ϕt i + Lap /ε ϕ. When n +, we have n im λ ϕti + Lap n n ε i= n = im λ ϕti + im n n n n Lap ε i= = c ϕ + im Lap, n nε When and ϵ > 0 are both finite rea numbers, it foows that im n Lap nε = 0. It eads to im n n f D n = g. 5 Since Equation 5 is appicabe to any, we have g = min g by proving im n f n D n = min im n f n D n, which is obvious given the definition of in Agorithm. Combining the resuts of Lemma 2 and Theorem 2, we concude that the output of Agorithm approaches the minimizer of f Dn, when database cardinaity n

6 4.2 Appication to Linear Regression After presenting the genera framework, we next appy FM to inear regression as shown in Definition. In inear regression, reca that t i = x i, y i is the i-th tupe in database D with d i= x2 id and y i [, ]; is a d-dimensiona vector contains mode parameters. The expansion of objective function f D for inear regression is f D = 2 y i x T i t i D = d y i 2 2 y i x ij j + t i D j= t i D x ij x i j. j, d t i D Therefore, f D ony invoves monomias in Φ 0, Φ and Φ 2. Since each x i ocates in the d-dimensiona unit sphere and y i [, ], given objective function of inear regression, Line of Agorithm coud cacuate the parameter as = 2 max t=x,y j= 2 max t=x,y J λ ϕt y d + d 2, d yx j + j= j, d x j x where t is an arbitrary tupe and x j denotes the j-th dimension of vector x. Thus, Agorithm adds Lap2d + 2 /ε noise to each coefficient and the optimization on is run on the noisy objective function. For exampe, assume that we have a two-dimensiona database D with three tupes: x, y =, 0.4, x 2, y 2 = 0.9, 0.3, and x 3, y 3 = 0.5,. The objective function for inear regression is f D = , with optima = If we appy Agorithm on D, then Line of Agorithm woud set = 2d+ 2 = 8, and then generate the noisy objective function f D. Figure 2 shows an exampe of f D and f D. Notice that the goba optimum of f D is cose to the origina when the coefficients are approximatey preserved. The anaysis for inear regression is fairy simpe, because the objective function is itsef a poynomia on. For other regression tasks e.g., ogistic regression, Agorithm cannot be directy appied, as the objective function may not be a poynomia with finite order. In the next section, we wi present a soution to tacke this probem. 5. POLYNOMIAL APPROXIMATION OF OBJECTIVE FUNCTIONS For Agorithm to work, it is crucia that the poynomia form of the objective function f D contains ony terms with bounded degrees. Whie this condition hods for certain types of regression anaysis e.g., inear regression, as we have shown in Section 4, there exist regression tasks where the condition cannot be satisfied e.g., ogistic regression. To address this issue, this section presents a method for deriving an approximate poynomia form of f D based on Tayor expansions. For ease of exposition, we wi focus on ogistic regression, but our method can be adopted for other types of regression tasks as we ƒ D = ƒ D = Figure 2: Exampe of objective function for inear regression and its noisy version obtained by FM 5. Expansion Consider the cost function ft i, of regression anaysis. Assume that there exist 2m functions f,..., f m and g,..., g m, such that i ft i, = m = f g t i,, and ii each g is a poynomia function of,..., m. As wi be shown shorty, such a decomposition of ft i, is usefu for handing ogistic regression. Given the above decomposition of ft i,, we can appy the Tayor expansion on each f to obtain the foowing equation: ft i, = m = k=0 f k z g t i, z k, 6 where each z is a rea number. Accordingy, the objective function f D can be written as: f D = n m i= = k=0 f k z g t i, z k 7 To expain how Equations 6 and 7 are reated to ogistic regression, reca that the cost function of ogistic regression is ft i, = og + expx T i y ix T i. Let f, f 2, g, and g 2 be four functions defined as foows: g t i, = x T i, g 2 t i, = y i x T i, f z = og + expz, f 2 z = z. Then, we have ft i, = f g t i, +f 2 g 2 t i,. By Equations 6 and 7, f D = n 2 i= = k=0 f k z g t i, z k 8 Since f 2z = z, we have f k 2 = 0 for any k >. Given this fact and by setting z = 0, Equation 8 can be simpified as f D = n i= k=0 f k 0 x T i n k y i x T i 9 There are two compications in Equation 9 that prevent us from appying it for private ogistic regression. First, the equation invoves a infinite summation. Second, the term f k 0 invoved in the equation does not have cosed form soution. To address these two issues, we wi present an approximate approach that reduces the degree of the summation, and the approach ony requires the vaue of f k 0 for k = 0,, 2, i.e., f 0 0 = og 2, f 0 =, and 2 f 2 0 =. 4 i= 369

7 5.2 Approximation Our approximation approach works by truncating the Tayor series in Equation 9 to remove a poynomia terms with order arger than 2. This eads to a new objective function with ony ow order poynomias as foows: ˆf D = = m = i= m n ˆf g t i, n 2 = i= k=0 f k z g t i, z k 0 A natura question is: how much error woud the above approximation approach incur? The foowing emmata provide the answer. LEMMA 3. Let = arg min fd and ˆ = arg min ˆfD. Let L = max fd ˆf D and S = min fd ˆf D. We have the foowing inequaity: f Dˆ f D L S PROOF. Observe that L f D ˆ ˆf D ˆ and S f D ˆf D. Therefore, f Dˆ ˆf Dˆ f D + ˆf D L S. In addition, ˆf D ˆ ˆf D 0. Hence, Equation hods. Lemma 3 shows that the error incurred by truncating the Tayor series approximate function depends on the maximum and minimum vaues of f D ˆf D. To quantify the magnitude of the error, we first rewrite f D ˆf D in a form simiar to Equation 8: f D ˆf D = m n = i= k=3 f k z g t i, z k To derive the minimum and maximum vaues of the function above, we ook into the remainder of Tayor expansion. The foowing emma provides exact ower and upper bounds on f D ˆf D, which is a we known resut []. LEMMA 4. For any z [z, z + ], fd ˆf n D must be in the interva [ min f 3 zz z 3, 6 ] max f 3 zz z 3 6 By combining Lemmata 3 and 4, we can easiy cacuate the error incurred by our approximation approach. In particuar, the error ony depends on the structure of the function, and is independent of the characteristics of the dataset. Furthermore, the average error of the approximation is aways bounded, since n f D n ˆf D max z f 3 zz z 3 min z f 3 zz z 3. 6 The above anaysis appies to the case of ogistic regression as foows. First, for the function f z = og + expz, we have f 3 expz expz2 z =. It can be verified that min +expz 3 z f 3 z = ƒ D ƒ^ D Figure 3: Exampe of objective function for ogistic regression and its poynomia approximation e e 2 +e 3, max z f 3 z = e2 e +e 3. Thus, the average error of the approximation is at most fˆ f e2 e 6 + e 3 e e2 6 + e 3 = e2 e 6 + e In other words, the error of the approximation on ogistic regression is a sma constant. However, because of this error, there does not exist a convergence resut simiar to the one stated in Theorem 2. That is, there is a gap between the resuts from our approximation approach and those from a standard regression agorithm. To iustrate this, et us consider a two-dimensiona database D with three tupes, x, y = 0.5,, x 2, y 2 = 0, 0, and x 3, y 3 =,. Figure 3 iustrates the objective function of ogistic regression f D as we as its approximation ˆf D. As wi be shown in our experiments, however, our approximation approach sti eads to accuracy regression resuts. 5.3 Appication to Logistic Regression Agorithm 2 presents an extension of Agorithm that incorporates our poynomia approximation approach. In particuar, given a dataset D, Agorithm 2 first constructs a new objective function ˆf D that approximates the origina one, and then feeds the new objective function as input to Agorithm. The mode parameter returned from Agorithm is then output as the fina resuts of Agorithm 2. It can be verified that Agorithm 2 guarantees ϵ- differentia privacy this foows from the fact that i Agorithm guarantees ϵ-differentia privacy for any given objective function regardess of whether it is an approximation, and ii the output of Agorithm 2 is directy obtained from Agorithm. To appy Agorithm 2 for ogistic regression, we set ˆf D = n 2 i= k=0 f k 0 x T i n k y ix T i. This is by Equation 8 and the fact that ˆf D retains ony the ow order terms in the f D. After that, when ˆf D is fed as part of input to Agorithm, Line of Agorithm woud cacuate the i= 370

8 parameter as: f = 2 max 0 t=x,y! + y d x j j= 2 d 2 + d2 8 + d d j= x j + f 2 0 2! x j x = d d, where t is an arbitrary tupe and x j denotes the j-th dimension of the vector x. Reca that Agorithm injects Lapace noise with scae /ϵ to the coefficients of the objective function see Line 4 of Agorithm. Therefore, = d 2 /4 + 3d indicates that the amount of noise injected by our agorithm is ony reated to d and is independent of the dataset cardinaity. 6. AVOIDING UNBOUNDED NOISY OB- JECTIVE FUNCTIONS As shown in the previous sections, FM achieves ϵ-differentia privacy by injecting Lapace noise into the coefficients of the objective functions of optimization probems. The injection of noise, however, may render the objective function unbounded, i.e., there may not exist any optima soution for the noisy objective function. For instance, if we fit a inear regression mode on a two dimensiona dataset, the objective function woud be a quadratic function f D = a 2 + b + c with a minimum point see Figure 2 for an exampe. If we add noise into the coefficients of f D, however, the resuting objective function may be no onger have a minimum, i.e., when coefficient a becomes non-positive after noise injection. In that case, there does not exist a soution to the optimization probem. One simpe approach to address the above issue is to re-run FM whenever the noisy objective function is unbounded, unti we obtain a soution to the optimization probem. This approach, as shown in the foowing emma, ensures ϵ-differentia privacy but incurs two times the privacy cost of FM. LEMMA 5. Let A be an agorithm that repeats Agorithm with privacy budget ϵ on a dataset, unti the output of Agorithm corresponds to a bounded objective function. Then, A satisfies 2ϵ-differentia privacy. PROOF. Let D and D be any two neighbor datasets, A be Agorithm, and O be any output of A. Since A ensures ϵ-differentia privacy see Theorem, we have e ϵ P r [ AD = O ] P r [AD = O] j, e ϵ P r [ AD = O ] 2 Let O + be the set of outputs by Agorithm that correspond to bounded objective functions. For any O + O +, we have P r [ A D = O +] P r [ A D = O +] = O O + P r [A D = O ] e ϵ P r [ A D = O +] e ϵ O O + P r [A D = O ] e 2ϵ P r [ A D = O +]. By Eqn. 2 Agorithm 2 Functiona Mechanism Database D, objective function f D, privacy budget ε : Decompose the function ft i, = f g t i,. 2: Buid a new objective function ˆf D, such that ˆf D = m n 2 z = i= k=0 g t i, z k 3: Run Agorithm with input D, ˆf D, ε. 4: Return from Agorithm. f k Athough repeating FM provides a quick fix to obtain bounded objective functions, it eads to sub-optima resuts as it entais a consideraby higher privacy cost than FM does. To address this issue, we propose two methods to avoid unbounded objective functions in inear and ogistic regressions, as wi be detaied in Sections 6. and Reguarization As shown in Sections 4 and 5, given a inear or ogistic regression task, FM woud transform the objective function into a quadratic poynomia ˆf D, after which it injects noise into the coefficients of ˆf D to ensure privacy. Let ˆf D = T M +α +β be the matrix representation of the quadratic poynomia, and f D = T M + α + β be the noisy version of ˆf D after injection of Lapace noise. Then, M must be symmetric and positive definite [28]. To ensure that ˆf D is bounded after noise injection, it suffices to make M aso symmetric and positive definite [28]. The symmetry of M can be easiy achieved by i adding noise to the upper trianguar part of the matrix and ii copying each entry to its counterpart in the ower trianguar part. In contrast, it is rather chaenging to ensure that M is positive definite. To our knowedge, there is no existing method for transforming a positive definite matrix into another positive definite matrix in a differentiay private manner. To circumvent this, we adopt a heuristic approach caed reguarization from the iterature of regression anaysis [4,29]. In particuar, we add a positive constant λ to each entry in the main diagona of M, such that the noisy objective function becomes f D = T M + λi + α + β, 3 where I is a d d identity matrix, and α and β are the noisy versions of α and β, respectivey. Athough reguarization is mosty used in regression anaysis to avoid overfitting [4,29], it aso heps achieving a bounded f D. To iustrate this, consider that we perform inear regression on a two dimensiona database. We have d =. In addition, each of, M + λ I, α, and β contains ony one vaue see Figure 2 for an exampe. Accordingy, the noisy objective function f D woud be a quadratic function with one variabe. Such a function has a minimum, if and ony if M + λi is positive. Intuitivey, we can ensure this as ong as λ is arge enough to mitigate the noise injected in M. In genera, for any d, a reasonaby arge λ makes it more ikey that a eigenvaues of M + λi are positive, in which case M + λi woud be positive definite. Meanwhie, as ong as λ does not overwhem the signa in M, it woud not significanty degrade the quaity of the soution to the regression probem. In our experiments, we observe that a good choice of λ equas 4 times standard deviation of the Lapace noise added into M. Note that setting λ to this vaue does not degrade the privacy guarantee of FM, since the standard deviation of the Lapace noise does not revea any information about the origina dataset. 37

9 Athough reguarization increases the chance of obtaining a bounded objective function, there is sti a certain probabiity that the noise objective function does not have a minimum even after reguarization. This motivates our second approach, spectra trimming, as wi be expained in Section Spectra Trimming Let f D = T M + λi + α + β be the noisy objective function with reguarization. As we have discussed in Section 6., M + λi is symmetric due to the symmetry of M. In addition, fd is unbounded if and ony if M + λi is not positive definite, which hods if and ony if at east one eigenvaue of M + λi is not positive [28]. In other words, to transform an unbounded f D into a bounded one, it suffices to get rid of the nonpositive eigenvaues of M + λi. Let Q T ΛQ be the eigen-decomposition of M + λi, i.e., Q is a d d matrix where each row is an eigenvector of M +λi, and Λ is a diagona matrix where the i-th diagona eement is the eigenvaue of M + λi corresponding to the eigenvector in the i-th row of Q. We have Q T Q = I. Accordingy, f D = T Q T ΛQ + α Q T Q + β Suppose that the i-th diagona eement e i of Λ is not positive. Then, we woud remove e i from Λ, which resuts in a d d diagona matrix. In addition, we woud aso deete the i row in Q, so that Q T ΛQ woud sti be we-defined. In genera, if Λ contains k non-positive diagona eements, then remova of a those eements woud transform Λ into a d k d k matrix, which we denote as Λ. Accordingy, Q becomes a d k d matrix, which we denote as Q. The noisy objective function then becomes f D = T Q T Λ Q + α Q T Q + β. 4 We rewrite f D as a function of Q : ḡ D Q = Q T Λ Q + α Q T Q + β, which is a bounded function of Q since a eigenvaues of Λ are positive. We compute the vector V that minimizes ḡ D V, and then derive by soving Q = V note that the soution to this equation is not unique. In summary, we deete non-positive eements in Λ to obtain a bounded objective function, based on which we derive the mode parameters. Intuitivey, the non-positive eements in Λ are mosty due to noise, and hence, removing them from Λ woud not incur significant oss of usefu information. Therefore, the objective function in Equation 4 may sti ead to accurate mode parameters. The remova of non-positive eements from Λ does not vioate ϵ-differentia privacy, as the removing procedure depends ony on M which is differentiay private instead of the input database. 7. EXPERIMENTS This section experimentay evauates the performance of FM against four approaches, namey, DPME [6], Fiter-Priority FP [7], NoPrivacy, and Truncated. As expained in Section 2, DPME is the state-of-the-art method for regression anaysis under ϵ- differentia privacy, whie FP is an ϵ-differentiay private technique for generating synthetic data that can aso be used for regression tasks. NoPrivacy and Truncated are two agorithms that performs regression anaysis do not enforce ϵ-differentia privacy: NoPrivacy directy outputs the mode parameters that minimize the objective function, and Truncated returns the parameters obtained Parameter Range and Defaut Vaue Data Subset Samping Rate 0.,, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, Dataset Dimensionaity 5, 8,, 4 Privacy Budget ϵ 3.2,.6, 0.8, 0.4,, 0. Tabe 2: Experimenta parameters and vaues from an approximate objective function with truncated poynomia terms see Section 5. We incude Truncated in the experiments, so as to investigate the error incurred by the ow-order approximation approach proposed in Section 5. For DPME and FP, we use the impementations provided by their respective authors, and we set a interna parameters e.g., the granuarity of noisy histograms used by DPME to their recommended vaues. A experiments are conducted using Matab version 7.2 on a computer with a 2.4GHz CPU and 32GB RAM. We use two datasets from the Integrated Pubic Use Microdata Series [22], US and Brazi, which contain 370, 000 and 90, 000 census records coected in the US and Brazi, respectivey. There are 3 attributes in each datasets, namey, Age, Gender, Martia Status, Education, Disabiity, Nativity, Working Hours per Week, Number of Years Residing in the Current Location, Ownership of Dweing, Famiy Size, Number of Chidren, Number of Automobies, and Annua Income. Among these attributes, Marita status is the ony categorica attribute whose domain contains more than 2 vaues, i.e., Singe, Married, and Divorced/Widowed. Foowing common practice in regression anaysis, we transform Marita Status into two binary attributes, Is Singe and Is Married an individua divorced or widowed woud have fase on both of these attributes. With this transformation, both of our datasets become 4 dimensiona. We conduct regression anaysis on each dataset to predict the vaue of Annua Income using the remaining attributes. For ogistic regression, we convert Annua Income into a binary attribute: vaues higher than a predefined threshod are mapped to, and 0 otherwise. Accordingy, when we use a ogistic mode to cassify a tupe expx T i t, we predict the Annua Income of t to be if > 0.5 +expx T i see Definition 2, where is the mode parameter, and x is a vector that contains the vaues of t on a attributes expect Annua Income. We measure the accuracy of a ogistic mode by its miscassification rate, i.e., the fraction of tupes that are incorrecty cassified. The accuracy of a inear mode, on the other hand, is measured by the mean square error of the predicted vaues, i.e., n n i= yi x T i 2, where n is the number of tupes in the dataset, y i is the Annua Income vaue of the i-th tupe, x is a vector that contains the other attribute vaues of the tupe, and is the mode parameter. In each experiment, we perform 5-fod cross-vaidation 50 times for each agorithm, and we report the average resuts. We vary three different parameters, i.e., the dataset size, the dataset dimensionaity, and privacy budget ϵ. In particuar, we generate random subsets of the tupes in the US and Brai datasets, with the samping rate varying from 0. to. To vary the dataset dimensionaity, we seect three subsets of the attributes in each dataset for cassification. The first subset contains 5 attributes: Age, Gender, Education, Famiy Size, and Annua Income. The second subset consists of 8 attributes: the aforementioned five attributes, as we as Nativity, Ownership of Dweing, and Number of Automobies. The third subset contains a attributes in the second subset, as we as Is Singe, Is Married, and Number of Chidren. Tabe 2 summarizes the parameter vaues, with the defaut vaues in bod. 372

10 mean square error dimensionaity A EF B CBD mean square error dimensionaity 50% 45% 40% 35% 30% 25% miscassification rate dimensionaity 45% 40% 35% 30% 25% 20% 5% miscassification rate dimensionaity a US-Linear b Brazi-Linear c US-Logistic d Brazi-Logistic mean square error samping rate Figure 4: Regression accuracy v.s. dataset dimensionaity A EF B CBD 0.7 mean square error samping rate 40% 38% 36% 34% 32% 30% miscassification rate samping rate 30% 25% 20% miscassification rate 5% samping rate a US-Linear b Brazi-Linear c US-Logistic d Brazi-Logistic Figure 5: Regression accuracy v.s. dataset cardinaity 7. Accuracy vs. Dataset Dimensionaity Figures 4a and 4b iustrate the inear regression error of each agorithm as a function of the dataset dimensionaity. We omit Truncated in the figures, as our approximation approach in Section 5 is required ony for ogistic regression but not inear regression. Observe that FM consistenty outperforms FP and DPME, and its regression accuracy is amost identica to that of of NoPrivacy. In contrast, FP and DPME incur significant errors, especiay when the dataset dimensionaity is arge. Figures 4c and 4d show the error of each agorithm for ogistic regression. The error of Truncated is comparabe to that of No- Privacy, which demonstrates the effectiveness of our ow-order approximation approach that truncates the poynomia representation of the objective function. The error of FM is sighty higher than that of Truncated, but it is sti much smaer than the errors of FP and DPME. 7.2 Accuracy vs. Dataset Cardinaity Figure 5 show the regression error of each agorithm as a function of the dataset cardinaity. For both regression tasks and for both datasets, FM outperforms FP and DPME by considerabe margins. In addition, for inear regression, the difference in accuracy between FM and NoPrivacy is negigibe; meanwhie, their accuracy remains stabe with varying number of records in the database, except when the samping rate equas 0. the smaest vaue used in a experiments. In contrast, the performance of FP and DPME improves with the dataset cardinaity, which is consistent with the theoretica resut in [7] and [6]. Nevertheess, even when we use a tupes in the dataset, the accuracy of FP and DPME is sti much worse than that of FM and NoPrivacy. For ogistic regression, there is a gap between the accuracy of FM and that of NoPrivacy and Truncated, but the gap shrinks rapidy with the increase of dataset cardinaity. The errors of FP and DPME aso decrease when the dataset cardinaity increases, but they remain consideraby higher than the error of FM in a cases. 7.3 Accuracy vs. Privacy Budget Figure 6 pots the regression error of each agorithm as a function of the privacy budget ϵ. The errors of NoPrivacy and Truncated remain unchanged for a ϵ, as none of them enforces ϵ-differentia privacy. A of the other three methods incur higher errors when ϵ decreases, as a smaer ϵ requires a arger amount of noise to be injected. FM outperforms FP and DPME in a cases, and it is reativey robust against the change of ϵ. In contrast, FP and DPME produce much ess accurate regression resuts, especiay when ϵ is sma. 7.4 Computation Time Finay, Figures 7-9 report the average running time of each agorithm. Due to the space constraint, we ony report the resuts for ogistic regression; the resuts for inear regression are quaitativey simiar. Overa speaking, the running time of FM is at east one order of magnitude ower than that of NoPrivacy, which in turn is about two times faster than FP and DPME. The efficiency of FM is mainy due to its ow-order approximation modue, which truncates the poynomia representation of the objective function and retains ony the first and second order terms. As a consequence, FM computes the optimization resuts by soving a muti-variate quadratic optimization probem, for which Matab has an efficient soution. In contrast, a other methods require soving the origina optimization probem of ogistic regression, which has a compicated objective function that renders the soving process time consuming. In addition, FP and DPME require additiona time to generate synthetic data, eading to even higher computation cost. 373

CS229 Lecture notes. Andrew Ng

CS229 Lecture notes. Andrew Ng CS229 Lecture notes Andrew Ng Part IX The EM agorithm In the previous set of notes, we taked about the EM agorithm as appied to fitting a mixture of Gaussians. In this set of notes, we give a broader view