1 Hoeffding s Inequality

Size: px

Start display at page:

Download "1 Hoeffding s Inequality"

Dinah Todd
5 years ago
Views:

1 Proailistic Method: Hoeffding s Inequality and Differential Privacy Lecturer: Huert Chan Date: 27 May 22 Hoeffding s Inequality. Approximate Counting y Random Sampling Suppose there is a ag containing red alls and lue alls. You would like to estimate the fraction of red alls in the ag. However, you are only allowed to sample randomly from the ag with replacement. The straightforward method is to make T random samples and use the fraction of red alls in the random samples as an estimate of the true fraction. We are interested in estimation with certain additive error, i.e., if the true fraction is p, it is enough to return a numer in the range [p, p + ]. The question is: how many samples are enough so that with proaility at least δ, our estimate is within an additive error of > from the true value? Let Y := t [T ] Z t, where Z t is the {, }-random variale that takes value if the t-th sample is red. The estimator is Y T. Suppose the true fraction of red alls in the ag is p. Then, we want to find how large T needs to e such that the following is satisfied: P r[ Y T p ] δ. Theorem. Hoeffding s Inequality Suppose X, X 2,..., X n are independent real-valued random variales, such that for each i, X i takes values from the interval [a i, i ]. Let Y := i X i. Then, for all α >, P r[ Y E[Y ] nα] 2 exp 2n2 α 2, i R2 i where R i := i a i. Using the Hoeffding s Inequality, with X i [, ] and E[X i ] = p, we have P r[ T i X i p α] 2 exp 2T α 2. Hence, in order to estimate the fraction of red alls with additive error at most α and failure proaility at most δ, it suffices to use T = Θ log α 2 δ..2 Proof of Hoeffding s Inequality We use the technique of moment generating function to prove the Hoeffding s Inequality. simplicity, we prove a slightly weaker result: P r[ Y E[Y ] nα] 2 exp n2 α 2 2. i R2 i Here are the three steps. For

2 .2. Transform the Inequality into a Convenient Form We use the inequality P r[ Y E[Y ] nα] P r[y E[Y ] nα] + P r[y E[Y ] nα], and give ounds for the two proailities separately. Here, we consider P r[y E[Y ] nα]. The other case is similar. Oserve that the expectation of Y does not appear in the upperound. It would e more convenient to first translate the variale X i. Define Z i := X i E[X i ]. Oserve that E[Z i ] =. Moreover, since oth X i and E[X i ] are in the interval [a i, i ], it follows that Z i takes values in the interval [ R i, R i ]. For t >, we have P r[y E[Y ] nα] = P r[t i Z i tnα]..2.2 Using Moment Generating Function and Independence Applying the exponentiation function to oth sides of the inequality, we follow the standard calculation, using independence of the Z i s. P r[t i Z i tnα] = P r[expt i Z i exptnα] exp tnα E[expt Z i ] i exp tnα E[exptZ i ] i The next step is the most technical part of the proof. Recall that we want to find some appropriate function g i t such that E[exptZ i ] expg i t. All we know aout Z i is that E[Z i ] = and Z i takes value in [ R i, R i ]. Think of a simple ut non-trivial random variale that has mean and takes values in [ R i, R i ]. Consider Ẑi that takes value R i with proaility 2 and R i with proaility 2. Then, it follows that E[exptẐi] = 2 etr i + e tr i exp 2 t2 Ri 2, the last inequality follows from the fact that for all real x, 2 ex + e x e 2 x2. Therefore, for such a simple Ẑi, we have a nice ound E[exptẐi] expg i t, where g i t := 2 t2 Ri 2. Using Convexity to Show that the Extreme Points are the Worst Case Scenario. Intuitively, Ẑ i is the worst case scenario. Since we wish to show measure concentration, it is a ad case if there is a lot of variation for Z i. However, we have the requirement that E[Z i ] = and Z i [ R i, R i ]. Hence, intuitively, Z i has the most variation if it takes values only at the extreme points, each with proaility 2. We formalize this intuition using the convexity of the exponentiation function. Definition.2 A function f : R R is convex if for all x, y R and λ, 2

3 fλx + λy λfx + λfy. Fact.3 If a function is douly differentiale and f x, then f is convex. We use the fact that the exponentiation function x e tx is convex. First, we need to express Z i as a convex comination of the end points, i.e., we want to find λ such that Z i = λr i + λ R i. We have λ := Z i+r i 2R i. Using the convexity of the function x e tx, we have exptz i = exptλr i + λ R i λ exptr i + λ exp tr i = 2 + Z i 2R i exptr i + 2 Z i 2R i exp tr i. Take expectation on oth sides, and oserving that E[Z i ] =, we have E[exptZ i ] 2 exptr i + exp tr i = E[exptẐi], as required..2.3 Picking the Best Value for t We have shown that for all t >, P r[y E[Y ] nα] exp 2 t2 i R2 i nαt, and we want to find a value of t that minimizes the right hand side. Note that the exponent is a quadratic function in t and hence is minimized when t := Hence, in this case, we have P r[y E[Y ] nα] exp n2 α 2 2, as required. i R2 i.3 Exercise nα i R2 i Estimating the Unknown Fraction of Red Balls. Suppose a ag contains an unknown numer of red alls assume there is at least one red all and you are only allowed to sample with replacement uniformly at random from the ag. Design an algorithm that, given <, δ <, with failure proaility at most δ, returns an estimate of the fraction of red alls with multiplicative error at most, i.e., if the real fraction is p, the algorithm returns a numer p such that p p p. Give the numer of random samples used y your algorithm. 2 Differential Privacy 2. Motivation A statistical dataase is a dataase used for statistical analysis. For example, a dataase containing information aout graduates of a certain university can answer questions like: what is the average salary of the graduates? Statistical dataase are widely used due to the enormous social value they provide: the previously mentioned dataase enefits the society in helping students to choose 3 >.

4 whether to go to that university, or how funding should e distriuted among universities. However, the statistics released might cause leakage of sensitive information. Therefore, a ig challenge is to maintain individual privacy, while providing useful aggregate statistical information aout a certain group. In 977 the statistician Tore Dalenius gave a privacy goal for statistical dataases: anything that can e learned aout a memer in the statistical dataase, should also can e learned without access to the dataase. However, as the following example illustrates, as long as the statistical dataase is useful, the goal is not achievale if the adversary has auxiliary information information not otained from the statistical dataase. Suppose we know that some student s X salary is $, more than the average, we can know his/her salary y querying aout the average salary. Note that in this case, even if X does not join the dataase, we can still approximately know the average salary and hence know his/her sensitive information. Notation. Let U e the set of possile user data points. A dataase of n users contains the data points for each user and can e viewed as a vector in U n. We use D := U n to denote the collection of all possile dataases with n users. We consider whether releasing the output of some function f : D O will compromise an individual s privacy. Example. Suppose there are n users, each of which holds a private it from U = {, }. Given a dataase X U n, suppose the function of interest is sumx = n X i. At first sight, it might seem that releasing the sum does not violate an individual s privacy, ecause the sum does not directly reveal any individual s private it. However, in reality, users data can e in many dataases. Suppose users to n also participate in another dataase which also releases the sum. Then, comining the results of the two sums, the private it of user n can e accurately calculated! Differential privacy defines privacy in a different sense: to minimize the increased risk of the sensitive information leakage due to one s joining in the statistical dataase. A private mechanism that answers queries to statistical dataases can achieve this goal y introducing randomness such that when two dataases differ y only one single user, the output produced have similar distriutions. Note that a differentially private mechanism encourages individuals to participate in statistical dataases, and thus enhancing the social enefit provided y them. 2.2 Formal Definition Let U e the set of possile data points and D := U n e the collection of dataases with n users. Two dataases X D and X 2 D are called neighoring denoted y X X 2, if they differ y at most one coordinate. Definition 2. -differential privacy A randomized mechanism function M : D O preserves -differential privacy, if for any two neighoring dataases X X 2, and any possile set of output S O, the following hold: Pr[MX S] exp Pr[MX 2 S], where the randomness comes from the coin flips of M. Remark 2.2 We oserve the following. 4

5 . Since M is a randomized mechanism, MX is a distriution of outputs in O. 2. Interchanging the roles of X and X 2, we also have: Pr[MX 2 S] exp Pr[MX S]. 3. If O is a countale set, then we can also have the inequality for each x O, Pr[MX = x] exp Pr[MX 2 = x]. 4. The inequality means that the distriutions MX and MX 2 are close, and hence y oserving the output, it is not possile to tell for certain whether the output comes from dataase X or X Achieving Differential Privacy Given a set D of dataases, a deterministic function f : D R d, and a privacy parameter, we want to find an -differentially private version of f, i.e., to find a random function f : D R d that has the following properties:. Privacy. The function f preserves -differential privacy, 2. Utility. For any dataase X D, with high proaility, fx is close to fx. Oserve that the trivial function f satisfies -differential privacy. However, this will not e very useful. Here is one way to define utility. Definition 2.3 λ, δ-useful Let < δ < and λ > e utility parameters. Let f : D R d e a deterministic function and f : D R d e a randomized function. We say f is λ, δ-useful with respect to f, if for any dataase X D, with proaility at least δ, for each i [d], f i X f i X λ. 2.4 Achieving differential privacy y adding Laplace noise One way to convert a function f : D R d into a differentially private version is to add independent random noise to each of its coordinates. Intuitively, the more fx changes when we change one coordinate of X, the larger the random noise is needed to hide the difference. We use l -sensitivity to formally measure the maximum difference etween the values of f of any two neighoring dataases. Definition 2.4 l -Sensitivity Let f : D R d e a deterministic function. The l -sensitivity of f, denoted y f, is max fx fx 2 = max X X 2 X X 2 d f i X f i X 2 We use random variales sampled from Laplace distriutions as the random noise. 5

6 Definition 2.5 Laplace Distriution Let >. We denote y Lap the Laplace distriution such that the proaility density function at z is z 2 exp. Laplace distriution has the following properties. Theorem 2.6 Let > and let γ e a random variale sampled from Lap. Then,. E[γ] =, 2. var[γ] = 2 2, 3. for any λ >, Pr[ γ > λ] = exp λ. Proof: Hence, E[γ] = E[γ 2 ] = = = = = 2 x 2 exp x x x 2 exp x 2 = 2 x 2 exp 2 exp x 2 exp x exp dx = d x + x dx + x x dx = 2 d x x x exp x exp x d x = 2 2 dx = 2 dx + x x 2 exp x x 2 exp dx x x 2 exp x 2 2 exp x = x 2 exp exp Pr[ γ > λ] = Pr[γ > λ] + Pr[γ < λ] = = 2 λ = exp λ x exp x dx x dx = dx x d x = 2 = 2 2 exp x x 2 exp = 2 2 var[γ] = E[γ 2 ] E[γ] 2 = 2 2 = exp x dx = λ λ 2 exp x exp x d x dx 2x exp x dx exp x dx λ dx + x = exp x 2 exp dx x λ 6

7 Note that if is small, the mass is highly concentrated around. Hence, it can only e used to privatize functions with small sensitivity. However, the highly concentrated mass also implies good utility. The following theorem shows that choosing := f is enough to preserve privacy and with high proaility, the additive error incurred y the random noise is small. Theorem 2.7 Let f : D R d e a deterministic function, < < e the privacy parameter and < δ < e the failure proaility. Let γ, γ 2,..., γ d e random variales independently sampled from Lap f. Then, the randomized function f such that f i X := f i x + γ i for all i [d]. preserves -differential privacy, 2. is f ln d δ, δ-useful with respect to f. Proof: Let X D and X 2 D e two neighoring dataase. Let z R d e a vector. We ause the notation a little it and use Pr[ fx = z] and Pr[ fx 2 = z] to denote the density instead of the proaility. Hence, we have Pr[ fx = z] Pr[ fx 2 = z] = Pr[ d f ix + γ i = z i ] Pr[ d f ix 2 + γ i = z i ] = Pr[ d γ i = z i f i X ] Pr[ d γ i = z i f i X 2 ] d = Pr[γ i = z i f i X ] d Pr[γ i = z i f i X 2 ] = d d 2 f exp f ix 2 z i f/ = exp 2 f exp f ix z i f/ d fx 2 z i fx z i f/ f/e d fx fx 2 exp f/ f exp f/ = exp For any measurale suset S R d, Pr[ fx S] = S Pr[ fx = z]dz exp S Pr[ fx 2 = z]dz = Pr[ fx 2 S]. Hence, the privacy guarantee is proved. By Property 3 of Theorem 2.6, we know that for all i [d], Pr[ γ i > f ln d f δ ] = exp ln d f δ / = δ d. Hence, y union ound on i [d], we know that Pr[ i [d] γ i > f ln d δ ] δ. 7

8 Let X D e any dataase. Note that f i X f i X = γ i for all i [d]. Hence, y the union ound, Pr[ i [d], f i X f i X > f ln d δ ] δ, which is equivalent to Pr[ i [d] f i X f i X ] δ. Thus, the utility guarantee is proved. f ln d δ 2.5 Example Consider the previous example where there are n users, each of which holds a private it from {, }, and we would like to calculate how many of them hold a one. In other words, we would like to calculate n X i, where X i is the private it of the i-th user. There is an untrusted aggregator who can calculate the sum of n real numers. To compute the sum, the users need to send information aout their private it to the aggregator, and the aggregator then does the calculation for them. However, since the aggregator might e malicious, the users do not want the aggregator to know their real private its. The protect the users privacy, the information they send to the aggregator should e -differentially private. Note that the sensitivity of each user s private it is one. Hence, a user can first add Laplace noise sampled from Lap to their private it, and then send the noisy it to the aggregator. The aggregator releases the sum of the noisy its as an estimate of the real sum. Oserve that the output is the real sum plus n independent random variales sampled from Lap. It can e shown that the additive error of this estimate is O n with high proaility. This is left as an exercise. 2.6 Exercises. In this question, we derive a measure concentration result for independent random variales drawn from Laplace distriution. We show that with high proaility, the sum of independent Laplace random variales are concentrated around its mean,. We use moment generating functions in a Chernoff-like argument. Let γ, γ 2,..., γ n e n independent random variales, where γ i is sampled from Lap i. a Prove that for each γ i, the moment generating function is E[exphγ i ] =, where h 2 2 i h < i. Show that E[exphγ i ] exp2h 2 2 i, if h < 2i. Hint: for x < 2, we have x + 2x exp2x. c Let M := max i [n] i. Also, let ν Pr[ Y > λ] 2 exp λ2 8ν 2 { n d Suppose < δ < and ν > max 2 i, M δ. e Prove that Pr[ Y > 8 n 2 i ln 2 δ ] δ. n 2 i and < λ < 2 2ν 2 M. Prove that } ln 2 δ. Prove that Pr[ Y > ν 8 ln 2 δ ] 8

1 Approximate Counting by Random Sampling

1 Approximate Counting by Random Sampling COMP8601: Advanced Topics in Theoretical Computer Science Lecture 5: More Measure Concentration: Counting DNF Satisfying Assignments, Hoeffding s Inequality Lecturer: Hubert Chan Date: 19 Sep 2013 These