Privacy in Statistical Databases

Size: px

Start display at page:

Download "Privacy in Statistical Databases"

Sarah Douglas
5 years ago
Views:

1 Privacy in Statistical Databases Individuals x 1 x 2 x n Server/agency ( ) answers. A queries Users Government, researchers, businesses (or) Malicious adversary What information can be released? Two conflicting goals Utility: Users can extract global statistics Privacy: Individual information stays hidden How can these be made precise? (How context-dependent must they be?) 1

2 Why not use crypto definitions? Cryptography successfully defined concepts such as encryption secure function evaluation Recall encryption: Semantic Security : For any function f, distribution on messages and efficient algorithm A, there exists an efficient algorithm A such that: Pr[A(P K, Enc P K (m)) = f(x)] Pr[A (P K) = f(m)] + ɛ Indistinguishability : For any message m, no efficient adversary can tell apart encryptions of m and a default message: Adversary s information quantified precisely Encryption must be randomized Enc P K (0) Enc P K (m) 2

3 Encryption: Real vs Ideal worlds Real world: Alice sends Bob encryption of 100-bit message m, adversary sees ciphertext Ideal world: Alice tells adversary I am sending Bob a message of 100 bits and nothing else. How can you simulate the ideal world, i.e. make the ideal world look like the real word? Have Alice send encryption of = }{{} 100 zeros No adversary can tell the real world from the simulation, and clearly the simulation leaks no information about m! 3

4 Notes about these definitions Security is a property of the algorithm used for encryption You can t point at a particular string and say it is secure Adversary s information and abilities quantified precisely Because we allow adversary side information about the message, all the security resides in the secret key and randomness used for encryption 4

5 Secure Function Evaluation a.k.a. multi-party computation Several parties, each with input xi, want to compute a function f(x1,x2,...,xn) Ideal world: all parties hand their inputs to a trusted party who computes f(x1,...,xn) and releases the result There exist secure protocols for this task Idea: a simulator can geneerate a dummy transcript given only the value of f Privacy: use SFE protocols to jointly data mine Horizontal vs vertical Lots of papers (see optional topics) 5

6 Example: Shamir secret-sharing Given secret value s Distribute shares s1, s2,..., sn such that any single participant learns nothing about s any two participants can reconstruct it Idea: pick a random line that passes through (0,s) s1 = value at x=1, s2 = value at x=2,... Shares of two secrets s and t can be added to obtain shares of s+t Shares can be multiplied to obtain shares of s*t but now you have degree 2 polynomial that passes through (0,s*t) 6

7 Privacy in Statistical Databases Individuals x 1 x 2 x n Server/agency ( ) answers. A queries Users Government, researchers, businesses (or) Malicious adversary What information can be released? Two conflicting goals Utility: Users can extract global statistics Privacy: Individual information stays hidden How can these be made precise? (How context-dependent must they be?) 7

8 Why not use crypto definitions? 8

9 Why not use crypto definitions? Attempt #1: Def n: For every entry i, no information about x i is leaked (as if encrypted) Problem: no information at all is revealed! Tradeoff privacy vs utility 8

10 Why not use crypto definitions? Attempt #1: Def n: For every entry i, no information about x i is leaked (as if encrypted) Problem: no information at all is revealed! Tradeoff privacy vs utility Attempt #2: Agree on summary statistics f(db) that are safe Def n: No information except f(db) Problem: why is f(db) safe to release? Tautology trap (Also: how do you figure out what f is?) C C C C C C 8

11 Why not use crypto definitions? 9

12 Why not use crypto definitions? Problem: Crypto makes sense in settings where the line between inside and outside is well-defined E.g. psychologist: inside = psychologist and patient outside = everyone else 9

13 Why not use crypto definitions? Problem: Crypto makes sense in settings where the line between inside and outside is well-defined E.g. psychologist: inside = psychologist and patient outside = everyone else 9

14 Why not use crypto definitions? Problem: Crypto makes sense in settings where the line between inside and outside is well-defined E.g. psychologist: inside = psychologist and patient outside = everyone else Statistical databases: fuzzy line between inside and outside 9

15 Achieving Differential Privacy Examples Intuitions for privacy Why crypto def s don t apply A Partial* Selection of Definitions Two straw men Blending into the Crowd An impossibility result Attribute Disclosure and Differential Privacy Conclusions * partial = incomplete and biased 10

16 Achieving Differential Privacy Examples Intuitions for privacy Why crypto def s don t apply A Partial* Selection of Definitions Two straw men Blending into the Crowd An impossibility result Attribute Disclosure and Differential Privacy Conclusions Criteria Understandable Clear adversary s goals & prior knowledge / side information * partial = incomplete and biased 10

17 Straw man #1: Exact Disclosure DB= x 1 x 2 x 3 M x n-1 x n San random coins query 1 answer 1 M query T answer T Adversary A 11

18 Straw man #1: Exact Disclosure DB= x 1 x 2 x 3 M x n-1 x n San random coins query 1 answer 1 M query T answer T Adversary A Def n: safe if adversary cannot learn any entry exactly 11

19 Straw man #1: Exact Disclosure DB= x 1 x 2 x 3 M x n-1 x n San random coins query 1 answer 1 M query T answer T Adversary A Def n: safe if adversary cannot learn any entry exactly 11

20 Straw man #1: Exact Disclosure DB= x 1 x 2 x 3 M x n-1 x n San random coins query 1 answer 1 M query T answer T Adversary A Def n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems 11

21 Straw man #1: Exact Disclosure DB= x 1 x 2 x 3 M x n-1 x n San random coins query 1 answer 1 M query T answer T Adversary A Def n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems Does not preclude learning value with 99% certainty or narrowing down to a small interval 11

22 Straw man #1: Exact Disclosure DB= x 1 x 2 x 3 M x n-1 x n San random coins query 1 answer 1 M query T answer T Adversary A Def n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems Does not preclude learning value with 99% certainty or narrowing down to a small interval Historically: 11

23 Straw man #1: Exact Disclosure DB= x 1 x 2 x 3 M x n-1 x n San random coins query 1 answer 1 M query T answer T Adversary A Def n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems Does not preclude learning value with 99% certainty or narrowing down to a small interval Historically: Focus: auditing interactive queries 11

24 Straw man #1: Exact Disclosure DB= x 1 x 2 x 3 M x n-1 x n San random coins query 1 answer 1 M query T answer T Adversary A Def n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems Does not preclude learning value with 99% certainty or narrowing down to a small interval Historically: Focus: auditing interactive queries Difficulty: understanding relationships between queries 11

25 Straw man #1: Exact Disclosure DB= x 1 x 2 x 3 M x n-1 x n San random coins query 1 answer 1 M query T answer T Adversary A Def n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems Does not preclude learning value with 99% certainty or narrowing down to a small interval Historically: Focus: auditing interactive queries Difficulty: understanding relationships between queries E.g. two queries with small difference 11

26 Straw man #2: Learning the distribution 12

27 Straw man #2: Learning the distribution Assume x 1,,x n are drawn i.i.d. from unknown distribution 12

28 Straw man #2: Learning the distribution Assume x 1,,x n are drawn i.i.d. from unknown distribution Def n: San is safe if it only reveals distribution 12

29 Straw man #2: Learning the distribution Assume x 1,,x n are drawn i.i.d. from unknown distribution Def n: San is safe if it only reveals distribution Implied approach: learn the distribution release description of distrib or re-sample points from distrib 12

30 Straw man #2: Learning the distribution Assume x 1,,x n are drawn i.i.d. from unknown distribution Def n: San is safe if it only reveals distribution Implied approach: learn the distribution release description of distrib or re-sample points from distrib Problem: tautology trap estimate of distrib. depends on data why is it safe? 12

31 Two Intuitions for Data Privacy If the release of statistics S makes it possible to determine the value [of private information] more accurately than is possible without access to S, a disclosure has taken place. [Dalenius] Learning more about me should be hard Privacy is protection from being brought to the attention of others. [Gavison] Safety is blending into a crowd 13

32 Blending into a Crowd 14

33 Blending into a Crowd Intuition: I am safe in a group of k or more k varies ( ,000?) 14

34 Blending into a Crowd Intuition: I am safe in a group of k or more k varies ( ,000?) Many variations on theme: Adv. wants predicate g such that 0 < #{ i g(x i )=true} < k g is called a breach of privacy 14

35 Blending into a Crowd Intuition: I am safe in a group of k or more k varies ( ,000?) Many variations on theme: Adv. wants predicate g such that 0 < #{ i g(x i )=true} < k g is called a breach of privacy Why? Fundamental: R. Gavison: protection from being brought to the attention of others Rare property helps me re-identify someone Implicit: information about a large group is public e.g. liver problems more prevalent among diabetics 14

36 Blending into a Crowd Intuition: I am safe in a group of k or more k varies ( ,000?) Many variations on theme: Adv. wants predicate g such that 0 < #{ i g(x i )=true} < k g is called a breach of privacy Why? Fundamental: R. Gavison: protection from being brought to the attention of others Rare property helps me re-identify someone Implicit: information about a large group is public e.g. liver problems more prevalent among diabetics Two variants: frequency in DB frequency in underlying population How can we capture this? 15

37 Blending into a Crowd Intuition: I am safe in a group of k or more k varies ( ,000?) Many variations on theme: Adv. wants predicate g such that 0 < #{ i g(x i )=true} < k g is called a breach of privacy Why? Fundamental: R. Gavison: protection from being brought to the attention of others How can we capture this? Rare property helps me re-identify someone Implicit: information about a large group is public e.g. liver problems more prevalent among diabetics 15

38 Cluster Definitions brown blue Σ blond brown Σ

39 Cluster Definitions Given sanitization S, look at set of all databases consistent with S brown blue Σ blond brown Σ

40 Cluster Definitions Given sanitization S, look at set of all databases consistent with S Def n: Safe if no predicate is a breach for all consistent databases brown blue Σ blond brown Σ

41 Cluster Definitions Given sanitization S, look at set of all databases consistent with S Def n: Safe if no predicate is a breach for all consistent databases k-anonymity [L. Sweeney] brown blue Σ blond brown Σ

42 Cluster Definitions Given sanitization S, look at set of all databases consistent with S Def n: Safe if no predicate is a breach for all consistent databases k-anonymity [L. Sweeney] Sanitization is histogram of data Partition D into bins B 1 [ B 2 [ L [ B t Output bins & cardinalities f j = # ( DB Å B j ) brown blue Σ blond brown Σ

43 Cluster Definitions Given sanitization S, look at set of all databases consistent with S Def n: Safe if no predicate is a breach for all consistent databases k-anonymity [L. Sweeney] Sanitization is histogram of data Partition D into bins B 1 [ B 2 [ L [ B t Output bins & cardinalities f j = # ( DB Å B j ) Safe if for all j, either f j k or f j =0 brown blue Σ blond brown Σ

44 Cluster Definitions Given sanitization S, look at set of all databases consistent with S Def n: Safe if no predicate is a breach for all consistent databases k-anonymity [L. Sweeney] Sanitization is histogram of data Partition D into bins B 1 [ B 2 [ L [ B t Output bins & cardinalities f j = # ( DB Å B j ) Safe if for all j, either f j k or f j =0 brown blue Σ blond brown Σ Cell bound methods [statistics, 1990 s] 16

45 Cluster Definitions Given sanitization S, look at set of all databases consistent with S Def n: Safe if no predicate is a breach for all consistent databases k-anonymity [L. Sweeney] Sanitization is histogram of data Partition D into bins B 1 [ B 2 [ L [ B t Output bins & cardinalities f j = # ( DB Å B j ) Safe if for all j, either f j k or f j =0 brown blue Σ blond brown Σ Cell bound methods [statistics, 1990 s] Sanitization consists of marginal sums Let f z = #{i : x i =z}. Then San(DB) = various sums of f z 16

46 Cluster Definitions Given sanitization S, look at set of all databases consistent with S Def n: Safe if no predicate is a breach for all consistent databases k-anonymity [L. Sweeney] Sanitization is histogram of data Partition D into bins B 1 [ B 2 [ L [ B t Output bins & cardinalities f j = # ( DB Å B j ) Safe if for all j, either f j k or f j =0 brown blue Σ blond brown Σ Cell bound methods [statistics, 1990 s] Sanitization consists of marginal sums Let f z = #{i : x i =z}. Then San(DB) = various sums of f z Safe if for all z, either 9 cons t DB with f z k or 8 cons t DB s, f z =0 16

47 Cluster Definitions Given sanitization S, look at set of all databases consistent with S Def n: Safe if no predicate is a breach for all consistent databases k-anonymity [L. Sweeney] Sanitization is histogram of data Partition D into bins B 1 [ B 2 [ L [ B t Output bins & cardinalities f j = # ( DB Å B j ) Safe if for all j, either f j k or f j =0 brown blue Σ blond brown Σ Cell bound methods [statistics, 1990 s] Sanitization consists of marginal sums Let f z = #{i : x i =z}. Then San(DB) = various sums of f z Safe if for all z, either 9 cons t DB with f z k or 8 cons t DB s, f z =0 Large literature using algebraic and combinatorial techniques 16

48 Cluster Definitions Given sanitization S, look at set of all databases consistent with S Def n: Safe if no predicate is a breach for all consistent databases k-anonymity [L. Sweeney] Sanitization is histogram of data Partition D into bins B 1 [ B 2 [ L [ B t Output bins & cardinalities f j = # ( DB Å B j ) Safe if for all j, either f j k or f j =0 brown blue Σ blond 2[0,12] 10 [0,12] 12 brown 12 [0,14] 6[0,16] 18 Σ Cell bound methods [statistics, 1990 s] Sanitization consists of marginal sums Let f z = #{i : x i =z}. Then San(DB) = various sums of f z Safe if for all z, either 9 cons t DB with f z k or 8 cons t DB s, f z =0 Large literature using algebraic and combinatorial techniques 16

49 Cluster Definitions Given sanitization S, look at set of all databases consistent with S Def n: Safe if no predicate is a breach for all consistent databases k-anonymity [L. Sweeney] Sanitization is histogram of data brown blue Σ Partition D into bins B 1 [ Issues: B 2 [ L [ B t blond [0,12] Output cardinalities f j = # ( DB If k Å is B small: j ) all three Canadians in CSE sing in a choir. brown [0,14] Safe if for all j, either f j k or Semantics? f j =0 Σ Cell bound methods [statistics, 1990 s] Probability not considered Sanitization consists of marginal sums What if I have side information? Algorithm for making decisions if f z = #{i : x i =z} then output = various sums of f not considered z [0,12 ] [0,16 ] Safe if for all z, either 9 cons t DB What with fadversary z k or does 8 cons t this apply DB s, to? f z = large literature using algebraic and combinatorial techniques 17

50 Security for Bayesian adversaries 18

51 Security for Bayesian adversaries Goal: 18

52 Security for Bayesian adversaries Goal: Adversary outputs point z 2 D 18

53 Security for Bayesian adversaries Goal: Adversary outputs point z 2 D Score = 1/f z if f z > 0 0 otherwise 18

54 Security for Bayesian adversaries Goal: Adversary outputs point z 2 D Score = 1/f z if f z > 0 0 otherwise Def n: sanitization safe if E(score) ε 18

55 Security for Bayesian adversaries Goal: Adversary outputs point z 2 D Score = 1/f z if f z > 0 0 otherwise Def n: sanitization safe if E(score) ε Procedure: 18

56 Security for Bayesian adversaries Goal: Adversary outputs point z 2 D Score = 1/f z if f z > 0 0 otherwise Def n: sanitization safe if E(score) ε Procedure: Assume you know adversary s prior distribution over databases 18

57 Security for Bayesian adversaries Goal: Adversary outputs point z 2 D Score = 1/f z if f z > 0 0 otherwise Def n: sanitization safe if E(score) ε Procedure: Assume you know adversary s prior distribution over databases Given a candidate output (e.g. set of marginal sums) Update prior conditioned on output (via Bayes rule) If max z E( score output ) < ε then release Else consider new set of marginal sums 18

58 Security for Bayesian adversaries Goal: Adversary outputs point z 2 D Score = 1/f z if f z > 0 0 otherwise Def n: sanitization safe if E(score) ε Procedure: Assume you know adversary s prior distribution over databases Given a candidate output (e.g. set of marginal sums) Update prior conditioned on output (via Bayes rule) If max z E( score output ) < ε then release Else consider new set of marginal sums Extensive literature on computing expected value 18

59 Security for Bayesian adversaries 19

60 Security for Bayesian adversaries Goal: 19

61 Security for Bayesian adversaries Goal: Adversary outputs point z 2 D 19

62 Security for Bayesian adversaries Goal: Adversary outputs point z 2 D Score = 1/f z if f z > 0 0 otherwise 19

63 Security for Bayesian adversaries Goal: Adversary outputs point z 2 D Score = 1/f z if f z > 0 0 otherwise Def n: sanitization safe if E(score) ε 19

64 Security for Bayesian adversaries Goal: Adversary outputs point z 2 D Score = 1/f z if f z > 0 0 otherwise Def n: sanitization safe if E(score) ε Procedure: 19

65 Security for Bayesian adversaries Goal: Adversary outputs point z 2 D Score = 1/f z if f z > 0 0 otherwise Def n: sanitization safe if E(score) ε Procedure: Assume you know adversary s prior distribution over databases 19

66 Security for Bayesian adversaries Goal: Adversary outputs point z 2 D Score = 1/f z if f z > 0 0 otherwise Def n: sanitization safe if E(score) ε Procedure: Assume you know adversary s prior distribution over databases Given a candidate output (e.g. set of marginal sums) Update prior conditioned on output (via Bayes rule) If max z E( score output ) < ε then release Else consider new set of marginal sums 19

67 Security for Bayesian adversaries Goal: Adversary outputs point z 2 D Score = 1/f z if f z > 0 0 otherwise Def n: sanitization safe if E(score) ε Procedure: Assume you know adversary s prior distribution over databases Given a candidate output (e.g. set of marginal sums) Update prior conditioned on output (via Bayes rule) If max z E( score output ) < ε then release Else consider new set of marginal sums Extensive literature on computing expected value (see Yosi s talk) 19

68 Security for Bayesian adversaries Goal: Issues: Adversary outputs point z 2 DRestricts the type of predicates Score = 1/f z if f z > 0 adversary can choose 0 otherwise Must know prior distribution Def n: sanitization safe if E(score) Can 1 scheme ε work for many Procedure: distributions? Sanitizer works harder than Assume you know adversary s prior distribution over databases adversary Given a candidate output (e.g. set of marginal sums) Conditional probabilities don t consider previous iterations Update prior conditioned on output (via Bayes rule) If max z E( score output ) < ε then release Simulatability [KMN 05] Else consider new set of marginal sums Can this be fixed (with efficient Extensive literature on computing expected computations)? value (see Yosi s talk) 19

69 Blending into a Crowd 20

70 Blending into a Crowd Intuition: I am safe in a group of k or more 20

71 Blending into a Crowd Intuition: I am safe in a group of k or more pros: appealing intuition for privacy seems fundamental mathematically interesting meaningful statements are possible! 20

Blending into a Crowd Intuition: I am safe in a group of k or more pros: appealing intuition for privacy seems fundamental mathematically interesting meaningful statements are

72 Blending into a Crowd Intuition: I am safe in a group of k or more pros: appealing intuition for privacy seems fundamental mathematically interesting meaningful statements are possible! cons does it rule out learning facts about particular individual? all results seem to make strong assumptions on adversary s prior distribution is this necessary? (yes ) 20

73 Overview Examples Intuitions for privacy Why crypto def s don t apply A Partial* Selection of Definitions Two Straw men Blending into the Crowd An impossibility result Attribute Disclosure and Differential Privacy Conclusions 21

74 Overview Examples Intuitions for privacy Why crypto def s don t apply A Partial* Selection of Definitions Two Straw men Blending into the Crowd An impossibility result Attribute Disclosure and Differential Privacy Conclusions 21

75 An impossibility result 22

76 An impossibility result An abstract schema: Define a privacy breach 8 distributions on databases 8 adversaries A, 9 A such that Pr( A(San) = breach ) Pr( A () = breach ) ε 22

77 An impossibility result An abstract schema: Define a privacy breach 8 distributions on databases 8 adversaries A, 9 A such that Pr( A(San) = breach ) Pr( A () = breach ) ε Theorem: [Dwork-Naor] For reasonable breach, if San(DB) contains information about DB then some adversary breaks this definition 22

78 An impossibility result An abstract schema: Define a privacy breach 8 distributions on databases 8 adversaries A, 9 A such that Pr( A(San) = breach ) Pr( A () = breach ) ε Theorem: [Dwork-Naor] For reasonable breach, if San(DB) contains information about DB then some adversary breaks this definition Example: Adv. knows Alice is 2 inches shorter than average Lithuanian but how tall are Lithuanians? With sanitized database, probability of guessing height goes up Theorem: this is unavoidable 22

79 proof sketch 23

80 proof sketch Suppose If DB is uniform then entropy I( DB ; San(DB) ) > 0 breach is predicting a predicate g(db) 23

81 proof sketch Suppose If DB is uniform then entropy I( DB ; San(DB) ) > 0 breach is predicting a predicate g(db) Pick hash function h: {databases}!{0,1} H(DB San) Prior distrib. is uniform conditioned on h(db)=z 23

82 proof sketch Suppose If DB is uniform then entropy I( DB ; San(DB) ) > 0 breach is predicting a predicate g(db) Pick hash function h: {databases}!{0,1} H(DB San) Prior distrib. is uniform conditioned on h(db)=z Then h(db)=z gives no info on g(db) San(DB) and h(db)=z together determine DB 23

83 proof sketch Suppose If DB is uniform then entropy I( DB ; San(DB) ) > 0 breach is predicting a predicate g(db) Pick hash function h: {databases}!{0,1} H(DB San) Prior distrib. is uniform conditioned on h(db)=z Then h(db)=z gives no info on g(db) San(DB) and h(db)=z together determine DB [DN] vastly generalize this 23

84 Preventing Attribute Disclosure DB= x 1 x 2 x M 3 x n-1 x n San random coins Large class of definitions query 1 answer 1 M query T answer T safe if adversary can t learn too much about any entry E.g.: Cannot narrow X i down to small interval For uniform X i, mutual information I(X i ; San(DB) ) ε How can we decide among these definitions? Adversary A 24

85 Differential Privacy DB= x 1 x 2 x M 3 x n-1 x n Lithuanians example: Adv. learns height even if Alice not in DB Intuition [DM]: Whatever is learned would be learned regardless of whether or not Alice participates San random coins query 1 answer 1 Dual: Whatever is already known, situation won t get worse M query T answer T Adversary A 25

86 Approach: Indistinguishability x 1 x 2. x n A local random coins ( queries answers ) x 1 x 2. x n A local random coins ( queries ) answers x is a neighbor of x if they differ in one row 26

87 Approach: Indistinguishability x 1 x 2. x n A local random coins ( queries answers ) x 1 x 2. x n A local random coins ( queries ) answers x is a neighbor of x if they differ in one row Neighboring databases induce close distributions on transcripts 26

88 Approach: Indistinguishability x 1 x 2. x n A local random coins ( queries answers ) x 1 x 2. x n A local random coins ( queries ) answers x is a neighbor of x if they differ in one row Definition: A is indistinguishable if, for all neighbors x, x, for all subsets S of transcripts Neighboring databases induce close distributions on transcripts Pr[A(x) S] (1 + ɛ)pr[a(x ) S] 26

89 Approach: Indistinguishability Note that ε has to be non-negligible here Triangle inequality: any pair of databases at distance < εn If ε < 1/n then users get no info! Why this measure? Statistical difference doesn t make sense with ε > 1/n E.g. choose random i and release i, xi This compromises someone s privacy w.p. 1 Definition: A is indistinguishable if, for all neighbors x, x, for all subsets S of transcripts Neighboring databases induce close distributions on transcripts Pr[A(x) S] (1 + ɛ)pr[a(x ) S] 27

90 Differential Privacy Another interpretation [DM]: You learn the same things about me regardless of whether I am in the database Suppose you know I am the height of median Canadian You could learn my height from database! But it didn t matter whether or not my data was part of it. Has my privacy been compromised? No! Definition: A is indistinguishable if, for all neighbors x, x, for all subsets S of transcripts Neighboring databases induce close distributions on transcripts Pr[A(x) S] (1 + ɛ)pr[a(x ) S] 28

91 Differential Privacy DB= x 1 x 2 0 M x n-1 x n San random coins query 1 answer 1 M query T answer T Adversary A 29

92 Differential Privacy DB= x 1 x 2 0 M x n-1 x n San random coins Define n+1 games Game 0 : Adv. interacts with San(DB) query 1 answer 1 M query T answer T For each i, let DB -i = (x 1,,x i-1,0,x i+1,,x n ) Game i : Adv. interacts with San(DB -i ) Adversary A 29

93 Differential Privacy DB= x 1 x 2 0 M x n-1 x n San random coins Define n+1 games Game 0 : Adv. interacts with San(DB) query 1 answer 1 M query T answer T For each i, let DB -i = (x 1,,x i-1,0,x i+1,,x n ) Game i : Adv. interacts with San(DB -i ) Bayesian adversary: Adversary A Given S and prior distrib p() on DB, define n+1 posterior distrib s 29

94 Differential Privacy DB= x 1 x 2 0 M x n-1 x n San random coins query 1 answer 1 M query T answer T Adversary A 30

95 Differential Privacy DB= x 1 x 2 0 M x n-1 x n San random coins Definition: San is safe if 8 prior distributions p( ) on DB, query 1 answer 1 M query T answer T 8 transcripts S, 8 i =1,,n StatDiff( p 0 ( S), p i ( S) ) ε Adversary A Note that the prior distribution may be far from both How can we satisfy this? 30

96 Indistinguishability ) Differential Privacy 31

97 Indistinguishability ) Differential Privacy Definition: San is safe if 8 prior distributions p( ) on DB, 8 transcripts S, 8 i =1,,n StatDiff( p 0 ( S), p i ( S) ) ε 31

98 Indistinguishability ) Differential Privacy Definition: San is safe if 8 prior distributions p( ) on DB, 8 transcripts S, 8 i =1,,n StatDiff( p 0 ( S), p i ( S) ) ε 31

99 Indistinguishability ) Differential Privacy Definition: San is safe if 8 prior distributions p( ) on DB, 8 transcripts S, 8 i =1,,n StatDiff( p 0 ( S), p i ( S) ) ε 31

100 Indistinguishability ) Differential Privacy Definition: San is safe if 8 prior distributions p( ) on DB, 8 transcripts S, 8 i =1,,n StatDiff( p 0 ( S), p i ( S) ) ε We can use indistinguishability: For every S and DB: 31

101 Indistinguishability ) Differential Privacy Definition: San is safe if 8 prior distributions p( ) on DB, 8 transcripts S, 8 i =1,,n StatDiff( p 0 ( S), p i ( S) ) ε We can use indistinguishability: For every S and DB: 31

102 Indistinguishability ) Differential Privacy Definition: San is safe if 8 prior distributions p( ) on DB, 8 transcripts S, 8 i =1,,n StatDiff( p 0 ( S), p i ( S) ) ε We can use indistinguishability: For every S and DB: This implies StatDiff( p 0 ( S), p i ( S) ) ε 31

103 Indistinguishability ) Differential Privacy Definition: San is safe if 8 prior distributions p( ) on DB, 8 transcripts S, 8 i =1,,n StatDiff( p 0 ( S), p i ( S) ) ε We can use indistinguishability: For every S and DB: This implies StatDiff( p 0 ( S), p i ( S) ) ε 31

104 Output Perturbation Individuals x1 x 2. x n Server/agency A local random coins Tell me f(x) f(x) + noise User Intuition: f(x) can be released accurately when f is insensitive to individual entries x 1, x 2,..., x n Global Sensitivity: Example: GS f = GS average = 1 n max neighbors x,x f(x) f(x ) 1 ( Theorem ) GSf If A(x) = f(x) + Lap then A is ε-indistinguishable. ε 32

105 Output Perturbation Individuals x1 x 2. x n Server/agency A local random coins Tell me f(x) f(x) + noise User Intuition: f(x) can be released accurately when f is insensitive Global Sensitivity: Example: to individual entries GS f = GS average = 1 n x 1, x 2,..., x n max neighbors x,x f(x) f(x ) 1 ( Theorem ) GSf If A(x) = f(x) + Lap then A is ε-indistinguishable. ε Lipschitz constant of f 32

106 Global Sensitivity: Noise Distribution ( Theorem ) GSf If A(x) = f(x) + Lap then A is ε-indistinguishable. ε Laplace distribution Lap(λ) has density h(y) e y 1 λ h(y) y 33

107 Global Sensitivity: Noise Distribution ( Theorem ) GSf If A(x) = f(x) + Lap then A is ε-indistinguishable. ε Laplace distribution Lap(λ) has density h(y) e y 1 λ h(y+δ) h(y) ( GSf Sliding property of Lap ε ) : h(y) h(y+δ) y eε δ GS f for all y, δ 34

108 Global Sensitivity: Noise Distribution ( Theorem ) GSf If A(x) = f(x) + Lap then A is ε-indistinguishable. ε Laplace distribution Lap(λ) has density h(y) e y 1 λ h(y+δ) h(y) ( GSf Sliding property of Lap ε Proof idea: ) : h(y) h(y+δ) A(x): blue curve A(x ): red curve y eε δ GS f for all y, δ δ = f(x) f(x ) GS f 35

109 Examples of low global sensitivity Example: GS average = 1 n if x [0, 1] n Add noise Lap( 1 ɛn ) Comparison: to estimate a frequency (e.g. proportion of diabetics) in underlying population, get sampling noise Many natural functions have low GS, e.g.: 1 n Histograms and contingency tables Covariance matrix Distance to a property Functions that can be approximated from a random sample [BDMN] Many data-mining and learning algorithms access the data via a sequence of low-sensitivity questions e.g. perceptron, some EM algorithms, SQ learning algorithms 36

110 Why does this help? With relatively little noise: Averages Contingency tables Matrix decompositions Certain types of clustering 37

111 Differential Privacy 38

112 Differential Privacy Protocols 38

113 Differential Privacy Protocols Output perturbation (Release f(x) + noise) Sum queries [DiN 03,DwN 04,BDMN 05] Sensitivity frameworks [DMNS 06, NRS 07] 38

114 Differential Privacy Protocols Output perturbation (Release f(x) + noise) Sum queries [DiN 03,DwN 04,BDMN 05] Sensitivity frameworks [DMNS 06, NRS 07] Input perturbation ( randomized response ) Frequent item sets [EGS 03] (Various learning results) 38

115 Differential Privacy Protocols Output perturbation (Release f(x) + noise) Sum queries [DiN 03,DwN 04,BDMN 05] Sensitivity frameworks [DMNS 06, NRS 07] Input perturbation ( randomized response ) Lower bounds Frequent item sets [EGS 03] (Various learning results) 38

116 Differential Privacy Protocols Output perturbation (Release f(x) + noise) Sum queries [DiN 03,DwN 04,BDMN 05] Sensitivity frameworks [DMNS 06, NRS 07] Input perturbation ( randomized response ) Lower bounds Limits on communication models Noninteractive [DMNS 06] Local [NSW 07] Frequent item sets [EGS 03] (Various learning results) 38

117 Differential Privacy Protocols Output perturbation (Release f(x) + noise) Sum queries [DiN 03,DwN 04,BDMN 05] Sensitivity frameworks [DMNS 06, NRS 07] Input perturbation ( randomized response ) Frequent item sets [EGS 03] (Various learning results) Lower bounds Limits on communication models Noninteractive [DMNS 06] Local [NSW 07] Limits on accuracy Many good answers allow reconstructing database [DiNi 03,DMT 07] 38

118 Differential Privacy Protocols Output perturbation (Release f(x) + noise) Sum queries [DiN 03,DwN 04,BDMN 05] Sensitivity frameworks [DMNS 06, NRS 07] Input perturbation ( randomized response ) Frequent item sets [EGS 03] (Various learning results) Lower bounds Limits on communication models Noninteractive [DMNS 06] Local [NSW 07] Limits on accuracy Many good answers allow reconstructing database [DiNi 03,DMT 07] Necessity of differential guarantees [DN] 38

Differential Privacy

CS 380S Differential Privacy Vitaly Shmatikov most slides from Adam Smith (Penn State) slide 1 Reading Assignment Dwork. Differential Privacy (invited talk at ICALP 2006). slide 2 Basic Setting DB= x 1