Differentially Private Machine Learning

Size: px

Start display at page:

Download "Differentially Private Machine Learning"

Felicia Garrison
6 years ago
Views:

1 Differentially Private Machine Learning Theory, Algorithms, and Applications Kamalika Chaudhuri (UCSD) Anand D. Sarwate (Rutgers)

2 Logistics and Goals Tutorial Time: 2 hr (15 min break after first hour) What this tutorial will do: Motivate and define differential privacy Provide overview of common methods and tools to design differentially private ML algorithms What this tutorial will not do: Provide detailed results on the state of the art in differentially private versions of specific problems

3 Learning Outcomes At the end of the tutorial, you should be able to: Explain the definition of differential privacy, Design basic differentially private machine learning algorithms using standard tools, Try different approaches for introducing differential privacy into optimization methods, Understand the basics of privacy risk accounting, Understand how these ideas blend together in more complex systems.

4 Motivating Differential Privacy

5 Sensitive Data Medical Records Genetic Data Search Logs

6 AOL Violates Privacy

7 AOL Violates Privacy

8 Netflix Violates Privacy [NS08] User%1% User%2% User%3% Movies% 2-8 movie-ratings and dates for Alice reveals: Whether Alice is in the dataset or not Alice s other movie ratings

9 High-dimensional Data is Unique Example: UCSD Employee Salary Table Position Gender Department Ethnicity Salary Faculty Female CSE SE Asian - One employee (Kamalika) fits description!

10 Simply anonymizing data is unsafe!

11 Disease Association Studies [WLWTZ09] Cancer Healthy Correlations Correlations Correlation (R 2 values), Alice s DNA reveals: If Alice is in the Cancer set or Healthy set

12 Simply anonymizing data is unsafe! Statistics on small data sets is unsafe! Privacy Data Size Accuracy

13 Privacy Definition

14 The Setting privacy barrier Statistics summary statistic Private data set Privacy-preserving sanitizer synthetic dataset Data release D Data (sensitive) non-public Sanitizer Public ML model public

15 Property of Sanitizer privacy barrier Private data set D Privacy-preserving sanitizer summary statistic Statistics synthetic dataset Data release ML model non-public Data Sanitizer Public Aggregate information computable Individual information protected (robust to side-information) public

16 Differential Privacy Data + Algorithm Outcome Data + Algorithm Outcome Participation of a person does not change outcome Since a person has agency, they can decide to participate in a dataset or not

17 Adversary Prior Knowledge: A s Genetic profile A smokes

18 Case 1: Study Adversary A has cancer Cancer [ Study violates A s privacy ] Prior Knowledge: A s Genetic profile A smokes

19 Case 1: Study Adversary A has cancer Cancer [ Study violates A s privacy ] Case 2: Study Prior Knowledge: A s Genetic profile A smokes A probably has cancer Smoking causes cancer [ Study does not violate privacy]

20 Differential Privacy [DMNS06] Data + Algorithm Outcome Data + Algorithm Outcome Participation of a person does not change outcome Since a person has agency, they can decide to participate in a dataset or not

21 How to ensure this? through randomness A( Data + ) Random variables have close distributions A( Data + )

22 How to ensure this? A( Data + ) Random variables A( Data + ) have close distributions Randomness: Added by randomized algorithm A Closeness: Likelihood ratio at every point bounded

23 Differential Privacy [DMNS06] D D p[a(d) = t] p[a(d ) = t] t For all D, D that differ in one person s value, If A = -differentially private randomized algorithm, then: sup t log p(a(d) =t) p(a(d 0 )=t) apple

24 Differential Privacy [DMNS06] For all D, D that differ in one person s value, If A = -differentially private randomized algorithm, then: Max-divergence of p(a(d)) and p(a(d )) D D p[a(d) = t] p[a(d ) = t] sup t log t p(a(d) =t) p(a(d 0 )=t) apple

25 Approx. Differential Privacy [DKM+06] For all D, D that differ in one person s value, If A = D D p[a(d) = t] p[a(d ) = t] (, )-differentially private randomized algorithm, then: max S,Pr(A(D)2S)> h log t Pr(A(D) 2 S) Pr(A(D 0 ) 2 S i apple

26 Properties of Differential Privacy

27 Property 1: Post-processing Invariance Private data set D differentially private algorithm ( 1, 1) privacy barrier non-private postprocessing data derivative data derivative data derivative legitimate user 1 legitimate user 2 adversary ( 1, 1) ( 1, 1) ( 1, 1) Risk doesn t increase if you don t touch the data again

28 Property 2: Graceful Composition A1 ( 1, 1) Private data set A2 ( 2, 2) ( X i i, X i i) D Total privacy loss is the sum of privacy losses (Better composition possible coming up later) A

29 How to achieve Differential Privacy?

30 Tools for Differentially Private Algorithm Design Global Sensitivity Method [DMNS06] Exponential Mechanism [MT07] Many others we will not cover [DL09, NRS07, ]

31 Global Sensitivity Method [DMNS06] Problem: Given function f, sensitive dataset D Find a differentially private approximation to f(d) Example: f(d) = mean of data points in D

32 The Global Sensitivity Method [DMNS06] Given: A function f, sensitive dataset D Define: dist(d, D ) = #records that D, D differ by

33 The Global Sensitivity Method [DMNS06] Given: A function f, sensitive dataset D Define: dist(d, D ) = #records that D, D differ by Global Sensitivity of f:

34 The Global Sensitivity Method [DMNS06] Given: A function f, sensitive dataset D Define: dist(d, D ) = #records that D, D differ by Global Sensitivity of f: S(f) = f(d) - f(d ) Domain(D) D D

35 The Global Sensitivity Method [DMNS06] Given: A function f, sensitive dataset D Define: dist(d, D ) = #records that D, D differ by Global Sensitivity of f: S(f) = f(d) - f(d ) dist(d, D ) = 1 Domain(D) D D

36 The Global Sensitivity Method [DMNS06] Given: A function f, sensitive dataset D Define: dist(d, D ) = #records that D, D differ by Global Sensitivity of f: S(f) = max f(d) - f(d ) dist(d, D ) = 1 Domain(D) D D D D

37 Laplace Mechanism Global Sensitivity of f is S(f) = max f(d) - f(d ) dist(d, D ) = 1 Output where -differentially private Laplace distribution:

38 Gaussian Mechanism Global Sensitivity of f is S(f) = max f(d) - f(d ) dist(d, D ) = 1 Output Z S(f) where N (0, 2ln(1.25/ )) (, )-differentially private

39 Example 1: Mean f(d) = Mean(D), where each record is a scalar in [0,1]

40 Example 1: Mean f(d) = Mean(D), where each record is a scalar in [0,1] Global Sensitivity of f = 1/n

41 Example 1: Mean f(d) = Mean(D), where each record is a scalar in [0,1] Global Sensitivity of f = 1/n Laplace Mechanism: Output where Z 1 n Lap(0, 1)

42 How to get Differential Privacy? Global Sensitivity Method [DMNS06] Two variants: Laplace and Gaussian Exponential Mechanism [MT07] Many others we will not cover [DL09, NRS07, ]

43 Exponential Mechanism [MT07] Problem: Given function f(w, D), Sensitive Data D Find differentially private approximation to w = argmax w f(w, D) Example: f(w, D) = accuracy of classifier w on dataset D

44 The Exponential Mechanism [MT07] Suppose for any w, f(w, D) f(w, D 0 ) apple S when D and D differ in 1 record. Sample w from: p(w) / e f(w,d)/2s for -differential privacy. argmax f(w, D) f(w, D) Pr(w) w w

45 Example: Parameter Tuning Given validation data D, k classifiers w1,.., wk (privately) find the classifier with highest accuracy on D Here, f(w, D) = classification accuracy of w on D For any w, any D and D that differ by one record, f(w, D) f(w, D 0 ) apple 1 D So, the exponential mechanism outputs wi with prob: Pr(w i ) / e D f(w i,d)/2 [CMS11, CV13]

46 Summary Motivation What is differential privacy? Basic differential privacy algorithm design tools: The Global Sensitivity Method Laplace Mechanism Gaussian Mechanism Exponential Mechanism

47 Differential privacy in estimation and prediction

48 Estimation and prediction problems Private data set D sample size n DP estimate of argmin w (", ) f(w,d) privacy barrier f(w, ) ŵ E[f(ŵ, z)] risk functional private estimator E[f(w, z)] Statistical estimation: estimate a parameter or predictor using private data that has good expected performance on future data. Goal: Good privacy-accuracy-sample size tradeoff

49 Privacy and accuracy make different assumptions about the data Private data set D sample size n DP estimate of argmin w (", ) f(w,d) privacy barrier f(w, ) ŵ E[f(ŵ, z)] risk functional private estimator E[f(w, z)] Privacy differential privacy makes no assumptions on the data distribution: privacy holds unconditionally. Accuracy accuracy measured w.r.t a true population distribution : expected excess statistical risk.

50 Statistical Learning as Risk Minimization w = argmin w 1 n nx i=1 `(w, (x i,y i )) + R(w) Empirical Risk Minimization (ERM) is a common paradigm for prediction problems. Produces a predictor w for a label/response y given a vector of features/covariates x. Typically use a convex loss function and regularizer to prevent overfitting.

51 Why is ERM not private? w w 0 Private data set D D 0 Private data set easy for adversary to tell the difference between D and D D or D 0? w adversary [CMS11, RBHT12]

52 Kernel learning: even worse Kernel-based methods produce a classifier that is a function of the data points. Even adversary with black-box access to w could potentially learn those points. nx w(x) = i=1 i k(x, x i ) + + [CMS11]

53 Privacy is compatible with learning Good learning algorithms generalize to the population distribution, not individuals. Stable learning algorithms generalize [BE02]. Differential privacy can be interpreted as a form of stability that also implies generalization [DFH+15,BNS+16]. Two parts of the same story: Privacy implies generalization asymptotically. Tradeoffs between privacy-accuracy-sample size for finite n.

54 Revisiting ERM w = argmin w 1 n nx i=1 `(w, (x i,y i )) + R(w) Learning using (convex) optimization uses three steps: 1. read in the data 2. form the objective function 3. perform the minimization input perturbation objective perturbation output perturbation We can try to introduce privacy in each step!

55 Privacy in ERM: options input perturbation input perturbation non-private preprocessing input perturbation private database input perturbation D input perturbation input perturbation non-private sanitized DP optimization DP database preprocessing optimization (", (",) ) w noise addition or randomization sanitized dataset nonprivatprivate non- algorithm learning ŵ objective ŵ output input (", ) privacy barrier (", ) privacy privacy barrier barrier

56 Local Privacy input perturbation input perturbation input perturbation input perturbation input perturbation sanitized database nonprivate algorithm input perturbation (", ) privacy barrier Local privacy: data contributors sanitize data before collection. Classical technique: randomized response [W65]. Interactive variant can be minimax optimal [DJW13].

57 Input Perturbation private database D DP preprocessing (", ) sanitized dataset nonprivate learning ŵ input privacy barrier Input perturbation: add noise to the input data. Advantages: easy to implement, results in reusable sanitized data set. [DJW13,FTS17]

58 Output Perturbation non-private preprocessing private database D non-private optimization w noise addition or randomization (", ) Compute the minimizer and add noise. privacy barrier ŵ output Does not require re-engineering baseline algorithms Noise depends on the sensitivity of the argmin. [CMS11, RBHT12]

59 Objective Perturbation non-private preprocessing private database D DP optimization (", ) ŵ objective privacy barrier

60 Objective Perturbation J(w) = 1 n nx i=1 `(w, (x i,y i )) + R(w) A. Add a random term to the objective: ŵ priv = argmin w J(w)+w > b B. Do a randomized approximation of the objective: ŵ priv = argmin w Ĵ(w) Randomness depends on the sensitivity properties of J(w). [CMS11, ZZX+12]

61 Sensitivity of the argmin ŵ w priv = argmin = argmin J(w)+w > b w w Non-private optimization solves rj(w) =0=) w Generate vector analogue of Laplace: b p(z) / e "/2kzk Private optimization solves rj(w) = b =) w priv Have to bound the sensitivity of the gradient.

62 Theoretical bounds on excess risk objective output input perturbation p! d log(1/ ) Same important parameters: privacy parameters (ε, δ) Õ n" d ÕÕ n" n" 3/2 (quadratic loss) [CMS11, BST14] [FTS17] data dimension d data bounds kx i kappleb analytical properties of the loss (Lipschitz, smoothness) regularization parameter

63 Typical empirical results [CMS11] [JT14] In general: Objective perturbation empirically outperforms output perturbation. Gaussian mechanism with (ε, δ) guarantees outperform Laplace - like mechanisms with ε- guarantees. Loss vs. non-private methods is very dataset-dependent.

64 Gaps between theory and practice Theoretical analysis is for fixed privacy parameters how should we choose them in practice? Given a data set, can I tell what the privacy-utilitysample-size tradeoff is? What about more general optimization problems/ algorithms? What about scaling (computationally) to large data sets?

65 Summary Training does not on its own guarantee privacy. There are many ways to incorporate DP into prediction and learning using ERM with different privacy-accuracysample size tradeoffs. Good DP algorithms should generalize since they learn about populations, not individuals. Theory and experiment show that (ε, δ)-dp algorithms have better accuracy than ε-dp algorithms at the cost of a weaker privacy guarantee.

66 Break

67 Differential privacy and optimization algorithms

68 Scaling up private optimization Large data sets are challenging for optimization: batch methods not feasible Using more data can help our tradeoffs look better: better privacy and accuracy Online learning involves multiple releases: potential for more privacy loss Goal: guarantee privacy using the optimization algorithm.

69 Stochastic Gradient Descent Stochastic gradient descent (SGD) is a moderately popular method for optimization Stochastic gradients are random already noisy already private? Optimization is iterative intermediate results leak information

70 Non-private SGD J(w) = 1 n nx i=1 `(w, (x i,y i )) + R(w) w 0 = 0 select a random data point For t =1, 2,...,T i t Unif{1, 2,...,n} take a gradient step g t = r`(w t 1, (x,y )) + it it rr(w t 1 ) w t = W (w t 1 t g t ) ŵ = w T

71 Private SGD with noise J(w) = 1 n nx i=1 `(w, (x i,y i )) + R(w) w 0 = 0 select random data point For t =1, 2,...,T i t Unif{1, 2,...,n} add noise to gradient z t p (", ) (z) ĝ t = z t + r`(w t 1, (x,y )) + it it rr(w t 1 ) w t = W (w t 1 t ĝ t ) ŵ = w T [SCS15]

72 Choosing a noise distribution Laplace mechanism p(z) / e ("/2)kzk Gaussian mechanism p(z) i.i.d. log(1/ ) N 0, Õ " 2 Have to choose noise according to the sensitivity of the gradient: " DP (", ) DP max max krj(w; D) rj(w; D,D 0 w D0 )k Sensitivity depends on the data distribution, Lipschitz parameter of the loss, etc.

73 Private SGD with randomized selection nx J(w) = 1 n `(w, (x i,y i )) + R(w) i=1 w 0 = 0 For t =1, 2,...,T i t Unif{1, 2,...,n} select random data point randomly select unbiased gradient estimate g t = r`(w t 1, (x it,y it )) + rr(w t 1 ) ĝ t p (", ),g (z) w t = W (w t 1 t ĝ t ) ŵ = w T [DJW13]

74 Randomized directions [DJW13] 1 1+e = L kgk g w.p kgk 2L B e 1+e Select hemisphere in direction of gradient or opposite. Pick uniformly from the hemisphere and take a step Need to have control of gradient norms: kgk applel Keep some probability of going in the wrong direction.

75 Why does DP-SGD work? z t = L kgk g w.p kgk 2L ĝ = g + z t g 1 1+e B e 1+e Noisy Gradient Choose noise distribution using the sensitivity of the gradient. Random Gradient Randomly select direction biased towards the true gradient. Both methods Guarantee DP at each iteration. Ensure unbiased estimate of g to guarantee convergence

76 Making DP-SGD more practical SGD is robust to noise True up to a point for small epsilon (more privacy), the gradients can become too noisy. Solution 1: more iterations ([BST14]: need O(n 2 ) ) Solution 2: use standard tricks: mini-batching, etc. [SCS13] Solution 3: use better analysis to show the privacy loss is not so bad [BST14][ACG+16]

77 Randomly sampling data can amplify privacy Private data set D subsample n Differentially private algorithm privacy barrier " 2 Suppose we have an algorithm A which is ε-differentially private for ε 1. Sample γn entries of D uniformly at random and run A on those. Randomized method guarantees 2γε-differential privacy. [BBKN14,BST14]

78 Summary Stochastic gradient descent can be made differentially private in several ways by randomizing the gradient. Keeping gradient estimates unbiased will help ensure convergence. Standard approach for variance reduction/stability (such as minibatching) can help with performance. Random subsampling of the data can amplify privacy guarantees.

79 Accounting for total privacy risk

80 Measuring total privacy loss Private data set D differentially private algorithm ( 1, 1) privacy barrier non-private postprocessing data derivative data derivative data derivative legitimate user 1 legitimate user 2 adversary ( 1, 1) ( 1, 1) ( 1, 1) Post processing invariance: risk doesn t increase if you don t touch the data again more complex algorithms have multiple stages all stages have to guarantee DP need a way to do privacy accounting: what is lost over time/ multiple queries?

81 A simple example privacy barrier Preprocessing (" 1, 1) Training (" 2, 2) Private data set D Crossvalidation (" 3, 3) Testing (" 4, 4)

82 Composition property of differential privacy Basic composition: privacy loss is additive: Apply R algorithms with ( i, i) :i =1, 2,...,R Total privacy loss: RX RX! i, i i=1 i=1 Worst-case analysis: each result exposes the worst privacy risk.

83 What composition says about multi-stage methods privacy barrier Preprocessing (" 1, 1) Training (" 2, 2) Private data set D Crossvalidation (" 3, 3) Testing (" 4, 4) Total privacy loss is the sum of the privacy losses

84 An open question: privacy allocation across stages Compositions means we have a privacy budget. How should we allocate privacy risk across different stages of a pipeline? Noisy features + accurate training? Clean features + sloppy training? It s application dependent! Still an open question

85 A closer look at ε and δ Gaussian noise of a given variance produces a spectrum of (ε, δ) guarantees: 1 2 p( 2 D) p( 1 p( D) 2 D) 2 0 ) apple exp(" 2) p( 1 D) p( 1 D 0 ) apple exp(" 1) p( 2 D 0 ) p( 1 D 0 ) 1 2

86 Privacy loss as a random variable Spectrum of (ε, δ) guarantees means we can trade off ε and δ when analyzing a particular mechanism. Actual privacy loss is a random variable that depends on D: Z D,D 0 = log p(a(d) =t) p(a(d 0 )=t) w.p. p(a(d) =t)

87 Random privacy loss Z D,D 0 = log p(a(d) =t) p(a(d 0 )=t) w.p. p(a(d) =t) Bounding the max loss over (D,D ) is still a random variable. Sequentially computing functions on private data is like sequentially sampling independent privacy losses. Concentration of measure shows that the loss is much closer to its expectation.

88 Strong composition bounds i = k times z } { (", ), (", ),...,(", ) (k 2i)", 1 (1 ) k (1 i) P i 1 `=0 k e (k `)" e (k 2i+`)" ` Given only the (ε, δ) guarantees for k algorithms operating on the data. Composition again gives a family of (ε, δ) tradeoffs: can quantify privacy loss by choosing any valid (ε, δ) pair. (1 + e " ) k [DRV10, KOV15]

89 Moments accountant A1 Private data set A2 A (, ) D Basic Idea: Directly calculate parameters (ε,δ) from composing a sequence of mechanisms More efficient than composition theorems [ACG+16]

90 How to Compose Directly? Given datasets D and D with one different record, mechanism A, define privacy loss random variable as: Z D,D 0 = log p(a(d) =t) p(a(d 0 )=t), w.p. p(a(d) =t) - Properties of ZD,D related to privacy loss of A - If max absolute value of ZD,D over all D, D is ε, then A is (ε,0)-differentially private

91 How to Compose Directly? Given datasets D and D with one different record, mechanism A, define privacy loss random variable as: Z D,D 0 = log p(a(d) =t) p(a(d 0 )=t), w.p. p(a(d) =t) Challenge: To reason about the worst case over all D, D Key idea in [ACG+16]: Use moment generating functions

92 Accounting for Moments A1 Private data set A2 A (, ) D Three Steps: 1. Calculate moment generating functions for A1, A2,.. 2. Compose 3. Calculate final privacy parameters [ACG+16]

93 1. Stepwise Moments A1 Private data set A2 A D Define: Stepwise Moment at time t of At at any s: At (s) = sup D,D 0 log E[e sz D,D0 ] (D and D differ by one record) [ACG+16]

94 2. Compose A1 Private data set A2 A D Theorem: Suppose A = (A1,, AT). For any s: A (s) apple TX t=1 At (s) [ACG+16]

95 3. Final Calculation A1 Private data set A2 A (, ) D Theorem: For any ε, mechanism A is (ε,δ)-dp for =minexp( A (s) s ) s Use theorem to find best ε for a given δ from closed form or by searching over s1, s2,.., sk [ACG+16]

96 Example: composing Gaussian mechanism A1 Private data set A2 A (, ) D Suppose At answers a query with global sensitivity 1 by adding N(0, 1) noise

97 1. Stepwise Moments A1 Private data set A2 A D Suppose At answers a query with global sensitivity 1 by adding N(0, 1) noise Simple algebra gives for any s: At (s) = s(s + 1) 2

98 2. Compose A1 Private data set A2 A D Suppose At answers a query with global sensitivity 1 by adding N(0, 1) noise TX A (s) apple (s) = At t=1 Ts(s + 1) 2

99 3. Final Calculation A1 Private data set A2 A (, ) D Find lowest δ for a given ε (or vice versa) by solving: =min s exp(ts(s + 1)/2 s ) In this case, solution can be found in closed form.

100 How does it compare? [DRV10] epsilon (better than linear) Moments Accountant #rounds of composition [ACG+16]

101 How does it compare on real data? EM for MOG with Gaussian Mechanism = 10 4 [PFCW17]

102 Summary Practical machine learning looks at the data many times. Post-processing invariance means we just have to track the cumulative privacy loss. Good composition methods use the fact that the actual privacy loss may behave much better than the worst-case bound. The Moments Accountant method tracks the actual privacy loss more accurately: better analysis for better privacy guarantees.

103 Applications to modern machine learning

104 When is differential privacy practical? Differential privacy is best suited for understanding population-level statistics and structure: Inferences about the population should not depend strongly on individuals. Large sample sizes usually mean lower sensitivity and less noise. To build and analyze systems we have to leverage postprocessing invariance and composition properties.

105 Differential privacy in practice Google: RAPPOR for tracking statistics in Chrome. Apple: various iphone usage statistics. Census: 2020 US Census will use differential privacy. mostly focused on count and average statistics

106 Challenges for machine learning applications Differentially private ML is complicated because real ML algorithms are complicated: Multi-stage pipelines, parameter tuning, etc. Need to play around with the data before committing to a particular pipeline/algorithm. Modern ML approaches (= deep learning) have many parameters and less theoretical guidance.

107 Some selected examples L 1,L 2,... Private data set D x L 1, x L2,... select random lot compute gradients new network weights T original posterior moments accountant (", ) clip and add noise privacy barrier update parameters processed posterior For today, we will describe some recent examples: 1. Differentially private deep learning [ACG+16] 2. Differential privacy and Bayesian inference

108 Differential privacy and deep learning Main idea: train a deep network using differentially private SGD and use moments accountant to track privacy loss. Additional components: gradient clipping, minibatching, data augmentation, etc. [ACG+16]

109 Overview of the algorithm L 1,L 2,... Private data set D x L 1, x L2,... select random lot compute gradients new network weights T moments accountant clip and add noise update parameters (", ) privacy barrier [ACG+16]

110 Effectiveness of DP deep learning Empirical results on MNIST and CIFAR: Training and test error come close to baseline nonprivate deep learning methods. To get moderate loss in performance, epsilon and delta are not negligible [ACG+16]

111 Moving forward in deep learning This is a good proof of concept for differential privacy for deep neural networks. There are lots of interesting ways to expand this. Just used one NN model: what about other architectures? RNNs? GANs? Can regularization methods for deep learning (e.g. dropout) help with privacy? What are good rules of thumb for lot/batch size, learning rate, # of hidden units, etc? [ACG+16]

112 Differentially Private Bayesian Inference Data X = { x1, x2, } Model Class } Related through likelihood p(x ) + = Prior ( ) Data X Posterior p( X) Find differentially private approx to posterior

113 Differentially Private Bayesian Inference General methods for private posterior approximation A Special Case: Exponential Families Variational Inference

114 How to make posterior private? Option 1: Direct posterior sampling [DMNR14] Not differentially private except under restrictive conditions - likelihood ratios may be unbounded! [GSC17] Answer changes under a new relaxation Rényi differential privacy [M17]

115 How to make posterior private? Option 2: One Posterior Sample (OPS) Method [WFS15] original posteriors processed posteriors 1. Truncate posterior so that likelihood ratio is bounded in the truncated region. 2. Raise truncated posterior to a higher temperature

116 How to make posterior private? Option 2: One Posterior Sample (OPS) Method: Advantage: General Pitfalls: Intractable - only exact distribution private Low statistical efficiency even for large n

117 How to make posterior private? Option 3: Approximate the OPS distribution via Stochastic Gradient MCMC [WFS15] original posteriors processed posteriors Advantage: Noise added during stochastic gradient MCMC contributes to privacy Disadvantage: Statistical efficiency lower than exact OPS

118 Differentially Private Bayesian Inference General methods for private posterior approximation A Special Case: Exponential Families Variational Inference

119 Exponential Family Posteriors (Non-private) posterior comes from exp. family: p( x) / e ( )> ( P i T (x i)) B( ) given data x1, x2, Posterior depends on data through sufficient statistic T

120 Exponential Family Posteriors (Non-private) posterior comes from exp. family: p( x) / e ( )> ( P i T (x i)) B( ) given data x1, x2,, sufficient statistic T Private Sampling: 1. If T is bounded, add noise to to get private version T X i T (x i ) 2. Sample from the perturbed posterior: p( x) / e ( )> T 0 B( ) [ZRD16, FGWC16]

121 How well does it work? 1.5 x Statistically efficient Non private HMM 4.5 Non private naive Bayes Laplace mechanism HMM OPS HMM (truncation multiplier = 100) Performance worse than non-private, better than OPS Can do inference in relatively complex systems by building up on this method eg, time series clustering in HMMs Test set log likelihood Epsilon (total)

122 Differentially Private Bayesian Inference General methods for private posterior approximation A Special Case: Exponential Families Variational Inference

123 Variational Inference Key Idea: Start with a stochastic variational inference method, and make each step private by adding Laplace noise. Use moments accountant and subsampling to track privacy loss. [JDH16, PFCW16]

124 Summary Two examples of differentially private complex machine learning algorithms Deep learning Bayesian inference

125 Summary, conclusions, and where to look next

126 Summary 1. Differential privacy: basic definitions and mechanisms 2. Differential privacy and statistical learning: ERM and SGD 3. Composition and tracking privacy loss 4. Applications of differential privacy in ML: deep learning and Bayesian methods

127 Things we didn t cover Synthetic data generation. Interactive data analysis. Statistical/estimation theory and fundamental limits Feature learning and dimensionality reduction Systems questions for large-scale deployment and many others

128 Where to learn more Several video lectures and other more technical introductions available from the Simons Institute for the Theory of Computing: Monograph by Dwork and Roth:

129 Final Takeaways Differential privacy measures the risk incurred by algorithms operating on private data. Commonly-used tools in machine learning can be made differentially private. Accounting for total privacy loss can enable more complex private algorithms. Still lots of work to be done in both theory and practice.

130 Thanks! This work was supported by National Science Foundation (IIS , CCF , SaTC ) National Institutes of Health (1R01DA A1) Office of Naval Research (N ) DARPA and the US Navy (N C-4070) Google Faculty Research Award

Accuracy First: Selecting a Differential Privacy Level for Accuracy-Constrained Empirical Risk Minimization

Accuracy First: Selecting a Differential Privacy Level for Accuracy-Constrained Empirical Risk Minimization Katrina Ligett HUJI & Caltech joint with Seth Neel, Aaron Roth, Bo Waggoner, Steven Wu NIPS 2017