Differentially Private Machine Learning

Size: px
Start display at page:

Download "Differentially Private Machine Learning"

Transcription

1 Differentially Private Machine Learning Theory, Algorithms, and Applications Kamalika Chaudhuri (UCSD) Anand D. Sarwate (Rutgers)

2 Logistics and Goals Tutorial Time: 2 hr (15 min break after first hour) What this tutorial will do: Motivate and define differential privacy Provide overview of common methods and tools to design differentially private ML algorithms What this tutorial will not do: Provide detailed results on the state of the art in differentially private versions of specific problems

3 Learning Outcomes At the end of the tutorial, you should be able to: Explain the definition of differential privacy, Design basic differentially private machine learning algorithms using standard tools, Try different approaches for introducing differential privacy into optimization methods, Understand the basics of privacy risk accounting, Understand how these ideas blend together in more complex systems.

4 Motivating Differential Privacy

5 Sensitive Data Medical Records Genetic Data Search Logs

6 AOL Violates Privacy

7 AOL Violates Privacy

8 Netflix Violates Privacy [NS08] User%1% User%2% User%3% Movies% 2-8 movie-ratings and dates for Alice reveals: Whether Alice is in the dataset or not Alice s other movie ratings

9 High-dimensional Data is Unique Example: UCSD Employee Salary Table Position Gender Department Ethnicity Salary Faculty Female CSE SE Asian - One employee (Kamalika) fits description!

10 Simply anonymizing data is unsafe!

11 Disease Association Studies [WLWTZ09] Cancer Healthy Correlations Correlations Correlation (R 2 values), Alice s DNA reveals: If Alice is in the Cancer set or Healthy set

12 Simply anonymizing data is unsafe! Statistics on small data sets is unsafe! Privacy Data Size Accuracy

13 Privacy Definition

14 The Setting privacy barrier Statistics summary statistic Private data set Privacy-preserving sanitizer synthetic dataset Data release D Data (sensitive) non-public Sanitizer Public ML model public

15 Property of Sanitizer privacy barrier Private data set D Privacy-preserving sanitizer summary statistic Statistics synthetic dataset Data release ML model non-public Data Sanitizer Public Aggregate information computable Individual information protected (robust to side-information) public

16 Differential Privacy Data + Algorithm Outcome Data + Algorithm Outcome Participation of a person does not change outcome Since a person has agency, they can decide to participate in a dataset or not

17 Adversary Prior Knowledge: A s Genetic profile A smokes

18 Case 1: Study Adversary A has cancer Cancer [ Study violates A s privacy ] Prior Knowledge: A s Genetic profile A smokes

19 Case 1: Study Adversary A has cancer Cancer [ Study violates A s privacy ] Case 2: Study Prior Knowledge: A s Genetic profile A smokes A probably has cancer Smoking causes cancer [ Study does not violate privacy]

20 Differential Privacy [DMNS06] Data + Algorithm Outcome Data + Algorithm Outcome Participation of a person does not change outcome Since a person has agency, they can decide to participate in a dataset or not

21 How to ensure this? through randomness A( Data + ) Random variables have close distributions A( Data + )

22 How to ensure this? A( Data + ) Random variables A( Data + ) have close distributions Randomness: Added by randomized algorithm A Closeness: Likelihood ratio at every point bounded

23 Differential Privacy [DMNS06] D D p[a(d) = t] p[a(d ) = t] t For all D, D that differ in one person s value, If A = -differentially private randomized algorithm, then: sup t log p(a(d) =t) p(a(d 0 )=t) apple

24 Differential Privacy [DMNS06] For all D, D that differ in one person s value, If A = -differentially private randomized algorithm, then: Max-divergence of p(a(d)) and p(a(d )) D D p[a(d) = t] p[a(d ) = t] sup t log t p(a(d) =t) p(a(d 0 )=t) apple

25 Approx. Differential Privacy [DKM+06] For all D, D that differ in one person s value, If A = D D p[a(d) = t] p[a(d ) = t] (, )-differentially private randomized algorithm, then: max S,Pr(A(D)2S)> h log t Pr(A(D) 2 S) Pr(A(D 0 ) 2 S i apple

26 Properties of Differential Privacy

27 Property 1: Post-processing Invariance Private data set D differentially private algorithm ( 1, 1) privacy barrier non-private postprocessing data derivative data derivative data derivative legitimate user 1 legitimate user 2 adversary ( 1, 1) ( 1, 1) ( 1, 1) Risk doesn t increase if you don t touch the data again

28 Property 2: Graceful Composition A1 ( 1, 1) Private data set A2 ( 2, 2) ( X i i, X i i) D Total privacy loss is the sum of privacy losses (Better composition possible coming up later) A

29 How to achieve Differential Privacy?

30 Tools for Differentially Private Algorithm Design Global Sensitivity Method [DMNS06] Exponential Mechanism [MT07] Many others we will not cover [DL09, NRS07, ]

31 Global Sensitivity Method [DMNS06] Problem: Given function f, sensitive dataset D Find a differentially private approximation to f(d) Example: f(d) = mean of data points in D

32 The Global Sensitivity Method [DMNS06] Given: A function f, sensitive dataset D Define: dist(d, D ) = #records that D, D differ by

33 The Global Sensitivity Method [DMNS06] Given: A function f, sensitive dataset D Define: dist(d, D ) = #records that D, D differ by Global Sensitivity of f:

34 The Global Sensitivity Method [DMNS06] Given: A function f, sensitive dataset D Define: dist(d, D ) = #records that D, D differ by Global Sensitivity of f: S(f) = f(d) - f(d ) Domain(D) D D

35 The Global Sensitivity Method [DMNS06] Given: A function f, sensitive dataset D Define: dist(d, D ) = #records that D, D differ by Global Sensitivity of f: S(f) = f(d) - f(d ) dist(d, D ) = 1 Domain(D) D D

36 The Global Sensitivity Method [DMNS06] Given: A function f, sensitive dataset D Define: dist(d, D ) = #records that D, D differ by Global Sensitivity of f: S(f) = max f(d) - f(d ) dist(d, D ) = 1 Domain(D) D D D D

37 Laplace Mechanism Global Sensitivity of f is S(f) = max f(d) - f(d ) dist(d, D ) = 1 Output where -differentially private Laplace distribution:

38 Gaussian Mechanism Global Sensitivity of f is S(f) = max f(d) - f(d ) dist(d, D ) = 1 Output Z S(f) where N (0, 2ln(1.25/ )) (, )-differentially private

39 Example 1: Mean f(d) = Mean(D), where each record is a scalar in [0,1]

40 Example 1: Mean f(d) = Mean(D), where each record is a scalar in [0,1] Global Sensitivity of f = 1/n

41 Example 1: Mean f(d) = Mean(D), where each record is a scalar in [0,1] Global Sensitivity of f = 1/n Laplace Mechanism: Output where Z 1 n Lap(0, 1)

42 How to get Differential Privacy? Global Sensitivity Method [DMNS06] Two variants: Laplace and Gaussian Exponential Mechanism [MT07] Many others we will not cover [DL09, NRS07, ]

43 Exponential Mechanism [MT07] Problem: Given function f(w, D), Sensitive Data D Find differentially private approximation to w = argmax w f(w, D) Example: f(w, D) = accuracy of classifier w on dataset D

44 The Exponential Mechanism [MT07] Suppose for any w, f(w, D) f(w, D 0 ) apple S when D and D differ in 1 record. Sample w from: p(w) / e f(w,d)/2s for -differential privacy. argmax f(w, D) f(w, D) Pr(w) w w

45 Example: Parameter Tuning Given validation data D, k classifiers w1,.., wk (privately) find the classifier with highest accuracy on D Here, f(w, D) = classification accuracy of w on D For any w, any D and D that differ by one record, f(w, D) f(w, D 0 ) apple 1 D So, the exponential mechanism outputs wi with prob: Pr(w i ) / e D f(w i,d)/2 [CMS11, CV13]

46 Summary Motivation What is differential privacy? Basic differential privacy algorithm design tools: The Global Sensitivity Method Laplace Mechanism Gaussian Mechanism Exponential Mechanism

47 Differential privacy in estimation and prediction

48 Estimation and prediction problems Private data set D sample size n DP estimate of argmin w (", ) f(w,d) privacy barrier f(w, ) ŵ E[f(ŵ, z)] risk functional private estimator E[f(w, z)] Statistical estimation: estimate a parameter or predictor using private data that has good expected performance on future data. Goal: Good privacy-accuracy-sample size tradeoff

49 Privacy and accuracy make different assumptions about the data Private data set D sample size n DP estimate of argmin w (", ) f(w,d) privacy barrier f(w, ) ŵ E[f(ŵ, z)] risk functional private estimator E[f(w, z)] Privacy differential privacy makes no assumptions on the data distribution: privacy holds unconditionally. Accuracy accuracy measured w.r.t a true population distribution : expected excess statistical risk.

50 Statistical Learning as Risk Minimization w = argmin w 1 n nx i=1 `(w, (x i,y i )) + R(w) Empirical Risk Minimization (ERM) is a common paradigm for prediction problems. Produces a predictor w for a label/response y given a vector of features/covariates x. Typically use a convex loss function and regularizer to prevent overfitting.

51 Why is ERM not private? w w 0 Private data set D D 0 Private data set easy for adversary to tell the difference between D and D D or D 0? w adversary [CMS11, RBHT12]

52 Kernel learning: even worse Kernel-based methods produce a classifier that is a function of the data points. Even adversary with black-box access to w could potentially learn those points. nx w(x) = i=1 i k(x, x i ) + + [CMS11]

53 Privacy is compatible with learning Good learning algorithms generalize to the population distribution, not individuals. Stable learning algorithms generalize [BE02]. Differential privacy can be interpreted as a form of stability that also implies generalization [DFH+15,BNS+16]. Two parts of the same story: Privacy implies generalization asymptotically. Tradeoffs between privacy-accuracy-sample size for finite n.

54 Revisiting ERM w = argmin w 1 n nx i=1 `(w, (x i,y i )) + R(w) Learning using (convex) optimization uses three steps: 1. read in the data 2. form the objective function 3. perform the minimization input perturbation objective perturbation output perturbation We can try to introduce privacy in each step!

55 Privacy in ERM: options input perturbation input perturbation non-private preprocessing input perturbation private database input perturbation D input perturbation input perturbation non-private sanitized DP optimization DP database preprocessing optimization (", (",) ) w noise addition or randomization sanitized dataset nonprivatprivate non- algorithm learning ŵ objective ŵ output input (", ) privacy barrier (", ) privacy privacy barrier barrier

56 Local Privacy input perturbation input perturbation input perturbation input perturbation input perturbation sanitized database nonprivate algorithm input perturbation (", ) privacy barrier Local privacy: data contributors sanitize data before collection. Classical technique: randomized response [W65]. Interactive variant can be minimax optimal [DJW13].

57 Input Perturbation private database D DP preprocessing (", ) sanitized dataset nonprivate learning ŵ input privacy barrier Input perturbation: add noise to the input data. Advantages: easy to implement, results in reusable sanitized data set. [DJW13,FTS17]

58 Output Perturbation non-private preprocessing private database D non-private optimization w noise addition or randomization (", ) Compute the minimizer and add noise. privacy barrier ŵ output Does not require re-engineering baseline algorithms Noise depends on the sensitivity of the argmin. [CMS11, RBHT12]

59 Objective Perturbation non-private preprocessing private database D DP optimization (", ) ŵ objective privacy barrier

60 Objective Perturbation J(w) = 1 n nx i=1 `(w, (x i,y i )) + R(w) A. Add a random term to the objective: ŵ priv = argmin w J(w)+w > b B. Do a randomized approximation of the objective: ŵ priv = argmin w Ĵ(w) Randomness depends on the sensitivity properties of J(w). [CMS11, ZZX+12]

61 Sensitivity of the argmin ŵ w priv = argmin = argmin J(w)+w > b w w Non-private optimization solves rj(w) =0=) w Generate vector analogue of Laplace: b p(z) / e "/2kzk Private optimization solves rj(w) = b =) w priv Have to bound the sensitivity of the gradient.

62 Theoretical bounds on excess risk objective output input perturbation p! d log(1/ ) Same important parameters: privacy parameters (ε, δ) Õ n" d ÕÕ n" n" 3/2 (quadratic loss) [CMS11, BST14] [FTS17] data dimension d data bounds kx i kappleb analytical properties of the loss (Lipschitz, smoothness) regularization parameter

63 Typical empirical results [CMS11] [JT14] In general: Objective perturbation empirically outperforms output perturbation. Gaussian mechanism with (ε, δ) guarantees outperform Laplace - like mechanisms with ε- guarantees. Loss vs. non-private methods is very dataset-dependent.

64 Gaps between theory and practice Theoretical analysis is for fixed privacy parameters how should we choose them in practice? Given a data set, can I tell what the privacy-utilitysample-size tradeoff is? What about more general optimization problems/ algorithms? What about scaling (computationally) to large data sets?

65 Summary Training does not on its own guarantee privacy. There are many ways to incorporate DP into prediction and learning using ERM with different privacy-accuracysample size tradeoffs. Good DP algorithms should generalize since they learn about populations, not individuals. Theory and experiment show that (ε, δ)-dp algorithms have better accuracy than ε-dp algorithms at the cost of a weaker privacy guarantee.

66 Break

67 Differential privacy and optimization algorithms

68 Scaling up private optimization Large data sets are challenging for optimization: batch methods not feasible Using more data can help our tradeoffs look better: better privacy and accuracy Online learning involves multiple releases: potential for more privacy loss Goal: guarantee privacy using the optimization algorithm.

69 Stochastic Gradient Descent Stochastic gradient descent (SGD) is a moderately popular method for optimization Stochastic gradients are random already noisy already private? Optimization is iterative intermediate results leak information

70 Non-private SGD J(w) = 1 n nx i=1 `(w, (x i,y i )) + R(w) w 0 = 0 select a random data point For t =1, 2,...,T i t Unif{1, 2,...,n} take a gradient step g t = r`(w t 1, (x,y )) + it it rr(w t 1 ) w t = W (w t 1 t g t ) ŵ = w T

71 Private SGD with noise J(w) = 1 n nx i=1 `(w, (x i,y i )) + R(w) w 0 = 0 select random data point For t =1, 2,...,T i t Unif{1, 2,...,n} add noise to gradient z t p (", ) (z) ĝ t = z t + r`(w t 1, (x,y )) + it it rr(w t 1 ) w t = W (w t 1 t ĝ t ) ŵ = w T [SCS15]

72 Choosing a noise distribution Laplace mechanism p(z) / e ("/2)kzk Gaussian mechanism p(z) i.i.d. log(1/ ) N 0, Õ " 2 Have to choose noise according to the sensitivity of the gradient: " DP (", ) DP max max krj(w; D) rj(w; D,D 0 w D0 )k Sensitivity depends on the data distribution, Lipschitz parameter of the loss, etc.

73 Private SGD with randomized selection nx J(w) = 1 n `(w, (x i,y i )) + R(w) i=1 w 0 = 0 For t =1, 2,...,T i t Unif{1, 2,...,n} select random data point randomly select unbiased gradient estimate g t = r`(w t 1, (x it,y it )) + rr(w t 1 ) ĝ t p (", ),g (z) w t = W (w t 1 t ĝ t ) ŵ = w T [DJW13]

74 Randomized directions [DJW13] 1 1+e = L kgk g w.p kgk 2L B e 1+e Select hemisphere in direction of gradient or opposite. Pick uniformly from the hemisphere and take a step Need to have control of gradient norms: kgk applel Keep some probability of going in the wrong direction.

75 Why does DP-SGD work? z t = L kgk g w.p kgk 2L ĝ = g + z t g 1 1+e B e 1+e Noisy Gradient Choose noise distribution using the sensitivity of the gradient. Random Gradient Randomly select direction biased towards the true gradient. Both methods Guarantee DP at each iteration. Ensure unbiased estimate of g to guarantee convergence

76 Making DP-SGD more practical SGD is robust to noise True up to a point for small epsilon (more privacy), the gradients can become too noisy. Solution 1: more iterations ([BST14]: need O(n 2 ) ) Solution 2: use standard tricks: mini-batching, etc. [SCS13] Solution 3: use better analysis to show the privacy loss is not so bad [BST14][ACG+16]

77 Randomly sampling data can amplify privacy Private data set D subsample n Differentially private algorithm privacy barrier " 2 Suppose we have an algorithm A which is ε-differentially private for ε 1. Sample γn entries of D uniformly at random and run A on those. Randomized method guarantees 2γε-differential privacy. [BBKN14,BST14]

78 Summary Stochastic gradient descent can be made differentially private in several ways by randomizing the gradient. Keeping gradient estimates unbiased will help ensure convergence. Standard approach for variance reduction/stability (such as minibatching) can help with performance. Random subsampling of the data can amplify privacy guarantees.

79 Accounting for total privacy risk

80 Measuring total privacy loss Private data set D differentially private algorithm ( 1, 1) privacy barrier non-private postprocessing data derivative data derivative data derivative legitimate user 1 legitimate user 2 adversary ( 1, 1) ( 1, 1) ( 1, 1) Post processing invariance: risk doesn t increase if you don t touch the data again more complex algorithms have multiple stages all stages have to guarantee DP need a way to do privacy accounting: what is lost over time/ multiple queries?

81 A simple example privacy barrier Preprocessing (" 1, 1) Training (" 2, 2) Private data set D Crossvalidation (" 3, 3) Testing (" 4, 4)

82 Composition property of differential privacy Basic composition: privacy loss is additive: Apply R algorithms with ( i, i) :i =1, 2,...,R Total privacy loss: RX RX! i, i i=1 i=1 Worst-case analysis: each result exposes the worst privacy risk.

83 What composition says about multi-stage methods privacy barrier Preprocessing (" 1, 1) Training (" 2, 2) Private data set D Crossvalidation (" 3, 3) Testing (" 4, 4) Total privacy loss is the sum of the privacy losses

84 An open question: privacy allocation across stages Compositions means we have a privacy budget. How should we allocate privacy risk across different stages of a pipeline? Noisy features + accurate training? Clean features + sloppy training? It s application dependent! Still an open question

85 A closer look at ε and δ Gaussian noise of a given variance produces a spectrum of (ε, δ) guarantees: 1 2 p( 2 D) p( 1 p( D) 2 D) 2 0 ) apple exp(" 2) p( 1 D) p( 1 D 0 ) apple exp(" 1) p( 2 D 0 ) p( 1 D 0 ) 1 2

86 Privacy loss as a random variable Spectrum of (ε, δ) guarantees means we can trade off ε and δ when analyzing a particular mechanism. Actual privacy loss is a random variable that depends on D: Z D,D 0 = log p(a(d) =t) p(a(d 0 )=t) w.p. p(a(d) =t)

87 Random privacy loss Z D,D 0 = log p(a(d) =t) p(a(d 0 )=t) w.p. p(a(d) =t) Bounding the max loss over (D,D ) is still a random variable. Sequentially computing functions on private data is like sequentially sampling independent privacy losses. Concentration of measure shows that the loss is much closer to its expectation.

88 Strong composition bounds i = k times z } { (", ), (", ),...,(", ) (k 2i)", 1 (1 ) k (1 i) P i 1 `=0 k e (k `)" e (k 2i+`)" ` Given only the (ε, δ) guarantees for k algorithms operating on the data. Composition again gives a family of (ε, δ) tradeoffs: can quantify privacy loss by choosing any valid (ε, δ) pair. (1 + e " ) k [DRV10, KOV15]

89 Moments accountant A1 Private data set A2 A (, ) D Basic Idea: Directly calculate parameters (ε,δ) from composing a sequence of mechanisms More efficient than composition theorems [ACG+16]

90 How to Compose Directly? Given datasets D and D with one different record, mechanism A, define privacy loss random variable as: Z D,D 0 = log p(a(d) =t) p(a(d 0 )=t), w.p. p(a(d) =t) - Properties of ZD,D related to privacy loss of A - If max absolute value of ZD,D over all D, D is ε, then A is (ε,0)-differentially private

91 How to Compose Directly? Given datasets D and D with one different record, mechanism A, define privacy loss random variable as: Z D,D 0 = log p(a(d) =t) p(a(d 0 )=t), w.p. p(a(d) =t) Challenge: To reason about the worst case over all D, D Key idea in [ACG+16]: Use moment generating functions

92 Accounting for Moments A1 Private data set A2 A (, ) D Three Steps: 1. Calculate moment generating functions for A1, A2,.. 2. Compose 3. Calculate final privacy parameters [ACG+16]

93 1. Stepwise Moments A1 Private data set A2 A D Define: Stepwise Moment at time t of At at any s: At (s) = sup D,D 0 log E[e sz D,D0 ] (D and D differ by one record) [ACG+16]

94 2. Compose A1 Private data set A2 A D Theorem: Suppose A = (A1,, AT). For any s: A (s) apple TX t=1 At (s) [ACG+16]

95 3. Final Calculation A1 Private data set A2 A (, ) D Theorem: For any ε, mechanism A is (ε,δ)-dp for =minexp( A (s) s ) s Use theorem to find best ε for a given δ from closed form or by searching over s1, s2,.., sk [ACG+16]

96 Example: composing Gaussian mechanism A1 Private data set A2 A (, ) D Suppose At answers a query with global sensitivity 1 by adding N(0, 1) noise

97 1. Stepwise Moments A1 Private data set A2 A D Suppose At answers a query with global sensitivity 1 by adding N(0, 1) noise Simple algebra gives for any s: At (s) = s(s + 1) 2

98 2. Compose A1 Private data set A2 A D Suppose At answers a query with global sensitivity 1 by adding N(0, 1) noise TX A (s) apple (s) = At t=1 Ts(s + 1) 2

99 3. Final Calculation A1 Private data set A2 A (, ) D Find lowest δ for a given ε (or vice versa) by solving: =min s exp(ts(s + 1)/2 s ) In this case, solution can be found in closed form.

100 How does it compare? [DRV10] epsilon (better than linear) Moments Accountant #rounds of composition [ACG+16]

101 How does it compare on real data? EM for MOG with Gaussian Mechanism = 10 4 [PFCW17]

102 Summary Practical machine learning looks at the data many times. Post-processing invariance means we just have to track the cumulative privacy loss. Good composition methods use the fact that the actual privacy loss may behave much better than the worst-case bound. The Moments Accountant method tracks the actual privacy loss more accurately: better analysis for better privacy guarantees.

103 Applications to modern machine learning

104 When is differential privacy practical? Differential privacy is best suited for understanding population-level statistics and structure: Inferences about the population should not depend strongly on individuals. Large sample sizes usually mean lower sensitivity and less noise. To build and analyze systems we have to leverage postprocessing invariance and composition properties.

105 Differential privacy in practice Google: RAPPOR for tracking statistics in Chrome. Apple: various iphone usage statistics. Census: 2020 US Census will use differential privacy. mostly focused on count and average statistics

106 Challenges for machine learning applications Differentially private ML is complicated because real ML algorithms are complicated: Multi-stage pipelines, parameter tuning, etc. Need to play around with the data before committing to a particular pipeline/algorithm. Modern ML approaches (= deep learning) have many parameters and less theoretical guidance.

107 Some selected examples L 1,L 2,... Private data set D x L 1, x L2,... select random lot compute gradients new network weights T original posterior moments accountant (", ) clip and add noise privacy barrier update parameters processed posterior For today, we will describe some recent examples: 1. Differentially private deep learning [ACG+16] 2. Differential privacy and Bayesian inference

108 Differential privacy and deep learning Main idea: train a deep network using differentially private SGD and use moments accountant to track privacy loss. Additional components: gradient clipping, minibatching, data augmentation, etc. [ACG+16]

109 Overview of the algorithm L 1,L 2,... Private data set D x L 1, x L2,... select random lot compute gradients new network weights T moments accountant clip and add noise update parameters (", ) privacy barrier [ACG+16]

110 Effectiveness of DP deep learning Empirical results on MNIST and CIFAR: Training and test error come close to baseline nonprivate deep learning methods. To get moderate loss in performance, epsilon and delta are not negligible [ACG+16]

111 Moving forward in deep learning This is a good proof of concept for differential privacy for deep neural networks. There are lots of interesting ways to expand this. Just used one NN model: what about other architectures? RNNs? GANs? Can regularization methods for deep learning (e.g. dropout) help with privacy? What are good rules of thumb for lot/batch size, learning rate, # of hidden units, etc? [ACG+16]

112 Differentially Private Bayesian Inference Data X = { x1, x2, } Model Class } Related through likelihood p(x ) + = Prior ( ) Data X Posterior p( X) Find differentially private approx to posterior

113 Differentially Private Bayesian Inference General methods for private posterior approximation A Special Case: Exponential Families Variational Inference

114 How to make posterior private? Option 1: Direct posterior sampling [DMNR14] Not differentially private except under restrictive conditions - likelihood ratios may be unbounded! [GSC17] Answer changes under a new relaxation Rényi differential privacy [M17]

115 How to make posterior private? Option 2: One Posterior Sample (OPS) Method [WFS15] original posteriors processed posteriors 1. Truncate posterior so that likelihood ratio is bounded in the truncated region. 2. Raise truncated posterior to a higher temperature

116 How to make posterior private? Option 2: One Posterior Sample (OPS) Method: Advantage: General Pitfalls: Intractable - only exact distribution private Low statistical efficiency even for large n

117 How to make posterior private? Option 3: Approximate the OPS distribution via Stochastic Gradient MCMC [WFS15] original posteriors processed posteriors Advantage: Noise added during stochastic gradient MCMC contributes to privacy Disadvantage: Statistical efficiency lower than exact OPS

118 Differentially Private Bayesian Inference General methods for private posterior approximation A Special Case: Exponential Families Variational Inference

119 Exponential Family Posteriors (Non-private) posterior comes from exp. family: p( x) / e ( )> ( P i T (x i)) B( ) given data x1, x2, Posterior depends on data through sufficient statistic T

120 Exponential Family Posteriors (Non-private) posterior comes from exp. family: p( x) / e ( )> ( P i T (x i)) B( ) given data x1, x2,, sufficient statistic T Private Sampling: 1. If T is bounded, add noise to to get private version T X i T (x i ) 2. Sample from the perturbed posterior: p( x) / e ( )> T 0 B( ) [ZRD16, FGWC16]

121 How well does it work? 1.5 x Statistically efficient Non private HMM 4.5 Non private naive Bayes Laplace mechanism HMM OPS HMM (truncation multiplier = 100) Performance worse than non-private, better than OPS Can do inference in relatively complex systems by building up on this method eg, time series clustering in HMMs Test set log likelihood Epsilon (total)

122 Differentially Private Bayesian Inference General methods for private posterior approximation A Special Case: Exponential Families Variational Inference

123 Variational Inference Key Idea: Start with a stochastic variational inference method, and make each step private by adding Laplace noise. Use moments accountant and subsampling to track privacy loss. [JDH16, PFCW16]

124 Summary Two examples of differentially private complex machine learning algorithms Deep learning Bayesian inference

125 Summary, conclusions, and where to look next

126 Summary 1. Differential privacy: basic definitions and mechanisms 2. Differential privacy and statistical learning: ERM and SGD 3. Composition and tracking privacy loss 4. Applications of differential privacy in ML: deep learning and Bayesian methods

127 Things we didn t cover Synthetic data generation. Interactive data analysis. Statistical/estimation theory and fundamental limits Feature learning and dimensionality reduction Systems questions for large-scale deployment and many others

128 Where to learn more Several video lectures and other more technical introductions available from the Simons Institute for the Theory of Computing: Monograph by Dwork and Roth:

129 Final Takeaways Differential privacy measures the risk incurred by algorithms operating on private data. Commonly-used tools in machine learning can be made differentially private. Accounting for total privacy loss can enable more complex private algorithms. Still lots of work to be done in both theory and practice.

130 Thanks! This work was supported by National Science Foundation (IIS , CCF , SaTC ) National Institutes of Health (1R01DA A1) Office of Naval Research (N ) DARPA and the US Navy (N C-4070) Google Faculty Research Award

Accuracy First: Selecting a Differential Privacy Level for Accuracy-Constrained Empirical Risk Minimization

Accuracy First: Selecting a Differential Privacy Level for Accuracy-Constrained Empirical Risk Minimization Accuracy First: Selecting a Differential Privacy Level for Accuracy-Constrained Empirical Risk Minimization Katrina Ligett HUJI & Caltech joint with Seth Neel, Aaron Roth, Bo Waggoner, Steven Wu NIPS 2017

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Privacy in Statistical Databases

Privacy in Statistical Databases Privacy in Statistical Databases Individuals x 1 x 2 x n Server/agency ) answers. A queries Users Government, researchers, businesses or) Malicious adversary What information can be released? Two conflicting

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Ways to make neural networks generalize better

Ways to make neural networks generalize better Ways to make neural networks generalize better Seminar in Deep Learning University of Tartu 04 / 10 / 2014 Pihel Saatmann Topics Overview of ways to improve generalization Limiting the size of the weights

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Differential Privacy and Pan-Private Algorithms. Cynthia Dwork, Microsoft Research

Differential Privacy and Pan-Private Algorithms. Cynthia Dwork, Microsoft Research Differential Privacy and Pan-Private Algorithms Cynthia Dwork, Microsoft Research A Dream? C? Original Database Sanitization Very Vague AndVery Ambitious Census, medical, educational, financial data, commuting

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

What Can We Learn Privately?

What Can We Learn Privately? What Can We Learn Privately? Sofya Raskhodnikova Penn State University Joint work with Shiva Kasiviswanathan Homin Lee Kobbi Nissim Adam Smith Los Alamos UT Austin Ben Gurion Penn State To appear in SICOMP

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

arxiv: v1 [cs.cr] 3 Apr 2019

arxiv: v1 [cs.cr] 3 Apr 2019 A preliminary version of this paper appears in the proceedings of 4th IEEE Symposium on Security and Privacy (IEEE S&P 219). This version contains a few refinements, corrections and extensions. Differentially

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017 Algorithms other than SGD CS6787 Lecture 10 Fall 2017 Machine learning is not just SGD Once a model is trained, we need to use it to classify new examples This inference task is not computed with SGD There

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Lecture 20: Introduction to Differential Privacy

Lecture 20: Introduction to Differential Privacy 6.885: Advanced Topics in Data Processing Fall 2013 Stephen Tu Lecture 20: Introduction to Differential Privacy 1 Overview This lecture aims to provide a very broad introduction to the topic of differential

More information

SYMMETRIC MATRIX PERTURBATION FOR DIFFERENTIALLY-PRIVATE PRINCIPAL COMPONENT ANALYSIS. Hafiz Imtiaz and Anand D. Sarwate

SYMMETRIC MATRIX PERTURBATION FOR DIFFERENTIALLY-PRIVATE PRINCIPAL COMPONENT ANALYSIS. Hafiz Imtiaz and Anand D. Sarwate SYMMETRIC MATRIX PERTURBATION FOR DIFFERENTIALLY-PRIVATE PRINCIPAL COMPONENT ANALYSIS Hafiz Imtiaz and Anand D. Sarwate Rutgers, The State University of New Jersey ABSTRACT Differential privacy is a strong,

More information

Machine Learning (CSE 446): Probabilistic Machine Learning

Machine Learning (CSE 446): Probabilistic Machine Learning Machine Learning (CSE 446): Probabilistic Machine Learning oah Smith c 2017 University of Washington nasmith@cs.washington.edu ovember 1, 2017 1 / 24 Understanding MLE y 1 MLE π^ You can think of MLE as

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

SYMMETRIC MATRIX PERTURBATION FOR DIFFERENTIALLY-PRIVATE PRINCIPAL COMPONENT ANALYSIS. Hafiz Imtiaz and Anand D. Sarwate

SYMMETRIC MATRIX PERTURBATION FOR DIFFERENTIALLY-PRIVATE PRINCIPAL COMPONENT ANALYSIS. Hafiz Imtiaz and Anand D. Sarwate SYMMETRIC MATRIX PERTURBATION FOR DIFFERENTIALLY-PRIVATE PRINCIPAL COMPONENT ANALYSIS Hafiz Imtiaz and Anand D. Sarwate Rutgers, The State University of New Jersey ABSTRACT Differential privacy is a strong,

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Differential Privacy

Differential Privacy CS 380S Differential Privacy Vitaly Shmatikov most slides from Adam Smith (Penn State) slide 1 Reading Assignment Dwork. Differential Privacy (invited talk at ICALP 2006). slide 2 Basic Setting DB= x 1

More information

CSE446: non-parametric methods Spring 2017

CSE446: non-parametric methods Spring 2017 CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables? Linear Regression Machine Learning CSE546 Sham Kakade University of Washington Oct 4, 2016 1 What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do

More information

Sample complexity bounds for differentially private learning

Sample complexity bounds for differentially private learning Sample complexity bounds for differentially private learning Kamalika Chaudhuri University of California, San Diego Daniel Hsu Microsoft Research Outline 1. Learning and privacy model 2. Our results: sample

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Approximate and Probabilistic Differential Privacy Definitions

Approximate and Probabilistic Differential Privacy Definitions pproximate and Probabilistic Differential Privacy Definitions Sebastian Meiser University College London, United Kingdom, e-mail: s.meiser@ucl.ac.uk July 20, 208 bstract This technical report discusses

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables? Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 1 What about continuous variables? n Billionaire says: If I am measuring a continuous variable, what

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 15: MCMC Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Course progress Learning from examples Definition + fundamental theorem of statistical learning,

More information

How Well Does Privacy Compose?

How Well Does Privacy Compose? How Well Does Privacy Compose? Thomas Steinke Almaden IPAM/UCLA, Los Angeles CA, 10 Jan. 2018 This Talk Composition!! What is composition? Why is it important? Composition & high-dimensional (e.g. genetic)

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Approximate inference in Energy-Based Models

Approximate inference in Energy-Based Models CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models Geoffrey Hinton Two types of density model Stochastic generative model using directed acyclic graph (e.g. Bayes Net) Energy-based

More information

Neural Networks: Optimization & Regularization

Neural Networks: Optimization & Regularization Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University, 1 Introduction to Machine Learning Regression Computer Science, Tel-Aviv University, 2013-14 Classification Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Other structures (e.g.,

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Statistical Learning Theory and the C-Loss cost function

Statistical Learning Theory and the C-Loss cost function Statistical Learning Theory and the C-Loss cost function Jose Principe, Ph.D. Distinguished Professor ECE, BME Computational NeuroEngineering Laboratory and principe@cnel.ufl.edu Statistical Learning Theory

More information

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

CMPUT651: Differential Privacy

CMPUT651: Differential Privacy CMPUT65: Differential Privacy Homework assignment # 2 Due date: Apr. 3rd, 208 Discussion and the exchange of ideas are essential to doing academic work. For assignments in this course, you are encouraged

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy

More information

Privacy-Preserving Data Mining

Privacy-Preserving Data Mining CS 380S Privacy-Preserving Data Mining Vitaly Shmatikov slide 1 Reading Assignment Evfimievski, Gehrke, Srikant. Limiting Privacy Breaches in Privacy-Preserving Data Mining (PODS 2003). Blum, Dwork, McSherry,

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Generative adversarial networks

Generative adversarial networks 14-1: Generative adversarial networks Prof. J.C. Kao, UCLA Generative adversarial networks Why GANs? GAN intuition GAN equilibrium GAN implementation Practical considerations Much of these notes are based

More information

Tutorial on Machine Learning for Advanced Electronics

Tutorial on Machine Learning for Advanced Electronics Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine Value Function Methods CS 294-112: Deep Reinforcement Learning Sergey Levine Class Notes 1. Homework 2 is due in one week 2. Remember to start forming final project groups and writing your proposal! Proposal

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x

More information

Differentially Private Linear Regression

Differentially Private Linear Regression Differentially Private Linear Regression Christian Baehr August 5, 2017 Your abstract. Abstract 1 Introduction My research involved testing and implementing tools into the Harvard Privacy Tools Project

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information