Communication-efficient and Differentially-private Distributed SGD

Size: px

Start display at page:

Download "Communication-efficient and Differentially-private Distributed SGD"

Edmund Richard
5 years ago
Views:

1 1/36 Communication-efficient and Differentially-private Distributed SGD Ananda Theertha Suresh with Naman Agarwal, Felix X. Yu Sanjiv Kumar, H. Brendan McMahan Google Research November 16, 2018

2 2/36 Outline Motivation Distributed SGD, federated learning Problem formulation Distributed SGD distributed mean estimation Communication efficiency Three schemes, optimality Differential privacy Privacy via Binomial mechanism

3 2/36 Outline Motivation Distributed SGD, federated learning Problem formulation Distributed SGD distributed mean estimation Communication efficiency Three schemes, optimality Differential privacy Privacy via Binomial mechanism New in distributed SGD?

4 3/36 Server based distributed SGD A tool for computation efficiency Workers are efficient Well designed architecture e.g., data is i.i.d.

5 4/36 Client based distributed SGD Data is inherently distributed e.g., cellphones Clients (cellphones) are not computationally efficient Not well designed architecture e.g., data is not i.i.d. Constrained resources e.g., memory, bandwidth, power Privacy e.g., sensitive data

6 4/36 Client based distributed SGD Data is inherently distributed e.g., cellphones Clients (cellphones) are not computationally efficient Not well designed architecture e.g., data is not i.i.d. Constrained resources e.g., memory, bandwidth, power Privacy e.g., sensitive data Federated learning: Knoecny et al 17

7 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient g t i

8 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient gi t Uplink communication cost To send gi t to constant accuracy: d log d bits Expensive for large models with millions of parameters

9 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient gi t Uplink communication cost To send gi t to constant accuracy: d log d bits Expensive for large models with millions of parameters Suppose gi t is a vector of 32-bit floats: 32 d bits Medium LSTM model for PTB: 13.5 million parameters 50 MB

10 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient gi t Uplink communication cost To send gi t to constant accuracy: d log d bits Expensive for large models with millions of parameters Suppose gi t is a vector of 32-bit floats: 32 d bits Medium LSTM model for PTB: 13.5 million parameters 50 MB This work How to reduce the uplink communication cost?

11 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient gi t Uplink communication cost To send gi t to constant accuracy: d log d bits Expensive for large models with millions of parameters Suppose gi t is a vector of 32-bit floats: 32 d bits Medium LSTM model for PTB: 13.5 million parameters 50 MB This work How to reduce the uplink communication cost? Uplink is more expensive than downlink

12 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient gi t Uplink communication cost To send gi t to constant accuracy: d log d bits Expensive for large models with millions of parameters Suppose gi t is a vector of 32-bit floats: 32 d bits Medium LSTM model for PTB: 13.5 million parameters 50 MB This work How to reduce the uplink communication cost? Uplink is more expensive than downlink What type of privacy guarantees for the users?

13 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient gi t Uplink communication cost To send gi t to constant accuracy: d log d bits Expensive for large models with millions of parameters Suppose gi t is a vector of 32-bit floats: 32 d bits Medium LSTM model for PTB: 13.5 million parameters 50 MB This work How to reduce the uplink communication cost? Uplink is more expensive than downlink What type of privacy guarantees for the users? Can we do both?

14 6/36 Distributed SGD formulation Goal Given m functions F 1 (w), F 2 (w),..., F m (w) on m clients, minimize 1 m m F i (w) i=1

15 w t+1 = w t η t g t 6/36 Distributed SGD formulation Goal Given m functions F 1 (w), F 2 (w),..., F m (w) on m clients, minimize 1 m m F i (w) i=1 Algorithm Initialize w to w 0 For t from 0 to T : Randomly select n clients Selected clients compute gradients g t i = F i (w t ) Selected client send gradients to server Server computes average g t = 1 n g t i and updates model by

16 7/36 Distributed SGD guarantees Non-convex problems Ghadimi et al 13 After T steps E t T F (w t ) 2 2 σ T where σ 2 = max 1 t T E g t (w t ) F (w t ) 2 2 Similar results for convex and strongly convex problems

17 7/36 Distributed SGD guarantees Non-convex problems Ghadimi et al 13 After T steps E t T F (w t ) 2 2 σ T where σ 2 = max 1 t T E g t (w t ) F (w t ) 2 2 Similar results for convex and strongly convex problems SGD guarantees depend on the MSE in gradients

18 7/36 Distributed SGD guarantees Non-convex problems Ghadimi et al 13 After T steps E t T F (w t ) 2 2 σ T where σ 2 = max 1 t T E g t (w t ) F (w t ) 2 2 Similar results for convex and strongly convex problems SGD guarantees depend on the MSE in gradients Guarantees with post-processing If gi t g i t, such that E[ g t i ] = g i t and g t = 1 n i g t i : σ 2 = max 1 t T E g t (w t ) F (w t ) 2 2

19 8/36 Distributed mean estimation Setting n vectors X n def = X 1, X 2..., X n R d that reside on n clients

20 8/36 Distributed mean estimation Setting n vectors X n def = X 1, X 2..., X n R d that reside on n clients Goal Estimate the mean of the vectors: X def = 1 n n i=1 X i

21 8/36 Distributed mean estimation Setting n vectors X n def = X 1, X 2..., X n R d that reside on n clients Goal Estimate the mean of the vectors: Applications X def = 1 n n i=1 X i Distributed SGD: X i is the gradient Distributed power iteration Distributed Lloyd s algorithm...

22 9/36 Communication efficiency: approach Problem statement n vectors X n def = X 1, X 2..., X n R d that reside on n clients Estimate the mean: X def = 1 n n i=1 X i

23 9/36 Communication efficiency: approach Problem statement n vectors X n def = X 1, X 2..., X n R d that reside on n clients Estimate the mean: Communication protocol X def = 1 n n i=1 Client i transmits a compressed or private vector q(x i ) Server estimates the mean by some function of q(x 1 ), q(x 2 ),..., q(x n ) X i

24 10/36 Definitions π : communication protocol ˆ X : estimated mean

25 10/36 Definitions π : communication protocol ˆ X : estimated mean Mean squared error (MSE) [ ] E(π, X n ) = E ˆ X X Expectation over the randomization in the protocol 2 2

26 10/36 Definitions π : communication protocol ˆ X : estimated mean Mean squared error (MSE) [ ] E(π, X n ) = E ˆ X X Expectation over the randomization in the protocol Communication cost C i (π, X i ): expected number of bits sent by i-th client Total number of bits transmitted by all clients 2 2 C(π, X n ) = n C i (π, X i ) i=1

27 Definitions π : communication protocol ˆ X : estimated mean Mean squared error (MSE) [ ] E(π, X n ) = E ˆ X X Expectation over the randomization in the protocol Communication cost C i (π, X i ): expected number of bits sent by i-th client Total number of bits transmitted by all clients 2 2 n C(π, X n ) = C i (π, X i ) i=1 MSE (E) vs communication (C) trade-off? 10/36

28 11/36 Minimax formulation No assumptions on X 1, X 2,..., X n

29 11/36 Minimax formulation No assumptions on X 1, X 2,..., X n To characterize optimality: B d : unit Euclidean ball in d dimensions

30 11/36 Minimax formulation No assumptions on X 1, X 2,..., X n To characterize optimality: B d : unit Euclidean ball in d dimensions Minimax MSE E(c) def = min π:c(π)<c max E(π, X n ) X n B d

31 12/36 Related works Garg et al 14, Braverman et al 16 Assume X 1, X 2,..., X n are generated i.i.d. from a distribution, typically Gaussian Estimate the distribution mean instead of the empirical mean

32 12/36 Related works Garg et al 14, Braverman et al 16 Assume X 1, X 2,..., X n are generated i.i.d. from a distribution, typically Gaussian Estimate the distribution mean instead of the empirical mean Algorithms are tailored to the assumptions e.g., Estimate the mean of a Gaussian with unit variance Compute ˆp = Pr(Xi 0) Output ˆµ such that Pr(N(ˆµ, 1) > 0) = ˆp

33 12/36 Related works Garg et al 14, Braverman et al 16 Assume X 1, X 2,..., X n are generated i.i.d. from a distribution, typically Gaussian Estimate the distribution mean instead of the empirical mean Algorithms are tailored to the assumptions e.g., Estimate the mean of a Gaussian with unit variance Compute ˆp = Pr(Xi 0) Output ˆµ such that Pr(N(ˆµ, 1) > 0) = ˆp We assume no distribution over data and want empirical mean

34 12/36 Related works Garg et al 14, Braverman et al 16 Assume X 1, X 2,..., X n are generated i.i.d. from a distribution, typically Gaussian Estimate the distribution mean instead of the empirical mean Algorithms are tailored to the assumptions e.g., Estimate the mean of a Gaussian with unit variance Compute ˆp = Pr(Xi 0) Output ˆµ such that Pr(N(ˆµ, 1) > 0) = ˆp We assume no distribution over data and want empirical mean Alistarh et al 16 Use quantization and Elias coding

35 13/36 Three protocols Stochastic uniform quantization Binary k-level Stochastic rotated quantization Quantization error significantly reduced by random rotation Variable length coding Optimal communication-mse trade-off

36 14/36 Stochastic binary quantization For client i with vector X i R d : Xi max Xi min = max 1 j d X i (j) = min 1 j d X i (j) The quantized value for each coordinate j: Xi max w.p. Y i (j) = otherwise Xi min X i (j) Xi min Xi max Xi min

37 14/36 Stochastic binary quantization For client i with vector X i R d : Xi max Xi min = max 1 j d X i (j) = min 1 j d X i (j) The quantized value for each coordinate j: Xi max w.p. Y i (j) = otherwise Xi min X i (j) Xi min Xi max Xi min

38 14/36 Stochastic binary quantization For client i with vector X i R d : Xi max Xi min = max 1 j d X i (j) = min 1 j d X i (j) The quantized value for each coordinate j: Xi max w.p. Y i (j) = otherwise Xi min X i (j) Xi min Xi max Xi min Observe EY i (j) = X i (j)

39 15/36 Stochastic binary quantization Estimated mean ˆ X = 1 n n i=1 Y i

40 15/36 Stochastic binary quantization Estimated mean ˆ X = 1 n n i=1 Y i Communication cost d + Õ(1) bits per client and hence C = n (d + Õ(1) )

41 15/36 Stochastic binary quantization Estimated mean ˆ X = 1 n n i=1 Y i Communication cost d + Õ(1) bits per client and hence C = n (d + Õ(1) ) Mean squared error E = O ( d n 1 n ) n X i 2 2 i=1 d bits per client = MSE is O(d/n)

42 16/36 Stochastic k-level quantization Divide the interval into k-levels

43 16/36 Stochastic k-level quantization Divide the interval into k-levels Communication cost d log 2 k + Õ(1) bits per client and hence C = n (d log 2 k + Õ(1) )

44 16/36 Stochastic k-level quantization Divide the interval into k-levels Communication cost d log 2 k + Õ(1) bits per client and hence C = n (d log 2 k + Õ(1) ) Mean squared error ( d E = O n(k 1) 2 1 n ) n X i 2 2 i=1 d log 2 k bits per client = MSE is O(d/nk 2 )

45 17/36 Two improvements 1. Stochastic rotated k-level quantization Rotates the vectors before quantization 2. Variable length coding Use Huffman/Arithmetic coding + universal compression

46 18/36 Stochastic rotated k-level quantization Observe Smaller X max i Xi min, smaller the error

47 18/36 Stochastic rotated k-level quantization Observe Smaller X max i Xi min, smaller the error Random rotation reduces X max i X min i ( ) log d to O d X i 2

48 18/36 Stochastic rotated k-level quantization Observe Smaller X max i Xi min, smaller the error Random rotation reduces X max i For each client X min i ( ) log d to O d X i 2 Rotate the vector using a random rotation matrix: Z i = RX i

49 18/36 Stochastic rotated k-level quantization Observe Smaller X max i Xi min, smaller the error Random rotation reduces X max i For each client X min i ( ) log d to O d X i 2 Rotate the vector using a random rotation matrix: Z i = RX i Quantize each coordinate of Z i in k levels to obtain Y i

50 Stochastic rotated k-level quantization Observe Smaller X max i Xi min, smaller the error Random rotation reduces X max i For each client X min i ( ) log d to O d X i 2 Rotate the vector using a random rotation matrix: Z i = RX i Quantize each coordinate of Z i in k levels to obtain Y i Server Estimate the mean by ˆ X = R 1 1 n n i=1 Y i 18/36

51 19/36 Stochastic rotated k-level quantization Communication cost d log 2 k + Õ(1) bits per client C = n (d log 2 k + Õ(1) )

52 19/36 Stochastic rotated k-level quantization Communication cost d log 2 k + Õ(1) bits per client C = n (d log 2 k + Õ(1) ) Mean squared error ( log d E = O n(k 1) 2 1 n ) n X i 2 2 i=1

53 19/36 Stochastic rotated k-level quantization Communication cost d log 2 k + Õ(1) bits per client C = n (d log 2 k + Õ(1) ) Mean squared error ( log d E = O n(k 1) 2 1 n ) n X i 2 2 i=1 O(d log k) bits/client, MSE O(d/nk 2 ) O(log d/nk 2 )

54 19/36 Stochastic rotated k-level quantization Communication cost d log 2 k + Õ(1) bits per client C = n (d log 2 k + Õ(1) ) Mean squared error ( log d E = O n(k 1) 2 1 n ) n X i 2 2 i=1 O(d log k) bits/client, MSE O(d/nk 2 ) O(log d/nk 2 ) Rotation of high-dimensional vector is slow: O(d 2 )!

55 20/36 Fast random rotation Use structured rotation R = HD H : Walsh-Hadamard matrix D : Diagonal matrix with independent Radamacher entries Matrix-vector multiplication time: O(d log d) Used in other contexts: Ailon et al 09, Yu et al 16

56 21/36 Two improvements 1. Stochastic rotated k-level Quantization Rotates the vectors before quantization 2. Variable length coding Uses Huffman/Arithmetic coding + universal compression Information theoretically optimal

57 22/36 Variable length coding Previous schemes used fixed log k bits for each dimension Variable length: use different number of bits for each dimension Key Idea: Use fewer bits for more frequent bins Described quantized distribution, encode using arithmetic coding for quantized distribution

58 22/36 Variable length coding Previous schemes used fixed log k bits for each dimension Variable length: use different number of bits for each dimension Key Idea: Use fewer bits for more frequent bins Described quantized distribution, encode using arithmetic coding for quantized distribution Communication Cost Arithmetic/Huffman coding + universal compression yields ( ) C n O d(1 + log(k 2 /d + 1)) + Õ(1) For k d, O(d log k) O(d) bits per client, MSE O(d/nk 2 ) k = d, O(d) bits per client, MSE O(1/n)

59 23/36 Information theoretic optimality What is the best protocol for any worst case dataset? Variable length coding combined with client sampling is optimal

60 23/36 Information theoretic optimality What is the best protocol for any worst case dataset? Variable length coding combined with client sampling is optimal Min-max result Let t < 1 be a constant. For communication cost c tnd, ( ( E(c) = min max E(π, X n ) = Θ min 1, d )) π:c(π)<c X n B d c Product of communication cost and MSE scales linearly with d

61 23/36 Information theoretic optimality What is the best protocol for any worst case dataset? Variable length coding combined with client sampling is optimal Min-max result Let t < 1 be a constant. For communication cost c tnd, ( ( E(c) = min max E(π, X n ) = Θ min 1, d )) π:c(π)<c X n B d c Product of communication cost and MSE scales linearly with d Lower bound from results in Zhang et al NIPS 13

62 24/36 Experiments: Lloyd s algorithm (kmeans) MNIST, m = 60K, d = 784, n = 10 clients, 10 centers, k = 16 Objective Stochastic k-level Rotated Variable length No compression # iterations Objective Stochastic k-level Rotated Variable length No compression Communication cost (bits per dim)

63 25/36 Experiments: Power iteration (PCA) MNIST, m = 60K, d = 784, n = 10 clients, k = 16 levels Error Stochastic k-level Rotated Variable length No compression Error Stochastic k-level Rotated Variable length No compression # iterations Communication cost (bits per dim)

64 26/36 Experiments: distributed SGD CIFAR, m = 50K, d > 10 6, n = 100 clients (500 examples each), k = 2

65 27/36 Communication efficiency: conclusion Three approaches for compressed distributed mean estimation without any assumption on data distribution For k = 2, Bits per client MSE Stochastic k-level d O(d/n) Rotated d O(log(d)/n) Variable-length* O(d) O(1/n) *Communication optimal in minimax sense

66 27/36 Communication efficiency: conclusion Three approaches for compressed distributed mean estimation without any assumption on data distribution For k = 2, Bits per client MSE Stochastic k-level d O(d/n) Rotated d O(log(d)/n) Variable-length* O(d) O(1/n) *Communication optimal in minimax sense Privacy?

67 28/36 Differential privacy Communication protocol Client i transmits a compressed / private vector q(x i ) Server estimates the mean by some function of q(x 1 ),..., q(x n ) Let the estimate be ˆ X

68 28/36 Differential privacy Communication protocol Client i transmits a compressed / private vector q(x i ) Server estimates the mean by some function of q(x 1 ),..., q(x n ) Let the estimate be ˆ X Differential privacy Two sets: X n = X 1, X 2,..., X n and X n = X 1, X 2,..., X n A mean estimator is (ɛ, δ) differential private, if for any set S Pr( ˆ X (X n ) S) e ɛ Pr( ˆ X (X n ) S) + δ

69 28/36 Differential privacy Communication protocol Client i transmits a compressed / private vector q(x i ) Server estimates the mean by some function of q(x 1 ),..., q(x n ) Let the estimate be ˆ X Differential privacy Two sets: X n = X 1, X 2,..., X n and X n = X 1, X 2,..., X n A mean estimator is (ɛ, δ) differential private, if for any set S Pr( ˆ X (X n ) S) e ɛ Pr( ˆ X (X n ) S) + δ No one can identify X i of a single user

70 29/36 Gaussian mechanism Algorithm Server computes the average ˆ X using q(x 1 ), q(x 2 ),..., q(x n ) and releases ˆ X + N(0, σ 2 I d ), where σ 1 nɛ log 1 δ

71 29/36 Gaussian mechanism Algorithm Server computes the average ˆ X using q(x 1 ), q(x 2 ),..., q(x n ) and releases ˆ X + N(0, σ 2 I d ), where σ 1 nɛ log 1 δ Guarantees Released model is (ɛ, δ) differential private

72 29/36 Gaussian mechanism Algorithm Server computes the average ˆ X using q(x 1 ), q(x 2 ),..., q(x n ) and releases ˆ X + N(0, σ 2 I d ), where σ 1 nɛ log 1 δ Guarantees Released model is (ɛ, δ) differential private Issues Server may not add noise Server knows true ˆ X

73 30/36 Distributed Gaussian mechanism Algorithm Clients send g(x i ) = q(x i ) + N(0, σ n) Server computes average by 1 n n g(x i ) i=1

74 30/36 Distributed Gaussian mechanism Algorithm Clients send g(x i ) = q(x i ) + N(0, σ n) Server computes average by 1 n n g(x i ) i=1 Analysis Average of Gaussians is a Gaussian Total noise variance: 1 n σ2 n = σ 2 = Learned model is (ɛ, δ) differentially private

75 31/36 Differential privacy guarantees Server is negligent but not malicious Estimate of the average is differentially private Learned model is private and safe to release

76 31/36 Differential privacy guarantees Server is negligent but not malicious Estimate of the average is differentially private Learned model is private and safe to release Clients do not trust the server Secure aggregation: cryptographic scheme to ensure that the server learns only the average ˆ X Server does not learn anything about individual users

77 31/36 Differential privacy guarantees Server is negligent but not malicious Estimate of the average is differentially private Learned model is private and safe to release Clients do not trust the server Secure aggregation: cryptographic scheme to ensure that the server learns only the average ˆ X Server does not learn anything about individual users Issue Algorithm is not communication efficient Cryptographic protocol operates over discrete values

78 32/36 Distributed Gaussian mechanism: attempt 2 Quantized Gaussians Clients send g(x i ) = X i + N(0, σ n) Quantizes g(x i ) to obtain q(g(x i )) Servers compute average by 1 n n q(g(x i )) i=1

79 32/36 Distributed Gaussian mechanism: attempt 2 Quantized Gaussians Clients send g(x i ) = X i + N(0, σ n) Quantizes g(x i ) to obtain q(g(x i )) Servers compute average by 1 n n q(g(x i )) i=1 Differential privacy Sum of quantized Gaussians is not Gaussian Needs proof that quantization does not affect privacy

80 33/36 Binomial mechanism Algorithm Client i computes the compressed vector q(x i ) Adds Binomial noise Z i = Bin(m, p) and transmits q(x i ) + Z i

81 33/36 Binomial mechanism Algorithm Client i computes the compressed vector q(x i ) Adds Binomial noise Z i = Bin(m, p) and transmits q(x i ) + Z i Advantages Finite number of bits to send Binomial random variable Communication cost n d log 2 (m + k) Sum of Binomial noise i Z i is also binomial By CLT: Binomial Gaussian

82 34/36 Binomial mechanism results Gaussian mechanism Dwork et al 06 N(0, σ 2 ) is (ɛ, δ) differentially private for ɛ 2 2 log 1.25 δ σ Binomial mechanism Bin(N, p) is (ɛ, δ) differentially private for ɛ 2 2 log 1.25 ( ) δ Õ σ σ 2 σ = Np(1 p) High privacy regime: ɛ 0, σ, mechanisms are same

83 35/36 cpsgd Rotated quantization + Binomial mechanism Achieves same MSE as Gaussian mechanism and has a communication cost n d (log 2 ( 1 + d nɛ 2 ) ( + O log log nd )) ɛδ If n d, then log log(nd/ɛδ) bits are sufficient

84 36/36 Conclusions and future directions Conclusions Distributed SGD distributed mean estimation Minimax optimal scheme for distributed mean estimation Communication-efficient DP by Binomial mechanism Future directions Distributed SGD: correlation between gradients over time Distributed SGD: better privacy algorithms Distributed mean estimation: competitive / instance optimal Thank you!

arxiv: v1 [stat.ml] 27 May 2018

arxiv: v1 [stat.ml] 27 May 2018 cpsgd: Communication-efficient and differentially-private distributed SGD arxiv:805.0559v [stat.ml] 27 May 208 Naman Agarwal Princeton University namana@cs.princeton.edu Felix Yu Google Research, New York