Communication-efficient and Differentially-private Distributed SGD

Size: px
Start display at page:

Download "Communication-efficient and Differentially-private Distributed SGD"

Transcription

1 1/36 Communication-efficient and Differentially-private Distributed SGD Ananda Theertha Suresh with Naman Agarwal, Felix X. Yu Sanjiv Kumar, H. Brendan McMahan Google Research November 16, 2018

2 2/36 Outline Motivation Distributed SGD, federated learning Problem formulation Distributed SGD distributed mean estimation Communication efficiency Three schemes, optimality Differential privacy Privacy via Binomial mechanism

3 2/36 Outline Motivation Distributed SGD, federated learning Problem formulation Distributed SGD distributed mean estimation Communication efficiency Three schemes, optimality Differential privacy Privacy via Binomial mechanism New in distributed SGD?

4 3/36 Server based distributed SGD A tool for computation efficiency Workers are efficient Well designed architecture e.g., data is i.i.d.

5 4/36 Client based distributed SGD Data is inherently distributed e.g., cellphones Clients (cellphones) are not computationally efficient Not well designed architecture e.g., data is not i.i.d. Constrained resources e.g., memory, bandwidth, power Privacy e.g., sensitive data

6 4/36 Client based distributed SGD Data is inherently distributed e.g., cellphones Clients (cellphones) are not computationally efficient Not well designed architecture e.g., data is not i.i.d. Constrained resources e.g., memory, bandwidth, power Privacy e.g., sensitive data Federated learning: Knoecny et al 17

7 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient g t i

8 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient gi t Uplink communication cost To send gi t to constant accuracy: d log d bits Expensive for large models with millions of parameters

9 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient gi t Uplink communication cost To send gi t to constant accuracy: d log d bits Expensive for large models with millions of parameters Suppose gi t is a vector of 32-bit floats: 32 d bits Medium LSTM model for PTB: 13.5 million parameters 50 MB

10 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient gi t Uplink communication cost To send gi t to constant accuracy: d log d bits Expensive for large models with millions of parameters Suppose gi t is a vector of 32-bit floats: 32 d bits Medium LSTM model for PTB: 13.5 million parameters 50 MB This work How to reduce the uplink communication cost?

11 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient gi t Uplink communication cost To send gi t to constant accuracy: d log d bits Expensive for large models with millions of parameters Suppose gi t is a vector of 32-bit floats: 32 d bits Medium LSTM model for PTB: 13.5 million parameters 50 MB This work How to reduce the uplink communication cost? Uplink is more expensive than downlink

12 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient gi t Uplink communication cost To send gi t to constant accuracy: d log d bits Expensive for large models with millions of parameters Suppose gi t is a vector of 32-bit floats: 32 d bits Medium LSTM model for PTB: 13.5 million parameters 50 MB This work How to reduce the uplink communication cost? Uplink is more expensive than downlink What type of privacy guarantees for the users?

13 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient gi t Uplink communication cost To send gi t to constant accuracy: d log d bits Expensive for large models with millions of parameters Suppose gi t is a vector of 32-bit floats: 32 d bits Medium LSTM model for PTB: 13.5 million parameters 50 MB This work How to reduce the uplink communication cost? Uplink is more expensive than downlink What type of privacy guarantees for the users? Can we do both?

14 6/36 Distributed SGD formulation Goal Given m functions F 1 (w), F 2 (w),..., F m (w) on m clients, minimize 1 m m F i (w) i=1

15 w t+1 = w t η t g t 6/36 Distributed SGD formulation Goal Given m functions F 1 (w), F 2 (w),..., F m (w) on m clients, minimize 1 m m F i (w) i=1 Algorithm Initialize w to w 0 For t from 0 to T : Randomly select n clients Selected clients compute gradients g t i = F i (w t ) Selected client send gradients to server Server computes average g t = 1 n g t i and updates model by

16 7/36 Distributed SGD guarantees Non-convex problems Ghadimi et al 13 After T steps E t T F (w t ) 2 2 σ T where σ 2 = max 1 t T E g t (w t ) F (w t ) 2 2 Similar results for convex and strongly convex problems

17 7/36 Distributed SGD guarantees Non-convex problems Ghadimi et al 13 After T steps E t T F (w t ) 2 2 σ T where σ 2 = max 1 t T E g t (w t ) F (w t ) 2 2 Similar results for convex and strongly convex problems SGD guarantees depend on the MSE in gradients

18 7/36 Distributed SGD guarantees Non-convex problems Ghadimi et al 13 After T steps E t T F (w t ) 2 2 σ T where σ 2 = max 1 t T E g t (w t ) F (w t ) 2 2 Similar results for convex and strongly convex problems SGD guarantees depend on the MSE in gradients Guarantees with post-processing If gi t g i t, such that E[ g t i ] = g i t and g t = 1 n i g t i : σ 2 = max 1 t T E g t (w t ) F (w t ) 2 2

19 8/36 Distributed mean estimation Setting n vectors X n def = X 1, X 2..., X n R d that reside on n clients

20 8/36 Distributed mean estimation Setting n vectors X n def = X 1, X 2..., X n R d that reside on n clients Goal Estimate the mean of the vectors: X def = 1 n n i=1 X i

21 8/36 Distributed mean estimation Setting n vectors X n def = X 1, X 2..., X n R d that reside on n clients Goal Estimate the mean of the vectors: Applications X def = 1 n n i=1 X i Distributed SGD: X i is the gradient Distributed power iteration Distributed Lloyd s algorithm...

22 9/36 Communication efficiency: approach Problem statement n vectors X n def = X 1, X 2..., X n R d that reside on n clients Estimate the mean: X def = 1 n n i=1 X i

23 9/36 Communication efficiency: approach Problem statement n vectors X n def = X 1, X 2..., X n R d that reside on n clients Estimate the mean: Communication protocol X def = 1 n n i=1 Client i transmits a compressed or private vector q(x i ) Server estimates the mean by some function of q(x 1 ), q(x 2 ),..., q(x n ) X i

24 10/36 Definitions π : communication protocol ˆ X : estimated mean

25 10/36 Definitions π : communication protocol ˆ X : estimated mean Mean squared error (MSE) [ ] E(π, X n ) = E ˆ X X Expectation over the randomization in the protocol 2 2

26 10/36 Definitions π : communication protocol ˆ X : estimated mean Mean squared error (MSE) [ ] E(π, X n ) = E ˆ X X Expectation over the randomization in the protocol Communication cost C i (π, X i ): expected number of bits sent by i-th client Total number of bits transmitted by all clients 2 2 C(π, X n ) = n C i (π, X i ) i=1

27 Definitions π : communication protocol ˆ X : estimated mean Mean squared error (MSE) [ ] E(π, X n ) = E ˆ X X Expectation over the randomization in the protocol Communication cost C i (π, X i ): expected number of bits sent by i-th client Total number of bits transmitted by all clients 2 2 n C(π, X n ) = C i (π, X i ) i=1 MSE (E) vs communication (C) trade-off? 10/36

28 11/36 Minimax formulation No assumptions on X 1, X 2,..., X n

29 11/36 Minimax formulation No assumptions on X 1, X 2,..., X n To characterize optimality: B d : unit Euclidean ball in d dimensions

30 11/36 Minimax formulation No assumptions on X 1, X 2,..., X n To characterize optimality: B d : unit Euclidean ball in d dimensions Minimax MSE E(c) def = min π:c(π)<c max E(π, X n ) X n B d

31 12/36 Related works Garg et al 14, Braverman et al 16 Assume X 1, X 2,..., X n are generated i.i.d. from a distribution, typically Gaussian Estimate the distribution mean instead of the empirical mean

32 12/36 Related works Garg et al 14, Braverman et al 16 Assume X 1, X 2,..., X n are generated i.i.d. from a distribution, typically Gaussian Estimate the distribution mean instead of the empirical mean Algorithms are tailored to the assumptions e.g., Estimate the mean of a Gaussian with unit variance Compute ˆp = Pr(Xi 0) Output ˆµ such that Pr(N(ˆµ, 1) > 0) = ˆp

33 12/36 Related works Garg et al 14, Braverman et al 16 Assume X 1, X 2,..., X n are generated i.i.d. from a distribution, typically Gaussian Estimate the distribution mean instead of the empirical mean Algorithms are tailored to the assumptions e.g., Estimate the mean of a Gaussian with unit variance Compute ˆp = Pr(Xi 0) Output ˆµ such that Pr(N(ˆµ, 1) > 0) = ˆp We assume no distribution over data and want empirical mean

34 12/36 Related works Garg et al 14, Braverman et al 16 Assume X 1, X 2,..., X n are generated i.i.d. from a distribution, typically Gaussian Estimate the distribution mean instead of the empirical mean Algorithms are tailored to the assumptions e.g., Estimate the mean of a Gaussian with unit variance Compute ˆp = Pr(Xi 0) Output ˆµ such that Pr(N(ˆµ, 1) > 0) = ˆp We assume no distribution over data and want empirical mean Alistarh et al 16 Use quantization and Elias coding

35 13/36 Three protocols Stochastic uniform quantization Binary k-level Stochastic rotated quantization Quantization error significantly reduced by random rotation Variable length coding Optimal communication-mse trade-off

36 14/36 Stochastic binary quantization For client i with vector X i R d : Xi max Xi min = max 1 j d X i (j) = min 1 j d X i (j) The quantized value for each coordinate j: Xi max w.p. Y i (j) = otherwise Xi min X i (j) Xi min Xi max Xi min

37 14/36 Stochastic binary quantization For client i with vector X i R d : Xi max Xi min = max 1 j d X i (j) = min 1 j d X i (j) The quantized value for each coordinate j: Xi max w.p. Y i (j) = otherwise Xi min X i (j) Xi min Xi max Xi min

38 14/36 Stochastic binary quantization For client i with vector X i R d : Xi max Xi min = max 1 j d X i (j) = min 1 j d X i (j) The quantized value for each coordinate j: Xi max w.p. Y i (j) = otherwise Xi min X i (j) Xi min Xi max Xi min Observe EY i (j) = X i (j)

39 15/36 Stochastic binary quantization Estimated mean ˆ X = 1 n n i=1 Y i

40 15/36 Stochastic binary quantization Estimated mean ˆ X = 1 n n i=1 Y i Communication cost d + Õ(1) bits per client and hence C = n (d + Õ(1) )

41 15/36 Stochastic binary quantization Estimated mean ˆ X = 1 n n i=1 Y i Communication cost d + Õ(1) bits per client and hence C = n (d + Õ(1) ) Mean squared error E = O ( d n 1 n ) n X i 2 2 i=1 d bits per client = MSE is O(d/n)

42 16/36 Stochastic k-level quantization Divide the interval into k-levels

43 16/36 Stochastic k-level quantization Divide the interval into k-levels Communication cost d log 2 k + Õ(1) bits per client and hence C = n (d log 2 k + Õ(1) )

44 16/36 Stochastic k-level quantization Divide the interval into k-levels Communication cost d log 2 k + Õ(1) bits per client and hence C = n (d log 2 k + Õ(1) ) Mean squared error ( d E = O n(k 1) 2 1 n ) n X i 2 2 i=1 d log 2 k bits per client = MSE is O(d/nk 2 )

45 17/36 Two improvements 1. Stochastic rotated k-level quantization Rotates the vectors before quantization 2. Variable length coding Use Huffman/Arithmetic coding + universal compression

46 18/36 Stochastic rotated k-level quantization Observe Smaller X max i Xi min, smaller the error

47 18/36 Stochastic rotated k-level quantization Observe Smaller X max i Xi min, smaller the error Random rotation reduces X max i X min i ( ) log d to O d X i 2

48 18/36 Stochastic rotated k-level quantization Observe Smaller X max i Xi min, smaller the error Random rotation reduces X max i For each client X min i ( ) log d to O d X i 2 Rotate the vector using a random rotation matrix: Z i = RX i

49 18/36 Stochastic rotated k-level quantization Observe Smaller X max i Xi min, smaller the error Random rotation reduces X max i For each client X min i ( ) log d to O d X i 2 Rotate the vector using a random rotation matrix: Z i = RX i Quantize each coordinate of Z i in k levels to obtain Y i

50 Stochastic rotated k-level quantization Observe Smaller X max i Xi min, smaller the error Random rotation reduces X max i For each client X min i ( ) log d to O d X i 2 Rotate the vector using a random rotation matrix: Z i = RX i Quantize each coordinate of Z i in k levels to obtain Y i Server Estimate the mean by ˆ X = R 1 1 n n i=1 Y i 18/36

51 19/36 Stochastic rotated k-level quantization Communication cost d log 2 k + Õ(1) bits per client C = n (d log 2 k + Õ(1) )

52 19/36 Stochastic rotated k-level quantization Communication cost d log 2 k + Õ(1) bits per client C = n (d log 2 k + Õ(1) ) Mean squared error ( log d E = O n(k 1) 2 1 n ) n X i 2 2 i=1

53 19/36 Stochastic rotated k-level quantization Communication cost d log 2 k + Õ(1) bits per client C = n (d log 2 k + Õ(1) ) Mean squared error ( log d E = O n(k 1) 2 1 n ) n X i 2 2 i=1 O(d log k) bits/client, MSE O(d/nk 2 ) O(log d/nk 2 )

54 19/36 Stochastic rotated k-level quantization Communication cost d log 2 k + Õ(1) bits per client C = n (d log 2 k + Õ(1) ) Mean squared error ( log d E = O n(k 1) 2 1 n ) n X i 2 2 i=1 O(d log k) bits/client, MSE O(d/nk 2 ) O(log d/nk 2 ) Rotation of high-dimensional vector is slow: O(d 2 )!

55 20/36 Fast random rotation Use structured rotation R = HD H : Walsh-Hadamard matrix D : Diagonal matrix with independent Radamacher entries Matrix-vector multiplication time: O(d log d) Used in other contexts: Ailon et al 09, Yu et al 16

56 21/36 Two improvements 1. Stochastic rotated k-level Quantization Rotates the vectors before quantization 2. Variable length coding Uses Huffman/Arithmetic coding + universal compression Information theoretically optimal

57 22/36 Variable length coding Previous schemes used fixed log k bits for each dimension Variable length: use different number of bits for each dimension Key Idea: Use fewer bits for more frequent bins Described quantized distribution, encode using arithmetic coding for quantized distribution

58 22/36 Variable length coding Previous schemes used fixed log k bits for each dimension Variable length: use different number of bits for each dimension Key Idea: Use fewer bits for more frequent bins Described quantized distribution, encode using arithmetic coding for quantized distribution Communication Cost Arithmetic/Huffman coding + universal compression yields ( ) C n O d(1 + log(k 2 /d + 1)) + Õ(1) For k d, O(d log k) O(d) bits per client, MSE O(d/nk 2 ) k = d, O(d) bits per client, MSE O(1/n)

59 23/36 Information theoretic optimality What is the best protocol for any worst case dataset? Variable length coding combined with client sampling is optimal

60 23/36 Information theoretic optimality What is the best protocol for any worst case dataset? Variable length coding combined with client sampling is optimal Min-max result Let t < 1 be a constant. For communication cost c tnd, ( ( E(c) = min max E(π, X n ) = Θ min 1, d )) π:c(π)<c X n B d c Product of communication cost and MSE scales linearly with d

61 23/36 Information theoretic optimality What is the best protocol for any worst case dataset? Variable length coding combined with client sampling is optimal Min-max result Let t < 1 be a constant. For communication cost c tnd, ( ( E(c) = min max E(π, X n ) = Θ min 1, d )) π:c(π)<c X n B d c Product of communication cost and MSE scales linearly with d Lower bound from results in Zhang et al NIPS 13

62 24/36 Experiments: Lloyd s algorithm (kmeans) MNIST, m = 60K, d = 784, n = 10 clients, 10 centers, k = 16 Objective Stochastic k-level Rotated Variable length No compression # iterations Objective Stochastic k-level Rotated Variable length No compression Communication cost (bits per dim)

63 25/36 Experiments: Power iteration (PCA) MNIST, m = 60K, d = 784, n = 10 clients, k = 16 levels Error Stochastic k-level Rotated Variable length No compression Error Stochastic k-level Rotated Variable length No compression # iterations Communication cost (bits per dim)

64 26/36 Experiments: distributed SGD CIFAR, m = 50K, d > 10 6, n = 100 clients (500 examples each), k = 2

65 27/36 Communication efficiency: conclusion Three approaches for compressed distributed mean estimation without any assumption on data distribution For k = 2, Bits per client MSE Stochastic k-level d O(d/n) Rotated d O(log(d)/n) Variable-length* O(d) O(1/n) *Communication optimal in minimax sense

66 27/36 Communication efficiency: conclusion Three approaches for compressed distributed mean estimation without any assumption on data distribution For k = 2, Bits per client MSE Stochastic k-level d O(d/n) Rotated d O(log(d)/n) Variable-length* O(d) O(1/n) *Communication optimal in minimax sense Privacy?

67 28/36 Differential privacy Communication protocol Client i transmits a compressed / private vector q(x i ) Server estimates the mean by some function of q(x 1 ),..., q(x n ) Let the estimate be ˆ X

68 28/36 Differential privacy Communication protocol Client i transmits a compressed / private vector q(x i ) Server estimates the mean by some function of q(x 1 ),..., q(x n ) Let the estimate be ˆ X Differential privacy Two sets: X n = X 1, X 2,..., X n and X n = X 1, X 2,..., X n A mean estimator is (ɛ, δ) differential private, if for any set S Pr( ˆ X (X n ) S) e ɛ Pr( ˆ X (X n ) S) + δ

69 28/36 Differential privacy Communication protocol Client i transmits a compressed / private vector q(x i ) Server estimates the mean by some function of q(x 1 ),..., q(x n ) Let the estimate be ˆ X Differential privacy Two sets: X n = X 1, X 2,..., X n and X n = X 1, X 2,..., X n A mean estimator is (ɛ, δ) differential private, if for any set S Pr( ˆ X (X n ) S) e ɛ Pr( ˆ X (X n ) S) + δ No one can identify X i of a single user

70 29/36 Gaussian mechanism Algorithm Server computes the average ˆ X using q(x 1 ), q(x 2 ),..., q(x n ) and releases ˆ X + N(0, σ 2 I d ), where σ 1 nɛ log 1 δ

71 29/36 Gaussian mechanism Algorithm Server computes the average ˆ X using q(x 1 ), q(x 2 ),..., q(x n ) and releases ˆ X + N(0, σ 2 I d ), where σ 1 nɛ log 1 δ Guarantees Released model is (ɛ, δ) differential private

72 29/36 Gaussian mechanism Algorithm Server computes the average ˆ X using q(x 1 ), q(x 2 ),..., q(x n ) and releases ˆ X + N(0, σ 2 I d ), where σ 1 nɛ log 1 δ Guarantees Released model is (ɛ, δ) differential private Issues Server may not add noise Server knows true ˆ X

73 30/36 Distributed Gaussian mechanism Algorithm Clients send g(x i ) = q(x i ) + N(0, σ n) Server computes average by 1 n n g(x i ) i=1

74 30/36 Distributed Gaussian mechanism Algorithm Clients send g(x i ) = q(x i ) + N(0, σ n) Server computes average by 1 n n g(x i ) i=1 Analysis Average of Gaussians is a Gaussian Total noise variance: 1 n σ2 n = σ 2 = Learned model is (ɛ, δ) differentially private

75 31/36 Differential privacy guarantees Server is negligent but not malicious Estimate of the average is differentially private Learned model is private and safe to release

76 31/36 Differential privacy guarantees Server is negligent but not malicious Estimate of the average is differentially private Learned model is private and safe to release Clients do not trust the server Secure aggregation: cryptographic scheme to ensure that the server learns only the average ˆ X Server does not learn anything about individual users

77 31/36 Differential privacy guarantees Server is negligent but not malicious Estimate of the average is differentially private Learned model is private and safe to release Clients do not trust the server Secure aggregation: cryptographic scheme to ensure that the server learns only the average ˆ X Server does not learn anything about individual users Issue Algorithm is not communication efficient Cryptographic protocol operates over discrete values

78 32/36 Distributed Gaussian mechanism: attempt 2 Quantized Gaussians Clients send g(x i ) = X i + N(0, σ n) Quantizes g(x i ) to obtain q(g(x i )) Servers compute average by 1 n n q(g(x i )) i=1

79 32/36 Distributed Gaussian mechanism: attempt 2 Quantized Gaussians Clients send g(x i ) = X i + N(0, σ n) Quantizes g(x i ) to obtain q(g(x i )) Servers compute average by 1 n n q(g(x i )) i=1 Differential privacy Sum of quantized Gaussians is not Gaussian Needs proof that quantization does not affect privacy

80 33/36 Binomial mechanism Algorithm Client i computes the compressed vector q(x i ) Adds Binomial noise Z i = Bin(m, p) and transmits q(x i ) + Z i

81 33/36 Binomial mechanism Algorithm Client i computes the compressed vector q(x i ) Adds Binomial noise Z i = Bin(m, p) and transmits q(x i ) + Z i Advantages Finite number of bits to send Binomial random variable Communication cost n d log 2 (m + k) Sum of Binomial noise i Z i is also binomial By CLT: Binomial Gaussian

82 34/36 Binomial mechanism results Gaussian mechanism Dwork et al 06 N(0, σ 2 ) is (ɛ, δ) differentially private for ɛ 2 2 log 1.25 δ σ Binomial mechanism Bin(N, p) is (ɛ, δ) differentially private for ɛ 2 2 log 1.25 ( ) δ Õ σ σ 2 σ = Np(1 p) High privacy regime: ɛ 0, σ, mechanisms are same

83 35/36 cpsgd Rotated quantization + Binomial mechanism Achieves same MSE as Gaussian mechanism and has a communication cost n d (log 2 ( 1 + d nɛ 2 ) ( + O log log nd )) ɛδ If n d, then log log(nd/ɛδ) bits are sufficient

84 36/36 Conclusions and future directions Conclusions Distributed SGD distributed mean estimation Minimax optimal scheme for distributed mean estimation Communication-efficient DP by Binomial mechanism Future directions Distributed SGD: correlation between gradients over time Distributed SGD: better privacy algorithms Distributed mean estimation: competitive / instance optimal Thank you!

arxiv: v1 [stat.ml] 27 May 2018

arxiv: v1 [stat.ml] 27 May 2018 cpsgd: Communication-efficient and differentially-private distributed SGD arxiv:805.0559v [stat.ml] 27 May 208 Naman Agarwal Princeton University namana@cs.princeton.edu Felix Yu Google Research, New York

More information

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria Distributed Machine Learning: A Brief Overview Dan Alistarh IST Austria Background The Machine Learning Cambrian Explosion Key Factors: 1. Large s: Millions of labelled images, thousands of hours of speech

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

arxiv: v1 [cs.lg] 27 Nov 2018

arxiv: v1 [cs.lg] 27 Nov 2018 LEASGD: an Efficient and Privacy-Preserving Decentralized Algorithm for Distributed Learning arxiv:1811.11124v1 [cs.lg] 27 Nov 2018 Hsin-Pai Cheng hc218@duke.edu Haojing Hu Beihang University of Aeronautics

More information

Locally Differentially Private Protocols for Frequency Estimation. Tianhao Wang, Jeremiah Blocki, Ninghui Li, Somesh Jha

Locally Differentially Private Protocols for Frequency Estimation. Tianhao Wang, Jeremiah Blocki, Ninghui Li, Somesh Jha Locally Differentially Private Protocols for Frequency Estimation Tianhao Wang, Jeremiah Blocki, Ninghui Li, Somesh Jha Differential Privacy Differential Privacy Classical setting Differential Privacy

More information

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Solving LPN Using Covering Codes

Solving LPN Using Covering Codes Solving LPN Using Covering Codes Qian Guo 1,2 Thomas Johansson 1 Carl Löndahl 1 1 Dept of Electrical and Information Technology, Lund University 2 School of Computer Science, Fudan University ASIACRYPT

More information

Stochastic Generative Hashing

Stochastic Generative Hashing Stochastic Generative Hashing B. Dai 1, R. Guo 2, S. Kumar 2, N. He 3 and L. Song 1 1 Georgia Institute of Technology, 2 Google Research, NYC, 3 University of Illinois at Urbana-Champaign Discussion by

More information

arxiv: v1 [cs.it] 21 Feb 2013

arxiv: v1 [cs.it] 21 Feb 2013 q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto

More information

Privacy and Fault-Tolerance in Distributed Optimization. Nitin Vaidya University of Illinois at Urbana-Champaign

Privacy and Fault-Tolerance in Distributed Optimization. Nitin Vaidya University of Illinois at Urbana-Champaign Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign Acknowledgements Shripad Gade Lili Su argmin x2x SX i=1 i f i (x) Applications g f i (x)

More information

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Online (and Distributed) Learning with Information Constraints. Ohad Shamir Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Quiz 2 Date: Monday, November 21, 2016

Quiz 2 Date: Monday, November 21, 2016 10-704 Information Processing and Learning Fall 2016 Quiz 2 Date: Monday, November 21, 2016 Name: Andrew ID: Department: Guidelines: 1. PLEASE DO NOT TURN THIS PAGE UNTIL INSTRUCTED. 2. Write your name,

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server in collaboration with: Minjie Xu, Balaji Lakshminarayanan, Leonard Hasenclever, Thibaut Lienart, Stefan Webb,

More information

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1] ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture 7: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 3, 06 [Ed. Apr ] In last lecture, we studied the minimax

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)

More information

J. Sadeghi E. Patelli M. de Angelis

J. Sadeghi E. Patelli M. de Angelis J. Sadeghi E. Patelli Institute for Risk and, Department of Engineering, University of Liverpool, United Kingdom 8th International Workshop on Reliable Computing, Computing with Confidence University of

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Privacy in Statistical Databases

Privacy in Statistical Databases Privacy in Statistical Databases Individuals x 1 x 2 x n Server/agency ) answers. A queries Users Government, researchers, businesses or) Malicious adversary What information can be released? Two conflicting

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Novel Quantization Strategies for Linear Prediction with Guarantees

Novel Quantization Strategies for Linear Prediction with Guarantees Novel Novel for Linear Simon Du 1/10 Yichong Xu Yuan Li Aarti Zhang Singh Pulkit Background Motivation: Brain Computer Interface (BCI). Predict whether an individual is trying to move his hand towards

More information

The space complexity of approximating the frequency moments

The space complexity of approximating the frequency moments The space complexity of approximating the frequency moments Felix Biermeier November 24, 2015 1 Overview Introduction Approximations of frequency moments lower bounds 2 Frequency moments Problem Estimate

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

Calibrating Noise to Sensitivity in Private Data Analysis

Calibrating Noise to Sensitivity in Private Data Analysis Calibrating Noise to Sensitivity in Private Data Analysis Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith Mamadou H. Diallo 1 Overview Issues: preserving privacy in statistical databases Assumption:

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Communication constraints and latency in Networked Control Systems

Communication constraints and latency in Networked Control Systems Communication constraints and latency in Networked Control Systems João P. Hespanha Center for Control Engineering and Computation University of California Santa Barbara In collaboration with Antonio Ortega

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

Lecture 22: Final Review

Lecture 22: Final Review Lecture 22: Final Review Nuts and bolts Fundamental questions and limits Tools Practical algorithms Future topics Dr Yao Xie, ECE587, Information Theory, Duke University Basics Dr Yao Xie, ECE587, Information

More information

Benny Pinkas Bar Ilan University

Benny Pinkas Bar Ilan University Winter School on Bar-Ilan University, Israel 30/1/2011-1/2/2011 Bar-Ilan University Benny Pinkas Bar Ilan University 1 Extending OT [IKNP] Is fully simulatable Depends on a non-standard security assumption

More information

Accelerated Dense Random Projections

Accelerated Dense Random Projections 1 Advisor: Steven Zucker 1 Yale University, Department of Computer Science. Dimensionality reduction (1 ε) xi x j 2 Ψ(xi ) Ψ(x j ) 2 (1 + ε) xi x j 2 ( n 2) distances are ε preserved Target dimension k

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

The Projected Power Method: An Efficient Algorithm for Joint Alignment from Pairwise Differences

The Projected Power Method: An Efficient Algorithm for Joint Alignment from Pairwise Differences The Projected Power Method: An Efficient Algorithm for Joint Alignment from Pairwise Differences Yuxin Chen Emmanuel Candès Department of Statistics, Stanford University, Sep. 2016 Nonconvex optimization

More information

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics DS-GA 100 Lecture notes 11 Fall 016 Bayesian statistics In the frequentist paradigm we model the data as realizations from a distribution that depends on deterministic parameters. In contrast, in Bayesian

More information

On the Cryptographic Complexity of the Worst Functions

On the Cryptographic Complexity of the Worst Functions On the Cryptographic Complexity of the Worst Functions Amos Beimel 1, Yuval Ishai 2, Ranjit Kumaresan 2, and Eyal Kushilevitz 2 1 Dept. of Computer Science, Ben Gurion University of the Negev, Be er Sheva,

More information

A Two-Stage Approach for Learning a Sparse Model with Sharp

A Two-Stage Approach for Learning a Sparse Model with Sharp A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis The University of Iowa, Nanjing University, Alibaba Group February 3, 2017 1 Problem and Chanllenges 2 The Two-stage Approach

More information

Information Choice in Macroeconomics and Finance.

Information Choice in Macroeconomics and Finance. Information Choice in Macroeconomics and Finance. Laura Veldkamp New York University, Stern School of Business, CEPR and NBER Spring 2009 1 Veldkamp What information consumes is rather obvious: It consumes

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing) CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

Lecture 18: Gaussian Channel

Lecture 18: Gaussian Channel Lecture 18: Gaussian Channel Gaussian channel Gaussian channel capacity Dr. Yao Xie, ECE587, Information Theory, Duke University Mona Lisa in AWGN Mona Lisa Noisy Mona Lisa 100 100 200 200 300 300 400

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence

More information

Lectures 1&2: Introduction to Secure Computation, Yao s and GMW Protocols

Lectures 1&2: Introduction to Secure Computation, Yao s and GMW Protocols CS 294 Secure Computation January 19, 2016 Lectures 1&2: Introduction to Secure Computation, Yao s and GMW Protocols Instructor: Sanjam Garg Scribe: Pratyush Mishra 1 Introduction Secure multiparty computation

More information

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression ECE521: Inference Algorithms and Machine Learning University of Toronto Assignment 1: k-nn and Linear Regression TA: Use Piazza for Q&A Due date: Feb 7 midnight, 2017 Electronic submission to: ece521ta@gmailcom

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Local Differential Privacy

Local Differential Privacy Local Differential Privacy Peter Kairouz Department of Electrical & Computer Engineering University of Illinois at Urbana-Champaign Joint work with Sewoong Oh (UIUC) and Pramod Viswanath (UIUC) / 33 Wireless

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

CMPUT651: Differential Privacy

CMPUT651: Differential Privacy CMPUT65: Differential Privacy Homework assignment # 2 Due date: Apr. 3rd, 208 Discussion and the exchange of ideas are essential to doing academic work. For assignments in this course, you are encouraged

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING

LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING Yichi Zhang Zhijian Ou Speech Processing and Machine Intelligence (SPMI) Lab Department of Electronic Engineering

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex

More information

Fastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle)

Fastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle) Fastfood Approximating Kernel Expansions in Loglinear Time Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle) Large Scale Problem: ImageNet Challenge Large scale data Number of training

More information

Failures of Gradient-Based Deep Learning

Failures of Gradient-Based Deep Learning Failures of Gradient-Based Deep Learning Shai Shalev-Shwartz, Shaked Shammah, Ohad Shamir The Hebrew University and Mobileye Representation Learning Workshop Simons Institute, Berkeley, 2017 Shai Shalev-Shwartz

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

National University of Singapore Department of Electrical & Computer Engineering. Examination for

National University of Singapore Department of Electrical & Computer Engineering. Examination for National University of Singapore Department of Electrical & Computer Engineering Examination for EE5139R Information Theory for Communication Systems (Semester I, 2014/15) November/December 2014 Time Allowed:

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

PRIVACY PRESERVING DISTANCE COMPUTATION USING SOMEWHAT-TRUSTED THIRD PARTIES. Abelino Jimenez and Bhiksha Raj

PRIVACY PRESERVING DISTANCE COMPUTATION USING SOMEWHAT-TRUSTED THIRD PARTIES. Abelino Jimenez and Bhiksha Raj PRIVACY PRESERVING DISTANCE COPUTATION USING SOEWHAT-TRUSTED THIRD PARTIES Abelino Jimenez and Bhisha Raj Carnegie ellon University, Pittsburgh, PA, USA {abjimenez,bhisha}@cmu.edu ABSTRACT A critically

More information

Faster Johnson-Lindenstrauss style reductions

Faster Johnson-Lindenstrauss style reductions Faster Johnson-Lindenstrauss style reductions Aditya Menon August 23, 2007 Outline 1 Introduction Dimensionality reduction The Johnson-Lindenstrauss Lemma Speeding up computation 2 The Fast Johnson-Lindenstrauss

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Provable Alternating Minimization Methods for Non-convex Optimization

Provable Alternating Minimization Methods for Non-convex Optimization Provable Alternating Minimization Methods for Non-convex Optimization Prateek Jain Microsoft Research, India Joint work with Praneeth Netrapalli, Sujay Sanghavi, Alekh Agarwal, Animashree Anandkumar, Rashish

More information

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Chengjie Qin 1 and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced June 29, 2017 Machine Learning (ML) Is

More information

A Distributed Solver for Kernelized SVM

A Distributed Solver for Kernelized SVM and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Variational Autoencoder

Variational Autoencoder Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Lecture 3: Error Correcting Codes

Lecture 3: Error Correcting Codes CS 880: Pseudorandomness and Derandomization 1/30/2013 Lecture 3: Error Correcting Codes Instructors: Holger Dell and Dieter van Melkebeek Scribe: Xi Wu In this lecture we review some background on error

More information

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 Generalization and Regularization 1 Chomsky vs. Kolmogorov and Hinton Noam Chomsky: Natural language grammar cannot be learned by

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Analysis of Rate-distortion Functions and Congestion Control in Scalable Internet Video Streaming

Analysis of Rate-distortion Functions and Congestion Control in Scalable Internet Video Streaming Analysis of Rate-distortion Functions and Congestion Control in Scalable Internet Video Streaming Min Dai Electrical Engineering, Texas A&M University Dmitri Loguinov Computer Science, Texas A&M University

More information

Memory Efficient Kernel Approximation

Memory Efficient Kernel Approximation Si Si Department of Computer Science University of Texas at Austin ICML Beijing, China June 23, 2014 Joint work with Cho-Jui Hsieh and Inderjit S. Dhillon Outline Background Motivation Low-Rank vs. Block

More information

Vector Quantization. Institut Mines-Telecom. Marco Cagnazzo, MN910 Advanced Compression

Vector Quantization. Institut Mines-Telecom. Marco Cagnazzo, MN910 Advanced Compression Institut Mines-Telecom Vector Quantization Marco Cagnazzo, cagnazzo@telecom-paristech.fr MN910 Advanced Compression 2/66 19.01.18 Institut Mines-Telecom Vector Quantization Outline Gain-shape VQ 3/66 19.01.18

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Lecture Notes 17. Randomness: The verifier can toss coins and is allowed to err with some (small) probability if it is unlucky in its coin tosses.

Lecture Notes 17. Randomness: The verifier can toss coins and is allowed to err with some (small) probability if it is unlucky in its coin tosses. CS 221: Computational Complexity Prof. Salil Vadhan Lecture Notes 17 March 31, 2010 Scribe: Jonathan Ullman 1 Interactive Proofs ecall the definition of NP: L NP there exists a polynomial-time V and polynomial

More information

Information Theory. Lecture 10. Network Information Theory (CT15); a focus on channel capacity results

Information Theory. Lecture 10. Network Information Theory (CT15); a focus on channel capacity results Information Theory Lecture 10 Network Information Theory (CT15); a focus on channel capacity results The (two-user) multiple access channel (15.3) The (two-user) broadcast channel (15.6) The relay channel

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu Thanks to: Aryan Mokhtari, Mark

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration

Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration Nati Srebro Toyota Technological Institute at Chicago a philanthropically endowed academic computer science institute dedicated

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information