Communication-efficient and Differentially-private Distributed SGD
|
|
- Edmund Richard
- 5 years ago
- Views:
Transcription
1 1/36 Communication-efficient and Differentially-private Distributed SGD Ananda Theertha Suresh with Naman Agarwal, Felix X. Yu Sanjiv Kumar, H. Brendan McMahan Google Research November 16, 2018
2 2/36 Outline Motivation Distributed SGD, federated learning Problem formulation Distributed SGD distributed mean estimation Communication efficiency Three schemes, optimality Differential privacy Privacy via Binomial mechanism
3 2/36 Outline Motivation Distributed SGD, federated learning Problem formulation Distributed SGD distributed mean estimation Communication efficiency Three schemes, optimality Differential privacy Privacy via Binomial mechanism New in distributed SGD?
4 3/36 Server based distributed SGD A tool for computation efficiency Workers are efficient Well designed architecture e.g., data is i.i.d.
5 4/36 Client based distributed SGD Data is inherently distributed e.g., cellphones Clients (cellphones) are not computationally efficient Not well designed architecture e.g., data is not i.i.d. Constrained resources e.g., memory, bandwidth, power Privacy e.g., sensitive data
6 4/36 Client based distributed SGD Data is inherently distributed e.g., cellphones Clients (cellphones) are not computationally efficient Not well designed architecture e.g., data is not i.i.d. Constrained resources e.g., memory, bandwidth, power Privacy e.g., sensitive data Federated learning: Knoecny et al 17
7 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient g t i
8 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient gi t Uplink communication cost To send gi t to constant accuracy: d log d bits Expensive for large models with millions of parameters
9 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient gi t Uplink communication cost To send gi t to constant accuracy: d log d bits Expensive for large models with millions of parameters Suppose gi t is a vector of 32-bit floats: 32 d bits Medium LSTM model for PTB: 13.5 million parameters 50 MB
10 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient gi t Uplink communication cost To send gi t to constant accuracy: d log d bits Expensive for large models with millions of parameters Suppose gi t is a vector of 32-bit floats: 32 d bits Medium LSTM model for PTB: 13.5 million parameters 50 MB This work How to reduce the uplink communication cost?
11 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient gi t Uplink communication cost To send gi t to constant accuracy: d log d bits Expensive for large models with millions of parameters Suppose gi t is a vector of 32-bit floats: 32 d bits Medium LSTM model for PTB: 13.5 million parameters 50 MB This work How to reduce the uplink communication cost? Uplink is more expensive than downlink
12 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient gi t Uplink communication cost To send gi t to constant accuracy: d log d bits Expensive for large models with millions of parameters Suppose gi t is a vector of 32-bit floats: 32 d bits Medium LSTM model for PTB: 13.5 million parameters 50 MB This work How to reduce the uplink communication cost? Uplink is more expensive than downlink What type of privacy guarantees for the users?
13 5/36 Distributed SGD over clients: challenges Standard approach Client i sends gradient gi t Uplink communication cost To send gi t to constant accuracy: d log d bits Expensive for large models with millions of parameters Suppose gi t is a vector of 32-bit floats: 32 d bits Medium LSTM model for PTB: 13.5 million parameters 50 MB This work How to reduce the uplink communication cost? Uplink is more expensive than downlink What type of privacy guarantees for the users? Can we do both?
14 6/36 Distributed SGD formulation Goal Given m functions F 1 (w), F 2 (w),..., F m (w) on m clients, minimize 1 m m F i (w) i=1
15 w t+1 = w t η t g t 6/36 Distributed SGD formulation Goal Given m functions F 1 (w), F 2 (w),..., F m (w) on m clients, minimize 1 m m F i (w) i=1 Algorithm Initialize w to w 0 For t from 0 to T : Randomly select n clients Selected clients compute gradients g t i = F i (w t ) Selected client send gradients to server Server computes average g t = 1 n g t i and updates model by
16 7/36 Distributed SGD guarantees Non-convex problems Ghadimi et al 13 After T steps E t T F (w t ) 2 2 σ T where σ 2 = max 1 t T E g t (w t ) F (w t ) 2 2 Similar results for convex and strongly convex problems
17 7/36 Distributed SGD guarantees Non-convex problems Ghadimi et al 13 After T steps E t T F (w t ) 2 2 σ T where σ 2 = max 1 t T E g t (w t ) F (w t ) 2 2 Similar results for convex and strongly convex problems SGD guarantees depend on the MSE in gradients
18 7/36 Distributed SGD guarantees Non-convex problems Ghadimi et al 13 After T steps E t T F (w t ) 2 2 σ T where σ 2 = max 1 t T E g t (w t ) F (w t ) 2 2 Similar results for convex and strongly convex problems SGD guarantees depend on the MSE in gradients Guarantees with post-processing If gi t g i t, such that E[ g t i ] = g i t and g t = 1 n i g t i : σ 2 = max 1 t T E g t (w t ) F (w t ) 2 2
19 8/36 Distributed mean estimation Setting n vectors X n def = X 1, X 2..., X n R d that reside on n clients
20 8/36 Distributed mean estimation Setting n vectors X n def = X 1, X 2..., X n R d that reside on n clients Goal Estimate the mean of the vectors: X def = 1 n n i=1 X i
21 8/36 Distributed mean estimation Setting n vectors X n def = X 1, X 2..., X n R d that reside on n clients Goal Estimate the mean of the vectors: Applications X def = 1 n n i=1 X i Distributed SGD: X i is the gradient Distributed power iteration Distributed Lloyd s algorithm...
22 9/36 Communication efficiency: approach Problem statement n vectors X n def = X 1, X 2..., X n R d that reside on n clients Estimate the mean: X def = 1 n n i=1 X i
23 9/36 Communication efficiency: approach Problem statement n vectors X n def = X 1, X 2..., X n R d that reside on n clients Estimate the mean: Communication protocol X def = 1 n n i=1 Client i transmits a compressed or private vector q(x i ) Server estimates the mean by some function of q(x 1 ), q(x 2 ),..., q(x n ) X i
24 10/36 Definitions π : communication protocol ˆ X : estimated mean
25 10/36 Definitions π : communication protocol ˆ X : estimated mean Mean squared error (MSE) [ ] E(π, X n ) = E ˆ X X Expectation over the randomization in the protocol 2 2
26 10/36 Definitions π : communication protocol ˆ X : estimated mean Mean squared error (MSE) [ ] E(π, X n ) = E ˆ X X Expectation over the randomization in the protocol Communication cost C i (π, X i ): expected number of bits sent by i-th client Total number of bits transmitted by all clients 2 2 C(π, X n ) = n C i (π, X i ) i=1
27 Definitions π : communication protocol ˆ X : estimated mean Mean squared error (MSE) [ ] E(π, X n ) = E ˆ X X Expectation over the randomization in the protocol Communication cost C i (π, X i ): expected number of bits sent by i-th client Total number of bits transmitted by all clients 2 2 n C(π, X n ) = C i (π, X i ) i=1 MSE (E) vs communication (C) trade-off? 10/36
28 11/36 Minimax formulation No assumptions on X 1, X 2,..., X n
29 11/36 Minimax formulation No assumptions on X 1, X 2,..., X n To characterize optimality: B d : unit Euclidean ball in d dimensions
30 11/36 Minimax formulation No assumptions on X 1, X 2,..., X n To characterize optimality: B d : unit Euclidean ball in d dimensions Minimax MSE E(c) def = min π:c(π)<c max E(π, X n ) X n B d
31 12/36 Related works Garg et al 14, Braverman et al 16 Assume X 1, X 2,..., X n are generated i.i.d. from a distribution, typically Gaussian Estimate the distribution mean instead of the empirical mean
32 12/36 Related works Garg et al 14, Braverman et al 16 Assume X 1, X 2,..., X n are generated i.i.d. from a distribution, typically Gaussian Estimate the distribution mean instead of the empirical mean Algorithms are tailored to the assumptions e.g., Estimate the mean of a Gaussian with unit variance Compute ˆp = Pr(Xi 0) Output ˆµ such that Pr(N(ˆµ, 1) > 0) = ˆp
33 12/36 Related works Garg et al 14, Braverman et al 16 Assume X 1, X 2,..., X n are generated i.i.d. from a distribution, typically Gaussian Estimate the distribution mean instead of the empirical mean Algorithms are tailored to the assumptions e.g., Estimate the mean of a Gaussian with unit variance Compute ˆp = Pr(Xi 0) Output ˆµ such that Pr(N(ˆµ, 1) > 0) = ˆp We assume no distribution over data and want empirical mean
34 12/36 Related works Garg et al 14, Braverman et al 16 Assume X 1, X 2,..., X n are generated i.i.d. from a distribution, typically Gaussian Estimate the distribution mean instead of the empirical mean Algorithms are tailored to the assumptions e.g., Estimate the mean of a Gaussian with unit variance Compute ˆp = Pr(Xi 0) Output ˆµ such that Pr(N(ˆµ, 1) > 0) = ˆp We assume no distribution over data and want empirical mean Alistarh et al 16 Use quantization and Elias coding
35 13/36 Three protocols Stochastic uniform quantization Binary k-level Stochastic rotated quantization Quantization error significantly reduced by random rotation Variable length coding Optimal communication-mse trade-off
36 14/36 Stochastic binary quantization For client i with vector X i R d : Xi max Xi min = max 1 j d X i (j) = min 1 j d X i (j) The quantized value for each coordinate j: Xi max w.p. Y i (j) = otherwise Xi min X i (j) Xi min Xi max Xi min
37 14/36 Stochastic binary quantization For client i with vector X i R d : Xi max Xi min = max 1 j d X i (j) = min 1 j d X i (j) The quantized value for each coordinate j: Xi max w.p. Y i (j) = otherwise Xi min X i (j) Xi min Xi max Xi min
38 14/36 Stochastic binary quantization For client i with vector X i R d : Xi max Xi min = max 1 j d X i (j) = min 1 j d X i (j) The quantized value for each coordinate j: Xi max w.p. Y i (j) = otherwise Xi min X i (j) Xi min Xi max Xi min Observe EY i (j) = X i (j)
39 15/36 Stochastic binary quantization Estimated mean ˆ X = 1 n n i=1 Y i
40 15/36 Stochastic binary quantization Estimated mean ˆ X = 1 n n i=1 Y i Communication cost d + Õ(1) bits per client and hence C = n (d + Õ(1) )
41 15/36 Stochastic binary quantization Estimated mean ˆ X = 1 n n i=1 Y i Communication cost d + Õ(1) bits per client and hence C = n (d + Õ(1) ) Mean squared error E = O ( d n 1 n ) n X i 2 2 i=1 d bits per client = MSE is O(d/n)
42 16/36 Stochastic k-level quantization Divide the interval into k-levels
43 16/36 Stochastic k-level quantization Divide the interval into k-levels Communication cost d log 2 k + Õ(1) bits per client and hence C = n (d log 2 k + Õ(1) )
44 16/36 Stochastic k-level quantization Divide the interval into k-levels Communication cost d log 2 k + Õ(1) bits per client and hence C = n (d log 2 k + Õ(1) ) Mean squared error ( d E = O n(k 1) 2 1 n ) n X i 2 2 i=1 d log 2 k bits per client = MSE is O(d/nk 2 )
45 17/36 Two improvements 1. Stochastic rotated k-level quantization Rotates the vectors before quantization 2. Variable length coding Use Huffman/Arithmetic coding + universal compression
46 18/36 Stochastic rotated k-level quantization Observe Smaller X max i Xi min, smaller the error
47 18/36 Stochastic rotated k-level quantization Observe Smaller X max i Xi min, smaller the error Random rotation reduces X max i X min i ( ) log d to O d X i 2
48 18/36 Stochastic rotated k-level quantization Observe Smaller X max i Xi min, smaller the error Random rotation reduces X max i For each client X min i ( ) log d to O d X i 2 Rotate the vector using a random rotation matrix: Z i = RX i
49 18/36 Stochastic rotated k-level quantization Observe Smaller X max i Xi min, smaller the error Random rotation reduces X max i For each client X min i ( ) log d to O d X i 2 Rotate the vector using a random rotation matrix: Z i = RX i Quantize each coordinate of Z i in k levels to obtain Y i
50 Stochastic rotated k-level quantization Observe Smaller X max i Xi min, smaller the error Random rotation reduces X max i For each client X min i ( ) log d to O d X i 2 Rotate the vector using a random rotation matrix: Z i = RX i Quantize each coordinate of Z i in k levels to obtain Y i Server Estimate the mean by ˆ X = R 1 1 n n i=1 Y i 18/36
51 19/36 Stochastic rotated k-level quantization Communication cost d log 2 k + Õ(1) bits per client C = n (d log 2 k + Õ(1) )
52 19/36 Stochastic rotated k-level quantization Communication cost d log 2 k + Õ(1) bits per client C = n (d log 2 k + Õ(1) ) Mean squared error ( log d E = O n(k 1) 2 1 n ) n X i 2 2 i=1
53 19/36 Stochastic rotated k-level quantization Communication cost d log 2 k + Õ(1) bits per client C = n (d log 2 k + Õ(1) ) Mean squared error ( log d E = O n(k 1) 2 1 n ) n X i 2 2 i=1 O(d log k) bits/client, MSE O(d/nk 2 ) O(log d/nk 2 )
54 19/36 Stochastic rotated k-level quantization Communication cost d log 2 k + Õ(1) bits per client C = n (d log 2 k + Õ(1) ) Mean squared error ( log d E = O n(k 1) 2 1 n ) n X i 2 2 i=1 O(d log k) bits/client, MSE O(d/nk 2 ) O(log d/nk 2 ) Rotation of high-dimensional vector is slow: O(d 2 )!
55 20/36 Fast random rotation Use structured rotation R = HD H : Walsh-Hadamard matrix D : Diagonal matrix with independent Radamacher entries Matrix-vector multiplication time: O(d log d) Used in other contexts: Ailon et al 09, Yu et al 16
56 21/36 Two improvements 1. Stochastic rotated k-level Quantization Rotates the vectors before quantization 2. Variable length coding Uses Huffman/Arithmetic coding + universal compression Information theoretically optimal
57 22/36 Variable length coding Previous schemes used fixed log k bits for each dimension Variable length: use different number of bits for each dimension Key Idea: Use fewer bits for more frequent bins Described quantized distribution, encode using arithmetic coding for quantized distribution
58 22/36 Variable length coding Previous schemes used fixed log k bits for each dimension Variable length: use different number of bits for each dimension Key Idea: Use fewer bits for more frequent bins Described quantized distribution, encode using arithmetic coding for quantized distribution Communication Cost Arithmetic/Huffman coding + universal compression yields ( ) C n O d(1 + log(k 2 /d + 1)) + Õ(1) For k d, O(d log k) O(d) bits per client, MSE O(d/nk 2 ) k = d, O(d) bits per client, MSE O(1/n)
59 23/36 Information theoretic optimality What is the best protocol for any worst case dataset? Variable length coding combined with client sampling is optimal
60 23/36 Information theoretic optimality What is the best protocol for any worst case dataset? Variable length coding combined with client sampling is optimal Min-max result Let t < 1 be a constant. For communication cost c tnd, ( ( E(c) = min max E(π, X n ) = Θ min 1, d )) π:c(π)<c X n B d c Product of communication cost and MSE scales linearly with d
61 23/36 Information theoretic optimality What is the best protocol for any worst case dataset? Variable length coding combined with client sampling is optimal Min-max result Let t < 1 be a constant. For communication cost c tnd, ( ( E(c) = min max E(π, X n ) = Θ min 1, d )) π:c(π)<c X n B d c Product of communication cost and MSE scales linearly with d Lower bound from results in Zhang et al NIPS 13
62 24/36 Experiments: Lloyd s algorithm (kmeans) MNIST, m = 60K, d = 784, n = 10 clients, 10 centers, k = 16 Objective Stochastic k-level Rotated Variable length No compression # iterations Objective Stochastic k-level Rotated Variable length No compression Communication cost (bits per dim)
63 25/36 Experiments: Power iteration (PCA) MNIST, m = 60K, d = 784, n = 10 clients, k = 16 levels Error Stochastic k-level Rotated Variable length No compression Error Stochastic k-level Rotated Variable length No compression # iterations Communication cost (bits per dim)
64 26/36 Experiments: distributed SGD CIFAR, m = 50K, d > 10 6, n = 100 clients (500 examples each), k = 2
65 27/36 Communication efficiency: conclusion Three approaches for compressed distributed mean estimation without any assumption on data distribution For k = 2, Bits per client MSE Stochastic k-level d O(d/n) Rotated d O(log(d)/n) Variable-length* O(d) O(1/n) *Communication optimal in minimax sense
66 27/36 Communication efficiency: conclusion Three approaches for compressed distributed mean estimation without any assumption on data distribution For k = 2, Bits per client MSE Stochastic k-level d O(d/n) Rotated d O(log(d)/n) Variable-length* O(d) O(1/n) *Communication optimal in minimax sense Privacy?
67 28/36 Differential privacy Communication protocol Client i transmits a compressed / private vector q(x i ) Server estimates the mean by some function of q(x 1 ),..., q(x n ) Let the estimate be ˆ X
68 28/36 Differential privacy Communication protocol Client i transmits a compressed / private vector q(x i ) Server estimates the mean by some function of q(x 1 ),..., q(x n ) Let the estimate be ˆ X Differential privacy Two sets: X n = X 1, X 2,..., X n and X n = X 1, X 2,..., X n A mean estimator is (ɛ, δ) differential private, if for any set S Pr( ˆ X (X n ) S) e ɛ Pr( ˆ X (X n ) S) + δ
69 28/36 Differential privacy Communication protocol Client i transmits a compressed / private vector q(x i ) Server estimates the mean by some function of q(x 1 ),..., q(x n ) Let the estimate be ˆ X Differential privacy Two sets: X n = X 1, X 2,..., X n and X n = X 1, X 2,..., X n A mean estimator is (ɛ, δ) differential private, if for any set S Pr( ˆ X (X n ) S) e ɛ Pr( ˆ X (X n ) S) + δ No one can identify X i of a single user
70 29/36 Gaussian mechanism Algorithm Server computes the average ˆ X using q(x 1 ), q(x 2 ),..., q(x n ) and releases ˆ X + N(0, σ 2 I d ), where σ 1 nɛ log 1 δ
71 29/36 Gaussian mechanism Algorithm Server computes the average ˆ X using q(x 1 ), q(x 2 ),..., q(x n ) and releases ˆ X + N(0, σ 2 I d ), where σ 1 nɛ log 1 δ Guarantees Released model is (ɛ, δ) differential private
72 29/36 Gaussian mechanism Algorithm Server computes the average ˆ X using q(x 1 ), q(x 2 ),..., q(x n ) and releases ˆ X + N(0, σ 2 I d ), where σ 1 nɛ log 1 δ Guarantees Released model is (ɛ, δ) differential private Issues Server may not add noise Server knows true ˆ X
73 30/36 Distributed Gaussian mechanism Algorithm Clients send g(x i ) = q(x i ) + N(0, σ n) Server computes average by 1 n n g(x i ) i=1
74 30/36 Distributed Gaussian mechanism Algorithm Clients send g(x i ) = q(x i ) + N(0, σ n) Server computes average by 1 n n g(x i ) i=1 Analysis Average of Gaussians is a Gaussian Total noise variance: 1 n σ2 n = σ 2 = Learned model is (ɛ, δ) differentially private
75 31/36 Differential privacy guarantees Server is negligent but not malicious Estimate of the average is differentially private Learned model is private and safe to release
76 31/36 Differential privacy guarantees Server is negligent but not malicious Estimate of the average is differentially private Learned model is private and safe to release Clients do not trust the server Secure aggregation: cryptographic scheme to ensure that the server learns only the average ˆ X Server does not learn anything about individual users
77 31/36 Differential privacy guarantees Server is negligent but not malicious Estimate of the average is differentially private Learned model is private and safe to release Clients do not trust the server Secure aggregation: cryptographic scheme to ensure that the server learns only the average ˆ X Server does not learn anything about individual users Issue Algorithm is not communication efficient Cryptographic protocol operates over discrete values
78 32/36 Distributed Gaussian mechanism: attempt 2 Quantized Gaussians Clients send g(x i ) = X i + N(0, σ n) Quantizes g(x i ) to obtain q(g(x i )) Servers compute average by 1 n n q(g(x i )) i=1
79 32/36 Distributed Gaussian mechanism: attempt 2 Quantized Gaussians Clients send g(x i ) = X i + N(0, σ n) Quantizes g(x i ) to obtain q(g(x i )) Servers compute average by 1 n n q(g(x i )) i=1 Differential privacy Sum of quantized Gaussians is not Gaussian Needs proof that quantization does not affect privacy
80 33/36 Binomial mechanism Algorithm Client i computes the compressed vector q(x i ) Adds Binomial noise Z i = Bin(m, p) and transmits q(x i ) + Z i
81 33/36 Binomial mechanism Algorithm Client i computes the compressed vector q(x i ) Adds Binomial noise Z i = Bin(m, p) and transmits q(x i ) + Z i Advantages Finite number of bits to send Binomial random variable Communication cost n d log 2 (m + k) Sum of Binomial noise i Z i is also binomial By CLT: Binomial Gaussian
82 34/36 Binomial mechanism results Gaussian mechanism Dwork et al 06 N(0, σ 2 ) is (ɛ, δ) differentially private for ɛ 2 2 log 1.25 δ σ Binomial mechanism Bin(N, p) is (ɛ, δ) differentially private for ɛ 2 2 log 1.25 ( ) δ Õ σ σ 2 σ = Np(1 p) High privacy regime: ɛ 0, σ, mechanisms are same
83 35/36 cpsgd Rotated quantization + Binomial mechanism Achieves same MSE as Gaussian mechanism and has a communication cost n d (log 2 ( 1 + d nɛ 2 ) ( + O log log nd )) ɛδ If n d, then log log(nd/ɛδ) bits are sufficient
84 36/36 Conclusions and future directions Conclusions Distributed SGD distributed mean estimation Minimax optimal scheme for distributed mean estimation Communication-efficient DP by Binomial mechanism Future directions Distributed SGD: correlation between gradients over time Distributed SGD: better privacy algorithms Distributed mean estimation: competitive / instance optimal Thank you!
arxiv: v1 [stat.ml] 27 May 2018
cpsgd: Communication-efficient and differentially-private distributed SGD arxiv:805.0559v [stat.ml] 27 May 208 Naman Agarwal Princeton University namana@cs.princeton.edu Felix Yu Google Research, New York
More informationDistributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria
Distributed Machine Learning: A Brief Overview Dan Alistarh IST Austria Background The Machine Learning Cambrian Explosion Key Factors: 1. Large s: Millions of labelled images, thousands of hours of speech
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationarxiv: v1 [cs.lg] 27 Nov 2018
LEASGD: an Efficient and Privacy-Preserving Decentralized Algorithm for Distributed Learning arxiv:1811.11124v1 [cs.lg] 27 Nov 2018 Hsin-Pai Cheng hc218@duke.edu Haojing Hu Beihang University of Aeronautics
More informationLocally Differentially Private Protocols for Frequency Estimation. Tianhao Wang, Jeremiah Blocki, Ninghui Li, Somesh Jha
Locally Differentially Private Protocols for Frequency Estimation Tianhao Wang, Jeremiah Blocki, Ninghui Li, Somesh Jha Differential Privacy Differential Privacy Classical setting Differential Privacy
More informationFaster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)
Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine
More informationTutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba
Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationSolving LPN Using Covering Codes
Solving LPN Using Covering Codes Qian Guo 1,2 Thomas Johansson 1 Carl Löndahl 1 1 Dept of Electrical and Information Technology, Lund University 2 School of Computer Science, Fudan University ASIACRYPT
More informationStochastic Generative Hashing
Stochastic Generative Hashing B. Dai 1, R. Guo 2, S. Kumar 2, N. He 3 and L. Song 1 1 Georgia Institute of Technology, 2 Google Research, NYC, 3 University of Illinois at Urbana-Champaign Discussion by
More informationarxiv: v1 [cs.it] 21 Feb 2013
q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto
More informationPrivacy and Fault-Tolerance in Distributed Optimization. Nitin Vaidya University of Illinois at Urbana-Champaign
Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign Acknowledgements Shripad Gade Lili Su argmin x2x SX i=1 i f i (x) Applications g f i (x)
More informationOnline (and Distributed) Learning with Information Constraints. Ohad Shamir
Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information
More informationOptimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors
More informationQuiz 2 Date: Monday, November 21, 2016
10-704 Information Processing and Learning Fall 2016 Quiz 2 Date: Monday, November 21, 2016 Name: Andrew ID: Department: Guidelines: 1. PLEASE DO NOT TURN THIS PAGE UNTIL INSTRUCTED. 2. Write your name,
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationDistributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server
Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server in collaboration with: Minjie Xu, Balaji Lakshminarayanan, Leonard Hasenclever, Thibaut Lienart, Stefan Webb,
More informationLecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]
ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture 7: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 3, 06 [Ed. Apr ] In last lecture, we studied the minimax
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)
More informationJ. Sadeghi E. Patelli M. de Angelis
J. Sadeghi E. Patelli Institute for Risk and, Department of Engineering, University of Liverpool, United Kingdom 8th International Workshop on Reliable Computing, Computing with Confidence University of
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationPrivacy in Statistical Databases
Privacy in Statistical Databases Individuals x 1 x 2 x n Server/agency ) answers. A queries Users Government, researchers, businesses or) Malicious adversary What information can be released? Two conflicting
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationNovel Quantization Strategies for Linear Prediction with Guarantees
Novel Novel for Linear Simon Du 1/10 Yichong Xu Yuan Li Aarti Zhang Singh Pulkit Background Motivation: Brain Computer Interface (BCI). Predict whether an individual is trying to move his hand towards
More informationThe space complexity of approximating the frequency moments
The space complexity of approximating the frequency moments Felix Biermeier November 24, 2015 1 Overview Introduction Approximations of frequency moments lower bounds 2 Frequency moments Problem Estimate
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationCalibrating Noise to Sensitivity in Private Data Analysis
Calibrating Noise to Sensitivity in Private Data Analysis Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith Mamadou H. Diallo 1 Overview Issues: preserving privacy in statistical databases Assumption:
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationCommunication constraints and latency in Networked Control Systems
Communication constraints and latency in Networked Control Systems João P. Hespanha Center for Control Engineering and Computation University of California Santa Barbara In collaboration with Antonio Ortega
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationA Parallel SGD method with Strong Convergence
A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,
More information<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)
Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation
More informationLecture 22: Final Review
Lecture 22: Final Review Nuts and bolts Fundamental questions and limits Tools Practical algorithms Future topics Dr Yao Xie, ECE587, Information Theory, Duke University Basics Dr Yao Xie, ECE587, Information
More informationBenny Pinkas Bar Ilan University
Winter School on Bar-Ilan University, Israel 30/1/2011-1/2/2011 Bar-Ilan University Benny Pinkas Bar Ilan University 1 Extending OT [IKNP] Is fully simulatable Depends on a non-standard security assumption
More informationAccelerated Dense Random Projections
1 Advisor: Steven Zucker 1 Yale University, Department of Computer Science. Dimensionality reduction (1 ε) xi x j 2 Ψ(xi ) Ψ(x j ) 2 (1 + ε) xi x j 2 ( n 2) distances are ε preserved Target dimension k
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationThe Projected Power Method: An Efficient Algorithm for Joint Alignment from Pairwise Differences
The Projected Power Method: An Efficient Algorithm for Joint Alignment from Pairwise Differences Yuxin Chen Emmanuel Candès Department of Statistics, Stanford University, Sep. 2016 Nonconvex optimization
More informationDS-GA 1002 Lecture notes 11 Fall Bayesian statistics
DS-GA 100 Lecture notes 11 Fall 016 Bayesian statistics In the frequentist paradigm we model the data as realizations from a distribution that depends on deterministic parameters. In contrast, in Bayesian
More informationOn the Cryptographic Complexity of the Worst Functions
On the Cryptographic Complexity of the Worst Functions Amos Beimel 1, Yuval Ishai 2, Ranjit Kumaresan 2, and Eyal Kushilevitz 2 1 Dept. of Computer Science, Ben Gurion University of the Negev, Be er Sheva,
More informationA Two-Stage Approach for Learning a Sparse Model with Sharp
A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis The University of Iowa, Nanjing University, Alibaba Group February 3, 2017 1 Problem and Chanllenges 2 The Two-stage Approach
More informationInformation Choice in Macroeconomics and Finance.
Information Choice in Macroeconomics and Finance. Laura Veldkamp New York University, Stern School of Business, CEPR and NBER Spring 2009 1 Veldkamp What information consumes is rather obvious: It consumes
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationCS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)
CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose
More informationOn Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:
A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More informationLecture 18: Gaussian Channel
Lecture 18: Gaussian Channel Gaussian channel Gaussian channel capacity Dr. Yao Xie, ECE587, Information Theory, Duke University Mona Lisa in AWGN Mona Lisa Noisy Mona Lisa 100 100 200 200 300 300 400
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationA Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir
A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence
More informationLectures 1&2: Introduction to Secure Computation, Yao s and GMW Protocols
CS 294 Secure Computation January 19, 2016 Lectures 1&2: Introduction to Secure Computation, Yao s and GMW Protocols Instructor: Sanjam Garg Scribe: Pratyush Mishra 1 Introduction Secure multiparty computation
More informationECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression
ECE521: Inference Algorithms and Machine Learning University of Toronto Assignment 1: k-nn and Linear Regression TA: Use Piazza for Q&A Due date: Feb 7 midnight, 2017 Electronic submission to: ece521ta@gmailcom
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationLocal Differential Privacy
Local Differential Privacy Peter Kairouz Department of Electrical & Computer Engineering University of Illinois at Urbana-Champaign Joint work with Sewoong Oh (UIUC) and Pramod Viswanath (UIUC) / 33 Wireless
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationCMPUT651: Differential Privacy
CMPUT65: Differential Privacy Homework assignment # 2 Due date: Apr. 3rd, 208 Discussion and the exchange of ideas are essential to doing academic work. For assignments in this course, you are encouraged
More informationCS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning
CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network
More informationStochastic Composition Optimization
Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationLEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING
LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING Yichi Zhang Zhijian Ou Speech Processing and Machine Intelligence (SPMI) Lab Department of Electronic Engineering
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationWorst-Case Complexity Guarantees and Nonconvex Smooth Optimization
Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex
More informationFastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle)
Fastfood Approximating Kernel Expansions in Loglinear Time Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle) Large Scale Problem: ImageNet Challenge Large scale data Number of training
More informationFailures of Gradient-Based Deep Learning
Failures of Gradient-Based Deep Learning Shai Shalev-Shwartz, Shaked Shammah, Ohad Shamir The Hebrew University and Mobileye Representation Learning Workshop Simons Institute, Berkeley, 2017 Shai Shalev-Shwartz
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationNational University of Singapore Department of Electrical & Computer Engineering. Examination for
National University of Singapore Department of Electrical & Computer Engineering Examination for EE5139R Information Theory for Communication Systems (Semester I, 2014/15) November/December 2014 Time Allowed:
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationPRIVACY PRESERVING DISTANCE COMPUTATION USING SOMEWHAT-TRUSTED THIRD PARTIES. Abelino Jimenez and Bhiksha Raj
PRIVACY PRESERVING DISTANCE COPUTATION USING SOEWHAT-TRUSTED THIRD PARTIES Abelino Jimenez and Bhisha Raj Carnegie ellon University, Pittsburgh, PA, USA {abjimenez,bhisha}@cmu.edu ABSTRACT A critically
More informationFaster Johnson-Lindenstrauss style reductions
Faster Johnson-Lindenstrauss style reductions Aditya Menon August 23, 2007 Outline 1 Introduction Dimensionality reduction The Johnson-Lindenstrauss Lemma Speeding up computation 2 The Fast Johnson-Lindenstrauss
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationProvable Alternating Minimization Methods for Non-convex Optimization
Provable Alternating Minimization Methods for Non-convex Optimization Prateek Jain Microsoft Research, India Joint work with Praneeth Netrapalli, Sujay Sanghavi, Alekh Agarwal, Animashree Anandkumar, Rashish
More informationDot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics
Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Chengjie Qin 1 and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced June 29, 2017 Machine Learning (ML) Is
More informationA Distributed Solver for Kernelized SVM
and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,
More informationStochastic Quasi-Newton Methods
Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent
More informationVariational Autoencoder
Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationLecture 3: Error Correcting Codes
CS 880: Pseudorandomness and Derandomization 1/30/2013 Lecture 3: Error Correcting Codes Instructors: Holger Dell and Dieter van Melkebeek Scribe: Xi Wu In this lecture we review some background on error
More informationTTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization
TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 Generalization and Regularization 1 Chomsky vs. Kolmogorov and Hinton Noam Chomsky: Natural language grammar cannot be learned by
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationAnalysis of Rate-distortion Functions and Congestion Control in Scalable Internet Video Streaming
Analysis of Rate-distortion Functions and Congestion Control in Scalable Internet Video Streaming Min Dai Electrical Engineering, Texas A&M University Dmitri Loguinov Computer Science, Texas A&M University
More informationMemory Efficient Kernel Approximation
Si Si Department of Computer Science University of Texas at Austin ICML Beijing, China June 23, 2014 Joint work with Cho-Jui Hsieh and Inderjit S. Dhillon Outline Background Motivation Low-Rank vs. Block
More informationVector Quantization. Institut Mines-Telecom. Marco Cagnazzo, MN910 Advanced Compression
Institut Mines-Telecom Vector Quantization Marco Cagnazzo, cagnazzo@telecom-paristech.fr MN910 Advanced Compression 2/66 19.01.18 Institut Mines-Telecom Vector Quantization Outline Gain-shape VQ 3/66 19.01.18
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationLecture Notes 17. Randomness: The verifier can toss coins and is allowed to err with some (small) probability if it is unlucky in its coin tosses.
CS 221: Computational Complexity Prof. Salil Vadhan Lecture Notes 17 March 31, 2010 Scribe: Jonathan Ullman 1 Interactive Proofs ecall the definition of NP: L NP there exists a polynomial-time V and polynomial
More informationInformation Theory. Lecture 10. Network Information Theory (CT15); a focus on channel capacity results
Information Theory Lecture 10 Network Information Theory (CT15); a focus on channel capacity results The (two-user) multiple access channel (15.3) The (two-user) broadcast channel (15.6) The relay channel
More informationHigh Order Methods for Empirical Risk Minimization
High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu Thanks to: Aryan Mokhtari, Mark
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationStochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration
Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration Nati Srebro Toyota Technological Institute at Chicago a philanthropically endowed academic computer science institute dedicated
More informationMachine Learning in the Data Revolution Era
Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,
More informationSimple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017
Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient
More information