ECS289: Scalable Machine Learning

Size: px

Start display at page:

Download "ECS289: Scalable Machine Learning"

Roger Gilmore
5 years ago
Views:

1 ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016

2 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning

3 Perceptron Prediction value for a training data: prediction = f (w T x i + w 0 ) x i : data w: parameters to be learned f (z) = max(0, z): activation function (Figure from Shlens and Toderici, ICIP 2016 slides)

4 Neural Network: A single hidden layer Prediction value for a training data: prediction = h 2 (W 2 h 1 (W 1 x i )) := f (x i ; W ) W = {W i }: parameters to be learned h 2, h 1 : activation function

5 Neural Network: Deep Network Minimize the following loss function: min W 1 n n l(f (x i ; W ), y i ), i=1 f (x i ; W ) := W k (h k 1 (W k 1... W 2 h 1 (W 1 x i ))): k-layers neural network l( ): loss function W can be sparse and customized: corresponding to the network structure Famous example: Convolutional network (extracting image features) Recurrent network & LSTM (long-term short-term memory): learning from sequence data

6 Convolutional Network Convolution operator: an effective way to extract image features (Figure from Shlens and Toderici, ICIP 2016 slides) Before deep learning: human design of the convolutional kernel Deep learning: learning features and classifiers together

7 Convolutional Network A convolution layer: Convolutional network (e.g., AlexNet)

8 Stochastic Gradient Descent Neural network objective function: min W 1 n n l(f (x i ; W ), y i ), i=1 f (x i ; W ) := W k (h k 1 (W k 1... W 2 h 1 (W 1 x i ))) Stochastic gradient descent: for each iteration 1 Randomly sample a small subset S 2 Compute g = 1 S i S W l(f (x i ; W ), y i ) 3 Update W W + α t g Gradient computed by chain rule (back-propagation).

9 Other SGD Variants AdaGrad (2011): adaptive step size for each parameter Update rule at the t-th iteration: Compute g = l(x i ) Estimate the second moment: G ii = G ii + g 2 for all i Parameter update: x i x i η Gii g i i for all i Adam (Kingma and Ba, 2015): maintain both first moment and second moment Update rule at the t-th iteration: Compute g = l i (x) m = β 1 m + (1 β 1 )g ˆm = m t /(1 β1 t ) (estimate of first moment) v = β 2 v + (1 β 2 )g 2 ˆv = v t /(1 β2 t ) (estimate of second moment) x x η ˆm/(ˆv + ɛ) (All the operations are element-wise)

10 Other Optimization Algorithms? Coordinate Descent: Can we update a subset of links at a time? (Yes) But the gradient has to be computed by chain rule (back propagation) Time for updating one coordinate time for updating all coordinates (?) Second order method with Using exact Hessian? Hessian-vector product can be computed by chain rule efficiently! (Pearlmutter et al., Fast Exact Multiplication by the Hessian ) (Martens, Deep Learning via Hessian-free optimization. ICML ) Better performance in auto-encoder (Kiros, Training Neural Networks with Stochastic Hessian-Free Optimization. 2013): better performance in conv-network

11 Other Optimization Algorithms? Second order method with approximate Hessian (Quasi-Newton) LBFGS (quasi-newton): using gradients to approximate Hessian (Le et al., On Optimization Methods for Deep Learning. ICML 2011) Shows better performance in auto-encoder Other second order methods? Subsampled gradient + subsampled Hessian?

12 Parallel SGD for Deep Learning

13 Synchronized Parallel: Minibatch SGD GPU parallelism: Convolution operator: matrix-matrix product Easy to parallel in GPU Each machine/core/gpu: Compute g k = 1 S k i S k W l(f (x i ; W ), y i ) Averaging the results (synchronization): g = 1 K K k=1 g k Update the parameters: W W + α t g Slow in distributed system due to the synchronization step.

14 Asynchronous Parallel: Downpour-SGD (Dean et al., Large Scale Distributed Deep Networks, in NIPS 2012) Parameter server: a centralized server to maintain/update latest parameters Update rule for each machine: For t = 1, 2, Read latest W from parameter server 2. Compute local SGD update: W = α t 1 S i S W l(f (x i ; W ), y i ) 3. Send W to parameter server Parameter server (usually multiple machines): Maintain the unique global parameter W Get W from local workers and update W W + W

15 Downpour-SGD

16 Concensus ADMM Summarized in (Boyd et al., 2011) Reformulate the ERM problem (K: number of partitions/machines) min w 1,...,w K K f k (w k ) such that w k = z, k k=1 Augmented Lagrangian: L({w k, λ k }, z) = K (f k (w k ) + λ k, w k z + β 2 w i z 2 ) k=1 Main idea: iteratively update w k, z, and λ k.

17 ADMM For each iteration t: 1 Update local models (in parallel): w t+1 k = arg min w f k (w) + λ t k, w + β 2 w z t 2 2 Update global model (synchronization): K w = arg min w k=1 λt k, z + β t+1 2 wk z 2 3 Update dual variables (in parallel): λ t+1 k = λ t k The global model update has a closed form solution: w = 1 K ( K k=1 w t+1 k + K k=1 λt k ) + β(w t+1 k z t+1 ) Usually slower convergence and need a synchronization step.

18 Elapsed SGD Proposed by (Zhang et al., Deep Learning with Elastic Averaging SGD. NIPS 2015) A reformulation (simpler than ADMM): Gradient descent: min w 1,...,w K,z K f k (w k ) + ρ 2 w k z 2 k=1 w t+1 k = w t k η( f k (w t k ) + ρ(w T k z t ) ) K z t+1 = zk t η ρ(z t wk t ) k=1 = (1 ηρk)z t k + ηρk( 1 K K wk t ) k=1

19 Asynchronous version Each worker repeat the following operation (α = ηρ): w w k If t k %τ == 0: w k w k α(w z) z z + α(w z) w k w k ηg k t k t k + 1 Convergence & Convergence rate (?) Can add the momentum term.

20 Elapsed SGD: CIFAR 10 classes, totally 60,000 32x32 colour images. (Figures from Zhang et al., Deep Learning with Elastic Averaging SGD )

21 Elapsed SGD: ImageNet 1000 classes, > 1 million 221x221 colour images. (Figures from Zhang et al., Deep Learning with Elastic Averaging SGD )

22 ADMM for decoupling layers Proposed by (Taylor et al., Training Neural Networks Without Gradients: A Scalable ADMM Approach. ICML 2016) Another way to rewrite the neural network with L layers: min l(z L, Y ) {W i },{A i },{Z i } s.t. Z i = W i A i 1 for i = 1,..., L A i = h i (Z i ) for i = 1,..., L 1 where h i ( ) is activation functions, and rows of A i, Z i correspond to training samples ADMM-like formulation: min l(z L, Y ) + {W i },{A i },{Z i } + Z L, λ L L 1 β Z i W i A i 1 2 F + β A i h i (Z i ) 2 F i=1 i=1

23 ADMM for decoupling layers Update of Z i : element-wise problem, each one is a single variable optimization Update of A i : least square problem decomposed into columns (each training samples) Update of W i : least square problem: Compute (A i 1 A T i 1 ) distributedly W i Z i A T i 1(A i 1 A T i 1) 1

24 Coming up Proposal Discussions on Thursday (in my office)

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {