Deep Learning Lecture 2

Size: px

Start display at page:

Download "Deep Learning Lecture 2"

Kristin Shaw
5 years ago
Views:

1 Fall 2016 Machine Learning CMPSCI 689 Deep Learning Lecture 2 Sridhar Mahadevan Autonomous Learning Lab UMass Amherst COLLEGE

2 Outline of lecture New type of units Convolutional units, Rectified linear units Faster gradient methods SGD,AdaGrad,ADAM,Batch normalization Software packages Tensorflow, Keras, Caffe, Theano, etc.

3 Representations in Animals (Wang et al., Nature Neuroscience, July 2014)

4 Convolutional Neural Networks for Atari (Yan Le Cun, 1990; Mnih et al., Nature 2015)

5 Convolutional NNs before: input layer hidden layer output layer now: Fei-Fei Li & Andrej Karpathy Lecture Jan 2015

6 Filters in CNNs Convolutional Neural Networks are just Neural Networks BUT: 1. Local connectivity nothing changes These form a single [1 x 1 x depth] depth column in the output volume 3 Fei-Fei Li & Andrej Karpathy Lecture Jan 2015

7 Convolution Replicate this column of hidden neurons across space, with some stride. 7x7 input assume 3x3 connectivity, stride 1 Fei-Fei Li & Andrej Karpathy Lecture Jan 2015

8 Convolution Replicate this column of hidden neurons across space, with some stride. 7x7 input assume 3x3 connectivity, stride 1 Fei-Fei Li & Andrej Karpathy Lecture Jan 2015

9 Filters learned for Breakout

10 1990 vs Year 2014 GoogLeNet VGG MSRA Learning to drive ConvoluHon' Pooling' SoMmax' Other' [Szegedy arxiv 2014] [Simonyan arxiv 2014] [He arxiv 2014]

11 Gradient of Sigmoid 0 (x) = e x (1 + e x ) Vanishing gradient problem!

12 Rectified Linear Units f(x) = max(x,0) ˆf(x) = log(e x + 1)

13 RLUs for speech recognition Zeiler et al. Fig. 3. Frame accuracy as a function of time for a 4 hidden layer neural net trained with either logistic or ReLUs and using as optimizer either SGD or SGD with Adagrad (ADG).

14 Sparse propagation Glorot et al., 2011

15 Faster Gradient Methods Standard method used in deep learning Stochastic gradient method (SGD) Faster variants RMSPROP, ADAM, ADAGRAD, Batch normalization Riemannian gradient methods

16 Rates of Convergence Let {x } R n and x 2 R n be such that x! x. We say that x! x at a linear rate if lim sup!1 kx +1 xk kx xk < 1. The convergence is said to be superlinear if this limsup is 0. The convergence is said to be quadratic if lim sup!1 kx +1 xk kx xk 2 < 1.

17 Example Let 2 (0, 1). { n } converges linearly to zero, but not superlinearly. { n2 } converges superlinearly to 0, but not quadratically. { 2n } converges quadratically to zero. Superlinear convergence is much faster than linear convergences, but quadratic convergence is much, much faster than superlinear convergence. = 1 2 gives n =2 n, n 2 =2 n2, 2 n =2 2n

18 Trust Region vs. Line Search Slow convergence of steepest descent Methods Trust region Line search direction contours of m k Trust region step contours of f

19 Hessian Many unconstrained optimization methods use the second order derivative of a smooth function Newton s method r 2 f(x) = n 1 C A Modified Newton methods

20 Rosenbrock Problem 2 f (x, y) = (1 x) + 100(y x) r f (x, y) =

21 Newton s Method Derivation Quadratic approximation around current point f k (x k )=f(x k )+rf(x k ) T (x x k )+ 1 2 (x x k) T r 2 f(x k )(x x k ) Set the derivative to 0 rf(x k )+r 2 f(x k )(x x k )=0 This gives us Newton s update rule: x k+1 = x k (r 2 f(x k )) 1 rf(x k )

22 Convergence of Newton s Method Theorem 3.5. Suppose that f is twice differentiable and that the Hessian 2 f (x) is Lipschitz continuous (see (A.42)) in a neighborhood of a solution x at which the sufficient conditions (Theorem 2.4) are satisfied. Consider the iteration x k+1 x k + p k, where p k is given by (3.30). Then (i) if the starting point x 0 is sufficiently close to x, the sequence of iterates converges to x ; (ii) the rate of convergence of {x k } is quadratic; and (iii) the sequence of gradient norms { f k } converges quadratically to zero. Lipschitz continuity: f (x 1 ) f (x 0 ) L x 1 x 0, for all x 0, x 1 N. Proof: See Nocedal and Wright s book

23 Newton s Method min f (x) :=x 2 +e x x k+1 = x k f 0 (x k ) f 00 (x k ) x f 0 (x) /

24 Steepest Descent k x k f (x k ) f 0 (x k ) s

25 ADAM (Kingma and Ba, ICLR 2015) Algorithm 1: Adam, our proposed algorithm for stochastic optimization. See section 2 for details, and for a slightly more efficient (but less clear) order of computation. g 2 t indicates the elementwise square g t g t. Good default settings for the tested machine learning problems are =0.001, 1 =0.9, we denote 2 =0.999 and = All operations on vectors are element-wise. With 1 and 2 to the power t. 1 t and 2 t Require: : Stepsize Require: 1, 2 2 [0, 1): Exponential decay rates for the moment estimates Require: f( ): Stochastic objective function with parameters Require: 0 : Initial parameter vector m 0 0 (Initialize 1 st moment vector) v 0 0 (Initialize 2 nd moment vector) t 0 (Initialize timestep) while t not converged do t t +1 g t r f t ( t 1 ) (Get gradients w.r.t. stochastic objective at timestep t) m t 1 m t 1 +(1 1) g t (Update biased first moment estimate) v t 2 v t 1 +(1 2) gt 2 (Update biased second raw moment estimate) t bm t m t /(1 1 ) (Compute bias-corrected first moment estimate) t bv t v t /(1 2 ) (Compute bias-corrected second raw moment estimate) t t 1 bm t /( p bv t + ) (Update parameters) end while return t (Resulting parameters)

26 Figure 3: Convolutional neural networks training cost. (left) T

27 (Ioffe and Szegedy, ICML 2015) Input: Values of x over a mini-batch: B = {x 1...m }; Parameters to be learned:, Output: {y i = BN, (x i )} µ B 2 B bx i 1 m 1 m mx i=1 x i // mini-batch mean mx (x i µ B ) 2 // mini-batch variance i=1 x i µ B p 2B + // normalize y i bx i + BN, (x i ) // scale and shift Algorithm 1: Batch Normalizing Transform, applied to activation x over a mini-batch.

28 Without BN With BN 10K 20K 30K 40K 50K (a) (b) Without BN (c) With BN Figure 1. (a) The test accuracy of the MNIST network trained with and without Batch Normalization, vs. the number of training steps. Batch Normalization helps the network train faster and achieve higher accuracy. (b, c) The evolution of input distributions to a typical sigmoid, over the course of training, shown as {15, 50, 85}th percentiles. Batch Normalization makes the distribution more stable and reduces the internal covariate shift.

29 Mirror Descent = Natural Gradient! Thomas, Dabney, Mahadevan, Giguere, NIPS 2013 Natural gradient (Amari) x k+1 = x k k G 1 k rf(x k), escent update at step, with Mirror Descent (Nemirovski and Yudin) x k+1 = r k r k (x k ) k rf(x k ) ntinuously differentiable and strongly con We show these 30-year old techniques are closely related!

30 Mirror Descent (Nemirovski and Yudin, 1980s) DUAL SPACE r (x t ) gradient step r (y t+1 ) r PRIMAL SPACE x t x t+1 X R n r D x k+1 = r k r k (x k ) k rf(x k ) ntinuously differentiable and strongly con

31 Mirror Descent unifies many methods Regular gradient Natural gradient Exponentiated gradient, Winnow, multiplicative methods Sparse regression methods Boosting Many online learning algorithms

32 Natural Neural Networks t t+1 F ( t ) 1 2 F ( t ) 1 2 t t+1 t+t Proposed by Google Deep Mind, builds on our work on equivalence of mirror descent and natural gradient methods

33 GPU machine I Built at Home 20 core Intel Xeon E v4 Three GPUs: 2 Nvidia 1080 Nvidia Titan X

34 CIFAR 10 Image Dataset

35 Keras program for MNIST MLP network

$com/protocol-buffers/docs/overview name: "LeNet" layer { name: "mnist" type: "Data" data_param { } source: "mnist_train_lmdb"$ $00390625 } top: "data" top: "label" Data layer layer { name: "conv1" type: "Convolution" param { lr_mult: 1 } param { lr_mult: 2 }$

37 Specifying LeNet in Caffe name: "LeNet" layer { name: "mnist" type: "Data" data_param { } source: "mnist_train_lmdb" backend: LMDB batch_size: 64 scale: } top: "data" top: "label" Data layer layer { name: "conv1" type: "Convolution" param { lr_mult: 1 } param { lr_mult: 2 } convolution_param { num_output: 20 kernel_size: 5 stride: 1 weight_filler { type: "xavier" } bias_filler { type: "constant" } } bottom: "data" top: "conv1" } Convolution layer

38 Max Pooling and RLU Layer layer { name: "pool1" type: "Pooling" pooling_param { kernel_size: 2 stride: 2 pool: MAX } bottom: "conv1" top: "pool1" } Convolution layer layer { name: "relu1" type: "ReLU" bottom: "ip1" top: "ip1" } RLU layer

39 Loss Layer layer { name: "loss" type: "SoftmaxWithLoss" bottom: "ip2" bottom: "label" } RLU layer

40 MNIST solver in Caffe # The train/test net protocol buffer definition net: "examples/mnist/lenet_train_test.prototxt" # test_iter specifies how many forward passes the test should carry out. # In the case of MNIST, we have test batch size 100 and 100 test iterations, # covering the full 10,000 testing images. test_iter: 100 # Carry out testing every 500 training iterations. test_interval: 500 # The base learning rate, momentum and the weight decay of the network. base_lr: 0.01 momentum: 0.9 weight_decay: # The learning rate policy lr_policy: "inv" gamma: power: 0.75 # Display every 100 iterations display: 100 # The maximum number of iterations max_iter: # snapshot intermediate results snapshot: 5000 snapshot_prefix: "examples/mnist/lenet" # solver mode: CPU or GPU solver_mode: GPU

41 Running LeNet on Caffe I :20: layer_factory.hpp:75] Creating layer mnist I :20: net.cpp:110] Creating Layer mnist I :20: net.cpp:432] mnist -> data I :20: net.cpp:432] mnist -> label I :20: db_lmdb.cpp:22] Opened lmdb examples/mnist/mnist_test_lmdb I :20: data_layer.cpp:44] output data size: 100,1,28,28 I :20: net.cpp:155] Setting up mnist I :20: net.cpp:163] Top shape: (78400) I :20: net.cpp:163] Top shape: 100 (100) I :20: layer_factory.hpp:75] Creating layer label_mnist_1_split I :20: net.cpp:110] Creating Layer label_mnist_1_split I :20: net.cpp:476] label_mnist_1_split <- label I :20: net.cpp:432] label_mnist_1_split -> label_mnist_1_split_0 I :20: net.cpp:432] label_mnist_1_split -> label_mnist_1_split_1 I :20: solver.cpp:266] Learning Rate Policy: inv I :20: solver.cpp:310] Iteration 0, Testing net (#0) I :20: solver.cpp:359] Test net output #0: accuracy = I :20: solver.cpp:359] Test net output #1: loss = (* 1 = loss) I :20: solver.cpp:222] Iteration 0, loss = I :20: solver.cpp:238] Train net output #0: loss = (* 1 = loss) I :20: solver.cpp:291] Iteration 10000, loss = I :20: solver.cpp:310] Iteration 10000, Testing net (#0) I :20: solver.cpp:359] Test net output #0: accuracy = I :20: solver.cpp:359] Test net output #1: loss = (* 1 = loss) I :20: solver.cpp:296] Optimization Done. I :20: caffe.cpp:184] Optimization Done. real user sys 0m34.403s 0m27.744s 0m25.308s

42 Most demanding computationally, but fewest parameters Least demanding computationally, but most parameters

45 GPUs Batch size Cross-entropy Top-1 error Time Speedup 1 (128, 128) % 98.05h 1x 2 (256, 256) % 50.24h 1.95x 2 (256, 128) % 50.90h 1.93x 4 (512, 512) % 26.20h 3.74x 4 (512, 128) % 26.78h 3.66x 8 (1024, 1024) % 15.68h 6.25x 8 (1024, 128) % 15.91h 6.16x

Deep Learning Lecture 2

Fall 2015 Deep Learning CMPSCI 697L Deep Learning Lecture 2 Sridhar Mahadevan Autonomous Learning Lab UMass Amherst COLLEGE Outline Some topics to be covered: 1. Quick review of classic neural nets, single