Deep Learning Lecture 2 - PDF Free Download

Fall 2016 Machine Learning CMPSCI 689 Deep Learning Lecture 2 Sridhar Mahadevan Autonomous Learning Lab UMass Amherst COLLEGE

Outline of lecture New type of units Convolutional units, Rectified linear units Faster gradient methods SGD,AdaGrad,ADAM,Batch normalization Software packages Tensorflow, Keras, Caffe, Theano, etc.

Representations in Animals (Wang et al., Nature Neuroscience, July 2014)

Convolutional Neural Networks for Atari (Yan Le Cun, 1990; Mnih et al., Nature 2015)

Convolutional NNs before: input layer hidden layer output layer now: Fei-Fei Li & Andrej Karpathy Lecture 7-18 21 Jan 2015

Filters in CNNs Convolutional Neural Networks are just Neural Networks BUT: 1. Local connectivity nothing changes 32 32 These form a single [1 x 1 x depth] depth column in the output volume 3 Fei-Fei Li & Andrej Karpathy Lecture 7-23 21 Jan 2015

Convolution Replicate this column of hidden neurons across space, with some stride. 7x7 input assume 3x3 connectivity, stride 1 Fei-Fei Li & Andrej Karpathy Lecture 7-24 21 Jan 2015

Convolution Replicate this column of hidden neurons across space, with some stride. 7x7 input assume 3x3 connectivity, stride 1 Fei-Fei Li & Andrej Karpathy Lecture 7-25 21 Jan 2015

Filters learned for Breakout

1990 vs. 2015 Year 2014 GoogLeNet VGG MSRA Learning to drive ConvoluHon' Pooling' SoMmax' Other' [Szegedy arxiv 2014] [Simonyan arxiv 2014] [He arxiv 2014]

Gradient of Sigmoid 0 (x) = e x (1 + e x ) 2 0.25 0.20 0.15 0.10 Vanishing gradient problem! 0.05-10 -5 5 10

Rectified Linear Units f(x) = max(x,0) ˆf(x) = log(e x + 1)

RLUs for speech recognition Zeiler et al. Fig. 3. Frame accuracy as a function of time for a 4 hidden layer neural net trained with either logistic or ReLUs and using as optimizer either SGD or SGD with Adagrad (ADG).

Sparse propagation Glorot et al., 2011

Faster Gradient Methods Standard method used in deep learning Stochastic gradient method (SGD) Faster variants RMSPROP, ADAM, ADAGRAD, Batch normalization Riemannian gradient methods

Rates of Convergence Let {x } R n and x 2 R n be such that x! x. We say that x! x at a linear rate if lim sup!1 kx +1 xk kx xk < 1. The convergence is said to be superlinear if this limsup is 0. The convergence is said to be quadratic if lim sup!1 kx +1 xk kx xk 2 < 1.

Example Let 2 (0, 1). { n } converges linearly to zero, but not superlinearly. { n2 } converges superlinearly to 0, but not quadratically. { 2n } converges quadratically to zero. Superlinear convergence is much faster than linear convergences, but quadratic convergence is much, much faster than superlinear convergence. = 1 2 gives n =2 n, n 2 =2 n2, 2 n =2 2n

Trust Region vs. Line Search Slow convergence of steepest descent Methods Trust region Line search direction contours of m k Trust region step contours of f

Hessian Many unconstrained optimization methods use the second order derivative of a smooth function Newton s method r 2 f(x) = 0 B @ @ 2 f @x 2 1 @ 2 f @x 2 @x 1... @ 2 f @x n @x 1 @ 2 f @x 1 @x 2 @ 2 f @x 2 2... @ 2 f @x n @x 2 @ 2 f @x 1 @x n @ 2 f @x 2 @x n... @ 2 f @x 2 n 1 C A Modified Newton methods

Rosenbrock Problem 2 f (x, y) = (1 x) + 100(y 1.0 0.5 x) 0.0 2-0.5-1.0 200 100 0-1.0 2 r f (x, y) = -0.5 0.0 198 200 200 200 0.5 1.0

Newton s Method Derivation Quadratic approximation around current point f k (x k )=f(x k )+rf(x k ) T (x x k )+ 1 2 (x x k) T r 2 f(x k )(x x k ) Set the derivative to 0 rf(x k )+r 2 f(x k )(x x k )=0 This gives us Newton s update rule: x k+1 = x k (r 2 f(x k )) 1 rf(x k )

Convergence of Newton s Method Theorem 3.5. Suppose that f is twice differentiable and that the Hessian 2 f (x) is Lipschitz continuous (see (A.42)) in a neighborhood of a solution x at which the sufficient conditions (Theorem 2.4) are satisfied. Consider the iteration x k+1 x k + p k, where p k is given by (3.30). Then (i) if the starting point x 0 is sufficiently close to x, the sequence of iterates converges to x ; (ii) the rate of convergence of {x k } is quadratic; and (iii) the sequence of gradient norms { f k } converges quadratically to zero. Lipschitz continuity: f (x 1 ) f (x 0 ) L x 1 x 0, for all x 0, x 1 N. Proof: See Nocedal and Wright s book

Newton s Method min f (x) :=x 2 +e x x k+1 = x k f 0 (x k ) f 00 (x k ) x f 0 (x) 1 4.7182818 0 1 1/3.0498646.3516893.00012.3517337.00000000064

Steepest Descent k x k f (x k ) f 0 (x k ) s 0 1.37182818 4.7182818 0 1 0 1 1 0 2.5.8565307 0.3934693 1 3.25.8413008 0.2788008 2 4.375.8279143.0627107 3 5.34075.8273473.0297367 5 6.356375.8272131.01254 6 7.3485625.8271976.0085768 7 8.3524688.8271848.001987 8 9.3514922.8271841.0006528 10 10.3517364.827184.0000072 12

ADAM (Kingma and Ba, ICLR 2015) Algorithm 1: Adam, our proposed algorithm for stochastic optimization. See section 2 for details, and for a slightly more efficient (but less clear) order of computation. g 2 t indicates the elementwise square g t g t. Good default settings for the tested machine learning problems are =0.001, 1 =0.9, we denote 2 =0.999 and = 10 8. All operations on vectors are element-wise. With 1 and 2 to the power t. 1 t and 2 t Require: : Stepsize Require: 1, 2 2 [0, 1): Exponential decay rates for the moment estimates Require: f( ): Stochastic objective function with parameters Require: 0 : Initial parameter vector m 0 0 (Initialize 1 st moment vector) v 0 0 (Initialize 2 nd moment vector) t 0 (Initialize timestep) while t not converged do t t +1 g t r f t ( t 1 ) (Get gradients w.r.t. stochastic objective at timestep t) m t 1 m t 1 +(1 1) g t (Update biased first moment estimate) v t 2 v t 1 +(1 2) gt 2 (Update biased second raw moment estimate) t bm t m t /(1 1 ) (Compute bias-corrected first moment estimate) t bv t v t /(1 2 ) (Compute bias-corrected second raw moment estimate) t t 1 bm t /( p bv t + ) (Update parameters) end while return t (Resulting parameters)

Figure 3: Convolutional neural networks training cost. (left) T

(Ioffe and Szegedy, ICML 2015) Input: Values of x over a mini-batch: B = {x 1...m }; Parameters to be learned:, Output: {y i = BN, (x i )} µ B 2 B bx i 1 m 1 m mx i=1 x i // mini-batch mean mx (x i µ B ) 2 // mini-batch variance i=1 x i µ B p 2B + // normalize y i bx i + BN, (x i ) // scale and shift Algorithm 1: Batch Normalizing Transform, applied to activation x over a mini-batch.

1 0.9 0.8 0.7 Without BN With BN 10K 20K 30K 40K 50K 2 0 2 (a) (b) Without BN (c) With BN 2 0 2 Figure 1. (a) The test accuracy of the MNIST network trained with and without Batch Normalization, vs. the number of training steps. Batch Normalization helps the network train faster and achieve higher accuracy. (b, c) The evolution of input distributions to a typical sigmoid, over the course of training, shown as {15, 50, 85}th percentiles. Batch Normalization makes the distribution more stable and reduces the internal covariate shift.

Mirror Descent = Natural Gradient! Thomas, Dabney, Mahadevan, Giguere, NIPS 2013 Natural gradient (Amari) x k+1 = x k k G 1 k rf(x k), escent update at step, with Mirror Descent (Nemirovski and Yudin) x k+1 = r k r k (x k ) k rf(x k ) ntinuously differentiable and strongly con We show these 30-year old techniques are closely related!

Mirror Descent (Nemirovski and Yudin, 1980s) DUAL SPACE r (x t ) gradient step r (y t+1 ) r PRIMAL SPACE x t x t+1 X R n r D x k+1 = r k r k (x k ) k rf(x k ) ntinuously differentiable and strongly con

Mirror Descent unifies many methods Regular gradient Natural gradient Exponentiated gradient, Winnow, multiplicative methods Sparse regression methods Boosting Many online learning algorithms

Natural Neural Networks t t+1 F ( t ) 1 2 F ( t ) 1 2 t t+1 t+t Proposed by Google Deep Mind, builds on our work on equivalence of mirror descent and natural gradient methods

GPU machine I Built at Home 20 core Intel Xeon E5 2697 v4 Three GPUs: 2 Nvidia 1080 Nvidia Titan X

CIFAR 10 Image Dataset

Keras program for MNIST MLP network

Specifying LeNet in Caffe https://developers.google.com/protocol-buffers/docs/overview name: "LeNet" layer { name: "mnist" type: "Data" data_param { } source: "mnist_train_lmdb" backend: LMDB batch_size: 64 scale: 0.00390625 } top: "data" top: "label" Data layer layer { name: "conv1" type: "Convolution" param { lr_mult: 1 } param { lr_mult: 2 } convolution_param { num_output: 20 kernel_size: 5 stride: 1 weight_filler { type: "xavier" } bias_filler { type: "constant" } } bottom: "data" top: "conv1" } Convolution layer

Max Pooling and RLU Layer layer { name: "pool1" type: "Pooling" pooling_param { kernel_size: 2 stride: 2 pool: MAX } bottom: "conv1" top: "pool1" } Convolution layer layer { name: "relu1" type: "ReLU" bottom: "ip1" top: "ip1" } RLU layer

Loss Layer layer { name: "loss" type: "SoftmaxWithLoss" bottom: "ip2" bottom: "label" } RLU layer

MNIST solver in Caffe # The train/test net protocol buffer definition net: "examples/mnist/lenet_train_test.prototxt" # test_iter specifies how many forward passes the test should carry out. # In the case of MNIST, we have test batch size 100 and 100 test iterations, # covering the full 10,000 testing images. test_iter: 100 # Carry out testing every 500 training iterations. test_interval: 500 # The base learning rate, momentum and the weight decay of the network. base_lr: 0.01 momentum: 0.9 weight_decay: 0.0005 # The learning rate policy lr_policy: "inv" gamma: 0.0001 power: 0.75 # Display every 100 iterations display: 100 # The maximum number of iterations max_iter: 10000 # snapshot intermediate results snapshot: 5000 snapshot_prefix: "examples/mnist/lenet" # solver mode: CPU or GPU solver_mode: GPU

Running LeNet on Caffe I0917 19:20:26.375691 26575 layer_factory.hpp:75] Creating layer mnist I0917 19:20:26.375877 26575 net.cpp:110] Creating Layer mnist I0917 19:20:26.375903 26575 net.cpp:432] mnist -> data I0917 19:20:26.375928 26575 net.cpp:432] mnist -> label I0917 19:20:26.378226 26581 db_lmdb.cpp:22] Opened lmdb examples/mnist/mnist_test_lmdb I0917 19:20:26.378762 26575 data_layer.cpp:44] output data size: 100,1,28,28 I0917 19:20:26.380553 26575 net.cpp:155] Setting up mnist I0917 19:20:26.380594 26575 net.cpp:163] Top shape: 100 1 28 28 (78400) I0917 19:20:26.380614 26575 net.cpp:163] Top shape: 100 (100) I0917 19:20:26.380635 26575 layer_factory.hpp:75] Creating layer label_mnist_1_split I0917 19:20:26.380668 26575 net.cpp:110] Creating Layer label_mnist_1_split I0917 19:20:26.380686 26575 net.cpp:476] label_mnist_1_split <- label I0917 19:20:26.380707 26575 net.cpp:432] label_mnist_1_split -> label_mnist_1_split_0 I0917 19:20:26.380738 26575 net.cpp:432] label_mnist_1_split -> label_mnist_1_split_1 I0917 19:20:26.405414 26575 solver.cpp:266] Learning Rate Policy: inv I0917 19:20:26.406183 26575 solver.cpp:310] Iteration 0, Testing net (#0) I0917 19:20:26.601101 26575 solver.cpp:359] Test net output #0: accuracy = 0.0777 I0917 19:20:26.601132 26575 solver.cpp:359] Test net output #1: loss = 2.3651 (* 1 = 2.3651 loss) I0917 19:20:26.604207 26575 solver.cpp:222] Iteration 0, loss = 2.34867 I0917 19:20:26.604233 26575 solver.cpp:238] Train net output #0: loss = 2.34867 (* 1 = 2.34867 loss) I0917 19:20:59.081962 26575 solver.cpp:291] Iteration 10000, loss = 0.00325083 I0917 19:20:59.081985 26575 solver.cpp:310] Iteration 10000, Testing net (#0) I0917 19:20:59.215575 26575 solver.cpp:359] Test net output #0: accuracy = 0.9904 I0917 19:20:59.215605 26575 solver.cpp:359] Test net output #1: loss = 0.0291382 (* 1 = 0.0291382 loss) I0917 19:20:59.215615 26575 solver.cpp:296] Optimization Done. I0917 19:20:59.215622 26575 caffe.cpp:184] Optimization Done. real user sys 0m34.403s 0m27.744s 0m25.308s

Most demanding computationally, but fewest parameters Least demanding computationally, but most parameters

GPUs Batch size Cross-entropy Top-1 error Time Speedup 1 (128, 128) 2.611 42.33% 98.05h 1x 2 (256, 256) 2.624 42.63% 50.24h 1.95x 2 (256, 128) 2.614 42.27% 50.90h 1.93x 4 (512, 512) 2.637 42.59% 26.20h 3.74x 4 (512, 128) 2.625 42.44% 26.78h 3.66x 8 (1024, 1024) 2.678 43.28% 15.68h 6.25x 8 (1024, 128) 2.651 42.86% 15.91h 6.16x