Deep Learning Lecture 2

Similar documents
Deep Learning Lecture 2

Day 3 Lecture 3. Optimizing deep networks

Deep Learning II: Momentum & Adaptive Step Size

Introduction to Convolutional Neural Networks (CNNs)

Neural Networks and Deep Learning

Normalization Techniques in Training of Deep Neural Networks

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Agenda. Digit Classification using CNN Digit Classification using SAE Visualization: Class models, filters and saliency 2 DCT

ECS289: Scalable Machine Learning

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Deep Learning & Artificial Intelligence WS 2018/2019

STA141C: Big Data & High Performance Statistical Computing

Deep Residual. Variations

Introduction to Neural Networks

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Convolutional Neural Networks

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Understanding Neural Networks : Part I

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Tips for Deep Learning

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Introduction to Machine Learning (67577)

ECS289: Scalable Machine Learning

Why should you care about the solution strategies?

Advanced Training Techniques. Prajit Ramachandran

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Large-scale Stochastic Optimization

Optimization for Training I. First-Order Methods Training algorithm

SGD and Deep Learning

Lecture 35: Optimization and Neural Nets

Reading Group on Deep Learning Session 1

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Deep Learning (CNNs)

Tips for Deep Learning

arxiv: v3 [cs.lg] 4 Jun 2016 Abstract

Introduction to Neural Networks

ECE521 Lecture 7/8. Logistic Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Convolutional Neural Networks. Srikumar Ramalingam

CSC 578 Neural Networks and Deep Learning

CS260: Machine Learning Algorithms

Deep Neural Networks (1) Hidden layers; Back-propagation

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Normalization Techniques

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

ECS171: Machine Learning

ECE521 Lectures 9 Fully Connected Neural Networks

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Grundlagen der Künstlichen Intelligenz

Convolutional Neural Network Architecture

CS 343: Artificial Intelligence

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

More on Neural Networks

Tasks ADAS. Self Driving. Non-machine Learning. Traditional MLP. Machine-Learning based method. Supervised CNN. Methods. Deep-Learning based

Stochastic Optimization Algorithms Beyond SG

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL

Jakub Hajic Artificial Intelligence Seminar I

Deep Feedforward Networks

Lecture 17: Neural Networks and Deep Learning

Data Mining & Machine Learning

Deep Learning Lab Course 2017 (Deep Learning Practical)

Lecture 6 Optimization for Deep Neural Networks

Deep Neural Networks (1) Hidden layers; Back-propagation

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Implicit Optimization Bias

Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error

Machine Learning Lecture 14

Convex Optimization Lecture 16

Artificial Neural Networks

Introduction to (Convolutional) Neural Networks

Neural networks. Chapter 20. Chapter 20 1

arxiv: v1 [cs.lg] 16 Jun 2017

Lecture 7 Convolutional Neural Networks

FreezeOut: Accelerate Training by Progressively Freezing Layers

Neural Network Training

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Introduction to Deep Learning CMPT 733. Steven Bergner

DeepLearning on FPGAs

Artificial Neuron (Perceptron)

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Artificial Neural Networks. MGS Lecture 2

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Stochastic gradient descent; Classification

Based on the original slides of Hung-yi Lee

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Understanding How ConvNets See

Transcription:

Fall 2016 Machine Learning CMPSCI 689 Deep Learning Lecture 2 Sridhar Mahadevan Autonomous Learning Lab UMass Amherst COLLEGE

Outline of lecture New type of units Convolutional units, Rectified linear units Faster gradient methods SGD,AdaGrad,ADAM,Batch normalization Software packages Tensorflow, Keras, Caffe, Theano, etc.

Representations in Animals (Wang et al., Nature Neuroscience, July 2014)

Convolutional Neural Networks for Atari (Yan Le Cun, 1990; Mnih et al., Nature 2015)

Convolutional NNs before: input layer hidden layer output layer now: Fei-Fei Li & Andrej Karpathy Lecture 7-18 21 Jan 2015

Filters in CNNs Convolutional Neural Networks are just Neural Networks BUT: 1. Local connectivity nothing changes 32 32 These form a single [1 x 1 x depth] depth column in the output volume 3 Fei-Fei Li & Andrej Karpathy Lecture 7-23 21 Jan 2015

Convolution Replicate this column of hidden neurons across space, with some stride. 7x7 input assume 3x3 connectivity, stride 1 Fei-Fei Li & Andrej Karpathy Lecture 7-24 21 Jan 2015

Convolution Replicate this column of hidden neurons across space, with some stride. 7x7 input assume 3x3 connectivity, stride 1 Fei-Fei Li & Andrej Karpathy Lecture 7-25 21 Jan 2015

Filters learned for Breakout

1990 vs. 2015 Year 2014 GoogLeNet VGG MSRA Learning to drive ConvoluHon' Pooling' SoMmax' Other' [Szegedy arxiv 2014] [Simonyan arxiv 2014] [He arxiv 2014]

Gradient of Sigmoid 0 (x) = e x (1 + e x ) 2 0.25 0.20 0.15 0.10 Vanishing gradient problem! 0.05-10 -5 5 10

Rectified Linear Units f(x) = max(x,0) ˆf(x) = log(e x + 1)

RLUs for speech recognition Zeiler et al. Fig. 3. Frame accuracy as a function of time for a 4 hidden layer neural net trained with either logistic or ReLUs and using as optimizer either SGD or SGD with Adagrad (ADG).

Sparse propagation Glorot et al., 2011

Faster Gradient Methods Standard method used in deep learning Stochastic gradient method (SGD) Faster variants RMSPROP, ADAM, ADAGRAD, Batch normalization Riemannian gradient methods

Rates of Convergence Let {x } R n and x 2 R n be such that x! x. We say that x! x at a linear rate if lim sup!1 kx +1 xk kx xk < 1. The convergence is said to be superlinear if this limsup is 0. The convergence is said to be quadratic if lim sup!1 kx +1 xk kx xk 2 < 1.

Example Let 2 (0, 1). { n } converges linearly to zero, but not superlinearly. { n2 } converges superlinearly to 0, but not quadratically. { 2n } converges quadratically to zero. Superlinear convergence is much faster than linear convergences, but quadratic convergence is much, much faster than superlinear convergence. = 1 2 gives n =2 n, n 2 =2 n2, 2 n =2 2n

Trust Region vs. Line Search Slow convergence of steepest descent Methods Trust region Line search direction contours of m k Trust region step contours of f

Hessian Many unconstrained optimization methods use the second order derivative of a smooth function Newton s method r 2 f(x) = 0 B @ @ 2 f @x 2 1 @ 2 f @x 2 @x 1... @ 2 f @x n @x 1 @ 2 f @x 1 @x 2 @ 2 f @x 2 2... @ 2 f @x n @x 2 @ 2 f @x 1 @x n @ 2 f @x 2 @x n... @ 2 f @x 2 n 1 C A Modified Newton methods

Rosenbrock Problem 2 f (x, y) = (1 x) + 100(y 1.0 0.5 x) 0.0 2-0.5-1.0 200 100 0-1.0 2 r f (x, y) = -0.5 0.0 198 200 200 200 0.5 1.0

Newton s Method Derivation Quadratic approximation around current point f k (x k )=f(x k )+rf(x k ) T (x x k )+ 1 2 (x x k) T r 2 f(x k )(x x k ) Set the derivative to 0 rf(x k )+r 2 f(x k )(x x k )=0 This gives us Newton s update rule: x k+1 = x k (r 2 f(x k )) 1 rf(x k )

Convergence of Newton s Method Theorem 3.5. Suppose that f is twice differentiable and that the Hessian 2 f (x) is Lipschitz continuous (see (A.42)) in a neighborhood of a solution x at which the sufficient conditions (Theorem 2.4) are satisfied. Consider the iteration x k+1 x k + p k, where p k is given by (3.30). Then (i) if the starting point x 0 is sufficiently close to x, the sequence of iterates converges to x ; (ii) the rate of convergence of {x k } is quadratic; and (iii) the sequence of gradient norms { f k } converges quadratically to zero. Lipschitz continuity: f (x 1 ) f (x 0 ) L x 1 x 0, for all x 0, x 1 N. Proof: See Nocedal and Wright s book

Newton s Method min f (x) :=x 2 +e x x k+1 = x k f 0 (x k ) f 00 (x k ) x f 0 (x) 1 4.7182818 0 1 1/3.0498646.3516893.00012.3517337.00000000064

Steepest Descent k x k f (x k ) f 0 (x k ) s 0 1.37182818 4.7182818 0 1 0 1 1 0 2.5.8565307 0.3934693 1 3.25.8413008 0.2788008 2 4.375.8279143.0627107 3 5.34075.8273473.0297367 5 6.356375.8272131.01254 6 7.3485625.8271976.0085768 7 8.3524688.8271848.001987 8 9.3514922.8271841.0006528 10 10.3517364.827184.0000072 12

ADAM (Kingma and Ba, ICLR 2015) Algorithm 1: Adam, our proposed algorithm for stochastic optimization. See section 2 for details, and for a slightly more efficient (but less clear) order of computation. g 2 t indicates the elementwise square g t g t. Good default settings for the tested machine learning problems are =0.001, 1 =0.9, we denote 2 =0.999 and = 10 8. All operations on vectors are element-wise. With 1 and 2 to the power t. 1 t and 2 t Require: : Stepsize Require: 1, 2 2 [0, 1): Exponential decay rates for the moment estimates Require: f( ): Stochastic objective function with parameters Require: 0 : Initial parameter vector m 0 0 (Initialize 1 st moment vector) v 0 0 (Initialize 2 nd moment vector) t 0 (Initialize timestep) while t not converged do t t +1 g t r f t ( t 1 ) (Get gradients w.r.t. stochastic objective at timestep t) m t 1 m t 1 +(1 1) g t (Update biased first moment estimate) v t 2 v t 1 +(1 2) gt 2 (Update biased second raw moment estimate) t bm t m t /(1 1 ) (Compute bias-corrected first moment estimate) t bv t v t /(1 2 ) (Compute bias-corrected second raw moment estimate) t t 1 bm t /( p bv t + ) (Update parameters) end while return t (Resulting parameters)

Figure 3: Convolutional neural networks training cost. (left) T

(Ioffe and Szegedy, ICML 2015) Input: Values of x over a mini-batch: B = {x 1...m }; Parameters to be learned:, Output: {y i = BN, (x i )} µ B 2 B bx i 1 m 1 m mx i=1 x i // mini-batch mean mx (x i µ B ) 2 // mini-batch variance i=1 x i µ B p 2B + // normalize y i bx i + BN, (x i ) // scale and shift Algorithm 1: Batch Normalizing Transform, applied to activation x over a mini-batch.

1 0.9 0.8 0.7 Without BN With BN 10K 20K 30K 40K 50K 2 0 2 (a) (b) Without BN (c) With BN 2 0 2 Figure 1. (a) The test accuracy of the MNIST network trained with and without Batch Normalization, vs. the number of training steps. Batch Normalization helps the network train faster and achieve higher accuracy. (b, c) The evolution of input distributions to a typical sigmoid, over the course of training, shown as {15, 50, 85}th percentiles. Batch Normalization makes the distribution more stable and reduces the internal covariate shift.

Mirror Descent = Natural Gradient! Thomas, Dabney, Mahadevan, Giguere, NIPS 2013 Natural gradient (Amari) x k+1 = x k k G 1 k rf(x k), escent update at step, with Mirror Descent (Nemirovski and Yudin) x k+1 = r k r k (x k ) k rf(x k ) ntinuously differentiable and strongly con We show these 30-year old techniques are closely related!

Mirror Descent (Nemirovski and Yudin, 1980s) DUAL SPACE r (x t ) gradient step r (y t+1 ) r PRIMAL SPACE x t x t+1 X R n r D x k+1 = r k r k (x k ) k rf(x k ) ntinuously differentiable and strongly con

Mirror Descent unifies many methods Regular gradient Natural gradient Exponentiated gradient, Winnow, multiplicative methods Sparse regression methods Boosting Many online learning algorithms

Natural Neural Networks t t+1 F ( t ) 1 2 F ( t ) 1 2 t t+1 t+t Proposed by Google Deep Mind, builds on our work on equivalence of mirror descent and natural gradient methods

GPU machine I Built at Home 20 core Intel Xeon E5 2697 v4 Three GPUs: 2 Nvidia 1080 Nvidia Titan X

CIFAR 10 Image Dataset

Keras program for MNIST MLP network

Specifying LeNet in Caffe https://developers.google.com/protocol-buffers/docs/overview name: "LeNet" layer { name: "mnist" type: "Data" data_param { } source: "mnist_train_lmdb" backend: LMDB batch_size: 64 scale: 0.00390625 } top: "data" top: "label" Data layer layer { name: "conv1" type: "Convolution" param { lr_mult: 1 } param { lr_mult: 2 } convolution_param { num_output: 20 kernel_size: 5 stride: 1 weight_filler { type: "xavier" } bias_filler { type: "constant" } } bottom: "data" top: "conv1" } Convolution layer

Max Pooling and RLU Layer layer { name: "pool1" type: "Pooling" pooling_param { kernel_size: 2 stride: 2 pool: MAX } bottom: "conv1" top: "pool1" } Convolution layer layer { name: "relu1" type: "ReLU" bottom: "ip1" top: "ip1" } RLU layer

Loss Layer layer { name: "loss" type: "SoftmaxWithLoss" bottom: "ip2" bottom: "label" } RLU layer

MNIST solver in Caffe # The train/test net protocol buffer definition net: "examples/mnist/lenet_train_test.prototxt" # test_iter specifies how many forward passes the test should carry out. # In the case of MNIST, we have test batch size 100 and 100 test iterations, # covering the full 10,000 testing images. test_iter: 100 # Carry out testing every 500 training iterations. test_interval: 500 # The base learning rate, momentum and the weight decay of the network. base_lr: 0.01 momentum: 0.9 weight_decay: 0.0005 # The learning rate policy lr_policy: "inv" gamma: 0.0001 power: 0.75 # Display every 100 iterations display: 100 # The maximum number of iterations max_iter: 10000 # snapshot intermediate results snapshot: 5000 snapshot_prefix: "examples/mnist/lenet" # solver mode: CPU or GPU solver_mode: GPU

Running LeNet on Caffe I0917 19:20:26.375691 26575 layer_factory.hpp:75] Creating layer mnist I0917 19:20:26.375877 26575 net.cpp:110] Creating Layer mnist I0917 19:20:26.375903 26575 net.cpp:432] mnist -> data I0917 19:20:26.375928 26575 net.cpp:432] mnist -> label I0917 19:20:26.378226 26581 db_lmdb.cpp:22] Opened lmdb examples/mnist/mnist_test_lmdb I0917 19:20:26.378762 26575 data_layer.cpp:44] output data size: 100,1,28,28 I0917 19:20:26.380553 26575 net.cpp:155] Setting up mnist I0917 19:20:26.380594 26575 net.cpp:163] Top shape: 100 1 28 28 (78400) I0917 19:20:26.380614 26575 net.cpp:163] Top shape: 100 (100) I0917 19:20:26.380635 26575 layer_factory.hpp:75] Creating layer label_mnist_1_split I0917 19:20:26.380668 26575 net.cpp:110] Creating Layer label_mnist_1_split I0917 19:20:26.380686 26575 net.cpp:476] label_mnist_1_split <- label I0917 19:20:26.380707 26575 net.cpp:432] label_mnist_1_split -> label_mnist_1_split_0 I0917 19:20:26.380738 26575 net.cpp:432] label_mnist_1_split -> label_mnist_1_split_1 I0917 19:20:26.405414 26575 solver.cpp:266] Learning Rate Policy: inv I0917 19:20:26.406183 26575 solver.cpp:310] Iteration 0, Testing net (#0) I0917 19:20:26.601101 26575 solver.cpp:359] Test net output #0: accuracy = 0.0777 I0917 19:20:26.601132 26575 solver.cpp:359] Test net output #1: loss = 2.3651 (* 1 = 2.3651 loss) I0917 19:20:26.604207 26575 solver.cpp:222] Iteration 0, loss = 2.34867 I0917 19:20:26.604233 26575 solver.cpp:238] Train net output #0: loss = 2.34867 (* 1 = 2.34867 loss) I0917 19:20:59.081962 26575 solver.cpp:291] Iteration 10000, loss = 0.00325083 I0917 19:20:59.081985 26575 solver.cpp:310] Iteration 10000, Testing net (#0) I0917 19:20:59.215575 26575 solver.cpp:359] Test net output #0: accuracy = 0.9904 I0917 19:20:59.215605 26575 solver.cpp:359] Test net output #1: loss = 0.0291382 (* 1 = 0.0291382 loss) I0917 19:20:59.215615 26575 solver.cpp:296] Optimization Done. I0917 19:20:59.215622 26575 caffe.cpp:184] Optimization Done. real user sys 0m34.403s 0m27.744s 0m25.308s

Most demanding computationally, but fewest parameters Least demanding computationally, but most parameters

GPUs Batch size Cross-entropy Top-1 error Time Speedup 1 (128, 128) 2.611 42.33% 98.05h 1x 2 (256, 256) 2.624 42.63% 50.24h 1.95x 2 (256, 128) 2.614 42.27% 50.90h 1.93x 4 (512, 512) 2.637 42.59% 26.20h 3.74x 4 (512, 128) 2.625 42.44% 26.78h 3.66x 8 (1024, 1024) 2.678 43.28% 15.68h 6.25x 8 (1024, 128) 2.651 42.86% 15.91h 6.16x