Introduction to (Convolutional) Neural Networks

Similar documents
Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CSCI567 Machine Learning (Fall 2018)

Introduction to Machine Learning Spring 2018 Note Neural Networks

CSC 411 Lecture 10: Neural Networks

Jakub Hajic Artificial Intelligence Seminar I

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Introduction to Convolutional Neural Networks (CNNs)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Convolutional Neural Network Architecture

Introduction to Machine Learning (67577)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Networks and Deep Learning

Introduction to Deep Learning

Artificial Neural Networks

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

A summary of Deep Learning without Poor Local Minima

Feedforward Neural Nets and Backpropagation

Advanced Machine Learning

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Bits of Machine Learning Part 1: Supervised Learning

Convolutional Neural Networks. Srikumar Ramalingam

SGD and Deep Learning

Tutorial on Machine Learning for Advanced Electronics

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Lecture 17: Neural Networks and Deep Learning

NN V: The generalized delta learning rule

Data Mining Part 5. Prediction

Deep Learning: a gentle introduction

Deep Learning & Artificial Intelligence WS 2018/2019

Neural Networks and the Back-propagation Algorithm

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Neural Networks: Introduction

Neural Networks: Backpropagation

How to do backpropagation in a brain

Artificial Neural Networks. MGS Lecture 2

Lecture 4: Perceptrons and Multilayer Perceptrons

Machine Learning

Neural networks COMS 4771

18.6 Regression and Classification with Linear Models

Introduction to Deep Learning CMPT 733. Steven Bergner

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

CSC242: Intro to AI. Lecture 21

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

RegML 2018 Class 8 Deep learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Deep Learning (CNNs)

From perceptrons to word embeddings. Simon Šuster University of Groningen

Deep Feedforward Networks

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Advanced computational methods X Selected Topics: SGD

CSC321 Lecture 5: Multilayer Perceptrons

Neural networks. Chapter 19, Sections 1 5 1

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Artificial Neural Networks

The connection of dropout and Bayesian statistics

Course 395: Machine Learning - Lectures

Machine Learning Basics III

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Neural networks and optimization

Deep Feedforward Networks. Sargur N. Srihari

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Neural networks. Chapter 20. Chapter 20 1

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Lecture 3 Feedforward Networks and Backpropagation

Topic 3: Neural Networks

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Feedforward Neural Networks

Grundlagen der Künstlichen Intelligenz

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Artificial Neural Networks

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

Deep Learning Lab Course 2017 (Deep Learning Practical)

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Lecture 5: Logistic Regression. Neural Networks

1 What a Neural Network Computes

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Introduction to Deep Learning

Linear discriminant functions

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Introduction to Convolutional Neural Networks 2018 / 02 / 23

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

DeepLearning on FPGAs

CSC321 Lecture 2: Linear Regression

CS:4420 Artificial Intelligence

Based on the original slides of Hung-yi Lee

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Lecture 2: Learning with neural networks

Transcription:

Introduction to (Convolutional) Neural Networks Philipp Grohs Summer School DL and Vis, Sept 2018

Syllabus 1 Motivation and Definition 2 Universal Approximation 3 Backpropagation 4 Stochastic Gradient Descent 5 The Basic Recipe 6 Going Deep 7 Convolutional Neural Networks 8 What I didn t tell you

1 Motivation and Definition

Which Method to Choose? We have seen linear Regression, kernel regression, regularization, K-PCA, K-SVM,... and there exist a zillion other methods.

Which Method to Choose? We have seen linear Regression, kernel regression, regularization, K-PCA, K-SVM,... and there exist a zillion other methods. Is there a universally best method?

No Free Lunch Theorem

No Free Lunch Theorem No Free Lunch Theorem [Wolpert(1996)], Informal Version Of course not!!

No Free Lunch Theorem No Free Lunch Theorem [Wolpert(1996)], Informal Version Of course not!! Proof of the Theorem If ρ is completely arbitrary and nothing is known, we cannot possibly infer anything about ρ from samples ((x i, y i )) m i=1...

No Free Lunch Theorem No Free Lunch Theorem [Wolpert(1996)], Informal Version Of course not!! Proof of the Theorem If ρ is completely arbitrary and nothing is known, we cannot possibly infer anything about ρ from samples ((x i, y i )) m i=1... Every algorithm will have a specific preference, for example as specified through the hypothesis class H all categories are artificial!

No Free Lunch Theorem No Free Lunch Theorem [Wolpert(1996)], Informal Version Of course not!! Proof of the Theorem If ρ is completely arbitrary and nothing is known, we cannot possibly infer anything about ρ from samples ((x i, y i )) m i=1... Every algorithm will have a specific preference, for example as specified through the hypothesis class H all categories are artificial! We want our algorithm to reproduce the artificial categories produced by our brain so let s build a hypothesis class that mimicks our thinking!

Neuroscience

Neuroscience The Brain as Biological Neural Network In neuroscience, a biological neural network is a series of interconnected neurons whose activation defines a recognizable linear pathway. The interface through which neurons interact with their neighbors usually consists of several axon terminals connected via synapses to dendrites on other neurons. If the sum of the input signals into one neuron surpasses a certain threshold, the neuron sends an action potential (AP) at the axon hillock and transmits this electrical signal along the axon. Source: Wikipedia

Neurons

Neurons recall: If the sum of the input signals into one neuron surpasses a certain threshold, [...] the neuron transmits this [...] signal [...].

Artificial Neurons x 1 w 1 Threshold b Inputs x 2 w 2 Σ { 1 0 Activation i x iw i b > 0 i x iw i b 0 Output y x 3 w 3 Weights

Artificial Neurons x 1 w 1 Threshold b Inputs x 2 w 2 Σ { 1 0 Activation i x iw i b > 0 i x iw i b 0 Output y x 3 w 3 Weights Artificial Neuron An artificial neuron with weights w 1,..., w s, bias b and activation function σ : R R is defined as the function ( s ) f(x 1,..., x s ) = σ x i w i b. i=1

Activation Functions Figure: Heaviside activation function (as in biological motivation)

Activation Functions Figure: Heaviside activation function (as in biological motivation) Figure: Sigmoid activation function σ(x) = 1 1+e x

Artificial Neural Networks

Artificial Neural Networks Artificial neural networks consist of a graph, connecting artificial neurons!

Artificial Neural Networks Artificial neural networks consist of a graph, connecting artificial neurons! Dynamics difficult to model, due to loops, etc...

Artificial Feedforward Neural Networks Use directed, acyclic graph!

Artificial Feedforward Neural Networks Use directed, acyclic graph!

Artificial Feedforward Neural Networks Definition Let L, d, N 1,..., N L N. A map Φ : R d R N L given by Φ(x) = A L σ (A L 1 σ (... σ (A 1 (x)))), x R d, is called a neural network. It is composed of affine linear maps A l : R N l 1 R N l, 1 l L (where N 0 = d), and non-linear functions often referred to as activation function σ acting component-wise. Here, d is the dimension of the input layer, L denotes the number of layers, N 1,..., N L 1 stands for the dimensions of the L 1 hidden layers, and N L is the dimension of the output layer.

Artificial Feedforward Neural Networks Definition Let L, d, N 1,..., N L N. A map Φ : R d R N L given by Φ(x) = A L σ (A L 1 σ (... σ (A 1 (x)))), x R d, is called a neural network. It is composed of affine linear maps A l : R N l 1 R N l, 1 l L (where N 0 = d), and non-linear functions often referred to as activation function σ acting component-wise. Here, d is the dimension of the input layer, L denotes the number of layers, N 1,..., N L 1 stands for the dimensions of the L 1 hidden layers, and N L is the dimension of the output layer. An affine map A : R N l 1 R N l is given by x W x + b with weight matrix W R N l 1 N l and bias vector b R N l.

Artificial Feedforward Neural Networks

On the Biological Motivation

On the Biological Motivation Artificial (feedforward) neural networks should not be confused with a model for our brain:

On the Biological Motivation Artificial (feedforward) neural networks should not be confused with a model for our brain: Neurons are more complicated than simply weighted linear combinations

On the Biological Motivation Artificial (feedforward) neural networks should not be confused with a model for our brain: Neurons are more complicated than simply weighted linear combinations Our brain is not feedforward

On the Biological Motivation Artificial (feedforward) neural networks should not be confused with a model for our brain: Neurons are more complicated than simply weighted linear combinations Our brain is not feedforward Biological neural networks evolve with time neuronal plasticity

On the Biological Motivation Artificial (feedforward) neural networks should not be confused with a model for our brain: Neurons are more complicated than simply weighted linear combinations Our brain is not feedforward Biological neural networks evolve with time neuronal plasticity...

On the Biological Motivation Artificial (feedforward) neural networks should not be confused with a model for our brain: Neurons are more complicated than simply weighted linear combinations Our brain is not feedforward Biological neural networks evolve with time neuronal plasticity... Artificial feedforward neural networks constitute a mathematically and computationally convenient but very simplistic mathematical construct which is inspired by our understanding of how the brain works.

Terminology

Terminology Neural Network Learning : Use neural networks of a fixed topology as hypothesis class for regression or classification tasks.

Terminology Neural Network Learning : Use neural networks of a fixed topology as hypothesis class for regression or classification tasks. This requires optimizing the weights and bias parameters.

Terminology Neural Network Learning : Use neural networks of a fixed topology as hypothesis class for regression or classification tasks. This requires optimizing the weights and bias parameters. Deep Learning : Neural network learning with neural networks consisting of many (e.g., 3) layers.

2 Universal Approximation

Approximation Question Main Approximation Problem can every (continuous, or measurable) function f : R d R N L be arbitrarily well approximated by a neural network, provided that we choose N 1,..., N L 1, L large enough?

Approximation Question Main Approximation Problem can every (continuous, or measurable) function f : R d R N L be arbitrarily well approximated by a neural network, provided that we choose N 1,..., N L 1, L large enough? Surely not!

Approximation Question Main Approximation Problem can every (continuous, or measurable) function f : R d R N L be arbitrarily well approximated by a neural network, provided that we choose N 1,..., N L 1, L large enough? Surely not! Suppose that σ is a polynomial of degree r. Then σ(ax) is a polynomial of degree r for all affine maps A and therefore any neural network with activation function σ will be a polynomial of degree r.

Approximation Question Main Approximation Problem Under which conditions on the activation function σ can every (continuous, or measurable) function f : R d R N L be arbitrarily well approximated by a neural network, provided that we choose N 1,..., N L 1, L large enough?

Universal Approximation Theorem Theorem Suppose that σ : R R continuous is not a polynomial and fix d 1, L 2, N L 1 N and a compact subset K R d. Then for any continuous f : R d R N l and any ε > 0 there exist N 1,..., N L 1 N and affine linear maps A l : R N l 1 R N l, 1 l L such that the neural network Φ(x) = A L σ (A L 1 σ (... σ (A 1 (x)))), x R d, approximates f to within accuracy ε, i.e., sup Φ(x) Φ(x) ε. x K

Universal Approximation Theorem Theorem Suppose that σ : R R continuous is not a polynomial and fix d 1, L 2, N L 1 N and a compact subset K R d. Then for any continuous f : R d R N l and any ε > 0 there exist N 1,..., N L 1 N and affine linear maps A l : R N l 1 R N l, 1 l L such that the neural network Φ(x) = A L σ (A L 1 σ (... σ (A 1 (x)))), x R d, approximates f to within accuracy ε, i.e., sup Φ(x) Φ(x) ε. x K Neural networks are universal approximators and one hidden layer is enough if the number of nodes is sufficient! Skip Proof

Proof of the Universal Approximation Theorem For simplicity we only the case of one hidden layer, e.g., L = 2 and one output neuron, e.g., N L = 1:

Proof of the Universal Approximation Theorem For simplicity we only the case of one hidden layer, e.g., L = 2 and one output neuron, e.g., N L = 1: N 1 Φ(x) = c i σ(w i x b i ), w i R d, c i, b i R. i=1

Proof of the Universal Approximation Theorem We will show the following. Theorem For d N and σ : R R continuous consider { } R(σ, d) := span σ(w x b) : w R d, b R. Then R(σ, d) is dense in C(R d ) if and only if σ is not a polynomial.

Proof for d = 1 and σ smooth

Proof for d = 1 and σ smooth if σ is not a polynomial, there exists x 0 R with σ (k) ( x 0 ) 0 for all k N.

Proof for d = 1 and σ smooth if σ is not a polynomial, there exists x 0 R with σ (k) ( x 0 ) 0 for all k N. constant functions can be approximated because σ(hx x 0 ) σ( x 0 ) 0, h 0.

Proof for d = 1 and σ smooth if σ is not a polynomial, there exists x 0 R with σ (k) ( x 0 ) 0 for all k N. constant functions can be approximated because σ(hx x 0 ) σ( x 0 ) 0, h 0. linear functions can be approximated because 1 h (σ((λ + h)x x 0) σ(λx x 0 )) xσ ( x 0 ), h, λ 0.

Proof for d = 1 and σ smooth if σ is not a polynomial, there exists x 0 R with σ (k) ( x 0 ) 0 for all k N. constant functions can be approximated because σ(hx x 0 ) σ( x 0 ) 0, h 0. linear functions can be approximated because 1 h (σ((λ + h)x x 0) σ(λx x 0 )) xσ ( x 0 ), h, λ 0. same argument polynomials in x can be approximated.

Proof for d = 1 and σ smooth if σ is not a polynomial, there exists x 0 R with σ (k) ( x 0 ) 0 for all k N. constant functions can be approximated because σ(hx x 0 ) σ( x 0 ) 0, h 0. linear functions can be approximated because 1 h (σ((λ + h)x x 0) σ(λx x 0 )) xσ ( x 0 ), h, λ 0. same argument polynomials in x can be approximated. Stone-Weierstrass Theorem yields the result.

General d

General d Note that the functions span{g(w x b) : w R d, b R, g C(R) arbitrary}, are dense in C(R d ) (just take g as sin(w x), cos(w x) just as in the Fourier series case).

General d Note that the functions span{g(w x b) : w R d, b R, g C(R) arbitrary}, are dense in C(R d ) (just take g as sin(w x), cos(w x) just as in the Fourier series case). First approximate f C(R d ) by N d i g i (v i x e i ), i=1 v i R d, d i, e i R, g i C(R).

General d Note that the functions span{g(w x b) : w R d, b R, g C(R) arbitrary}, are dense in C(R d ) (just take g as sin(w x), cos(w x) just as in the Fourier series case). First approximate f C(R d ) by N d i g i (v i x e i ), i=1 v i R d, d i, e i R, g i C(R). Then apply our univariate result to approximate the univariate functions t g i (t e i ) using neural networks.

The case that σ is nonsmooth

The case that σ is nonsmooth pick family (g ε ) ε>0 of mollifiers, i.e. lim σ g ε σ ε 0 uniformly on compacta.

The case that σ is nonsmooth pick family (g ε ) ε>0 of mollifiers, i.e. lim σ g ε σ ε 0 uniformly on compacta. Apply previous result to the smooth function σ g ε and let ε 0:

The case that σ is nonsmooth pick family (g ε ) ε>0 of mollifiers, i.e. lim σ g ε σ ε 0 uniformly on compacta. Apply previous result to the smooth function σ g ε and let ε 0:

3 Backpropagation

Regression/Classification with Neural Networks Neural Network Hypothesis Class Given d, L, N 1,..., N L and σ define the associated hypothesis class H [d,n1,...,n L ],σ := { AL σ (A L 1 σ (... σ (A 1 (x)))) : A l : R N l 1 R N l affine linear }.

Regression/Classification with Neural Networks Neural Network Hypothesis Class Given d, L, N 1,..., N L and σ define the associated hypothesis class H [d,n1,...,n L ],σ := { AL σ (A L 1 σ (... σ (A 1 (x)))) : A l : R N l 1 R N l affine linear }. Typical Regression/Classification Task Given data z = ((x i, y i )) m i=1 Rd R N L, find the empirical regression function f z argmin f H[d,N1,...,N L ],σ m L(f, x i, y i ), i=1 where L : C(R d ) R d R N L R + is the loss function (in least squares problems we have L(f, x, y) = f(x) y 2 ).

Example: Handwritten Digits MNIST Database for handwritten digit recognition http://yann.lecun.com/ exdb/mnist/

Example: Handwritten Digits Every image is given as a 28 28 matrix x R 28 28 R 784 : MNIST Database for handwritten digit recognition http://yann.lecun.com/ exdb/mnist/

Example: Handwritten Digits Every image is given as a 28 28 matrix x R 28 28 R 784 : MNIST Database for handwritten digit recognition http://yann.lecun.com/ exdb/mnist/

Example: Handwritten Digits Every image is given as a 28 28 matrix x R 28 28 R 784 : MNIST Database for handwritten digit recognition http://yann.lecun.com/ exdb/mnist/ Every label is given as a 10-dim vector y R 10 describing the probability of each digit

Example: Handwritten Digits Every image is given as a 28 28 matrix x R 28 28 R 784 : MNIST Database for handwritten digit recognition http://yann.lecun.com/ exdb/mnist/ Every label is given as a 10-dim vector y R 10 describing the probability of each digit

Example: Handwritten Digits Every image is given as a 28 28 matrix x R 28 28 R 784 : Every label is given as a 10-dim vector y R 10 describing the probability of each digit

Example: Handwritten Digits Every image is given as a 28 28 matrix x R 28 28 R 784 : Given labeled training data (x i, y i ) m i=1 R784 R 10. Every label is given as a 10-dim vector y R 10 describing the probability of each digit

Example: Handwritten Digits Every image is given as a 28 28 matrix x R 28 28 R 784 : Every label is given as a 10-dim vector y R 10 describing the probability of each digit Given labeled training data (x i, y i ) m i=1 R784 R 10. Fix network topology, e.g., number of layers (for example L = 3) and numbers of neurons (N 1 = 20, N 2 = 20).

Example: Handwritten Digits Every image is given as a 28 28 matrix x R 28 28 R 784 : Every label is given as a 10-dim vector y R 10 describing the probability of each digit Given labeled training data (x i, y i ) m i=1 R784 R 10. Fix network topology, e.g., number of layers (for example L = 3) and numbers of neurons (N 1 = 20, N 2 = 20). The learning goal is to find the empirical regression function f z H [784,20,20,10],σ.

Example: Handwritten Digits Every image is given as a 28 28 matrix x R 28 28 R 784 : Every label is given as a 10-dim vector y R 10 describing the probability of each digit Given labeled training data (x i, y i ) m i=1 R784 R 10. Fix network topology, e.g., number of layers (for example L = 3) and numbers of neurons (N 1 = 20, N 2 = 20). The learning goal is to find the empirical regression function f z H [784,20,20,10],σ.???how???

Example: Handwritten Digits Every image is given as a 28 28 matrix x R 28 28 R 784 : Every label is given as a 10-dim vector y R 10 describing the probability of each digit Given labeled training data (x i, y i ) m i=1 R784 R 10. Fix network topology, e.g., number of layers (for example L = 3) and numbers of neurons (N 1 = 20, N 2 = 20). The learning goal is to find the empirical regression function f z H [784,20,20,10],σ.???how??? Non-linear, non-convex

Gradient Descent: The Simplest Optimization Method Gradient Descent

Gradient Descent: The Simplest Optimization Method Gradient Descent Gradient of F : R N R is defined by F (u) = ( ) F (u) F (u) T,...,. (u) 1 (u) N

Gradient Descent: The Simplest Optimization Method Gradient Descent Gradient of F : R N R is defined by F (u) = ( ) F (u) F (u) T,...,. (u) 1 (u) N Gradient descent with stepsize η > 0 is defined by u n+1 u n η F (u n ).

Gradient Descent: The Simplest Optimization Method Gradient Descent Gradient of F : R N R is defined by F (u) = ( ) F (u) F (u) T,...,. (u) 1 (u) N Gradient descent with stepsize η > 0 is defined by u n+1 u n η F (u n ). Converges (slowly) to stationary point of F.

Backprop In our problem: F = m i=1 L(f, x i, y i ) and u = ((W l, b l )) L l=1.

Backprop In our problem: F = m i=1 L(f, x i, y i ) and u = ((W l, b l )) L l=1. Since ((Wl,b l )) L F = m l=1 i=1 ((W l,b l )) L L(f, x i, y i ), we need to l=1 determine (for x, y R d R N L fixed) L(f, x, y) (W l ) i,j, L(f, x, y) (b l ) i, l = 1,..., L. Skip Derivation

Backprop In our problem: F = m i=1 L(f, x i, y i ) and u = ((W l, b l )) L l=1. Since ((Wl,b l )) L F = m l=1 i=1 ((W l,b l )) L L(f, x i, y i ), we need to l=1 determine (for x, y R d R N L fixed) L(f, x, y) (W l ) i,j, L(f, x, y) (b l ) i, l = 1,..., L. Skip Derivation For simplicity suppose that L(f, x, y) = (f(x) y) 2, so that L(f, x, y) = 2 (f(x) y) T f(x), (W l ) i,j (W l ) i,j L(f, x, y) (b l ) i = 2 (f(x) y) T f(x) (b l ) i.

x = ( ) (x) 1 (x) 2

x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) = σ(w 1 x + b 1 ) (W 1 ) 1,1 (W 1 ) 1,2 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 (W 1 ) 3,1 (W 1 ) 3,2 (b 1 ) 1 b 1 = (b 1 ) 2 (b 1 ) 3

x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3

( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

(z 3 ) 1 (W 3 ) 1,2 = ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

(z 3 ) 1 (W 3 ) 1,2 = (W 3 ) 1,2 ((W 3 ) 1,1 (a 2 ) 1 + (W 3 ) 1,2 (a 2 ) 2 + (W 3 ) 1,3 (a 2 ) 3 ) ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

(z 3 ) 1 (W 3 ) 1,2 = (W 3 ) 1,2 ((W 3 ) 1,1 (a 2 ) 1 + (W 3 ) 1,2 (a 2 ) 2 + (W 3 ) 1,3 (a 2 ) 3 ) = (a 2 ) 2 ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

(z 3 ) 2 (W 3 ) 1,2 = ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

(z 3 ) 2 (W 3 ) 1,2 = (W 3 ) 1,2 ((W 3 ) 2,1 (a 2 ) 1 + (W 3 ) 2,2 (a 2 ) 2 + (W 3 ) 2,3 (a 2 ) 3 ) ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

(z 3 ) 2 (W 3 ) 1,2 = (W 3 ) 1,2 ((W 3 ) 2,1 (a 2 ) 1 + (W 3 ) 2,2 (a 2 ) 2 + (W 3 ) 2,3 (a 2 ) 3 ) = 0 ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

(z 3 ) k (W 3 ) i,j = { (a2 ) j i = k 0 i k ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

(z 3 ) k (W 3 ) i,j = (z 3 ) k (b 3 ) i = { (a2 ) j i = k { 1 0 i k i = k 0 i k ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

( (a2 ) 1 Φ(x) = 0 W 3 )( (a2 ) 2 0 )( (a2 ) 3 0 ) ( )( )( ) 0 0 0 (a 2 ) 1 (a 2 ) 2 (a 2 ) 3 ) 1 Φ(x) = ( ( 0 b 3 0 1) ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

Backprop: Last Layer

Backprop: Last Layer L(f,x,y) (W L ) i,j = 2 (f(x) y) T L(f,x,y) (b L ) i f(x) (W L ) i,j, = 2 (f(x) y) T f(x) (b L ) i.

Backprop: Last Layer L(f,x,y) (W L ) i,j = 2 (f(x) y) T f(x) (W L ) i,j, L(f,x,y) (b L ) i = 2 (f(x) y) T f(x) (b L ) i. Let f(x) = W L σ(w L 1 (... ) + b L 1 ) + b L. It follows that f(x) (W L ) i,j = (0,..., σ(w L 1 (... ) + b L 1 ) j,..., 0) T }{{} i,..., 0) T f(x) (b L ) i = (0,..., 1 }{{} i

Backprop: Last Layer L(f,x,y) (W L ) i,j = 2 (f(x) y) T f(x) (W L ) i,j, L(f,x,y) (b L ) i = 2 (f(x) y) T f(x) (b L ) i. Let f(x) = W L σ(w L 1 (... ) + b L 1 ) + b L. It follows that f(x) (W L ) i,j = (0,..., σ(w L 1 (... ) + b L 1 ) j,..., 0) T }{{} i,..., 0) T f(x) (b L ) i = (0,..., 1 }{{} i f(x) 2(f(x) y) T (W L ) i,j = 2(f(x) y) i σ(w L 1 (... ) + b L 1 ) j, 2(f(x) y) T f(x) (b L ) i = 2(f(x) y) i

Backprop: Last Layer L(f,x,y) (W L ) i,j = 2 (f(x) y) T f(x) (W L ) i,j, L(f,x,y) (b L ) i = 2 (f(x) y) T f(x) (b L ) i. Let f(x) = W L σ(w L 1 (... ) + b L 1 ) + b L. It follows that f(x) (W L ) i,j = (0,..., σ(w L 1 (... ) + b L 1 ) j,..., 0) T }{{} i,..., 0) T f(x) (b L ) i = (0,..., 1 }{{} i f(x) 2(f(x) y) T (W L ) i,j = 2(f(x) y) i σ(w L 1 (... ) + b L 1 ) j, 2(f(x) y) T f(x) (b L ) i = 2(f(x) y) i In matrix notation: {}}{ W L = 2(f(x) y) (σ( W L 1 (... ) + b L 1 )) T, }{{}}{{} δ L a L 1 L(f,x,y) L(f,x,y) b L = 2(f(x) y). z L 1

Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L.

Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L. We have computed L(f,x,y) W L, L(F,x,y) b L

Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L. We have computed L(f,x,y) W L Then, use chain rule: L(f, x, y) W L 1 =, L(F,x,y) b L L(f, x, y) a L 1 = 2(f(x) y) T W a L 1 L. a L 1 W L 1 W L 1

Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L. We have computed L(f,x,y) W L Then, use chain rule: L(f, x, y) W L 1 =, L(F,x,y) b L L(f, x, y) a L 1 = 2(f(x) y) T W a L 1 L. a L 1 W L 1 W L 1 = 2(f(x) y) T W L diag(σ (z L 1 )) z L 1 W L 1

Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L. We have computed L(f,x,y) W L Then, use chain rule: L(f, x, y) W L 1 =, L(F,x,y) b L L(f, x, y) a L 1 = 2(f(x) y) T W a L 1 L. a L 1 W L 1 W L 1 = 2(f(x) y) T W L diag(σ (z L 1 )) z L 1 W L 1 same as before!

Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L. We have computed L(f,x,y) W L Then, use chain rule: L(f, x, y) W L 1 =, L(F,x,y) b L L(f, x, y) a L 1 = 2(f(x) y) T W a L 1 L. a L 1 W L 1 W L 1 = 2(f(x) y) T W L diag(σ (z L 1 )) z L 1 W L 1 = diag(σ (z L 1 )) WL T 2(f(x) y) }{{} δ L 1 a T L 2.

Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L. We have computed L(f,x,y) W L Then, use chain rule: L(f, x, y) W L 1 =, L(F,x,y) b L L(f, x, y) a L 1 = 2(f(x) y) T W a L 1 L. a L 1 W L 1 W L 1 = 2(f(x) y) T W L diag(σ (z L 1 )) z L 1 W L 1 = diag(σ (z L 1 )) WL T 2(f(x) y) }{{} δ L 1 Similar arguments yield L(f,x,y) b L 1 = δ L 1. a T L 2.

The Backprop Algorithm

The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass).

The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass). 2 Set δ L = 2(f(x) y)

The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass). 2 Set δ L = 2(f(x) y) 3 Then L(f,x,y) b L = δ L and L(f,x,y) W L = δ L a T L 1.

The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass). 2 Set δ L = 2(f(x) y) 3 Then L(f,x,y) b L = δ L and L(f,x,y) W L = δ L a T L 1. 4 for l from L 1 to 1 do:

The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass). 2 Set δ L = 2(f(x) y) 3 Then L(f,x,y) b L = δ L and L(f,x,y) W L = δ L a T L 1. 4 for l from L 1 to 1 do: δ l = diag(σ (z l )) W T l+1 δ l+1

The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass). 2 Set δ L = 2(f(x) y) 3 Then L(f,x,y) b L = δ L and L(f,x,y) W L = δ L a T L 1. 4 for l from L 1 to 1 do: δ l = diag(σ (z l )) W T l+1 δ l+1 Then L(f,x,y) b l = δ l and L(f,x,y) W l = δ l a T l 1.

The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass). 2 Set δ L = 2(f(x) y) 3 Then L(f,x,y) b L = δ L and L(f,x,y) W L = δ L a T L 1. 4 for l from L 1 to 1 do: δ l = diag(σ (z l )) Wl+1 T δ l+1 Then L(f,x,y) b l = δ l and L(f,x,y) W l = δ l a T l 1. 5 return L(f,x,y) b l, L(f,x,y) W l, l = 1,..., L.

Computational Graphs

Automatic Differentiation

4 Stochastic Gradient Descent

The Complexity of Gradient Descent Recall that one gradient descent step requires the calculation of m ((Wl,b l )) L L(f, x i, y i ). l=1 i=1 and each of the summands requires one backpropagation run.

The Complexity of Gradient Descent Recall that one gradient descent step requires the calculation of m ((Wl,b l )) L L(f, x i, y i ). l=1 i=1 and each of the summands requires one backpropagation run. Thus, the total complexity of one gradient descent step is equal to m complexity(backprop).

The Complexity of Gradient Descent Recall that one gradient descent step requires the calculation of m ((Wl,b l )) L L(f, x i, y i ). l=1 i=1 and each of the summands requires one backpropagation run. Thus, the total complexity of one gradient descent step is equal to m complexity(backprop). The complexity of backprop is asymptotically equal to the number of DOFs of the network: complexity(backprop) L N l 1 N l + N l. l=1

An Example

An Example ImageNet database consists of 1.2m images and 1000 categories.

An Example ImageNet database consists of 1.2m images and 1000 categories. AlexNet, neural network with 160m DOFs is one of the most successful annotation methods

An Example ImageNet database consists of 1.2m images and 1000 categories. AlexNet, neural network with 160m DOFs is one of the most successful annotation methods One step of gradient descent requires 2 1014 flops (and memory units)!!

Stochastic Gradient Descent (SGD) Approximate m ((Wl,b l )) L L(f, x i, y i ) l=1 i=1 by ((Wl,b l )) L l=1 L(f, x i, y i ) for some i chosen uniformly at random from {1,..., m}.

Stochastic Gradient Descent (SGD) Approximate m ((Wl,b l )) L L(f, x i, y i ) l=1 i=1 by ((Wl,b l )) L l=1 L(f, x i, y i ) for some i chosen uniformly at random from {1,..., m}. In expectation we have E ((Wl,b l )) L l=1 L(f, x i, y i ) = 1 m m ((Wl,b l )) L L(f, x i, y i ) l=1 i=1

The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R.

The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R. 1 Set starting value u 0 and n = 0

The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R. 1 Set starting value u 0 and n = 0 2 while (error is large) do:

The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R. 1 Set starting value u 0 and n = 0 2 while (error is large) do: Pick i {1,..., m} uniformly at random

The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R. 1 Set starting value u 0 and n = 0 2 while (error is large) do: Pick i {1,..., m} uniformly at random update u n+1 = u n η F i

The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R. 1 Set starting value u 0 and n = 0 2 while (error is large) do: Pick i {1,..., m} uniformly at random update u n+1 = u n η F i n = n + 1

The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R. 1 Set starting value u 0 and n = 0 2 while (error is large) do: Pick i {1,..., m} uniformly at random update u n+1 = u n η F i n = n + 1 3 return u n

Typical Behavior Figure: Comparison btw. GD and SGD. m steps of SGD are counted as one iteration.

Typical Behavior Figure: Comparison btw. GD and SGD. m steps of SGD are counted as one iteration. Initially very fast convergence, followed by stagnation!

Minibatch SGD For every {i 1,..., i K } {1,..., m} chosen uniformly at random, it holds that E 1 K k ((Wl,b l )) L L(f, x i, y l=1 l i ) = 1 l m l=1 m ((Wl,b l )) L L(f, x i, y i ), l=1 i=1 e.g., we have an unbiased estimator for the gradient.

Minibatch SGD For every {i 1,..., i K } {1,..., m} chosen uniformly at random, it holds that E 1 K k ((Wl,b l )) L L(f, x i, y l=1 l i ) = 1 l m l=1 m ((Wl,b l )) L L(f, x i, y i ), l=1 i=1 e.g., we have an unbiased estimator for the gradient. K = 1 SGD

Minibatch SGD For every {i 1,..., i K } {1,..., m} chosen uniformly at random, it holds that E 1 K k ((Wl,b l )) L L(f, x i, y l=1 l i ) = 1 l m l=1 m ((Wl,b l )) L L(f, x i, y i ), l=1 i=1 e.g., we have an unbiased estimator for the gradient. K = 1 SGD K > 1 Minibatch SGD with batchsize K.

Some Heuristics

Some Heuristics The sample mean 1 k K l=1 ((W l,b l )) L L(f, x i, y l=1 l i ) is itself a l random variable that has expected value 1 m m i=1 ((W l,b l )) L L(f, x i, y i ). l=1

Some Heuristics The sample mean 1 k K l=1 ((W l,b l )) L L(f, x i, y l=1 l i ) is itself a l random variable that has expected value 1 m m i=1 ((W l,b l )) L L(f, x i, y i ). l=1 In order to assess the deviation of the sample mean from its expected value we may compute its standard deviation σ/ n where σ is the standard deviation of i ((Wl,b l )) L L(f, x i, y i ). l=1

Some Heuristics The sample mean 1 k K l=1 ((W l,b l )) L L(f, x i, y l=1 l i ) is itself a l random variable that has expected value 1 m m i=1 ((W l,b l )) L L(f, x i, y i ). l=1 In order to assess the deviation of the sample mean from its expected value we may compute its standard deviation σ/ n where σ is the standard deviation of i ((Wl,b l )) L L(f, x i, y i ). l=1 Increasing the batch size by a factor 100 yields an improvement of the variance by a factor 10 while the complexity increases by a factor 100!

Some Heuristics The sample mean 1 k K l=1 ((W l,b l )) L L(f, x i, y l=1 l i ) is itself a l random variable that has expected value 1 m m i=1 ((W l,b l )) L L(f, x i, y i ). l=1 In order to assess the deviation of the sample mean from its expected value we may compute its standard deviation σ/ n where σ is the standard deviation of i ((Wl,b l )) L L(f, x i, y i ). l=1 Increasing the batch size by a factor 100 yields an improvement of the variance by a factor 10 while the complexity increases by a factor 100! Common batchsize for large models: K = 16, 32.

5 The Basic Recipe

The basic Neural Network Recipe for Learning

The basic Neural Network Recipe for Learning 1 Neuro-inspired model

The basic Neural Network Recipe for Learning 1 Neuro-inspired model 2 Backprop

The basic Neural Network Recipe for Learning 1 Neuro-inspired model 2 Backprop 3 Minibatch SGD

The basic Neural Network Recipe for Learning 1 Neuro-inspired model 2 Backprop 3 Minibatch SGD Now let s try classifying handwritten digits!

Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = 10000.

Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = 10000. network size [784, 30, 10].

Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = 10000. network size [784, 30, 10]. Classification accuracy 94.84%.

Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = 10000. network size [784, 30, 10]. Classification accuracy 94.84%. network size [784, 30, 30, 10].

Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = 10000. network size [784, 30, 10]. Classification accuracy 94.84%. network size [784, 30, 30, 10]. Classification accuracy 95.81%.

Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = 10000. network size [784, 30, 10]. Classification accuracy 94.84%. network size [784, 30, 30, 10]. Classification accuracy 95.81%. network size [784, 30, 30, 30, 10].

Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = 10000. network size [784, 30, 10]. Classification accuracy 94.84%. network size [784, 30, 30, 10]. Classification accuracy 95.81%. network size [784, 30, 30, 30, 10]. Classification accuracy 95.07%.

Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = 10000. network size [784, 30, 10]. Classification accuracy 94.84%. network size [784, 30, 30, 10]. Classification accuracy 95.81%. network size [784, 30, 30, 30, 10]. Classification accuracy 95.07%. Deep learning might not help after all...

6 Going Deep (?)

Problems with Deep Networks

Problems with Deep Networks Overfitting (as usual...)

Problems with Deep Networks Overfitting (as usual...) Vanishing/Exploding Gradient Problem

Dealing with Overfitting: Regularization Rather than minimizing m L(f, x i, y i ), i=1 minimize for example m L(f, x i, y i ) + λω((w l ) L l=1 ), i=1 Ω((W l ) L l=1 ) = (W l ) i,j p. l,i,j

Dealing with Overfitting: Regularization Rather than minimizing m L(f, x i, y i ), i=1 minimize for example m L(f, x i, y i ) + λω((w l ) L l=1 ), i=1 Ω((W l ) L l=1 ) = (W l ) i,j p. l,i,j Gradient update has to be augmented by λ (W l ) i,j Ω((W l ) L l=1 ) = λ p (W l) i,j p 1 sgn((w l ) i,j )

Sparsity-Promoting Regularization Since lim p 0 (W l ) i,j p = #nonzero weights, l,i,j regularization with p 1 promotes sparse connectivity (and hence small memory requirements)!

Sparsity-Promoting Regularization Since lim p 0 (W l ) i,j p = #nonzero weights, l,i,j regularization with p 1 promotes sparse connectivity (and hence small memory requirements)!

Dropout During each feedforward/backprop step drop nodes with probability p.

Dropout During each feedforward/backprop step drop nodes with probability p.

Dropout During each feedforward/backprop step drop nodes with probability p. After training, multiply all weights with p.

Dropout During each feedforward/backprop step drop nodes with probability p. After training, multiply all weights with p. Final output is average over many sparse network models.

Dataset Augmentation Use invariances in dataset to generate more data!

Dataset Augmentation Use invariances in dataset to generate more data!

Dataset Augmentation Use invariances in dataset to generate more data!

Dataset Augmentation Use invariances in dataset to generate more data!

Dataset Augmentation Use invariances in dataset to generate more data! Sometimes also noise is added to the weights to favour robust stationary points.

The Vanishing Gradient Problem

The Vanishing Gradient Problem Figure: Extremely Deep Network

The Vanishing Gradient Problem Figure: Extremely Deep Network Φ(x) = w 5 σ(w 4 σ(w 3 σ(w 2 σ(w 1 x + b 1 ) + b 2 ) + b 3 ) + b 4 ) + b 5

The Vanishing Gradient Problem Figure: Extremely Deep Network Φ(x) = w 5 σ(w 4 σ(w 3 σ(w 2 σ(w 1 x + b 1 ) + b 2 ) + b 3 ) + b 4 ) + b 5 Φ(x) b 1 = L l=2 L 1 w l σ (z l ) l=1

The Vanishing Gradient Problem Figure: Extremely Deep Network Φ(x) = w 5 σ(w 4 σ(w 3 σ(w 2 σ(w 1 x + b 1 ) + b 2 ) + b 3 ) + b 4 ) + b 5 Φ(x) b 1 = L l=2 L 1 w l σ (z l ) l=1 If σ(x) = 1 1+e x, it holds that σ (x) 2 e x, and thus Φ(x) b 1 2 L l=2 w l e L 1 l=1 z l

The Vanishing Gradient Problem Figure: Extremely Deep Network Φ(x) = w 5 σ(w 4 σ(w 3 σ(w 2 σ(w 1 x + b 1 ) + b 2 ) + b 3 ) + b 4 ) + b 5 Φ(x) b 1 = L l=2 L 1 w l σ (z l ) l=1 If σ(x) = 1 1+e x, it holds that σ (x) 2 e x, and thus Φ(x) b 1 2 L l=2 w l e L 1 l=1 z l bottom layers will learn much slower than top layers and not contribute to learning.

The Vanishing Gradient Problem Figure: Extremely Deep Network Φ(x) = w 5 σ(w 4 σ(w 3 σ(w 2 σ(w 1 x + b 1 ) + b 2 ) + b 3 ) + b 4 ) + b 5 Φ(x) b 1 = L l=2 L 1 w l σ (z l ) l=1 If σ(x) = 1 1+e x, it holds that σ (x) 2 e x, and thus Φ(x) b 1 2 L l=2 w l e L 1 l=1 z l bottom layers will learn much slower than top layers and not contribute to learning. Is depth a nuisance!?

Dealing with the Vanishing Gradient Problem

Dealing with the Vanishing Gradient Problem Use activation function with large gradient.

Dealing with the Vanishing Gradient Problem Use activation function with large gradient. ReLU The Rectified Linear Unit is defined as { x x > 0 ReLU(x) := 0 else

7 Convolutional Neural Networks

Is there a cat in this image?

Suppose we have a cat-filter W

Suppose we have a cat-filter W w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33

C-S Inequality Suppose we have a cat-filter W w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33

Suppose we have a cat-filter W w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33 For any X = have C-S Inequality x 11 x 12 x 13 x 21 x 22 x 23 x 31 x 32 x 33 we X W (X X) 1/2 (W W ) 1/2 with equality if and only if X is parallel to a cat.

Suppose we have a cat-filter W w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33 (Cat-Selection) C-S Inequality For any X = have x 11 x 12 x 13 x 21 x 22 x 23 x 31 x 32 x 33 we X W (X X) 1/2 (W W ) 1/2 with equality if and only if X is parallel to a cat.

Suppose we have a cat-filter W w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33 (Cat-Selection) C-S Inequality For any X = have x 11 x 12 x 13 x 21 x 22 x 23 x 31 x 32 x 33 we X W (X X) 1/2 (W W ) 1/2 with equality if and only if X is parallel to a cat. perform cat-test on all 3 3 image subpatches!

Convolution Definition Suppose that X, Y R n n. Then Z = X Y R n n is defined as Z[i, j] = n 1 k,l=0 X[i k, j l]y [k, l], where periodization or zero-padding of X, Y is used if i k or j l is not in {0,..., n 1}.

Convolution Definition Suppose that X, Y R n n. Then Z = X Y R n n is defined as Z[i, j] = n 1 k,l=0 X[i k, j l]y [k, l], where periodization or zero-padding of X, Y is used if i k or j l is not in {0,..., n 1}. Efficient computation possible via FFT (or directly if X or Y are sparse)!

Example: Detecting Vertical Edges Given vertical-edge-detection-filter W = 1 2 1 1 2 1 1 2 1

Example: Detecting Vertical Edges Given vertical-edge-detection-filter W = 1 2 1 1 2 1 1 2 1

Example: Detecting Vertical Edges

Example: Detecting Vertical Edges

Example: Detecting Vertical Edges =

Example: Detecting Vertical Edges =

Introducing Convolutional Nodes A convolutional node accepts as input a stack of images, e.g. X Rn1 n2 S.

Introducing Convolutional Nodes A convolutional node accepts as input a stack of images, e.g. X Rn1 n2 S.

Introducing Convolutional Nodes A convolutional node accepts as input a stack of images, e.g. X Rn1 n2 S. Given a filter W RF F S, where F is the spatial extent and a bias b R, it computes a matrix Z = W 12 X := S X i=1 X[:, :, i] W [:, :, i] + b.

Introducing Convolutional Layers A convolutional node accepts as input a stack of images, e.g. X R n 1 n 2 S. Given a filter W R F F S, where F is the spatial extent and a bias b R, it computes a matrix Z = W 12 X := S X[:, :, i] W [:, :, i] + b. i=1 A convolutional layer consists of K convolutional nodes ((W i, b i )) K i=1 RF F S R and produces as output a stack Z R n 1 n 2 K via Z[:, :, i] = W i 12 X + b i.

Introducing Convolutional Layers A convolutional node accepts as input a stack of images, e.g. X R n 1 n 2 S. Given a filter W R F F S, where F is the spatial extent and a bias b R, it computes a matrix Z = W 12 X := S X[:, :, i] W [:, :, i] + b. i=1 A convolutional layer consists of K convolutional nodes ((W i, b i )) K i=1 RF F S R and produces as output a stack Z R n 1 n 2 K via Z[:, :, i] = W i 12 X + b i. A convolutional layer can be written as a conventional neural network layer!

Activation Layers The activation layer is defined in the same way as before, e.g., Z R n 1 n 2 K is mapped to A = ReLU(Z) where ReLU is applied component-wise.

Pooling Layers Reduce dimensionality after filtering.

Pooling Layers Reduce dimensionality after filtering. Definition A pooling operator R acts layer-wise on a tensor X R n 1 n 2 S to result in a tensor R(X) R m 1 m 2 S, where m 1 < n 1 and m 2 < n 2.

Pooling Layers Reduce dimensionality after filtering. Definition A pooling operator R acts layer-wise on a tensor X R n 1 n 2 S to result in a tensor R(X) R m 1 m 2 S, where m 1 < n 1 and m 2 < n 2. Downsampling

Pooling Layers Reduce dimensionality after filtering. Definition A pooling operator R acts layer-wise on a tensor X R n 1 n 2 S to result in a tensor R(X) R m 1 m 2 S, where m 1 < n 1 and m 2 < n 2. Downsampling Max-pooling

Convolutional Neural Networks (CNNs) Definition A CNN with L layers consists of L iterative applications of a convolutional layer, followed by an activation layer, (possibly) followed by a pooling layer.

Convolutional Neural Networks (CNNs) Definition A CNN with L layers consists of L iterative applications of a convolutional layer, followed by an activation layer, (possibly) followed by a pooling layer. Typical architectures consist of a CNN (as a feature extractor), followed by a fully connected NN (as a classifier)

Convolutional Neural Networks (CNNs) Definition A CNN with L layers consists of L iterative applications of a convolutional layer, followed by an activation layer, (possibly) followed by a pooling layer. Typical architectures consist of a CNN (as a feature extractor), followed by a fully connected NN (as a classifier) Figure: LeNet (1998, LeCun etal): the first successful CNN architecture, used for reading handwritten digits

Feature Extractor vs. Classifier

Feature Extractor vs. Classifier

Feature Extractor vs. Classifier D. Trump B. Sanders A. Merkel B. Johnson

Python Library Tensor Flow, developed by Google Brain, based on symbolic computational graphs www.tensorflow.org.

Python Library Tensor Flow, developed by Google Brain, based on symbolic computational graphs www.tensorflow.org.

8 What I didn t tell you

1 Data structures & algorithms for efficient deep learning (computational graphs, automatic differentiation, adaptive learning rate, hardware,...)

1 Data structures & algorithms for efficient deep learning (computational graphs, automatic differentiation, adaptive learning rate, hardware,...) 2 Things to do besides regression or classification

1 Data structures & algorithms for efficient deep learning (computational graphs, automatic differentiation, adaptive learning rate, hardware,...) 2 Things to do besides regression or classification 3 Finetuning (choice of activation function, choice of loss function,...)

1 Data structures & algorithms for efficient deep learning (computational graphs, automatic differentiation, adaptive learning rate, hardware,...) 2 Things to do besides regression or classification 3 Finetuning (choice of activation function, choice of loss function,...) 4 More sophisticated training procedures for feature extractor layers (Autoencoder, Restricted Boltzmann Machines,...)

1 Data structures & algorithms for efficient deep learning (computational graphs, automatic differentiation, adaptive learning rate, hardware,...) 2 Things to do besides regression or classification 3 Finetuning (choice of activation function, choice of loss function,...) 4 More sophisticated training procedures for feature extractor layers (Autoencoder, Restricted Boltzmann Machines,...) 5 Recurrent Neural Networks

Questions?