Neural Network Back-propagation HYUNG IL KOO

Hidden Layer Representations Backpropagation has an ability to discover useful intermediate representations at the hidden unit layers inside the networks which capture properties of the input spaces that are most relevant to learning the target function. When more layers of units are used in the network, more complex features can be invented. But the representations of the hidden layers are very hard to understand for humans.

Basic Math

Optimization Find x that minimizes f(x) If f(x) is differentiable, But, in many cases, solving the above equation is a still difficult problem.

Gradient descent f(x 1 ) y = f(x) f(x 2 ) x 1 x 2 x 3

Chain Rule Chain rule with a single variable Chain Rule (multiple variables) w = f x, y, z => dw = f dx + f dy + f dz dt x dt y dt z dt Δw f x Δx + f y Δy + f z Δz

Feed-forward neural network

Feed forward network example: 1 st layer w 11 w 12w21 +1 w 22 w 31 w 32 The 1st hidden layer Non-linear function

Feed forward network example: 2 nd layer u 11 u 21 u 22 u 12 u 13 +1 u 23 The 1st hidden layer +1 The 2 nd hidden layer

Layer 1 Layer 2 Soft Max Forward propagation The 1st hidden layer The 2 nd hidden layer Output

Layer 1 Layer 2 Soft Max Forward propagation matrix repr. The 1st hidden layer The 2 nd hidden layer Output

Back-propagation algorithm

Weight update method Parameter update : Gradient Descent ` Loss function (L) Learning rate

Layer 1 Layer 2 Soft Max Dataflow diagram VS The 1st hidden layer The 2 nd hidden layer Output Ground Truth

Soft Max Back-propagation step; Loss function Output Ground Truth Computing Loss function(l): ex) Cross entropy

Layer 1 Layer 2 Soft Max Layer 1 Layer 2 Soft Max Overview 1 0 The 1st hidden layer The 2 nd hidden layer Output Ground Truth Loss function

Layer 2 Soft Max Back-propagation; 2 nd layer The 1st hidden layer The 2 nd hidden layer Output Ground Truth Weight update the Layer 2 has to do Error propagation

Error propagation (feed forward network) +1 +1 The 1st hidden layer The 2 nd hidden layer z 1 and z 2 are from its upper layer.

Weight updates (feed forward network) +1 +1 The 1st hidden layer The 2 nd hidden layer z 1 and z 2 are from its upper layer.

Layer 1 Layer 2 Soft Max Back propagation; 1 st layer The 1st hidden layer The 2 nd hidden layer Output Ground Truth the Layer 1 has to do Weight update w new ij = w old ij μ w ij Error propagation, Input update???

Weight updates w ij new = w ij old μ w ij (feed forward network) +1 The 1st hidden layer, and are y 1 y 2 y 3 from its upper layer

Block-based perspective

Basic Math Y= X AX Y+ΔY X + ΔX A X + ΔX = X AX + ΔX AX + X ΔX + ΔX AΔX X AX + X A + A ΔX = Y X Y A Y+ΔY X A + ΔA X = X AX + X ΔAX ΔY = X ΔAX = tr X ΔAX = tr(xx ΔA) Y A = XX

Block-based representation X Y Z Y = WX + b Z = φ(y) X Y Z X = W Y Y = diag φ Z Z W b = Y = Y X

Soft Max Forward propagation (block-based representation) Layer 1 Layer 2 Output Layer 1 Layer 2

Soft Max Backward propagation; 2 nd layer VS Layer 1 Layer 2 Output Ground Truth Error propagation

Soft Max Backward propagation; 2 nd layer VS Layer 1 Layer 2 Error propagation Output Ground Truth VS

Soft Max Backward propagation; 2 nd layer VS Layer 1 Layer 2 Output Ground Truth Error propagation Weight update

Soft Max Backward propagation; 1 st layer VS Layer 1 Layer 2 Error propagation Output Ground Truth

Soft Max Backward propagation; 1 st layer VS Layer 1 Layer 2 Error propagation Weight update Output Ground Truth

Input Optimization while fixing all weights

Input update (feed forward network) +1 The 1st hidden layer y 1, y 2 and 3 are from its upper layer.

Output Loss Building

I new = I old μ I Output Loss Building

Output Loss Building

Inceptionism: Going Deeper into Neural Networks https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html

Inceptionism: Going Deeper into Neural Networks

DEEP INSIDE CONVOLUTION NETWORKS Inputs maximizing class score

Inputs maximizing class score S c I λ I 2 Objective Function (to be maximized) K. Simonyan, A. Vedaldi, A. Zisserman, Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, ICLR Workshop 2014

Inputs maximizing class score S c I λ I 2 Objective Function (to be maximized) Goose class

Maximizing class score S c I λ I 2 Objective Function (to be maximized) Goose class

Inputs maximizing class score dumbbell cup dalmatian bell pepper lemon husky

Inputs maximizing class score computer keyboard kit fox limousine Washing machine goose ostrich

DEEP INSIDE CONVOLUTION NETWORKS Saliency visualization

Saliency visualization Linear score model for class c: w : importance of corresponding pixels of I for class c

Saliency visualization Dog Class Objective Function

Saliency visualization Dog Class Differentiation Objective Function S c (I 0 ) I 0 =

Saliency visualization I 0 = saliency map

yacht dog monkey washing machine cow building

A NEURAL ALGORITHM OF ARTISTIC STYLE

Loss Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. "A neural algorithm of artistic style." arxiv preprint arxiv:1508.06576 (2015).

Loss

Artistic style

Backups

Vector-by-scalar

Scalar-by-vector

Vector-by-vector

Scalar-by-matrix

Why W = X Y? We want to find W satisfying: ΔL = tr from Y = WX ΔL = Y ΔY. W ΔW. [Intuitive derivation] ΔY = ΔWX ΔL = W Y ΔY = ΔWX = tr Y = X Y Y ΔWX = tr X Y ΔW

X Y Z V Y = WX Z = Y + b V = φ(z) X Y Z V X = W Y Y = Z Z = diag φ Z V W = Y X b = Z

Proof Element-wise operation y 1 y 2 y n y i = z i φ z i Y y 1 y 2 y n = diag φ Z Z z 1 = φ(y 1 ) z 2 = φ(y 2 ) z n = φ(y n ) z 1 z 2 z n

Proof Y = W X L Y Y Y W X L Y W X X = W Y = X y = WX L tr y WX L Y Y W W L WX = tr Y W = Y Y WX = tr X Y W X

New Layer Design

Layer 1 Layer m Layer N Soft Max New layer addition α 1 β 1 α 2 α 3 β 2 β 3 The 1st layer A newly added layer The Nth layer Output

Layer m Layer m New layer design Forward pass Compute output from input Backward pass Compute the derivatives w.r.t. data Update weights?? dl dl α 1 β 1 dα 1 dl dβ 1 dl α 2 β 2 dα 2 dl dβ 2 dl α 3 β 3 dα 3 dβ 3 Forward pass Backward pass

Max pooling layer Example: Max pooling 5 6 3 9 2 1 4 11 17 14 3 21 11 13 12 9 6 11 17 21

max max Derivatives of max For forward pass For backward pass x dl dx = dl dz dz dx y z dl dy = dl dz dz dy dl dz Forward pass Backward pass