Neural Network Back-propagation HYUNG IL KOO
Hidden Layer Representations Backpropagation has an ability to discover useful intermediate representations at the hidden unit layers inside the networks which capture properties of the input spaces that are most relevant to learning the target function. When more layers of units are used in the network, more complex features can be invented. But the representations of the hidden layers are very hard to understand for humans.
Basic Math
Optimization Find x that minimizes f(x) If f(x) is differentiable, But, in many cases, solving the above equation is a still difficult problem.
Gradient descent f(x 1 ) y = f(x) f(x 2 ) x 1 x 2 x 3
Chain Rule Chain rule with a single variable Chain Rule (multiple variables) w = f x, y, z => dw = f dx + f dy + f dz dt x dt y dt z dt Δw f x Δx + f y Δy + f z Δz
Feed-forward neural network
Feed forward network example: 1 st layer w 11 w 12w21 +1 w 22 w 31 w 32 The 1st hidden layer Non-linear function
Feed forward network example: 2 nd layer u 11 u 21 u 22 u 12 u 13 +1 u 23 The 1st hidden layer +1 The 2 nd hidden layer
Layer 1 Layer 2 Soft Max Forward propagation The 1st hidden layer The 2 nd hidden layer Output
Layer 1 Layer 2 Soft Max Forward propagation matrix repr. The 1st hidden layer The 2 nd hidden layer Output
Back-propagation algorithm
Weight update method Parameter update : Gradient Descent ` Loss function (L) Learning rate
Layer 1 Layer 2 Soft Max Dataflow diagram VS The 1st hidden layer The 2 nd hidden layer Output Ground Truth
Soft Max Back-propagation step; Loss function Output Ground Truth Computing Loss function(l): ex) Cross entropy
Layer 1 Layer 2 Soft Max Layer 1 Layer 2 Soft Max Overview 1 0 The 1st hidden layer The 2 nd hidden layer Output Ground Truth Loss function
Layer 2 Soft Max Back-propagation; 2 nd layer The 1st hidden layer The 2 nd hidden layer Output Ground Truth Weight update the Layer 2 has to do Error propagation
Layer 2 Soft Max Back-propagation; 2 nd layer The 1st hidden layer The 2 nd hidden layer Output Ground Truth Weight update the Layer 2 has to do Error propagation
Error propagation (feed forward network) +1 +1 The 1st hidden layer The 2 nd hidden layer z 1 and z 2 are from its upper layer.
Weight updates (feed forward network) +1 +1 The 1st hidden layer The 2 nd hidden layer z 1 and z 2 are from its upper layer.
Layer 1 Layer 2 Soft Max Back propagation; 1 st layer The 1st hidden layer The 2 nd hidden layer Output Ground Truth the Layer 1 has to do Weight update w new ij = w old ij μ w ij Error propagation, Input update???
Weight updates w ij new = w ij old μ w ij (feed forward network) +1 The 1st hidden layer, and are y 1 y 2 y 3 from its upper layer
Block-based perspective
Basic Math Y= X AX Y+ΔY X + ΔX A X + ΔX = X AX + ΔX AX + X ΔX + ΔX AΔX X AX + X A + A ΔX = Y X Y A Y+ΔY X A + ΔA X = X AX + X ΔAX ΔY = X ΔAX = tr X ΔAX = tr(xx ΔA) Y A = XX
Block-based representation X Y Z Y = WX + b Z = φ(y) X Y Z X = W Y Y = diag φ Z Z W b = Y = Y X
Soft Max Forward propagation (block-based representation) Layer 1 Layer 2 Output Layer 1 Layer 2
Soft Max Backward propagation; 2 nd layer VS Layer 1 Layer 2 Output Ground Truth Error propagation
Soft Max Backward propagation; 2 nd layer VS Layer 1 Layer 2 Error propagation Output Ground Truth VS
Soft Max Backward propagation; 2 nd layer VS Layer 1 Layer 2 Output Ground Truth Error propagation Weight update
Soft Max Backward propagation; 1 st layer VS Layer 1 Layer 2 Error propagation Output Ground Truth
Soft Max Backward propagation; 1 st layer VS Layer 1 Layer 2 Error propagation Weight update Output Ground Truth
Input Optimization while fixing all weights
Input update (feed forward network) +1 The 1st hidden layer y 1, y 2 and 3 are from its upper layer.
Output Loss Building
I new = I old μ I Output Loss Building
Output Loss Building
Inceptionism: Going Deeper into Neural Networks https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html
Inceptionism: Going Deeper into Neural Networks
DEEP INSIDE CONVOLUTION NETWORKS Inputs maximizing class score
Inputs maximizing class score S c I λ I 2 Objective Function (to be maximized) K. Simonyan, A. Vedaldi, A. Zisserman, Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, ICLR Workshop 2014
Inputs maximizing class score S c I λ I 2 Objective Function (to be maximized) Goose class
Maximizing class score S c I λ I 2 Objective Function (to be maximized) Goose class
Inputs maximizing class score dumbbell cup dalmatian bell pepper lemon husky
Inputs maximizing class score computer keyboard kit fox limousine Washing machine goose ostrich
DEEP INSIDE CONVOLUTION NETWORKS Saliency visualization
Saliency visualization Linear score model for class c: w : importance of corresponding pixels of I for class c
Saliency visualization Dog Class Objective Function
Saliency visualization Dog Class Differentiation Objective Function S c (I 0 ) I 0 =
Saliency visualization I 0 = saliency map
yacht dog monkey washing machine cow building
A NEURAL ALGORITHM OF ARTISTIC STYLE
Loss Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. "A neural algorithm of artistic style." arxiv preprint arxiv:1508.06576 (2015).
Loss
Artistic style
Artistic style
Backups
Vector-by-scalar
Scalar-by-vector
Vector-by-vector
Scalar-by-matrix
Why W = X Y? We want to find W satisfying: ΔL = tr from Y = WX ΔL = Y ΔY. W ΔW. [Intuitive derivation] ΔY = ΔWX ΔL = W Y ΔY = ΔWX = tr Y = X Y Y ΔWX = tr X Y ΔW
X Y Z V Y = WX Z = Y + b V = φ(z) X Y Z V X = W Y Y = Z Z = diag φ Z V W = Y X b = Z
Proof Element-wise operation y 1 y 2 y n y i = z i φ z i Y y 1 y 2 y n = diag φ Z Z z 1 = φ(y 1 ) z 2 = φ(y 2 ) z n = φ(y n ) z 1 z 2 z n
Proof Y = W X L Y Y Y W X L Y W X X = W Y = X y = WX L tr y WX L Y Y W W L WX = tr Y W = Y Y WX = tr X Y W X
New Layer Design
Layer 1 Layer m Layer N Soft Max New layer addition α 1 β 1 α 2 α 3 β 2 β 3 The 1st layer A newly added layer The Nth layer Output
Layer m Layer m New layer design Forward pass Compute output from input Backward pass Compute the derivatives w.r.t. data Update weights?? dl dl α 1 β 1 dα 1 dl dβ 1 dl α 2 β 2 dα 2 dl dβ 2 dl α 3 β 3 dα 3 dβ 3 Forward pass Backward pass
Max pooling layer Example: Max pooling 5 6 3 9 2 1 4 11 17 14 3 21 11 13 12 9 6 11 17 21
max max Derivatives of max For forward pass For backward pass x dl dx = dl dz dz dx y z dl dy = dl dz dz dy dl dz Forward pass Backward pass