Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 1 / 26
Jacobian matrix Two functions f (x, y), g(x, y) with two parameters x, y f (x, y) 3x 2 y g(x, y) 5xy + y 3 Jacobian matrix (numerator layout): J [ ] f (x, y) g(x, y) [ 6yx 3x 2 5y 5x + 3y 2 [ f (x,y) x g(x,y) x ] f (x,y) y g(x,y) y ] Jacobian matrix (denominator layout): J T [ f (x,y) x f (x,y) y g(x,y) x g(x,y) y ] Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 2 / 26
Jacobian: Generalization y f(x): a vector of m scalar-valued functions that each takes a vector x y 1 f 1 (x). y m f m (x) Jacobian matrix: has m rows for m equations. f y 1 (x) x f 1(x) x f m (x) x f m(x) x 1 f 1 (x) x n f 1 (x).. x 1 f m (x) x n f m (x) Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 3 / 26
Jacobian: Generalization Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 4 / 26
Vector chain rule Jacobian is the multiplication of two other Jacobians f f (g (x)) x g g x f 1 g 1. f m g 1 f 1 g k. f m g k g 1 x 1. g k x 1 g 1 x n. g k x n Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 5 / 26
Vector chain rule: Example y f(x) y [ y1 (x) y 2 (x) ] [ f1 (x) f 2 (x) ] [ ln(x 2 ) sin(3x) ] y f(g(x)): introduce two intermediate variables g 1, g 2 : [ ] [ ] g1 (x) x 2 g g 2 (x) 3x [ ] [ ] f1 (g) ln(g1 ) y f 2 (g) sin(g 2 ) (1) Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 6 / 26
Jacobian: Generalization Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 7 / 26
MLP with single hidden layer: Notation For simplicity, a network has single hidden layer only o k : k-th output unit, h j : j-th hidden unit, x i : i-th input u kj : weight b/w j-th hidden and k-th output w ji : weight b/w i-th input and j-th hidden Bias terms are also contained in weights Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 8 / 26
MLP with single hidden layer: Matrix notation h max(wx, 0) o softmax(uh) Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 9 / 26
Typical Setting for Classification K: the number of labels Input layer: Input values (raw features) Output layer: Scores of labels Softmax layer: o softmax(v) o k exp(v k) i exp(v i) exp(v k) Z Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 10 / 26
Learning as Optimization Training data: T : {(x i, y i ),, (x N, y N )} x i : i-th input feature vector y i (or y i ): i-th target label Parameter: θ : {W, U} Weight matrices: Input-to-hidden, and hidden-to-output Objective function ( Loss function) Take Negative Log-likelihood (NLL) as Empirical risk J(θ) Loss(T, θ) logp (y x) Training process Known as Empirical risk minimization (x,y) T θ argmin θ J(θ) Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 11 / 26
Optimization by Gradient Method Gradient Descent: θ θ θ η θ E (x,y) [ logp(y x)] Batch algorithm Expectations over the training set are required But, computing expectations exactly is very expensive, as it evaluates on every example in the entire dataset Minibatch algorithm In practice, we compute these expectations by randomly sampling a small number of examples from the dataset, then taking the average over only those examples Using exact gradient using large examples does not significantly reduce the estimation error: Slow convergence Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 12 / 26
Stochastic Gradient Method 1 Randomly a minibatch of m samples {(x, y)} from training data 2 Define NLL for {(x i, y i )} J(θ) 1 i m 3 Compute derivatives W for W θ 4 Update weight matrix for W θ: log (y i x i ) W W η W Iterate the above procedure until stopping criteria is satisfied Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 13 / 26
Logistic regression for binary classification (x, y): a training example for binary classification where y {0, 1} Logistic regression function: which is rewritten to: o (x) σ J: the log-likelihood on (x, y) ( ) w T x + b z w T x + b o σ (z) J ylog (o) + (1 y)log (1 o) o y o 1 y 1 o Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 14 / 26
Logistic regression: Deriv J wrt w All together lead to: z w xt o z σ (z) (1 σ (z)) o(1 o) w o o z z w ( y o 1 y 1 o (y o) x T ) o (1 o) x T Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 15 / 26
Logistic regression for multi-class classification K: the number of labels (x, k): a training example where k {1,, K} Logistic regression function: which is rewritten to: o (x) softmax (Wx + b) z Wx + b o softmax (z) where softmax (z) exp(z)/ i exp(z i) exp(z)/z J: the log-likelihood on (x, k) J y T log (o) where y is one-hot encoding for target label. y [0 1 0] T where y i I(i k) where k is the target label. Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 16 / 26
Logistic regression: Deriv J wrt W o o z Thus, we have: z [ 0 1 ] 0 o k [ ] oj [exp(z j ) (I(i j) ] k exp(z k) exp(z i )) z i ij ( k exp(z k)) 2 ij [ ] exp(zj ) (I(i j) Z exp(z i )) (Z) 2 [o j I(i j) o i o j ] ij o o z [ 0 1 o k 0 ] [ ] exp(zj ) (I(i j) Z exp(z i )) (Z) 2 ij [ ] I(1 k) o 1 1 o k I(K k) o K ij Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 17 / 26
Logistic regression: Deriv J wrt W (Cont.) Given z Wx + b, where W i is i-th row vector of W. Finally, this leads to: z i W i x T z i (I(i k) o i ) x T W i z i W i W : W 1. W K (I(1 k) o 1 ) x T. (I(K k) o K ) x T (I(1 k) o 1 ). x T T x T z (I(K k) o K ) Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 18 / 26
MLP with single hidden layer: Log-likelihood h max(wx, 0) o softmax(uh) J: the log-likelihood on single example (x, y) J y T log (o) y: one-hot encoding for target label. y [0 1 0] T where y i I(i k) where k is a target label. Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 19 / 26
Derivative of J wrt Output layer v Uh o softmax(v) exp(v) i exp (v i) exp(v) Z J y T log (o) o o v [ 0 1 ] 0 o k [ ] oj [exp(v j ) (I(i j) ] k exp(v k) exp(v i )) v i ij ( k exp(v k)) 2 [ ] exp(vj ) (I(i j) Z exp(v i )) (Z) 2 ij ij Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 20 / 26
Derivative of J wrt Output layer (Cont.) v o o v [ 0 1 [ 0 [ ] [ ] exp(vj ) (I(i j) Z exp(v i )) 0 o k (Z) 2 ij ] [ ] Z exp(v k ) 0 exp(vj ) (I(i j) Z exp(v i )) (Z) 2 exp(v 1) Z 1 exp(v k) Z exp(v K ) Z [ ] o 1 1 o k o K ] ij Let δ (o) be the error signal for output layer: δ (o) : v [ o 1 1 o k o K ] Here, note that δ (o) is a row vector. Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 21 / 26
Error propagation to hidden layer Hidden-to-output: v Uh v h U Let δ (h) be the error signal for hidden layer δ (h) : h v v h δ (o) U Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 22 / 26
Deriv of J wrt hidden-output weight matrix U Hidden-to-output: v Uh v i U i h j u ij h j v i U i h T where U i is i-th row vector of U. U i v i U : v i δ (o) i U i U 1. U K h T δ (o) 1. δ (o) K h T δ (o)t h T Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 23 / 26
Error prop to input layer Input-to-hidden layer: z Wx h max (z, 0) h z z x I(z 1 > 0) 0..... diag (I(z i > 0)) 0 I(z m > 0) W Let δ (z) be the error signal for the pre-activated hidden layer δ (z) : z h h z δ(h) diag (I(z i > 0)) Let δ (x) be the error signal for input layer δ (x) : x h z h z x δ(h) diag (I(z i > 0)) W Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 24 / 26
Deriv of J wrt input-hidden weight matrix W Input-to-hidden layer: z Wx z i W i x j w ij x j z i W i x T where W i is i-th row vector of W. W i z i W : z i δ (z) i W i W 1. W m x T δ (z) 1. δ (z) m x T δ (z)t x T Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 25 / 26
Discussion Here, error messages such as δ (o), δ (h) are row vectors But, we can define δ (o), δ (h) as column vectors and derive backprop again. In this case, only the slight modification on error prop is necessary Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 26 / 26