Neural Networks: Backpropagation

Size: px

Start display at page:

Download "Neural Networks: Backpropagation"

Geraldine Porter
5 years ago
Views:

1 Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

2 Jacobian matrix Two functions f (x, y), g(x, y) with two parameters x, y f (x, y) 3x 2 y g(x, y) 5xy + y 3 Jacobian matrix (numerator layout): J [ ] f (x, y) g(x, y) [ 6yx 3x 2 5y 5x + 3y 2 [ f (x,y) x g(x,y) x ] f (x,y) y g(x,y) y ] Jacobian matrix (denominator layout): J T [ f (x,y) x f (x,y) y g(x,y) x g(x,y) y ] Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

3 Jacobian: Generalization y f(x): a vector of m scalar-valued functions that each takes a vector x y 1 f 1 (x). y m f m (x) Jacobian matrix: has m rows for m equations. f y 1 (x) x f 1(x) x f m (x) x f m(x) x 1 f 1 (x) x n f 1 (x).. x 1 f m (x) x n f m (x) Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

4 Jacobian: Generalization Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

5 Vector chain rule Jacobian is the multiplication of two other Jacobians f f (g (x)) x g g x f 1 g 1. f m g 1 f 1 g k. f m g k g 1 x 1. g k x 1 g 1 x n. g k x n Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

6 Vector chain rule: Example y f(x) y [ y1 (x) y 2 (x) ] [ f1 (x) f 2 (x) ] [ ln(x 2 ) sin(3x) ] y f(g(x)): introduce two intermediate variables g 1, g 2 : [ ] [ ] g1 (x) x 2 g g 2 (x) 3x [ ] [ ] f1 (g) ln(g1 ) y f 2 (g) sin(g 2 ) (1) Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

7 Jacobian: Generalization Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

8 MLP with single hidden layer: Notation For simplicity, a network has single hidden layer only o k : k-th output unit, h j : j-th hidden unit, x i : i-th input u kj : weight b/w j-th hidden and k-th output w ji : weight b/w i-th input and j-th hidden Bias terms are also contained in weights Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

9 MLP with single hidden layer: Matrix notation h max(wx, 0) o softmax(uh) Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

layer: o softmax(v) o k exp(v k) i exp(v i) exp(v k) Z Seung-Hoon Na

10 Typical Setting for Classification K: the number of labels Input layer: Input values (raw features) Output layer: Scores of labels Softmax layer: o softmax(v) o k exp(v k) i exp(v i) exp(v k) Z Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

11 Learning as Optimization Training data: T : {(x i, y i ),, (x N, y N )} x i : i-th input feature vector y i (or y i ): i-th target label Parameter: θ : {W, U} Weight matrices: Input-to-hidden, and hidden-to-output Objective function ( Loss function) Take Negative Log-likelihood (NLL) as Empirical risk J(θ) Loss(T, θ) logp (y x) Training process Known as Empirical risk minimization (x,y) T θ argmin θ J(θ) Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

12 Optimization by Gradient Method Gradient Descent: θ θ θ η θ E (x,y) [ logp(y x)] Batch algorithm Expectations over the training set are required But, computing expectations exactly is very expensive, as it evaluates on every example in the entire dataset Minibatch algorithm In practice, we compute these expectations by randomly sampling a small number of examples from the dataset, then taking the average over only those examples Using exact gradient using large examples does not significantly reduce the estimation error: Slow convergence Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

13 Stochastic Gradient Method 1 Randomly a minibatch of m samples {(x, y)} from training data 2 Define NLL for {(x i, y i )} J(θ) 1 i m 3 Compute derivatives W for W θ 4 Update weight matrix for W θ: log (y i x i ) W W η W Iterate the above procedure until stopping criteria is satisfied Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

14 Logistic regression for binary classification (x, y): a training example for binary classification where y {0, 1} Logistic regression function: which is rewritten to: o (x) σ J: the log-likelihood on (x, y) ( ) w T x + b z w T x + b o σ (z) J ylog (o) + (1 y)log (1 o) o y o 1 y 1 o Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

15 Logistic regression: Deriv J wrt w All together lead to: z w xt o z σ (z) (1 σ (z)) o(1 o) w o o z z w ( y o 1 y 1 o (y o) x T ) o (1 o) x T Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

16 Logistic regression for multi-class classification K: the number of labels (x, k): a training example where k {1,, K} Logistic regression function: which is rewritten to: o (x) softmax (Wx + b) z Wx + b o softmax (z) where softmax (z) exp(z)/ i exp(z i) exp(z)/z J: the log-likelihood on (x, k) J y T log (o) where y is one-hot encoding for target label. y [0 1 0] T where y i I(i k) where k is the target label. Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

17 Logistic regression: Deriv J wrt W o o z Thus, we have: z [ 0 1 ] 0 o k [ ] oj [exp(z j ) (I(i j) ] k exp(z k) exp(z i )) z i ij ( k exp(z k)) 2 ij [ ] exp(zj ) (I(i j) Z exp(z i )) (Z) 2 [o j I(i j) o i o j ] ij o o z [ 0 1 o k 0 ] [ ] exp(zj ) (I(i j) Z exp(z i )) (Z) 2 ij [ ] I(1 k) o 1 1 o k I(K k) o K ij Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

18 Logistic regression: Deriv J wrt W (Cont.) Given z Wx + b, where W i is i-th row vector of W. Finally, this leads to: z i W i x T z i (I(i k) o i ) x T W i z i W i W : W 1. W K (I(1 k) o 1 ) x T. (I(K k) o K ) x T (I(1 k) o 1 ). x T T x T z (I(K k) o K ) Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

19 MLP with single hidden layer: Log-likelihood h max(wx, 0) o softmax(uh) J: the log-likelihood on single example (x, y) J y T log (o) y: one-hot encoding for target label. y [0 1 0] T where y i I(i k) where k is a target label. Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

20 Derivative of J wrt Output layer v Uh o softmax(v) exp(v) i exp (v i) exp(v) Z J y T log (o) o o v [ 0 1 ] 0 o k [ ] oj [exp(v j ) (I(i j) ] k exp(v k) exp(v i )) v i ij ( k exp(v k)) 2 [ ] exp(vj ) (I(i j) Z exp(v i )) (Z) 2 ij ij Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

21 Derivative of J wrt Output layer (Cont.) v o o v [ 0 1 [ 0 [ ] [ ] exp(vj ) (I(i j) Z exp(v i )) 0 o k (Z) 2 ij ] [ ] Z exp(v k ) 0 exp(vj ) (I(i j) Z exp(v i )) (Z) 2 exp(v 1) Z 1 exp(v k) Z exp(v K ) Z [ ] o 1 1 o k o K ] ij Let δ (o) be the error signal for output layer: δ (o) : v [ o 1 1 o k o K ] Here, note that δ (o) is a row vector. Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

22 Error propagation to hidden layer Hidden-to-output: v Uh v h U Let δ (h) be the error signal for hidden layer δ (h) : h v v h δ (o) U Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

23 Deriv of J wrt hidden-output weight matrix U Hidden-to-output: v Uh v i U i h j u ij h j v i U i h T where U i is i-th row vector of U. U i v i U : v i δ (o) i U i U 1. U K h T δ (o) 1. δ (o) K h T δ (o)t h T Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

24 Error prop to input layer Input-to-hidden layer: z Wx h max (z, 0) h z z x I(z 1 > 0) diag (I(z i > 0)) 0 I(z m > 0) W Let δ (z) be the error signal for the pre-activated hidden layer δ (z) : z h h z δ(h) diag (I(z i > 0)) Let δ (x) be the error signal for input layer δ (x) : x h z h z x δ(h) diag (I(z i > 0)) W Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

25 Deriv of J wrt input-hidden weight matrix W Input-to-hidden layer: z Wx z i W i x j w ij x j z i W i x T where W i is i-th row vector of W. W i z i W : z i δ (z) i W i W 1. W m x T δ (z) 1. δ (z) m x T δ (z)t x T Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

26 Discussion Here, error messages such as δ (o), δ (h) are row vectors But, we can define δ (o), δ (h) as column vectors and derive backprop again. In this case, only the slight modification on error prop is necessary Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation / 26

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections