More on Neural Networks

Size: px

Start display at page:

Download "More on Neural Networks"

Scott Cuthbert Newton
5 years ago
Views:

1 More on Neural Networks Yujia Yan Fall 2018

2 Outline

4 Linear Regression y = Wx + b (1)

5 Linear Regression y = Wx + b (1) Polynomial Regression y = Wφ(x) + b (2) where φ(x) gives the polynomial basis, e.g., [x 1, x1 2, x 2, x2 2,...]T

6 Linear Regression y = Wx + b (1) Polynomial Regression y = Wφ(x) + b (2) where φ(x) gives the polynomial basis, e.g., [x 1, x1 2, x 2, x2 2,...]T Adaptive Basis Regression y = W 2 (f (W 1 x + b 1 )) + b 2 (3) This is exactly the neural network with a single hidden layer

7 The Universal Approximator Feedforward Neural network with a single hidden layer y = W 2 (f (W 1 x + b 1 )) + b 2 (4) Universal Approximation Theorem states that in case that the nonlinear activation function f ( ) fulfills some mild conditions, it can approximate any continuous function in a bounded domain if the hidden layer is wide enough.

8 Neural Networks: Going Deep Feedforward Neural network with a single hidden layer y = W 2 (f (W 1 x + b 1 )) + b 2 (5) Interpretation: We compute the similarity between entries in W 1 = [w 1, w 2,...] and x by taking the inner product to obtain a basis for regression.

9 Neural Networks: Going Deep Feedforward Neural network with a single hidden layer y = W 2 (f (W 1 x + b 1 )) + b 2 (5) Interpretation: We compute the similarity between entries in W 1 = [w 1, w 2,...] and x by taking the inner product to obtain a basis for regression. However, it is not efficient to memorize many patterns.

10 Neural Networks: Going Deep Feedforward Neural network with a single hidden layer y = W 2 (f (W 1 x + b 1 )) + b 2 (5) Interpretation: We compute the similarity between entries in W 1 = [w 1, w 2,...] and x by taking the inner product to obtain a basis for regression. However, it is not efficient to memorize many patterns. Solution: Going deep y = W N (... W 3 f (W 2 (f (W 1 x + b 1 )) + b 2 ) + b 3...) (6) The total number of patterns memorized goes exponential with the number of layers! But sometimes wide is also needed (Why?)

11 Neural Networks: A Composition of Operations In fact, We can compose everything into neural networks as long as we know how to train it.

12 Neural Networks: A Composition of Operations From the computation perspective, a neural network can be viewed as a set of operations composed within a computational graph. Example: For training L(W 2 (f (W 1 x + b 1 )) + b 2, Y GT ) Where L is the loss function, Y GT is the fitting target. The corresponding computational graph is: W 1 matmul add f ( ) matmul add X b 1 W 2 b 2 Y GT L(, )

13 Neural Networks: Computational Graph W 1 matmul add f ( ) matmul add X b 1 W 2 b 2 Y GT L(, )

14 Neural Networks: Computational Graph W 1 matmul add f ( ) matmul add X b 1 W 2 b 2 Y GT L(, ) 1. Nodes without incoming edges are variables

15 Neural Networks: Computational Graph W 1 matmul add f ( ) matmul add X b 1 W 2 b 2 Y GT L(, ) 1. Nodes without incoming edges are variables 2. Nodes with incoming edges are operations, producing intermediate variables

16 Neural Networks: Computational Graph W 1 matmul add f ( ) matmul add X b 1 W 2 b 2 Y GT L(, ) 1. Nodes without incoming edges are variables 2. Nodes with incoming edges are operations, producing intermediate variables 3. Edge var Op means var is an argument of Op

17 A little bit vector calculus Jacobian Matrix y x = y 1 x 1.. y M x 1 y 1 x N. y M x N Example: For y = Ax, where A is a matrix and x is a vector: y x = A

18 A little bit vector calculus Jacobian Matrix y x = y 1 x 1.. y M x 1 y 1 x N. y M x N Gradient We use the column vector convention for the gradient: x L = L x 1. L x N = ( L x )T

19 Neural Networks: Training Training is performed by minimizing the loss function with stochastic gradient decent: θ θ µ batch θ L where θ is the parameters of the model, µ is the step size.

20 Neural Networks: Training Training is performed by minimizing the loss function with stochastic gradient decent: θ θ µ batch θ L where θ is the parameters of the model, µ is the step size. It is called stochastic because the gradient w.r.t. θ is evaluated only over a small random subset of data (minibatch).

21 Neural Networks: Training Training is performed by minimizing the loss function with stochastic gradient decent: θ θ µ batch θ L where θ is the parameters of the model, µ is the step size. It is called stochastic because the gradient w.r.t. θ is evaluated only over a small random subset of data (minibatch). Many different methods to average and scale the gradient for updating parameters exist: Adam, RMSProp, SGD with Momentum, etc.

22 Automatic Reverse-Mode Differentiation How to calculate gradient?

23 Automatic Reverse-Mode Differentiation How to calculate gradient? Explicitly storing the computational graph and every value for each node allows automatic computation of gradients from the end to the beginning (reverse topological order), which is known as Automatic Reverse-Mode Differentiation or Backpropagation.

24 Automatic Reverse-Mode Differentiation How to calculate gradient? Explicitly storing the computational graph and every value for each node allows automatic computation of gradients from the end to the beginning (reverse topological order), which is known as Automatic Reverse-Mode Differentiation or Backpropagation. Modern deep learning frameworks

25 Automatic Reverse-Mode Differentiation How to calculate gradient? Explicitly storing the computational graph and every value for each node allows automatic computation of gradients from the end to the beginning (reverse topological order), which is known as Automatic Reverse-Mode Differentiation or Backpropagation. Modern deep learning frameworks 1. Tensorflow, MXNet: build the graph first and then perform computation using the graph (static graph)

26 Automatic Reverse-Mode Differentiation How to calculate gradient? Explicitly storing the computational graph and every value for each node allows automatic computation of gradients from the end to the beginning (reverse topological order), which is known as Automatic Reverse-Mode Differentiation or Backpropagation. Modern deep learning frameworks 1. Tensorflow, MXNet: build the graph first and then perform computation using the graph (static graph) 2. Pytorch, Dynet, tensorflow eager: record the graph while doing computation (dynamic graph)

27 Chain rule with computational graph Now we assume all nodes (variable or intermediate variables) except for the last node (a scalar function) are vectors. For one node in the computational graph, Var o p 1 (..., Var,...) o p N (..., Var,...)

28 Chain rule with computational graph Now we assume all nodes (variable or intermediate variables) except for the last node (a scalar function) are vectors. For one node in the computational graph, Var o p 1 (..., Var,...) o p N (..., Var,...) Var L = i ( op i Var )T opi L where op i Var is the Jacobian matrix.

29 Chain rule with computational graph Now we assume all nodes (variable or intermediate variables) except for the last node (a scalar function) are vectors. For one node in the computational graph, Var o p 1 (..., Var,...) o p N (..., Var,...) For implementing reverse-mode AD, we need to store all intermediate values for all nodes, which usually uses a lot of memory. Also, propagating gradient along the edge is multiplicative, which means it is easy to get overflow (gradient explosion) or underflow (gradient vanishing)

30 Elementwise Nonlinear Function

31 Elementwise Nonlinear Function Most nonlinear functions used in Neural Networks are elementwise or can be constructed from a combination of Matrix multiplication and Elementwise Nonlinear Functions.

32 Elementwise Nonlinear Function Most nonlinear functions used in Neural Networks are elementwise or can be constructed from a combination of Matrix multiplication and Elementwise Nonlinear Functions. The Jacobian is diagonal, therefore the terms within the chain rule can be computed element-wise: f Var i ( f L) i

33 Vectorization What if a variable is not a vector, e.g., a Matrix, a Tensor?

34 Vectorization What if a variable is not a vector, e.g., a Matrix, a Tensor? We do vectorization (assuming column vectors): [ ] vec( ) =

35 Vectorization What if a variable is not a vector, e.g., a Matrix, a Tensor? We do vectorization (assuming column vectors): [ ] vec( ) = Usually it s the storage layout for matrix/tensor; no additional cost

36 Vectorization of Matrix-Matrix Multiplication Matrix multiplication is important because it can represent the largest portion of operations in a neural network (Linear Layer, Convolution Layer, etc.). We use the the identity: vec(abc) = (C T A)vec(B) where is the Kronecker product a 11 B... a 1N B A B =... a M1 B... a MN B Good News: typically, there s no need to calculate the Kronecker product explicitly.

37 Vectorization of Matrix-Matrix Multiplication Examples: Assuming A and X are M N and N P matrices respectively vec(ax) =

38 Vectorization of Matrix-Matrix Multiplication Examples: Assuming A and X are M N and N P matrices respectively vec(ax) = vec(i M AX)

39 Vectorization of Matrix-Matrix Multiplication Examples: Assuming A and X are M N and N P matrices respectively vec(ax) = vec(i M AX) = (X T I M )vec(a)

40 Vectorization of Matrix-Matrix Multiplication Examples: Assuming A and X are M N and N P matrices respectively vec(ax) = vec(i M AX) = (X T I M )vec(a) Then we have vec(ax) vec(a) = XT I M

41 Vectorization of Matrix-Matrix Multiplication Examples: Assuming A and X are M N and N P matrices respectively vec(ax) = vec(i M AX) = (X T I M )vec(a) Then we have vec(ax) vec(a) = XT I M To propagate gradient from AX to A, denote vec(δ AX ) = vec(ax) L, where δ AX has the same shape as AX

42 Vectorization of Matrix-Matrix Multiplication Examples: Assuming A and X are M N and N P matrices respectively vec(ax) = vec(i M AX) = (X T I M )vec(a) Then we have vec(ax) vec(a) = XT I M To propagate gradient from AX to A, denote vec(δ AX ) = vec(ax) L, where δ AX has the same shape as AX vec(δ A ) = (X T I M ) T vec(δ AX ) = (X I M )vec(δ AX ) = vec(δ AX X T )

43 Vectorization of Matrix-Matrix Multiplication: Another Side Similarly, to propagate gradient from AX to X, vec(δ X ) = vec(a T δ AX )

44 Vectorization of Matrix-Matrix Multiplication: Another Side Similarly, to propagate gradient from AX to X, vec(δ X ) = vec(a T δ AX ) If we view multiplying A to be an operator inside a neural network, then the gradient is propagated by applying its transposed operator.

45 Vectorization of Matrix-Matrix Multiplication: Another Side Similarly, to propagate gradient from AX to X, vec(δ X ) = vec(a T δ AX ) If we view multiplying A to be an operator inside a neural network, then the gradient is propagated by applying its transposed operator. It applies to all finite dimensional linear operators used in machine learning.

46 Vectorization of Matrix-Matrix Multiplication: Another Side Similarly, to propagate gradient from AX to X, vec(δ X ) = vec(a T δ AX ) If we view multiplying A to be an operator inside a neural network, then the gradient is propagated by applying its transposed operator. It applies to all finite dimensional linear operators used in machine learning. For example, for calculating the gradient of a convolution, we need to use the transposed convolution.

47 Congratulations! Now you know how to implement your own deep leanring framework how these frameworks work. Model GPU Workstation Data Human Effort Electricy

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target