Deep Learning & Artificial Intelligence WS 2018/2019

Size: px

Start display at page:

Download "Deep Learning & Artificial Intelligence WS 2018/2019"

Bridget Nelson
5 years ago
Views:

1 Deep Learning & Artificial Intelligence WS 2018/2019

2 Linear Regression

3 Model

4 Model

5 Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

6 Objective Function with a single example with a set of examples

7 Objective Function Solution

8 Closed Form Solution

9 Closed Form Solution Fast to compute Only exists for some models and error functions Must be determined manually

10 Gradient Descent

11 Gradient Descent 1. Initialize at random 2. Compute error 3. Compute gradients w.r.t. parameters 4. Apply the above update rule 5. Go back to 2. and repeat until error does not decrease anymore

12 Computing Gradients

13 Computing Gradients Kronecker delta

14 Computing Gradients

15 Computing Gradients

16 Gradient Descent (Result) 1. Initialize at random 2. Compute error 3. Compute gradients w.r.t. parameters 4. Apply the above update rule 5. Go back to 2. and repeat until error does not decrease anymore

17 Probabilistic Interpretation Error term that captures unmodeled effects or random noise

18 Probabilistic Interpretation Error term that captures unmodeled effects or random noise

19 Probabilistic Interpretation Error term that captures unmodeled effects or random noise

20 Likelihood

21 Maximum Likelihood

22 Log-Likelihood

23 Maximum Log-Likelihood

24 Neural Networks & Backpropagation

25 Error Function Prediction Ground truth / target

26 Simple Fully-Connected Neural Network

27 Objective Function with a single example with a set of examples

28 Gradients: Towards Backpropagation

29 Gradients: Towards Backpropagation Number of neurons of the layer (excluding bias 1 )

30 Gradients: Towards Backpropagation

31 Gradients: Towards Backpropagation Can you do it for on your own?

32 Gradients: Towards Backpropagation

33 Gradients: Towards Backpropagation

34 Gradients: Towards Backpropagation

35 Gradients: Towards Backpropagation

36 Gradients: Towards Backpropagation

37 Gradients: Towards Backpropagation Can you do it for on your own?

38 Backpropagation Delta messages

39 Activation Functions & Vanishing Gradients

40 Common Activation Functions

41 Common Activation Functions Small or even tiny gradient

42 Vanishing Gradients Element-wise multiplication with small or even tiny gradients for each layer In a neural network with many layers, the gradients of the objective function w.r.t. the weights of a layer close to the inputs may become near zero! Gradient descent updates will starve

43 Weight Initialization

44 The Importance of Weight Initialization Simple CNN trained on MNIST for 12 epochs 10-batch rolling average of training loss Image Source:

45 The Importance of Weight Initialization Initialization with 0 values is ALWAYS WRONG! 0 here = everything is 0 = no error signal How to initialize properly?

46 Information Flow in a Neural Network Consider a network with... 5 hidden layers and 100 neurons per hidden layer the hidden layer activation function = identity function Let s omit the bias term for simplicity (commonly initialized with all 0 s).

47 Information Flow in a Neural Network Image Source:

48 Information Flow in a Neural Network What s the explanation for the previous image? One layer with some activation function and without the bias term:

49 Information Flow in a Neural Network

50 Information Flow in a Neural Network

51 Information Flow in a Neural Network (1) (2) (3) (1) tends to 0 when either (2) tends to 0 or (3) tends to 0. Preserve variance of activations throughout the network.

52 Information Flow in a Neural Network Variance approximation possible when pre-activation neurons are close to zero.

53 Variance Basic properties of variance for independent random variables with expected value = 0

54 Variance of Activations Random variables

55 Variance of Activations

56 Variance of Activations Variance preservation

57 Variance of Error Contribution

58 Variance of Error Contribution

59 Variance of Error Contribution assumption

60 Variance of Error Contribution Random variables

61 Variance of Error Contribution

62 Variance of Error Contribution Variance preservation

63 Glorot Initialization Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th international conference on artificial intelligence and statistics (pp ).

64 Optimization Methods

65 Martens, J. (2010). Deep Learning via Hessian-Free Optimization. In Proceedings of the 27th International Conference on Machine Learning (pp ). Gradient Descent Too large learning rate zig-zag Too small learning rate starvation

66 Batch Gradient Descent Update based on the entire training data set Susceptible to converging to local minima Expensive and inefficient for large training data sets

67 Stochastic Gradient Descent (SGD) Update based on a single example More robust against local minima Noisy updates small learning rate

68 Mini-Batch Gradient Descent Update based on multiple examples More robust against local minima More stable than stochastic gradient descent Most common Often also called SGD despite multiple examples

69 Gradient Descent with Momentum Momentum dampens oscillations Gradient is computed before momentum is applied Typical momentum term:

70 Gradient Descent with Nesterov Momentum Gradient is computed after momentum is applied Anticipated update from momentum is used to include knowledge of momentum in the gradient Typically preferred over vanilla momentum

71 AdaGrad Adaptive (per-weight) learning rates Learning rates of frequently occurring features are reduced while learning rates of infrequent features remain large Monotonically decreasing learning rates Suited for sparse data Typical learning rate:

72 RMSProp Typical hyperparameters:

73 Adam Often used these days Typical hyperparameters:

74 Computation Graphs

75 Matrix-Vector Multiplication VECTOR float y SYMBOL TYPE data type symbolic variable MATMUL W MATRIX float x VECTOR float OPERATION

76 Indexing INDEXING A MATRIX float B i A i B MATRIX float VECTOR int

77 Graph Optimization SCALAR float z DIVIDE SCALAR float OPTIMIZATION x SCALAR float MULTIPLY x SCALAR float y SCALAR float

78 Automatic Differentiation SCALAR float y SCALAR float dy/dx SQUARE GRAD(y, x) MULTIPLY x SCALAR float 2 SCALAR float

79 Neural Network Layers VECTOR float z VECTOR float a TANH VECTOR float VECTOR float ADD z MATMUL LAYER OP DENSE W x b x MATRIX float VECTOR float VECTOR float VECTOR float

More on Neural Networks

More on Neural Networks Yujia Yan Fall 2018 Outline Linear Regression y = Wx + b (1) Linear Regression y = Wx + b (1) Polynomial Regression y = Wφ(x) + b (2) where φ(x) gives the polynomial basis, e.g.,