Lecture 35: Optimization and Neural Nets

Size: px

Start display at page:

Download "Lecture 35: Optimization and Neural Nets"

Jacob Stanley
5 years ago
Views:

1 Lecture 35: Optimization and Neural Nets CS 4670/5670 Sean Bell DeepDream [Google, Inceptionism: Going Deeper into Neural Networks, blog 2015]

2 Aside: CNN vs ConvNet Note: There are many papers that use either phrase, but ConvNet is the preferred term, since CNN clashes with that other thing called CNN

3 (Review) Softmax Classifier p i, j = k es i, j e s i,k L i = log p i,yi Q2: At initialization, W is small and thus s ~= 0. What is the loss L? Slide from Karpathy 2016

4 Softmax vs SVM Loss Slide from Karpathy 2016

5 Softmax vs SVM Loss Softmax: Slide from Karpathy 2016

6 Softmax vs SVM Loss Softmax: SVM: Slide from Karpathy 2016

7 Softmax vs SVM Loss Softmax: SVM: Slide from Karpathy 2016

8 Softmax vs SVM Loss Softmax: SVM: Slide from Karpathy 2016

9 Softmax Classifier Let s code this up in NumPy: p i, j = es i, j k e s i,k

10 Softmax Classifier Let s code this up in NumPy: p i, j = k es i, j e s i,k Doesn t work what s the problem?

11 Softmax Classifier Let s code this up in NumPy: p i, j = k es i, j e s i,k Doesn t work what s the problem? - What if there is the value 1000 appears in s?

12 Softmax Classifier Let s code this up in NumPy: p i, j = k es i, j e s i,k Doesn t work what s the problem? - What if there is the value 1000 appears in s? Overflow > we get inf/inf = NaN

13 Softmax Classifier Let s code this up in NumPy: p i, j = k es i, j e s i,k Doesn t work what s the problem? - What if there is the value 1000 appears in s? Overflow > we get inf/inf = NaN - What if the largest value is -1000?

14 Softmax Classifier Let s code this up in NumPy: p i, j = k es i, j e s i,k Doesn t work what s the problem? - What if there is the value 1000 appears in s? Overflow > we get inf/inf = NaN - What if the largest value is -1000? Underflow > we get 0/0 = NaN

15 Softmax Classifier Let s code this up in NumPy: p i, j = k es i, j e s i,k Doesn t work what s the problem? - What if there is the value 1000 appears in s? Overflow > we get inf/inf = NaN - What if the largest value is -1000? Underflow > we get 0/0 = NaN This expression is numerically unstable

16 Softmax Classifier

17 Softmax Classifier Observation: subtracting a constant does not change p :

18 Softmax Classifier Observation: subtracting a constant does not change p : p i, j = k es i, j C e s j,k C = e C e s i, j k e C e s i,k = es i, j k e s i,k

19 Softmax Classifier Observation: subtracting a constant does not change p : p i, j = k es i, j C e s j,k C = e C e s i, j k e C e s i,k = es i, j k e s i,k If we choose C to be the max, then it works:

20 Softmax Classifier Observation: subtracting a constant does not change p : p i, j = k es i, j C e s j,k C = e C e s i, j k e C e s i,k = es i, j k e s i,k If we choose C to be the max, then it works: - If a large value appears in s, then that value will become 1 and all others will be 0 (avoiding overflow)

21 Softmax Classifier Observation: subtracting a constant does not change p : p i, j = k es i, j C e s j,k C = e C e s i, j k e C e s i,k = es i, j k e s i,k If we choose C to be the max, then it works: - If a large value appears in s, then that value will become 1 and all others will be 0 (avoiding overflow) - If all values in s are large negative, then they will be shifted up towards 0 (avoiding underflow)

22 Optimization How do we minimize the loss L?

23 Idea #1: Random Search

24 Idea #1: Random Search

25 Idea #1: Random Search

26 Idea #1: Random Search

27 Idea #2: Follow the Slope

28 Idea #2: Follow the Slope

29 Idea #2: Follow the Slope

30 Idea #2: Follow the Slope

31 Idea #2: Follow the Slope In 1D, the derivative of a function is: In multiple dimensions, the gradient is the vector of (partial derivatives). Slide from Karpathy 2016

32 Idea #2: Follow the Slope In 1D, the derivative of a function is: In multiple dimensions, the gradient is the vector of (partial derivatives). We could directly estimate it: Slide from Karpathy 2016

33 Idea #2: Follow the Slope In 1D, the derivative of a function is: In multiple dimensions, the gradient is the vector of (partial derivatives). We could directly estimate it: But this is super slow. Slide from Karpathy 2016

34 Idea #2: Follow the Slope We can just directly compute the gradient: Slide from Karpathy 2016

35 Idea #2: Follow the Slope We can just directly compute the gradient: Calculus Slide from Karpathy 2016

36 Idea #2: Follow the Slope Slide from Karpathy 2016

37 Idea #2: Follow the Slope Slide from Karpathy 2016

38 Gradient Descent Slide from Karpathy 2016

39 Gradient Descent Slide from Karpathy 2016

40 Gradient Descent Slide from Karpathy 2016

41 Gradient Descent Step size called the learning rate Slide from Karpathy 2016

42 Mini-batch Gradient Descent Slide from Karpathy 2016

43 Mini-batch Gradient Descent Note: In practice we use fancier update rules (momentum, RMSprop, ADAM, etc) Slide from Karpathy 2016

44 Mini-batch Gradient Descent Typical loss Slide from Karpathy 2016

45 Setting the learning rate Learning rate: 1e6 what could go wrong? L Loss A weight in W

46 Setting the learning rate Learning rate: 1e6 what could go wrong? L Loss A weight in W

47 Setting the learning rate Learning rate: 1e6 what could go wrong? L Loss A weight in W

48 Setting the learning rate Learning rate: 1e6 what could go wrong? L Loss A weight in W

49 Setting the learning rate Learning rate: 1e6 what could go wrong? L Loss A weight in W

50 Setting the learning rate Learning rate: 1e6 what could go wrong? L Loss A weight in W

51 Setting the learning rate Typical loss (more on this later) Slide from Karpathy 2016

52 Classification: Overview Training Images Test Image

53 Classification: Overview Training Images Gradient Descent Test Image

54 Classification: Overview Training Images Training Labels Gradient Descent Test Image

55 Classification: Overview Training Images Training Labels Trained Classifier Gradient Descent Test Image

56 Classification: Overview Training Images Training Labels Trained Classifier Gradient Descent Test Image

57 Classification: Overview Training Images Training Labels Trained Classifier Gradient Descent Prediction: outdoor Test Image

58 Neural Networks (First we ll cover Neural Nets, then build up to Convolutional Neural Nets)

59 Feature hierarchy with ConvNets End-to-end models [Zeiler and Fergus, Visualizing and Understanding Convolutional Networks, 2013]

60 Slide: R. Fergus

61 Inspiration from Biology Figure: Andrej Karpathy

62 Inspiration from Biology Figure: Andrej Karpathy

63 Inspiration from Biology Neural nets are loosely inspired by biology Figure: Andrej Karpathy

64 Inspiration from Biology Neural nets are loosely inspired by biology But they certainly are not a model of how the brain works, or even how neurons work Figure: Andrej Karpathy

65 Simple Neural Net: 1 Layer Let s consider a simple 1-layer network: x Wx + b f

66 Simple Neural Net: 1 Layer Let s consider a simple 1-layer network: x Wx + b f This is the same as what you saw last class:

67 1 Layer Neural Net Block Diagram: x Wx + b f (Input) (class scores)

68 1 Layer Neural Net Block Diagram: x Wx + b f (Input) (class scores) Expanded Block Diagram: M W + x b D D 1 M = f 1 M 1 M classes D features 1 example

69 1 Layer Neural Net Block Diagram: x Wx + b f (Input) (class scores) Expanded Block Diagram: M W + x b D D 1 M = f 1 M NumPy: 1 M classes D features 1 example

70 1 Layer Neural Net

71 1 Layer Neural Net How do we process N inputs at once?

72 1 Layer Neural Net How do we process N inputs at once? It s most convenient to have the first dimension (row) represent which example we are looking at, so we need to transpose everything

73 1 Layer Neural Net How do we process N inputs at once? It s most convenient to have the first dimension (row) represent which example we are looking at, so we need to transpose everything N X D

74 1 Layer Neural Net How do we process N inputs at once? It s most convenient to have the first dimension (row) represent which example we are looking at, so we need to transpose everything N X D W M D

75 1 Layer Neural Net How do we process N inputs at once? It s most convenient to have the first dimension (row) represent which example we are looking at, so we need to transpose everything N X W D b! b N D M M

76 1 Layer Neural Net How do we process N inputs at once? It s most convenient to have the first dimension (row) represent which example we are looking at, so we need to transpose everything N X W D b! b N = F N D M M M M classes D features N examples

77 1 Layer Neural Net How do we process N inputs at once? It s most convenient to have the first dimension (row) represent which example we are looking at, so we need to transpose everything N X W D b! b N = F N D M M M Note: Often, if the weights are transposed, they are still called W M classes D features N examples

78 1 Layer Neural Net

79 1 Layer Neural Net D X N Each row is one input example

80 1 Layer Neural Net D X N Each row is one input example W D Each column is the weights for one class M

81 1 Layer Neural Net D X N Each row is one input example W D Each column is the weights for one class M F M N Each row is the predicted scores for one example

82 1 Layer Neural Net Implementing this with NumPy: First attempt let s try this:

83 1 Layer Neural Net Implementing this with NumPy: First attempt let s try this: Doesn t work why?

84 1 Layer Neural Net Implementing this with NumPy: First attempt let s try this: N x D M Doesn t work why? D x M

85 1 Layer Neural Net Implementing this with NumPy: First attempt let s try this: N x D M Doesn t work why? D x M - NumPy needs to know how to expand b from 1D to 2D

86 1 Layer Neural Net Implementing this with NumPy: First attempt let s try this: N x D M Doesn t work why? D x M - NumPy needs to know how to expand b from 1D to 2D - This is called broadcasting

87 1 Layer Neural Net Implementing this with NumPy: What does np.newaxis do?

88 1 Layer Neural Net Implementing this with NumPy: What does np.newaxis do? b = [0, 1, 2]

89 1 Layer Neural Net Implementing this with NumPy: What does np.newaxis do? b = [0, 1, 2] Make b a row vector

90 1 Layer Neural Net Implementing this with NumPy: What does np.newaxis do? b = [0, 1, 2] Make b a row vector Make b a column vector

91 1 Layer Neural Net Implementing this with NumPy: What does np.newaxis do?

92 1 Layer Neural Net Implementing this with NumPy: What does np.newaxis do? Row vector (repeat along rows)

93 1 Layer Neural Net Implementing this with NumPy: What does np.newaxis do? Row vector (repeat along rows) Column vector (repeat along columns)

94 2 Layer Neural Net What if we just added another layer? x W (1) x + b (1) h

95 2 Layer Neural Net What if we just added another layer? x W (1) x + b (1) h W (2) h + b (2) f

96 2 Layer Neural Net What if we just added another layer? x W (1) x + b (1) h W (2) h + b (2) f Let s expand out the equation:

97 2 Layer Neural Net What if we just added another layer? x W (1) x + b (1) h W (2) h + b (2) f Let s expand out the equation: f = W (2) (W (1) x + b (1) ) + b (2) = (W (2) W (1) )x + (W (2) b (1) + b (2) )

98 2 Layer Neural Net What if we just added another layer? x W (1) x + b (1) h W (2) h + b (2) f Let s expand out the equation: f = W (2) (W (1) x + b (1) ) + b (2) = (W (2) W (1) )x + (W (2) b (1) + b (2) ) But this is just the same as a 1 layer net with:

99 2 Layer Neural Net What if we just added another layer? x W (1) x + b (1) h W (2) h + b (2) f Let s expand out the equation: f = W (2) (W (1) x + b (1) ) + b (2) = (W (2) W (1) )x + (W (2) b (1) + b (2) ) But this is just the same as a 1 layer net with: W = W (2) W (1) b = W (2) b (1) + b (2)

100 2 Layer Neural Net What if we just added another layer? x W (1) x + b (1) h W (2) h + b (2) f Let s expand out the equation: f = W (2) (W (1) x + b (1) ) + b (2) = (W (2) W (1) )x + (W (2) b (1) + b (2) ) But this is just the same as a 1 layer net with: W = W (2) W (1) b = W (2) b (1) + b (2) We need a non-linear operation between the layers

101 Nonlinearities

102 Sigmoid Nonlinearities

103 Nonlinearities Sigmoid Tanh

104 Nonlinearities Sigmoid Tanh ReLU

105 Nonlinearities Sigmoid Tanh ReLU Historically popular 2 Big problems: - Not zero centered - They saturate

106 Nonlinearities Sigmoid Tanh ReLU Historically popular 2 Big problems: - Not zero centered - They saturate - Zero-centered, - But also saturates

107 Nonlinearities Sigmoid Tanh ReLU Historically popular 2 Big problems: - Not zero centered - They saturate - Zero-centered, - But also saturates - No saturation - Very efficient

108 Nonlinearities Sigmoid Tanh ReLU Historically popular 2 Big problems: - Not zero centered - They saturate - Zero-centered, - But also saturates Best in practice for classification - No saturation - Very efficient

109 Nonlinearities Saturation What happens if we reach this part? Sigmoid

110 Nonlinearities Saturation What happens if we reach this part? Sigmoid

111 Nonlinearities Saturation What happens if we reach this part? Sigmoid

112 Nonlinearities Saturation What happens if we reach this part? Sigmoid

113 Nonlinearities Saturation What happens if we reach this part? σ (x) = 1 1+ e x

114 Nonlinearities Saturation What happens if we reach this part? σ (x) = 1 1+ e x σ x = σ (x)( 1 σ (x))

115 Nonlinearities Saturation What happens if we reach this part? σ (x) = 1 1+ e x σ x = σ (x)( 1 σ (x))

116 Nonlinearities Saturation What happens if we reach this part? σ (x) = 1 1+ e x σ x = σ (x)( 1 σ (x))

117 Nonlinearities Saturation What happens if we reach this part? σ (x) = 1 1+ e x σ x = σ (x)( 1 σ (x))

118 Nonlinearities Saturation What happens if we reach this part? σ (x) = 1 1+ e x σ x = σ (x)( 1 σ (x)) Saturation: the gradient is zero!

119 Nonlinearities [Krizhevsky 2012] (AlexNet) tanh ReLU

120 Nonlinearities [Krizhevsky 2012] (AlexNet) tanh ReLU In practice, ReLU converges ~6x faster than Tanh for classification problems

121 ReLU in NumPy Many ways to write ReLU these are all equivalent:

122 ReLU in NumPy Many ways to write ReLU these are all equivalent: (a) Elementwise max, 0 gets broadcasted to match the size of h1:

123 ReLU in NumPy Many ways to write ReLU these are all equivalent: (a) Elementwise max, 0 gets broadcasted to match the size of h1: (b) Make a boolean mask where negative values are True, and then set those entries in h1 to 0:

ReLU in NumPy Many ways to write ReLU these are all equivalent: (a) Elementwise max, 0 gets broadcasted to match the size of h1: (b) Make a boolean mask where negative

124 ReLU in NumPy Many ways to write ReLU these are all equivalent: (a) Elementwise max, 0 gets broadcasted to match the size of h1: (b) Make a boolean mask where negative values are True, and then set those entries in h1 to 0: (c) Make a boolean mask where positive values are True, and then do an elementwise multiplication (since int(true) = 1):

125 2 Layer Neural Net

126 2 Layer Neural Net (Layer 1) x W (1) x + b (1)

127 2 Layer Neural Net (Layer 1) x W (1) x + b (1) (Nonlinearity) max(0, ) h 1 h (1) ( hidden activations )

128 2 Layer Neural Net (Layer 1) (Nonlinearity) (Layer 2) x W (1) x + b (1) max(0, ) h 1 h (1) W (2) x + b (2) f ( hidden activations )

129 2 Layer Neural Net (Layer 1) (Nonlinearity) (Layer 2) x W (1) x + b (1) max(0, ) h 1 h (1) W (2) x + b (2) f ( hidden activations ) Let s expand out the equation:

130 2 Layer Neural Net (Layer 1) (Nonlinearity) (Layer 2) x W (1) x + b (1) max(0, ) h 1 h (1) W (2) x + b (2) f ( hidden activations ) Let s expand out the equation: f = W (2) max(0,w (1) x + b (1) ) + b (2)

131 2 Layer Neural Net (Layer 1) (Nonlinearity) (Layer 2) x W (1) x + b (1) max(0, ) h 1 h (1) W (2) x + b (2) f ( hidden activations ) Let s expand out the equation: f = W (2) max(0,w (1) x + b (1) ) + b (2) Now it no longer simplifies yay

132 2 Layer Neural Net (Layer 1) (Nonlinearity) (Layer 2) x W (1) x + b (1) max(0, ) h 1 h (1) W (2) x + b (2) f ( hidden activations ) Let s expand out the equation: f = W (2) max(0,w (1) x + b (1) ) + b (2) Now it no longer simplifies yay Note: any nonlinear function will prevent this collapse, but not all nonlinear functions actually work in practice

133 2 Layer Neural Net (Layer 1) (Nonlinearity) (Layer 2) x W (1) x + b (1) max(0, ) h 1 h (1) W (2) x + b (2) f ( hidden activations ) Note: Traditionally, the nonlinearity was considered part of the layer and is called an activation function In this class, we will consider them separate layers, but be aware that many others consider them part of the layer

134 Neural Net: Graphical Representation

135 Neural Net: Graphical Representation 2 layers

136 Neural Net: Graphical Representation 2 layers - Called fully connected because every output depends on every input. - Also called affine or inner product

137 Neural Net: Graphical Representation 2 layers - Called fully connected because every output depends on every input. - Also called affine or inner product 3 layers

138 Questions?

139 Neural Networks, More generally x Function h (1) Function h (2)... f

140 Neural Networks, More generally x Function h (1) Function h (2)... f This can be: - Fully connected layer - Nonlinearity (ReLU, Tanh, Sigmoid) - Convolution - Pooling (Max, Avg) - Vector normalization (L1, L2) - Invent new ones

141 Neural Networks, More generally x h (1) Function Function h (2)... f

142 Neural Networks, More generally (Input) (Hidden Activation) (Scores) x Function h (1) Function h (2)... f

143 Neural Networks, More generally (Input) (Parameters) θ (1) (Parameters) θ (2) (Hidden Activation) x Function h (1) Function h (2)... (Scores) f

144 Neural Networks, More generally (Input) (Parameters) θ (1) (Parameters) θ (2) (Hidden Activation) x Function h (1) Function h (2)... (Scores) f Here, θ represents whatever parameters that layer is using (e.g. for a fully connected layer ). θ (1) = { W (1), b (1) }

145 Neural Networks, More generally (Input) (Parameters) θ (1) (Parameters) θ (2) (Hidden Activation) x Function h (1) Function h (2)... (Scores) f (Ground Truth Labels) y Here, θ represents whatever parameters that layer is using (e.g. for a fully connected layer ). θ (1) = { W (1), b (1) }

146 Neural Networks, More generally (Input) (Parameters) θ (1) (Parameters) θ (2) (Hidden Activation) x Function h (1) Function h (2)... (Scores) f (Ground Truth Labels) y L Here, θ represents whatever parameters that layer is using (e.g. for a fully connected layer ). θ (1) = { W (1), b (1) } (Loss)

147 Neural Networks, More generally (Input) (Parameters) θ (1) (Parameters) θ (2) (Hidden Activation) x Function h (1) Function h (2)... (Scores) f (Ground Truth Labels) y L Here, θ represents whatever parameters that layer is using (e.g. for a fully connected layer ). θ (1) = { W (1), b (1) } Recall: the loss L measures how far the predictions f (Loss) are from the labels y. The most common loss is Softmax.

Understanding How ConvNets See

Understanding How ConvNets See Slides from Andrej Karpathy Springerberg et al, Striving for Simplicity: The All Convolutional Net (ICLR 2015 workshops) CSC321: Intro to Machine Learning and Neural Networks,