UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1
Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 2
W1 x1 x x2 W2 w3 ŷ x3 3
Logistic Regression wx+b e P(Y=1 x) = wx+b 1+e Sigmoid Function (aka logistic, logit, S, soft-step) x 4
Expanded Logistic Regression x1 x2 Input x p=3 x3 +1 w1 w2 w 3 1 b Σ z Summing Function ŷ = P(Y=1 x,w) Sigmoid Function Multiply by weights T. z = w x +b 1x1 px1 1xp y = sigmoid(z) = ez 1 + ez 5
Neuron x1 x2 x3 w1 w2 w3 Σ z b1 +1 6
Neurons 7 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Neuron x1 Input x x2 w1 w2 Σ w3 z ŷ x3 From here on, we leave out bias for simplicity T. z = w x 1x1 px1 1xp ŷ = sigmoid(z) = ez 1 + ez 8
Block View of a Neuron parameterized block w x * Input Dot Product z T. z = w x 1x1 px1 1xp ŷ = sigmoid(z) = ŷ Sigmoid ez 1 + ez output 9
Neuron Representation x1 x x2 w1 w2 Σ ŷ w3 x3 The linear transformation and nonlinearity together is typically considered a single neuron 10
Neuron Representation x w * ŷ The linear transformation and nonlinearity together is typically considered a single neuron 11
Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 12
W1 x1 x x2 W2 w3 ŷ x3 13
1-Layer Neural Network (with 4 neurons) vector z matrix W ŷ Σ 1 x1 ŷ Σ Input x 2 x2 Output ŷ ŷ Σ x3 3 ŷ Σ Linear 4 Sigmoid 1 layer 14
1-Layer Neural Network (with 4 neurons) z W ŷ Σ 1 x1 ŷ Σ Input x 2 x2 Output ŷ ŷ Σ x3 3 p=3 ŷ Σ 4 Element-wise on vector z d=4 z =WT x dx1 dxp px1 ŷ = sigmoid(z) = dx1 dx1 ez 1 + ez 15
1-Layer Neural Network (with 4 neurons) ŷ W Σ 1 x1 ŷ Σ Input x 2 x2 Output ŷ ŷ Σ x3 3 p=3 ŷ Σ 4 Element-wise on vector z d=4 z =WT x dx1 dxp px1 ŷ = sigmoid(z) = dx1 dx1 ez 1 + ez 16
Block View of a Neural Network W is now a matrix W x * Input Dot Product z is now a vector z ŷ Sigmoid output z =WT x dx1 dxp px1 ŷ = sigmoid(z) = dx1 dx1 ez 1 + ez 17
Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 18
W1 x1 x x2 W2 w3 ŷ x3 19
Multi-Layer Neural Network (Multi-Layer Perceptron (MLP) Network) weight subscript represents layer number W1 w2 x1 x ŷ x2 x3 Hidden layer Output layer 2-layer NN 20
Multi-Layer Neural Network (MLP) W2 W1 w3 x1 x ŷ x2 x3 1st hidden layer 2nd hidden layer Output layer 3-layer NN 21
Multi-Layer Neural Network (MLP) h1 W1 h2 W2 w3 x1 x ŷ x2 x3 hidden layer 1 output z1 =W1T x h1 = sigmoid(z1) z2 =W2T h1 h2 = sigmoid(z2) z3 =w3t h2 ŷ = sigmoid(z3) 22
Multi-Class Output MLP W1 x1 x h1 W2 h2 w3 x2 ŷ x3 z1 =W1T x h1 = sigmoid(z1) z2 =W2T h1 h2 = sigmoid(z2) z3 =W3T h2 ŷ = sigmoid(z2) 23
Block View Of MLP x W1 * z1 1st hidden layer h1 W2 * z2 2nd hidden layer h2 W3 * z3 y Output layer 24
Deep Neural Networks (i.e. > 1 hidden layer) Researchers have successfully used 1000 layers to train an object classifier 25
Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 26
W1 x1 x x2 W2 w3 ŷ E (ŷ) x3 27
Binary Classification Loss W1 W2 w3 x1 x ŷ = P(y=1 X,W) x2 x3 this example is for a single sample x E= loss = - logp(y = ŷ X= x ) = - y log(ŷ) - (1 - y) log(1-ŷ) 28 true output
Regression Loss W1 x1 x W2 w3 Σ x2 ŷ x3 E = loss= 12 ( y - ŷ )2 true output 29
Multi-Class Classification Loss W1 W2 W3 x z1 Σ x1 ŷ1 z2 x2 Σ x3 Σ ŷ2 z3 ŷi = P( ŷi = 1 x ) ŷ3 K=3 Softmax function. Normalizing function which converts each class output to a probability. E = loss = - Σ yj ln ŷj j = 1...K 0 for all except true class 30
Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 31
W1 x1 x x2 W2 w3 ŷ E (ŷ) x3 32
Training Neural Networks How do we learn the optimal weights WL for our task?? Gradient descent: WL(t+1) = WL(t) - E WL(t) W1 x W2 w3 ŷ But how do we get gradients of lower layers? Backpropagation! Repeated application of chain rule of calculus Locally minimize the objective Requires all blocks of the network to be differentiable 33 LeCun et. al. Efficient Backpropagation. 1998
Backpropagation Intro 34 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Backpropagation Intro 35 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Backpropagation Intro 36 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Backpropagation Intro 37 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Backpropagation Intro 38 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Backpropagation Intro 39 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Backpropagation Intro 40 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Backpropagation Intro 41 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Backpropagation Intro 42 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Backpropagation Intro 43 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Backpropagation Intro 44 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Backpropagation Intro 45 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Backpropagation Intro Tells us: by increasing x by a scale of 1, we decrease f by a scale of 4 46 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Backpropagation (binary classification example) W1 x1 x x2 w2 ŷ = P(y=1 X,W) x3 Example on 1-hidden layer NN for binary classification 47
Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 48
Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y E = loss = Gradient Descent to Minimize loss: Need to find these! 49
Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 50
Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y =?? =?? 51
Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y =?? =?? Exploit the chain rule! 52
Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 53
Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y chain rule 54
Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 55
Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 56
Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 57
Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 58
Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 59
Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 60
Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 61
Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y already computed 62
Local-ness of Backpropagation activations local gradients x f y gradients 63 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Local-ness of Backpropagation activations local gradients x f y gradients 64
Example: Sigmoid Block x sigmoid(x) = (x) 65
Deep Learning = Concatenation of Differentiable Parameterized Layers (linear & nonlinearity functions) x W1 * z1 1st hidden layer h1 W2 * z2 2nd hidden layer h2 W3 * z3 y Output layer Want to find optimal weights W to minimize some loss function E! 66
Backprop Whiteboard Demo w1 x1 w2 x2 Σ w3 w4 b1 z1 h1 w5 Σ Σ h2 z2 b2 1 ŷ w6 b3 1 f1 z1 = x1w1 + x2w3 + b1 z2 = x1w2 + x2w4 + b2 f2 exp(z1) h1 = 1 + exp(z ) 1 exp(z2) h2 = 1 + exp(z2) f3 ŷ = h1w5 + h2w6 + b3 f4 E = ( y - ŷ )2 w(t+1) = w(t) - E w E w(t) =?? 67
Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 68
Nonlinearity Functions (i.e. transfer or activation functions) x1 x2 x3 +1 w1 w2 w3 b1 Σ Summing Function z Sigmoid Function Multiply by weights 69
Nonlinearity Functions (i.e. transfer or activation functions) Name Plot Equation Derivative ( w.r.t x ) 70 https://en.wikipedia.org/wiki/activation_function#comparison_of_activation_functions
Nonlinearity Functions (i.e. transfer or activation functions) Name Plot Equation Derivative ( w.r.t x ) 71 https://en.wikipedia.org/wiki/activation_function#comparison_of_activation_functions
Nonlinearity Functions (aka transfer or activation functions) Name Plot Equation Derivative ( w.r.t x ) 72 https://en.wikipedia.org/wiki/activation_function#comparison_of_activation_functions usually works best in practice
Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 73
Neural Net Pipeline W2 W1 x w3 ŷ E 1. Initialize weights 2. For each batch of input x samples S: a. Run the network Forward on S to compute outputs and loss b. Run the network Backward using outputs and loss to compute gradients c. Update weights using SGD (or a similar method) 3. Repeat step 2 until loss convergence 74
Non-Convexity of Neural Nets In very high dimensions, there exists many local minimum which are about the same. Pascanu, et. al. On the saddle point problem for non-convex optimization 2014 75
Building Deep Neural Nets y x f 76 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Building Deep Neural Nets GoogLeNet for Object Classification 77
Block Example Implementation 78 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Advantage of Neural Nets As long as it s fully differentiable, we can train the model to automatically learn features for us. 79
Advanced Deep Learning Models: Convolutional Neural Networks & Recurrent Neural Networks Most slides from http://cs231n.stanford.edu/ 80
Convolutional Neural Networks (aka CNNs and ConvNets) 81
Challenges in Visual Recognition
Challenges in Visual Recognition
Problems with Fully Connected Networks on Images Each neuron corresponds to a specific pixel 1000x1000 Image 1M hidden units: 10^12 parameters Spatial Correlation is local! Connect units 84
Problems with Fully Connected Networks on Images Each neuron corresponds to a specific pixel 1000x1000 Image 1M hidden units: 10^12 parameters Spatial Correlation is local! Connect units How do we deal with multiple dimensions in the input? Length,height, channel (R,G,B) 85
Convolutional Layer 86
Convolutional Layer 87
Convolutional Layer 88
Convolutional Layer 89
Convolution 90
Convolution 91
Neuron View of Convolutional Layer 92
Neuron View of Convolutional Layer 93
Convolutional Layer 94
Convolutional Layer 95
Neuron View of Convolutional Layer 96
Convolutional Neural Networks 97
Convolutional Neural Networks 98
Convolutional Neural Networks 99
Convolutional Neural Networks http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/ 100
Pooling Layer 101
Pooling Layer (Max Pooling example) 102
History of ConvNets 1998 2012 Gradient-based learning applied to document recognition [LeCun, ImageNet Classification with Deep Convolutional Neural Networks Bottou, Bengio, Haffner] [Krizhevsky, Sutskever, Hinton, 2012] 103
104
105
106
107
Recurrent Neural Networks 108
Standard Feed-Forward Neural Network 109
Standard Feed-Forward Neural Network 110
Recurrent Neural Networks (RNNs) RNNs can handle 111
Recurrent Neural Networks (RNNs) RNNs can handle 112
Recurrent Neural Networks (RNNs) Traditional Feed Forward Neural Network Recurrent Neural Network predict a vector at each timestep output output hidden hidden input input
Recurrent Neural Networks ht ht-1 114
Recurrent Neural Networks ht ht-1 115
Vanilla Recurrent Neural Network ht ht-1 116
Character Level Language Model with an RNN 117
Character Level Language Model with an RNN 118
Character Level Language Model with an RNN 119
Character Level Language Model with an RNN 120
Character Level Language Model with an RNN 121
Character Level Language Model with an RNN 122
Character Level Language Model with an RNN 123
Example: Generating Shakespeare with RNNs 124
Example: Generating C Code with RNNs 125
Long Short Term Memory Networks (LSTMs) Recurrent networks suffer from the vanishing gradient problem Aren t able to model long term dependencies in sequences output hidden input
Long Short Term Memory Networks (LSTMs) Recurrent networks suffer from the vanishing gradient problem Aren t able to model long term dependencies in sequences Use gating units to learn when to remember output input
RNNs and CNNs Together 128