Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016
Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph & Backpropagation Dropout Machine Learning for NLP 2/46
Outline Part 2 Word Embeddings Recurrent Neural Network Some Use Cases in NLP TensorFlow and the Assignment Machine Learning for NLP 3/46
Introduction History 1940s, first neural network computing model. 1950s, two-layer network, perceptron 1980s, backpropagation 2009 (2011) - now, recent great success in deep learning Machine Learning for NLP 4/46
Introduction Deep Learning Revolution Machine Learning for NLP 5/46
Introduction Deep Reinforcement Learning Machine Learning for NLP 6/46
Introduction Language Modelling, Question Answering, Speech Recognition (Black Mirror, Season 2, Be Right Back ) Machine Learning for NLP 7/46
Introduction Language Modelling, Recurrent Neural Network (http://karpathy.github.io/2015/05/21/rnn-effectiveness/) Machine Learning for NLP 8/46
Introduction Deep Convolutional Neural Networks (https://devblogs.nvidia.com/parallelforall/mocha-jl-deep-learningjulia/) Machine Learning for NLP 9/46
Introduction Machine Translation (https://research.googleblog.com/2016/09/a-neural-network-formachine.html) Machine Learning for NLP 10/46
Feedforward Neural Network Linear Perceptron: f (x i ) = w i x i + b (1) Input #1 Input #2 Input #3 Input #4 Machine Learning for NLP 11/46
Feedforward Neural Network Linear Perceptron: y = w i x i + b (2) Input #1 Input #2 Input #3 Input #4 Machine Learning for NLP 12/46
Feedforward Neural Network Linear Perceptron: y j = w ij x i + b j (3) Input #1 Input #2 Input #3 Input #4 Machine Learning for NLP 13/46
Feedforward Neural Network y j = w ij x i + b j y k = w jk y j + b k (4) Input #1 Input #2 Input #3 Input #4 Machine Learning for NLP 14/46
Introduction Non-linear Activation 1 y y = 1 1+e t 2 1 1 2 x 1 (Sigmoid Function) Machine Learning for NLP 15/46
Introduction Non-linear Activation y 1 y = tanh x 2 1 1 2 x 1 (Hyberbolic Tangent) Machine Learning for NLP 16/46
Feedforward Neural Network Stacked Linear Layers y j = w ij x i + b j y k = w jk y j + b k (5) Input #1 Input #2 Input #3 Input #4 Machine Learning for NLP 17/46
Feedforward Neural Network Add Non-linear Activation yj h = h(w ij x i + b j ) y k = w jk yj h (6) + b k Input #1 Input #2 Input #3 Input #4 Machine Learning for NLP 18/46
Feedforward Neural Network Softmax function (normalized exponential function) σ(z j ) = ey j e y k k Machine Learning for NLP 19/46
Feedforward Neural Network Add Softmax y h j = h(w ij x i + b j ) y g k = g(w jk y h j + b k ) (7) Input #1 Input #2 Input #3 Input #4 σ σ σ Machine Learning for NLP 20/46
Feedforward Neural Network y h j = h(w ij x i + b j ) y g k = g(w jk y h j + b k ) (8) Input Layer Hidden Layer Layer Input #1 Input #2 Input #3 Input #4 σ σ σ Machine Learning for NLP 21/46
Feedforward Neural Network y h j = h(w ij x i + b j ) y g k = g(w jk y h j + b k ) (9) j equals to the size of the hidden layer. Theorem: If j is big enough, we can simulate any vector-valued function to any degree of precision. Machine Learning for NLP 22/46
Feedforward Neural Network Training Training data, input [x i ], output [y i ] Loss Function, L(θ; y i, ŷ i ) (Stochastic) Gradient Descent Backpropagation Machine Learning for NLP 23/46
Feedforward Neural Network Loss Function L(θ; y i, ŷ i ) a.k.a Cost Function, Energy Function, Objective Function ŷ i = δ(x i θ) Mean Square Error mse = 1 n (ŷi y i ) 2 Cross Entropy H = y i log(ŷ i ) i Machine Learning for NLP 24/46
Gradient Descent To minimise the loss function L(θ; y i, ŷ i ), at every step t, we pick the training sample(s) i and update the weights parameter θ as: L t = L(θ;y i,ŷ i ) θ t θ=θt θ t+1 = θ t η L t where η (learning rate) is a very small value. We stop the iteration when it reaches the optima (local or global). Example Link How can we efficiently compute the gradient L t? Machine Learning for NLP 25/46
Gradient Descent To minimise the loss function L(θ; y i, ŷ i ), at every step t, we pick the training sample(s) i and update the weights parameter θ as: L t = L(θ;y i,ŷ i ) θ t θ=θt θ t+1 = θ t η L t where η (learning rate) is a very small value. We stop the iteration when it reaches the optima (local or global). Example Link How can we efficiently compute the gradient L t? Machine Learning for NLP 25/46
Computational Graph & Backpropagation Derivative? Partial Derivative? Machine Learning for NLP 26/46
Computational Graph & Backpropagation Derivative? Partial Derivative? Machine Learning for NLP 26/46
Computational Graph & Backpropagation Partial derivatives: Sum rule: Chain rule: a (a + b) = a a + b a = 1 u (uv) = u u v + v u u = v Machine Learning for NLP 27/46
Computational Graph & Backpropagation We can represent numeric computations in any degree of complexity via a data flow graph using the nodes for computational operations. A simple example e = (a + b) (b + 1) where we have a and b as inputs. This expression can be rewritten as: c = a + b d = b + 1 e = c d Machine Learning for NLP 28/46
Computational Graph & Backpropagation Represent the expression as a graph: (https://colah.github.io/posts/2015-08-backprop/) Machine Learning for NLP 29/46
Computational Graph & Backpropagation Let s set a = 2, b = 1: (https://colah.github.io/posts/2015-08-backprop/) Machine Learning for NLP 30/46
Computational Graph & Backpropagation Add partial derivatives on the edges: (https://colah.github.io/posts/2015-08-backprop/) Machine Learning for NLP 31/46
Computational Graph & Backpropagation e e a = 1 2 = 2; b = 1 2 + 1 3 = 5 (https://colah.github.io/posts/2015-08-backprop/) Machine Learning for NLP 32/46
Computational Graph & Backpropagation Factoring Paths (https://colah.github.io/posts/2015-08-backprop/) Z X Z X = αδ + αɛ + αζ + βδ + βɛ + βζ + γδ + γɛ + γζ = (α + β + γ)(ζ + ɛ + δ) Machine Learning for NLP 33/46
Computational Graph & Backpropagation Factoring Paths (https://colah.github.io/posts/2015-08-backprop/) Z X Z X = αδ + αɛ + αζ + βδ + βɛ + βζ + γδ + γɛ + γζ = (α + β + γ)(ζ + ɛ + δ) Machine Learning for NLP 33/46
Computational Graph & Backpropagation Forward vs. Backward (https://colah.github.io/posts/2015-08-backprop/) Machine Learning for NLP 34/46
Computational Graph & Backpropagation Forward Propagation (https://colah.github.io/posts/2015-08-backprop/) Machine Learning for NLP 35/46
Computational Graph & Backpropagation Backpropagation (https://colah.github.io/posts/2015-08-backprop/) Machine Learning for NLP 36/46
Computational Graph & Backpropagation Employing the computational graph with backpropagation makes training of neural networks millions of times faster. When we train our neural networks with the standard deep learning libraries (Theano, Torch, Tensorflow etc.), there is always such a graph running behind. Machine Learning for NLP 37/46
Dropout Overfitting Machine Learning for NLP 38/46
Dropout A regularisation technique to migrate overfitting. Randomly drop units (along with their connections) from the neural network. It is only applied during training. Machine Learning for NLP 39/46
Dropout y h j = h(w ij x i + b j ) y g k = g(w jk y h j + b k ) (10) Input #1 Input #2 Input #3 Input #4 σ σ σ Machine Learning for NLP 40/46
Dropout y h j = d(h(w ij x i + b j )); dropout_rate = 0.5 y g k = g(w jk y h j + b k ) (11) Input #1 Input #2 Input #3 Input #4 σ σ σ Machine Learning for NLP 41/46
Dropout y h j = d(h(w ij x i + b j )); dropout_rate = 0.5 y g k = g(w jk y h j + b k ) (12) Input #1 Input #2 Input #3 Input #4 σ σ σ Machine Learning for NLP 42/46
Dropout y h j = d(h(w ij x i + b j )); dropout_rate = 0.5 y g k = g(w jk y h j + b k ) (13) Input #1 Input #2 Input #3 Input #4 σ σ σ Machine Learning for NLP 43/46
Dropout y h j = d(h(w ij x i + b j )); dropout_rate = 0.5 y g k = g(w jk y h j + b k ) (14) Input #1 Input #2 Input #3 Input #4 σ σ σ Machine Learning for NLP 44/46
Next time Word Embeddings Recurrent Neural Network Some Use Cases in NLP TensorFlow and the Assignment Machine Learning for NLP 45/46
Further Readings Deep Learning online courses: Stanford http://cs224d.stanford.edu// Cambridge https://youtu.be/plhfwt7vaew?list= PLE6Wd9FR--EfW8dtjAuPoTuPcqmOV53Fu More on computational graph and back propagation: https://colah.github.io/posts/2015-08-backprop/ Machine Learning for NLP 46/46