Introduction to Neural Networks

Introduction to Neural Networks Philipp Koehn 4 April 205

Linear Models We used before weighted linear combination of feature values h j and weights λ j score(λ, d i ) = j λ j h j (d i ) Such models can be illustrated as a network

Limits of Linearity 2 We can give each feature a weight But not more complex value relationships, e.g, any value in the range [0;5] is equally good values over 8 are bad higher than 0 is not worse

XOR 3 Linear models cannot model XOR good bad bad good

Multiple Layers 4 Add an intermediate ( hidden ) layer of processing (each arrow is a weight) Have we gained anything so far?

Non-Linearity 5 Instead of computing a linear combination score(λ, d i ) = j λ j h j (d i ) Add a non-linear function Popular choices score(λ, d i ) = f ( λ j h j (d i ) ) tanh(x) sigmoid(x) = +e x j (sigmoid is also called the logistic function )

Deep Learning 6 More layers = deep learning

7 example

Simple Neural Network 8 3.7 3.7 2.9 2.9 -.5-4.6 4.5-5.2-2.0 One innovation: bias units (no inputs, always value )

Sample Input 9.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6 4.5-5.2-2.0 Try out two input values Hidden unit computation sigmoid(.0 3.7 + 0.0 3.7 +.5) = sigmoid(2.2) = = 0.90 + e 2.2 sigmoid(.0 2.9 + 0.0 2.9 + 4.5) = sigmoid(.6) = = 0.7 + e.6

Computed Hidden 0.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6.90.7 4.5-5.2-2.0 Try out two input values Hidden unit computation sigmoid(.0 3.7 + 0.0 3.7 +.5) = sigmoid(2.2) = = 0.90 + e 2.2 sigmoid(.0 2.9 + 0.0 2.9 + 4.5) = sigmoid(.6) = = 0.7 + e.6

Compute Output.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6.90.7 4.5-5.2-2.0 Output unit computation sigmoid(.90 4.5 +.7 5.2 + 2.0) = sigmoid(.7) = = 0.76 + e.7

Computed Output 2.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6.90.7 4.5-5.2-2.0.76 Output unit computation sigmoid(.90 4.5 +.7 5.2 + 2.0) = sigmoid(.7) = = 0.76 + e.7

3 why neural networks?

Neuron in the Brain 4 The human brain is made up of about 00 billion neurons Dendrite Axon terminal Soma Nucleus Axon Neurons receive electric signals at the dendrites and send them to the axon

Neural Communication 5 The axon of the neuron is connected to the dendrites of many other neurons Neurotransmitter Voltage gated Ca++ channel Synaptic vesicle Neurotransmitter transporter Axon terminal Postsynaptic density Receptor Synaptic cleft Dendrite

The Brain vs. Artificial Neural Networks 6 Similarities Neurons, connections between neurons Learning = change of connections, not change of neurons Massive parallel processing But artificial neural networks are much simpler computation within neuron vastly simplified discrete time steps typically some form of supervised learning with massive number of stimuli

7 back-propagation training

Error 8.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6.90.7 4.5-5.2-2.0.76 Computed output: y =.76 Correct output: t =.0 How do we adjust the weights?

Key Concepts 9 Gradient descent error is a function of the weights we want to reduce the error gradient descent: move towards the error minimum compute gradient get direction to the error minimum adjust weights towards direction of lower error Back-propagation first adjust last set of weights propagate error back to each previous layer adjust their weights

Derivative of Sigmoid Sigmoid sigmoid(x) = + e x 20 Reminder: quotient rule (f(x) ) = g(x)f (x) f(x)g (x) g(x) g(x) 2 Derivative d sigmoid(x) dx = d dx + e x = 0 ( e x ) ( e x ) ( + e x ) 2 = = ( e x ) + e x + e x ( + e x ) + e x = sigmoid(x)( sigmoid(x))

Final Layer Update 2 Linear combination of weights s = k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 2 (t y)2 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k

Final Layer Update () 22 Linear combination of weights s = k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 2 (t y)2 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k Error E is defined with respect to y de dy = d dy 2 (t y)2 = (t y)

Final Layer Update (2) 23 Linear combination of weights s = k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 2 (t y)2 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k y with respect to x is sigmoid(s) dy ds = d sigmoid(s) ds = sigmoid(s)( sigmoid(s)) = y( y)

Final Layer Update (3) 24 Linear combination of weights s = k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 2 (t y)2 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k x is weighted linear combination of hidden node values h k ds dw k = d dw k k w k h k = h k

Putting it All Together 25 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k = (t y) y( y) h k error derivative of sigmoid: y Weight adjustment will be scaled by a fixed learning rate µ w k = µ (t y) y h k

Multiple Output Nodes 26 Our example only had one output node Typically neural networks have multiple output nodes Error is computed over all j output nodes E = j 2 (t j y j ) 2 Weights k j are adjusted according to the node they point to w j k = µ(t j y j ) y j h k

Hidden Layer Update 27 In a hidden layer, we do not have a target output value But we can compute how much each node contributed to downstream error Definition of error term of each node δ j = (t j y j ) y j Back-propagate the error term (why this way? there is math to back it up...) δ i = ( ) w j i δ j j y i Universal update formula w j k = µ δ j h k

Our Example 28 A B C.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6 D E F.90.7 4.5-5.2-2.0 G.76 Computed output: y =.76 Correct output: t =.0 Final layer weight updates (learning rate µ = 0) δ G = (t y) y = (.76) 0.8 =.0434 w GD = µ δ G h D = 0.0434.90 =.39 w GE = µ δ G h E = 0.0434.7 =.074 w GF = µ δ G h F = 0.0434 =.434

Our Example 29 A B C.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6 D E F.90.7 4.89 4.5-5.26-5.2 -.566-2.0 G.76 Computed output: y =.76 Correct output: t =.0 Final layer weight updates (learning rate µ = 0) δ G = (t y) y = (.76) 0.8 =.0434 w GD = µ δ G h D = 0.0434.90 =.39 w GE = µ δ G h E = 0.0434.7 =.074 w GF = µ δ G h F = 0.0434 =.434

Hidden Layer Updates 30 A B C.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6 D E F.90.7 4.89 4.5-5.26-5.2 -.566-2.0 Hidden node D ( ) δ D = j w j iδ j y D = w GD δ G y D = 4.5.0434.0898 =.075 w DA = µ δ D h A = 0.075.0 =.75 w DB = µ δ D h B = 0.075 0.0 = 0 w DC = µ δ D h C = 0.075 =.75 Hidden node E ( ) δ E = j w j iδ j y E = w GE δ G y E = 5.2.0434 0.2055 =.0464 w EA = µ δ E h A = 0.0464.0 =.464 etc. G.76

3 some additional aspects

Initialization of Weights 32 Weights are initialized randomly e.g., uniformly from interval [ 0.0, 0.0] Glorot and Bengio (200) suggest for shallow neural networks [ n, ] n n is the size of the previous layer for deep neural networks [ 6 nj + n j+, sqrt6 ] nj + n j+ n j is the size of the previous layer, n j size of next layer

Neural Networks for Classification 33 Predict class: one output node per class Training data output: One-hot vector, e.g., y = (0, 0, ) T Prediction predicted class is output node y i with highest value obtain posterior probability distribution by soft-max softmax(y i ) = ey i j ey j

Speedup: Momentum Term 34 Updates may move a weight slowly in one direction To speed this up, we can keep a memory of prior updates w j k (n )... and add these to any new updates (with decay factor ρ) w j k (n) = µ δ j h k + ρ w j k (n )

35 computational aspects

Vector and Matrix Multiplications 36 Forward computation: s = W h Activation function: y = sigmoid( h) Error term: δ = ( t y) sigmoid ( s) Propagation of error term: δ i = W δ i+ sigmoid ( s) Weight updates: W = µ δ h T

GPU 37 Neural network layers may have, say, 200 nodes Computations such as W h require 200 200 = 40, 000 multiplications Graphics Processing Units (GPU) are designed for such computations image rendering requires such vector and matrix operations massively mulit-core but lean processing units example: NVIDIA Tesla K20c GPU provides 2496 thread processors Extensions to C to support programming of GPUs, such as CUDA

Theano 38 GPU library for Python Homepage: http://deeplearning.net/software/theano/ See web site for sample implementation of back-propagation training Used to implement neural network language models neural machine translation (Bahdanau et al., 205)