Introduction to Neural Networks

Introduction to Neural Networks Philipp Koehn 3 October 207

Linear Models We used before weighted linear combination of feature values h j and weights λ j score(λ, d i ) = j λ j h j (d i ) Such models can be illustrated as a network

Limits of Linearity 2 We can give each feature a weight But not more complex value relationships, e.g, any value in the range [0;5] is equally good values over 8 are bad higher than 0 is not worse

XOR 3 Linear models cannot model XOR good bad bad good

Multiple Layers 4 Add an intermediate ( hidden ) layer of processing (each arrow is a weight) Have we gained anything so far?

Non-Linearity 5 Instead of computing a linear combination score(λ, d i ) = j λ j h j (d i ) Add a non-linear function Popular choices score(λ, d i ) = f ( λ j h j (d i ) ) tanh(x) sigmoid(x) = +e x relu(x) = max(0,x) j (sigmoid is also called the logistic function )

Deep Learning 6 More layers = deep learning

What Depths Holds 7 Each layer is a processing step Having multiple processing steps allows complex functions Metaphor: NN and computing circuits computer = sequence of Boolean gates neural computer = sequence of layers Deep neural networks can implement complex functions e.g., sorting on input values

8 example

Simple Neural Network 9 3.7 3.7 2.9 2.9 -.5-4.6 4.5-5.2-2.0 One innovation: bias units (no inputs, always value )

Sample Input 0.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6 4.5-5.2-2.0 Try out two input values Hidden unit computation sigmoid(.0 3.7 + 0.0 3.7 +.5) = sigmoid(2.2) = = 0.90 + e 2.2 sigmoid(.0 2.9 + 0.0 2.9 + 4.5) = sigmoid(.6) = = 0.7 + e.6

Computed Hidden.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6.90.7 4.5-5.2-2.0 Try out two input values Hidden unit computation sigmoid(.0 3.7 + 0.0 3.7 +.5) = sigmoid(2.2) = = 0.90 + e 2.2 sigmoid(.0 2.9 + 0.0 2.9 + 4.5) = sigmoid(.6) = = 0.7 + e.6

Compute Output 2.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6.90.7 4.5-5.2-2.0 Output unit computation sigmoid(.90 4.5 +.7 5.2 + 2.0) = sigmoid(.7) = = 0.76 + e.7

Computed Output 3.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6.90.7 4.5-5.2-2.0.76 Output unit computation sigmoid(.90 4.5 +.7 5.2 + 2.0) = sigmoid(.7) = = 0.76 + e.7

Output for all Binary Inputs 4 Input x 0 Input x Hidden h 0 Hidden h Output y 0 0 0 0.2 0.02 0.8 0 0 0.88 0.27 0.74 0 0.73 0.2 0.74 0.99 0.73 0.33 0 Network implements XOR hidden node h 0 is OR hidden node h is AND final layer operation is h 0 h Power of deep neural networks: chaining of processing steps just as: more Boolean circuits more complex computations possible

5 why neural networks?

Neuron in the Brain 6 The human brain is made up of about 00 billion neurons Dendrite Axon terminal Soma Nucleus Axon Neurons receive electric signals at the dendrites and send them to the axon

Neural Communication 7 The axon of the neuron is connected to the dendrites of many other neurons Neurotransmitter Voltage gated Ca++ channel Synaptic vesicle Neurotransmitter transporter Axon terminal Postsynaptic density Receptor Synaptic cleft Dendrite

The Brain vs. Artificial Neural Networks 8 Similarities Neurons, connections between neurons Learning = change of connections, not change of neurons Massive parallel processing But artificial neural networks are much simpler computation within neuron vastly simplified discrete time steps typically some form of supervised learning with massive number of stimuli

9 back-propagation training

Error 20.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6.90.7 4.5-5.2-2.0.76 Computed output: y =.76 Correct output: t =.0 How do we adjust the weights?

Key Concepts 2 Gradient descent error is a function of the weights we want to reduce the error gradient descent: move towards the error minimum compute gradient get direction to the error minimum adjust weights towards direction of lower error Back-propagation first adjust last set of weights propagate error back to each previous layer adjust their weights

Gradient Descent 22 error(λ) gradient = λ optimal λ current λ

Gradient Descent 23 Optimum Gradient for w Combined Gradient Current Point Gradient for w2

Derivative of Sigmoid Sigmoid sigmoid(x) = + e x 24 Reminder: quotient rule (f(x) ) = g(x)f (x) f(x)g (x) g(x) g(x) 2 Derivative d sigmoid(x) dx = d dx + e x = 0 ( e x ) ( e x ) ( + e x ) 2 = = ( e x ) + e x + e x ( + e x ) + e x = sigmoid(x)( sigmoid(x))

Final Layer Update 25 Linear combination of weights s = k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 2 (t y)2 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k

Final Layer Update () 26 Linear combination of weights s = k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 2 (t y)2 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k Error E is defined with respect to y de dy = d dy 2 (t y)2 = (t y)

Final Layer Update (2) 27 Linear combination of weights s = k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 2 (t y)2 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k y with respect to x is sigmoid(s) dy ds = d sigmoid(s) ds = sigmoid(s)( sigmoid(s)) = y( y)

Final Layer Update (3) 28 Linear combination of weights s = k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 2 (t y)2 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k x is weighted linear combination of hidden node values h k ds dw k = d dw k k w k h k = h k

Putting it All Together 29 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k = (t y) y( y) h k error derivative of sigmoid: y Weight adjustment will be scaled by a fixed learning rate µ w k = µ (t y) y h k

Multiple Output Nodes 30 Our example only had one output node Typically neural networks have multiple output nodes Error is computed over all j output nodes E = j 2 (t j y j ) 2 Weights k j are adjusted according to the node they point to w j k = µ(t j y j ) y j h k

Hidden Layer Update 3 In a hidden layer, we do not have a target output value But we can compute how much each node contributed to downstream error Definition of error term of each node δ j = (t j y j ) y j Back-propagate the error term (why this way? there is math to back it up...) δ i = ( ) w j i δ j j y i Universal update formula w j k = µ δ j h k

Our Example 32 A B C.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6 D E F.90.7 4.5-5.2-2.0 G.76 Computed output: y =.76 Correct output: t =.0 Final layer weight updates (learning rate µ = 0) δ G = (t y) y = (.76) 0.8 =.0434 w GD = µ δ G h D = 0.0434.90 =.39 w GE = µ δ G h E = 0.0434.7 =.074 w GF = µ δ G h F = 0.0434 =.434

Our Example 33 A B C.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6 D E F.90.7 4.89 4.5-5.26-5.2 -.566-2.0 G.76 Computed output: y =.76 Correct output: t =.0 Final layer weight updates (learning rate µ = 0) δ G = (t y) y = (.76) 0.8 =.0434 w GD = µ δ G h D = 0.0434.90 =.39 w GE = µ δ G h E = 0.0434.7 =.074 w GF = µ δ G h F = 0.0434 =.434

Hidden Layer Updates 34 A B C.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6 D E F.90.7 4.89 4.5-5.26-5.2 -.566-2.0 Hidden node D ( ) δ D = j w j iδ j y D = w GD δ G y D = 4.5.0434.0898 =.075 w DA = µ δ D h A = 0.075.0 =.75 w DB = µ δ D h B = 0.075 0.0 = 0 w DC = µ δ D h C = 0.075 =.75 Hidden node E ( ) δ E = j w j iδ j y E = w GE δ G y E = 5.2.0434 0.2055 =.0464 w EA = µ δ E h A = 0.0464.0 =.464 etc. G.76

35 some additional aspects

Initialization of Weights 36 Weights are initialized randomly e.g., uniformly from interval [ 0.0, 0.0] Glorot and Bengio (200) suggest for shallow neural networks [ n, ] n n is the size of the previous layer for deep neural networks [ 6 nj + n j+, 6 nj + n j+ ] n j is the size of the previous layer, n j size of next layer

Neural Networks for Classification 37 Predict class: one output node per class Training data output: One-hot vector, e.g., y = (0, 0, ) T Prediction predicted class is output node y i with highest value obtain posterior probability distribution by soft-max softmax(y i ) = ey i j ey j

Problems with Gradient Descent Training 38 error(λ) λ Too high learning rate

Problems with Gradient Descent Training 39 error(λ) λ Bad initialization

Problems with Gradient Descent Training 40 error(λ) local optimum global optimum λ Local optimum

Speedup: Momentum Term 4 Updates may move a weight slowly in one direction To speed this up, we can keep a memory of prior updates w j k (n )... and add these to any new updates (with decay factor ρ) w j k (n) = µ δ j h k + ρ w j k (n )

Adagrad 42 Typically reduce the learning rate µ over time at the beginning, things have to change a lot later, just fine-tuning Adapting learning rate per parameter Adagrad update based on error E with respect to the weight w at time t = g t = de dw w t = µ t τ= g2 τ g t

Dropout 43 A general problem of machine learning: overfitting to training data (very good on train, bad on unseen test) Solution: regularization, e.g., keeping weights from having extreme values Dropout: randomly remove some hidden units during training mask: set of hidden units dropped randomly generate, say, 0 20 masks alternate between the masks during training Why does that work? bagging, ensemble,...

Mini Batches 44 Each training example yields a set of weight updates w i. Batch up several training examples sum up their updates apply sum to model Mostly done or speed reasons

45 computational aspects

Vector and Matrix Multiplications 46 Forward computation: s = W h Activation function: y = sigmoid( h) Error term: δ = ( t y) sigmoid ( s) Propagation of error term: δ i = W δ i+ sigmoid ( s) Weight updates: W = µ δ h T

GPU 47 Neural network layers may have, say, 200 nodes Computations such as W h require 200 200 = 40, 000 multiplications Graphics Processing Units (GPU) are designed for such computations image rendering requires such vector and matrix operations massively mulit-core but lean processing units example: NVIDIA Tesla K20c GPU provides 2496 thread processors Extensions to C to support programming of GPUs, such as CUDA

Toolkits 48 Theano Tensorflow (Google) PyTorch (Facebook) MXNet (Amazon) DyNet