INFOB2KI 2017-2018 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Artificial Neural Networks Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
2
Outline Biological neural networks Artificial NN basics: perceptrons multi layer networks Training ANN Combination with other ML techniques NN and Evolutionary Computing NN and Reinforcement Learning e.g. deep learning 3
(Artificial) Neural Networks Supervised learning technique: error driven classification Output is determined from weighted set of inputs Training updates the weights Used in games for e.g. Select weapon Select item to pick up Steer a car on a circuit Recognize characters Recognize face 4
Biological Neural Nets Pigeons as art experts (Watanabe et al. 1995) Experiment: Pigeon in Skinner box Present paintings of two different artists (e.g. Chagall / Van Gogh) Reward for pecking when presented a particular artist (e.g. Van Gogh) 5
6
Results from experiment Pigeons were able to discriminate between Van Gogh and Chagall with 95% accuracy (when presented with pictures they had been trained on) Discrimination still 85% successful for previously unseen paintings of the artists 7
Praise to neural nets Pigeons have acquired knowledge about art Pigeons do not simply memorise the pictures They can extract and recognise patterns (the style ) They generalise from the already seen to make predictions Pigeons have learned. Can one implement this using an Artificial neural network? 8
Inspiration from biology If a pigeon can do it, how hard can it be? ANN s are biologically inspired. ANN s are not duplicates of brains (and don t try to be). 9
(Natural) Neurons Natural neurons: receive signals through synapses (~ inputs) If signals strong enough (~ above some threshold), the neuron is activated and emits a signal though the axon. (~ output) Natural neuron Artificial neuron (Node) 10
McCulloch & Pitts model (1943) A logical calculus of the ideas immanent in nervous activity x 1 w 1 Linear Combiner hard delimiter output x 2 w 2 y x n w n aka: - linear threshold gate - threshold logic unit n binary inputs x i and 1 binary output y n weights w i ϵ { 1,1} Linear combiner: z = Hard delimiter: unit step function at threshold θ, i.e. 1 if, 0 if 11
Rosenblatt s Perceptron (1958) x z y = g(z) x enhanced version of McCulloch Pitts artificial neuron n+1 real valued inputs: x 1 x n and 1 bias b; binary output y weights w i with real valued values Linear combiner: z = g(z): (hard delimiter) unit step function at threshold 0, i.e. 1if 0, 0if 0 12
Classification: feedforward The algorithm for computing outputs from inputs in perceptron neurons is the feedforward algorithm. 4 w=2 8-3 w=4-12 -4 0 weighted input: z = 0 activation g(z): 0 0 13
Bias & threshold implementation Bias can be incorporated in three different ways, with same effect on output: 1 b b w 0 = 1 θ- b Alternatively: threshold θ canbeincorporatedin three different ways, with same effect on output 14
Single layer perceptron x 1 Input nodes: 1 w 14 w 13 w 23 x 2 2 4 w 24 Single layer of neurons: 3 y 1 y 2 Rosenblatt s perceptron is building block of single layer perceptron which is the simplest feedforward neural network alternative hard limiting activation functions g(z) possible; e.g. sign function: 1 if 0, 1 if 0 can have multiple independent outputs the adjustable weights can be trained using training data the Perceptron learning rule adjusts the weights w 1 w n such that the inputs x 1 x n give rise to the desired output(s) y 15
Perceptron learning: idea Idea: minimize error in the output through gradient descent squared error, per output: (d=desired output) change term proportional to gradient if (non differentiable) activation replaced with y = g(z) = z Proportional change: learning rate > 0 NB in the book the learning rate is called Gain, with notation η 16
Perceptron learning Initialize weights and threshold (or bias) to random numbers; Choose a learning rate 0 1 For each training input t=<x 1,,x n >: calculate the output y(t) and error e(t)=d(t) - y(t) Adjust all n weights using perceptron learning rule: where e(t) 1 epoch desired output Weights for any t changed? All Weights unchanged? or other stopping rule Ready 17
Example: AND- learning (1) x 1 x 2 d 0 0 0 0 1 0 1 0 0 1 1 1 x 2 1 0 1 x 1 desired output of logical AND, given 2 binary inputs 18
Example AND (2) x 1 0 w=0.3 0 x 2 0 w=-0.1 0 0 0.2 0 e(t 1 ) = d(t) 0 = 0 0 Init: choose weights w i and threshold θ randomly in [ 0.5,0.5]; set ; use step function: return 0 if < θ; 1 if θ x 1 x 2 d(t) t 1 0 0 0 t 2 0 1 0 t 3 1 0 0 t 4 1 1 1 Alternative: use bias b= θ with unit stepfunction Done with t 1, for now 19
Example AND (3) x 1 0 w=0.3 0 x 2 1 w=-0.1-0.1-0.1 0 0.2 e(t 2 ) = 0-0 x 1 x 2 d(t) t 1 0 0 0 t 2 0 1 0 t 3 1 0 0 t 4 1 1 1 Done with t 2, for now 20
Example AND (4) x 1 1 w=0.3 w=0.2 0.3 x 2 0 w=-0.1 0 0.3 1 0.2 e(t 3 ) = 0-1 x 1 x 2 d(t) t 1 0 0 0 t 2 0 1 0 t 3 1 0 0 t 4 1 1 1 (t) w 1 0.2; done with t 3, for now 21
Example AND (5) x 1 1 w=0.3 w=0.2 0.2 0.1 0 x 2 1 w=-0.1 w=0-0.1 0.2 e(t 4 ) = 1-0 x 1 x 2 d(t) t 1 0 0 0 t 2 0 1 0 t 3 1 0 0 t 4 1 1 1 (t).1 w 1 0.3 and w 2 0; done with t 4 and first epoch 22
Example (6) : 4 epoch s later x 1 w=0.1 x 2 w=0.1 0.2 algorithm has converged, i.e. the weights do not change any more. algorithm has correctly learned the AND function 23
AND example (7): results x 1 x 2 d y 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 x 2 1 0 1 x 1 Learned function/decision boundary: 0.1 0.1 0.2 0 linear classifier Or: 2 24
Perceptron learning: properties We do gradient descent in space without local optimal Complete: yes, if sufficiently small or initial weights suff. large examplescomefroma linearly separable function! then perceptron learning converges to a solution. Optimal: no (weights serve to correctly separate seen inputs; no guarantees for unseen inputs close to the decision boundaries) 25
Limitation of perceptron: example XOR x 1 x 2 d 0 0 0 0 1 1 1 0 1 1 1 0 x 2 1 0 1 x 1 Cannot separate two output types with a single linear function XOR is not linearly separable. 26
Solving XOR using 2 McCulloch & Pitt models x 1 1 2 x 2 ϴ=1 1 3-1 x 1 1 2 4 1 ϴ=1 x 2-1 ϴ=1 1 1 3 1-1 -1 2 4 1 1 ϴ=1 x 1 x 2 ϴ=1 5 y x 2 x 2 x 2 1 1 1 0 1 x 1 0 1 x 1 0 1 x 1 27
Types of decision regions 28
Multi-layer networks x 1 y 1 x 2 y 2 x 3 y 3 input nodes hidden layer of neurons output neuron layer This type of network is also called a feed forward network hidden layer captures nonlinearities more than 1 hidden layer is possible, but often reducible to 1 hidden layer introduced in 50s, but not studied until 80s 29
Training Multi-Layer Network s MLN s are trained using Back propagation Input signals x 1 y 1 x 2 y 2 x 3 y 3 Error signals 30
Training Multi-Layer Network s I Similar to perceptron learning rule, but now error has to be distributed over hidden nodes squared error, per output: i j We need a continuous activation function 31
Continuous activation functions As continuous activation function, we can use smoothed versions of step function: a sigmoid E.g. logistic sigmoid g(z) z 32
Continuous artificial neurons x 1 w 1 Linear Combiner sigmoid function output x 2 w 2 y x n w n weighted input: activation (logistic sigmoid): z = 33
Example 3 w=2 6-2 w=4-8 -2 0.119 weighted input: activation: z = 34
Training Multi-Layer Network s squared error, per output: output of node is input for node o for node in output layer o = for node in a previous (hidden) layer NB previous = closer to input layer 35
Backpropagation Initialize weights and threshold (or bias) to random numbers; Choose a learning rate 0 1 For each training input t=<x 1,,x n >: calculate the output y(t) and error e(t)=d(t) - y(t) Recursively adjust each weight on link node i to node j: o if j is output node o if j is hidden node Weights for any t changed? All Weights unchanged? or other stopping rule Ready 36
Training for XOR x 1 0 1 W 14 = -5 W 23 = -5 W 13 = 10 3 0.002 W 35 = 5 5 0.003 y x 2 2 4 0 W 24 = 10 W 45 = 5 0.002 e(t) = 0-0.003 Activation function for nodes 3-5: 1 1 (i.e. 6 ) Set 0.9 To simplify computation, if absolute value of e(t) < 0.1, we consider outcome correct. x 1 x 2 d 0 0 0 0 1 1 1 0 1 1 1 0 With the sigmoid as approximation of the step function, we consider this outcome correct no weight updates required for first case, for now.. 37
x 1 0 1 W 14 = -5 W 23 = -5 W 13 = 10 Training for XOR 3 0.000 W 35 = 5 0.252 x 2 4 W 45 = 5 2 W 24 = 10 0.982 1 Activation function for nodes 3-5: 1 1 (i.e. 6 ) Set 0.9 δ 5 = y 5 * (1-y 5 ) * e ~ 0.141 x 1 x 2 d Δw 35 = α * y 3 * δ 5 ~ 0.000 Δw 0 0 0 45 = α * y 4 * δ 5 ~ 0.125 δ 3 = y 3 * (1-y 3 ) * w 35 * δ 5 ~ 0.000 0 1 1 δ 4 = y 4 * (1-y 4 ) * w 45 * δ 5 ~ 0.012 1 0 1 Δw 13 = α * y 1 * δ 3 = α * x 1 * δ 3 = 0 = Δw 14 1 1 0 Δw 23 = α * x 2 * δ 3 ~ 0.000 Δw 24 = α * x 2 * δ 4 ~ 0.011 5 y e(t) = 1-0.252=0.748 38
0 x 1 1 x 2 W 14 = -5 W 23 = -5 2 4 1 W 13 = 10 W 24 = 10.011 Training for XOR 3 0.000 W 35 = 5 0.982 W 45 = 5.125 0.276 5 y e(t) = 1-0.276=0.724 Activation function for nodes 3-5: 1 1 (i.e. 6 ) Set 0.9 x 1 x 2 d 0 0 0 0 1 1 1 0 1 1 1 0 Adjust the weights that require changing: Δw 45 ~ 0.125: update w 45 to 5.125 Δw 24 ~ 0.011: update w 24 to 10.011 39
After many training examples x 1 0 1 W 14 = -13 W 23 = -11 W 13 = 12 3 0.000 W 35 = 13 0.999 5 e(t) = 1-0.999=0.001 2 4 W 45 = 13 x 2 W 24 = 13 0.999 1 Activation function for nodes 3-5: 1 1 (i.e. 6 ) Set 0.9 y x 1 x 2 d y 0 0 0 0.003 0 1 1 0.999 1 0 1 0.999 1 1 0 0.003 e(t) < 0.1 for all cases: we can consider these outcomes correct 40
Properties of MLNs Boolean functions: Every boolean function f:{0,1} k {0,1} can be represented using a single hidden layer Continuous functions: Every bounded piece wise continuous function can be approximated with arbitrarily small error with one hidden layer Anycontinuous functioncanbe approximatedto arbitrary accuracy with two hidden layers Learning: Not efficient (but intractable, regardless of method) No guarantee of convergence 41
Example: Voice Recognition Task: Learn to discriminate between two different voices saying Hello Data Sources Steve Simpson David Raubenheimer Format Frequency distribution (60 bins) Analogy: cochlea 42
Example: Voice Recognition Network architecture Feed forward network 60 input (one for each frequency bin) 6 hidden 2 output (0 1 for Steve, 1 0 for David ) 43
Example: Voice Recognition Presenting the data: feed forward Steve David 44
Example: Voice Recognition Presenting the data: feed forward (untrained network) Steve 0.43 0.26 David 0.73 0.55 45
Example: Voice Recognition Calculate error Steve 0 0.43 = 0.43 1 0.26 = 0.74 David 1 0.73 = 0.27 0 0.55 = 0.55 46
Example: Voice Recognition Backprop total error and adjust weights Steve 0 0.43 = 0.43 1 0.26 = 0.74 1.17 David 1 0.73 = 0.27 0 0.55 = 0.55 0.82 47
Example: Voice Recognition Total error Repeat process (sweep) for all training pairs Present data Calculate error Backpropagate error Adjust weights Repeat process multiple timess #sweeps 48
Presenting the data (trained network) Steve Example: Voice Recognition 0.01 0.99 David 0.99 0.01 49
Example: Voice Recognition Results Voice Recognition Performance of trained network Discrimination accuracy between known Hello s 100% Discrimination accuracy between new Hello s 100% 50
Example: Voice Recognition Results Voice Recognition (ctnd.) Network has learnt to generalise from original data Networks with different weight settings can have same functionality Trained networks concentrate on lower frequencies Network is robust against non functioning nodes 51
Applications of feed-forward nets Classification, pattern recognition, diagnosis: Character Recognition, both printed and handwritten Face Recognition, speech recognition Object classification by means of salient features Analysis of signal to determine their nature and source Regression and forecasting: Examples: In particular non linear functions and time series Sonar mine/rock recognition (Gorman & Sejnowksi, 1988) Navigation of a car (Pomerleau, 1989) Stock market prediction Pronunciation (NETtalk: Sejnowksi & Rosenberg, 1987) 52
More Neural Networks Acyclic: feedforward Cyclic: recurrent 53
(Natural) Neurons revisited Human s have 10 10 neurons, and 10 15 dendrites. Don t even think about creating an ANN of this size Most ANN s do not have feedback loops in the network structure (exception: recurrent NN). The ANN activation function is (probably) much simpler than what happens in the biological neuron. 54
Learning NNs using Evolution https://www.youtube.com/watch?v=ts8qll 3NXk https://www.youtube.com/watch?v=s9y_i9vy8qw 55
Deep learning Source: NIPS 2015 tutorial by Y LeCun 56
NN as function approximator A NN can be used as a black box that represents (an approximation of) a function This can be used in combination with other learning methods E.g. use a NN to represent the Q function in Q learning 57
NN + Q-learning 58
Alpha Go (Deepmind/Google) https://www.youtube.com/watch?v=mzpw10dpheq 59