Introduction and Perceptron Learning

Artificial Neural Networks Introduction and Perceptron Learning CPSC 565 Winter 2003 Christian Jacob Department of Computer Science University of Calgary Canada

CPSC 565 - Winter 2003 - Emergent Computing The Brain Paradigm Example: Visual Cortex of a Cat Image 1 2

CPSC 565 - Winter 2003 - Emergent Computing Image 2 ANN Image Processing Example 3

CPSC 565 - Winter 2003 - Emergent Computing 4 Brains vs. Digital Computers Ï Computers require hundreds of cycles to simulate a firing of a neuron. (How does the "firing" pattern of a neuron look like?) Ï Computers are good at symbol processing. ï Is "life" and "mind" reducible to "symbol processing"? Ï Brains perform extremely well at highly parallel pattern recognition tasks: Ë face recognition, Ë language processing, Ë language understanding (!), Ë creativity, inventing, use of tools,... Ë self-reflection, self-awareness,... Computers versus Human Brains Human Brain: Ë Grown by cell differentiation and iterated cell division (instead of constructed from pre-fabricated building blocks) Ë Rather simple "processing elements" Ë High degree of interconnectivity (adaptive!) Ë Adaptive and hierarchical architecture Ë Highly parallel and distributed information processing Ë Redundant information storage and processing Ë Functionality is both pre-programmed (to some degree) and "programmable" Ë "Algorithms" are designed through learning, not programming.

CPSC 565 - Winter 2003 - Emergent Computing 5 Computers versus Human Brains: Hard- / Software and Processing Computer Human Brain Computational units 1 CPU, 10 5 gates 10 11 neurons Storage units 10 9 bits RAM, 10 10 bits disk10 11 neurons, 10 14 synapses Cycle time 10-9 sec 10-3 sec Neuron updates per sec 10 5 10 14 Networks of Neurons Dendrites, Synapses, Cell Body, Axon Dendrites, synapses, cell body, and axon are the four elements that are usually adopted from the biological model in order to build artificial neural networks. Artificial neurons for computing will have Ë input channels, Ë a cell body, and Ë an output channel. Synapses are simulated by contact points between the cell body and input or output connections. A weight will be associated with these points.

CPSC 565 - Winter 2003 - Emergent Computing 6 Figure 1. A typical motor neuron Transmission of Information A fundamental problem of any information processing system is the way by which information is transmitted through the system. Neurons transmit information using electrical signals. However, in biological structures this can not be done by simple electronic transport as in metallic cables. Evolution arrived at another solution: involving ions and semi-permeable membranes. Charged Cells Our body consists mainly of water, 55% of which is contained within the cells and 45% forming its environment. The cells preserve their identity and biological components by enclosing the protoplasm in a membrane. Membranes are made of a double layer of molecules that form a diffusion barrier. Some salts, present in our body, dissolve in the intra- and extracellular fluid and dissociate into negative and positive ions. Ions present in the cells that play an important role for neurons and their information processing are Ë sodium ions HNa + L, chlorine ions HCl - L, potassium HK + L, and calcium HCa 2+ L. The membranes of the cells exhibit different degrees of permeability for each of these ions.

CPSC 565 - Winter 2003 - Emergent Computing 7 The permeability is determined by the number and size of pores in the membrane, the so-called ionic channels. The specific permeability of the membrane leads to different distributions of ions in the interior and the exterior of the cells. Action Potential In particular, differences in membrane permeability lead to the interior of neurons being negatively charged with respect to the extracellular fluid. An action potential is produced by an initial depolarization of the cell membrane. Figure 2. Typical form of the action potential The potential increases from -70mV to +40mV. After some time, the potential becomes negative again, but it overshoots. Gradually, the cell recovers and the cell membrane returns to the initial potential.

CPSC 565 - Winter 2003 - Emergent Computing 8 Transmission of an Action Potential Figure 3. Transmission of an action potential Information Processing at the Synapses Neurons transmit information using action potentials. The processing of this information at the interfaces between neurons, the synapses, involves a combination of electrical and chemical processes. Directed Transmission of Information Synapses determine a direction for the transmission of information.

CPSC 565 - Winter 2003 - Emergent Computing 9 Signals flow from one cell to another in a well-defined manner. Figure 4. Chemical signaling at the synapse When an electric impulse arrives at the synapse, the synaptic vesicles fuse with the cell membrane. The transmitters flow into the synaptic gap and some attach to the ionic channels. This opens the ionic channels such that more ions can now flow from the exterior to the interior of the cell. This way, the cell's potential is altered. If the interior of the cell potential is increased, this helps prepare an action potential and the synapse causes an excitation of the cell. Storage of Information and Learning NMDA receptors help to understand some forms of learning (among many others) in neurons (NMDA = N-methyl-Daspartate). NMDA receptors are ionic channels permeable for different kinds of molecules (sodium, calcium, or potassium ions). Figure 5. Unblocking of an NMDA receptor These channels are blocked by a magnesium ion, such that the permeability for sodium and potassium is low.

CPSC 565 - Winter 2003 - Emergent Computing 10 If the cell has reached a certain excitation level, the ionic channels lose the magnesium ions and become unblocked. The permeability for Ca 2+ ions increases immediately, which starts a chain of reactions resulting in a durable change of the threshold level of the cell. Artificial Neural Networks: Introductory Concepts Definition of an ANN Ë A neural network is a system composed of (usually a large number of) simple processing elements (neurons). Ë Ideally, the processing elements operate asynchronously and in parallel. Ë The ANNs can be used to acquire (through training, learning), store, and utilize experiential knowledge. Mathematically, a neural network is a "mapping machine" capable of modeling a function F : n ö m That is, a network maps an m-dimensional real input vector Hx 1, x 2,, x n L to an m-dimensional real output Hy 1, y 2,, y m L. ANN Architectures Feed-forward networks: Ë Neurons are arranged in layers. Ë Links only follow in one direction, namely from input to output layer. Ë Usually, a unit is linked only to units in the following layer(s). Ë Units within the same layer are not linked. Ë Signal (and error) propagation as well as weight updating can proceed uniformly from the input to the output layer.

CPSC 565 - Winter 2003 - Emergent Computing 11 Figure 6. Example of a feed-forward network with a single hidden layer Recurrent networks: Ë Links can be between any neuron and can form arbitrary topologies. Ë Can implement more complex neural architectures. Ë Internal states with memory can be modelled. Ë A stable internal state and output might not be reached. Figure 7. McCulloch-Pitts network for a binary scaler. For example, it translates the binary sequence 00110110 into the sequence 00100100.

CPSC 565 - Winter 2003 - Emergent Computing 12 A Generic Neuron Model Generic Model of a Neuron Processing Unit A typical model of a neural processing unit: A more detailed model of a neural processing unit: Input function: in i = j w ji a j Activation function: ghin i L = gh j w ji a j L

CPSC 565 - Winter 2003 - Emergent Computing 13 Output function: a i = outhghin i LL = outhgh j w ji a j LL Activation Functions (1) Step Function: stephx, tl = i j 1 if x t k 0 if x < t t = 1 1 0.8 0.6 0.4 0.2-4 -2 2 4 (2) Sign Function: signhxl = i j 1 if x 0 k -1 if x < 0 1 0.5-4 -2 2 4-0.5-1

CPSC 565 - Winter 2003 - Emergent Computing 14 (3) Sigmoid Function: sigmoidhx, al = 1 ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ 1+E -a x 1 0.8 0.6 0.4 0.2-4 -2 2 4 The parameter a determines the slope of the sigmoid function: Ë 0.1 a 1: sigmoidhx, al = 1 ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ 1+E -a x a = 0.1 0.6 0.55 0.5 0.45-4 -2 2 4 Ë 1 a 10 : sigmoidhx, al = 1 ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ 1+E -a x a = 1 1 0.8 0.6 0.4 0.2-4 -2 2 4 This allows the sigmoid function to approximate both the step and the sign function.

CPSC 565 - Winter 2003 - Emergent Computing 15 Neurons in Action: Logic Gates Neurons as Logic Gates Individual units, representing Boolean functions, can act as logic gates, given appropriate thresholds and weights. Activation function: stephx, tl t = 1 1 0.8 0.6 0.4 0.2-4 -2 2 4 (1) Which logic function? w = 1 t = 1.5 w = 1 (2) Which logic function? w = 1 t = 0.5 w = 1

CPSC 565 - Winter 2003 - Emergent Computing 16 (3) Which logic function? w = -1 t = -0.5 Specific Neuron Models McCulloch-Pitts Units McCulloch-Pitts processing units are the simplest neuron models, which produce and transmit only binary information. Figure 8. McCulloch-Pitts unit The rule for evaluating the input of a McCulloch-Pitts unit (MP unit) is as follows: Ë The MP unit gets two sorts of input: - input x 1, x 2,, x n through n excitatory edges - input y 1, y 2,, y m through m inhibitory edges. Ë If m 1 and at least one of the signals y 1, y 2,, y m is 1, the unit is inhibited and the output is 0. n Ë Otherwise, the total excitation x = i=1 x i = x 1 + x 2 + + x n is computed and compared to the threshold q: output = i j 1 if x q k 0 if x < q

CPSC 565 - Winter 2003 - Emergent Computing 17 Conjunction and Disjunction Figure 9. Generalized AND and OR gates as McCulloch-Pitts units Negation and More Logical Functions Figure 10. Logical functions and their realizations as McCulloch-Pitts neurons What Do MP Units Compute? For visualization purposes, we consider the function space of logical functions of three variables. Figure 11. Function values of a logical function of three variables Hx 1, x 2, x 3 L

CPSC 565 - Winter 2003 - Emergent Computing 18 McCulloch-Pitts units divide the input space into two half-spaces. For a given input Hx 1, x 2, x 3 L and a threshold q the condition x 1 + x 2 + x 3 q is tested, which is true for all points to one side of the plane defined by x 1 + x 2 + x 3 = 0 and false for all points to the other side. Figure 12. Separation of the input space for the OR function The majority function (with threshold q = 2) of three variables divides the input space in a similar manner, but the separating plane is given by the equation x 1 + x 2 + x 3 = 2. Figure 13. Separating planes of the OR and majority functions The planes are always parallel in the case of McCulloch-Pitts units.

CPSC 565 - Winter 2003 - Emergent Computing 19 The Perceptron Today, the perceptron is one of the classic models of neural network processing elements and architectures. Its use in practical applications is limited, however, due to its simplicity (both in its structure and learning algorithm) it provides a good model to study the basics and problems of connectionist information processing. The Classical Perceptron The perceptron was probably the first computation device inspired by neural networks. The perceptron was developed in 1958 by the American psychologist Frank Rosenblatt. Rosenblatt used the perceptron for image processing and image classification tasks. Figure 14. The classical perceptron architecture as proposed by Frank Rosenblatt Minsky-and-Papert Perceptron Minsky and Papert distilled the essential features from Rosenblatt's model in order to study the computational capabilities of the perceptron under different assumptions. A retina is directly connected to logic elements called predicates, which can computer a single bit according to their input. These predicates can be as computationally complex as we like. For example, each predicate could perform a filter function on the pixel image.

CPSC 565 - Winter 2003 - Emergent Computing 20 Figure 15. Predicates and weights of a perceptron. Each predicate, however, is limited in its diameter or the number of input pixels. No predicate sees the whole retina. A threshold unit, which receives weighted inputs from the predicates, is used to compute the final output of the perceptron. Limitations* A Perceptron Cell x 1 x 2 w 2 w 1... w n 0 y x n x n+1 = 1 w n+1 = -q Perceptron with a Bias In many cases it is more convenient to deal with perceptrons of threshold zero only. This corresponds to linear separations which go through the origin of the input space.

CPSC 565 - Winter 2003 - Emergent Computing 21 Any perceptron with threshold q can be converted into an equivalent perceptron with threshold zero, which has an additional input called the bias weighted by -q. Figure 17. A perceptron with a bias Most learning algorithms can be stated more concisely by transforming thresholds into biases. The input and weight vectors must be extended: Ë extended input vector: Hx 1, x 2,, x n, 1L Ë extended weight vector: Hw 1, w 2,, w n, w n+1 L with w n+1 = -q. From Inputs to Output The perceptron calculates its output value as follows: y = 9 1 if i=1 n+1 w i ÿ x i 0 0 if n+1 i=1 w i ÿ x i < 0

CPSC 565 - Winter 2003 - Emergent Computing 22 What Do Perceptrons Compute? Geometric Interpretation A simple perceptron is a computing unit with threshold q. Receiving the n real inputs x 1, x 2,, x n through edges with the associated weights w 1, w 2,, w n, a perceptron computes its output as follows: output = 9 1 if n i=1 w i x i q 0 otherwise. The following figure shows this separation of the input space for weights Hw 1, w 2 L = H0.9, 0.2L. Figure 18. Separation of input space with a perceptron testing the condition 0.9 x 1 + 2 x 2 1 Linearly Separable Functions A perceptron network is capable of computing any logical function. If we reduce the network to a single perceptron, which functions are still computable? The 16 Boolean functions of two variables: x 1 x2 f0 f1 f2 f3 f4 f5 f 6 f7 f 8 f 9 f 10 f11 f12 f13 f14 f15 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 CPSC 565 - Winter 2003 - Emergent Computing 23 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 Perceptron-computable functions are those for which the points whose function value is 0 can be separated from the points whose function value is 1 using a single line. Figure 19. Linear separations of input space corresponding to OR and AND Two sets of points A and B in an n-dimensional space are called linearly separable if n + 1 real numbers w 1,, w n+1 n exist, such that every point Hx 1, x 2,, x n L œ A satisfies i=1 w i x i w n+1 and every point Hx 1, x 2,, x n L œ B satisfies n w i x i < w n+1. i=1 Duality of Input Space and Weight Space Figure 20. Duality of input and weight space The computation performed by a perceptron can be visualized as a linear separation of input space. When trying to find the appropriate weights for a perceptron, the search process can be better visualized in weight space.

CPSC 565 - Winter 2003 - Emergent Computing 24 The Error Function in Weight Space Assume that the set A of input vectors in n-dimensional space must be separated from the set B of input vectors such that a perceptron computes the binary function f w with f w HxL = 9 1 if x œ A 0 if x œ B. The function f w depends on the set w = Hw 1,, w N L of weights (including the threshold). The error function value is the number of false classifications for a particular weight vector w: EHwL = xœa H1 - f w HxLL + xœb f w HxL. Since EHwL is positive or zero, we want to reach the global minimum where EHwL = 0. Consequently, the aim of perceptron learning is to find the weight vector for which EHwL = 0. The optimization problem, which the learning algorithm has to solve, can be understood as a descent on the error surface. Figure 21. Error function for the AND function (for a perceptron with two inputs Hx 1, x 2 L and constant threshold q = 1. Here is an example of such a path through an interation of weight settings w 0, w 1, w 2, w *.

CPSC 565 - Winter 2003 - Emergent Computing 25 Figure 22. Iteration steps to the region of minimal error The Perceptron Learning Algorithm Optimization Problem: Definition The optimization problem, which the learning algorithm has to solve, can be understood as descent on the error surface. But we can also look at the problem as a search for an inner point of the solution region (a polytope in the case of the perceptron). For example, let's have a look at the separation corresponding to the AND function: P = 8H1, 1L< N = 8H0, 0L, H1, 0L, H0, 1L< Here P and N are the two sets of points to be separated. The set P must be classified in the positive and the set N in the negative half-space. Optimization Problem: Analytical Solution Three weights w 1, w 2 and w 3 = -q are needed to implement the desired separation with a generic perceptron. With the extended input vector Hx 3 = 1L, the following four inequalities have to be fulfilled for the AND function: H0, 0, 1L ÿ Hw 1, w 2, w 3 L < 0 H1, 0, 1L ÿ Hw 1, w 2, w 3 L < 0

CPSC 565 - Winter 2003 - Emergent Computing 26 H0, 1, 1L ÿ Hw 1, w 2, w 3 L < 0 H1, 1, 1L ÿ Hw 1, w 2, w 3 L > 0 These equations can be written in a simpler matrix form: i 0 0-1 y -1 0-1 i w i 1 y w 2 0-1 -1 j z > j zk w 3 { j k 1 1 1 { k This can be written as ÿ w > 0, where is the 4µ3 matrix and w 0 y 0 0 z 0 { the weight vector (written as a column vector). This equation describes all points in the interior of a convex polytope. The sides of the polytope are delimited by the planes defined by each of the inequalities above. Any point in the interior of the polytope represents a solution for the learning problem. Figure 23. Solution polytope for the AND function in weight space

CPSC 565 - Winter 2003 - Emergent Computing 27 Optimization Problem: Learning Algorithm The following procedure describes the learning algorithm for a single perceptron cell. Given are two sets of points P and N, which the perceptron should learn to classify. Ï Start: Generate an initial vector of weights w 0. t = 0; w = w 0 Ï Testing: Select x œ P N. Ï Addition: w t+1 = w t + x If x œ P and w t ÿ x > 0: goto Test for End If x œ P and w t ÿ x 0: goto Addition If x œ N and w t ÿ x < 0: goto Test for End If x œ N and w t ÿ x 0: goto Subtraction t = t + 1 goto Testing Ï Subtraction: w t+1 = w t - x t = t + 1 goto Testing Ï Test for End: Are all x œ P N correctly classified? Yes: No: END goto Testing Note: The perceptron learning procedure only works if the point sets are linearly separable.

CPSC 565 - Winter 2003 - Emergent Computing 28 Example The following example illustrates the convergence behavior of the perceptron learning algorithm. Figure 24. Initial Configuration Figure 25. After correction with x 1

CPSC 565 - Winter 2003 - Emergent Computing 29 Figure 26. After correction with x 3 Figure 27. After correction with x 1 Adaptive "Programming" of ANNs through Learning ANN Learning A learning algorithm is an adaptive method by which a network of computing units self-organizes to implement the desired behavior. Testing Input/Output Examples Calculating Network Errors Changing Network Parameters Figure 28. Learning process in a parametric system In some learning algorithms, examples of the desired input-output mapping are presented to the network. A correction step is executed iteratively until the network learns to produce the desired response.

CPSC 565 - Winter 2003 - Emergent Computing 30 Learning Schemes Supervised Learning Some input vectors are collected and presented to the network. The output computed by the network is observed and the deviation from the expected answer is measured. The weights are corrected (= learning algorithm) according to the magnitude of the error. Ë Error-correction Learning: The magnitude of the error, together with the input vector, determines the magnitude of the corrections to the weights. Examples: Perceptron learning, backpropagation. Ë Reinforcement Learning: After each presentation of an input-output example we only know whether the network produces the desired result or not. The weights are updated based on this Boolean decision (true or false). Examples: Learning how to ride a bike. Unsupervised Learning For a given input, the exact numerical output a network should produce is unknown. Since no "teacher" is available, the network must organize itself (e.g., in order to associate clusters with units). Examples: Clustering with self-organizing feature maps, Kohonen networks. Figure 29. Three clusters and a classifier network

CPSC 565 - Winter 2003 - Emergent Computing 31 References Rojas, R. (1996). Neural networks : a systematic introduction. Berlin ; New York, Springer-Verlag. Kasabov, N. K. (1998). Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering. Cambridge, MA, MIT Press. Nilsson, N. J. (1998). Artificial intelligence : a new synthesis. San Francisco, Calif., Morgan Kaufmann Publishers. Negnevitsky, M. (2002). Artificial intelligence : a guide to intelligent systems. Harlow, England ; Toronto, Addison-Wesley.