Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso

Size: px

Start display at page:

Download "Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso"

Molly Butler
5 years ago
Views:

1 Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso Fall, 2018

2 Outline Introduction A Brief History ANN Architecture Terminology Multi-Layer Perceptrons (MLP) Features of MLP Universal Approximation Radial Basis Function (RBF) Network Optimization Back-Prop Discussion

3 Introduction Introduction Neural Networks (NNs) are networks of neurons, for example, as found in real (i.e. biological) brains. Artificial Neural Networks (ANNs) are networks of Artificial Neurons, and hence constitute crude approximations to parts of real brains. They may be physical devices, or simulated on conventional computers. From a practical point of view, an ANN is just a parallel computational system consisting of many simple processing elements connected together in a specific way in order to perform a particular task. One should never lose sight of how crude the approximations are, and how over-simplified our ANNs are compared to real brains.

4 Introduction Human Brain & Neuron Cells Neural Networks: The flow of information includes feed-forward, feedback, and activation. The directed connections in ANN resemble the synapses.

5 Introduction Some Numbers about Brain vs. Computer There are approximately 10 billion neurons in the human cortex, compared with 10 of thousands of processors in the most powerful parallel computers. Each biological neuron is connected to several thousands of other neurons, similar to the connectivity in powerful parallel computers. Lack of processing units can be compensated by speed. The typical operating speeds of biological neurons is measured in milliseconds (10 3 s), while a silicon chip can operate in nanoseconds (10 9 s). The human brain is extremely energy efficient, using approximately joules per operation per second, whereas the best computers today use around 10 6 joules per operation per second. Brains have been evolving for tens of millions of years; computers have been evolving for tens of decades.

6 Introduction A Brief History A Brief History 1943 McCulloch and Pitts proposed the McCulloch-Pitts neuron model Hebb published The Organization of Behavior, in which the Hebbian learning rule was proposed Rosenblatt introduced Perceptrons Minsky and Papert s book Perceptrons demonstrated the limitation of Perceptrons, and almost the entire field went into hibernation Hopfield published on Hopfield networks Kohonen developed the Self-Organising Maps (SOM) that now bear his name The back-propagation learning algorithm for Multi-Layer Perceptrons (MLP) was re-discovered and the field took off again. 1990s 2000s 2010s The sub-field of Radial Basis Function Networks (RBFN) was developed. The power of Ensembles of ANN and SVM becomes apparent. Deep Learning revives the field.

7 ANN Architecture ANN Architecture

8 ANN Architecture Terminology Terms in ANN Units or Neurons Input units can be connected to hidden units or to output units (e.g., skip network) Hidden units can be connected to other hidden units or to output units. Output units cannot be connected to other units. Connections (directional synapses) The number of connections may depend on computing capacity. For example, the total number of connections in a network cannot exceed roughly 32,000 in SAS Enterprise Miner.

9 ANN Architecture Terminology Layers and Weights Layers All the units in a given layer share certain characteristics. For example, all the input units in a given layer have the same measurement level and the same method of standardization. All the units in a given hidden layer have the same combination function and the same activation function. All the units in a given output layer have the same combination function, activation function, and error function. Weights (e.g., coefficients in linear combinations), Bias and altitude (e.g., the intercept)

10 ANN Architecture Terminology Several Functions in ANN Combination Functions MLP: Linear combination in MLP RBF: Squared Euclidean distance between the vector of weights and the vector of values feeding into the unit and then multiply by the squared bias value (scale factor or inverse width) Activation Function: The value produced by the combination function is transformed by an activation function, e.g., Sigmoid, exponential, or softmax. Error Functions: The loss function, e.g, SSE in least squares or log-likelihood function

11 Multi-Layer Perceptrons (MLP) Features of MLP MLP - the Graphical Representation MLP can be viewed as a Multi-stage regression/classification model, with each stage corresponding to a parametric PPR.

12 Multi-Layer Perceptrons (MLP) Features of MLP Multi-Layer Perceptrons (MLP) Most popular form of neural network architecture. MLP is the default architecture in many Neural Network software. Features has any number of inputs. has one or more hidden layers with any number of units. uses linear combination functions in the hidden and output layers. uses sigmoid activation functions in the hidden layers. has any number of outputs with any activation function. has connections between the input layer and the first hidden layer, between the hidden layers, and between the last hidden layer and the output layer.

13 Multi-Layer Perceptrons (MLP) Universal Approximation MLP as Universal Approximator Given enough data, enough hidden units, and enough training time, an MLP with just one hidden layer can learn to approximate virtually any function to any degree of accuracy. Known as Universal Approximators. (A statistical analogy is approximating a function with nth order polynomials.) There are situations where a network with two or more hidden layers may require fewer hidden units and weights than a network with one hidden layer, so using extra hidden layers sometimes can improve generalization.

14 Multi-Layer Perceptrons (MLP) Universal Approximation MLP Function Approximation Illustration Consider the simple regression scenario where we have only one predictor X. The General idea of function approximation in MLP can be understood as follows Approximate a given function by a step function. Replace step functions with smooth sigmoid functions. The entire process can be represented via a MLP structure.

15 Multi-Layer Perceptrons (MLP) Universal Approximation Step Function Approximation

16 Multi-Layer Perceptrons (MLP) Universal Approximation MLP Graphical Structure The approximating step function can be written as y 0 + y 1 I (x x 1 ) + + y 4 I (x x 4 ) = y 0 + y 1 I ( x 1 + x 0) + + y 4 I ( x 4 + x 0)

17 Multi-Layer Perceptrons (MLP) Universal Approximation Sigmoid Approximation to the Threshold Function The threshold function can be approximated with a smooth sigmoid function. Many smooth sigmoid functions s( ) are available. Among them, the logistic or expit function is one. Namely, I (x > c) = I (x c > 0) exp(x c) = s(x c). 1 + exp(x c) In the example, replacing all steps functions with expit functions yields ŷ(x) = y 0 + y 1 s( x 1 + x) + + y 4 s( x 4 + x), which is exactly a MLP model.

18 Multi-Layer Perceptrons (MLP) Universal Approximation Sigmoid Function Approximation

19 Radial Basis Function (RBF) Network Radial Basis Function (RBF) Network First proposed by Broomhead and Lowe (1988). Features has any number of inputs. typically has only one hidden layer with any number of units. uses radial combination functions in the hidden layer, based on the squared Euclidean distance between the input vector and the weight vector. typically uses exponential or softmax activation functions (inducing two types) in the hidden layer, in which case the network is a Gaussian RBF network. uses linear combination functions in the output layer. can have multiple outputs with activation (output) function selected depending on the response type.

20 Radial Basis Function (RBF) Network Two Types of Gaussian RBF Networks The first type, ordinary RBF (ORBF) network, uses the exponential activation function, so the activation of the unit is a Gaussian bump as a function of the inputs. The second type, normalized RBF (NRBF) network, uses the softmax activation function, so the activations of all the hidden units are normalized to sum to one. While the distinction between these two types of Gaussian RBF architectures is sometimes mentioned in the NN literature, its importance has rarely been appreciated except by Tao (1993) and Werntges (1993).

21 Radial Basis Function (RBF) Network Ordinary RBF Networks Starting with the input layer x to hidden layer, Let u k be a unit in only hidden layer of a RBF network. In an ordinary RBF network, ( u k = exp w0k 2 x w k 2), for k = 1,..., K, where w k = (w 1k,..., w pk ) T. RBF network resembles kernel regression, in particular, the local linear regression in some way. The weights w k plays the role of an anchor/center point (or a typical observation) in each of the K clusters.

22 Radial Basis Function (RBF) Network Normalized RBF Networks In a normalized RBF network, first define [ e k = exp f ln(a k ) w0k 2 x w 2], for k = 1,..., K. Altitude parameters ak represents the maximum height of the component Gaussian functions. The constant f, the fan-in, is the number of connections to the neuron. Additional normalization is applied to obtain the unit at the hidden layer: u k u k := K k=1 u k so that K k=1 u k = 1. This amounts to activate with the softmax function, which is also routinely used multinomial logistic regression.

23 Radial Basis Function (RBF) Network RBF Network - Output Layer In the output layer, linear combinations are used as the combination function, i.e., ŷ = w 0 + w 1 u w K u K. The output activation function in RBF networks is customarily the identity function. Using an identity output activation function is a computational convenience in training, but it is possible and often desirable to use other output activation functions as in MLP.

24 Radial Basis Function (RBF) Network RBF Networks vs. MLP MLPs are said to be distributed-processing networks because the effect of a hidden unit can be distributed over the entire input space. On the other hand, Gaussian RBF networks are said to be local processing networks because the effect of a hidden unit is usually concentrated in a local area centered at the weight vector.

25 Radial Basis Function (RBF) Network An Example: RBF & MLP MLP g 1 0 (E(y)) = w 0 + w 1 u 1 + w 2 u 2 u 1 = tanh(w 01 + w 11 x 1 + w 21 x 2 + w 31 x 3 ) u 2 = tanh(w 02 + w 12 x 1 + w 22 x 2 + w 32 x 3 ) Ordinary RBF g 1 0 (E(y)) = w 0 + w 1 u 1 + w 2 u 2 [ u 1 = exp w 2 01 {(x 1 w 11 ) 2 + (x 2 w 21 ) 2 + (x 3 w 31 ) 2}] [ u 2 = exp w 2 02 {(x 1 w 12 ) 2 + (x 2 w 22 ) 2 + (x 3 w 32 ) 2}] Normalized RBF g 1 0 (E(y)) = w 1 u 1 + w 2 u 2 + w 3 u 3 e k u k := for k = 1, 2, 3 e 1 + e 2 + e + 3 [ e k = exp f ln(a i ) w 2 0k {(x 1 w 1k ) 2 + (x 2 w 2k ) 2 + (x 3 w 3k ) 2}]

26 Optimization Training Neural Networks Backprop (Back Propagation) provides an efficient way of computing derivatives (including gradient, Jacobian, and Hessian). Use the derivatives in optimization algorithms. Steepest descent: Batch Prop and Incremental Prop (stochastic gradient descent or online learning); Other optimization methods could be more reliable: Levenberg-Marquardt, Quasi-Newton, Conjugate Gradient, BFGS, and L-BFGS. Minimizing the empirical risk function for ANN is a nonconvex problem. Global optimization techniques may be helpful.

27 Discussion Other Issues ANN is not off-the-shelf and entails careful data preparation. Normalize input variables; Handle nominal/ordinal variables; Missing values need to be imputed. Training a good ANN model takes considerable experiences since there are so many parameters to be tuned, e.g., number of layers, number of hidden units, starting values for weights, tuning parameter for regularization, learning rate (step size in gradient descent). ANN lacks interpretability, although efforts have been made in terms of variable importance ranking, confidence intervals for weights, etc.

28 Discussion Discussion Thanks!

Artificial Neural Network and Fuzzy Logic

Artificial Neural Network and Fuzzy Logic 1 Syllabus 2 Syllabus 3 Books 1. Artificial Neural Networks by B. Yagnanarayan, PHI - (Cover Topologies part of unit 1 and All part of Unit 2) 2. Neural Networks