Chapter ML:VI (continued) VI. Neural Networks Perceptron Learning Gradient Descent Multilayer Perceptron Radial asis Functions ML:VI-56 Neural Networks STEIN 2005-2013
Definition 1 (Linear Separability) Two sets of feature vectors, X 0, X 1, of a p-dimensional feature space are called linearly separable, if p + 1 real numbers, θ, w 1,..., w p, exist such that holds: 1. x X 0 : p j=1 w jx j < θ 2. x X 1 : p j=1 w jx j θ ML:VI-57 Neural Networks STEIN 2005-2013
Definition 1 (Linear Separability) Two sets of feature vectors, X 0, X 1, of a p-dimensional feature space are called linearly separable, if p + 1 real numbers, θ, w 1,..., w p, exist such that holds: 1. x X 0 : p j=1 w jx j < θ 2. x X 1 : p j=1 w jx j θ x 2 x 1 x 2 x 1 linearly separable not linearly separable ML:VI-58 Neural Networks STEIN 2005-2013
Separability The XOR function defines the smallest example for two not linearly separable sets: x 2 = 1 x 1 x 2 XOR Class 0 0 0 1 0 1 0 1 1 1 1 0 x 2 = 0 x 1 = 0 x 1 = 1 ML:VI-59 Neural Networks STEIN 2005-2013
Separability (continued) The XOR function defines the smallest example for two not linearly separable sets: x 2 = 1 x 1 x 2 XOR Class 0 0 0 1 0 1 0 1 1 1 1 0 x 2 = 0 x 1 = 0 x 1 = 1 specification of several hyperplanes combination of several perceptrons ML:VI-60 Neural Networks STEIN 2005-2013
Separability (continued) Layered combination of several perceptrons: the multilayer perceptron. Minimum multilayer perceptron that is able to handle the XOR problem: x 1 = Σ, θ Σ, θ {, } x 2 = Σ, θ ML:VI-61 Neural Networks STEIN 2005-2013
Remarks: The multilayer perceptron was presented by Rumelhart and McClelland in 1986. Earlier, but unnoticed, was a similar research work of Werbos and Parker [1974, 1982]. Compared to a single perceptron the multilayer perceptron poses a significantly more challenging training (= learning) problem, which requires continuous threshold functions and sophisticated learning strategies. Marvin Minsky and Seymour Papert showed 1969 with the XOR problem the limitations of single perceptrons. Moreover, they assumed that extensions of the perceptron architecture (such as the multilayer perceptron) would be similarly limited as a single perceptron. fatal mistake. In fact, they brought the research in this field to a halt that lasted 17 years. [erkeley] [Marvin Minsky] ML:VI-62 Neural Networks STEIN 2005-2013
Computation in the Network [Heaviside] perceptron with a continuous, non-linear threshold function: x 0 =1 w 0 = θ x 1 w 1 x 2. w 2. w p Σ 0θ y x p Inputs Output The sigmoid function σ(z) as threshold function: σ(z) = 1 where dσ(z) 1 + e z dz = σ(z) (1 σ(z)) ML:VI-63 Neural Networks STEIN 2005-2013
Computation in the Network (continued) Computation of the perceptron output y(x) via the sigmoid function σ: y(x) = σ(w T x) = 1 1 + e wt x 1 0 Σ w j x j j=0 p n alternative to the sigmoid function is the tanh function: 1 tanh(x) = ex e x e x + e = e2x 1 x e 2x + 1 0 Σ w j x j j=0 p ML:VI-64 Neural Networks STEIN 2005-2013
Computation in the Network (continued) Distinguish units (nodes, perceptrons) of type input, hidden, and output: U I U H U O x 0 =1 = Σ y 1 x 1 x p =. = Σ Σ. Σ Σ. y 2 y k feed forward U I, U H, U O x u v w uv, w uv δ u y u Sets with units of type input, hidden, and output Input value for unit v, provided via output of unit u Weight and weight adaptation for edge connecting units u and v Classification error of unit u Output value of unit u ML:VI-65 Neural Networks STEIN 2005-2013
Remarks: The units of the input layer, U I, perform no computations at all. They distribute the input values to the next layer. The non-linear characteristic of the sigmoid function make networks possible that approximate every function. To achieve this flexibility, only three active layers are required, i.e., two layers with hidden units and one layer with output units. Keywords: universal approximator, [Kolmogorov Theorem, 1957] The generic delta rule allows for a backpropagagtion of the classification error and hence the training of multi-layered networks. Gradient descent is based on the classification error of thee entire network and hence considers the entire network weight vector. Multilayer perceptrons are also called multilayer networks or (artificial) neural networks, abbreviated as NN. ML:VI-66 Neural Networks STEIN 2005-2013
Classification Error The classification error Err(w) is computed as sum over the U O = k network outputs: Err(w) = 1 (c v (x) y v (x)) 2 2 v U O (x,c(x)) D Due its complex form, Err(w) may contain various local minima: Err(w) 16 12 8 4 0 ML:VI-67 Neural Networks STEIN 2005-2013
Weight daptation: Incremental Gradient Descent [network] lgorithm: MPT Multilayer Perceptron Training Input: D Training examples of the form (x, c(x)) with x = p + 1, c(x) {0, 1} k. η Learning rate, a small positive constant. Output: w Weights for the units in U I, U H, U O. 1. initialize_random_weights(u I, U H, U O ), t = 0 2. REPET 3. t = t + 1 4. FORECH (x, c(x)) D DO 5. FORECH u U H DO y u = propagate(x, u) // layer 1 6. FORECH v U O DO y v = propagate(y H, v) // layer 2 7. FORECH v U O DO δ v = y v (1 y v ) (c v (x) y v ) // backpropagate layer 2 8. FORECH u U H DO δ u = y u (1 y u ) v U o w uv δ v // backpropagate layer 1 9. FORECH w jk, (j, k) (U I U H ) (U H U O ) DO 10. w jk = η δ k x j k 11. w jk = w jk + w jk 12. ENDDO 13. ENDDO 14. UNTIL(convergence(D, y O (D)) OR t > t max ) 15. return(w) ML:VI-68 Neural Networks STEIN 2005-2013
Weight daptation: Incremental Gradient Descent [network] lgorithm: MPT Multilayer Perceptron Training Input: D Training examples of the form (x, c(x)) with x = p + 1, c(x) {0, 1} k. η Learning rate, a small positive constant. Output: w Weights for the units in U I, U H, U O. 1. initialize_random_weights(u I, U H, U O ), t = 0 2. REPET 3. t = t + 1 4. FORECH (x, c(x)) D DO 5. FORECH u U H DO y u = propagate(x, u) // layer 1 6. FORECH v U O DO y v = propagate(y H, v) // layer 2 7. FORECH v U O DO δ v = y v (1 y v ) (c v (x) y v ) // backpropagate layer 2 8. FORECH u U H DO δ u = y u (1 y u ) v U o w uv δ v // backpropagate layer 1 9. FORECH w jk, (j, k) (U I U H ) (U H U O ) DO 10. w jk = η δ k x j k 11. w jk = w jk + w jk 12. ENDDO 13. ENDDO 14. UNTIL(convergence(D, y O (D)) OR t > t max ) 15. return(w) ML:VI-69 Neural Networks STEIN 2005-2013
Weight daptation: Incremental Gradient Descent [network] lgorithm: MPT Multilayer Perceptron Training Input: D Training examples of the form (x, c(x)) with x = p + 1, c(x) {0, 1} k. η Learning rate, a small positive constant. Output: w Weights for the units in U I, U H, U O. 1. initialize_random_weights(u I, U H, U O ), t = 0 2. REPET 3. t = t + 1 4. FORECH (x, c(x)) D DO 5. FORECH u U H DO y u = propagate(x, u) // layer 1 6. FORECH v U O DO y v = propagate(y H, v) // layer 2 7. FORECH v U O DO δ v = y v (1 y v ) (c v (x) y v ) // backpropagate layer 2 8. FORECH u U H DO δ u = y u (1 y u ) v U o w uv δ v // backpropagate layer 1 9. FORECH w jk, (j, k) (U I U H ) (U H U O ) DO 10. w jk = η δ k x j k 11. w jk = w jk + w jk 12. ENDDO 13. ENDDO 14. UNTIL(convergence(D, y O (D)) OR t > t max ) 15. return(w) ML:VI-70 Neural Networks STEIN 2005-2013
Weight daptation: Momentum Term Momentum idea: a weight adaptation in iteration t considers the adaptation in iteration t 1: w uv (t) = η δ v x u v + α w uv (t 1) The term α, 0 α < 1, is called momentum. ML:VI-71 Neural Networks STEIN 2005-2013
Weight daptation: Momentum Term Momentum idea: a weight adaptation in iteration t considers the adaptation in iteration t 1: w uv (t) = η δ v x u v + α w uv (t 1) The term α, 0 α < 1, is called momentum. Effects: due the adaptation inertia local minima can be overcome if the direction of the descent does not change, the adaptation increment and, as a consequence, the speed of convergence is increased. ML:VI-72 Neural Networks STEIN 2005-2013
Neural Networks dditional Sources on the Web pplication and implementation: JNNS. Java Neural Network Simulator. http://www-ra.informatik.uni-tuebingen.de/software/javanns SNNS. Stuttgart Neural Network Simulator. http://www-ra.informatik.uni-tuebingen.de/software/snns ML:VI-73 Neural Networks STEIN 2005-2013