Neural Networks 2007 Task Sheet 2 1/6 University of Zurich Prof. Dr. Rolf Pfeifer, pfeifer@ifi.unizh.ch Department of Informatics, AI Lab Matej Hoffmann, hoffmann@ifi.unizh.ch Andreasstrasse 15 Marc Ziegler, mziegler@ifi.unizh.ch 8050 Zürich Jonas Ruesch, ruesch@ifi.unizh.ch Neural Networks 2007 Task Sheet 2 Due date: May 4 2007 Student ID: First name: Family name Note: The purpose of this task sheet is to help you familiarize with the basic concepts that will be used throughout the class. The questions are very similar to the ones you will find in the final exam. Please write your name and student ID number at the respective fields on the title page, and most important: please try to write legibly, as we will not give you points for an answer that we cannot decipher! If you need additional sheets of paper, please staple them to the task sheets. Points: of 30
Neural Networks 2007 Task Sheet 2 2/6 Question 1 3 points Consider the following simple multilayer perceptron (MLP). Figure 1: Multi-Layer Perceptron for the OR function Compute the individual steps of the back propagation algorithm. Consider the learning rate to be =1 and sigmoid activation functions g h = 1 1 e 2 h in the nodes with gain =0.5. Calculate the complete cycle with the input pattern (0,1) and the input pattern (1,0). w 0 w 1 w 2 w 3 w 4 x 1 x 2 O h O o w 4 w 3 w 2 w 1 w 0 0.2-0.1 0 0.1 0.2 0 1 1 0 - - - - - - - Question 2 3 points Consider an MLP with two inputs, one output and one hidden layer. As learning algorithm use back-propagation and sigmoid activation functions in the nodes. Test the learning performance for networks with 1, 3 and 10 nodes in the hidden layer for the AND and the XOR (create the pattern sets as well) function and for learning rates and the momentum term according to the table below. Solve this question with the Java NN-Simulator (see our class-website). Note that the weights have to be reset before each new test run. Stop training if the total error is smaller than 0.1 or after 1000 epochs. Write down the steps and the total error for each combination. The initial weights are randomly distributed. So we recommend to run the simulation more than once to avoid misleading results.
Neural Networks 2007 Task Sheet 2 3/6 Hidden Units Learning rate momentum term steps AND total error AND 1 0.1 0.0 1000 0.25 1 0.1 0.9 1 0.8 0.0 1 0.8 0.9 3 0.1 0.0 steps XOR total error XOR 3 0.1 0.9 380 0.1 3 1.0 0.0 3 1.0 0.9 10 0.1 0.0 10 0.1 0.9 10 0.8 0.0 10 0.8 0.9 Question 3 a) 1 point, b) 1 point, c) 1 point, d) 1 point Now think of a MLP with 8 input neurons, n hidden neurons and 8 output neurons. The network should map inputs to identical outputs (i.e. input patterns and desired output patterns are the same; this is called self-supervised learning). The inputs patterns are defined to contain only one "1", all other entries are "0". What is the minimum number of hidden neurons required to solve and to learn the task? a) Solve the question by thinking. What is the minimal number of hidden neurons to solve the task? Why? b) Use the Java Simulator and verify a) by trying to learn the task and describe the results. How many hidden neurons do you need? c) How does the learning speed change with a larger number of hidden neurons? It increases. It decreases. The number of hidden neurons has no influence on the learning speed. d) What could be an application for this kind of network?
Neural Networks 2007 Task Sheet 2 4/6 Question 4 2 points What is the effect of a momentum term on the back-propagation algorithm? What are its advantages? Disadvantages? Use the 1-d error landscape shown below in your explanation (i.e. indicate the gradient descent with and without momentum) Question 5 a) 1 point, b) 1 point, c) 1 point, d) 1 point, e) 1 point, f) 1 point The back-propagation algorithm is a so-called gradient descent method. To compute the gradient you need the derivative g'(h) of the activation function g(h). a) Compute the derivatives of the following activation function: g h = 1 2 h 1 e (Sigmoid function) b) Prove that the derivative can be expressed as: g ' h =2 g h 1 g h c) Compute the derivatives of the following activation function: g h =tanh h (Hyperbolic tangent) d) Show that the derivative can be expressed as: g ' h = 1 g 2 h e) What is the relation between these two activation functions? (look at the graph of these two functions) f) Why are these especially suited for the Back propagation algorithm? Hints: tanh h = e2 h 1 e 2 h 1 d dh e ah =[e ah]'=a e ah d dh 1 f h =[ 1 f ' h ]= f h f 2 h d f g h =[ f g h ]' = f ' g h g ' h dh d f h f h f ' h g h f h g ' h =[ ]'= dh g h g h g 2 h
Neural Networks 2007 Task Sheet 2 5/6 Question 6 3 points Look at the back-propagation algorithm with the sigmoid function. What happens to the derivative when the activation in a node becomes 0 or 1 (check the result from question 5a)? What happens to the w ij? Where do you see problems and how would you solve them? Question 7 a) 1 point, b) 2points Run the Cascade Correlation applet, which is on the website. Be sure to uncheck the option for running the standard back propagation algorithm on the same problem. Clicking next will display a pull-down menu for selecting one of the pre-defined problems. The initial weights are randomly distributed. So we recommend to run the simulation more than once to avoid misleading results. Load the parity (4 bits) problem in the cascade correlation simulator with the default parameter settings. Set the score threshold parameter = 0.2 a) Try different patience parameter set to 2, 5, and then 10. What is the affect of the patience parameter upon the error curve? b) Restart the simulator and compare the cascade-correlation with the backpropagation using different parameters (i.e. with different problem sets, different learning lates etc.). What can you observe? Question 8 2 points The recruiting and training of hidden units occurs over several phases, where these phases are known as output phases. Learning continues in this fashion until network error is reduced to the extent that all output units have activations within a certain range of their targets on all training patterns. This range is a parameter called: Score threshold Activation threshold Learning rate Patience
Neural Networks 2007 Task Sheet 2 6/6 Question 9 a) 2 points, b) 2points Load the continuous XOR problem, using the same parameter settings as in question 7. a) How many hidden (recruited) neurons are required to reduce error to less than 0.08? 4 6 25 5 b) For what value of the score threshold are there 5 or more hidden nodes recruited? 0.1 0.2 0.3 0.4