CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! November 18, 2015 THE EXAM IS CLOSED BOOK. Once the exam has started, SORRY, NO TALKING!!! No, you can t even say see ya at Porter s! (Especially now that UCSD, in their infinite wisdom, kicked them out of campus...what were they thinking???) There are 5 problems: Make sure you have all of them - AFTER YOU ARE TOLD TO START! Read each question carefully. Remain calm at all times! Problem Type Points Score 1 True/False 15 2 Short Answer 20 3 Multiple Choice 10 4 The Delta Rule 10 5 Forward/Backwards Propagation 15 Total 70 1
Problem 1: True/False (15 pts) (15 pts: +1 for correct, -0.5 for incorrect, 0 for no answer) If you would like to justify an answer, feel free. Similar to learning in neural networks with the backpropagation procedure, the perceptron learning algorithm also ensures that the network output will near the target at each iteration. Following the percpetron learning algorithm, a perceptron is guaranteed to perfectly learn a given linearly separable data set within a finite number of training steps. 1 The sigmoid function, y = g(x) = of the input, x, given the output, y. 1+e wt x may be simply interpreted as the probability It is best to have as many hidden units as there are patterns to be learned by a multilayer neural network. Robbie Jacob s adaptive learning rate method resulted in a different learning rate for every weight in the network. The backpropagation procedure is a powerful optimization technique that can be applied to hidden activation functions like sigmoid, tanh and binary threshold. Stochastic gradient descent will typically provide a more accurate estimate of the gradient of a loss function than the full gradient calculated over all examples - that is why this method is generally preferred. Overfitting occurs when the model learns the regularities present only in the training data, or in other words, the model fits the sampling error of the training set. In backpropagation learning, we should start with a small learning rate and slowly increase it during the learning process. People use the Rectified Linear Unit (ReLU) as an activation function in deep networks because 1) it works; and 2) it makes computing the slope trivial. While implementing backpropagation, it is a mistake to compute the deltas for a layer, change the weights, and then propagate the deltas back to the next layer. Unfortunately, minibatch learning is difficult to parallelize. In a deep neural network, while the error surface may be very complicated and nonconvex, locally, it may be well-approximated by a quadratic surface. A convolutional neural network learns features with shared weights (filters) in order to reduce the number of free parameters. One of the biggest puzzles in machine learning is who hid the hidden layers, and why. Wherever they are, they are probably buried deep, very deep. Some suspect Wally did it. 2
Problem 2: Short answer (20 pts) Only a very brief explanation is necessary! a) (2 pts) Explain why dropout in a neural network acts as a regularizer. b) (2 pts) Explain why backpropagation of the deltas is a linear operation. c) (3 pts) Describe two distinct advantages of stochastic gradient descent over the batch method. d) (2 pts) Fill in the value for w in this example of gradient descent in E(w). Calculate the weight for Iteration 2 of gradient descent where the step-size is η = 1.0 and the momentum coefficient is 0.5. Assume the momentum is initialized t0 0.0. Iteration w w E 0 1.0 1.0 1 2.0 0.5 2 0.25 e) (2 pts) Explain why we should use weight decay when training a neural network. f) (3 pts) A graduate student is competing in the ImageNet Challenge with 1000 classes, however, he is puzzled as to why his network doesn t work. He has two tanh hidden units in the final layer before the 1000-way output, but does not think this is a problem, since he has many units and layers leading up to this point. Explain the error in his thinking. 3
g) (4 pts) In the Efficient Backprop paper, preprocessing of the input data is recommended. Illustrate this process by starting with an elongated, oval-shaped cloud of points tilted at about 45 degrees,and showing effect of the mean cancellation step, the PCA step, and the variance scaling step. (so you should end up with 4 pictures from start to finish). h) (2 pts) What is wrong with using the logistic sigmoid in the hidden layers of a deep network? Give at least two reasons why it should be avoided. 4
Problem 3: Multiple Choice (10 pts, 2 each) a. Which of the following is the delta rule for the hidden units? i. δ i = (t i y i ) ii. δ j = k w jk δ k iii. δ j = y (a j ) k w jk δ k b. In a convolutional neural network, the image is of dimension x = 100 100 and one of the learned filters is of dimension 10 10 with a stride of 5. The resulting feature map of this filter over the image will have dimension, i. 21 21 ii. 19 19 iii. 5 5 iv. 20 20 v. 100 5 5 c. Assume we have an error function E and modify our cost function C by adding an L2-weight penalty, or specifically C = E + λ 2 j w2 j. The cost function is minimized with respect to w i when, i. w i = 1 E λ w i ii. w i = + E w i iii. w i = λ E 2 w i iv. w i = E w i 2 v. w i = 0 which describes how our weight magnitude should vary. HINT: recall that C is minimized when its derivative is 0. d. The best objective function for classification is i. Sum Squared Error ii. Cross-Entropy iii. Rectified linear unit iv. Logistic v. Funny tanh 5
e. Suppose we have a 3-dimensional input x = (x 1, x 2, x 3 ) connected to 4 neurons with the exact same weights w = (w 1, w 2, w 3 ) where: x 1 = 2, w 1 = 1, x 2 = 1, w 2 = 0.5, x 3 = 1, w 3 = 0, and the bias b = 0.5. We calculate the output of each of the four neurons using the input x, weights w and bias b. If y 1 = 0.95, y 2 = 3, y 3 = 1, y 4 = 3, then a valid guess for the neuron types of y 1, y 2, y 3 and y 4 is: i Rectified Linear, Logistic Sigmoid, Binary Threshold, Linear ii Linear, Binary Threshold, Logistic Sigmoid, Rectified Linear iii Logistic Sigmoid, Linear, Binary Threshold, Rectified Linear iv Rectified Linear, Linear, Binary Threshold, Logistic Sigmoid 6
Problem 4: The delta rule (10pts) Derive the delta rule for the case of a single layer network with a linear output and the sum squared error loss function. To make this as simple as possible, assume we are doing this for one input-output pattern p (then we can simply add these up over all of the patterns). So, starting with: SSE p = 1 2 (tp y p ) 2 (1) and y p = d w j x j (2) j=1 derive that: SSEp w i = (t p y p )x i (3) 7
Problem 5: Forward/Backward Propagation. (15 pts) Consider the simple neural network in Figure 1 with the corresponding initial weights and biases in Figure 2. Weights are indicated as numbers along connections and biases are indicated as numbers within a node. All units use the Sigmoid Activation function g(a) = f(a) = 1 and the cost function is the Cross-Entropy Loss. 1+e a On the following page, fill in the three panels. 1. (4 pts) In the first panel, record the a i s into each of the nodes. 2. (3 pts) In the second panel, record z i = g(a i ) for each of the nodes. You may use the table of approximate Sigmoid Activation values on the next page. 3. (5 pts) In the third panel, compute the δ for each node. Do this for training example, X = (1.0, 1.0) with target of t = 0.85. Update the weights. (3 pts) Given the δ s you computed, use gradient descent to calculate the new weight from hidden unit 1 (H1) to the Output (OUT) (currently 1.0). Use gradient descent with no momentum and learning rate η = 1.0. Figure 1. Figure 2. 8
9