CSE 250a. Assignment Noisy-OR model. Out: Tue Oct 26 Due: Tue Nov 2

Size: px

Start display at page:

Download "CSE 250a. Assignment Noisy-OR model. Out: Tue Oct 26 Due: Tue Nov 2"

Lorena Pearson
5 years ago
Views:

1 CSE 250a. Assignment 4 Out: Tue Oct 26 Due: Tue Nov Noisy-OR model X 1 X 2 X 3... X d Y For the belief network of binary random variables shown above, consider the noisy-or conditional probability table (CPT): P (Y =1 X 1 =x 1,..., X n =x n ) = 1 (1 p i ) x i. i The parameters of this CPT are the conditional probabilities p i = P (Y = 1 X i = 1, X j = 0 for all j i). In this question, you will consider how to learn them by gradient ascent. Consider a data set of i.i.d. examples { x t, y t } T t=1 where x t = (x 1t, x 2t,..., x nt ) denotes the observed vector of values from the t th example for root nodes in the network. Also, as shorthand, let q t = P (Y =1 X = x t ). Show that the gradient of the conditional log-likelihood L = t log P (y t x t ) is given by: L p i = T ( xit t=1 1 p i ) ( yt q t q t ). Intuitively, this result shows that the differences between observed values y t and predictions q t appear as error signals for learning. 1

2 4.2 Multinomial logistic regression... X 1 X 2 X 3 X d Y A simple generalization of logistic regression is to predict a discrete (but non-binary) label Y {1, 2,..., m} from a real-valued vector X R d. For the belief network shown above, consider the following parameterized conditional probability table (CPT): P (Y =i X = x) = e w i x mj=1 e w j x. The parameters of this CPT are the weight vectors w i which must be learned for each possible label. The denominator normalizes the distribution so that the elements of the CPT sum to one. Consider a training set of T labeled examples {( x t, y t )} T t=1. As shorthand, let y it {0, 1} denote the target assignment matrix defined by: { 1 if yt = i, y it = 0 otherwise. Also, let p it [0, 1] denote the conditional probability that the model classifies the tth example by the ith possible label: e w i x t p it = mj=1 e w j x t. The weight vectors can be obtained by maximum likelihood estimation using gradient ascent. Show that the gradient of the conditional log-likelihood L = t log P (y t x t ) is given by: L w i = t (y it p it ) x t. Again, this result shows that the differences between observed values y it and predictions p it appear as error signals for learning. 2

3 4.3 Convergence of gradient descent One way to gain intuition for gradient descent is to analyze its behavior in simple settings. For a onedimensional function f(x) over the real line, gradient descent takes the form: x n+1 = x n ηf (x n ). (a) Consider minimizing the function f(x) = α 2 (x x ) 2 by gradient descent, where α > 0. Derive an expression for the error ε n = x n x at the n th iteration in terms of the initial error ε 0 and the step size η >0. (b) For what values of the step size η does the update rule converge to the minimum at x? What step size leads to the fastest convergence, and how is it related to f (x n )? In practice, the gradient descent learning rule is often modified to dampen oscillations at the end of the learning procedure. A common variant of gradient descent involves adding a so-called momentum term: x n+1 = x n η f + β ( x n x n 1 ), where β > 0. Intuitively, the name arises because the optimization continues of its own momentum (stepping in the same direction as its previous update) even when the gradient vanishes. In one dimension, this learning rule simplifies to: x n+1 = x n ηf (x n ) + β(x n x n 1 ). (c) Consider minimizing the quadratic function in part (a) by gradient descent with a momentum term. Again, let ε n =x n x denote the error at the nth iteration. Show that the error in this case satisfies the recursion relation: ε n+1 = (1 αη + β)ε n βε n 1. (d) Suppose that the second derivative α = f (x ) is given by α = 1, the learning rate by η = 4 9, and the momentum parameter by β = 1 9. Show that one solution to the recursion in part (c) is given by: ε n = c n ε 0, where ε 0 is the initial error and c is a numerical constant to be determined. (Other solutions are also possible, depending on the way that the momentum term is defined at time t = 0; do not concern yourself with this.) How does this rate of convergence compare to that of gradient descent with the same learning rate (η = 4 9 ) but no momentum parameter (β = 0)? 3

4 4.4 Newton s method One way to gain intuition for Newton s method is to analyze its behavior in simple settings. For a twicedifferentiable function f(x) over the real line, Newton s method takes the form: x n+1 = x n f (x n ) f (x n ). (a) Consider the function f(x) = x log(x /x) x + x, where x >0. Show that the minimum occurs at x=x, and sketch the function in the region x x < x. (b) Consider minimizing the function in part (a) by Newton s method. Derive an expression for the relative error r n = (x n x )/x at the n th iteration in terms of the initial relative error r 0. Note the rapid convergence (which is typical of Newton s method). For what range of initial values (for x 0 ) does Newton s method converge to the correct answer? (c) Consider the polynomial function f(x) = (x x ) 2k for positive integers k, whose minimum occurs at x=x. Suppose that Newton s method is used to minimize this function, starting from some initial estimate x 0. Derive an expression for the error ε n = x n x at the n th iteration in terms of the initial error ε 0. (d) For the function in part (c), how many iterations of Newton s method are required to reduce the initial error by a constant factor δ < 1, such that ε n δε 0? Starting from your previous answer, show that n (2k 1) log(1/δ) iterations are sufficient. (Hint: use the inequality that log z z 1 for z > 0.) 4

5 4.5 Stock market prediction In this problem, you will apply a simple linear model to predicting the stock market. From the course web site, download the files nasdaq00.txt and nasdaq01.txt, which contain the NASDAQ indices at the close of business days in 2000 and K 5K TRAIN NASDAQ TEST price 4K 3K 2K 1K year (a) How accurately can the index on one day be predicted by a linear combination of the three preceding indices? Using only data from the year 2000, compute the linear coefficients (w 1,w 2,w 3,w 4 ) that maximize the conditional log probability L = t log P (x t x t 1, x t 2, x t 3, x t 4 ), where: P (x t x t 1, x t 2, x t 3, x t 4 ) = 1 [ exp 1 ) ] 2 (x t w 1 x t 1 w 2 x t 2 w 3 x t 3 w 4 x t 4, 2π 2 and the sum is over business days in the year 2000 (starting from the fifth day). (b) For the coefficients estimated in part (a), compare the model s performance (in terms of mean squared error) on the data from the years 2000 and Would you recommend this linear model for stock market prediction? Turn in your source code, your solution for the linear coefficients, and your results for the mean squared prediction errors. You may program in the language of your choice, and you may solve the required system of linear equations either by hand or by using built-in routines (e.g., in Matlab, Maple, Mathematica, etc.). 5

6 4.6 Handwritten digit classification In this problem, you will use logistic regression to classify images of handwritten digits. From the course web site, download the files digits3a.txt, digits3b.txt, digits5a.txt, and digits5b.txt. These files contain data for binary images of handwritten digits. Each image is an 8x8 bitmap represented in the files by one line of text. Some of the examples are shown in the following figure. (a) Perform a logistic regression (using gradient ascent or Newton s method) on the images in files digits3a.txt and digits5a.txt. Indicate clearly the algorithm used, and provide evidence that it has converged (or nearly converged) by plotting or printing out the log-likelihood on several iterations of the algorithm, as well as the percent error rate on the images in these files. Also, print out the 64 elements of your solution for the weight vector as an 8x8 matrix. (b) Use the model learned in part (a) to label the images in the files digits3b.txt and digits5b.txt. Report your percent error rate on these images. Again, turn in your source code. You may program in the language of your choice. 6

CSE 150. Assignment 6 Summer Maximum likelihood estimation. Out: Thu Jul 14 Due: Tue Jul 19

CSE 150. Assignment 6 Summer Maximum likelihood estimation. Out: Thu Jul 14 Due: Tue Jul 19 SE 150. Assignment 6 Summer 2016 Out: Thu Jul 14 ue: Tue Jul 19 6.1 Maximum likelihood estimation A (a) omplete data onsider a complete data set of i.i.d. examples {a t, b t, c t, d t } T t=1 drawn from