Stat751 / CSI771 Midterm October 15, 2015 Solutions, Comments. f(x) = 0 otherwise

Size: px

Start display at page:

Download "Stat751 / CSI771 Midterm October 15, 2015 Solutions, Comments. f(x) = 0 otherwise"

Lydia McLaughlin
5 years ago
Views:

1 Stat751 / CSI771 Midterm October 15, 2015 Solutions, Comments 1. 13pts Consider the beta distribution with PDF Γα+β ΓαΓβ xα 1 1 x β 1 0 x < 1 fx = 0 otherwise, for fixed constants 0 < α, β. Now, assume that you can generate random deviates U i from a U0, 1 distribution. These are the standard things you can get from a simple random number generator in almost any programming system. In R it is what we get from runif. Describe very carefully how you would generate one random deviate from this triangular distribution using an acceptance/rejection method with one or more U i from U0, 1. You can use a uniform distribution as the majorizing distribution. You can also assume that you can evaluate the gamma function, so just use expressions such as Γα+β, Γα, and Γβ First, we determine a good distribution to use as the majorizing distribution. Any distribution with finite range that we can generate variates from would work. How good a given distribution is for this purpose would depend on α and β, which are not given in this problem. The problem states that you can just use the uniform distribution. In the absence on knowledge of α and β, that s probably as good as anything. In the common notation that I used in class, this means gy = I [0,1] y. Now, we need a number c such that c Γα+β ΓαΓβ xα 1 1 x β 1 for 0 x 1. Finding such a c is not hard; since x α 1 1 x β 1 1, one such c is just Γα+β/ΓαΓβ. Of course, we want the smallest such c. Even this is not hard. In grading your work, I only looked for your c, however you chose it or defined it. Once, we ve got the majorizing function and the c we re ready to go. I subtracted 3 points if you were not explicit in defining your c but not its actual value and in defining your gy. There really is no gy in this problem. Now, the steps are straightforward: 1 Generate u 1 and u 2 from U0, 1. 2 If u 2 fu 1 /c, then accept u 1 as the desired variate; otherwise, go back to 1. 1

2 2. 13pts Show that the least-squares estimator for β in the linear model y Xβ, where y is an observed n- vector and X is a corresponding n m matrix of observations is given by β = X + y, where X + is the Moore-Penrose inverse of X. I like this formula! It looks like the solution to a full-rank, consistent system: c = Az z = A 1 c, and y Xβ β = X + y. Although the problem did not specify that X be of full column rank and it is true even if X is not full rank, the steps are simpler if we assume that, and in the following, I will make that assumption. No one had any problems with the rank of X, one way or the other. Also, in the following, let n and m represent the dimensions; that is, assume that X is n m and the other matrices and vectors are of the implied sizes. The least-squares estimator for β is the value β that minimizes the expression y Xβ T y Xβ, which is the residual sum of squares. The optimal value for β can be obtained in different ways. One way is by using calculus to show that a minimum β must satisfy the condition X T X β = X T y. Another way is by expanding the expression for the residual sum of squares and showing that if X T y X β = 0, then β must be the optimal solution. Following this approach, we take a candidate solution the one we want to prove, and show that the residuals are orthogonal to the columns of X. With all of that as a preface, I will give three different proofs that the optimal value can be expressed as β = X + y. a We obtain an expression for β = Ay using calculus, and then show that A = X +. There are two ways to do this: i. Show that A satisfies the four properties that define X +. ii. Use the QR decomposition of X to show that A = X +. b Take β = X + y and show that it satisfies a necessary and sufficient condition for it to be the minimizer. The condition is the orthogonality of the residuals to the columns of X. **************** Now let s do it each way ************************** a First, by taking the first and second derivatives wrt β, we find that a minimum β must satisfy the condition X T X β = X T y, which yields the solution as that is, A = X T X 1 X T. β = X T X 1 X T y; i. The four properties uniquely determine X + : A. XX + X = X B. X + XX + = X + C. XX + is symmetric. D. X + X is symmetric. So here we go: A. XX T X 1 X T X = X B. X T X 1 X T XX T X 1 X T = X T X 1 X T C. XX T X 1 X T is symmetric take its transpose. D. X T X 1 X T X is symmetric. Therefore, X T X 1 X T = X +. 2

3 ii. Now, we use the QR decomposition of X to show that A = X +. This is the way I did it in class. Form the QR decomposition of X, X = QR, where we can write [ ] R1 R =, 0 where R 1 an m m upper triangular matrix. The squared residual norm can now be written as y Xb T y Xb = y QRb T y QRb = Q T y Rb T Q T y Rb = c 1 R 1 b T c 1 R 1 b + c T 2 c 2, where c 1 is a vector with m elements and c 2 is a vector with n m elements, such that Q T y = c1 Because the squared norm is nonnegative, the minimum of the residual norm occurs when c 1 R 1 b T c 1 R 1 b = 0; that is, when c 1 R 1 b = 0, or c 2 R 1 β = c 1. Because R 1 is triangular, the system is easy to solve: β = R 1 1 c 1. Now, [ X + = R ] Q T. This important expression of the Moore-Penrose inverse was the key to solving this problem in this way. Therefore, β = X + y. We also see that the minimum of the residual norm, or the residual sum of squares,is c T 2 c 2. b Finally, as another way, we take β = X + y as a candidate and show that it satisfies the condition of the orthogonality of the residuals to the columns of X; that is, X T y XX + y = 0. Using the properties of X +, we have X T y XX + y = X T y X T XX + y = X T y X T XX + T y because of symmetry = X T y X T X + T X T y = X T y X T X T + X T y property of Moore-Penrose inverses and transposes = X T y X T y property of Moore-Penrose inverses = 0 3

4 3. 13pts Describe how you would evaluate the integrals below using Monte Carlo. Assume that you have a source of uniform U0, 1 random numbers; that is, you can get a sample x 1, x 2,..., x m. Since your result is an estimate, also give a formula for an estimate of the variance of your estimator Although I have told you not to use Monte Carlo when you can evaluate something analytically and these simple integrals could be evaluated analytically, use Monte Carlo anyway. Give formulas for your estimates and for your estimates of the variance of your estimator. a b 2 0 x 2 e x/2 dx You should, of course, use a good PDF decomposition of x 2 e x/2 so as to be more efficient. The simplest decomposition is just 2x 2 e x/2 2 1, where the second factor is just the uniform PDF over [0, 2]. This is the one I d probably use. Using the uniform, a Monte Carlo estimate of the integral is just where u i are iid U0, 1. 2 m m 2u i 2 e u i, i=1 m i=1 2u i 2 e u i t 2 An estimate of the variance is 1 m m 1, where t is the estimate of the integral. Notice also that the problem stated only that you have a source of U0, 1 random numbers, so this PDF requires no real transformation. Other possibilities would be to use the exponential2 distribution truncated at 2, or to use the gamma3,2 distribution, also truncated at 2. In either case, the first question would be how to get random variables from the distribution of interest. The exponential is easy, just using the inverse CDF; but the gamma is rather difficult. Of course if you assume that you have R, you could use qgammam,3,2. The truncation involved with each of the latter distributions would likely make them less efficient. Use of the exponential is would certainly be more efficient than use of the gamma, however. 0 sinxe x dx Because the integral is improper, you must use a distribution with infinite support. A simple one is the exponential with parameter 1. We can generate an exponential from a uniform u as logu. Hence, a Monte Carlo estimate of the integral is t = 1 m sin logu i, m i=1 where u i are iid U0, 1, and an estimate of the variance is 1 m m i=1 sin logu i t 2 m 1. 4

5 4. 13pts Outline how you would design and conduct a Monte Carlo study to compare the performance of the standard two-sample t test for equality of means of two normal populations with Welch s test when the variance of the underlying distributions are unequal. You do not need to know how to perform these two test; just assume you have programs that will perform the two tests at a given significance level α. That is, given two datasets, your programs will return a value of reject or don t reject. Treat this as a factorial experiment. Identify the factors and the factor levels you would use just make some reasonable choices. Then describe the steps you would follow. The response of interest is the performance of the tests. The subject of the tests is the difference in the means. The difference in the means can be measured in various ways, such as an arithmetic difference or a ratio, in either case, possibly scaled by a standard deviation. The performance of either test is its power over some range of differences. The treatments are the two tests. The factors of interest are a. the differences in the means, possibly scaled b. the differences in the variances c. the sample sizes The general approach would be to choose one population as N0, 1, and the second as Nµ, σ 2. We see that all possible ranges are encompassed by the ranges [0, and 0,. More realistically we may choose [0, 3σ] for µ after choosing [1/9, 9] for σ 2. 5

6 5. 22pts Consider the model y i = αe βx i + ɛ i, where α and β are unknown constants, and ɛ i is a random variable with expected value of 0 and constant variance. Assume that we have pairs of observations y 1, x 1,..., y n, x n, and that the ɛ i s for the observations are independent. a Estimation by least squares. i. What is the objective function; that is, what is the function of α and β that is to be minimized? First of all, notice that if you linearize this by taking logs, you are changing the model. We can write this in the form of sums of individual elements or in a vector notation, where, when x is an n-vector, we would adopt the notation e βx to represent the n-vector whose i th element is e βx i. In vector notation, the objective function is fa, b = y ae bx T y ae bx. In the form of sums of individual elements, the objective function is n fa, b = y i ae bx i 2. i=1 ii. What is the gradient of the objective function? Using the vector form, the gradient is g f = f = = f a f b 2ae bx T y ae bx 2adiagxe bx T y ae bx iii. What is the Hessian of the objective function? H f = g f = = 2 f a 2 2 f b a 2 f a b 2 f b 2 a 2ae bx T y ae bx b 2ae bx T y ae bx a 2adiagxe bx T y ae bx b 2adiagxe bx T y ae bx This was messy, and if you had the formulas right, I gave full credit. iv. Given a starting point, what is the Newton step to move to a new solution? Let a 0, b 0 be given. a 1 b 1 = a 0 b 0 H f a 0, b 0 1 gf a 0, b 0. 6

7 b Estimation by maximum likelihood. i. What else would you need to know or assume? You would need to know the distribution of the random variables, ɛ i. This means the multivariate distribution. Notice that nothing was stated about the relationships of the ɛ i to each other. Make an appropriate assumption to satisfy the need referred to in the previous question the specific assumption is not important. Assume that they are iid N0, σ 2. That is the multivariate distribution is N n 0, σ 2 I n. We can represent the PDF of this distribution as fɛ = 1 2σ 2 π ne ɛt ɛ/2σ 2 Now, based on that assumption, in the following, describe how would you proceed to compute the MLEs of α and β. ii. What is the objective function; that is, what is the function of α and β that is to be minimized? There are actually three variables, α, β, and σ 2. As it turns out, however, the optimal values of α and β are not affected by the value of σ 2. The objective function is the likelihood function: Lα, β; x, y = or, equivalently, the log-likelihood: 1 2σ 2 π ne y αe βx T y αe βx /2σ 2 ; l L α, β; x, y = y αe βx T y αe βx. iii. What is the gradient of the objective function? This is the same as least squares. iv. What is the Hessian of the objective function? This is the same as least squares. v. Given a starting point, what is the Newton step to move to a new solution? This is the same as least squares. It is well-known that least squares is ML if the distribution is normal, the error is additive, and the model is linear. It is also the case here because of the form of the model. 7

8 6. 13pts Given the three linearly independent vectors in 5-space: x 1 = 1, 1, 1, 2,0 x 2 = 1, 0, 0, 1,0 x 3 = 1, 0, 1, 1,1 Form three orthonormal vectors z 1, z 2, and z 3 that span the same space. I intended for my numbers to work out evenly, but they don t, so when you have a square root, just show it as such, and don t worry about the computations. The method to use is the Gram-Schmidt. I was very lenient in grading this one. It gets pretty messy, but here are some expressions z 1 = 1, 1, 1, 2, 0/ 7 z 2 = 1, 0, 0,1,0 31, 1, 1,2,0/7/a, z 3 = 1, 0, 1,1,1 41, 1, 1,2,0/7 bz 2 /c, where a and c are the norms, and b is the inner product of z 2 and the z 3 before adjustment. Some of you made the first vector from the third vector; that is, you smartly chose because x 3 is an integer. z 1 = x 3 / x 3, Rather than doing in the manner indicated above, it is actually better to accumulate the third vector in two steps and the fourth, if there were one, in three steps, and so on. Here s some R code to do it m <- 3 n <- 5 z1 <- c1,1,1,2,0 z2 <- c1,0,0,1,0 z3 <- c1,0,1,1,1 Z <- cbindz1,z2,z3 Z[1:n,1] <- Z[1:n,1]/sqrtsumZ[1:n,1]^2 for k in 2:m{ for j in k:m{ Z[1:n,j] <- Z[1:n,j]-sumZ[1:n,k-1]*Z[1:n,j]*Z[1:n,k-1] } Z[1:n,k] <- Z[1:n,k]/sqrtsumZ[1:n,k]^2 } Check it: roundtz%*%z,10 8

9 7. 13pts Given the vector x = 1, 2, 0,2,0. Describe how you would reflect this vector into the vector x = 3, 0, 0,0,0 The reflection is achieved by the Householder matrix I 2uu T, where So x = I 2uu T x. u = 1 x, 2, 0, 2, 0/ 1 x, 2, 0, 2,0 = 2, 2, 0, 2,0/ 12 x <- c1,2,0,2,0 u <- c-2,2,0,2,0/sqrt12 H <- matrixcrepc1,0,0,0,0,0,4,1,nrow=5-2*u%*%tu roundh%*%x,10 yields [,1] [1,] 3 [2,] 0 [3,] 0 [4,] 0 [5,] 0 9

Linear Models Review

Linear Models Review Vectors in IR n will be written as ordered n-tuples which are understood to be column vectors, or n 1 matrices. A vector variable will be indicted with bold face, and the prime sign