Stat 502X Exam 1 Spring 2014

Size: px

Start display at page:

Download "Stat 502X Exam 1 Spring 2014"

Randolf Todd
5 years ago
Views:

1 Stat 502X Exam 1 Spring 2014 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed This is a long exam consisting of 11 parts. I'll score it at 10 points per problem/part and add your best 8 scores to get an exam score (out of 80 points possible). Some parts will go faster than others, and you'll be wise to do them first. 1

2 1. I want to fit a function to N data pairs xi, y i that is linear outside the interval 0,1, is quadratic in each of the intervals 0,.5 and.5,1 and has a first derivative for all x (has no sharp corners). Specify below 4 functions h x, h x, h x, and h x and one linear constraint on coefficients 0, 1, 2, 3, and 4 so that the function y h x h x h x h x is of the desired form. 2

3 2. Consider a joint pdf (for xy, 0,10, ) of the form 1 y gx, y exp for 0 x 1 and 0 y 2 2 x x ( x U0,1and conditional on x, the variable y is exponential with mean 2 x.) 10 pts a) Find the linear function of x (say x ) that minimizes E y x 2 is over the joint distribution of x, y. Find the optimizing intercept and slope.). (The averaging 3

4 b) Suppose that a training set consists of N data pairs, xi y i that are independent drawn from the distribution specified on the previous page, and that least squares is used to fit a predictor f x a b x to the training data. Suppose that it's possible to argue (don't try to do so here) ˆN N N that the least squares coefficients an and b N converge (in a proper probabilistic sense) to your optimizers from a) as N. Then for large N, about what value of (SEL) training error do you expect to observe under this scenario? (Give a number.) 4

5 3. Suppose that (unbeknownst to a statistical learner) x U0,1 and E y x I.45 x.55 (that is, the conditional mean of y given x is 1 when.45 x.55 and is 0 otherwise). A 3- nearest-neighbor predictor, f ˆN, is based on N data pairs, and f ˆ N.5 has conditional means given the values of the inputs in the training set: 0 if no xi is in.45,.55 1/ 3 if one xi is in.45,.55 2 / 3 if two xi's are in.45,.55 1 if three or more x 's are in.45,.55 What is the value of the bias of the nearest neighbor predictor at.5? Does this bias go to 0 as N gets big? Argue carefully one way or the other. i 5

6 4. Below is a JMP report from the fit of a two-layer feed-forward neural network to N 100 pairs i, i function" here is u tanh u x y. What value is predicted for x.5? (Plug in, but you need not simplify. The "activation.) 6

7 5. An interesting data set from the Technical University of Delft relates a measure of yacht performance, y, to several shape characteristics of yachts. This problem concerns predicting (at a particular "Froude number") logy as a function of p 5 inputs. 10 pts a) Below is a summary of such an analysis. Use the information below and predict logy if x 5.0, x 0.565, x 5.10, x 3.94, and x > MARSfitlogY<-earth(logy~.-logy,degree=2,data=Yachts,pmethod="none",trace=2) x is a 22 by 5 matrix: 1=x1, 2=x2, 3=x3, 4=x4, 5=x5 y is a 22 by 1 matrix: 1=logy Forward pass: minspan 4 endspan 9 GRSq ParentTerm RSq DeltaRSq Pred PredName Cut Terms x2 x x x reject term Reached min GRSq (GRSq < -10) at 7 terms After forward pass GRSq RSq Forward pass complete: 7 terms Prune method "none" penalty 3 nprune 7: selected 7 of 7 terms, and 3 of 5 predictors After backward pass GRSq RSq > summary(marsfitlogy) Call: earth(formula=logy~.-logy, data=yachts, trace=2, degree=2, pmethod="none") coefficients (Intercept) h(x ) h(0.565-x2) h(x4-3.99) h(3.99-x4) h(x ) * h(0.565-x2) h(-2.3-x1) * h(0.565-x2) Selected 7 of 7 terms, and 3 of 5 predictors Importance: x4, x2, x1, x3-unused, x5-unused Number of terms at each degree of interaction: GCV RSS GRSq RSq

8 After standardizing the 5 input variables and centering the response (for the yacht data), the following are some results from R. (The input matrix is called CX and the response vector is W.) > round(svd(cx)$u,3) [,1] [,2] [,3] [,4] [,5] [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,] [21,] [22,] > round(svd(cx)$v,3) [,1] [,2] [,3] [,4] [,5] [1,] [2,] [3,] [4,] [5,] > round(svd(cx)$d,3) [1] > round(cx%*%svd(cx)$v,3) [,1] [,2] [,3] [,4] [,5] [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,] [21,] [22,]

9 > a<-cx%*%t(cx)%*%w > round(a,3) [,1] [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,] [21,] [22,] > t(w)%*%a [,1] [1,] > t(a)%*%a [,1] [1,] > round(t(w)%*%svd(cx)$u,3) [,1] [,2] [,3] [,4] [,5] [1,] pts b) What are the effective degrees of freedom associated with a ridge regression in this context, using a ridge parameter of 2? 9

10 c) Give the M 1 component PCR and PLS vectors of predicted values ˆ PCR PLS W and W. ˆ 10

11 d) Below is some output from fitting a lasso regression to these data, using cross-validation to choose a lasso penalty coefficient. > cv.out=cv.glmnet(cx,w,alpha=1) > plot(cv.out) > bestlam=cv.out$lambda.min > bestlam [1] > lasso.coef=predict(cv.out,type="coefficients",s=bestlam) > lasso.coef 6 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) e-16 V1. V e-02 V3. V e-02 V5. > betahat<-c(lasso.coef[2:6]) > round(cx%*%betahat,3) [,1] [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,] [21,] [22,] > t(cx%*%betahat-w)%*%(cx%*%betahat-w) [,1] [1,] What penalized value of training error is associated with this lasso fit (including the penalty)? 11

12 6. Below is a particular smoother matrix, S, for p 1 data at values x 0,.1,.2,.3,,.9,1.0 (The labeling convention used below is x1 0, x2.1, x3.2,, x ) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] What effective degrees of freedom are associated with this smoother? Approximately what bandwith is associated with this smoother? For training data as below, what is f ˆ.4? y x

13 7. K -fold cross-validation is (as you know by now) probably the most common and effective known means of assessing "test error" for purposes of choosing between predictors. K 5 and K 10 are common choices. K N is the case of leave-one-out (LOO) crossvalidation. For present purposes ignore issues of computational complexity. Why would you expect LOO cross-validation to often be preferable to K 10 in terms of bias for assessing test error? How do you explain that (despite consideration of bias) K 10 can often be preferable to LOO? (Again, do not consider computational issues.) 13

Stat 602 Exam 1 Spring 2017 (corrected version)

Stat 602 Exam 1 Spring 2017 (corrected version) Stat 602 Exam Spring 207 (corrected version) I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed This is a very long Exam. You surely won't be able to