Stat 602 Exam 1 Spring 2017 (corrected version)

Size: px

Start display at page:

Download "Stat 602 Exam 1 Spring 2017 (corrected version)"

Gary Martin
5 years ago
Views:

1 Stat 602 Exam Spring 207 (corrected version) I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed This is a very long Exam. You surely won't be able to finish all of it. Do parts of it that look like they will go quickly. Point values are indicated and I'll score it out of 00 (not 20).

2 . If, in a classification problem, all N inputs p x are distinct, a default random forest (one i R with n min = ) will typically have err = 0 (a 0 training error rate for 0- loss) unless a "small" maximum tree depth is set. 7 pts a) Why is this? Explain! 6 pts b) Does this mean that the out-of-bag error rate will be 0? Explain! 7 pts c) Does this mean that the out-of-bag error rate is unreliable as a representing likely random forest performance? Explain! 2

3 2. Consider a -d N-W smoothing problem on [ 0, 2 ] for values of ( ) Suppose that one uses weights.5 if i = j w( i j ) =.25 if i j = 0 otherwise x =. j for j =, 2,, 2. j to make smoothed values ˆ j 2 i= ( ) y = w i j y i except for the "edge" cases where we'll take yˆ =.5 y+.5y2 and yˆ 2 =.5 y20 +.5y2. 6 pts a) For S the smoother matrix to be applied to a vector of observations = ( y, y,, y ) smoothed values, what are effective degrees of freedom? Y to get pts b) What are (except for the "edge" cases, now with indices j =, 2, 20, and 2) the weights, say 2 ( ) w i j, used to make "doubly smoothed" values via two successive applications of the original smoothing. That is, Y ˆ = SSY? What (approximately, you don't need to get exactly the right terms for the edge cases) are effective degrees of freedom for SS? 3

4 7 pts c) Consider local linear regression in this same context, where the original weights are used and thus (except for edge cases) the slope and intercept used to make y ˆ j are determined by minimizing ( ( )) ( ) ( ) ( ( )) j β0 + β j + j β0 + β j + j+ β0 + β j+.25 y x.5 y x.25 y x (or equivalently 4 times this quantity). Ultimately (again except for edge cases) what weights go into a smoother matrix for an "equivalent N-W kernel smoother" in this case? (It may be helpful to N N recall that OLS for SLR produces ( )( ) ( ) 2 b = yi y xi x / xi x and b0 = y bx.) i= i= 4

5 3. Below are 8 N = training cases ( x, y ) for [ 0,] i i x and a corresponding "design matrix" holding values of the first 8 Haar basis functions (in the order ϕψψ,,,0, ψ,, ψ2,0, ψ2,, ψ2,2, ψ 2,3 ) for the x i. (The X matrix below is not as was printed. This is correct. The one on exam night was off by multipliers in the first 4 columns. ;+{ ) x /6 2 3/ / / = = = 9/6 y 4 X / / / a) Find the OLS prediction vector orthogonal.) OLS ŷ here. (This is trivial. Note that the 8 columns of X are 5

6 b) Find the -component PLS prediction vector PLS ŷ here. c) After normalizing the predictors (so that the find the LASSO prediction vector 8 R norm of each column of the normalized X is ) LASSO ŷ for the penalty parameter λ = 0. (This is confusing as stated, because one should center the vector of responses, remove the first column of X and work with an 8 7 vector of inputs, and this wasn't spelled out.) 6

7 d) Using the normalized version of the predictors referred to in part c) find a vector of coefficients β that minimizes y Xβ y Xβ β diag β ( ) ( ) + ( 0,0,0,0, 4, 4, 4, 4) (This, too, is confusing as stated because of the question of whether or not the response vector has been centered and the column of s removed from X. If this has been done, the dimension of the diagonal matrix above is wrong.) 7

8 4. Consider the 2-class classification model with the coding y {,} and (for sake of concreteness) x R. As is more or less standard, for g( x ) a generic voting function we'll consider the classifier ( ) = ( ) sign ( ) f x g x Another (besides those mentioned in class) "function loss" sometimes discussed is ( ) = ( v ) 2 h v 0 pts ( ) opt a) Carefully derive the function g ( x ) optimizing ( ) Eh yg x over choices of g. b) To the extent possible, simplify a good upper bound on the 0- loss error rate of a classifier opt f ( x ) made from your g ( ) x from part a). 8

9 0 pts c) Suppose that in pursuit of a good classifier, one wishes to optimize an empirical version of ( ( )) Eh yg x, based on a training set of size N, over the class of functions of the form ( β β ) ( β β x) g x, = 2Φ +, penalized by λβ for a λ 0. ( Φ is the standard normal cdf.) In as simple a form as possible, give two equations to be solved simultaneously to do this fitting. d) Suppose that as a matter of fact the two class-conditional densities operating in the model are ( ) = [ 0 < < ] and ( ) = 6 ( ) [ 0 < < ] p x I x p x x x I x and that ultimately what is desired is a good ordering function O ( x), one that produces a small value of the "AUC" criterion. Do you expect the methodology of part c) to produce a function ˆ, ˆ O x? Explain carefully. ( 0 ) g x β β that would be a good choice of ( ) 9

10 0 pts 5. Suppose that in a toy 2-class classification model with 2 has N = 5 training cases in the small table below. p = using the {,} y coding one y x In a gradient boosting exercise with the hinge loss and base functions [ ] and [ ] m ( ) 3 ( x) 5 i= ( yg i ( xi) ) + I x< c I x> c c, suppose that one has a current function version g x = x. As completely as is possible in the current context, describe how you will produce g m +. (Show some specific calculations, not just general formulas.) 0

11 6. In the class notes there is an assertion that for a finite set B, say B = { b b b } number of elements in A B, one kernel function on subsets of B is K, 2 A = B ( AB),,, m (B could, for example, be a list of attributes that an item might or might not possess.) 2, for A the a) Prove that K is a kernel function using the "kernel mechanics" facts in the notes. (Hint: You may find it useful to associate with each A B an - m dimensional vector of 0s and s, call it A { } x 0, m, with x Al = exactly when bl A.) Let ( )( ) ( ) T A = K A, = 2 A map subsets of B to real-valued functions of subsets of B. b) In the abstract space A (of real-valued functions of subsets of B ) what is the distance between T A T B T( A) and T( B ), ( ) ( ) A?

12 For N training "vectors" ( Ai, y i) ( Ai B and i A R, namely ( T( Ai), y i). Define a k- neighborhood Nk ( ) a set of k points (functions) T( A i ) with smallest T( Ai ) V y R) consider the corresponding N points in V of a point (function) V A to be A. c) Carefully describe a SEL knn predictor of y, f ( V ), mapping elements V of A to real numbers ŷ in R. Then describe as completely as possible the corresponding predictor f T( A ) mapping A B to ŷ R. ( ) d) A more direct method of producing a kind of knn predictor of y is to take account of the hint for part a) and for subsets A and C of B, associate m-vectors of 0s and s respectively xa and x and define a distance between sets A and C as the Euclidean distance between xa and x C. This typically produces a different predictor than the one in part c). Argue this point by considering m distances from x to x and from x to x in T A to T C and from ( ) to ( ) A C A D R and from ( ) ( ) T A T D in the space A for cases with A = 0, C = 4, D= 5, A C = 2, and A D = 3. (There was a slight typo in the list of set sizes on the original version.) C 2

Stat 502X Exam 1 Spring 2014

Stat 502X Exam 1 Spring 2014 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed This is a long exam consisting of 11 parts. I'll score it at 10 points