5.2 Expounding on the Admissibility of Shrinkage Estimators

Size: px

Start display at page:

Download "5.2 Expounding on the Admissibility of Shrinkage Estimators"

Roberta Allison
6 years ago
Views:

1 STAT 383C: Statistical Modeling I Fall 2015 Lecture 5 September 15 Lecturer: Purnamrita Sarkar Scribe: Ryan O Donnell Disclaimer: These scribe notes have been slightly proofread and may have typos etc Note: The latex template was borrowed from EECS, UC Berkeley 51 Quick Note on MLE Existence Two examples were provided to highlight the possibility that an MLE may not exist Example 1: Let X N(µ, σ 2 ) Let θ =< µ, σ >, theta ɛ RxR + The usual pdf for the normal is used here, which is equal to f(x, θ) = log(f(x, θ)) = c 1 (x µ)2 exp( ) 2π σ (x u)2 log(σ) So the MLE would be (x µ)2 0 = 2ˆσ 3 ˆσ 2 = (x µ) 2 1ˆσ But if x = µ, then σ = 0, which is not allowed So the MLE does not exist in this case Example 2: Let X i uniform(0, θ) If the interval included its boundary, then clearly the MLE would be θ = max[x i ] But since this interval does not include its boundary, the MLE cannot be the maximum, and therefore an MLE does not exist 52 Expounding on the Admissibility of Shrinkage Estimators As was previously mentioned, it is somewhat difficult to intuitively understand why these particular shrinkage estimators are admissible over the MLE The below begins with the Bayesian approach to the problem Beginning with X θ N(θ, I), θ N(0, τ 2 I) 5-1

2 The posterior mean from empirical bayes is just X (1 1 ) With that, we can aim to show that the MSE of this posterior estimator is preferable to the MSE of the MLE For the moment, we will assume we knew Tau exactly This allows for an easier proof of the MSE decreasing In reality, we could perhaps approximate it from the data, though the classical Bayesian approach would not allow for this, as it violates the idea of a prior distribution For ease of notation, let τ 2 E[(θ post θ) T (θ post θ)] = E[( 1 + τ X 2 θ)t ( X θ)] (51) 1 + τ 2 Therefore, the above becomes τ 2 c = 1 (52) = E[(X θ cx) T (X θ cx) = E[(X θ) T (X θ)] + c 2 E[] 2cE[X T (X θ)] = MSE(X) (2c c 2 )E[] + 2cE[X T θ] To show that the above does indeed equal something smaller than the MSE, it is easiest to break it up into pieces First, recall the law of iterated expectations Using this law, E[X] = E[E[X θ]] = E[θ] = 0 var[x] = E[var(X θ)] + var(e[x θ]) = E[I] + var(θ) = I( ) As it turns out, X also has a normal distribution whose parameters using the above derivation is: X N(0, ( )I) This is useful because it implies that is a chi-squared distribution with degrees of freedom p So, by properties of the chi squared distribution, E[] = ( )p (53) Combining this with the original definition for c shows that: = (2c c 2 )E[] = c(2 c) E[] = 1 + 2τ 2 ( ) 2 p 5-2

3 For the next part, the law of iterated expectation and the chi squared distributions are again very useful The bulk of the work comes from simply implementing the law E[θ T X] = E[E[θ T X θ]] = E[θ T E[X θ]] = E[θ T θ] = i E[θ 2 i ] = τ 2 p Combining this with the original definition for c shows that: = 2c E[θ T X] = 2τ 2 p Now, if we combine these two facts with the original definition of c, we can simplify our original expression for the MSE MSE(θ post ) = MSE(X) (2c c 2 )E[] + 2cE[X T θ] = MSE(X) 1 + 2τ τ p + 2τ τ p 2 = MSE(X) τ p 2 So, as long as we know τ, we have found a way to create a shrinkage estimator that is uniformaly better than MLE in terms of its MSE Also, this posterior mean approach creates something that is similar to the James-Stein Estimator However, this example was not entirely realistic What if we did not know Tau? Would we still do better than the MLE? It turns out that if we use some y to estimate tau, we arrive at the James Stein Estimator Recall the following: ˆθ post = (1 1 ) X If we don t know τ, we must estimate it Consider a random variable Y st E[Y ] = 1 Now, let V = XT X X By definition, V is a chi-squared distribution, as it is equal to Σ( ) 2 Now, take 1 This has the inverse chi squared distribution By properties of the inverse chi V squared, E[ 1 V ] = 1 p 2 = E[ ] Now, notice the following: E[ p 2 ] = (p 2)E[ 1 ] = (p 2)( 1 (p 2)( ) ) = 1 Therefore, since this yields the desired expectation, Y = p 2 Now, using this value of y as an estimator for 1 1 yields the following, which is equivalent to the James Stein Estimator: ˆθ empirical bayes = (1 p 2 ) X We call this empirical bayes since here we used a Bayesian model and then played frequentist by estimating the hyperparameter using the data 5-3

4 53 Linear Regression 531 Model and MLE Here is a linear model for linear regression Lets first do it for one pair of data points (x, y) y = β 0 + β 1 x β p x p + ɛ, ɛ N(0, σ 2 ) Now, for n data-points (x i, y i ), where x i = (x i1, x i2,, x ip ) we can write it in matrix notation as follows: We can write this in matrix form by stacking the datapoints as the rows of a matrix X so that x ij is the j-th feature of the i-th datapoint Then writing Y, β and ɛ as column vectors, we can write the matrix form of the linear regression model as: where: y = Y 1 Y 2 Y n ɛ 1, ɛ = ɛ 2 ɛ n, β = y = Xβ + ɛ β 0 β 1 β 2 β p 1 x 12 x 1p 1 x 22 x 2p, and X = 1 x n2 x np Assume that ɛ i is normally distributed with variance σ 2 And so ɛ We will now calculate the MLE ˆβ of β We are using the notation where smaller case bold letters denote vectors, capital bold dentotes matrices f(y, β) exp( (y Xβ)T (y Xβ) ) Take Log, we can get: (y Xβ) T (y Xβ) (54) Same drill differentiate and set it to zero X T (y X ˆβ) = 0 ˆβ = X T y ˆβ = () 1 X T y 532 Relation to least squares Lets say I wanted to calculate an estimate that minimized the residual sum of squares (RSS) β LS = min RSS(β ) := min (y i x T β β i β ) 2 As it turns out, RSS(β ) is none other than (y Xβ ) T (y Xβ ) But remember, because the noises are all independently drawn from the same mean zero normal distribution, maximizing log likelihood boils down to minimizing the RSS And in this special case, the least squares estimate is identical to the MLE 5-4 i

5 533 Expectation and Variance of ˆβˆβˆβ Now, we want to find the E[ ˆβ], V ar[ ˆβ] Lets put down some ground rules for taking expectations of vector valued random variables Say z = Ay where A is a fixed matrix E[z] = AE[y] and var(z) = Avar(z)A T Recall that E[y] = Xβ and var(y) = σ 2 I E[ˆβˆβˆβ] = () 1 X T E[y] = () 1 β var[ˆβˆβˆβ] = () 1 X T var[y]x() 1 = σ 2 () 1 Conclusion: ˆβˆβˆβ N(β, σ 2 () 1 ) Note: this is not approximate, but exact! 5-5

Lecture 14 October 13

STAT 383C: Statistical Modeling I Fall 2015 Lecture 14 October 13 Lecturer: Purnamrita Sarkar Scribe: Some one Disclaimer: These scribe notes have been slightly proofread and may have typos etc. Note: