Lecture 5: A step back

Size: px

Start display at page:

Download "Lecture 5: A step back"

Wilfrid Thornton
5 years ago
Views:

1 Lecture 5: A step back

2 Last time Last time we talked about a practical application of the shrinkage idea, introducing James-Stein estimation and its extension We saw our first connection between shrinkage and selection (there will be a couple more, stay tuned!) We then motivated the shrinkage idea from the standpoint of a penalized least squares criterion

3 Regression revisited Recall the basic assumptions for the normal linear model We have a response y variables x 1, x 2,..., x p and (possible) predictor We hypothesize a linear model for the response y = β 1x 1 + β 2x 2 + β px p + ɛ where ɛ variance has a normal distribution with mean 0 and σ 2

4 Regression revisited Then, we collect data... We have a n observations y 1,..., y n and each response is associated with the predictors x i1, x i2,..., x ip y i Then, according to our linear model y i = β 1x i1 + β 2x i2 + β px ip + ɛ i ɛ i and we assume the are independent, each having a normal distribution with mean 0 and variance σ 2

5 Regression revisited To estimate the coefficients β1, β2,..., βp we turn to OLS, ordinary least squares (this is also maximum likelihood under our normal linear model) We want to choose β 1, β 2,..., β p to minimize the OLS criterion n [y i β 1 x i1 β 2 x i2 β p x ip ] 2 i=1 You recall that this is just the sum of squared errors that we incur if we predict y i with β 1 x i1 + β 2 x i2 + + β p x ip

6 The OLS criterion as a function As a function of the variables is quadratic β 1, β 2,..., β p the OLS criterion n [y i β 1 x i1 β 2 x i2 β p x ip ] 2 i=1

7 One predictor So, if we only have a single predictor variable, we have the usual univariate quadratic polynomial n [y i β 1 x i1 ] 2 = β1 2 xi1 2β 1 yi x i1 + yi 2 i=1 = aβ bβ 1 + c

8 One predictor In this case, the function looks something like the bowl-shape on the right The minimum occurs where the slope is zero; that is, where the derivative of the function OLS criterion β 1 aβ bβ 1 + c!2! is zero beta!1

9 Two predictors OLS criterion With two predictors, we again get a quadratic function; this time it looks like a bowl n [y i β 1 x i1 β 2 x i2 ] 2 i=1 To find the minimum, we search for the place where the partial derivatives with respect to β 1 and β 2 are both zero beta!2! beta!1 beta! beta!1

10 Two predictors Following our noses a bit, we can just directly take the partial derivatives For β 1 β 1 [yi β 1 x i1 β 2 x i2 ] 2 = 2 [y i β 1 x i1 β 2 x i2 ]x i1 For β 2 = 2 y i x i1 2β 1 x 2 i1 2β 2 xi2 x i1 β 2 [yi β 1 x i1 β 2 x i2 ] 2 = 2 [y i β 1 x i1 β 2 x i2 ]x i2 = 2 y i x i2 2β 1 xi1 x i2 2β 2 x 2 i2

11 Two predictors Again, the minimum of our bowl occurs at the flat spot -- where these derivatives are zero This gives us two linear equations in two unknowns β 1 x 2 i1 + β 2 xi2 x i1 = y i x i1 and β 1 xi1 x i2 + β 2 x 2 i2 = y i x i2

12 Two predictors Writing these two equations using matrix notation we find x 2 i1 xi1 x i2 β 1 = yi x i1 xi1 x i2 x 2 i2 β 2 yi x i2

13 Two predictors To make the final step, let s define the design matrix, a vector representing our responses and a vector to hold our parameter estimates X = x 11 x 12 x 21 x 22 x 31 x 32.. y = y 1 y 2 y 3. β = β 1 β 2 x n1 x n2 y n

14 Two predictors With these definitions, you should convince yourself that our linear equation from two slides back is equivalent to the form X t Xβ = X t y and that we can solve them by inverting X t X, or β = ( X t X ) 1 X t y where we have defined β = β 1 β 2

15 That was last quarter... now for something new Last time, we discussed how a penalized least squares criterion can give rise to simple shrinkage estimates; that is the formula can be derived if we minimize the following penalized criterion (where we assume the predictors are orthonormal) n p [y i β 1 x i1 β p x ip ] 2 + λ i=1 β k = λ β k k=1 While that gives us some basic mechanics of how we encounter simple shrinkage, it moves all the mystery from the original formula for shrinking to this funny penalty β 2 k

16 Penalties As we mentioned last lecture, there is nothing special about the orthogonal problems when it comes to introducing penalties like this Before we have a better sense of why we want to penalize in the first place, let s make sure we understand how it changes our OLS solutions

17 Penalties The new criterion we want to minimize is n [y i β 1 x i1 β p x ip ] 2 + λ i=1 p k=1 β 2 k Notice that it is again a quadratic function! By adding a penalty that is the sum of squared terms, we are just adding two quadratics So our path to generate the normal equations ought to be similar; that is, we have a bowl and we want to find the minimum of the bowl

18 Two predictors Following our noses a bit, we can just directly take the partial derivatives For β 1 β 1 [yi β 1 x i1 β 2 x i2 ] 2 + λ(β β 2 2) = 2 [y i β 1 x i1 β 2 x i2 ]x i1 + 2λβ 1 = 2 y i x i1 2β 1 ( x 2 i1 + λ) 2β 2 xi2 x i1 For β 2 β 1 [yi β 1 x i1 β 2 x i2 ] 2 + λ(β β 2 2) = 2 [y i β 1 x i1 β 2 x i2 ]x i2 + 2λβ 2 = 2 y i x i2 2β 1 xi1 x i2 2β 2 ( x 2 i2 + λ)

19 Two predictors Again, the minimum of our bowl occurs at the flat spot -- where these derivatives are zero This gives us two linear equations in two unknowns and β 1 ( x 2 i1 + λ) + β 2 xi1 x i2 = y i x i1 β 1 xi1 x i2 + β 2 ( x 2 i2 + λ) = y i x i2

20 Two predictors You should now be able to repeat the steps we followed for the unpenalized case to derive the associated linear equations we have to solve; namely (X t X + λi 2 2 )β = X t y

21 Penalties We saw in the context of the orthogonal problem, the amount of shrinkage was controlled by λ ; the larger it is, the more we reduce the coefficients to zero This is true in for the general case as well, and you can see it either by looking at the penalty directly or by considering an equivalent form of the problem

22 Penalties Minimizing the penalized OLS criterion n p [y i β 1 x i1 β p x ip ] 2 + λ i=1 k=1 is equivalent to minimizing the ordinary OLS criterion n [y i β 1 x i1 β p x ip ] 2 i=1 subject to the constraint that p βk 2 s for some value of s k=1 β 2 k

Lecture 3: Just a little more math

Lecture 3: Just a little more math Last time Through simple algebra and some facts about sums of normal random variables, we derived some basic results about orthogonal regression We used as our major