STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method. Rebecca Barter May 5, 2015
Linear Regression Review
Linear Regression Review y = β 0 + β 1 x + ɛ Suppose that there is some global linear relationship between height and weight in the population. For example Weight = β 0 + β 1 Height + ɛ Obviously height and weight don t fall on a perfectly straight line: there is some random noise/deviation from the line (ɛ is a random variable with E(ɛ) = 0 and V ar(ɛ) = σ 2 ). Moreover, we don t observe everyone in the population, and so there is no way we can figure out what β 0 and β 1 are, but what we can do is get estimates from a sample.
Linear Regression Review We observe a sample (blue points) and fit the best line through the sample. The red fitted line corresponds to ŷ = ˆβ 0 + ˆβ 1 x where ˆβ 0 and ˆβ 1 are estimated from our sample using least squares (finding the β values that minimize the distance from the observed points to the line)
Linear Regression Review We showed that we can write our linear regression model in matrix form: Y = Xβ + ɛ and the least squares estimate of β is given by ˆβ = arg min β Y Xβ 2 2 = (X T X) 1 X T Y We further showed that ˆβ was unbiased: E( ˆβ) = β and that its variance is given by V ar( ˆβ) = σ 2 (X T X) 1
Linear Regression Review The residuals are then defined to be the observed difference between the true y (blue dot) and the fitted ŷ (corresponding point on the red line): e i = y i ŷ i = y i ˆβ 0 ˆβ 1 x i which is different from the unobservable error ɛ i = y i β 0 β 1 x i (this corresponds to the deviation from the observed y i to the true population line, which we don t know!)
Multivariate Random Variables
Multivariate Random Variables Suppose we have n random variables: Y 1,..., Y n, such that E(Y i ) = µ and Cov(Y i, Y j ) = σ ij. Suppose that we want to consider the Y i s jointly as a single multivariate random variable, Y Y = (Y 1,..., Y n ) Then this multivariate random variable Y has mean vector and covariance matrix: mean vector: E(Y) = µ = (µ 1,..., µ n ) σ 2 1 σ 1,2... σ 1,n σ 2,1 σ 2 2... σ 2,n covariance matrix: Cov(Y) = Σ Y,Y =.... σ n,1 σ n,2... σn 2
Multivariate Random Variables E(Y) = µ Cov(Y) = Σ Suppose we have the following linear transformation of Y: Z = b + AY where b is a fixed vector and A a fixed matrix. What is the mean vector and covariance matrix of Z? E(Z) = E(b + AY) = b + AE(Y) = b + Aµ Cov(Z) = Cov(b + AY) = Cov(AY) = ACov(Y)A T = AΣA T
Multivariate Random Variables Suppose that Y is a multivariate random vector: Y = (Y 1,..., Y n ). Recall that for scalars, a quadratic transformation of y is given by x = ay 2 R. The equivalent form for vectors is X = Y T AY R This is referred to as a random quadratic form in Y
Multivariate Random Variables X = Y T AY R is a random quadratic form ia Y How can we calculate the expected value of the quadratic form of a random variable? We can use the following facts 1. Trace and expectation are both linear operators (and so can be interchanged). 2. X = Y T AY R is a real number (not a matrix or vector), and if x R then tr(x) = x. 3. If A and B are matrices of appropriate dimension, then tr(ab) = tr(ba) where the trace of a matrix is the sum of its diagonal entries
Multivariate Random Variables So 1. Trace and expectation are both linear operators (and so can be interchanged). 2. If x R then tr(x) = x. 3. If A and B are matrices, then tr(ab) = tr(ba) E [ Y T AY ] = E [ tr(y T AY) ] fact (2) = E [ tr(ayy T ) ] fact (3) = tr ( E [ AYY T ]) fact (1) = tr ( AE [ YY T ]) fact (1) = tr ( A ( Σ + µµ T )) calculating expectation = tr(aσ) + tr(aµµ T ) fact (1) = tr(aσ) + tr(µ T Aµ) fact (3) = tr(aσ) + µ T Aµ fact (2)
Example: Residual sum of squares The expected value of a quadratic form of a random variable is given by: E [ Y T AY ] = tr(aσ) + µ T Aµ We can use this to show that an unbiased estimate for σ 2 is given by ˆσ 2 = RSS n p where is in quadratic form RSS = e 2 2 = e T e
Example: Residual sum of squares Recall that the residuals are defined by e = Y Ŷ = Y X ˆβ = Y X(X T X) 1 X T Y = (I X(X T X) 1 X T )Y = P X Y Where P X := I X(X T X) 1 X T is a projection matrix onto the space orthogonal to X (check: P X X = 0). Note that a projection matrix is a matrix, A, that satisfies A T = A AA = A 2 = A
Example: Residual sum of squares e = (I X(X T X) 1 X T )Y = P T X Y Recall that the residual sum of squares was given by RSS = e 2 2 = e T e = (P X Y ) T (P X Y ) = Y T P T X P X Y = Y T P X Y Thus, since E [ Y T AY ] = tr(aσ) + µ T Aµ, the expected RSS is given by E(RSS) = E(Y T P X Y ) = σ 2 tr(p X ) + E(Y ) T P X E(Y ) (Σ = σ 2 I n ) = σ 2 (n p) How did we get that last equality!?
Example: Residual sum of squares E(RSS) = σ 2 tr(p X ) + E(Y ) T P X E(Y ) = σ 2 (n p) To get the last equality, we use our newly acquired knowledge of projection matrices and the trace operator: is a projection matrix orthogonal to X, thus P X E(Y ) = P X Xβ = 0, so P X E(Y ) T P X E(Y ) = 0 The trace of an m m identity matrix is m, so tr(p X ) = tr(i n X(X T X) 1 X T ) = n tr(x(x T X) 1 X T ) (linearity of trace) = n tr((x T X) 1 X T X) (tr(ab) = tr(ba)) = n tr(i p ) = n p
Example: Residual sum of squares Note that we have shown that which tells us that is an unbiased estimate of σ 2. E(RSS) = σ 2 (n p) ˆσ 2 = RSS n p
Prediction
Prediction Suppose we want to predict/fit the responses, Y 1,..., Y n corresponding to the observed predictors, x 1,..., x n (each x i is a 1 p vector), which form the rows of the design matrix X. Then our predicted Y i s (note that these are the same Y s used to define the model) are given by Ŷ i = x i ˆβ = xi (X T X) 1 X T Y Since these Ŷi s are random variables, we might be interested in calculating the variance of these predictions.
Prediction Our predicted Y i s are given by Ŷ i = x i ˆβ = xi (X T X) 1 X T Y The variance of the Ŷi s are given by V ar(ŷi) = V ar(x i (X T X) 1 X T Y ) = x i (X T X) 1 X T V ar(y ) X(X T X) 1 x T i = σ 2 x i (X T X) 1 X T X(X T X) 1 x T i = σ 2 x i (X T X) 1 x T i
Prediction V ar(ŷi) = σ 2 x i (X T X) 1 x T i Thus the average variance for across the n observations is: 1 n n i=1 V ar(ŷi) = σ2 n n x i (X T X) 1 x T i i=1 = σ2 n tr(x(xt X) 1 X T ) = σ2 n tr((xt X) 1 X T X) = σ2 n tr(i p) = σ 2 p n So the more variables we have in our model, the more variable our fitted/predicted values will be!
Logistic Regression
Logistic Regression For linear regression, we have Y = Xβ + ɛ and Y is continuous, such that Y i x i N(β T x i, σ 2 ) x i is the ith row/observation in X We want to use logistic regression if Y doesn t take continuous values, but is instead binary, i.e. Y i {0, 1}. The above linear model no longer makes sense, since a linear combination of our (continuous) x s is very unlikely to be equal to 0 or 1. First step: think of a distribution for Y i x i that makes more sense for binary Y...
Logistic Regression Since Y i is either 0 or 1 we can model it as a Bernoulli random variable Y i x i Bernoulli(p(x i, β)) where the probability that Y i = 1 is given by some function of our predictors, p(x i, β). We need our probability, p(x i, β) to be in [0, 1] and for it to depend on the linear combination of our predictors, β T x i, in some way. We choose: eβt x i p(x i, β) = 1 + e βt x i This is a nice choice since β T x i 0 implies that p(x i, β) 0.5 so Y i = 1 is most likely β T x i 0 implies that p(x i, β) 0.5 so Y i = 0 is most likely
Logistic Regression We have the following logistic regression model: ( ) Y i x i Bernoulli e βt x i 1 + e βt x i If we had a new sample x and we knew the β vector, we could calculate the probability that the corresponding Y = 1: eβt x P (Y = 1 x) = p(x, β) = 1 + e βt x If this probability is greater than 0.5, we would set Ŷ = 1. Otherwise, we would set Ŷ = 0. But we don t know β: we need to estimate the unknown β vector (just like with linear regression).
Logistic Regression Let s try to calculate the MLE for β. The likelihood function is given by lik(β) = = = n P (y i X i = x i, β) i=1 ( n i=1 e βt x i 1 + e βt x i ( ) n i=1 e βt x i y i 1 + e βt x i ) yi ( 1 1 + e βt x i ) 1 yi So that the log-likelihood is given by l(β) = n i=1 ) β T x i y i log (1 + e βt x i
Logistic Regression The log-likelihood is given by l(β) = n i=1 ) β T x i y i log (1 + e βt x i Differentiating with respect to β gives l(β) = = n x i y i x ie βt x i 1 + e βt x i n x i (y i p(x i, β)) i=1 i=1 = X T (y p) where p = (p(x 1, β), p(x 2, β),..., p(x n, β)). Setting l(β) = 0 cannot yield a closed form solution for β, so we need to use an iterative approach such as Newton s method.
Logistic Regression Newton s method: Suppose that we want to find x such that f(x) = 0. Then we could use the iterative approximation given by x n+1 = x n f(x n) f (x n ) and the more iterations we conduct, the closer we get to the root. We need to find the vector β such that l(β) = 0. To do this, we need to conduct a vector version of Newton s method: ( β n+1 = β n 2 ) 1 l(β) β β T l(β)
Logistic Regression Newton s method for scalars: Find x such that f(x) = 0 by x n+1 = x n f(x n) f (x n ) Vector version of Newton s method for β: β n+1 = β n ( 2 ) 1 l(β) β β T l(β) f(x) = l(x) is the vector-version of the first derivative f (x) = 2 l(x) x x T is the vector-version of the second derivative
Logistic Regression The first derivative (gradient vector) of l(β) with respect to the vector β: 2 l(β) β β T = l(β) = 2 l β 0 β 0 l β 0 l β 1. l β p The second derivative (Hessian matrix) of l(β) with respect to the vector β: 2 l β 0 β 1... 2 l β 0 β p 2 l β 1 β 0. 2 l β p β 0 2 l β 1 β 1... 2 l β 1 β p.. 2 l β p β 1... 2 l β p β p.
Logistic Regression Recall that we were trying to find the MLE for β in the logistic regression using Newton s method β n+1 = β n We already showed that and one can show that ( 2 ) 1 l(β) β β T l(β) (β) = X T (y p) 2 l(β) β β T = XT W X where W is a diagonal matrix whose diagonal entries are given by W ii = V ar(y i ) = p(x i, β)(1 p(x i, β))
Logistic Regression So our iterative procedure becomes β n+1 = β n + (X T W X) 1 X T (y p) = (X T W X) 1 X T W (Xβ n + W 1 (y p)) = (X T W X) 1 X T W z where z = Xβ n + W 1 (y p) Recall the least squares linear regression β estimator ˆβ = (X T X) 1 X T y and our iterative logistic regression estimator β n+1 = (X T W X) 1 X T W z We say that the estimate has been reweighted by W, and so this estimator is referred to as iteratively reweighted least squares
The δ-method
The δ-method The δ-method tells us that if we have a sequence of random variables, X n, such that n(xn θ) N(0, σ 2 ) (1) then, for any differentiable and non-zero function, g( ), we have that n(g(xn ) g(θ)) N(0, σ 2 (g (θ)) 2 ) (2) The most common reason we use the δ-method is to identify the variance of a function of a random variable.
Example: The δ-method Suppose that X n Binom(n, p). Then since Xn n is the mean of n iid Bernoulli(p) random variables, the cental limit theorem tells us that ( ) Xn n n p N(0, p(1 p)) and that V ar( Xn n ) = p(1 p) n δ-method: for any non-zero, differentiable function g( ), we have ( ( ) ) Xn n g g(p) N(0, p(1 p)(g (p)) 2 ) n For example, we can use this to calculate V ar ( log ( X n n )).
Example: The δ-method n ( g ( Xn n ) ) g(p) N(0, p(1 p)(g (p)) 2 ) For example, we can use this to calculate V ar ( log ( X n n )). Select So differentiating gives g(x) = log (x) g (x) = 1 x and the δ-method says: ( ( ) ) Xn n log log(p) n N ( 0, ) p(1 p) p 2
Example: The δ-method The δ-method says: ( n log n ( Xn ) ) ( log(p) N 0, so we can see that the variance of log ( ) X n n is ( ( )) Xn V ar log = 1 p n pn ) p(1 p) p 2