STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Similar documents
Maximum Likelihood Estimation

MLES & Multivariate Normal Theory

Matrix Approach to Simple Linear Regression: An Overview

3 Multiple Linear Regression

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Linear Methods for Prediction

STA 2201/442 Assignment 2

MIT Spring 2015

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices

STAT5044: Regression and Anova. Inyoung Kim

Econ 620. Matrix Differentiation. Let a and x are (k 1) vectors and A is an (k k) matrix. ) x. (a x) = a. x = a (x Ax) =(A + A (x Ax) x x =(A + A )

Linear Models in Machine Learning

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Chapter 5 Matrix Approach to Simple Linear Regression

Lecture 16 Solving GLMs via IRWLS

Weighted Least Squares

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

MATH 829: Introduction to Data Mining and Analysis Linear Regression: statistical tests

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Linear Methods for Prediction

3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No.

Chapter 1. Linear Regression with One Predictor Variable

Association studies and regression

Ma 3/103: Lecture 24 Linear Regression I: Estimation

STA 2101/442 Assignment Four 1

Ch4. Distribution of Quadratic Forms in y

Linear Algebra Review

Topic 7 - Matrix Approach to Simple Linear Regression. Outline. Matrix. Matrix. Review of Matrices. Regression model in matrix form

Notes on Random Vectors and Multivariate Normal

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Simple Linear Regression

CS229 Lecture notes. Andrew Ng

Preliminaries. Copyright c 2018 Dan Nettleton (Iowa State University) Statistics / 38

4 Multiple Linear Regression

STA 2101/442 Assignment 3 1

5.2 Expounding on the Admissibility of Shrinkage Estimators

Matrix Algebra, Class Notes (part 2) by Hrishikesh D. Vinod Copyright 1998 by Prof. H. D. Vinod, Fordham University, New York. All rights reserved.

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

STA442/2101: Assignment 5

Regression Models - Introduction

Properties of the least squares estimates

Linear Regression (9/11/13)

STAT 151A: Lab 1. 1 Logistics. 2 Reference. 3 Playing with R: graphics and lm() 4 Random vectors. Billy Fang. 2 September 2017

Simple and Multiple Linear Regression

Lecture 13: Simple Linear Regression in Matrix Format

Sampling Distributions

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Weighted Least Squares

Generalized Linear Models. Kurt Hornik

Lecture 9 SLR in Matrix Form

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Stat 579: Generalized Linear Models and Extensions

STAT 100C: Linear models

STAT 100C: Linear models

STAT Homework 8 - Solutions

ST 740: Linear Models and Multivariate Normal Inference

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

Lecture Notes on Different Aspects of Regression Analysis

Xβ is a linear combination of the columns of X: Copyright c 2010 Dan Nettleton (Iowa State University) Statistics / 25 X =

Linear Models Review

Regression. ECO 312 Fall 2013 Chris Sims. January 12, 2014

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Basic Distributional Assumptions of the Linear Model: 1. The errors are unbiased: E[ε] = The errors are uncorrelated with common variance:

Lecture 15. Hypothesis testing in the linear model

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

[y i α βx i ] 2 (2) Q = i=1

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Chapter 17: Undirected Graphical Models

Ch 2: Simple Linear Regression

Generalized linear models

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Regression Models - Introduction

Math 423/533: The Main Theoretical Topics

The Multivariate Normal Distribution 1

Weighted Least Squares I

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Ridge Regression. Flachs, Munkholt og Skotte. May 4, 2009

Stat 206: Sampling theory, sample moments, mahalanobis

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

STAT 540: Data Analysis and Regression


MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Topic 2: Logistic Regression

STAT5044: Regression and Anova

Multivariate Analysis and Likelihood Inference

3d scatterplots. You can also make 3d scatterplots, although these are less common than scatterplot matrices.

REGRESSION WITH SPATIALLY MISALIGNED DATA. Lisa Madsen Oregon State University David Ruppert Cornell University

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

The Multivariate Normal Distribution 1

Asymptotic Theory. L. Magee revised January 21, 2013

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

. a m1 a mn. a 1 a 2 a = a n

Linear models. Linear models are computationally convenient and remain widely used in. applied econometric research

For more information about how to cite these materials visit

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Lecture 6: Linear models and Gauss-Markov theorem

Transcription:

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method. Rebecca Barter May 5, 2015

Linear Regression Review

Linear Regression Review y = β 0 + β 1 x + ɛ Suppose that there is some global linear relationship between height and weight in the population. For example Weight = β 0 + β 1 Height + ɛ Obviously height and weight don t fall on a perfectly straight line: there is some random noise/deviation from the line (ɛ is a random variable with E(ɛ) = 0 and V ar(ɛ) = σ 2 ). Moreover, we don t observe everyone in the population, and so there is no way we can figure out what β 0 and β 1 are, but what we can do is get estimates from a sample.

Linear Regression Review We observe a sample (blue points) and fit the best line through the sample. The red fitted line corresponds to ŷ = ˆβ 0 + ˆβ 1 x where ˆβ 0 and ˆβ 1 are estimated from our sample using least squares (finding the β values that minimize the distance from the observed points to the line)

Linear Regression Review We showed that we can write our linear regression model in matrix form: Y = Xβ + ɛ and the least squares estimate of β is given by ˆβ = arg min β Y Xβ 2 2 = (X T X) 1 X T Y We further showed that ˆβ was unbiased: E( ˆβ) = β and that its variance is given by V ar( ˆβ) = σ 2 (X T X) 1

Linear Regression Review The residuals are then defined to be the observed difference between the true y (blue dot) and the fitted ŷ (corresponding point on the red line): e i = y i ŷ i = y i ˆβ 0 ˆβ 1 x i which is different from the unobservable error ɛ i = y i β 0 β 1 x i (this corresponds to the deviation from the observed y i to the true population line, which we don t know!)

Multivariate Random Variables

Multivariate Random Variables Suppose we have n random variables: Y 1,..., Y n, such that E(Y i ) = µ and Cov(Y i, Y j ) = σ ij. Suppose that we want to consider the Y i s jointly as a single multivariate random variable, Y Y = (Y 1,..., Y n ) Then this multivariate random variable Y has mean vector and covariance matrix: mean vector: E(Y) = µ = (µ 1,..., µ n ) σ 2 1 σ 1,2... σ 1,n σ 2,1 σ 2 2... σ 2,n covariance matrix: Cov(Y) = Σ Y,Y =.... σ n,1 σ n,2... σn 2

Multivariate Random Variables E(Y) = µ Cov(Y) = Σ Suppose we have the following linear transformation of Y: Z = b + AY where b is a fixed vector and A a fixed matrix. What is the mean vector and covariance matrix of Z? E(Z) = E(b + AY) = b + AE(Y) = b + Aµ Cov(Z) = Cov(b + AY) = Cov(AY) = ACov(Y)A T = AΣA T

Multivariate Random Variables Suppose that Y is a multivariate random vector: Y = (Y 1,..., Y n ). Recall that for scalars, a quadratic transformation of y is given by x = ay 2 R. The equivalent form for vectors is X = Y T AY R This is referred to as a random quadratic form in Y

Multivariate Random Variables X = Y T AY R is a random quadratic form ia Y How can we calculate the expected value of the quadratic form of a random variable? We can use the following facts 1. Trace and expectation are both linear operators (and so can be interchanged). 2. X = Y T AY R is a real number (not a matrix or vector), and if x R then tr(x) = x. 3. If A and B are matrices of appropriate dimension, then tr(ab) = tr(ba) where the trace of a matrix is the sum of its diagonal entries

Multivariate Random Variables So 1. Trace and expectation are both linear operators (and so can be interchanged). 2. If x R then tr(x) = x. 3. If A and B are matrices, then tr(ab) = tr(ba) E [ Y T AY ] = E [ tr(y T AY) ] fact (2) = E [ tr(ayy T ) ] fact (3) = tr ( E [ AYY T ]) fact (1) = tr ( AE [ YY T ]) fact (1) = tr ( A ( Σ + µµ T )) calculating expectation = tr(aσ) + tr(aµµ T ) fact (1) = tr(aσ) + tr(µ T Aµ) fact (3) = tr(aσ) + µ T Aµ fact (2)

Example: Residual sum of squares The expected value of a quadratic form of a random variable is given by: E [ Y T AY ] = tr(aσ) + µ T Aµ We can use this to show that an unbiased estimate for σ 2 is given by ˆσ 2 = RSS n p where is in quadratic form RSS = e 2 2 = e T e

Example: Residual sum of squares Recall that the residuals are defined by e = Y Ŷ = Y X ˆβ = Y X(X T X) 1 X T Y = (I X(X T X) 1 X T )Y = P X Y Where P X := I X(X T X) 1 X T is a projection matrix onto the space orthogonal to X (check: P X X = 0). Note that a projection matrix is a matrix, A, that satisfies A T = A AA = A 2 = A

Example: Residual sum of squares e = (I X(X T X) 1 X T )Y = P T X Y Recall that the residual sum of squares was given by RSS = e 2 2 = e T e = (P X Y ) T (P X Y ) = Y T P T X P X Y = Y T P X Y Thus, since E [ Y T AY ] = tr(aσ) + µ T Aµ, the expected RSS is given by E(RSS) = E(Y T P X Y ) = σ 2 tr(p X ) + E(Y ) T P X E(Y ) (Σ = σ 2 I n ) = σ 2 (n p) How did we get that last equality!?

Example: Residual sum of squares E(RSS) = σ 2 tr(p X ) + E(Y ) T P X E(Y ) = σ 2 (n p) To get the last equality, we use our newly acquired knowledge of projection matrices and the trace operator: is a projection matrix orthogonal to X, thus P X E(Y ) = P X Xβ = 0, so P X E(Y ) T P X E(Y ) = 0 The trace of an m m identity matrix is m, so tr(p X ) = tr(i n X(X T X) 1 X T ) = n tr(x(x T X) 1 X T ) (linearity of trace) = n tr((x T X) 1 X T X) (tr(ab) = tr(ba)) = n tr(i p ) = n p

Example: Residual sum of squares Note that we have shown that which tells us that is an unbiased estimate of σ 2. E(RSS) = σ 2 (n p) ˆσ 2 = RSS n p

Prediction

Prediction Suppose we want to predict/fit the responses, Y 1,..., Y n corresponding to the observed predictors, x 1,..., x n (each x i is a 1 p vector), which form the rows of the design matrix X. Then our predicted Y i s (note that these are the same Y s used to define the model) are given by Ŷ i = x i ˆβ = xi (X T X) 1 X T Y Since these Ŷi s are random variables, we might be interested in calculating the variance of these predictions.

Prediction Our predicted Y i s are given by Ŷ i = x i ˆβ = xi (X T X) 1 X T Y The variance of the Ŷi s are given by V ar(ŷi) = V ar(x i (X T X) 1 X T Y ) = x i (X T X) 1 X T V ar(y ) X(X T X) 1 x T i = σ 2 x i (X T X) 1 X T X(X T X) 1 x T i = σ 2 x i (X T X) 1 x T i

Prediction V ar(ŷi) = σ 2 x i (X T X) 1 x T i Thus the average variance for across the n observations is: 1 n n i=1 V ar(ŷi) = σ2 n n x i (X T X) 1 x T i i=1 = σ2 n tr(x(xt X) 1 X T ) = σ2 n tr((xt X) 1 X T X) = σ2 n tr(i p) = σ 2 p n So the more variables we have in our model, the more variable our fitted/predicted values will be!

Logistic Regression

Logistic Regression For linear regression, we have Y = Xβ + ɛ and Y is continuous, such that Y i x i N(β T x i, σ 2 ) x i is the ith row/observation in X We want to use logistic regression if Y doesn t take continuous values, but is instead binary, i.e. Y i {0, 1}. The above linear model no longer makes sense, since a linear combination of our (continuous) x s is very unlikely to be equal to 0 or 1. First step: think of a distribution for Y i x i that makes more sense for binary Y...

Logistic Regression Since Y i is either 0 or 1 we can model it as a Bernoulli random variable Y i x i Bernoulli(p(x i, β)) where the probability that Y i = 1 is given by some function of our predictors, p(x i, β). We need our probability, p(x i, β) to be in [0, 1] and for it to depend on the linear combination of our predictors, β T x i, in some way. We choose: eβt x i p(x i, β) = 1 + e βt x i This is a nice choice since β T x i 0 implies that p(x i, β) 0.5 so Y i = 1 is most likely β T x i 0 implies that p(x i, β) 0.5 so Y i = 0 is most likely

Logistic Regression We have the following logistic regression model: ( ) Y i x i Bernoulli e βt x i 1 + e βt x i If we had a new sample x and we knew the β vector, we could calculate the probability that the corresponding Y = 1: eβt x P (Y = 1 x) = p(x, β) = 1 + e βt x If this probability is greater than 0.5, we would set Ŷ = 1. Otherwise, we would set Ŷ = 0. But we don t know β: we need to estimate the unknown β vector (just like with linear regression).

Logistic Regression Let s try to calculate the MLE for β. The likelihood function is given by lik(β) = = = n P (y i X i = x i, β) i=1 ( n i=1 e βt x i 1 + e βt x i ( ) n i=1 e βt x i y i 1 + e βt x i ) yi ( 1 1 + e βt x i ) 1 yi So that the log-likelihood is given by l(β) = n i=1 ) β T x i y i log (1 + e βt x i

Logistic Regression The log-likelihood is given by l(β) = n i=1 ) β T x i y i log (1 + e βt x i Differentiating with respect to β gives l(β) = = n x i y i x ie βt x i 1 + e βt x i n x i (y i p(x i, β)) i=1 i=1 = X T (y p) where p = (p(x 1, β), p(x 2, β),..., p(x n, β)). Setting l(β) = 0 cannot yield a closed form solution for β, so we need to use an iterative approach such as Newton s method.

Logistic Regression Newton s method: Suppose that we want to find x such that f(x) = 0. Then we could use the iterative approximation given by x n+1 = x n f(x n) f (x n ) and the more iterations we conduct, the closer we get to the root. We need to find the vector β such that l(β) = 0. To do this, we need to conduct a vector version of Newton s method: ( β n+1 = β n 2 ) 1 l(β) β β T l(β)

Logistic Regression Newton s method for scalars: Find x such that f(x) = 0 by x n+1 = x n f(x n) f (x n ) Vector version of Newton s method for β: β n+1 = β n ( 2 ) 1 l(β) β β T l(β) f(x) = l(x) is the vector-version of the first derivative f (x) = 2 l(x) x x T is the vector-version of the second derivative

Logistic Regression The first derivative (gradient vector) of l(β) with respect to the vector β: 2 l(β) β β T = l(β) = 2 l β 0 β 0 l β 0 l β 1. l β p The second derivative (Hessian matrix) of l(β) with respect to the vector β: 2 l β 0 β 1... 2 l β 0 β p 2 l β 1 β 0. 2 l β p β 0 2 l β 1 β 1... 2 l β 1 β p.. 2 l β p β 1... 2 l β p β p.

Logistic Regression Recall that we were trying to find the MLE for β in the logistic regression using Newton s method β n+1 = β n We already showed that and one can show that ( 2 ) 1 l(β) β β T l(β) (β) = X T (y p) 2 l(β) β β T = XT W X where W is a diagonal matrix whose diagonal entries are given by W ii = V ar(y i ) = p(x i, β)(1 p(x i, β))

Logistic Regression So our iterative procedure becomes β n+1 = β n + (X T W X) 1 X T (y p) = (X T W X) 1 X T W (Xβ n + W 1 (y p)) = (X T W X) 1 X T W z where z = Xβ n + W 1 (y p) Recall the least squares linear regression β estimator ˆβ = (X T X) 1 X T y and our iterative logistic regression estimator β n+1 = (X T W X) 1 X T W z We say that the estimate has been reweighted by W, and so this estimator is referred to as iteratively reweighted least squares

The δ-method

The δ-method The δ-method tells us that if we have a sequence of random variables, X n, such that n(xn θ) N(0, σ 2 ) (1) then, for any differentiable and non-zero function, g( ), we have that n(g(xn ) g(θ)) N(0, σ 2 (g (θ)) 2 ) (2) The most common reason we use the δ-method is to identify the variance of a function of a random variable.

Example: The δ-method Suppose that X n Binom(n, p). Then since Xn n is the mean of n iid Bernoulli(p) random variables, the cental limit theorem tells us that ( ) Xn n n p N(0, p(1 p)) and that V ar( Xn n ) = p(1 p) n δ-method: for any non-zero, differentiable function g( ), we have ( ( ) ) Xn n g g(p) N(0, p(1 p)(g (p)) 2 ) n For example, we can use this to calculate V ar ( log ( X n n )).

Example: The δ-method n ( g ( Xn n ) ) g(p) N(0, p(1 p)(g (p)) 2 ) For example, we can use this to calculate V ar ( log ( X n n )). Select So differentiating gives g(x) = log (x) g (x) = 1 x and the δ-method says: ( ( ) ) Xn n log log(p) n N ( 0, ) p(1 p) p 2

Example: The δ-method The δ-method says: ( n log n ( Xn ) ) ( log(p) N 0, so we can see that the variance of log ( ) X n n is ( ( )) Xn V ar log = 1 p n pn ) p(1 p) p 2