Multiple regression One of the most widely used tools in statistical analysis Matrix expressions for multiple regression are the same as for simple linear regression
Need for Several Predictor Variables Often the response is best understood as being a function of multiple input quantities Examples Spam filtering-regress the probability of an email being a spam message against thousands of input variables Football prediction - regress the probability of a goal in some short time span against the current state of the game.
First-Order with Two Predictor Variables When there are two predictor variables X 1 and X 2 the regression model Y i = β 0 + β 1 X i1 + β 2 X i2 + ɛ i is called a first-order model with two predictor variables. A first order model is linear in the predictor variables. X i1 and X i2 are the values of the two predictor variables in the i th trial.
Functional Form of Regression Surface Assuming noise equal to zero in expectation E(Y ) = β 0 + β 1 X 1 + β 2 X 2 The form of this regression function is of a plane -e.g. E(Y ) = 10 + 2X1 + 5X 2
Loess example
Meaning of Regression Coefficients β 0 is the intercept when both X 1 and X 2 are zero; β 1 indicates the change in the mean response E(Y ) per unit increase in X 1 when X 2 is held constant β 2 -vice versa Example: fix X 2 = 2 E(Y ) = 10 + 2X 1 + 5(2) = 20 + 2X 1 X 2 = 2 intercept changes but clearly linear In other words, all one dimensional restrictions of the regression surface are lines.
Terminology 1. When the effect of X 1 on the mean response does not depend on the level X 2 (and vice versa) the two predictor variables are said to have additive effects or not to interact. 2. The parameters β 1 and β 2 are sometimes called partial regression coefficients.
Comments 1. A planar response surface may not always be appropriate, but even when not it is often a good approximate descriptor of the regression function in local regions of the input space 2. The meaning of the parameters can be determined by taking partials of the regression function w.r.t. to each.
First order model with > 2 predictor variables Let there be p 1 predictor variables, then Y i = β 0 + β 1 X i1 + β 2 X i2 +... + β p 1 X i,p 1 + ɛ i which can also be written as p 1 Y i = β 0 + β k X ik + ɛ i k=1 and if X i0 = 1 is also can be written as where X i0 = 1 p 1 Y i = β k X ik + ɛ i k=0
Geometry of response surface In this setting the response surface is a hyperplane This is difficult to visualize but the same intuitions hold Fixing all but one input variables, each β p tells how much the response variable will grow or decrease according to that one input variable
General Linear Regression Model We have arrived at the general regression model. In general the X 1,..., X p 1 variables in the regression model do not have to represent different predictor variables, nor do they have to all be quantitative(continuous). The general model is p 1 Y i = β k X ik + ɛ i where X i0 = 1 k=0 with response function when E(ɛ i )=0 is E(Y ) = β 0 + β 1 X 1 +... + β p 1 X p 1
Qualitative(Discrete) Predictor Variables Until now we have (implicitly) focused on quantitative (continuous) predictor variables. Qualitative(discrete) predictor variables often arise in the real world. Examples: Patient sex: male/female/other Goal scored in last minute: yes/no Etc
Example Regression model to predict the length of hospital stay(y) based on the age (X 1 ) and gender(x 2 ) of the patient. Define gender as: X 2 = { 1 if patient female 0 if patient male And use the standard first-order regression model Y i = β 0 + β 1 X i1 + β 2 X i2 + ɛ i
Example cont. WhereX i1 is patient s age, and X i2 is patient s gender If X 2 = 0, the response function is E(Y ) = β 0 + β 1 X 1 Otherwise, it s E(Y ) = (β 0 + β 2 ) + β 1 X 1 Which is just another parallel linear response function with a different intercept
Polynomial Regression Polynomial regression models are special cases of the general regression model. They can contain squared and higher-order terms of the predictor variables The response function becomes curvilinear. For example Y 1 = β 0 + β 1 X i + β 2 Xi 2 + ɛ i which clearly has the same form as the general regression model.
General Regression Transformed variables log Y, 1/Y Interaction effects Y i = β 0 + β 1 X i1 + β 2 X i2 + β 2 X i1 X i2 + ɛ i Combinations Y i = β 0 + β 1 X i1 + β 2 X 2 i1 + β 3X i2 + β 4 X i1 X i2 + ɛ i Key point-all linear in parameters!
General Regression Model in Matrix Terms Y 1. 1 X 11 X 12... X 1,p 1 Y =. X =.... 1 X n1 X n2... X n,p 1 Y n β 0 ɛ 1.. β =. ɛ =... β p 1 ɛ n
General Linear Regression in Matrix Terms With E(ɛ) = 0 and Y = X β + ɛ σ 2 0... 0 σ 2 (ɛ) = 0 σ 2... 0... 0 0... σ 2 We have E(Y ) = X β and σ 2 {y} = σ 2 I
Least Square Solution The matrix normal equations can be derived directly from the minimization of w.r.t to β Q = (y Xβ) (y Xβ)
Least Square Solution We can solve this equation X Xb = X y (if the inverse of X X exists) by the following (X X) 1 X Xb = (X X) 1 X y and since we have (X X) 1 X X = I b = (X X) 1 X y
Fitted Values and Residuals Let the vector of the fitted values are ŷ 1 ŷ 2 ŷ =... ŷ n in matrix notation we then have ŷ = Xb
Hat Matrix-Puts hat on y We can also directly express the fitted values in terms of X and y matrices ŷ = X(X X) 1 X y and we can further define H, the hat matrix ŷ = Hy H = X(X X) 1 X The hat matrix plans an important role in diagnostics for regression analysis.
Hat Matrix Properties 1. the hat matrix is symmetric 2. the hat matrix is idempotent, i.e. HH = H Important idempotent matrix property For a symmetric and idempotent matrix A, rank(a) = trace(a), the number of non-zero eigenvalues of A.
Residuals The residuals, like the fitted value ŷ can be expressed as linear combinations of the response variable observations Y i e = y ŷ = y Hy = (I H)y also, remember e = y ŷ = y Xb these are equivalent.
Covariance of Residuals Starting with e = (I H)y we see that σ 2 {e} = (I H)σ 2 {y}(i H) but σ 2 {y} = σ 2 {ɛ} = σ 2 I which means that σ 2 {e} = σ 2 (I H)I(I H) = σ 2 (I H)(I H) and since I H is idempotent (check) we have σ 2 {e} = σ 2 (I H)
ANOVA We can express the ANOVA results in matrix form as well, starting with where SSTO = (Y i Ȳ ) 2 = Y 2 i ( Y i ) 2 n leaving y y = Y 2 i ( Y i ) 2 n = 1 n y Jy SSTO = y y 1 n y Jy
SSE Remember SSE = e 2 i = (Y i Ŷ i ) 2 In matrix form this is SSE = e e = (y Xb) (y Xb) = y y 2b X y + b X Xb = y y 2b X y + b X X(X X) 1 X y = y y 2b X y + b IX y Which when simplified yields SSE = y y b X y or, remembering that b = (X X) 1 X y yields SSE = y y y X(X X) 1 X y
SSR We know that SSR = SSTO SSE, where SSTO = y y 1 n y Jy and SSE = y y b X y From this SSR = b X y 1 n y Jy and replacing b like before SSR = y X(X X) 1 X y 1 n y Jy
Quadratic forms The ANOVA sums of squares can be interpretted as quadratic forms. An example of a quadratic form is given by 5Y 2 1 + 6Y 1Y 2 + 4Y 2 2 Note that this can be expressed in matrix notation as (where A is always (in the case of a quadratic form) a symmetric matrix) ( ) ( ) ( ) 5 3 Y1 Y1 Y 2 3 4 = y Ay The off diagonal terms must both equal half the coefficient of the cross-product because multiplication is associative. Y 2
Quadratic Forms In general, a quadratic form is defined by y Ay = i j a ijy i Y j where a ij = a ji with A the matrix of the quadratic form. The ANOVA sums SSTO,SSE and SSR can all be arranged into quadratic forms. SSTO = y (I 1 n J)y SSE = y (I H)y SSR = y (H 1 n J)y
Inference We can derive the sampling variance of the β vector estimator by remembering that b = (X X) 1 X y = Ay where A is a constant matrix A = (X X) 1 X A = X(X X) 1 Using the standard matrix covariance operator we see that σ 2 {b} = Aσ 2 {y}a
Variance of b Since σ 2 {y} = σ 2 I we can write σ 2 {b} = (X X) 1 X σ 2 IX(X X) 1 = σ 2 (X X) 1 X X(X X) 1 = σ 2 (X X) 1 I = σ 2 (X X) 1 Of course E(b) = E((X X) 1 X y) = (X X) 1 X E(y) = (X X) 1 X Xβ = β