We like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model.

Size: px

Start display at page:

Download "We like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model."

Clyde Neal
5 years ago
Views:

1 Statistical Methods in Business Lecture 5. Linear Regression We like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model. Let there be k number of possible causes, such that the value of a possible cause is denoted by X ", the index j = 1,2,3,, k points at a particular possible cause. Linear Regression Model: Assume Y = β 0 + β 1 X ' + β 2 X ( + + β k X ) + Ɛ. Ɛ ~ Ɲ (0, σ ( ). Here, Y denotes the value of the response which is generated by a linear combination of all possible causes.x " /, and other factors that are contributing to the response behavior, assumed to be independent of the cause factor, denoted by the independent random variable Ɛ. Also, we assumed that this random factor behaves according to a normal (a.k.a. Gaussian) probability law, it does not contribute anything to the average response behavior in the long run, yet it generates a constant variance for the response for all possible combinations of the cause values. We wish to collect independent observations and construct a regression equation for the collected data set. Assume that we have finitely many independent observations, say n number of observations, in our data set. Each observation informs us about the values of each possible cause and their response: Observation (X ' X ( X ) Y ) (values) 1 st (X '' X '( X ') Y ' ) 2 nd (X (' X (( X () Y ( ) n th (X 2' X 2( X 2) Y 2 ) DATA SET 1 P age

2 Now, we can fit the model with our data set and produce n number of equations: Y ' = β 0 + β 1 X '' + β 2 X '( + + β k X ') + Ɛ ' Y ( = β 0 + β 1 X (' + β 2 X (( + + β k X () + Ɛ ( Y 3 = β 0 + β 1 X 3' + β 2 X 3( + + β k X 3) + Ɛ 3 Y 2 = β 0 + β 1 X 2' + β 2 X 2( + + β k X 2) + Ɛ 2 Here, we know all the Y 4 and X 4" values, for i = 1,2,, n and j = 1,2,, k, from our data set recorded values. But, β j values for j = 0,1,2,, k are (assumed) unknown constant numbers. Therefore, we will search for the best possible numerical values to estimate the unknown β j values, by using our data set information. Now, we can represent the MODEL AND DATA SET INFORMATION FITTED TOGETHER with a simple algebraic notation. Let y = Y ' Y ( Y 2 be the column listing of all the recorded values of the response on the data set. This is a column vector of the recorded responses for each observation and each coordinate gives this information. We call y as the response vector. Similarly, let X = 1 X '' X '( X ') 1 X (' X (( X () 1 X 3' X 3( X 3) 1 X 2' X 2( X 2) be the writing of all observed values of the possible causes in the data set, such that each row (line) starts with a constant one and followed by the ordered list of numerical values of possible causes in each observation. This matrix of possible cause values, X, is known as the design matrix. 2 P age

3 Also, let β = β H β ' be the column listing of all unknown parameters, known as the parameter vector. β ( β ) Also, let Ɛ = Ɛ ' Ɛ ( Ɛ 2 be the column listing of all error terms, known as the error vector Thus, we can represent these MODEL AND DATA FITTED TOGETHER, n equations as y = X β + Ɛ. This linear algebraic equation can be easily utilized in order to search and find a solution for the unknown β. Our objective is to find the best possible numerical values for β such that the estimated β will minimize the prediction error. The conditions for minimizing the prediction error is the predicted response vector, y8, is an orthogonal projection of the unknown response vector, y, on to the tangent plane of the space generated by the possible causes information. Thus the magnitude of the prediction error between the actual response vector and the predicted response vector is minimized, i.e. X Ɛ or, X : Ɛ = 0. Since y = X β +Ɛ, we have Ɛ = y - X β. Thus, the minimum prediction error condition yields the best possible numerical estimated values for β, such that X : (y X β?) = 0. X : y X : X β? = 0. X : y = X : X β?. (X : X) is a square, positive definite, (k+1) x (k+1) matrix, therefore it has an inverse, (X : X) C'. Consequently, (X : X) C' (X : y) = (X : X) C' (X : X) β?. Since (X : X) C' (X : X)= I ()F')G ()F'), identity matrix (X : X) C' (X : y) = β?. Therefore, the best estimated parameter vector that minimizes the magnitude of the prediction error is β?* = (X : X) C' X : y. 3 P age

4 Let β?* = b H b ' b ( b ) Now, we can construct the regression equation for the data set, by using these estimated parameter values; YJ = b K + b ' X ' + b ( X ( + + b ) X ). Thus, we can estimate the response values by using the data set recorded possible cause values, such as: Y ' L = b K + b ' X '' + b ( X '( + + b ) X '). Y L ( = b K + b ' X (' + b ( X (( + + b ) X (). Y 2 M = b K + b ' X 2' + b ( X 2( + + b ) X 2). Or, better yet, we can write a column list of these estimated response values by the data set recorded possible cause values; y8 = and thus we have y8 = X β?* Y L. ' Y L ( Y 2 M If we look at the differences between what is observed and what is estimated for the response, we generate residuals, such that e ' = Y ' - YJ ' e ( = Y ( - YJ ( e 2 = Y 2 - YJ 2 4 P age

5 Hence, we can write a column list of these residuals, as our residual vector: e ' e =. e ( e 2 Therefore, e = y - y8. On the other hand, we have y8 = X β?* = X (X : X) C' (X : y). Hence, the orthogonal projection operator X (X : X) C' X : is called the HAT MATRIX and denoted by H, since it puts a hat on the response vector. Thus, y8 = Hy. This implies that the residual vector is generated by the data set values, e = y Hy = (I-H)y. We can measure the variations in the data set for the response values and construct performance measures for our regression equation, to answer two questions: Question One: How close our predictions are to reality? Question Two: How reliable our predictions are? Now, we will answer these two questions. Measure of variations on the data for response values: (TOTAL) SST = 2 4W' (Y 4 YY ) ( and df : = n-1. Where YY = \ ' ] 2 Y 2 4W' 4. (Regression) SSR = 2 4W' _YJ 4 YY ) ( df` = k. (Error) SSE = 2 4W' (Y 4 YJ 4 ) ( df c = n-k-1. Where, SST = SSR + SSE and df : = df`+df c. 5 P age

6 Variance Estimators for the response: MSR = ff` g hi, MSE = ffc g hj (where MSE = σk ( ). MSR estimates the variance of the response, generated by the regression equation, and MSE estimates the variance of response, generated by randomness. Performance Measures for our Regression Equation: (A) Standard Error of Prediction S m op o q o s = MSE, (σk, estimated standard deviation) This is the estimated standard deviation for the response and it indicates the average distance between the actual response and the estimated response by our regression equation. Standard error answers the first question. (B) Coefficient of Determination a.k.a. R-SQUARE. r ( = ff`. This ratio indicates what proportion of variations in the response behavior can be ff: explained by the regression equation. This r ( is the squared linear correlation coefficient between the response and the set of all possible causes, such that r = cos θ. In this context, Actual actual response ; Projection projected response k + 1 dimensional linear, or affine space 6 P age

7 Where the angle θ is the angle between the unknown y vector, response vector, and the tangent plane of the space X, generated by the possible causes information. The coefficient of determination answers the second question. If we wish to compare many different linear regression models, then we use the adjusted form of R-Square: r }g" ( = 1 ~(1 r ( ) \ 2C' 2C)C' ]. r }g" ( is a function of both the number of predictors (possible causes) used in a model and a function of the number of observations in the data set. We wish to use our regression equation, if we have evidence of the agreement between our model and our data set. That is, we investigate our data set for the evidence of our model assumptions such as linear relationship between predictors and the response, normal behavior for the response, constant variance (homoscedasticity) of the response, and independence of the observations. Therefore, we investigate the data set for sufficient evidence of the following assumptions of the model: 1. Normality 2. Homoscedasticity 3. Linearity Also, we investigate the data set for the evidence of the independence of observation. 1. Normality assumption indicates that the data set recorded response values are subject to a normal probability law. Our investigative method is to employ the normal probability plot of the data set recorded response values. If this plot results a line, or a graph that is not significantly different than a line, we take this as a sufficient evidence for satisfying the normality assumption. Also, a goodness-of-fit-test, such as the chi-square-test may be used to inspect the normality of the data set recorded response values. 2. Homoscedasticity, or the constant variance, assumption for the response may be inspected by using the residual plot of the residuals versus data set recorded response values. If we see the same spread of residuals when the data set recorded response values and changing from a small level to a high level, then we take it as a sufficient evidence of homoscedasticity. 3. Linearity assumption is investigated for the evidence of a liner relationship between the data set recorded response values and the data set recorded collection of all possible cause values. This investigation is known as the F-test. 7 P age

8 We prepare a summary information of the data set, called as ANOVA TABLE: SOURCE df SS MS Regression Error k n-k-1 SSR SSE MSR MSE TOTAL n-1 SST Where SST = SSR + SSE and df : = df` + df c. Also, MSR = ff` g i = ff` ). And MSE = ffc g j = ffc 2C)C'. Thus, we construct a hypothesis test for linear relationship between the response and the set of all possible causes (called the F-test): H H : β ' = β ( = = β ) = 0. (Null hypothesis declares a belief that there is no linear relation between the response and the cause factor, and the response is generated by pure randomness.) H ' : At least one β " is significantly different than zero. (i.e., not all coefficients of the possible causes equal to zero. Alternative hypothesis declares a belief that there is a linear relationship between the response and the set of all possible causes.) Level of significance: Test statistic: F f: : = f` fc ~ F g i, df c. F ` : ˆ = F ; df`, df c = F ; k, (n-k-1). p-value = P (F>F f: : ) by using F ), (n-k-1) probability distribution. Decision Rule: If F f: : > F ` : ˆ, then reject H H Or, if p-value <, then reject H H. Decision: Case-A We cannot reject H H at level of significance. Thus, we don t have sufficient evidence of linear relationship between the response and the set of all possible causes. Decision: Case-B We reject H H at level of significance. We are (1- )% confident that there is sufficient evidence of a linear relationship between the response event and the set of all possible causes. 8 P age

9 If we decide that there is sufficient evidence of a linear relationship between the response and the set of all possible causes at a desired level of significance,, then we investigate each and every possible cause for evidence of a linear relationship with the response event, at the same level of significance used in the F-test. This individual investigation of each possible cause will identify which possible cause information is necessary (to explain the response behavior) and which possible cause information is not needed for the regression equation. We can employ one of the three alternative investigative tools for their individual inspection of X ", for j = 1,2,3,, k: I) t TEST for β " : H H : β " = 0. (This statement assumes that there is no linear relationship between the response event and the possible cause number j.) H ' : β " 0. Level of significance: (same as was used on the F-TEST) Test Statistic: t f: : = C f ~ t 2C)C'. Where S " = f. špšq š s ffo, and S m. op o q o s = MSE ; SSX " = 2 4W' _X 4" XY " ) (, XY " = ( ' ) 2 X 2 4W' 4". t ` : ˆ = t ; n-k-1. q p-value = P (T > t f: : ), by using the t 2C)C' probability distribution. Decision Rule: If t f: : > t ` : ˆ, then reject H H. Or, if p-value < (, then reject H H. Decision: Case A We cannot reject H H, at level of significance. Hence, we have no evidence for the necessity of the knowledge of the possible cause number j, in order to estimate the response behavior. Thus we can eliminate the employment of X ". 9 P age

10 Decision: Case-B We reject H H at level of significance. Thus, we are (1- )% confident that there is a linear relationship between the possible cause number j and the response event. Thus, we need to know X ". II) (1- )% confidence interval estimator of β " ; j=1,2,, k. [b " (t ` : ˆ) (S " ), b " + (t ` : ˆ) (S " )] Where t ` : ˆ= t, (n-k-1), S " = f.šp š s, by using the same used in the F- q ffo TEST before. If this interval includes zero in it, then there is no sufficient evidence of β " is significantly different than zero. Therefore, we do not need to know X ". Otherwise, if this interval does not include zero in it, then there is strong evidence that the β " is significantly different than zero, with (1- ) probability. Hence, we need to know X ". III) Partial F-Test for X " ; j=1,2,3,, k. H H : The knowledge of X " does not contribute significantly to explain the response event. H ' : The knowledge of X " does significantly contribute to explain the response event. Level of significance: (The same used in the F-TEST.) Test Statistic: F f: : = [ff` ( ˆˆ o )Cff` _ ˆˆ o co cž: o Ÿ] fc F f: : ~ F ', (n-k-1). Where SSR (ALL Xs) is the SSR for a regression equation that is utilizing all the predictors knowledge, {X ', X (,, X ) }, and also SSR (ALL Xs EXCEPT X " ) is the SSR for a regression equation that is using all the predictors, except the knowledge of X " ; { X ', X (,, X "C', X "F', X ) }. We measure the degree of contribution, made by the knowledge of X ", to the prediction of the response event as the difference between the full model generated information versus the reduced model generated information, such as: SSR (X " ALL Xs EXCEPT X " ) = SSR (ALL Xs) SSR (ALL Xs EXCEPT X " ). Also, we have F ` : ˆ= F ; 1, n-k-1. p-value = P(F> F f: : ), by using the F ', n-k-1 probability distribution. 10 P age

11 Decision Rule: If F f: : > F ` : ˆ, then reject H H. Or, if p-value <, then reject H H. Decision: Case-A We cannot reject H H at level of significance. Thus, the knowledge of X " is not needed. Decision: Case-B We reject H H at level of significance. Hence, we are (1- )% confident that the knowledge of X " is necessary to predict the response event. Here, we generate another critical information about the degree of contribution generated by the knowledge of X " ; measured by the COEFFICIENT OF PARTIAL DETERMINATION, denoted by: r ( m. {ALL Xs EXCEPT X " }. (. {ALL Xs EXCEPT X " } = r m ff` _o ˆˆ o co cž: o ) ff:cff` ( ˆˆ o )Fff` _o ˆˆ o co cž: o ) This information is necessary to associate a monetary value to the knowledge of X ". IV) Independence assumption for the data set observations can be satisfied by a random selection of observations, collected in the same time period. Otherwise, if the data set is a Time-Series a.k.a. Historical Data, then we investigate for sufficient evidence of positive autocorrelation between data set observations, by utilizing a Durbin-Watson test. Interaction Effect: We may introduce the interaction effect between predictors as a multiplicative effect and generate a new predictor to represent a multiplication of individual predictors for all possible combinations of the original predictors. This new predictor is treated as any other individual ones, but represents one interaction effect between the assumed combination of original predictors, constituting the new predictor term. A linear regression model accommodates quantitative variables and qualitative variables without any difficulty. Categorical Variables: Represented by an indicator function, also known as characteristic function, used for identifying membership to a set. 11 P age

12 Definition: Let A be a set of elements. If an element denoted by R, belongs to the set A, then the characteristic function of the set A takes the value of one, for this element. Otherwise if R does not belong to the set A, the characteristic function of the set A will take the value of zero, for this element, such that 1, if R A. χ (R) = 0, if R A. The characteristic function of the set A, denoted by χ, is also called the Indicator function, and denoted by Ι. In the context of Bernoulli trial, we use the term dummy variable, to identify an outcome, such that: 1, if an observation is a SUCCESS. X = 0, if an obersevation is a FAILURE. Here, success and failure are generic terms that indicate a dichotomy, or, two distinct events that are mutually exclusive and collectively exhaustive. Now, we may introduce an application of the linear regression model to a very important decision making activity, How to estimate the probability of success? (generated by many possible causes), for binomial experiments. 12 P age

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

Summary of Chapter 7 (Sections 7.2-7.5) and Chapter 8 (Section 8.1) Chapter 7. Tests of Statistical Hypotheses 7.2. Tests about One Mean (1) Test about One Mean Case 1: σ is known. Assume that X N(µ, σ