Topic 1. Definitions

Size: px

Start display at page:

Download "Topic 1. Definitions"

Edwina Walters
6 years ago
Views:

1 S Topic. Definitions. Scalar A scalar is a number. 2. Vector A vector is a column of numbers. 3. Linear combination A scalar times a vector plus a scalar times a vector, plus a scalar times a vector... etc. 4. Adding two vectors To add two (column) vectors, they must be of the same length, the same number of observations; adding two vectors together is accomplished by adding the contents of each row, one at a time, to form a new vector. A + B = C Multiplying a scalar by a vector To multiply a scalar by a vector, one simply creates a new vector the same length as the old vector, where every new value is calculated as the value in the old vector multiplied by the scalar. 2A + 3B = D

2 S Topic 2. Linear Independence. Definition A set of vectors is said to be linearly independent if no vector in the set can be expressed as a linear combination of others in the set.. How to Determine Linear Independence When faced with a set of vectors, it will sometimes be necessary to determine how many of the vectors are linearly independent. The steps below can be followed: ) Determine if any vector in the set of N can be written as a linear combination of the rest. If not, there are N linearly independent vectors. (Stop.) 2) If any one vector can be expressed as a linear combination of the rest (any scalars including 0 are permissible), then eliminate that vector. 3) Of the remaining N- vectors, determine if any one can be written as a linear combination of the rest. If not, then among the N vectors, N- are linearly independent. If yes, eliminate the vector and proceed with the set of N-2 vectors. 4) Continue the process until all vectors in the set remaining are linearly independent. If k vectors have been eliminated, there are (N-k) vectors that are linearly independent. 3. Examples ) 2) 3) A B C A B C D E A B C D E

3 S Topic 3. Simple Example Media Test. Simple Example Media, No Unit Vector: Problem A) Full The Data Sales Media S = a M + a 2 M2 Sales a M + a 2 M2 + Error ) Fill in the 7 values for M and M2 above. 2) Solve for a and a 2 : a = a 2 = 3) Fill in the 7 values in the error vector above. 4) Solve for the error sum-of-squares for the Full. ESS F =. 5) Write down: Value of expected value for Sales for media : Value of expected value for Sales for media 2: 4-3

4 S Topic 3. Media Test (Continued) B) Restricted S = bu Sales = bu + Error ) Fill in the 7 values for U above. 2) Solve for b: b = 3) Fill in the 7 values in the error vector above. 4) Solve for the error sum-of-squares for the Restricted. ESS R =. 5) Write down: Value of expected value for sales for media : Value of expected value for sales for media 2: C) Solve for F F = ESS R - ESS F NL F - NL R ESS F NOB - NL F Answer for F: 4-4

5 S Topic 3. Media Test (Continued) 2. Simple Example Media, No Unit Vector: Answer The Data Sales Media A) Full ) & 3) The Sales a M a 2 M2 Error ) a = 37 a 2 = 44 4) ESS F =6 5) Value of expected value for Sales for Media : 37 Value of expected value for Sales for Media 2:

6 S Linear restriction a = a 2 = b Sales = a M + a 2 M2 = bm + bm2 = b(m + M2) = bu Topic 3. Media Test (Continued) B) Restricted ) & 3) The Sales = bu Error ) b = 4 4) ESS R = 00 5) Value of expected value for Sales for Media : 4 Value of expected value for Sales for Media 2: 4 C) Solve for F F = ESS R - ESS F NL F - NL R 2 84 ESS = = = = F NOB - NL F

7 S Topic 3. Media Test (Continued) 3. Simple Example Media, With Unit Vector: Problem A) Full S = a 0 U + a 3 M Sales = a 0 U + a 3 M + Error ) Fill in the 7 values for U and M above. 2) Solve for a 0 and a 3 : a 0 = a 3 = 3) Fill in the 7 values in the error vector above. 4) Solve for the error sum-of-squares for this Full, ESS F =. 5) Write down: Value of expected value for sales, media : Value of expected value for sales, media 2: B) Restricted : Note: it is the same as the Restricted in the above example with no unit vectors. C) Solve for F. F = 4-7

8 S Topic 3. Media Test (Continued) 4. Simple Example Media, With Unit Vector: Answer A) Full ) & 3) Sales = a 0 U + a 3 M + Error ) a 0 = 44 a 3 = -7 4) ESS F =6 5) Value of expected value for Sales for Media : a 0 () + a 3 () = 44() - 7() = 37 Value of expected value for Sales for Media 2: a 0 () + a 3 (0) = a 0 = 44 Linear restriction a 3 = 0 B) Restricted C) Solve for F. S = a0u (Same as in example with no unit vector.) F = = = =

9 S Topic 3. Media Test (Continued) 5. Simple Example Media, With Unit Vector: SPSS A) Data sales media media

10 S Topic 3. Media Test (Continued) B) Regression Output Variables Entered/Removed b Variables Variables Entered Removed Method MEDIA a. Enter a. All requested variables entered. b. Dependent Variable: SALES Summary Adjusted Std. Error of R R Square R Square the Estimate.97 a a. Predictors: (Constant), MEDIA ESS R -ESS F Regression Residual Total ANOVA b Sum of Squares df Mean Square F Sig a a. Predictors: (Constant), MEDIA b. Dependent Variable: SALES ESS R ESS F (Constant) MEDIA a. Dependent Variable: SALES Unstandardized Coefficients Coefficients a Standardi zed Coefficien ts B Std. Error Beta t Sig a 3 a 0 4-0

11 Simple Example Topic 4. Simple Example Price, 3 Levels (A) The Now we illustrate the one independent variable test using a simple example with only seven observations. The dependent variable is sales and the independent variable is price with 3 levels: $5, $0, or $5. Here is the raw data. Trade Area Unit Sales Price Charged 40 $ $ $ $ $ $ $5 Full. Full in theory We construct the Full model in which we allow a different estimate for sales at each level of price: S = a P5 + a 2 P0 + a 3 P5.. () 2. Full in SPSS Since SPSS automatically adds the unit vector to our model, we must drop one of the three binary predictor vectors, either P5, P0 or P5. (We must drop one of the vectors because Unit = P5 + P0 + P5, and this would introduce a linear dependency into the model. Dropping the vector is not a problem, however, because model (2) below and model () above are equivalent models.) S = a 0 U + a P5 + a 2 P0.. (2) In this model (2), which we call our full model, sales is the dependent variable (measured at the interval or ratio level), U is the unit vector of all ones, a 0 is the weight on the unit vector, 4-

12 Topic 4. Price, 3 Levels (Continued) P5 is a binary predictor vector, a is the weight on P5, P0 is a binary predictor vector and its weight is a 2. What is in P5? P5 has ones and zeros. It has a one when the sales for the row came from a trade area where we charged $5 and a zero otherwise. What is in P0? P0 has ones and zeros. It has a one when the sales for the row came from a trade area where we charged $0 and a zero otherwise. 3. Converting raw observations into Full The above observations about P5 and P0 are illustrated in the following complete depiction of the model. S = a 0 U + a P5 + a 2 P (The binary predictor vectors P5 and P0 are created using the recode function in SPSS and the data on whether $5 or $0 was charged.) 4. Calculating expected value (EV) of Sales using Full From our full model, what is the expected value of sales given that $5 (or $0) was charged? To answer that question we first have to say what is in each vector of the model under the condition $5 was charged. U has a, P5 has a and P0 has a 0. EV(S: $5) = a 0 U + a P5 + a 2 P0 = a 0 () + a () + a 2 (0) = a 0 + a When $0 was charged, U still has a, but P5 has a 0 and P0 has a. EV(S: $0) = a 0 U + a P5 + a 2 P0 = a 0 () + a (0) + a 2 () = a 0 + a 2 4-2

13 Topic 4. Price, 3 Levels (Continued) When $5 was charged, U still has a, but P5 has a 0, and P0 also has a 0. EV(S: $5) = a 0 U + a P5 + a 2 P0 = a 0 () + a (0) + a 2 (0) = a 0 5. Parameters estimation in SPSS The parameters a 0, a and a 2 are estimated in SPSS by using the Regression function under the Statistics menu where sales is the dependent variable and P5 and P0 are the independent variables. (Don t worry about where they come from, this will be explained later.) In our example, a 0 =06, a =32, a 2 =7. (Details as per the SPSS outputs shown later.) So our full model can be written as: S = (06)U + (32)P5 + (7)P0 6. Restating the model with error term E Using the parameter estimates, we can restate the model with the error term E as follows. S = (06)U + (32)P5 + (7)P0 + E How can we get the value of E in the above model? First, we need to get the expected values of Sales at different price levels: $5, $0, and $5. To do so, we simply plug in the parameter estimates into our solutions in 4 earlier. EV(S: $5) = a 0 + a, so our estimate for sales at $5 is = 38. EV(S: $0) = a 0 + a 2, so estimate at $0 is = 23. EV(S: $5) = a 0, so estimate at $5 is 06. Then, by comparing the estimate value and the raw observations at each of the price level, we can get the value of E. 4-3

14 Topic 4. Price, 3 Levels (Continued) 7. Error sum-of-squares of full model (ESS F ) The error sum-of-squares of our full model is simply the sum of the squared errors in E : ESS F = (2) 2 + (-2) 2 + (-) 2 + () 2 + (-2) 2 + (2) 2 + (0) 2 = = 8 Restricted. The hypothesis in our test The hypothesis we wish to test is whether our sample could have come from a population where there is no relationship between price and sales. Put another way, whether our sample could have come from a population where the sales at all three price levels were equal. In another words, if there is no relationship between price and sales, the expected value of sales would be the same at different price levels. So We can state this hypothesis in null form: EV(S: $5) = EV(S: $0) = EV(S: $5) Substituting the appropriate parameters, it can be re-written as: a 0 + a = a 0 + a 2 = a 0 Note that the one and only condition when the above equation is true is where a = a 2 = Linear restriction So the linear restriction we impose on the Full (2) is a = a 2 = 0. This is what we are testing. Could our sample have come from a population where a = a 2 = 0? 3. Restricted model The linear restriction gives us our restricted model S = a 0 U.. (3) (We write a 0 because when SPSS automatically runs the restricted model with just a weight on the unit vector, the value for a 0 in such a model will almost always be different than the value for a 0 in (2), the full model. The least squares estimate for a 0 in (3) is 20. Note: For a model in which only the unit vector is present, the weight on the unit vector will simply be the average of the dependent variable.) 4-4

15 Topic 4. Price, 3 Levels (Continued) Rewriting (3) with the error vector E 2,we have: S = 20 U + E Note that in this model we estimate sales to be the same (every time our estimate is 20), regardless of the price charged. 4. Error sum-of-squares of restricted model (ESS R ) The error sum-of-squares of our restricted model is simply the sum of the squared errors in E 2 : ESS R = (20) 2 + (6) 2 + (2) 2 + (4) 2 + (-6) 2 + (-2) 2 + (-4) 2 = =,272 F Statistic Calculation. Now we calculate our F statistic with the following numbers: There are 3 linearly independent predictor vectors in the full model: (NL F = 3); There is linearly independent predictor vector in the restricted model: (NL R = ); There are 7 observations in our example: (NOB = 7) Error sum-of-squares of the full model is 8: (ESS F = 8); error sum-of-squares of the restricted model is,272 (ESS R =,272).,272 8, F = 3 = = =

16 2. Interpretation Topic 4. Price, 3 Levels (Continued) Using the SPSS output, we can get the probability that we would observe an F of or larger in a sample taken from a population where the true F is zero is a very, very low number, less than Since these odds are so small, we conclude that our sample did not come from a population where the linear restriction is true (equivalent to saying F in population is not zero). So, if the linear restriction is not true in the population, this means that sales are different when the price is different; and we must think carefully about the price we charge. Perhaps we can use cost and margin data to figure out its optimal price. Data (B) SPSS area sales price p5 p

17 Topic 4. Price, 3 Levels (Continued) Output Interpretation Regression Variables Entered/Removed b Variables Variables Entered Removed Method P0, P5 a. Enter a. All requested variables entered. b. Dependent Variable: SALES Summary Adjusted Std. Error of R R Square R Square the Estimate.993 a a. Predictors: (Constant), P0, P5 ESS R -ESS F Regression Residual Total ANOVA b Sum of Squares df Mean Square F Sig a a. Predictors: (Constant), P0, P5 b. Dependent Variable: SALES ESS R ESS F (Constant) P5 P0 a. Dependent Variable: SALES Unstandardized Coefficients Coefficients a Standardi zed Coefficien ts B Std. Error Beta t Sig a 2 a a 0 4-7

18 Topic 5. More On The Concept of Linear s And F Statistics Test Example Data Base Assume we are working with the data from the example file in which we test marketing a new product. In the test market, we systematically varied prices ($5 or $0), advertising (equivalent to $0,000 per market area or $20,000 per market area) and secret ingredient X (essentially, at 4 different levels). There were 96 different test market areas, each roughly equivalent in terms of size, income, and all other relevant characteristics and we recorded the sales of the product in the market area after a suitable interval. In this data set, it is easy to identify the dependent variable (Sales) because everything else was part of the carefully controlled experiment. So what we want to do is to test for relationships between each of the controlled variables and Sales. Does price affect sales? Does advertising affect sales? Does the level of the secret ingredient affect sales? Assume for the moment that our data is really data from a population of interest (and not the sample that it is). The Logic What would be true if there was a relationship between the dependent variable Sales and the independent variable Price? We would observe that for different values of Price we obtained different values for Sales. If this were a product which was price sensitive, then we would expect for Sales to be higher when Price was lower. Since we are dealing with 48 observations for each price level, we would expect the average Sales for the price of $5 to be higher than the average Sales for the price of $0. One way to test this would be to calculate the 2 averages and compare them. (Remember, we assumed this was our population of interest, so if the averages are different, we conclude there exists a relationship.) But simply comparing averages will not work for all of the hypotheses we wish to test. There are many fairly complex hypotheses we wish to test that require us to think differently than simply in terms of averages. Linear s. The Concept of Linear In Topic Part IV, we introduced the concept of a linear combination of a set of vectors. It is simply the sum of a weight times a vector, plus a weight times a vector,...etc. Put most simply: A linear model is a linear combination of a set of predictor vectors. 4-8

19 Topic 5. Linear s & F Test (Continued) It is a model in the sense that it is intended to reproduce (or fit) the values for one variable (we call it the dependent variable) given the values on or more other variables (we call them the independent variables). For example, we might create a linear model to predict Sales as a function of Price. Or advertising. Or Price and Advertising. Or Price, Advertising and our secret Ingredient X. 2. Full and Restricted To test our hypotheses, we need to create 2 models -- a full model and a restricted model -- and compare them in terms of their fit to a set of data. The restricted model is created by imposing a linear restriction on the weights in the full model. If the linear restriction is true, then the restricted model will fit the data almost as well as the full model. If the linear restriction is not true, then the restricted model will not fit the data as well as the full model. 3. Example Demonstration Now we use an example to illustrate the full and restricted model. Suppose we wish to test for a relationship between Price and Sales. Full model We know that Price has 2 levels ($5 and $0). So we first create a full model in which we express Sales as a function of Price. Because Price has 2 levels, we form 2 predictor vectors: to be associated with Sales values that resulted when Price was at $5 and the other to be associated with Sales values that resulted when Price was at $0. The predictor vectors will be binary, i.e., they will contain zeros or ones, and they can be thought of as membership vectors in the sense that they indicate whether a particular sales result is a "member of" the $5 price condition or the $0 price condition. The full model looks like this: S = a (P 5 ) + a 2 (P 0 ) Where: S is the sales value P5 is the binary predictor vector which will contain i) a one if the observed sales value came from a test market area where $5 was charged, and ii) a zero otherwise P0 is the binary predictor vector which will contain: i) a one if the observed sales value came from a test market area where $0 was charged; and, ii) a zero otherwise a is the weight (to be estimated) for predictor vector P5, and a 2 is the weight (to be estimated) for predictor vector P0. 4-9

20 Topic 5. Linear s & F Test (Continued) If we submitted the above model and data to a software package, it would produce estimates for a and a 2 equal to 34.83, and 22.3, respectively. (Incidentally, it turns out in this simple case that the estimate for a will be equal to the average sales when the price is $5 and the estimate for a 2 will be equal to the average sales when the price is $0.) Some definitions: i) We call a the expected value for sales at a price of $5. We call the value of the expected value for sales at a price of $5. ii) We call a 2 the expected value for sales at a price of $0. We call 22.3 the value for the expected value for sales at a price of $0. Full model with an error vector Because we almost never have a model which fits the data perfectly, we must add an error vector E to our model. So the full model with an error vector looks like this: S = a (P 5 ) + a 2 (P 0 ) + E While using the estimates of a and a 2, as well as the observations from our data base, we can get the values of this error vector. S = (P5) (P0) + E To calculate the value for the error term in row we would have: 02 = (0) () For rows 2, 3, 4, 95, and 96 we would have: 2 20 = (0) ()

21 3 37 = () (0) = () (0) = () (0) Topic 5. Linear s & F Test (Continued) = (0) () Focus, for a moment on the error vector. The weights, a = and a 2 = 22.3, are chosen so as to minimize the sum of the squares of the error terms. There is no other set of values for a and a 2 that would produce a lower error sum of squares. The error sum-of-squares is a measure of how well our model "fits" the data. In our full model, the error sum-of-squares ESS F = 8, Restricted model Remember that our model has allowed for one estimate for sales at a price of $5 (34.83) and another estimate for sales at a price of $0 (22.3). The fact that there is a difference between the averages suggests there is a relationship. But our way of testing this is to now create a restricted model which does not allow for differences in estimates for sales at price = $5 and price = $0 and compare the new error sum-of-squares to the old error sum-of-squares. To create our restricted model, we need to impose a linear restriction on the weights in the full model that embodies our hypothesis. In this case our hypothesis (stated in the "null" form) is that: There is no relationship between the price charged and the resulting sales. In terms of the expected values for the full model, our hypothesis is that: The expected value for sales at price = $5 is equal to the expected value for sales at price = $0: EV(S:P5) = EV(S:P0) But in our full model, the expected value for sales at price = $5 is a and the expected value for sales at price = $0 is a 2. So in terms of the weights, the hypothesis is represented by the linear restriction: EV(S:P5) = EV(S:P0) = a = a 2 Now we impose the linear restriction on the weights in the full model (let a = a 2 = c), we get our restricted model (with error vector): S = c(p5 + P0) + E 2 4-2

22 Topic 5. Linear s & F Test (Continued) But P5 + P0 gives us a vector with all ones. We label such a vector the unit vector, u. So our restricted model is S = c(u) + E 2. Our least-squares estimate for c is (Incidentally, when the restricted model is just the unit vector, the weight will be the average of all of the values for the dependent variable.) The error sum-of-squares for the restricted model, is ESS R = 2, Analysis By imposing the linear restriction the ESS went from 8, to 2, Thus, the restricted model is not nearly as good a fit as the full model. F Statistics. The concept But we can't use ESS alone as our index of fit. Differences between ESS for a full model and a restricted model, although affected by differences in fit, can also be affected by differences in the number of parameters being estimated. For this reason we need to construct an index which takes all relevant factors into consideration and provides one single summary of the difference between the full model and the restricted model. We call our index the F statistic and it is calculated using the following formula: F = ESS R - ESS F NL F - NL R ESS F NOB - NL F Where ESS R : the error sum-of-squares for the restricted model; ESS F : the error sum-of-squares for the full model; NL F : the number of linearly independent predictor vectors in the full model; NL R : the number of linearly independent predictor vectors in the restricted model; NOB: the number of observations on which the two models are based. Note that, all other things equal, the greater the difference between ESS R and ESS F, the greater will be the value for F. Also note that when ESS R = ESS F (ESS R can never be less than ESS F ), F equals zero. 4-22

23 2. Sampling error concern Topic 5. Linear s & F Test (Continued) Now suppose we reintroduce the fact that our data is really a sample. If no relationship exists between price and sales in the population then ESS R will equal ESS F in the population. That is, the average sales for both price levels will be the same. Therefore, it won't matter whether we allow 2 estimates (as we do in the full model) or estimate (as we do in the restricted model). If ESS R = ESS F in the population, then F = 0 in the population. Thus, when there is no relationship between Price and Sales in the population, the F will be zero. But because we are taking samples, it would be possible for us to obtain sample F values that were not zero, even though the true F for the population was zero. So we need to know the sampling distribution for the F statistic. The sampling distribution for F depends on degrees of freedom. But this time, instead of only, there are 2: DF and DF 2. DF = NL F NL R, the denominator in the numerator in the formula for F; DF 2 = NOB NL F, the denominator in the denominator in the formula for F. Once we know DF and DF 2, we can draw the sampling distribution for F. 3. Example Demonstration In our example F = 2, , , = 3, , = 3, = The probability that, with DF = and DF 2 = 94, we would get an F of or larger in a sample taken from a population where the true F was 0, is.000. Since this probability is so low, we can conclude that our linear restriction a = a 2 is probably not true in the population from which this sample was taken. Thus, the average sales in the population where we charge $5 would not be the same as the average sales where we charge $0, so there must be a relationship between price and sales. 4-23

24 Topic 6. Steps For One Variable Test Suggested Steps for Conducting One Independent Variable Test. Pick two variables where you believe one variable is dependent on (i.e. is possibly caused by) the other. Label the two variables as dependent and independent, respectively. The dependent variable must be at least interval scaled. (An exception for this will be made in this class for the Fail3/Fail4 database where the dependent variable is binary, or 0.) 2. Now inspect the values for the dependent variable. If a plot of the values for the dependent variable reveals that a few values are clearly outliers -- that is, a few are very large or very small and clearly set apart from the rest of the observations then create a new working file in which the entire row for each of these outlier observations has been deleted. 3. With the observations that remain after step 2, now focus on the values for the independent variable. If the independent variable is nominal and/or takes on only a few discrete values, then proceed to step 4. But if the independent variable is continuous, then try to divide its values into roughly 4 to 7 groups where the interval widths are equal. To group your observations: a) Decide on the number of groups you would like to have; b) Ignore the extreme values of the independent variable, calculate the interval width as (Max Min)/(# of intervals desired). 4. For each different group on the independent variable, use the recode feature to create a binary predictor vector (a membership vector). Make sure to recode missing values on the independent variable for a row into missing values in the binary predictor vector for that row. 5. Make certain you have at least 5 observations per group. If you don t, you need to recode differently and go back to step 4. Checking for at least 5 observations per group can be accomplished by running frequencies or descriptives on the binary predictor vectors. 6. Use regression under the statistics menu to run the model. 7. Pull the appropriate numbers from the output to complete the tables illustrated in the example one-variable test assignment. 4-24

25 Topic 7. Two Independent Variable Test (A) The Two Independent Variable Test With Binary Predictor Vectors. In this test, we select two independent variables and create binary predictor vectors for both. When you create the binary predictor vectors, make sure they contain either ones, zeros, or the system-missing value indicator. Create such vectors for all levels of both variables, not just the N- levels you have been creating. 2. Now create new binary predictor vectors by multiplying (Transform/Compute) every binary predictor vector on the first independent variable by every binary predictor vector on the second independent variable. If the first variable has N levels and the second has K levels, in this step you will be creating N times K vectors. For example, in test data, two levels on price crossed with four levels on X gives 8 new binary predictor vectors. P5x would be a vector with one when the sales came from a trade area where $5 was charged and xlevel was gram; it would have zeros otherwise. P5x2 would have ones where $5 was charged and there were.5 grams of the secret ingredient. This would continue all the way up to P0x4 which is the last of the eight vectors and would have ones where $0 was charged and there were 3 grams of the secret ingredient. If you were testing a 2-level variable by a 2-level variable, you would be creating 4 binary predictor vectors. 3. Now you need to create the full model ( ). To run the model we must drop one of our 8 (xlevel by price example) binary predictor vectors because SPSS is going to add the unit vector. We run this model by submitting the N*K- predictor vectors. We get this model s error sum-of-squares from the residual line of the output and the parameters from the output just like before. 4. SPSS automatically tests the linear restriction that all parameters (except the weight on the unit vector) are zero. 5. We now want to test to see if price mattered in our Full model and we perform that test by forcing the information on price out of our model and just running with the binary predictor vectors for xlevel. The results for this model ( 2) are compared to the results of the Full model in the form of an F test and using the F tables. 6. To test to see if xlevel mattered in our Full model we force the information on xlevel out of our model and see how much worse the model ( 3) with just price fits in the form of an F test. 4-25

26 Topic 7. Two Independent Variable Test (Continued) 7. Note: If you have some rows missing data on one of your two independent variables but not the other, it will be necessary to create the binary predictor vectors for models 2 and 3 by summing up the appropriate vectors from the full model. This is the only way that all 3 models will be run with exactly the same set of observations. For example, to create the vectors for running just price you need to recreate p5 by summing p5x+p5x2+p5x3+p5x4. x would be created by summing p5x+p0x, and so on. (B) SPSS Outputs Descriptives Because the mean of a column of s and 0 s is the proportion of s, the descriptives output can be used to easily calculate how many s are in each binary predictor vector. For example, 96 x.250 =2. Thus all 8 binary predictor vectors (BPVs) have exactly 2 observations. Given our knowledge of the data, this is what we would have expected. Descriptive Statistics P5X P5X2 P5X3 P5X4 P0X P0X2 P0X3 P0X4 Valid N (listwise) N Mean

27 Regression Topic 7. Two Independent Variable Test (Continued) Variables Entered/Removed b Variables Entered P0X3, P0X2, P0X, P5X4, P5X3, P5X2, P5X a Variables Removed a. All requested variables entered. b. Dependent Variable: SALES Method. Enter Summary Std. Error Adjusted of the R R Square R Square Estimate.89 a a. Predictors: (Constant), P0X3, P0X2, P0X, P5X4, P5X3, P5X2, P5X Regression Residual Total ANOVA b Sum of Mean Squares df Square F Sig a a. Predictors: (Constant), P0X3, P0X2, P0X, P5X4, P5X3, P5X2, P5X b. Dependent Variable: SALES (Constant) P5X P5X2 P5X3 P5X4 P0X P0X2 P0X3 a. Dependent Variable: SALES Unstandardized Coefficients Coefficients a Standardi zed Coefficien ts B Std. Error Beta t Sig

28 2 Regression Topic 7. Two Independent Variable Test (Continued) Variables Entered/Removed b Variables Variables Entered Removed Method RXLEVEL3, RXLEVEL2, RXLEVEL a. Enter a. All requested variables entered. b. Dependent Variable: SALES Summary Std. Error Adjusted of the R R Square R Square Estimate.639 a a. Predictors: (Constant), RXLEVEL3, RXLEVEL2, RXLEVEL Regression Residual Total ANOVA b Sum of Mean Squares df Square F Sig a a. Predictors: (Constant), RXLEVEL3, RXLEVEL2, RXLEVEL b. Dependent Variable: SALES (Constant) RXLEVEL RXLEVEL2 RXLEVEL3 a. Dependent Variable: SALES Coefficients a Unstandardized Coefficients Standardi zed Coefficien ts B Std. Error Beta t Sig

29 3 Regression Topic 7. Two Independent Variable Test (Continued) Variables Entered/Removed b Variables Variables Entered Removed Method P5 a. Enter a. All requested variables entered. b. Dependent Variable: SALES Summary Std. Error Adjusted of the R R Square R Square Estimate.407 a a. Predictors: (Constant), P5 Regression Residual Total a. Predictors: (Constant), P5 b. Dependent Variable: SALES ANOVA b Sum of Mean Squares df Square F Sig a (Constant) P5 a. Dependent Variable: SALES Unstandardized Coefficients Coefficients a Standardi zed Coefficien ts B Std. Error Beta t Sig

30 Topic 8. F Tables Steps In Testing Hypotheses Using The F Tables. Run the Full and Restricted s, calculate the F statistic using the appropriate error sumof-squares, and note the two degrees of freedom. Say our sample s calculated F was Pick the probability you wish to use for this test:.0,.025,.05,.0, then use the % table, the 2.5% table, the 5% table, or the 0% table respectively. 3. For your test s degrees of freedom, look up the F value. 4. What does the F value from the table indicate? Say we are working with the 5% table and the F we pull from the table is 3.0. This means that (only) 5% of the time would one get an F of 3.0 or larger from a sample taken from a population where the true F was 0. Put another way: We would say (only) 5% of the time would one get an F of 3.0 or larger from a sample taken from a population where the linear restriction on the parameters of the Full to get the Restricted was true. (In the case of the one independent variable test) Put another way: (Only) 5% of the time would one get an F of 3.0 or larger from a sample taken from a population where the average for the dependent variable was the same across all levels of the independent variable. 5. Since our calculated F of 7.2 is larger than the table F of 3.0, the odds in all three statements of 4 above are less than 5% for our sample. 6. Thus, we can conclude that: The F for the population form which our sample came is probably not zero. Or, Our sample probably did not come from a population where the linear restriction is true. Or, (In the case of the one independent variable test) Our sample probably did not come from a population where the average for the dependent variable was the same across all levels of the independent variable. 7. Thus, we conclude there is probably a relationship between the two variables in the population from which our sample came. 4-30

31 Topic 9: Test For Linearity The Logic In the test for linearity, we first specify a full model in which we create binary predictor vectors for each of several different levels (at least three) of an independent variable. We want to find out if constant increases in the independent variable will result in constant increases in the dependent variable. For example, assume that when the value for the independent variable increases from to 2 (an increase of unit) that the value for the dependent variable increases from 5 to 30 (an increase of 5 units). If it is also true that for any other one unit increase on the independent variable, the dependent variable increases by approximately 5 units, and for /2 unit increases on the independent variable the dependent variable increases by 7.5 units, then the relationship is probably linear. But to know whether this sample could have come from a population where the relationship is linear we must do a statistical test. Hypothesis for Test In reality we don't believe that XLEVEL is linearly related to Sales. But the null hypothesis is that XLEVEL is linearly related to sales. We will test this hypothesis by comparing, with an F statistic, the ESS for the full model (with binary vectors) to the ESS of a restricted model in which the relationship is forced to be linear. Full. Full model with unit vector The dependent variable is sales, the independent variable is Xlevel with 4 levels. The full model with unit vector is: S = a 0 u + a 2 X2 + a 3 X3 + a 4 X4 where X2 contains a if the sales figure came from an area where the level of ingredient "X" was.5 and a zero otherwise; and so on through X4. X2 has 24 's and 72 0's. The same is true for X3 through X4. The unit vector, of course, has all s. 2. Expected value of Sales at each xlevel in full model EV(S: X) = a 0 () + a 2 (0) + a 3 (0)+ a 4 (0)= a 0 EV(S: X2) = a 0 () + a 2 () + a 3 (0)+ a 4 (0)= a 0 + a 2 EV(S: X3) = a 0 () + a 2 (0) + a 3 ()+ a 4 (0)= a 0 + a 3 EV(S: X4) = a 0 () + a 2 (0) + a 3 (0)+ a 4 ()= a 0 + a 4 4-3

32 3. SPSS output full model Regression Variables Entered/Removed b Topic 9: Test For Linearity (Continued) Variables Variables Entered Removed Method XL4, XL3, XL2 a. Enter a. All requested variables entered. b. Dependent Variable: SALES Summary Adjusted Std. Error of R R Square R Square the Estimate.639 a a. Predictors: (Constant), XL4, XL3, XL2 Regression Residual Total ANOVA b Sum of Squares df Mean Square F Sig a a. Predictors: (Constant), XL4, XL3, XL2 b. Dependent Variable: SALES ESS F (Constant) XL2 XL3 XL4 a. Dependent Variable: SALES a 3 a 4 Unstandardized Coefficients Coefficients a Standardi zed Coefficien ts B Std. Error Beta t Sig a 2 a

33 Topic 9: Test For Linearity (Continued) 4. For this full model, the R-square is.408, adjusted R-square is.389, and the standard error of the estimate is Using the parameters estimated by SPSS, we can calculate the expected value of Sales. EV(S: X) = a 0 = EV(S: X2) = a 0 + a 2 = = EV(S: X3) = a 0 + a 3 = = EV(S: X4) = a 0 + a 4 = = Restricted. EV(S:X2) - EV(S:X) = (a 0 + a 2 ) - (a 0 ) = a 2 =.5c EV(S:X3) - EV(S:X2) = (a 0 + a 3 ) - (a 0 + a 2 ) = a 3 - a 2 =.5c EV(S:X4) - EV(S:X3) = (a 0 + a 4 ) - (a 0 + a 3 ) = a 4 - a 3 =c a 2 =.5c a 3 = a 2 +.5c =.5c +.5c = c a 4 = a 3 + c = c + c = 2c 2. Restricted model ' S = a0u +.5cX 2 + cx 3 + 2cX 4 ' = a 0U + c(.5x 2 + X 3 + 2X 4) 3. Expected value of Sales at each xlevel in restricted model ' ' ( S : X) = a0 () + c[.5(0) + (0) + 2(0)] a0 EV = ' ' EV ( S : X 2) = a () + c[.5() + (0) + 2(0)] = a. 5c ' ' EV ( S : X 3) = a0 () + c[.5(0) + () + 2(0)] = a0 + c ' ' EV ( S : X 4) = a () + c[.5(0) + (0) + 2()] = a 2c

34 4. SPSS output - restricted model Regression Variables Entered/Removed b Topic 9: Test For Linearity (Continued) Variables Variables Entered Removed Method LINVECT a. Enter a. All requested variables entered. b. Dependent Variable: SALES Summary Std. Error Adjusted of the R R Square R Square Estimate.346 a a. Predictors: (Constant), LINVECT Regression Residual Total ANOVA b Sum of Mean Squares df Square F Sig a a. Predictors: (Constant), LINVECT b. Dependent Variable: SALES ESS R (Constant) LINVECT a. Dependent Variable: SALES c Unstandardized Coefficients Coefficients a Standardi zed Coefficien ts B Std. Error Beta t Sig ' a

35 Topic 9: Test For Linearity (Continued) 4. For this restricted model, the R-square is.20, adjusted R-square is.0, and the standard error of estimate is Using the parameters estimated by SPSS, we can calculate the expected value of Sales. EV ( S ' : X ) = a = ' EV ( S : X 2) = a +.5c = (5.462) = ' EV ( S : X 3) = a + c = = ' EV ( S : X 4) = a0 + 2c = (5.462) = Analysis. Expected value of Sales SALE (in 000's) 47 TEST FOR LINEARITY GRAM.5 GRAMS 2 GRAMS 3 GRAMS XLEVEL FULL MODEL RESTRICTED MODEL 2. F-statistic calculation ESS R =, , ESS f = 7, , NL F = 4, NL R = 2, NOB = 96 F = = = =

36 3. Conclusion Topic 9: Test For Linearity (Continued) Reject the linear restriction. We would observe an F of 4.88 or larger (with degrees of freedom df = 2 and df 2 = 92) from a sample taken from a population where the true F was zero only % of the time. Since the above computed F is much larger, we would observe it, or one larger, even less often. Therefore, the sample probably did not come from a population where the linear restriction would have been true. The relationship is not linear, so the peak at the third level of XLEVEL, about two units of the secret ingredient, after which sales go down, is probably not just a chance occurrence. There does appear to be an ideal level on "X". 4-36

37 Topic 0. Steps For Linearity Test Suggested Steps For Conducting The Linearity Test. Pick two variables that you have already established are related to each other. As always in linear models, the dependent variable must be measured at the interval or ratio level. (We still include the exception where the dependent is two-level, reflected at or 0, as in Fail3/Fail4). In this test, however, the independent variable should be measured at the interval or ratio level as well. (If the independent variable is only ordinal but has 6 or more levels (i.e., 6 or more binary predictor vectors), then an exception can be made that allows use of this ordinal-level variable in the linearity test, but be sure and tell the reader.) 2. Now inspect the values for the dependent variable. If a plot of the values for the dependent variable reveals that a few values are clearly outliers that is, a few are very large or very small and clearly set apart from the rest of the observations then create a new working file in which the entire row for each of these outlier observations has been deleted. 3. With the observations that remain after step 2, now focus on the values for the independent variable. If there exist natural breaks or meaningful categories, then use these natural breaks or meaningful categories to decide on the separate levels for the independent variable. You are seeking 4 to 7 different groups. The absolute minimum number of different groups for running the linearity test is 3. If there are no natural breaks or logically meaningful categories, then: a) Decide on the number of groups you would like to have; b) Ignore the extreme values of the independent variable, calculate the interval width as (Max Min)/(# of intervals desired). 4. For each different group on the independent variable, use the recode feature to first create a recoded independent variable. Then use the recode feature and the recoded independent variable to create a binary predictor vector (a membership vector) for each level on the independent variable. Make sure to recode missing values on the independent variable for a row into missing values in the binary predictor vector for that row. 8. Make certain you have at least 5 observations per group. If you don t, you need to recode differently and go back to step 4. Checking for at least 5 observations per group can be accomplished by running frequencies or descriptives on the binary predictor vectors. 9. Use regression under the statistics menu to run this model with binary predictor vectors 4-37

38 Topic 0. Steps For Linearity Test (Continued) 0. Pull ESS (residual) as ESS for this full model, R-square, adjusted R-square, and the parameter estimates from the output. Use the parameter estimates and the values of or 0 in the predictor vectors to calculate the values of the expected values for the full model for each level on the independent variable.. For each level on the independent variable, write down a number that serves to indicate a typical number on the independent variable for that level. This can be the midpoint of the numbers in the range, or the average of the values for the independent variable in that range, or if the distribution of values in the range in skewed, you can make a rough estimate of where the median would be. 2. Use the indicator value (form step 8) for the range to create LINVEC. For the Xlevel example remember that the indicator values were,.5, 2, and 3 and LINVEC had 0,.5,, and 2; essentially, LINVEC will have a 0 for the first level, and then the difference from the first level to each of the other levels. For levels 2 through 4 in the Xlevel example, LINVEC has.5-=.5; 2-=; 3-=2. 3. Run the restricted model with LINVEC as the only independent variable. Of course SPSS adds the unit vector. 4. Pull ESS (residual) as ESS for this restricted model, R-square, adjusted R-square, and the two parameter estimates from the output. Use the parameter estimates and the values in U and LINVEC for the various levels to calculate the values of the expected values for this restricted model. 5. Calculate the F statistic. Select the critical probability you are using for this test and go to the table (one of 4: p=.0, p=.05, p=.025, or p=.0) for that critical probability. Use the 2 degrees of freedom for your calculated F to find the critical F in the table. 6. Compare your calculated F to the critical F in the table. If you calculated F is larger than the critical F, reject the linear restriction; if not, accept the linear restriction. If you reject, you are concluding that the sample probably did not come from a population where the relationship is linear. If you accept, you are concluding that the sample probably did come from a population where the relationship is linear. 4-38

39 Introduction Topic : Building Regression s In previous assignments, the models have been constructed with binary predictor vectors to represent different levels of an independent variable. The independent variable could be any level of measurement: nominal, ordinal, interval or ratio. If the variable was continuous, it had to be recoded based on ranges which then constituted each of the levels on the independent variable and these were reflected in binary predictor vectors. The models we will now build, however, can have both binary predictor vectors as well as the raw values of the independent variables. We may also create what are called pseudo-variables by squaring the independent variables or taking the product between pairs of independent variables. Variables In The The Dependent Variable The dependent variable should be measured at the interval or ratio level. (Exceptions are sometimes made if the dependent is nominal but only has two levels of 0 and, or if the variable is ordinal but has many levels.) It is a good idea to examine a histogram (or frequency distribution) of the dependent variable and identify any outliers. Outliers are values extremely far removed (either much lower or much higher) from the bulk of roughly continuous values. In examining the distribution of the dependent variable, it is also sometimes useful to have the mean and standard deviation of the dependent variable and consider how many standard deviations away from the mean a particular value is. Essentially, if you have a few values on the dependent variable that are very far removed from the other values, then you probably need to delete these observations from the dataset before building your regression model. The Independent Variables The independent variables with which you build your regression model can contain either the original, raw values of the independent variables, or they can be binary predictor vectors representing each of several different levels of an independent variable. If the independent variable is nominal, or if it is ordinal with only a few levels, then binary predictor vectors must be created to represent the different levels. If the independent variable is interval or ratio, then you can use either the raw values, or binary predictor vectors representing the various levels. 4-39

40 Curvilinear Relationships Topic : Regression s (Continued) If you use the raw values on the independent variable, then you are essentially assuming that the best way to capture the relationship between the dependent variable and the independent variable is linear, or in the form of a straight line. If, however, you believe the nature of the relationship between the two variables is best represented by a curve, then you need to include both the raw values on the independent variable as well as a pseudo-variable that contains the square of the raw values of the independent variable. Interaction Effects If one has several independent variables in a regression model, one way to describe the modeling of the effect of the independent variables on the dependent is that the effects are additive. That is, we can capture the overall effect by simply adding up the various effects from each variable separately. Sometimes, however, the combined or joint effects of the independent variables cannot be captured in a simple additive form. In these cases one needs to construct new pseudo-variables that are the product of two independent variables. Example Data Base We will use the testdata (X level) database to illustrate the points above as well as how to interpret the output and how to build a regression model. Test Objectives Recall that in testdata we had 96 observations. Unit sales is our dependent variable and advertising, price and X level are our independent variables. We wish to build and test regression models to accomplish two objectives: ) We want to test whether, from the experiment generated data, advertising, price or X level have an impact on sales; and, 2) We want to build a model that will allow us to predict sales for any particular combination of values for advertising, price and X level. Create New Independent Variables Based on our earlier tests we have good reason to believe that Sales may be related to X level in a curvilinear fashion. So, before running the regression, we create a new variable called XLSQ which is the square of X level. We also have reason to believe that to fully capture the effect of X level and price in terms of the way in which they affect sales we need an interaction term. So, we create another new variable called XLPR which is the product of price and X level. 4-40

EDF 7405 Advanced Quantitative Methods in Educational Research. Data are available on IQ of the child and seven potential predictors.

EDF 7405 Advanced Quantitative Methods in Educational Research Data are available on IQ of the child and seven potential predictors. Four are medical variables available at the birth of the child: Birthweight