Analisi Statistica per le Imprese

, Analisi Statistica per le Imprese Dip. di Economia Politica e Statistica 4.3. 1 / 33

You should be able to:, Underst model building using multiple regression analysis Apply multiple regression analysis to business decision-making situations Analyze interpret the computer output for a multiple regression model Test the signicance of the independent variables in a multiple regression model Incorporate qualitative variables into the regression model by using dummy variables 2 / 33

The systematic part, One may need a mathematical model to quantify the existing relationship between a response variable Y k explicative variables X 1,...,X k y = f (X 1,...X k ) The multiple linear regression model species the functional relationship as f (X 1,...X k ) = β 1 X 1 + β 2 X 2 +...β k X k (1) Geometrically this corresponds to a hyper-plan in k dimensions. The model is extremely useful because: 1 It has an intuitive geometrical interpretation 2 It has a simple estimation of the model's parameters 3 / 33

The stochastic part, The model always includes a rom component, that identies the stochastic component. This can be expressed as follows: Y = f (X 1,...X k ) + }{{}}{{} ε systematic stochastic (2) 4 / 33

The specication Stard notation: for each statistiocal unit i=1...n, Y i = β 1 X i1 + β 2 X i2 +...β k X ik + ε i (3) Matrix notation Y = Xβ + ε (4) Y : (n 1) vector of n dependent variable observations X : (n k) matrix of k regressors with each n observations β : (k 1) vector of k parameters ε : (n 1) vector of n 5 / 33

The Matrices, X 11 X 12 X 1k X 21 X 22 X 2k X =...... ; β = X n1 X n2 X nk Y 1 ε 1 Y 2 y =. ; ε = ε 2. Y n ε n β 1 β 2 The matrix X will have a unitary rst column if the model is with intercept. In this case the intercept would be β 1 in the multidimensional notation. β k 6 / 33

Stard Assumptions, Assumptions The functional relationship must be linear Covariates must have a deterministic nature The X matrix has full rank The error term has a null expected value E [ε i ] = 0 The error term is homosckedastic: Var [ε i ] = σ 2 The error terms are not correlated: Cov[ε i ε j ] i j = 0 7 / 33

OLS estimator, The OLS estimator in multiple linear regression is the vector β that minimize the following function of β, where X i is the i-th row of the X matrix. So min n i=1 e 2 i = min ( Y i X i β) 2 (5) ( e = Y X β ) (6) ( ) 1 β = X X X Y β (7) Notice that the matrix X X (cross product matrix) must be of full rank in order to be invertible. 8 / 33

OLS estimator properties, β is BLUE (Best Linear Unbiased Estimator) ( ( ) 1 ) One can notice that X X X is a matrix of constant elements, therefore β is a linear transformation of Y. One can prove that β is a correct estimator as follows ( ) 1 ( ) β = X 1 ( ) X X Y = X 1 X X (Xβ + ε) = β + X X X ε (8) ] ( ) 1 E [ β = β + X X X E [ε] = β (9) We could also prove that in the class of the linear unbiased estimator is the one presenting the minimum variance. 9 / 33

Variance of the OLS estimator, The variance of the OLS estimator is calculated as follows: ) [ Var ( β = E ( β β )( β ] ) β (10) ) [ ( ) 1 ( ) ] Var ( β = E X 1 X X εε X X X (11) ) ( ) 1 [ Var ( β = X X X E εε ] ( X X X) ( 1 ) 1 ( ) = σ 2 X X X X X X so for each parameter estimator (12) ) Var ( β ( 1 = σ 2 X X) (13) ) ( ) 1 Var ( βj = σ 2 X X jj (14) 10 / 33

Empirical example, A distributor of frozen desert pies wants to evaluate factors that inuence dem Dependent variable: y= Pie sales (units per week) Independent variables: x 1 =Price ($) x 2 =Advertising ($100's) Data is collected for 15 weeks The OLS estimates gives the following estimated model: ŷ = 306 24x 1 + 74x 2 (15) 11 / 33

Interpretation of the estimated coecient, each β j estimates the average value of Y changes by β j units for each 1 unit increase in X j, holding all other independent variables constant example: β1 = 24 then sales (y) are expected to decrease, on average, by 24 pies per week for each $1 increase in selling price (x 1 ), net of the eects due to advertising (x 2 ) β2 = 74 then sales (y) is expected to increase, on example: Intercept average, by 74 pies per week for each $100 increase in advertising (x 2 ), net of the eects due to price (x 1 ) the estimated average value of y when all X variables are zero. 12 / 33

Using the model to make predictions, We can calculate the predicted sales per week given the selling price ($5) advertising expenses ($350): ŷ = 306 24(5) + 74(3.5) = 445 (16) We predict to sell 445 pies per week. 13 / 33

The σ 2 estimator, [ ( e = Y X β ) ] 1 = Xβ + ε X X X X (Xβ + ε) (17) ( ) 1 e = Xβ + ε Xβ X X X X ε (18) ( ( ) 1 ) e = 1 X X X X ε = M x ε (19) M x is a idempotent symmetric matrix. This implies that: M x = M x = M k x; k > 0 (20) 14 / 33

Theσ 2 estimator, we can prove that Q = e e = ε M xm x ε = ε M x ε (21) E [Q] = σ 2 (n k) (22) so the unbiased estimator of the parameter σ 2 is [ ] Q E = σ 2 σ 2 = Q n k n k = e e n k (23) 15 / 33

ANOVA, Adding to the stard assumptions the following The error term has a normal distribution so:ε i N(0,σ 2 ) Reminding that follows β = β 1 β 2 =. β k ( ) 1 X X X Y (24) Y i = N ( Xβ,σ 2) (25) ( ( ) 1 β i = N β i,σ 2 X X ( β ( ) ) 1 = N β,σ 2 X X ii ) (26) (27) 16 / 33

ANOVA, In order to test the meaning of the whole model, we need to test the hypothesis system H 0 : β 1 = β 2 ==...β k = 0 H 1 : almeno un β j 0 the test is F = ESS /(k 1) = R2 /(k 1) F RSS/(n k) (1 R 2 )/n k (k 1,n k) 17 / 33

ANOVA, SS d.f. Mean Square (MS) β X y/(k 1) ESS = β X y = y yr 2 k-1 Residual RSS = e e = y y ( 1 R 2) n-k e e/(n k) Total TSS = y y = y 2 i n-1 Table 1: Construct the F statistic F = ESS /(k 1) RSS/(n k) Find the 95 th or the 99 th quantile of the distribution F (k 1),(n k) If F > F (1 α);(k 1),(n k) one rejects 18 / 33

R 2, R 2 = R 2 = ESS TSS 0 (28) TSS RSS TSS = 1 e2 i Y 2 i 1 (29) 0 R 2 1 (30) 19 / 33

Adjusted R 2, R 2 never decreases when a new X variable is added to the model this can be a disadvantage when comparing dierent models What is the net eect of adding a new variable? we lose a degree of freedom when a new X variable is added did the new X variable add enough explanatory power to oset the loss of one degree of freedom? The adjusted R 2 adjusts for the number of variables (k). R 2 adj = 1 e2 i/(n k) Y 2 i/(n 1) (n 1) ( = 1 1 + R 2 ) (31) (n k) 20 / 33

Empirical example: ANOVA, SS d.f. MS 29460 2 14730 Residual 27033 12 2252 Total 56493 14 Table 2: R 2 = 29460/56493 = 0.52 (32) 52% of the variation in pie sales is explained by the variation in price advertising R 2 adj = 0.44 (33) 44% of the variation in pie sales is explained by the variation in price advertising, taking into account the sample size the number of independent variables 21 / 33

Empirical example: is the model signicant?, F-test for the overall signicance of the model Shows if there is a linear relationship between all of the X variables considered together Y use F test statistic in the estimated model the empirical value of F is equal to 6.54 the critical value of F 0.05;2,12 is equal to 3.88 6.54 > 3.88 therefore the regression model explains a signicant portion of the variation in pie sales (there is evidence that at least one independent variable aects y) 22 / 33

Single parameter t-test, ( ( ) ) 1 β N β,σ 2 X X To test the hypothesis if the individual variable X i has a signicant eect on Y we have to test H 0 :β i =0 (no linear relationship) H 1 :β i 0 (linear relationship does exist between X i Y) if σ 2 is known, under H 0 : (34) β i 0 σ 2 (X X) 1 N (0,1) (35) ii 23 / 33

Single parameter t-test, Generally σ 2 is unknown we have to use its estimator σ 2 = e e n k Therefore the Stard Errors of β is (36) se β = σ 2 (X X) 1 (37) 24 / 33

Single parameter t-test, under H 0 we have: β i 0 σ 2 (X X) 1 ii e e σ 2 /(n k) = β i se β t n k (38) Find the 95 th or the 99 th quantile of the distribution t (n k) if t > t (1 α/2);(n k) one rejects H 0 25 / 33

Empirical example: are individual variables signicant?, coecients Stard Error t Intercept 306 114.25 2.67 Price -24 10.83-2.21 Advertising 74 25.96 2.85 Table 3: At α =0.05 signicant level, the t-value for price is -2.21 >t (α/2;12) = 2.1788 so we refuse H 0 At α =0.05 signicant level, the t-value for advertising is 2.85 >t (α/2;12) =2.1788 so we refuse H 0 There is evidence that both Price Advertising aect pie sales at α =0.05 26 / 33

, Until now we have assumed that in Y = Xβ + u, X are cardinal variables. One can also use categorial explanatory (i.e. dummy) variables that identify specic factors depending on categories: Temporal eects Spacial eects Qualitative variables We suppose that the Dummies inuence just the model intercept (not the slopes). There will be dierent regression interecepts corresponding to the dierent groups/situations, if the dummy variable is signicant. 27 / 33

, Lets assume the following generic model: Ŷ = β 0 + β 1 X 1 + β 2 X 2 (39) Where Y is the pie sale, X 1 is the price X 2 is a holiday indicator function, that will be equal to 1 when a holiday has occurred during the week, equal to 0 when there is no holiday during the week. 28 / 33

, Ŷ = β 0 + β 1 X 1 + β 2 (1) = (β 0 + β 2 ) + β 1 X 1 Ŷ = β 0 + β 1 X 1 + β 2 (0) = { (β 0 ) }} { + β 1 {}}{ X 1 different intercept same slope (40) y sales Holiday No Holiday If H 0 : β 2 = 0 is rejected Holiday has a signicant eect on pie sales x 1 price 29 / 33

, Example: Sales = 300 30(Price) + 15(Holiday) (41) Sales = number of pies sold per week Price = pie price in $ { 1 if holiday has occurred Holiday = 0 if holiday has not occurred As we see the dummy coecient β 2 = 15. This implies that on average 15 more pies were sold in weeks with holidays then in weeks without holidays, given the same price. 30 / 33

, The number of dummy variables must always be one less then the number of discriminated levels. Imagine that we are analyzing the house market have three dierent housing levels {ranch, split level, condo}. Example: Ŷ = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 (42) Y = house price X 1 = { square meters 1 if ranch X 2 = β 2 impact of ranch vs. condo 0 if not { 1 if split level X 3 = β 3 split level vs. condo 0 if not 31 / 33

, Suppose the estimated equation is: The we will have as follows: For a condo (x 2 = x 3 = 0): For a ranch (x 3 = 0): Ŷ = 20 + 0.05X 1 + 24X 2 + 15X 3 (43) ŷ = 20 + 0.05X 1 (44) For a split level (x 2 = 0): ŷ = 44 + 0.05X 1 (45) ŷ = 35 + 0.05X 1 (46) 32 / 33

, Pindyck R. Rubinfeld D., 1998 Econometrics s Economic Forecasts (4-th Ed.) De Luca Amedeo,1996 Marketing Bancario e Metodi Statistici Applicati, vol. 1, Franco Angeli. 33 / 33