STATISTICS Multiple regression
Problem : Explain the price of a ski pass. 2
3
4
Model (Constant) nb pistes SPSS results Unstandardized Coefficients a. Dependent Variable: prix forfait jour Coefficients a Standardized Coefficients B Std. Error Beta t Sig. 88,459 4,596 9,248,000,873,076,760,470,000 Model Regression Residual Total ANOVA b Sum of Squares df Mean Square F Sig. 80825,82 80825,82 3,558,000 a 58980,067 96 64,376 39805,9 97 a. Predictors: (Constant), nb pistes b. Dependent Variable: prix forfait jour PFJ=88,459 + 0,873 PIS + ε 5
Direct calculation for R? SPSS : Model Model Summary Adjusted Std. Error of R R Square R Square the Estimate,760 a,578,574 24,78660 a. Predictors: (Constant), nb pistes 6
Multiple regression : Maybe there are other variables to explain PFJ 7
Multiple regression explain PFJ by variables Objective :. AST Altitude station 2. REM Nb of Ski tows (remonte pente) 3. API Altitude Ski run 4. PIS Nb of ski run (pistes) 5. KMF Cross country run (Km) (ski de fond) 6. LIT Nb bed 7. HOT Nb hotel What is the best model available? Onchercheraàmettreenplaceunmodèledelaforme Y =β 0 +β X +β 2 X 2 + +β k X k +ǫ oùlesβ j sontdesparamètresfixes(maisinconnus)etǫuntermealéatoiredemoyenne0 et d écart-type σ. 8
9 On définit les vecteurs suivants Y = Y Y 2. Y n β= β 0. β k ǫ= ǫ ǫ 2. ǫ n et la matrice X= x x k.... x i x ik.... x n x nk Y =Xβ+ǫ.
Ilfautrechercherb 0,b,,b k telque soit minimum. n i= (y i b 0 b x i b k x ik ) 2 oùw estlesousespacevectorielengendréparlesvecteurs,x,x 2,,x k. 0
Model (Constant) altitude de la station nb de remontees altitude des pistes nb de pistes kilometrage ski de fond nb de lits nb d'hotels a. Dependent Variable: prix forfait jour SPSS results Coefficients a Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. 52,646,404 4,66,000-2,0E-03,006 -,09 -,309,758 -,E-02,060 -,02 -,85,853,885E-02,006,269 3,247,002,430,092,375 4,702,000-7,40E-02,05 -,079 -,455,49,60E-03,000,434 5,058,000-3,7E-02,69 -,04 -,220,827 ANOVA b Model a. Regression Residual Total Sum of Squares df Mean Square F Sig. 3065,9 7 652,278 54,365,000 a 26739,942 90 297,0 39805,9 97 Predictors: (Constant), nb d'hotels, nb de remontees, kilometrage ski de fond, altitude de la station, altitude des pistes, nb de pistes, nb de lits b. Dependent Variable: prix forfait jour
With the least squares method : PFJ= 52.6459-0.002AST -0.0REM + 0.089API + 0.4302PIS -0.0740KMF + 0.006LIT - 0.037HOT + residual We can still use R²=RegSS/TSS R² % of the information (of the price) is explained by this model 2
Precision of the model? As for the simple regression, ε follows a normal distribution N(0, σ) We can estimate σ with the given data
Estimation of σ (standard deviation of residual) Estimation of σ 2 : σˆ 2 = n n i= e 2 i k Estimation of σ : σ ˆ = σˆ 2 4
Forecast intervall for y i Model : Y i = β 0 + β x i + + β k x ki + ε i ŷ i = βˆ + βˆ x + L+ βˆ = 0 i prévision de y i k x ki Simplified formula : ŷ i ± 2 σˆ 95% of the ei are in [- 2σˆ; 2σˆ ] 5
Isn t there a problem? Model (Constant) altitude de la station nb de remontees altitude des pistes nb de pistes kilometrage ski de fond nb de lits nb d'hotels a. Dependent Variable: prix forfait jour Coefficients a Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. 52,646,404 4,66,000-2,0E-03,006 -,09 -,309,758 -,E-02,060 -,02 -,85,853,885E-02,006,269 3,247,002,430,092,375 4,702,000-7,40E-02,05 -,079 -,455,49,60E-03,000,434 5,058,000-3,7E-02,69 -,04 -,220,827
We don t use R in multiple regression What can we say about each coefficient regarding the others? What can we say about the signs of the coefficients? Ex : nb de remontées Do all the variables seem useful?
Is the contribution of X j significant? Model : Y = β 0 + β X + + β j X j + + β k X k + ε Test : H 0 : β j = 0 H : β j 0 Statistic used : t j = βˆ s j j ˆ où s = écart-type( β ) = j j 2 2 R (X j;autres X) (x ji x j) i Reject of H 0 with risk α : Reject of H 0 if t j t -α/2 (n-k-) Fractil of a Student distribution σˆ 2
What is the signification level (SIG)? The smallest value for α with a reject of H 0 Student distribution Sig/2 α/2 Sig/2 - t j 0 t j t -α/2 (n-k-) We can reject «H 0 : β j = 0» with a risk α if Sig α
What are the significant variables in this model (α = 0.05)? Model (Constant) altitude de la station nb de remontees altitude des pistes nb de pistes kilometrage ski de fond nb de lits nb d'hotels a. Dependent Variable: prix forfait jour Coefficients a Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. 52,646,404 4,66,000-2,0E-03,006 -,09 -,309,758 -,E-02,060 -,02 -,85,853,885E-02,006,269 3,247,002,430,092,375 4,702,000-7,40E-02,05 -,079 -,455,49,60E-03,000,434 5,058,000-3,7E-02,69 -,04 -,220,827 t 0.975[ 90] =. 987
Selection of variables Backward method Step : model with all variables Step 2 : I take out the variable X j with the smallest contribution : t j minimum or Sig(t j ) maximum I compute a new model until I find a model in which all the variables are significant (default value in SPSS : Sig(t j ) 0. )
Model 2 3 4 5 (Constant) altitude de la station nb de remontees altitude des pistes nb de pistes kilometrage ski de fond nb de lits nb d'hotels (Constant) altitude de la station altitude des pistes nb de pistes kilometrage ski de fond nb de lits nb d'hotels (Constant) altitude de la station altitude des pistes nb de pistes kilometrage ski de fond nb de lits (Constant) altitude des pistes nb de pistes kilometrage ski de fond nb de lits (Constant) altitude des pistes nb de pistes nb de lits a. Dependent Variable: prix forfait jour Coefficients a Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. 52,646,404 4,66,000-2,0E-03,006 -,09 -,309,758 -,E-02,060 -,02 -,85,853,885E-02,006,269 3,247,002,430,092,375 4,702,000-7,40E-02,05 -,079 -,455,49,60E-03,000,434 5,058,000-3,7E-02,69 -,04 -,220,827 52,43,285 4,646,000-2,08E-03,006 -,020 -,323,748,902E-02,006,272 3,336,00,49,069,365 6,058,000-7,32E-02,050 -,078 -,452,50,60E-03,000,434 5,083,000-3,59E-02,68 -,04 -,24,83 5,557 0,463 4,928,000 -,74E-03,006 -,06 -,28,780,97E-02,006,274 3,404,00,420,069,366 6,20,000-7,25E-02,050 -,077 -,448,5,569E-03,000,423 6,96,000 5,28 0,299 4,964,000,827E-02,005,26 3,958,000,42,068,367 6,73,000-7,3E-02,050 -,078 -,469,45,589E-03,000,429 6,574,000 42,096 8,32 5,065,000 2,6E-02,004,309 5,34,000,45,069,36 6,057,000,456E-03,000,393 6,460,000
Model 2 3 4 5 Regression Residual Total Regression Residual Total Regression Residual Total Regression Residual Total Regression Residual Total ANOVA f Sum of Squares df Mean Square F Sig. 3065,9 7 652,278 54,365,000 a 26739,942 90 297,0 39805,9 97 3055,7 6 8842,624 64,00,000 b 26750,47 9 293,958 39805,9 97 3042,3 5 22608,463 77,77,000 c 26763,57 92 290,908 39805,9 97 309,4 4 28254,854 98,098,000 d 26786,474 93 288,027 39805,9 97 2398,2 3 37466,080 28,497,000 e 27407,648 94 29,57 39805,9 97 a. Predictors: (Constant), nb d'hotels, nb de remontees, kilometrage ski de fond, altitude de la station, altitude des pistes, nb de pistes, nb de lits b. Predictors: (Constant), nb d'hotels, kilometrage ski de fond, altitude de la station, altitude des pistes, nb de pistes, nb de lits Compare the precision of the models : c. Predictors: (Constant), kilometrage ski de fond, altitude de la station, altitude des pistes, nb de pistes, nb de lits which one will you choose? d. Predictors: (Constant), kilometrage ski de fond, altitude des pistes, nb de pistes, nb de lits e. Predictors: (Constant), altitude des pistes, nb de pistes, nb de lits f. Dependent Variable: prix forfait jour
How can we get a better model? Study the outliers ( ) We can exclude them and run a new model e i > 2σˆ What else can we do?
We can also build new variables : Ex : 60 50 40 30 20 VENTES 0 3200 3400 3600 3800 4000 4200 PRIX
This model seems to look like : Y=aX²+bX+c We create the variables : X=X X2=X² And we study the model given by SPSS with Y ˆ = b + b X + 0 b 2 X 2
Model Model Summary Adjusted Std. Error of R R Square R Square the Estimate.97 a.842.85 6.52 a. Predictors: (Constant), PRIX2, PRIX Model Regression Residual Total ANOVA b Sum of Squares df Mean Square F Sig. 245.293 2 207.647 3.97.000 a 454.040 2 37.837 2869.333 4 a. Predictors: (Constant), PRIX2, PRIX b. Dependent Variable: VENTES Model (Constant) PRIX PRIX2 Unstandardized Coefficients a. Dependent Variable: VENTES Coefficients a Standardized Coefficients B Std. Error Beta t Sig. -2720.94 46.556-6.532.000.550.230 27.523 6.745.000-2.7E-04.000-28.004-6.863.000
60 VENTES 50 40 30 20 Observé 0 Quadratiique 3200 3400 3600 3800 4000 4200 PRIX
Case Number 2 3 4 5 6 7 8 9 0 2 3 4 5 Casewise Diagnostics a Predicted Std. Residual VENTES Value Residual.365 30.00 27.7525 2.2475 a. Dependent Variable: VENTES -.694 30.00 34.2688-4.2688 -.764 35.00 39.70-4.70.55 45.00 44.0495.9505.895 55.00 49.4942 5.5058.082 50.00 49.4942.5058.552 54.00 50.603 3.3969.564 53.00 49.535 3.4685.589 5.00 47.3760 3.6240.40 45.00 44.365.8635 -.529 25.00 34.4056-9.4056 -.287 20.00 27.94-7.94 -.449 9.00 27.94-8.94.028 8.00.6793 6.3207.353 20.00.6793 8.3207
Ex 2 : graph salary/age
Blue : men Green : women
Blue : men Green : women
Ex 2 : graph salary/ Basic model : salary = b0+b*age+residual How can we make the difference between men and women?
What about this model? Salary ' ' ' = b0 + bage + b2woman + residual
Better model : '' '' '' '' Salary = b0 + b Age + b2woman + b3age* Woman + residual Model 2 (Constant) Woman womanage Age (Constant) womanage Age a. Dependent Variable: Salary Unstandardized Coefficients Coefficients a Standardized Coefficients B Std. Error Beta t Sig. 4443,283 320,72,388,77 2623,08 4925,435,23,533,599-336,4 25,0 -,630-2,689,02 888,759 80,068,92,00,000 555,632 240,664 2,32,028-272,34 34,2 -,50-7,978,000 862, 6,70,893 3,972,000 Model 2 Model Summary Adjusted Std. Error of R R Square R Square the Estimate,944 a,89,878 3737,07,943 b,889,882 3688,953 a. Predictors: (Constant), Age, Woman, womanage b. Predictors: (Constant), Age, womanage
36