STATISTICS. Multiple regression

Similar documents
Simple Linear Regression

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

WORKSHOP 3 Measuring Association

Area1 Scaled Score (NAPLEX) .535 ** **.000 N. Sig. (2-tailed)

( ), which of the coefficients would end

Multiple linear regression S6

ECON 497 Midterm Spring

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

Homework 2: Simple Linear Regression

Sociology 593 Exam 2 Answer Key March 28, 2002

Review of Multiple Regression

Regression ( Kemampuan Individu, Lingkungan kerja dan Motivasi)

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

10. Alternative case influence statistics

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope

Multiple linear regression

The inductive effect in nitridosilicates and oxysilicates and its effects on 5d energy levels of Ce 3+

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Correlation and simple linear regression S5

LI EAR REGRESSIO A D CORRELATIO

EDF 7405 Advanced Quantitative Methods in Educational Research MULTR.SAS

The Multiple Regression Model

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

AMS 7 Correlation and Regression Lecture 8

y response variable x 1, x 2,, x k -- a set of explanatory variables

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

Advanced Quantitative Data Analysis

NATCOR Regression Modelling for Time Series

SPSS Output. ANOVA a b Residual Coefficients a Standardized Coefficients

4 Multiple Linear Regression

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

Data Analysis 1 LINEAR REGRESSION. Chapter 03

Chapter 3. Diagnostics and Remedial Measures

Interactions and Centering in Regression: MRC09 Salaries for graduate faculty in psychology

Multivariate Correlational Analysis: An Introduction

Chapter 4 Regression with Categorical Predictor Variables Page 1. Overview of regression with categorical predictors

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Topic 1. Definitions

Chapter 9 - Correlation and Regression

Ordinary Least Squares Regression Explained: Vartanian

3. Diagnostics and Remedial Measures

Sociology 593 Exam 1 February 14, 1997

4:3 LEC - PLANNED COMPARISONS AND REGRESSION ANALYSES

Univariate analysis. Simple and Multiple Regression. Univariate analysis. Simple Regression How best to summarise the data?

STATISTICS 110/201 PRACTICE FINAL EXAM

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Regression. Notes. Page 1. Output Created Comments 25-JAN :29:55

Ch 3: Multiple Linear Regression

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

Sociology 593 Exam 2 March 28, 2002

Correlation & Simple Regression

Multiple Regression and Model Building Lecture 20 1 May 2006 R. Ryznar

Lecture 3: Inference in SLR

Multiple Regression Analysis

McGill University. Faculty of Science MATH 204 PRINCIPLES OF STATISTICS II. Final Examination

22s:152 Applied Linear Regression

Inferences for Regression

Linear Modelling: Simple Regression

WELCOME! Lecture 13 Thommy Perlinger

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Practical Biostatistics

bivariate correlation bivariate regression multiple regression

Sociology 593 Exam 1 February 17, 1995

Multiple Comparisons

Multiple OLS Regression

Section 4.6 Simple Linear Regression

ST430 Exam 1 with Answers

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Assoc.Prof.Dr. Wolfgang Feilmayr Multivariate Methods in Regional Science: Regression and Correlation Analysis REGRESSION ANALYSIS

A discussion on multiple regression models

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA

Chapter 10-Regression

Lecture 9: Linear Regression

Interactions between Binary & Quantitative Predictors

CAMPBELL COLLABORATION

Parametric Test. Multiple Linear Regression Spatial Application I: State Homicide Rates Equations taken from Zar, 1984.

Density Temp vs Ratio. temp

Interactions among Continuous Predictors

Correlation and Regression Notes. Categorical / Categorical Relationship (Chi-Squared Independence Test)

FREC 608 Guided Exercise 9

QUANTITATIVE STATISTICAL METHODS: REGRESSION AND FORECASTING JOHANNES LEDOLTER VIENNA UNIVERSITY OF ECONOMICS AND BUSINESS ADMINISTRATION SPRING 2013

Self-Assessment Weeks 6 and 7: Multiple Regression with a Qualitative Predictor; Multiple Comparisons

Confidence Interval for the mean response

SPSS LAB FILE 1

Statistics and Quantitative Analysis U4320

Chaper 5: Matrix Approach to Simple Linear Regression. Matrix: A m by n matrix B is a grid of numbers with m rows and n columns. B = b 11 b m1 ...

Statistical Modelling in Stata 5: Linear Models

3 Multiple Linear Regression

III. Inferential Tools

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

Estadística II Chapter 5. Regression analysis (second part)

ST430 Exam 2 Solutions

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

*************NO YOGA!!!!!!!************************************.

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Transcription:

STATISTICS Multiple regression

Problem : Explain the price of a ski pass. 2

3

4

Model (Constant) nb pistes SPSS results Unstandardized Coefficients a. Dependent Variable: prix forfait jour Coefficients a Standardized Coefficients B Std. Error Beta t Sig. 88,459 4,596 9,248,000,873,076,760,470,000 Model Regression Residual Total ANOVA b Sum of Squares df Mean Square F Sig. 80825,82 80825,82 3,558,000 a 58980,067 96 64,376 39805,9 97 a. Predictors: (Constant), nb pistes b. Dependent Variable: prix forfait jour PFJ=88,459 + 0,873 PIS + ε 5

Direct calculation for R? SPSS : Model Model Summary Adjusted Std. Error of R R Square R Square the Estimate,760 a,578,574 24,78660 a. Predictors: (Constant), nb pistes 6

Multiple regression : Maybe there are other variables to explain PFJ 7

Multiple regression explain PFJ by variables Objective :. AST Altitude station 2. REM Nb of Ski tows (remonte pente) 3. API Altitude Ski run 4. PIS Nb of ski run (pistes) 5. KMF Cross country run (Km) (ski de fond) 6. LIT Nb bed 7. HOT Nb hotel What is the best model available? Onchercheraàmettreenplaceunmodèledelaforme Y =β 0 +β X +β 2 X 2 + +β k X k +ǫ oùlesβ j sontdesparamètresfixes(maisinconnus)etǫuntermealéatoiredemoyenne0 et d écart-type σ. 8

9 On définit les vecteurs suivants Y = Y Y 2. Y n β= β 0. β k ǫ= ǫ ǫ 2. ǫ n et la matrice X= x x k.... x i x ik.... x n x nk Y =Xβ+ǫ.

Ilfautrechercherb 0,b,,b k telque soit minimum. n i= (y i b 0 b x i b k x ik ) 2 oùw estlesousespacevectorielengendréparlesvecteurs,x,x 2,,x k. 0

Model (Constant) altitude de la station nb de remontees altitude des pistes nb de pistes kilometrage ski de fond nb de lits nb d'hotels a. Dependent Variable: prix forfait jour SPSS results Coefficients a Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. 52,646,404 4,66,000-2,0E-03,006 -,09 -,309,758 -,E-02,060 -,02 -,85,853,885E-02,006,269 3,247,002,430,092,375 4,702,000-7,40E-02,05 -,079 -,455,49,60E-03,000,434 5,058,000-3,7E-02,69 -,04 -,220,827 ANOVA b Model a. Regression Residual Total Sum of Squares df Mean Square F Sig. 3065,9 7 652,278 54,365,000 a 26739,942 90 297,0 39805,9 97 Predictors: (Constant), nb d'hotels, nb de remontees, kilometrage ski de fond, altitude de la station, altitude des pistes, nb de pistes, nb de lits b. Dependent Variable: prix forfait jour

With the least squares method : PFJ= 52.6459-0.002AST -0.0REM + 0.089API + 0.4302PIS -0.0740KMF + 0.006LIT - 0.037HOT + residual We can still use R²=RegSS/TSS R² % of the information (of the price) is explained by this model 2

Precision of the model? As for the simple regression, ε follows a normal distribution N(0, σ) We can estimate σ with the given data

Estimation of σ (standard deviation of residual) Estimation of σ 2 : σˆ 2 = n n i= e 2 i k Estimation of σ : σ ˆ = σˆ 2 4

Forecast intervall for y i Model : Y i = β 0 + β x i + + β k x ki + ε i ŷ i = βˆ + βˆ x + L+ βˆ = 0 i prévision de y i k x ki Simplified formula : ŷ i ± 2 σˆ 95% of the ei are in [- 2σˆ; 2σˆ ] 5

Isn t there a problem? Model (Constant) altitude de la station nb de remontees altitude des pistes nb de pistes kilometrage ski de fond nb de lits nb d'hotels a. Dependent Variable: prix forfait jour Coefficients a Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. 52,646,404 4,66,000-2,0E-03,006 -,09 -,309,758 -,E-02,060 -,02 -,85,853,885E-02,006,269 3,247,002,430,092,375 4,702,000-7,40E-02,05 -,079 -,455,49,60E-03,000,434 5,058,000-3,7E-02,69 -,04 -,220,827

We don t use R in multiple regression What can we say about each coefficient regarding the others? What can we say about the signs of the coefficients? Ex : nb de remontées Do all the variables seem useful?

Is the contribution of X j significant? Model : Y = β 0 + β X + + β j X j + + β k X k + ε Test : H 0 : β j = 0 H : β j 0 Statistic used : t j = βˆ s j j ˆ où s = écart-type( β ) = j j 2 2 R (X j;autres X) (x ji x j) i Reject of H 0 with risk α : Reject of H 0 if t j t -α/2 (n-k-) Fractil of a Student distribution σˆ 2

What is the signification level (SIG)? The smallest value for α with a reject of H 0 Student distribution Sig/2 α/2 Sig/2 - t j 0 t j t -α/2 (n-k-) We can reject «H 0 : β j = 0» with a risk α if Sig α

What are the significant variables in this model (α = 0.05)? Model (Constant) altitude de la station nb de remontees altitude des pistes nb de pistes kilometrage ski de fond nb de lits nb d'hotels a. Dependent Variable: prix forfait jour Coefficients a Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. 52,646,404 4,66,000-2,0E-03,006 -,09 -,309,758 -,E-02,060 -,02 -,85,853,885E-02,006,269 3,247,002,430,092,375 4,702,000-7,40E-02,05 -,079 -,455,49,60E-03,000,434 5,058,000-3,7E-02,69 -,04 -,220,827 t 0.975[ 90] =. 987

Selection of variables Backward method Step : model with all variables Step 2 : I take out the variable X j with the smallest contribution : t j minimum or Sig(t j ) maximum I compute a new model until I find a model in which all the variables are significant (default value in SPSS : Sig(t j ) 0. )

Model 2 3 4 5 (Constant) altitude de la station nb de remontees altitude des pistes nb de pistes kilometrage ski de fond nb de lits nb d'hotels (Constant) altitude de la station altitude des pistes nb de pistes kilometrage ski de fond nb de lits nb d'hotels (Constant) altitude de la station altitude des pistes nb de pistes kilometrage ski de fond nb de lits (Constant) altitude des pistes nb de pistes kilometrage ski de fond nb de lits (Constant) altitude des pistes nb de pistes nb de lits a. Dependent Variable: prix forfait jour Coefficients a Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. 52,646,404 4,66,000-2,0E-03,006 -,09 -,309,758 -,E-02,060 -,02 -,85,853,885E-02,006,269 3,247,002,430,092,375 4,702,000-7,40E-02,05 -,079 -,455,49,60E-03,000,434 5,058,000-3,7E-02,69 -,04 -,220,827 52,43,285 4,646,000-2,08E-03,006 -,020 -,323,748,902E-02,006,272 3,336,00,49,069,365 6,058,000-7,32E-02,050 -,078 -,452,50,60E-03,000,434 5,083,000-3,59E-02,68 -,04 -,24,83 5,557 0,463 4,928,000 -,74E-03,006 -,06 -,28,780,97E-02,006,274 3,404,00,420,069,366 6,20,000-7,25E-02,050 -,077 -,448,5,569E-03,000,423 6,96,000 5,28 0,299 4,964,000,827E-02,005,26 3,958,000,42,068,367 6,73,000-7,3E-02,050 -,078 -,469,45,589E-03,000,429 6,574,000 42,096 8,32 5,065,000 2,6E-02,004,309 5,34,000,45,069,36 6,057,000,456E-03,000,393 6,460,000

Model 2 3 4 5 Regression Residual Total Regression Residual Total Regression Residual Total Regression Residual Total Regression Residual Total ANOVA f Sum of Squares df Mean Square F Sig. 3065,9 7 652,278 54,365,000 a 26739,942 90 297,0 39805,9 97 3055,7 6 8842,624 64,00,000 b 26750,47 9 293,958 39805,9 97 3042,3 5 22608,463 77,77,000 c 26763,57 92 290,908 39805,9 97 309,4 4 28254,854 98,098,000 d 26786,474 93 288,027 39805,9 97 2398,2 3 37466,080 28,497,000 e 27407,648 94 29,57 39805,9 97 a. Predictors: (Constant), nb d'hotels, nb de remontees, kilometrage ski de fond, altitude de la station, altitude des pistes, nb de pistes, nb de lits b. Predictors: (Constant), nb d'hotels, kilometrage ski de fond, altitude de la station, altitude des pistes, nb de pistes, nb de lits Compare the precision of the models : c. Predictors: (Constant), kilometrage ski de fond, altitude de la station, altitude des pistes, nb de pistes, nb de lits which one will you choose? d. Predictors: (Constant), kilometrage ski de fond, altitude des pistes, nb de pistes, nb de lits e. Predictors: (Constant), altitude des pistes, nb de pistes, nb de lits f. Dependent Variable: prix forfait jour

How can we get a better model? Study the outliers ( ) We can exclude them and run a new model e i > 2σˆ What else can we do?

We can also build new variables : Ex : 60 50 40 30 20 VENTES 0 3200 3400 3600 3800 4000 4200 PRIX

This model seems to look like : Y=aX²+bX+c We create the variables : X=X X2=X² And we study the model given by SPSS with Y ˆ = b + b X + 0 b 2 X 2

Model Model Summary Adjusted Std. Error of R R Square R Square the Estimate.97 a.842.85 6.52 a. Predictors: (Constant), PRIX2, PRIX Model Regression Residual Total ANOVA b Sum of Squares df Mean Square F Sig. 245.293 2 207.647 3.97.000 a 454.040 2 37.837 2869.333 4 a. Predictors: (Constant), PRIX2, PRIX b. Dependent Variable: VENTES Model (Constant) PRIX PRIX2 Unstandardized Coefficients a. Dependent Variable: VENTES Coefficients a Standardized Coefficients B Std. Error Beta t Sig. -2720.94 46.556-6.532.000.550.230 27.523 6.745.000-2.7E-04.000-28.004-6.863.000

60 VENTES 50 40 30 20 Observé 0 Quadratiique 3200 3400 3600 3800 4000 4200 PRIX

Case Number 2 3 4 5 6 7 8 9 0 2 3 4 5 Casewise Diagnostics a Predicted Std. Residual VENTES Value Residual.365 30.00 27.7525 2.2475 a. Dependent Variable: VENTES -.694 30.00 34.2688-4.2688 -.764 35.00 39.70-4.70.55 45.00 44.0495.9505.895 55.00 49.4942 5.5058.082 50.00 49.4942.5058.552 54.00 50.603 3.3969.564 53.00 49.535 3.4685.589 5.00 47.3760 3.6240.40 45.00 44.365.8635 -.529 25.00 34.4056-9.4056 -.287 20.00 27.94-7.94 -.449 9.00 27.94-8.94.028 8.00.6793 6.3207.353 20.00.6793 8.3207

Ex 2 : graph salary/age

Blue : men Green : women

Blue : men Green : women

Ex 2 : graph salary/ Basic model : salary = b0+b*age+residual How can we make the difference between men and women?

What about this model? Salary ' ' ' = b0 + bage + b2woman + residual

Better model : '' '' '' '' Salary = b0 + b Age + b2woman + b3age* Woman + residual Model 2 (Constant) Woman womanage Age (Constant) womanage Age a. Dependent Variable: Salary Unstandardized Coefficients Coefficients a Standardized Coefficients B Std. Error Beta t Sig. 4443,283 320,72,388,77 2623,08 4925,435,23,533,599-336,4 25,0 -,630-2,689,02 888,759 80,068,92,00,000 555,632 240,664 2,32,028-272,34 34,2 -,50-7,978,000 862, 6,70,893 3,972,000 Model 2 Model Summary Adjusted Std. Error of R R Square R Square the Estimate,944 a,89,878 3737,07,943 b,889,882 3688,953 a. Predictors: (Constant), Age, Woman, womanage b. Predictors: (Constant), Age, womanage

36