Using the Regression Model in multivariate data analysis

Similar documents
DISCRIMINANT ANALYSIS IN THE STUDY OF ROMANIAN REGIONAL ECONOMIC DEVELOPMENT

ROMANIA S TOURISTIC REGIONALIZATION - TOURISTIC DEVELOPMENT INDICATOR -

The Geography of Regional Clusters in Romania and Their Importance for Entrepreneurial Activities

The influence of cluster type economic agglomerations on the entrepreneurship, in Romania

Administratia Nationala Apele Romane. Proposed alternatives for ecological restoration of Lower Danube in Romania.

THE ROMANIAN POPULATION BY GENDER AND AGE GROUPS IN 2011

40 TH CONGRESS OF THE EUROPEAN REGIONAL SCIENCE ASSOCIATION THE ROMANIAN REGIONS COMPETITIVENESS

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Country Romania Romania Romania Romania Romania Romania Romania Romania IIRUC SERVICE IIRUC SERVICE

TRANSPORT SYSTEM, TELECOMMUNICATION NETWORKS AND PUBLIC SERVICES IN ROMANIA BETWEEN 1990 AND 2000

Available online at (Elixir International Journal) Statistics. Elixir Statistics 49 (2012)

COMMERCIAL FACILITIES AND URBAN REGENERATION

Geographical Economics and Per Capita GDP Growth in Romania *

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

Endogenous regional growth and foreign trade, in Romania

Transportation Planning and Balanced Growth in Romania

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

Multiple linear regression S6

Assumptions of the error term, assumptions of the independent variables

INFLUENCE OF CLUSTER TYPE BUSINESS AGGLOMERATIONS FOR DEVELOPMENT OF ENTREPRENEURIAL ACTIVITIES STUDY ABOUT ROMANIA

LUNGIMEA CĂILOR DE TRANSPORT LA SFÂRŞITUL ANULUI 2015

COMPETITIVENESS OF DEVELOPING REGIONS IN ROMANIA

The Model Building Process Part I: Checking Model Assumptions Best Practice

Working together Revista de cercetare [i interven]ie social\

Econometric Analysis of Some Economic Indicators Influencing Nigeria s Economy.

Multiple Regression and Model Building (cont d) + GIS Lecture 21 3 May 2006 R. Ryznar

URBAN SPRAWL THE LEGAL CONTEXT AND TERRITORIAL PRACTICES IN ROMANIA

PROJECT FOR A FUNCTIONAL REGIONALIZATION OF ROMANIA

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

Regression Diagnostics Procedures

Regression ( Kemampuan Individu, Lingkungan kerja dan Motivasi)

Answer all questions from part I. Answer two question from part II.a, and one question from part II.b.

ECONOMETRIC ANALYSIS OF THE COMPANY ON STOCK EXCHANGE

Parametric Test. Multiple Linear Regression Spatial Application I: State Homicide Rates Equations taken from Zar, 1984.

Multiple Regression Analysis

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Econometrics Part Three

Multivariate Correlational Analysis: An Introduction

AN EXPLORATORY STUDY OF THE REGIONAL CONTEXT OF COMPETITIVE DEVELOPMENT IN ROMANIA. Valentin COJANU

ECON 497 Midterm Spring

Single and multiple linear regression analysis

FinQuiz Notes

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

A Study of Hotel Performance Under Urbanization in China

Correlation and Regression Bangkok, 14-18, Sept. 2015

Quantitative Methods I: Regression diagnostics

Assoc.Prof.Dr. Wolfgang Feilmayr Multivariate Methods in Regional Science: Regression and Correlation Analysis REGRESSION ANALYSIS

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

Inferences for Regression

INTRODUCTORY REGRESSION ANALYSIS

Sociology 593 Exam 2 Answer Key March 28, 2002

Regression Analysis By Example

Multiple Linear Regression II. Lecture 8. Overview. Readings

Multiple Linear Regression II. Lecture 8. Overview. Readings. Summary of MLR I. Summary of MLR I. Summary of MLR I

Lecture 19 Multiple (Linear) Regression

Multivariate Lineare Modelle

Multiple Linear Regression I. Lecture 7. Correlation (Review) Overview

ECON 4230 Intermediate Econometric Theory Exam

y response variable x 1, x 2,, x k -- a set of explanatory variables

Relationship between ridge regression estimator and sample size when multicollinearity present among regressors

CHAPTER 6: SPECIFICATION VARIABLES

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

Introduction to statistical modeling

Unit 11: Multiple Linear Regression

LECTURE 11. Introduction to Econometrics. Autocorrelation

CERTIFICATE. Str. Aurel Vlaicu, nr Satu Mare Romania

9. Linear Regression and Correlation

Seismic report - manual

SPSS Output. ANOVA a b Residual Coefficients a Standardized Coefficients

REVIEW 8/2/2017 陈芳华东师大英语系

VARIANCE ANALYSIS OF WOOL WOVEN FABRICS TENSILE STRENGTH USING ANCOVA MODEL

The Effect of Multicollinearity on Prediction in Regression Models Daniel J. Mundfrom Michelle DePoy Smith Lisa W. Kay Eastern Kentucky University

DEMAND ESTIMATION (PART III)

Multiple Linear Regression II. Lecture 8. Overview. Readings

Multiple Linear Regression II. Lecture 8. Overview. Readings. Summary of MLR I. Summary of MLR I. Summary of MLR I

CHANGES IN THE STRUCTURE OF POPULATION AND HOUSING FUND BETWEEN TWO CENSUSES 1 - South Muntenia Development Region

Sustainable Regional Development Policy in Romania - Coordinates

STAT 212 Business Statistics II 1

Analysing data: regression and correlation S6 and S7

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

ASSESSMENT OF FUNCTIONAL POLICENTRICITY IN THE ROMANIAN COUNTY RESIDENCE MUNICIPALITIES

Counties of Romania List

Multiple Linear Regression

Bivariate Regression Analysis. The most useful means of discerning causality and significance of variables

Area1 Scaled Score (NAPLEX) .535 ** **.000 N. Sig. (2-tailed)

A particularly nasty aspect of this is that it is often difficult or impossible to tell if a model fails to satisfy these steps.

Chapter 10 Correlation and Regression

Electronic Research Archive of Blekinge Institute of Technology

Inter Item Correlation Matrix (R )

Linear Regression Models

Introduction to Regression

Regression Analysis. BUS 735: Business Decision Making and Research

Equation Number 1 Dependent Variable.. Y W's Childbearing expectations

ECON 497: Lecture 4 Page 1 of 1

Types of Statistical Tests DR. MIKE MARRAPODI

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

Lecture 4: Regression Analysis

Interactions and Centering in Regression: MRC09 Salaries for graduate faculty in psychology

Transcription:

Bulletin of the Transilvania University of Braşov Series V: Economic Sciences Vol. 10 (59) No. 1-2017 Using the Regression Model in multivariate data analysis Cristinel CONSTANTIN 1 Abstract: This paper is about an instrumental research regarding the using of Linear Regression Model for data analysis. The research uses a model based on real data and stress the necessity of a correct utilisation of such models in order to obtain accurate information for the decision makers. The main scope is to help practitioners and researchers in their efforts to build prediction models based on linear regressions. The conclusion reveals the necessity to use quantitative data for a correct model specification and to validate the model according to the assumptions of the least squares method. Key-words: Regression model, multivariate analysis, model validation, predictions, GDP 1. Introduction The present paper is a part of a series of instrumental researches meant to review the main multivariate data analysis models. The research is based on the exemplification of using the Multiple Linear Regression Model starting from the model specification and continuing with the validation of this one. The main issues related to this model are underlined in order to stress the importance of a correct utilisation in the process of data analysis. 2. The Multiple Linear Regression Model According to Kutner, et al. (2005), regression analysis has three major purposes: (1) description, (2) control, and (3) prediction. Thus the regression model could be used to describe the relationship between different variables, to control and predict the evolution of a dependent variable according to the evolution of one or more variables used as predictors. One of the most popular models is the linear one, which starts from the assumption of a linear relationship between the analysed variables. If we take into consideration a dependent variable (Y) and an independent one (X), it would be 1 Transilvania University of Braşov, cristinel.constantin@unitbv.ro

28 Bulletin of the Transilvania University of Braşov - Vol. 10 (59), No. 1-2017 Series V supposed that the mean of the dependent variable is placed on a straight line determined by the variation of the independent variable. In this respect, two parameters, β 0 and β 1, which determine a straight line, can be calculated based on the observed data. The observations of the dependent variable Y i is supposed to deviate from the mean with a random error denoted by ε i (Rawlings, Pantula and Dickey, 1998). Thus, the statistical model of simple regression is: Y i = β 0 + β 1X i + ε i (1) In practice, the variation of dependent variable is determined by more than one predictor, so that a multiple regression model is used by adding more independent variables to the above equation. Y i = β 0 + β 1X 1i + β 2X 2i + β 3X 3i + β kx ki + ε i (2) The estimation of model s parameters (β β k) is made by using the least squares method, which can be applied only when the number of observed values for the analysed variables (n) is higher than the number of independent variables (k). Other assumptions should be also considered: the errors (ε i) have a null mean and a constant variance, the errors are not auto-correlated and the independent variables (X) are not correlated each other (Montgomery, Peck and Vining, 2006). The variables used in the regression model, both dependent and independent ones, have to be quantifiable (Saunders, Lewis and Thornhill, 2007). In this respect, a ratio scale is used for measurement, which allows the calculation of means and variation indicators. Moreover, in order to make predictions the selected variables have to be in a cause-effect relationship- i.e. the independent variables determine the evolution of the dependent variable. Thus, a strong statistical relationship between variables is not enough and the causal relationship should be interpreted with caution, using supplementary analyses and references to theories (Kutner, et al. 2005). Conducting multiple regression analysis involves several steps, starting with the estimation of the model s parameters, by using the least squares method, which continues with the test for parameters significance and other tests conducted for verifying the model s assumptions mentioned above (Malhotra, 2004). Finally, the coefficient of determination (R 2 ) can be calculated in order to measure the strength of association. The predicted values of the dependent variable (Ŷ) can also be obtained based on the values of independent variables. 2. Using the Regression Model in data analysis In the followings we are going to exemplify the applying of the Multiple Linear Regression Model using the IBM SPSS system (Statistical Package for Social Sciences). The dependent variable (Y) is the GDP recorded in 2014 by every county of Roma-

C. CONSTANTIN: Using the Regression Model in multivariate data analysis 29 nia. The main economic problem related to this indicator is the huge difference that exists among Romanian counties (see Fig. 1). County Covasna Mehedinti Tulcea Salaj Calarasi Vaslui Giurgiu Ialomita Botosani Teleorman Vrancea Bistrita-Nasaud Caras-Severin Harghita Braila Satu Mare Olt Valcea Neamt Buzau Gorj Maramures Hunedoara Alba Dambovita Suceava Galati Bacau Arad Sibiu Mures Bihor Dolj Ilfov Arges Iasi Brasov Timis Cluj Prahova Constanta Bucharest 0 20000 40000 60000 80000 100000 120000 140000 160000 GDP (Mill. Lei) Fig. 1. The levels of GDP recorded in 2014 by Romanian counties (Bucharest included)

30 Bulletin of the Transilvania University of Braşov - Vol. 10 (59), No. 1-2017 Series V We can notice that the capital city recorded a very high value of GDP, which is higher than the sum of the first 5 counties but there are also significant discrepancies between counties. Starting from this problem we tried to identify the influence factors that contribute to a higher or a lower level of GDP. In this respect, a Multiple Linear Regression Model has been used having as predictors the following independent variables (see Fig. 2): the population size (Population), the labour resources (Labour_res), the number of active companies (Company_ number) and the employed population (Employed_pop). These predictors have been chosen starting from the supposition that the population is the main determinant of the total consumption and contribute to a large extent to the production process through the labour resources and the employed population. Another predictor, the number of active companies, can have a direct influence on the production and consequently on the GDP value. As Bucharest recorded an extreme value, it has been excluded from the analysis. Fig. 2. The linear dependences between the GDP and every predictor (year 2014)

C. CONSTANTIN: Using the Regression Model in multivariate data analysis 31 In Fig. 2 we can observe the linear dependence between the GDP and every predictor, with determination coefficients (R 2 ) higher than 0.7. Therefore, we can consider that every variable by itself explains more than 70% of the GDP variation. The highest influence is given by the number of companies, which explains 85.5% of the GDP variation. But the relevance of the statistical relationship is not enough for a scientific explanation of the causal relationship. In practice, an economic phenomenon is influenced by a mix of factors and the use of Multiple Regression Model is more suitable. Thus we have applied such a model on the above variables, all predictors being included together in the analysis (see Table. 1). Model Unstandardized Coefficients B Std. Error Std. Coeff. Beta t Sig. Correlations Collinearity Statistics Zeroordeer- Partial Part Tol- VIF ance (Constant) -2065,7 1491,5-1,38,175 Population -,047,036 -,95-1,29,205,85 -,21 -,07,006 172,8 Labour_res 95,069 56,85 1,29 1,67,103,88,26,09,005 191,6 Company_ number,668,25,49 2,60,013,92,39,14,087 11,4 Employed_ pop 11,660 30,51,10,38,705,90,06,02,042 23,5 Table 1. The regression coefficients and multicolinearity The results show that even if every variable by itself is strongly correlated with the dependent variable, when they are put together in the model, the majority of regression coefficients become insignificant. According to the results of t-student test, the significance level (Sig.) is smaller than 0.05 only for the Company_number. For the rest of variables, Sig.> 0.05 and the coefficients cannot be considered significantly different from zero. This anomaly appears due to a high multicollinearity between variables. This phenomenon is presented in the column Partial correlation, where the partial correlation coefficients have quite small values. These coefficients represent the correlation between two variables, which remains after removing their mutual correlation with other variables included in model. The multicolinearity is also revealed in the last two columns of the table. Small values of Tolerance and big values of Variance Inflation Factor (VIF) underline a low contribution of the variables to the model. It means that the model has computational problem and the predictors have to be reconsidered. In order to avoid such problems it is recommended to use a selection method of predictors. One of the best methods is Stepwise selection. This one includes the predictors in model step by step, starting with the variable that has the highest influence on the dependent variable. The variables with small contribution to the variance explanation are excluded from model.

32 Bulletin of the Transilvania University of Braşov - Vol. 10 (59), No. 1-2017 Series V Model Unstandardized Co- Std. t Sig. Correlations Collinearity Sta- efficients Coeff. tistics B Std. Error Beta Zero- Par Part Toler- VIF order tial ance (Constant) 231,501 923,835,2,803 1,246,082,925 15,1,000,925,92,925 1,000 1,000 (Constant) -2631,5 1302,41-2,0,050 Company_number Company_number,87,15,649 5,8,000,925,68,328,255 3,915 Labour_res 23,411 8,09,320 2,8,006,880,42,162,255 3,915 Table 2. The results of regression model with stepwise selection The results of stepwise selection are presented in Table 2. The variable Company_number has been selected at the first step and Labour_res at the second step. The rest of variables have been rejected. We can observe that by including the second variable the tolerance decreased and the VIF increased due to a certain multicolinearity but the regression coefficients are both significant as Sig.< 0.05. Model R R Square Adjusted R Square Std. Error of the Estimate Durbin-Watson 1,925 a,855,851 2982,58941 2,939 b,881,875 2735,12359 1,881 a. Predictors: (Constant), Company_number b. Predictors: (Constant), Company_number, Labour_res c. Dependent Variable: GDP Table 3. The coefficient of determination and error autocorrelation testing The coefficient of determination (R 2 ) for the model with two independent variables is 0.855, which means that these variables explain 85.5% of the GDP variation (see Table 3). The result of Durbin-Watson test (DW) is also provided. This is a tool used to test the error autocorrelation, as one of the regression assumption is that the errors are independent (there is no correlation between errors). The interpretation of this test s

C. CONSTANTIN: Using the Regression Model in multivariate data analysis 33 results is made by comparing the calculated value (d) with two critical values from DW table (d L and d U), which lies between 0 and 4. The hypothesis of autocorrelation is rejected if d U < d < 4-d U. In our case for a significance level α = 0.05, 2 predictors and 41 observations, the critical values are: d L=1.391 and d U=1.600. As the calculated value presented in Table 3, d=1.881 is higher than d U and lower than 4-d U, we can reject the hypothesis of error autocorrelation. Another assumption of the model is the absence of heteroscedasticity, which means that the errors have a constant variance. There is no direct method of identifying heteroscedasticity but some visual methods and empirical tests could be used. One of the visual methods is to plot the standardized residuals (ZRESID) on the standardized predicted values (ZPRED) in a scatterplot diagram (see Fig. 3). This diagram can be made easily with the SPSS just when we perform the regression model by pressing the Plots button. Fig. 3. The relationship between standardized predicted values and residuals As the plot fans out, having some outlier values, we can conclude that there is a sign of heteroscedasticity. In such cases the estimators could be biased or inconsistent so that the results should be interpreted with prudence or new independent variables should be found.

34 Bulletin of the Transilvania University of Braşov - Vol. 10 (59), No. 1-2017 Series V 3. Discussions and conclusions As we presented in the above analysis, the Linear Regression Model is a powerful method used for the description of relationships between variables and for predictions but the results have to be interpreted with caution. First of all, it is very important to have a proper specification of the model taking into account the nature of the variables and of the relationships between these ones. In this respect, both the independent and dependent variables have to be measured with metric scales. Sometimes binary variable, named dummy variables, could be used as independent variables. As regards the relationships between variables, these ones have to be linear but a pure statistical dependence is not enough. The dependence relationship must be based on a theory if we want to use the model for predictions. After the model specification, some validation procedures are necessary in order to accept the model s hypotheses. If certain hypotheses are rejected, the model should be reconsidered or the results should be considered with maximum precaution. In the above example, we can find that increasing the number of companies and labour resources will lead to higher levels of GDP. Thus new investments are necessary for the regional economic development. Based on different values of the independent variables, predicted values of the GDP could be calculated but we have to take care that the model is susceptible of heteroscedasticity. The results should be also cross validated using other samples from different years. The results of this instrumental research could be useful both for practitioners and academic researchers in their efforts to build prediction models using the Linear Regression. The overall conclusion is that a superficial using of this model, without a proper validation, could lead to wrong conclusions and predictions. Consequently bad decisions could be made starting from inaccurate information. 6. References Kutner, M., Nachtsheim, C., Neter, J. and Li, W., 2005. Applied linear statistical models, 5 th ed. New York: McGraw-Hill Irwin. Malhotra, N., 2004. Marketing research. An applied orientation, 4 th ed. New Jersey: Pearson Education. Montgomery, G., Peck, E. and Vining, G., 2006. Introduction to Linear Regression Analysis, 4 th ed. New Jersey: John Wiley & Sons. Rawlings, J., Pentula, S. and Dickey, D., 1998. Applied regression analysis: a research tool. 2 nd ed. New York: Springer-Verlag. Saunders, M., Lewis, Ph. and Thornhill, A., 2007. Research methods for business students, 4 th ed. Harlow: Pearson Education.