Assoc.Prof.Dr. Wolfgang Feilmayr Multivariate Methods in Regional Science: Regression and Correlation Analysis REGRESSION ANALYSIS

REGRESSION ANALYSIS Regression Analysis can be broadly defined as the analysis of statistical relationships between one dependent and one or more independent variables. Although the terms dependent and independent variables are quite conventional, no implication of causality is necessarily implied in any given case. This is true no matter how strong the statistical relationship might be. In some instances there may be strong a priori grounds for specification of a cause-and-effect relationship and the selection of a dependent variable and one or more independent variables. In many other situations this may not be so easy. Regional science, and social science generally, are characterized by the recognition of relationships between variables which are unclear, ambiguous, or possibly even exhibiting two-way dependence. All variables involved have to be quantitative. 1

Examples How do income and duration of holidays influence holiday expenses. What is the relationship between the occupancy rate of tourist beds and the number of tourist facilities. How will be the future development of potential building plots in some communities. Purposes of Regression Analysis (1) Analysis of the relationship between dependent and independent variables; e.g. estimation of the regression coefficients b j. () Examination, if the estimated relationship is significant and if it can be extended from the sample to the statistical population, if sample and population are not identical. Significance is also an indicator, if the number of observations is large enough in relation to the number of independent variables

y = b 0 + b 1 x 1 + b x +... + b n x n + u y x i b i u dependent variable regressors (independent variables) regression coefficients stochastic term (residual) Methods of estimation 1. Least Square Estimation (Kleinst-Quadrat Schätzung). Maximum Liklihood It can be shown, that in the case of linear regression analysis the two methods lead to the same results. We restrict here to the first case and detail "Least Square Estimation" in the case of one independent variable (Simple Regression). In this case dependent and independent variable can be depicted as points in the R. Graphically the purpose of simple regression analysis can be described as "fitting" the regression line to the scattergram of the variables...................................................... 3

Least Square Estimation Minimize the sum of the squared deviations from the regression line! y = a + bx + u y = a + bx y predicted value for y u = y - y u i = (y i - a - bx i ) min! u i = y i - ay i - bx i + abx i + a + b x i 4

Differentiating with respect to a and b! Setting each derivative to zero! u i / δa = - y i + bx i + a = 0 u i / δb = - x i + ax i + bx i = 0 That yields the socalled Normal Equations (Normalgleichungen): an + b x i = y i a x i + b x i = x i y i from which the regression coefficients can be derived: a = y b x i N i and b = N x y x y i i i i N x ( x ) i i N number of cases 5

Example: Let the y i be the real estate prices (m ) in Vienna. We would like to know, how they are influenced by accessability (the x i represent the distance from the city center). y i x i x i x i y i y i residual 10000 5 65 50000 13174-3174 40000 5 5 00000 41634-1634 5000 10 100 50000 34519-9519 45000 5 5 5000 41634 3366 50000 0 0 0 48749 151 30000 0 400 600000 089 9711 sum 00000 65 1175 155000 0 6 * 155000 00000 * 65 b = b = 136. 8 1175* 6 65 00000 + 136. 8 * 65 a = a = 48097 6 y = 48097 143x 6

Least-Square Estimation for k Variables (Multiple Regression Analysis) Let Y be the the column vector of the n observations y i ; Let X be the (n x k+1) matrix of the independent variables (note: the x i0 are all equal to 1); Let β be the (k+1) column vector of estimates of ß and e the column vector of the n residuals: Y = y1 y... y n x10 x11... x1k ˆ β 1 x0 x1... xk ˆ β...... X = ˆ. β =............. xn0 xn 1... xnk. ˆ β k e = e1 e... e n Then we may write the multiple regression model as: and we have to minimize: Y = X β + e n e i i=1 = e e 7

= (Y - X β) (Y - X β ) = Y Y - β X Y + β X X β To find the value of β, which minimizes the sum of the squared residuals we differentiate: ( ) ee = XY + X Xβ β Setting this expression to zero gives: β = (X X) -1 X Y 8

Analysis of Variance (of the Regression Model) Method: Decomposition of Variance ( yi y) = ( y y) i + ( y y ) i i Total = Regression + Error Variance Variance Variance SS tot = SS reg + SS err r = SS reg /SS tot = SS reg /(SS reg + SS err ) ε [0,1] 9

It can be shown, that the coefficient of determination (goodness of fit; Bestimmtheitsmaß) r equals the square of the Correlation coefficient r between y und x. Example: Calculate the coeffiecient of determination for the regression of the real estate prices: y i y i ( y y) ( y y ) ( yi y) i 10000 1407 3776049 161679 544475556 40000 4183 63186601 1646089 44435556 5000 34469 1885 89661961 69455556 45000 4183 63186601 13816089 136095556 50000 48097 17946169 361409 77755556 30000 0841 156075049 8388781 11115556 Sum 00000 87444894 08849558 1083333336 i i y =33334 r = 87444894 / 1083333336 = 0.807 10

Inferences in Regression Analysis 1. Inferences concerning the r Null hypothesis: There is no relationship (in the population) between dependent and independent variables. F-Test: If the empirical F-value F emp is greater than the tabulated F-value F tab, the null hypothesis has to be rejected; therefore the influence is significant on the respective level of significance (usually between 90% and 99.9%). Femp = r / M ( 1 r )/( N M 1) M N number of independent variables number of observations (sample size). Inferences concerning the regression coefficients Null hypothesis: The regression coefficients equal (in the population) zero; the respective variables have no influence. T-Test: If the empirical T-value T emp is greater than the absolute value of of the tabulated T-value T tab, 11

the null hypothesis has to be rejected; therefore the influence of the independent variable under consideration is significant on the respective level of significance (usually between 90% and 99.9%). T emp = b S j bj b j S bj regression coefficient of variable j standard Error of b j The Basic Assumptions of the Simple Linear Regression Model 1. For the i-th level of the independent variable x i the expected value of the error e i is equal to zero. This is usually expressed as E(e i ) = 0, for i = 1, N.. The variance of the error component e i is constant for all levels of X i ; that is, V(e i ) =σ, for i = 1, N. 3. The values of the error component for any two e i and e j are pairwise uncorrelated. 1

4. The error components e i are normally distributed. Gauß - Markov Theorem Given the four assumptions of the simple linear regression model, the least squares estimators a and b are unbiased and have the minimum variance among all the linear unbiased estimators of the regression coefficients. Violations of the basic assumptions of the linear regression model 1. Multicollinearity Assumption: Exogenuos variables have to be independent from each other (must not be correlated) 13

Mearsurement: Calculation of "Tolerance values" Tolerance of x j : 1 - r j r j Coefficient of determination of a regression with x j as dependent and all other variables as independent variables. Tolerance values close to zero indicate multicollinearity. Recommendations: (1) Elimination of variables with low tolerance values () Add new observations to your sample (3) Factor Analysis. Autocorrelation Assumption: Residuals are uncorrelated (Ass. 3) Autocorrelation occurs, if the deviations from the regression line are not at random, but depend for 14

example on the deviation of previous observations. This is the case with time-related autocorrelation. Measurement: Durbin-Watson Test d = ( e e k k ) e k 1 e k d Residual of observation k Durbin-Watson Coefficient For values close to no autocorrelation has to be suspected. Values close to 0 indicate positive, values close to 4 negative autocorrelation. Recommendation: Search for those variables, which are responsible for autocorrelation and incorporate them into the model; autocorrelation may be an indicator for a nonlinear relationship. More important in regional science is spatial autocorrelation. It occurs, if one observation value is dependent from the value of a neighbouring observation. (Example: The price of a real estate depends on the prices having been paid in the neighbourhood). Spatial autocorrelation can be detected by visualising the residuals on maps or by performing special contiguity tests. 15

3. Heteroscedasticity Assumption: The variance of the residuals must be constant (Ass. ) The residuals must not be influenced by the ammount or by the sequence of the observations of the dependent variable. (Example: Increasing measurement errors due to decreasing attention of the person collecting the data). Measurement: Calculating the variances Recommendation:Heteroscedasticity often is an indicator for a nonlinear relationship 16

Recommendations for Regression Analysis 1. The problem which has to be analysed should be specified carefully.. The sample should be sufficiently large. At least the number of observations should be twice the number of the variables in the model. 3. At the beginning hypotheses about the expected relationships (strength and sign) should be formulated. 4. After the estimation of the regression function first of all the significance of the determination coefficient has to be examined. If it is not signifgicant, the regression hypothesis has to be rejected. Otherwise the regression coefficients have to be examined logically (sign, amount) and statistically (significance). 5. Check, if the estimated regression function violates the assumptions of the linear regression model. 6. Eventually variables have to be dropped from the equation or new variables have to be added. 17

The model formulation should be done iteratively. The results from one step are used to formulate new hypotheses, which will be examined in a further step. 7. This iterative process is supported by most of the computer-based statistical software systems (SPSS, SAS). One method is "Forward Selection". Hereby the first variable considered for entry into the equation is the one with the largest positive or negative correlation with the dependent variable. The F-test for the hypothesis that the coefficient of the entered variable is zero is then calculated. To determine wether this variable (and each succeeding variable) is entered, the F-value is compared to an established criterion (either a tabulated F-Value (FIN), or a p-value (PIN)). The process terminates until no significant independent variables are left. While forward selection starts with no independent variables in the equation and sequentially enters them, backward elimination starts with all variables in the equation and sequentially removes them. Instead of entry criteria, removal criteria are used. "Stepwise selection" is really a combination of backward and forward procedures and is probably the most commonly used method. Variables, once entered, may be removed again, if they do not meet the defined criteria. 18

CORRELATION ANALYSIS Correlation analysis investigates the statistical correlation between variables without explicitly distinguishing between dependent and independent variables. Examples: (-) Is there any relationship between personal income and holiday expenses (-) Is there any relationship between the marks in mathematics and in computer science 1. Bivariate Correlation between Metric Variables r xy = ( x x)( y y) i i ( x x) ( y y) i i r xy Correlation coefficient (Pearson s product-moment correlation coefficient) between variables x and y -1 r xy +1 19

Inferences in Correlation Analysis Null hypothesis: There is no relationship between the two variables. T-Test: If the empirical T-value T emp is greater than the tabulated T-value T tab, the null hypothesis has to be rejected; therefore the influence is significant on the respective level of significance (usually between 90% and 99.9%). r T emp = xy n 1 xy r n number of observations. Multiple Correlation between Metric Variables In order to mearsure the relationship between one variable on the one side and more variables on the other side multiple correlation has to be applied. 0

In the case of three variables multiple correlation is defined as follows: R.= xyz r + r r r r xy xz xy xz yz 1 r yz R x.yz multiple correlation coefficient between x and y plus z Note: If the direction of cause and effect is known, one should perform a multiple regression analysis. It can be shown, that the square of the multiple correlation coefficient equals the coefficient of determination. 3. Partial Correlation between Metric Variables In many cases the correlation between two variables is also influenced by other variables. To account for this a partial correlation coefficient can be defined. The partial correlation coefficient measures the relationship between two variables controlling for other variables. 1

For example if you wish to control for a third variable z, the partial correlation coefficient is defined as: r xy. z = r r r xy xz yz ( 1 r )( 1 r ) xz yz r xy.z partial correlation coefficient between x and y controlling for z 4. Correlation between Ordinal Variables (Rank Correlation Coefficients) To measure the relationship between ordinal scaled variables (example: the ranks of location quality before and after some infrastructure investments) rank correlation coefficients are appropriate. One of the most commonly used measures of ordinal association is Spearman s rs : r s = 1 6 Di nn ( 1) r s D i n Spearman s rs differences of the ranks of each observation number of observations

5. Correlations between Nominal Variables (Nominal Scale Measures) The analysis of the dependency of two nominal scaled variables is identical with the analysis of contingency tables. Contingency tables are matrices, where the rows represent the categories of the first and the columns the categories of a second attribute. The matrix elements f ij contain the counts of observations with the first attribute i and the second attribute j: f ij R i C j matrix elements sums of rows sums of columns f11 f1 R1 F = f1 f R C C 1 Example: Relationship between sex and nationality nation male female sum A 50 78 18 D 80 3 11 NL 3 90 113 other 43 1 55 sum 196 1 408 3

The relationship between the two variables can be measured with the socalled χ -Test. The χ -statistic is defined as: χ = k1 k ( f ij Fij) i= 1 j= 1 Fij F ij expected frequency of f ij F ij = (R i * C j )/N 61 67 F = 53 58 54 59 6 9 χ = 84. 38 The difficulty with χ is that the value of χ in any contingency table is directly proportional to the sample size N. Two tables with identically proportional cell frequencies will have different χ values. One way of overcoming this problem is to use the φ (phi) coefficient: φ = Inference-Test: If the empirical χ-value is greater than the tabulated χ-value, the null hypothesis, that there is no relationship has to be rejected; therefore the influence is significant on the respective level of significance (usually between 90% and 99.9%). χ N 4

Recommendations for Correlation Analysis Before calculating correlation coefficients it is necessary to examine, if formal correlation does exist between two variables (if for example the variables are percentages summing up to 100%). If this can be excluded the case of inhomogenity correlation should be checked. Hereby the observations come from subpopulations lying in different parts of the coordinate system. If the subpopulations are not distinguished, correlations effects may occur, which are not consistent with the relationships within the subpopulations. 5

Here a positive correlation between live expectancy and car ownership might be identified, being inconsistant with the relationship within the subpopulations. Finally it should be checked, if joint correlation does exist, as it occurs for example between size and weight of individual persons. If all the "pathlogical" cases can be excluded, it can be assumed, that the calculated correlations indicate a real relationship between the variables under consideration. 6

SPSS-PROCEDURES FOR REGRESSION ANALYSIS 7

With Method you can choose a proper method. The default method is forced entry. All variables are entered in a single step (Enter). The options 8

Forward, Backward and Stepwise correspond to the the methods described earlier. 9

Additional Information Concerning the Interpretation of the SPSS-Regression Output The statistic adjusted R attempts to correct R to more closely reflect the goodness of fit of the model in the population. Adjusted R is given by R R a = R M( 1 ) N M 1 where M is the number of independent variables in the equation. Since the population variance of the errors σ, is not known, it must also be estimated. The usual estimate of σ is (in the case of a simple regression) S = ( Y B B X ) N i 0 1 N i The positive square root of S is termed the standard error of estimate, or the standard deviation of the residuals. The estimated standard errors of the parameters of the regression are displayed in third column and labeled SE B. It is also inappropriate to interpret the B s as indicators of the relative importance of variables. The actual magnitude 34

of the coefficients depends on the units in which the variables are measured. Only if all independent variables are measured in the same units-years, for example-are their coefficients directly comparable. When variables differ substantially in units of measurement, the sheer magnitude of their coefficients does not reveal anything real about relative importance. One way to make regression coefficients somewhat more comparable is to calculate beta weights, which are the coefficients of the independent variables when all variables are expressed in standardized (Z score) form. The beta coefficients can be calculated directly from the regression coefficients using k betak B S k S Y = ( ) where S k is the standard deviation of the kth independent variable. However, the values of the beta coefficients, like the B s, are contingent on the other independent variables in the equation. They are also affected by the correlations of the independent variables and do not in any absolute sense reflect the importance of the various independent variables. 35

SPSS-OUTPUT FROM REGRESSION ANALYSIS Descriptive Statistics MEAND ABIPRO LANDESH LANDP UEGES1 Std. Mean Deviation N 1046,594 97,775650 878 10,3747 5,169 878 140,535 73,747313 878 16,8689 15,9676 878 79850,45 180096,55 878 Meand Average price (ATS/m) for building land Abipro Percentage of higher educated people Landesh Distance to federal capital (min) Landp Percentage of agricultural employees Ueges1 Number of nights per year Correlations Pearson Correlation Sig. (1-tailed) N MEAND ABIPRO LANDESH LANDP UEGES1 MEAND ABIPRO LANDESH LANDP UEGES1 MEAND ABIPRO LANDESH LANDP UEGES1 MEAND ABIPRO LANDESH LANDP UEGES1 1,000,536 -,37 -,40,347,536 1,000 -,94 -,501,099 -,37 -,94 1,000,5,07 -,40 -,501,5 1,000 -,5,347,099,07 -,5 1,000,,000,000,000,000,000,,000,000,00,000,000,,000,11,000,000,000,,000,000,00,11,000, 878 878 878 878 878 878 878 878 878 878 878 878 878 878 878 878 878 878 878 878 878 878 878 878 878 36

Model 1 Model Summary b Std. Error Adjusted R of the R R Square Square Estimate,647 a,418,415 743,765810 a. Predictors: (Constant), UEGES1, LANDESH, ABIPRO, LANDP b. Dependent Variable: MEAND Model 1 Regression Residual Total ANOVA b Sum of Mean Squares df Square F Sig. 3,47E+08 4 86741433 156,803,000 a 4,83E+08 873 553187,580 8,30E+08 877 a. Predictors: (Constant), UEGES1, LANDESH, ABIPRO, LANDP b. Dependent Variable: MEAND Model 1 (Constant) ABIPRO LANDESH LANDP UEGES1 Unstandardized Coefficients a. Dependent Variable: MEAND Coefficients a Standardi zed Coefficien ts B Std. Error Beta t Sig. 581,495 10,885 5,65,000 76,380 5,755,405 13,71,000 -,575,359 -,195-7,17,000-5,413 1,867 -,089 -,899,004 1,577E-03,000,9 10,978,000 Case Number 155 1311 1984 1995 008 05 033 037 Casewise Diagnostics a Std. Predicted Residual MEAND Value Residual -3,139 900,0000 334,6844-334,68 a. Dependent Variable: MEAND 3,156 361,500 165,0844 347,416 5,558 5875,000 1741,0513 4133,949 3,58 456,500 1938,609 63,891 4,906 5900,000 50,9701 3649,030 3,841 4750,000 1893,865 856,714 3,564 4388,413 1737,5084 650,904 4,48 4937,500 1778,53 3159,75 37

Residuals Statistics a Minimum Maximum Mean Std. Deviation N Predicted Value -48,678 4840,075 1046,594 68,989646 878 Residual -334,68 4133,949-1,E-13 74,06771 878 Std. Predicted Value -,059 6,031,000 1,000 878 Std. Residual -3,139 5,558,000,998 878 a. Dependent Variable: MEAND Histogram 160 Dependent Variable: MEAND 140 10 Frequency 100 80 60 40 0 0 5,5 4,75 4,5 3,75 3,5,75,5 1,75 1,5,75,5 -,5 -,75-1,5-1,75 -,5 -,75-3,5 Std. Dev = 1,00 Mean = 0,00 N = 878,00 Regression Standardized Residual 38

Scatterplot 6000 Dependent Variable: MEAND 5000 4000 3000 000 1000 MEAND 0-1000 -4-0 4 6 Regression Standardized Residual Example for Simulation What is the virtual price for a community with 0% higher educated, 0 min distance from the federal capital, 10% agricultural employees and 50.000 nights per year? P = 581,5 + 76,38 x 0 -,575 x 0-5,413 x 10 +0.001577 x 50.000 =.08,3 ATS 39

SPSS-PROCEDURES FOR CORRELATION ANALYSIS 40

SPSS-OUTPUT FROM CORRELATION ANALYSIS 1. Pearson Correlation Fahrzeit nach Wien im IV BEZH LANDESH MEAND Pearson Correlation Sig. (-tailed) N Pearson Correlation Sig. (-tailed) N Pearson Correlation Sig. (-tailed) N Pearson Correlation Sig. (-tailed) N Correlations **. Correlation is significant at the 0.01 level (-tailed). Fahrzeit nach Wien im IV BEZH LANDESH MEAND 1,000,100**,061**,31**,,000,003,000 331 331 331 191,100** 1,000,330** -,167**,000,,000,000 331 355 355 191,061**,330** 1,000 -,97**,003,000,,000 331 355 355 191,31** -,167** -,97** 1,000,000,000,000, 191 191 191 191. Rank (nonparametric) Correlation Correlations Kendall s tau_b Spearman s rho WI SO WI SO Correlation Coefficient Sig. (-tailed) N Correlation Coefficient Sig. (-tailed) N Correlation Coefficient Sig. (-tailed) N Correlation Coefficient Sig. (-tailed) N **. Correlation is significant at the.01 level (-tailed). WI SO 1,000,73**,,000 331 331,73** 1,000,000, 331 331 1,000,746**,,000 331 331,746** 1,000,000, 331 331 WI SO 3 types of winter tourism (1=weak; =medium; 3=high) 3 types of summer tourism (1=weak; =medium; 3=high) 43

SPSS-PROCEDURES FOR PARTIAL CORRELATION ANALYSIS 44

SPSS-OUTPUT FROM PARTIAL CORRELATION-ANALYSIS - - - P A R T I A L C O R R E L A T I O N C O E F F I C I E N T S - - - Controlling for.. MEAND ABIPRO LANDESH MEAND 1,0000 -,188 ( 0) ( 188) P=, P=,000 LANDESH -,188 1,0000 ( 188) ( 0) P=,000 P=, (Coefficient / (D.F.) / -tailed Significance) ", " is printed if a coefficient cannot be computed Note, that the correlation coefficient between building land price and distance to federal capital is smaller, if you control for the percentage of the higher educated (-0.188 vs. -0.97). SPSS-PROCEDURES FOR CONTINGENCY ANALYSIS 46

A graphical display is available by activating Display clustered bar charts. If you are interested in crosstabulation statistical measures but don t want to display the actual tables, you can choose Suppress Tables. 48

Finally you can modify the table format by clicking on Format in the Crosstabs dialog box. SPSS-OUTPUT FROM CONTINGENCY ANALYSIS 50

HEIZ * ZUST Crosstabulation HEIZ Total 1 3 4 6 7 8 9 10 11 1 13 14 15 16 Count Expected Count % within HEIZ % within ZUST % of Total Residual Count Expected Count % within HEIZ % within ZUST % of Total Residual Count Expected Count % within HEIZ % within ZUST % of Total Residual Count Expected Count % within HEIZ % within ZUST % of Total Residual Count Expected Count % within HEIZ % within ZUST % of Total Residual Count Expected Count % within HEIZ % within ZUST % of Total Residual Count Expected Count % within HEIZ % within ZUST % of Total Residual Count Expected Count % within HEIZ % within ZUST % of Total Residual Count Expected Count % within HEIZ % within ZUST % of Total Residual Count Expected Count % within HEIZ % within ZUST % of Total Residual Count Expected Count % within HEIZ % within ZUST % of Total Residual Count Expected Count % within HEIZ % within ZUST % of Total Residual Count Expected Count % within HEIZ % within ZUST % of Total Residual Count Expected Count % within HEIZ % within ZUST % of Total Residual Count Expected Count % within HEIZ % within ZUST % of Total Residual Count Expected Count % within HEIZ % within ZUST % of Total ZUST 1 3 Total 10 10 5 5 16,9 6,0,1 5,0 40,0% 40,0% 0,0% 100,0%,3%,8% 1,1%,5%,%,%,1%,5% -6,9 4,0,9 380 131 5 536 361,8 18,6 45,6 536,0 70,9% 4,4% 4,7% 100,0% 11,0% 10,7% 5,7% 10,5% 7,4%,6%,5% 10,5% 18,,4-0,6 543 11 0 684 461,7 164,1 58, 684,0 79,4% 17,7%,9% 100,0% 15,7% 9,9% 4,6% 13,4% 10,6%,4%,4% 13,4% 81,3-43,1-38, 171 6 3 00 135,0 48,0 17,0 00,0 85,5% 13,0% 1,5% 100,0% 5,0%,1%,7% 3,9% 3,3%,5%,1% 3,9% 36,0 -,0-14,0 140 33 68 1793 110,3 430, 15,5 1793,0 78,% 18,0% 3,8% 100,0% 40,6% 6,3% 15,6% 35,1% 7,4% 6,3% 1,3% 35,1% 191,7-107, -84,5 107 1 55 84 191,7 68,1 4, 84,0 37,7% 43,0% 19,4% 100,0% 3,1% 9,9% 1,6% 5,6%,1%,4% 1,1% 5,6% -84,7 53,9 30,8 9 31 18 141 95, 33,8 1,0 141,0 65,%,0% 1,8% 100,0%,7%,5% 4,1%,8% 1,8%,6%,4%,8% -3, -,8 6,0 7 1 0 19 1,8 4,6 1,6 19,0 36,8% 63,%,0% 100,0%,% 1,0%,0%,4%,1%,%,0%,4% -5,8 7,4-1,6 41 3 0 44 9,7 10,6 3,7 44,0 93,% 6,8%,0% 100,0% 1,%,%,0%,9%,8%,1%,0%,9% 11,3-7,6-3,7 1 1 0 1,4,5,,0 50,0% 50,0%,0% 100,0%,0%,1%,0%,0%,0%,0%,0%,0% -,4,5 -, 7 80 60 167 11,7 40,1 14, 167,0 16,% 47,9% 35,9% 100,0%,8% 6,5% 13,8% 3,3%,5% 1,6% 1,% 3,3% -85,7 39,9 45,8 163 135 6 360 43,0 86,4 30,6 360,0 45,3% 37,5% 17,% 100,0% 4,7% 11,0% 14,3% 7,0% 3,%,6% 1,% 7,0% -80,0 48,6 31,4 38 3 93 163 110,0 39,1 13,9 163,0 3,3% 19,6% 57,1% 100,0% 1,1%,6% 1,4% 3,%,7%,6% 1,8% 3,% -7,0-7,1 79,1 440 16 13 615 415,1 147,6 5,3 615,0 71,5% 6,3%,1% 100,0% 1,7% 13,% 3,0% 1,0% 8,6% 3,%,3% 1,0% 4,9 14,4-39,3 30 38 13 81 54,7 19,4 6,9 81,0 37,0% 46,9% 16,0% 100,0%,9% 3,1% 3,0% 1,6%,6%,7%,3% 1,6% -4,7 18,6 6,1 345 17 435 5114 345,0 17,0 435,0 5114,0 67,5% 4,0% 8,5% 100,0% 100,0% 100,0% 100,0% 100,0% 67,5% 4,0% 8,5% 100,0% 51

Pearson Chi-Square Likelihood Ratio Linear-by-Linear Association Chi-Square Tests Asymp. Sig. Value df (-sided) 164,185 a 8,000 100,515 8,000 43,83 1,000 N of Valid Cases 5114 a. 7 cells (15,6%) have expected count less than 5. The minimum expected count is,17. Nominal by Nominal Interval by Interval Ordinal by Ordinal N of Valid Cases Phi Cramer s V Contingency Coefficient Pearson s R Spearman Correlation a. Not assuming the null hypothesis. Symmetric Measures b. Using the asymptotic standard error assuming the null hypothesis. c. Based on normal approximation. Asymp. Approx. Value Std. Error a Approx. T b Sig.,497,000,35,000,445,000,18,014 15,981,000 c,08,014 15,06,000 c 5114 1600 1400 100 1000 800 600 ZUST 400 1 Count 00 0 1 3 4 6 7 8 9 10 11 1 13 14 15 16 3 HEIZ 5

In this example (crosstabulation of typ of heating system and state of repair of Vienna appartements) we find out a significant relationsship between the two variables, as Pearson Chi-Square as well as Phi are highly significant. Looking closer at the crosstabulation we see that central heating systems (categories to 4) are predominantly associated with "very good" state of repair (positive residuals). 53