, pp.45-54 http://dx.doi.org/10.14257/ijseia.2013.7.5.05 A Case Study on the Application of Computational Intelligence to Identifying Relationships between Land use Characteristics and Damages caused by Natural Hazards: A SVR Approach Jae Heon Shim 1 and Sangyong Kim 2 * 1 Institute of Environmental Studies, Pusan National University, Busan, Republic of Korea 2 School of Construction Management and Engineering, University of Reading, Reading, UK 1 shim@pusan.ac.kr, 2 rd026992@reading.ac.uk Abstract This paper examines the application of a support vector regression (SVR) approach to identifying relationships between land use characteristics and damages caused by natural hazards. Our empirical results show the outperformance of a SVR model over a multiple ordinary least squares (OLS) regression model in terms of the predictive performance. Nonlinear relationships between land use characteristics and damages are revealed by a SVR model. Keywords: SVR; Land use; Natural hazards; OLS; Nonlinear relationships 1. Introduction Cities in South Korea have been industrialized and urbanized for a relatively short time in comparison with those of other countries. They are not well prepared for natural hazards, such as typhoons and floods, and are quite vulnerable to unexpected events, though damages caused by natural disasters have steadily increased due to climate change. Recently, a wide range of studies related to natural hazards have been conducted in South Korea. This line of research in the field of land use planning generally identified various factors which had a significant effect on damages and tried to provide effective measures for reducing damages. For example, Shim et al., [1] examined the effect of land use characteristics on property damages caused by natural disasters. Their empirical results indicate that the size of an urbanized area, the population density, the size of an industrial district, the size of an agricultural district, the size of a bare land, the size of a levee, etc., have a significant influence on damages. And they proposed the following mitigation measures based on their findings: structural facilities, such as dams and levees, need to be fully equipped in industrial districts; the more systematic approach is required for unstructured and unkept bare lands; the local governments in their study area need to take into account the wider role of green spaces such as parks, grass lands, and wet lands, for mitigating flood losses. Shim et al., [2] classified cities in a metropolitan area based on natural hazard vulnerability. Specifically, they summarized variables related to vulnerability to some significant factors, carrying out a principal component analysis and classified cities conducting a cluster analysis. And they proposed differential mitigation measures for classified cities based on vulnerability. A report by Korea Environment Institute [3] identified * Corresponding author. ISSN: 1738-9984 IJSEIA Copyright c 2013 SERSC
the effect of green infrastructures, such as parks, urban forests, green roofs, on property damages. According to the report, a 1% increase in the size of green infrastructures is estimated to decrease damages by 6.4%. Choi [4] focused on the relationship between the size of an urbanized area and damages. The results show that an increase in the size of an urbanized area leads to an increase in damages. The aforementioned research and a considerable number of other recent studies have a tendency to depend on traditional statistical methods such as ordinary least squares (OLS) regression, in terms of methodology. This study investigates the application of a support vector regression (SVR) approach, which is widely known as a nonlinear data mining tool, to identifying relationships between land use characteristics and damages caused by natural hazards. 2. Support Vector Machine for Regression Support vector machine (SVM) has been proven to be a powerful tool in pattern recognition including classification, regression and function approximation [5]. Traditional training techniques usually focus on minimizing empirical risk; i.e., minimize the classification error of training data. However, SVM aims to minimize the structural risk in finding a probable upper bound of the classification error of training data. This has been shown to deliver a higher performance than the traditional empirical risk minimization methods used by many of the learning machines [6, 7]. Data classification and regression, two critical components of computer science, are being used in increasingly broad and general applications. Support vector classification (SVC) is founded on the principle of minimizing training theoretical structure risk. SVC utilizes existing data to do training and then selects several support vectors by analyzing the training data to represent the whole data. The concept of SVR is similar to that of SVC. It maps regression problems from low dimensional to high dimensional vector spaces to identify the support vector in which the appropriate linear regression equation could be obtained. The scheme of SVM is shown in Figure 1. Figure 1. Scheme of SVM The SVM is originally designed for binary classification in order to construct an optimal hyper-plane so that the margin of separation between two classes, negative and positive, could be maximized [8]. In general cases where the data are not linearly separated, SVM uses nonlinear machines to find a hyper-plane that minimize the number of errors for the training set, as shown in Figure 2 [9, 10]. The aim of SVM training is to find the linear separating 46 Copyright c 2013 SERSC
hyper-plane that gives the maximum margin between two hyper-planes to maximize the generalization of a performance [11]. Figure 2. Optional Separating Hyper-plane The SV algorithm can construct a variety of learning machines using different kernel functions as shown in Figure 3, some of which coincide with classical architectures. Figure 3. Architecture of SVM Three specific kernel functions have provided acceptable kernels [9, 12, 13]. These respective kernel functions are described as follows: The polynominal kernel K( x ) ( 1) n i xj = xi xj + (1) where the degree of the polynominal n is a user-defined value. The radical basis function K( x x ) = exp( γ x x ) (2) i j i j 2 Copyright c 2013 SERSC 47
where γ is a user-defined value. The two-layer neural networks (sigmoid) where k and δ are user-defined values. K( x x ) = than( kx x δ ) n (3) i j i j 3. Data Construction The data 1 is comprised of the details of 64 administrative districts within the Seoul Metropolitan Area (SMA) in 2010, which contains three major regions, that is, Seoul Metropolitan City, Incheon Metropolitan City, and Gyeonggi province. The output variable is the amount of damages (DAMAGE) caused by natural hazards such as the wind and floods. The input variables are as follows: the population density (DENSITY), the size of a levee (LEVEE), the size of an industrial district (INDUSTRIAL), the size of an agricultural district (AGRICULTURAL), the size of a river and a stream (RIVER), the size of a park (PARK), the size of a natural grassland (GRASSLAND), the size of an inland wetland (WETLAND), and the size of an unstructured bare land (BARE). The data on the population density and the size of a park was collected from the database of the Korean Statistical Information Service. The data related to land use was calculated using the spatial analysis of GIS (Geographic Information Systems) with the land cover map provided by the Ministry of Environment, as shown in Figure 4. The descriptive statistics for the data set are given in Table 1. Table 1. Descriptive Statistics of Data Variable Mean Std. Dev. Min. Max. Output variable DAMAGE (1,000 USD) 1,713.44 2,069.40 15.60 11,660.56 Input variable DENSITY (1,000 people/km 2 ) 16.93 10.78 1.52 41.28 INDUSTRIAL (km 2 ) 4.45 7.01 0.00 36.86 AGRICULTURAL (km 2 ) 46.88 77.95 0.00 317.02 RIVER (km 2 ) 7.13 11.05 0.00 46.97 PARK (km 2 ) 2.48 2.20 0.00 9.51 GRASSLAND (km 2 ) 1.03 1.89 0.00 8.90 WETLAND (km 2 ) 1.38 2.34 0.00 10.55 BARE (km 2 ) 5.01 8.05 0.17 53.89 LEVEE (km 2 ) 0.42 0.58 0.00 2.53 *1USD ($) = 1,000 Korean Won ( ) 1 This data was used in the final model of a study conducted by Shim et al [1]. According to their empirical results, regional climate-related variables such as the annual rainfall, the rainfall intensity, and the maximum instantaneous wind velocity did not reach statistical significance in their cross-sectional analysis. 48 Copyright c 2013 SERSC
Figure 4. Land use Patterns in the SMA 4. Case Study 4.1. Data Pre-processing and Model Requirements Setup All data values of the input variables were rescaled in the range of 0 to 1 to adjust the group sizes and reduce a process time, using the maximum and minimum values of the respective variables as defined in Equation 4 [11, 13, 14]. For example, the variable AGRICULTURAL (0-317.02) has a wide range of values compared with the LEVEE (0-2.53) or the GRASSLAND (0-8.90). xi x x min n = ( i= 1, 2,3,, k) (4) x x max where x n is the normalized data value of the original value x i, x max is the maximum value for the input data while x is the minimum. min The normalized data was divided into two separate sets in the ratio of 85% (the training set) and 15% (the test set) by random selection, in order to carry out a SVR approach. Therefore, both data sets consisted of 54 and 10 cases respectively. Optimal kernel functions and an appropriate set of related parameters need to be determined in order to structuring SVR models. A SVR approach has a distinct feature to detect them within the pre-specified conditions, differently to other traditional computational intelligence techniques. The diverse kernel functions such as linear, polynomial, radial basis, and sigmoid functions were tried in the preliminary optimization process. The results indicated that the polynomial kernel function was suitable for the training set, based on the mean squared error (MSE). The potential range of functional forms and related parameters was narrowed down by an iterative optimization process. Table 2 shows the best kernel functions and related parameters with the lowest MSE. Therefore, the SVR model trained min Copyright c 2013 SERSC 49
under the selected kernel function and related parameters was used to verify its predictive performance using the test set which was not used in the training phase. Table 2. Results of Optimization Process MSE C epsilon Kernel d s c 165543.13 358.379 0.011 polynomial 5 2.274850 0.793634 376431.84 316.080 0.021 polynomial 4 4.549547 0.295267 467180.38 330.653 0.057 polynomial 4 2.952208 0.102695 438081.75 306.589 0.065 polynomial 4 3.054903 0.104678 473183.16 418.638 0.082 polynomial 5 2.028565 0.231178 67111.18 455.107 0.090 polynomial 5 2.468642 0.160680 79389.55 373.424 0.110 polynomial 5 2.405927 0.319987 465785.38 90.014 0.121 polynomial 5 3.611713 0.778985 303159.25 216.742 0.127 polynomial 5 2.274850 0.446181 448332.34 437.117 0.136 polynomial 4 4.175848 0.364696 235935.50 187.002 0.157 polynomial 5 2.532121 1.010468 208370.80 284.280 0.167 polynomial 5 2.866604 0.482650 239782.28 11.490 0.168 polynomial 5 4.365215 1.708884 312395.38 158.803 0.209 polynomial 5 2.560198 1.070589 87270.85 233.436 0.211 polynomial 5 2.254860 0.613727 464443.06 55.422 0.278 polynomial 5 4.006928 0.805383 269098.94 134.129 0.289 polynomial 5 2.730949 1.157414 47050.95 121.784 0.301 polynomial 5 3.208411 0.183416 98848.49 398.114 0.324 polynomial 4 4.384899 0.056764 249646.27 344.829 0.332 polynomial 5 2.198553 0.891751 401783.00 436.811 0.341 polynomial 4 3.959624 0.551927 343449.47 459.166 0.368 polynomial 4 2.909635 0.059206 428256.25 209.143 0.441 polynomial 4 3.357036 0.137486 23352.66 49.013 0.454 polynomial 5 4.026002 0.048524 40004.70 285.287 0.485 polynomial 4 4.250618 0.175939 50018.35 31.999 0.486 polynomial 5 3.706015 0.911283 471668.75 131.687 0.491 polynomial 4 4.346294 0.710929 436356.28 55.757 0.519 polynomial 4 4.274728 0.921354 196192.41 78.021 0.545 polynomial 4 4.637288 0.304575 101780.32 273.705 0.558 polynomial 4 3.339183 0.412610 Note. C: regularization parameter; d: polynomial degree; s: slope; c: constant term 4.2. Empirical Results The damages predicted by SVR and multiple OLS 2 regression models are presented in Table 3, and the distribution of them is illustrated graphically in Figure 5. As shown in the figure, the damages predicted by the SVR have a distribution more similar to the actual damages than OLS-predicted damages. 2 The linear functional form was used in the OLS regression analysis. We actually tried functional forms such as the linear, semi-log, inverse semi-log, and double-log forms. As a result, we concluded that the linear functional form was suitable for the training set, based on its performance. 50 Copyright c 2013 SERSC
Table 3. Damages Predicted by SVR and OLS Regression Models Case Actual SVR-predicted OLS-predicted 1 1394.03 1061.11 622.31 2 1084.20 808.61 1312.20 3 3047.51 3187.20 2174.31 4 1109.87 739.08 1347.21 5 1437.65 1013.03 2415.64 6 2177.28 1548.55 1456.70 7 2199.76 865.89 1254.24 8 1803.84 1650.53 2044.37 9 3014.17 2240.55 2963.68 10 1383.30 1533.57 2143.55 Figure 5. Distribution of Damages Predicted by SVR and OLS Regression Models The forecasting accuracy was measured by the following established error measuring criteria: root mean squared error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE). Table 4 shows that the SVR gives more precise predictions than the OLS, based on all accuracy measures. Table 4. Comparison of the predictive performances between SVR and OLS regression models Accuracy measures SVR OLS RMSE 578.20 668.91 MAE 458.34 580.56 MAPE 25.14 34.05 Copyright c 2013 SERSC 51
4.3. Relationships between Land use Characteristics and Damages Relationships between land use characteristics and damages were analyzed by changing the values of a variable related to land use characteristics and keeping the other variables constant in the SVR model. Figure 6 shows the variations in damages relative to the size of an industrial district (INDUSTRIAL) for three randomly chosen cases in the test set. As the size of an industrial district increases, the damages increase as expected. The results reveal the positive and nonlinear relationship between the size of an industrial district and damages. Figure 7 shows the variations in damages relative to the size of an agricultural district (AGRICULTURAL) for three other cases in the test set. The curves also indicate the nonlinear relationship between the size of an agricultural district and damages. In the same way, a SVR approach makes it possible to identify relationships between the remainder related to land use and damages. Figure 6. Variations in Damages Relative to the Size of an Industrial District Figure 7. Variations in Damages Relative to the Size of an Agricultural District 52 Copyright c 2013 SERSC
5. Conclusions This paper explores the application of a SVR approach to identifying relationships between land use characteristics and damages caused by natural hazards. Our empirical results show that a SVR model is superior to a multiple OLS regression model in terms of the predictive performance, based on accuracy measures. And the nonlinear relationships of damages with land use characteristics such as the size of an industrial district and the size of an agricultural district were identified by a SVR model. We think that the better performance of a SVR model comes from its capacity to capture nonlinear relationships between diverse land use characteristics and damages. However, although a SVR model is proficient at discovering complicated nonlinear relationships and provides comparatively accurate predictions, it is not desirable to compare SVR and OLS regression models just in terms of the predictive performance, because an OLS regression model has its own advantage that the process and its results are easily interpretable and intelligible. An OLS regression model could be rather complementary to a SVR approach. Lastly, a separate validation dataset was not used in our empirical analysis due to the lack of data, though it is more advisable to set up the dataset in order to avoid overfitting a specific training dataset. Nevertheless, this paper can broaden the scope of applications of computational intelligence in natural hazards research. Acknowledgements This paper is an expanded version of a paper entitled A case study on the effect of landuse characteristics on damages caused by natural hazards in South Korea presented at international conferences ASEA and DRBC 2012 by Jae Heon Shim, Kwang-Woo Nam, and Sung-Ho Lee. And this research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF-2013 R1A1A2013676). References [1] J. H. Shim, K. W. Nam and S. H. Lee, A case study on the effect of land-use characteristics on damages caused by natural hazards in South Korea, Communications in Computer and Information Science, Springer Berlin Heidelberg, vol. 340, (2012), pp. 287-292. [2] J. H. Shim, J. E. Kim and S. H. Lee, Classification of cities in the metropolitan area based on natural hazard vulnerability, Journal of the Korea Academia-Industrial Cooperation Society, vol. 13, no. 11, (2012), pp. 5534-5541. [3] Korea Environment Institute, Urban renewal strategy for adapting to climate change: Use of green infrastructure on flood mitigation, (2011). [4] C. I. Choi, A study on natural hazards vulnerability in urban area by urban land use change: In case of Kyonggi province, Journal of Korean Planners Association, vol. 38, no. 2, (2003), pp. 35-48. [5] V. N. Vapnik, The nature of statistical learning theory, 2 nd ed., Springer, New York, (1995). [6] C. Li and J. Li, Support vector machines approach to conditional simulation of non-gaussian stochastic process, Journal of Computing in Civil Engineering, vol. 26, no. 1, (2012), pp. 131-140. [7] G. H. Kim, J. M. Shin, S. Kim and Y. Shin, Comparison of school building construction costs estimation methods using regression analysis, neural network, and support vector machine, Journal of Building Construction and Planning Research, vol. 1, no. 1, (2013). pp. 1-7. [8] H. Guo, H. P. Liu and L. Wang, Method for selecting parameters of least squares support vector machines and application, Journal of System Simulation, vol. 18, no. 7, (2006), pp. 2033-2036. [9] D. Mattera, and S. Haykin, Support vector machines for dynamic reconstruction of a chaotic system, Advances in Kernel Methods, 1 st ed., MIT press, Massachusetts, (1999), pp. 211-241. Copyright c 2013 SERSC 53
[10] K. C. Lam, M. C. K. Lam and D. Wang, Efficacy of using support vector machine in a contractor prequalification decision model, Journal of Computing in Civil Engineering, vol. 24, no. 3, (2010), pp. 273-280. [11] J. Kim, S. Kim and L. Tang, A case study on the determination of building materials using a support vector machine, Journal of Computing in Civil Engineering, Accepted article, doi:10.1061/(asce)cp.1943-5487.0000259, (2012). [12] Z. Huang, H. Chen, C. J. Hsu, W. H. Chen and S. Wu, Credit rating analysis with support vector machines and neural networks: a market comparative study, Decision Support System, vol. 37, no. 4, (2004), pp. 543-558. [13] U. Y. Park and G. H. Kim, A study on predicting construction cost of apartment projects based on support vector regression at the early project stage, Journal of Architectural Institute of Korea, vol. 23, no. 4, (2007), pp. 165-172. [14] J. H. Shim, A differential pricing model for industrial land based on locational characteristics, Journal of Korean Society of Civil Engineers, vol. 31, no. 2D, (2011), pp. 303-314. Authors Jae Heon Shim, obtained his Ph.D. in urban planning in 2009 at Pusan National University. He is now a research professor in the Institute of Environmental Studies at Pusan National University. His research areas of interest are land use planning, urban development, and so on. Sangyong Kim, is a Ph.D. candidate at the School of Construction Management and Engineering, University of Reading. His research interests include cost estimation, information technology in construction, decision-making and analysis by artificial intelligence, and its applications in construction areas. 54 Copyright c 2013 SERSC