Land Features Extraction from Landsat TM Image Using Decision Tree Method

www.ijrsa.org International Journal of Remote Sensing Applications (IJRSA) Volume 6, 2016 doi: 10.14355/ijrsa.2016.06.011 Land Features Extraction from Landsat TM Image Using Decision Tree Method Jason Yang *1 and Feihong Wang 2 1 Department of Geography, Ball State University, Muncie, IN 47306, USA 2 Provincial Center for Remote Sensing of Shanxi, Taiyuan, Shanxi 030001, China *1 jyang@bsu.edu; 2 fhwang_flying@126.com Abstract In this paper we presented a method based on the decision tree to extract land feature information for an urban area of Taiyuan City in China. One Landsat TM image obtained on September 23, 2010 covering the entire city of Taiyuan was obtained and processed to extract information. Digital elevation model (DEM) and some derived index images about water, vegetation, and crop land were used to develop and construct the decision tree. Six general land categories including water body, developed land, bare land, grass land, forest land, and crop land of the study area were classified using the established decision tree. The results were evaluated using high resolution satellite imagery and reported in a confusion matrix table. An overall accuracy of 89.52% with a kappa statistic of 0.87 were obtained using our method, which is higher than those from other traditional methods. Keywords Landsat TM; Decision Tree; Spectral Characteristics; Normalized Index; Taiyuan Introduction Many techniques have been developed in remote sensing to extract land feature information from satellite images, including thematic information acquisition, dynamic change prediction, and the development of a thematic map, all of these methods are inseparable from the image classification process. Image classification is done through feature selection of the spectral and spatial information of objects in remotely sensed images. According to the rules in algorithms, each pixel in the image will be classified into different categories. However, due to the existence phenomena of "different objects with the same spectrum", and "same object with different spectra", simply using spectral characteristics or image brightness value to extract feature, is difficult to meet the requirements in both accuracy and precision. In the late 1990s, the emergence of data mining techniques made the decision tree method recognized, bringing new ideas and methods to remote sensing image classification. The idea of using decision tree to identify and classify objects was first reported by Hunt, Marin, and Stone (1996). Decision tree classification is based on remote sensing image and other spatial data to construct the classification rules (decision tree) for grouping pixels through expert s knowledge on the subject, mathematics and statistics, and inductive reasoning. This process has the characteristics of intuitive, clear, and high computational efficiency in using multi-source spatial data (Sharma, Ghosh, and Joshi, 2013). Decision tree classification techniques have been used for a wide range of classification problems (Pal and Mather, 2003; Shen, Wang, and Luo, 2007); Kandrika and Roy, 2008). Huang, Zhou, and Wu (2009) used spectrally enhanced data and texture statistics of the first principal component as ancillary classification variables, and combined with urban wetland geometry feature information, built a wetland decision tree for Shanghai s wetland information extraction and classification. Ma & Yang (2009) built a decision tree with the aid of expert system to extract saline land information of different degrees over Yili region near the reclaimed area in Xinjiang Uygur Autonomous Region. Cai and Wei (2009) established a decision tree on the base of analysis of wave bands, spectrum value and normalized difference vegetation index (NDVI) to extract the information of aquatic vegetation in Taihu Lake. Punia, Joshi, and Porwal (2011) used C5.0 based decision tree classifiers to classify IRS-P6 AWiFS data and reported very high accuracy. Qian, Yu, and Jia (2013) used NDVI and Digital Elevation Model (DEM) to build a decision tree, effectively extracting the desert grassland types. 108

International Journal of Remote Sensing Applications (IJRSA) Volume 6, 2016 www.ijrsa.org In this paper, a simple decision tree was constructed to classify the Landsat TM image of Taiyuan into six general land features. Surface reflectance, normalized index images including NDVI, normalized difference water index (NDWI), cropland index (CI), as well as slope information extracted from digital elevation model were applied to establish the decision tree nodes. Through further analysis on the features by matching relationships between distribution and terrain characteristics, the definition and optimization of decision branch points were established. Finally, the decision tree was used to classify the Landsat TM image of the study area to achieve a more accurate land cover map. Decision Tree Method The decision tree method includes two main processes: learning and classification. The learning process is machine-driven through inductive learning of the training samples to generate the decision tree classification rules (Jensen, 2015). This process is based on the information theory to abstract the complex decision-making process into rules or judgements which is easy to understand and express. This method uses the information theory to find attribute field in the image with the largest amount of information in the sample database, forming a series of rules to establish the decision tree nodes. The input of decision tree learning algorithm is made up of attributes and attribute values of the training samples, and output is a decision tree. The classification process is to classify each pixel in the image into a certain category using the decision tree. The generation of the decision tree usually adopts the top-down recursive way: 1) selects the optimal attribute as the tree node through a certain method, 2) compares attribute value on the node, 3) judges downward branch from the node according to the different attribute values of each training sample, 4) establishes the lower nodes and branches in each branch, 5) stops the growth of the tree under certain conditions, 6) gets the conclusion in the leaf node of decision tree, and finally 7) forms the decision tree. After the decision tree is generated through the learning of training samples, then an unknown sample set is classified according to the attribute values (Rokach and Maimon, 2008; Li, Liu, and Hang, 2011). Figure 1 shows the diagram of the basic process of decision tree learning and classification. FIG. 1 THE BASIC PROCESS OF DECISION TREE LEARNING AND CLASSIFICATION Study Area And Data Study Area The study area of this research is Taiyuan, the capital city of Shanxi Province, China, which is about 500 km southwest of Beijing (Figure 2). The geographical coordinate ranges of Taiyuan are Longitude 111 30 ~113 09 east and Latitude 37 27 ~38 25 north. Located in the central area of Shanxi Province, Taiyuan is surrounded by mountains except on its southern side. Main peak is Han Mountain, standing at 1,591 meters above sea level, which 109

www.ijrsa.org International Journal of Remote Sensing Applications (IJRSA) Volume 6, 2016 is located in the eastern Taihang Mountain; while the Bei Mountain is located on the western Yunzhong Mountains range. Bei Mountain stands at 2,659 meters above sea level, which makes it the highest peak of the territory. The entire terrain of the city from north to south is a dustpan. The Taiyuan basin is flat and fertile land lays between east mountain and west mountain with an altitude of about 800 meters. The Fen River runs through the city from north to south. Data and Preprocessing FIG. 2 LOCATION OF TAIYUAN CITY, SHANXI, CHINA One scene of multispectral Landsat Thematic Mapper (TM) image covering the entire Taiyuan area was used as data source in this study. The image has seven spectral bands with 30-m spatial resolution, and was acquired on September 23, 2010. Table 1 describes the TM sensor s bands and their spectral and spatial resolutions. Twenty ground control points (GCPs) on both the original image and reference image were sampled to perform geometric correction with an overall correction accuracy of 0.5 pixel. Then, the Area of Interest (AOI) was subsetted by the administrative vector boundary of Taiyuan city, which is shown in Figure 2. FIG. 3 AREA OF INTEREST, THE TAIYUNA CITY IN RGB=432 Other ancillary geospatial data used in this study include the 1: 250, 000 fundamental geographic vector data for city boundary, the 30-m Digital Elevation Model (DEM) can be acquired from the international scientific data 110

International Journal of Remote Sensing Applications (IJRSA) Volume 6, 2016 www.ijrsa.org service platform for extracting slope and aspect information, and high resolution ZY-3 satellite image for accuracy assessment (Table 1). TABLE 1 SPECTRAL AND SPATIAL RESOLUTION OF LANDSAT TM AND ZY-3 SATELLITE IMAGES Band Landsat TM Resolution ( m) ZY-3 Resolutions ( m) 1 Blue 0.45-0.52 (30 m) 0.45~0.52 (6 m) 2 Green 0.52-0.60 (30 m) 0.52~0.59 (6 m) 3 Red 0.63-0.69 (30 m) 0.63~0.69 (6 m) 4 NIR 0.76-0.90 (30 m) 0.77~0.89 (6 m) 5 MIR 1.55-1.75 (30 m) 6 TIR 10.4-12.5 (120 m) 7 MIR 2.08-2.35 (30 m) Decision Tree Classification Defining different decision nodes from various land features is one of the essential guarantee to ensure the accuracy of the decision tree classification results. Since the differences among various land features can be enlarged by the normalized operation, some normalized index images for single-feature extraction were used to provide a basis for decision tree establishment in this study. In order to reduce the complexity of node expression, simplify decision-making expressions, and articulate the classification strategy, all image processing in this study are carried out in a commonly used remote sensing software, ENVI. Creation of Decision Tree Considering the surface features of the study area, six general categories including water body, forest land, grass land, crop land, developed land, and bare land were predefined in this study. In the classification process, water and land were first separated by using a modified normalized difference water index (MNDWI); then, the land was classified into vegetated areas and non-vegetated areas by using the normalized difference vegetation index (NDVI); lastly, non-vegetated areas were classified into developed land and bare land, and vegetated areas were classified into forest land, grass land, and crop land by using slope information, a cropland index (CI), and other spectral and spatial characteristics. 1) Establishment of Index Nodes Normalized index calculation meets the requirement of simplifying decision-making process and plays a key role in the application of decision tree method. Many normalized indices about water, land, and vegetation have been developed in remote sensing image classification, including normalized difference water index (NDWI) (Xu, 2006), NDVI (Hansen, Dubayah, DeFries, 1996), Green Vegetation Index (GVI) (Li, Liu, Li, 2006), and Cropland Index (CI) (Pan, Du, Luo, 2009). In this study, in order to improve the contrast among classes, three normalized indices including MNDWI, NDVI and CI were selected and used to establish decision tree nodes. a. Normalized Difference Water Index McFeerers (1996) first proposed a normalized difference water index (NDWI) to highlight the water in the image and increase the difference between water and land. NDWI can be calculated by the formula below: NDWI = (Green NIR) / (Green + NIR) [1] where Green and NIR are reflectance on a green band and near-infrared band, respectively. However, when constructing the NDWI, McFeerers only considered the vegetation factor, and ignored the effect of developed land. In reality, developed land such as artificial buildings also have a very high gray level in the map of NDWI results. Therefore, to increase the difference between water and other land features, we adopted a modified NDWI (MNDWI) proposed by Xu (2006) to extract the water signal in this study. MNDWI can be expressed as 111

www.ijrsa.org International Journal of Remote Sensing Applications (IJRSA) Volume 6, 2016 the formula below: MNDWI = (Green MIR) / (Green + MIR) [2] where MIR is reflectance on a mid-infrared band. As shown in Figure 4, areas with high brightness value are water bodies. The difference between water and buildings is significantly enhanced, and the degree of confusion between them is reduced. To determine the decision nodes, we used high resolution ZY-3 images to compare and determine the threshold value, which was set to 0.2 in our study. That means, any pixels in the MNDWI image with a value greater than 0.2 will be classified to water bodies, otherwise will be land. FIG. 4 THE MODIFIED NORMALIZED DIFFERENCE WATER INDEX (MNDWI) IMAGE OF THE STUDY AREA b. Normalized Difference Vegetation Index NDVI was first used by Rouse et al. in 1974 to monitor vegetation systems in the Great Plains and has been widely used by many scientists since then (Galvao et al., 2005, 2009; Zhang, Zhao, and Li, 2008; Vina et al., 2011; Li et al., 2014). NDVI can be used to reflect the growth state and distribution of vegetation, which uses the unique characteristic that vegetation has higher reflection values in the near-infrared band than at visible bands. In the NDVI scale, positive values normally indicate the presence of vegetation, while negative values indicate non-vegetated surfaces. NDVI can be calculated in the following formula: NDVI = (NIR Red) / (NIR + Red) [3] FIG. 5 THE NORMALIZED DIFFERENCE VEGETATION INDEX (NDVI) IMAGE OF THE STUDY AREA where Red is the reflectance on a red band. NDVI image shown in Figure 5 indicates that areas of high 112

International Journal of Remote Sensing Applications (IJRSA) Volume 6, 2016 www.ijrsa.org brightness value are vegetated surfaces. Similarly, in order to define the decision tree nodes, we used high resolution ZY-3 images to contrast and interpret and eventually determine the threshold value between vegetated and non-vegetated areas, which was set to 0.15 in this study. This means that any pixels that has a value more than 0.15 in NDVI image will be classified into vegetated areas; pixels with a NDVI value less than or equal to 0.15 will be non-vegetated areas. c. Cropland Index Since the brightness of crop land is uneven in the NDVI map, a cropland index (CI) was adopted to separate crop land from other vegetated lands of forest land and grass land (Zhu, Li, and Ye, 2011). This index is calculated by dividing the difference between NIR and Red band one hundred, then multiplying the NDVI, as shown in the formula 4 below. CI = NDVI (NIR Red) / 100 [4] Using the high resolution ZY-3 image, we determined the threshold value range of cropland is 0.14 to 0.3. That means that any pixels in the cropland index image with a value between 0.14 and 0.3 will be classified as crop land. The cropland index image is shown in Figure 6. FIG. 6 THE CROPLAND INDEX (CI) IMAGE FOR THE STUDY AREA 2) Establishment of Spectral Characteristics Nodes Since the surface features are extracted based on their spectral characteristics, the spectral profile of each land feature was also examined. Figure 7 shows spectral curves of some typical and representative sample points for each of the six land covers. FIG. 7 THE SPECTRUM CURVES OF SOME TYPICAL LAND FEATURES 113

www.ijrsa.org International Journal of Remote Sensing Applications (IJRSA) Volume 6, 2016 As can be seen from Figure 7, grass land, crop land and forest land belong to the vegetated areas, so their spectral curves are very similar, particularly from band 1 to band 4. However, the reflectance of grass land is higher than the forest land in band 5 and band 7. This was used to distinguish the small difference between forest land and grass land. After repeated experiments and field observation, the threshold values to separate grass from forest land were determined as grass land with TM5>55 and TM7>27. This means for vegetated areas other than crop land, if the pixel value meet both TM5>55 and TM7>27, it will be grass land, otherwise it will be forest land. The non-vegetation spectral values in all bands are much higher than the vegetated features and water, especially in the TM5 band. The threshold to separate bare land from developed land was determined as 85 in TM5 band based on field observation and tests. This means that for non-vegetated areas, the pixel will be bare land when its value in TM 5 is larger than 85; otherwise, it will be developed land. 3) Establishment of Terrain Nodes Under different terrain conditions, the distribution and trends of surface features are different. Therefore, the terrain characteristics can also be useful in distinguishing surface features. The elevation, slope and aspect have a certain impact on the distribution of surface features. In general, the areas with small slope and lower elevation are good for cultivated land. If this and other agricultural production conditions are met, the area is suitable for the development of agriculture, especially crop production. The slope of the study area was extracted from the 30-m DEM data using the method 4 discussed by Robert, Weih, and Tabitha (2004). This method is also called the plane algorithm method, which calculates the slope of the four surrounding right triangle planes that have the centre as a common point. Through the field observations, we found that there is almost no crop land in the Taiyuan region with a slope larger than 6, so the crop lands in our study are determined as those relatively flat areas with a slope less than 6. Results And Discussion Decision Tree The decision tree module in the software ENVI 4.6 was used to establish the decision tree according to those nodes obtained above. Figure 8 shows the final decision tree established for image classification in this study. FIG. 8 THE DECISION TREE ESTABLISHED IN THIS STUDY Running the decision tree classifier, the classification map of the study area was obtained. After that, category merge processing (primary/ secondary analysis) was applied to remove those individual pixels in the map. The final classification map with six general classes based on the proposed decision tree is shown in Figure 9 below. It can be seen from Figure 9 that Taiyuan is surrounded by mountains on all sides except in the south. In its western, northern, and eastern sides, those mountains are mainly covered by forest land. Taiyuan basin, in the middle of the map, is the economic and human activity centre and is composed mainly of developed land and crop land, plus a small portion of bare land and grass land. In terms of water bodies, the Fen River runs through the city, while Jinyang Lake, the largest lake in the territory, is located in the southwest of the basin. 114

International Journal of Remote Sensing Applications (IJRSA) Volume 6, 2016 www.ijrsa.org FIG. 9 THE CLASSIFICATION RESULT OF THE STUDY AREA USING THE DECISION TREE METHOD An accuracy assessment was conducted using the high resolution ZY-3 satellite image and Google Earth image obtained during the same period. Total of 682 samples were randomly generated on both classification map and ZY-3 image using ENVI software, and a confusion matrix table was created to report accuracies. The results show that the overall accuracy of this method is 89.52%, with a kappa coefficient of 0.87, which is higher than most traditional image classifications methods (Table 2). From a producer s perspective, water body, developed land, and forest land are classified most accurate (> 90%). Forest land has the lowest producer s accuracy (73%) because its small proportion and similar spectra with the forest land. From a user s perspective, developed land and forest land have the highest accuracies (> 95%), while bare land and grass land have the lowest user s accuracies around 70%. Overall, the decision tree method in this paper is easy to use, but have a satisfactory classification accuracy. Classes TABLE 2 THE CONFUSION MATRIX AND CLASSIFICATION ACCURACY USING THE DECISION TREE METHOD Water Body Developed Land Bare Land Grass Land Forest Land Crop Land Total User s Accuracy Water Body 94 0 0 0 0 0 105 88.6% Developed Land 2 93 9 1 0 0 81 98.8% Bare Land 0 0 80 1 0 0 102 71.6% Grass Land 1 5 10 73 9 4 129 70.5% Forest Land 2 2 0 24 91 10 89 96.6% Crop Land 1 0 1 1 0 86 105 88.6% Total 100 100 100 100 100 100 Producer s Accuracy 94.0% 93.0% 80.0% 73.0% 91.0% 86.0% Overall Accuracy:89.52% Kappa statistic:0.87 Conclusion This paper discusses a decision tree classification method using Landsat TM image, along with the three normalized index images extracted from TM image, and slope information extracted from DEM data. The decision trees used in this study are likely to be applied to other locations with minor modifications. The methods can be easily and quickly implemented by individuals with moderate technical experience. In the future study, texture and shape information of the land features can be extracted and combined with more field-collected reference data to improve the classification accuracy in urban environments. REFERENCES [1] Cai D. and Wei W. Study on Remote Sensing Information Extraction of Aquatic Vegetation Based on Decision Tree. Journal of Anhui Agricultural Sciences. 37(16) (2009): 7615-7616. 115

www.ijrsa.org International Journal of Remote Sensing Applications (IJRSA) Volume 6, 2016 [2] Calvao L.S., Formaggio A.R., and Tisot D.A. Discrimination of sugarcane varieties in southeaster Brazil with EO-1 Hyperion data. Remote Sensing of Environment, 94(4) (2005), 523-534. [3] Calvao L.S., Roberts D.A., Formaggio A., Numata I., and Breuning F. View angle effects on the discrimination of soybean varieties and on the relationship between vegetation indices and yield using off-nadir Hyperion data. Remote Sensing of Environment, 113 (2009):846-856. [4] Hang Y., Zhou Y. X., and Wu W. Shanghai Urban Wetland Extraction and Classification with Remote Sensed Imageries Based on a Decision Tree Model. Journal of Jilin University (Earth Science Edition), 39(6) (2009):1156-1162. [5] Hansen M., Dubayah R., DeFries R. Classification Trees, an Alternative to Traditional Land Cover Classifiers. International Journal of Remote Sensing, 17(1996): 1075-1081. [6] Hunt E.B., Marin J., and Stone P.T. Experiments in Induction; Academic Press, New York, 1996. [7] Jensen, J.R. Introductory digital image processing: A remote sensing perspective, 2 nd Edition Upper Sadle River, New Jersey: Prentice Hall Inc., 2015. [8] Kandrika S. and Roy P.S. Land use land cover classification of Orissa using multi-temporal IRS-P6 AWiFS data: A decision tree approach. Int. J. Appl. Earth Obs. Geoinf. 10(2) (2008):186 193. [9] Li J.L., Liu X.M., and Li H.P. Extraction of Texture Feature and Identification Method of Land use Information from SPOT 5 Image. Journal of Remote Sensing, 10(6) (2006): 927-932. [10] Li Q., Cao X., Jia K., Zhang M., and Dong Q. Crop type identification by integration of high-spatial resolution multispectral data with features extracted from coarse-resolution time-series vegetation index data. International Journal of Remote Sensing, accessed February 28, 2015, DOI: 10.1080/01431161.2014.943325. [11] Li Y.F., Liu G.H., and Hang C. Exploring Landscapes Based on Decision Tree Classification in the Diqin Region, Yunnan Province. Resources Science, 33(2) (2011):328-334. [12] Ma H.Q. and Yang X. H. Extraction of Soil Salinization from Remote Sensing Information Based on Knowledge Discovery over the Arid Area: A Case Study on Yili newly reclaimed Area. Resources Science, 31(12) (2009): 2065-2071. [13] McFeeters S.K. The use of normalized difference water index (NDWI) in the delineation of open water features. International Journal of Remote Sensing. 17(7) (1996):1425-1432. [14] Pal M. and Mather P.M. An assessment of the effectiveness of decision tree methods for land cover classification. Remote Sensing of Environment. 86(2003):554 565. [15] Pan C., Du P.J., and Luo Y. Decision tree classification of remote sensing images based on vegetation indices. Journal of Computer Applications, 29(3) (2009):777-780. [16] Punia M., Joshi P.K., and Porwal M.C. Decision tree classification of land use land cover for Delhi, India using IRS-P6 AWiFS data. Expert Syst. Appl. 38(5) (2011): 5577 5583. [17] Qian Y.R., Yu J., and Jia Z.H. The classification strategy of desert grassland based on decision tree using remote sensing image. Journal of Northwest A & F University (Natural Science Edition), 41(2) (2013):1-8. [18] Robert C., Weih, Jr., and Tabitha L. M. Modelling slope in a geographic information system. Journal of the Arkansas Academy of Science, 58(2004):100-108. [19] Rokach L. and Maimon O. Data mining with decision trees: theory and applications. World Scientific Pub Co Inc. ISBN 978-9812771711, 2008. [20] Rouse J.W., Haas R.H., Schell J.A., and Deering D.W. Monitoring vegetation systems in the Great Plains with ERTS, Proceedings. 3 rd Earth Resource Technology Satellite (ERTS) Symposium, 1(1974):48-62. [21] Sharma R., Ghosh A. and P.K. Joshi. Decision tree approach for classification of remotely sensed satellite data using open source support. J. Earth Syst. Sci. 5(2013):1237-1247. [22] Shen W.M., Wang W.J., and Luo H.J. Classification Methods of Remote Sensing Image Based on Decision Tree Technologies. Remote Sensing Technology and Application, 22(3) (2007): 333-338. 116

International Journal of Remote Sensing Applications (IJRSA) Volume 6, 2016 www.ijrsa.org [23] Vina A., Gitelson A.A., Nguy-Robertson A.L., and Peng Y. Comparison of different vegetating indices for the remote assessment of green leaf area index of crops. Remote Sensing of Environment. 115(2011):3468-3478. [24] Xu H.Q. Modification of normalized difference water index (NDWI) to enhance open water features in remotely sensed imagery. International Journal of Remote Sensing, 27(14) (2006):3025-3033. [25] Zhang Y.C., Zhao Z.Q., and Li S.C. Indicating variation of surface vegetation cover using SPOT NDVI in the northern part of North China. Geographical Research, 27(4) (2008):745-754. [26] Zhu J.H., Li J.F., and Ye J. Land Use Information Extraction from Remote Sensing Data Based on Decision Tree Tool. Geomatics and Information Science of Wuhan University, 36(3) (2011): 301-305. 117