Using a K-Means Clustering Algorithm to Examine Patterns of Pedestrian Involved Crashes in Honolulu, Hawaii

Similar documents
Texas A&M University

Using GIS to Identify Pedestrian- Vehicle Crash Hot Spots and Unsafe Bus Stops

DEVELOPING DECISION SUPPORT TOOLS FOR THE IMPLEMENTATION OF BICYCLE AND PEDESTRIAN SAFETY STRATEGIES

The CrimeStat Program: Characteristics, Use, and Audience

Spatial Variation in Local Road Pedestrian and Bicycle Crashes

Comparison of spatial methods for measuring road accident hotspots : a case study of London

Evaluation of fog-detection and advisory-speed system

Data Driven Approaches to Crime and Traffic Safety

Geospatial Big Data Analytics for Road Network Safety Management

ENHANCING ROAD SAFETY MANAGEMENT WITH GIS MAPPING AND GEOSPATIAL DATABASE

If you aren t familiar with Geographical Information Systems (GIS), you. GIS, when combined with a database that stores response information,

John Laznik 273 Delaplane Ave Newark, DE (302)

Spatiotemporal Analysis of Urban Traffic Accidents: A Case Study of Tehran City, Iran

Hot Spot Identification using frequency of distinct crash types rather than total crashes

THE DEVELOPMENT OF ROAD ACCIDENT DATABASE MANAGEMENT SYSTEM FOR ROAD SAFETY ANALYSES AND IMPROVEMENT

Long Island Breast Cancer Study and the GIS-H (Health)

Traffic accidents and the road network in SAS/GIS

How GIS Can Help With Tribal Safety Planning

Spatial and Temporal Geovisualisation and Data Mining of Road Traffic Accidents in Christchurch, New Zealand

Development of a Prototype Traffic Safety Geographic Information System

EVALUATION OF HOTSPOTS IDENTIFICATION USING KERNEL DENSITY ESTIMATION (K) AND GETIS-ORD (G i *) ON I-630

BROOKINGS May

Spatial Analysis I. Spatial data analysis Spatial analysis and inference

Development of Criteria to Identify Pedestrian High Crash Locations in Nevada

Spatial analysis of pedestrian accidents

Pedestrian Accident Analysis in Delhi using GIS

arxiv: v1 [cs.cv] 28 Nov 2017

Using Public Information and Graphics Software in Graduate Highway Safety Research at Worcester Polytechnic Institute

Spatial Scale of Clustering of Motor Vehicle Crash Types and Appropriate Countermeasures

STATISTICAL ANALYSIS OF LAW ENFORCEMENT SURVEILLANCE IMPACT ON SAMPLE CONSTRUCTION ZONES IN MISSISSIPPI (Part 1: DESCRIPTIVE)

Extraction of Accidents Prediction Maps Modeling Hot Spots in Geospatial Information System

Lecture 3: Exploratory Spatial Data Analysis (ESDA) Prof. Eduardo A. Haddad

GIS ANALYSIS METHODOLOGY

transportation research in policy making for addressing mobility problems, infrastructure and functionality issues in urban areas. This study explored

PROBLEMS AND SOLUTIONS IN LOGGING OF TRAFFIC ACCIDENTS LOCATION DATA

FHWA GIS Outreach Activities. Loveland, Colorado April 17, 2012

DATA DISAGGREGATION BY GEOGRAPHIC

GIS = Geographic Information Systems;

Crime Analysis. GIS Solutions for Intelligence-Led Policing

NRS 509 Applications of GIS for Environmental Spatial Data Analysis Project. Fall 2005

ARIC Manuscript Proposal # PC Reviewed: _9/_25_/06 Status: A Priority: _2 SC Reviewed: _9/_25_/06 Status: A Priority: _2

NEW YORK DEPARTMENT OF SANITATION. Spatial Analysis of Complaints

California Urban Infill Trip Generation Study. Jim Daisa, P.E.

Spatial Analysis of Weather Crash Patterns

GIS and the Built Environment

Inclusion of Non-Street Addresses in Cancer Cluster Analysis

Geometric Algorithms in GIS

Use of Crash Report Data for Safety Engineering in Small- and Mediumsized

Development of Decision Support Tools to Assess Pedestrian and Bicycle Safety: Focus on Population, Demographic and Socioeconomic FINAL REPORT

Neighborhood Locations and Amenities

Lecture 3: Exploratory Spatial Data Analysis (ESDA) Prof. Eduardo A. Haddad

Risk Assessment of Pedestrian Accident Area Using Spatial Analysis and Deep Learning

Application of Geographic Information Systems to Rail-Highway Grade Crossing Safety

GEO 463-Geographic Information Systems Applications. Lecture 1

EXAMINATION OF THE SAFETY IMPACTS OF VARYING FOG DENSITIES: A CASE STUDY OF I-77 IN VIRGINIA

USING GEOGRAPHICAL INFORMATION SYSTEMS TO EFFECTIVELY ORGANIZE POLICE PATROL ROUTES BY GROUPING HOT SPOTS OF CRASH AND CRIME DATA

INDOT Office of Traffic Safety

PLANNING TRAFFIC SAFETY IN URBAN TRANSPORTATION NETWORKS: A SIMULATION-BASED EVALUATION PROCEDURE

A COMPARATIVE STUDY OF THE APPLICATION OF THE STANDARD KERNEL DENSITY ESTIMATION AND NETWORK KERNEL DENSITY ESTIMATION IN CRASH HOTSPOT IDENTIFICATION

GEOGRAPHIC INFORMATION SYSTEMS Session 8

High Speed / Commuter Rail Suitability Analysis For Central And Southern Arizona

DATA SOURCES AND INPUT IN GIS. By Prof. A. Balasubramanian Centre for Advanced Studies in Earth Science, University of Mysore, Mysore

Unobserved Heterogeneity and the Statistical Analysis of Highway Accident Data. Fred Mannering University of South Florida

Lecture 9: Geocoding & Network Analysis

NJDOT Pedestrian Safety Analysis Tool 2015 GIS T Conference

CALOTS Upgrade for Performance Monitoring

Applying cluster analysis to 2011 Census local authority data

GIS for Crime Analysis. Building Better Analysis Capabilities with the ArcGIS Platform

METHODOLOGICAL ISSUES IN CREATING A REGIONAL NEIGHBORHOOD TYPOLOGY

Qatar Statistical Geospatial Integration

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

Class 9. Query, Measurement & Transformation; Spatial Buffers; Descriptive Summary, Design & Inference

Chapter 6. Fundamentals of GIS-Based Data Analysis for Decision Support. Table 6.1. Spatial Data Transformations by Geospatial Data Types

Interaction Analysis of Spatial Point Patterns

Local Calibration Factors for Implementing the Highway Safety Manual in Maine

Understanding Land Use and Walk Behavior in Utah

Acknowledgments xiii Preface xv. GIS Tutorial 1 Introducing GIS and health applications 1. What is GIS? 2

Analysis of traffic incidents in METU campus

AN INVESTIGATION OF BRAIN INJURY RISK IN VEHICLE CRASHES (SECOND REPORT)

Digitization in a Census

ENV208/ENV508 Applied GIS. Week 1: What is GIS?

GEOGRAPHIC INFORMATION SYSTEM ANALYST I GEOGRAPHIC INFORMATION SYSTEM ANALYST II

A 3D GEOVISUALIZATION APPROACH TO CRIME MAPPING

Unsupervised machine learning

KAAF- GE_Notes GIS APPLICATIONS LECTURE 3

DEVELOPMENT OF CRASH PREDICTION MODEL USING MULTIPLE REGRESSION ANALYSIS Harshit Gupta 1, Dr. Siddhartha Rokade 2 1

A Framework for Incorporating Community Benefits Agreements into. 14 July

Data Preprocessing. Cluster Similarity

GEOGRAPHY 350/550 Final Exam Fall 2005 NAME:

Test of Complete Spatial Randomness on Networks

Developing Built Environment Indicators for Urban Oregon. Dan Rubado, MPH EPHT Epidemiologist Oregon Public Health Division

LOCATION OF PREHOSPITAL CARE BASIS THROUGH COMBINED FUZZY AHP AND GIS METHOD

Global Scene Representations. Tilke Judd

Lecture 4. Spatial Statistics

Mapping Accessibility Over Time

MACRO-LEVEL ANALYSIS OF THE IMPACTS OF URBAN FACTORS ON TAFFIC CRASHES: A CASE STUDY OF CENTRAL OHIO

Spatial discrete hazards using Hierarchical Bayesian Modeling

Learning Computer-Assisted Map Analysis

NSF CNH-Ex # Political Fragmentation Indicator Database (Version 3.01)

Chapter 5. GIS The Global Information System

Transcription:

Journal of Advanced Transportation, Vol. 41, No. 1, pp. 69-89 www.advanced-transport.com Using a K-Means Clustering Algorithm to Examine Patterns of Pedestrian Involved Crashes in Honolulu, Hawaii Karl Kim Eric Y. Yamashita The purpose of this paper is twofold: 1) to describe a statistical technique known as K-means clustering in term of its advantages and disadvantages in safety research; and, 2) to use this method to analyze spatial patterns of pedestrianinvolved crashes in Honolulu. K-means, a partitioning clustering technique, provides a powerful tool for analyzing and visualizing spatial patterns. While there are other techniques, one of the advantages of the K-means approach is that it is a well established technique that has been used for many different applications other than traffic safety. In this paper, we compare it to hierarchical clustering techniques and suggest that both are useful in the arsenal of spatial analytic tools for safety research. Keywords: K-Means Clustering, Pedestrian-Involved Crashes, Visualizing Spatial Patterns Introduction According to the National Highway Traffic Safety Administration, on average, a pedestrian is killed in a traffic crash in the United States every 109 minutes and injured every 7 minutes (NHTSA, 2002). Not only does this pose a heavy burden on society in terms of the number of casualties, but it also suggests a need to better understand and address the problem of pedestrian safety. A critical aspect of improving safety involves knowing the spatial pattern of accidents. Techniques of spatial statistical analysis have evolved over the past several years. With the advent of GIS (geographic information systems) and the increased availability of both digital map files and geocoded (events assigned to locations) data, there is both increased need and Karl Kim and Eric Y. Yamashita, University of Hawaii, Hawaii, USA Received: April 2005 Accepted: August 2005

70 K. Kim and E.Y. Yamashita opportunity for the integration of spatial and statistical analysis. In this paper, we focus on a well-established statistical clustering technique (Kmeans clustering) and describe how it can be used to analyze the locations and patterns of traffic accidents. Cluster analyses are a set of statistical techniques that groups items together on the basis of similarities and/or dissimilarities. According to Cameron (1997), clustering techniques are an important tool to analyze traffic accidents as these methods are able to identify groups of road users, vehicles and road segments which would be suitable targets for countermeasures. There has been much research on the causal factors associated with accidents and the reduction of crashes and the resulting injuries and fatalities. There have also been recent efforts to utilize techniques of geographical analysis to inform understanding of the spatial patterns of motor vehicle accidents (Levine, Kim and Nitz, 1995). In Ng, Hung and Wong (2002) a combination of cluster analysis, regression analysis and Geographic Information System (GIS) techniques were used to group homogeneous accident data together, estimate the number of traffic accidents and assess the risk of traffic accidents in a study area. Schneider,et.al. (2004) analyzed pedestrian crash risk on the University of North Carolina campus by comparing point distributions using kernel density estimation techniques to create a probability surface of crashes (Schneider, et. al., 2004). In addition to the kernel density function, they also examined other techniques, such as the nearest neighbor routine to identify clusters (Schneider, et. al., 2004). The research in this paper focuses specifically on the use of the K-Means technique for analyzing pedestrian crash patterns. Traffic accidents, such as collisions with pedestrians, can be visualized as an event (or events) occurring at a point location. Figure 1 shows the locations of all pedestrian crashes occurring in Honolulu. One of the real challenges, of course, is to make sense of the scatter of points to identify places of concern or hot spots where potential intervention is warranted. In some instances, there are just too many locations where crashes or other incidents can occur and there is a need to reduce the large number of locations to a more manageable subset. Data reduction can involve focusing just on the really problem sites. Another approach involves aggregating the individual crashes and their locations and assigning them to a zone (such as a census tract or traffic assignment zone), see Figure 2, or a roadway segment (see Figure 3).

Using a K-Means 71 Figure 1. Collisions Involving Pedestrians on Oahu Point Pattern Distribution Figure 2. Pedestrian Involved Crashes by Census Block Groups Zone Based Analyses

72 K. Kim and E.Y. Yamashita Often, the type of accident and the frequency of occurrence can be used to reduce the number of locations to a more manageable critical few. In Figure 3, the pedestrian crashes have been reduced to only those occurring at non-intersection locations (mid-block crossings) and posted to roadway segments. This type of approach could also be used in conjunction with other data to establish threshold levels for reporting, such as only crashes which involve fatalities or a high degree of injury. Figure 4 for example, contains only those pedestrian crashes requiring EMS transport. Figure 3. Pedestrian Involved Crashes by Road Networks Another approach involves examining the mode or frequency of occurrence at particular locations and identifying hot spots or high frequency locations as in Figure 5 for pedestrian accidents. In this map, a search radius of 500 feet is drawn and all of the incidents that fall within this area are tabulated and included in the map total. This approach also referred to, as a fuzzy mode method can be useful in identifying the critical locations. Similar to black spots (Maher, et. al.,

Using a K-Means 73 1988), this type of spatial analysis is useful in terms of program or facility planning and management. Depending on the purpose of the study, aggregation methods allow one to associate the events with other site-specific characteristics at the same scale, such as census information or traffic volume data. While it is useful to be able to aggregate individual events into zones or segments for prioritization of actions or programs and it is certainly easier to manage the analysis of grouped data, it also important to remember that the aggregation of data may mask underlying patterns. When data are recorded as individual occurrences, a better understanding of the shape and scale of the data can be determined. When this is the case, the basic question that develops out of these forms of analyses is what kind of spatial pattern is formed by particular subsets of accidents? Are these accidents clustered into specific, key locations or are they distributed more evenly or regularly across space? Figure 4. Pedestrian Involved Crashes Requiring EMS Transport

74 K. Kim and E.Y. Yamashita Figure 5. Pedestrian Involved Crashes: A Hot Spot Analyses The focus of our research in this paper is on pedestrian accidents. According to the National Highway Traffic Safety Administration (NHTSA), in 2002, out of the 71,000 pedestrians injured, there were 4,808 pedestrians killed in the United States (NHTSA, 2002). In 2002, there were 119 traffic fatalities in Hawaii, and 33 (27.7%) of those fatalities were pedestrian (NHTSA, 2002). Nationwide, almost 175,000 pedestrians have died in motor vehicle crashes between 1975 and 2000 (NHTSA, 2003). In this paper, we describe the use of a non-hierarchical clustering routine called K-Means clustering to analyze patterns in the distribution of pedestrian involved crashes in Honolulu, Hawaii. We compare this technique to a commonly used hierarchical clustering method known as the nearest neighbor routine. Within the field of statistics and pattern recognition, there have been various clustering algorithms developed, i.e., point location, hierarchical clustering techniques, partitioning techniques, density techniques, etc. Most clustering routines are used in an unsupervised fashion, where there is a set of data that must be grouped according to some notion of similarity. This type of algorithm is called hierarchical clustering, where the final number of clusters is not known ahead of time. It starts with

Using a K-Means 75 each case in a separate cluster and then combines the clusters sequentially, reducing the number of clusters at each step until only one cluster is left (Hartigan, et. al., 1979). In many applications, however, it is often the case that the researcher possesses some background knowledge that is useful in clustering the data. When this is the case, partitioning clustering routines may be particularly useful. In the partitioning clustering routines, like K-Means, the number of clusters is specified a priori. Points are assigned to one and only one group. Developed in 1967 by MacQueen, the algorithm was later improved by Hartigan, et. al. (1979). The goal of the K-Means clustering algorithm is to divide (or partition) the objects into k clusters such that a value relative to the centroids of the clusters is minimized. The K-Means clustering technique is a method commonly used to partition a data set into k groups (MacQueen, 1967). In the next section, the basic approaches to clustering are described and compared. Clustering Routines Point Clustering Routines Point location techniques can be divided into mode and fuzzy mode routines. The mode is the location with the largest number of incidents (Levine, 1999). It is a simple measure that provides the frequency of occurrence at unique locations. At a small geographic scale, this routine can provide a quick means to identify the location with the highest number of accidents based on frequency of occurrence. However, the usefulness of the node routine is dependent on the degree of resolution or scale to which one wants to analyze the incidents because it only represents the location as one and only one point. For example, in an examination of pedestrian-involved crashes on a roadway, the mode routine may not capture the entire intersection, if measurement of each crash was very precise, each crash could be slightly different in location. An alternative to the mode is the fuzzy mode. The fuzzy mode allows one to capture incidents around or near the location by examining a radius around the location. The idea behind this method is to allow identification of locations where a number of incidents may occur, but where there may not be precision in the measurement (Levine, 1999). However, this routine has its limitations as well, By relying on the use of a search radius around incident locations, incidents are counted multiple

76 K. Kim and E.Y. Yamashita times for each radius they fall within (Levine, 1999), hence, possibly changing the frequency of incidents and the hierarchy of hot spot locations by producing small hot spot locations rather than exact locations. Hierarchical Clustering Routines The most familiar hierarchical clustering technique is the nearest neighbor routine. This routine clusters points based on a criterion, such as threshold distance. The clustering is repeated until either all the points are grouped into a single cluster or the clustering criterion fails (Levine, 1999). After defining a threshold distance, the routine compares the threshold distances to the distance between all pairs of points. The algorithm only selects points to be clustered if the points are closer than the threshold distance. Other criterion, as having a minimum number of points can also be used to determine a cluster. Only points that fit both criteria will be clustered at the first level (Figure 6). The first level clusters are then clustered into second order clusters (Figure 6), and so on, until no more clustering is possible, all clusters converge into one cluster (Figure 6), or the clustering criteria fails (Levine, 1999). Distributions that are made up of many incidents will tend to have smaller threshold distances. Hence, a hot spot is dependent on an environment, and not the number of incidents, thus not producing a consistent definition of a hot spot area. Meaningful cluster size based on user judgment may be an arbitrary construct and not based on any statistical means of defining regularity. The technique produces results that are subject to user manipulation. Like other clustering techniques, there is no theoretical rationale behind the clusters, only empirical estimates of cluster locations. As with other exploratory data techniques, there is a need to better understand why the clusters are occurring or why they could be related.

Using a K-Means 77 Figure 6. Nearest Neighbor Hierarchical Cluster

78 K. Kim and E.Y. Yamashita Partitioning Clustering Routines The K-Means clustering routine is a partitioning routine where the data are grouped into K groups defined by the user (Hartigan, et. al., 1979; MacQueen, 1967). The routine finds the most optimum positioning of the K centers and then assigns each point to the most nearest center. The theory behind the K-Means procedure is straightforward, but the implementation is complicated. K-Means is an attempt to define an optimal number of K locations where the sum of the distance from every point to each of the K centers is minimized (Levine, 1999). In theory, the routine tries every combination of K objects where K is a subset of the total population of incidents (N), and measures the distance from every incident point to every one of the K locations. The particular combination which provides the minimal sum of all distances (or all squared distances) is considered the best solution (Levine, 1999). Within many applications, it is often the case that the researcher possesses some background knowledge that is useful in clustering the data. This is the case of the partitioning clustering routines. In the partitioning clustering routines, like K-Means, the number of clusters is chosen before the procedure starts. One of the challenges of this technique involves selecting an appropriate number of clusters. Too many will lead to defining patterns that do not exist, and too few will lead to poor differentiation among distinctly different neighborhoods. Much like the nearest neighbor hierarchical clustering routine, the K- Means clustering routine assigns points to one, and only one, cluster. Unlike the nearest neighbor routine however, all points are assigned to clusters, hence there is no hierarchy in the clusters (Levine, 1999). This routine is useful when the user wants to control the grouping. For example, one may want to analyze where the clusters occur for each census tract or district. Also, if time series data exist, analysis may be conducted to see if clusters have shifted. The nearest neighbor hierarchical routine produces solutions based solely on proximity with most clusters being very small, whereas the K-Means routine allows the user to control the size of the clusters. The K-Means routine provides the ability to fine tune a particular model of clusters to fit a pattern that is known; as such it is more useful as an exploratory tool.

Using a K-Means 79 Data and Methods The data in this project were compiled as part of the Hawaii CODES (Crash Outcome Data Evaluation System) Project, funded by the U.S. Department of Transportation, National Highway Traffic Safety Administration. The intent of the CODES Project was to collect accident data and link it to various registries, administrative databases, and spatial files and to develop and perform various analyses related to traffic safety. In this analysis, a comprehensive database of police crash reports, linked to EMS transport files was geocoded and analyzed using various GIS and statistical software packages. The dataset included a total of 349 cases of pedestrian involved crash locations. Other papers describe the linkage procedures (Kim and Nitz, 1994) as well as the geocoding efforts (Levine and Kim, 1999) and the development of a traffic safety GIS in Hawaii (Kim and Levine, 1997). For this analysis, we obtained motor vehicle accident crash data, recorded by police accident scene investigators for all incidents involving an injury or property damage in excess of $1,000 (the threshold was recently raised to $3,000). Special police investigators are trained to determine at the scene damage estimates. Using information from the crash records (street address, intersection, etc.) the locations were matched to a dictionary of addresses and locations based on a number of different files (Census TIGER, U.S. Geological Survey digital line graph files, and cadastral files maintained by the City and County of Honolulu). Coordinates for each crash was then determined. A spatial database for all accidents based on both the map data and the crash attributes was constructed using Arc/Info and ArcView. While there have long-standing concerns about the quality of police collected crash data (O Day, 1993), there is reason to believe that the quality of both accident and spatial data is higher in Hawaii than in other places. To begin with, there are only four counties with four police departments in the entire state. All of the police officers receive the standardized crash report training and special accident investigators are used for the more serious crashes. Because of the limited land area and roadways, it has been easier to build a comprehensive, inclusive transportation GIS covering all streets and roads. Because of initiatives such as the federally funded Hawaii CODES Project, there has been better integration of roadway, environmental, and administrative databases. In addition to the usual spatial databases, the project had access to comprehensive air photographs, satellite imagery, and other

80 K. Kim and E.Y. Yamashita spatial databases that have been shared by various agencies and research entities within the state. There has been much opportunity to validate and update the spatial databases within Hawaii. Investigations of accident clustering may be either used to generate or test hypotheses regarding the causation of the accident. As with other methods, the approach is based on comparing observed versus expected patterns. The null-hypothesis is that the observations are spatially randomly distributed. The alternative hypothesis may take several forms a one-tailed example would be that cases of pedestrian involved crashes are grouped more closely than expected or that particular units contain more cases than expected under the assumption of randomness. On the other hand, with a two-tailed test applied, the alternative hypothesis may be that the cases are clustered or dispersed. The type of pattern formed may be used to derive further hypotheses regarding the factors associated with pedestrian involved crashes. K-means is an unsupervised learning algorithm (MacQueen, 1967). It follows a simple procedure to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other (MacQueen, 1967). The algorithm in its simplest form is comprised of the following steps: 1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2. Assign each object to the group that has the closest centroid. 3. When all objects have been assigned, recalculate the positions of the K centroids. 4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. The K-Means clustering routine was not designed to show the relationship between clusters. Instead, K-Means clusters are constructed so that the average behavior in each group is distinct from any of the other groups (MacQueen, 1967). For example, in a time series experiment you could use K-Means clustering to identify unique classes of pedestrian involved crashes that are determined in a time dependent manner.

Using a K-Means 81 In the K-Means routine, a simple and widely used square error cost function is employed to measure the distance, which is defined as (MacQueen, 1967): k N j= 1 i= 1 ( ) 2 E = v i c j (1) where N, and k are the number of data and the number of centers respectively; v i is the data sample, in this case, the location (coordinates) of the ith crash belonging to center c j. Here are taken to be the Euclidian norm, but other distances could be used (Manhattan Distance or Road Network Distance Measures). During the clustering process, the centers are adjusted according to a certain set of rules such that by searching for the center c j as the data are presented, the total distance in equation (1) is minimized. The Euclidean distances between the data sample and all the centers are calculated and the nearest center is updated according to (MacQueen, 1967): () t = () t [ v() t c ( t 1) ] Cz η z (2) where z indicates the nearest center to the data v(t). Notice that, the centers and the data are written in terms of time t where c z (t-1) represents the center location at the previous clustering step. The adaptation rate, η(t), can be selected in a number of ways. MacQueen (MacQueen, 1967) set η(t) = 1/n z (t), where n z (t) is the number of data samples that have been assigned to the center, up to the time t. Because this research is focused on traffic accidents occurring on roadways, future research may be conducted on modifying the K-Means clustering algorithm to take advantage of shortest path distances between the data sample and centers along a network, as opposed to Euclidean distances. Many statistical programs can perform the K-Means routine, such as SPSS K-Means function, SAS Fastclus function, but these software were not developed to examine spatial patterns within the data. For this paper we used the CrimeStat program because of its ease of use, combined with its capability to utilize GIS mapping software to display the results spatially (Levine, 1999). It should be noted that an early, prototype version of some of the algorithms contained in CrimeStat were developed as part of Hawaii PointStat, a Unix-based program for doing

82 K. Kim and E.Y. Yamashita point statistics developed at the University of Hawaii under the auspices of the CODES Project. Results: K-Means Clustering of Pedestrian Accidents The results of our K-Means analysis of pedestrian accidents in Honolulu are presented in Figure 7 and Table 1. Several runs of the K- Means routine were performed using 2, 5, 8, 10, and 12 clusters. Given the size and spatial distribution of the dataset, 10 clusters was the upper limit of clusters that could be distinguished before the procedure identified a cluster with only two crashes. In addition to showing the centroid (mean location for each cluster in terms of X and Y coordinates, the summary table also shows the degree of rotation, the length of X and Y axes, and the area in square miles covered by the standard deviational ellipse. Also included is the number of points associated with each of the 10 clusters. It is, perhaps, most useful to refer first to the map itself to see the location of the 10 clusters. The clusters correspond, roughly, to population areas. Previous research has shown that pedestrian accidents are not just associated with population density, but also various socio-economic attributes as well as key roadway factors including volume, speed, and roadway type (Levine, et. al., 1995b). Clearly, the surrounding land uses are important as well as these will determine the nature, extent, volume, and intensity of conflict between pedestrians and motorized traffic. Note that the two clusters with largest number of crashes, Cluster 1 and Cluster 3 are located along the southern coast of the island of Oahu, the most urbanized and developed areas of the county. Note that after the first two large clusters which have 103 and 77 locations, respectively, associated with them, the number of points begins to drop off quickly, 24, 16, 9, 7, etc. The strength of the K-Means approach lies within the ability to evaluate very quickly the density and intensity of the effect. Obviously a cluster in which there are 104 accidents in 4.8 square mile area is, relatively speaking, more intense than one having 7 incidents spread out over 34.8 square miles. Whether a cluster with a 100 accidents in a 4.8 square mile area is more serious than one with 7 accidents in a 34.8 square mile area will depend on the levels of exposure, hence without this information on traffic and pedestrian activity levels, it may be difficult to make any meaningful comparisons and decisions regarding where the prioritize accident remedial work. The

Using a K-Means 83 other graphic information conveyed on the map is the direction of the standard deviational ellipse from which the general direction of travel can be inferred. The more elongated the ellipse, the more directionality associated with the spatial pattern of accidents. It should be noted, however, that the ellipse represents an abstraction of the cluster. Obviously, the cluster itself is not elliptical in shape. The clusters show the main estimated centers of pedestrian crashes while the orientation of the ellipses symbolizes the density and directionality of the crash locations constrained to the roadway network. To understand the value of the K-Means approach, it is useful to compare the map contained in Figure 7 with the information contained in the previous maps. Compared to Figure 1, which illustrates all pedestrian crashes, the K-Means partitioning allows for the identification of key districts or locations of concern. By visually presenting data on clusters or groupings, the larger, overall problem of pedestrian safety can be broken down into more manageable units. It becomes possible to focus in on other socio-economic correlates associated with the intensity of pedestrian crashes, at perhaps the block group level as shown in Figure 2. Overlaying the clusters on the roadway network could also provide a way of isolating key roadway segments for intervention. Using a combination of both the cluster analysis and the roadway analysis (Figure 3), it is possible to target key areas for traffic enforcement or perhaps engineering solutions such as traffic calming or the installation of new traffic control devices. With additional information about injury severity, a new cluster analysis could have been performed to show the key clusters where the most serious injuries have occurred. One of the reasons, however, for focusing on pedestrian accidents is that virtually any time a vehicle hits a pedestrian, there is an injury. Comparing Figure 5 and Figure 7 demonstrates the differences between overall or global solutions (as in the hot spot analysis) and one based upon a partitioning of the data into a set of clusters. Using both globally determined hot-spot analysis and partitioned cluster analysis provides, perhaps, the best of both worlds. In addition to the hotspot identified in Figure 7, the K-Means Clustering approach also helps to identify other areas of concerns that may warrant further investigation, analysis, problem identification and countermeasure design.

84 K. Kim and E.Y. Yamashita Table 1. K-Means Clustering Results K-Means Clustering Sample Size = 255 Clusters = 10 Iterations = 6 Cluster Mean X 1-157.83008 2-158.00162 3-157.87304 4-157.76333 5-158.01347 6-158.14943 7-158.1914 8-158.02804 9-158.11726 10-157.92981 Mean Y 21.29211 21.39216 21.32945 21.40559 21.31833 21.39008 21.44953 21.49882 21.58496 21.64536 Rotation 14.9696 83.27666 38.75594 47.81827 44.25596 47.77556 44.71923 87.86322 41.04701 39.4556 X-Axis(mi.) 1.94105 1.35679 2.05215 4.83258 0.5056 3.6045 1.5266 1.35405 1.30088 5.37345 Y-Axis(mi. 0.80107 2.84513 0.88072 1.53022 0.1 0.77282 0.22411 1.2094 3.6051 2.0667 Area(sq. mi.) 4.88491 12.1272 5.67802 23.23174 0.15884 8.75133 1.07483 5.14463 14.73345 34.88841 Points 104 24 77 16 2 5 6 9 5 7

Using a K-Means 85 Figure 7. K-Means Clustering of Pedestrian Involved Crashes Discussion The K-Means Clustering Algorithm provides an improvement over the simplistic pin-map approach to crash location analysis. While there always may be some inclination towards data reduction as a way of simplifying the problem, being able to quickly identify salient clusters or groupings of points is quite useful in the effort to prioritize locations for further analysis or possible intervention. At the heart of the difference between the K-means approach and the hierarchical approaches to clustering is the issue of local versus global optimization. With K-means, points are initially assigned to the nearest K seed locations to form an initial cluster, then, local optimization occurs which assigns each point to the nearest of the K clusters. This is different from hierarchical clustering techniques such as the nearest

86 K. Kim and E.Y. Yamashita neighbor technique that compares all points to a threshold distance level. Only points that fit the specified distance criteria are clustered in the first iteration. The routine then conducts more clustering to produce a hierarchy of clusters (second and third order clusters, etc.). With hierarchical clustering approaches, the user specifies a probability value or confidence level for the mean random distance between point pairs and minimum number of points required for each cluster. Levine (1999) points out that there are four advantages to the hierarchical clustering techniques. The first is that it can be used to identify small geographic areas where there are concentrated incidents. The second is that the technique can be applied to the entire, larger area and need not necessarily be restricted to a smaller geographic unit. The third advantage is that small clusters can be seen through the second and higher order clusters ( see Figure 6). Each of the levels may imply different levels of action. Just as there are strategies that may make sense for the smallest level areas such as the introduction of crosswalks or traffic control devices, there may be strategies that make sense for larger areas (third and fourth order clusters) such the enforcement of speed limits or other area-wide strategies. There may be, therefore, situations and conditions under which a hierarchical or global optimization approach may be more appropriate than K-means analysis. Of course there is nothing preventing the spatial analyst from doing both using both a global approach like the fuzzy mode algorithm or a nearest neighbor hierarchical techniques and a partitioning method like the K-Means algorithm. Use the global approach to see where potential clusters or groups may exist and then refine the spatial partitioning with repeated iterations and experiments using the K-Means approach. Part of the value is for the visual display of information, but also part of it is for the statistical value of identifying centroid locations and relating the scatter or dispersion or standard deviation of points to the identified groupings. Of course the biggest challenge may involve identifying an appropriate number, K, of groupings. Some of this comes with experience and experimentation with varying the number of means, but also some of it comes from an understanding of the real world conditions. There may be distinct communities or spatial groupings, which correspond to physical boundaries or other underlying spatial distributions. There may be political or jurisdictional zones, which could also be relevant in terms of identifying the number of clusters. Finally, there may be resource

Using a K-Means 87 constraints or other factors such that it would be impractical to consider, at least initially, more than K partitions. Conclusions In recent years, there have a number of important studies linking spatial and statistical analysis. One of the key considerations that policy analysts and planners must contend with is the determination of clusters or grouping of related events. The presence of a cluster indicates the need for further investigation and possible intervention. K-means provides a relatively easy powerful tool for isolating and describing the existence of spatial clusters. While there are alternative clustering techniques, in this paper, a case for using K-means is presented. It provides a robust, easy to explain approach for doing cluster analysis. Moreover, there are numerous statistical packages that support the generation of K-Means statistics. While the algorithms vary between packages, the general approaches as described in this paper are similar. There are times, however, that a more global approach may be more appropriate versus a local approach. K-Means should not be seen as a substitute for all clustering routines. It has its appropriate place and use. In the realm of pedestrian safety, the K-Means algorithm would be especially appropriate to locate compact, localized clusters. Pedestrian activity generally occurs over a smaller distance than travel by other motorized modes of travel. One could, conceivably, divide the larger area under consideration into smaller subsets and then partition the information according to K groupings. Also, various strategies for enforcement, engineering, and education in the effort to improve pedestrian safety might best be implemented in terms of the locally identified areas or subzones associated with a high incidence of pedestrian accidents. As such, K-Means is not just a useful exploratory data analysis tool, but it also could be applied to the evaluation of effectiveness in the reduction of injuries and fatalities and incidents across time and space. Future work includes research on the determination of the optimal number of clusters to represent the data under investigation and the applicability of weighting intensity variables. Also, further comparison of hierarchical clustering routines versus partitioning techniques such as

88 K. Kim and E.Y. Yamashita K-Means clustering would seem to benefit efforts to understanding and improving pedestrian safety. References Cameron, M. 1997. Accident Data Analysis to Develop Target Groups for Countermeasures. Monash University Accident Research Centre. Reports 46 and 47. Hartigan, J. and Wong, M.A. 1979. A K-Means Clustering Algorithm. Applied Statistics. 28: 100 108. Kim, Karl and Lawrence Nitz (1994). Applications of Automated Records Linkage Software in Traffic Records Analysis. Transportation Research Record. National Research Council.1467: 50-55. Kim, Karl and Ned Levine (1997). Using GIS to Improve Highway Safety. Computers, Environment and Urban Systems. 20, 4/5: 289-302. Levine, N., Kim, K.E., and Nitz, L.H. 1995b. Spatial Analysis of Honolulu Motor Vehicle Crashes: I. Spatial Patterns. Accident Analysis and Prevention 27: 663 674. Levine, Ned and Karl Kim (1999). The Location of Motor Vehicle Crashes in Honolulu: A Methodology for Geocoding Intersections. Computers, Environment And Urban Systems. 22,6: 557-576. Levine, N. 1999. CrimeStat Spatial Statistics Program: Version 2.0 Manual. National Institute of Justice. http://www.icpsr.umich.edu/nacjd/crimestat.html#download MacQueen, J.B. 1967. Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of the Fifth Symposium on Math, Statistics, and Probability. 281 297. Berkeley, CA: University of California Press. Maher, M.J., and Mountain, L.J. 1988. The Identification of Accident Blackspots: A Comparison of Current Methods. Accident Analysis and Prevention. 20: 143 151. National Highway Traffic Safety Administration, NHTSA. 2002. Traffic Safety Facts 2002 - Pedestrians. DOT HS 809 614. U.S. Department of Transportation. Washington, DC. National Highway Traffic Safety Administration, NHTSA. 2003. Pedestrian Roadway Fatalities. DOT HS 809 456. Department of Transportation. Washington, DC.

Using a K-Means 89 National Highway Traffic Safety Administration, NHTSA. 1999. Literature Review on Vehicle Travel Speeds and Pedestrian Injuries. DOT HS 809 021. U.S. Department of Transportation, Washington, DC. Ng, K., Hung, W., and Wong, W. 2002. An Algorithm for Assessing the risk of Traffic Accident. Journal of Safety Research 33:387 410. O Day, J. 1993. Accident Data Quality. National Cooperative Highway Research Program. National Academy Press. Washington, D.C. Schneider, R.J., Ryznar, R.M., and Khattak, A.J. 2004. An Accident Waiting to Happen: A Spatial Approach to Proactive Pedestrian Planning. Accident Analysis and Prevention. 36: 193 211.