Chicago Crime Category Classification (CCCC)

Chicago Crime Category Classification (CCCC) Sahil Agarwal 1, Shalini Kedlaya 2 and Ujjwal Gulecha 3 Abstract It is very important and crucial to keep a city safe from crime. If the police could be given information regarding the types of crime that occur over time and in different locations, they could be better equipped to fight crime. We use Chicago crime data to predict the category of crime based on date, time and some geographically relevant features such as latitude and longitude. We tried various models like Multinomial Naive Bayes, Decision Trees, Multinomial Logistic Regression and Random Forest Classifier to achieve this classification I. INTRODUCTION We use various statistical classification models to predict the category of a crime based on the date, time and few geographically relevant features. We used the Chicago crime dataset which we obtained from Kaggle [1]. Due to limitations of computational capacity we restricted our analysis and modeling to data for 2015 and 2016. The following sections describe the dataset, analysis of data, various models and results. A. Information about Data II. THE DATASET The dataset consists of information about crimes that occurred in Chicago for 2015-2016. There are roughly 0.5 million data points. Each data point consists of the following fields as shown in Table I There are 21 different fields for each data point. B. Data representation Our goal is to predict the Primary Type of crime given a set of features. Description, IUCR and FBI code are directly indicative of the primary type and cannot be used as features. ID and Case Number are unique to each crime incident and hence add no value in prediction. Arrest and Updated On features are determined only after the crimes are committed and are not available at the time of prediction. Location, latitude and longitude correspond to X,Y coordinates as defined for Chicago. Field Description ID Unique identifier for the record Case Number The Chicago Police Department RD Number Date Date when the incident occurred in mm/dd/yyyy. Block The partially redacted address where the incident occurred. IUCR The Illinois Uniform Crime Reporting code Primary Type The primary description of the IUCR code. Description The secondary description of the IUCR code Location Description Description of the location where the incident occurred. Arrest Indicates whether an arrest was made. Domestic Indicates whether the incident was domestic-related Beat Indicates the beat (the smallest police geographic area) where the incident occurred. District Indicates the police district where the incident occurred. Ward The ward (City Council district) where the incident occurred. Community Area Indicates the community area where the incident occurred. Chicago has 77 community areas. FBI Code Indicates the crime classification as outlined in the FBI s National Incident-Based Reporting System (NIBRS). X Coordinate The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. Y Coordinate The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. Year Year the incident occurred. Updated On Date and time the record was last updated. Latitude The latitude of the location where the incident occurred. Longitude The longitude of the location where the incident occurred. TABLE I: Fields and Descriptions 1 saa034@eng.ucsd.edu [A53100622] 2 skedlaya@eng.ucsd.edu [A53090836] 3 ugulecha@eng.ucsd.edu [A53091469]

Primary Category Occurrences Theft Battery Criminal Damage Narcotics Assault Other Offense Deceptive Practice Burglary Robbery Motor Vehicle Theft Criminal Trespass Weapons Violation Offense Involving Children Public Peace Violation Crim Sexual Assault Total 118459 99147 59647 36246 35750 34574 32656 27423 21599 21439 12707 6792 4458 4023 2784 528457 Percentage total crime 22.42% 18.76% 18.76% 6.86% 6.76% 6.54% 6.18% 5.19% 4.09% 4.06% 2.4% 1.29% 0.84% 0.76% 0.53% of Fig. 1: Total crime, Thefts, Battery and Narcotics cases for every month TABLE II: Different crimes and their frequency C. Data Analysis We performed some basic data analysis to understand trends in the data. Theft is the most common type of crime comprising 22.42% of all crime, followed by battery at 18.76%, criminal damage at 18.76%, narcotics at 6.86% and assault at 6.76%. The occurrences of 15 most common categories of crimes over the five years is shown in Table II. Figures 1 and 2 show this variation of total crime and different representative categories over different months of the year and different hours in a day respectively. These seem to follow common sense patterns of reasoning. Total crime in Chicago dips during the winter months because even criminals feel cold. Total crime is least in the period 5 am - 6 am because the bad elements of society also need their sleep. The category of crime is not influenced significantly by the month and we predicted that the month will not play an important role in category prediction. However, it is influenced by the time of day: for example, battery occurring more frequently than theft in the late hours of the day as shown in Figure 2. Geographical location also influences the frequency and nature of crime. Shown in Figure 3 is the heat map of crime in Chicago in the year 2016. Heat maps for thefts and narcotics are shown in Figure 4. The red concentration in the narcotics-crimes map is in the infamous Far West Side of Chicago. More cases of theft occur in the affluent neighborhoods or shopping districts (the red concentration in Figure 4a is the Near North, a prime shopping and dining area). Fig. 2: Total crime, Thefts, Battery and Criminal damage cases for every hour Fig. 3: Heat map of crime in Chicago

time (like a month) and asked to do predict a category we can t do better than 44%. This became our Holy Grail for category prediction. For comparison we also predicted the top two and top three categories. If any of them match the actual value we consider that a success. For top 3 our initial experiments gave an accuracy of 62% which was quite promising. (a) Thefts (b) Narcotics Fig. 4: Heat maps of different types of crimes III. P REDICTIVE TASK AND I MPLEMENTATION DETAILS Our task is to predict the primary crime type based on the features we have described in the previous sections. A. Dataset Later in our experimental stage, our poor laptops were unable to handle the > 1.4 million data points from the Kaggle dataset. So we decided to restrict ourselves to crime date from 2015 and 2016. This had around 0.5 million samples which is a good number and is fairly recent therefore more accurate data. We did a 60:20:20 random split for training, validation and testing respectively. B. Performance evaluation and Baseline We chose a very natural performance measure for multi-class classification: accuracy of predictions vs the actual category (y value). To check how well our models did, we developed a baseline model to judge against. For the baseline, we predicted the most common crime category in the training set (Theft) for every data point in the test set. We obtained an accuracy of 22% as expected. C. Theorizing an upper bound on accuracy We conducted a few initial experiments using Multinomial Naive Bayes classifier with simple feature selection. We improved on our baseline performance by few percentage points but started peaking around 28%. We then attempted to find out the reason behind this. We looked at crime categories for every hour of every month in every police beat (274 of them). No single category dominated by far. We looped over all beats, months and hours and found the average domination by a single crime catergory is 44%. So if we are given a location, time of day and a period of D. Pre-processing date for feature extraction Since all the features were categorical, we had to use one-hot encoding to represent those features. We also removed the data points which had no locations given. There were less than 100 such samples. The below features had to be pre-processed to use them in our experiments: Date : Month was extracted from day to verify that it does not add value to category prediction as shown in Figure 1 Time : hour of the day is indicative of certain kinds of crimes as indicated in in Figure 2. Location is a major factor influencing the type of crime. Figures 4b and 4a showed us the localization of different crimes and we thus used various methods for determining the location of our sample: 1) Beat - 274 unique police beats defined by the Chicago police. 2) x-y coordinate grid - location represented by x-y coordinates were divided into a 9x9 grid. 3) K-means - K-means clustering (Number of clusters = 25) was done on the x-y coordinates and the cluster centers were used as a feature. 4) Block - 2132 unique block names 5) Type of location - 125 types of locations 6) Community Area - 77 unique community areas 7) Ward - 50 unique wards 8) District - 23 unique districts Domestic crime: True/False IV. C LASSIFICATION M ODELS AND THEIR P ERFORMANCE We evaluated several models such as Multinomial Naive Bayes, Decision trees and Random Forest classifier. Other models such as Multinomial Logistic regression and Multiclass SVM Classifier were unsuitable for this task and are described in section D.

All models were evaluated against the same baseline that predicts THEFT as the most common crime and gives an accuracy of 22%. The performance of our classifiers on the training and validation sets are given in Tables III, IV and V. We are evaluating the accuracy of our top prediction, top 2 predictions and top 3 predictions written as top1/top2/top3 A. Multinomial Naive Bayes Classifier The Multinomial Naive Bayes classifier is a simple probabilistic classifier which is based on Bayes theorem with a strong assumption that all features are conditionally independent. The Multinomial Naive Bayes model is computationally less intensive and this worked well for us with our limited CPU power and memory. The shorter training time meant that we could evaluate multiple combinations of features. Our data set had multiple representations for location. Taking only a limited number of features for location, and combining them with other features such as time and domestic crime, we could achieve a model that was more or less conditionally independent. Our experimental results also show that these features which had the best assumption for conditional independence performed the best for the Multinomial Bayes Model. Table III lists the features that were used for selecting the features to use. 10 different experiments were tried. The train and validation accuracies are listed in the table. B. Decision Tree Classifier A decision tree is built using the train dataset considering all features. Decision tree models are robust to noisy data and are capable of learning expressions that lack connection. This is very suitable to our dataset which has features like time, location, domestic crime which do not have a direct connection with one another. Our dataset is also noisy as the time is approximated in cases where the exact time is unknown and we had to remove several data points for which information about the location was missing. Decision trees can also mirror human decision making better than other approaches. A decision tree can take many hyper - parameters. We initially performed our experiments with the parameters: [max depth=50, min samples split=30, min samples leaf=20] which gave us an accuracy of 38.04%. We then performed grid search and then found the best parameters to be: [max depth=150,min samples split=70, min samples leaf=40]. This improved the accuracy to 38.49%. Table V lists the features that were used with the two models that we described above selecting the features to use. 5 different experiments were conducted on each model. The train and validation accuracies are listed in the table. C. Random Forest Classifier A random forest produces a large number of decision trees. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Categorical variables also increase the computational complexity to create trees. The same features of the dataset that help decision trees also help random forest. We initially performed our experiments with the parameters: [n estimators = 70, min samples split = 30, bootstrap = True, max depth = 50, min samples leaf = 25] which gave us an accuracy of 35.7%. We then performed grid search and then found the best parameters to be: [n estimators = 150, min samples split = 60, bootstrap = True, max depth = 70, min samples leaf = 45]. This improved the accuracy to 35.9%. While we tried to optimize the parameters for random forest, we still observed overfitting. This is validated by the train data having a high accuracy of 38.92%, but the test data performing poorly. Table IV lists the features that were used with the two models that we described above selecting the features to use. 5 different experiments were conducted on each model. The train and validation accuracies are listed in the table. D. Unsuitable Models While Multinomial Logistic Regression works well when features are categorical, it is very expensive to train for a data set that has a large number of classes. Given this and the size of our training data, it was not practical to include it in our experiments. We faced a similar problem while using the multi-class SVM classifier as it failed to fit the model in a reasonable time frame because of the large number of samples. The Gaussian Naive Bayes model is suitable for continuous data that has Gaussian Distribution. Since our features do not follow this distribution, the Gaussian Naive Bayes Model performed horribly, giving an accuracy of 1.6%. V. RELATED LITERATURE We used the dataset from Kaggle [1]. This dataset was not used in a Kaggle competition, rather it was one of the datasets Kaggle has in their datasets collection. A similar dataset that has been analyzed and studied

a lot is the San Francisco Crime Dataset [2]. The San Francisco Crime Dataset had features very similar to our dataset. We read some previous year submissions for CSE 255 and submissions on Kaggle. Various models like Multinomial Naive Bayes, Decision Trees, Random Forest Regressors were used. We took inspiration from reading these papers to represent our geographical features in a grid based system and also use k-means. We found a book on predictive policing with descriptions of various models used for predictive analytics pertaining to crime analysis. We have been using this information as inspiration to design our own models used in this report [3]. The conclusions that we could draw were that it is hard to predict one category of crime given date, time and location and this was independent of the dataset, i.e San Francisco Dataset or Chicago Dataset. High Kaggle ranks for San Francisco had accuracies of 23% [4]. This just shows that the hypothesis we have that given date, time and location, you cannot have a good accuracy by predicting one label is true. VI. RESULTS AND CONCLUSION A. Performance of models For Multinomial Naive Bayes, in Table III, row indexes 1,2 and 4, the accuracies on the test data set were: 34.66/50.82/61.76%, 37.52/54.93/66.55% and 37.33/54.56/66.59% respectively. For Decision Trees, in Table V, row indexes 1,2 and 4, the accuracies on the test data set were: 37.11/53.21/64.90%, 32.1/46.11/57.05% 37.96/54.61/66.10% respectively. For Decision Trees, in Table V, row indexes 6,7 and 9, the accuracies on the test data set were: 38.04/54.39/66.12%, 37.85/54.21/66.10% and 38.30/55.22/66.98% respectively. For Random Forest Classifier, in Table IV, row indexes 1,2 and 4, the accuracies on the test data set were: 35.55/52.30/63.55%, 35.50/52.33/64.34% 35.20/51.12/62.66% respectively. For Random Forest Classifiers, in Table IV, row indexes 6,7 and 9, the accuracies on the test data set were: 35.44/52.22/63.34%, 35.23/51.89/64.05% 35.65/51.63/63.77% respectively. Decision Tree learning model with max depth of 150, min samples split of 70 and min samples leaf of 40 (shown in red in Table V) performed the best amongst all models we considered. We think that the classifier does better with more depth compared to previous model as it is able to have longer root to leaf paths which account for the different feature values to get a classification. The model did not over-fit with this depth as it was significantly lesser than the total number of features at any experiment. Also, we think that having a bigger min samples split and min samples leaf helped in better classification of the major crimes in the dataset. From existing documentation, we had expected the random forest classifier to perform better. Having a large number of categorical features that are encoded as one hot may cause it to find patterns in the train data that do not exist in the test data. B. Interpretation of features Adding the day of the month as a feature changed the accuracy by +- 0.5% for different combinations of features. From this we gather that days do not add any useful information to the predictive task and a slight variation in the accuracy may be from random shuffling of the data. We do not expect crime to vary depending on the day of the month. From our data analysis (Fig 1) we saw that while the total number of crimes changes across months, the category of crime is not influenced by the month. we also saw that the category of crime changes depending on the hour of the day (Fig 2). We validated this through our experiments. Adding month as a feature only reduced the accuracy and adding hour increased it. Of all the features to represent location, adding beat provided the most information. Beats are geographical areas defined by the police and we think they have divided area according to category and concentration of crimes. Other location features that we generated such as the grid, KMeans clusters, block and type of location improved the accuracy marginally when used together. They only provided little additional information not already covered by beat. Other location features such as ward, district and community area reduced the accuracy. This could be because adding additional categorical features that categorize location in different ways over-complicates the model and violates Occam s Razor. Conclusion (TLDR) Our best model was the Decision Tree classifier and the best feature representation was one that included a few or one element each representing time, geographic location, type of location and whether it is domestic or not (going with Occam s Razor).

The accuracy for top1/top2/top3 category predictions was 38.30/55.22/66.98%. We conclude that it is hard to predict the category of crime given date, time and location and this mirrors the conclusion drawn in the work on San Francisco Dataset. This further reinforces our hypothesis of not getting better accuracy than 44%. REFERENCES [1] Crimes in chicago. https://www.kaggle.com/ currie32/crimes-in-chicago. Accessed: 2017-03- 04. [2] San francisco crime classification. https://www.kaggle. com/c/sf-crime. Accessed: 2017-03-04. [3] Brian McInnis Perry Walter L. and John S. Hollywood. Predictive Policing: The Role of Crime Forecasting in Law Enforcement Operations. Santa Monica, CA: RAND Corporation, 2013. [4] Silvia Chyou Shen Ting Ang, Weichen Wang. San francisco crime classification. CSE 255 Fall 2015, 2015.

Index Features Performance[In Percentage] Training set validation set 1 month + days + hour + block + location + domestic + beat + district + ward + 34.72/50.89/62.01 34.32/50.25/61.86 2 month + days + hour + block + location + beat + district + ward + community area 31.64/47.89/59.74 31.12/47.56/59.24 + grid(for + kmeans(for 3 month + hour + block + location + domestic + beat + district + ward + community area 34.64/50.49/62.24 34.32/50.23/61.87 + grid(for + kmeans(for 4 month + hour + block + location + domestic + beat + kmeans(for 37.62/55.09/67.04 37.18/54.65/66.67 5 month + hour + location + domestic + beat + district + ward + community area + 34.77/50.01/61.77 34.13/49.61/61.16 grid(for + kmeans(for 6 month + hour + days+ location + domestic + beat + district + ward + community area 34.44/49.99/61.55 34.07/49.54/61.06 + grid(for + kmeans(for 7 month + hour + block + location + domestic + beat 38.01/ 55.33/67.21 37.54/54.96/66.89 8 month + days + hour + block + location + domestic + beat + kmeans(for xy 37.77/ 55.01/67.01 37.21/54.67/66.63 coordinates) 9 hour + block + location + domestic + beat + grid(for + kmeans(for 37.66/54.88/66.88 37.07/54.47/66.52 10 month + hour + block + weekday + location + domestic + beat + kmeans(for xy coordinates) 34.66/ 50.55/ 62.12 34.32/50.25/61.86 TABLE III: Results of experiments using Multinomial Naive Bayes classifier

Index Features Performance[In Percentage] Training set validation set n estimators = 70, min samples split = 30, bootstrap = True, max depth = 50, min samples leaf = 25 1 month + days + hour + block + location + domestic + beat + district + ward + 35.87/52.55/63.98 35.61/52.31/63.61 2 month + hour + block + location + domestic + beat 35.85/52.65/64.70 35.70/52.54/64.49 3 month + hour + block + location + domestic + beat + kmeans(for 35.38/51.27/62.97 35.32/50.95/62.81 4 hour + block + location + domestic + beat + grid(for + kmeans(for 35.60/51.55/63.01 35.35/51.17/62.87 5 month + hour + block + location + domestic + beat + kmeans(for 35.65/ 50.91/ 63.44 35.47/50.78/63.04 n estimators = 150, min samples split = 60, bootstrap = True, max depth = 70, min samples leaf = 45 6 month + days + hour + block + location + domestic + beat + district + ward + 35.81/52.58/63.75 35.70/52.36/63.62 7 month + hour + block + location + domestic + beat 35.70/52.57/64.71 35.63/52.35/64.21 8 month + hour + block + location + domestic + beat + kmeans(for 35.81/52.57/64.71 35.78/52.22/64.07 9 hour + block + location + domestic + beat + grid(for + kmeans(for 35.90/52.47/64.55 35.83/52.16/64.07 10 month + hour + block + location + domestic + beat + kmeans(for 35.72/52.57/63.81 35.48/51.57/63.43 TABLE IV: Results of experiments using Random Forest classifier

Index Features max depth=50,min samples split=30, min samples leaf=20 1 month + days + hour + block + location + domestic + beat + district + ward + Performance[In Percentage] Training set validation set 39.5/57.2/69.51 37.34/53.88/65.16 2 month + hour + block + location + domestic + beat 35.4/49.4/59.2 32.29/46.30/57.41 3 month + hour + block + location + domestic + beat + kmeans(for 35/48.4/58.1 31.85/45.08/54.98 4 hour + block + location + domestic + beat + grid(for + kmeans(for 40.82/59.37/71.47 38.04/54.74/66.20 5 month + hour + block + location + domestic + beat + kmeans(for 39.59/59.33/69.11 37.67/53.91/65.25 max depth=150,min samples split=70, min samples leaf=40 6 month + days + hour + block + location + domestic + beat + district + ward + 39.88/56.52/69.31 38.30/54.97/66.41 7 month + hour + block + location + domestic + beat 40.01/57.94/69.21 38.08/54.87/66.70 8 month + hour + block + location + domestic + beat + kmeans(for 39.53/57.51/69.21 38.24/54.98/66.65 9 hour + block + location + domestic + beat + grid(for + kmeans(for 40.05/ 58.22/70.32 38.49/55.34/67.09 10 month + hour + block + location + domestic + beat + kmeans(for 39.98/57.94/68.55 38.24/54.98/66.65 TABLE V: Results of experiments using Decision Tree classifier