DEVELOPMENT OF CRASH PREDICTION MODEL USING MULTIPLE REGRESSION ANALYSIS Harshit Gupta 1, Dr. Siddhartha Rokade 2 1

Similar documents
Multiple linear regression S6

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

New Achievement in the Prediction of Highway Accidents

Correlation and Regression Bangkok, 14-18, Sept. 2015

The relationship between urban accidents, traffic and geometric design in Tehran

Single and multiple linear regression analysis

Regression Model Building

CRP 272 Introduction To Regression Analysis

Chapter 14 Student Lecture Notes 14-1

Final Exam - Solutions

y response variable x 1, x 2,, x k -- a set of explanatory variables

MACRO-LEVEL ANALYSIS OF THE IMPACTS OF URBAN FACTORS ON TAFFIC CRASHES: A CASE STUDY OF CENTRAL OHIO

EVALUATION OF SAFETY PERFORMANCES ON FREEWAY DIVERGE AREA AND FREEWAY EXIT RAMPS. Transportation Seminar February 16 th, 2009

Correlation and simple linear regression S5

Spatiotemporal Analysis of Urban Traffic Accidents: A Case Study of Tehran City, Iran

LINEAR REGRESSION CRASH PREDICTION MODELS: ISSUES AND PROPOSED SOLUTIONS

Evaluation of fog-detection and advisory-speed system

School of Mathematical Sciences. Question 1. Best Subsets Regression

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

DEVELOPING DECISION SUPPORT TOOLS FOR THE IMPLEMENTATION OF BICYCLE AND PEDESTRIAN SAFETY STRATEGIES

The simple linear regression model discussed in Chapter 13 was written as

Introducing Generalized Linear Models: Logistic Regression

Simple Linear Regression

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Review of Multiple Regression

ST430 Exam 2 Solutions

Multiple Regression Part I STAT315, 19-20/3/2014

TOPIC 9 SIMPLE REGRESSION & CORRELATION

16.400/453J Human Factors Engineering. Design of Experiments II

Regression. Estimation of the linear function (straight line) describing the linear component of the joint relationship between two variables X and Y.

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Chapter 3 Multiple Regression Complete Example

Geospatial Big Data Analytics for Road Network Safety Management

9. Linear Regression and Correlation

ENHANCING ROAD SAFETY MANAGEMENT WITH GIS MAPPING AND GEOSPATIAL DATABASE

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Multiple Linear Regression II. Lecture 8. Overview. Readings

Multiple Linear Regression II. Lecture 8. Overview. Readings. Summary of MLR I. Summary of MLR I. Summary of MLR I

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

Regression. Marc H. Mehlman University of New Haven

Univariate analysis. Simple and Multiple Regression. Univariate analysis. Simple Regression How best to summarise the data?

Parametric Test. Multiple Linear Regression Spatial Application I: State Homicide Rates Equations taken from Zar, 1984.

Daniel Boduszek University of Huddersfield

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

For Bonnie and Jesse (again)

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Checking model assumptions with regression diagnostics

EFFECT OF HIGHWAY GEOMETRICS ON ACCIDENT MODELING

Unit 6 - Introduction to linear regression

Prediction of Bike Rental using Model Reuse Strategy

36-309/749 Experimental Design for Behavioral and Social Sciences. Dec 1, 2015 Lecture 11: Mixed Models (HLMs)

Regression ( Kemampuan Individu, Lingkungan kerja dan Motivasi)

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

General linear models. One and Two-way ANOVA in SPSS Repeated measures ANOVA Multiple linear regression

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Analysis & Prediction of Road Accident Data for Indian States

Correlation and regression

Evaluation of Road Safety in Portugal: A Case Study Analysis. Instituto Superior Técnico

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

Spatial Variation in Local Road Pedestrian and Bicycle Crashes

Multiple Regression and Model Building (cont d) + GIS Lecture 21 3 May 2006 R. Ryznar

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Area1 Scaled Score (NAPLEX) .535 ** **.000 N. Sig. (2-tailed)

Practical Statistics for the Analytical Scientist Table of Contents

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing

A discussion on multiple regression models

Chapter 9. Correlation and Regression

Introduction to Regression

Factors Affecting the Severity of Injuries Sustained in Collisions with Roadside Objects

Big Data Analysis with Apache Spark UC#BERKELEY

Analysing data: regression and correlation S6 and S7

EM375 STATISTICS AND MEASUREMENT UNCERTAINTY CORRELATION OF EXPERIMENTAL DATA

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

M & M Project. Think! Crunch those numbers! Answer!

Statistical Model Of Road Traffic Crashes Data In Anambra State, Nigeria: A Poisson Regression Approach

Chapter 7 Student Lecture Notes 7-1

AN ARTIFICIAL NEURAL NETWORK MODEL FOR ROAD ACCIDENT PREDICTION: A CASE STUDY OF KHULNA METROPOLITAN CITY

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

Road Accident Analysis and Severity prediction Model On State Highway-5 (Halol-Shamlaji Section)

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

AP Statistics Bivariate Data Analysis Test Review. Multiple-Choice

This gives us an upper and lower bound that capture our population mean.

Modeling Spatial Relationships using Regression Analysis

Draft of an article prepared for the Encyclopedia of Social Science Research Methods, Sage Publications. Copyright by John Fox 2002

Econometrics 1. Lecture 8: Linear Regression (2) 黄嘉平

Chapter 10 Correlation and Regression

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Multiple Linear Regression. Chapter 12

STATISTICS 479 Exam II (100 points)

Chapter 9 - Correlation and Regression

SPSS Guide For MMI 409

QUANTITATIVE STATISTICAL METHODS: REGRESSION AND FORECASTING JOHANNES LEDOLTER VIENNA UNIVERSITY OF ECONOMICS AND BUSINESS ADMINISTRATION SPRING 2013

Review of Statistics 101

12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL)

SPSS LAB FILE 1

Do not copy, post, or distribute

Transcription:

DEVELOPMENT OF CRASH PREDICTION MODEL USING MULTIPLE REGRESSION ANALYSIS Harshit Gupta 1, Dr. Siddhartha Rokade 2 1 PG Student, 2 Assistant Professor, Department of Civil Engineering, Maulana Azad National Institute of Technology, Bhopal Abstract The purpose of the study is to develop a model for prediction of crashes in urban medium size cities. In this paper Crash prediction model (CPM) is developed using multiple regression analysis. A model is a simplified representation of a real world process. It should be representative in the sense that it should contain the salient features of the phenomena under study. In general, one of the objectives in modeling is to have a simple model to explain a complex phenomenon. Keywords: Crash prediction model (CPM), Multiple regression Analysis, Medium size city 1. INTRODUCTION Road accidents are increasing every year. The total number of road accidents increased marginally from 4,86,476 in 2013 to 4,89,400 in 2014 (MoRT&H,2015). This is probably an underestimate. The actual numbers of injuries are much higher than official statics, as not all injuries are reported to the police stations. Road crashes are an outcome of various factors, some of which are the length of road network, vehicle population, human population and enforcement of road safety regulations etc. Bhopal city is selected for development of CPM. About 3500 traffic crashes occurred every year in Bhopal city. Crash Prediction s (CPMs) is developed to know the expected number of crashes on a road in a certain time frame. Crash frequency depends on Segment length, the traffic flow and several risk factors. No. of accidents is called the dependent (variable). The other components like traffic volume, carriage way width are incorporated in the CPM as independent model variables, also called predictors. SPSS software is used for regression analysis. 2. DESCRIPTION OF DATA To develop a crash prediction model independent and dependent variables are needed. The selection of appropriate variables for model development is an important step as these variables significantly influence the accuracy and validity of model. The collected data was divided into two parts that is dependent data and independent data. a) Dependent data Road accident data b) Independent data Factors affecting the crashes like segment length, carriageway width, median width etc. These variables are selected on the basis former literatures and data of these variables are collected from the selected segments. Traffic volume and road geometric design characteristics are related to accidents. In many research results show that changing in geometric design characteristics and traffic volume accident rate change. Traffic volume is also an important parameter. Traffic volume on selected segment is counted manual at peak hour. Peak hour of each segment is decided on the basis of expert opinion. 2.1. Collection of Accident Data A 6 year (20112015 & 2016) Accident data was collected in form the police department of the Bhopal city. The data is digitized in three categories; accident details, vehicle details, and victim details (include Gender and age of victim). Different type of land uses are selected for the study like Shops, Block of flats, Industrial, Residential, neighborhood, scattered housing, open spaces, Market areas (CBDs) etc. 82

2.2. Road characteristics data The road characteristics data are collected in the year 2016 The road characteristics data segment length, road width, traffic volume, median width, no. access road and pedestrian volume to be incorporated in the model analysis. Due to limitations, the road design elements like vertical alignment, horizontal alignment, lighting, road markings, and haptic feedbacks attributes could not be included in the model analyses. Environmental factors, vehicular factor, human factor are also not included in this analysis. A total of 76 segments were selected for development of accident prediction model. The sections that were considered included the following 1. The length of the access roads varied from 100 m to 1400 m with a mean of 413 m and total length of segments is 32,210 meters 2. The Peak hour traffic volume varied from 1810 to 4560 PCU with mean of 3144. 3. The number of crashes over 6 years varied from 10 to 39 crashes in a section with mean of 20. Total number of crashes was 1558. 4. The no. of access road varied from 1 to 8 with the mean of 2.75 5. The pedestrian volume varied from 13 to 345 with the mean of 145 3. MODEL DEVELOPMENT Multiple regression is an extension of simple linear regression. If the independent variables are two or more in numbers, then the analysis is known as multiple linear regression.the main idea of multiple linear regression method is to build correlation analysis between dependent and independent variables. The function will be the following form: Y=B 0+B 1X 1+B 2X 2+B 3X 1+ +B mx m Where, Y= Dependent variable B0= regression constant, B1, B2, B3...Bm = Regression coefficients of the respective m independent variables. X1, X2, X3,..Xm = Independent variables. A statistic test of the model was necessary, including determination coefficient test (R 2 test), significance test of regression coefficient (ttest), and significance test of regression equation (F test). If the significant test of regression equation failed, it was possible that important factors were missed during the selection of independent variables, or the relationship between independent and dependent variables was nonlinear, in which situation the model should be rebuilt (Feng et al., 2016). The multiple linear regression analysis makes several key assumptions such as linear relationship, multicollinearity, normality, independence and homoscedasticity between the model variables. Finally these assumptions should also be examined for developed model. It assumes that, there is a linear relationship between the dependent variable and each of independent variables and the dependent variable and independent variable collectively. 3.1. Correlation Analysis Bivariate correlation analysis is done by SPSS. Pearson correlation coefficient is selected for analysis. Correlation is a unit free measure of relationship between two variables and take value in [1, +1], where r is close to +1(1), there is strong and positive (negative) relationship and a correlation coefficient of 0 indicates that there is no linear relationship between the two variables. It measures only linear relationship. In table 1 seems that the correlation test value of independent variable with dependent variable. Sign () indicates a Correlation is significant at the 0.01 level where the sign () Indicate a moderate Correlation is significant at the 0.05 level. Table 1 shows parameters having positive signs with traffic accident. It is presented that variable has positive correlation which means that variables are perfectly related in a positive linear sense if the correlation value is positive and negative linear sense when correlation value is negative. Table 1: Correlation Value of Independent Variable (X) With Dependent Variable(Y) Sr. No. 1. 2. 3. 4. 5. Independent Variable Segment length Carriageway Width Pedestrian Volume Traffic Volume No. of Access Road Symbol Correlation Test value SL 0.295 CW 0.425 PV 0.249 TV 0.440 AP 0.232 83

3.2. Multicollinearity analyses In statistics, multicollinearity or collinearity is a phenomenon in which two or more predictor variables (Independent variables) in a MLRM are highly correlated. This means, one can be linearly predicted from the others with a substantial degree of accuracy. Multicollinearity poses a problem only for multiple regression analyses. Collinearity is between 1 to +1. Perfect collinearity means, two predictors that are perfectly correlated have a correlation coefficient is 1). If there is perfect collinearity between predictors it becomes impossible to obtain unique regression coefficients because there are an infinite number of combinations of coefficients that would work equally well. Put simply, if we have two predictors that are perfectly correlated, then the values of b for each variable are interchangeable. (Field 2005) One way of identifying multicollinearity is to scan a correlation matrix of all of the predictor variables and see if any correlate very highly (by very highly mean correlations of above.80 or.90). SPSS produces various collinearity diagnostics, one of which is the variance inflation factor (VIF). The VIF indicates whether a predictor has a strong linear relationship with the other predictor(s). Table 2: Correlation Matrix of Selected Independent Variable Predicto TV AL AP PV CW r TV 1.052.111.089.483 SL AP PV.052 1.111.089 CW.483.481.416.481 1.338.416.338.178.043 1.203.178.043.203 1 In this research correlation matrix is used to detect the problem of multicollinearity. The values of correlation between the independent variables are larger than 0.5 indicate the multicollinearity otherwise there is no multicollinearity. Table 2 shows the correlation values of selected independent variables with each other. It seems that the multicollinearity is absent. In the above section it is presented that there is no coloration between selected variable. If collinearity present than we remove variables with multicollinearity. All variables are selected for model formulation. 3.3. Formulation by Multiple linear regression analysis using SPSS Parameters which are used in this study for development of model are identified using literature reviews. The parameter include as an input for model development are: Segment Length i.e. Length of the road segment were an accident occurred Traffic Volume i.e. Peak hour traffic volume (PCU/hour) on carriageway Carriageway width i.e. Width of road on which a vehicles are not restricted by any physical barriers to move laterally. Access Road i.e. No. of access road at road segment. Pedestrian volume i.e. Total number of pedestrian on the road Vehicle Accident i.e. Accident of vehicle on the selected segments. I. Summary: Table 3 shows the model summary. The "R" column represents the value of R, the multiple correlation coefficients. The "R Square" column represents the R 2 value (also called the coefficient of determination). Table 3: Summary of MLR R R Square Adjusted R Square Std. Error of the Estimate 1.686 a.471.433 4.789 It is the proportion of variance in the dependent variable that can be explained by the independent variable. II Selection of Best The most sensitive approach to selecting a subset of important variables in a complex linear model is to compare all possible subsets. This procedure simply fits all the possible regression models (i.e., all possible combinations of predictors) and chooses the best one (or more than one) based on the criteria described below. 1. High R, R 2 and Adjusted R 2 value 2. Low significance F value and Minimum standard error 84

3. Low p value for the coefficients of independent variables and y intercept III. Formulation of Equation Segment length, traffic volume and pedestrian volume have high value in comparison of other parameters. Regression coefficients of respective variables SL, TV, PV are in three decimals. To obtain a proper regression coefficient SL, TV, PV are divided by respectively 10, 100 and 10. The Coefficients part of the output gives us the values that we need in order to write the regression equation. The accident Prediction model for selected road segments using the multiple regression analysis is presented in equation 5.1 TC = 2.179 + 1.164(CW) + 0.0114(SL) + 0.525(AP) + 0.0312(PV) + 0.00122(TV) Where, 2.179 = regression constant or intercept 1.164, 0.0114, 0.525, 0.0312 and 0.00122 = regression coefficients of their respective variables. CW = Width of road on which a vehicles are not restricted by any physical barriers to move laterally SL = Segment Length AP = No. of access road PV = Pedestrian Volume TV = Peak hour Traffic Volume in PCU/hour 3.4. Validation and Assumptions of MLR ANOVA, Ftest, Ttest and normality check in multiple linear regression. 1. ANOVA: Analysis of Variance (ANOVA) is a statistical method used to test differences between two or more means. Analysis of Variance (ANOVA) consists of calculations that provide information about levels of variability within a regression model and form a basis for tests of significance. The ANOVA calculations for multiple regression are nearly identical to the calculations for simple linear regression, except that the degrees of freedom are adjusted to reflect the number of explanatory variables included in the model. Table 4 shows the output of the ANOVA analysis. Result of ANOVA test is f (5, 70) = 12.451 which is at the significance level 0.000 (Pvalue =.000), P value is less than 0.005. It means that the regression equation is significant at 95 % confidence limit in levels of variability within a regression model. Table 4: ANOVA, Multiple Linear Regression Analysis in SPSS Sum of Squares df Mean Square F Sig. Regression 1427.68 5 285.538 12.451.000 b Residual 1605.31 70 22.933 Total 3033.00 75 2. Examining Normality The normal probability plot shows the theoretical percentiles of a normal distribution versus the actual percentiles of the standard residuals with the same variance and mean. If the collected data are normally distributed, then the observed values (the dots on the chart) should fall closely along the straight line (meaning that the observed values are the same as you would expect to get from a normally distributed data set). In broad, the more closely the points are clustered about the 45degree line, the stronger the indication supporting the normality assumption. Any substantial curvature in the plot is evidence that the residuals have not come from a normal distribution Figure 5.6: Normal PP Plot of Regression Residual In multiple regression analysis the accurate relationship between variable can be estimate if the relationships between variable are linear. If the relationship between dependent and independent variable is not a linear relationship, the results of the regression analysis will underestimate the true relationship. A preferable method of detection linearity is examination of residual plots between dependent and independent. The first step of assessment of assumption begins with initial descriptive plots of dependent Variable versus each of the explanatory variables, separately. 85

4. CONCLUSION This research developed a crash prediction model by multiple linear regression analysis. Predictor variable carriage way widths, segment length, traffic volume, pedestrian volume, no. of access road are selected for the study. The risk is increasing as the traffic volume, pedestrian volume, carriageway width, segment length and no. of access road increases. Previous study shows that presence of median decrease the risk of accident. High R 2 value is needed for the model selection. Low R 2 value shows either some predictor variables are missing or the relation between dependent and independent variable is not linear or not follow normal distribution. The accident prediction models can be used to decrease the risk on the road and safety performance. In terms of future work, it is suggested to taking other predictor variables such as horizontal and vertical alignment, median width, sidewalk width, environmental factors etc. CPM can be developed by regression analysis such as poisson regression, negative binomial regression, zero inflated regression model etc. REFERENCES [1] Feng S., Li Z., Ci Y. and Zhang G. (2016), Risk factors affecting fatal bus accident severity: Their impact on different types of bus drivers, Accident Analysis & Prevention, 86, pp.2939. [2] MORT&H, (2015), Road Accident Statistics, Ministry of Road Transport and Highways, Government of India, New Delhi [3] Field A. (2005), Discovering statistics using SPSS, SAGE Publications Ltd ISBN 9781 847879066 [4] IBM Software Group (2015), IBM SPSS Advanced Statistics 22 Available from: http://library.uvm.edu/services/statistics/spss2 2Manuals/IBM%20SPSS%20Advanced %20Statistics.pdf [5] Ranjitkar P and Chengye P. (2013), ling motorway accidents using negative binomial regression, Eastern Asia Society for Transportation Studies Vol. 9. [6] Sawalha Z. and Sayed T. (2003), Statistical issues in traffic accident modeling, Canadian Journal of Civil Engineering, 33(9), P11151124 [7] Prajapati, P., Tiwari, G., Evaluating Safety of Urban Arterial Roads of Medium Sized Indian City. Proceedings of the Eastern Asia Society for Transportation Studies, Vol.9, 2013 [8] WHO (2015), Global status report on road safety 2015 86