Railway passenger train delay prediction via neural network model

Size: px

Start display at page:

Download "Railway passenger train delay prediction via neural network model"

Jody Hampton
6 years ago
Views:

1 Railway passenger train delay prediction via neural network model Masoud Yaghini 1*, Mohammad M. Khoshraftar 1 and Masoud Seyedabadi 1 1 School of Railway Engineering, Iran University of Science and Technology, Tehran, Iran SUMMARY The aim of this paper is to present an artificial neural network model with high accuracy to predict the delay of passenger trains in Iranian Railways. In the proposed model, we use three different methods to define inputs including normalized real number, binary coding, and binary set encoding inputs. One of the great challenges of using neural network is how to design a superior network for a specific task. To find an appropriate architecture, three different strategies, called quick, dynamic, and multiple are investigated. To prevent the proposed model from overfitting in modeling, according to cross validation, we divide existing passenger train delays data set into three subsets called training set, validation set, and testing set. To evaluate the proposed model, we compare the results of three different data input methods and three different architectures with each other and with some common prediction methods such as decision tree and multinomial logistic regression. For comparing different neural networks, we consider training time and accuracy of neural networks on test data set and network size. In addition, for comparing neural networks with other well-known prediction methods, we consider training time and the accuracy of neural network on test data sets. To make a fair comparison among all models, we sketch a time-accuracy graph. The results revealed that the proposed model has higher accuracy. KEYWORDS: Neural network; Prediction model; Passenger train delays; Iranian Railways * Correspondence 1

2 1. INTRODUCTION Delay is one of the major issues in railway systems all over the world. According to the British National Audit Office (NAO) [11] incidents such as infrastructure faults, fleet problems, fatalities and trespass still cause significant delays to the traveling public and great cost to the railway. For example, in , 0.8 million incidents led to 14 million minutes of delay to franchised passenger rail services in Great Britain, costing a minimum of 1 billion (averaging around 73 for each minute of delay ) in the time lost to passengers in delays. Of these incidents 1376, each led to over 1000 minutes of delay. Managing the consequences of incidents and getting trains running normally again is vital to reducing delays; so predicting passenger train delays is a very difficult task [11]. Because of the importance of this matter, Iranian Railways always register and analyze the data of delay with respect to its date, causes, and time of delays. In this research, the registered data of passenger train delays in Iranian Railways from 2005 to the end of 2009 is used. According to the achieved data from this database, the average delay from 2005 to the end of 2009 was 18,174 hours per year and 30 minutes for each passenger train. Literature review revealed that a few research on passenger train delays forecasting has been done. Carey and Kwiecinski [2] develop a simple stochastic method to knock-on train delays. Knock-on delay refers to that part of a train s delay, which is caused by other trains in front of it. Huisman and Boucherie [8] develop a stochastic model to predict the train delays with different speeds. Their model can capture both scheduled and unscheduled train movements. A case study of a railway section in the Dutch railway network illustrates the practical value of the model, both for long and shortterm railway planning [8]. Peters et al. [12] develop an intelligent delay predictor model for real-time delay monitoring, and timetable optimization in the range of train networks. This system is responsible for processing existing delays in the network to generate delay predictions for depending trains in the near future. This rule-based system was used as a comparison to the specially developed neural network in order to evaluate the accuracy and the faculty of 2

3 abstraction of such an artificially intelligent component. Yuan [16] develops an improved stochastic model for train delays and delay propagation in stations. The most important scientific contribution of this research is an innovative analytical probability model that accurately predicts the knock-on delays of trains, including the impact on train punctuality at stations based on an extension of blocking time theory of railway operations to stochastic phenomena [16]. Yuan [17] develops a model that deals with stochastic dependence in the modeling of train delays and delay propagation. The proposed model can be used in assessing timetable stability and predicting train punctuality given primary delays. Model validation reveals that the delay estimates match with real-world data very well. Briggs and Beck [1] demonstrate that the distribution of train delays on the British railway network is accurately described by q-exponential functions. In this research, they use data on departure times for 23 major stations for the period September 2005 October Daamenan et al. [5] propose a method to predict knock-on delays in an accurate and non-discriminative way. In this research, two main classes of knock-on delays are distinguished: hindrance at conflicting track sections and waiting for scheduled connections in stations. Flight delays prediction is one of the related topics to this research. Zonglei et al. [18] develop a new method based on machine learning to predict large scale of flight delays. This new method first does k-means on the data of the flight delays, which is an unsupervised learning to get standard of each class of delay. With these classes of delay, a supervised learning method can be used to build an alarm model. For the supervised learning, they use decision tree, BP neural network, and Naive Bayes. Zonglei et al. [19] develop a new method to forecast flight delays. This new method is based on content-based recommendation system. In the forecast model, the events flight delays and airports have been mapped to users and items, respectively, which are the concepts in the recommendation system. According to the propagation of the delay, this new method alerts the target airport by monitoring the status of related airports. The observed status is compared with the history data in order to predict the seriousness of delay. Jianli et al. [9] develop a new method to 3

4 describe the prediction of flight delays and delay propagation at the airport. The method is based on extended cellular automata (ECA) which is one of the extending applications of cellular automata (CA) by extending the components of cell and the definition of cellular neighbors. Long and Hasan [10] develop a simulation model to estimate flight delays and cancellations in all operating conditions, including off-nominal conditions such as inclement weather. It also explicitly models the impacts of flight delays on downstream operations, such as delayed departure, flight cancellation, and ground delay program (GDP). Another related topic is the predicting of bus arrival times. Effective prediction of bus arrival times is important to advanced traveler information systems (ATIS). Automatic passenger counter (APC) systems have been implemented in various public transit systems to obtain bus occupancy along with other information such as location, travel time, etc. Such information has great potential as input data for a variety of applications including performance evaluation, operations management, and service planning. Chen et al. [3] proposed a dynamic model for predicting bus arrival times. Their model uses data collected by a real world APC system. It consists of two major elements. The first one is an artificial neural network model for predicting bus travel time between time points for a trip occurring at given time of day, day of week, and weather condition. The second one is a Kalman filter based dynamic algorithm to adjust the arrival time prediction using up to the minute bus location information. Their test runs shows that this model is quite powerful in modeling variations in bus arrival times along the service route. Chen et al. [4] used an automatic passenger counter data which is collected by New Jersey Transit. They proposed models based on a neural network to predict bus arrival times. Their test runs shows that the predicted travel times generated by the models are reasonably close to the actual arrival times. Yu et al. [15] proposed a hybrid model, based on support vector machine (SVM) and Kalman filtering technique to predict bus arrival times. In the model, the SVM model predicts the baseline travel times on the basic of historical trips occurring data at given time of day, weather conditions, route segment, the travel times on the current segment, and the latest travel times on the predicted 4

5 segment. The Kalman filtering based dynamic algorithm uses the latest bus arrival information, together with estimated baseline travel times, to predict arrival times at the next point. The results show that the hybrid model is feasible and applicable in bus arrival time forecasting area, and generally provides better performance than artificial neural network (ANN) based methods. By the accurate predicted value of train delays, the railway operator can schedule a suitable timetable. This minimizes delays and prevents the errors and problems for the railway plans and because of this one can hope passenger trains have a minimum travel time. Developing a highly accurate neural network based passenger train delays prediction is the aim of our research. To evaluate the proposed model, a comparison among the accuracy of the proposed model, decision tree, and multinomial logistic regression has been made. The evaluation revealed that the proposed model has the highest accuracy. The outline of this paper is as follows. Section 2 presents an explanation for gathered passenger delay data. In Section 3, a brief description on weight updates method for neural networks is presented and then an explanation of the way that how different methods find a suitable structure for the proposed neural network model is given. In Section 4, data partitioning, neural network structures, data discretization, and transformation of data into binary set encoding are explained. In Section 5, the obtained results from the proposed model are presented. In Section 6 summary, conclusions, and some hints for the future research are given. 2. DATA UNDERSTANDING Delay in a passenger train means that the train is not arrived at its prescheduled time. Actually, train delay is not including scheduled stopping time. For example, in Iranian Railways stopping time at interval stations for praying or boarding and alighting passengers are excluded from delay time. The important causes for passenger train delays are: 5

6 - Delay at the origin. It is the difference between actual train s departure time and the scheduled train s departure time. - Incidence with another passenger train or freight train. This case happens when trains running in opposing directions pass each other at places where loops or sidings are available. It is the essentials of train waiting time for a line clearance. - Unscheduled waiting time at overtaking points. It is the train waiting time for the arrival and passing of another train with common path according to its priority. - Engine breakdown. Many kinds of problems can cause the passenger train s engine to stop working properly during the traveling time. - The other trains engine breakdowns: The other trains engine damages, which have some negative effects on travel time of this train. - The other causes. Wagon breakdowns, infrastructure faults such as the track and signal failure, and non-scheduled stops for praying are the the other causes of delay time. In Iranian Railways, the data of passenger train delays are being registered and recorded every day. At the weekend, these files merge and create weekly delay data. Finally, at the end of each month, these weekly files merge to create monthly delay data. In this research, monthly files from 2005 to end of 2009 are used. In each year, the number of patterns represents the number of dispatched trains of that year. Table 1 represents these data in summary. As shown in table 1, 2008 year has the maximum total delay, and the 2007 year also has a great total delay. The average delay for each train in 2008 year is the quotient of total delay divided by the number of dispatched trains, which is equal to 39.7 minutes with respect to table 1. Concerning average delay, 2008 has the greatest average delay. Note that 2009 year roughly has a great total delay but by making a comparison, it has a better average delay. Figure 1 represents the monthly average delay per each passenger train from 2005 to 2009 years, which maximum and minimum average among monthly delays belongs to 10th month and 1th month with and minutes per 6

7 each passenger train, respectively. Figure 2 represents seasonal average delay per each passenger train from 2005 to 2009 years, which maximum and minimum average among seasonal delays is related to summer and spring seasons with and minutes per each passenger train, respectively Notations 3. THE BASICS OF THE PROPOSED MODEL In this section we introduce the basics of the proposed model. The notations used in this section are as follows. σ ( x ) : The activation functions for hidden and output neuron. ij w ij : The weight form unit i to unit j. ( t) w : The change value for w ij at iteration t. η : The error learning rate. δ pj : The propagated error in unit j for the pattern p. o pi : The output of unit i for pattern p. α : The momentum parameter. d : Eta decay. t : The target value of output j for pattern p. pj Θ : i The number of input units. Θ : o The number of output units. M(t): The movement vector based on the changes to the weights over the cycle t. C(t) : The Change vector based on the momentum at the cycle t. W (t) : The vector of weights at cycle t. m(t) : The index of the training acceleration at the cycle t Weight updating procedure 7

8 Artificial neural network is one of the most important data mining techniques. It is being used with both supervised and unsupervised learning. In this paper, a special kind of feedforward neural network, which called multilayer perceptrons (MLP), is used. For training of the network back propagation, the algorithm with the momentum term is used. Activation function for hidden and output units is the standard sigmoid function, and it is shown in equation (1). σ ( x ) 1 1 x = + e. (1) Initial weights of the network are set to random values in the interval 0.5 w ij 0.5. For each pattern of training data set, information flow through the network to generate a prediction. The prediction is compared to the target value found in the training data for the current pattern, and that difference is propagated back through the network to update the weights. To be more precise, the change value w for updating the weights is calculated as equation (2). ( 1) ηδ α ( ) w t + = o + w t. (2) ij pj pi ij Where η is the error learning rate, δ pj is the propagated error, o pi is the output of unit i for pattern p, α is the momentum parameter and w ( t) change value for is the w ij at the previous epoch. The value of α is fixed during training, but the value of η varies across epochs of training. η starts at the userspecified initial eta, decreases logarithmically to the value of ηlow and reverts to ηhigh equation (3). value, and then decreases again to η low ( ) ( ).The value of η is calculated as η t = η t 1.exp(log( η η ) / d ). (3) Where d is the user-specified number of eta decay epochs. If η ( t 1) < ηlow then η ( t ) ηhigh complete. The back-propagated error valueδ pj low high = and continues to epoch thusly until training is is calculated based on where the ij 8

9 connection lies in the network. For connections to output units it is calculated as equation (4). ( t o ) o ( 1 o ) δ =. (4) pj pj pj pj pj For the other units it is calculated as equation (5). t pj δ ( 1 ) = o o δ w. (5) pj pj pj pk kj k is the target value of output j for pattern p. Weights are updated immediately as each pattern is presented to the network during training The architecture of the proposed model Finding network architecture based on existing data is one of the controversial issues in neural networks. Several different methods in this research are used to cope with this problem and finally the method with highest accuracy is selected as the best method. These methods named quick, dynamic and multiple, respectively. In quick method, only a single neural network is trained. By default, the network has one hidden layer containing max (( ) / 20,3) Θ + Θ, where i o i Θ is the number of input units and o Θ is the number of output units. In the dynamic method, the topology of the network changes during training, with units added to improve performance until the network achieves a desired accuracy. There are two stages to dynamic training: finding the topology and training the final network. At the first stage, a network with two hidden layers, which each hidden layer has two units, is built. Initial learning rate is 0.05 and α = 0.09 and train the initial network as usual through one epoch. Create two copies of the initial network, a left and a right network. To the right network, add one unit to the second hidden layer. Train both augmented networks through one epoch, and determine the overall error for each network. If the left network has the lower error, keep it and add one unit to the right network, which has the first hidden layer. If the right network has the lower error, replace the left network with a copy of the right network, and add a unit to the second hidden layer of the right network. Train both networks through another epoch, and repeat the training/augmentation 9

10 epoch until the stopping criteria are met. To adjust the learning rate at each epoch, two vectors are computed. The first vector is the movement vector, M(t), based on the changes to the weights over the epoch. The second vector is the change vector, C(t), based on the momentum at the current epoch. M(t) and C(t) are computed according to the equations (6) and (7). ( ) = 2 ( ) ( 1) ( ) 0.8. ( 1) ( ) M t W t W t. (6) C t = C t M t. (7) W (t) is the vector of weights at epoch t and W (t 1) is the vector of weights at the previous epoch. The ratio of the magnitudes of these vectors is defined as equation (8). m ( t ) = M ( t ) C ( t ). (8) m(t) is an index of the acceleration of training. If the index is less than 1+ C ( t ) 10 training is slowing and learning rate increased by a factor of 1.2. If the index is greater than 5.0, training is accelerating, and eta is decreased by a factor of 4 m ( t ). After finding a good topology, the final network is trained in the normal back-propagation manner and initial learning rate set to 0.02 and α = 0.09 then train network based on these values. In multiple methods, multiple networks are trained in pseudo parallel fashion and the network with the highest accuracy is considered as the final model. At the first step, several single-layer networks are generated with different numbers of hidden units, from 3 up to the number of input units. These networks will always go up to 12 units (even if there are fewer than 12 input units) and will never go larger than 60 units. A network is generated for each number of input units in the sequence 3, 4, 7, 12, and so on, with the increment at each step being two larger than the previous increment. For each single-layer network, a set of two-layer networks is also created. The first layer has the same number of hidden units as the single-layer network, and the number of units in the second layer varies across networks. Network is generated for each number of second-layer units in the sequence 2, 5, 10, 17, 10

11 and so on, up to the number of units in the first hidden layer, again with the increment at each step being two larger than the previous increment [14]. 4. DATA PARTITIONING AND NEURAL NETWORK STRUCTURE 4.1. Data partitioning During training a prediction model, overfitting, i.e. learning more than adequate specification of training data is one of the common problems especially with the large data sets [13]. The aforementioned problem has an off-putting effect on the prediction model to forecast new pattern and causes incorrect or lower than expected prediction. With respect to superabundant volume of data, more than 30,000 patterns per each year, using crossed validation technique could prevent such negative effects. This technique partitions the data into three samples called training set, validation set, and test set. Validation set is used as a pseudo test set in order to evaluate the quality of a network during training such an evaluation called cross validation, allowing training the model with training set, refining the model using a second set, and test results with the third set. This reduces the size of each partition accordingly. However, it may be most suitable when working with a very large dataset. Training proceeds until a minimum of the error on the training set is reached, but only until a minimum of the error on the validation set is reached during the training. By using this technique, specify the point of overfitting. In this paper 60% of data used as training set, 10% as validation set and remaining 30% as test set Neural network output units In the existing data set, the passenger train delay attributes have real number values, they saved in minutes. Data discretization technique is used to reduce the number of values for the given continuous attribute and divide the range of the attributes into intervals. Interval labels then are used to replace actual data values. Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data. This 11

12 leads to a concise, easy-to-use, knowledge-level representation of mining results. Discretization techniques can be categorized based on how the discretization is performed, such as whether it uses class information or which direction it proceeds (i.e., top-down vs. bottom-up). If the discretization process uses class information, then we say it is the supervised discretization. Otherwise, it is unsupervised. If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range, and then repeats this recursively on the resulting intervals, it is called top-down discretization or splitting. This contrasts with bottom-up discretization or merging, which starts by considering all of the continuous values as potential split-points, removes some by merging neighborhood values to form intervals, and then recursively applies this process to the resulting intervals [7]. In this research, binning technique is used as discretization technique. Binning does not use class information and therefore is an unsupervised discretization technique. Binning is a top-down splitting technique based on a specified number of bins. In this research, we try to use equal-width binning to discretize attribute values. Using this technique, data is divided (partitioned) into 10 different classes and each of these classes constituted one output unit of neural network. Figure 3 represents the histogram of these classes. Eventually, the proposed model could predict one of these classes, i.e. it predicts an approximation of delay time. Origin destination (O-D pair) of train, day, month and year of dispatching (i.e. dispatching date), railway corridor, which train pass through on its route to destination are the fields that used as the input of the model. In the existing database, there are 278 O-D pairs and for each of them an exclusive integer number is considered. There are also 9 railway corridors and for each of them as O-D pairs an integer number is considered. One of the aforementioned ten classes of delay is network output. Network output layer has 10 binary output units and each of them represent one class of delay, consequently. By Pearson s chi-square, which can test dependency between two categorical attributes, ones could measure two by two dependencies among attributes [6]. If the value of chi-square statistics is at the interval 12

13 2 0, χα ( r 1) ( c 1) the hypostasis of independency between two attributes is accepted, otherwise it is rejected, α is the level of significance (here α = 0.05) and r and c are the number of rows and columns in order, respectively. The dependency between two categorical attributes is high if the probability of independency becomes lower than For delay, corridor, day, month, year and O-D pairs, the values of Pearson s chi-square test statistics are illustrated in table Input units of neural network Finding the best inputs forms to be used with the proposed neural network; three different approaches are considered for input attributes, including normalized real number, binary coding, and binary set encoding inputs. In the first approach, called normalized real number, five units are considered as input units, input unit number one is allotted to the O-D pairs, number two to the corridor, number three to the day, number four to the month, and number five to the year. In the second approach, called binary set encoding inputs, at first integer value which assigned to each attribute encodes into the binary number and then log 2 (max number + 1) binary attribute is considered which max number is the greatest integer number among the different values. If the number of binary attributes did not become an integer value, it rounded to the nearest greater integer number. Finally, these encoded values are assigned to the new attribute such that each attribute just takes one digit of these binary values. For example, for value equal to 50, figure 4 shows the procedure of encoding to binary string (suppose the maximum number of attribute is 70). For each different value of categorical attribute, one binary attribute is considered e.g. for an attribute with 10 different values, 10 separate binary attributes (one for each different value) is considered, and for each pattern just one of 10 separate attributes takes 1 and the others take 0. Another approach for neural network input units is binary inputs. In this approach, binary input units for attendance or nonattendance of different 13

14 values of a categorical attribute is considered, e.g. attributes O-D pairs have 278 different values so, it need 278 binary input units whereas with the binary set encoding it need just nine binary input units. In addition, this matter causes exponential growth of neural network structure. Neural network structure size decreases significantly if ones use normalized input values. Table 3 represents neural network size for different input approaches. Note that just for quick method, a comparison has been made in table 3, and the two other methods do not have a fixed and predefined structure in advance. Therefore, doing these computations is impossible, only thing we can say, because of two hidden layers, network structure size is greater at the end The results on whole dataset 5. EXPERIMENTAL RESULTS A personal computer with Quad CPU 2.83 GHz 64 Bits and 4GB main memory and SPSS Clementine 12.0 software [14] is used to achieve all of results for the proposed model and other prediction methods. Table 4 represents the results of three different structural methods of neural network models on database. In the model, from 179,982 patterns, 107,918 patterns are used as the training set, 53,837 as the test set and as the validation set. To reduce the effect of random parameter initialization on the prediction ability of the models, we run each model 100 times, independently and take the average of results. As observed from table 4 greatest prediction accuracy on the test set is related to binary input using quick methods, which is equal to 92.18%. Binary inputs make network structure size too much big and therefore, it needs more memory and as the consequence, greater time to solve the problems. Lowest prediction accuracy on test set belongs to normalized real number inputs using multiple methods, which is equal to 60.30%. As results reveal ones cannot say one method is definitely the best one. Instead, an analysis on accuracy, time and network structure size is needed to disclose. To evaluate the acquired results, decision tree and multimodal logistic regression is used. In prediction 14

15 with decision tree and logistic regression, the normalized input values are used as input of the model and instead of the ten-output units, an output unit is used to predict ten different classes. Integer numbers 1 to 10 represent each class values. For decision tree induction, C5.0 algorithm is applied. Table 5 represents the results of decision tree and logistic regression models. To make a comparison in an easier way, tables 6 represents the results of different models from the aspects of training time and model accuracy on the test set. We consider two integer numbers for each result of table 6; one shows the rank of the model accuracy and the other shows rank of model training time among other models. Finding a suitable way to make a fair comparison among the models, in figure 5, training time-test accuracy graph is sketched. In this graph, horizontal axis represents model training time and vertical axis represents the accuracy of the model on the test set. In this graph, based on accuracy and training time ones could see the position of each models. According to dispersion of points, ones could divide prediction methods into three classes. In figure 5, the neural network models with binary set input values for all structural methods and normalized real number input values for quick and dynamic methods have low training time and high prediction accuracy. Decision tree, multinomial logistic regression and neural network based model with normalized real number input value for multiple methods have low training time and low prediction accuracy. Each of three structural methods for neural network based model with binary set input values are in the first region and this show the efficiency of the proposed model. In spite of the high prediction accuracy neural network model with binary input values, they located in the second region because they have great training times. Because of low prediction accuracy, all of other models are located in the fourth region The periodical analysis In order to see how the trained model could predict delay time of trains for following periods, in this section four models are constructed. Note that in this research one year is considered as one period, ones also could consider other 15

16 periods such as month, week and etc. each of these models which are called model one, two, three and four, are trained and fitted with data of preceding year in order to predict the expected delay in the following year. In this section, binary set encoding scheme is used. The first model is trained by data of 2005 years, i.e. data of 2005 year is used as train and validation set, then this trained model is used for prediction in 2006 year, i.e. data of 2006 year is used as test dataset. Model two is trained by data of 2005 to the end of 2006 year, and then this trained model is used for prediction in 2007 year. The third model is trained by data of 2005 to the end of 2007 year, and then this trained model is used for prediction in 2008 year. The fourth model is trained by data of 2005 to the end of 2008 year, and then this trained model is used for prediction in 2009 year. 75% of the train set patterns is used as the training pattern and the remaining 25% is used as the validation pattern. Table 7 is illustrated the number of patterns for each model. Table 8 represents the results of the model one to the model four; each of these models is trained 250 epochs over the training set at most. To reduce the effect of random parameter initialization on the prediction ability of the models, all results were averaged over 100 independent runs. These results reveal the effectiveness and efficiency of the proposed model in prediction of delay time of trains for following periods Analysis on disaggregated data To improve the comprehension of the quality of results, the main dataset is partitioned into k different categories called category 1 to category k, respectively. Then k different neural network models associated with each category are trained. According to table 1, 2009 year is not following the pervious delay procedure, so, to prove the fitting ability of the models with data of Iran railway, 2009 year is considered as the test set and all delay patterns in this year will not use to train the model at all. The patterns are included in 2005 to 2008 years are used as the training and validation set (antecedent periods to this test dataset), e.g. model 1 which is trained by pattern in category 1 for 2005 to the end of 2008 years, is used to predict one 16

17 of the ten delay class labels in category 1 for 2009 year. As mentioned at section 4.3, there are nine railway corridors in the dataset. These corridors are used to create nine different categories. Note that ones could consider another attribute (such as O-D pairs, day, month and year or combination of them) to make k categories. With respect to the number of railway corridors, nine disaggregated test datasets are created from the 2009 dataset and nine disaggregated train and validation datasets are created from the 2005 to the end of 2008-year dataset. The nine different models associated to each of these training and validation sets, are trained, in order to predict the expected delay in the following testing dataset. The 75 percent of train dataset is used as training patterns and the remaining 25 percent is used as validation patterns. Table 9 illustrates the number of patterns for each model. Table 10 represents the results of the model one to the model nine. Each of these models is trained 250 epochs over the training set at most. To reduce the effect of random parameter initialization on the prediction ability of the models, all results were averaged over 100 independent runs. The average over structural methods revealed using disaggregated model could improve the quality of results in comparing with the whole dataset (table 4) and periodical (table 8) results. 6. CONCLUSIONS In this paper for being able to predict passenger train delays in Iranian Railways, a neural network model with high accuracy is presented. In the proposed model, we use three different approaches to define inputs including normalized real number, binary coding, and binary set encoding input values. Finding an appropriate architecture for the passenger train prediction neural network model, various strategies are investigated. Predicting passenger train delays, the registered data of in Iranian Railways from 2005 to the end of 2009 year is used. To evaluate the quality of the results, we take advantage of decision tree and multinomial logistic regression models. These comparisons among outcomes revealed that the proposed model has great accuracy and low training time and as the consequence good solution quality. Delay prediction 17

18 makes it easier for railway operators to do a suitable timetable and minimizes delays, errors, and problems for the future railway plans and hope passenger trains have a minimum travel time. Future research will progress in two directions: improving training time and improving prediction accuracy. The model accuracy may be improved through metaheuristic methods such as genetic algorithms or simulated annealing or hybrid algorithms to find a better network architecture. Training time can be improved through other meta-heuristic methods such as particle swarm optimization or continuous ant colony optimization. [1] [2] [3] [4] [5] [6] [7] [8] REFERENCE Briggs, K., Beck, C. (2007). Modeling train delays with q exponential functions, Statistical Mechanics and its Applications, 15, May, vol. 378, pp Carey, M., Kwiecinski, A. (2007). Stochastic approximation to the effects of headways on knock on delays of trains, Transportation Research, vol. 28, August, pp Chen, M., Liu, X., Xia, J., Chien, S.,(2004). A Dynamic Bus-Arrival Time Prediction Model Based on APC Data, Computer-Aided Civil and Infrastructure Engineering, vol. 19, no.5, p.p Chen, M., Yaw, J., Chien, S., Liu, X. (2007). Using automatic passenger counter data in bus arrival time prediction, Journal of Advanced Transportation, vol. 41, no.3, p.p Daamen, W., Goverde, R., Hansen. (2009). Non Discriminatory Automatic Registration of Knock On Train Delays, Networks and Spatial Economics, vol. 9, 23, November, pp Freedman, D., Pisani, R., Puves, R.(2007). Statistics, 4th edition, W. W. Norton & Company. Han, J., Kamber, M. (2006). Data Mining Concepts and Techniques, 2nd edition, Morgan Kaufmann Publishers. Huisman T., Boucherie, R. (2001). Running times on railway sections 18

19 [9] [10] [11] [12] [13] [14] [15] [16] [17] with heterogeneous traintraffic, Transportation Research, vol. 35, March, pp Jianli, D., Yuecheng, Y., Jiandong, W. (2009). A Model for Predicting Flight Delay and Delay Propagation Based on ParallelCellular Automata, International Colloquium on Computing, Communication, Control, and Management, vol. 1, 29, September, pp Long, D., Hasan, S.(2009). Improved Prediction of Flight Delays Using the LMINET2System Wide Simulation Model, Aviation Technology, Integration, and Operations Conference (ATIO), 23, September. National Audit Office. (2008). Reducing passenger rail delays by better management of incidents, HC 308 Session , Report by the Comptroller and Auditor General. Peters, J., Emig, B., Jung, M., Schmidt, S. (2005). Prediction of Delays in Public Transportation using Neural Networks, International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, vol. 2, pp Prechelt, L. (1994). PROBEN1 a set of neural network benchmark problems and benchmarking rules, Faculty Informatics, University of Karlsruhe, Germany, Technical Report. 21/94. SPSS, Clementine 12.0 Algorithms Guide, Yu, B., Yang, Z-Z., Chen, K., Yu, B.(2010). Hybrid model for prediction of bus arrival times at next station, Journal of Advanced Transportation, vol. 44, no.3, p.p Yuan, J. (2006). Stochastic Modeling of Train Delays and Delay Propagation in Stations, PhD dissertation, Delft University of Technology, Faculty of Civil Engineering and Geosciences, Department of Transportation and Planning. Yuan, J. (2009). Dealing with stochastic dependence in the modeling of train delays and delay propagation, International Conference on Transportation Engineering. 19

20 [18] [19] Zonglei, L., Jiandong, W., Guansheng, Z. (2008). A New Method to Alarm Large Scale of Flights Delay Based on Machine Learning, International Symposium on Knowledge Acquisition and Modeling, pp Zonglei, L., Jiandong, W., Tao, X. (2009). A new method for flight delays forecast based on the recommendation system, Computing Communication Control and Management, Vol. 1, p.p

21 Table1: The summary of passenger train delay data Number of Total delay average Total delay(min) Year dispatched trains (min/train) Sum (average) Table 2: Pearson s chi-square test statistics values Attribute Chi-square test Degree of Probability of statistics values freedom independency Delay, corridor Delay, day Delay, month Delay, year Delay, origin-destination Table 3: Comparison neural network sizes in three different approaches for defining input units in quick method Number of Normalized Binary set Binary number encoding input input input Unit at input layer Unit at hidden layer Unit at output layer Connection between input and hidden layer Connection between hidden and output layer Total connection

22 Table 4: The average result of 100 times runs of neural network methods independently Input Structural Time Training set Test set Validation set value type methods (sec) Correct Accuracy Correct Accuracy Correct Accuracy Quick Numeric Dynamic Multiple Quick Binary Dynamic Multiple Quick Binary set Dynamic Multiple Table 5: The result of prediction with decision tree and logistic regression algorithm Algorithm Time Training set Test set Validation set (sec) correct Accuracy correct accuracy correct accuracy Decision tree Multinomial Logistic regression Method Neural network Table 6: A comparison among all results for ranking Input Structural Method Time Test Training Accuracy Value type methods code (sec) accuracy time rank rank Quick Numeric Dynamic Multiple Quick Binary Dynamic Multiple Quick Binary set Dynamic Multiple Decision tree Numeric Logistic regression Numeric

23 Model Table 7: The number of patterns for each model Training and validation dataset Test dataset No. of training patterns No. of validation patterns No. of test patterns One Two Three Four Table 8: The results of model one to model four model One Two Three Four Structural methods Training Time (sec) Training set Validation set Test set Correct Accuracy Correct Accuracy Correct Accuracy Quick Dynamic Multiple Quick Dynamic Multiple Quick Dynamic Multiple Quick Dynamic Multiple Table 9: The number of patterns for each model Model No. of training patterns No. of validation patterns No. of test patterns

24 Table 10: The results of model one to model nine Model Average Structural methods Training Time (sec) Training set Validation set Test set Correct Accuracy Correct Accuracy Correct Accuracy Quick Dynamic Multiple Quick Dynamic Multiple Quick Dynamic Multiple Quick Dynamic Multiple Quick Dynamic Multiple Quick Dynamic Multiple Quick Dynamic Multiple Quick Dynamic Multiple Quick Dynamic Multiple Quick Dynamic Multiple

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,