A new insight into prediction modeling system

Size: px

Start display at page:

Download "A new insight into prediction modeling system"

Sophie Jodie Golden
6 years ago
Views:

1 Integrated Design and Process Technology, IDPT-2003 Printed in the United States of America, December, Society for Design and Process Science A new insight into prediction modeling system Sang C. Suh, Sam I. Saffer, Dan Li, Jingmiao Gao Department of Computer Science Texas A&M University-Commerce Commerce, Texas , USA ABSTRACT Data mining and forecasting has attracted a lot of research interests in time series data sequence. The paper brings a new insight in how to select different time series forecasting models to make prediction according to different situations and data patterns. The contribution of this research is to facilitate prediction modeling system design for extensive forecasting purpose. Keywords : forecasting, prediction, time series, stochastic, data mining and knowledge discovery, intelligent database, smoothing, ARIMA, Box-Jenkins, regression, neural network, expert system NOMENCLATURE F t = Forecast for the period t I t = smoothed seasonal index for period t S t = smoothed value at end of period t S t = A single smoothed value S t = A double smoothed value SSE = Sum of squared errors (explained) SST = Total sums of squares SSU = Unexplained variation TC t = The combined trend and cyclical Y t = Actual in the period t = Fitted or forecasted value at time t c = seasonal index smoothing constant m = forecast horizon t = Period t a = (level) smoothing constant ß = trend smoothing constant (called Beta) σ = Standard deviation INTRODUCTION Forecasting and predictions attract researchers, industries, and governments more and more in recent years. Many researches and models have achieved great success in some fields. Examples include X-11 [19] forecasting model in unemployment rates, GNP, weather forecasts, neural network in forecasting in inventories [10], EMA in forecasting commodity sales and web access predictions [2][4], decision tree of data mining in forecasting credit card risks, and Fourier analysis in pattern forecasting in mobile stock system [7]. But model selections have been a painful issue in forecasting areas. When we select models for some forecasting, we need to think about the model from every angle, hypothesize how different variables affect each other, and etc. The rule of thumb is to just follow the carpenter s rule - Measure twice, cut once. Don t be in a rush to apply a model and run the program. There is no best single forecasting model. Different forecasting models may generate quite different results when applied in different fields according to different situations and environments. The following remark by G. Box gives us an insight into the problem of formalizing any type of prediction model. "All models are wrong, but some are useful." - George Box (1994) The purpose of this paper is to describe different forecasting models and compare them according to different conditions so as to bring a new insight into prediction modeling system. Our goal is to facilitate forecasting system analysis and design, to make analysts and designers understand different prediction models and know what a best suitable forecasting system is and when (under what kinds of circums tances) should use such models. The structure of the paper is as follows. In section 2, we present basic concepts and understandings of time series analysis and attach some frequently used formulas. In section 3, we describe different time series forecasting models and present advantages/disadvantages/constraints for each model. In section 4, we compare different models in different angles. In section 5, we devise our new prediction modeling system and explain the detailed procedure. In section 6, we conclude the paper and discuss the future work. BACKGROUND NOTIONS In time series analysis, there is a fundamental distinction between the terms process and realization. The actual values in an observed time series are the realization of some underlying process that generates 1

2 those values. We refer to the underlying process as the stochastic (a process whose outcomes are given by probability) or probabilistic generating process (probabilistic and stochastic are synonyms). In time series analysis, data sequences are often decomposed into patterns including random patterns, trend patterns, seasonal patterns, and cyclical patterns. So forecast = trend + season + cycle + random. Trends and seasons are deterministic and cycles and random are non-deterministic. Most models are based on sample statistics because a true population statistic is usually unobservable and checking the entire population is expensive and statistically unnecessary. So we have to clearly understand difference between samples and populations and difference between the number of samples and samplings. Generally, it is recommended that at least 60 observations (i.e., five seasons) be stored for monthly series, regardless of the forecasting method. When you select a forecasting model, how you can determine whether it is proper for your specific problems or data series and whether it is the best candidate become main issues in the building process of the whole prediction modeling system. Many statistic criteria can be used for measuring errors in different models under different circumstances. The commonly used statistic criteria are shown in Formula in appendix. DIFFERENT METHODS Univariate time series methods SMA (simple moving averages) SMA is useful in modeling a random series (i.e., one without trend or seasonality) because it averages or smoothes the most recent actual values to remove the unwanted randomness. Moving average models work well with patternless demands. A patternless series is one that does not have a trend or seasonality in it. Such a series can have smooth or erratic variations (i.e., with or without high autocorrelation). It assumes that a future value will equal an average of past values: SMA(m) = (X t + X t X t-m+1 ) / m (1) The most distinct disadvantage of all moving average or smoothing methods is that they do not model seasonality or trend. A general forecasting method should be able to model seasonal and trend series. Also, historically the disadvantage of moving average has been that all data needed to calculate the average must be stored and processed. For example, if a 52-week moving average is used, a database of 52 observations must be maintained. Although this created a serious data storage problem several decades ago when computer memory and calculations were expensive, it is no longer a matter of great concern. When using a moving average, it is difficult to determine the optimal number of periods to include in the average. However this is not a problem for exponential smoothing. Exponential smoothing is as accurate as moving averages while at the same time more computationally efficient. In general, the optimal number of periods to have in a SMA is the number that is minimizing the residual standard error (RSE see equation 22 in appendix). Longer moving-averages smooth the randomness of the series more. A long-period moving average is the more accurate forecasting model that yields the lowest RSE when a series is very random and erratic (an erratic series is one not possessing high levels of autocorrelation). In contrast, if the randomness is very smooth, with highly auto-correlated walks away from an overall mean, it should be modeled using a shorter-period average that will yield a lower RSE. (i.e., a very common pattern of prices for items sold in large competitive markets, such as stock, bond, and commodity markets). Also, it may describe the demand for very high-volume products that are not trending or seasonal. The pattern is referred to as a random walk because it randomly and smoothly walks up and down without any repeating pattern. Random walks are characterized by high degrees of autocorrelation. Exponential smoothing As mentioned above, when computer storage capacity was expensive, exponentially weighted moving averages method (i.e., exponential smoothing) was advantageous. They are still very popular today, even though storage cost is not of much concern. Exponential smoothing refers to a set of methods of forecasting, several of which are widely used. 1. Single exponential smoothing (SES) It is easy to apply because forecasts require only three pieces of data: the most recent forecast, the most recent actual, and a smoothing constant. The smoothing constant (a) determines the weight given to the most recent past observations and therefore controls the rate of smoothing or averaging. F t+1 = F t + a ( Y t - F t ) (2) The best alphas should be chosen on the basis of minimal sum of squared errors. Note: ( Y t - F t ) = forecast error GIVEN: F t+1 = a F t + ( 1 - a ) F t THEN: F t+1 = a Y t + a ( 1 - a ) Y t-1 + a ( 1 - a ) 2 Y t-2 + a ( 1 - a ) 3 Y t-3 + +a ( 1 - a ) t-1 Y 1 + ( 1 - a ) t F 1 (3) It only works for no trend and no seasonality patterns. 2

3 2. Double moving averages/linear moving averages This is the method that calculates a second moving average from the original moving average so as to eliminate the systematic error Advantages : When trend and randomness are the only significant demand patterns, this is a useful method. Because it smoothes large random variations, it is less influenced by outliers than the method of the first differences. In general, it is too simplistic to be used by itself; It does not model the seasonality of the series. We also face the problem of determining the optimal number of periods to use in the double moving averages. 3. Brown s Double Exponential Smoothing [1][17] It uses a single coefficient (a) for both smoothing operations. This method computes the difference between single and double smoothed values as a measure of trend, and then adds this value to the single smoothed value together with adjustment for the current trend. Brown s Double Exponential Smoothing Formula: S t = a Y t + ( 1 - a ) S t-1 S t = a S t + ( 1 - a ) S t-1 a t = S t + ( S t - S t ) = 2 S t - S t a b t = ( S t - S t ) 1 - a F t + m = a t + b t m (4) Advantages : It models the trends and level of a time series; It is computationally more efficient than double moving average; It requires less data than double moving averages. Because one parameter is used, parameter optimization is simple. There is some loss of flexibility because the best smoothing constants for the level and trend may not be equal; It does not model the seasonality of a series. Thus it is not recommended unless the data is first de-seasonalized. 4. Holt s Two-Parameter Trend Model It uses two smoothing constants (a and ß), one for trend and the other for the level of the series. Holt s Two -Parameter Trend Model Formula: S t = a Y t + ( 1 - a ) ( S t-1 + b t-1 ) b t = ß ( S t - S t-1 ) + ( 1 - ß) b t-1 F t + m = S t + b t m (5) Advantages : It eliminates the natural lag of single smoothing as Brown s DES; It is more flexible in that the level and trend can be smoothed with different weights. It requires that two parameters be optimized. Thus the search for the best comb ination of parameters is more complex than for single parameters; It does not model the seasonality of a series. 5. Winters Three-Parameter Exponential Smoothing It uses three smoothing constants, one for trend, one for the level of the series, and one for seasonality adjustment. Winter s Three-Parameter Exponential Smoothing formula: Y t + 1 = ( S t + b t ) I t L e t + 1 (6) where S t = smoothed non-seasonal level of the series at end of period t b t = smoothed tend in period t I t L + 1 = smoothed seasonal index for period t + 1 S t = a Y t I t L + ( 1 - a ) ( S t-1 + b t-1 ) b t = ß ( S t - S t-1 ) + ( 1 - ß) b t-1 I t = c Y t S t + ( 1 - c ) I t L (7)) where I t L = smoothed seasonal index L periods ago L = length of the seasonal cycle ( e.g., 12 months or four quarters ) Advantages : It models trend, seasonality, and randomness using an efficient exponential smoothing process; Parameters can be updated using computationally efficient algorithms; Its forecasting equations are easily interpreted and understood by management. So it is popular in some commercial forecasting systems. It may be too complex for data that do not have identifiable trends and seasonality; The simultaneous determination of the optimal values of three smoothing parameters using Winters model may take more computational time than regression, Fourier series analysis, etc. Outliers can have dramatic effects on forecasts, and should be eliminated. Data Requirements: Because it models seasonality, its data requirements are greater than previous methods above. To adequately 3

4 measure seasonality, at least three seasons of monthly data (36 months), four or five seasons of quarterly data (16 to 20 quarters), and tree seasons of weekly data (156 weeks) are suggested minimums. Decomposition - X-11 One of the most highly developed decomposition methods is the Census Method II, X-11 procedure. This is the primary method that the federal government uses to de-seasonalize the many macroeconomic times series data (e.g., GDP, unemployment, etc.) reported in the media. The predecessor of this method was developed by the National Bureau of Economic Research during the 1920s to forecast economic time series. It is the computerization of the classical decomposition method. Because of its complexity, the Census Method II is infeasible without a computer. The method decomposes a time series into three components, where TC is the combined effect of trend and cycle components. The components are estimated as follows: Estimate the TC component via a centered moving average of Y Estimate SI as the ratio of Y/TC Adjust SI for extreme points (find extreme points and smooth by a moving average) Estimate S by an averaging of SI Estimate TCI by the ratio of Y/S Estimate I by the ratio of TCI to TC This yields the following forecasting model using trend-cyclical and seasonal factors that can be the basis for a forecasting. That is, = TC t S t. The above six steps are repeated more than once for refinements during the following four major phases: First phase: adjusting the data for trading days in a month or quarter. Second phase: preliminary estimates of the seasonal indexes are made. Third phase: the preliminary seasonal indexes are refined. Final phase: several summary statistics are generated as diagnostic tools. Implementation of such method: the standard output involves 17 to 27 tables; the long output involves 27 to 39 tables; and the full output includes 44 to 59 tables. Plots of the seasonal indexes, the trend-cycle curves, and other graphs are extremely useful in tracking trends and estimating turning points. In addition, this method is a very good way to identify and adjust the outliers in a time series. Because of its heavy computer requirements, the method is not used routinely in automated forecasting systems. However, this is a cost-effective method for important time series. Also it is limited to time series with sufficient history (at least three years of monthly data). Its principal usage in business are in monthly and quarterly forecasting for seasonal time series for 1 to 10 years and in analyzing trend and seasonal factors for use in other forecasting methods. Because of the manner in which seasonal indexes and trends are calculated, overfitting models can be a problem with decomposition methods. That is, a large random error in only one period will influence the estimate of seasonal and trend values. Therefore, when calculating seasonal indexes, it is important to eliminate outliers and extraordinarily high or low indexes; then a modified mean or a median of the seasonal indexes should be used. When de-seasonalizing data and calculating trends is that the division by abnormally low seasonal indexes might generate extremely large forecasts. This creates large errors when forecasting. Consequently, it is important to check for outliers in each step of the method. It is very difficult to simultaneously decompose trend and seasonality in a series when only a few seasonal cycles exist. Because of these problems, the R 2 of the fitted decomposition model may be quite high relative to actual R 2 values realized when forecasting. Also, the multiperiod-ahead forecasts of a decomposition model may be greatly influenced by future changes in the cyclical influences. Consequently, we very cautiously forecast one to two seasonal cycles into the future using decomposition methods. We are even more cautious in forecasting several years into the future. Although this method has several disadvantages, it is useful in conjunction with other models of trend and cyclical variations. Thus, the more sophisticated the decomposition program, the better. Fourier Series Analysis (FSA) A method that models trend, seasonality, and cyclical movements using trigonometric sine and cosine functions. A method used in automated forecasting system; however, it is not without its detractors. Combing Function in Fourier Series Analysis where (8) = Fitted or forecasted value at time t = Constant used to set the level of the a 0 series b 0 = Trend estimate of the series = Coefficients defining the amplitude a 1, b 1, a 2 and phases ω = 2πf/n, know as omega k = Harmonic of ω 4

5 Disadvantages: Overfitting and the number of parameters to include As coefficients are added to a model, it becomes more complex, more time -consuming to estimate, more accurate in fitting the past, but eventually less accurate in forecasting the future. The additional terms that increase historical fit might not improve accuracy, a situation called overfitting. Added trig terms might only model events such as outliers that occurred in the past but are not going to occur in the future. Thus the significance test on the choice of frequencies that should be included in the model is an important tool in model selection. However, it is not an infallible one. Overfitting is a general problem no matter which forecasting approach is used. Interpretability The primary disadvantage of the FSA method is that it may be difficult to relate the model parameters to commonly used or intuitive explanations of seasonal profiles. We have found the use of graphs very useful in describing the seasonal behavior of FSA. Also, these seasonal profiles can be used to express the amplitudes of each frequency as either an additive or percentage of the trend. Equal weight to all observations A FSA model uses least squares regression to determine seasonality or cyclical variations. One of the attributes of least squares is equal weight is given to all errors; that is, a recent error (e.g. t = 46) is squared just as a distant error (e.g. t=2). Consequently, FSA models do not give more weight to the more recent actuals than to the distant actuals, as is done in exponential smoothing. Exponential smoothing methods such as Winters method give more weight to the more recent actual values, and this is often advantageous because the recent past is normally a better predictor of the immediate future. Advantages: An advantage of FSA over other trend-seasonal methods is that it has coefficients that are statistically independent of each other. In general, if all amplitudes are insignificant, then all will be dropped and the model will no longer include seasonal patterns. Finally, if the trend parameter is found to be insignificant, it too can be dropped to yield a simple level model. Most other methods require a refitting of the model when one of more terms is dropped from the model. This makes FSA a versatile approach to modeling time series in operational forecasting systems. Periodograms are useful in discerning white noise, non-stationarity, and seasonality. However, FSA and periodograms are not commonly used in time series, despite the fact that they provide useful information. Auto-Regressive Integrated Moving Averages (ARIMA) [14][15][16][5][6] ARIMA (Box-Jenkins) a method that models a series using trend, seasonal, and smoothing coefficients that are based on moving averages, auto-regression, and difference equations. It is an accurate and very versatile approach. The purpose of time series analysis is to use the realization of a process (i.e., the sample) to identify a model of the ARIMA process (i.e., the population) that generates the series. The procedures used to build models are broadly referred to as Box-Jenkins or ARIMA model building methods. ARIMA model building is an empirically driven methodology of systematically identifying, estimating, diagnosing, and forecasting time series. The Box-Jenkins modeling Procedure (see Fig. 1 in appendix) ARIMA(p, d, q) includes a class of models : MA (q) = ARIMA (0, 0, q) Suppose that {Y t } is a discrete purely random process with mean zero, then a process {X t } is said to be a moving average process of order q (MA(q)) if X t = Y t + ß 1 Y t ß q Y t-q (9) where { ß i } are constants, {Y t } are a sequence of random variables; AR (p) = ARIMA (p, 0, 0) Suppose that {Y t } is a discrete purely random process with mean zero, then a process {X t } is said to be an autoregressive process of order p (AR(p)) if X t = a 1 X t a p X t-p + Y t (10) where {a i } are constants, {X t } is multiple regressed on past variables of {X t } rather than on independent variables; ARMA (p, q) = ARIMA (p, 0, q) Broadly speaking, a MA(q) explains the present as the mixture of q random impulses, while an AR(p) process builds the present in terms of the past p events. A mixed auto-regressive moving-average process containing p AR terms and q MA terms is said to be an ARMA process of order (p, q). It is given by X t = a 1 X t a p X t-p + Y t + ß 1 Y t ß q Y t-q (11) where {X i } is the original series and {Y t } is a series of unknown random errors which are assumed to follow the normal probability distribution; ARIMA (p, d, q) d here is for difference purpose that it is possible to describe certain types of non-stationarity time series. Such a model is called ARIMA (Auto Regressive Integrated Moving Average) because the stationary model is fitted to the differenced data has to be summed or integrated to provide a model for the non-stationary data. The general process ARIMA (p, d, q) is of the form 5

6 W t =a 1 W t a p W t-p + Y t + ß 1 Y t ß q Y t-q (12) Limitations: Short-time forecasting We emphasize short-term forecasting because most ARIMA models place heavy emphasis on the recent past rather than the distant past. This emphasis on the recent past means that long-term forecasts from ARIMA models are less reliable than short-term forecasts. Data types The UBJ method applies to either discrete data or continuous data. Even though, it deals only with data measured at equally spaced, discrete time intervals. Data measured at discrete time intervals can arise in two ways. First, a variable may be accumulated through time and the total recorded periodically. Second, data of this type can arise when a variable is sampled periodically. Suppose an investment analyst records the closing price of a stock at the end of each week. The variable is being sampled at an instant in time rather than being accumulated through time. Sample Size Building an ARIMA model requires an adequate sample size. 50 observations is minimum required number. Stationary data series only A stationary time series has a mean, variance, and autocorrelation function that are essentially constant through time. Advantages : The concepts associated with UBJ models are derived from a solid foundation of classical probability theory and mathematical statistics. ARIMA models are a family of models, not just a single model. It includes as special cases the AR, MA and ARMA classes. That s why it is often called the best model in stochastic modeling because it combines many of historically popular univariate methods. Box and Jenkins have developed a strategy that guides the analyst in choosing one or more appropriate models out of this larger family of models. It can be shown that an appropriate ARIMA model produces optimal univariate forecasts. That is, no other standard single-series model can give forecasts with a smaller mean-squared forecast error (i.e. forecast error variance). In sum, there seems to be general agreement among knowledgeable professionals that properly built ARIMA models can handle a wider variety of situations and provide more accurate short-term forecasts than any other standard single-series technique. However, the construction of proper ARIMA models may require more experience and computer time than some historically popular univariate methods. Other statistic/quantitative/data-mining methods Multiple linear regressions General multiple-regression model: Y = a + b 1 X 1 + b 2 X b n X n + e (13) BLUE rules for desirable properties: Best: 1) minimum variance; 2) efficient (because size of sample? variance?) Linear: should be coefficients linear, not variables. Unbiased Estimates Multiple regressions modeling process: see Fig. 4 in the appendix. It is hard to implement in an automatic modeling system. It may face multicollinearity problems, serial correlation problems, heteroscedasticity problems, etc. which need bunches of tests with human judgments. Inductions/decision trees [1] It is a method (also referred to as decision trees) that generates rules (decision rules) for predicting dependencies. It is commonly used in target marketing, credit risk analysis, fraud detection, help desk/problem diagnosis, customer retention. The following simple credit risk analysis forecasting example shows the method to generate useful information for credit card companies: (see Table 1 in appendix). According to the table, we can construct the cross-tabulation table of independent vs. dependent columns for the root node (Table 2 in appendix) and draw the decision tree (see Fig. 2 in appendix). So we can generate our decision rules: If Debt = High, then Risk = Poor ( 20% ) If Debt = Low && Income = High, then Risk = Poor ( 20% ) If Debt = Low && Income = Low && Married = No, then Risk = Poor ( 20% ) If Debt = Low && Income = Low && Married = Yes && Credit Card = No, then Risk = Good ( 10% ) Factors selection may cause either under-fitting or over-fitting. We recommend to use some statistics tools to analyze the correlations between variables (independents and dependent) It is hard to implement those into a time series data sequence. It is a kind of semi-strong form of efficient market hypothesis, which states that the dependent variable fully reflects all relevant publicly available information (independent variables) rather than the past series of itself only. Artificial Neural Networks (ANNs) [8][9][10][18] It is a highly accurate model for prediction & forecasting. We introduce ANNs and apply/use one of the 6

7 most commonly used, versatile learning ANNS, the FeedForward, BackPropagation architecture (Multi-Layer). Known data is fed through nodes to train the model; once trained, new data is input and scored for its propensity. ANN is typically used in sales forecasting, fraud detection, target marketing, credit card validation, credit application acceptance, stock and bond price predictions, etc. Steps in the development of Artificial Neural Networks: 1) Determine the structure of the ANN based on some underlying theory about what influences the dependent variable. This step is the most important part in forecasting procedure. However, in general, this involves choosing input variables, the number of input nodes, the number of hidden layers and nodes, the transfer function type, and the number of output nodes. 2) Divide the input and output data into two groups, the first to be used to train the network, the second to be used to validate the network in an out-of-sample experiment or forecast. 3) Scale all input variables and the desired output variable to the range of 0 to 1. 4) Set initial weights and start a training epoch using the training data set by repeating step (5) to (13): An epoch is the calculation of errors and the adjustment of weights by processing( i.e., taking one pass through) all observation in the training set. Also in this step, initial values can influence the speed and RMS that result from the training process. Most programs allow weights to be initialized with all zeros and random numbers from, for example, -1 to 1. 5) Input scaled variables. 6) Distribute the scaled inputs to each hidden node. In general, each hidden node receives all scaled input variables, which results in a parallel processing of all inputs at multiple nodes. In some cases, these inputs are distributed to each output layer node as well as to the hidden layer nodes. Note that the output O j of an input node equals its input I j. Thus: O 0 = I 0, O 1 = I 1. 7) Weight and sum inputs to receiving nodes. At each hidden node, the outputs of the input nodes are weighted and summed. Thus the input to node 3 is: I 3 = W 03 O 0 + W 13 O 1 (14) 8) Transform hidden-node inputs to outputs. At each hidden node, the weighted inputs are transformed into an output in the range of 0 to 1. 1 O 3 = (1 + e I3 (15) ) where, as always, e is the natural number of ) Weight and sum hidden node outputs as inputs to output nodes. The outputs of the hidden node and O 3 any bias nodes are weighted and summed. As we will see, a bias node operates much like the constant term in regression analysis. I 4 = W 24 O 2 + W 34 O 3 (16) 10) Transform inputs at the output nodes. At the output node 4, the weighted input I 4 is transformed into the output of O 4 in the range of 0 to 1. This is the final output of the ANN: 1 O 4 = ( 1 + e I4 (17) ) 11) Calculate output errors. The scaled output value O 4 is compared to the scaled desired output value D 4 and the error is calculated: e i = D 4 + O 4 (18) where i is the observation number in the training set. 12) Backpropagate errors to adjust weights. Based on the errors of the step(11), the weights throughout the network are modified so as to move toward minimization of the RMS value. 13) Continue the epoch. Repeat steps (5) to (12) for all observations in the input data set. As mentioned, each pass through all observations is called an epoch. When all observations have been processed (i.e., one epoch completed), go to step (14). 14) Calculate the epoch RMS value of the errors. If this RMS is low enough, stop and go to step (15). If the RMS is not low enough, repeat steps (5) to (14) until the stopping condition is reached: (19) where RMS = Approximately the residual standard error e i = Errors during each observation in latest epoch nt = Number of observations in the epoch/training set 15) Judge out-of-sample validity. Having trained the ANN using one set of input data, now it should be validated using out-of-sample data. That is, use the ANN trained in steps (1) to (14) to predict the withheld output variables. If the out-of-sample RMS is consistent with the training RMS, the model appears valid. If the model is not valid, repeat the experiment after you: a) Try different initial values for the weights. b) Redesign the ANNs (i.e., use fewer / more layers / nodes). c) Try a different ANN method. d) Reject ANNs as a viable method. 16) Using the model in forecasting, being cautious to monitor it using tracking signals and other devices. There are few well-established guidelines on choosing the best ANN architecture. The architecture includes all of the considerations: the number of input variables and nodes, the number of hidden layers and nodes, the number of output nodes, also included are the choice of the transfer function used and values of other coefficients. In general, there is nothing that precludes having more than one hidden layer. It is hard to 7

8 determine the number of hidden layers as the best fitted model. So it is possible to have a model that is too complex or too simple. If the ANN is too complex, the fitted weights will be designed to match the input data too closely instead of modeling the underlying patterns. Fortunately, this problem of overfitting can be detected during step 15. If the neural network is too simplistic, the training and validation RMS values will be too high. COMPARISIONS AND ANALYSIS As we have described in the previous sections, there is no single best forecasting model. Each model may be best fitted into specific situation such as horizon length, automation of development, pattern recognition ability, number of observation required, etc. Table 4 and Table 5 in appendix show the comparison between different forecasting models. Any model has its own advantages, disadvantages and constraints. ANN models, like all forecasting methods, are not always the best approach to forecasting. In fact, the experiences to date give mixed reviews to the effectiveness of ANNs. Sharda and Patil (1990) used 75 times series from the M-competition and compared their results to those of an expert ARIMA forecasting system. Out of the 75 series, ANN performed better than 39 of the series and worse for 36 series. We believe that it is difficult to consistently outperform a good human ARIMA model builder (which moderately heavy use of skilled manpower). In general, smoothing methods such as Winters, Fourier series analysis, and ARIMA can be more accurate for immediate to intermediate-term forecasting than are multivariate methods. Of course, the virtue of having a broad class of models from which to choose is that, on the average, this should lead to more accurate forecasts. Certainly it would be expected that a sensible use of the ARIMA model building approach should, in the aggregate, produce forecasts which are of higher quality than those deriving from the imposition of a single model structure on every time series. However, empirical studies by Makridakis and Hibon (1979), and by Makridakis et al. (1982), report little or no improvement in forecast accuracy when using ARIMA models rather than some simpler extrapolative techniques. The degree to which the estimation of forecasting methods can be automated decreases for those lower in the Figure 3 in appendix. This is true because, the complexity and subjectivity of the forecasting methods evolve from the highly programmable to the ill-structured qualitative methods. As the complexity of the method increases, the less likely will be its automation. (i.e. it is not normally possible or cost effective to automate multiple regression modeling methods. The strengths of the ARIMA model building approach lie in its flexibility, the relative accuracy of the resulting forecasts, and the possibility of extension to the analysis of multiple time series. It is not easy to make development of automatic ARIMA model building programs so that can produce satisfactory forecasts. PREDICTION MODELING SYSTEM We have done extensive researches and developed case studies on how anyone can choose any model based on individual circumstances when designing a new prediction system. We developed a new methodology of a prediction modeling system. The modeling system is based on a couple of assumptions. First, heavy computation requirements won t be a great concern. We can sacrifice a little more computation time to trade in for more accurate forecasting. Just as we can sacrifice a little more hard disk space to trade in for higher speed and stability of operating system. Second, according to Table 5 in appendix, we recommended that at least 60 observations (i.e., five seasons) be stored for monthly series, regardless of the forecasting method. Because there is no best single method of forecasting systems, the methods used in good systems are, in general, less important than their intelligent usage. Several very good methodologies executed very poorly in automated systems a good implementation of a less effective method can outperform a poorly implemented best method. As we see from different forecasting model analysis and comparisons, we can classify those models into different patterns: (see Table 3 in appendix). From the table, in order to simplify the problems, we set a rulebased pattern table for classifications and select the following forecasting methods for building our DEMO prediction modeling system: SES Holt/Winters ARIMA Neural Network Now, let s see our prediction modeling system procedure: (Fig. 3 in appendix) 1. Pattern Identification Module (PIM) This module is to use statistic tools to analyze data sequence first for patterns classification. We built autocorrelation function (ACF) and partial auto-correlation function (PACF) to facilitate the objective and use t-value to test the coefficients. ACF measures the direction and strength of the statistical relationship between ordered pairs of observations on two random variables. It measures how closely the matched pairs are related to each other. And PACF measures the correlation between ordered pairs separated by various time spans (k=1, 2, 3 ) with the effects of intervening observations accounted for. We use total sample number divided by 4 8

9 as the lags number. This part is highly important because it is the milestone of the whole modeling system. We also calculate the mean, standard deviation, etc. to facilitate later use and do a RUNS test (if applicable) to measures the likelihood that a series is a random occurrence. 2. Pattern Comparison Module (PCM) This module is to do the pattern comparison by using back-propagation neural network. We use neural network instead of human experts to learn and train neurons (statistic results from PIM), eliminate outliers and output the trend of the statistic data sequence, which are added pattern information. This neural network based module can intelligently generate pattern information such as no season, with trend, stationary, etc. 3. Model Selection Module (MSM) This module is to select the available right model (s) from the pattern information generated from PCM based on the rule-based pattern table. For example, based on no season, with trend, stationary time series data sequence, Holt s Two-Parameter Trend Model may be a right choice. This model is a pure knowledge based expert system. 4. Error Check Module (ECM) This module is to run residual statistic analysis to see whether errors are within acceptable range. If so, then we can accept the forecasting results. If not, we have to repeat the whole procedures again and again to repeat comparing the following statistic criterion: residual standard error (RSE), mean square error (MSE), root mean squared error (RMSE), Mean Absolute Percentage Error (MAPE) (see Formulas in appendix) from (1) to (4) till errors fall into acceptable range. 3.5 Model Comparison Module This module is to do the model comparison only when there are two or more model candidates selected from MSM. Then it has to repeat both models to compare the forecasting accuracy by residual analysis also. A good forecasting model should not consistently over- or under-forecast; consequently, the mean error should not vary greatly from zero. The sum of the squared errors is used to calculate the residual standard errors as in equation (22) in Formulas in appendix. It is used for calculating the standard deviation of the errors, commonly called the residual standard error or standard error of estimate; we shall use the term residual standard error (RSE) to denote this concept. The k value in equation (22) in appendix is the usual degree of freedom adjustment that makes RSE a better estimate of the true population RSE. As shown, the RSE is different from the usual standard deviation. The RSE is a standard deviation that assumes the mean error is equal to zero, whether it is or not. The selection of the best forecasting model is considerably more complex than choosing the model with minimum RSE. There are a variety of criteria used to choose one forecasting model over another. The most important general objective in forecasting is to decrease the width of the confidence intervals used to make probability statements about future values. A probability statement for the next actual value is: Actual = Forecast ± Z RSE. The probability that the actual value will be in this interval is determined by the t- or Z-value. If the Z of 1.96 is used and errors are normally distributed, approximately 95% of the actual values for one-period-ahead forecasts will be in this range, assuming the model of the past is accurate and the past repeats itself. As we have discussed above, in our new prediction modeling system, we combine expert system (ES), conventional program system (CPS) and neural network (NN) with large statistics and computations together. Our effects are to build an automatic forecasting modeling system, which can be applied to any data sequence in different situations and conditions. CONCLUSION AND FURTHER WORK As we mentioned before, there was not one method that was best for all series or all forecast horizons. Some methods are better than others in short-term patterns, others at long-horizons length. Current world is a digital world. Computer-based Automatic prediction modeling system is the tendency. Since we have pointed out that each forecasting system has its own advantages and disadvantages and we also defined constraints for different prediction models. However, we cannot solely rely on the Fig. 4, Fig. 5 in appendix, the M-Competition 24 models and even our extensive hands-on experiences in building and testing different models. Because average rankings can be misleading, and by choosing different models for different horizons, one can dramatically improve accuracy. Thus, how to make those intelligencebased model selections and how to make seamless combinations of different models to solve more complex problems may bring a break-through in the development of prediction modeling system. REFERENCES [1] A. Bellacicco, 2000, "A new insight into the algebraic structure of the exponential smoothing algorithm of Brown", in: N. Ebecken and C. A. Brebbia (Eds.), Data Mining II, WIT Press, Southampton, UK, pp [2] A. Dhond, A. Gupta, S. Vadhavkar, 2000, Data Mining Techniques for Optimizing Inventories for Electronic Commerce, KDD, Boston, MA USA, 9

[3] D. Rafiei and A.O. Mendelzon, 2000, Querying Time Series Data Based on Similarity, IEEE Transactions on Knowledge and Data Engineering, Vol.12, no.5 [4] G. Antoniol, G. Casazza, G. Dilucca, M.

10 [3] D. Rafiei and A.O. Mendelzon, 2000, Querying Time Series Data Based on Similarity, IEEE Transactions on Knowledge and Data Engineering, Vol.12, no.5 [4] G. Antoniol, G. Casazza, G. Dilucca, M. Di Penta, E. Merlo., 2001, Predicting Web Site Access: an Application of Time Series. The 3 rd International Workshop on Web Site Evolution (WSE 01), IEEE [5] G. Box, G.M. Jenkins and G.C. Reinsel, 1994, Time Series Analysis: forecasting and control, 3 rd ed. Perntice Hall [6] G. Box and M. Jenkins. 1970, Time Series Forecasting Analysis and Control. Holden Day, San Francisco (USA). [7] Hillol Kargupta et al., 2001, MobiMine: Monitoring the Stock Market from a PDA, SIGKDD Explorations, Vol.3, Issue 2, pp [8] J. Neves and P. Cortez, 1997, An Artificial Neural Network Genetic Based Approach for Time Series Forecasting, the 4 th Brazilian Symposium on Neural Networks (SBRN 97) [9] J. Yao and C.L. Tan, 2000, Time dependent Directional Profit Model for Financial Time Series Forecasting, IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN 00) [10] K. Bansal, S. Vadhavkar, A. Gupta, 1998, Brief Application description Neural Networks Based Forecasting Techniques for Inventory Control Applications, Data Mining and Knowledge Discovery, Vol. 2, [11] Makridakis et al., 1982, The Accuracy of Extrapolation (Time Series) Methods: Results of a Forecasting Competition. Journal of Forecasting, Vol. 1, pp [12] Makridakis, S., S.C. Wheelwright, V.E. McGee., 1983, Forecasting Methods and Applications. 2 nd ed. New York: John Wiley & Sons. [13] Makridakis, S., and M. Hibon., 1979, Accuracy of Forecasting: An Empirical Investigation (with discussion). Journal of the Royal Statistical Society A, no.14 (part 2), pp [14] Pankratz, Alan, 1983, Forecasting with univariate Box-Jenkins models. [15] Paul Newbold, 1983, ARIMA Model Building and the Time Series Analysis Approach to Forecasting, Journal of Forecasting, Vol.2, pp [16] Paul Newbold and J. Kenton Zumwalt, 1987, Combining Forecasts To Improve Earnings Per Share Prediction, International Journal of Forecasting, Vol.3, pp [17] R.G. Brown., 1963, Smoothing, Forecasting and Prediction. Prentice Hall. [18] Sharda, R., and R. Patil., 1990, Neural Networks as Forecasting Experts: An Empirical Test., International Joint Conference on Neural Networks. Vol , pp (Washington, D.C.: IEEE) [19] Shiskin, J., A.H. Young, and J.C. Musgrave., 1967, The X-11 variant of the Census Method II seasonal adjustment program. Technical Paper No. 15. U.S. Department of Commerce, Bureau of Economic Analysis. [20] Stephen A. DeLurgio, 1998, Forecasting Principles and Applications, International Editions, McGraw Hill FIGURES AND TABLES Fig. 1 The Box-Jenkins modeling Procedure Table 1 Credit Risk Analysis for Credit Card Companies Table 2 cross-tabulation table of independent vs. dependent columns 10

11 Fig. 2 Decision Tree for credit risk analysis Fig. 3 Prediction Modeling System procedure Fig. 4 Multiple Regression Modeling Process [20] 11

Table 3 Model Pattern Classifications Table 5 Comparisons of Forecasting Methods (con d) [20] Table 4 Comparisons of Forecasting Methods [20] FORMULAS Frequently used general

12 Table 3 Model Pattern Classifications Table 5 Comparisons of Forecasting Methods (con d) [20] Table 4 Comparisons of Forecasting Methods [20] FORMULAS Frequently used general statistics: Mean, Variance, Standard Deviation, Covariance, Correlation, t-test, Z-normal, F-test, p-value, ACF, PACF, DW, VIF, RUNS test Residuals/error statistics used in smoothing: 12

13 In ARIMA: Mean square error: RACF: (23) Chi square test: RMSE (Root Mean Squared Error): (24) (25) MAPE (Mean Absolute Percentage Error): (26) (27) Akaike information criterion: AIC = nlog (SSE) + 2k (28) Schwarz Bayesian information criterion: BIC = nlog (SSE) + klog(n) (29) In Multiple Regressions DEV = [Y i E(Y)] (30) In Neural Network RSE (see equation 22 above) Standard error of regression coefficient (34) Partial F-test for restricted and unrestricted modes F calculated = ( SSE R - SSE U ) / m (35) (36) 13

TIME SERIES ANALYSIS AND FORECASTING USING THE STATISTICAL MODEL ARIMA

CHAPTER 6 TIME SERIES ANALYSIS AND FORECASTING USING THE STATISTICAL MODEL ARIMA 6.1. Introduction A time series is a sequence of observations ordered in time. A basic assumption in the time series analysis