Scenario 5: Internet Usage Solution. θ j

Scenario : Internet Usage Solution Some more information would be interesting about the study in order to know if we can generalize possible findings. For example: Does each data point consist of the total number of users that logged in during a -minute period, or is it an instantaneous measurement at some point during the minute? Do users have to be active in order to count or is being connected sufficient? Recall that for linear regression, independence of the observations is necessary. In this case, this assumption does not hold. Consecutive observations are correlated with each other, which is called autocorrelation. Thus, we will use a model that is very common in time series analysis: the ARIMA(p,d,q) (auto regressive integrated moving average) model, where p is the number of autoregressive terms, d is the number of nonseasonal differences, and q is the number of moving average terms. The ARIMA(p,d,q) model is given by i = p φ i B B d X t i= j q θ j B a t j= where B i X t = X t-j, φ i are the autoregressive parameters, θ j are the moving average parameters, and a t are standard normal errors. Figure shows a time plot of the dataset. 00 0 00 Minutes Figure : Time plot of the number of internet users

One assumption necessary for the ARIMA model is that the variance is stationary. Since the variation over time appears constant based on the graph, this assumption seems to be met. In order to find the AR and MA component of the model, the mean has to be constant over time. Since this is clearly not the case, we will correct this by first define d, the number of differences required to have a constant mean. In Figure, it can be seen that the mean is not constant. To address this problem, we will difference the series until the mean appears stationary. That is, we will now use the differenced values w t = X t X t- values for t. Figure shows the series after taking one difference. It appears that the mean is now stable, so we will use d =. If this differencing is not done, then the predicted model is not valid. 0 0 - -0-0 0 0 0 0 Minutes Transforms: difference() Figure : Time plot after one difference Now we must determine the orders for p and q. To determine q, the number of autoregressive terms, we will consider the sample autocorrelation function () of the differenced logged observations. We look for q such that the decays to 0 after lag q. To determine p, the number of moving average terms, we will consider the sample partial autocorrelation function (P) of the differenced logged observations. We look for p such that the P decays to 0 after lag p. The and P of the differenced series can be seen in Figure.

0. 0. 0. 0. Partial - - -0. -0. -0. -0. 0 0 Figure : Sample and P Although the is significant for the first six lags, it definitely starts to decay to 0 after the first lag. The first three lags are significant in the P, but there is damped sine wave decay to 0 after the first lag. For this reason, I suspect that the best model will be an ARIMA(,,). However, it is possible that more terms are necessary. In order to assess the model fit, we will look for patterns in the residuals (ideally having no pattern and being normally distributed). We will also consider the and P of the residuals. If any lags lie outside of the white noise boundaries, this suggests that we need to increase the order of the model. We will also look at the significance of each of the terms. Additionally the AIC gives a indication how well a model fits. Smaller values of the AIC indicate a better model fit than higher values. We will start with smaller order models ARIMA(,,0) Both the and P (Figure ) of the residuals show two significant lags. This suggests that more terms should be included in the model. With a p-value of 00 (Table ), the autoregressive term is statistically significant. Additionally we can find the AIC to be.. Error for usage from ARIMA(,,0) Error for usage from ARIMA(,,0) 0. 0. 0. 0. Partial - - -0. -0. -0. -0. 0 0 Figure : and P of residual of an ARIMA(,,0) model Parameter Estimates

Estimates Std Error t Approx Sig Non-Seasonal Lags AR..0.0.000 Constant.00..0.0 Melard's algorithm was used for estimation. Table : Parameter estimates of a ARIMA(,,0) model ARIMA(0,,) The and P of the residuals, shown in Figure are both significant through the fourth lag. The P shows decay after the first lag, suggesting that an autoregressive term is needed. Table shows a p-value of 00, the moving average term is statistically significant and we can find the AIC to be.. Since the AIC is larger than before, this model is actually worse than the previous one. Error for usage from ARIMA(0,,) Error for usage from ARIMA(0,,) 0. 0. 0. 0. Partial - - -0. -0. -0. -0. 0 0 Figure : and P of residual of an ARIMA(0,,) model Parameter Estimates Estimates Std Error t Approx Sig Non-Seasonal Lags MA -..0 -..000 Constant....0 Melard's algorithm was used for estimation. Table : Parameter estimates of a ARIMA(0,,) model ARIMA(,,) The and P of the residuals (Figure ) stay within the white noise lines. There also seem to be no problems with the diagnostics. There is no pattern in the residuals. With a p-values of 00 (Table ), both the autoregressive and moving average terms are statistically significant and we also find the lowest AIC of..

Error for usage from ARIMA(,,) Error for usage from ARIMA(,,) 0. 0. 0. 0. Partial - - -0. -0. -0. -0. 0 0 Figure : and P of residual of an ARIMA(,,) model Parameter Estimates Non-Seasonal Lags Estimates Std Error t Approx Sig AR..00.0.000 MA -.0.00 -.0.000 Constant..0.. Melard's algorithm was used for estimation. Table : Parameter estimates of a ARIMA(,,) model Thus, the model that best fits this data is the ARIMA(,,). Now that we have chosen an appropriate model we want to predict the internet usage for the next 0 minutes. Figure shows a graph of the internet usage together with the predictions based on the above model. Clearly the model works really well for the time period we had data collected. For the predictions, we see that it appears to follow the pattern, but the confidence limits associated with the prediction get wide quickly. This on the one hand is due to the short time series we have available and on the other due to the fact the our model depends on the previous values. Consequently any mistake made in the prediction will be carried forward which leads to large confidence limits.

00 0 Prediction % Lower Confidence % Upper Confidence 00 0 00 0 0 0 Minutes Figure : Internet users and prediction using an ARIMA(,,) model Getting the Results in SPSS Time Plot of Series:. Open the data file www.sav in SPSS.. On the toolbar, click on Graphs > Sequence. In the Sequence Charts box, select the usage variable, and input it into the Variables box.. Click OK. Time Plot of Differenced Series:. On the toolbar, click on Graphs > Sequence. In the Sequence Charts box, select the usage variable, and input it into the Variables box.. Check the box to the left of Difference, and type in the box to the right of Difference.. Click OK. Plots of and P:

. On the toolbar, click on Graphs > Time Series > Autocorrelations. In the Autocorrelations box, select the usage variable, and input it into the Variables box.. Check the box to the left of Difference, and type in the box to the right of Difference.. Click OK. Fitting the ARIMA Model:. On the toolbar, click on Analyze > Time Series > ARIMA. In the ARIMA box, select the usage variable, and input it into the Dependent box.. Enter the values of p, d, and q in the appropriate box.. Click OK. Plots of and P of the residuals:. On the toolbar, click on Graphs > Time Series > Autocorrelations. In the Autocorrelations box, select the Err_ variable, and input it into the Variables box.. Make sure that the box to the left of Difference is NOT checked.. Click OK. Predicting with the ARIMA Model:. On the toolbar, click on Analyze > Time Series > ARIMA. In the ARIMA box, select the usage variable, and input it into the Dependent box.. Enter the values of p, d, and q in the appropriate box.. Click on Save and under Predicted Cases select Predict through.. In the box next to Observations enter 0 and click continue.. Click OK.. On the toolbar, click on Graphs > Sequence. In the Sequence Charts box, select the usage variable, FIT_, LCL_ and UCL_ and input it into the Variables box.. Click OK.