MGR-815. Notes for the MGR-815 course. 12 June School of Superior Technology. Professor Zbigniew Dziong

Size: px

Start display at page:

Download "MGR-815. Notes for the MGR-815 course. 12 June School of Superior Technology. Professor Zbigniew Dziong"

Whitney Newton
6 years ago
Views:

1 Modeling, Estimation and Control, for Telecommunication Networks Notes for the MGR-815 course 12 June 2010 School of Superior Technology Professor Zbigniew Dziong 1

2 Table of Contents Preface 5 1. Example of the analytic model of a VoIP system The basics and the packet loss rate Binomial distribution Control of connections admission Sizing the link 9 2. Introduction to time series and forecasting Decomposition of a time series Types of forecasting Forecasting error Forecasting techniques Choosing a forecasting technique Basic Concepts Probability Probability distribution Normal distribution and Student's t Law Central limit theorem Confidence interval Example of calculation of the confidence interval Application of confidence interval for simulation results Regression Model of simple linear regression Model of general (multiple) linear regression Model of quadratic regression Time series as stochastic processes Definition of stochastic process Stationarity of the process 21 2

3 4.3. Correlogram Interpretation of correlograms Trend Filtering the series Removal of local fluctuations (low pass filters) Removal of long-term fluctuations (high pass filters) Useful stochastic processes for the analysis of time series White noise Random path Moving average Autoregressive process First order autoregressive process (Markov process) ARMA model ARIMA model Development of forecasting models Identification of the type of the sample's model Correlogram of the sample Interpretation of correlogram Estimation of the parameters of the autoregression model Least square estimate method - LSE AR(1) parameters AR(2) parameters Estimation of the order of the autoregression model Maximum likelihood estimation method - MLE Estimation of the parameters of the moving average model Estimation of the order of the moving average model Estimation of parameters of the average mobile model Model Verification 43 3

4 7. Forecasting Forecasting by extrapolation Exponential Smoothing Choice of Generalized exponential smoothing Kalman filter State space model Scalar example of Kalman filter Non-linear models for estimation and forecasting Extensions of ARMA models Nonlinear autoregressive model NLAR Threshold autoregressive model TAR Bilinear models Neural networks Neural network Optimization Neural network for forecasting Chaos Characteristics of chaotic system Fractal behavior and self-similarity Poisson process and exponential distribution Derivation of the Poisson distribution Transition probabilities for t Standardization method Markov decisional process Average reward The relative values Transformation of the iterative algorithm on the values 65 References 67 4

5 Preface These MGR-815 course notes explain many concepts and important derivations for understanding the course. Nevertheless, the purpose of these notes is not to cover all the material of the course, but to help understand the material included in the other references and those that are available on the course's website. Chapter 1 provides an example of how to calculate the performance of the Voice on IP system while using a simplified but effective model. Chapters 2-7 present the problems of estimation and forecasting and cover the majority of materials necessary for Project 1 and Test 1. Chapters 4-8 follow, in many aspects, the [Chatfield, 2003] book and partly the [Dziong, 1997] book, where we can find more details. The purpose of Chapter 9 is to help track the overheads associated with the [Bose, 2001] book and the selected chapters of [Dziong, 1997], which deal with the Markov process and the queue analysis. Finally, Chapter 10 offers some basic concepts and derivations which facilitate the understanding of the material associated with the Markov decision process that is shown in the selected chapters of [Dziong, 1997]. The more advanced study of this subject is available in the [Tijms, 1994] book. Through notes, the parts marked by the star * are not obligatory and can be omitted without causing problems in understanding the other parts of the notes. 5

6 1. Example of the analytic model of a VoIP system In this section, we will analyze the performance of a link in a network serving a certain number, M, with VoIP (Voice over IP) connections. The purpose of this analysis is to find a relationship between the maximum number of VoIP connections M max, which we can admit on the link and its capacity, L. At the same time, we require certain guarantees for the quality of the service for these connections. This guarantee is expressed as a constraint, B p c, on the probability of packet loss, B p, which should not be exceeded. The architecture of this system is illustrated in Figure M connections admitted on the link.. M. Link Capacity = L Lost packets with B p Probability Figure 1.1. The system's architecture Here are the assumptions used in the analysis: 1. VoIP packets have a fixed length (e.g. 200 bytes) 2. An active connection generates a packet every 20ms 3. During the inactivity period, a connection does not generate packets 4. A connection is active with the probability p a = The link works synchronously with a 20ms time interval 6. In each interval, the link can serve a maximum of L packets 7. The bond has no buffer for the packets. Therefore, during the given interval, if the number of packets reached, A, exceeds the capacity of the link, A > L, the link will reject A L packets. Attention! Unlike assumption 7, the real links have a buffer for the packets which cannot be served during the given intervals and these packets can be served during the following intervals. In this case, the constraint of the quality of the service can be 6

7 defined as the probability of packet delay which must not exceed a certain value (e.g. 150ms). However, the non-buffer system can be used for the approximate analysis of the buffered system, because we can set certain approximate matches between the two qualities of service constraints The basics and the packet loss rate In general, in the interval j, of 20ms, there is a certain number of packets, A j, which happen to be served by the link. Let us consider this number as a result of the j-me trial (or experience). For n trials, we can define the following concepts: N k (n) k (n) = N k(n) n p(k) = lll n k (n) ; Number of trials with A j = k packets ; Relative frequency of having a trial with k packets ; Probability of having a trial with k packets A n = 1 n A n j=1 j = A n n M 1 M n k=0 kk k(n) ; Sample average k=0 kk(k) E[A] ; Mathematical expectation, average value Using these concepts, we can define the probability of packet loss as follows: 1 n M K=L+1 1 n (K L)N k(n) M k=0 kn k(n) n M K=L+1 M k=0 (K L)p(k) = B kk(k) p (1.1) 1.2. Binomial distribution Considering a sequence of three trials of head (T) or tail (F) where each trial is independent with the probability of a head p and a tail 1-p. Here are the probabilities for all possible combinations: P [{FFF}] =p 3 P [{FFF}] =p 2 (1 p) P [{FTT}] =p 2 (1 p) P [{TFF] =p 2 (1 p) P [{TTT] =p (1 p) 2 7

8 By grouping the sequences with the same number of heads and tails, it is possible to define the probabilities of having k tails in the three trials By analyzing the structure of these formulas, we can find the generalization for any number of trials, n, in the sequence Or Express the number of combinations for the selection of k positions among n positions. The formula (1.2) defines the binomial distribution which can also be calculated using the recurrence 1.3. Control of connections admission Using (1.2) with p = p a, in (1.1) we reach Where the notation B p (M,L) for the probability of packet loss stresses that it depends on the number of connections and the capacity of the link. The formula (4) can be used to find the maximum number of VoIP connections, M max, which we may admit on the link with the capacity L: 8

9 Figure (1.2) shows an example of B p (M,L) as a function of M for L =20 and P a = 1 3. By using this figure, we can find M max for the required value of the constraint B p c. For example, for B p c = 5% we can accept a maxium M max = 56 connections. Figure 1.2. Probability of packet loss B p (M,L) as a function of M for L = Sizing the link Formula (4) may also be used for finding the minimum capacity L min for the number of given M connections: Figure 1.3. provides an example of B p (M,L) as a function of L for M =48 and P a = 1. By 3 using this figure, we can find L min for the required value of the constraint B c p. For c example, for B p = 5% we need a minimum L min = 18 VoIP packets per interval. 9

10 Figure 1.3. Probability of packet loss B p (M,L) as a function of L for M =48 10

11 2. Introduction to time series and to forecasting A time series may be defined by a variable X(t i ) being a function of discrete time t i : i = 1, 2, 3,. Assuming that we are at the moment of time t n and that we know all the values of the series for i n. These values form the history of the series and they are also called the samples of the series. One of the basic problems of time series analysis is the estimation of one (or several) future value(s) of X(t i ) for i > n, based on the full or partial history of the samples. This estimation is also called a forecast or a prediction. A simplified notation is often used for the index of discrete time X t : t = 1, 2, 3, Decomposition of a time series In general, a time series can be decomposed into four components: X t = T t + C t + S t + R t where T t - Trend (can be ascending or descending) C t - Cycle (periodic repetition) S t - Seasonal (a special case of a cycle) R t - Random A series may have only a subset of these components. Often we want to analyze only one component and in this case we must remove (filter) the other components Types of forecasting In practice, the forecast is always associated with a forecast error. Actually, there are two types of forecasting that differ in the treatment of error: 1. Point Forecast: In this case, the result of forecasting is a particular value that should be equal to the future value of the series we seek. This type of forecasting gives no information on the size of the possible error of the forecast. 2. Forecast of confidence interval: In this case, the result of forecasting is an interval that covers the real future value of the series we search with a 11

12 specified probability (e.g. 0.9 or 0.95 or 0.99). This interval gives confidence in the forecast because it indicates the probability of having an error of the specified size Forecasting error In general, the forecasting of the time series sample is not ideal and we may define the forecasting error: e t = x t - x t where x t is the true value of the sample and x t corresponds with its forecast. We are often interested only in the absolute deviation of the error: e t = x t - x t To analyze the quality of forecasting in a good way, we need to use the errors on several samples. In this case we can apply one of two metrics: [1] MAD (Mean Absolute Deviation) - the average deviation of the error: 2. MSE (Mean Squared Error) average squared error The mean squared error is more popular because it gives more weight to the biggest mistakes that are more "dangerous." 2.4. Forecasting techniques Here are some techniques that can be used for forecasting in the time series: 1. Estimation of the last sample used as the forecast 2. Exponential reading 3. Time series regression 4. Autoregression of the time series (ARIMA) 5. Box-Jenkins methodology 6. Models based on the states' space: Kalman Filter 12

13 The techniques outlined in bold are covered in this course Choosing a forecasting technique The choice of a forecasting technique depends on the sought characteristics which can be divided into the following categories: 1. Type of forecast (point, confidence interval) 2. Horizon of the forecast (longer => more difficult) 3. Component of the series (T, C, S, R) 4. Cost of forecasting a. Development of the forecasting model b. Requested memory c. Complexity 5. Accuracy of forecasting 6. Availability of data 7. Ease of operation and understanding. 13

14 3. Basic Concepts Supposing that {y 1, y 2,, y n } is a sample with a variable Y randomly selected. For this sample we can define: 1. Sample's average 2. Sample's variance 3. Distance-type of the sample Question: Why do we divide by n 1, and not by n, in the formula for the variance? 3.1. Probability Probability is a measure of a likelihood, or a chance, that the variable is equal to a value. It can be defined as Where A is a particular value of the variable, n A is a number of values A in the sample and n is a total number of values in the sample Probability distribution Here we must distinguish between the continuous and the discrete variables. For the discrete variables we can define the probability function: and the cumulative distribution function: 14

15 The function of cumulative distribution is also valid for the continuous variables. On the other hand, the probability function is not defined for the continuous variables, because in this case it is always equal to zero due to the infinite number of possible values. Nevertheless, we can define an equivalent function which is called probability density function which is determined by: Therefore, the function of cumulative distribution can be presented as: For the discrete variables and as for the continuous variables Normal distribution and Student's t law The normal (Gaussian) distribution is very important for many applications and especially for statistics. The function of normal distribution density is defined by where µ is an average and σ is a distance-type of the distribution. The normal distribution is sometimes called N (µ, σ). If we normalize the variable x : we reach the standard normal distribution with the average equals zero and with the variance equal to a unit. The normal distribution is symmetrical as shown in Figure 3.1. For the analysis of the confidence interval, shown later, it is useful to present the probability values in which the variable is found during the following intervals: 15

16 Student's t distribution is a distribution related to the normal distribution and is important for the calculation of the confidence interval discussed later. It is a symmetrical distribution with µ = 0 and σ > 1, which is characterized by a number of degrees of freedom df. For df wide, the law of t approaches the normal distribution: The density of the law of t with df = 4 is compared with that of the normal standard law in Figure 3.1. Figure 3.1. Density of the normal standard law (higher) and with the law of t with m = 4 (wider) 3.4. Central limit theorem Suppose that is the sum of n random iid variables (independent with identical distribution) with the average µ and the finished variance σ 2. Let us consider a standardized random variable with the average zero and variance of a unit. We can prove that Which means that the distribution of the sum of a large number of iid variables approaches the normal distribution. The importance of the central limit theorem 16

17 comes from the fact that these variables can have any distribution. The only requirement is that the average and variance are finished Confidence interval While estimating or forecasting a value of a series parameter W, we can use the notion of the confidence interval to indicate a precision in the estimation or forecasting. This interval covers the true value of the parameter with a probability 1 - where is a parameter of the estimation or forecasting. Often this interval is shown as a deviation of the estimator (of the forecast) from the point W : Example of calculation of the confidence interval Let us consider a sample of n independent values {y 1,, y n } with a random variable Y having a distribution close to the normal distribution. The goal is the estimation of the confidence interval for the average µ of Y. Here are the steps: 1. Let us calculate the average and the distance-type of the sample {y 1,, y n } Let us note that y is actually our estimator in terms of µ. 2. Now, let us consider y as a random variable. We can easily find the relations between the averages and the variances of y and Y : which give the average and the distance-type of the sample y : 3. Because we can prove that y follows Student's t law with the degree of freedom df = n 1, the confidence interval for the average µ is given by 17

18 where t dd /2 is a value which gives where T is a random variable that follows Student's t law. We can find the values dd in the mathematical tables. In practice, we can also use the normal t /2 distribution when n - 1 > 30, since, as mentioned before, the law of t approaches the normal law for wide df Application of confidence interval for simulation results Let us consider the estimation of an average µ of a random variable X in while using a simulation. For example, X may be the number of connections (or packets) in the system. To obtain the statistics, the measured values x j of X are recorded during the simulation from time to time. Let us note that in order to avoid the influence of the initial conditions it is good to start the measurement after an initial period when the process becomes stationary. To obtain the confidence interval for the estimation of the average µ, we must have a m sample of average values {x 1,, x n } where x i = /m are independent. We can obtain it in two ways: j=1 x j 1. Realize n independent simulations while using different initial values having m random figures in each simulation. Calculate x i = /m for each simulation. j=1 x j 2. Realize a simulation that we divide on n periods (of course after the initial m period). In each period, calculate x i = /m. The periods should be long enough to have independent x i. j=1 x j Note that regardless of the distribution of x i, the variables x i have a distribution close to a normal distribution for reasonably wide m, based on the central limit theorem. So once we have a sample of average values {x 1,, x n } confidence interval is given by Where The confidence parameter is chosen according to the need.. Typically, we use = 0.1, 0.05,

19 3.6. Regression The regression deals with the description of the nature of the relation between two or more variables. In particular, it is concerned with the problem of estimating the value of a variable depending on the basis of one or more independent variables. Below is a description of some models of linear regression Model of Simple linear regression A model of simple linear regression model is defined as Where y - dependent variable x - independent variable µ y/x - average dependent value β i - coefficient of regression ε - Term of error, or disturbance, or noise. Let us note that ε captures all the other factors influencing the dependent variable y other than the regressor x. In statistical problems, we focus on the estimation of {β i } while using the least squared method where we estimate {β i } while finding the minimum of the sum of the squared error where the error is defined as the difference between the value of an observation y t, and the value given by the regression model R ( {β i }, {x i }j ) which in general is a function of the values of the regression coefficients {β i } and values of the independent variables {x i } j. In the case of the simple linear regression, the estimators of regression coefficients, which minimize the average square error are: 19

20 where {y 1,, y n }, {x 1,, x n } are the samples of the dependent and independent variables, and y, x are their averages, respectively. Based on this result, it is possible to make a point estimation for simple linear regression: Model of general (multiple) linear regression A model of general linear regression (with the multiple independent variables) is defined by It is worth mentioning that in the analysis of the different types of regression, we almost always use the following assumptions for ε: 1. This is a random variable with the normal distribution 2. The average is µ = 0 3. The variance σ 2 is constant (does not depend on values of x i ) 4. The values are statistically independent Model of quadratic regression Let us note that this regression is always linear, because y is a linear function of regression coefficients β i. 20

21 4. Time series as stochastic processes The time series can be modeled and analyzed as the stochastic processes. In this chapter we will describe the concepts and types of stochastic processes that are used in analyzing the time series Definition of stochastic process A stochastic process is a set of random variables that are time-ordered. Two types of stochastic processes are defined: 1. With continuous time: X(t) 2. With discrete time: X t : t = 0, 1, 2, 3, Obviously, to model and analyze the time series, we use the stochastic processes with discrete time. The basic problem in the analysis of stochastic processes is the estimation of the process's characteristics, based on a sample of realizations of the process. It must be emphasized that a major difficulty in the analysis of time series is the fact that in most cases we only have a realization (observation) of the available process. The stochastic process can be defined by the multi-dimensional probability distribution for any set of time. Unfortunately, this approach is not practical because of the large size of the problem. The alternative approach is based on the moments. The exact version of this approach is always difficult, because we need all the moments. However, in practice we can often limit ourselves to the first two moments: Average, variance, and autocovariance. Let us note that the normal processes are exactly defined by the first two moments Stationarity of the process A process can be strictly stationary or it may be characterized by the weak stationarity (of second order). Here are the definitions: 1. Strict stationarity (two definitions) a. The distributions of multidimensional probability are identical to any time sequence of the same length. 21

22 b. All the moments are the same for any time sequence of the same length. 2. Weak stationarity (of second order) a. The average does not depend on time For all t b. Auto-covariance depends only on the shift 4.3. Correlogram The correlogram is an autocorrelation of a process presented as a shift function. It is very useful for determining the type of a time series. Let us start with the definition of the sample's autocovariance with n elements and its biased version which is more popular. Based on this function we can define the function (also known as the coefficient) of the sample's autocorrelation as the value of autocovariance normalized by the variance Interpretation of correlograms In this section we will highlight the important features of correlograms for the different types of time series. a. Random series In this case the correlogram is characterized by values close to zero The values r k are not equal to zero because normally the number of values x t in the sample n is limited, which causes a statistical error. We can demonstrate that this error has an approximately normal distribution N (0, 1 ). Based on this n characteristic and on (3.3) we can say that averagely 19 out of 20 r k values will be within the confidence interval 22

23 An example of a random series of correlogram, with n = 100 observations, is shown in Figure 4.1 with 95% confidence interval Figure 4.1. Random series correlogram with the confidence interval (taken from Chris Chatfield. The analysis of time series: an introduction. Chapman & Hall, 2003). b. Short-term correlation series In the case of the short-term correlation series, its correlogram has an important value for k = 1 and one or two following coefficients with the significant values, but smaller and progressively reduced. An example of a short-term correlation series and its correlogram is shown in Figure 4.2. Figure 4.2. A short-term correlation series and its correlogram (taken from Chris Chatfield. The analysis of time series: an introduction. Chapman & Hall, 2003). 23

24 c. Alternating series For the alternating series, the correlogram also has a tendency to alternate between the positive and negative values, as shown in Figure 4.3. Figure 4.3. Alternating series and its correlogram (taken from Chris Chatfield. The analysis of time series: an introduction. Chapman & Hall, 2003). d. Non-stationary series If we have a non-stationary series, its correlogram decreases for a long time and reaches zero for a long time and reaches zero for the large shift. In this case the component of the trend is dominant and to analyze the other characteristics it must be removed. An example of a non-stationary series and its correlogram is shown in Figure 4.4. Figure 4.4. Non-stationary series and its correlogram (taken from Chris Chatfield. The analysis of time series: an introduction. Chapman & Hall, 2003). 24

25 e. Series with seasonal fluctuations In Figure 4.5 (a) there is the monthly temperatures observations correlogram. This correlogram is dominated by the annual cycle. On the other hand, after the seasonal adjustment obtained by subtracting the monthly average (long term) of each observation, we obtain the correlogram shown in Figure 4.5 (b) where 95% confidence interval is also given for zero values. This correlogram indicates that the adjusted series has a short-term correlation. Based on this characteristic we can say that if a month is hotter, one or two of the following months will be hotter too. Figure 4.5. Non-stationary series and its correlogram (taken from Chris Chatfield. The analysis of time series: an introduction. Chapman & Hall, 2003) Trend As indicated in Section 2.1, a series may have a trend component. Here are some types of trend with a mathematical formulation: a. No trend T t = β o b. Linear trend T t = β o + β 1 t + ε t c. Quadratic trend T t = β o + β 1 t + β 2 t 2 + ε t d. p-me order polynomial trend: T t = β o + β 1 t + + β p t p + ε t 25

26 where β i are constant coefficients and ε t is a random error with the average equals zero. The estimation of the least squares of coefficients β i can be obtained by regression. Let us note that if the trend has a local character, the coefficients depend on time β i (t) Filtering the series In general, a time series may have several components associated with the different time scales. We already mentioned the trend T t, the cycle C t, and the seasonal variation S t, which, normally, are associated with a long time scale. On the other hand, the random component R t is rather associated with a short time scale or a medium time scale. In addition, we may have the component ε which is associated with a short time scale. This component can be interpreted as a measurement error, or a disturbance, or a noise. Filtering is used to remove certain components of the series, which are associated with a particular time scale, to obtain a series with only the components of interest. Below, we will discuss the different types of filtering Removal of local fluctuations (low pass filters) To remove local fluctuations, associated with a short time scale, we can use several filters. Here are the filters often used: a. General linear filter The general linear filter transforms a linear series to another using the linear operation Which sums up observations x t, in the window q, s, weighed by the weights a r. We often use r a r = 1. b. Symmetric linear filter This is a special version of the general linear filter where the weights are identical for all the observations in the interval. 26

27 c. Exponential Smoothing In this case, we sum up all the previous observations with the weight which decreases exponentially (geometrically) where = (0,1) defines the weight given to the previous observations. The removal of local fluctuations, associated with a short scale time is also treated as a low pass filter, as the resulting series has only the components associated with a long time scale Removal of long-term fluctuations (high pass filters) In case we want to have only one component associated with a short time scale, we must remove the long-term fluctuations. Here are some methods to do so. a. General linear filter After estimating the trend Sm(x t ), using the general linear filter, we can subtract it from the series where r b r = 0, b 0 = 1 a 0, and b r = -a r, assuming that r a r = 1. The residue of this Res(x t ) contains only the local fluctuations. b. Differentiation to remove the trend The first order differentiation is a simple and very useful filter to remove the trend If the differentiated series is not always stationary, we can make a second order differentiation to reach the stationary series. c. Differentiation to remove a seasonal variation To remove seasonal variations, we can also use the differentiation, but between the observations distanced by the variations cycle 27

28 For example for the annual variations, we use m = 12. It is worth mentioning that in general we can use the filters in series. For example, to obtain a series with the components associated with an average time scale, we can use two filters in series: a low pass filter (to remove noise or error or disturbance) and a high-pass filter (to remove the trend or seasonal variations). 28

29 5. Useful stochastic processes for the analysis of time series In this chapter we present several types of stochastic processes that are useful in time series analysis White noise The purely random process, often called the white noise, is a sequence of random variables where (Z t ) are iid which means a. independent (mutually) b. identically distributed, In this case, the autocovariance and autocorrelation are equal to zero In addition, we assume the average equal to zero µ z = 0 and the limited variance σ z 2 < Random path The random path is connected to a sequence of random variables {Zt } iid which is characterized by µ z > 0 and the limited variance σ z 2 <. This relation is defined by Normally it is assumed that X 0 = 0 which gives Consequently, the average and variance of the random path are given by indicating that the random path is a non-stationary process. Let us note that, if we apply the differentiation on the random path 29

30 We obtain the stationary series {Z t }. A good example of the random series is the price of a stock at the exchange market Moving Average Supposing that {Z t } iid with µ z = 0 and σ z 2 <. The moving average MA(q) of the order q is defined by Where {β i } are constant coefficients with β 0 normally proportional to 1. The moments of MA(q) are given by Let us note that for MA(1) with β 0 = 1 we receive The formula for the moments indicates that MA(q) is stationary, independently from the value of the coefficients {β i } Autoregressive process Supposing that {Z t } iid with µ z = 0 and σ z 2 <. The autoregressive process AR(p) of the order p is defined by 30

31 where {a i } are constant coefficients which define the autocorrelation of the process First order autoregressive process (Markov process) The First order autoregressive process is important because it belongs to the class of Markov process which has many applications in the field of telecommunications. In this case, we have p = 1 and the first order autoregressive process is described by Using the successive substitutions, we reach the other form of process which corresponds with the moving average of the infinite order. Equation (5.8) may also be presented as Where B is backward operator : Z t-1 = BZ t. Based on (5.8) or (5.9) we can easily find that for AR(1) we have It is important to note that we must have α < 1 for the variance to be finished. In this case, the second moments are given by Because the moments do not depend on t, this shows the second order stationarity for α <1. In Figure 5.1, we present the correlograms for three values α = 0.8, 0.3,

32 Figure 5.1. Correlograms of the first order autoregressive process for α = 0.8, 0.3, (taken from Chris Chatfield. The analysis of time series: an introduction. Chapman & Hall, 2003) ARMA model The mixed autoregressive and moving average model ARMA of the order (p,q) is a composition of the autoregressive and moving average: The importance of ARMA is that a stationary series can be described by less coefficients using ARMA than when using MA or AR ARIMA model The mixed integrated autoregressive and of moving averages model ARIMA of the order (p,d,q) is a composition of the differentiation, the autoregressive process, and the moving average. It is used for non-stationary series. First, we remove the nonstationarity by differentiation of the order d 32

33 and subsequently we apply ARMA model For example, the random path can be modeled by ARIMA of the order (0, 1, 0). 33

34 6. Development of forecasting models In this chapter we will present a methodology for the development of forecasting models. This methodology will be illustrated by several examples. It can be broken down into four steps: 1) Identification of the type of model a) Application of the correlogram 2) Estimation of the model's parameters a) Least Square Estimate - LSE, or b) Maximum Likelihood Estimate MLE 3) Model verification a) Analysis of the residue 4) Verification of the alternatives (if the model verification is not positive) a) Return to step 1 In the following sections, we will illustrate this methodology while considering only the types of models belonging to the ARIMA family. On the other hand, the methodology is also valid for other types of models Identification of the type of the sample's model The correlogram is very useful for identifying the type of model. Therefore, first, it must be estimated, based on the series' sample Correlogram of the sample Suppose that n is the number of observations in the time series' sample. As already indicated in section 4.3, we can consider two formulas for the estimation of the sample's autocovariance: the unbiased estimation and the biased version where the bias has the order of 1/n. Nevertheless, the estimator (6.2) is asymptotically unbiased, because lim n E(C k ) = y (k). In addition, its finite transformation Fourier is not negative, which gives also a certain advantage in estimating the spectrum of 34

35 the series. On the other hand, even if the estimator (6.1) is unbiased, we can show that its variance Var(C k ) is higher than that of (6.2) and that corresponds to the average square error of the highest estimation in the cases (6.1). For these reasons, in practice we rather use the estimator (6.2). Once the autocovariance is estimated, the correlogram is defined by the values of the autocorrelation of the sample given by For most series, the autocovariance p(k) reaches zero from a certain k'. The question is how can we identify k' of the sample if the estimator r k has an error. To answer this question, we first can show that if {x t } are observations of the variable iid with an arbitrary average, r k has an asymptotically normal distribution with From these characteristics we can define 95% confidence interval for r k which corresponds with p(k) = 0 which is near because normally 1 n 2 n. An illustration of the use of the confidence interval to identify p(k) = 0 of the sample is given in section and in Figure 4.1. Note that the bias of the estimator (6.2) can be reduced using the technique jackknife where the final estimator is given by Where c k, c k 1 and c k 2 are the estimations (6.2) based on all the observations, the first half of the observations, and the second half of the observations, respectively. We can show that (6.5) reduces the bias of 1/n to 1/n 2. Similarly, we can reduce the bias of the estimator (6.3) using 35

36 Where r k, r k 1 and r k 2 are the estimations (6.3) based on all the observations, the first half of the observations, and the second half of the observations, respectively Interpretation of correlogram The correlogram gives the possibility of a preliminary determination of the type of series. In this section we will give certain rules for the series that can be modeled through a model of the ARIMA family. We assume that we are interested in the characteristics of the short-term fluctuations. a) If the correlogram decreases for a long time and reaches zero for the very wide shift, as shown in Figure 4.4, we assume that the series is not stationary. In this case, we must differentiate the series with the degree which gives the stationarity b) If the correlogram drops to zero immediately after a certain shift k, we assume that the series corresponds to the moving average of the order MA(q = k) c) If the correlogram is a mixture of exponentials and depreciated sinusoids, and it decreases (or alternates) slowly, we assume that the series corresponds with the autoregressive process. Unfortunately, it is difficult to find the order p of the model AR from the correlogram d) If the correlogram alternates and does not fall to zero, we can also assume that the series belongs to the ARMA category Estimation of the parameters of the autoregressive model Assuming that the analysis of a series' correlogram suggests the autoregressive model with µ The objective is to estimate the values of the parameters µ, p and { I }. To facilitate the estimation, it is suitable to decompose the problem into two parts: a) Estimation of µ, and { I } while assuming that the order p is known b) Estimation of the order p. First, we will deal with the estimation of µ, and { I } while the estimation of the order p will be treated in section

37 Generally, to estimate µ, and { I } we can use one of two methods: a) Least Square Estimate method LSE b) Maximum Likelihood Estimate method MLE The two methods provide very similar or identical results. First, we focus on LSE, while MLE is discussed in section Least Square Estimate method LSE Let us consider the generic model of the time series Where θ represents a set of model parameters and ε t is a random factor. In the least square method, we estimate θ by finding the minimum of the sum of the squared error where the error is defined as the difference between the value of observation x t, and the value given by the model M(t, θ) which may also be a function of previous observations. Note that this technique is the same as the one used for the regression problem described in section 3.6. Now by applying the least square method (6.9) for estimating parameters µ, and { I } and while assuming that the order p is known in our autoregressive model (6.7) we reach As already mentioned, this minimization is analog to the least square method used in the regression. An important difference is that, in the regression, the variables x i are independent, which is not the case for x t in the autoregressive model. Nevertheless, we can demonstrate that the basic results concerning the optimality of the solution for the regression are also valid for the autoregressive model. In particular, the least square method is characterized by the variance which is minimal between the unbiased linear estimators. 37

38 AR(1) Parameters According to (6.7) the AR(1) model can be defined as For this model, the least square method (6.10) gives In this case we can analytically find a solution in the form where x (1), x (2) are the means of the first and last n 1 observations in the sample, respectively. Because we can assume that we reach the solution in the form Note that this solution is identical with the solution of AR(1) using the ordinary regression. In this case, the element (X t-1 - µ) in the equation AR(1) (6.11) is treated as the independent variable. Note also that the coefficient I is approximately equal to the autocorrelation coefficient AR(2) Parameters According to (6.7) the AR(2) model can be defined as For this model, the least square method (6.10) gives In this case we can analytically find a solution in the form 38

39 The coefficient 2 is also treated as the estimator of the partial autocorrelation for the order p = 2, π 2. In general the partial autocorrelation for the order p, π p = p, defines the excess autocorrelation when compared with the order p 1. This notion applies to any order p and will be used in the next section to find the order p of the AR model Estimation of the order of the autoregressive model The determination of the order of the AR series, based on the sample, is not easy. The correlogram provides no specific information as in the case of the moving average (See Section 6.3). We can only say that - For AR (p= 1) the correlogram decreases exponentially - for AR (p> 1) the correlogram is a mixture of depreciated exponential functions or depreciated sinusoid functions Nevertheless, we can find the order of an AR series by analyzing the behavior of certain characteristics of the series, presented as the function of the order. Here are two approaches based on this idea. 1) Analysis of the partial autocorrelation π j : a. Estimating the partial autocorrelation π j = j of the sample for a large enough, increasing sequence of orders j. b. Identifying the order k from which we can assume that π j = 0. Here, we must use the concept of confidence interval, presented in sections to identify the area where the correlogram is equal to zero. For example for 95% confidence interval we have π j < 2 n. c. The order of the AR series is estimated as the order of the last partial autocorrelation which has a significant value, π j > 2. In other words p = k 1. 2) Analysis of the minimum of the sum of the squared error S mmm (j) : a. To calculate the minimum of the sum of the squared error S mmm (j) for the sample (See section 6.2.1, equation (6.10)) for a sequence of large enough, increasing sequence of orders j. n 39

40 b. To identify the order k from which we can assume that the sum of the square error S mmm (j) of the sample is constant (and of course minimal) c. The order of the AR series is estimated as the minimal order which gives the sum of the squared error S mmm (j) for the minimal sample. In other words p = k. Note that both methods are based on the same principle and involve the same complexity. Nevertheless, the method based on the analysis of the partial autocorrelation has the advantage of using the confidence interval to determine the order. On the other hand, in the method based on the analysis of the minimum of the sum of squared error, the choice of the order is less accurate with regard to stochastic variations Maximum Likelihood Estimate Method - MLE* Supposing that X t : t = 1,, n is the sample of independent observations with a probability density function f(x t ). In this case the multi-dimensional function of the probability density is defined as Let us assume that the probability density function is a function of certain parameters θ, f(x t, θ). In this case we can define the likelihood function The estimation of the maximum likelihood (Maximum Likelihood Estimation - MLE) of the parameters θ is obtained by maximizing L(θ) over θ. Note that in the case of time series, the observations are not independent, so the equations (6.22, 6.23) should be adjusted while using the probability density functions f(x t ) and f(x t, θ) conditioned on the previous values X t-1 = {X 1, X 2,, X t-1 } In fact, Kalman filter, discussed in the following chapters, produces such a conditional distribution for a general class of time series models. 40

41 We should emphasize that in the analysis of time series, the method of maximum likelihood is always used while assuming that the series is Gaussian (i.e. Z t has a normal distribution) Estimation of AR(2) parameters In this section, we present an example of application of the maximum likelihood method for the estimation of an AR(2) series parameters. In this case, the conditioned probability density function is given by and the likelihood function is given by Note that the product should be started with t = 3 but if the number of observation is large, the effect of this approximation is minimal. To resolve (6.27) we apply the logarithmic operation which gives This equation shows that maximizing (6.27) is equivalent to the minimization So we obtained the identical solution with the least square solution (LSE) given by (6.18). Although this is not always the case, LSE gives a good approximation of MLE. If σ 2 is not known, it can be estimated using LSE Estimation of the parameters of the moving average model Assuming that the analysis of a series' correlogram suggests the moving average model, MA, with µ The objective is to estimate the values of the parameters µ, q, and {β i }. To facilitate the estimation it is suitable to decompose the problem into two parts: a. Estimation of order q b. Estimation of parameters µ and {β i } 41

42 Contrary to the autoregressive model, order estimation for the moving average model is easy. On the other hand, the estimation of parameters µ and {β i } is more difficult Estimation of the order of the moving average model The order of MA series, based on the sample, can be easily estimated by a correlogram analysis because the theoretical autocorrelation is cut to zero for the shifts larger than q. In practice it is necessary to apply the following steps: a. To calculate the sample's correlogram b. To identify the order k from which we can assume that r j = 0. Here we must use the concept of confidence interval, presented in Section 4.3.1, to identify the area where the correlogram is equal to zero. For example for 95% confidence interval we have r j < 2 n. c. The order of the MA series is estimated as the order of the last autocorrelation coefficient which has a significant value r j > 2. In other words q = k 1. n Estimation of the parameters of the average mobile model In general, the estimation of the parameters of the moving average model is difficult because there are no explicit estimators. To solve the problem we can use a numerical repetition. In the next section we will describe an approach to MA(1) Estimation of MA(1) parameters Let us consider the moving average model of the order q = 1 with β 0 normalized to 1: In fact, we can be tempted to use the theoretical equation which defines the autocorrelation (5.5) to calculate B 1 under the condition B 1 < 1: Unfortunately, we can show that such an estimator is inefficient. So we are forced to use the following numerical procedure: a. Starting with µ = x and β 1 of (6.31) b. Calculate the sum of the square residue, n 2 1 Z t using the recurrence Starting with z 0 = 0. c. Minimizing Z t 2 n 1 through the inspection of the values close to µ and B 1. 42

43 6.4. Model Verification Model verification may be made through an analysis of the residue. The residue is defined as a difference between the observation value x t, and the estimation value of this observation x t which is calculated using the model and the previous observations. Therefore the residue is defined as For example, for AR(1) the residue is given by If the model is exact the residues should be random and independent. To verify these characteristics we can analyze the residues' correlogram. Also, we can assume that the distribution of the autocorrelation coefficients is normal, which gives the means to calculate the confidence interval to verify whether the correlogram indicates a dependency. For the 95% confidence we can say that the values which are outside the interval 2 / n are significantly different from zero. If the analysis of the residue gives the dependency indication, the model should be improved to reach a random and independent residue. 43

44 7. Forecasting This chapter presents three approaches for forecasting: forecasting by extrapolation, exponential smoothing, and the Kalman filter. Forecasting by extrapolation uses a time series model that can be developed using the methodology presented in the previous chapter, but it is limited to the stationary series. The exponential smoothing is an iterative method that is much simpler, because it does not require the creation of a model, and can adapt to the slow change in the series' parameters, but it is an intuitive method that gives no confidence interval on the forecast. The Kalman Filter is also an iterative and adaptive method but it is based on a theory which also gives the confidence interval Forecasting by extrapolation Consider a time series X t, at the moment T. Let us assume that at this time we have every observation x t of the series for t T. Suppose x T (k) is a forecast of the series for the time T + k where k is the horizon of the forecasting. This forecast may be presented as a conditional expectation: If we have an ARMA model of the series, the forecast x T (k) may be obtained by substitutions: a) Z future values by zeros b) X future values by their conditional expectations c) X and Z old values by the values of their observations For example, assuming that the model of our series is given by AR(1) and we need the forecast x T (3). In this case the prediction is iteratively calculated: The main advantage of forecasting by extrapolation is that this approach is based on a good model. On the other hand, there are also some disadvantages: 44

45 a) Complex calculation of the model's parameters b) Need to memorize the series c) Assumption of the series' stationarity We can avoid these disadvantages by using the exponential smoothing described in the next section Exponential Smoothing In exponential smoothing we build a new series using the weight 0 1 : This operation can also be presented as a sum: which clearly shows the name of the method which comes from the fact that the weight of the previous observations decreases exponentially (or more precisely: geometrically) with the elapsed time j. Note that exponential smoothing has already been presented in section 4.5 as a lowpass filter to remove local fluctuations. However, exponential smoothing can be used for forecasting. Especially Note that for the horizon k > 1 we always have the same value of the prevision: This forecast can also be presented as a function of the forecasting error e T = x T - x T-1 (1). Especially Note that for the horizon k > 1 we always have the same value of the forecasting: x T (k) = x T (1) There are several advantages for forecasting using exponential smoothing: a) It adapts well to the slow change in the series' parameters, thus the assumption of the series' stationarity is not required. b) No need to memorize the series c) The calculation is very simple. On the other hand, exponential smoothing has also some disadvantages: 45

46 a) It is not based on a theory or a statistical model b) We can say that the method is intuitive c) No confidence interval. It should be mentioned that we can avoid these disadvantages using the Kalman Filter, which is described at the end of this chapter Choice of The choice of for exponential smoothing is not easy and there is no unique approach. First we will consider a theoretical method that can be applied on the assumptions that the series is infinite and that its autocorrelation has the following form: In this case we can show that if is selected according to the following rule: the sum of squared errors is minimized. On the other hand, if the condition (7.6) is not met, to find the optimal value of we have to minimize the sum of squared errors Q( ) using a numerical method. In many cases the optimal value of is found in the interval 0.01 < < 0.3. However, if the optimal value is close to 0 or 1 and the function Q( ) is flat around the optimal value, it is better to choose a less extreme value Generalized Exponential Smoothing Consider a time series with a linear trend defined by x t = a + bt In this case the exponential smoothing method gives the forecasting which can be expressed as 46

47 For wide T we can show that Using these formulations we reach This result shows that the forecast has a delay (sometimes termed as a shift) de b/. To remove or minimize this delay we can apply the double exponential smoothing with two parameters, and β : Where M t corresponds with the estimator of the series and D t corresponds with the estimator of the shift (delay) in the time interval t - (t 1). Thus, the forecast is given by To find the optimal values of and β we must minimize the sum of square errors Q( and β) using a numerical method Kalman filter In this section we will introduce Kalman filter, and the related notions, in a simplified and abbreviated way, using the particular application. For more details, see the [Dzio97] book and the references included in this book. Kalman filter provides the optimal estimator according to certain criteria. It is an iterative filter that is based on the model of states' space, S-S (state-space model). It is important that this filter can give the confidence interval of the estimation State space model In general, and observation Z t in the time t can be presented as a sum of a certain signal and of measurement noise. In the state space model, the observation is the vector that is defined by the measurement equation 47

48 where W t - Vector of the system's state which we cannot observe directly H t - Observation matrix which gives the state that we can observe µ t - Error vector (or the noise) of the observation (or the measurement). This vector is a white sequence with the average zero and the covariance matrix Y t In the state space model, we also define the system's equation where t - Transition matrix of the system's state e t - Vector of the system's deviation. This vector is a white sequence with the average zero and the covariance matrix Q t Figure 7.1 shows dynamics of a system which corresponds with the system's equation and the measurement equation Figure 7.1 Dynamics of a system with a measurement NB: Modèle de système = System's model Modèle de mesure = Measurement model Scalar example of Kalman filter Consider a link that serves the VoIP connections (voice over IP). For the aim of admission of calls it is important to know the average rate of packets served, which, generally, is a function of number of served connections and of types of connections. Figure 7.2 shows an example of the trajectory of the average rate of packets which changes at each admission or departure of a call. We assume that in each period t with the same number of calls we make the measure M t of the average rate. Of course the real average rate M t, during the period t is different from the 48

49 measurement M, due to the measurement error µ t which is caused by the fact that the arrival process of the parquets is random. Thus, we can formulate the analog measurement equation at (7.10) where in our case H t = 1. Note that based on the measurement method and the length of the epoch t, in general we can also estimate the variance of the measurement error V t e,m which corresponds with the covariance matrix Y t. Figure 7.2 Trajectory of system's state NB: époque = epoch temps = time départ = departure arrivée = arrival We can also assume that the type of each incoming or outgoing call is known. This gives a possibility of estimating the average rate m t of the incoming (or outgoing) call at the beginning (or at the end) of the epoch t, based on its type and its historical statistics. Therefore, we can define the equation of the analog system by (7.11) Where t = M t 1 m t M t 1 is a transition matrix of the system state and e t is an error of the estimator m t that can be calculated based on the type of call and its historic statistics. Note than based on the same historical statistics we can also estimate the variance of the estimator's error v t e,m which corresponds with the covariance matrix Q t. Finally, (7.13) can also be transformed to 49

50 The dynamics of the system and measurement models are illustrated in figure 7.3 above. On the other hand, beneath this figure there is an illustration of the dynamics of Kalman Filter. Figure 7.3. Dynamics of Kalman filter NB: gain de Kalman = Kalman gain modèle de système = system model modèle de mesure = measurement model filtre de Kalman = Kalman filter The objective of the Kalman filter is to calculate the optimal estimator M t of the average rate M t. As illustrated in Figure 7.3, using (7.14) and the previous iteration estimator M t-1 we can find an extrapolation of the average rate So we have two sources of information for the estimation of the average rate: the extrapolation of the average rate M t e and the measurement of the average rate M t. Looking at Figure 7.3, we can easily find that the optimal estimator of Kalman filter uses both values with the weight determined by Kalman gain K t : Because in our case H t = 1, we obtain Where 50

51 The optimality of Kalman filter comes from the fact that it is calculated using the information about the extrapolation and measurement errors that is contained in v t e,m and V t e,m, respectively. Especially where P t e is an extrapolation of the covariance matrix of the estimation error P t defined as The equations (7.19, 7.20) show that if the measurement error V e,m t is dominant, Kalman gain is decreased (closer to zero) and the estimator gives a greater weight to e the extrapolation M t. On the other hand, if the extrapolation error is dominant, Kalman gain is increased (closer to 1) and the estimator gives a greater weight to the measurement. By using Kalman gain, we can also estimate the covariance matrix of the estimation error Apart from being useful for the Kalman gain calculation in the next iteration, the covariance matrix of the estimation error can be used to find the confidence interval of the estimator M t while assuming the normal distribution of the error. As for the estimation, we can use the extrapolation while assuming that we have the estimation of the transition matrix of the system state. In our example, we get for k = 1 and for k > 1. 51

52 8. Non-linear models for estimation and forecasting 8.1. Extensions of ARMA models In this section, we will present some examples of non linear models having ARMA models' forms of extension Non linear autoregressive model NLAR Here is a general form of a non linear autoregressive model of the order p For example, for p = 1 we reach the model Where φ(x t ) < 1 for a stability. In this class we can define the model with the instationary parameter Where ε t is a variable i.i.d. independent of Z t. For β = 0 we reach the model with the random coefficient Threshold autoregressive model TAR Here is an example of a threshold autoregressive model (TAR) In this case, the estimation of coefficients 1 and 2 demands an iterative procedure. The forecast for the horizon k = 1 is simple, but for wider horizons, is complicated Bilinear models The bilinear models can be seen as a non linear extension of ARMA models. Here is an example Where βz t-1 X t-1 is a "cross product term". 52

53 8.2. Neural networks Figure 8.1 shows an example of a neural network structure with the direct action (feed forward). In this example there is only a hidden layer of neurons, but in general we can have more of them Figure 8.1. Example of a neural network structure NB: couche d'entrée = input layer couche cachée = hidden layer couche de sortie = output layer In general the structure of a neural network is determined by the analyst who is experienced in the field. Specifically, he should determine three elements a) Number of layers b) Number of neurons in each layer c) Function of the input and output of each neuron Normally the function of the input is linear Where ω ij is the weight for the input i of neurons j. The output function, which is also called the activation function, defines a neuron value, z j = f v j. It can be linear Z j = v j 53

54 but more often it is non linear. Here are some examples of the nonlinear activation function: In the case of a hidden layer, the output of the neural network can be presented as where f o, f j - Activation functions in the output layer and the hidden layer, respectively ω j, ω ij - Weight in the output layer and the hidden layer, respectively ω 0 - Weight for the direct link between the constant input and the output layer. The activation function in the output layer is often chosen as the identity function, which means that the output layer is linear. Note that the form of equation (8.2) indicates that neural networks can be connected to the linear regression models Neural Network Optimization To optimize a neural network, we can apply the method of least square, which has already been presented in section In particular, based on the series' sample, we minimize the sum of squared error on the prediction with the horizon k = 1 In general the optimization is done using the research by trial and error and backpropagation. Here are some recommendations and practical observations: - It is a good practice to divide the sample into two parts: one for an optimization and the other for verification of the model. - We often need thousands of iterations for a convergence 54

55 - Local minima are possible, due to the nonlinearity and very large number of parameters. - To avoid the large numbers of parameters, a penalty term can be added - The large numbers of parameters may very well minimize the square error on the sample, but that does not necessarily give a better forecasting. There is also software that optimizes the neural network automatically, but sometimes the results are ridiculous Neural network for forecasting The neural networks may be useful for the long series with the evident nonlinear characteristics. But at the same time, to have confidence we must have hundreds and even preferably thousands of observations Chaos A time series that appears to be random may be a) random, b) chaotic, non-linear, deterministic, c) combination of a) and b) The very well known example of the chaotic series (or chaos) is called the logistic map or the quadratic map. It is a deterministic series defined as We can show that for 0 < k 4 we have an x t (0,1) and that the series has the following characteristics for the different values of k a) K = 4 ; The series is found on the quadratic curve shown in figure 8.2 (a). In addition, it has the characteristics of white noise without correlation (UWN - uncorrelated white noise) b) 0 < k < 1 ; The series converges to zero x t 0. c) 1 k 3 ; The series converges to the attractor x t (1 1/k) as shown in figure 8.2 (b). d) 3 < k < 3.57 ; The series has cyclical behavior. 55

56 Figure 8.2. Illustration of a logistics map (the quadratic map) NB: attracteur = attractor Characteristics of chaotic system There are two interesting characteristics associated with the chaotic series: Butterfly Effect The name of this feature comes from the anecdote that the flapping of butterflies' wings can cause a tropical storm. In the case of the chaotic storm, a small change in the initial value of the series can cause a big change in the following values. We can verify it while calculating x 3 for x0 =.1 and x0 =.11 in the series defined by (8.3) with k = Non integral dimension Note that in the series defined by (8.3) with k = 4, all observations are found on the quadratic curve. Therefore, we can say that the dimension of this series is smaller than the dimension corresponding to a random series which can take any value in the square 1 by 1. In fact, there is a definition for non integral dimension and in particular the series with a fractal behavior may have the non integral dimensions Fractal behavior and self-similarity A fractal is a geometric shape that can be cut into pieces which are (at least Approximately) copies of the original form. This property is also called the selfsimilarity which can be visible at whatever scale. Fractals can be obtained by iterative series such as (8.3). Sometimes the field of attraction gives very interesting geometric forms as shown in Figure

57 Figure 8.3. Examples of fractals 57

If we want to analyze experimental or simulated data we might encounter the following tasks:

Chapter 1 Introduction If we want to analyze experimental or simulated data we might encounter the following tasks: Characterization of the source of the signal and diagnosis Studying dependencies Prediction