Stochastic Volatility Models with Auto-regressive Neural Networks

Size: px

Start display at page:

Download "Stochastic Volatility Models with Auto-regressive Neural Networks"

Lisa McDowell
5 years ago
Views:

1 AUSTRALIAN NATIONAL UNIVERSITY PROJECT REPORT Stochastic Volatility Models with Auto-regressive Neural Networks Author: Aditya KHAIRE Supervisor: Adj/Prof. Hanna SUOMINEN CoSupervisor: Dr. Young LEE A report submitted in fulfillment of the requirements for the subject Special Topics in Computing (COMP6470) in the Department of Computer Science October 28, 2016

3 iii AUSTRALIAN NATIONAL UNIVERSITY Abstract Dr. Weifa Liang Department of Computer Science Stochastic Volatility Models with Auto-regressive Neural Networks by Aditya KHAIRE The Financial Time series data often has high volatility in the data and this makes Financial data more unpredictable and with this data it is very hard to predict the future values. There are various models from econometrics which models the stochastic volatility from the data. They are mainly focus on the mean and variance of the time dependent data but prediction of variance for next time interval is not of main concern. The algorithm proposed in the project is focusing on the prediction of variance through the time dependent data-set.

5 v Acknowledgements I would like to express my special thanks of gratitude to my Supervisor Adj/Prof. Hanna SUOMINEN as well as my Co-Supervisor Dr. Young LEE who gave me this opportunity to do this project on the topic Stochastic Volatility Models with Auto-regressive Neural Networks, which also helped me in doing a lot of Research and I came to know about so many new things I am really thankful to them. The thesis was partly carried out in the National Information and Communication Technology Australia and its successor Data61. I also express my gratitude to Mr. Kar Wai Lim as my advisor in NICTA/Data61 and The ANU. Secondly, I would also like to thank my parents and friends who helped me a lot in finalizing this project within the limited time frame.

7 vii Contents Abstract Acknowledgements iii v 1 Introduction Introduction Literature Review Gaussian Process Volatility Model Auto regressive Models Stationarity Stochastic Process Gaussian Processes The Stochastic Volatility model The priors Hetroscedastic Generalized Auto-regressive Conditionally Hetroscedastic (GARCH) ARCH(1) Processes GARCH (1,1) Process Model GARCH process Results on GARCH Auto Regressive Neural Network (AR NN) Feed Forward Neural Network Design of Auto Regressive Neural Network Time Series Data set Weight-space Activation Function Back Propagation Algorithm Practical Reducibility Function Results Alpha when alpha = When alpha ( 1 apple 1) When alpha ( >1) Discussion on Activation Function Prediction Results Comparison of Models Comparison between SV, GARCH and AR-NN... 27

8 viii 5 Summary Conclusion Recommendation Future Work Independent Study Contract Contract Bibliography 35

9 ix List of Figures 2.1 Prediction and Observation for the GARCH model Auto Regressive Neural Network AUD/USD Error Function Error Function with = Error Function with 1 apple Error Function with > Logistic Activation Function tanh Activation Function AR-NN Prediction and Observed Error on Train set Contract Contract Page Contract Page

11 xi List of Tables 2.1 Values for parameters!,, Error on GARCH model Number of Iteration required for convergence... 27

13 xiii List of Abbreviations MCMC - Markov Chain Monte Carlo AR-NN - Auto-regressive Neural Network GARCH - Generalized Auto-Regressive Conditional Heteroskedasticity ARCH - Auto-Regressive Conditional Heteroskedasticity SV - Stochastic Volatility NN - Neural Network AUD/USD - Australian Dollar / U.S Dollar

15 1 Chapter 1 Introduction 1.1 Introduction The Financial time series data from the various sources like stock market, foreign exchange, or any data are time dependent and have time index as one of the dependent variable.it is often observed that in the Financial time series data, large changes tend to be followed by future large changes in while small changes are followed by future small changes, this phenomenon is referred to as volatility clustering.(nicalos Chapados, 2012) The stochastic volatility model used for modelling the financial data. In this model Gaussian noise is considered as an additive factor. The idea to model the non-linear stochastic volatility with the linear state space, the model gets complicated as Gaussian noise is transformed into non-linear distribution and is no longer mapped to Gaussian distribution. The approach for predicting the time series data is different from the time independent data as it can be transformed and is easy to manipulate. The same cannot be considered while dealing with the time dependent data. Studies have shown that it is easy to model the time series data when it is linear in nature with uniform distribution with time but when the non linear time series needs to be handled usually it is converted into the linear form and then model using the linear distribution. In the linear approach, the use of Bayesian Inference is the most suitable approach for modelling the posterior mean and variance. The posterior mean is the time-varying standard deviation and it can be modelled through the stochastic volatility model. Neural models are non-linear in nature because of the activation functions used in the hidden layer of the network. The idea to map non-linear financial data onto the Neural model has its own advantages as Neural network has the Universal Approximation property, where with proper selection of prior parameters the model can achieve a reasonably good predictor.the normal feed forward Neural network cannot be used for the time series data because it is only dependent on non-linear factors which makes it difficult to map the time index data. The different approach of Neural model called Auto-regressive Neural Network which is used in this project for the purpose of prediction analysis. The proposed Auto-regressive Neural model is different from the Feed Forward Neural model as it combines the linear model with additive Gaussian noise while developing the final model. The prediction from the Neural model can be compared to the posterior mean from the Linear Gaussian model and that will in turn minimize the Mean squared error loss. 1.2 Literature Review In the recent years, Financial market has become more volatile after the 2008 crisis and market has become unstable for long time since There were no specific reason for the crisis but during this period the financial returns were extremely high or low in comparison to normal period. While building the portfolio on risk assessment, volatility and other factors are always dependent on the time series analysis.these are the major factors in financial market and are

16 2 Chapter 1. Introduction always been unpredictable because of the inefficiency in predicting the values through the statistical models.as pointed out in paper (Nicalos Chapados, 2012), the large change in observations are followed by large changes, while small changes are followed by small changes.this is referred as volatility clustering. Time series analysis, is a field in economics which tries to calculate the various factors through observing the past values. Time series data is difficult to model as it s mean and variance are non-constant. Basically, there are two types of model in time series analysis called linear models and non-linear models. In linear models, shocks are uncorrelated but are not assumed to be identical and independent. In non-linear models, shocks are assumed as identical and independent. (ruppert, 2001) Financial series is stochastic in nature with the mean and variance which are non-constant. The useful approach of regressive model is not beneficial in this financial field. The approaches used for modelling the Stochastic Volatility in the financial market are Bayesian Inference(Nicalos Chapados, 2012), Stochastic Volatility with Markov Chain Monte Carlo (MCMC).(Nicalos Chapados, 2012) These models are complex in nature and difficult to build. Hence, the most common model used were Auto-Regressive Conditionally Heteroskedasticity (ARCH) and other recently used model is GARCH (Generalized Auto-Regressive Conditionally heteroskedasticity) models which is an improved version of ARCH. These models are most importantly used in the financial time series analysis as they exhibit the volatility clustering phenomenon. The other models like Bayesian Inference and Stochastic Volatility with MCMC are too complicated and not very popular in the real world modelling for time series analysis. Relatively less complex models are used in the financial market because of it s stability to capture non-linearity and also it is less complex to implement. Along with the advantages of modelling the volatility with GARCH, it has a significant disadvantage that it is not able to model the hidden non-linearity in the data. The reason behind that, the GARCH model rely only on the previous forecasting and variance to predict the current value. GARCH/ARCH models are used specifically for non-linearity in the data or for structural regression coefficient (Dietz, 2010). The models discussed above have never developed as a prediction models but to calculate the posterior variance and mean of the time series data. The approach for predicting the non-linear volatility in the financial time series data is complex as there is no specific Neural Network for prediction. Though, the Neural Network has a property called Universal Approximation (Dietz, 2010) which could be helpful in evaluating any function and thus, it can be utilised in prediction analysis. The Universal Approximation property says that any continuous, bounded and non-constant activation function can approximate any function. This property has made Neural Network suitable for various applications which inspired us to explore the Neural Networks in the field of Financial Time series analysis. We are not using the Normal Feed forward Neural Network as it does not complement with the linearity in the Time series data. So, the proposed model for the analysis given in (Dietz, 2010) as the foundation for the modelling.

17 3 Chapter 2 Gaussian Process Volatility Model 2.1 Auto regressive Models A time series is a sequence of variables measure over a time with uniformly distributed time as example monthly/daily or yearly time. Times series data are Markov dependent with higher order lag. In the uni-variate state space models a times series models represent the autoregressive AR( ). The series {y t } means y measured in t with uniform time distribution. The AR(1) is model of y t regressed on the past values y t 1 which is the previous time series value ofy t. y t = y t 1 + t (2.1) where t N (0, 1) is a white additive noise with µ =0and 2 =1is uncorrelated with the past values of the AR series y t 1. t represents the new contribution to the y t and they are known as the series random shocks or innovation. The equation is termed as auto-regressive which is actually a linear regression model for the y t in terms of the y t 1. That is, y t is modeled as regression on its past y t 1. The value of the strongly affects the series behaviour for the AR(1) process. In series of the observation where y t 1 is a long vector then the value of 1 apple 1 the weights given to the shocks t which occurred long time ago will also be extremely small which will make the series stationary and thus, means and variance to the model remain constant as t grows and if the 1 then the weights given to the distant shocks will be much greater than those given to more recent ones. This model is said to be explosive as the series mean and variance tends to grow exponentially as the t grows. Finally, if =1, the model is neither stationary or explosive and is called a random walk. 2.2 Stationarity An AR( ) is a stationary process, if following properties of the times series data exists E[y t ]=0for 8 t (y t )= P 1 j=0 2 j for 8 t and this can be true if the error decays rapidly as j! 0 Cov(y t,y t 1 )= (k) is a function of j weights and depends only on the lag k and not on t. When 3, the restrictions for the increases and the models become much more unstable while trying to predict and also if the model gets over-fits to the training data, the generalization error measured on the test data will be increase.

18 4 Chapter 2. Gaussian Process Volatility Model 2.3 Stochastic Process A Stochastic Process is a random variable {X }, with parameter and index on the set. Where represents the time as an index. Stochastic process that depends on time is a simple process where it process at specific times according to the specific probabilistic rules. Thus, the state space, assumed in the time dependent analysis, is evolved to a stationary process in which probabilistic rules are constant during the transition matrix. If is a stationary and P is a transition probability P(X,Y) = pforallt 0 (2.2) A measure is stationary for X if measure µ where, µ(x) =q x (x), is stationary for Y For a discrete time process, the random variable X n will depend on earlier values of the process, X n 1,X n 2 therefore the conditional distribution of the form Pr =(X tk X tk 1,X tk 2,...,X t1 ) (2.3) for some set of times t k >t k 1 >t k 2 >...>t 1 the stochastic processes with time dependent satisfy the Markov property which states as Pr(X tk X tk 1,X tk 2,...,X tk )=Pr(X tk X tk 1 ) (2.4) The stochastic processes that satisfy the Markov Property are easy to model such as the stock exchange and the exchange rate are also time dependent. 2.4 Gaussian Processes A Gaussian Processes (GP) is a generalization of Gaussian distribution where probability is distributed over the function with the specified mean and variance. GP extends the multivariate Gaussian distribution to the infinite dimensionality. Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution. let x R D index into the real process f(x). we write, f(x) GP(M( ),K(, )) (2.5) where functions M( ) and K(, ) are respectively, the mean and covariance functions: M(x) =E[f(x)]k(x 1,x 2 )=E[f(x 1 ) M(x 1 )(f(x 2 M(x 2 )))] (2.6) The data often used or received from different sources are never consistent and contain errors such as each observation of y can be thought of the data x as a function f(x) with the additive noise model as y = f(x)+n (0, 2 n) (2.7) The volatility measurements obtained from the log-range are normally distributed around the true log-volatility and hence the equation assumed to be hold with function f( ) which represents the log-volatility. Where y is the observed values for the x modelled on the time index. For modelling the stochastic volatility with Gaussian processes, the process is to cast in terms of regression from the time indexes to volatility measurement obtained from the log-range and the estimation would be the pair of D = {(t i,y i )} where t i would be time index and y i would

19 2.5. The Stochastic Volatility model 5 the obtained from formulation y t = f(t)+ 2 n t (2.8) 2.5 The Stochastic Volatility model The observation at time t is given by y t = e 1 2 ht t, (2.9) for t =1, 2,...,T, where t N(0, 1). Note that the state h t is called the log-volatility. The states are assumed to evolve according to a stationary process h t = µ h + h (h t 1 µ h )+ t (2.10) for t =2, 3,...,T, where t N(0, 2 h ) and is independent of at all leads and lags. Hence the conditional variance of y t is given by Var(y t h t )=(e 1 2 ht ) 2 Var( t )=e ht. (2.11) We further assume that h < 1, and that the states are initialized to h 1 N µ h, which is the stationary distribution of the process The priors 2 h 1 2 h We assume independent prior distributions for µ h, h and 2 h, i.e., (2.12) (µ h, h, 2 h )=(µ h)( h )( 2 h ) (2.13) Specifically, we use the following independent prior distributions: µ h N(µ h0,v µh ), (2.14) h N( h0,v h )1{ h < 1}, (2.15) 2 h IG( h,s h ) (2.16) where 1 denotes the indicator function of the and IG represents the inverse-gamma distribution. The stationarity condition h < 1 is imposed through the prior distribution h.to model stochastic volatility, we have invoked an R package stochvol which has a straight forward functions for doing the Auto-regressive AR(1) SV analysis. We use the svsample function to train the model on the AUD/USD daily returns data-set. Hence, using the prediction function predict for predicting the days ahead which are equal to the test data-set points. The mean squared error (MSE) is used for calculating the prediction error between the test data points and predicted data points. This model is compared with the other models in the chapter 4 during the model comparison.(kastner, 2016)

20 6 Chapter 2. Gaussian Process Volatility Model 2.6 Hetroscedastic A sequence of output random variables Y t and a sequence of input random variable X t with the conditional variance Var(Y t X t ) is non constant for a given time t so the model with constant variance 2 is not efficient to capture this phenomenon. It arises in two forms: Conditional and Unconditional. Conditional Hetroscedastic can be identified as non-constant volatility when future high and low returns cannot be identified. Unconditional Hetroscedastic is identified as independent volatility where future returns could be identified. 2.7 Generalized Auto-regressive Conditionally Hetroscedastic (GARCH) GARCH time series model are widely used in Financial market for capturing the random volatility of the returns. GARCH calculate the forecast y t based on the squared past forecast values yt 2 1 and past variance t 2 1. The model is not difficult to understand but it is required to focus on the reasons for taking past forecast and variance values into consideration. The main purpose of GARCH is to model variance on financial returns. To understand the GARCH model, we will go through a simple approach called ARCH model (D.Ruppert, 2011). ARCH(1) models the conditional variance looking at the past values which is similar to Linear AR(1) model as discussed in the Section 2.1. The Linear regressive model with constant variance 2 and expectation equal to 0 given as, Y t = f(x t )+ t (2.17) The conditional constant variance Var(Y t X t )= 2 and f is a conditional expectation of Y t given as X t. Equation ( 2.17) can be modified to allow conditional Hetroscedastic into the model with the equation, Y t = f(x t )+ (X t )+ t (2.18) where t has a conditional mean equal to 0 and variance equal to 1. As (X t ) should be nonnegative as it is a standard deviation. If the function ( ) is linear then its coefficient has to be made non-negative so that standard deviation remains non-negative. This would be a difficult task of controlling the coefficient, so the non-linear non-negativity approach is used instead. The conditional class variance method is also used for GARCH method ARCH(1) Processes We have to consider the Gaussian noise in the ARCH(1) model. Hence, while adding the noise into the model, we consider the Gaussian noise to be with constant mean and variance as, and E( t t 1 )=0 (2.19) Variance( 2 )( t t 1 )=1 (2.20) This property of the white Gaussian Noise is called as homoskedasticity. The process a t is given in ARCH(1) model as follows, q a t =! + 1 a 2 t 1 t (2.21)

21 2.8. Model GARCH process 7 From the equation ( 2.21) expectation is equal to zero and is equal to q! + 1 a 2 t 1. where to make the >0 the coefficients of the variance!>0and 1 0. To make the model a t to be stationary <1. The equation 2.21 can also be modified as, a 2 t =! + 1 a 2 t 1 2 t (2.22) The equation ( 2.22) is similar to the AR(1) but with the squared term and noise with mean 1. The conditional variance for the ARCH(1) would be 2 t = Var(a t a t 1 ) (2.23) and as noise is independent of the past values a t expectation of the equation ( 2.21) is zero. 1 the E( 2 t )=Var( t )=1. The conditional 2 t = E{(! + 1 a 2 t 1) 2 t a t 1 } (2.24) 2 t =(! + 1 a 2 t 1)E( 2 t a t 1 ) (2.25) 2 t 1 =! + 1 a 2 t 1 (2.26) To understand the GARCH model, the variance derived in the equation( 2.26) is the same variance for the model. If a t 1 is having a large magnitude then t will also be large. This tends to make a t as large as well and makes the volatility to propagate from a t to a t+1. Similarly, if a t 1 is small in magnitude then t is small in variance and a t will also be small in magnitude. There is a proportional relation between a t and t+1. Thus, the ARCH(1) variance model with y t is conditional on y t 1 with variance at time t, Therefore y t is for ARCH model with series mean = 0 as, q GARCH (1,1) Process Var(y t y t 1 )= 2 t = y 2 t 1 (2.27) 2 t = 2 t = y 2 t 1 (2.28) y t = t t (2.29) GARCH uses the same approach of ARCH model of relying on the past values y t to predict the new y t but with the addition of the past variance t 2 1 to the model. 1 of forecast 2 t = 2 t = y 2 t t 1 (2.30) In equation ( 2.30), 1 is multiplicative with t 2 1 and is added as the past variance into the model to make the ARCH model as a generalized model for any time series data. Since, t from the past value changes the magnitude of the y t, past variance are added into the GARCH model. 2.8 Model GARCH process To model the variance t from the GARCH model, the constraint for the variance should be satisfied and for this parameter selection of!,, needs to adhere to the constraints. Care should be taken during this so that variance should not have a negative value. The range of 0 0, 0 1 apple 1 and 0 1 apple 1 is required to maintain non-negative variance. To train

22 8 Chapter 2. Gaussian Process Volatility Model TABLE 2.1: Values for parameters!,, Type! Observed Predicted and test the model, we will use AUD/USD daily returns data set. The data-set is split into the training and testing data-set using the split function and normalize all the set to reduce the effect of outliers. The optimize function is used to run the GARCH model multiple times and find the mean squared error between the observed forecast and the predicted forecast on the train set. The optimize function will return the values for!,, and these values will be used for GARCH prediction. The GARCH prediction function predicts the variance for the specific time t equivalent to the test data-set. The Mean Squared Error (MSE) is used for calculating the cost function between the predicted values and the test data. As GARCH is a simple model it is not able to capture the hidden non-linearity in the financial data. We will go through the results in Section ( 2.9). 2.9 Results on GARCH The Financial data-set is split into 3 4 as train set and 1 4 as test set for the GARCH model. The parameters for selections are!, and which need to be selected to get the optimize prediction on the test set. We are using the optimize algorithm from the python code to iteratively the GARCH model and measure the cost function for minimized error on the particular selected parameter value for the!, and. The initialized values for all the parameters at the start of the optimization process is 1 for!, and. yt 2 1 will be squared with lag of the y t. 2 t = y 2 t t 1 (2.31) The equation ( 2.31) is used for calculating the variance and y t is calculated by taking the square root of the variance t 2. The equation is as follows, q y t = 2 t (2.32) In actual GARCH model, to calculate the y t we need to multiple the y t with Gaussian noise given as, t N(0, 1) but, in the training phase we don t add the Gaussian Noise because train model never gets to the global minimum cost function. This makes less efficient values of!, and. The table ( 2.1) gives the details of the values of the parameters after the training and now these value can be used for prediction model. The GARCH function for the prediction is same as the GARCH function for the training the only difference is now we are not using the optimize function to continuously run the training model. The error calculated on the prediction model and training model is given as, In figure ( 2.1) the two plots are based on the test set. There are some outliers in observation plots which are not able to capture in the prediction model. The reason is the hidden non-linearity in the data set are not properly captured by the GARCH models. The total prediction error is 0.33 which is comparatively small since the model is fairly simple to implement with less factors to be considered. As the input lag y t 1 is increased to y t i then the model will become complex and difficult to find the global

2.9. Results on GARCH 9 TABLE 2.2: Error on GARCH model Type Error Test Set 0.404414457 minimum cost function for the train model.

23 2.9. Results on GARCH 9 TABLE 2.2: Error on GARCH model Type Error Test Set minimum cost function for the train model. This will have a larger impact on the test data set as the generalization error will increase drastically. FIGURE 2.1: Prediction and Observation for the GARCH model

25 11 Chapter 3 Auto Regressive Neural Network (AR NN) 3.1 Feed Forward Neural Network In linear time series analysis, the Auto-regressive AR( ) model consist of two layers: The input layer, which contains the entirely independent variables and the output layer which contains dependent variables with a constant term called as bias. A linear Auto regressive given AR(1) by y t = y t y t 2 (3.1) As Linear AR model is not sufficient for prediction so the non-linear part has to be augmented with the linear model. The non-linear AR(1) is model on the hidden layer of the Neural Network and thus added in between input and output layer. Thus AR-NN model are those in which there is a direct connection between input and output layer as designed for linear model and as well the connection between the input and output layer through hidden layer for nonlinear model. Thus, the nonlinear function F ( ) is an extension of a linear AR(1) is given as, y t = y t y t 2 + F (y t 1,y t 2 ) (3.2) This equation ( 3.2) can be model using the feed forward neural network with extension of linear AR(1) model to build an Auto regressive Neural Network. We are using a three layer Feed Forward Neural Network with Back propagation algorithm as a method to update the weight vector on the network. The main purpose of the Neural Network in the project is to implement it for the prediction of the times series data-set. We have selected a AUD/USD data-set to predict the daily returns with variance of prediction that is to be compared with the observed values. Neural network is a nonlinear model because of the use of activation function in hidden layer to predict the values. The hidden layer in the network uses non-linear activation functions and as they are non-linear in nature the values from the hidden nodes are bounded between the range. We are using Neural Network as a regression model for the purpose of predicting the time series vector. NN has a property of universal approximation which means it can approximate any input attribute as close possible to the target with the selection of approximate hidden layer units. The normal feed forward network has set of input attributes and a target values and using the input attribute, we try to make the predicted output as close as possible to the target vector. We are implementing the special case of the neural network called Auto-regressive Neural Network and the reason for building this network model is that there is no input attributes for the time series data set. As this is a time series data set, we will be using the same target vector with lags as our input attribute.

26 12 Chapter 3. Auto Regressive Neural Network (AR NN) Input Layers i 1 Hidden Layers Output Layers h 1 h 2 o 1 i 2 1 FIGURE 3.1: Auto Regressive Neural Network (AR-NN) The design of the neural network is a state of the art as there are various factors to be considered for design. There is no specific criteria or constraint while building the model. To implement the model there is a need of various algorithms like Gradient descent, Back propagation, Cost function are considered while designing the model. In the figure ( 3.1) at the input layer i t where t =1, 2,...T acts as a input to the model and connections from the nodes i t to the hidden layer nodes h t and output layer nodes o t. It is called non-linear because of the use of activations function in the hidden nodes. The connections are made for linear part between the input node and output node. The figure is an approximate to the model used for the prediction there will be weight assigned on the edges connected between the nodes. This weight vectors are to be selected at the initial state and gets updated after predicting the output vector. This output vector is applied to the batch gradient algorithm to calculate the cost function. The aim of the model is to minimize the error and approximate the input vector as close as possible to the target vector. 3.2 Design of Auto Regressive Neural Network To design the AR-NN for regression as output, we have to consider prior parameters for design and model with three layers consisting of Input layer, Hidden Layer and Output layer are also sometimes called as first, second and third layers respectively. Input layer consist of the number of nodes equal to the dimension/attributes of the data set and it acts as input X t for the model. Output layer can have nodes according to the type of learning. We are using this model for predicting the regression, so only one node y t is required at the output layer. The decision on hidden layer nodes is state of the art as there is no specific criteria for hidden nodes. The input vector into the network through the input layer and forwarded to the hidden layer after combining it with weight parameters w ij on the input side. At the hidden node, there is a non linear activation function f( ) it acts as a triggering node in the layer. The output from the hidden units are differentiated through activation function and are of non-linear nature.

27 3.2. Design of Auto Regressive Neural Network 13 These vectors are again multiplied with the output layer weight vector w jk and transferred to the output layer. Thus equation for the input side layer network is as follows, DX a j = f( w (1) ji x i + w (1) j0 ) (3.3) i=1 z j = h(a j ) (3.4) where i = 1,...,D of equation 3.4 indicates the dimensions of the first layer also called input layer nodes. a j represents the activation function of single node of the next layer in our case the hidden layer. We add the bias value w j0 to the nodes to handle the output offset in this equation. Hence, these activation nodes a j is transformed through non linear activation function given in the equation (1.4) The above equation are at the input side of the model and are again transformed from hidden layer to the output layer through the output side weight vector and activation functions to predict the data as vector at the output layer. This formulates the whole equation of the Neural Network model as described, MX y k (X, W) = ( j=1 w (2) kj h( D X i=1 w (1) ji x i + w (1) j0 )+w(2) k0 ) (3.5) In the above equations, layer one and layer two given by superscripts (1) and (2) are combined to form the non linear equation for predicting the value. This is a normal Feed Forward Network and to design Auto regressive Neural Network this Feed Forward has to be extended by adding a linear term into the equation. Linear Input term is used with a multiplicative parameter and directly connected to the output layer and hence, there is no non-linear factor involved. This linear part is sometimes called as memory of the network used for storing the previous output vector and supplying the same to the output node at next time interval. The estimated equation for the Auto regressive model is as, y t = 0 + nx MX i y t i +( ( i=1 j=1 w (2) kj h( D X i=1 w (1) ji y t i + w (1) j0 )+w(2) k0 )) + t (3.6) The input for the equation is y t i which is lag of the output y t and this lag can be increased to bring the predicted output as close to the target vector. A care should be taken with the lag input as this might over-fit the model or under-fit and perform poorly on test or unknown data-set. For building the whole model all the parameters need to be initialized and also defined with the specified criteria. In the following subsection, there will be detailed discussion on Input time series data set, weight-space, Back propagation algorithm, Activation functions Time Series Data set The data-set we are using of interest is financial returns regarding the Daily asset returns in AUD/USD from January 2005 to December The data-set is only one variable representing the returns and as this data set is time dependent time series data it contains time-varying variance and constant time difference. The input to the model will be past values of series data to forecast the current value. Time series data exhibit a different approach of the prediction as it is only dependent on time index for the prediction it is difficult to model normally according to the other feed forward models. In times dependent data its current value is dependent on the previous data point and this is similar to the Markov process. When one past series value y t 1

14 Chapter 3. Auto Regressive Neural Network (AR NN) is used for the prediction then the series represents Auto Regressive AR(1), where 1 represent one lag vector of the y t vector is used.

28 14 Chapter 3. Auto Regressive Neural Network (AR NN) is used for the prediction then the series represents Auto Regressive AR(1), where 1 represent one lag vector of the y t vector is used. The Linear AR(1) model can be described as, y t = y t 1 (3.7) 0 is the offset and 1 weight factor for y t 1 this equation resembles to the Linear regression equation with t t 1 as input for y t. From the the figure 3.2, the variance of the daily returns FIGURE 3.2: AUD/USD daily returns from January 2005 to December 2012 mostly is ±2% and there is more than ±8% between 2009 and This drastic returns have to be able to capture by the model as this signifies an important event in the financial market. This event were during the financial crisis in the US market due to the recession in the market. This event continues for the next two years as returns continuously remains in the high variance range of ±4% The bounded value range of activation functions, the AUD/USD data set is scaled on the value range R! [0,1] this can be done using the mean - variance method with ȳ t or µ is the mean of the data set y t. t the square root of the variance of y t respective ȳ t as this equates as, y t = y t µ 2 (3.8) The two were taken into consideration while doing the scaling of the data

29 3.2. Design of Auto Regressive Neural Network 15 The model behaves better if all the variables are scaled as if the range of the observed values is much higher than the range of the activation function only linear values dominate the process. The initial weight parameters values are not depended on the observed values and if input vectors are not scaled and the initial weights are not sufficiently small, the output from the activation function might switch from the upper and lower bound. Transforming the series data set might lose the information but the good thing is prediction of the series gets much better if the range of activation function and series data is in bounded range. This criteria is not a strict criteria to follow but the cost function with scaled series gives good approximation of the observed values Weight-space The weight matrix W consist of the weight vectors w from all the layers of the model. This initialization of weight vector holds importance as, if the values are initialized randomly then network model will require more iterations to reach a minimum cost. The ratio of weight vector w and hidden units h is 2 n and with random weights makes the computation costly for the neural model. The updating of weight space dependents on the type of gradient descent algorithm approach to be done according to the output prediction. We have decided to use Batch gradient approach as with it weights get updated after it, computes the gradient of the cost function w.r.t. to the weight parameters W (1) and W (2) for the entire training set. The weight matrix W can be updated as, W = W r J( ) (3.9) In our model, weights are initialized using the Gaussian normal distribution N (0, 1) with µ =0 and 2 =1. Weights can be initialized with any random distribution but, Gaussian is the most preferred distribution in the various paper studies so far done for the regressive function. The weight matrix weight is greater than one then usually it takes more iteration to reduce the cost function. Another design consideration with the weight matrix W, as every input unit is connected to every output unit with a weight value w ij. There are strategies to connect units in order to reduce the W matrix calculations.the strategy used for connection is to connect all the units to each other and initializing the weights accordingly.as we are not using any basis function for the data set as studies have shown that adding more hidden units is equivalent to adding the basis function. Concerning the selection of Hidden neural units as usual approach is trial and error with arbitrary number of neurons with all other factors to be fixed and constantly monitoring the error approximation. A rule of thumb regarding choosing the hidden layer units is equal to the median of the input and output variables given as, h =(n + 1)/2 (3.10) This method doesn t have any technical support but, when we considered this approach our results were better. We started the hidden unit selection more than 20 units and the results from the hidden activation function used to always over shoots the range [-1,1] and at the output unit prediction was with all ones. So after taking into consideration the equation, we choose the hidden units to be around 3 to 5 and the results from the output unit were better but need some changes in the type of activation function to be chosen. The approach used during this was to step by step increase the units in the hidden layer and monitoring the error function.

30 16 Chapter 3. Auto Regressive Neural Network (AR NN) The cost function selection for the Neural model was difficult as there are no specific cost function to be used for the time dependant nonlinear data set. We have chosen the squared error and Mean Squared Error (MSE) to measure the performance of the prediction model. Mean Squared Error has more efficient than Squared Error, and also to take the derivative of the MSE is easy with back propagation it was the main criteria for the selection. E = 1 2 k X (y k t k ) 2 (3.11) k In order to avoid the over fitting of the model to the training data set, we have added the regularization parameter in the cost function. The purpose of the regularization parameter is to make sure the model gets penalize if it tries to over-fit the data and indirectly improves the generalization error measured on the test data. The cost function after adding the regularization variable W = W r J( )+ W (3.12) The variable is fixed value during the prior initialization and same fixed value is used for all the iteration. The criteria used for the selection of the value is trial and approach but if we increase the value of model might under-fit and this care has to be taken while selecting the value for. As over-fitting is a major problem for the prediction model a small changes in the cost function improves generalization error Activation Function The task of choosing the activation function for the model is important to concertize the AR- NN function. The selection of activation function dependents on the Universal approximation property which says that any continuous, bounded and non constant activation can approximate any function on a Borel measurable function from one finite-dimensional space to another with any desired non-zero amount of error, provided the network has enough hidden units for approximating. The derivatives of the feed forward network can also approximate the derivatives of the function well. The concept of Borel measurably is that the function is closed continuous and bounded subset of R n is Borel measurable. The use of basis function in AR-NN model is not used as it gets more complicated with the non linearity of the function. The bounded activation function used in general in all the NN models is the sigmoid function and as it is bounded between R! [-1,1] logistic function is one of the sigmoid function, ( ) =(1+exp ( )) 1 (3.13) logistic : R! [0,1] this sigmoid function is called the tangent hyperbolic function (tanh) ( ) = exp( ) exp ( ) exp( )+exp ( ) (3.14) Linear activation function called as Identity function is used sometimes at the output units for the regression based prediction model. We used this activation function for the model but, there was not significant improvement in the cost function value.this two activation function can be used in the model for the hidden units and output units. It can be used as a mixture of the two activation function with hidden units having the logistic function and output units with tanh function. It depends on the how much error be reduced from the selected activation functions. Sigmoid functions reduce the effect of the outliers because they squash the vector into the range of [-1,1].

31 3.2. Design of Auto Regressive Neural Network Back Propagation Algorithm The back propagation algorithm is a learning procedure for feed forward neural network, by this the network can map the a set of inputs to a set of outputs. The mapping is specified by giving the desired activation function on the units in the hidden and output layer. Learning is carried out iteratively by adjusting the coupling of the strength in the network so as to minimize the difference between the actual output state vector and input state vector. The learning process is repeated until the network responds for each input vector with an output vector that is sufficiently close to the desired one.as weights are initialized randomly with some prior distribution and we need to update the weights after every iteration. The main purpose of implementing back propagation is to minimize the cost function C with respect to weight W and bias b and this is achieved through partial derivative of the After calculating the partial derivative of the cost function it needs to update the weight space and bias so as to make the predicted output closer to the target data. Before discussing the algorithm, we need to consider the cost function for that to be used at the output unit. We have consider squared loss function for the model and has the form as, 1 2 nx (t y t ) 2 (3.15) i=1 where y t is the predicted vector and t is the target vector we take the squared difference of two and take sum of all the data points and divided the whole equation with 0.5. The usage of the the 1 2 is while taking the partial derivative of the cost function during the gradient descent the squared term gets eliminated because of the fraction term. We have have initialized weight vectors on the edges, one between input layer and hidden unit and another between hidden unit and output layer so as, we have to calculate the updating of the weight space twice with the same value of error function. we need to do the partial derivative of the cost function as nx (t y t ) it is some times written as i=1 nx (y t t) (3.16) The above function gives the total error between the input and output vector and adding this difference into the next immediate previous layer. We are using the activation function on each units in the layer, we need to take the derivative of the activation function while doing the updating. We are using three different activation function for the model to look at the error but for calculation we will use the sigmoid function 1/(1 + exp( z)) and combining the above equation ( 3.16) with the sigmoid function, we can update the weight vector w (2) as they are the previous layer of the output layer. The equations for calculations of the weight space vector between output to hidden layer. Therefore, delta o can be written as, i=1 a (2) = (z (2) ) (2) = (z) (1 (2) = (t y (z) t) (1 z) (z(2) ) (3.19) o = (t y t ) (z (2) ) (1 z (2) ) (3.20)

32 18 Chapter 3. Auto Regressive Neural Network (AR NN) We can update all the weights at different hidden units (2) = o a (2) (3.21) From the above equations we have updated the weights vector at the hidden units and now again back propagate the same error to the previous layer from the hidden layer and that to input layer. We have to use same set of equations for the updating of weight space vector W (1) Therefore, delta o can be written as, a (1) = (z (1) ) (1) = (z(1) ) (1 z (1) ) (2) = (t y (z (1) ) t) (1 z (1) ) (z(1) ) (3.24) o = (t y t ) (z (1) ) (1 z (1) ) (3.25) We can update all the weights at different hidden units using 0 and input vector y t (1) = o y t i (3.26) Finally, all the weights from the edges W (1) and W (2) of the layer are updated with Batch gradient approach. After every back propagation procedure on training sample error have to decrease slowly and weights gets updated and for the network model it requires around more than 10,000 iteration to get the weights stabilized to the local minimum range. From where error never goes below the certain level and remains in that range for long time. There are two choices for tackling this problem, one is either making the algorithm to run for fixed amount of iteration or second approach, if the error is not decreasing for long time then it can be stopped at the particular step. The changes made in the algorithm is to continuously tack the error range and take decision based on the cost function.(m.bishop, 2006) Practical Reducibility When the AR-NN extended to add the additive noise, the distribution of the noise term is positive everywhere in the range ( 1, 1) and thus AR-NN forms an irreducible and aperiodic Markov chain. This chain is aperiodic, because it does not cycle between a set of values at the specified multiples of t. It is irreducible because it is impossible to reduce the range of Y t from the entire real line ( 1, 1) to a smaller finite set. As noise is additive and it does not depend on Y t and t takes any random from the range. Therefore, even if the model converges, the noise term ensures that Y t is irreducible and aperiodic.(dietz, 2010) 3.3 Function As the architecture for the model is built in the above sections, we will now discuss the steps required for the model to implement.the data used for modelling AUD/USD returns with functions created for normalizing the data and calculating the Loss functions. The equations needed to model the Auto-regressive Neural Network combining all the algorithms included

33 3.3. Function 19 into the design of the model to make the prediction model for the time dependent non linear data set.linear equations for the model with linear weight vector with two lags of the observed forecasting data set is as follows, y t = y t y t 2 (3.27) The above equation will be needed for predicting at the output unit before this non-linear part of the model equations has to be proved. where weight matrix W consisting of weight vector w (1) and w (2) at the input side given by superscript (1) z (1) j = DX i=1 w (1) ji y t i + w (1) j0 (3.28) where i R D represent the dimensions of the input matrix and j represents the number of units at the hidden layer. This equation needs to be transformed into non-linear form through activation function. a (1) j = (z (1) j ) (3.29) We have used sigmoid function as activation function for the hidden units for transforming the linear value from the input vector into the range of R! [ 1, 1]. This transformed value from the hidden units is now combined with the next weight vector w (2) z (2) k = MX j=1 w (2) kj a(1) j + w (2) k0 (3.30) where k R M belongs to the units at the output layer. As we have only one unit at the put layer M =1. The final output equation is passed through the activation function a (2) k = (z (2) k ) (3.31) Finally equation ( 3.31) is the non-linear transformation of the input vector and now needs to be combined with linear equation ( 3.27) and additive noise t. At the output layer,linear and nonlinear part are combined with the to the activation function for prediction the equation ( 3.32) combines all the derived equation for predicting y t.data set is normalize through the function with mean µ and variance to remove the outliers. The weight vectors w ij and w kj are initialized with random Gaussian distribution with N (0, 1) and i is also initialized with Gaussian distribution. i can be initialized as a fixed values but as data set is non-linear and fixing value for the same doesn t improve the error function. A two different constant value are used for the regularization parameters this way two function have been created for the predictions. There are two separate calculations for linear and non-linear part of the model.for non-linear, input vector is feed into the hidden layer to combine it with weight values and differentiated with activation function. After the prediction of vector Algorithm ( 1) to describes the model for predicting the times series data. The factors that are need to initialized are the equation to be consider for the prediction.(dietz, 2010) y t = 0 + nx MX i y t i +( ( i=1 j=1 w (2) kj h( D X i=1 w (1) ji y t i + w (1) j0 )+w(2) k0 )) + t (3.32)

34 20 Chapter 3. Auto Regressive Neural Network (AR NN) For updating the weight matrix after the training samples, Back-propagation algorithm is invoked with the gradient descent on the cost function with the same steps which are followed in the section( 3.2.4) needs to be followed to update the weight matrix w (1) and w (2) = (t y (z (1) ) t) (1 z (1) ) (z(1) ) (3.33) (z (1) ) o = (t y t ) (1 z (1) (3.34) (1) = o y t i (3.35) Data: Times Series Data of AUD/USD Daily returns Result: Predicting the Time series and Minimizing Error Data NormalizeData(Data); for all loop num_passes do hidden h Combining Input vector with Weight vector W; output Y Combining h and W with linear Weight Matrix and ; Backpropogation Algorithm; error e CalculateLossFunction(Y); if error in range then W min + delta; else W max + delta ; end end return model Algorithm 1: Training AR-NN In the algorithm ( 1), we first initialize the prior parameters for W,,,, and selection of hidden units for the hidden layer as well activation function for hidden units and output units. Data set is passed through the function NormalizeData to get the normalize data set and this would be used for further operations. The data set with one and two lag are created called as y t 1 and y t 2 an these are set as input for the model. In the For loop two things are calculated, first is hidden units combining the input vector with weight values and at output unit, values from the hidden units and with the linear values from the input vector that are combined with and differentiated with the activation function. At the point, Back propagation algorithm is invoked to calculate the gradient descent on the error function and back propagate the value into the previous layers through differentiating the weight vectors at each previous layers. The weights at the immediate previous layers from the error function gets highly updated and the weight at the first layer of the network are updated with very small amount but as iteration increases weights start to get stabilize and error function stays at the specific range for longer time and toggles between the ranges. The total error calculated through the CalculateLossFunction is used for updating the weight vectors for the next iteration. The model is made to rum mode than 10,000 iteration because anything less than weight doesn t get stabilize and as AR-NN model will not reach the local minimum bounded range because of the noise consideration, while invoking prediction and error function. So its better to run the algorithm more iteration to stabilize the weight to a certain weight and reach a steady low error value for the network. From the figure 3.3, the prediction error has more fluctuation in the range of [0,10,000] and as the iteration goes on increasing the error gets stabilized in the range of [1100, 1000] after 30,000 iteration and as neural model cannot have a global minimum and because of additive noise in the error function, cause is to fluctuate between the range and never gets stabilized lo local

35 3.4. Results 21 FIGURE 3.3: Error minimum. 3.4 Results In the following section we will discuss the results from the AR-NN model with consideration of the different parameters affecting the results.the subsection which are following will have detailed discussion on each parameter and focus will be on the cost function posterior mean and variance. As these parameters will define the performance of the network architecture. Neural network cannot achieve a global minimum as the activation functions used in the units in all the layers act as a nonlinear function with different bounding range for each activation function. Some parameters are predefined in the model as model performance is not measured on these values and they are not important prior information to be considered by designing the model Alpha As discussed in the Linear Auto regressive model the linear weight matrix changes the y t from being stationary, explosive or a random walk. We tried to see the impact of on the model. The mean of the observed data set after normalizing it with mean and variance is µ = x10 4 and 2 =1

36 22 Chapter 3. Auto Regressive Neural Network (AR NN) when alpha = 1 At first, the vector was kept constant at =1From the figure ( 3.4), after 20,000 iteration FIGURE 3.4: Error Function with =1 error starts to get stabilize in the range [2800,2600] and because the linear weight vector is is constant at 1 the contribution of the linear value is fixed across the iteration and because of that the error does not fluctuates much and remain in the same range for long time. The mean for the predicted data set is µ =1.018 and variance 2 =2.93 where predicted µ and 2 are away from the observed values and this approach would not perform for the test data set When alpha ( 1 apple 1) We will consider the 1 apple 1 in which the predicted set has µ =0.23 and 2 =1.35. In the figure ( 3.5) at the start of the iteration there is a high deviation of the cost function and that continuous 80,000 iteration and after that error starts to get minimized with minimum deviation. The mean and variance of this plot is very close to the observed data set. This model might be outfitted but it has performed better than the previous graph When alpha ( >1) Finally, is considered above 1 that is >1 for this condition there is no specific value to be considered so, we have chosen =2as a parameter vector value to be fixed.as grows the value of the mean and variance grows exponential and from the figure 3.6 cost value never stabilizes also the cost value starts at around 12,500 and after 80,000 iterations it still stays at around 10,000 value and never reduces below 10,000 range. Compare to the other two,

37 3.5. Prediction Results 23 FIGURE 3.5: Error Function with 1 apple 1 this models has µ =2.01 and 2 =8.87 higher than the other two predicted set. So while implementing the neural model for prediction it would be better to consider alpha in the range of [-1,1] and for this one approach would be to use Gaussian distribution for the selection of the prior Discussion on Activation Function As discussed the two type of activation functions in the section ( 3.2.3) we will explore the results from the model using these activation function and the results for the two function are quite different. At first, we will consider the logistic activation function on the hidden units and output units. The formula for the same ( ) =(1+exp ( )) 1 (3.36) The predicted plot of the logistic activation function from the figure ( 3.7) it was able to capture the outliers from the forecasting data like the high variance of daily returns during the financial crisis between 2008 to The mean µ for the predicted is and the variance 2 is The model is not able to predict the normal target points. Secondly, the figure ( 3.8) of predicted output from the tanh activation is able to predict the vector as close to the observed values with µ = and 2 = 1.30 and as the mean of the observed vector is µ =5.109X The tanh activation performs better than the logistic function as it was able to capture the outliers regarding to the observed data. tanh activation function predicts better on the selected data set than the logistic function but, there is no specific criteria for selection of activation function and trail and error is the best approach for selection of this functions. 3.5 Prediction Results In AR-NN model, we choose for prediction with prior parameters such as initialization of the weight vectors W, choosing activation function ( ), linear weight vector and regularization

38 24 Chapter 3. Auto Regressive Neural Network (AR NN) FIGURE 3.6: Error Function with >1 parameter. The AR-NN model we choose is have the Gaussian distribution for the weight vector W and also tanh as activation function for the hidden and output layer of the network. Regularization parameter added in to the cost function is of = 0.1 when used in the model the prediction results from the model are optimized.in the figure ( 3.9) there are two plots of prediction and observed values with variance on the y-axis and time on x-axis. There is a high variance between 200 to 300 data points on the observed plot and this high volatility is of major concern for the financial market and our model is able to capture this variance but the magnitude is not as high as observed plot has. The reasons for this is to consider more hidden units to capture the non-linearity in the data. Neural network cannot reach the global minimum error for the reason of the activation function. we have to use the trial error method to reduce the error consistently and reach a minimum range but at same time model should not over-fit as this has impact on the prediction and this finally increases the prediction error. The plot ( 3.10) refers to the error calculated using Mean Squared error (MSE) the main aim is to reduce the error and bring the training set error to its minimum range and usually it takes around more than 10,000 iteration for the error to come in this range.

39 3.5. Prediction Results 25 FIGURE 3.7: Logistic Activation Function FIGURE 3.8: tanh Activation Function

40 26 Chapter 3. Auto Regressive Neural Network (AR NN) FIGURE 3.9: AR-NN Prediction and Observed FIGURE 3.10: Error on Train set

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example: