On Out-of-Sample Statistics for Financial Time-Series

Similar documents
Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Resampling Methods. Chapter 5. Chapter 5 1 / 52

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Pattern Recognition 2014 Support Vector Machines

, which yields. where z1. and z2

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

Distributions, spatial statistics and a Bayesian perspective

Simple Linear Regression (single variable)

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

Perfrmance f Sensitizing Rules n Shewhart Cntrl Charts with Autcrrelated Data Key Wrds: Autregressive, Mving Average, Runs Tests, Shewhart Cntrl Chart

ENSC Discrete Time Systems. Project Outline. Semester

Hypothesis Tests for One Population Mean

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

AP Statistics Notes Unit Two: The Normal Distributions

What is Statistical Learning?

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

Computational modeling techniques

Differentiation Applications 1: Related Rates

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

The blessing of dimensionality for kernel methods

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

Comparing Several Means: ANOVA. Group Means and Grand Mean

Computational modeling techniques

SAMPLING DYNAMICAL SYSTEMS

Chapter 3: Cluster Analysis

A Matrix Representation of Panel Data

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

IN a recent article, Geary [1972] discussed the merit of taking first differences

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

Lead/Lag Compensator Frequency Domain Properties and Design Methods

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

Admissibility Conditions and Asymptotic Behavior of Strongly Regular Graphs

On Huntsberger Type Shrinkage Estimator for the Mean of Normal Distribution ABSTRACT INTRODUCTION

Inference in the Multiple-Regression

Kinetic Model Completeness

Part 3 Introduction to statistical classification techniques

Eric Klein and Ning Sa

AP Statistics Practice Test Unit Three Exploring Relationships Between Variables. Name Period Date

COMP 551 Applied Machine Learning Lecture 4: Linear classification

We can see from the graph above that the intersection is, i.e., [ ).

Tree Structured Classifier

Lab 1 The Scientific Method

SIZE BIAS IN LINE TRANSECT SAMPLING: A FIELD TEST. Mark C. Otto Statistics Research Division, Bureau of the Census Washington, D.C , U.S.A.

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

Determining the Accuracy of Modal Parameter Estimation Methods

Least Squares Optimal Filtering with Multirate Observations

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA

Sections 15.1 to 15.12, 16.1 and 16.2 of the textbook (Robbins-Miller) cover the materials required for this topic.

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

Chapter Summary. Mathematical Induction Strong Induction Recursive Definitions Structural Induction Recursive Algorithms

5 th grade Common Core Standards

Checking the resolved resonance region in EXFOR database

Homology groups of disks with holes

Math Foundations 20 Work Plan

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Revision: August 19, E Main Suite D Pullman, WA (509) Voice and Fax

Preparation work for A2 Mathematics [2017]

B. Definition of an exponential

Five Whys How To Do It Better

Thermodynamics Partial Outline of Topics

7 TH GRADE MATH STANDARDS

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

Dead-beat controller design

Weathering. Title: Chemical and Mechanical Weathering. Grade Level: Subject/Content: Earth and Space Science

Module 4: General Formulation of Electric Circuit Theory

IAML: Support Vector Machines

Module 3: Gaussian Process Parameter Estimation, Prediction Uncertainty, and Diagnostics

BASD HIGH SCHOOL FORMAL LAB REPORT

NUMBERS, MATHEMATICS AND EQUATIONS

Statistical Learning. 2.1 What Is Statistical Learning?

Performance Bounds for Detect and Avoid Signal Sensing

A Few Basic Facts About Isothermal Mass Transfer in a Binary Mixture

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

CS 109 Lecture 23 May 18th, 2016

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

Phys. 344 Ch 7 Lecture 8 Fri., April. 10 th,

Medium Scale Integrated (MSI) devices [Sections 2.9 and 2.10]

NOTE ON A CASE-STUDY IN BOX-JENKINS SEASONAL FORECASTING OF TIME SERIES BY STEFFEN L. LAURITZEN TECHNICAL REPORT NO. 16 APRIL 1974

UNIV1"'RSITY OF NORTH CAROLINA Department of Statistics Chapel Hill, N. C. CUMULATIVE SUM CONTROL CHARTS FOR THE FOLDED NORMAL DISTRIBUTION

Support-Vector Machines

Sequential Allocation with Minimal Switching

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Technical Bulletin. Generation Interconnection Procedures. Revisions to Cluster 4, Phase 1 Study Methodology

WRITING THE REPORT. Organizing the report. Title Page. Table of Contents

ECE 5318/6352 Antenna Engineering. Spring 2006 Dr. Stuart Long. Chapter 6. Part 7 Schelkunoff s Polynomial

Interference is when two (or more) sets of waves meet and combine to produce a new pattern.

READING STATECHART DIAGRAMS

Aerodynamic Separability in Tip Speed Ratio and Separability in Wind Speed- a Comparison

initially lcated away frm the data set never win the cmpetitin, resulting in a nnptimal nal cdebk, [2] [3] [4] and [5]. Khnen's Self Organizing Featur

We say that y is a linear function of x if. Chapter 13: The Correlation Coefficient and the Regression Line

ECEN 4872/5827 Lecture Notes

Transcription:

On Out-f-Sample Statistics fr Financial Time-Series Françis Gingras Yshua Bengi Claude Nadeau CRM-2585 January 1999 Département de physique, Université de Mntréal Labratire d infrmatique des systèmes adaptatifs, Département d infrmatique et recherche pératinnelle, Université de Mntréal Centre interuniversitaire de recherche en analyse des rganisatins

Abstract This paper studies an ut-f-sample statistic fr time-series predictin that is analgus t the widely used R 2 in-sample statistic. We prpse and study methds t estimate the variance f this ut-f-sample statistic. We suggest that the ut-f-sample statistic is mre rbust t distributinal and asympttic assumptins behind many tests fr insample statistics. Furthermre we argue that it may be mre imprtant in sme cases t chse a mdel that generalizes as well as pssible rather than chse the parameters that are clsest t the true parameters. Cmparative experiments are perfrmed n artificial data as well as n a financial time-series (daily and mnthly returns f the TSE300 index). The experiments are perfrmed fr varying predictin hrizns and we study the relatin between predictibility (ut-f-sample R 2 ), variability f the utf-sample R 2 statistic, and the predictin hrizn. In particular, we find that very different cnclusins wuld be btained when testing against the null hypthesis f n dependency rather than testing against the null hypthesis that the prpsed mdel des nt generalize better than a naive frecast.

1 Intrductin The purpse f the analysis f time-series such as financial time-series is ften t take decisins based n this analysis. The analyst is given a certain quantity f histrical data D T = {z 1,..., z T } frm which he will eventually cme up with a decisin. In this paper, we will fcus n decisins which take the frm f a predictin ŷ T h f the future value f sme variable 4, say Y T h, with Z t = (X t h, Y t ). The quality f the predictin will be judged a psteriri accrding t sme lss functin, such as the squared difference (ŷ T h Y T h ) 2 between the predictin ŷ T h and the realizatin Y T h f the predicted variable. A cmmn apprach is t use the histrical data D T t infer a functin f that takes as input the value f sme summarizing infrmatin X t available at time t and prduces as utput ŷ th = f(x t ), which in the case f the abve quadratic lss functin wuld be an estimate f the cnditinal expectatin E[Y th X t ]. The hpe is that if this functin wrked well n bserved past pairs (x t, y th ), it shuld wrk well n (X T, Y T h ). Hw shuld we chse the functin f? A classical apprach is t assume a parametrized class f functins (e.g., the affine functins), estimate the value f these parameters (e.g., by maximum likelihd r least squares), and then in rder t validate the mdel, perfrm statistical tests t verify if these parameters differ significantly frm the value that wuld be cnsistent with a null hypthesis (e.g., the parameters f the regressin are significantly different frm zer, s that there is really a linear dependency between the X s and the Y s). In particular, these tests are imprtant t knw whether ne shuld use the prpsed mdel at all, r t decide amng several mdels. In this paper we will cnsider alternative appraches t address the last questin, i.e., hw a mdel shuld be validated and hw several mdels shuld be cmpared. It is very satisfying t btain a result n the true value f the parameters (e.g., t use an efficient estimatr, which cnverges as fast as pssible t the true value f the parameters). But in many applicatins f timeseries analysis, the end-user f the analysis may be mre interested in knwing whether the mdel is ging t wrk well, i.e., t generalize well t the future cases. In fact, we will argue that smetimes (especially when data is scarce), the tw bjectives (estimating the true parameters r chsing the mdel that generalizes better) may yield t quite different results. Anther fundamental justificatin fr the apprach that we are putting frward is that we may nt be sure that the true distributin f the data has the frm (e.g. linear, Gaussian, etc...) that has been assumed. Therefre it may nt be meaningful t talk abut the true value f the parameters, in this case. What may be mre apprpriate is the questin f generalizatin perfrmance: will the mdel yield gd predictins in the future? where the ntin f gd can be used t cmpare tw mdels. T btain answers t such questins, we will cnsider statistics that measure ut-f-sample perfrmance, i.e., measured n data that was nt used t frm the predictin functin. In the machine learning cmmunity, it is cmmn t use such measures f perfrmance. Fr example, ne can estimate ut-f-sample errr by first estimating the parameters n a large subset f the data (als called the training data) and test the functin n the rest f the data (als called the held-ut r test data). With this methd, when there is little histrical data, the estimate f perfrmance wuld be pr because it wuld be based n a small subset f the data. An alternative is the K-fld crss-validatin methd, in which K train/test partitins are created, a separate parameter estimatin perfrmed n each f the training subsets, testing perfrmed fr each f these K functins n the crrespnding K test sets, and the average f these test errrs prvides a (slightly pessimistic) estimate f generalizatin errr (Efrn and Tibshirani, 1993). When the data is sequential (and may be nn-statinary), the abve methds may nt be applicable (and in the nn-statinary case may yield ptimistic estimates). A mre hnest estimate can be btained 4 In this paper we will nrmally use upper case fr randm variables and lwer case fr their value. 1

with a sequential crss-validatin prcedure, described in this paper. This estimate essentially attempts t measure the predictability f the time-series when a particular class f mdels is used. In this cntext, what the analyst will try t chse is nt just a functin frm x t t E[Y th x t ], but a functinal F that maps histrical data D t int such a functin (and will be applied t many cnsecutive time steps, as mre histrical data is gathered). What d we mean by predictability? Fr practitiners f quantitative finance, predictability equates beating the market. This gal is, usually and fr a large part, frm the dmain f financial engineering. Depending n the market, and things like the investment hrizn and the nature f the investr, it requires the incrpratin f transactins fees, liquidity f the market, risk management, tax and legal cnsideratins, and s n. But, amng the details n which depends an applicable mdel, ne which is crucial is that in rder t btain the decisin r the predictin at time t, nly infrmatin that is available at time t can be used. This includes nt nly the inputs f the functin but als the data that is used t estimate the parameters f this functin. In this paper, we define predictability f sequential data, such as market return, as the pssibility t identify, n the basis f knwn infrmatin at time t (that will be used as input ), a functin amng a set f pssible functins (frm the input t a predictin) that utperfrms (in terms f generalizatin perfrmance) a naive mdel that des nt use any input, but can be trained n the same (past) data. One bjective f this paper is t study if the use f generalizatin errr allws t recver the statistical results btained by traditinal in-sample inference, but withut ding any assumptin n the prperties f the data and the residuals f the mdel. We apply the methd t a simple linear mdel where we can, in the case f the in-sample statistic, cnstruct an autcrrelatincnsistent standard errr. Here the results btained with the ut-f-sample statistic are cnsistent with thse btained with the in-sample apprach, but the methd has the advantage f being directly extensible t any mdel, linear r nt (e.g., neural netwrks), fr which the distributins f the in-sample statistics are unknwn r rely n delicate hyptheses. Several empirical tests based n btstrapping techniques are prpsed t cmpare the mdel t a null hypthesis in which the inputs d nt cntain infrmatin abut future utputs. Anther majr bjective f this paper is t establish a distinctin between tw apparently clse null hyptheses: (1) n relatinship between the inputs and the utputs and (2) n better predictive pwer f a given mdel with respect t a naive mdel. We prpse a methd t test against the 2nd null hypthesis, and we shw that these tw types f tests yield t very different results n ur financial returns data. In sectin 2, we present the classical ntin f generalizatin errr, empirical risk minimizatin, and crss-validatin, and we extend these ntins t nn-i.i.d. data. We als present the ntin f a naive mdel used t establish a cmparisn benchmark (and null hyptheses) and establish the cnnectin between it and the in-sample variance f data. In sectin 3, we recall sme aspects f rdinary least square regressin applied t sequential data. We recall the definitins f explained variance, the usual in-sample R 2, used in predictability tests (Campbell et al., 1997; Kaul, 1996). We define an ut-f-sample analgus f R 2 that we dente R 2, and a related but unbiased statistic that we dente D. The R 2 intrduced in this paper is related t a definitin f frecastibility prpsed by Granger and Newbld (1976). Sectin 4 describes the financial time-series data and presents sme preliminary results. In sectin 5, we test the hypthesis f nn-relatin between the inputs and the utputs. We generate btstrap samples similar t the riginal series but fr which we knw that the null hypthesis is true, and we cmpare the test statistics bserved n the riginal series t the distributin f the statistics btained with this btstrap prcess. Althugh this hypthesis is f n direct relevance 2

t this paper, it allws us t nicely intrduce sme difficult issues with the data at hand (such as dependency induced by verlapping) and the type f methdlgies used later n, including the btstrap as mentined abve. It als allws us t make the distinctin between the absence f relatinship between inputs and utputs and the inability f inputs t frecast utputs. In rder t study sme prperties f the statistics used t test the hypthesis f nn-relatin between the inputs and the utputs, we als generated statinary artificial data fr which we knw the nature f the relatin between the inputs and the utputs. Using this data, we cmpare the pwer f in-sample and ut-f-sample statistics when testing against a hypthesis f n linear dependency. Sectin 6 aims at assessing whether inputs may be used t prduce frecast that wuld utperfrm a naive frecast. Fllwing sectin 3, we test if R 2 = 0 against the alternative that it is psitive. T d s, we use the statistic ˆR 2 0 and varius btstrap schemes. The experiments are perfrmed fr varying predictin hrizns and we study the relatin between predictability (ut-fsample R 2, which we will write R 2 ), variability f the ut-f-sample R 2 statistic, and the predictin hrizn. The results are cmpared t thse btained when trying t reject the null hypthesis f n dependency. 2 Expected Risk and Sequential Validatin This sectin reviews ntins frm the generalizatin thery f Vapnik (1995), and it presents an extensin t nn-i.i.d. data f the cncepts f generalizatin errr and crss-validatin. We als define a naive mdel that will be used as a reference fr the R 2 statistic. First let us cnsider the usual i.i.d. case (Vapnik, 1995). Let Z = (X, Y ) be a randm variable with an unknwn density P (Z), and let the training set D l be a set f l examples z 1,..., z l drawn independently frm this distributin. In ur case, we will suppse that X R n and Y R. Let F be a set f functins frm R n t R. A measure f lss is defined which specifies hw well a particular functin f F perfrms the generalizatin task fr a particular Z: Q(f, Z) is a functinal frm F R n1 t R. Fr example, in this paper we will use the quadratic errr Q(f, Z) = (Y f(x)) 2. The bjective is t find a functin f F that minimizes the expectatin f the lss Q(f, Z), that is the generalizatin errr f f: G(f) = E[Q(f, Z)] = Q(f, z)p (z)dz (1) Since the density P (z) is unknwn, we can t measure r even less minimize G(f), but we can minimize the crrespnding empirical errr: G emp (f, D l ) = 1 l z i D l Q(f, z i ) = 1 l l Q(f, z i ) (2) i=1 where the z i are sampled i.i.d. frm the unknwn distributin P (Z). When f is chsen independently f D l, this is an unbiased estimatr f G(f), since E[G emp (f, D l )] = G(f). Empirical risk minimizatin (Vapnik, 1982, 1995) simply chses f = F (D l ) = arg min G emp (f, D l ) f F where we have nted F (D l ) the functinal that maps a data set int a decisin functin. Vapnik has shwn varius bunds n the maximum difference between G(f) and G emp (f, D l ) fr f F, which depend n the s-called VC-dimensin r capacity (Vapnik, 1982, 1995) f the set f 3

functins F. Nte that the capacity f an rdinary linear regressin mdel is simply the number f free parameters. When the sample size l is small relative t capacity, there will be a significant discrepancy between the generalizatin errr (1) and the in-sample training errr G emp (F (D l ), D l ), but it can be cntrlled with the capacity f the set f functins F (Vapnik, 1995), e.g., in the case f linear mdels, by using less inputs r by regularizing (fr example by cnstraining the parameters t be small). An empirical estimate f G(F (D)), the generalizatin errr f a functinal F (frm a data set D t a functin f F), can be btained by partitining the data in tw subsets: a training subset D 1 t pick f = F (D 1 ) F which minimizes the empirical errr in D 1, and a held-ut r test subset D 2 which gives an unbiased estimate f G(F (D 1 )), the generalizatin errr f f, and a slightly pessimistic estimate f G(F (D)), the generalizatin errr assciated t a functinal F when it will be applied t D = D 1 D 2. When there is nt much data, it is preferable but cmputatinally mre expensive t use the K-fld crss-validatin prcedure described in Bishp (1995); Efrn and Tibshirani (1993). Hwever, in the case where the data are nt i.i.d., the results f learning thery are nt directly applicable, nr are the prcedures fr estimating generalizatin errr. Cnsider a sequence f pints z 1, z 2,..., with z t R n1, generated by an unknwn prcess such that the z t s may be dependent and have different distributins. Nevertheless, at each time step t, in rder t make a predictin, we are allwed t chse a functin f t frm a set f functins F using the past bservatins z t 1 = (z 1, z 2,..., z t ), i.e., we chse f t = F (z t 1). In ur applicatins z t is a pair (x t h, y t ) 5 and the functins f F take an x t in input t take a decisin that will be evaluated against y th thrugh the lss functin Q(f t, Z th ). Here we call h the hrizn because it crrespnds t the predictin hrizn in the case f predictin prblems. Mre generally it is the number f time steps frm a decisin t the time when the quality f this decisin can be evaluated. In this paper, we cnsider the quadratic lss Q(f, Z th ) = Q(f, (X t, Y th )) = (Y th f(x t )) 2. We then define the expected generalizatin errr G t fr the decisin at time t as G t (f) = E[Q(f, Z th ) Z1] t = Q(f, z th )P th (z th Z1)dz t th. (3) The bjective f learning is t find, n the basis f empirical data z1, t the functin f F which has the lwest expected generalizatin errr G t (f). The prcess Z t may be nn-statinary, but as lng as the generalizatin errrs made by a gd mdel are rather stable in time, we can hpe t use the data z1 t t pick a functin which wuld have wrked well in the past and will wrk well in the future. Therefre, we will extend the abve empirical and generalizatin errr (equatins 2 and 1). Hwever we cnsider nt the errr f a single functin f but the errr assciated with a functinal F (which maps a data set D t = z1 t int a functin f F). Nw let us first cnsider the empirical errr which is the analgue fr nn i.i.d. data f the K-fld crss-validatin prcedure. We call it the sequential crss-validatin prcedure and it measures the ut-f-sample errr f the functinal F as fllws: C T (F, z T 1 ) = 1 T M 1 5 The first bservable x is called x 1 h rather than x 1. T t=m Q(F (z t h 1 ), z t ) (4) 4

where f t = F (z1) t is the chice f the training algrithm using data z1 t (see equatin 6 belw), and (M h) > 0 is the minimum number f training examples required fr F (z1 M h ) t prvide meaningful results. We define the generalizatin errr assciated t a functinal F fr decisins r predictins with a hrizn h as fllws: E Gen (F ) = E[C T (F, z1 T )] = = 1 T M 1 T t=m 1 T M 1 T t=m Q(F (z t h 1 ), z t )P (z T 1 )dz T 1 E[G t h (F (Z t h 1 ))] (5) where P (z1 T ) is the prbability f the sequence z1 T under the generating prcess. In that case, we readily see that (4) is the empirical versin f (5), that is (4) estimates (5) by definitin. In the case f the quadratic lss, we have E Gen (F ) = 1 T M 1 T t=m E[V ar[f (Z t h 1 )(X t h ) Y t X T h 1 h ] E 2 [F (Z1 t h )(X t h ) Y t X T h ]] (6) T cmplete the picture, let us simply mentin that the functinal F may be chsen as F (z1) t = arg min R(f) f F 1 h t Q(f, z s ) (7) where R(f) might be used as a regularizer, t define a preference amng the functins f F, e.g., thse that are smther. Fr example, cnsider a sequence f bservatins z t = (x t h, y t ). A simple class f functins F is the class f cnstant functins, which d nt depend n the argument x, i.e., f(x) = µ. Applying the principle f empirical risk minimizatin t this class f functin with the quadratic lss Q(f, (x t, y th )) = (y th f(x t )) 2 yields s=1 f naive t = F naive (z t 1) = arg min µ t (y s µ) 2 = ȳ t = 1 t s=1 t y s, (8) s=1 the histrical average f the y s up t the current time t. We call this uncnditinal predictr the naive mdel, and its average ut-f-sample errr C T (F naive, z1 T 1 ) = T T M1 t=m (ȳ t h y t ) 2 is called the ut-f-sample naive cst. 3 Out-f-Sample R 2 T assess the generalizatin ability f a functinal F fr a mre interesting class f functins, which depend n their argument x, let us intrduce the relative measure f perfrmance R 2 (F ) = 1 E Gen(F ) E Gen (F naive ) = 1 E[C T (F, z T 1 )] E[C T (F naive, z T 1 )] (9) 5

where E Gen (.), C T (.,.) and the F naive functinal were discussed in the previus sectin. R 2 will be negative, null r psitive accrding t whether the functinal F generalizes wrse, as well r better than F naive. Bradly speaking, when R 2 is psitive it means that there is a dependency between the inputs and the utputs. In ther wrds, when there is n dependency and we use a mdel with mre capacity (e.g., degrees f freedm) than the naive mdel, then R 2 will be negative. The cnverse is nt true, i.e. R 2 < 0 des nt imply n dependency but indicates that the dependency (if present) is nt captured by the class f functins in the image f F. S in cases where the signal-t-nise-rati is small, it may be preferable nt t try t capture the signal fr making predictins. The empirical versin r estimatr f R, 2 called ut-f-sample R 2, is defined as the statistic where ˆR 2 (F ) = 1 C T (F, z T 1 ) C T (F naive, z T 1 ) = 1 e F t = y t F (z t h 1 )(x t h ) T t=m (ef t ) 2 T t=m (enaive t ), (10) 2 dentes the previsin errr made n y t by the functinal F. T ease ntatin, we let e naive t stand fr e F naive t. This empirical ˆR2 is a nisy estimate (due t the finite sample), and thus might be psitive even when R 2 is negative (r vice-versa). Furthermre, this estimate ˆR 2 may be biased because its expectatin is 1 minus the expectatin f a rati f tw randm variables (C T (F, z1 T ) and C T (F naive, z1 T )), which is different frm R 2 that is 1 minus the rati f the expectatins f these same variables. Hwever, unless there is sme strange dependency between these tw variables, we can expect that ˆR 2 underestimates R 2 (which is preferable than ver-estimating it, meaning that a mre cnservative estimate is made). It is therefre imprtant t analyze hw nisy this estimate is in rder t cnclude n the dependency between the inputs and the utputs. This matter will be addressed in a later sectin, using an related statistic that is unbiased (fr which the expectatin f the empirical estimate is equal t the true value), that we dente D : with empirical estimate D (F ) = E Gen (F naive ) E Gen (F ) = E[C T (F naive, z T 1 )] E[C T (F, z T 1 )] (11) ˆD (F ) = C T (F naive, z T 1 ) C T (F, z T 1 ) = T t=m (e naive t ) 2 T (e F t ) 2 T understand the name ut-f-sample R 2, ntice that ˆR 2 lks like the usual R 2 which is ˆR 2 = 1 t=m T t=1 (ẽf t ) 2 T T t=1 (ẽnaive t ) = 1 t=1 (y t F (z1 T )(x t h )) 2 2 T t=1 (y, (12) t ȳ T ) 2 where ẽ F t = y t F (z1 T )(x t h ) is the usual in-sample residual and ẽ naive t = ẽ F naive t. Hwever, nte that ˆR, 2 like R, 2 may be negative, which cntrasts with R 2 which is nn-negative whenever F F naive. The terms in and ut f sample underline the difference between ẽ F t, depending n the whle sample thrugh F (z1 T ), and e F t which depends slely n y t and the sample z1 t h up t time t h. In ther wrds, e F t is a genuine frecast errr and ẽ F t is nt, as F (z1 T ) is available nly at time T s that F (z1 T )(x t h ) cannt be used at time t h. 6

An example may clarify all f the abve. Take n = 1 and let F lin be the set f affine functins, i.e. linear mdels f(x) = α βx. Sticking with the quadratic lss with n regularizatin, we have that ft lin (x) = F lin (z1)(x) t = ˆα t ˆβ t x, where (ˆα t, ˆβ t ), minimizing t (y s α βx s h ) 2, s=1 are the least square estimates f the linear regressin f y s n x s h, s = 1,..., t, and rely nline n data knwn up t time t, i.e. z t 1. e naive t e lin t ẽ naive t e lin t = y t F naive (z t h 1 )(x t h ) = y t ȳ t h = y t F lin (z1 t h )(x t h ) = y t ˆα t h ˆβ t h x t h = y t F naive (z1 T )(x t h ) = y t ȳ T = y t F lin (z T 1 )(x t h ) = y t ˆα T ˆβ T x t h If we assume that the Z t s are independent with expectatin E[Y t x t h ] = αβx t h and variance V ar[y t x t h ] = σ 2, then (??) yields and (T M 1)E Gen (F naive ) = σ 2 T t=m [ 1 1 ] T β 2 E[(X t t h X t h ) 2 ] t=m T [ (T M 1)E Gen (F lin ) = σ 2 1 1 ] [ T σ 2 (X t h E X ] t 2h ) 2 t h t h t=m t=m s=1 (X s h X, t 2h ) 2 where X t h = t 1 t h s=1 h X s is the mean f the X s up t X t h. We then see that R 2 is negative, null r psitive accrding t whether β2 is smaller, equal r greater than σ 2 [ T t=m E ] (X t h X t 2h ) 2 t h s=1 (X s h X t 2h ) 2 T t=m E[(X t X t h ) 2 ]. (13) This illustrates the cmment made earlier regarding the fact that R 2 < 0 means that the signal-tnise-rati ( β2 here) is t small fr F lin t utperfrm F naive. This result shws in this particular σ 2 case that even if the true generating mdel has β 0, a mdel trained frm the class f mdels with β = 0 (the naive mdel) shuld be chsen fr its better frecast generalizatin, rather than a mdel frm the class β 0. Let us nw cnsider a mre cmplex case where the distributin is clser t the kind f data studied in this paper. If we assume that E[Y t x T h h1 ] = αβx t h and V ar[y t x T h h1 ] = σ2 with Cv[Y t, Y tk x T h h1 ] = 0 whenever k h, then (??) yields and (T M 1)E Gen (F naive ) = (T M 1)E Gen (F lin ) = T t=m (σ 2 E[V ar[ȳt h x T h h1 ]]) β2 T t=m T E[(X t X t h ) 2 ] t=m (σ 2 E[V ar[ȳt h ˆβ t h (x t h x t 2h ) x T h h1 ]]) 7

where X t h = t 1 t h s=1 h X s is the mean f the X s up t X t h. We then see that R 2 is negative, null r psitive accrding t whether β2 is smaller, equal r greater than σ 2 σ [ 2 T t=m E V ar[ȳt h X T h h1 ] V ar[ȳt h ˆβ t h (X t h X ] t 2h ) X T h h1 ] T t=m E[(X t X. (14) t h ) 2 ] Nte that it can be shwn that the abve numeratr is free f σ as it invlves nly expectatins f expressins in X t s (like the denminatr). This again illustrates the cmment made earlier regarding the fact that R 2 < 0 means that the signal-t-nise-rati ( β2 here) is t small fr F lin σ 2 t utperfrm F naive. It als illustrates the pint made in the intrductin that when the amunt f data is finite, chsing a mdel accrding t its expected generalizatin errr may yield a different answer than chsing a mdel that is clsest t the true generating mdel, and that fr the purpse f making frecasts r taking decisins regarding yet unseen (i.e., ut-f-sample) data, it reflects better ur true bjective. See als (Vapnik, 1982, sectin 8.6) fr an example f the difference in ut-f-sample generalizatin perfrmance between the mdel btained when lking fr the true generating mdel versus chsing the mdel which has a better chance t generalize (in this case using bunds n generalizatin errr, fr plynmial regressin). 4 The financial data and preliminary results Experiments n the in-sample and ut-f-sample statistics where perfrmed n a financial timeseries. The data is based n the daily ttal return, including capital gain as well as dividends, fr the Trnt Stck Exchange TSE300 index, starting in January 1982 up t July 1998. The ttal return series T R t, t = 0, 1,..., 4178, can be described as the result at time t f an initial investment f 1 dllar and the reinvestment f all dividends received. We cnstruct, fr different values f h, the lg-return series n a hrizn h ( ) T Rt r t (h) = lg = lg(t R t ) lg(t R t h ) (15) T R t h where T R means ttal return, and t represents days. Thus r t (h) represents the lgarithm f the ttal return at day t f the past h day(s). There are 4179 trading days in the sample. We cnsider that there are twenty-ne trading days per mnth r 252 trading days per year. The real number f trading days, where the trading activities can ccur, can vary slightly frm mnth t mnth, depending n hlidays r exceptinal events, but 21 is a gd apprximatin if we want t wrk with a fixed number f trading days per mnth. A hrizn f H = N mnths will mean h = N 21 days. Using and predicting returns n a hrizn greater than the sampling perid creates an verlapping effect. Indeed, upn defining the daily lg-returns we can write r t (h) = lg(t R t ) lg(t R t h ) = r t = r t (1), t = 1,..., 4178, t s=t h1 8 (lg(t R s ) lg(t R s 1 )) = t s=t h1 r s (16)

lg return f TSE300 0.0 0.5 1.0 1.5 daily lg return f TSE300-0.10-0.05 0.0 0.05 0 1000 2000 3000 4000 0 1000 2000 3000 4000 Figure 1: Left: daily lgarithm f TSE300 index frm January 1982 t end f July 1998. Right: daily lg returns f TSE300 fr the same perid Table 1: Sample skewness and sample kustsis f TSE300 return ver hrizn f 1 day and 1 mnth and 3 mnths. The statistics and their standard deviatins (shwn in parathensis) have been cmputed accrding t frmulas described in Campbell et al. (1997). Hrizn skewness kurtsis 1 day -1.22 (0.04) 33.17 (0.08) 1 mnth -1.13 (0.17) 10.63 (0.35) 3 mnths -0.40 (0.30) 3.93 (0.60) as a mving sum f the r t s. We will wrk n mnthly returns as it has been suggested frm empirical evidence (Campbell et al., 1997; Fama and French, 1988) that they can be useful fr frcasting, where a such results are nt dcumented fr daily return, except fr nn-prfitable trading effects. S ur hrizn will be a multiple f 21 days. Data are slightly better behaved when we take mnthly returns instead f daily nes. Fr instance, the daily return series is far frm being nrmally distributed. It is knwn that stck indices return distributins present mre mass in their tails than the nrmal distributin (Campbell et al., 1997). But returns ver lnger hrizns get clser t nrmality, thanks t equatin 13 and the central limits. Fr example, table 1 shws the sample skewness and kurtsis fr the daily, mnthly and quarterly returns. We readily ntice that these higher mments are mre in line with thse f the nrmal distributin (skewness=0, kurtsis=3) when we cnsider lnger term returns instead f daily returns. Table 1 is the first illustratin f the tuchy prblem f the verlapping effect. Fr instance, yu will ntice that the standard deviatin are nt the same fr daily and mnthly returns. This is because the daily returns statistics are based n r 1,..., r 4178, whereas their mnthly cunterparts are based n r 21 (21), r 42 (21),... r 21 198 (21), that is apprximatively 21 times fewer pints than in the daily case. The reasn fr this is that we want independent mnthly returns. If we assumed that the daily returns were independent, then mnthly returns wuld have t be at least ne mnth apart t be als independent. Fr instance, r 21 (21) and r 40 (21) wuld nt be independent as they 9

In-sample R2 0.0 0.05 0.10 0.15 5 10 15 20 H Figure 2: Evlutin f the in-sample R 2 with the hrizn f predictin. The R 2 seem indicate than the strnger input/utput relatin is fr the hrizn f arund a year. share r 20 and r 21. Figures 2 and 3 depict the values f R 2 and R 2 btained n the TSE data when H = 1, 2,..., 24. The first plt suggests that there appears t be little relatinship between past and future returns except, perhaps, when we agregate the returns n a perid f abut ne year (H = 12). Figure 3 tells a similar stry: at best, predictability f future returns seems pssible nly fr yearly returns r s. But hw can we decide (frmally) if there is a relatinship between past and future returns, and if such a relatinship might be useful fr frecasting. This will be the gal f the next sectin. 5 Testing the hypthesis f n relatin between Y and X Cnsider testing the hypthesis that there is n relatinship between successive return f hrizn h, i.e. H 0 : E[r t (h) r t h (h)] = µ. Nte that r t (h) and r t h (h) d nt verlap but are cntiguus h days returns. T put it in sectin 4 s ntatin, we have y t = x t = r t2h 1 (h), s that, fr instance, x 1 h = r h (h) is the first bservable x. We wish t test E[Y t x t h ] = µ. As mentined in the intrductin, this hypthesis is nt what we are actually interested in what we d in this sectin prves t be useful in sectin 6 as it allws us t intrduce the btstrap, amng ther things. T perfrm a test f hypthesis, ne needs a statistics with a behavir that depends n whether H 0 is true r false. We will mainly cnsider tw statistics here. First we have R 2 that will take 10

Out-f-sample estimated R2 H 5 10 15 20-0.3-0.2-0.1 0.0 Figure 3: Evlutin f the ut-f-sample R 2 with the change f hrizn. 11

smaller values under H 0 than therwise. The ther apprach t testing H 0 is t ntice that if E[r t (h) r t h (h)] des nt depend n r t h (h) then the crrelatin between r t h (h) and r t (h) is null, ρ(r t (h), r t h (h)) = 0. Thus we will use ˆρ(r t (h), r t h (h)), an estimatr f ρ(r t (h), r t h (h)), t test H 0 as it will tend t be clser t 0 under H 0 than therwise. The secnd thing needed in a test f hypthesis is the distributin f the chsen statistic under H 0. This may be btained frm theretical results r apprximated frm a btstrap as explained later. In the case f ˆρ(r t (h), r t h (h)), we d have such a theretical result (Bartlett, 1946; Andersn, 1984; Bx and Jenkins, 1970). First let us frmally define ˆρ(r t (h), r t h (h)) = T 2h (r t(h) r(h))(r t h (h) r(h)) T h (r t(h) r(h)) 2, (17) with r(h) being the sample mean f r h (h),..., r T (h). Assuming that the r t s are independent and identically distributed with finite variance then T h 1(ˆρ(rt (h), r t h (h)) ρ(r t (h), r t h (h))) N(0, W ) with W = (ρ vh ρ v h 2ρ h ρ v ) 2, (18) v=1 where ρ k stands fr ρ(r tk (h), r t (h)). If the r t s are independent and the r t (h) are running sums f r t s as shwn in equatin 13, then where u = max(u, 0). Therefre we have W = 2h 1 v=1 ρ 2 v h = h v v=1 h ρ k = (h k ) h h 1 ρ 2 v = 1 2h 2 (h v) 2 = 1 v=1 (h 1)(2h 1) 3h where the identity 1 2 2 2 3 2... N 2 = N(N1)(2N1) was used in the last equality. Large 6 T h1 values f ˆρ(r W t(h), r t h (h)) are unfavrable t H 0 and their significance are btained frm a N(0, 1) table. In the case f the ˆR -statistics, 2 its distributin is unknwn. Hwever we may find an apprximatin f it by simulatin (btstrap). S we have t generate data frm the hypthesis H 0 : E[Y t x t h ] = µ (i.e. d nt depend n x t h ). This can be dne in at least fur ways. 1. Generate a set f independent r t s and cmpute the Y t = r t (h) s and the x t h = r t h (h) s in the usual way. 2. Keep the Y t btained frm the actual data, but cmpute the x t h as suggested in 1. 3. Keep the x t h btained frm the actual data, but cmpute the Y t as suggested in 1. 4. Generate a set f independent r t s and cmpute the Y t = r t (h) s. Then generate anther set f r t s independently f the first set and cmpute the x t h = r t h n thse. 12 (19)

The generatin f the r t s may cme frm the empirical distributin f the actual r t s (i.e. resampling with replacement) r anther distributin deemed apprpriate. We have cnsidered bth the empirical distributin and the N(0, 1) distributin 6. We believe that the generatin scheme 1 t be the mst apprpriate here since it lks mre like the way the riginal data was treated: Y t and x t h btained frm a single set f r t s. Once we have chsen a simulatin scheme, we may btain as many (B, say) samples as we want and thus get B independent realizatins f the statistic ˆR. 2 We then check if the ut-f-sample statistic will take values that are large even in this case, cmpared t the value bserved n the riginal data series. Frmally, cmpute p-value= A where A is the number f simulated ˆR 2 B greater r equal t the ˆR 2 cmputed n the actual data. This measures the plausibility f H 0 ; small values f p-value indicate that H 0 is nt plausible in the light f the actual data bserved. Anther way t use the btstrap values f ˆR 2 is t assume that the distributin f ˆR 2 under H 0 is N(Ê[ ˆR ], 2 ˆV [ ˆR ]) 2 where Ê[ ˆR ] 2 and ˆV [ ˆR ] 2 are the sample mean and the sample variance f the B btstrap values f ˆR. 2 Cmparing the actual ˆR 2 t this distributin yields the nrmalized btstrap p-value. Fr the type 1 methd we simply cmpute the p-value f the bserved ˆR 2 under the null hypthesis f n relatinship between the inputs and the utputs, using the empirical histgram f this statistic ver the btstrap replicatins. When the p-value is small, a mre meaningful quantity might be the mean and the standard deviatin f the statistic ver the btstrap replicatins t prvide a z-statistic. Of curse, this btstrap apprach may be used even in the case where the (asympttic) distributin f a statistic is knwn. Therefre, we will cmpute btstrap p-values fr the statistic ˆρ(r t (h), r t h (h)) as well as its theretical p-value fr cmparisn purpses. Finally, ne may wnder why Fisher s test-statistic R 2 F = (T h 2) 1 R, 2 where R 2 is the in-sample standard R 2, was nt used t test H 0. That is because the famus result F F 1,T k 2 (under H 0 ) hlds in a rather strict framewrk where, amng ther things, the Y s have t be independent (which is nt the case here). The usual theretical p-value wuld be terribly misleading. The actual distributin f F nt being knwn, an interesting exercise is t cmpute the btstrap p-values n F t cmpare with the wrng theretical p-values. 5.1 Results n artificial data They are sme tricky pints cncerning the financial data and the apprpriateness f the asympttic and autcrrelatin-cnsistent standard errr ˆρ(r t (h), r t h (h)). The result f Bartlett (1946) hld fr Gaussian and statinary data. A slight vilatin f these assumptin can cmplicate the cmparisn f the in-sample and ut-f-sample statistics. Hence we generated artificial data and tested the null hypthesis f n relatinship n them. We chse an autregressive prcess f rder 1, y t1 = βy t ɛ t fr which we vary the cefficient β f autregressin frm a range f values between zer and 1 and where ɛ t is drawn frm a nrmal distributin N(0, σ ɛ ), with here σ ɛ = 0.2. We cnduct the tests n the null hypthesis fr series f lengths in the set N = 200, 1000, 2000, 4000, 8000, 16000. 6 Since the ut-f-sample R 2, just like the in sample R 2, is lcatin-scale invariant, we dn t have t bther abut matching the mean and variance f the actual series. 13

Table 2: Empirical critical pints f fur statistics estimated n series f different length. We can bserve the presence f a negative skew in the empirical distributin f the ρ fr series f length 200 and 500. Fr the same series, the critical pints f the ut-f-sample statistics cntain the the value zer. N ρ R 2 R 2 D 200 [ 0.121, 0.099] 0.012 [ 0.026, 0.006] [ 0.16, 0.038] 500 [ 0.077, 0.066] 0.005 [ 0.012, 0.001] [ 0.21, 0.026] 1000 [ 0.050, 0.050] 0.0026 [ 0.006, 0.0002] [ 0.25, 0.007] 2000 [ 0.037, 0.035] 0.0013 [ 0.0035, 0.0004] [ 0.27, 0.035] 4000 [ 0.026, 0.028] 0.0007 [ 0.002, 0.0003] [ 0.30, 0.045] 8000 [ 0.020, 0.020] 0.00033 [ 0.001, 0.002] [ 0.33, 0.077] 16000 [ 0.014, 0.014] 0.00017 [ 0.0055, 0.0016] [ 0.35, 0.099] We first generated, fr each value f n in N, ne thusand series fr which β = 0. Fr each f these series we cnstruct the empirical distributin f 4 statistics, namely the autcrrelatin ρ (equatin 14), the in-sample R 2, the ut-f-sample R 2 and the difference between the ut-f-sample naive cst and the ut-f-sample linear mdels cst, named D (F ) and that we have frmally defined earlier (eq. 10) as with the empirical estimate D (F ) = E Gen (F naive ) E Gen (F ) = E[C T (F naive, z T 1 )] E[C T (F, z T 1 )] ˆD (F ) = C T (F naive, z T 1 ) C T (F, z T 1 ) = T t=m (e naive t ) 2 T (e F t ) 2 nt suffering f the same pssible bias f ˆR 2 (F ) as discussed in sectin 3, because the expectatin f a difference f tw randm variables is equal t the difference f the expectatin f each randm variable E[C T (F naive, z T 1 ) C T (F, z T 1 )] = E[C T (F naive, z T 1 )] E[C T (F, z T 1 )] Frm these empirical distributins, we estimated the critical pints at 10%, [L 5%, H 5% ], excepted fr the in-sample R 2 where we estimated the 10% critical pints at the right f the empirical distributins. Fr the ut-f-sample statistics, we chse M = 50 fr minimum number f training examples befre t prvide generalizatin errr (see equatin 4). The values f these critical pints are presented in table 2. After having established the critical pints at 10%, we want t study the pwer f these tests, i.e. hw each statistic is useful t reject the null hypthesis when the null hypthesis is false. Fr this gal, we generated ne thusand series fr different value f β, sme smaller than σ ɛ and sme larger. We estimated n these series the value f the fur statistics cnsidered in table 2, and β cmpute fr the different values f σ ɛ the number f times each f these statistics are utside the interval delimited by the critical values, r greater than the critical value in the case f R 2. The results are presented in the table 3 up t the table 8. We can bserve frm these tables that the pwer f the test f H :n relatinship between the inputs and the utputs based n the ut-f-sample statistics R 2 and D are less than the pwer f the test based n the in-sample statistic. This seem particularly true fr value f 0.01 < β < 0.05, 14 t=m

Table 3: Statistics n artificial data f length 200. ŝ \ β σ 0 0.005 0.05 0.075 0.1 0.11 0.25 0.5 ρ 10.0% 10.5% 11.4% 13.5% 15.4% 14.6% 25.3% 46.9% R 2 10.2% 10.5% 10.9% 11.0% 13.4% 12.7% 21.1% 40.7% R 2 10.9% 11.8% 11.1% 12.8% 15.8% 15.3% 18.9% 33.6% D 10.9% 11.4% 11.3% 12.1% 15.8% 14.8% 18.1% 32.3% Table 4: Statistics n artificial data f length 1000 ŝ \ β σ 0 0.005 0.05 0.075 0.1 0.11 0.25 0.5 ρ 9.6% 9.5% 11.8% 13.2% 15.3% 16.4% 47.9% 93.7% R 2 9.3% 9.6% 11.5% 11.7% 12.9% 13.2% 46.7% 93.3% R 2 9.7% 8.9% 12.0% 12.2% 11.1% 12.5% 38.3% 86.1% D 9.6% 8.7% 11.7% 12.0% 13.3% 12.2% 37.7% 85.8% Table 5: Statistics n artificial data f length 2000 ŝ \ β σ 0 0.005 0.05 0.075 0.1 0.11 0.25 0.5 ρ 10.6% 12.2% 14.3% 17.0% 25.4% 27.6% 75.1% 99.8% R 2 10.3% 12.6% 14.0% 16.9% 24.8% 26.9% 74.0% 99.8% R 2 10.1% 14.1% 13.6% 15.2% 21.8% 20.8% 62.8% 99.1% D 9.9% 13.9% 13.5% 14.5% 21.5% 20.4% 62.4% 99.0% Table 6: Statistics n artificial data f length 4000 ŝ \ β σ 0 0.005 0.05 0.075 0.1 0.11 0.25 0.5 ρ 10.5% 10.7% 13.1% 21.1% 27.2% 35.1% 91.2% 100.0% R 2 9.4% 9.9% 14.5% 23.1% 29.7% 37.5% 93.3% 100.0% R 2 10.0% 10.8% 13.0% 20.6% 23.1% 30.1% 85.6% 100.0% D 10.1% 10.5% 13.2% 20.4% 23.4% 30.0% 85.3% 100.0% Table 7: Statistics n artificial data f length 8000 ŝ \ β σ 0 0.005 0.05 0.075 0.1 0.11 0.25 0.5 ρ 0.10 0.11 0.21 0.37 0.58 0.60 0.99 1.00 R 2 0.10 0.11 0.21 0.36 0.57 0.59 0.99 1.00 R 2 0.11 0.12 0.15 0.26 0.45 0.49 1.00 1.00 D 0.11 0.12 0.15 0.26 0.44 0.49 0.99 1.00 Table 8: Statistics n artificial data f length 16000 ŝ \ β σ 0 0.005 0.05 0.075 0.1 0.11 0.25 0.5 ρ 0.11 0.08 0.33 0.37 0.79 0.87 1.00 1.00 R 2 0.11 0.08 0.33 0.36 0.79 0.87 1.00 1.00 R 2 0.12 0.11 0.25 0.26 0.69 0.76 1.00 1.00 D 0.12 0.11 0.24 0.26 0.68 0.76 1.00 1.00 15

Empirical Distributins f the autcrrelatin 0 5 10 15 20 25 30-0.05 0.0 0.05 0.10 Figure 4: The dts draw the empirical distributin f the ˆρ btained with φ = 0 in the autregressive mdel used t generate the artificial data. The length f the series is 4000 and there are 1000 such series. The represent the empirical distributin f the ˆρ btained with β/σɛ = 0.02. Lking at the table 6, there is 25.5% f the value utside the critical range [ 0.026, 0.028], presented in table 2. 16

Empirical distributins f ut-f-sample R2 0 200 400 600 800 1000 1200 1400-0.004-0.003-0.002-0.001 0.0 0.001 Figure 5: The dts draw the empirical distributin f the ˆR 2 btained with β = 0 in the autregressive mdel used t generate the artificial data. The length f the series is 4000 and there are 1000 σɛ such series. The represent the empirical distributin f the ˆR 2 btained with β/σɛ = 0.02. Lking at the table 6, there is 25.5% f the value utside the critical range [ 0.30, 0.045], presented in table 2. 17

crrespnding t a signal-t-nise rati in the range f 0.05 < β/σ ɛ < 0.25. Fr small value f β σ ɛ, such as 0 and 0.005, there is nt a clear difference between the pwer f the tests based n insample statistics and n ut-f-sample statistics (althugh the ut-f-sample statistics seem slightly better). Estimatins f these prbabilities t bserve the statistics utside the critical range n ther samples f 1000 series can lead t slight different prbabilities, indicating that we must use a bigger sample in rder t be able t bserve a significant discrepancy between the tests fr small values f signal-t-nise rati. It wuld appear frm these results that when we want t test against the null hypthesis f n dependency, the classical in-sample tests prvide mre pwer. Hwever, there are tw reasns why we may still be interested in lking at the ut-f-sample statistics: first, we may care mre abut the ut-f-sample perfrmance (whether ur mdel will generalize better than the naive mdel) than abut the true value f β (see the fllwing sectin fr a striking result cncerning this pint); secnd, the dependency may be nn-linear r nn-gaussian. 5.2 Discussin f the results n financial data In all cases B = 1000 btstrap replicatins were generated and the ut-f-sample statistic was cmputed n each f them with M = 50, yielding distributins f ˆR 2 fr the null hypthesis which is that the true R 2 is negative. Fr ˆρ(r t (h), r t h (h)), the theretical p-values disagree with the tw thers, see table 9, indicating that the asympttic nrmality f ˆρ(r t (h), r t h (h)) des nt hld in ur sample. Fr Fisher s F, we see that the theretical p-values are awfully wrng. Even the tw btstrap p-values dn t agree well, indicating that the null distributin f F is nt nrmal. We bserved an asymmetry in the empirical distributins f the Fischer s F : mst f the value are near the 0 with a decay in the frequency f the F fr larger F. Typically, the skewness f the empirical distributin fr the F are psitives and arund 2. S here, nly the (pure) btstrap p-value can be trusted as it is valid generally. Regarding ˆR 2 0, we see (table 9) that a similar pattern is bserved fr the psitive ˆR 2 0. The pure btstrap p-values seems indicating a pssible dependence f the near ne year return n the past year return. Als, in this case, the empirical distributins f the ˆR 2 0 are nt nrmal, the bserved skewness n these distributin are systematically negative with values arund 4. The theretical p-values fr this ut-f-sample statistics are nt knw. The table 10 is presented t prvide the crrespndancy between the F statistic shwn in the table 9 and the value f the in-sample ˆR 2. The table 11 presents the results f the test cnducted n the null hypthesis n relatinship between inputs and utputs using the statistic D. This test allws t reject even mre strngly the null hypthesis f n linear dependency than the test based n R 2. 6 Test f H 0 : R 2 = 0 Here we attack the prblem we are actually interested in: assessing whether generalizatins based n past returns are better than the naive generalizatins. Here we cnsider linear frecasts, s that we want t knw if F lin generalizes better than F naive. The statistic we will use t this end is ˆR 2, assuming that its bias, nt ccurring in D, is nt a majr cncern here. Its distributin nt being knwn, we will have t turn t the btstrap methd and simulate values f ˆR 2 cmputed n samples generated under H 0 : R 2 = 0. We assume that E[r t (h) r t h (h)] = α βr t h (h). (20) 18

Table 9: Test f the hypthesis f n relatinship between inputs and utputs. Three statistics are used, and fr each the theretical (tpv), pure btstrap (pbpv) and nrmalized (nbpv) p-values are cmputed. H ˆR2 tpv pbpv nbpv ˆρ tpv pbpv nbpv ˆF tpv pbpv nbpv 1-0.03 NA 0.83 0.67 0.02 0.74 0.60 0.60 2 0.16 0.70 0.74 2-0.31 NA 0.99 0.99 0.02 0.81 0.67 0.67 2 0.19 0.77 0.75 3-0.08 NA 0.68 0.51-0.02 0.84 0.95 0.98 1.9 0.13 0.84 0.75 4-0.10 NA 0.64 0.46-0.05 0.65 0.87 0.87 9 0.00 0.68 0.72 5-0.07 NA 0.30 0.31-0.07 0.60 0.72 0.72 21 0.00 0.58 0.68 6-0.06 NA 0.24 0.29-0.06 0.68 0.87 0.89 18 0.00 0.65 0.72 7-0.11 NA 0.42 0.39-0.09 0.56 0.78 0.75 40 0.00 0.56 0.68 8-0.14 NA 0.49 0.39-0.15 0.37 0.55 0.51 109 0.00 0.34 0.46 9-0.15 NA 0.47 0.39-0.18 0.31 0.48 0.46 177 0.00 0.26 0.36 10-0.18 NA 0.52 0.43-0.22 0.24 0.38 0.38 174 0.00 0.19 0.21 11-0.14 NA 0.38 0.35-0.26 0.18 0.27 0.26 274 0.00 0.10 0.07 12-0.06 NA 0.15 0.25-0.32 0.12 0.21 0.20 433 0.00 0.08 0.01 13 0.07 NA 0.02 0.14-0.32 0.14 0.20 0.21 705 0.00 0.06 0.01 14 0.07 NA 0.04 0.14-0.28 0.21 0.31 0.31 549 0.00 0.11 0.07 15-0.01 NA 0.10 0.19-0.23 0.33 0.58 0.55 350 0.00 0.27 0.29 16-0.05 NA 0.13 0.24-0.17 0.48 0.75 0.71 189 0.00 0.39 0.49 17-0.11 NA 0.24 0.29-0.13 0.58 0.99 0.96 121 0.00 0.52 0.63 18-0.15 NA 0.30 0.31-0.09 0.73 0.82 0.86 56 0.00 0.69 0.73 19-0.16 NA 0.28 0.32-0.05 0.85 0.74 0.75 22 0.00 0.78 0.76 20-0.25 NA 0.44 0.39-0.04 0.88 0.70 0.69 16 0.00 0.83 0.78 21-0.21 NA 0.37 0.37-0.04 0.89 0.71 0.70 17 0.00 0.83 0.76 22-0.26 NA 0.45 0.38-0.02 0.94 0.54 0.55 8 0.00 0.89 0.80 23-0.16 NA 0.31 0.32 0.02 0.92 0.37 0.38 0.7 0.00 0.97 0.80 24-0.09 NA 0.19 0.28 0.08 0.79 0.28 0.28 18 0.36 0.84 0.78 19

Table 10: The in-sample ˆR 2 presented in the figure 2 and related t Fisher s ˆF shwn in the table 9. H R 2 H R 2 H R 2 H R 2 1 0.05% 7 1.0% 13 16.2% 19 0.65% 2 0.05% 8 2.8% 14 13.2% 20 0.46% 3 0.05% 9 4.5% 15 9.0% 21 0.51% 4 0.22% 10 6.8% 16 5.1% 22 0.25% 5 0.51% 11 10.4% 17 3.4% 23 0.02% 6 0.44% 12 16.1% 18 1.6% 24 0.06% Table 11: The test based n the D statistic als give a strng evidence against H : n relatin between inputs and utputs. The empirical versin used t estimate D des nt suffer f a bias like the empirical versin f R 2. H D p-value H D p value 1-0.23 0.95 13 4.57 0.01 2-5.10 0.99 14 4.54 0.03 3-1.86 0.84 15-0.70 0.10 4-3.19 0.83 16-3.43 0.14 5-2.97 0.62 17-8.04 0.27 6-3.37 0.52 18-11.02 0.31 7-6.31 0.70 19-12.15 0.30 8-8.79 0.69 20-20.01 0.44 9-8.51 0.61 21-17.17 0.37 10-9.65 0.57 22-20.67 0.45 11-7.63 0.39 23-13.26 0.33 12-3.48 0.17 24-7.41 0.22 This is the regressin used by Fama and French (1988) t test the mean reversin hypthesis f a suppsed statinary cmpnent f stck prices. We saw earlier that this amunts t β2 being equal σ 2 t the rati shwn in (??). If we let the Y t s (given x T h 1 h ) have the crrelatin structure shwn in (??), we have E[V ar[ȳt h X T h h1 ]] = σ 2 (t h) 2 h = = σ 2 (t h) 2 h σ 2 (t h) 2 h h 1 s=1 h (h s )(t h s ) [ ] h 1 h(t h) 2 (h s)(t h s) s=1 [ ] h 1 h(t h) 2 s(t 2h s) s=1 [ ] σ 2 = h 2 h(h 1)(2h 1) (t h) (t h) 2 h 3 [ ] σ 2 (h 1)(2h 1) = h(t h) (t h) 2 3 (21) 20

and E[V ar[ȳt h ˆβ t h (X t h X t 2h ) X T h h1 ]] = σ2 E[c V c], where V is a (t h) (t h) matrix with V ij = (h i j ), and c is a (t h) 1 vectr with h c i = 1 t h (X t h X t 2h )(X i h X t 2h ) t h j=1 (X j h X, i = 1,..., t h. t 2h ) 2 If we let L be a (T 1) (T h) with L ij = I[0 i j < h]/ h, then we may write c V c as W W where W = Lc. This representatin is useful if we need t cmpute V ar[f lin (Z1 t h )(X t h ) X T h h1 ] = c V c fr varius values f t as recursive relatins may be wrked ut in W. Due the lcatin-scale invariance f the ˆR 2 mentined earlier, σ 2 and α may be chsen as ne pleases (1 and 0, say). The expectatins then depend bviusly n the prcess generating the X t s. The simplest thing t d is t assume that X T h 1 h δ x, that is X T h T h 1 h can nly take the value 1 h bserved. This makes the expectatin easy t wrk ut. Otherwise, these expectatins can be wrked ut via simulatins. Once X T h 1 h s prcess, α, β, σ2 have been chsen, we generate Z1 T = (X T h 1 h, Y 1 T ) as fllws. 1. Generate X T h 1 h. 2. Generate ɛ 1,..., ɛ T s that the ɛ t s are independent f X T h 1 h with V ar[ɛ t] = σ 2 and the cvariance structure shwn in (??). This may be dne by generating independent variates with variance equal t σ2 and take their mving sums (with windw size f h). h 3. Put Y t = α βx t h ɛ t. The btstrap test f H 0 : R 2 = 0 culd be perfrmed by generating B samples in the way explained abve, yielding B btstrap values f ˆR. 2 These wuld be used t cmpute either a pure btstrap p-value r a nrmalized btstrap p-value. Needless t say that generating data under H 01 : R 2 = 0 is mre tedius than generating data under H 02 : n relatinship between inputs and utputs. Furthermre the abve apprach relies heavily n the distributinal assumptins f linearity and the given frm f cvariance, and we wuld like t devise a prcedure that can be extended t nn-linear relatinships, fr example. T get the distributin f ˆR 2 under H 01, we prpse t cnsider an apprximatin saying that the distributin f ˆR2 R 2 is the same under H 01 and H 02. We will call this hypthesis the shifted distributin hypthesis (nte that this hypthesis can nly be apprximately true because fr extreme values f ˆR2 near 1, it cannt be true). This means that we are assuming that the distributin f ˆR 2 under R 2 = 0 has the same shape as its distributin under β = 0 but is shifted t the right (since it crrespnds t a psitive value f β). If that was the case, generating ˆR 2 0 under H 01 wuld be the same as simulating ˆR 2 R 2 under H 02, which we have dne previusly withut subtracting ff R. 2 This R 2 can be btained either analytically r estimated frm the btstrap as B b=1 1 C T (F lin, Z1 T (b)) B b=1 C T (F naive, Z1 T (b)). Nte, t make the ntatin clear, that the btstrap ˆR s 2 are simply 1 C T (F lin,z1 T (b)) C T (F naive,z1 T (b)), b = 1,..., B. Frm these ˆR 2 R s, 2 we btain the btstrap p-values and the nrmalized btstrap p-values as usual. Nte that the btstrap p-values fr H 01 and H 02 are the prprtin f the ˆR s 2 (generated under H 02 ) that are greater than ˆR (bserved) 2 R 2 and ˆR (bserved) 2 respectively. Since R 2 < 0 under H 02, we see that p-value(h 02 ) p-value(h 01 ). 21