A Reference Guide for Statistical Tests Useful in Finance (Preliminary)

A Reference Guide for Statistical Tests Useful in Finance (Preliminary) Market Pattern Research, Inc. modified from the book The Strategic Analysis of Financial Markets Volume 1: Framework Steven D. Moffitt, Ph.D. April 14, 2018 Contents 1 A Reference Guide to Statistical Tests for Financial Analysis 3 2 A Note on Nonparametric Tests 4 3 Statistical Tests for Gaussianity 4 3.1 The Anderson-Darling Test of Normality (AD Test)....... 5 3.2 The Cramer-von Mises Test (CvM) of Normality......... 6 3.3 The Shapiro-Francia Test of Normality............... 7 3.4 Pearson s Chi-square Goodness-of-Fit Test of Normality..... 8 3.5 The Lilliefors Test of Normality................... 9 3.6 The Jarque-Bera Test of Normality................. 10 3.7 The D Agostino Test of Normality................. 10 3.8 The Shapiro-Wilk Test of Normality................ 11 4 Testing for Randomness 12 4.1 Nonparametric Tests for Randomness: Runs Tests........ 12 4.1.1 The Wald Wolfowitz Runs Test............... 13 4.1.2 The Up and Down Test................... 13 4.1.3 The Bartels Rank Test.................... 14 4.2 Nonparametric Tests for Randomness: Trend Tests........ 15 4.2.1 The Cox-Stuart Test..................... 15 4.2.2 The Difference Sign Test................... 15 4.2.3 The Mann Test for Trend.................. 16 4.2.4 The Dietz-Kileen Multivariate Test for Trend....... 17 1

5 Nonparametric Tests for Comparing Distributions 17 5.1 A Comparison of Distributions Test: The two-sample test of Kolmogorov-Smirnov (KS test)................... 17 5.2 Subset Location Tests: Wilcoxon Rank-Sum (k = 2) and Kruskal-Wallis (k 2) Location Tests............. 18 5.3 The Ansari-Bradley Test of Equality of Variances......... 20 5.4 The Fligner-Kileen Test of Scale.................. 21 5.5 The Mood Test of Scale....................... 22 6 Time Series Tests 23 6.1 General Tests............................. 23 6.2 Parametric Tests for Random Walks................ 25 6.2.1 A Unit Root Test: The Augmented Dickey-Fuller Test (ADF-Test))......................... 26 6.2.2 A Unit Root Test: The Phillips-Perron Test (PP-Test).. 26 6.3 Variance-Ratio Tests......................... 27 6.3.1 The Lo-MacKinlay Test................... 28 6.3.2 Chen-Deo Test........................ 29 6.3.3 Chow-Denning Test..................... 29 6.3.4 Wald Test........................... 30 6.3.5 Wright and Joint Wright Tests............... 30 2

This article was abstracted from material presented in a forthcoming twovolume series, The Strategic Analysis of Financial Markets by Steven D. Moffitt, Ph.D. in May, 2017 by World Scientific. Those books analyze the strategies used by investors and traders, deconstruct the associated market games in ways that uncover a commonality of structure, and apply statistics, psychology and gambling logic to identify winning (and losing) strategies. 1 A Reference Guide to Statistical Tests for Financial Analysis Though seldom acknowledged, the statistical analysis of price series is a problematic part of empirical finance. The reason is that prices arise from unknown processes that on empirical grounds alone, are unlikely to be stationary since they exhibit outliers and volatility intermittency. But probably 99% of all statistical methods for time series require stationarity. Clearly, traditional statistics should be applied to price series with caution. This paper serves the purpose of presenting methods that detect exploitable stochastic price behavior such as non-randomness, trends, the propensity to produce outliers, etc., which are more akin to data mining and pattern recognition than to traditional statistical estimation and hypothesis testing. The first Section presents tests of the null hypothesis that a distribution is univariate or multivariate Gaussian. Applying these tests to price series almost always results in rejection of Gaussianity, but the nature of that rejection can reveal things about statistical behavior and is useful for component decomposition methods like independent component analysis. The second Section presents tests of the Hypothesis of Randomness for time series, which includes nonparametric tests for runs and for monotonic (not necessarily linear) trends. The third Section presents non-parametric methods for comparing distributions, with the notably exception of bootstrap methods. In the fourth Section, various time series tests are presented: (1) for i.i.d. data, (2) for martingales and martingale difference series,(3) for serial correlation and (4) for random walks. The presentation is semi-formal; each test has its null and alternative hypotheses stated formally, accompanied by an informal discussion of the test s strengths and weaknesses. References are included for those who want additional information or proofs. A number of tests were omitted, mostly because their statement required material deemed too specialized. Four major instances of such omission are (1) monte carlo tests, (2) tests involving wavelets, (3) spectral tests for independence or martingales, and (4) random matrix methods for covariance matrices. Of these omissions, (1) is the most serious, followed closely by (2). 3

2 A Note on Nonparametric Tests Nonparametric or Distribution-free statistical tests are ones that are valid for large families of distributions, e.g. for all continuous distributions (ones with a density function). Many such tests use the signs of observations or their ranks in a sample. Without going into the theory of these tests, we point out some obvious properties of tests involving signs or ranks. For ranks, the data are replaced by their order in a sample. For example, the ranks for sample data x 1 = 5, x 2 = 1, x 3 = 2, and x 4 = 10 are R(x 1 ) = 3, R(x 2 ) = 1, R(x 3 ) = 2 and R(x 4 ) = 4. If a test is based on ranks, then it is invariant with respect to all strictly increasing transformations of the underlying data. Thus is one wants to test for a monotone trend in a time series, for example, rank tests are particularly appropriate. Here s an example of a nonparametric test using signs. The statistic is the number of consecutive +1 s in differences sgn(x t+1 X t ), where sgn(x) = 1 if x > 0, 0 if x = 0 and 1 if x < 0 (Bartels Test, Section 4.1.2). This test is also invariant with respect to all strictly increasing transformations of the underlying data. As another example, sgn(x i med(x)) (med(x) is the median of the sample) is invariant with respect to strictly increasing transformations. On the other hand, sgn(x i mean(x)) is not. We also note an extremely important characteristic of sign and rank tests they are robust against outliers. Thus if a few large returns in a long series are doubled, it will generally change tests based on ranks little. We prefer nonparametric tests for because they might perform better on nonstationary data and because small sample sizes are typically not a problem. On the other hand, parametric tests have two things that can go wrong: (1) the chosen parametric family doesn t fit the data, e.g. the data aren t Gaussian but Gaussian models are assumed, and (2) the observations or their transformations don t form a random sample. Nonparametric tests suffer only from the second problem, which is the one of primary interest in trading systems development. 3 Statistical Tests for Gaussianity Gaussian distributions occupy a central place in financial theory for several reasons. They are limiting distributions for random samples of distributions with finite variances, they are minimum information distributions among those that have finite variances, their affine transformations are again Gaussian, and they are infinitely divisible, meaning that for any Gaussian random variable X and any positive integer n, there is a (Gaussian) distribution Y whose n-fold convolution has the same distribution as X. Certainly, financial time series would be much simpler if all distributions were Gaussian, but the reality is far different. In general, tests based on Gaussian distributions should be avoided in trading system analysis, a notable exception being independent component analysis. Eight tests of the null hypotheses that a sample or time series is distributed 4

as i.i.d. Gaussians are presented in this Section. They are N.1: The Anderson-Darling Test of Normality (AD Test) N.2: The Cramer-von Mises Test (CvM) of Normality N.3: The Shapiro-Francia Test of Normality N.4: The Pearson Chi-square Goodness-of-Fit Test of Normality N.5: The Lilliefors Test of Normality N.6: The Jarque-Bera Test of Normality N.7: The D Agostino Test of Normality N.8: The Shapiro-Wilk Test of Normality 3.1 The Anderson-Darling Test of Normality (AD Test) Let F be an unknown continuous distribution function for which the mean µ and variance σ 2 exist, and let Φ(x; µ, σ 2 ) be the cumulative normal distribution with mean µ and variance σ 2. Null Hypothesis: H 0 : F = Φ(µ, σ 2 ). Alt. Hypothesis: H a : F Φ(µ, σ 2 ). Let X 1, X 2,... X n be a random sample from F and let ˆF be its empirical distribution function ˆF (x) = 1 n I(X i x), n i=1 where I(c) = 1 if c is true and I(c) = 0 is c is false. Then the Anderson-Darling test statistic is ( n ˆF (x) Φ(x; µ, σ 2 )) 2 [Φ(x; µ, σ 2 )][1 Φ(x; µ, σ 2 )] dφ(x; µ, σ2 ). (1) A formula can be developed from (1), as follows. Assuming that the mean and variance of the X i are unknown, standardize the X i s, Z i = (X i X)/S X, where X is the sample mean and S X is the sample standard deviation, and order the Z i s from lower to highest, calling the results Y i : Define A 2 as A 2 = n 1 n Y 1 < Y 2 < < Y n. n [(2i 1)ln(Φ(Y i )) (2(n i) + 1)ln(1 Φ(Y i ))], i=1 and let A 2 = A 2 (1 + 4/n 25/n 2 ) (2) 5

be a correction to A 2 for small samples (see Shorack & Wellner, [Shorack and Wellner, 2009]). The theoretical asymptotic distribution 1 of A 2 has been tabulated, and the null hypothesis of normality will be rejected if A 2 exceeds the tabulated value. Note that the test statistic (1) depends only on the empirical distribution function (EDF) 2, as do several other test criteria, and therefore, are nonparametric tests. Other tests in the EDF family are the Cramer-von Mises test and the Lilliefors test. The rationale for (1) is as follows. If U i is a random sample from N(0, 1), then Φ 1 (U i ) will be a random sample from the uniform distribution on (0, 1). Thus by standardizing the X i to Y i, the distribution of Φ 1 (Y i ) will asymptotically also have a uniform distribution under the null hypothesis. One can then perform a theoretical calculation of the distribution of (1) assuming that Φ is replaced by the cumulative distribution of an uniform distribution and ˆF is the empirical cumulative distribution for an uniform distribution. Simplified Explanation: Note that test statistic (1) will be large if there are too many outliers relative to a normal distribution, since the denominator [Φ(x)][1 Φ(x)] approaches zero quickly beyond x = ±3. Thus the AD is an excellent choice if there are outliers in the data, as there usually are in financial returns series. But distribution is asymptotic, that is, holds exactly only in the limit as n. Reference: Anderson & Darling [Anderson and Darling, 1952, Anderson and Darling, 1954], Shorack & Wellner [Shorack and Wellner, 2009] and Stephens [Stephens, 1974]. 3.2 The Cramer-von Mises Test (CvM) of Normality Let F be an unknown continuous distribution function for which the mean µ and variance σ 2 exist, and let Φ(µ, σ 2 ) be the cumulative normal distribution with mean µ and variance σ 2. Null Hypothesis: H 0 : F = Φ(µ, σ 2 ). Alt. Hypothesis: H a : F Φ(µ, σ 2 ). Let X 1, X 2,... X n be a random sample from F and let ˆF be its empirical distribution function ˆF (x) = 1 i=n I(X i x), n i=1 where I(c) = 1 if c is true and I(c) = 0 is c is false. Then the Cramer-von Mises test statistic is n 1 That is, the distribution as n. 2 And implicity, the normal distribution ( ˆF (x) Φ(x; µ, σ 2 )) 2 dφ(x; µ, σ 2 ). (3) 6

A formula can be developed from (3), as follows, assuming that the mean and variance of the X i are unknown. Standardize the X i s, Z i = (X i X)/s X, where X is the sample mean and s X is the sample standard deviation and order the Z i s from least to greatest. Define W 2 as W 2 = 1 n 12n + [ ] 2 2i 1 2n Φ(Z i), (4) i=1 The theoretical distribution of W 2 has been tabulated, and the null hypothesis of normality will be rejected if W 2 exceeds the tabulated value. Note that the test statistic (3) depends only on the empirical distribution function (EDF) 3, as do Anderson-Darling test and the Lilliefors test. The rationale for (3) is as follows. If U i is a random sample from N(0, 1), then Φ 1 (U i ) will be a random sample from the uniform distribution on (0, 1). Thus by standardizing and ordering the X i, the distribution of Φ 1 (Z i ) will asymptotically also have a uniform distribution under the null hypothesis. One can then perform a theoretical calculation of the distribution of (3) assuming that Φ is replaced by the cumulative distribution of an uniform distribution and ˆF is the empirical cumulative distribution for an uniform distribution. Simplified Explanation: The test statistic (3) will be large if differences from a normal in the center or in the extremes receive enough weight. Thus the CvM is an excellent choice relative to the AD test for testing differences in the center of the distribution. Reference: Darling [Darling, 1957]. 3.3 The Shapiro-Francia Test of Normality Let F be an unknown continuous distribution function for which the mean µ and variance σ 2 exist, and let Φ(µ, σ 2 ) be the cumulative normal distribution with mean µ and variance σ 2. Null Hypothesis: H 0 : F = Φ(µ, σ 2 ). Alt. Hypothesis: H a : F Φ(µ, σ 2 ). The Shapiro-Francia test statistic for random sample {X i } n 1 uses the order statistics {X (i) } n 1 to form a test for normality. Order statistics for a sample have the same values as {X i } n 1, but ordered from lowest to highest. 4 Thus X (1) is the smallest value in {X i } n 1, X (2) is the next smallest, X (n) being the largest. 3 And implicity, the normal distribution 4 The ordering of order statistics are unique only for continuous distributions, since the probability of ties is zero. 7

With this notation, the Shapiro-Wilk test statistic is W = ( n i=1 b ix (i) ) 2 n i=1 (X i X) 2 (5) where X is the sample mean, with m (b 1, b 2,..., b n ) = (m m) 1/2 m = ( E[X (1) ], E[X (2) ],..., E[X (n) ] ). Since a closed form of the W -test s distribution is unknown, the percentiles have been estimated using Monte Carlo methods. Simplified Explanation: The Shapiro-Francia test compares the order statistics from a random sample to those expected in a normal distribution having the same mean as the sample distribution, but with an assumption of independence. The test is almost the same as the Shapiro-Wilk W test; it differs only in the use of order statistic weights b i instead of a i (see the Shapiro-Wilk description in this Section.) Compared with the Shapiro-Wilk, it is more sensitive to alternatives that are continuous and symmetric with high kurtosis, to ones that are near normal and to ones that are discrete and skewed. But it is less sensitive than W on alternatives that are continuous and skewed with high kurtosis, and on ones that are discrete and symmetric. Since market returns in general are right-skewed and have high kurtosis, the Shapiro-Wilk should be preferred over the Shapiro-Francia on this data. Reference: Shapiro & Francia (1972) [Shapiro and Francia, 1972] and Sarkadi (1975) [Sarkadi, 1975]. 3.4 Pearson s Chi-square Goodness-of-Fit Test of Normality Let F be an unknown continuous distribution function for which the mean µ and variance σ 2 exist, and let Φ(µ, σ 2 ) be the cumulative normal distribution with mean µ and variance σ 2. Null Hypothesis: H 0 : F = Φ(µ, σ 2 ). Alt. Hypothesis: H a : F Φ(µ, σ 2 ). The Pearson χ 2 test is a mainstay of beginning statistics; for a sample that is binned into K classes, it has the form χ 2 = i=k i=1 (O i E i ) 2 E i (6) 8

where the i th term of the sum applies to all observations in the i th bin, O i is the observed count in bin i and E i is the expected count (generally non-integer) for bin i. Under conditions of the null hypothesis, the statistic will have an asymptotic χ 2 distribution with K 2 degrees of freedom. Simplified Explanation: The χ 2 assumes that data from a sample are put into discrete bins, each of which has a positive probability of occurrence. It compares using statistic 6 the observed counts in the bins with those expected. The resulting test has K p degrees of freedom, where p is the number of parameters that must be estimated to calculate expected values. For the normal, p = 2, since a mean and variance must be calculated to determine bin probabilities. Reference: Spiegel (2000) [Spiegel et al., 2000]. 3.5 The Lilliefors Test of Normality Let F be an unknown continuous distribution function for which the mean µ and variance σ 2 exist, and let Φ(µ, σ 2 ) be the cumulative normal distribution with mean µ and variance σ 2. Null Hypothesis: H 0 : F = Φ(µ, σ 2 ). Alt. Hypothesis: H a : F Φ(µ, σ 2 ). For a random sample {X i } n 1 having sample mean X and sample standard deviation s X, the Lilliefors test uses the statistic L = sup ˆF (x) Φ(x; X, s 2 x), (7) x where ˆF is the empirical distribution of the random sample and Φ(x; µ, σ 2 ) is the cumulative distribution of a Gaussian distribution with mean µ and variance σ 2, X. This statistic is the same as the Kolmogorov-Smirnov one-sample test (KS test) except that the mean and variance are estimated in the L-test but assumed known in the KS-test. This parameter estimation changes the distribution of L as compared to KS. 5 At this writing, the asymptotic L distribution has been tabulated only using Monte Carlo methods. Simplified Explanation: The Lilliefors statistic measures the maximum absolute difference between the empirical distribution function and a Gaussian having the same mean and standard deviation. The test will be quite sensitive to deviations from normality, but will be inferior to the Anderson-Darling test in the presence of outliers. Reference: Lilliefors [Lilliefors, 1967]. 5 An asymptotic closed form distribution is known for the one-sample KS-test. 9

3.6 The Jarque-Bera Test of Normality Let F be an unknown continuous distribution function for which the mean µ and variance σ 2 exist, and let Φ(µ, σ 2 ) be the cumulative normal distribution with mean µ and variance σ 2. Null Hypothesis: H 0 : F = Φ(µ, σ 2 ). Alt. Hypothesis: H a : F Φ(µ, σ 2 ). The Jarque-Bera tests whether a distribution s skewness and kurtosis match those of a Gaussian distribution. For a random sample {X i } n 1 having sample mean X JB = n (U 2 + 14 ) 6 (V 3), (8) where and 1 n 4 i=1 U = (X i X) 3 ( 1 n 4 i=1 (X i X) 2) 3/2 1 n 4 i=1 V = (X i X) 4 ( 1 n 4 i=1 (X i X) 2) 2. Under the null hypothesis of normality, U = 0 and V = 3. For large samples, n 200, the distribution of JB is approximately χ 2 with 2 degrees of freedom. Simplified Explanation: The terms U and V are, respectively, estimates of the sample s skew and kurtosis. For a normal distribution, the kurtosis equals 3, thus the appearance of that term in JB. The weights given U and V in the expression defining JB ensure that it is asymptotically χ 2 with 2 d.f. under the null hyothesis. I would not recommend using this test because it is not powerful against non-normal distributions with skew and kurtotis that match a normal. Reference: Jarque & Bera [Jarque and Bera, 1987]. 3.7 The D Agostino Test of Normality Let F be an unknown continuous distribution function for which the mean µ and variance σ 2 exist, and let Φ(µ, σ 2 ) be the cumulative normal distribution with mean µ and variance σ 2. Null Hypothesis: H 0 : F = Φ(µ, σ 2 ). Alt. Hypothesis: H a : F Φ(µ, σ 2 ). The D Agostino K 2 -test statistic for a random sample {X i } n 1 uses Cornish- Fisher expansions to derive an approximate test based on the first four moments of the sample distribution. The test statistic has the form K 2 = Z 1 (g 1 ) 2 + Z 2 (g 2 ) 2 (9) 10

where Z 1 (g 1 ) 2 and Z 2 (g 2 ) 2 have lengthy algebraic expressions. See the references for definitions. Like the Jarque-Bera, K 2 tests for skew and kurtosis that match a normal. Simplified Explanation: The K 2 test uses a Cornish-Fisher expansion to approximate a distribution based on the first four moments of a normal. Like the Jarque-Bera test, it is not recommended because any non-normal distribution that matches the first four moments of a normal will not be rejected. Reference: D Agostino [D Agostino, 1971]. 3.8 The Shapiro-Wilk Test of Normality Let F be an unknown continuous distribution function for which the mean µ and variance σ 2 exist, and let Φ(µ, σ 2 ) be the cumulative normal distribution with mean µ and variance σ 2. Null Hypothesis: H 0 : F = Φ(µ, σ 2 ). Alt. Hypothesis: H a : F Φ(µ, σ 2 ). The Shapiro-Wilk test statistic for random sample {X i } n 1 uses the order statistics {X (i) } n 1 to form a test for normality. Order statistics for a sample have the same values as {X i } n 1, but ordered from lowest to highest. 6 Thus X (1) is the smallest value in {X i } n 1, X (2) is the next smallest, X (n) being the largest. With this notation, the Shapiro-Wilk test statistic is W = ( n i=1 a ix (i) ) 2 n i=1 (X i X) 2 (10) where X is the sample mean, with and m V 1 (a 1, a 2,..., a n ) = (m V 2 m) 1/2 m = ( E[X (1) ], E[X (2) ],..., E[X (n) ] ). u 11 u 12... u 1n u 21 u 22... u 2n V =..... u1n u n1 u n2... u nn where u ij = cov(x (i), X (j) ). Since a closed form of the W -test s distribution is unknown, the percentiles have been estimated using Monte Carlo methods. 6 The ordering of order statistics are unique only for continuous distributions since the probability of ties is zero. 11

Simplified Explanation: The Shapiro-Wilk test compares the order statistics from a random sample to those expected in a normal distribution having the same mean and variance as the sample distribution. The test may be interpreted as the Pearson correlation coefficient between the ordered observations and the weights a i used in the numerator, so that the test in effect measures the straightness of the line in a normal Q-Q plot. Among the several tests of normality presented above, the Shapiro-Wilk ranks favorably with the Anderson-Darling test and is generally better than the others. Reference: Shapiro & Wilk [Shapiro and Wilk, 1965] and Royston [Royston, 1982, Royston, 1995]. 4 Testing for Randomness The difference between statistics for random samples and statistics for time series is the assumption in the latter of dependence in the variables. In time series models such as ARMA, ARCH and GARCH, decompositions that have a deterministic part and a white noise part are produced (assuming that those models fit, of course). Trading system development has the similar objective of transforming time series of stochastically dependent prices into time series of uncorrelated trades. Therefore it is important to have statistical tools that can test for randomness, that is, which test that a time series of trades behaves like a random sample with positive expected value. An hypothesis that goes a long way toward testing for random sample behavior is the Hypothesis of Randomness. Let X 1, X 2,..., X T,... be the variables of a time series and let P T be the set of all permutations on integers 1,..., T. The Hypothesis of Randomness is met if all permutations of variables in the series have the same distribution: H 0 :(X 1, X 2,..., X T ) D = (X σ(1), X σ(2),..., X σ(t ) ) (11) for any permutation σ P T. where the = D notation means having the same distribution. In the following, we use the notation sgn(x) for +1 x > 0 sgn(x) = 0 x = 0 1 x < 0. 4.1 Nonparametric Tests for Randomness: Runs Tests In this Section, we present three nonparametric runs tests of the Hypothesis of Randomness for continuous data. Runs of this section are of two types: (1) runs against the median, and (2) ascending/descending runs. Runs against the median are maximal numbers of consecutive observations that are above or below the median. For example, with 4 as the median of the 10 values 12

1, 2.3, 3, 3.5, 8, 9, 1, 1.5, 8, 2.7, there are 5 runs below and above the median of lengths 4,2,2,1,1. We note that a test statistic counting the number of runs above or below the median does not depend on the underlying (continuous) distribution and is therefore distribution-free or non-parametric. If there are too many or too few runs against the median, then the series is clearly not uncorrelated. On the other hand, ascending (descending) runs are consecutive increasing (decreasing) values. Under the Hypothesis of Randomness a testic consisting of the number of increasing runs is distribution-free and has a calculable distribution. Too many or too few ascending runs indicates dependence among variables. In the above data, there were ascending runs (using positive or negative differences) of lengths 5 and 2 and descend runs of lengths 1 and 1. 4.1.1 The Wald Wolfowitz Runs Test {X t } T 1 are continuous random variables with joint density f(x 1, X 2,..., X T ). Null Hypothesis: H 0 : The Hypothesis of Randomness. Alt. Hypothesis: H a : Series {X t } has either too many or too few runs (1-sided) or either (two-sided.) The following procedure is used to calculate the test statistic: 1. The median or some other cutoff, m x, of the sample x 1, x 2,...x T is calculated, 2. A new sample y 1, y 2,..., y T is formed, where y i = +1, if x i m x and y i = 1, if x i < m x, 3. The number of runs u in the sequence of y s is counted, where a run is a maximal consecutive number of +1 s or 1 s. 4. The value u is then compared to a statistical table and a significance level assigned. It is also possible to conduct runs tests with percentiles other than the median; see the references for details. Simplified Explanation: The runs test examines a series to see if there are too many or too few consecutive observations above the cutoff and below the cutoff, i.e., there are too many or too few runs. If a series is slowly varying and stays above the cutoff for long stretches, or is rapidly oscillating above and below the cutoff, then the null hypothesis will be rejected. Reference: Gibbons & Chakraborti [Gibbons and Chakraborti, 2010], Siegel & Castellan [Siegel and Castellan, 1998]. 4.1.2 The Up and Down Test {X t } T 1 are continuous random variables with joint density f(x 1, X 2,..., X T ). Null Hypothesis: H 0 : The Hypothesis of Randomness. Alt. Hypothesis: H a : Series {sgn(x t+1 X t )} has either too many or too few runs of +1 s. 13

The test statistic is the number of runs of consecutive +1 s of the signed differences Y t = sgn(x t+1 X t ), t = 1, 2,..., T 1. Under the Hypothesis of Randomness, one can calculate finite sample distributions, as in Gibbons & Chakraborti [Gibbons and Chakraborti, 2010]. Letting m be the number of non-zero Y t, it can be shown that the test statistic is asymptotically normal with mean (2m 1)/3 and variance (16m 29)/90. Simplified Explanation: The Up and Down Test determines if there are too many or too few consecutive ascending runs in the data. It is not a test for global trend, though, since a sawtooth pattern that goes nowhere has no long-term trend. Therefore the Up and Down Test rejects the Hypothesis of Randomness, but not a global test of runs for the entire series. In a long series, the Up and Down Test detects local trends. Reference: Gibbons & Chakraborti [Gibbons and Chakraborti, 2010]. 4.1.3 The Bartels Rank Test {X t } T 1 are continuous random variables with joint density f(x 1, X 2,..., X T ). Null Hypothesis: H 0 : The Hypothesis of Randomness Alt. Hypothesis: H a : The sequential variance of ranks is not equal to the usual variance of ranks. The test statistic is based on a parametric test originally suggested by John von Neumann, but substitutes ranks for observations. The Bartels statistic U is U = (R t R t+1 ) 2 / (R i (t + 1)/2) 2 where R t = rank(x t ), t = 1,..., T. It is known that the distribution of (U 2)/σ, where σ 2 = [4(T 2)(5T 2 2T 9)]/[5T (T + 1)(T 1) 2 ]. is asyptotically normal N(0, 1) under the Hypothesis of Randomness. If for example, the data are positively serially correlated, then U will be small and if negatively correlated, large compared to the average random arrangement. Simplified Explanation: The Bartels test is based on a test originally suggested by von Neumann, in which the ratio of a variance determinded by sequential differences is compared to the variance formed from deviations from the mean (ignoring a norming constant). In the von Neumann test, the asymptotic sampling distribution for a normal distribution can be calculated. The Bartels test substitutes ranks for the actual observations, and forms the ratio of the variance calculated from consecutive ranks R t R t+1 to that based on deviations from the mean R t (T + 1)/2. The reference below shows that the rank version has an asymptotically normal distribution, with mean and variance indicated above. In simulations run by Bartels against serially correlated alternatives, this test was more sensitive that the runs up and down test. Reference: Bartels [Bartels, 1982]. 14

4.2 Nonparametric Tests for Randomness: Trend Tests Another important deviation from randomness is a global trend. In elementary theory of time series, trends are treated simply and naïvely, as either random walks with drift or trend stationary processes. In either case there is a constant push up or down augmented by noise. But what if one wants to test for trend without restricted models of these types, as can occur when the drift is not constant or the process is nonlinear? This Section presents several nonparametric tests that can be used to detect such trends. 4.2.1 The Cox-Stuart Test Null Hypothesis: Alt. Hypothesis: Let f(x 1,..., x T ) be unknown density function for (X 1,..., X T ). H 0 : The Hypothesis of Randomness H a : There is either an excess number of positive (or negative) sequential values in the first half compared to the second. Data is split into two halves with the middle observation dropped if sample size is odd, giving equal numbers m in each half. Calling X t and Y t, t = 1, 2,..., m the respective observations in the two halves, the test statistic is t=m t=1 sgn(y t X t ). Under the null hypothesis, this statistic (known as a sign test) has a binomial distribution B(m, 1/2), i.e. with p = 0.5. When the value significantly exceeds m/2, there is evidence of an upward trend, and when it is significantly below, evidence of a downward trend. Simplified Explanation: The Cox-Stuart Test is a simple but not very powerful test. It compares the ranks of two halves of a sample using only the signs of pairwise equential ifferences. It s easy to calculate, but not appropriate if one wants to determine the degree of monotonicity in the data. Reference: Cox & Stuart [D. R. Cox, 1955]. 4.2.2 The Difference Sign Test Null Hypothesis: Let f(x 1,..., x T ) be unknown density function for(x 1,..., X T ). H 0 : The Hypothesis of Randomness Null Hypothesis: H 0 : (X 1,..., X T ) = D (X σ(1),..., X σ(t ) ), for all σ P T. Alt. Hypothesis: H aa : P [X t > X t 1 ] > 1/2 for all t or H ad : P [X t > X t 1 ] < 1/2 for all t. 15

This test counts the number of positive differences among sample data and performs a sign test. Under the null hypothesis, the number of positive differences X t X t 1 will be distributed as a binomial B(T 1, 1/2). Simplified Explanation: The test statistic is the number of positive differences which has under the null hypothesis a binomial distribution B(T 1, 1/2). To be effective, it needs a large sample. Reference: Moore & Wallis [Geoffrey H. Moore, 1943]. 4.2.3 The Mann Test for Trend The data have joint density f(x 1, x 2,..., x T ). Null Hypothesis: H 0 : The Hypothesis of Randomness. Alt. Hypothesis: H a : There is a monotone trend, P [X t2 > X t1 ] for t 2 > t 1. The original Mann test is S = t=t 1 t=1 s=t s=t+1 sgn(x s X t ). Under the null Hypothesis of Randomness, E[S] = 0 and the variance is p=g V ar(s) = [T (T 1)(2T + 5) t p (t p 1)(2t p + 5)]/18, where the number of groups of observations that are tied is g and the number of ties in each group p is t p. Asymptotically, Z = p=1 S V ar(s) has a N(0, 1) distribution. In small samples of T < 30 the test S 1 S > 0 V ar(s) Z = 0 S = 0 S+1 S < 0 V ar(s) which applies a continuity correction, is recommended. Simplified Explanation: When the Mann Test is positive and large, there is evidence of an upward trend; when highly negative, a downward trend. The two previous tests, the Cox-Stuart 4.2.1 and the Difference Sign Test 4.2.2 as well as this one all tests for monotone, not necessarily linear, trends. Of these three tests, the Mann is definitely the best. Moreover, as we see below in the Dietz- Killeen multivariate generalization of the Mann Test, short term correlations can be partially removed, allowing the dectection of an underlying trend not due to those correlations. Reference: Hirsh & Slack [Hirsh and Slack, 1984], Kendall [Kendall, 1975], Mann [Mann, 1945]. 16

4.2.4 The Dietz-Kileen Multivariate Test for Trend Null Hypothesis: Alt. Hypothesis: The p-variate time series X t = (X t1, X t2,..., X tp has unknown joint density f(x 1, x 2,..., x p ). H 0 : The Hypothesis of Randomness H a : There is a monotone trend in one or more of the p variables. The test statistic is complicated to describe; the interested reader is referred to the cited references. Under H 0 and full rank of a covariance matrix, the test statistic will be distributed as a χ 2 (p) distribution. Simplified Explanation: When the Test is positive and large, one rejects the hypothesis of no trend in any variable. Directionality is not measured some variables can have ascending trends, some descending and rejection will occur anyway. This test is useful in trading systems testing, as follows. One forms a multivariate series by taking every two or three trades and calling that vector an observation. Then perform the Dietz-Kileen procedure tests for monotone trends adjusted for serial or three-fold correlation. Under tests of this method by Hirsh and Slack [Hirsh and Slack, 1984], it proved quite effective if the serial correlation does not exceed 0.6. Thus the Dietz-Kileen test can address a common question about trading systems is serial correlation the cause of a trend or not? Reference: Dietz & Kileen [E. Jacquelin Dietz, 1981], [Hirsh et al., 1982], Mann [Mann, 1945]. 5 Nonparametric Tests for Comparing Distributions Because returns distributions have been shown to have systematic departures from normality, it is more appropriate to test i.i.d., white noise and random walk hypotheses using nonparametric or semiparametric methods. Below are several of the more common tests of this type. 5.1 A Comparison of Distributions Test: The two-sample test of Kolmogorov-Smirnov (KS test) One univariate random sample from continous distribution F 1 (x), the other from continuous distribution F 2 (x). Null Hypothesis: H 0 : F 1 (x) = F 2 (x) for all x R. Alt. Hypothesis: H a : F 1 (x) F 2 (x) > 0 for some x R. The test statistic uses the empirical distribution functions for the two samples. An empirical distribution ˆF (x) for a univariate sample {X i } i=1 i=n is defined 17

as ˆF (x) = # of X is in the sample that are x. (12) n The KS test statistic is just the largest absolute difference between the empirical distribution functions. KS = max x R ˆF 1 (x) ˆF 2 (x) (13) The strength of the KS test is that its asymptotic distribution does not depend on the hypothesized common distribution function. We omit discussion of the Cramer-von Mises [Anderson, 1962] and Anderson-Darling [Scholz and Stephens, 1987] tests, which are similar the KS test, but have power against different alternatives. Simplified Explanation: This test detects a worst case difference between two distribution functions. When sample sizes are large, it is quite sensitive to small differences in two distributions. Like the Kruskal-Wallis test and other tests that compare distributions, however, it is questionable for time series because the null hypothesis ignores the sample s original order. Reference: Hollander & Wolfe [Hollander and Wolfe, 1999]. 5.2 Subset Location Tests: Wilcoxon Rank-Sum (k = 2) and Kruskal-Wallis (k 2) Location Tests Assuming that there are k groups of approximately the same size, where the i th group has indices m i 1 + 1 to m i, 0 = m 0 < m 1 < < m k = T For each i, j = 1, 2,..., k, define A ij = P [X i > X j ] + 1 2 P [X i = X j ] and for each i = 1, 2,..., k, let n i = m i m i 1 be the size of the i th group. Null Hypothesis: H 0 : For each i = 1, 2,..., k, j=k n j j=1 T A i,j = 0.5. Alt. Hypothesis: H a : For at least one i = 1, 2,..., k, j=k j=1 n j T A i,j 0.5. This rather convoluted statement of the null and alternative hypotheses can be reduced to the statement that the null hypothesis H 0 must be invariant under the conversion of returns to ranks. A weaker condition which ensures that this null hypothesis holds is 18

Let A i be the indices of group i. For each j A 1, let X j have the form X j = µ + e j, and for each i > 1 and each j A i, let X j = µ + i + e j. Assume that e l for l = 1, 2,..., T are a random sample from a continuous distribution. Null Hypothesis: H0 : 2 = 3 = = k = 0. Alt. Hypothesis: Ha : At least one i 0 (one-sided test is available for k = 2). The model assumes that each group s distribution is merely shifted from the others; the null hypothesis H 0 hypothesis is then true if all shifts are 0. Historically, the Wilcoxon test (k = 2) was proposed first and was generalized later to several groups. Thus 1. For k = 2, the test is known as the Wilcoxon Rank-Sum Test. 2. For k > 2, the test is known as the Kruskal-Wallis Test. The Wilcoxon and Kruskal-Wallis tests convert the combined samples to ranks and then applies a small or large-sample ANOVA to the groups of ranks. Simplified Explanation: The Kruskal-Wallis test is identical to the Wilcoxon Rank-Sum test when k = 2. Most users of these tests consider that they compare the medians of the distributions of the k groups and the null hypothesis will be rejected when there is evidence that not all of these medians are equal. The Kruskal-Wallis test, however, is not powerful against most differences that exist in returns data for two reasons: (1) it converts the data to ranks, which neglects their magnitudes, and (2) it ignores the time sequence of the data so that serial dependencies are ignored. Reference: Vargha & Delaney [Vargha and Delaney, 1998]; Hollander & Wolfe [Hollander and Wolfe, 1999]. NOTE: The following three tests, the Ansari-Bradley, Fligner-Kileen and Mood, all have the same null hypothesis for testing the scale of k groups, but have different test statistics. However, the default R-language implementations of the Ansari-Bradley and Mood tests supports k = 2 only, while the Fligner- Killeen test supports k 2. For simplicity, all null hypotheses are stated for two groups only. 19

5.3 The Ansari-Bradley Test of Equality of Variances Let F be an unknown continuous distribution function, let {X 1i } n1 1 be a random sample with d.f. F (x 1 m) and let {X 2i } n2 1 be an independent r.s. with d.f. F ((x 2 m)/s), where m is a common unknown location for the two distributions and where s is an unknown scale parameter. Null Hypothesis: H 0 : s = 1. Alt. Hypothesis: H a : s 1, or the distributions are not representable as described in the model. The test statistic ranks the combined N = n 1 + n 2 observations from the n 1 of X 1 and the n 2 of X 2 and assigns a score of a i = (N + 1)/2 i N + 1 2 (14) to the i th in the ranking. Thus the first and last have the smallest scores of 1, the next smallest and next largest, scores of 2 and so on. Designating the indices in the ranking corresponding to the X 1 sample as i 1, i 2,..., i n1, the test statistic is Z = j=n 1 j=1 a ij. (15) Under the null hypothesis, the distribution of Z is described in [Hollander and Wolfe, 1999]; for small samples, a formula is available and for large samples a normal approximation is available. It is advisable to calculate medians for each sample separately and to subtact them from their sample prior to performing the test. Note that this test, unlike the parametric F -test, does not require that means or variances of either sample exist. Note also that the R-language implementation of the Ansari-Bradley test ansari.test does not median-correct the data, and that correction should be performed prior to the test. Simplified Explanation: The Ansari-Bradley test relies on the fact that under the null distribution, all permutations of data are equally likely. On the other hand, if one sample has the same location but a different variance, then more of its scores will be near the extremes of the ranks, or more toward the middle of the ranks. In these cases, (15) will be either too large, or too small, respectively, compared to a typical case under the null hypothesis. Before applying this test, each sample s median should be subtracted from its observations. A word should be added contrasting this test with the Fligner-Kileen and Mood tests. The difference among these tests is the method of scoring; 7 in all other formal respects they are the same. Among the three scale tests, this one has the smallest weights for large observations, so that unlike the others, it is less sensitive to 7 Scores are the values assigned to the observations and then summed up to form the test statistic in the Ansari-Bradley, they are the a i of formula (14). 20

very large observations, but more sensitive to very small observations. As such, it should not perform as well on market data that has outliers. Like the Kruskal- Wallis test and other distributional comparison tests, it is questionable for time series because the null hypothesis ignores the sample s original order. Reference: Hollander & Wolfe [Hollander and Wolfe, 1999]. 5.4 The Fligner-Kileen Test of Scale Null Hypothesis: H 0 : s = 1. Alt. Hypothesis: Let F be an unknown continuous distribution function, random sample X 1 have distribution F (x 1 m) and independent random sample X 2 have distribution F ((x 2 m)/s), where m is a common unknown location for the distributions and s is an unknown scale parameter. H a : s 1, or the distributions are not representable as described in the model. Let N = n 1 +n 2 be the total number of observations, where n 1 is the number in the X 1 sample and n 2 is the number in the X 2 sample. The Fligner-Kileen test assigns scores a ij to observations x ij as follows: (1) observations in each group are centered separately using means or medians, (2) the centered data are combined into one sample and ranked from lowest to highest, yielding ranks r ij, 8 and (3) the score for the j th observation in the i th group, i = 1, 2 is assigned the score ( ) 1 a ij = Φ 1 2 + r ij (16) 2(N + 1) where Φ is the standard normal distribution and Φ 1 is its inverse, i.e. maps probabilites to quantiles. Scores are then assigned as in steps (2) and (3) above and the test statistic where χ 2 1 = n 1(Ā1 ā) + n 2 (Ā2 ā) V 2, (17) Ā i = 1 a ij, n i j ā = (n 1 A 1 + n 2 A 2 )/N, and i=2 V 2 i=1 j = (a ij ā) 2. N 1 As the notation indicates, χ 2 1 is distributed as a χ 2 distribution with 1 degree of freedom. Simplified Explanation: The Fligner-Kileen test relies on the fact that under the null distribution, all permutations of data are equally likely. If one sample 8 For the handling of ties, see the references below. 21

has the same location but a different variance, then more of its scores will be near the extremes of the ranks, or more toward the middle of the ranks. In either case the χ 2 1 statistic will be either too large or too small compared to typical ones under the null hypothesis. By default, the R-language implementation centers the data with medians. A word should be added contrasting this test with the Ansari-Bradley and Mood tests. The difference among these tests is the method of scoring; in all other formal respects they are the same. This test has the relatively largest weights for extreme observations, so that among the others it should be most sensitive to extreme observations; as such, it should perform better than the others on a sample with more outliers. Like the Kruskal-Wallis test and other distributional comparison tests, however, it is questionable for time series because the null hypothesis ignores the sample s original order. Reference: Conover, Johnson & Johnson [Conover et al., 1981]. 5.5 The Mood Test of Scale Null Hypothesis: H 0 : s = 1. Alt. Hypothesis: Let F be an unknown continuous distribution function, random sample X 1 have distribution F (x 1 m and independent random sample X 2 have distribution F ((x 2 m)/s), where m is a common unknown location for the distributions and s is an unknown scale parameter. H a : s 1, or the distributions are not representable as described in the model. The Mood test has the form (15) where r ij is the rank of the (ij) th observation, i = 1, 2, j = 1, 2,..., n i, and with scores ( a ij = r ij N + 1 ) 2 (18) 2 and ā i = j=n i j=1 a ij. In other words, it differs from the Ansari-Bradley test only in the form of its scores a ij. The critical values under the null hypothesis are given by a formula in small samples, and by a normal approximation in large samples. Simplified Explanation: The test statistic is based on the fact that under the null distribution, all permutations of data are equally likely. If one sample has the same location but a different variance, then more of its scores will be near the extremes of the ranks, or more toward the middle of the ranks. A word should be added contrasting this test with the Ansari-Bradley and Fligner- Kileen tests. The difference among these tests is the method of scoring; in all other formal respects they are the same. On large observations, this test has larger scores than Ansari-Bradley, but less than Fligner-Kileen, so that it will have intermediate sensitivity to outliers. By default, the R-language implementation does not center the data, so this should be done before analysis. Like the Kruskal-Wallis test and other distributional comparison tests, however, 22

it is questionable for time series because the null hypothesis ignores the sample s original order. Reference: Conover, Johnson & Jonhnson [Conover et al., 1981]. 6 Time Series Tests 6.1 General Tests (G.1) The Brock-Dechert-Schenkman (BDS) i.i.d. Test Null Hypothesis: Alt Hypothesis: Arbitrary data {X t } (of infinite extent above and below) H 0 : X i are i.i.d. from continuous distribution F (x). H a : X i are from a continuous distributions but not i.i.d.. This test is complicated to describe, but the basic idea is that an i.i.d. series X 1, X 2,..., X T has the property that P [ X t X s < ɛ] 2 = P [ X t X s < ɛ]p [ X t 1 X s 1 < ɛ] (19) for all s, t and ɛ > 0. Thus lagged pairs (X t, X t 1 ) must satisfy this relationship, and by extending the above relationship to sequences of length m, one can obtain analogous relationships among m-length sequences. The test itself uses a statistic which is an estimate of the correlation dimension of the series for an account of how the BDS is developed, see references on chaos or the original articles cited below. The BDS test is one of the few tests for i.i.d. robust against nonlinear and chaotic time series. It is valid for the standardized residuals of all ARIMA, ARCH and GARCH-family models provided the sample size is sufficient (> 500). Simplified Explanation: The BDS test is one of the few i.i.d. tests that is robust against nonlinear and chaotic time series. In general, the BDS test is appropriate for ARIMA and ARCH standardized residuals, but not for small samples from a GARCH model. Reference: Brock, Dechert & Scheinkman [Brock et al., 1996]; LeBaron [LeBaron, 1997], de Lima [de Lima, 1996]. (G.2) The KPSS Test of Stationarity 23

Null Hypothesis: Alt Hypothesis: X t = βt + Y t + U t, where U t is i.i.d white noise, Y t = Y t 1 + V t and X t is an observed series, β is a constant, V t is i.i.d. and V t N(0, σv). 2 H0 1 : σv 2 = 0 for trend stationarity and H0 2 : β = 0 for level stationarity. Ha 1 : σv 2 > 0 for stationarity (the series has a unit root.) Ha 2 : β 0 for level stationarity. Trend stationarity refers to a series with β 0, and level stationarity to one with β = 0. The KPSS null hypothesis of stationarity stipulates (1) that there is no random walk term Y t = Y t 1 +V t embedded in the observed series X t, and (2) that the time series has the same probability generating mechanism at all time periods. When a random walk term is present, the series will have a unit root, and therefore, will not be stationary. This test is a one-sided, right-tailed test, so rejection of occurs when the statistic is large. The level stationarity null hypothesis is satisfied when the long term trend of the series is zero. In this case, the test is two-sided, being positive when the series is increasing and negative when it is decreasing. Simplified Explanation: The KPSS test is typically used to detect the existence of an embedded random walk in the series, guaranteeing that ARMA models are not appropriate without a data transformation such as differencing. And if a series is not level stationary, then it is certainly not i.i.d.. Thus rejection of the stationary null hypothesis means that the series is not i.i.d.. The problem with this test is that there are other deviations from i.i.d. besides an embedded random walk, and this test may not be powerful against such alternatives. Reference: Kwiatkowski, Phillips and Schmidt [Kwiatkowski et al., 1991]. (G.3) The Terasvirta Neural Network Test For Neglected Nonlinearity This test is too complicated to described in detail here. The idea is to build a neural network with hidden processing units and to use these to test that the data have non-zero linearity. See the Terasvirta reference for details. Simplified Explanation: The Terasvirta tests the null hypothesis of i.i.d. against the alternative that there is nonlinear dependence of X t on X t 1, X t 2,.... If the null hypothesis is rejected, then there is evidence that the series not only fails to be i.i.d., but the relationship is not linear. Reference: Terasvirta, Lin & Granger [Tersvirta et al., 1993]. (G.4) The White Test for Neglected Nonlinearity 24