Econometrics II: Statistical Analysis

Size: px

Start display at page:

Download "Econometrics II: Statistical Analysis"

Joan Christiana Kelley
6 years ago
Views:

1 Econometrics II: Statistical Analysis Prof. Dr. Alois Kneip Statistische Abteilung Institut für Finanzmarktökonomie und Statistik Universität Bonn Contents: 1. Empirical Distributions, Quantiles and Nonparametric Tests 2. Nonparametric Density Estimation 3. Nonparametric Regression 4. Bootstrap 5. Semiparametric Models EconometricsII-Kneip 0 1

2 Some literature: Gibbons, J.D., A. (1971): Nonparametric Statistical Inference, McGraw-Hill, Inc. for Data Analysis; Clarendon Press Bowman, A.W. and Azzalini, A. (1997): Applied Smoothing Techniques for Data Analysis; Clarendon Press Li and Racine (2007): Nonparametric Econometrics; Princeton University Press Greene, W.H. (2008): Econometric Analysis; Pearson Education Silverman, B.W. (1986): Density Estimation for Statistics and Data Analysis, Chapman and Hall Davison, A.C and Hinkley, D.V. (2005): Bootstrap Methods and their Application, Cambridge University Press Yatchew, A. (2003): Semiparametric Regression for the Applied Econometrician, Cambridge University Press Hastie, T., Tisbshirani, R. and Friedman, J. (2001): The elements of statistical learning, Springer Verlag EconometricsII-Kneip 0 2

3 1 Empirical distributions, quantiles and nonparametric tests 1.1 The empirical distribution function The distribution of a real-valued random variable X can be completely described by its distribution function F (x) = P (X x) for all x IR. It is well-known that any distribution function possesses the following properties: F (x) is a monotonically increasing function of x Any distribution function is left-continuous: for any x IR. Furthermore, lim F (x + ) = F (x) 0 lim F (x ) = F (x) P (X = x) 0 If F (x) is continuous, then there exists a density f such that x f(t)dt = F (x) for all x IR. If f(x) is continuous at x, then F (x) = f(x). Data: i.i.d. random sample X 1,..., X n For given data, the sample analogue of F is the so-called empirical distribution function, which is an important tool of statistical inference. Let I( ) denote the indicator function, i.e., I(x t) = 1 if x t, and I(x t) = 0 if x > t. EconometricsII-Kneip 1 1

4 Empirical distribution function: F n (x) = 1 n n i=1 I(X i x), i.e F n (x) is the proportion of observations with X i x Properties: 0 F n (x) 1 F n (x) = 0, if x < X (1), where X (1) - smallest observation F (x) = 1, if x X (n), where X (n) - largest observation F n monotonically increasing step function Example x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 5,20 4,80 5,40 4,60 6,10 5,40 5,80 5,50 Corresponding empirical distribution function: EconometricsII-Kneip 1 2

5 For real valued random variables the empirical distribution function is closely linked with the so-called order statistics. Given a sample X 1,..., X n, the corresponding order statistics is the n-tuple of the ordered observations (X (1),..., X (n) ), where X (1) X (2) X (n). For r = 1,..., n, X (r) is called r-th order statistics. Order statistics can only be determined for one-dimensional random variables. But an empirical distribution function can also be defined for random vectors. Let X be a d-dimensional random variable defined on IR d, and let X i = (X i1,..., X id ) T denote an i.i.d. sample of random vectors from X. Then for any x = (x 1,..., x d ) T F (x) = P (X1 x1,..., Xd xd) and F n (x) = 1 n n I(X i1 x 1,..., X id x d ) i=1 We can also define the so-called empirical measure P n. For any A IR d P n (A) = 1 n I(X i A) n i=1 Note that P n (A) simply quantifies the relative frequency of observations falling into A. As n P n (A) P P (A) Note that P n of course depends on the observation and thus is random. At the same time, however, it possesses all properties of a probability measures. When knowing F n we can uniquely reconstruct all observed values {X 1,..., X n } The only information lost is the exact succes- EconometricsII-Kneip 1 3

6 sion of these values. For i.i.d. samples this information is completely irrelevant for all statistical purposes. All important statistics (and estimators) can thus be written as functions of F n (or P n )). In particular, in theoretical literature expectations and corresponding samples averages are often represented in the following form: For a continuous function g E(g(X)) = g(x)dp = g(x)df (x) and 1 n g(x i ) = i=1 g(x)dp n = g(x)df n (x) Here, g(x)df (x) refers to the Stieltjes integral. This is a generalization of the well-known Riemann integral. Let d = 1, and consider a partition a = x 0 < x 1 < < x m = b of an interval [a, b]. Then b a g(x)df (x) = lim m ;sup x i+1 x i 0 m g(ξ j )(F (x j ) F (x j 1 ) if the limit exists and is independent of the specific choices of ξ j [x j 1, x j ]. It can be shown that for any continuous function g and any distribution function F the corresponding Stieltjes integral exist for any finite interval [a, b]. g(x)df (x) g(x)df (x) corresponds to the limit (if existent) as a, b. j=1 EconometricsII-Kneip 1 4

7 1.2 Theoretical properties of empirical distribution functions In the following we will assume that X is a real-valued random variable (d = 1). Theorem: For every x IR nf n (x) B(n, F (x)), i.e., nf n (x) has a binomial distribution with parameters n and F (x). The probability distribution of F n (x) is thus given by ( P F n (x) = m ) = n F (x) m (1 F (x)) n m, m = 0, 1,..., n n m Some consequences: E(F n (x)) = F (x), i.e. F n (x) is an unbiased estimator of F (x) V ar(f n (x)) = 1 nf (x)(1 F (x)), i.e. as n increases the variance of F n (x) decreases. F n (x) is a (weakly) consistent estimator of F (x). Theorem of Glivenko-Cantelli: P ( lim sup F n (x) F (x) = 0 n x IR ) = 1 EconometricsII-Kneip 1 5

8 The distribution of Y = F (X) Note: there is an important difference between F (x) und F (X): For any fixed x IR the corresponding value F (x) is also a fixed number, F (x) = P (X x) F (X) is a random variable, where F denotes the distribution function of X. Theorem: Let X by a random variable with a continuous distribution function F. Then Y = F (X) has a (continuous) uniform distribution on the interval (0, 1), i.e. F (X) U(0, 1), P (a F (X) b) = b a for all 0 a < b 1 Consequence: If F is continuous, then F (X 1 ),..., F (X n ) can be interpreted as an i.i.d. random sample of observations from a U(0, 1) distribution (F (X (1) ),..., F (X (n) ) is the corresponding order statistics EconometricsII-Kneip 1 6

9 1.3 Quantiles Quantiles are an essential tool for statistical analysis. They provide important information for characterizing location and dispersion of a distribution. In statistical inference they play a central role in measuring risk. Let X denote a real valued random variable with distribution function F. Quantiles: For 0 < τ < 1, any q τ IR satisfying F (q τ ) = P (X q τ ) τ and P (X q τ ) 1 τ is called τth quantile (or simply τ-quantile) of X. Note that quantiles are not necessarily unique. for given τ, there may exist an interval of possible values fulfilling the above conditions. But if X is a continuous random variable with density f, then q τ is unique if f(q τ ) > 0 (then F (q τ ) = τ and F (q) τ for all q q τ ). In statistical literature most work on quantiles is based on the so-called quantile function which is defined as an inverse distribution function. For 0 < τ < 1 the quantile function is defined by Q(τ) : inf{y F (y) τ} For any 0 < τ < 1 the value q τ = Q(τ) is a τ-quantile satisfying the above conditions. If there is an interval of possible values for q τ, Q(τ) selects the smallest possible value. Like the distribution function, the quantile function provides a complete characterization of the random variable X. If the distribution function F (x) is strictly monotonically increasing, then Q(τ) is the inverse of F, Q(τ) = F 1 (τ). EconometricsII-Kneip 1 7

10 Important quantiles: µ med = Q(0.5) is the median of X (with probability at least 0.5 an observation is smaller or equal to Q(0.5), and with probability at least 0.5 an observation is larger or equal to Q(0.5) Q(0.25) and Q(0.75) are called lower and upper quartile, respectively. Instead of the standard deviation, the inter-quartile range IRQ = Q(0.75) Q(0.25) (also called quartile coefficient of dispersion) is frequently used as a measure of statistical dispersion. Note that P (X [Q(0.25), Q(0.75)]) 0.5. Q(0.1), Q(0.2),..., Q(0.9) are the deciles of X. Q(0.01), Q(0.02),..., Q(0.99) are the percentiles of X. The median is of particular interest. In classical nonparametric statistics it often preferred to the mean µ = E(x) in order to localize the center of a distribution. Different from the mean, the median is defined for any real valued random variable X. The median is a robust measure, its value is not much affected by the tails of a distribution ( empirically, outliers in the data do not play much of a role when estimating a median or quartiles). If a distribution is heavily skewed, then the median is more informative than the mean for localizing the center of a distribution. If the distribution of X is symmetric, then µ med = µ (provided that µ = E(X) exists). For skewed distribution µ µ med. In general, µ med < µ if the distribution is right-skewed, µ med > µ if the distribution is left-skewed. EconometricsII-Kneip 1 8

11 For many important measures summarizing characteristics of a distribution, there exist different versions which are either based on moments or on quantiles. The quantile-based versions are necessarily more robust, since quantiles are well-defined for any distribution, while the existence of moments already introduces some restriction. Some summary measures: Location measures: mean µ = E(X), median µ med Dispersion measures: standard deviation σ, IRQ Skewness measures: γ 1 := E ( (X µ σ ) 3 γ(τ) = Q(τ)+Q(1 τ) 2µ med Q(τ) Q(1 τ) (for τ > 0.5) In empirical analysis, sample quantiles are used to estimate the unknown true quantiles of X. Data: i.i.d random sample X 1,..., X n from X The sample quantile function Q n (τ) is then defined by using the empirical distribution function F n instead of F. The sample quantile function: For 0 < τ < 1 define Q n (τ) : inf{y F n (y) τ} For a fixed τ (0, 1), Q n (τ) is called the τth sample quantile. A frequently used tool for descriptive data analysis is the socalled boxplot. The boxplot provides a graphical description of the empirical distribution of the observed data by using sample quantiles. It provides information about median, lower and upper quartiles, as well as outliers. EconometricsII-Kneip 1 9

12 Example: Order statistic (n=10): 0,1 0,1 0,2 0,4 0,5 0,7 0,9 1,2 1,4 1,9 Histogram: x Boxplot: x EconometricsII-Kneip 1 10

13 Stundenlohn Frauen Maenner EconometricsII-Kneip 1 11

14 1.4 Nonparametric tests: the Kolmogorov-Smirnov test There exists an enormous variety of nonparametric tests for different statistical problems. Starting with the Kolmogorov-Smirnov one sample test we will introduce some important test procedures which are based on the use of empirical distribution functions and order statistics. There exist many further classical nonparametric tests based on various approaches. A reference is the book by Gibbons (1971). Although approaches and setups are different, there are some common characteristics shared by all of these tests: Generality: The null hypothesis of interest is formulated in a general way; no parametrization, no dependence on existence and values of moments of specific distributions. Distribution-free tests: The distribution of the tests statistics under H 0 is does not depend on the underlying distribution of the variable of interest Robustness: test results should not be unduly affected by outliers or small departures from the model assumptions Goodness-of-fit tests: There are a number of nonparametric tests which try to assess whether a given distribution is suited to a dataset. The aim is to verify whether an observed variable possesses a specified distribution, as e.g. an exponential distribution with parameter λ = 1 or a normal distribution with mean 0 and variance 1. The most important test in this context is the Kolmogorov-Smirnov test EconometricsII-Kneip 1 12

15 Assumption: Real-valued random variable X with continuous distribution function F Data: i.i.d. random sample X 1,..., X n from X Goal: Test of the null hypothesis H 0 : F = F 0, where F 0 is a given distribution function. Idea: F n (x) is an unbiased and consistent estimator of F (x). Hence, if the null hypothesis is correct and F = F 0, the differences F n (x) F 0 (x) should be sufficiently small. Kolmogorov-Smirnov test: H 0 : F (x) = F 0 (x) for all x IR H 1 : F (x) F 0 (x) for some x IR Test statistic: D n = sup F n (x) F 0 (x) x IR H 0 is rejected if D n > d n,1 α, where d n,1 α is the 1 α-quantile of the distribution of D n under H 0. Problem: Distribution of D n under H 0? a) Under H 0 : F = F 0 the test statistic D n is distribution-free. It coincides with the distribution of the random variable D n = sup y Fn(y). y [0,1] Here, Fn denotes the empirical distribution function of an i.i.d. sample Y 1,..., Y n from a U(0, 1)-distribution. b) Asymptotic distribution (n large): For every EconometricsII-Kneip 1 13

16 λ > 0 we obtain lim P (D n λ/ n) = 1 2 n ( 1) k 1 e 2k2 λ 2 k=1 Result a) implies that the critical values of a Kolmogorov- Smirnov test can be approximated by Monte-Carlo-simulations: Using a random number generator draw an i.i.d. sample Y 1,..., Y n from a U[0, 1]-distribution, and calculate the corresponding value D n,1 = sup y IR y F n(y). Iterate k times (k large, e.g. k = 2000) k values: D n,1, D n,2,..., D n,k the (1 α)-quantile of the empirical distribution of D n,1, D n,2,..., D n,k provides an approximation of d n,1 α (the larger k, the more accurate the approximation) There exist tables providing critical values d n,1 α for small n. Example: A manufacturer of a certain SUV claims that when driving at a constant speed of 100 km/h fuel consumption of the SUV is normally distributed with mean µ = E(X) = 12 und standard deviation σ = 1. A random sample of 10 SUVs leads to the following observed fuel consumptions: Calculating the K-S test statistic yields (n = 10): D 10 = Critical value of the test for n = 10 and α = 0.05: d 10,0.95 = H 0 is accepted, since < Remark: In principle, the test may also be used for discrete EconometricsII-Kneip 1 14

17 distributions. In this case the test is conservative, i.e. under H 0 the probability of a type I error is usually smaller than α. Composite null hypotheses It is common to speak of a composite null hypothesis, if F 0 (x) F 0 (x, θ) is only specified up to an unknown parameter vector θ IR m. An example is the normal distribution with unknown mean and variance, i.e. θ = (µ, σ 2 ). In such a case the aim is simply to test whether the data are normally distributed (irrespective of the particular mean and variance). Testing problem: H 0 : F (x) = F 0 (x, θ) for all x IR; θ unknown H 1 : For all possible θ: F (x) F 0 (x, θ) for some x IR Test statistic: D n = sup F n (x) F 0 (x, ˆθ) x IR Here, ˆθ denotes the maximum-likelihood estimate of θ. Normal distribution: ˆθ = ( X, ˆσ 2 ), ˆσ 2 = 1 n i (X i X) 2. H 0 is rejected if D n > d n,1 α In general one uses the same critical values as in the case of a simple null hypothesis (see above). This implies that the test is conservative, i.e. under H 0 the probability of a type I error is usually smaller than α. For the special case of a normal distribution, exact critical values have been determined by Lillifors. The resulting Lillifors test is implemented in many statistical program packages. EconometricsII-Kneip 1 15

18 1.5 Nonparametric one-sample tests Rank statistics Many nonparametric tests are (implicitly or explicitly) based on ranks of observations. Ranks are easily determined from order statistics. Consider an i.i.d. random sample X 1,..., X n from a continuous random variable X. If X i X j for all i j, then the rank r(x i ) of observation X i, i = 1,..., n, is defined by r(x i ) := n I(X j X i ). j=1 This means that the smallest observation has rank 1, while the largest observation has rank n, and r(x (i) ) = i i = 1,..., n For an i.i.d. sample from a continuous random variable we have P (X i = X j for some i j) = 0. Consequently, with probability 1, r(x 1 ),..., r(x n ) is a random permutation of all natural numbers between 1 and n. E(r(X i )) = n+1 2 V ar(r(x i )) = n In practice, it can of course occur that there exist ties, i.e. different observations which have equal values. In this case an average rank is assigned to all observations with identical value. EconometricsII-Kneip 1 16

19 Examples (n=5): X i 0, 3 1, 5 0, 1 0, 8 1, 0 r(x i ) X i 2, 0 0, 5 0, 9 1, 3 2, 6 r(x i ) X i 1, 09 2, 17 2, 17 2, 17 3, 02 r(x i ) X i 0, 5 0, 5 0, 9 1, 3 1, 3 r(x i ) 1, 5 1, Note: If there are ties, then the empirical variance of r(x i ) is necessarily smaller than n Linear rank statistics (one sample) Consider a random variable X with continuous distribution function F Data: i.i.d. random sample X 1,..., X n Nonparametric one-sample tests try to verify hypotheses concerning the location of the center of a distribution. More precisely, they aim to test whether the median µ med is equal to a prespecified value µ 0. Recall that for a continuous random variable the median µ med necessarily statisfies F (µ med ) = 0.5. For simplicity, in the following we will only consider two-sided tests. One-sided tests are completely analogous. EconometricsII-Kneip 1 17

20 Formal testing problem: H 0 : µ med = µ 0 H 1 : µ med µ 0 Example: For studying the intelligence of PhD students at a certain university n = 10 students were randomly selected and the corresponding IQ-values were measured using an IQ test. This lead to the following 10 observations: X i Question: Is the data compatible with the hypothesis H 0 : µ med = 110? Linear rank statistics for the one-sample problem rely on the ranks of the absolute values of the differences D i = X i µ 0 : r( D i ) := rank of D i = X i µ 0 in the sample of the absolute values D 1,..., D n Moreover, let 1 if X i µ 0 > 0 V i := 0 if X i µ 0 0 For a suitable weight function g a linear rank statistics L + n is then defined by n L + n = g(r( D i )) V i i=1 EconometricsII-Kneip 1 18

21 IQ-example (µ 0 = 110): X i V i D i r( D i ) There exist some general theoretical results on the choice of a suitable weight function for constructing locally optimal rank tests. The term locally optimal refers to the assumption that the underlying F is close to some pre-specified parametric distribution (e.g. normal). In practice, the most frequently used linear rank tests are the sign test and the Wilcoxon test. The sign test: The sign test is the linear rank test with the simplest possible weight function: g(x) = 1 for all x. For testing H 0 : µ med = µ 0 the sign test thus relies on the test statistics V + n = n i=1 Under H 0 we obtain P (V i = 1) = 1 2 and P (V i = 0) = 1 2 V i This implies that the null distribution of Vn distribution with parameters n and 1 2, is a binomial V + n B(n, 1 2 ). For a given significance level α > 0, the sign test rejects H 0 if either P (B n, 1 V + 2 n ) α/2 or P (B n, 1 V + 2 n ) α/2. n large: the binomial distribution may be approximated by a EconometricsII-Kneip 1 19

22 normal distribution. Under H 0 we have approximatively V + n n/2 n/4 AN(0, 1) Remark: Since F is continuous we have P (X i µ 0 = 0) = 0. In practice, however, there may exist observations with X i µ 0 = 0. In this case it is common practice to eliminate these observations and to apply the sign test to the corresponding reduced sample. The Wilcoxon test: The Wilcoxon test is a linear rank test based on the weight function g(x) = x for all x. It relies on the additional assumption that the underlying distribution is symmetric. The test statistic is W + n = n r( D i ) V i i=1 For a given significance level α > 0, the Wilcoxon test rejects H 0 if either W + n w n,α/2 or W + n w n,1 α/2. Here, w n,α/2 and w n,α/2 are the corresponding quantiles of the distribution of W n under H 0. If F is symmetric, then under H 0 the statistic W n is distributionfree. Under H 0, V 1,..., V n are i.i.d. with P (V i = 1) = 1 2 and P (V i = 0) = 1 2, while symmetry of F implies that the random variables V i and D i are independent. Hence, all possible combinations of zeros and ones for the indicator variables V 1,..., V n are equally probable, while at the same time r( D ),..., r( D n ) are purely random permutations of {1,..., n}. Therefore, critical values can be obtained by straightforward combinatorial methods. EconometricsII-Kneip 1 20

23 Asymptotic approximation (n large): W + n n(n+1) 4 V ar(w n + ) AN(0, 1), where V ar(w + n ) = n(n+1)(2n+1) 24 Note: The theoretical derivation of the null distribution relies on the assumption of a continuous random variable (probability of ties equal to zero). Ties may of course exist in practice. Then the above distribution are only approximatively valid, and the accuracy of approximation decreases with the number of ties. In the literature there can be found some formulas which provide corrected critical values in the presence of ties. Application: Paired-sample procedures Paired samples: Sample (X 1, Y 1 ),..., (X n, Y n ) X 1,..., X n i.i.d. with distribution function F X Y 1,..., Y m i.i.d. with distribution function F Y X i und Y i not independent; e.g. (X i, Y i ) repeated measure- for the same statistical unit ments Example: advertising campaign The following table represents the weekly sales (in Euro) of a trade chain before and after an advertising campaign. chain store before campaign (X) 18,5 15,6 20,1 17,2 21,1 19,3 after campaign (Y) 20,2 16,6 19,8 19,3 21,9 19,0 EconometricsII-Kneip 1 21

24 x = 18, 63, ȳ = 19, 47 Question: Has the advertising campaign been successful? Did the campaign (in tendency) lead to significantly higher sales? Nonparametric approach: Analysis of the resulting sample of differences Z 1 = X 1 Y 1, Z 2 = X 2 Y 2,..., Z n = X n Y n The above problem can be translated into the question: Is the median of Z 1,..., Z n significantly different from zero? Testing problem: H 0 : µ med;z = 0 H 1 : µ med;z 0 Application of the sign test (or Wilcoxon test) based on Z 1,..., Z n. Power for detecting alternatives: Parametric alternative (assuming normality): Student t-test The asymptotic relative efficiency of the sign test relative to the t-test ist if the underlying distribution is normal. The sign test can be much more efficient than the t-test if the underlying distribution is skew or possesses heavy tails. For a symmetric distribution the Wilcoxon test is always more efficient than the sign test. The asymptotic relative efficiency of the Wilcoxon test relative to the t-test ist 0.96 if the underlying distribution is normal. EconometricsII-Kneip 1 22

25 1.6 Two-sample tests In the following we consider two random variables X und Y with continuous distribution functions F X und F Y Data: i.i.d random samples X 1,..., X m and Y 1,..., Y n from underlying populations with distribution functions F X und F Y. X i is independent of Y j for all i, j. Problem: Test the null hypothesis H 0 : F X = F Y the underlying distribution of equality of Example: Coffee and the speed of typing on a keyboard An experiment was conducted in order to measure the influence of caffeine on the speed of typing on a computer keyboard. 20 trained test persons were randomly divided into two groups of 10 persons. The first group did not receive any beverages, but each member of the second group had to drink a big cup of coffee (administering 200 mg caffeine). Every test person then had to type a text on a keyboard. The following table provides the respective average number of characters typed per minute. no caffeine (X) mg caff. (Y) Question: Does there exist a difference between typing speeds with and without caffeine? Formal testing problem: H 0 : F X = F Y H 1 : F X F Y EconometricsII-Kneip 1 23

26 For two sample tests based on order statistics the rank of the observations X i and Y j in the combined samples of all n + m observations play a central role. If there are no ties, then r(x i ) is defined by r(x i ) := and consequently m I(X j X i ) + j=1 n I(Y j X i ) j=1 r(x (i) ) = i + n I(Y j X (i) ) j=1 for all i = 1,..., n. If H 0 : F X = F Y is correct, then all ranks between 1 and m + n are equally probable, P (r(x i ) = j) = 1 n+m for all j {1,..., m+n}. More precisely, under H 0, r(x 1 ),..., r(x m ) can be interpreted as m numbers randomly drawn from the set {1, 2,..., m + n}. All possible sequence of these m numbers are equally probable. This will not be true under the alternative The Kolmogorov-Smirnov two-sample tests Note: The empirical distribution functions F X,m and F Y,n are unbiased and consistent estimators of F X and F Y, respectively. If the null hypothesis H 0 : F X = F Y is correct, all differences F X,m (x) F Y,n (x) are purely random and should be sufficiently small. This motivates the two-sample test of Kolmogorov and Smirnov for testing H 0 : F X = F Y. EconometricsII-Kneip 1 24

27 Test statistic: D m,n = sup F X,m (x) F Y,n (x) x IR H 0 is rejected if D m,n > d m,n,1 α, where d m,n,1 α is the 1 αquantile of the distribution of D m,n under the null hypothesis. a) Under H 0 : F X = F Y, the test statistic D mn is distributionfree. Critical values can be obtained by straightforward combinatorics. Recall that ties do not play any role in theoretical analysis, since they have probability 0. We obtain { D m,n = max max F X,m(X i ) F Y,n (X i ), i=1,...,m } max F X,m(Y j ) F Y,n (Y j ) j=1,...,n { = max max i i=1,...,m m 1 n I(Y j X (i) ), n max i=1,...,n 1 m j=1 m I(X j Y (i) ) i } n j=1 The values of D m,n thus only depend on the ranks of X i, X j in the combined sample of all m + n observation. Since all under H 0 all ranks are equally probable, critical values are thus obtained by a simple counting procedure. b) Asymptotic distribution (n large): For all λ > 0 lim P (D m,n λ/ mn/(m + n)) = 1 2 n ( 1) k 1 e 2k2 λ 2 k=1 c) The Kolmogorov-Smirnov test is consistent for all alternatives. EconometricsII-Kneip 1 25

28 1.6.2 Linear rank statistics Rank tests are explicitly constructed on the basis of the ranks X i bzw. Y i in the combined samples of all N = m + n observations. Under H 0 : F X = F Y the combined sample can be interpreted as an i.i.d. random sample of size N := m + n from a population with distribution function F X = F Y. If there are no ties, the ranks are random permutations of the natural numbers between 1 and N. Rank tests then aim to verify, whether the distribution of ranks is indeed purely random, or if there are systematic differences between the ranks of the X and Y variables which indicate that F X F Y. Most commonly used rank tests for the two-sample problem can be classified together as linear combinations of indicator variables for the combined (ordered) samples. Such statistics are often called linear rank statistics. For the following theoretical analysis we will assume that F X and F Y are continuous and that there are not ties in the samples. Let 1 if the i-th variable in the combined, V i := ordered sample is an X-variable 0 else Linear rank statistics can now generally be written in the form L N = N a i V i, i=1 where a 1, a 2,... are pre-specified weights ( scores ). Different test procedures use different specifications of the scores a i. (V 1, V 2,..., V N ) is a vector consisting of m ones and n zeros. EconometricsII-Kneip 1 26

29 There are N different possible combinations of these m m ones and n zero, each of which has the same probability under H 0. Under H 0 : F X = F Y the distribution of L N is distributionfree. Critical values can be determined by straightforward combinatorics: P (L N = c H 0 ) = q(c), N m where q(c) denotes the number of vectors (V 1,..., V N ) satisfying L N = N i=1 a iv i = c. Moments under H 0 : E(V i ) = m N V ar(v i ) = mn N 2 Cov(V i, V j ) = mn N 2 (N 1) This implies E(L N ) = m N N i=1 a i V ar(l N ) = mn N 2 (N 1) (N N i=1 a2 i ( N i=1 a i) 2 ) Asymptotic distribution (n large): Z N = L N E(L N ) V ar(ln ) AN(0, 1). Tests based on linear rank statistics are not consistent against all possible alternatives. However, they can be constructed in such a way that they are particularly powerful in detecting some important types of alternatives, as for example shifts in location. The point is that in many practically relevant situations the EconometricsII-Kneip 1 27

30 structure of the distributions F x and F Y is quite similar, but there exists a shift in the centers of these distributions (different median, means). Mathematically this can be formalized by the concept of stochastic dominance. Definition: A real random variable X (first order) stochastically dominates a real random variable Y (written X F SD Y if P (X > z) P (Y > z) for all z or equivalently F X (z) F Y (z) for all z If X F SD Y., then µ X,med > µ Y,med, where µ X,med and µ Y,med denote the medians of X and Y, respectively. Moreover, if E(X) exists, then E(X) > E(Y ). Tests for the location problem are particulary powerful against alternatives of the form F X (z) < F Y (z) or F X (z) > F Y (z). Location tests based on linear rank statistics rely on specifying scores such that a 1 < a 2 < < a n is a strictly monotonically increasing sequence. Note that the following tests may also be able to detect alternatives where stochastic dominance of one variable is not exactly satisfied. They will, however, not be consistent against alternatives, where the centers of the distributions are equal and the only difference lies in the fact that one variable is more dispersed than the other. The Wilcoxon-Mann-Whitney-test (Mann-Whitney-Utest): The best known two-sample location test is the Wilcoxon-Mann- Whitney-test. The test statistic is a special linear rank statistic EconometricsII-Kneip 1 28

31 with scores a i = i, i = 1,..., n: W N = N i V i = i=1 m r(x j ) j=1 For α > 0 let ω N,α denote the α-quantile of the distribution of W n under H 0. Two-sided test (H 0 : F X = F Y against H 1 : F X F Y ): H 0 is rejected if W N ω N,α/2 or W N ω N,1 α/2. One-sided test (H 0 : F X = F Y against H 1 : F X (z) < F Y (z) for all z): H 0 is rejected if W N ω N,1 α/2. One-sided test (H 0 : F X = F Y against H 1 : F X (z) > F Y (z) for all z): H 0 is rejected if W N ω N,α/2. Unter H 0, W n is distribution-free. Critical values can be obtained in a combinatorial way (see above). E(W N ) = m(n+1) 2, V ar(w n ) = mn(n+1) 12 Asymptotic approximation (n large): W N approximatively normal with mean m(n+1) 2 and variance mn(n+1) 12. Note: The theoretical derivation of the null distribution relies on the assumption of continuous random variables (probability of ties equal to zero). Ties may of course exist in practice. Then the above distribution are only approximatively valid, and the accuracy of approximation decreases with the number of ties. In the literature there can be found some formulas which provide corrected critical values in the presence of ties. EconometricsII-Kneip 1 29

32 The test by van der Waerden The van der Waerden-test relies on a special linear rank statistic with scores a i = Φ 1 i ( N+1 ). Here, Φ is the distribution function of the standard normal distribution. This leads to the test statistic N V W N = Φ 1 i m ( N + 1 ) V i = Φ 1 ( r(x j) N + 1 ) i=1 Critical vaules can again be obtain by using the general results for linear rank statistics. Both tests mentioned in this section possess a considerable power for detecting shifts in location. Asymptotic relative efficiencies are calculated with respect to restricted, parametrized classes of alternatives H 1 : F X (t) = F Y (t δ) for some δ IR and all t IR. Power for detecting alternatives: j=1 Parametric t-test: Additional assumption: normal distributions with equal variances, X N(µ 1, σ 2 ) und Y N(µ 2, σ 2 ) two-sample t-test with test statistic T = X Ȳ S 1/n + 1/m Under H 0 the statistic T follows a Student t-distribution with N 2 degrees of freedom (Rejection of H 0 if T is to large). The asymptotic relative efficiency of the Wilcoxon-Mann- Whitney-test relative to the t-test ist if the underlying distributions are normal. The Wilcoxon-Mann-Whitney test is more efficient than the t-test for strongly skewed or heavy EconometricsII-Kneip 1 30

33 tailed distribution. A lower bound for the asymptotic relative efficiency is 0.864, an upper bound does not exist. Assuming normal distributions, the asymptotic relative efficiency of the an der Waerden Test-test relative to the t-test is equal to 1. If the distribution have heavy tails, then the Wilcoxon-Mann-Whitney-test is more powerful than the van der Waerden-test. Scale alternatives: There are also rank tests which are specialized to detect whether one random variable is more dispersed than the other (scale alternative). Such tests already rely on the assumption that the centers of the distributions are equal, i.e., µ X,med = µ Y,med (which may be tested using a location test). Test statistics are linear rank statistics which assign small values a i to very small and very large observation, and assign large values a i to observations in the center of the distribution. The best known test in this context is the Siegel-Tukey-test. It is based on the test statistic S N = N a i V i, i=1 where the weights a are calculated as follows: a 1 = 1, a N = 2, a N 1 = 3, a 2 = 4, a 3 = 5, a N 2 = 6, a N 3 = 7, a 4 = 8, a 5 = 9, a N 4 = 10,... The critical values of the Siegel-Tukey-test coincide with the critical values of the Wilcoxon-Mann-Whitney-test. EconometricsII-Kneip 1 31

34 1.7 Multiple comparisons In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or infers a subset of parameters selected based on the observed values. Errors in inference, including confidence intervals that fail to include their corresponding population parameters or hypothesis tests that incorrectly reject the null hypothesis are more likely to occur when one considers the set as a whole. This is an important, although largely ignored problem in applied econometric work. In empirical studies often dozens or even hundreds of tests are performed for the same data set. When searching for significative test results, one may come up with false discoveries. Multiple tests: In some study many different tests are done simultaneously Example: m different, independent test of significance level α > 0 (independence means that the test statistics used are mutually independent; this is usually not true in practice). Assume that the respective null hypothesis H 0 holds for each of the m tests Type I error P by at least = 1 (1 α)m =: α m > α one of the m tests EconometricsII-Kneip 1 32

35 m α m (!) Interpretation of significant results? Analogous problem: Construction of m (1 α)-confidence intervals at least one of the m confidence P intervals does not contain = 1 (1 α)m > α the true parameter value This represents the general problem of multiple comparisons. In practice, it will not be true that all test statistics used are mutually independent. This even complicates the problem. We will still have the effect that the probability of at least one falsely significant results increases with the number m of tests, but it will not be equal to 1 (1 α) m. A statistically rigorous solution of this problem consists in modifying the constructions of tests or confidence intervals in order to arrive at simultaneous tests or simultaneous confidence intervals: P Type I error by α at least one of the m tests EconometricsII-Kneip 1 33

36 or P All confidence interval simultaneously contain the 1 α true parameter values For certain problems (e.g. analysis of variance) there exist specific procedure for constructing simultaneous confidence intervals. The only generally applicable procedure seems to be the Bonferroni correction. It is based on Boole s inequality. Theorem (Boole): Let A 1, A 2,..., A m denote m different events. Then P (A 1 A 2 A m ) m P (A i ). This inequality also implies that with Āi denoting the complementary event not A i P (A 1 A 2 A m ) 1 Application: Bonferroni adjustment i=1 m P (Āi). i=1 m different tests of level α = α m : Type I error by P at least one of the m tests m i=1 α m = α Analogously: Construction of m (1 α ) confidence intervals, EconometricsII-Kneip 1 34

37 α = α m, all confidence interval P simultaneously contain the 1 m i=1 true parameter values α m = 1 α Example: Regression analysis For n = 40 US corporations a multiple regression model is used to model the observed return of capital Y in dependence of 12 explanatory variables. After eliminating two outliers, the following table provides the results of the regression analysis. Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) WCFTCL ** WCFTDT GEARRAT LOGSALE * LOGASST * NFATAST CAPINT * FATTOT INVTAST PAYOUT QUIKRAT CURRAT Signif. codes: 0 *** ** 0.01 * Residual standard error: on 25 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 12 and 25 DF, p-value: EconometricsII-Kneip 1 35

38 1.8 Maxima of a finite sequence of random variables The problem of multiple comparisons is closely connected with the problem of bounding max i=1,...,m X i for a sequence of random variables. When the probability distribution of each random variable X i is known, Boole s inequality can be used in order to obtain (fairly rough) stochastic bounds for max i=1,...,m X i. More precise results can, however, be obtained for some practically important special cases Maximum of a sample of bounded random variables Let X 1,..., X n be an i.i.d. sample, and assume that for some (unknown) θ IR the underlying distribution possesses a density f with the following properties: f(θ) > 0 and f(x) = 0 for all x > θ For any ϵ > 0, f is continuous in the interval [θ ϵ, θ]. This implies that all possible values of X are bounded by θ, P (X > θ) = 0. Problem: Estimate θ A natural estimator of θ is given by ˆθ n := X (n) = max i=1,...,n X i Note that with probability 1 we have ˆθ n θ. ˆθ n is therefore a biased estimator of θ. It is fairly straightforward to derive the EconometricsII-Kneip 1 36

39 asymptotic distribution of ˆθ n. For any c > 0 we obtain P (n(θ ˆθ n ) c) = P (X i [θ c, θ] for some i = 1,..., n) n n = 1 P ( I(X i [θ c, θ]) = 0) n i=1 But n i=1 I(X i [θ c n, θ]) has a binomial distribution with parameters n and P (X [θ c n, θ]). Therefore, P (n(θ ˆθ n ) c) = 1 (1 P (X [θ c ) n n, θ] But as n, P (X [θ c n, θ]) = f(θ) c n + o( 1 n ). Furthermore, it is well known that for any λ > 0 and consequently, lim n lim (1 λ n n )n = exp( λ), ( 1 P (X [θ c n, θ] ) n = lim (1 f(θ)c n n )n = exp( f(θ)c). We can conclude that as n the asymptotic distribution of n(θ ˆθ n ) is an exponential distribution with parameter f(θ), n(θ ˆθ n ) D Exp(f(θ)) This type of problems is quite important in economics. The above setup represents a simple form of an extreme value problem which are, for example, important in finance. The estimation of (conditional) maxima of observation is the subject of production frontier analysis. The setup of frontier analysis can be described as follows: In an industrial sector there are usually a large number of competing companies. Each firm produces a production output X on the EconometricsII-Kneip 1 37

40 basis of several production inputs z IR p. For a given input vector z there is a maximal output g(z) which can be produced based on the current state of technology. The function g(z) is called production function. A firm with input vector z is efficient if its output equals g(z), and it is (to some degree) inefficient if the output is smaller than g(z). For a sample of measured production outputs X 1,..., X n the basic model then can be written as X i = g(z i ) + u i, i = 1,..., n, where u i is a negative random variable, i.e. P (u i 0) = 1, which measures the degree of inefficiency. This in turn implies that P (X i g(z) Z i = z) = 0. The situation described above corresponds to the trivial case p = 0 with no inputs and g(z) θ being a fixed constant. In practice, there will of course always exist a number p > 0 of important input variables which leads to the much more complicated problem of estimating conditional maxima. Different estimation methods (e.g. data envelopment analysis) have been developed in deterministic frontier analysis. Procedures of stochastic frontier analysis are based on a variant of the above model which adds a normally distributed measurement error ϵ i, i.e. it is assumed that X i = g(z i )+u i +ϵ i, i = 1,..., n,. For some overview see e.g. Cooper, Seiford and Tone (2006): Introduction to data envelopment analysis and its uses, Springer Verlag Kumbhakar and Lovell (2000): Stochastic frontier analysis, Cambridge University Press EconometricsII-Kneip 1 38

41 Example of stochastic frontier analysis (p = 1): Maximum of normal variables Let X 1,..., X m be a collection of standard normal random variables, i.e. X i N(0, 1). Note that it is not assumed that the variables are independent. Problem: Establish a bound for sup i=1,...,m X i which is valid for large m. We first establish a simple tail bound for a standard normal variable X: For any c > 0 P (X c) = 1 2π c 1 2π exp( t2 2 )dt t c c exp( t2 2 )dt = 1 c 2π exp( t2 2 ) c = 1 c 2π exp( c2 2 ) EconometricsII-Kneip 1 39

42 Let A be some constant with A > 2. Using Boole s inequality we can then infer from the above bound that sup X i A log m i=1,...,m holds with probability at least 1 1 A log m 2π m A 2 as m this probability converges to Note that This bound is heavily used in wavelet regression and high-dimensional model selection procedures like the Lasso. For example assume a standard linear regression model with normal errors and a very large number m n of explanatory variables. For the the estimated regression coefficient ˆβ j we have n( ˆβj β j ) σ q jj N(0, 1), j = 1,..., m where σ 2 is the error variance, and q jj is the jth diagonal element of the matrix ( 1 n XXT ) 1, where in this case X is the n m dimensional matrix of regressors. Hence, whenever β j = 0 we have n ˆβj σ N(0, 1), q jj and the above bound implies that n ˆβj σ q jj A log m for all j {1,..., m} with β j = 0 holds with high probability if m is large. EconometricsII-Kneip 1 40

43 1.9 More on quantiles Quantiles and quantile regression are an important empirical tool in risk analysis. For non-normal data quantile regression offers a robust alternative to usual least squares methods The check function It is well known that if E(X 2 ) < the mean µ = E(X) is obtained by minimizing squared loss: µ = arg min c IR E ( (X c) 2). If E( X ) <, then the median is obtained by minimizing L 1 - loss (absolute deviations): µ med = arg min E ( X c ). c IR The condition E( X ) < can be avoided by rewriting the minimization problem in the (otherwise equivalent) form µ med = arg min E ( X c X ). c IR Note that E ( X c X ) < for any real valued random variable X and every c IR. In general, for every τ (0, 1) the τ-quantile Q(τ) can be obtained by minimizing expected loss with respect to a an asymmetric linear loss function bases on the Check function: ρ τ (u) = (τ I(u < 0))u, u IR EconometricsII-Kneip 1 41

44 Q(τ) then minimizes V τ (q) := E (ρ τ (X q)) = τe ( X q I(X > q)) + (1 τ)e ( X q I(X < q)) = τ x q df (x) + (1 τ) x q df (x) x>q over all q IR. x<q Note that if τ = 1/2, then E (ρ τ (X q)) = 1 2E ( X c ). Also in the general case moment conditions on X can be avoided by formally considering the (otherwise equivalent) problem of minimizing E (ρ τ (X q)) E (ρ τ (X)) with respect to q. In order to verify that Q(τ) is indeed the minimizer of the above minimization problem let us analyze the structure of V τ (q). The following arguments also apply to the modified version V τ (q) = E (ρ τ (X q)) E (ρ τ (X)) (to be used if E( X ) does not exist). It is easily seen that V τ (q) = E (ρ τ (X q)) is a continuous function of q. If F (q) is continuous at q, then V τ (q) is differentiable at q. If P (X = q) > 0, then F (u) has a jump at u = q, while V τ (u) has a kink (i.e. is not differentiable) at u = q. One can, however, always define directional derivatives, i.e. right and left derivatives when considering the limits V τ (q ) and V τ (q + ) as 0. More precisely, For the left-derivative we have V τ (q) q + 1 = lim 0 V τ (q + ) = τp (X > q) + (1 τ)p (X q) = τ + P (X q), EconometricsII-Kneip 1 42

45 while the right derivative is given by V τ (q) q 1 = lim 0 V τ (q ) = τp (X q) + (1 τ)p (X < q) = τ + P (X < q). Now let Q(τ) denote a τ-quantile of X. Note that for any q IR with F (q) = P (X q) > F (Q(τ)) τ, we also have P (X < q) = F (q) P (X = q) τ. Therefore, V τ (q) q + F (Q(τ)) V τ (q) q + F (Q(τ)) > 0 and V τ (q) q 0 for any q IR with F (q) > < 0 and V τ (q) q < 0 for any q IR with F (q) < This implies that Q(τ) minimizes V τ (q). If X is a continuous random variable, then necessarily F (Q(τ)) = τ, and any solution hence satisfies the first order condition 0 = τ + P (X q(τ)) = τ + F (q(τ)). Recall from the definition of quantiles that the solution is not necessarily unique. If F has constant segments, there may exist an interval of possible values for Q(τ). But Q(τ) is necessarily unique if F is continuous and if the corresponding density f satisfies f(q(τ)) > 0. Let X 1,..., X n be an i.i.d. random sample from X. The above arguments also imply that sample quantiles Q n (τ), τ (0, 1), can be obtained by minimizing ρ τ with respect to the empirical EconometricsII-Kneip 1 43

46 distribution function. Any possible value Q n (τ) minimizes V τ,n (q) := 1 n ρ τ (X i q) n i=1 = τ 1 X i q + (1 τ) 1 n n i:x i >q = τ x q df n (x) + (1 τ) x>q i:x i <q x<q X i q x q df n (x) Assume that the distribution of X possesses a density f with f(q(τ)) > 0. Then Q(τ) is unique, and it is easy to show that Q n (τ) is a consistent estimator of Q(τ). Furthermore, n(qn (τ) Q(τ)) D N ( 0, τ(1 τ) f(q(τ)) 2 ) Quantile regression Quantile regression plays an increasingly important role in econometrics. It opens a way to explore regression relationship in depth. Much more information can be obtained than by using trasitional least squares regression which only aims to quantify a conditional mean. Furthermore, a crucial property is robustness. In particular, median regression is preferable to least squares regression when dealing with heavy-tailed distributions. Assume an i.i.d sample (Y 1, X 1 ),..., (Y n, X n ), where Y i IR is a response variable of interest, while X i IR k is a vector of explanatory variables. We are now interested in determining quantiles of the conditional distribution of Y given X. For any vector x IR k there is a conditional distribution function F Y X=x (y) = P (Y = y X = x) EconometricsII-Kneip 1 44

47 and a corresponding conditional quantile function Q Y X=x (τ), τ (0, 1). In the following we will assume that all conditional distribution functions are continuous which implies the existence of conditional densities f Y X=x ( ). Note that if Y and X are independent, then F Y X=x = F Y and Q Y X=x ( ) = Q Y ( ) for all x IR k, where F Y and Q Y denote the (marginal) distribution and quantile functions of Y, respectively. Otherwise, F Y X=x and Q Y X=x will depend on the value X = x. Standard quantile regression now rests upon the assumption that for a given τ (0, 1) Q Y X=x (τ) = x T β τ for some β τ IR k If this assumption holds for all τ (0, 1), we arrive at the general model Y i = X T i β(z i ), where the random variable Z i U(0, 1) is independent of X i, and β : (0, 1) IR k is a measurable function such that β(τ) = β τ. Special cases: 1) Simple OLS model with X i = (1, X i1,..., X i,k 1 ) T Y i = β 1 + k β j X ij + ϵ i, i = 1,..., n, j=2 where ϵ 1,..., ϵ n are i.i.d errors with continuous strictly monotonically increasing distribution function F ϵ. Then Q Y X=Xi (τ) = β 1 + k j=2 β j X ij +F 1 ϵ (τ) = β 1 + Fϵ 1 (τ) }{{} β τ,1 + k β j X ij j=2 EconometricsII-Kneip 1 45

48 2) Heteroskedastic errors: Y i = k β j X ij + ( j=1 k γ j X ij )ϵ i, i = 1,..., n, j=1 where ϵ 1,..., ϵ n are i.i.d errors with continuous strictly monotonically increasing distribution function F ϵ, and α j, γ j IR, j = 1,..., k. Then Q Y X=Xi (τ) = k β j X ij + ( j=1 k j=1 for β τ,j = β j + γ j F 1 ϵ (τ), j = 1,..., k. γ j X ij )F 1 ϵ (τ) = k β τ,j X ij A remarkable property of this approach is its equivariance to monotonic transformations: For a nondecreasing function h we have Q h(y ) X=Xi (τ) = h(q Y X=Xi (τ)). For example, if α τ +x T β τ is the τth conditional quantile of log Y, then exp(α τ + x T β τ ) is the τth conditional quantile of Y. The coefficients β τ IR k can be estimated by using the checkfunction approach. Estimates ˆβ τ are determined by minimizing n V τ,n (β) := ρ τ (Y i Xi T β) = τ over all β IR k. i=1 i:y i >X T i β Y i X T i β + (1 τ) j=1 i:y i <X T i β Y i X T i β An important special case is median regression, i.e., τ = 0.5. In this case one also speaks of LAD regression (LAD least absolute deviation). Obviously, V 0.5,n (β) = 1 n Y i Xi T β 2 i=1 EconometricsII-Kneip 1 46

Econometrics II - Problem Set 2

Deadline for solutions: 18.5.15 Econometrics II - Problem Set Problem 1 The senior class in a particular high school had thirty boys. Twelve boys lived on farms and the other eighteen lived in town. A