One-Sample Numerical Data quantiles, boxplot, histogram, bootstrap confidence intervals, goodness-of-fit tests University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 35 One-sample numerical data We assume we have data of the form X 1,...,X n real-valued. Typically, we assume that these are sampled from the same population, which means these variables are iid, and that the underlying distribution is continuous. Example. In 1882 Simon Newcomb conducted some experiments for measuring the speed of light. The light had to travel 3,721 meters and the time it took to do that was measured. This was repeated n = 66 times. The time that was measured on the ith trial is X i 10 3 +24.8 in millionth of a second, where X 1,...,X n are displayed below: 28 26 33 24 34-44 27 16 40-2 29 22 24 21 25 30 23 29 31 19 24 20 36 32 36 28 25 21 28 29 37 25 28 26 30 32 36 26 30 22 36 23 27 27 28 27 31 27 26 33 26 32 32 24 39 28 24 25 32 25 29 27 28 29 16 23 2 / 35 Summary statistics. There are two main types of summary statistics: Location : mean, median, quantiles/percentiles, etc Scale : standard deviation, median absolute deviation, etc Graphics. There are various ways of plotting these summary statistics, and other relevant quantities. Popular options are: A boxplot schematic view of the main quantiles. A histogram approximation to the density (PDF). 3 / 35
Location statistics Suppose we have a sample X 1,...,X n R. The sample mean is defined as X = mean(x 1,...,X n ) = 1 n n i=1 X i The sample median is defined as follows. Order the sample to get X (1) X (n) (These are called the order statistics.) X ((n+1)/2) median(x 1,...,X n ) = X (n/2) +X (n/2+1) 2 if n is odd if n is even 4 / 35 The sample quantiles may be defined as follows. Let For α [0,1], let i be such that α [p i,p i+1 ]. Then there is b [0,1] such that The sample α-quantile is defined as p i = i 1 n 1, i = 1,...,n α = (1 b)p i +bp i+1 (1 b)x (i) +bx (i+1) Examples: 1st quartile (α = 0.25), median (α = 0.5), 3rd quartile (α = 0.75). 5 / 35 Scale statistics The sample standard deviation is the square root of the sample variance, defined as S 2 = Var(X 1,...,X n ) = 1 n 1 The median absolute deviation (MAD) is defined as where M = median(x 1,...,X n ). n (X i X) 2 MAD(X 1,...,X n ) = median( X 1 M,..., X n M ) i=1 6 / 35
Boxplot A boxplot helps visualize how the data is spread out. The box represents the inter-quartile range (IQR), containing 50% of the data. The upper edge (hinge) of the box indicates the 75th percentile. The lower hinge indicates the 25th percentile. The line within the box indicates the median (50th percentile). The top whisker is at the largest observation within 1.5 the length of the IQR from the top of the box, and similarly, for the bottom whisker. (The 1.5 factor is tailored to the normal distribution.) The observations falling outside the whiskers are plotted as points and may be suspicious of being outliers. (At least if the underlying distribution is normal.) Histogram A histogram is a piecewise constant estimate of the population probability density function (PDF). It works as follows. The data are binned and the histogram is the barplot of the bin counts. Suppose the bins are the intervals I s = (a s 1,a s ], where The number of observations in the s-th bin is = a 0 < a 1 < a 2 < < a S 1 < a S = N s = #{i : X i I s } The histogram based on this choice of bins is the barplot of N 1,...,N S. Student confidence interval for the mean Suppose we have a sample X 1,...,X n R. Suppose the underlying distribution has a well-defined mean µ and that we want to compute a (1 α)-confidence interval for µ. First assume that the distribution is normal N(µ,σ 2 ), with variance σ 2 unknown, as is often the case. The (two-sided) Student (1 α)-confidence interval for θ is X ±t (α/2) n 1 S n 7 / 35 8 / 35 where t (α) m is the α-quantile of the t-distribution with m degrees of freedom. 9 / 35
This interval hinges on the fact that T = X µ S/ n has the t-distribution with n 1 degrees of freedom when the sample is normal. Indeed, for any a < b, ( ) ( ) P X +as/ n µ X +bs/ n = P b T a The confidence level is exact if the population is indeed normal. It is asymptotically correct if the population has finite variance, because of the Central Limit Theorem (CLT). In practice, it is approximately correct if the sample is large enough and the underlying distribution is not too asymmetric or heavy-tailed. The nonparametric bootstrap interval for the mean 10 / 35 This procedure is nonparametric it does not assume a particular parametric model for the distribution of the data. The idea is to use resampling to estimate the distribution of the t-ratio. Define the sample (aka empirical) distribution as the uniform distribution over the sample denoted by ˆF. Generating an iid sample of size k from the empirical distribution is done by sampling with replacement k times from the data {X 1,...,X n } Note that even if all the observations X 1,...,X n are distinct, a sample from the empirical distribution may contain many repeats and may not include all the observations. 11 / 35
Let B be a large integer. 1. For b = 1,...,B, do the following: (a) Generate X (b) 1,...,X (b) n iid from ˆF. (b) Compute the corresponding t-ratio 2. Compute t (α) boot, the α-quantile of {T X b = 1 n A bootstrap (1 α)-ci for µ = mean(f) is n i=1 [ X +t (α/2) boot T b = X b X S b / n, where X (b) i, S 2 b = 1 n 1 b : b = 1,...,B} n i=1 S (1 α/2) S ], X +t n n boot (X (b) i X b ) 2 Note that the confidence level is not exact. 12 / 35 Confidence interval for the median Suppose we want to compute a (1 α)-ci for the median, denoted θ. (What we do here applies in the same way to any other quantile.) The sample median is asymptotically unbiased and asymptotically normal, but its asymptotic variance depends on the underlying density function, which is unknown. Confidence interval for the median based on the sample quantiles Suppose we have a sample X 1,...,X n R. Assume that the underlying distribution is continuous Define q k = P(X (k) θ). The q k are then independent of the underlying distribution. Indeed since θ is the median. This is interesting because, for k < l, q k = P(#{i : X i θ} k) = P(Bin(n,1/2) k) P(X (k) θ X (l) ) = P(X (k) θ) P(X (l) < θ) = q k q l Choosing k largest such that q k 1 α/2 and l smallest such that q l α/2, we obtain a (1 α)-ci for θ. The interval is conservative, meaning that the confidence level is at least 1 α. 13 / 35
14 / 35 The bootstrap variance estimate Suppose we have a sample X 1,...,X n IID F and want to estimate the variance of a statistic D = Λ(X 1,...,X n ). We have several options, depending on what information we have access to. We can compute it by integration if F (or its density) is known in closed form. We can compute it by Monte Carlo integration if we can simulate from F. Let B be a large integer. 1. For b = 1,...,B: (a) Sample X 1 b,...,x b n IID from F. (b) Compute D b = Λ(X 1 b,...,x b n ). 2. Compute the sample mean and variance D = 1 B B b=1 D b, ŝe 2 MC = 1 B 1 B ( Db D ) 2 b=1 (MC = Monte Carlo) 15 / 35 We can estimate it by the nonparametric bootstrap. The procedure is the same as above except that we sample from ˆF (the sample distribution) instead of F (the population distribution). The nonparametric bootstrap does as if the sample were the population. Let ŝe boot denote the bootstrap variance estimate. Note that the other two options do not require a sample. 16 / 35
Bootstrap confidence intervals Consider a functional A and let θ = A(F). For example, A(F) = median(f) or A(F) = MAD(F), etc. Suppose we want a (1 α)-confidence interval for θ. Define ˆθ = A(ˆF), which is the plug-in estimate for θ. The bootstrap procedure is is based on generating many bootstrap samples and computing the statistic of interest on each sample. Let B be a large integer. For b = 1,...,B, do the following: 1. Generate X1 b,...,x b n iid from ˆF. 2. Compute ˆθ b = A(ˆF b ) where ˆF b is the sample distribution of X b 1,...,X b n. Let ˆθ ( b) denote the b-th largest bootstrap statistic, so that ˆθ ( 1) ˆθ ( B) 17 / 35 Bootstrap pivotal confidence interval The bootstrap pivotal confidence interval is ( 2ˆθ ˆθ( B(1 α/2)), 2ˆθ ˆθ (Bα/2) ) This is justified by considering the pivot Z = ˆθ θ. If Ψ(z) = P(Z z) and z α = Ψ 1 (α), then P(z α/2 ˆθ θ z 1 α/2 ) = 1 α equivalently, θ [ˆθ z1 α/2, ˆθ z ] α/2 with probability 1 α. We estimate Ψ by the bootstrap ˆΨ(z) = 1 B B I{Z b z} where Z b = ˆθ b ˆθ. (In practice, we only need the desired sample quantiles of Z 1,...,Z B.) b=1 18 / 35
Bootstrap Studentized pivotal confidence interval Let B and C be two large integers. For b = 1,...,B, do the following: 1. Generate X b 1,...,X b n from ˆF. Let ˆF b denote the corresponding empirical distribution. 2. Compute ˆθ b = A(ˆF b ). 3. For c = 1,...,C, do the following: (2nd bootstrap loop) (a) Generate X (b,c) 1,...,X (b,c) n from ˆF b. Let ˆF (b,c) denote the corresponding empirical distribution. (b) Compute ˆθ (b,c) = A(ˆF (b,c) ). 4. Compute 5. Compute the t-ratio θ b = 1 C C c=1 ˆθ (b,c), ŝe 2 b = 1 C 1 C (ˆθ (b,c) θ ) 2 b c=1 T b = ˆθ b ˆθ ŝe b 19 / 35 Note that θ b is different from ˆθ b. The bootstrap Studentized pivotal confidence interval is (ˆθ t 1 α/2ŝe boot, ˆθ t α/2 ŝe boot ) where t α = T ( Bα) and ŝe boot denotes the bootstrap estimate of standard error, in this case, the sample standard deviation of {ˆθ b : b = 1,...,B}. The rationale is to do as in the bootstrap pivot confidence interval, where instead of Z we use as pivot T = (ˆθ θ)/ŝe boot The standard deviation bootstrap estimate requires a bootstrap loop, and this is carried out for each bootstrap sample, giving rise to a double loop! Bootstrap P-values Suppose we want to test H 0 : θ = θ 0 versus H 1 : θ θ 0. We can simply build a confidence interval using one of the aforementioned methods. If Î1 α is a bootstrap (1 α)-confidence interval, then P-value = sup{α : θ 0 Î1 α} (We can perform one-sided testing by considering appropriate one-sided confidence intervals.) 20 / 35
Other tests The sign test is a test for the median. (It is equivalent to testing via the exact confidence interval for the median.) The (Wilcoxon) signed-rank is a test for symmetry. But testing for symmetry is equivalent to testing for the median when the underlying distribution is assumed to be symmetric. 21 / 35 Both tests are distribution-free in the sense that, in each situation, the distribution of the test statistic does not depend on the underlying distribution as long as it satisfies the null hypothesis. Goodness-of-fit testing for a given null distribution Beyond questions on specific parameters (mean, median, etc), one may want to check whether the population comes from a hypothesized distribution, or family of distributions. This leads to goodness-of-fit testing. We observe an i.i.d. numerical sample X 1,...,X n with CDF F. Given a null distribution F 0, we test H 0 : F = F 0 versus H 1 : F F 0 Graphics. Besides comparing densities via histograms or comparing distribution functions visually, a quantile-quantile (Q-Q) plot is a popular option. It plots the sample quantiles versus the quantiles of F 0. The chi-squared goodness-of-fit test This test amounts to applying the chi-squared GOF test after binning the data. Suppose the bins are the intervals I s = (a s 1,a s ], where We consider the discrete variables = a 0 < a 1 < a 2 < < a S 1 < a S = ξ i = s if X i (a s 1,a s ] and apply the chi-squared GOF test to ξ 1,...,ξ n. These variables are discrete, with values in {1,...,S}. 22 / 35 23 / 35 24 / 35
Define the observed counts: N s = #{i : X i I s } The expected counts are where E 0 (N s ) = np s p s = P 0 (X i I s ) = F 0 (a s ) F 0 (a s 1 ) We then rejects for large values of (for example) D = S (N s np s ) 2 s=1 np s Theory. Under the null, D has asymptotically the chi-square distribution with S 1 degrees of freedom. Simulation. We can compute the p-value by Monte Carlo simulation. Choice of bins A possible choice of bins is to define S = [n/c] and let a s be the (1/S)-quantile of F 0, for s = 0,...,S. This guaranties that expected counts are approximately equal to c. Another option is to perform multiple tests, one test for each bin size, and run through a predetermined set of bin sizes. Yet another option is to start with some (small) bins and merge bins until significance or until all bins are merged. 25 / 35 26 / 35
The (two-sided) Kolmogorov-Smirnov test Recall the sample (aka empirical) distribution function ˆF(x) = 1 n n I{X i x} i=1 The (two-sided) Kolmogorov-Smirnov test rejects for large values of D = sup ˆF(x) F 0 (x) x The null distribution of D does not depend on F 0 (as long as it is continuous, which we assume) and has been tabulated (for a range of sample sizes n). Theory. Under the null, nd has asymptotically (n ) the distribution of the maximum in absolute value of a Brownian bridge. Simulation. We can compute the p-value by Monte Carlo simulation. In fact, since the distribution of D under the null does not depend on F 0, this can be done once for each sample size, e.g., based on the F 0 = Unif(0,1). The Cramér - von Mises test Many variations of the KS test exist. For example, the Cramér - von Mises test rejects for large values of where f 0 (x) = F 0(x) is the null PDF. D 2 = (ˆF(x) F 0 (x)) 2 f 0 (x)dx This has a simple closed form expression not requiring the calculation of integrals: where is the ordered sample (aka order statistics). nd 2 = 1 n [ 2i 1 ] 2 12n + 2n F 0(X (i) ) i=1 X (1) X (n) Again, the null distribution of D does not depend on F 0 and has been tabulated. The asymptotic null distribution is also known, but complicated. And one can resort to Monte Carlo simulations to compute a p-value. 27 / 35 28 / 35
Goodness-of-fit testing for a given null family distributions In some situations, we simply want to know whether the observations come from a distribution in a given family of distributions. Example: are the data normally distributed? We observe an i.i.d. numerical sample X 1,...,X n with CDF F, and we are given a family of distributions G = {G θ,θ Θ}, where Θ R l, meaning the parameter θ = (θ 1,...,θ l ) is l-dimensional. We want to test whether the sample was generated by a distribution in the family G, namely H 0 : F G, meaning, there exists θ such that F = G θ. H 1 : F / G, meaning, F G θ for all θ. Example: Testing for normality corresponds to taking and letting G θ denote the CDF of N(µ,σ 2 ). θ = (µ,σ 2 ) Θ = (, ) (0, ) GOF with plug-in, calibrated by parametric bootstrap Take any test statistic for testing F = F 0 versus F F 0. This statistic is necessarily of the form Suppose that we reject for large values of D. Λ(X 1,...,X n ;F 0 ) Suppose we have an estimator ˆθ = Γ(X 1,...,X n ) for θ, for example the MLE. The corresponding plug-in test statistic is Large values of this statistics are indicative that F / G. Λ(X 1,...,X n ;G) def = Λ(X 1,...,X n ;G Γ(X1,...,X n)) But how large? In other words, how do we obtain a p-value? One option is to do so by parametric bootstrap. 29 / 35 30 / 35
Let B be a large integer. Let ˆθ 0 = Γ(X 1,...,X n ). 1. For b = 1,...,B, do the following: (a) Generate X (b) 1,...,X n (b) iid from. (our estimated null) Gˆθ0 (b) Compute D b = Λ(X (b) 1,...,X(b) n ;G) 2. Let D 0 = Λ(X 1,...,X n ;G) (the observed statistic). The estimated p-value is #{b : D b D 0 }+1 B +1 31 / 35 Note that we bootstrapped the statistic Λ( ; G), meaning, in round b we computed where ˆθ b = Γ(X (b) 1,...,X(b) n ). If instead one computes Λ(X (b) 1,...,X(b) n ;G) = D(X(b) 1,...,X(b) n ) ;Gˆθb Λ(X (b) 1,...,X(b) n ;Gˆθ0 ) then the p-value that one obtains is for the situation where Gˆθ0 is given beforehand as a single null distribution... not the setting we consider here! NOTE: In that case, the obtained p-value is biased upward. This is because having a whole family of distributions (G) to fit the data allows for a better fit than just a single distribution from that family. 32 / 35 We used the parametric bootstrap, in that we sampled from Gˆθ0. If one were to use the nonparametric bootstrap, meaning sample from ˆF instead of Gˆθ0, then the distribution of D b would be approximately the same as that of the statistic D 0... regardless of whether the null hypothesis is true or not! 33 / 35
The chi-squared goodness-of-fit test The test works exactly as before, except that the expected counts have to be estimated by nˆp s, where ˆp s = Gˆθ0 (a s+1 ) Gˆθ0 (a s ) We then rejects for large values of Λ(X 1,...,X n ;G) = S (N s nˆp s ) 2 s=1 nˆp s Theory. Under the null, this statistic has asymptotically the chi-square distribution with S 1 l degrees of freedom. Simulation. Of course, as we just saw, we can also obtain the p-value using the parametric bootstrap. Kolmogorov-Smirnov test The statistic is Λ(X 1,...,X n ;G) = sup ˆF(x) Gˆθ(x) x The p-value is typically estimated via the parametric bootstrap. (Note that ˆF and ˆθ are recomputed for each bootstrap sample.) When G is the normal family of distributions { } G = N(µ,σ 2 ) : µ R,σ > 0 the test is often called the Lilliefors normality test. In this case, the statistic above can be calibrated under any distribution in the family, leading to Monte Carlo simulations. Because of that, the null distribution of this statistic has been tabulated. NOTE: The same applies to any other location-scale family of distributions, meaning, a family of the form { } G = G 0 (( a)/b) : a R,b > 0 where G 0 is some given distribution on R. 34 / 35 35 / 35