One-Sample Numerical Data

Similar documents
robustness, efficiency, breakdown point, outliers, rank-based procedures, least absolute regression

Multiple Sample Categorical Data

MIT Spring 2015

Contents 1. Contents

Non-parametric Inference and Resampling

Bivariate Paired Numerical Data

Math 494: Mathematical Statistics

Stat 710: Mathematical Statistics Lecture 31

Math 475. Jimin Ding. August 29, Department of Mathematics Washington University in St. Louis jmding/math475/index.

Advanced Statistics II: Non Parametric Tests

This does not cover everything on the final. Look at the posted practice problems for other topics.

Statistics. Statistics

The Nonparametric Bootstrap

What to do today (Nov 22, 2018)?

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

Chapter 2: Resampling Maarten Jansen

Dr. Maddah ENMG 617 EM Statistics 10/12/12. Nonparametric Statistics (Chapter 16, Hines)

1 Statistical inference for a population mean

Stat 427/527: Advanced Data Analysis I

1 Measures of the Center of a Distribution

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

MATH4427 Notebook 4 Fall Semester 2017/2018

H 2 : otherwise. that is simply the proportion of the sample points below level x. For any fixed point x the law of large numbers gives that

Central Limit Theorem ( 5.3)

STAT 461/561- Assignments, Year 2015

Recall the Basics of Hypothesis Testing

Lecture 2: CDF and EDF


Lecture 2 and Lecture 3

The bootstrap. Patrick Breheny. December 6. The empirical distribution function The bootstrap

Unit 14: Nonparametric Statistical Methods

STAT 830 Non-parametric Inference Basics

Political Science 236 Hypothesis Testing: Review and Bootstrapping

Multiple Sample Numerical Data

Inferential Statistics

Better Bootstrap Confidence Intervals

Spring 2012 Math 541B Exam 1

Nonparametric tests. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 704: Data Analysis I

Lecture 13: p-values and union intersection tests

Bootstrap Confidence Intervals

Lecture 10: Generalized likelihood ratio test

Asymptotic Statistics-VI. Changliang Zou

Statistical Inference

Qualifying Exam CS 661: System Simulation Summer 2013 Prof. Marvin K. Nakayama

Chapter 11. Hypothesis Testing (II)

z and t tests for the mean of a normal distribution Confidence intervals for the mean Binomial tests

Notes on MAS344 Computational Statistics School of Mathematical Sciences, QMUL. Steven Gilmour

Some Assorted Formulae. Some confidence intervals: σ n. x ± z α/2. x ± t n 1;α/2 n. ˆp(1 ˆp) ˆp ± z α/2 n. χ 2 n 1;1 α/2. n 1;α/2

11. Bootstrap Methods

Continuous Distributions

Contents. Acknowledgments. xix

Statistics Handbook. All statistical tables were computed by the author.

Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Bootstrap tests. Patrick Breheny. October 11. Bootstrap vs. permutation tests Testing for equality of location

2.1 Measures of Location (P.9-11)

Descriptive Univariate Statistics and Bivariate Correlation

Resampling and the Bootstrap

Robustness and Distribution Assumptions

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

ORDER STATISTICS, QUANTILES, AND SAMPLE QUANTILES

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Performance Evaluation and Comparison

Bootstrap (Part 3) Christof Seiler. Stanford University, Spring 2016, Stats 205

Solutions exercises of Chapter 7

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Investigation of goodness-of-fit test statistic distributions by random censored samples

Continuous Random Variables. and Probability Distributions. Continuous Random Variables and Probability Distributions ( ) ( ) Chapter 4 4.

Inference on distributions and quantiles using a finite-sample Dirichlet process

Transition Passage to Descriptive Statistics 28

Random Number Generation. CS1538: Introduction to simulations

simple if it completely specifies the density of x

A Very Brief Summary of Statistical Inference, and Examples

STAT 512 sp 2018 Summary Sheet

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Inference on distributional and quantile treatment effects

14.30 Introduction to Statistical Methods in Economics Spring 2009

Section 3. Measures of Variation

EXAMINERS REPORT & SOLUTIONS STATISTICS 1 (MATH 11400) May-June 2009

Nonparametric hypothesis tests and permutation tests

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Lecture 32: Asymptotic confidence sets and likelihoods

Multiple Linear Regression

Continuous Random Variables. and Probability Distributions. Continuous Random Variables and Probability Distributions ( ) ( )

Section 3 : Permutation Inference

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

Economics 583: Econometric Theory I A Primer on Asymptotics

BTRY 4090: Spring 2009 Theory of Statistics

Mathematics Qualifying Examination January 2015 STAT Mathematical Statistics

Business Statistics. Lecture 10: Course Review

Exploratory data analysis: numerical summaries

Composite Hypotheses and Generalized Likelihood Ratio Tests

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Non-parametric Inference and Resampling

Summary of Chapters 7-9

1. Exploratory Data Analysis

Distribution Fitting (Censored Data)

II. The Normal Distribution

IEOR E4703: Monte-Carlo Simulation

Transcription:

One-Sample Numerical Data quantiles, boxplot, histogram, bootstrap confidence intervals, goodness-of-fit tests University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 35 One-sample numerical data We assume we have data of the form X 1,...,X n real-valued. Typically, we assume that these are sampled from the same population, which means these variables are iid, and that the underlying distribution is continuous. Example. In 1882 Simon Newcomb conducted some experiments for measuring the speed of light. The light had to travel 3,721 meters and the time it took to do that was measured. This was repeated n = 66 times. The time that was measured on the ith trial is X i 10 3 +24.8 in millionth of a second, where X 1,...,X n are displayed below: 28 26 33 24 34-44 27 16 40-2 29 22 24 21 25 30 23 29 31 19 24 20 36 32 36 28 25 21 28 29 37 25 28 26 30 32 36 26 30 22 36 23 27 27 28 27 31 27 26 33 26 32 32 24 39 28 24 25 32 25 29 27 28 29 16 23 2 / 35 Summary statistics. There are two main types of summary statistics: Location : mean, median, quantiles/percentiles, etc Scale : standard deviation, median absolute deviation, etc Graphics. There are various ways of plotting these summary statistics, and other relevant quantities. Popular options are: A boxplot schematic view of the main quantiles. A histogram approximation to the density (PDF). 3 / 35

Location statistics Suppose we have a sample X 1,...,X n R. The sample mean is defined as X = mean(x 1,...,X n ) = 1 n n i=1 X i The sample median is defined as follows. Order the sample to get X (1) X (n) (These are called the order statistics.) X ((n+1)/2) median(x 1,...,X n ) = X (n/2) +X (n/2+1) 2 if n is odd if n is even 4 / 35 The sample quantiles may be defined as follows. Let For α [0,1], let i be such that α [p i,p i+1 ]. Then there is b [0,1] such that The sample α-quantile is defined as p i = i 1 n 1, i = 1,...,n α = (1 b)p i +bp i+1 (1 b)x (i) +bx (i+1) Examples: 1st quartile (α = 0.25), median (α = 0.5), 3rd quartile (α = 0.75). 5 / 35 Scale statistics The sample standard deviation is the square root of the sample variance, defined as S 2 = Var(X 1,...,X n ) = 1 n 1 The median absolute deviation (MAD) is defined as where M = median(x 1,...,X n ). n (X i X) 2 MAD(X 1,...,X n ) = median( X 1 M,..., X n M ) i=1 6 / 35

Boxplot A boxplot helps visualize how the data is spread out. The box represents the inter-quartile range (IQR), containing 50% of the data. The upper edge (hinge) of the box indicates the 75th percentile. The lower hinge indicates the 25th percentile. The line within the box indicates the median (50th percentile). The top whisker is at the largest observation within 1.5 the length of the IQR from the top of the box, and similarly, for the bottom whisker. (The 1.5 factor is tailored to the normal distribution.) The observations falling outside the whiskers are plotted as points and may be suspicious of being outliers. (At least if the underlying distribution is normal.) Histogram A histogram is a piecewise constant estimate of the population probability density function (PDF). It works as follows. The data are binned and the histogram is the barplot of the bin counts. Suppose the bins are the intervals I s = (a s 1,a s ], where The number of observations in the s-th bin is = a 0 < a 1 < a 2 < < a S 1 < a S = N s = #{i : X i I s } The histogram based on this choice of bins is the barplot of N 1,...,N S. Student confidence interval for the mean Suppose we have a sample X 1,...,X n R. Suppose the underlying distribution has a well-defined mean µ and that we want to compute a (1 α)-confidence interval for µ. First assume that the distribution is normal N(µ,σ 2 ), with variance σ 2 unknown, as is often the case. The (two-sided) Student (1 α)-confidence interval for θ is X ±t (α/2) n 1 S n 7 / 35 8 / 35 where t (α) m is the α-quantile of the t-distribution with m degrees of freedom. 9 / 35

This interval hinges on the fact that T = X µ S/ n has the t-distribution with n 1 degrees of freedom when the sample is normal. Indeed, for any a < b, ( ) ( ) P X +as/ n µ X +bs/ n = P b T a The confidence level is exact if the population is indeed normal. It is asymptotically correct if the population has finite variance, because of the Central Limit Theorem (CLT). In practice, it is approximately correct if the sample is large enough and the underlying distribution is not too asymmetric or heavy-tailed. The nonparametric bootstrap interval for the mean 10 / 35 This procedure is nonparametric it does not assume a particular parametric model for the distribution of the data. The idea is to use resampling to estimate the distribution of the t-ratio. Define the sample (aka empirical) distribution as the uniform distribution over the sample denoted by ˆF. Generating an iid sample of size k from the empirical distribution is done by sampling with replacement k times from the data {X 1,...,X n } Note that even if all the observations X 1,...,X n are distinct, a sample from the empirical distribution may contain many repeats and may not include all the observations. 11 / 35

Let B be a large integer. 1. For b = 1,...,B, do the following: (a) Generate X (b) 1,...,X (b) n iid from ˆF. (b) Compute the corresponding t-ratio 2. Compute t (α) boot, the α-quantile of {T X b = 1 n A bootstrap (1 α)-ci for µ = mean(f) is n i=1 [ X +t (α/2) boot T b = X b X S b / n, where X (b) i, S 2 b = 1 n 1 b : b = 1,...,B} n i=1 S (1 α/2) S ], X +t n n boot (X (b) i X b ) 2 Note that the confidence level is not exact. 12 / 35 Confidence interval for the median Suppose we want to compute a (1 α)-ci for the median, denoted θ. (What we do here applies in the same way to any other quantile.) The sample median is asymptotically unbiased and asymptotically normal, but its asymptotic variance depends on the underlying density function, which is unknown. Confidence interval for the median based on the sample quantiles Suppose we have a sample X 1,...,X n R. Assume that the underlying distribution is continuous Define q k = P(X (k) θ). The q k are then independent of the underlying distribution. Indeed since θ is the median. This is interesting because, for k < l, q k = P(#{i : X i θ} k) = P(Bin(n,1/2) k) P(X (k) θ X (l) ) = P(X (k) θ) P(X (l) < θ) = q k q l Choosing k largest such that q k 1 α/2 and l smallest such that q l α/2, we obtain a (1 α)-ci for θ. The interval is conservative, meaning that the confidence level is at least 1 α. 13 / 35

14 / 35 The bootstrap variance estimate Suppose we have a sample X 1,...,X n IID F and want to estimate the variance of a statistic D = Λ(X 1,...,X n ). We have several options, depending on what information we have access to. We can compute it by integration if F (or its density) is known in closed form. We can compute it by Monte Carlo integration if we can simulate from F. Let B be a large integer. 1. For b = 1,...,B: (a) Sample X 1 b,...,x b n IID from F. (b) Compute D b = Λ(X 1 b,...,x b n ). 2. Compute the sample mean and variance D = 1 B B b=1 D b, ŝe 2 MC = 1 B 1 B ( Db D ) 2 b=1 (MC = Monte Carlo) 15 / 35 We can estimate it by the nonparametric bootstrap. The procedure is the same as above except that we sample from ˆF (the sample distribution) instead of F (the population distribution). The nonparametric bootstrap does as if the sample were the population. Let ŝe boot denote the bootstrap variance estimate. Note that the other two options do not require a sample. 16 / 35

Bootstrap confidence intervals Consider a functional A and let θ = A(F). For example, A(F) = median(f) or A(F) = MAD(F), etc. Suppose we want a (1 α)-confidence interval for θ. Define ˆθ = A(ˆF), which is the plug-in estimate for θ. The bootstrap procedure is is based on generating many bootstrap samples and computing the statistic of interest on each sample. Let B be a large integer. For b = 1,...,B, do the following: 1. Generate X1 b,...,x b n iid from ˆF. 2. Compute ˆθ b = A(ˆF b ) where ˆF b is the sample distribution of X b 1,...,X b n. Let ˆθ ( b) denote the b-th largest bootstrap statistic, so that ˆθ ( 1) ˆθ ( B) 17 / 35 Bootstrap pivotal confidence interval The bootstrap pivotal confidence interval is ( 2ˆθ ˆθ( B(1 α/2)), 2ˆθ ˆθ (Bα/2) ) This is justified by considering the pivot Z = ˆθ θ. If Ψ(z) = P(Z z) and z α = Ψ 1 (α), then P(z α/2 ˆθ θ z 1 α/2 ) = 1 α equivalently, θ [ˆθ z1 α/2, ˆθ z ] α/2 with probability 1 α. We estimate Ψ by the bootstrap ˆΨ(z) = 1 B B I{Z b z} where Z b = ˆθ b ˆθ. (In practice, we only need the desired sample quantiles of Z 1,...,Z B.) b=1 18 / 35

Bootstrap Studentized pivotal confidence interval Let B and C be two large integers. For b = 1,...,B, do the following: 1. Generate X b 1,...,X b n from ˆF. Let ˆF b denote the corresponding empirical distribution. 2. Compute ˆθ b = A(ˆF b ). 3. For c = 1,...,C, do the following: (2nd bootstrap loop) (a) Generate X (b,c) 1,...,X (b,c) n from ˆF b. Let ˆF (b,c) denote the corresponding empirical distribution. (b) Compute ˆθ (b,c) = A(ˆF (b,c) ). 4. Compute 5. Compute the t-ratio θ b = 1 C C c=1 ˆθ (b,c), ŝe 2 b = 1 C 1 C (ˆθ (b,c) θ ) 2 b c=1 T b = ˆθ b ˆθ ŝe b 19 / 35 Note that θ b is different from ˆθ b. The bootstrap Studentized pivotal confidence interval is (ˆθ t 1 α/2ŝe boot, ˆθ t α/2 ŝe boot ) where t α = T ( Bα) and ŝe boot denotes the bootstrap estimate of standard error, in this case, the sample standard deviation of {ˆθ b : b = 1,...,B}. The rationale is to do as in the bootstrap pivot confidence interval, where instead of Z we use as pivot T = (ˆθ θ)/ŝe boot The standard deviation bootstrap estimate requires a bootstrap loop, and this is carried out for each bootstrap sample, giving rise to a double loop! Bootstrap P-values Suppose we want to test H 0 : θ = θ 0 versus H 1 : θ θ 0. We can simply build a confidence interval using one of the aforementioned methods. If Î1 α is a bootstrap (1 α)-confidence interval, then P-value = sup{α : θ 0 Î1 α} (We can perform one-sided testing by considering appropriate one-sided confidence intervals.) 20 / 35

Other tests The sign test is a test for the median. (It is equivalent to testing via the exact confidence interval for the median.) The (Wilcoxon) signed-rank is a test for symmetry. But testing for symmetry is equivalent to testing for the median when the underlying distribution is assumed to be symmetric. 21 / 35 Both tests are distribution-free in the sense that, in each situation, the distribution of the test statistic does not depend on the underlying distribution as long as it satisfies the null hypothesis. Goodness-of-fit testing for a given null distribution Beyond questions on specific parameters (mean, median, etc), one may want to check whether the population comes from a hypothesized distribution, or family of distributions. This leads to goodness-of-fit testing. We observe an i.i.d. numerical sample X 1,...,X n with CDF F. Given a null distribution F 0, we test H 0 : F = F 0 versus H 1 : F F 0 Graphics. Besides comparing densities via histograms or comparing distribution functions visually, a quantile-quantile (Q-Q) plot is a popular option. It plots the sample quantiles versus the quantiles of F 0. The chi-squared goodness-of-fit test This test amounts to applying the chi-squared GOF test after binning the data. Suppose the bins are the intervals I s = (a s 1,a s ], where We consider the discrete variables = a 0 < a 1 < a 2 < < a S 1 < a S = ξ i = s if X i (a s 1,a s ] and apply the chi-squared GOF test to ξ 1,...,ξ n. These variables are discrete, with values in {1,...,S}. 22 / 35 23 / 35 24 / 35

Define the observed counts: N s = #{i : X i I s } The expected counts are where E 0 (N s ) = np s p s = P 0 (X i I s ) = F 0 (a s ) F 0 (a s 1 ) We then rejects for large values of (for example) D = S (N s np s ) 2 s=1 np s Theory. Under the null, D has asymptotically the chi-square distribution with S 1 degrees of freedom. Simulation. We can compute the p-value by Monte Carlo simulation. Choice of bins A possible choice of bins is to define S = [n/c] and let a s be the (1/S)-quantile of F 0, for s = 0,...,S. This guaranties that expected counts are approximately equal to c. Another option is to perform multiple tests, one test for each bin size, and run through a predetermined set of bin sizes. Yet another option is to start with some (small) bins and merge bins until significance or until all bins are merged. 25 / 35 26 / 35

The (two-sided) Kolmogorov-Smirnov test Recall the sample (aka empirical) distribution function ˆF(x) = 1 n n I{X i x} i=1 The (two-sided) Kolmogorov-Smirnov test rejects for large values of D = sup ˆF(x) F 0 (x) x The null distribution of D does not depend on F 0 (as long as it is continuous, which we assume) and has been tabulated (for a range of sample sizes n). Theory. Under the null, nd has asymptotically (n ) the distribution of the maximum in absolute value of a Brownian bridge. Simulation. We can compute the p-value by Monte Carlo simulation. In fact, since the distribution of D under the null does not depend on F 0, this can be done once for each sample size, e.g., based on the F 0 = Unif(0,1). The Cramér - von Mises test Many variations of the KS test exist. For example, the Cramér - von Mises test rejects for large values of where f 0 (x) = F 0(x) is the null PDF. D 2 = (ˆF(x) F 0 (x)) 2 f 0 (x)dx This has a simple closed form expression not requiring the calculation of integrals: where is the ordered sample (aka order statistics). nd 2 = 1 n [ 2i 1 ] 2 12n + 2n F 0(X (i) ) i=1 X (1) X (n) Again, the null distribution of D does not depend on F 0 and has been tabulated. The asymptotic null distribution is also known, but complicated. And one can resort to Monte Carlo simulations to compute a p-value. 27 / 35 28 / 35

Goodness-of-fit testing for a given null family distributions In some situations, we simply want to know whether the observations come from a distribution in a given family of distributions. Example: are the data normally distributed? We observe an i.i.d. numerical sample X 1,...,X n with CDF F, and we are given a family of distributions G = {G θ,θ Θ}, where Θ R l, meaning the parameter θ = (θ 1,...,θ l ) is l-dimensional. We want to test whether the sample was generated by a distribution in the family G, namely H 0 : F G, meaning, there exists θ such that F = G θ. H 1 : F / G, meaning, F G θ for all θ. Example: Testing for normality corresponds to taking and letting G θ denote the CDF of N(µ,σ 2 ). θ = (µ,σ 2 ) Θ = (, ) (0, ) GOF with plug-in, calibrated by parametric bootstrap Take any test statistic for testing F = F 0 versus F F 0. This statistic is necessarily of the form Suppose that we reject for large values of D. Λ(X 1,...,X n ;F 0 ) Suppose we have an estimator ˆθ = Γ(X 1,...,X n ) for θ, for example the MLE. The corresponding plug-in test statistic is Large values of this statistics are indicative that F / G. Λ(X 1,...,X n ;G) def = Λ(X 1,...,X n ;G Γ(X1,...,X n)) But how large? In other words, how do we obtain a p-value? One option is to do so by parametric bootstrap. 29 / 35 30 / 35

Let B be a large integer. Let ˆθ 0 = Γ(X 1,...,X n ). 1. For b = 1,...,B, do the following: (a) Generate X (b) 1,...,X n (b) iid from. (our estimated null) Gˆθ0 (b) Compute D b = Λ(X (b) 1,...,X(b) n ;G) 2. Let D 0 = Λ(X 1,...,X n ;G) (the observed statistic). The estimated p-value is #{b : D b D 0 }+1 B +1 31 / 35 Note that we bootstrapped the statistic Λ( ; G), meaning, in round b we computed where ˆθ b = Γ(X (b) 1,...,X(b) n ). If instead one computes Λ(X (b) 1,...,X(b) n ;G) = D(X(b) 1,...,X(b) n ) ;Gˆθb Λ(X (b) 1,...,X(b) n ;Gˆθ0 ) then the p-value that one obtains is for the situation where Gˆθ0 is given beforehand as a single null distribution... not the setting we consider here! NOTE: In that case, the obtained p-value is biased upward. This is because having a whole family of distributions (G) to fit the data allows for a better fit than just a single distribution from that family. 32 / 35 We used the parametric bootstrap, in that we sampled from Gˆθ0. If one were to use the nonparametric bootstrap, meaning sample from ˆF instead of Gˆθ0, then the distribution of D b would be approximately the same as that of the statistic D 0... regardless of whether the null hypothesis is true or not! 33 / 35

The chi-squared goodness-of-fit test The test works exactly as before, except that the expected counts have to be estimated by nˆp s, where ˆp s = Gˆθ0 (a s+1 ) Gˆθ0 (a s ) We then rejects for large values of Λ(X 1,...,X n ;G) = S (N s nˆp s ) 2 s=1 nˆp s Theory. Under the null, this statistic has asymptotically the chi-square distribution with S 1 l degrees of freedom. Simulation. Of course, as we just saw, we can also obtain the p-value using the parametric bootstrap. Kolmogorov-Smirnov test The statistic is Λ(X 1,...,X n ;G) = sup ˆF(x) Gˆθ(x) x The p-value is typically estimated via the parametric bootstrap. (Note that ˆF and ˆθ are recomputed for each bootstrap sample.) When G is the normal family of distributions { } G = N(µ,σ 2 ) : µ R,σ > 0 the test is often called the Lilliefors normality test. In this case, the statistic above can be calibrated under any distribution in the family, leading to Monte Carlo simulations. Because of that, the null distribution of this statistic has been tabulated. NOTE: The same applies to any other location-scale family of distributions, meaning, a family of the form { } G = G 0 (( a)/b) : a R,b > 0 where G 0 is some given distribution on R. 34 / 35 35 / 35