One-Sample Numerical Data
|
|
- Kristopher Harris
- 6 years ago
- Views:
Transcription
1 One-Sample Numerical Data quantiles, boxplot, histogram, bootstrap confidence intervals, goodness-of-fit tests University of California, San Diego Instructor: Ery Arias-Castro 1 / 35 One-sample numerical data We assume we have data of the form X 1,...,X n real-valued. Typically, we assume that these are sampled from the same population, which means these variables are iid, and that the underlying distribution is continuous. Example. In 1882 Simon Newcomb conducted some experiments for measuring the speed of light. The light had to travel 3,721 meters and the time it took to do that was measured. This was repeated n = 66 times. The time that was measured on the ith trial is X i in millionth of a second, where X 1,...,X n are displayed below: / 35 Summary statistics. There are two main types of summary statistics: Location : mean, median, quantiles/percentiles, etc Scale : standard deviation, median absolute deviation, etc Graphics. There are various ways of plotting these summary statistics, and other relevant quantities. Popular options are: A boxplot schematic view of the main quantiles. A histogram approximation to the density (PDF). 3 / 35
2 Location statistics Suppose we have a sample X 1,...,X n R. The sample mean is defined as X = mean(x 1,...,X n ) = 1 n n i=1 X i The sample median is defined as follows. Order the sample to get X (1) X (n) (These are called the order statistics.) X ((n+1)/2) median(x 1,...,X n ) = X (n/2) +X (n/2+1) 2 if n is odd if n is even 4 / 35 The sample quantiles may be defined as follows. Let For α [0,1], let i be such that α [p i,p i+1 ]. Then there is b [0,1] such that The sample α-quantile is defined as p i = i 1 n 1, i = 1,...,n α = (1 b)p i +bp i+1 (1 b)x (i) +bx (i+1) Examples: 1st quartile (α = 0.25), median (α = 0.5), 3rd quartile (α = 0.75). 5 / 35 Scale statistics The sample standard deviation is the square root of the sample variance, defined as S 2 = Var(X 1,...,X n ) = 1 n 1 The median absolute deviation (MAD) is defined as where M = median(x 1,...,X n ). n (X i X) 2 MAD(X 1,...,X n ) = median( X 1 M,..., X n M ) i=1 6 / 35
3 Boxplot A boxplot helps visualize how the data is spread out. The box represents the inter-quartile range (IQR), containing 50% of the data. The upper edge (hinge) of the box indicates the 75th percentile. The lower hinge indicates the 25th percentile. The line within the box indicates the median (50th percentile). The top whisker is at the largest observation within 1.5 the length of the IQR from the top of the box, and similarly, for the bottom whisker. (The 1.5 factor is tailored to the normal distribution.) The observations falling outside the whiskers are plotted as points and may be suspicious of being outliers. (At least if the underlying distribution is normal.) Histogram A histogram is a piecewise constant estimate of the population probability density function (PDF). It works as follows. The data are binned and the histogram is the barplot of the bin counts. Suppose the bins are the intervals I s = (a s 1,a s ], where The number of observations in the s-th bin is = a 0 < a 1 < a 2 < < a S 1 < a S = N s = #{i : X i I s } The histogram based on this choice of bins is the barplot of N 1,...,N S. Student confidence interval for the mean Suppose we have a sample X 1,...,X n R. Suppose the underlying distribution has a well-defined mean µ and that we want to compute a (1 α)-confidence interval for µ. First assume that the distribution is normal N(µ,σ 2 ), with variance σ 2 unknown, as is often the case. The (two-sided) Student (1 α)-confidence interval for θ is X ±t (α/2) n 1 S n 7 / 35 8 / 35 where t (α) m is the α-quantile of the t-distribution with m degrees of freedom. 9 / 35
4 This interval hinges on the fact that T = X µ S/ n has the t-distribution with n 1 degrees of freedom when the sample is normal. Indeed, for any a < b, ( ) ( ) P X +as/ n µ X +bs/ n = P b T a The confidence level is exact if the population is indeed normal. It is asymptotically correct if the population has finite variance, because of the Central Limit Theorem (CLT). In practice, it is approximately correct if the sample is large enough and the underlying distribution is not too asymmetric or heavy-tailed. The nonparametric bootstrap interval for the mean 10 / 35 This procedure is nonparametric it does not assume a particular parametric model for the distribution of the data. The idea is to use resampling to estimate the distribution of the t-ratio. Define the sample (aka empirical) distribution as the uniform distribution over the sample denoted by ˆF. Generating an iid sample of size k from the empirical distribution is done by sampling with replacement k times from the data {X 1,...,X n } Note that even if all the observations X 1,...,X n are distinct, a sample from the empirical distribution may contain many repeats and may not include all the observations. 11 / 35
5 Let B be a large integer. 1. For b = 1,...,B, do the following: (a) Generate X (b) 1,...,X (b) n iid from ˆF. (b) Compute the corresponding t-ratio 2. Compute t (α) boot, the α-quantile of {T X b = 1 n A bootstrap (1 α)-ci for µ = mean(f) is n i=1 [ X +t (α/2) boot T b = X b X S b / n, where X (b) i, S 2 b = 1 n 1 b : b = 1,...,B} n i=1 S (1 α/2) S ], X +t n n boot (X (b) i X b ) 2 Note that the confidence level is not exact. 12 / 35 Confidence interval for the median Suppose we want to compute a (1 α)-ci for the median, denoted θ. (What we do here applies in the same way to any other quantile.) The sample median is asymptotically unbiased and asymptotically normal, but its asymptotic variance depends on the underlying density function, which is unknown. Confidence interval for the median based on the sample quantiles Suppose we have a sample X 1,...,X n R. Assume that the underlying distribution is continuous Define q k = P(X (k) θ). The q k are then independent of the underlying distribution. Indeed since θ is the median. This is interesting because, for k < l, q k = P(#{i : X i θ} k) = P(Bin(n,1/2) k) P(X (k) θ X (l) ) = P(X (k) θ) P(X (l) < θ) = q k q l Choosing k largest such that q k 1 α/2 and l smallest such that q l α/2, we obtain a (1 α)-ci for θ. The interval is conservative, meaning that the confidence level is at least 1 α. 13 / 35
6 14 / 35 The bootstrap variance estimate Suppose we have a sample X 1,...,X n IID F and want to estimate the variance of a statistic D = Λ(X 1,...,X n ). We have several options, depending on what information we have access to. We can compute it by integration if F (or its density) is known in closed form. We can compute it by Monte Carlo integration if we can simulate from F. Let B be a large integer. 1. For b = 1,...,B: (a) Sample X 1 b,...,x b n IID from F. (b) Compute D b = Λ(X 1 b,...,x b n ). 2. Compute the sample mean and variance D = 1 B B b=1 D b, ŝe 2 MC = 1 B 1 B ( Db D ) 2 b=1 (MC = Monte Carlo) 15 / 35 We can estimate it by the nonparametric bootstrap. The procedure is the same as above except that we sample from ˆF (the sample distribution) instead of F (the population distribution). The nonparametric bootstrap does as if the sample were the population. Let ŝe boot denote the bootstrap variance estimate. Note that the other two options do not require a sample. 16 / 35
7 Bootstrap confidence intervals Consider a functional A and let θ = A(F). For example, A(F) = median(f) or A(F) = MAD(F), etc. Suppose we want a (1 α)-confidence interval for θ. Define ˆθ = A(ˆF), which is the plug-in estimate for θ. The bootstrap procedure is is based on generating many bootstrap samples and computing the statistic of interest on each sample. Let B be a large integer. For b = 1,...,B, do the following: 1. Generate X1 b,...,x b n iid from ˆF. 2. Compute ˆθ b = A(ˆF b ) where ˆF b is the sample distribution of X b 1,...,X b n. Let ˆθ ( b) denote the b-th largest bootstrap statistic, so that ˆθ ( 1) ˆθ ( B) 17 / 35 Bootstrap pivotal confidence interval The bootstrap pivotal confidence interval is ( 2ˆθ ˆθ( B(1 α/2)), 2ˆθ ˆθ (Bα/2) ) This is justified by considering the pivot Z = ˆθ θ. If Ψ(z) = P(Z z) and z α = Ψ 1 (α), then P(z α/2 ˆθ θ z 1 α/2 ) = 1 α equivalently, θ [ˆθ z1 α/2, ˆθ z ] α/2 with probability 1 α. We estimate Ψ by the bootstrap ˆΨ(z) = 1 B B I{Z b z} where Z b = ˆθ b ˆθ. (In practice, we only need the desired sample quantiles of Z 1,...,Z B.) b=1 18 / 35
8 Bootstrap Studentized pivotal confidence interval Let B and C be two large integers. For b = 1,...,B, do the following: 1. Generate X b 1,...,X b n from ˆF. Let ˆF b denote the corresponding empirical distribution. 2. Compute ˆθ b = A(ˆF b ). 3. For c = 1,...,C, do the following: (2nd bootstrap loop) (a) Generate X (b,c) 1,...,X (b,c) n from ˆF b. Let ˆF (b,c) denote the corresponding empirical distribution. (b) Compute ˆθ (b,c) = A(ˆF (b,c) ). 4. Compute 5. Compute the t-ratio θ b = 1 C C c=1 ˆθ (b,c), ŝe 2 b = 1 C 1 C (ˆθ (b,c) θ ) 2 b c=1 T b = ˆθ b ˆθ ŝe b 19 / 35 Note that θ b is different from ˆθ b. The bootstrap Studentized pivotal confidence interval is (ˆθ t 1 α/2ŝe boot, ˆθ t α/2 ŝe boot ) where t α = T ( Bα) and ŝe boot denotes the bootstrap estimate of standard error, in this case, the sample standard deviation of {ˆθ b : b = 1,...,B}. The rationale is to do as in the bootstrap pivot confidence interval, where instead of Z we use as pivot T = (ˆθ θ)/ŝe boot The standard deviation bootstrap estimate requires a bootstrap loop, and this is carried out for each bootstrap sample, giving rise to a double loop! Bootstrap P-values Suppose we want to test H 0 : θ = θ 0 versus H 1 : θ θ 0. We can simply build a confidence interval using one of the aforementioned methods. If Î1 α is a bootstrap (1 α)-confidence interval, then P-value = sup{α : θ 0 Î1 α} (We can perform one-sided testing by considering appropriate one-sided confidence intervals.) 20 / 35
9 Other tests The sign test is a test for the median. (It is equivalent to testing via the exact confidence interval for the median.) The (Wilcoxon) signed-rank is a test for symmetry. But testing for symmetry is equivalent to testing for the median when the underlying distribution is assumed to be symmetric. 21 / 35 Both tests are distribution-free in the sense that, in each situation, the distribution of the test statistic does not depend on the underlying distribution as long as it satisfies the null hypothesis. Goodness-of-fit testing for a given null distribution Beyond questions on specific parameters (mean, median, etc), one may want to check whether the population comes from a hypothesized distribution, or family of distributions. This leads to goodness-of-fit testing. We observe an i.i.d. numerical sample X 1,...,X n with CDF F. Given a null distribution F 0, we test H 0 : F = F 0 versus H 1 : F F 0 Graphics. Besides comparing densities via histograms or comparing distribution functions visually, a quantile-quantile (Q-Q) plot is a popular option. It plots the sample quantiles versus the quantiles of F 0. The chi-squared goodness-of-fit test This test amounts to applying the chi-squared GOF test after binning the data. Suppose the bins are the intervals I s = (a s 1,a s ], where We consider the discrete variables = a 0 < a 1 < a 2 < < a S 1 < a S = ξ i = s if X i (a s 1,a s ] and apply the chi-squared GOF test to ξ 1,...,ξ n. These variables are discrete, with values in {1,...,S}. 22 / / / 35
10 Define the observed counts: N s = #{i : X i I s } The expected counts are where E 0 (N s ) = np s p s = P 0 (X i I s ) = F 0 (a s ) F 0 (a s 1 ) We then rejects for large values of (for example) D = S (N s np s ) 2 s=1 np s Theory. Under the null, D has asymptotically the chi-square distribution with S 1 degrees of freedom. Simulation. We can compute the p-value by Monte Carlo simulation. Choice of bins A possible choice of bins is to define S = [n/c] and let a s be the (1/S)-quantile of F 0, for s = 0,...,S. This guaranties that expected counts are approximately equal to c. Another option is to perform multiple tests, one test for each bin size, and run through a predetermined set of bin sizes. Yet another option is to start with some (small) bins and merge bins until significance or until all bins are merged. 25 / / 35
11 The (two-sided) Kolmogorov-Smirnov test Recall the sample (aka empirical) distribution function ˆF(x) = 1 n n I{X i x} i=1 The (two-sided) Kolmogorov-Smirnov test rejects for large values of D = sup ˆF(x) F 0 (x) x The null distribution of D does not depend on F 0 (as long as it is continuous, which we assume) and has been tabulated (for a range of sample sizes n). Theory. Under the null, nd has asymptotically (n ) the distribution of the maximum in absolute value of a Brownian bridge. Simulation. We can compute the p-value by Monte Carlo simulation. In fact, since the distribution of D under the null does not depend on F 0, this can be done once for each sample size, e.g., based on the F 0 = Unif(0,1). The Cramér - von Mises test Many variations of the KS test exist. For example, the Cramér - von Mises test rejects for large values of where f 0 (x) = F 0(x) is the null PDF. D 2 = (ˆF(x) F 0 (x)) 2 f 0 (x)dx This has a simple closed form expression not requiring the calculation of integrals: where is the ordered sample (aka order statistics). nd 2 = 1 n [ 2i 1 ] 2 12n + 2n F 0(X (i) ) i=1 X (1) X (n) Again, the null distribution of D does not depend on F 0 and has been tabulated. The asymptotic null distribution is also known, but complicated. And one can resort to Monte Carlo simulations to compute a p-value. 27 / / 35
12 Goodness-of-fit testing for a given null family distributions In some situations, we simply want to know whether the observations come from a distribution in a given family of distributions. Example: are the data normally distributed? We observe an i.i.d. numerical sample X 1,...,X n with CDF F, and we are given a family of distributions G = {G θ,θ Θ}, where Θ R l, meaning the parameter θ = (θ 1,...,θ l ) is l-dimensional. We want to test whether the sample was generated by a distribution in the family G, namely H 0 : F G, meaning, there exists θ such that F = G θ. H 1 : F / G, meaning, F G θ for all θ. Example: Testing for normality corresponds to taking and letting G θ denote the CDF of N(µ,σ 2 ). θ = (µ,σ 2 ) Θ = (, ) (0, ) GOF with plug-in, calibrated by parametric bootstrap Take any test statistic for testing F = F 0 versus F F 0. This statistic is necessarily of the form Suppose that we reject for large values of D. Λ(X 1,...,X n ;F 0 ) Suppose we have an estimator ˆθ = Γ(X 1,...,X n ) for θ, for example the MLE. The corresponding plug-in test statistic is Large values of this statistics are indicative that F / G. Λ(X 1,...,X n ;G) def = Λ(X 1,...,X n ;G Γ(X1,...,X n)) But how large? In other words, how do we obtain a p-value? One option is to do so by parametric bootstrap. 29 / / 35
13 Let B be a large integer. Let ˆθ 0 = Γ(X 1,...,X n ). 1. For b = 1,...,B, do the following: (a) Generate X (b) 1,...,X n (b) iid from. (our estimated null) Gˆθ0 (b) Compute D b = Λ(X (b) 1,...,X(b) n ;G) 2. Let D 0 = Λ(X 1,...,X n ;G) (the observed statistic). The estimated p-value is #{b : D b D 0 }+1 B / 35 Note that we bootstrapped the statistic Λ( ; G), meaning, in round b we computed where ˆθ b = Γ(X (b) 1,...,X(b) n ). If instead one computes Λ(X (b) 1,...,X(b) n ;G) = D(X(b) 1,...,X(b) n ) ;Gˆθb Λ(X (b) 1,...,X(b) n ;Gˆθ0 ) then the p-value that one obtains is for the situation where Gˆθ0 is given beforehand as a single null distribution... not the setting we consider here! NOTE: In that case, the obtained p-value is biased upward. This is because having a whole family of distributions (G) to fit the data allows for a better fit than just a single distribution from that family. 32 / 35 We used the parametric bootstrap, in that we sampled from Gˆθ0. If one were to use the nonparametric bootstrap, meaning sample from ˆF instead of Gˆθ0, then the distribution of D b would be approximately the same as that of the statistic D 0... regardless of whether the null hypothesis is true or not! 33 / 35
14 The chi-squared goodness-of-fit test The test works exactly as before, except that the expected counts have to be estimated by nˆp s, where ˆp s = Gˆθ0 (a s+1 ) Gˆθ0 (a s ) We then rejects for large values of Λ(X 1,...,X n ;G) = S (N s nˆp s ) 2 s=1 nˆp s Theory. Under the null, this statistic has asymptotically the chi-square distribution with S 1 l degrees of freedom. Simulation. Of course, as we just saw, we can also obtain the p-value using the parametric bootstrap. Kolmogorov-Smirnov test The statistic is Λ(X 1,...,X n ;G) = sup ˆF(x) Gˆθ(x) x The p-value is typically estimated via the parametric bootstrap. (Note that ˆF and ˆθ are recomputed for each bootstrap sample.) When G is the normal family of distributions { } G = N(µ,σ 2 ) : µ R,σ > 0 the test is often called the Lilliefors normality test. In this case, the statistic above can be calibrated under any distribution in the family, leading to Monte Carlo simulations. Because of that, the null distribution of this statistic has been tabulated. NOTE: The same applies to any other location-scale family of distributions, meaning, a family of the form { } G = G 0 (( a)/b) : a R,b > 0 where G 0 is some given distribution on R. 34 / / 35
robustness, efficiency, breakdown point, outliers, rank-based procedures, least absolute regression
Robust Statistics robustness, efficiency, breakdown point, outliers, rank-based procedures, least absolute regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html
More informationMultiple Sample Categorical Data
Multiple Sample Categorical Data paired and unpaired data, goodness-of-fit testing, testing for independence University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html
More informationMIT Spring 2015
MIT 18.443 Dr. Kempthorne Spring 2015 MIT 18.443 1 Outline 1 MIT 18.443 2 Batches of data: single or multiple x 1, x 2,..., x n y 1, y 2,..., y m w 1, w 2,..., w l etc. Graphical displays Summary statistics:
More informationContents 1. Contents
Contents 1 Contents 1 One-Sample Methods 3 1.1 Parametric Methods.................... 4 1.1.1 One-sample Z-test (see Chapter 0.3.1)...... 4 1.1.2 One-sample t-test................. 6 1.1.3 Large sample
More informationNon-parametric Inference and Resampling
Non-parametric Inference and Resampling Exercises by David Wozabal (Last update. Juni 010) 1 Basic Facts about Rank and Order Statistics 1.1 10 students were asked about the amount of time they spend surfing
More informationBivariate Paired Numerical Data
Bivariate Paired Numerical Data Pearson s correlation, Spearman s ρ and Kendall s τ, tests of independence University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html
More informationMath 494: Mathematical Statistics
Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/
More informationStat 710: Mathematical Statistics Lecture 31
Stat 710: Mathematical Statistics Lecture 31 Jun Shao Department of Statistics University of Wisconsin Madison, WI 53706, USA Jun Shao (UW-Madison) Stat 710, Lecture 31 April 13, 2009 1 / 13 Lecture 31:
More informationMath 475. Jimin Ding. August 29, Department of Mathematics Washington University in St. Louis jmding/math475/index.
istical A istic istics : istical Department of Mathematics Washington University in St. Louis www.math.wustl.edu/ jmding/math475/index.html August 29, 2013 istical August 29, 2013 1 / 18 istical A istic
More informationAdvanced Statistics II: Non Parametric Tests
Advanced Statistics II: Non Parametric Tests Aurélien Garivier ParisTech February 27, 2011 Outline Fitting a distribution Rank Tests for the comparison of two samples Two unrelated samples: Mann-Whitney
More informationThis does not cover everything on the final. Look at the posted practice problems for other topics.
Class 7: Review Problems for Final Exam 8.5 Spring 7 This does not cover everything on the final. Look at the posted practice problems for other topics. To save time in class: set up, but do not carry
More informationStatistics. Statistics
The main aims of statistics 1 1 Choosing a model 2 Estimating its parameter(s) 1 point estimates 2 interval estimates 3 Testing hypotheses Distributions used in statistics: χ 2 n-distribution 2 Let X 1,
More informationThe Nonparametric Bootstrap
The Nonparametric Bootstrap The nonparametric bootstrap may involve inferences about a parameter, but we use a nonparametric procedure in approximating the parametric distribution using the ECDF. We use
More informationWhat to do today (Nov 22, 2018)?
What to do today (Nov 22, 2018)? Part 1. Introduction and Review (Chp 1-5) Part 2. Basic Statistical Inference (Chp 6-9) Part 3. Important Topics in Statistics (Chp 10-13) Part 4. Further Topics (Selected
More informationChapte The McGraw-Hill Companies, Inc. All rights reserved.
er15 Chapte Chi-Square Tests d Chi-Square Tests for -Fit Uniform Goodness- Poisson Goodness- Goodness- ECDF Tests (Optional) Contingency Tables A contingency table is a cross-tabulation of n paired observations
More informationSTAT 135 Lab 5 Bootstrapping and Hypothesis Testing
STAT 135 Lab 5 Bootstrapping and Hypothesis Testing Rebecca Barter March 2, 2015 The Bootstrap Bootstrap Suppose that we are interested in estimating a parameter θ from some population with members x 1,...,
More informationChapter 2: Resampling Maarten Jansen
Chapter 2: Resampling Maarten Jansen Randomization tests Randomized experiment random assignment of sample subjects to groups Example: medical experiment with control group n 1 subjects for true medicine,
More informationDr. Maddah ENMG 617 EM Statistics 10/12/12. Nonparametric Statistics (Chapter 16, Hines)
Dr. Maddah ENMG 617 EM Statistics 10/12/12 Nonparametric Statistics (Chapter 16, Hines) Introduction Most of the hypothesis testing presented so far assumes normally distributed data. These approaches
More information1 Statistical inference for a population mean
1 Statistical inference for a population mean 1. Inference for a large sample, known variance Suppose X 1,..., X n represents a large random sample of data from a population with unknown mean µ and known
More informationStat 427/527: Advanced Data Analysis I
Stat 427/527: Advanced Data Analysis I Review of Chapters 1-4 Sep, 2017 1 / 18 Concepts you need to know/interpret Numerical summaries: measures of center (mean, median, mode) measures of spread (sample
More information1 Measures of the Center of a Distribution
1 Measures of the Center of a Distribution Qualitative descriptions of the shape of a distribution are important and useful. But we will often desire the precision of numerical summaries as well. Two aspects
More informationSTATS 200: Introduction to Statistical Inference. Lecture 29: Course review
STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout
More informationMATH4427 Notebook 4 Fall Semester 2017/2018
MATH4427 Notebook 4 Fall Semester 2017/2018 prepared by Professor Jenny Baglivo c Copyright 2009-2018 by Jenny A. Baglivo. All Rights Reserved. 4 MATH4427 Notebook 4 3 4.1 K th Order Statistics and Their
More informationH 2 : otherwise. that is simply the proportion of the sample points below level x. For any fixed point x the law of large numbers gives that
Lecture 28 28.1 Kolmogorov-Smirnov test. Suppose that we have an i.i.d. sample X 1,..., X n with some unknown distribution and we would like to test the hypothesis that is equal to a particular distribution
More informationCentral Limit Theorem ( 5.3)
Central Limit Theorem ( 5.3) Let X 1, X 2,... be a sequence of independent random variables, each having n mean µ and variance σ 2. Then the distribution of the partial sum S n = X i i=1 becomes approximately
More informationSTAT 461/561- Assignments, Year 2015
STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and
More informationRecall the Basics of Hypothesis Testing
Recall the Basics of Hypothesis Testing The level of significance α, (size of test) is defined as the probability of X falling in w (rejecting H 0 ) when H 0 is true: P(X w H 0 ) = α. H 0 TRUE H 1 TRUE
More informationLecture 2: CDF and EDF
STAT 425: Introduction to Nonparametric Statistics Winter 2018 Instructor: Yen-Chi Chen Lecture 2: CDF and EDF 2.1 CDF: Cumulative Distribution Function For a random variable X, its CDF F () contains all
More informationAdditional Problems Additional Problem 1 Like the http://www.stat.umn.edu/geyer/5102/examp/rlike.html#lmax example of maximum likelihood done by computer except instead of the gamma shape model, we will
More informationLecture 2 and Lecture 3
Lecture 2 and Lecture 3 1 Lecture 2 and Lecture 3 We can describe distributions using 3 characteristics: shape, center and spread. These characteristics have been discussed since the foundation of statistics.
More informationThe bootstrap. Patrick Breheny. December 6. The empirical distribution function The bootstrap
Patrick Breheny December 6 Patrick Breheny BST 764: Applied Statistical Modeling 1/21 The empirical distribution function Suppose X F, where F (x) = Pr(X x) is a distribution function, and we wish to estimate
More informationUnit 14: Nonparametric Statistical Methods
Unit 14: Nonparametric Statistical Methods Statistics 571: Statistical Methods Ramón V. León 8/8/2003 Unit 14 - Stat 571 - Ramón V. León 1 Introductory Remarks Most methods studied so far have been based
More informationSTAT 830 Non-parametric Inference Basics
STAT 830 Non-parametric Inference Basics Richard Lockhart Simon Fraser University STAT 801=830 Fall 2012 Richard Lockhart (Simon Fraser University)STAT 830 Non-parametric Inference Basics STAT 801=830
More informationPolitical Science 236 Hypothesis Testing: Review and Bootstrapping
Political Science 236 Hypothesis Testing: Review and Bootstrapping Rocío Titiunik Fall 2007 1 Hypothesis Testing Definition 1.1 Hypothesis. A hypothesis is a statement about a population parameter The
More informationMultiple Sample Numerical Data
Multiple Sample Numerical Data Analysis of Variance, Kruskal-Wallis test, Friedman test University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 /
More informationInferential Statistics
Inferential Statistics Eva Riccomagno, Maria Piera Rogantin DIMA Università di Genova riccomagno@dima.unige.it rogantin@dima.unige.it Part G Distribution free hypothesis tests 1. Classical and distribution-free
More informationBetter Bootstrap Confidence Intervals
by Bradley Efron University of Washington, Department of Statistics April 12, 2012 An example Suppose we wish to make inference on some parameter θ T (F ) (e.g. θ = E F X ), based on data We might suppose
More informationSpring 2012 Math 541B Exam 1
Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote
More informationNonparametric tests. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 704: Data Analysis I
1 / 16 Nonparametric tests Timothy Hanson Department of Statistics, University of South Carolina Stat 704: Data Analysis I Nonparametric one and two-sample tests 2 / 16 If data do not come from a normal
More informationLecture 13: p-values and union intersection tests
Lecture 13: p-values and union intersection tests p-values After a hypothesis test is done, one method of reporting the result is to report the size α of the test used to reject H 0 or accept H 0. If α
More informationBootstrap Confidence Intervals
Bootstrap Confidence Intervals Patrick Breheny September 18 Patrick Breheny STA 621: Nonparametric Statistics 1/22 Introduction Bootstrap confidence intervals So far, we have discussed the idea behind
More informationLecture 10: Generalized likelihood ratio test
Stat 200: Introduction to Statistical Inference Autumn 2018/19 Lecture 10: Generalized likelihood ratio test Lecturer: Art B. Owen October 25 Disclaimer: These notes have not been subjected to the usual
More informationAsymptotic Statistics-VI. Changliang Zou
Asymptotic Statistics-VI Changliang Zou Kolmogorov-Smirnov distance Example (Kolmogorov-Smirnov confidence intervals) We know given α (0, 1), there is a well-defined d = d α,n such that, for any continuous
More informationStatistical Inference
Statistical Inference Classical and Bayesian Methods Revision Class for Midterm Exam AMS-UCSC Th Feb 9, 2012 Winter 2012. Session 1 (Revision Class) AMS-132/206 Th Feb 9, 2012 1 / 23 Topics Topics We will
More informationQualifying Exam CS 661: System Simulation Summer 2013 Prof. Marvin K. Nakayama
Qualifying Exam CS 661: System Simulation Summer 2013 Prof. Marvin K. Nakayama Instructions This exam has 7 pages in total, numbered 1 to 7. Make sure your exam has all the pages. This exam will be 2 hours
More informationChapter 11. Hypothesis Testing (II)
Chapter 11. Hypothesis Testing (II) 11.1 Likelihood Ratio Tests one of the most popular ways of constructing tests when both null and alternative hypotheses are composite (i.e. not a single point). Let
More informationz and t tests for the mean of a normal distribution Confidence intervals for the mean Binomial tests
z and t tests for the mean of a normal distribution Confidence intervals for the mean Binomial tests Chapters 3.5.1 3.5.2, 3.3.2 Prof. Tesler Math 283 Fall 2018 Prof. Tesler z and t tests for mean Math
More informationNotes on MAS344 Computational Statistics School of Mathematical Sciences, QMUL. Steven Gilmour
Notes on MAS344 Computational Statistics School of Mathematical Sciences, QMUL Steven Gilmour 2006 2007 ii Preface MAS344 Computational Statistics is a third year course given at the School of Mathematical
More informationSome Assorted Formulae. Some confidence intervals: σ n. x ± z α/2. x ± t n 1;α/2 n. ˆp(1 ˆp) ˆp ± z α/2 n. χ 2 n 1;1 α/2. n 1;α/2
STA 248 H1S MIDTERM TEST February 26, 2008 SURNAME: SOLUTIONS GIVEN NAME: STUDENT NUMBER: INSTRUCTIONS: Time: 1 hour and 50 minutes Aids allowed: calculator Tables of the standard normal, t and chi-square
More information11. Bootstrap Methods
11. Bootstrap Methods c A. Colin Cameron & Pravin K. Trivedi 2006 These transparencies were prepared in 20043. They can be used as an adjunct to Chapter 11 of our subsequent book Microeconometrics: Methods
More informationContinuous Distributions
Chapter 3 Continuous Distributions 3.1 Continuous-Type Data In Chapter 2, we discuss random variables whose space S contains a countable number of outcomes (i.e. of discrete type). In Chapter 3, we study
More informationContents. Acknowledgments. xix
Table of Preface Acknowledgments page xv xix 1 Introduction 1 The Role of the Computer in Data Analysis 1 Statistics: Descriptive and Inferential 2 Variables and Constants 3 The Measurement of Variables
More informationStatistics Handbook. All statistical tables were computed by the author.
Statistics Handbook Contents Page Wilcoxon rank-sum test (Mann-Whitney equivalent) Wilcoxon matched-pairs test 3 Normal Distribution 4 Z-test Related samples t-test 5 Unrelated samples t-test 6 Variance
More informationConfidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods
Chapter 4 Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods 4.1 Introduction It is now explicable that ridge regression estimator (here we take ordinary ridge estimator (ORE)
More informationStatistical Inference: Estimation and Confidence Intervals Hypothesis Testing
Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing 1 In most statistics problems, we assume that the data have been generated from some unknown probability distribution. We desire
More informationBootstrap tests. Patrick Breheny. October 11. Bootstrap vs. permutation tests Testing for equality of location
Bootstrap tests Patrick Breheny October 11 Patrick Breheny STA 621: Nonparametric Statistics 1/14 Introduction Conditioning on the observed data to obtain permutation tests is certainly an important idea
More information2.1 Measures of Location (P.9-11)
MATH1015 Biostatistics Week.1 Measures of Location (P.9-11).1.1 Summation Notation Suppose that we observe n values from an experiment. This collection (or set) of n values is called a sample. Let x 1
More informationDescriptive Univariate Statistics and Bivariate Correlation
ESC 100 Exploring Engineering Descriptive Univariate Statistics and Bivariate Correlation Instructor: Sudhir Khetan, Ph.D. Wednesday/Friday, October 17/19, 2012 The Central Dogma of Statistics used to
More informationResampling and the Bootstrap
Resampling and the Bootstrap Axel Benner Biostatistics, German Cancer Research Center INF 280, D-69120 Heidelberg benner@dkfz.de Resampling and the Bootstrap 2 Topics Estimation and Statistical Testing
More informationRobustness and Distribution Assumptions
Chapter 1 Robustness and Distribution Assumptions 1.1 Introduction In statistics, one often works with model assumptions, i.e., one assumes that data follow a certain model. Then one makes use of methodology
More informationChapter 2: Fundamentals of Statistics Lecture 15: Models and statistics
Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics Data from one or a series of random experiments are collected. Planning experiments and collecting data (not discussed here). Analysis:
More informationORDER STATISTICS, QUANTILES, AND SAMPLE QUANTILES
ORDER STATISTICS, QUANTILES, AND SAMPLE QUANTILES 1. Order statistics Let X 1,...,X n be n real-valued observations. One can always arrangetheminordertogettheorder statisticsx (1) X (2) X (n). SinceX (k)
More informationReview: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.
1 Review: Let X 1, X,..., X n denote n independent random variables sampled from some distribution might not be normal!) with mean µ) and standard deviation σ). Then X µ σ n In other words, X is approximately
More informationPerformance Evaluation and Comparison
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation
More informationBootstrap (Part 3) Christof Seiler. Stanford University, Spring 2016, Stats 205
Bootstrap (Part 3) Christof Seiler Stanford University, Spring 2016, Stats 205 Overview So far we used three different bootstraps: Nonparametric bootstrap on the rows (e.g. regression, PCA with random
More informationSolutions exercises of Chapter 7
Solutions exercises of Chapter 7 Exercise 1 a. These are paired samples: each pair of half plates will have about the same level of corrosion, so the result of polishing by the two brands of polish are
More informationMonte Carlo Studies. The response in a Monte Carlo study is a random variable.
Monte Carlo Studies The response in a Monte Carlo study is a random variable. The response in a Monte Carlo study has a variance that comes from the variance of the stochastic elements in the data-generating
More informationInvestigation of goodness-of-fit test statistic distributions by random censored samples
d samples Investigation of goodness-of-fit test statistic distributions by random censored samples Novosibirsk State Technical University November 22, 2010 d samples Outline 1 Nonparametric goodness-of-fit
More informationContinuous Random Variables. and Probability Distributions. Continuous Random Variables and Probability Distributions ( ) ( ) Chapter 4 4.
UCLA STAT 11 A Applied Probability & Statistics for Engineers Instructor: Ivo Dinov, Asst. Prof. In Statistics and Neurology Teaching Assistant: Christopher Barr University of California, Los Angeles,
More informationInference on distributions and quantiles using a finite-sample Dirichlet process
Dirichlet IDEAL Theory/methods Simulations Inference on distributions and quantiles using a finite-sample Dirichlet process David M. Kaplan University of Missouri Matt Goldman UC San Diego Midwest Econometrics
More informationTransition Passage to Descriptive Statistics 28
viii Preface xiv chapter 1 Introduction 1 Disciplines That Use Quantitative Data 5 What Do You Mean, Statistics? 6 Statistics: A Dynamic Discipline 8 Some Terminology 9 Problems and Answers 12 Scales of
More informationRandom Number Generation. CS1538: Introduction to simulations
Random Number Generation CS1538: Introduction to simulations Random Numbers Stochastic simulations require random data True random data cannot come from an algorithm We must obtain it from some process
More informationsimple if it completely specifies the density of x
3. Hypothesis Testing Pure significance tests Data x = (x 1,..., x n ) from f(x, θ) Hypothesis H 0 : restricts f(x, θ) Are the data consistent with H 0? H 0 is called the null hypothesis simple if it completely
More informationA Very Brief Summary of Statistical Inference, and Examples
A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2009 Prof. Gesine Reinert Our standard situation is that we have data x = x 1, x 2,..., x n, which we view as realisations of random
More informationSTAT 512 sp 2018 Summary Sheet
STAT 5 sp 08 Summary Sheet Karl B. Gregory Spring 08. Transformations of a random variable Let X be a rv with support X and let g be a function mapping X to Y with inverse mapping g (A = {x X : g(x A}
More informationFundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur
Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Statistics collection, analysis, interpretation of data development of new
More informationInference on distributional and quantile treatment effects
Inference on distributional and quantile treatment effects David M. Kaplan University of Missouri Matt Goldman UC San Diego NIU, 214 Dave Kaplan (Missouri), Matt Goldman (UCSD) Distributional and QTE inference
More information14.30 Introduction to Statistical Methods in Economics Spring 2009
MIT OpenCourseWare http://ocw.mit.edu 4.0 Introduction to Statistical Methods in Economics Spring 009 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationSection 3. Measures of Variation
Section 3 Measures of Variation Range Range = (maximum value) (minimum value) It is very sensitive to extreme values; therefore not as useful as other measures of variation. Sample Standard Deviation The
More informationEXAMINERS REPORT & SOLUTIONS STATISTICS 1 (MATH 11400) May-June 2009
EAMINERS REPORT & SOLUTIONS STATISTICS (MATH 400) May-June 2009 Examiners Report A. Most plots were well done. Some candidates muddled hinges and quartiles and gave the wrong one. Generally candidates
More informationNonparametric hypothesis tests and permutation tests
Nonparametric hypothesis tests and permutation tests 1.7 & 2.3. Probability Generating Functions 3.8.3. Wilcoxon Signed Rank Test 3.8.2. Mann-Whitney Test Prof. Tesler Math 283 Fall 2018 Prof. Tesler Wilcoxon
More informationMA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2
MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 1 Bootstrapped Bias and CIs Given a multiple regression model with mean and
More informationLecture 32: Asymptotic confidence sets and likelihoods
Lecture 32: Asymptotic confidence sets and likelihoods Asymptotic criterion In some problems, especially in nonparametric problems, it is difficult to find a reasonable confidence set with a given confidence
More informationMultiple Linear Regression
Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from
More informationContinuous Random Variables. and Probability Distributions. Continuous Random Variables and Probability Distributions ( ) ( )
UCLA STAT 35 Applied Computational and Interactive Probability Instructor: Ivo Dinov, Asst. Prof. In Statistics and Neurology Teaching Assistant: Chris Barr Continuous Random Variables and Probability
More informationSection 3 : Permutation Inference
Section 3 : Permutation Inference Fall 2014 1/39 Introduction Throughout this slides we will focus only on randomized experiments, i.e the treatment is assigned at random We will follow the notation of
More informationMeasures of center. The mean The mean of a distribution is the arithmetic average of the observations:
Measures of center The mean The mean of a distribution is the arithmetic average of the observations: x = x 1 + + x n n n = 1 x i n i=1 The median The median is the midpoint of a distribution: the number
More informationEconomics 583: Econometric Theory I A Primer on Asymptotics
Economics 583: Econometric Theory I A Primer on Asymptotics Eric Zivot January 14, 2013 The two main concepts in asymptotic theory that we will use are Consistency Asymptotic Normality Intuition consistency:
More informationBTRY 4090: Spring 2009 Theory of Statistics
BTRY 4090: Spring 2009 Theory of Statistics Guozhang Wang September 25, 2010 1 Review of Probability We begin with a real example of using probability to solve computationally intensive (or infeasible)
More informationMathematics Qualifying Examination January 2015 STAT Mathematical Statistics
Mathematics Qualifying Examination January 2015 STAT 52800 - Mathematical Statistics NOTE: Answer all questions completely and justify your derivations and steps. A calculator and statistical tables (normal,
More informationBusiness Statistics. Lecture 10: Course Review
Business Statistics Lecture 10: Course Review 1 Descriptive Statistics for Continuous Data Numerical Summaries Location: mean, median Spread or variability: variance, standard deviation, range, percentiles,
More informationExploratory data analysis: numerical summaries
16 Exploratory data analysis: numerical summaries The classical way to describe important features of a dataset is to give several numerical summaries We discuss numerical summaries for the center of a
More informationComposite Hypotheses and Generalized Likelihood Ratio Tests
Composite Hypotheses and Generalized Likelihood Ratio Tests Rebecca Willett, 06 In many real world problems, it is difficult to precisely specify probability distributions. Our models for data may involve
More information401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.
401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis
More informationNon-parametric Inference and Resampling
Non-parametric Inference and Resampling Exercises by David Wozabal (Last update 3. Juni 2013) 1 Basic Facts about Rank and Order Statistics 1.1 10 students were asked about the amount of time they spend
More informationSummary of Chapters 7-9
Summary of Chapters 7-9 Chapter 7. Interval Estimation 7.2. Confidence Intervals for Difference of Two Means Let X 1,, X n and Y 1, Y 2,, Y m be two independent random samples of sizes n and m from two
More information1. Exploratory Data Analysis
1. Exploratory Data Analysis 1.1 Methods of Displaying Data A visual display aids understanding and can highlight features which may be worth exploring more formally. Displays should have impact and be
More informationDistribution Fitting (Censored Data)
Distribution Fitting (Censored Data) Summary... 1 Data Input... 2 Analysis Summary... 3 Analysis Options... 4 Goodness-of-Fit Tests... 6 Frequency Histogram... 8 Comparison of Alternative Distributions...
More informationII. The Normal Distribution
II. The Normal Distribution The normal distribution (a.k.a., a the Gaussian distribution or bell curve ) is the by far the best known random distribution. It s discovery has had such a far-reaching impact
More informationIEOR E4703: Monte-Carlo Simulation
IEOR E4703: Monte-Carlo Simulation Output Analysis for Monte-Carlo Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Output Analysis
More information