Dr. Maddah ENMG 617 EM Statistics 10/15/12 Nonparametric Statistics (2) (Goodness of fit tests) Introduction Probability models used in decision making (Operations Research) and other fields require fitting a probability distribution to row data. Nonparametric statistics offer useful goodness of fit tests toward this end. These tests assume that a probability distribution has been fit to the histogram of the data. E.g., a probability density function in the case of continuous data is fully estimated (type of distribution and parameter values). The tests check how good the fit is. 1
Steps in fitting a probability distribution to raw data Fitting a probability distribution is usually done through three activities: o Activity I: Hypothesizing families of distributions o Activity II: Estimation of parameters o Activity III: Determining how representative the fitted distributions are. Activity I: Hypothesizing Families of Distributions We need to decide what form or family to use: Exponential, gamma, or what? Sometimes we can use our theoretical knowledge of the random variable to hypothesis a distribution. E.g., o Arrivals one-at-a-time, constant rate, independent: Exponential interarrival times. o Sum of many independent pieces: Normal. o Product of many independent pieces: Lognormal. o Service times: Cannot be normal (because of < 0 values). o Proportion defective: Use a bounded distribution on (0,1). The following empirical tools can be used to hypothesis a family of distribution. Descriptive statistics. By comparing the descriptive statistics of the sample with those of the hypothesized distribution. For example, the coefficient of variation is useful in distinguishing continuous distributions. 2
o CV > 1 suggests gamma or Weibull with α < 1 o CV 1 suggests exponential o CV < 1 suggests gamma or Weibull with α > 1 Lexis ratio, = variance/mean, is useful in distinguishing discrete distributions. o > 1 suggests negative binomial or geometric o 1 suggests Poisson. o < 1 suggests binomial. The skewness, = E[(X ) 3 ] / 3, where is the mean of X and its standard deviation, is a measure of symmetry of a distribution s density. o > 0 suggests right skewness (e.g. exponential) o 0 suggests symmetry (e.g., normal). o < 0 suggests left skeweness (e.g. right triangular). Histograms are used to visually check the goodness of fit of the hypothesized distribution (via the probability density or mass function). Box plots are used to visually inspect the skewness of the data. 3
Hypothesizing a Family of Distributions: Example with Continuous Data Sample of n = 219 interarrival times of cars to a drive-up bank over a 90-minute peak-load period Number of cars arriving in each of the six 15-minute periods was approximately equal, suggesting stationarity of arrival rate Sample mean = 0.399 (all times in minutes) > median = 0.270, skewness = +1.458, all suggesting right skewness cv = 0.953, close to 1, suggesting exponential Histograms (for different choices of interval width b) suggest exponential: Box plot is consistent with exponential: 6-14
Hypothesizing a Family of Distributions: Example with Discrete Data Sample of n = 156 observations on number of items demanded per week from an inventory over a three-year period Range 0 through 11 Sample mean = 1.891 > median = 1.00, skewness = +1.655, all suggesting right skewness Lexis ratio = 5.285/1.891 = 2.795 > 1, suggesting negative binomial or geometric (special case of negative binomial) Histogram suggests geometric: 6-15
Activity II: Estimation of Parameters With hypothesized distribution(s) at hand, we need to estimate numerical values for the distribution(s) parameters. There are many methods for estimating parameters. o Method of moments. o Least squares. o Maximum likelihood estimators (MLE). MLE is the preferred method because (i) it has good statistical properties; (ii) it ustifies using goodness-of-fit tests; and (iii) it is intuitive. The MLE method operates on a set of observed values, X 1, X 2,,., X n. The idea of the MLE is to choose the parameter(s) that maximizes the probability that the random variable of interest takes on values X 1, X 2,,, X n. For example, for a discrete distribution having a single parameter, the MLE estimator is ˆ arg max ( ) ( ) ( ) ( ) L p X1 p X 2 p X n, where p (X i ) = P{X = X i parameter = } is the pmf of X. For a continuous distribution the density function is used in place of the pmf. 6
Activity III: Determining How Representative the Fitted Distributions Are Having hypothesized a family of distributions and estimated parameters, the final activity is to determine whether the hypothesized distribution is a good fit. The main question here is: Does the fitted distribution agree with the observed data? There are two approaches to answer this question: Heuristic and formal statistical tests. Heuristic approaches use visual tools such the probability plot, we utilized for checking normality. There are two formal nonparametric tests that are often used: The 2 and the Kolmogorov-Smirnov tests. The 2 test is based on Pearson theorem which we discuss next. Pearson s Theorem Consider k boxes B 1, B 2,, B k, as in the following figure: B 1 B 2... B k Assume that we throw n balls into these boxes randomly independently of each other. Let p i be the probability that a ball is thrown in box i. Let O i be the number of observed balls in box i. 7
Then, O i is binomially distributed with E i = E[O i ] = np i. Further, define the random variable as 2 k i 1 2 ( Oi Ei). E i Pearson s Theorem states that for n large enough has a distribution with k 1 degrees of freedom. The proof is based on the normal approximation to the Binomial distribution and noting that O i are dependent and accounting for their correlation. The 2 goodness of fit test Given n data points with a hypothesized distribution having a cumulative distribution function Fx, ˆ ( ) the test works as follows. o Divide the range of data into k intervals, [a 0, a 1 ), [a 1, a 2 ),..., [a k 1, a k ). o Count the number of observations that fall in interval [a 1, a ), O, = 1,, k. o Find the expected number of observations in each interval, E = np, where p ˆ( ) ˆ F a F( a 1). This test is then performed as follows. o H 0 : X i s are iid with distribution function Fx ˆ ( ) o H 1 : X i s are not iid with distribution function Fx ˆ ( ) 8
o The test statistic is based on Pearson s theorem 2 k ( O np ) np 1 2 o Reection region: At significance level, reect H 0 if 2 > 2, k 1. As a guideline the intervals, [a 1, a ), are selected based on an equiprobable approach, i.e., p 1 = p 2 = = p k = 1/k, and such that np 5. Example of using the test to check uniformity Consider the the following 100 numbers. 0.126 0.092 0.375 0.938 0.254 0.223 0.029 0.359 0.397 0.343 0.086 0.300 0.072 0.001 0.404 0.621 0.092 0.120 0.565 0.869 0.255 0.958 0.874 0.893 0.046 0.424 0.325 0.603 0.235 0.660 0.167 0.336 0.708 0.589 0.381 0.225 0.191 0.288 0.596 0.633 0.832 0.422 0.902 0.348 0.143 0.039 0.723 0.372 0.920 0.928 0.786 0.680 0.430 0.610 0.363 0.463 0.670 0.678 0.926 0.223 0.208 0.650 0.070 0.010 0.696 0.340 0.548 0.497 0.973 0.518 0.821 0.456 0.485 0.629 0.683 0.953 0.338 0.750 0.780 0.075 0.321 0.994 0.984 0.293 0.185 0.454 0.474 0.557 0.094 0.464 0.690 0.636 0.195 0.645 0.680 0.548 0.118 0.543 0.476 0.137 Use the 2 test to test if this data is uniformly distributed on (0,1). Noting that for the U(0,1) distribution, F ˆ( x) 1 1 x, p Fˆ( a ) Fˆ( a ) a a, it is appropriate to pick a s as equidistant points. 9
Given that there are 100 observations, utilizing 10 intervals, with a 0 = 0, a 1 = 0.1, a 2 = 0.2,, a 10 = 1, is appropriate. The TS is computed as follows. i Interval O i E i (O i E i ) 2 / E i 1 [0.0,0.1) 12 10 0.4 2 [0.1,0.2) 9 10 0.1 3 [0.2,0.3) 10 10 0 4 [0.3,0.4) 13 10 0.9 5 [0.4,0.5) 12 10 0.4 6 [0.5,0.6) 8 10 0.4 7 [0.6,0.7) 16 10 3.6 8 [0.7,0.8) 5 10 2.5 9 [0.8,0.9) 5 10 2.5 10 [0.9,1.0] 10 10 0 10.8 For = 0.05, the critical value for the test is = 16.92. Decision: Do not reect H 0. There is not enough evidence that the data is not uniformly distributed on (0,1). Example of the test with the exponential distribution An exponential distribution with ˆ( ) 1 x/0.399 F x e was fitted to 219 inter-arrival time observations. To perform the 2 test, k = 20 intervals are used with an equiprobable approach having p = 1/20. Then, setting a 0 = 0, and a 20 =, a, = 1, 2,, 19 are found such that Fˆ( a ) / 20, which implies that p ˆ( ) ˆ F a F( a 1) 1/ 20. 10
Then, the a s are found by inverting, Fx ˆ ( ) i.e. solving This gives a /0.399 ˆ( ) 1 / 20. F a e a 0.399ln(1 / 20). Once the a s are determined, the test proceeds ust like the above for the uniform distribution case. The Kolmogorov-Smirnov goodness of fit test This can be seen as a formal comparison between empirical and fitted distribution functions, Fn ( x) and Fx ˆ ( ). It has the advantage of not requiring grouping the data into intervals and being valid for any sample size over the test. However, it s not as general as H 0 and H 1 for K-S are the same as for Assume that data is arranged such that X 1 X 2 X n. Then, F ( X ) i / n. n The test statistic for KS is i D max i / n Fˆ ( X ). n i 1,, n H 0 is reected (implying that there is not enough evidence of a good fit) if D n is too large. Critical values for D n are tabulated below. In this table, p = 1, and the critical value for the two-sided test is used. i 11
Example Use K-S test to check if the following data is iid distributed as U(0,1). Use = 0.05. 0.05, 0.14, 0.44, 0.81, 0.93 In this cases, F ˆ( X ) i X i. The TS is found as follows. I 1 2 3 4 5 X i 0.05 0.14 0.44 0.81 0.93 i/n 0.2 0.4 0.6 0.8 1 i/n X i 0.15 0.26 0.16 0.01 0.07 Then, D n = 0.26. Since D n < 0.563, the critical value in the table, do not reect H 0. There is not enough evidence that the data is not uniformly distributed on (0,1). 12
13