Dr. Maddah ENMG 617 EM Statistics 10/15/12. Nonparametric Statistics (2) (Goodness of fit tests)

Similar documents
Practice Problems Section Problems

Distribution Fitting (Censored Data)

Slides 8: Statistical Models in Simulation

Computer Science, Informatik 4 Communication and Distributed Systems. Simulation. Discrete-Event System Simulation. Dr.

IE 303 Discrete-Event Simulation

Chapter 5. Statistical Models in Simulations 5.1. Prof. Dr. Mesut Güneş Ch. 5 Statistical Models in Simulations

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

System Simulation Part II: Mathematical and Statistical Models Chapter 5: Statistical Models

Modeling and Performance Analysis with Discrete-Event Simulation

H 2 : otherwise. that is simply the proportion of the sample points below level x. For any fixed point x the law of large numbers gives that

Chapter Learning Objectives. Probability Distributions and Probability Density Functions. Continuous Random Variables

Generating Random Variates 2 (Chapter 8, Law)

System Simulation Part II: Mathematical and Statistical Models Chapter 5: Statistical Models

Computer Science, Informatik 4 Communication and Distributed Systems. Simulation. Discrete-Event System Simulation. Dr.

Computer Science, Informatik 4 Communication and Distributed Systems. Simulation. Discrete-Event System Simulation. Dr.

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

EE/CpE 345. Modeling and Simulation. Fall Class 5 September 30, 2002

A) Questions on Estimation

CPSC 531: Random Numbers. Jonathan Hudson Department of Computer Science University of Calgary

Math 494: Mathematical Statistics

SPRING 2007 EXAM C SOLUTIONS

UNIT 5:Random number generation And Variation Generation

Learning Objectives for Stat 225

Institute of Actuaries of India

B.N.Bandodkar College of Science, Thane. Subject : Computer Simulation and Modeling.

Random Variables Example:

Probability Distributions Columns (a) through (d)

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

EE/CpE 345. Modeling and Simulation. Fall Class 10 November 18, 2002

IE 581 Introduction to Stochastic Simulation

Qualifying Exam CS 661: System Simulation Summer 2013 Prof. Marvin K. Nakayama

Random Variables. Definition: A random variable (r.v.) X on the probability space (Ω, F, P) is a mapping

Dr. Maddah ENMG 617 EM Statistics 10/12/12. Nonparametric Statistics (Chapter 16, Hines)

1 Inverse Transform Method and some alternative algorithms

Plotting data is one method for selecting a probability distribution. The following

Statistics for Economists. Lectures 3 & 4

Subject CS1 Actuarial Statistics 1 Core Principles

ORF 245 Fundamentals of Engineering Statistics. Final Exam

Practice Exam 1. (A) (B) (C) (D) (E) You are given the following data on loss sizes:

1. I had a computer generate the following 19 numbers between 0-1. Were these numbers randomly selected?

S2 QUESTIONS TAKEN FROM JANUARY 2006, JANUARY 2007, JANUARY 2008, JANUARY 2009

Solutions. Some of the problems that might be encountered in collecting data on check-in times are:

STAT 461/561- Assignments, Year 2015

Chapter 2. 1 From Equation 2.10: P(A 1 F) ˆ P(A 1)P(F A 1 ) S i P(F A i )P(A i ) The denominator is

EE/CpE 345. Modeling and Simulation. Fall Class 9

Power laws. Leonid E. Zhukov

STAT 516 Midterm Exam 2 Friday, March 7, 2008

57:022 Principles of Design II Final Exam Solutions - Spring 1997

Parameter Estimation

Contents LIST OF TABLES... LIST OF FIGURES... xvii. LIST OF LISTINGS... xxi PREFACE. ...xxiii

Chapters 3.2 Discrete distributions

TUTORIAL 8 SOLUTIONS #

CSE 312 Final Review: Section AA

MATH Notebook 5 Fall 2018/2019

Lecture 10: Generalized likelihood ratio test

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 18

The University of Hong Kong Department of Statistics and Actuarial Science STAT2802 Statistical Models Tutorial Solutions Solutions to Problems 71-80

Week 1 Quantitative Analysis of Financial Markets Distributions A

Slides 5: Random Number Extensions

Math Review Sheet, Fall 2008

(Ch 3.4.1, 3.4.2, 4.1, 4.2, 4.3)

Simulation model input analysis

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Statistic Distribution Models for Some Nonparametric Goodness-of-Fit Tests in Testing Composite Hypotheses

12.10 (STUDENT CD-ROM TOPIC) CHI-SQUARE GOODNESS- OF-FIT TESTS

MIT Arts, Commerce and Science College, Alandi, Pune DEPARTMENT OF STATISTICS. Question Bank. Statistical Methods-I

STATISTICS SYLLABUS UNIT I

Discrete probability distributions

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

Network Simulation Chapter 5: Traffic Modeling. Chapter Overview

375 PU M Sc Statistics

Recall the Basics of Hypothesis Testing

f X (x) = λe λx, , x 0, k 0, λ > 0 Γ (k) f X (u)f X (z u)du

Inference for the mean of a population. Testing hypotheses about a single mean (the one sample t-test). The sign test for matched pairs

CIVL 7012/8012. Continuous Distributions

SUMMARIZING MEASURED DATA. Gaia Maselli

STAT Chapter 5 Continuous Distributions

Probability Distributions for Continuous Variables. Probability Distributions for Continuous Variables

Statistical Methods in HYDROLOGY CHARLES T. HAAN. The Iowa State University Press / Ames

15 Discrete Distributions

Mathematical statistics

Modelling the risk process

Summarizing Measured Data

Spare parts inventory management: new evidence from distribution fitting

Solutions to the Spring 2015 CAS Exam ST

Random variables. DS GA 1002 Probability and Statistics for Data Science.

HANDBOOK OF APPLICABLE MATHEMATICS

Probability Distribution

STATISTICS ( CODE NO. 08 ) PAPER I PART - I

Testing Statistical Hypotheses

A THREE-PARAMETER WEIGHTED LINDLEY DISTRIBUTION AND ITS APPLICATIONS TO MODEL SURVIVAL TIME

HW on Ch Let X be a discrete random variable with V (X) = 8.6, then V (3X+5.6) is. V (3X + 5.6) = 3 2 V (X) = 9(8.6) = 77.4.

Institute of Actuaries of India

Introduction and Overview STAT 421, SP Course Instructor

Lecture 2. Distributions and Random Variables

Chapter 6: Functions of Random Variables

Independent Events. Two events are independent if knowing that one occurs does not change the probability of the other occurring

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

Model Fitting. Jean Yves Le Boudec

Irr. Statistical Methods in Experimental Physics. 2nd Edition. Frederick James. World Scientific. CERN, Switzerland

Transcription:

Dr. Maddah ENMG 617 EM Statistics 10/15/12 Nonparametric Statistics (2) (Goodness of fit tests) Introduction Probability models used in decision making (Operations Research) and other fields require fitting a probability distribution to row data. Nonparametric statistics offer useful goodness of fit tests toward this end. These tests assume that a probability distribution has been fit to the histogram of the data. E.g., a probability density function in the case of continuous data is fully estimated (type of distribution and parameter values). The tests check how good the fit is. 1

Steps in fitting a probability distribution to raw data Fitting a probability distribution is usually done through three activities: o Activity I: Hypothesizing families of distributions o Activity II: Estimation of parameters o Activity III: Determining how representative the fitted distributions are. Activity I: Hypothesizing Families of Distributions We need to decide what form or family to use: Exponential, gamma, or what? Sometimes we can use our theoretical knowledge of the random variable to hypothesis a distribution. E.g., o Arrivals one-at-a-time, constant rate, independent: Exponential interarrival times. o Sum of many independent pieces: Normal. o Product of many independent pieces: Lognormal. o Service times: Cannot be normal (because of < 0 values). o Proportion defective: Use a bounded distribution on (0,1). The following empirical tools can be used to hypothesis a family of distribution. Descriptive statistics. By comparing the descriptive statistics of the sample with those of the hypothesized distribution. For example, the coefficient of variation is useful in distinguishing continuous distributions. 2

o CV > 1 suggests gamma or Weibull with α < 1 o CV 1 suggests exponential o CV < 1 suggests gamma or Weibull with α > 1 Lexis ratio, = variance/mean, is useful in distinguishing discrete distributions. o > 1 suggests negative binomial or geometric o 1 suggests Poisson. o < 1 suggests binomial. The skewness, = E[(X ) 3 ] / 3, where is the mean of X and its standard deviation, is a measure of symmetry of a distribution s density. o > 0 suggests right skewness (e.g. exponential) o 0 suggests symmetry (e.g., normal). o < 0 suggests left skeweness (e.g. right triangular). Histograms are used to visually check the goodness of fit of the hypothesized distribution (via the probability density or mass function). Box plots are used to visually inspect the skewness of the data. 3

Hypothesizing a Family of Distributions: Example with Continuous Data Sample of n = 219 interarrival times of cars to a drive-up bank over a 90-minute peak-load period Number of cars arriving in each of the six 15-minute periods was approximately equal, suggesting stationarity of arrival rate Sample mean = 0.399 (all times in minutes) > median = 0.270, skewness = +1.458, all suggesting right skewness cv = 0.953, close to 1, suggesting exponential Histograms (for different choices of interval width b) suggest exponential: Box plot is consistent with exponential: 6-14

Hypothesizing a Family of Distributions: Example with Discrete Data Sample of n = 156 observations on number of items demanded per week from an inventory over a three-year period Range 0 through 11 Sample mean = 1.891 > median = 1.00, skewness = +1.655, all suggesting right skewness Lexis ratio = 5.285/1.891 = 2.795 > 1, suggesting negative binomial or geometric (special case of negative binomial) Histogram suggests geometric: 6-15

Activity II: Estimation of Parameters With hypothesized distribution(s) at hand, we need to estimate numerical values for the distribution(s) parameters. There are many methods for estimating parameters. o Method of moments. o Least squares. o Maximum likelihood estimators (MLE). MLE is the preferred method because (i) it has good statistical properties; (ii) it ustifies using goodness-of-fit tests; and (iii) it is intuitive. The MLE method operates on a set of observed values, X 1, X 2,,., X n. The idea of the MLE is to choose the parameter(s) that maximizes the probability that the random variable of interest takes on values X 1, X 2,,, X n. For example, for a discrete distribution having a single parameter, the MLE estimator is ˆ arg max ( ) ( ) ( ) ( ) L p X1 p X 2 p X n, where p (X i ) = P{X = X i parameter = } is the pmf of X. For a continuous distribution the density function is used in place of the pmf. 6

Activity III: Determining How Representative the Fitted Distributions Are Having hypothesized a family of distributions and estimated parameters, the final activity is to determine whether the hypothesized distribution is a good fit. The main question here is: Does the fitted distribution agree with the observed data? There are two approaches to answer this question: Heuristic and formal statistical tests. Heuristic approaches use visual tools such the probability plot, we utilized for checking normality. There are two formal nonparametric tests that are often used: The 2 and the Kolmogorov-Smirnov tests. The 2 test is based on Pearson theorem which we discuss next. Pearson s Theorem Consider k boxes B 1, B 2,, B k, as in the following figure: B 1 B 2... B k Assume that we throw n balls into these boxes randomly independently of each other. Let p i be the probability that a ball is thrown in box i. Let O i be the number of observed balls in box i. 7

Then, O i is binomially distributed with E i = E[O i ] = np i. Further, define the random variable as 2 k i 1 2 ( Oi Ei). E i Pearson s Theorem states that for n large enough has a distribution with k 1 degrees of freedom. The proof is based on the normal approximation to the Binomial distribution and noting that O i are dependent and accounting for their correlation. The 2 goodness of fit test Given n data points with a hypothesized distribution having a cumulative distribution function Fx, ˆ ( ) the test works as follows. o Divide the range of data into k intervals, [a 0, a 1 ), [a 1, a 2 ),..., [a k 1, a k ). o Count the number of observations that fall in interval [a 1, a ), O, = 1,, k. o Find the expected number of observations in each interval, E = np, where p ˆ( ) ˆ F a F( a 1). This test is then performed as follows. o H 0 : X i s are iid with distribution function Fx ˆ ( ) o H 1 : X i s are not iid with distribution function Fx ˆ ( ) 8

o The test statistic is based on Pearson s theorem 2 k ( O np ) np 1 2 o Reection region: At significance level, reect H 0 if 2 > 2, k 1. As a guideline the intervals, [a 1, a ), are selected based on an equiprobable approach, i.e., p 1 = p 2 = = p k = 1/k, and such that np 5. Example of using the test to check uniformity Consider the the following 100 numbers. 0.126 0.092 0.375 0.938 0.254 0.223 0.029 0.359 0.397 0.343 0.086 0.300 0.072 0.001 0.404 0.621 0.092 0.120 0.565 0.869 0.255 0.958 0.874 0.893 0.046 0.424 0.325 0.603 0.235 0.660 0.167 0.336 0.708 0.589 0.381 0.225 0.191 0.288 0.596 0.633 0.832 0.422 0.902 0.348 0.143 0.039 0.723 0.372 0.920 0.928 0.786 0.680 0.430 0.610 0.363 0.463 0.670 0.678 0.926 0.223 0.208 0.650 0.070 0.010 0.696 0.340 0.548 0.497 0.973 0.518 0.821 0.456 0.485 0.629 0.683 0.953 0.338 0.750 0.780 0.075 0.321 0.994 0.984 0.293 0.185 0.454 0.474 0.557 0.094 0.464 0.690 0.636 0.195 0.645 0.680 0.548 0.118 0.543 0.476 0.137 Use the 2 test to test if this data is uniformly distributed on (0,1). Noting that for the U(0,1) distribution, F ˆ( x) 1 1 x, p Fˆ( a ) Fˆ( a ) a a, it is appropriate to pick a s as equidistant points. 9

Given that there are 100 observations, utilizing 10 intervals, with a 0 = 0, a 1 = 0.1, a 2 = 0.2,, a 10 = 1, is appropriate. The TS is computed as follows. i Interval O i E i (O i E i ) 2 / E i 1 [0.0,0.1) 12 10 0.4 2 [0.1,0.2) 9 10 0.1 3 [0.2,0.3) 10 10 0 4 [0.3,0.4) 13 10 0.9 5 [0.4,0.5) 12 10 0.4 6 [0.5,0.6) 8 10 0.4 7 [0.6,0.7) 16 10 3.6 8 [0.7,0.8) 5 10 2.5 9 [0.8,0.9) 5 10 2.5 10 [0.9,1.0] 10 10 0 10.8 For = 0.05, the critical value for the test is = 16.92. Decision: Do not reect H 0. There is not enough evidence that the data is not uniformly distributed on (0,1). Example of the test with the exponential distribution An exponential distribution with ˆ( ) 1 x/0.399 F x e was fitted to 219 inter-arrival time observations. To perform the 2 test, k = 20 intervals are used with an equiprobable approach having p = 1/20. Then, setting a 0 = 0, and a 20 =, a, = 1, 2,, 19 are found such that Fˆ( a ) / 20, which implies that p ˆ( ) ˆ F a F( a 1) 1/ 20. 10

Then, the a s are found by inverting, Fx ˆ ( ) i.e. solving This gives a /0.399 ˆ( ) 1 / 20. F a e a 0.399ln(1 / 20). Once the a s are determined, the test proceeds ust like the above for the uniform distribution case. The Kolmogorov-Smirnov goodness of fit test This can be seen as a formal comparison between empirical and fitted distribution functions, Fn ( x) and Fx ˆ ( ). It has the advantage of not requiring grouping the data into intervals and being valid for any sample size over the test. However, it s not as general as H 0 and H 1 for K-S are the same as for Assume that data is arranged such that X 1 X 2 X n. Then, F ( X ) i / n. n The test statistic for KS is i D max i / n Fˆ ( X ). n i 1,, n H 0 is reected (implying that there is not enough evidence of a good fit) if D n is too large. Critical values for D n are tabulated below. In this table, p = 1, and the critical value for the two-sided test is used. i 11

Example Use K-S test to check if the following data is iid distributed as U(0,1). Use = 0.05. 0.05, 0.14, 0.44, 0.81, 0.93 In this cases, F ˆ( X ) i X i. The TS is found as follows. I 1 2 3 4 5 X i 0.05 0.14 0.44 0.81 0.93 i/n 0.2 0.4 0.6 0.8 1 i/n X i 0.15 0.26 0.16 0.01 0.07 Then, D n = 0.26. Since D n < 0.563, the critical value in the table, do not reect H 0. There is not enough evidence that the data is not uniformly distributed on (0,1). 12

13