Chi-square goodness-of-fit test for vague data

Similar documents
An Evaluation of the Reliability of Complex Systems Using Shadowed Sets and Fuzzy Lifetime Data

Friedman s test with missing observations

Statistical Hypotheses Testing in the Fuzzy Environment

Testing Statistical Hypotheses in Fuzzy Environment

Fuzzy Order Statistics based on α pessimistic

On various definitions of the variance of a fuzzy random variable

TUTORIAL 8 SOLUTIONS #

New independence definition of fuzzy random variable and random fuzzy variable

On flexible database querying via extensions to fuzzy sets

Summary of Chapters 7-9

Extended Triangular Norms on Gaussian Fuzzy Sets

Intuitionistic Fuzzy Numbers and It s Applications in Fuzzy Optimization Problem

Fuzzy histograms and fuzzy probability distributions

Fuzzy Modal Like Approximation Operations Based on Residuated Lattices

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

Compenzational Vagueness

Solving Fuzzy PERT Using Gradual Real Numbers

On Distribution Characteristics of a Fuzzy Random Variable

An Uniformly Minimum Variance Unbiased Point Estimator Using Fuzzy Observations

Numerical Solution of Fuzzy Differential Equations

SOLVING FUZZY LINEAR SYSTEMS OF EQUATIONS

Probability of fuzzy events

An Implicit Method for Solving Fuzzy Partial Differential Equation with Nonlocal Boundary Conditions

Hybrid Logic and Uncertain Logic

Expected pair-wise comparison of the outcomes of a fuzzy random variable

Fuzzy system reliability analysis using time dependent fuzzy set

Lecture 21. Hypothesis Testing II

Intuitionistic Fuzzy Sets - An Alternative Look

Sequential Probability Ratio Test for Fuzzy Hypotheses Testing with Vague Data

Math Review Sheet, Fall 2008

Introductory Econometrics. Review of statistics (Part II: Inference)

Key Renewal Theory for T -iid Random Fuzzy Variables

Generalized Triangular Fuzzy Numbers In Intuitionistic Fuzzy Environment

This does not cover everything on the final. Look at the posted practice problems for other topics.

One-Sample Numerical Data

Some limit theorems on uncertain random sequences

On using different error measures for fuzzy linear regression analysis

Political Science 236 Hypothesis Testing: Review and Bootstrapping

ATANASSOV S INTUITIONISTIC FUZZY SET THEORY APPLIED TO QUANTALES

Recall the Basics of Hypothesis Testing

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Chebyshev Type Inequalities for Sugeno Integrals with Respect to Intuitionistic Fuzzy Measures

Fuzzy relation equations with dual composition

Ranking Fuzzy Random Variables Based on New Fuzzy Stochastic Orders

Intuitionistic Fuzzy Sets: Spherical Representation and Distances

International Journal of Scientific & Engineering Research, Volume 6, Issue 3, March ISSN

Rough Approach to Fuzzification and Defuzzification in Probability Theory

Lectures on Statistics. William G. Faris

Statistics with fuzzy random variables

On the Central Limit Theorem on IFS-events

Inclusion Relationship of Uncertain Sets

Concept of Fuzzy Differential Equations

Homework 7: Solutions. P3.1 from Lehmann, Romano, Testing Statistical Hypotheses.

j=1 π j = 1. Let X j be the number

Entropy for intuitionistic fuzzy sets

A Novel Numerical Method for Fuzzy Boundary Value Problems

ON INTUITIONISTIC FUZZY SOFT TOPOLOGICAL SPACES. 1. Introduction

A Note on Stochastic Orders

STOCHASTIC COMPARISONS OF FUZZY

Bayesian vs frequentist techniques for the analysis of binary outcome data

Practical implementation of possibilistic probability mass functions

4 Hypothesis testing. 4.1 Types of hypothesis and types of error 4 HYPOTHESIS TESTING 49

2.1.3 The Testing Problem and Neave s Step Method

CPSC 531: Random Numbers. Jonathan Hudson Department of Computer Science University of Calgary

STAT 135 Lab 6 Duality of Hypothesis Testing and Confidence Intervals, GLRT, Pearson χ 2 Tests and Q-Q plots. March 8, 2015

Ling 289 Contingency Table Statistics

Hypothesis testing: theory and methods

14.30 Introduction to Statistical Methods in Economics Spring 2009

Statistics Challenges in High Energy Physics Search Experiments

Rough Sets, Rough Relations and Rough Functions. Zdzislaw Pawlak. Warsaw University of Technology. ul. Nowowiejska 15/19, Warsaw, Poland.

NUMERICAL SOLUTIONS OF FUZZY DIFFERENTIAL EQUATIONS BY TAYLOR METHOD

Probability Theory Review

Uncertain Entailment and Modus Ponens in the Framework of Uncertain Logic

Lecture 17: Likelihood ratio and asymptotic tests

Fuzzy Systems. Introduction

Probability Theory and Statistics. Peter Jochumzen

Math 152. Rumbos Fall Solutions to Exam #2

Drawing Conclusions from Data The Rough Set Way

2.3 Analysis of Categorical Data

Intuitionistic Fuzzy Sets: Spherical Representation and Distances

Credibilistic Bi-Matrix Game

Solutions of fuzzy equations based on Kaucher arithmetic and AE-solution sets

On the variability of the concept of variance for fuzzy random variables

A new Approach to Drawing Conclusions from Data A Rough Set Perspective

Similarity-based Classification with Dominance-based Decision Rules

Practical implementation of possibilistic probability mass functions

Multiattribute decision making models and methods using intuitionistic fuzzy sets

A Method for Solving Fuzzy Differential Equations Using Runge-Kutta Method with Harmonic Mean of Three Quantities

TYPE-2 FUZZY G-TOLERANCE RELATION AND ITS PROPERTIES

One-Way Tables and Goodness of Fit

A fixed point theorem on soft G-metric spaces

Bootstrap Tests: How Many Bootstraps?

Mathematical Approach to Vagueness

CVAR REDUCED FUZZY VARIABLES AND THEIR SECOND ORDER MOMENTS

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

Environment Protection Engineering MATRIX METHOD FOR ESTIMATING THE RISK OF FAILURE IN THE COLLECTIVE WATER SUPPLY SYSTEM USING FUZZY LOGIC

Membership Function of a Special Conditional Uncertain Set

Robust goal programming

STAT 830 Hypothesis Testing

Institute of Actuaries of India

Transcription:

Chi-square goodness-of-fit test for vague data Przemys law Grzegorzewski Systems Research Institute Polish Academy of Sciences Newelska 6, 01-447 Warsaw, Poland and Faculty of Math. and Inform. Sci., Warsaw Institute of Technology Plac Politechniki 1 00-665 Warsaw, Poland pgrzeg@ibspan.waw.pl Abstract Testing goodness-of-fit plays a central role in data analysis. This problem seems to be much more complicated in the presence of vague data. In this paper we suggest how to generalize the well-known chisquare goodness-of-fit test for situations with fuzzy data. Keywords: Chi-square test, fuzzy sets, goodness-of-fit tests, vague data. 1 Introduction Most of statistical procedures are based on fairly specific assumptions regarding the underlying population distribution, like normality, exponentiality, etc. Therefore it might be desirable to check whether these assumptions are reasonable. It is a crucial point because if the assumptions regarding the distribution are not justified then the results of the statistical inference might be not appropriate. Statistical procedures for testing hypotheses about the underlying distribution are called goodness-of-fit tests. The above mentioned problems become much more complicated when the data are not necessarily precise. Many statistical tests for vague data have been proposed recently (for the review see, e.g., [7], [8]). But they were generally constructed either under specific assumptions on the underlying distribution ([4]) or as distribution-free tests (e.g. [3], [5], [6]). Anna Jȩdrej Faculty of Math. and Inform. Sci., Warsaw Institute of Technology Plac Politechniki 1 00-665 Warsaw, Poland In this paper we consider the problem of testing goodness-of-fit with vague data. We suggest how to modify a classical chi-square test for such data. Since the chi-square test is a general one and it could be applied both for the discrete and continuous distributions, it is a good candidate for this generalization. 2 Chi-square goodness-of-fit test Suppose that a random sample X 1,..., X n is drawn from a population with unknown cumulative distribution function F. We wish to test the null hypothesis H 0 : F (x) = F 0 (x) for all x, (1) that the population c.d.f. is F 0 (which is completely specified), against H 1 : F (x) F 0 (x) for some x. (2) To verify H 0 a goodness-of-fit test should be used. One of the most popular goodness-of-fit test is the chi-square test. To apply this test data must first be grouped into categories and then the observed frequencies for these categories are compared with the frequencies expected under the null hypothesis. In the case of a discrete distribution these categories appear in a natural way and are relevant to the distribution under study. When the distribution F 0 is continuous we have to arrange classes which are counterparts of above mentioned categories. Let us group our data into k mutually exclusive class intervals (ξ 0, ξ 1 ],..., (ξ k 1, ξ k ], where ξ 0 < ξ 1 <,..., < ξ k R {, + }.

Let n i be the observed frequency in the class interval Ξ i = (ξ i 1, ξ i ], where i = 1,..., k. Of course, we have n 1 +... + n k = n. If the null hypothesis holds then the probability p i that a given observation belongs to the class interval Ξ i = (ξ i 1, ξ i ] is given by p i = P (ξ i 1 < X ξ i H 0 ) = F 0 (ξ i ) F 0 (ξ i 1 ). (3) Thus np i is the expected frequency for the ith class under H 0. The idea of the test is to compare observed and expected frequencies and if the differences between n i and np i are large then it is an indication that H 0 is not correct. The test statistic is T = k (n i np i ) 2. (4) np i For large samples T is approximately chisquare distributed with k 1 degrees of freedom. We reject H 0 in favor of H 1 if T χ 2 1 α,k 1, where χ2 1 α,k 1 is the quantile of order 1 α from the chi-square distribution with k 1 degrees of freedom. 3 Vague data It may happen that a sample used for making decision consists of observations that are not necessarily crisp but may be vague as well. In order to describe the vagueness of data we use the notion of a fuzzy number, introduced by Dubois and Prade [2]. We say that a fuzzy subset A of the real line R, with the membership function µ A : R [0, 1], is a fuzzy number if and only if a) A is normal (i.e. there exists an element x 0 such that µ A (x 0 ) = 1), b) A is fuzzy convex (i.e. µ A (λx 1 + (1 λ)x 2 ) µ A (x 1 ) µ A (x 2 ), x 1, x 2 R, λ [0, 1]), c) µ A is upper semicontinuous, d) suppa is bounded, where suppa = cl({x R : µ A (x) > 0}) and cl is the closure operator. A useful notion for dealing with a fuzzy number is a set of its α cuts. The α cut of a fuzzy number A is a nonfuzzy set defined as A α = {x R : µ A (x) α}. (5) A family {A α : α (0, 1]} is a set representation of the fuzzy number A. According to the definition of a fuzzy number it is easily seen that every α cut of a fuzzy number is a closed interval. Hence we have A α = [A L α, A U α ], where A L α = inf{x R : µ A (x) α}, (6) A U α = sup{x R : µ A (x) α}. (7) A space of all fuzzy numbers will be denoted by FN(R). A notion of fuzzy random variable was introduced by Kwakernaak [12], [13]. Other definitions of fuzzy random variables are due to Kruse [10] or to Puri and Ralescu [14]. Our definition is similar to those of Kwakernaak and Kruse. Suppose that a random experiment is described as usual by a probability space (Ω, A, P ), where Ω is a set of all possible outcomes of the experiment, A is a σ algebra of subsets of Ω (the set of all possible events) and P is a probability measure. Then mapping X : Ω FN(R) is called a fuzzy random variable if {X(α, ω) : α (0, 1]} is a set representation of X(ω) for all ω Ω and for each α (0, 1] both Xα L = Xα L (ω) = inf X α (ω) and Xα U = Xα U (ω) = sup X α (ω) are usual real-valued random variables on (Ω, A, P ) (see [11]). A fuzzy random variable X is considered as a perception of an unknown usual random variable V : Ω R, called an original of X (if only vague data are available, it is of course impossible to show which of the possible originals is the true one). Similarly n dimensional fuzzy random sample X 1,..., X n may be treated as a fuzzy perception of the usual random sample V 1,..., V n (where V 1,..., V n are independent and identically distributed crisp random variables). For more information we refer the reader to [11]. A random variable is completely characterized by its probability distribution P θ. In statistical reasoning we assume that a probability

distribution under study belongs to a family of distributions P = {P θ : θ Θ}. Then very often we identify the distribution with its parameter θ and restrict statistical inference to that parameter. However, if we deal with a fuzzy random variable we cannot observe parameter θ directly but only its vague image. Using this reasoning together with Zadeh s extension principle Kruse and Meyer [11] introduced the notion of fuzzy parameter of fuzzy random variable θ which may be considered as a fuzzy perception of the unknown parameter θ. It is defined as a fuzzy subset of the parameter space Θ with membership function µ θ e : Θ [0, 1]. Of course, if our data are crisp, i.e. X = V, we get θ = θ. 4 Chi-square test for vague data Suppose µ X1,..., µ Xn denote membership functions of fuzzy numbers which are observations of a fuzzy random sample X 1,..., X n. Suppose our sample comes from the unknown distribution F, and our goal is to test the null hypothesis H 0 : F = F e θ, (8) against H 1 : F F e θ, (9) where the distribution Fθ e is completely specified by a fuzzy parameter θ described by its membership function µ θ e. In this section we propose how to apply the chi-square test for fuzzy data. As it was shown in Sec. 2, data firstly have to be grouped into class intervals. However, a fuzzy observation, contrary to the crisp one, may belong to more than one class. Therefore, a natural question immediately arises: How to compute observed frequencies for classes intervals using fuzzy data? Let Ξ 1,..., Ξ k, where Ξ i = (ξ i 1, ξ i ] and ξ 0 < ξ 1 <,..., < ξ k R {, + }, denote, as before, mutually exclusive class intervals. Let w(x j ) = µ Xj (x)dx (10) R denote the width of a fuzzy observation X j (see [1]). Moreover, let w(ξ i X j ) = ξ i ξ i 1 µ Xj (x)dx (11) denote the width of the intersection of the fuzzy observation X j with the class interval Ξ i = (ξ i 1, ξ i ]. Hence for each i = 1,..., k we can compute coefficients ñ i = µ Xj (x)dx n ξ i 1 µ Xj (x)dx j=1 ξ i R (12) which are counterparts of frequencies corresponding to classes Ξ i. One can check that ñ 1 +... + ñ k = n, similarly as the sum of frequencies is equal to the sample size. However, since the coefficients ñ i can assume arbitrary nonnegative value (not necessarily integer) we call them rather quotas than frequencies. As we remember, the main idea of the chisquare test is to compare the observed frequencies for class intervals with the frequencies expected under the null hypothesis. Thus, we have to solve another problem: How to determine the frequencies expected under the null hypothesis that the actual distribution is F Λ(θ)? Suppose fθ e denotes the density function corresponding to Fθ e, which may be considered as a fuzzy perception of the density f θ. In other words, a density f θ belongs to fθ e with the same grade µ θ e (θ) as θ belongs to the fuzzy parameter θ.for each i = 1,..., k we can determine following values π i = ξ i ξ i 1 Θ Let C be a constant such that C = µ e θ (θ)f θ(x)dθdx. (13) k π i. (14) Thus, if the null hypothesis holds, the probability that a fuzzy observation belongs to the

class interval Ξ i = (ξ i 1, ξ i ] is given by the following formula p i = π i C. (15) One can easily check that p i [0, 1] for each i = 1,..., k and k p i = 1. Hence n p i denotes the expected frequency for the ith class under H 0. So now we are able to define a test statistic for testing H 0 against H 1 with fuzzy data. It is given by T = k (ñ i n p i ) 2. (16) n p i One can see that it looks like (4). In fact we have the same structure and the only difference is that n i and p i in (16) are computed in a different way than in (4). Thus it is not surprising that for large samples - as simulations show - statistic (16) is approximately chi-square distributed with k 1 degrees of freedom. Therefore, we reject H 0 in favor of H 1 if T χ 2 1 α,k 1, where χ2 1 α,k 1 is the quantile of order 1 α from the chi-square distribution with k 1 degrees of freedom. We have examined the suggested chi-square goodness-of-fit test for fuzzy data using Monte-Carlo simulations (see [9]). Our simulation study showed that our test works properly but it is a little bit liberal. Till now, we have considered the chi-square goodness-of-fit test for a simple null hypothesis. However, the more typical situation in practise is that the null hypothesis is composite, i.e. it states the form of the distribution but not all the relevant parameters. If the null hypothesis is composite then we have to estimate the expected frequencies from the data (generally using the method of maximum likelihood). Then we can apply the same test statistic (4) but now its asymptotic distribution is the chi-square with k r 1 degrees of freedom, where r is the number of parameters which had to be estimated in order to estimate expected frequencies. Taking into account all these remarks our goodness-of-fit test for fuzzy data could be also applied for complex hypotheses. In such case the method of maximum likelihood and Zadeh s extension principle would be useful in estimating expected frequencies. Then the final decision would be based on the relationship between test statistic (16) and the appropriate quantile of the chi-square distribution with k r 1 degrees of freedom. 5 Conclusions In the present paper we have considered the problem of goodness-of-fit testing in fuzzy environment. We have proposed a generalization of the classical chi-square goodness-of-fit test for fuzzy data. Unfortunately, as in the classical case with crisp data, the chi-square test requires large samples. Since not always such big samples are available other goodnessof-fit techniques for vague data are still required. References [1] S. Chanas, On the interval approximation of a fuzzy number, Fuzzy Sets and Systems 122 (2001), 353 356. [2] D. Dubois, H. Prade, Operations on fuzzy numbers, Int. J. Syst. Sci. 9 (1978), 613 626. [3] P. Grzegorzewski, Statistical inference about the median from vague data, Control and Cybernetics 27 (1998), 447 464. [4] P. Grzegorzewski, Testing statistical hypotheses with vague data, Fuzzy Sets and Systems 112 (2000), 501 510. [5] P. Grzegorzewski, Distribution-free tests for vague data, In: Soft Methodology and Random Information Systems, Lopez-Diaz M., Gil M.A., Grzegorzewski P., Hryniewicz O., Lawry J. (Eds.), Springer, Heidelberg, 2004, pp. 495-502. [6] P. Grzegorzewski, Two-sample median test for vague data, In: Proceedings of the 4th Conference of the European Society for Fuzzy Logic and Technology -

Eusflat 2005, Barcelona, September 7-9, 2005, pp. 621 626. [7] P. Grzegorzewski, O. Hryniewicz, Testing hypotheses in fuzzy environment, Mathware and Soft Computing 4 (1997), 203 217. [8] P. Grzegorzewski, O. Hryniewicz, Soft methods in hypotheses testing, In: Soft Computing for Risk Evaluation and Management, Ruan D., Kacprzyk J., Fedrizzi M. (Eds.), Springer Physica Verlag, Heidelberg, 2001, pp. 55 72. [9] A. Jȩdrej, Simulation of fuzzy random variables, M.Sc. Thesis, Warsaw Institute of Technology, 2006 (in Polish). [10] R. Kruse, The strong law of large numbers for fuzzy random variables, Inform. Sci. 28 (1982), 233 241. [11] R. Kruse, K.D. Meyer, Statistics with Vague Data, D. Riedel Publishing Company, 1987. [12] H. Kwakernaak, Fuzzy random variables, part I: Definitions and theorems, Inform. Sci. 15 (1978), 1 15; [13] H. Kwakernaak, Fuzzy random variables, part II: Algorithms and examples for the discrete case, Inform. Sci. 17 (1979), 253 278. [14] M.L. Puri, D.A. Ralescu, Fuzzy random variables, J. Math. Anal. Appl. 114 (1986), 409 422.