CHAPTER 1 INTRODUCTION

Similar documents
Katz Family of Distributions and Processes

Mixtures and Random Sums

Approximating the Conway-Maxwell-Poisson normalizing constant

HANDBOOK OF APPLICABLE MATHEMATICS

Statistical Methods in HYDROLOGY CHARLES T. HAAN. The Iowa State University Press / Ames

COPYRIGHTED MATERIAL CONTENTS. Preface Preface to the First Edition

Irr. Statistical Methods in Experimental Physics. 2nd Edition. Frederick James. World Scientific. CERN, Switzerland

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

On a simple construction of bivariate probability functions with fixed marginals 1

Stat 5101 Lecture Notes

Testing Statistical Hypotheses

Institute of Actuaries of India

Subject CS1 Actuarial Statistics 1 Core Principles

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

EVALUATING THE BIVARIATE COMPOUND GENERALIZED POISSON DISTRIBUTION

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

STATISTICS SYLLABUS UNIT I

Copula modeling for discrete data

Testing Statistical Hypotheses

GAUSSIAN PROCESS REGRESSION

Mathematical statistics

Chapter 5. Chapter 5 sections

HANDBOOK OF APPLICABLE MATHEMATICS

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

Estimation of Operational Risk Capital Charge under Parameter Uncertainty

The University of Hong Kong Department of Statistics and Actuarial Science STAT2802 Statistical Models Tutorial Solutions Solutions to Problems 71-80

Generalized Linear Models. Kurt Hornik

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Chapter 5 continued. Chapter 5 sections

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Probability for Statistics and Machine Learning

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Multivariate Statistics

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Introduction to the Mathematical and Statistical Foundations of Econometrics Herman J. Bierens Pennsylvania State University

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER. 21 June :45 11:45

Overview of Extreme Value Theory. Dr. Sawsan Hilal space

Key Words: Conway-Maxwell-Poisson (COM-Poisson) regression; mixture model; apparent dispersion; over-dispersion; under-dispersion

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Simulating Realistic Ecological Count Data

STAT 461/561- Assignments, Year 2015

Research Article The Laplace Likelihood Ratio Test for Heteroscedasticity

Fundamentals of Applied Probability and Random Processes

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances

Recall the Basics of Hypothesis Testing

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

On Discrete Distributions Generated through Mittag-Leffler Function and their Properties

TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1

PRINCIPLES OF STATISTICAL INFERENCE

Extreme Value Analysis and Spatial Extremes

Estimation, Inference, and Hypothesis Testing

Unconditional Distributions Obtained from Conditional Specification Models with Applications in Risk Theory

Mathematical statistics

15 Discrete Distributions

BAYESIAN INFERENCE ON MIXTURE OF GEOMETRIC WITH DEGENERATE DISTRIBUTION: ZERO INFLATED GEOMETRIC DISTRIBUTION

Statistical Data Analysis Stat 3: p-values, parameter estimation

Hypothesis testing:power, test statistic CMS:

Contents 1. Contents

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

The LmB Conferences on Multivariate Count Analysis

MFM Practitioner Module: Quantitative Risk Management. John Dodson. September 23, 2015

Directional Statistics

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Stat 451 Lecture Notes Numerical Integration

Lecture 2: Repetition of probability theory and statistics

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Ch. 5 Hypothesis Testing

Modelling Dependence with Copulas and Applications to Risk Management. Filip Lindskog, RiskLab, ETH Zürich

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

Linear Methods for Prediction

Zero-Inflated Models in Statistical Process Control

Introduction to Maximum Likelihood Estimation

Introduction. Chapter 1

Using copulas to model time dependence in stochastic frontier models

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

Probability and Stochastic Processes

Stat 5102 Final Exam May 14, 2015

Bayesian Model Diagnostics and Checking

Probability Distributions Columns (a) through (d)

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER.

Introduction to Probability and Statistics (Continued)

A Simulation Study on Confidence Interval Procedures of Some Mean Cumulative Function Estimators

Multiple Random Variables

Three hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER.

ON INTERVENED NEGATIVE BINOMIAL DISTRIBUTION AND SOME OF ITS PROPERTIES

STAT 302 Introduction to Probability Learning Outcomes. Textbook: A First Course in Probability by Sheldon Ross, 8 th ed.

Numerical Analysis for Statisticians

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...

Primer on statistics:

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Statistical techniques for data analysis in Cosmology

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

General Regression Model

Spring 2012 Math 541B Exam 1

Information geometry for bivariate distribution control

Transcription:

CHAPTER 1 INTRODUCTION 1.0 Discrete distributions in statistical analysis Discrete models play an extremely important role in probability theory and statistics for modeling count data. The use of discrete models does not only enable the investigators to visualize and discuss the observed data but also to summarize and make predictions or estimate the properties of the samples easily. In addition, the behaviour and patterns observed in many types of phenomena can be described through discrete models. Therefore, it is not surprising to find that the study and application of discrete models has received much attention and became a major research area for decades. The discovery of new models with applications is ongoing. The Poisson, binomial and hypergeometric distributions are a few of the most often used discrete models. In short, probability models or distributions are very important in statistical data analysis as empirical models or as mathematical models for random variations. Families of univariate and multivariate distributions have been examined by many researchers. In particular for discrete distributions, the difference-equation families, power series distributions and Kemp families (see Johnson et al., 005) are well-known while the bivariate and multivariate distributions have been covered by a number of books (Mardia, 1970; Kocherlakota & Kocherlakota, 199; Johnson et al., 1997; Kotz et al., 000). The principle of complete randomness is not practical in all situations. The observed patterns tend to have variations and evidences were present in the fields of 1

biology, ecology, and medicine to indicate contagion in the distributions of the observed data. This has led to a rich class of contagious distributions; for example, the Neyman s Type A, B, C (1939), negative binomial (Fisher, 1941), discrete lognormal (Preston, 1948), Thomas (1949) distributions and so on. Archibald (1948) had shown that the contagious distributions of Neyman often give a good fit to plant distributions. Bliss and Fisher (1953) studied the efficiency of parameter estimation methods for the negative binomial distribution to biological data, and Pipes et al. (1977) indicated its suitability for fitting coliform counts (bacterial density in water). Clearly, enumerative studies on contagious distributions in various fields have been well-developed. 1.1 Literature review This section gives a brief review relevant to the development of this thesis and to provide motivation for research problems considered. To describe the behavior of different types of phenomena, the Poisson distribution is derived by assuming the events occur under the principle of complete randomness. The model depends upon a single parameter which is the mean as well as the variance. Based on the equality of mean and variance, the Poisson distribution is said to satisfy the equi dispersion property. However, in practice, most of the real data encountered are either over or under dispersed and therefore do not follow a Poisson model. Numerous researchers have tried to explain the variations in the observed data by various modifications and generalization of the Poisson distribution; see Lundberg, (1940), Haight (1967), Patil and Joshi (1968) and Johnson and Kotz (1969). Most of these distributions were developed to explain the unequal mean and variance in the observed data.

Consul and Jain (1970, 1973a, b) introduced a generalized Poisson distribution (GPD) with parameters θ and λ. The variance of this model can be less than, equal to or even greater than the mean depending on the parameter λ, meaning that the GPD can cater for under, equi and over dispersed data. The increase or decrease of parameter θ tends to increase or decrease both the mean and variance. But for under dispersion, the parameters have to be restricted in order for the GPD to be a proper distribution (Nelson, 1975); see also Scollnik (1998) for further discussion. Efron (1986) proposed the double Poisson distribution to deal with the issue of under, equi and over dispersion in the data. A disadvantage of the double Poisson distribution is that it is not a proper distribution as the probabilities do not sum to one. However, Efron showed that the multiplication constant, which makes the distribution proper, is very close to one virtually over the whole parameter space. Efron also suggested an approximation for this constant which was obtained by Edgeworth expansion. Another distribution that can model under, equi and over dispersion is the Conway-Maxwell-Poisson distribution (COM-Poisson) (Conway and Maxwell, 196; Shmueli et al., 005) which enjoyed a recent revival. The COM-Poisson distribution is a generalization of the Poisson distribution which contains the Bernoulli and geometric distributions as special cases. In addition, it is also a member of the exponential family. The flexibility of this distribution has prompted a fast growth of research in various fields as it possesses only two parameters and allows a wide range of under and over dispersion. The earliest use of the COM-Poisson distribution is found in the field of linguistics. Major development of the COM-Poisson distribution followed after it was re-introduced by Shmueli et al. (005) where the statistical properties of the COM- Poisson were studied in detail. However, the COM-Poisson distribution does not have 3

closed form expression for its moments and asymptotic approximations are required even for its mean and variance (Shmueli et al., 005, page 130). Shimizu and Yanagimoto, (1991) proposed the inverse trinomial (IT) distribution which arises as a random walk on the real line with three transition probabilities. The IT distribution has pmf as Pr( X ) λ x x / t λ p q x + λ pr = x = x λ t = 0 t, t + λ, x t q + for x = 0, 1,, where λ > 0, p r, p + q + r = 1 and x + λ ( x + λ)! t, t λ, x t = + t!( t + λ)!( x t)! Although, this model has a pmf in terms of the Gauss hypergeometric function, the existence of the recurrence formula simplifies its computation. Khang and Ong (007) have shown the IT distribution to be a Poisson-stopped (generalized Poisson) distribution. Aoyama et al. (008) proposed a generalization of the shifted inverse trinomial (GIT) distribution. The GIT distribution is formulated as a first-passage time distribution of a modified random walk on the half plane with five transition probabilities. It contains up to twenty-two possible distributions, including the binomial, negative binomial, shifted negative binomial, shifted inverse binomial and shifted inverse trinomial distributions. According to Aoyama et al. (008), one of the subclass of the GIT distribution, denoted by GIT distribution, also has the flexibility to model under, equi and over dispersion data. This interesting feature has motivated us to further study the GIT distribution. The study of discrete distributions has been widened to multivariate distributions. During the 1950 s the negative binomial distribution has received considerable attention, mainly in problems concerning accident proneness where it has 4

been extended to the bivariate case (see Arbous and Kerrich, 1951) and the multivariate case (see Bates and Neyman, 195). Edward and Gurland (1961) and Subrahmaniam (1966) worked independently on a version of the bivariate negative binomial distribution (BNBD). The joint probability mass function formulated by Subrahmaniam (1966) is given as Γ ( ν + x, x i) f ( x, x ) q p p p min( x1, x ) ν 1 r i s i i 1 = 1 3 i= 0 Γ( ν ) i!( r i)!( s i)! where q = 1 ( p1 + p + p3) and x1, x = 0, 1,,.... The disadvantage of this model is the correlation between the two random variables X1, X must be positive. A less restrictive model which allows negative correlation is introduced by Mitchell and Paulson (1981) but the shape parameter is limited to integer values. Based on the trivariate reduction method, Famoye and Consul (1995) developed a bivariate generalized Poisson distribution with X1 = Y1 + Y3 and X = Y + Y3 where Y1, Y, Y 3 are independent univariate generalized Poisson random variables. Unfortunately, the model only permits positive correlation. Lee (1999) defined a bivariate negative binomial distribution (BNBD) based on copula function which allows positive or negative correlation. This model suffers from two main drawbacks where the joint probability function is very complicated and it can only be used for over dispersed data. Lakshaminaraya, Pandit and Rao (1999) formulated a bivariate Poisson distribution as a product of Poisson marginals with a multiplicative factor. The joint probability function is defined as f ( x, x ) = θ θ e x1 x x1 x 1 1 x1 dθ1 x dθ ( e e )( e e ) 1+ λ x! x! 1 5

where d 1 = 1 e d i, e θ X is the expectation E( e i ) (i=1, ) and x1, x = 0, 1,,.... The d ( θ1+ θ ) correlation coefficient is given by ρ( x, x ) = λ θ θ d e. Clearly, the correlation 1 1 is flexible as it depends on the values of the multiplicative factor λ which can be negative, zero or positive. The work of Lakshaminaraya, Pandit and Rao (1999) is extended by Famoye (010) who introduced a new bivariate generalized Poisson distribution (BGPD) with joint probability function xi xi 1 θi (1 + αix ) i x1 x f ( x1, x) = exp [ θi (1 + αixi )] 1 + λ( e c1 )( e c) i= 1 xi! where c E( e X i ) exp [ θ ( s 1) ] = =, ln s α θ ( s 1) + 1 = 0 (i=1, ) and x1, x = 0, 1 i i i i i i i,.... The correlation is also dependent on the value of the multiplicative factor λ. Hence, it can be negative, zero or positive. Assumed models are almost never exactly true in reality; the goals of a practitioner are to estimate θ, the vector of parameters of a distribution, efficiently when the model is correct and robustly in the case where the true distribution is in the neighbourhood of the model. Various estimation methods such as the method of moments, maximum likelihood (ML) method and so on have been the source of considerable interest as seen in recent statistical literature. Traditionally, the maximum likelihood (ML) approach is a popular technique in parameter estimation due to its generality and asymptotic efficiency. Various aspects and applications of MLE have been discussed in many papers. However, difficulties arise when the probability function is complicated, not tractable or not bounded over the parameter space. Besides that, MLE does not work well in the presence of outliers. Therefore, minimum Hellinger (MH) estimation which was introduced by Beran (1977) 6

attracted great attention because of its ability to reconcile the properties of robustness and asymptotic efficiency. Tamura and Boos (1986) have also considered the minimum Hellinger distance estimator for continuous models. Simpson (1989) discussed the minimum Hellinger distance estimation for discrete models and studied the robust hypothesis testing problem for general models using the Hellinger distance. Unfortunately, computations in the MH estimation also depend on the probability function, similar to MLE, where a model with complicated probability function will severely slow down the parameter estimation process. This problem led researchers to investigate simpler inference procedures one of which is based on the probability generating function (pgf). The pgf based approach has been proposed for testing goodness-of-fit and parameter estimation. Kemp and Kemp (1988) suggested estimators based on the empirical probability generating function (epgf); the methods involve solving estimating equations obtained by equating functions of the epgf and probability generating function (pgf) on a fixed, finite set of values. Dowling and Nakamura (1997) considered the epgf for parameter estimation for families of discrete distributions with support on the nonnegative integers or a subset thereof. The pgf viewpoint leads to appealing graphical procedures and computational advantages. For the test of goodness-of-fit, much work has been focused on discrete distributions. In general, the Pearson s chi-square is most widely used where it can be applied to count data distribution with a finite number of classes. Kocherlakota and Kocherlakota (1986) proposed goodness-of-fit tests for discrete random variables based on the pgf ( ; θ ) X θ G t = E t and derived the necessary asymptotic results. In the proposed tests, the epgf has to be evaluated at a number of values of t, and therefore, the performance of the tests depends on the selected t. Marques and Pérez-Abreu (1989) 7

showed that the process ξ ( t, θ ) = N { G ( t) G( t; θ )} converges in the space C [ 0,1] N to a continuous Gaussian process centred at 0, where G ( ) given by N N t is the empirical pgf (epgf) N 1 xi G ( t) = t, t 1. N N i = 1 By using this result, Rueda et al. (1991) proposed a quadratic statistic for testing goodness-of-fit. Rueda et al. (1999) derived further results and corrected an error in the expression for the covariance formula. 1. Contributions of the Thesis A new distribution denoted by GIT is proposed to model under, equi and over dispersion. Inference, study of statistical power by Monte Carlo simulation and goodness-of-fit of this GIT model have been considered. A paper based upon this work has been submitted for publication. Two new bivariate GIT distributions have been derived. The properties and the characteristics of the distributions are presented. The performance of the proposed pgfbased MH type distance estimation is extended and applied to the parameter estimation for the bivariate case. A paper on this part of work is in progress. The COM-Poisson distribution is also studied. The limitation of computer accuracy for the computation of the infinite sum and the accuracy of the proposed asymptotic approximation are investigated. The study of statistical power by Monte Carlo simulation has been conducted. A paper on the COM-Poisson distribution has been prepared for publication. 8

A new estimation procedure based upon minimum disparity and probability generating function has been investigated and compared to well-known estimation methods like MLE and minimum Hellinger distance estimation. It is of interest to identify an efficient and simple method of parameter estimation for discrete distributions with complicated probability mass functions. The basic asymptotic theory for the proposed pgf-based MH type distance estimation is provided. An intensive simulation study has been done to investigate the robustness and consistency of the proposed statistic in estimation. This work has been published in Sim and Ong (010). 1.3 Organization of the Thesis A general introduction and brief literature survey regarding the thesis are given in Chapter 1. The contributions of the thesis are also stated. Chapter gives the preliminaries of the thesis by providing the terms, theorems and concepts needed for the following chapters. Our main findings are concentrated in Chapters 3 to 6. A paper is to be written based upon each chapter. In Chapter 3, we explored and reviewed the properties of the GIT distribution. A proposed pgf-based MH type distance estimation and goodness-of-fit test are considered for the GIT distribution. A test for equi- dispersion by Rao s score test and LR test is provided. Then results of a simulation study of the statistical power are given. The fitting of this distribution to real life data sets will be examined where the GIT distribution is compared to the GPD and COM-Poisson distributions. 9

In Chapter 4 two new bivariate GIT distributions have been derived. The properties and characteristics of the distributions are presented together with parameter estimation and application of the new bivariate distributions to real life data sets. Further results for the COM-Poisson distribution are given in Chapter 5. Some computational issues, parameter estimation, test of equi- dispersion and statistical power of the score and log-likelihood tests are examined. In Chapter 6, we proposed a pgf-based MH type distance estimation. A study of various techniques in parameter estimation and goodness-of-fit tests will be carried out. Comparative analysis of the performance of the proposed parameter estimation methods has been done. Simulated data will be used to validate the theoretical results concerning an efficient method for the proposed parameter estimation. Analyses of real life data sets are then provided. for the future. Finally, Chapter 7 concludes the finding of our research and we discussed works 10