Abdel H. El-Shaarawi National Water Research Institute, Burlington, Ontario, Canada L7R 4A6

Similar documents
Poincaré Plots for Residual Analysis

SAMPLE CHAPTERS UNESCO EOLSS FOUNDATIONS OF STATISTICS. Reinhard Viertl Vienna University of Technology, A-1040 Wien, Austria

Quantile Approximation of the Chi square Distribution using the Quantile Mechanics

Continuous Univariate Distributions

A Convenient Way of Generating Gamma Random Variables Using Generalized Exponential Distribution

Contents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1

Monitoring Random Start Forward Searches for Multivariate Data

2005 ICPSR SUMMER PROGRAM REGRESSION ANALYSIS III: ADVANCED METHODS

Stat 5101 Lecture Notes

Irr. Statistical Methods in Experimental Physics. 2nd Edition. Frederick James. World Scientific. CERN, Switzerland

Testing Goodness Of Fit Of The Geometric Distribution: An Application To Human Fecundability Data

Approximate Median Regression via the Box-Cox Transformation

Preliminaries The bootstrap Bias reduction Hypothesis tests Regression Confidence intervals Time series Final remark. Bootstrap inference

AN INTRODUCTION TO PROBABILITY AND STATISTICS

Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances

STATISTICAL COMPUTING USING R/S. John Fox McMaster University

Preliminaries The bootstrap Bias reduction Hypothesis tests Regression Confidence intervals Time series Final remark. Bootstrap inference

A nonparametric two-sample wald test of equality of variances

Foundations of Probability and Statistics

Journal of Biostatistics and Epidemiology

Subject CS1 Actuarial Statistics 1 Core Principles

Bayesian Inference for the Multivariate Normal

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

Theory and Methods of Statistical Inference

REFERENCES AND FURTHER STUDIES

MATHEMATICAL MODELS Vol. III - Mathematical Models in Epidemiology - M. G. Roberts, J. A. P. Heesterbeek

Specification Tests for Families of Discrete Distributions with Applications to Insurance Claims Data

Theory and Methods of Statistical Inference. PART I Frequentist likelihood methods

Small-Area Population Forecasting Using a Spatial Regression Approach

Theory and Methods of Statistical Inference. PART I Frequentist theory and methods

HANDBOOK OF APPLICABLE MATHEMATICS

BAYESIAN ESTIMATION OF LINEAR STATISTICAL MODEL BIAS

Research Article The Laplace Likelihood Ratio Test for Heteroscedasticity

Using Simulation Procedure to Compare between Estimation Methods of Beta Distribution Parameters

Introduction to Statistical Hypothesis Testing

Department of Mathematics

CHAPTER 1 INTRODUCTION

ARIMA modeling to forecast area and production of rice in West Bengal

Estimation of AUC from 0 to Infinity in Serial Sacrifice Designs

Transform or Link? Harold V. Henderson Statistics Section, Ruakura Agricultural Centre, Hamilton, New Zealand. and

Analytical Bootstrap Methods for Censored Data

Deccan Education Society s FERGUSSON COLLEGE, PUNE (AUTONOMOUS) SYLLABUS UNDER AUTOMONY. SECOND YEAR B.Sc. SEMESTER - III

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

SPRING 2007 EXAM C SOLUTIONS

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

COPYRIGHTED MATERIAL CONTENTS. Preface Preface to the First Edition

Negative Multinomial Model and Cancer. Incidence

Point and Interval Estimation for Gaussian Distribution, Based on Progressively Type-II Censored Samples

STATISTICAL ANALYSIS WITH MISSING DATA

On Bayesian Inference with Conjugate Priors for Scale Mixtures of Normal Distributions

A noninformative Bayesian approach to domain estimation

Testing Statistical Hypotheses

ROBUST ESTIMATION OF A CORRELATION COEFFICIENT: AN ATTEMPT OF SURVEY

Reliability and Quality Mathematics

Fiducial Inference and Generalizations

A comparison study of the nonparametric tests based on the empirical distributions

Efficient Robbins-Monro Procedure for Binary Data

PIRLS 2016 Achievement Scaling Methodology 1

STATISTICS SYLLABUS UNIT I

The Type I Generalized Half Logistic Distribution Downloaded from jirss.irstat.ir at 11: on Monday August 20th 2018

A Discussion of the Bayesian Approach

On Parameter-Mixing of Dependence Parameters

Estimation of Operational Risk Capital Charge under Parameter Uncertainty

Optimum Test Plan for 3-Step, Step-Stress Accelerated Life Tests

Theory of Probability Sir Harold Jeffreys Table of Contents

arxiv: v1 [stat.me] 5 Apr 2013

STAT 4385 Topic 01: Introduction & Review

Comparison of Estimators in GLM with Binary Data

PROBABILITY AND STATISTICS Vol. I - Random Variables and Their Distributions - Wiebe R. Pestman RANDOM VARIABLES AND THEIR DISTRIBUTIONS

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Bayesian Estimation of Prediction Error and Variable Selection in Linear Regression

SIMULATED POWER OF SOME DISCRETE GOODNESS- OF-FIT TEST STATISTICS FOR TESTING THE NULL HYPOTHESIS OF A ZIG-ZAG DISTRIBUTION

FINITE MIXTURE DISTRIBUTIONS

CONTROL SYSTEMS, ROBOTICS, AND AUTOMATION - Vol. V - Recursive Algorithms - Han-Fu Chen

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...

STATISTICS-STAT (STAT)

SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION. University of Minnesota

Application of Parametric Homogeneity of Variances Tests under Violation of Classical Assumption

INVERTED KUMARASWAMY DISTRIBUTION: PROPERTIES AND ESTIMATION

Department of Mathematics

Approximate and Fiducial Confidence Intervals for the Difference Between Two Binomial Proportions

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

Introduction to Statistical Analysis

Lecture 2: Repetition of probability theory and statistics

Testing Statistical Hypotheses

The Bayesian Approach to Multi-equation Econometric Model Estimation

New Bayesian methods for model comparison

Random Graphs. EECS 126 (UC Berkeley) Spring 2019

Observation Consisting of Parameter and Error: Determination of Parameter

Appendix F. Computational Statistics Toolbox. The Computational Statistics Toolbox can be downloaded from:

Hazard Function, Failure Rate, and A Rule of Thumb for Calculating Empirical Hazard Function of Continuous-Time Failure Data

Department of Econometrics and Business Statistics

Numerical Analysis for Statisticians

Bootstrap inference for the finite population total under complex sampling designs

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky

The Effect of a Single Point on Correlation and Slope

A Proposed Method for Estimating Parameters of Non-Gaussian Second Order Moving Average Model

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL

Theoretical Properties of Weighted Generalized. Rayleigh and Related Distributions

Transcription:

APPLIED STATISTICS Abdel H. El-Shaarawi National Water Research Institute, Burlington, Ontario, Canada L7R 4A6 Keywords: Data acquisition, data reduction, exploratory analysis, graphs, tables, sample space, probability, random variable, estimation, models, inference, lielihood, Bayesian methods, experimental design, randomization, replication, control Contents 1. Introduction 2. Foundations 3. Exploratory Data Analysis 4. Models 5. Statistical Inference 6. Design of Experiments 7. The Future of Applied Statistics Glossary Bibliography Biographical Setch Summary Statistics is a field that has been widely applied to almost all aspects of human endeavors particularly those related to science and technology. This article provides an overview of probability as the foundational bases of statistical methods that justify their use in maing inductive inferences based on observed data. In addition, statistical methods used for data exploration, model building and assessment and for data acquisition are briefly presented. 1. Introduction Observations and experiments produce data that can be used to challenge existing beliefs, mae decisions, predict the future and reveal new nowledge. It is not surprising then to see the wide application of statistical methods to almost all types of human activities. National governments periodically collect, analyze and publish demographic, social, economic and environmental statistical data. These are used to document changes, identify causes and mae decisions. Many papers published within scientific journals contain some applications of statistical methods. Statistical thining plays a pivotal role in shaping and guarding the objectivity of scientific methods. In some cases, the interaction of statistics and other disciplines has led to the creation of new branches of sciences with their own journals, societies and educational departments as can be seen in the cases of biostatistics, chemometrics and environmetrics. One may as why statistical science is so widely applied. The answer rests on the fact that although data are collected upon a limited number of units, a sample, the objective of statistics is to extract and understand the information contained within the data and to

generalize the results to other potential sampling units that are not included in the sample. It achieves this objective by devising methods for extracting the relevant information contained in the data and by the logical generalization of the results of the sample to un-sampled units. In this contribution, we shall begin by briefly discussing the probabilistic foundations of applied statistical methods. This is then followed by a broad discussion of the statistical methods for exploratory data analysis, model building, inference, and the design of data collection. We will conclude with some comments about the future practice of applied statistics in a rapidly advancing technological environment. 2. Foundations The foundations of formal applied statistical methods reside in the theory of probabilities. This arose historically out of discussions of games of chance in the 17 th century. From this beginning, it slowly tricled into other areas such as agriculture, official statistics, and now it is applied in many different fields. Experimental and observational data are subject to a measure of uncertainty. This uncertainty is modeled and analyzed by probabilistic methods. Such methods are mathematically precise and are different from the subjective notion of the word probability used to judge a situation or an event in every-day life. The mathematical approach begins by the primitive concept of elementary events. All possible such events form a sample space with a subset of these events representing a compound event. For each event, a mathematical measure is attached that represents the probability of its realization. To clarify these concepts, let us consider a controlled experiment in which three indistinguishable organisms are individually exposed to an equivalent dose of a toxic substance for the same length of time. Let the objective of the experiment be to determine the effects of exposure upon the survival of the organism. If S and D are used to represent respectively the survival and death of an organism, then an elementary event will correspond to a particular combination of each organism s outcome of S or D. For example {S, S, D} stands for the survival of the first two organisms and the death of the third. The sample space S consists of the eight elementary events: {S,S,S}; {S,S,D}; {S,D,S}; {S,D,D}; {D,S,S}; {D,S,D}; {D,D,S}; {D,D,D}. The next step is to assign a probability to each of these events. For the purpose of illustration, let us assume that the toxic substance has no effect on the organism. Hence that it is logical to view each of these events as equally liely. Thus, P(S) = P(D) = 1/2. Therefore, the probability of an elementary event is given as P{elementary event} = (1/2). (1/2). (1/2) = 1/8. From this probability measure, we can calculate the probabilities of various compound events by summing the probabilities of all its elementary events. Let R be the number of survivors. Then R taes on the values 0, 1, 2, 3 with the following probabilities: P{R=0}

= 1/8; P{R=1} = 3/8; P{R=2} = 3/8 and P{R=3} = 1/8. We note that the sum of the probabilities of R over its entire set of values is one. R is a compound event. Therefore, each of its possible events is made up of a number of simple events. For instance, the event R = 2 corresponds to the three events {S,S,D}, {S,D,S} and {D,S,S}, so that P{R=2}=3/8. A variable defined analogously to R is called a random variable. It represents a function defined on the sample space S. Here R taes on a set of discrete values and for every possible value j of R there exists a non-negative number called the probability of R=j, (that is P{R=j}), such that 0 PR { = j} 1. If instead of three organisms, n organisms are used and instead of assuming that the toxic substance has an effect, we assume that P{S} = θ and P{D} = 1 - θ, where 0 θ 1, then R is distributed as a binomial distribution with PR j n j n j { = } = j ( ) ( ) θ 1 θ for j = 0, 1,..., n This binomial distribution serves as an approximate model to many natural phenomena. There are other probability models for discrete data. These include the multinomial, Poisson, hypergeometric and negative binomial distributions (see Johnson, Kotz and Kemp (1992)). There are also multivariate versions of these distributions that can be found in the boo by Johnson, Kotz and Balarishnan (1997). Univariate and multivariate probability models are also available for continuous data. Here the sample space is continuous and the probability of realizing a specific value is zero. We spea instead of the probability that the random variable is realized within an interval. Examples of common continuous models include the normal, gamma, uniform, beta, Cauchy, student t, chi-square and Weibull distributions. Most of these models are given in statistics boos (see Evans, Hastings and Peacoc (2000) or Kendall and Stuart (1976)). Let f ( x, α ) be the probability density of a continuous random variable X and α is the parameter of the distribution. Note that both X and α may be vectors. One particular continuous distribution that deserves further discussion is that of the normal distribution. A random variable X is normally distributed if it has the probability 2 1 1 x μ density function exp. This distribution has been at the σ 2π 2 σ cornerstone of statistical theory and applications for decades. Classical regression analysis and the associated analysis of variance methods are based upon this distribution or upon distributions derived from it. These include the student t, the chi-square and the Fisher F distribution. Moreover, the normal distribution (univariate or multivariate) has been derived as the asymptotic limiting distribution for many frequently used test statistics (Van der Vaart (1998)).

Some characteristics of the probability models are frequently of interest in applications. These include the mode, median, quantiles and the moments of the random variables. The mode is the value of the random variable that maximizes its probability density. The median is the value of the random variable that divides the total area under the probability curve into equal parts. Finally the th order moment of a random variable is E( X ) = μ = x P( X = x ) i i= 0 i for a discrete random variable E( X ) = μ = x f ( x, α ) dx for a continuous random variable. Of particular interest to most experimenters, are the first four raw moments. From them, we can derive the expectation E(X) and the variance Var(X) and the coefficients of sewness γ and urtosis β : ( ) 2 2 1 2 1 Var( X ) = E ( X μ ) = μ μ 3 1 3/2 EX ( μ ) γ = (Var( X )) and 4 1 2 EX ( μ ) β = 3. (Var( X )) The expectation (the first raw moment), the median and the mode are nown as the measures of location of the distribution. They have the same value for symmetric distributions. The variance is a measure of the spread of the distribution while the sewness and urtosis measure the shape of a distribution. The sewness coefficient is zero for a symmetric distribution. The urtosis determines if the distribution has a heavy or light tail relative to that of the normal distribution. - - - TO ACCESS ALL THE 13 PAGES OF THIS CHAPTER, Visit: http://www.eolss.net/eolss-sampleallchapter.aspx Bibliography Anderson, E. (1960). A Semi-graphical Method for the Analysis of Complex Problems. Technometrics, 2, 387-392. [An article on graphical display of multivariate data] Andrews, D. F. (1972). Plots of High Dimensional Data. Biometrics, 28, 125-136. [An article that suggests the use of curves to represent multivariate data graphically]

Atinson, Anthony and Riani, Marco (2000). Robust Diagnostic Regression Analysis. Springer-Verlag, New Yor. [A modern boo on regression] Box, G.E.P. and Jenins, G.M. (1976). Time Series Analysis: Forecasting and Control, Revised edition, Holden-Day, San Francisco.[A classic basic reference on time series]. Brocwell, P.J. and Davis, R.A. (1988).Time Series: Theory and Methods, 2 nd Edition, Springer-Verlag, New Yor. [A basic reference on time series]. Chernoff, H. (1973). The Use of Faces to Represent Points in -Dimensional Space Graphically. J. Amer. Statist. Assoc., 68, 361-368. [An article in which the author introduced the human face for the display of multivariate data] Cleveland, W. S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey. [A basic reference] Cleveland, W. S. (1994). The Elements of Graphing Data. Hobart Press, Summit, New Jersey. [A well written reference on approaches for the graphical display of data] Cochran, W. G. and Cox, G. M. (1957). Experimental Designs. New Yor: Wiley. [A classic boo on statistical approaches for the design and analysis of experimental data] Cressie, N.A.C. (1993). Statistics for Spatial Data, Revised edition, New Yor: Wiley. [A comprehensive boo on spatial statistical methods] David, H. A. (1981). Order Statistics, second edition. New Yor: John Wiley & Sons. [A basic boo on order statistics]. Davison, A. C. and Hinley, D. V. (1998). Bootstrap Methods and their Applications. Cambridge University Press. [An excellent boo with many applications] Draper, N.R. and Smith, H. (1981). Applied Regression Analysis, 2 nd edition, New Yor: Wiley. [A basic boo on regression]. Efron, B. (1979). Bootstrap methods, Another Loo at the Jacnife, Ann. Statist. 7, 1-26. [Original research paper by the founder of the bootstrapping as a method for statistical inference] Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. London: Chapman & Hall. [An elementary reference on bootstrapping. It presents the basic fundamentals]. Evans, Merran, Hastings, Nicholas, and Peacoc, Brian (2000). Statistical Distributions, Third edition, New Yor: John Wiley & Sons. [An easy to use reference on commonly used frequency distributions] Gnanadesian, R. (1977). Methods for Statistical Data Analysis of Multivariate Observations. John Wiley, New Yor. [An excellent applied boo on multivariate analysis methods] Hall, P. (1992). The Bootstrap and Edgeworth Expansion, Springer-Verlag. [The theory of bootstrap is presented in this boo]. Härdle, Wolfgang (1990). Smoothing Techniques with Implementation in S, Springer-Verlag. [A good reference on ernel smoothing with software routines for their implementations] Hoaglin, D.C., Mosteller, F., and Tuey, J. W. (1985). Exploring Data Tables, Trends and Shapes. New Yor: Wiley. [A fundamental boo on exploratory data analysis methods] Huber, P. (1981). Robust Statistics, New Yor: John Wiley & Sons. [A systematic theory for robust statistical inference is presented in this boo]. Jeffreys, H. (1939). The Theory of Probability. Oxford: Oxford University Press. [A classic boo on the development of theory of the objective specification of non-informative prior probabilities]. Johnson, N. L., Kotz, S., and Balarishnan. N. (1997). Discrete Multivariate Distributions, New Yor: John Wiley & Sons. [A basic boo on multivariate discrete distributions] Johnson, N. L., Kotz, S., and Kemp, A. W. (1992). Univariate Discrete Distributions, Second edition, New Yor: John Wiley & Sons. [Provides a summary of nown distributions along with their main properties]

Kendall, M.G. and Stuart, A. (1976). The Advanced Theory of Statistics, Vol. 1, Griffin, London. [A classical boo for common distributions and their characteristics]. Krzanowsi, W.J. (1988). Principles of Multivariate Analysis. Oxford University Press. [A comprehensive reference for various aspects of multivariate analysis. It contains applications to real life applications]. Placett, R. L. (1949). A Historical Note on the Method of Least Squares, Biometria, vol. 36, part 3 and 4, p. 458. [It gives the history of the discovery of the celebrated method of least squares]. Tuey, J. W. (1960). A Survey of Sampling from Contaminated Distributions. In Contributions to Probability and Statistics, pp. 448-485, eds. Olin, I. et al. Stanford University Press.[This paper shows the effects of minor errors upon the performance of classical methods]. Tuey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley, Reading. [A fundamental reference containing ideas about various ways to extract signals from messy data] Van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press. [Derivation of asymptotic distributions of test statistics and their use in practical applications are the focus of the boo. It is well written and not too difficult to study]. Venables, W.N. and Ripley, B.D. (1997). Modern Applied Statistics with S-PLUS (2 nd ed.). New Yor: Springer-Verlag. [This boo is directed mainly to the users of SPLUS routines by providing examples of their use in applications]. Viertl, R. (1996). Statistical Methods for Non-Precise Data. CRC Press, Boca Raton, Florida. [A fundamental reference on the treatment of non-precise data] Viertl, R. (1997). Statistical Inference for Non-Precise Data. Environmetrics, 8, 541-568. [A research article describing the steps involved in maing inferences from non-precise data]. Biographical Setch Abdel El-Shaarawi was born in Egypt. He received his B.Sc. and M.Sc. degrees from Cairo University and his Ph.D. in Statistics in 1972 from University of Waterloo. He is a research scientist at the Canada Centre for Inland Waters in Burlington, Ontario and a professor in the Department of Mathematics and Statistics, McMaster University and is currently a visiting professor at the University of Genova. He is founding Editor of the journal Environmetrics and founding President of The International Environmetrics Society. He is an elected member of the International Statistical Institute and a Fellow of the Royal Statistical Society (UK), the American Statistical Association and the Modelling and Simulation Society of Australia and New Zealand. He received several Awards including the Distinguished Achievement Medal of the ASA Section on Statistics and the Environment and the Citation of Excellence Award from the Government of Canada.