Eruptions of the Old Faithful geyser

Similar documents
BIOSTATS Intermediate Biostatistics Spring 2015 Class Activity Unit 1: Review of BIOSTATS 540

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

ASSIGNMENT 3 SIMPLE LINEAR REGRESSION. Old Faithful

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

STAT 200 Chapter 1 Looking at Data - Distributions

1. Exploratory Data Analysis

Exploratory Data Analysis August 26, 2004

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

Describing Distributions with Numbers

Chapter 2: Tools for Exploring Univariate Data

Description of Samples and Populations

3 GRAPHICAL DISPLAYS OF DATA

Lecture 1: Descriptive Statistics

Descriptive Statistics Example

Elementary Statistics

Chapter 6 The Standard Deviation as a Ruler and the Normal Model

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

AP Final Review II Exploring Data (20% 30%)

Math 140 Introductory Statistics

Math 140 Introductory Statistics

Topic 3: Introduction to Statistics. Algebra 1. Collecting Data. Table of Contents. Categorical or Quantitative? What is the Study of Statistics?!

P8130: Biostatistical Methods I

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

Section 3. Measures of Variation

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12)

Chapter 3. Data Description

Units. Exploratory Data Analysis. Variables. Student Data

Coordinate Algebra Practice EOCT Answers Unit 4

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

Algebra I+ Pacing Guide. Days Units Notes Chapter 1 ( , )

CHAPTER 1. Introduction

Univariate Descriptive Statistics for One Sample

Chapter 1: Exploring Data

Revised: 2/19/09 Unit 1 Pre-Algebra Concepts and Operations Review

Division With a Remainder

21 ST CENTURY LEARNING CURRICULUM FRAMEWORK PERFORMANCE RUBRICS FOR MATHEMATICS PRE-CALCULUS

Which range of numbers includes the third quartile of coats collected for both classes? A. 4 to 14 B. 6 to 14 C. 8 to 15 D.

Glossary for the Triola Statistics Series

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

2011 Pearson Education, Inc

Histogram, cumulative frequency, frequency, 676 Horizontal number line, 6 Hypotenuse, 263, 301, 307

CIVL 7012/8012. Collection and Analysis of Information

Chapter 4. Displaying and Summarizing. Quantitative Data

1 Measures of the Center of a Distribution

Watch TV 4 7 Read 5 2 Exercise 2 4 Talk to friends 7 3 Go to a movie 6 5 Go to dinner 1 6 Go to the mall 3 1

Chapter 1:Descriptive statistics

Descriptive statistics

IB Questionbank Mathematical Studies 3rd edition. Grouped discrete. 184 min 183 marks

Stat Lecture Slides Exploring Numerical Data. Yibi Huang Department of Statistics University of Chicago

CHAPTER 2: Describing Distributions with Numbers

TOPIC: Descriptive Statistics Single Variable

ECLT 5810 Data Preprocessing. Prof. Wai Lam

CHAPTER 1 Exploring Data

Sociology 6Z03 Review I

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Chapter 1: Introduction. Material from Devore s book (Ed 8), and Cengagebrain.com

SCHOOL OF MATHEMATICS AND STATISTICS

Lecture 6: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Descriptive Univariate Statistics and Bivariate Correlation

STT 315 This lecture is based on Chapter 2 of the textbook.

STATISTICS 1 REVISION NOTES

Algebra Calculator Skills Inventory Solutions

Lecture 3B: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Class 11 Maths Chapter 15. Statistics

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

Applied Regression Analysis

REVIEW: Midterm Exam. Spring 2012

Chapter. The Normal Probability Distribution 7/24/2011. Section 7.1 Properties of the Normal Distribution

Performance of fourth-grade students on an agility test

Exploratory data analysis: numerical summaries

Chapter 1. Looking at Data

Observed changes in geyser eruption intervals in Yellowstone following the 2002

Quantitative Methods Chapter 0: Review of Basic Concepts 0.1 Business Applications (II) 0.2 Business Applications (III)

M 225 Test 1 B Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

Homework Example Chapter 1 Similar to Problem #14

Chapter 3. Measuring data

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

Chapter 4.notebook. August 30, 2017

Shape, Outliers, Center, Spread Frequency and Relative Histograms Related to other types of graphical displays

Histograms allow a visual interpretation

Chapter 3 Examining Data

Chapter 2: Descriptive Analysis and Presentation of Single- Variable Data

Last time. Numerical summaries for continuous variables. Center: mean and median. Spread: Standard deviation and inter-quartile range

Statistics Add Ins.notebook. November 22, Add ins

Regression on Faithful with Section 9.3 content

AP Statistics Cumulative AP Exam Study Guide

How geysers work. By How Stuff Works, adapted by Newsela staff on Word Count 810 Level 850L

UNIVERSITY OF MASSACHUSETTS Department of Biostatistics and Epidemiology BioEpi 540W - Introduction to Biostatistics Fall 2004

(Re)introduction to Statistics Dan Lizotte

Chapter2 Description of samples and populations. 2.1 Introduction.

Sampling, Frequency Distributions, and Graphs (12.1)

Types of Information. Topic 2 - Descriptive Statistics. Examples. Sample and Sample Size. Background Reading. Variables classified as STAT 511

Sets and Set notation. Algebra 2 Unit 8 Notes

Descriptive Statistics

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

1 Introduction Suppose we have multivariate data y 1 ; y 2 ; : : : ; y n consisting of n points in p dimensions. In this paper we propose a test stati

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Example 2. Given the data below, complete the chart:

Transcription:

Eruptions of the Old Faithful geyser geyser2.ps Topics covered: Bimodal distribution. Data summary. Examination of univariate data. Identifying subgroups. Prediction. Key words: Boxplot. Histogram. Mean. Median. Mode. Scatter plot. Standard deviation. Side by side boxplots. Stem and leaf display. Data Files: geyser1.dat, geyser2.dat A geyser is a hot spring that occasionally becomes unstable and erupts hot water and steam into the air. The Old Faithful geyser at Yellowstone National Park in Wyoming is probably the most famous geyser in the world. Visitors to the park try to arrive at the geyser site to see it erupt without having to wait too long; the name of this geyser comes from the fact that eruptions follow a relatively stable pattern. The National Park Service erects a sign at the geyser site predicting when the next eruption will occur. Thus, it is of interest to understand and predict the interval time until the next eruption. The following analysis is based on a sample of 222 intereruption times taken during August 1978 and August 1979 (data source: Applied Linear Regression, 2nd. ed., by S. Weisberg). The first step in any data analysis is simply to look at the data. A histogram gives a good deal of information about the distribution of intereruption times, suggesting some interesting structure. Interval times are in the general range of to 100 minutes, but there are apparently two subgroups in the data, centered at roughly 55 minutes and 80 minutes, respectively, with a gap in the middle. 35 Histogram 28 Frequency 21 14 7 0 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 A stem and leaf display gives a similar impression: 5

6 Eruptions of the Old Faithful geyser geyser1.ps STEM AND LEAF PLOT OF INTERVAL LEAF DIGIT UNIT = 1 1 2 REPRESENTS 12. STEM LEAVES 3 4* 234 11 4. 55788999 39 5* 0011111111111111222333334444 54 5. 555566677778889 67 6* 0000111112223 78 6. 66677788999 107 7* 00000111112222233333333344444 (44) 7. 55555555555555566666666667777777788888889999 71 8* 0000000000000111111111222222222233333333444444444 22 8. 5666666788899 9 9* 00011134 1 9. 5 222 CASES INCLUDED 0 MISSING CASES The stem and leaf display can give additional information as well. For example, the first column of the plot above is a cumulative frequency column that starts at both ends and meets in the middle. The row that contains the median is marked with parentheses around the count of observations for that row. Not all exploratory techniques are as effective for this type of data. A boxplot (sometimes called a box and whisker plot) of the interval times shows that interval times are in the general range of to 100 minutes, but the bimodal distribution is hidden by the form of the plot: 100 Box and Whisker Plot 80 60 222 cases Summary statistics, as given below, look informative, but that is somewhat misleading. For example, the mean intereruption time of about 71 minutes doesn t actually describe a typical result from either subgroup. A useful rule of thumb is that

Eruptions of the Old Faithful geyser 7 roughly 95% of the observations will lie within two standard deviations of the mean, or (here) 71 ± 25. 6 = (45. 4, 96. 6). In fact, all but five of the 222 observations fall in this interval, which is more than we would expect. It is important to remember that summary measures are only trustworthy when data values come from a homogeneous population, which is not the case here. DESCRIPTIVE STATISTICS INTERVAL N 222 MEAN 71.009 SD 12.799 MINIMUM 42.000 MEDIAN 75.000 MAXIMUM 95.000 How, then, can we help the tourists? We need more information. One readily available characteristic of the geyser is the duration of the previous eruption. We could think of our data as pairs of the form (duration of eruption, time until next eruption). Here is a scatter plot of those pairs: geyser3.ps 100 Scatter Plot of INTERVAL vs DURATION 80 60 1.5 2.3 3.1 3.9 4.7 5.5 Eruption duration time (minutes) We notice two dominant effects here: there are indeed two distinct subgroups, and a longer eruption tends to be followed by a longer time interval until the next eruption. The existence of two subgroups in this type of data is rare, but not unheard of; J.S. Rinehart, in a 1969 paper in the Journal of Geophysical Research, provides a mechanism for this pattern based on the temperature level of the water at the bottom of a geyser tube at the time the water at the top reaches the boiling temperature. That a shorter eruption would be followed by a shorter intereruption time (and a longer eruption would be followed by a longer intereruption time) is also consistent with Rinehart s model, since a short eruption is characterized by having more water at the bottom of

8 Eruptions of the Old Faithful geyser the geyser being heated short of boiling temperature, and left in the tube. This water has been heated somewhat, however, so it takes less time for the next eruption to occur. A long eruption results in the tube being emptied, so the water must be heated from a colder temperature, which takes longer. A. Azzalini and A.W. Bowman provide further discussion of statistical analysis based on this model in a 1990 paper in Applied Statistics. The properties of these two kinds of eruptions can be examined in more detail by separating the observations based on duration of the previous eruption. The following summary statistics are for intereruption times based on separating the eruptions by whether the duration is less than, or greater than, three minutes: DESCRIPTIVE STATISTICS FOR ERUPTION DURATION < 3 MINUTES FOR ERUPTION DURATION > 3 MINUTES geyser4.ps INTERVAL INTERVAL N 67 155 MEAN 54.463 78.161 SD 6.2989 6.8911 MINIMUM 42.000 53.000 MEDIAN 53.000 78.000 MAXIMUM 78.000 95.000 Boxplots also can be stacked side by side on the same plot, separated by subgroups, to provide a graphical representation of this pattern: 100 Box and Whisker Plot 80 60 Short duration Long duration Duration time short (< 3 minutes) or long (> 3 minutes) 222 cases Based on these statistics and plots, a simple prediction rule would be that an eruption of duration less than 3 minutes will be followed by an intereruption interval of about 55 minutes, while an eruption of duration greater than 3 minutes will be followed by an intereruption interval of about 80 minutes. Further, the latter longer intereruption time would be expected to occur about two thirds of the time.

Eruptions of the Old Faithful geyser 9 geyser5.ps We can evaluate the effectiveness of this prediction rule by testing it on new data. Azzalini and Bowman, in their 1990 Applied Statistics article, gave intereruption and duration times for 296 eruptions in August 1985, which can be used for this purpose (66 of the eruption durations occurred at night, and were recorded as simply Short or Long, which we will treat as less than or greater than 3 minutes, respectively; 2 durations were listed as Medium, which will be ignored here). Boxplots of the 1985 eruption durations show that apparently the relationship between eruption duration and the previous intereruption time had not changed appreciably in the six intervening years (which is what we would expect), which suggests that the prediction rule will work well: 120 Box and Whisker Plot 100 80 60 Short duration Long duration Duration time short (< 3 minutes) or long (> 3 minutes) 296 cases 2 missing cases We can examine the error in predicting intereruption time from the simple prediction rule using August 1985 data. For example, the first eruption given lasted 4.0 minutes, and was followed by a 71 minute intereruption time. Since the duration was greater than 3 minutes, we would have predicted the intereruption time to be 80 minutes. Thus, the error made was 71 80 = 9 minutes. This operation is then repeated for all available eruptions. A histogram of the error made in using the prediction rule shows that, except for three unusual eruptions, the rule predicts the intereruption time to within roughly ±20 minutes (and usually to within ±10 minutes).

10 Eruptions of the Old Faithful geyser -25-22 -19-16 -13-10 -7-4 -1 2 5 8 11 14 17 20 23 26 29 0 10 20 30 50 Histogram Frequency Error from prediction rule (minutes) Using the duration time information geyser6.ps This performance can be compared with making a prediction based just on the distribution of the intereruption times. Recall that the overall median intereruption time for the 1978 1979 data was 75 minutes; a histogram of the errors from using that as a prediction rule for the 1985 data shows that the errors are bimodal, and dramatically larger than if the previous duration is used, even being off by as much as 30 minutes: -33-29 -25-21 -17-13 -9-5 -1 3 7 11 15 19 23 27 31 35 0 9 18 27 36 45 54 Histogram Frequency Error from prediction rule (minutes) Ignoring the duration time information geyser7.ps Thus, by using a very simple rule, the National Park Service can try to ensure that visitors to Yellowstone Park will get to see an Old Faithful eruption without waiting

Eruptions of the Old Faithful geyser 11 too long. Summary Intereruption times of the Old Faithful geyser in Yellowstone National Park are apparently bimodal, with modes centered at around 55 minutes and 75 80 minutes, respectively. These times are directly related to the duration of the previous eruption, with longer eruptions followed by longer intereruption times (and shorter eruptions followed by shorter intereruption times). This pattern is consistent with previously described physical models of geyser eruption. A simple rule that can be used to predict intereruption time is to predict one or the other of the two modal values, based on whether the previous duration was less than, or greater than, three minutes. This rule can be shown to work well on new eruptions of the geyser, supporting its general use. Technical terms Boxplot: a graphical device in which a rectangle is used to summarize the distribution of a batch of data. The top and the bottom of the rectangle represent the third and first quartiles, respectively. The line inside the rectangle represents the median. The lines extending from the top and bottom of the rectangle represent either the actual limits of the data, or the limits of the bulk of the data (with unusual observations being represented by individual symbols [ flagged ] if they are further out). The boxplot is particularly useful for comparing the location and variability of several batches of data, as boxes can be plotted side by side on one plot. Histogram: a graphical device to represent the distribution of a batch of data. The data values are usually grouped into mutually exclusive and exhaustive intervals of equal width, and the number of observations in each interval is determined and represented by a vertical bar. In some variations the widths of the intervals are varied, resulting in potentially different appearances in the plot. Median: an estimate of location determined by ordering the observations in the sample, and then choosing the middle one. If the sample size is even, the median is taken to be the average of the middle two observations. The median has the desirable property of not being greatly influenced by an unusual observation (that is, it is a robust estimate). Mean: an estimate of location determined by adding all of the observations and dividing by the number of observations. It is also known as the average, and is certainly the most commonly used location estimate. An undesirable property is that it can be heavily influenced by even one unusual observation. The usual notation for the sample mean is the use of an overbar, as in X. Mode: a value, or range of values, characterized by having a greater likelihood of occurrence than values around it. A distribution might have one mode (being unimodal), two modes (bimodal), and so on. Another use of the term is as follows: the sample mode is the value that occurs most often in a sample. Quartile: a value corresponding to a 25% point in a sample. The first quartile is, roughly speaking, the value such that one quarter of the observations are less than it, while three quarters of the observations are greater than it. The third quartile is defined similarly, reversing the quantities one quarter and three quarters. The

12 Eruptions of the Old Faithful geyser second quartile corresponds to the median. Different statistical packages often use slightly different definitions of the quartiles for a given sample size. A measure of the variability in a set of data that is insensitive to a few unusual values is the difference between the third and first quartiles, called the inter quartile range. Scatter plot: a method that can be used to study the joint variation of two variables graphically. Each observation is represented by a point on the plot, indexed by the values on the axes. Each axis is used for a different variable. Besides showing how (and whether) two variables are related to each other, scatter plots also can indicate the existence of distinct subgroups in the data. Standard deviation: an estimate of scale (or variability) determined as the square root of the sum of squared differences between the value of an observation and the sample mean, divided by one less than the sample size. Stem and leaf display: a graphical device to represent the distribution of a batch of data. Very similar to a histogram, it is often accompanied by additional information about the data, such as cumulative frequencies and the position of the median. The plot represents the data values by their numerical values, providing additional information over the histogram, but the grouping intervals are usually chosen based on using round numbers, rather than in an attempt to provide the most effective plot.