Statistics, continued

Visual Displays of Data Since numbers often do not resonate with people, giving visual representations of data is often uses to make the data more meaningful. We will talk about a few ways to view data. 2

Histograms A histogram, or bar chart, is a common way to represent numerical data. We illustrate a histogram with weather data for high temperatures on January 1 in San Francisco. 3

High Temperatures on January 1 in San Francisco Year Temperature Year Temperature 1977 54 1993 55 1978 55 1994 60 1979 56 1995 53 1980 61 1996 67 1981 54 1997 66 1982 52 1998 58 1983 50 1999 61 1984 60 2000 54 1985 57 2001 64 1986 60 2002 58 1987 57 2003 59 1988 49 2004 54 1989 50 2005 56 1990 56 2006 63 1991 55 2007 60 1992 58 4

By counting how many days had a given high temperature, we get the following chart. Temperature Number of Days Temperature Number of Days 49 1 59 1 50 2 60 4 51 0 61 2 52 1 62 0 53 1 63 1 54 4 64 1 55 3 65 0 56 3 66 1 57 2 67 1 58 3 5

Here is a histogram of the last chart. The numbers on the vertical axis are the number of days of the given temperature, and the values on the horizontal axis are the various temperatures. This was created by excel. (#$" (" '#$" '" &#$" &" %#$" %"!#$"!" 6

Here is the same chart with the legend put in the vertical and horizontal axes. 7

Each value, in this case temperature, is drawn with a vertical bar. The height of the bar represents how many times that value occurs. The values are listed on the horizonal axis in increasing value from left to right. 8

Let s look at some more weather data. We have the high temperatures on January 1 from 1977 to 2007 for both San Francisco and Las Cruces. We will also calculate the mean and median for both data sets. 9

High Temps on January 1 in San Francisco and Las Cruces Year SF LC Year SF LC 1977 54 61 1993 55 58 1978 55 61 1994 60 57 1979 56 52 1995 53 54 1980 61 56 1996 67 62 1981 54 66 1997 66 64 1982 52 62 1998 58 56 1983 50 33 1999 61 63 1984 60 56 2000 54 65 1985 57 61 2001 64 61 1986 60 66 2002 58 50 1987 57 58 2003 59 68 1988 49 49 2004 54 67 1989 50 54 2005 56 65 1990 56 50 2006 63 65 1991 55 57 2007 60 54 1992 58 57 10

One interesting point of this data is the following calculation of central tendency. mean median San Francisco 57.2 57 Las Cruces 57.7 58 The mean and median are virtually identical for the two cities. We will now plot the data in the same way as we did earlier. 11

San Francisco and Las Cruces Weather Data (#$" (" '#$" '" &#$" &" )*" +," %#$" %"!#$"!" 12

However, the graphical representation makes the data look much different. The data for Las Cruces is spread out much more than that of San Francisco. A calculation of the middle of the data only presents part of the story. The dispersion or deviation of the data is also an important part of the data. While there are several measures of deviation, the most common one is called standard deviation. 13

The most basic property of standard deviation is: the larger the standard deviation, the more spread out the data. That is, the larger the deviation, the more the data is away from the middle, or the average. 14

The point of measuring deviation is to give a sense of how far data is from the middle, or the average. Standard deviation approximately measures the average of how far data is from the middle. This is not exactly true, but is roughly true. We will say more about the standard deviation in a little while. 15

Box and Whiskers Plot A box and whisker plot is another way to plot data, and it focuses attention on other aspects of the data than in a histogram. 16

One of the main pieces of information this plot shows is the quartiles. The idea of quartiles is to divide the data into quarters. The median breaks the data into two halves. If we break each half into halves, we will have broken the data into 4 quartiles. 17

The first quartile represents a point where 1/4 of the data is below and 3/4 is above. The second quartile, which is the median, represents a point where 2/4 of the data is below and 2/4 is above. The third quartile represents a point where 3/4 of the data is below and 1/4 is above. 18

We will illustrate constructing a box and whisker plot with the following data. Suppose your data set has the following 10 numbers: 60, 62, 64, 64, 65, 67, 70, 75, 80, 82 We first find the median; this is the average of 65 and 67, so is 66. 19

We next find the first and third quartile. To do this we split the data 60, 62, 64, 64, 65, 67, 70, 75, 80, 82 in half: 60, 62, 64, 64, 65 67, 70, 75, 80, 82 The first quartile is the median of the small half. That is 64. The third quartile is the median of the big half. That is 75. 20

The median is the second quartile value. So, we have: first quartile: 64 median: 66 third quartile: 75 low: 60 high: 82 We then make the following plot, marking it next to a number line starting and ending at the high and low, respectively. 21

The two boxes reflect the two quartiles, one goes from the first quartile to the median and the other from the median to the third quartile. Then we have the whiskers, which are lines to the extremes (high and low) of the data. The significance of the boxes is that half the data lies inside the two of them. The other half of the data is represented by the the whiskers. 22

For a second example, let s use the Laker salary data..32m,.43m,.77m, 1M, 1.76M, 2.17M, 2.2M, 2.7M, 4M, 4.4M, 5.6M, 13.5M, 13.7M, 19.5M. Since there are 14 data points, each half has 7 points. The median of each half has three numbers on each side. These, the first and third quartiles, are listed in red. As we saw earlier, the median is 2.45M. The high and low salaries are.23m and 19.5M 23

Box and Whisker plot for the Laker salary data 24

There are other sorts of charts to represent data. A pie chart is a commonly occurring chart. Its purpose is to visually show percentages between different items. 25

The Normal Distribution Let s look at the experiment of flipping a coin repeatedly. We will simulate this with a computer program. It represents flipping 100 coins and counting the number of heads. It will do this as many times as we want. 26

Recall the coin flipping experiment we did early in the semester. Number of Students with a given number of Heads 14 12 10 Number of Students 8 6 4 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Heads 27

Simulation of flipping 100 coins 1000 times 28

Simulation of flipping 100 coins 10000 times 29

Simulation of flipping 100 coins 100000 times 30

Simulation of flipping 100 coins 1000000 times 31

As the number of flips gets larger and larger, the graph looks more and more regular. In fact, the shape looks more like a shape of a curve you may have seen before. 32

The Normal (Bell) Curve 33

The importance of the bell curve is that as the number of trials gets larger and larger, histograms generally look more and more like a bell curve. The particular shape of the bell curve reflects the standard deviation. 34

Bell Curves with Different Standard Deviation St. Dev. = 2 St. Dev. = 5 35

The higher the standard deviation, the wider the curve is. In terms of the normal curve, standard deviation can be interpreted as follows: 68% of all data is within 1 standard deviation of the average. 95% of all data is within 2 st. deviations of the average. 99.7% of all data is within 3 st. devs. of the average. 36

Sampling The following quotes were taken from Gallup.com on April 27: 37

The latest Gallup Poll tracking report shows that 86% of Americans say the U.S. economy is getting worse, while 44% rate the current economy as poor, and only 15% rate it as excellent or good. National Democratic voters preferences for their party s nomination remain evenly split, with the latest Gallup Poll Daily tracking results showing Barack Obama and Hillary Clinton each receiving 47% support. 38

The following two quotes were taken from http://www.cnn.com/2004/allpolitics/04/19/ bush.kerry.poll/index.html December 21, 2004: As for Bush, 49 percent of respondents said they approved of the job the president is doing.... The question had a margin of error of plus or minus 3 percentage points. 39

April 20, 2004: A broader survey of registered voters gave the president a 50 percent to 46 percent lead over Kerry in a two-man race. And among all adults, Bush led Kerry 49 percent to 46 percent, with a margin of error of plus or minus 3 percentage points. 40

How are these statistics measured? Also, in the first quote, what is the meaning of the statement: The question had a margin of error of plus or minus 3 percentage points. Does it mean Bush had an percentage approval rating between 49-3=46 and 49+3=52? 41

Polls are conducted by taking a sample of the population and asking them their opinion. The CNN poll surveyed 1003 people. The reported percentage of Bush s approval rating is the percentage of the 1003 people who approved of his performance. Since only a small fraction were surveyed, is the actual approval rating really in this range? 42

What is almost always missing from statements about polls is that, due to the fact that not everybody is polled, the poll lists an estimate of the actual data. In fact, statements like the CNN poll data which list a percentage and a margin of error are really probabilistic statements. 43

The statement As for Bush, 49 percent of respondents said they approved of the job the president is doing.... The question had a margin of error of plus or minus 3 percentage points. really means that the actual percentage of Americans who approved of the president s job performance has a certain probability of being within 3 points of 49%. 44

Most poll data calculate the margin of error based on having a 95% probability that the actual value is within the margin of error. Unfortunately, it is rare, if ever, that a poll lists the actual probabiliy that the true value is within the margin of error. 45

If it is important to be more than 95% sure of the accuracy of the results, one can guarantee this by making the margin of error in the poll larger. However, if you said that candidate A has 51% support and candidate B has 49% support, and the margin of error was plus or minus 10%, then the 2% difference between the candidates is much less than the 10% margin of error. Thus, we cannot have any feel for which candidate has the greater support. It is then necessary to have a fairly small margin of error. 46

We want, therefore, to be confident of the data but have a fairly small margin of error. Over the years, people working with statistics have found 95% confidence in the data a good balance between being sure of the data and not having too large of a margin of error. 47

It turns out that actual number of people polled, rather than the fraction of the population, is what matters to do these calculations. For example, polling 1000 people is plenty to get good results, even when polling about a national issue. 48