Statistics, continued

Similar documents
Mean/Average Median Mode Range

Section 7.1 How Likely are the Possible Values of a Statistic? The Sampling Distribution of the Proportion

Chapter 7: Statistics Describing Data. Chapter 7: Statistics Describing Data 1 / 27

Data set B is 2, 3, 3, 3, 5, 8, 9, 9, 9, 15. a) Determine the mean of the data sets. b) Determine the median of the data sets.

AP Stats MOCK Chapter 7 Test MC

Understanding Inference: Confidence Intervals I. Questions about the Assignment. The Big Picture. Statistic vs. Parameter. Statistic vs.

are the objects described by a set of data. They may be people, animals or things.

Introduction to Statistical Data Analysis Lecture 1: Working with Data Sets

Introduction to Statistics for Traffic Crash Reconstruction

Chapter 1. Looking at Data

Last few slides from last time

Chapter. Objectives. Sampling Distributions

Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

Men. Women. Men. Men. Women. Women

Chapter 2: Tools for Exploring Univariate Data

Introduction to Statistics

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

Lecture 8 Sampling Theory

TOPIC: Descriptive Statistics Single Variable

STT 315 This lecture is based on Chapter 2 of the textbook.

Exam #2 Results (as percentages)

Chapter 18: Sampling Distributions

IDAHO EXTENDED CONTENT STANDARDS MATHEMATICS

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

LC OL - Statistics. Types of Data

Essentials of Statistics and Probability

Data Analysis and Statistical Methods Statistics 651

Marquette University MATH 1700 Class 5 Copyright 2017 by D.B. Rowe

AP Final Review II Exploring Data (20% 30%)

3.1 Measure of Center

Bag RED ORANGE GREEN YELLOW PURPLE Candies per Bag

MATH 1150 Chapter 2 Notation and Terminology

A is one of the categories into which qualitative data can be classified.

Ch. 7: Estimates and Sample Sizes

Unit 4 Probability. Dr Mahmoud Alhussami

Topic 3: Introduction to Statistics. Algebra 1. Collecting Data. Table of Contents. Categorical or Quantitative? What is the Study of Statistics?!

Graphing Data. Example:

NOWCASTING THE OBAMA VOTE: PROXY MODELS FOR 2012

GALLUP NEWS SERVICE JUNE WAVE 1

Chapter 5: Exploring Data: Distributions Lesson Plan

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc.

Elementary Statistics

1. Descriptive stats methods for organizing and summarizing information

Section 3.2 Measures of Central Tendency

Statistics and Quantitative Analysis U4320. Segment 5: Sampling and inference Prof. Sharyn O Halloran

AMS 5 NUMERICAL DESCRIPTIVE METHODS

Bayesian Inference for Binomial Proportion

CS 147: Computer Systems Performance Analysis

Finding Quartiles. . Q1 is the median of the lower half of the data. Q3 is the median of the upper half of the data

Describing Distributions with Numbers

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

1. Exploratory Data Analysis

Chapter 8: Confidence Intervals

Lab 5 for Math 17: Sampling Distributions and Applications

Are data normally normally distributed?

Collecting and Reporting Data

Types of Information. Topic 2 - Descriptive Statistics. Examples. Sample and Sample Size. Background Reading. Variables classified as STAT 511

Chapter 6. The Standard Deviation as a Ruler and the Normal Model 1 /67

What are the mean, median, and mode for the data set below? Step 1

Do students sleep the recommended 8 hours a night on average?

Announcements. Lecture 5: Probability. Dangling threads from last week: Mean vs. median. Dangling threads from last week: Sampling bias

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Problems Pages 1-4 Answers Page 5 Solutions Pages 6-11

Describing Data 247. Color Frequency Blue 25 Green 52 Red 41 White 36 Black 39 Grey 23

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS

MATH 10 INTRODUCTORY STATISTICS

CHAPTER 8 INTRODUCTION TO STATISTICAL ANALYSIS

Probability Distributions

What Is a Sampling Distribution? DISTINGUISH between a parameter and a statistic

GRACEY/STATISTICS CH. 3. CHAPTER PROBLEM Do women really talk more than men? Science, Vol. 317, No. 5834). The study

MATH 117 Statistical Methods for Management I Chapter Three

Chapter 1:Descriptive statistics

Probability Distributions

Descriptive Univariate Statistics and Bivariate Correlation

After completing this chapter, you should be able to:

CHAPTER 1. Introduction

PETERS TOWNSHIP HIGH SCHOOL

2. Graphing Practice. Warm Up

Sets and Set notation. Algebra 2 Unit 8 Notes

Preliminary Statistics course. Lecture 1: Descriptive Statistics

QUIZ 1 (CHAPTERS 1-4) SOLUTIONS MATH 119 FALL 2012 KUNIYUKI 105 POINTS TOTAL, BUT 100 POINTS

1.0 Continuous Distributions. 5.0 Shapes of Distributions. 6.0 The Normal Curve. 7.0 Discrete Distributions. 8.0 Tolerances. 11.

Discrete Distributions

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

Statistics lecture 3. Bell-Shaped Curves and Other Shapes

Final Exam STAT On a Pareto chart, the frequency should be represented on the A) X-axis B) regression C) Y-axis D) none of the above

Point Estimation and Confidence Interval

Units. Exploratory Data Analysis. Variables. Student Data

Chapter Four. Numerical Descriptive Techniques. Range, Standard Deviation, Variance, Coefficient of Variation

ECLT 5810 Data Preprocessing. Prof. Wai Lam

MgtOp 215 Chapter 3 Dr. Ahn

MEASURING THE SPREAD OF DATA: 6F

Lecture: Sampling and Standard Error LECTURE 8 1

Sampling, Frequency Distributions, and Graphs (12.1)

CHAPTER 4 VARIABILITY ANALYSES. Chapter 3 introduced the mode, median, and mean as tools for summarizing the

Stat 101 Exam 1 Important Formulas and Concepts 1

4/19/2009. Probability Distributions. Inference. Example 1. Example 2. Parameter versus statistic. Normal Probability Distribution N

Transcription:

Statistics, continued

Visual Displays of Data Since numbers often do not resonate with people, giving visual representations of data is often uses to make the data more meaningful. We will talk about a few ways to view data. 2

Histograms A histogram, or bar chart, is a common way to represent numerical data. We illustrate a histogram with weather data for high temperatures on January 1 in San Francisco. 3

High Temperatures on January 1 in San Francisco Year Temperature Year Temperature 1977 54 1993 55 1978 55 1994 60 1979 56 1995 53 1980 61 1996 67 1981 54 1997 66 1982 52 1998 58 1983 50 1999 61 1984 60 2000 54 1985 57 2001 64 1986 60 2002 58 1987 57 2003 59 1988 49 2004 54 1989 50 2005 56 1990 56 2006 63 1991 55 2007 60 1992 58 4

By counting how many days had a given high temperature, we get the following chart. Temperature Number of Days Temperature Number of Days 49 1 59 1 50 2 60 4 51 0 61 2 52 1 62 0 53 1 63 1 54 4 64 1 55 3 65 0 56 3 66 1 57 2 67 1 58 3 5

Here is a histogram of the last chart. The numbers on the vertical axis are the number of days of the given temperature, and the values on the horizontal axis are the various temperatures. This was created by excel. (#$" (" '#$" '" &#$" &" %#$" %"!#$"!" 6

Here is the same chart with the legend put in the vertical and horizontal axes. 7

Each value, in this case temperature, is drawn with a vertical bar. The height of the bar represents how many times that value occurs. The values are listed on the horizonal axis in increasing value from left to right. 8

Let s look at some more weather data. We have the high temperatures on January 1 from 1977 to 2007 for both San Francisco and Las Cruces. We will also calculate the mean and median for both data sets. 9

High Temps on January 1 in San Francisco and Las Cruces Year SF LC Year SF LC 1977 54 61 1993 55 58 1978 55 61 1994 60 57 1979 56 52 1995 53 54 1980 61 56 1996 67 62 1981 54 66 1997 66 64 1982 52 62 1998 58 56 1983 50 33 1999 61 63 1984 60 56 2000 54 65 1985 57 61 2001 64 61 1986 60 66 2002 58 50 1987 57 58 2003 59 68 1988 49 49 2004 54 67 1989 50 54 2005 56 65 1990 56 50 2006 63 65 1991 55 57 2007 60 54 1992 58 57 10

One interesting point of this data is the following calculation of central tendency. mean median San Francisco 57.2 57 Las Cruces 57.7 58 The mean and median are virtually identical for the two cities. We will now plot the data in the same way as we did earlier. 11

San Francisco and Las Cruces Weather Data (#$" (" '#$" '" &#$" &" )*" +," %#$" %"!#$"!" 12

However, the graphical representation makes the data look much different. The data for Las Cruces is spread out much more than that of San Francisco. A calculation of the middle of the data only presents part of the story. The dispersion or deviation of the data is also an important part of the data. While there are several measures of deviation, the most common one is called standard deviation. 13

The most basic property of standard deviation is: the larger the standard deviation, the more spread out the data. That is, the larger the deviation, the more the data is away from the middle, or the average. 14

The point of measuring deviation is to give a sense of how far data is from the middle, or the average. Standard deviation approximately measures the average of how far data is from the middle. This is not exactly true, but is roughly true. We will say more about the standard deviation in a little while. 15

Box and Whiskers Plot A box and whisker plot is another way to plot data, and it focuses attention on other aspects of the data than in a histogram. 16

One of the main pieces of information this plot shows is the quartiles. The idea of quartiles is to divide the data into quarters. The median breaks the data into two halves. If we break each half into halves, we will have broken the data into 4 quartiles. 17

The first quartile represents a point where 1/4 of the data is below and 3/4 is above. The second quartile, which is the median, represents a point where 2/4 of the data is below and 2/4 is above. The third quartile represents a point where 3/4 of the data is below and 1/4 is above. 18

We will illustrate constructing a box and whisker plot with the following data. Suppose your data set has the following 10 numbers: 60, 62, 64, 64, 65, 67, 70, 75, 80, 82 We first find the median; this is the average of 65 and 67, so is 66. 19

We next find the first and third quartile. To do this we split the data 60, 62, 64, 64, 65, 67, 70, 75, 80, 82 in half: 60, 62, 64, 64, 65 67, 70, 75, 80, 82 The first quartile is the median of the small half. That is 64. The third quartile is the median of the big half. That is 75. 20

The median is the second quartile value. So, we have: first quartile: 64 median: 66 third quartile: 75 low: 60 high: 82 We then make the following plot, marking it next to a number line starting and ending at the high and low, respectively. 21

The two boxes reflect the two quartiles, one goes from the first quartile to the median and the other from the median to the third quartile. Then we have the whiskers, which are lines to the extremes (high and low) of the data. The significance of the boxes is that half the data lies inside the two of them. The other half of the data is represented by the the whiskers. 22

For a second example, let s use the Laker salary data..32m,.43m,.77m, 1M, 1.76M, 2.17M, 2.2M, 2.7M, 4M, 4.4M, 5.6M, 13.5M, 13.7M, 19.5M. Since there are 14 data points, each half has 7 points. The median of each half has three numbers on each side. These, the first and third quartiles, are listed in red. As we saw earlier, the median is 2.45M. The high and low salaries are.23m and 19.5M 23

Box and Whisker plot for the Laker salary data 24

There are other sorts of charts to represent data. A pie chart is a commonly occurring chart. Its purpose is to visually show percentages between different items. 25

The Normal Distribution Let s look at the experiment of flipping a coin repeatedly. We will simulate this with a computer program. It represents flipping 100 coins and counting the number of heads. It will do this as many times as we want. 26

Recall the coin flipping experiment we did early in the semester. Number of Students with a given number of Heads 14 12 10 Number of Students 8 6 4 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Heads 27

Simulation of flipping 100 coins 1000 times 28

Simulation of flipping 100 coins 10000 times 29

Simulation of flipping 100 coins 100000 times 30

Simulation of flipping 100 coins 1000000 times 31

As the number of flips gets larger and larger, the graph looks more and more regular. In fact, the shape looks more like a shape of a curve you may have seen before. 32

The Normal (Bell) Curve 33

The importance of the bell curve is that as the number of trials gets larger and larger, histograms generally look more and more like a bell curve. The particular shape of the bell curve reflects the standard deviation. 34

Bell Curves with Different Standard Deviation St. Dev. = 2 St. Dev. = 5 35

The higher the standard deviation, the wider the curve is. In terms of the normal curve, standard deviation can be interpreted as follows: 68% of all data is within 1 standard deviation of the average. 95% of all data is within 2 st. deviations of the average. 99.7% of all data is within 3 st. devs. of the average. 36

Sampling The following quotes were taken from Gallup.com on April 27: 37

The latest Gallup Poll tracking report shows that 86% of Americans say the U.S. economy is getting worse, while 44% rate the current economy as poor, and only 15% rate it as excellent or good. National Democratic voters preferences for their party s nomination remain evenly split, with the latest Gallup Poll Daily tracking results showing Barack Obama and Hillary Clinton each receiving 47% support. 38

The following two quotes were taken from http://www.cnn.com/2004/allpolitics/04/19/ bush.kerry.poll/index.html December 21, 2004: As for Bush, 49 percent of respondents said they approved of the job the president is doing.... The question had a margin of error of plus or minus 3 percentage points. 39

April 20, 2004: A broader survey of registered voters gave the president a 50 percent to 46 percent lead over Kerry in a two-man race. And among all adults, Bush led Kerry 49 percent to 46 percent, with a margin of error of plus or minus 3 percentage points. 40

How are these statistics measured? Also, in the first quote, what is the meaning of the statement: The question had a margin of error of plus or minus 3 percentage points. Does it mean Bush had an percentage approval rating between 49-3=46 and 49+3=52? 41

Polls are conducted by taking a sample of the population and asking them their opinion. The CNN poll surveyed 1003 people. The reported percentage of Bush s approval rating is the percentage of the 1003 people who approved of his performance. Since only a small fraction were surveyed, is the actual approval rating really in this range? 42

What is almost always missing from statements about polls is that, due to the fact that not everybody is polled, the poll lists an estimate of the actual data. In fact, statements like the CNN poll data which list a percentage and a margin of error are really probabilistic statements. 43

The statement As for Bush, 49 percent of respondents said they approved of the job the president is doing.... The question had a margin of error of plus or minus 3 percentage points. really means that the actual percentage of Americans who approved of the president s job performance has a certain probability of being within 3 points of 49%. 44

Most poll data calculate the margin of error based on having a 95% probability that the actual value is within the margin of error. Unfortunately, it is rare, if ever, that a poll lists the actual probabiliy that the true value is within the margin of error. 45

If it is important to be more than 95% sure of the accuracy of the results, one can guarantee this by making the margin of error in the poll larger. However, if you said that candidate A has 51% support and candidate B has 49% support, and the margin of error was plus or minus 10%, then the 2% difference between the candidates is much less than the 10% margin of error. Thus, we cannot have any feel for which candidate has the greater support. It is then necessary to have a fairly small margin of error. 46

We want, therefore, to be confident of the data but have a fairly small margin of error. Over the years, people working with statistics have found 95% confidence in the data a good balance between being sure of the data and not having too large of a margin of error. 47

It turns out that actual number of people polled, rather than the fraction of the population, is what matters to do these calculations. For example, polling 1000 people is plenty to get good results, even when polling about a national issue. 48