Statistics and parameters

Similar documents
Unit 2: Numerical Descriptive Measures

Chapter 4. Displaying and Summarizing. Quantitative Data

MAT Mathematics in Today's World

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

Descriptive Statistics-I. Dr Mahmoud Alhussami

Describing Data: Numerical Measures

Measures of Central Tendency

Chapter 4.notebook. August 30, 2017

Describing Distributions With Numbers

4.12 Sampling Distributions 183

1 Measures of the Center of a Distribution

Units. Exploratory Data Analysis. Variables. Student Data

Describing Distributions

Describing Distributions with Numbers

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

Descriptive statistics

are the objects described by a set of data. They may be people, animals or things.

3.1 Measures of Central Tendency: Mode, Median and Mean. Average a single number that is used to describe the entire sample or population

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Chapter 6 Group Activity - SOLUTIONS

- a value calculated or derived from the data.

Essentials of Statistics and Probability

Algebra 2. Outliers. Measures of Central Tendency (Mean, Median, Mode) Standard Deviation Normal Distribution (Bell Curves)

MEASURES OF LOCATION AND SPREAD

Describing Data with Numerical Measures

Describing Data: Numerical Measures. Chapter 3

Statistics for Managers using Microsoft Excel 6 th Edition

Chapter 2: Tools for Exploring Univariate Data

3.1 Measure of Center

Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

Chapter 1: Exploring Data

Describing Distributions With Numbers Chapter 12

Σ x i. Sigma Notation

2.0 Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

additionalmathematicsstatisticsadditi onalmathematicsstatisticsadditionalm athematicsstatisticsadditionalmathem aticsstatisticsadditionalmathematicsst

A C E. Answers Investigation 4. Applications

- measures the center of our distribution. In the case of a sample, it s given by: y i. y = where n = sample size.

MALLOY PSYCH 3000 MEAN & VARIANCE PAGE 1 STATISTICS MEASURES OF CENTRAL TENDENCY. In an experiment, these are applied to the dependent variable (DV)

CIVL 7012/8012. Collection and Analysis of Information

1.0 Continuous Distributions. 5.0 Shapes of Distributions. 6.0 The Normal Curve. 7.0 Discrete Distributions. 8.0 Tolerances. 11.

KCP e-learning. test user - ability basic maths revision. During your training, we will need to cover some ground using statistics.

Descriptive Statistics (And a little bit on rounding and significant digits)

Chapter2 Description of samples and populations. 2.1 Introduction.

Statistical Concepts. Constructing a Trend Plot

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

Chapter 3. Measuring data

MATH 117 Statistical Methods for Management I Chapter Three

TOPIC: Descriptive Statistics Single Variable

CHAPTER 8 INTRODUCTION TO STATISTICAL ANALYSIS

Class 11 Maths Chapter 15. Statistics

Chapter. Numerically Summarizing Data. Copyright 2013, 2010 and 2007 Pearson Education, Inc.

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

1.3.1 Measuring Center: The Mean

Stats Review Chapter 3. Mary Stangler Center for Academic Success Revised 8/16

2011 Pearson Education, Inc

MODULE 9 NORMAL DISTRIBUTION

1. AN INTRODUCTION TO DESCRIPTIVE STATISTICS. No great deed, private or public, has ever been undertaken in a bliss of certainty.

MATH 10 INTRODUCTORY STATISTICS

Statistics 511 Additional Materials

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS

Histograms allow a visual interpretation

Preliminary Statistics course. Lecture 1: Descriptive Statistics

Chapter 3. Data Description

SESSION 5 Descriptive Statistics

STAT 200 Chapter 1 Looking at Data - Distributions

MATH 1150 Chapter 2 Notation and Terminology

Unit 2. Describing Data: Numerical

Describing distributions with numbers

GRAPHS AND STATISTICS Central Tendency and Dispersion Common Core Standards

Probability Distributions

Resistant Measure - A statistic that is not affected very much by extreme observations.

Lecture 1 : Basic Statistical Measures

Math 223 Lecture Notes 3/15/04 From The Basic Practice of Statistics, bymoore

Numerical Measures of Central Tendency

UNIT 3 CONCEPT OF DISPERSION

Chapter 1:Descriptive statistics

The standard deviation as a descriptive statistic

CS 147: Computer Systems Performance Analysis

Describing Center: Mean and Median Section 5.4

200 participants [EUR] ( =60) 200 = 30% i.e. nearly a third of the phone bills are greater than 75 EUR

Chapter 3 Statistics for Describing, Exploring, and Comparing Data. Section 3-1: Overview. 3-2 Measures of Center. Definition. Key Concept.

A is one of the categories into which qualitative data can be classified.

Topic-1 Describing Data with Numerical Measures

Describing distributions with numbers

2.1 Measures of Location (P.9-11)

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode.

CHAPTER 2: Describing Distributions with Numbers

Measures of disease spread

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

Shape, Outliers, Center, Spread Frequency and Relative Histograms Related to other types of graphical displays

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

Chapter 5. Understanding and Comparing. Distributions

Week 1: Intro to R and EDA

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008

Statistics I Chapter 2: Univariate data analysis

Chapter 5. Means and Variances

Chapter Four. Numerical Descriptive Techniques. Range, Standard Deviation, Variance, Coefficient of Variation

Transcription:

Statistics and parameters Tables, histograms and other charts are used to summarize large amounts of data. Often, an even more extreme summary is desirable. Statistics and parameters are numbers that characterize different aspects of the data Terminology: A number that summarizes population data is usually referred to as a parameter, while a number that summarizes sample data is called a statistic. Comment: There are two significant differences between population parameters and sample statistics. Population parameters are (more or less) constant, while sample statistics depend on the particular sample chosen i.e., sample statistics vary with the sample. Sample statistics are known because we can compute them from the (available) sample data, while population parameters are often unknown, because data for the entire population is often unavailable. 1

Measures of central tendency The center of a list of numbers can be characterized in several ways. The two most common measures of middle are the average (or mean) and the median. The mean of a set of numbers is the sum of all the values divided by the number of values in the set. The median of a set of number is the middle number, when the numbers are listed in ascending (or descending) order. The median splits the data into two equally sized sets 50% of the data lies below the median and 50% lies above. (If the number of numbers in the set is even, then the median is the average of the two middle values.) The mean and median are meant to describe the typical value in the data. Another statistic that is often used to describe the typical value is the mode this is the most frequently occurring value in the data. 2

Example. Find the mean, median and mode of the following set of numbers: The mean (average). {12, 5, 6, 8, 12, 17, 7, 6, 14, 6, 5, 16}. 12 + 5 + 6 + 8 + 12 + 17 + 7 + 6 + 14 + 6 + 5 + 16 12 = 114 12 = 9.5. The median. Arrange the data in ascending order, and find the average of the middle two values in this case, since there are an even number of values: 5, 5, 6, 6, 6, 7, 8, 12, 12, 14, 16, 17 median = 7 + 8 2 = 7.5. The mode is 6, because 6 occurs most frequently (three times). 3

The relative positions of the mean and median provides information about how the data is distributed... In the histogram below, the mean is bigger than the median, and the histogram has a longer tail on the right. We say that the data is skewed to the right. 50% 50% Median Mean 4

In the histogram below, the mean is smaller than the median, and the histogram has a longer tail on the left. We say that the data is skewed to the left. 50% 50% Mean Median 5

If the mean and median are (more or less) equal, then the tails of the distribution are (more or less) the same, and the data has a (more or less) symmetric distribution around the mean/median, as depicted below. 50% 50% Median Mean 6

Example: The histogram below describes the distribution of household income in the United States. The data comes from the Wikipedia s summary of the 2006 Economic Survey. 1.5% Percentage of US Households per Income (data from 2006 Economic Survey) 1.0% 0.5% $0 - $12,499 $75,000 - $99,999 $50,000 - $74,999 $25,000 - $49,499 $12,500 - $24,499 $100,000 - $149,999 $150,000 - $199,999 $250,000 and up (1.5%) $200,000 - $249,999 $200,000 - $249,9 7

The distribution of household income is skewed to the right. The median household income in the survey was about $44,400. The mean income was about $63,000 (my estimate). 1.5% Percentage of US Households per Income (data from 2006 Economic Survey) median ~ 44,400 % per $1000 1.0% mean ~ 63,000 0.5% $0 - $12,499 $25,000 - $49,499 $12,500 - $24,499 $75,000 - $99,999 $50,000 - $74,999 $100,000 - $149,999 $150,000 - $199,999 $200,000 - $249,999 $250,000 and up (1.5%) 8

Comments: The mean and median describe the middle of the data in somewhat different ways: The median divides the histogram into two halves of equal area. The mean is the balancing point of the histogram. The mean is more sensitive to outliers data values that are much bigger or much smaller than average. The median gives a more fair sense of middle when the data is skewed in one direction or the other. On the other hand, the mean is easier to use in mathematical formulas. Both the median and the mean leave out a lot of information. E.g., they tell us nothing about the spread of the data, where we might find peaks in the distribution, etc. 9

Notation. The population mean (a parameter) is denoted by the Greek letter µ ( mu ). If there are several variables being studied, we put a subscript on the µ to tell us which variable it pertains to. For example, if we have data for population height (h) and population weight (w), the mean height would be denoted by µ h and the mean weight by µ w. The mean of a set of sample data (a statistic) is denoted by putting a bar over the variable. E.g., if {h 1, h 2, h 3,..., h n } is a sample of heights, then the average of this sample would be denoted by h. We use summation notation to simplify the writing of (long) sums,: h 1 + h 2 + h 3 + + h n = n h j. So, for example we can write: h = 1 n n h j. 10

Measuring the spread of the data The mean and median describe the middle of the data distribution. To get a better sense of where the data lies, statisticians also use measures of dispersion. The range is the distance between the smallest and largest values in the data. The interquartile range is the distance between the value separating the bottom 25% of the data from the rest and the value separating the top 25% of the data from the rest. In other words, it is the range of the middle 50% of the data. The standard deviation is something like the average distance of the data to its mean, or the average spread. Example: In the histogram describing household income distribution, about 25% of all households have incomes below $22,400 and about 25% of all households have incomes above $78,000, so the interquartile range is 78, 000 22, 400 = 65, 600. 11

The standard deviation As mentioned above, the standard deviation of a set of numbers is something like the average distance of the numbers from their mean. Technically, it is a little bit more complicated than that. Suppose that x 1, x 2, x 3,..., x n are numbers and n x = 1 n x j is their mean. The standard deviation of these numbers is defined by SD x = 1 n (x j x) n 2. In words, the SD is the root of the mean of the squared deviations of the numbers from their mean. 12

If much of the data is spread far from the mean, then many of the (x j x) 2 terms will be quite large, so the mean of these terms will be large and the SD of the data will be large. In particular, outliers can make the SD bigger. (Outliers have an even bigger effect on the range of the data.) On the other hand, if the data is all clustered around the mean, then all of the (x j x) 2 terms will be fairly small, so their mean will be small and the SD will be small. Example: Find the SD of the set {x j } = {2, 4, 5, 8, 5, 11, 7}. Step 1. Find the mean: x = 1 7 7 x j = 6. Step 2. Find the mean of the squared deviations of the numbers from their mean: 1 7 7 (x j 6) 2 = 52 7. Step 3. SD x = 52/7 2.726. 13

Example: Find the SD of the set {y j } = {21, 39, 52, 78, 51, 112, 74}. Step 1. Find the mean: y = 1 7 7 y j = 61. Step 2. Find the mean of the squared deviations of the numbers from their mean: 1 7 (y j 61) 2 = 5324 7 7. Step 3. SD y = 5324/7 27.578. Observation: The mean and the standard deviation are both sensitive to scale. In more detail, if you change the scale of the data, the mean and the standard deviation will change in the same way. For example, if you have data for heights measured in centimeters, and the same set of heights measured in inches, then the mean and SD of the first set will be about 2.54 times bigger than the mean and SD of the second set. (Because there are about 2.54 centimeters in an inch.) 14

SD vs. SD + One of the most important uses of sample statistics is to estimate the corresponding population parameters. The mean of a representative sample is a good estimate of the mean of the population that the sample represents. The SD of a representative sample is a good estimate of the SD of the population that the sample represents, as long as the sample is large. If a sample is relatively small, statisticians use the SD + of the sample to estimate the SD of the population, where if n is the sample size. SD + = n n 1 SD The SD + is usually called the sample standard deviation. Many calculators with statistical functions (and software packages) compute the SD + instead of the SD because in practice, this is what most people need. 15