Description of Samples and Populations

Similar documents
Chapter2 Description of samples and populations. 2.1 Introduction.

Units. Exploratory Data Analysis. Variables. Student Data

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

Chapter 2 Class Notes Sample & Population Descriptions Classifying variables

Sections 2.3 and 2.4

STAT 200 Chapter 1 Looking at Data - Distributions

Describing distributions with numbers

Chapter 1: Exploring Data

Chapter 2: Tools for Exploring Univariate Data

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

Chapter 3. Measuring data

Full file at

Chapter 1. Looking at Data

are the objects described by a set of data. They may be people, animals or things.

Describing distributions with numbers

2 Descriptive Statistics

Chapter 4. Displaying and Summarizing. Quantitative Data

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

P8130: Biostatistical Methods I

CHAPTER 2 Description of Samples and Populations

Elementary Statistics

Resistant Measure - A statistic that is not affected very much by extreme observations.

CHAPTER 2: Describing Distributions with Numbers

Chapter 3. Data Description

1. Exploratory Data Analysis

Lecture 3B: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Chapter 1 Descriptive Statistics

1.3.1 Measuring Center: The Mean

Lecture 6: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

Lecture 1: Descriptive Statistics

1 Measures of the Center of a Distribution

Histograms allow a visual interpretation

Descriptive Statistics

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

Summarizing and Displaying Measurement Data/Understanding and Comparing Distributions

Example 2. Given the data below, complete the chart:

UNIVERSITY OF MASSACHUSETTS Department of Biostatistics and Epidemiology BioEpi 540W - Introduction to Biostatistics Fall 2004

Statistics I Chapter 2: Univariate data analysis

MAT Mathematics in Today's World

Statistics for Managers using Microsoft Excel 6 th Edition

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

Elementary Statistics for the Biological and Life Sciences

Statistics I Chapter 2: Univariate data analysis

MATH 117 Statistical Methods for Management I Chapter Three

Week 1: Intro to R and EDA

3 GRAPHICAL DISPLAYS OF DATA

AP Final Review II Exploring Data (20% 30%)

Descriptive Univariate Statistics and Bivariate Correlation

Chapter 2 Solutions Page 15 of 28

Chapter 6 Group Activity - SOLUTIONS

Performance of fourth-grade students on an agility test

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

CHAPTER 1. Introduction

Descriptive Data Summarization

CHAPTER 1 Exploring Data

Describing Distributions with Numbers

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode.

Chapter 5. Understanding and Comparing. Distributions

A graph for a quantitative variable that divides a distribution into 25% segments.

2011 Pearson Education, Inc

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Lecture 1 : Basic Statistical Measures

A is one of the categories into which qualitative data can be classified.

Math 140 Introductory Statistics

Math 140 Introductory Statistics

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

The empirical ( ) rule

Chapter Four. Numerical Descriptive Techniques. Range, Standard Deviation, Variance, Coefficient of Variation

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

3.1 Measure of Center

Chapter 5: Exploring Data: Distributions Lesson Plan

MATH 1150 Chapter 2 Notation and Terminology

Introduction to Statistics

CHAPTER 1 Univariate data

Review: Central Measures

Chapter 3 Statistics for Describing, Exploring, and Comparing Data. Section 3-1: Overview. 3-2 Measures of Center. Definition. Key Concept.

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

Chapter 1:Descriptive statistics

Eruptions of the Old Faithful geyser

Unit 2: Numerical Descriptive Measures

Section 3. Measures of Variation

Chapter 2: Descriptive Analysis and Presentation of Single- Variable Data

1.3: Describing Quantitative Data with Numbers

STT 315 This lecture is based on Chapter 2 of the textbook.

BNG 495 Capstone Design. Descriptive Statistics

Statistics 1. Edexcel Notes S1. Mathematical Model. A mathematical model is a simplification of a real world problem.

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

Chapter 1: Introduction. Material from Devore s book (Ed 8), and Cengagebrain.com

2.1 Measures of Location (P.9-11)

Introduction to Statistics

Data Analysis and Statistical Methods Statistics 651

Sets and Set notation. Algebra 2 Unit 8 Notes

appstats27.notebook April 06, 2017

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Exploratory data analysis: numerical summaries

Transcription:

Description of Samples and Populations Random Variables Data are generated by some underlying random process or phenomenon. Any datum (data point) represents the outcome of a random variable. We represent random variables with capital letters, usually X, Y, and Z. Example: Let X = weight of a newborn baby. Let Y = weight of baby at one week old Types of Random Variables Qualitative Categorical (or nominal) Ordinal Quantitative Discrete Continuous Definition: An observation is a recorded outcome of a variable from a random sample. We represent observations with lower case letters. For example, suppose we are measuring the outcome of a random variable X = weight of 10 newborn babies. Our observations would be denoted by x 1, x 2,,x 10. Notation: x 1 is the first observation. Definition: The observational unit is the type of subject being sampled. It is the smallest unit upon which we make our observations. Observational units could be a baby, moth, Petri dish, etc

Example: For the following setting identify the(i) variable(s) in the study, (ii) type of variable (iii) observational unit (iv) sample size (From example 1.1.4) In a study of schizophrenia, researchers measured the activity of the enzyme monoamine oxidase (MAO) in the blood platelets of 42 patients. The results were recorded as nmols benzylaldehyde product per 108 platelets. Definition: A frequency distribution is a summary display of the frequencies of occurrence of each value in a sample. Definition: A relative frequency is a raw frequency divided by n (sample size). Example 2.2.4 Y = number of piglets surviving 21 days (litter size at 21 days) What is the sample size? What is the relative frequency for a litter size at 21 days of 10? Graphical Displays After you collect data, we hope that it tells us something. We look at it in order to learn something about the process that generated it. One way to summarize (or describe) the data is through a graphical display. A graphical display should always be as clear as possible. It should be well labeled with a title, key (if necessary), labels on the graphical display itself, units should be clear from your display, and the sample size should be clear. Do not over label your graphical display! Nicely formatting a graphical display can be more difficult than it sounds.we ll try one in a minute and you ll see they require some thought. Chapter 2 Page 2

A dot plot is a graphical display where dots indicate observed data in a sample. Figure 2.2.4 is an illustration of the frequency distribution of example 2.2.4. Surviving Piglets at 21 Days n=36 sows A histogram is a graphical display where bars (or bins) replace the dots from a dotplot. Figure 2.2.5 is a histogram of the frequency distribution of example 2.2.4. Surviving Piglets at 21 Days (n=36 sows) Let s make a histogram in R First, a trick. rep(10,9) means repeat 10 9 times or list 9 10 s in a row We can quickly enter the Litter Size data of table 2.2.4 Lsize< c(5,7,7,8,8,8,9,9,9,rep(10,9),rep(11,8),rep(12,5),13,13,13,14,14) Look at what hist(lsize) does. What is wrong with the display? Chapter 2 Page 3

Turns out this is a glitch in R. Type?hist This gives all the options available for the histogram function. We ll use breaks to fix the histogram. Type hist(lsize,breaks=4:14) The breaks argument in the histogram function tells where the edges of the bars (bins) of the histogram are defined. The 4:14 statement is the same as saying list the numbers from 4 up to 14. Now, notice this is not well labeled. Hist(Lsize,breaks=4:14,main= insert title here,sub= insert subtitle here ) puts a title and subtitle on the display. Warning: A histogram can be a very misleading display we ll see an example of this in a minute. A stemplot or stem and leaf plot is a lot like a dot plot, usually turned on its side. We use a stemplot when we have more detailed information to replace the dots with. The stems are the core values of the data and the leaves are the last values of the data points. We ll put the leaves in numerical order in this class and the resulting plot is called an ordered stemplot but, we ll never use the unordered kind, so we ll refer to ours simply as a stemplot. A stemplot should always include a key with units! Example (data from a portion of example 1.1.4) In a study of schizophrenia, researchers measured the activity of the enzyme monoamine oxidase (MAO) in the blood platelets of 18 patients. The results were recorded as nmols benzylaldehyde product per 108 platelets. 4.1 5.2 6.8 7.3 7.4 7.8 7.8 8.4 8.7 9.7 9.9 10.6 10.7 11.9 12.7 14.2 14.5 18.8 Create a stemplot for these data using R. stem() is the stemplot function in R. Let s figure it out together. Chapter 2 Page 4

Describing the shape of a frequency distribution We can see the shape of a frequency distribution by looking at an appropriate graphical display. The following are some basic words we use to describe the shape of a frequency distribution symmetric / asymmetric bell shaped / skewed left / skewed right unimodal / bimodal Some examples Chapter 2 Page 5

Example (From Exercisee 2.13 in 3 rd ed. of text) Trypanosomes are parasites which cause disease in humans and animals. In an early study of trypanosomee morphology, researchers measured thee lengths ( m) of 500 individual trypanosomess taken from the blood of a rat. The results are summarized in the accompanying frequency distribution. The following is the default histogram returned by a statistical software package (not well labeled yet!) Describe the shape of the frequency distribution. This next histogram is returned by the same software package for the trypanosome data, with the bin width changed. How would you describe the shape of the distribution now? Discuss the changes. Chapter 2 Page 6

We learned to describe data sets graphically. We can also describe a data set numerically. Definition A statistic is a sample quantity used to estimate a population parameter. Astatistic is a numerical quantity calculated from the sample data. Measures of Location The sample median is the value of the data nearest their middle. We call the median Q 2 or M and your text denotes it as. To find the median of a data set Put the data in order The median is o The middle value if n is odd o The average of the two middle values if n is even Example Exercise 2.3.1 Y = weight gain (lb) of lambs on a special diet for n = 6 lambs. The ordered values are y (1) =1, y (2) =2, y (3) =10, y (4) =11, y (5) =13, y (6) =19 Find Q 2. Notice the new notation. A lower case letter denoting the outcome of a random variable with parenthesis in the subscript y (i) denotes the i th observation in order from smallest to largest. i=(1) denotes the smallest observation (minimum) and i=(n) denotes the largest (maximum). Definition The sample mean is the arithmetic average of n values. We denote the sample mean by y = 1 n n y i i=1 = y 1 + y 2 + + y n n Example 2.3.3 Y = weight gain (lb) of lambs on a special diet for n = 6 lambs. Compute the sample mean for the resulting data set. 11 13 19 2 10 1 Chapter 2 Page 7

The following two figures illustrate that the sample mean can be viewed as a balancing point in the data, whereas the sample median is the point which divides the data set into the upper half and lower half. Question: Which measure of location (or measure of center) do we report? Mean or median? To answer this, explore what happens on certain data sets to the relation between the mean and median. Consider two data sets. Find the mean and the median for both. 1 2 3 4 5 1 2 3 4 20 What happened and why? So the comparison of the mean and median can indicate skewness. Data skewed right, mean median Data skewed left, mean median Data symmetric, mean median Measures of Dispersion (IQR, Range, and Standard Deviation) The quartiles of a data set are points that separate the data into quarters (or fourths). Q 1 separates the lower quarter (25%) from the upper three quarters (75%) Q 2 separates the lower two quarters (50%)from the upper two quarters (50%) Chapter 2 Page 8

Q 3 separates the lower three quarters (75%) from the upper quarter (25%) Notice the median is the second quartile. One way to report the dispersion (or spread) of a data set is to report the inter quartile range. Definition The inter quartile range is IQR = Q 3 Q 1 Definition The sample range is y (n) y (1) = max min Definition The five number summary is {y (1), Q 1, Q 2, Q 3, y (n) } Example (from Example 2.22) In a common biology experiment, radishes were grown in total darkness and the length (mm) of each radish shoot was measured at the end of three days. Find the five number summary for these data. 8 10 11 15 15 15 20 20 22 25 29 30 33 35 37 Definition An outlier is an observation that differs dramatically from the rest of the data. Formally y i is an outlier if Example 2.4.5 Y = radish growth in full light condition. The data are 3 5 5 7 7 8 9 10 10 10 10 14 20 21 Find any outliers. Chapter 2 Page 9

A boxplot (a.k.a. box and whisker plot) is a graphical display of the five number summary. The box spans the quartiles and the whiskers extend from the quartiles to the min / max. Figure 2.4.2 Dotplot and boxplot of radish height after three days in full light. Notice the outliers are plotted beyond the whiskers. Boxplots are often used for comparative purposes as in figure 2. 5.3, below. Compare the three distributions on shape, center, and spread. Radish Length at Three Days Grown Under Threee Conditions Chapter 2 Page 100

Definition The sample variance is S 2 = 1 n -1 n i =1 ( Y i - Y ) 2 Definition The sample standard deviation is S = S 2 Example 2.28 In an experiment on chrysanthemums, a botanist measured the stem elongation (mm in 7 days) of five plants grown on the same greenhouse bench 76 72 65 70 82 Find the sample standard deviation. Empirical Rule For unimodal, not too skewed data sets, the empirical rule states the following: ~ 68% of the data lie between Y s and Y s ~ 95% of the data lie between Y 2s and Y 2s > 99% of the data lie between Y 3s and Y 3s Example from 3 rd ed. Suppose Y = pulse rate after 5 minutes of exercise. For n = 28 subjects, we find y = 98 (beats/min) and s = 13.4 (beats/min) and the data are unimodal and not too skewed. Thus, from the empirical rule we expect ~95% of the data to lie between 98 (2)(13.4) = 98 26.8 = 71.2 beats/min and 98 + (2)(13.4) = 98 + 26.8 = 124.8 beats/min Chapter 2 Page 11

Figure 2.5.3, comparing the distribution of radish growth under the three lighting conditions, alludes to the idea that often, comparisons are made among groups and we like to analyze the relationship between variables. Figure 2.5.5 illustrates a common way to explore the relationship between two quantitative variables a scatterplot. This plot is depicting tooth selenium concentration versus liver selenium concentration for 20 beluga whales. How would you describe the relationship between liver and tooth selenium concentrations? Does it look like there are any possible outliers (are there any extreme observations that might qualify as outliers quantifying an extreme observation as an outlier when comparing two variables is totally different from our univariate definition)? Chapter 2 Page 12

Where are we headed? The goal of a statistical analysis is to ultimately describe a population. Definition The population is the larger group of subjects (organisms, plots, regions, ecosystems, etc.) on which we wish to draw inferences. Definition A parameter is a quantified population characteristic. It is usually unknown and often assumed to be a fixed value. Definition A statistic is a sample quantity used to estimate a population parameter. It is known and can vary from sample to sample. Question: When we use a statistic to estimate a parameter, is that statistic going to hit the target (is the statistic going to exactly equal the population parameter)? The process of using a statistic to estimate a parameter and then quantifying the uncertainty in our estimate (i.e. making a statement about how far off our estimate is, in general) is called statistical inference. Definition The population proportion is the proportion of subjects exhibiting a particular trait or outcome in the population. (It generalizes to the probability that any population element will exhibit the trait.) NOTATION: p Definition The SAMPLE PROPORTION is the number of sample elements exhibiting the trait, divided by the sample size, n. NOTATION: Chapter 2 Page 13

Chapter 2 Page 14