Chapter2 Description of samples and populations. 2.1 Introduction.

Similar documents
Units. Exploratory Data Analysis. Variables. Student Data

Description of Samples and Populations

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

STAT 200 Chapter 1 Looking at Data - Distributions

Chapter 2 Class Notes Sample & Population Descriptions Classifying variables

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

Example 2. Given the data below, complete the chart:

Chapter 4. Displaying and Summarizing. Quantitative Data

Chapter 2: Tools for Exploring Univariate Data

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

CHAPTER 2 Description of Samples and Populations

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

Elementary Statistics

Resistant Measure - A statistic that is not affected very much by extreme observations.

Describing distributions with numbers

AP Final Review II Exploring Data (20% 30%)

1.3.1 Measuring Center: The Mean

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

Shape, Outliers, Center, Spread Frequency and Relative Histograms Related to other types of graphical displays

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

1. Exploratory Data Analysis

P8130: Biostatistical Methods I

CIVL 7012/8012. Collection and Analysis of Information

Lecture 1: Descriptive Statistics

MATH 1150 Chapter 2 Notation and Terminology

Chapter 1. Looking at Data

Full file at

Stat 101 Exam 1 Important Formulas and Concepts 1

Module 1. Identify parts of an expression using vocabulary such as term, equation, inequality

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

Chapter 6 Group Activity - SOLUTIONS

A graph for a quantitative variable that divides a distribution into 25% segments.

Chapter 3. Data Description

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

Lecture 2 and Lecture 3

Describing distributions with numbers

are the objects described by a set of data. They may be people, animals or things.

Chapter 2 Solutions Page 15 of 28

Chapter 1: Exploring Data

2 Descriptive Statistics

Descriptive Univariate Statistics and Bivariate Correlation

Vocabulary: Samples and Populations

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Sections 2.3 and 2.4

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

University of Jordan Fall 2009/2010 Department of Mathematics

Lecture 6: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Descriptive Statistics

MATH 2560 C F03 Elementary Statistics I Lecture 1: Displaying Distributions with Graphs. Outline.

Topic 3: Introduction to Statistics. Algebra 1. Collecting Data. Table of Contents. Categorical or Quantitative? What is the Study of Statistics?!

Histograms allow a visual interpretation

STT 315 This lecture is based on Chapter 2 of the textbook.

CHAPTER 1. Introduction

Introduction to Statistics

Lecture 3B: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Math 223 Lecture Notes 3/15/04 From The Basic Practice of Statistics, bymoore

Unit 2: Numerical Descriptive Measures

Types of Information. Topic 2 - Descriptive Statistics. Examples. Sample and Sample Size. Background Reading. Variables classified as STAT 511

Statistics 1. Edexcel Notes S1. Mathematical Model. A mathematical model is a simplification of a real world problem.

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

3.1 Measure of Center

IB Questionbank Mathematical Studies 3rd edition. Grouped discrete. 184 min 183 marks

Vocabulary: Data About Us

Statistics for Managers using Microsoft Excel 6 th Edition

Math 140 Introductory Statistics

Math 140 Introductory Statistics

Statistics and parameters

Chapters 1 & 2 Exam Review

Chapter 3: Displaying and summarizing quantitative data p52 The pattern of variation of a variable is called its distribution.

Sociology 6Z03 Review I

Descriptive Data Summarization

Lecture Notes 2: Variables and graphics

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008

Exploring, summarizing and presenting data. Berghold, IMI, MUG

Unit Six Information. EOCT Domain & Weight: Algebra Connections to Statistics and Probability - 15%

MATH 117 Statistical Methods for Management I Chapter Three

Descriptive Statistics-I. Dr Mahmoud Alhussami

Chapter 2. Mean and Standard Deviation

Announcements. Lecture 1 - Data and Data Summaries. Data. Numerical Data. all variables. continuous discrete. Homework 1 - Out 1/15, due 1/22

Percentile: Formula: To find the percentile rank of a score, x, out of a set of n scores, where x is included:

Analytical Graphing. lets start with the best graph ever made

Exploring Data. How to Explore Data

Performance of fourth-grade students on an agility test

MAT Mathematics in Today's World

Preliminary Statistics course. Lecture 1: Descriptive Statistics

CHAPTER 2: Describing Distributions with Numbers

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

Introduction to Probability and Statistics Slides 1 Chapter 1

Learning Objectives for Stat 225

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

Stat Lecture Slides Exploring Numerical Data. Yibi Huang Department of Statistics University of Chicago

AIM HIGH SCHOOL. Curriculum Map W. 12 Mile Road Farmington Hills, MI (248)

M 140 Test 1 B Name (1 point) SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

Chapter 4.notebook. August 30, 2017

Math 082 Final Examination Review

A is one of the categories into which qualitative data can be classified.

QUIZ 1 (CHAPTERS 1-4) SOLUTIONS MATH 119 FALL 2012 KUNIYUKI 105 POINTS TOTAL, BUT 100 POINTS

Transcription:

Chapter2 Description of samples and populations. 2.1 Introduction. Statistics=science of analyzing data. Information collected (data) is gathered in terms of variables (characteristics of a subject that can be assigned a numerical value or nonnumerical category. Data itself and its transformed forms are also called statistics. Types of variables: 1. Categorical Variable, it records a category subject belongs to, like Blood Type (O, A, B, AB) or Gender (Female, Male). Usually categories do not have a meaningful order. Some categorical data can be ordinal, where some natural order exists for example: response to the treatment: none, partial, complete. 2. Quantitative (Numeric) Variable, records amount of something or a count of something. It can be continuous,with values on the continuous scale (Weight of a newborn, Cholesterol content in a blood specimen) or discrete, where values can be listed, often values are integer (Number of eggs in the nest, Number of bacteria in a petri dish). Distinction between discrete and continuous variables is not rigid, we often round up measurements to nearest integer Sample=collection of persons or things on which we measure one or more variables. Sometimes that same word is used in a different context (for example sample of blood taken from a subject). To avoid confusion we will say a specimens of blood in that case. Some other vocabulary and notation: Example. Twenty students gave reported their gender, blood type and weight to a researcher. Students are here observational units. Variables are Gender, Blood Type ( both categorical) and Weight (numerical). Sample size is n=20 We will use capital letters like X and Y for the names of the variables and lower case letters (x or y) for the particular observations. For example we may use Y=weight of a student and y 1 =150 lb as a weight of one such a student (John). 2.2. Frequency distributions. When data is collected, to make sense of it it is helpful to summarize it in a form of tables and/or graphs. We will use some example data sets to examine different ways data can be displayed. Ex1: Sample of Blood Type for 21 people: A O A AB O B AB A O A O AB O A O B A AB A O A We can summarize it using frequency and relative frequency table. Frequency=count in a particular class. Relative frequency=frequency/n % frequency= relative frequency*100%

Frequency table results for Blood Type: Blood Type Frequency Relative Frequency A 8 0.3809524 AB 4 0.1904762 B 2 0.0952381 O 7 0.33333334 Graphical display includes a Bar Chart. Notice that classes do not have to be placed in any particular order. Ex2 40 couples, # of children in each family 3 3 3 1 4 3 0 0 2 0 4 2 4 3 2 2 3 2 5 1 1 0 1 1 2 1 0 0 1 2 1 1 0 3 2 1 2 1 2 3 These data can be grouped using a single value, since there are relatively few different data values. Our classes will be in order: 0,1,2,3,4,5, frequencies will be computed exactly as in example #1.

Frequency table results for Number of children: Number of children Frequency Relative Frequency 0 7 0.175 1 11 0.275 2 10 0.25 3 8 0.2 4 3 0.075 5 1 0.025 Graphical display of such a data is called a histogram, bars will be raised with classes placed in the middle of each bar. Another way to display such a data is a dotplot. You place a dot over each data value. If values are repeated, you place multiple dots equally spaced above these values. Grouped frequency distribution is appropriate for a data set with a lot of different values like in the following example. Ex3 AGE of onset of diabetes (35 people) 48 41 57 83 41 55 59 61 38 48 79 75 77 7 54 23 47 56 79 68 61 64 45 53 82 68 38 70 10 60 83 76 21 65 47 If we decide to start at 0 and have groups with the width=10 we can have following classes: [0,10), [10,20), [20,30) and so on, Treat the notation like an interval notation. Histogram for these data can also be obtain, bars will be raised over each class. Vertical axis can represent either frequency or relative frequency.

We can also obtain a fast histogram, otherwise called stem-and-leaf diagram (or a stemplot): Each data point is divided into stem and leaf, all possible stems are placed vertically and leaves are added to them in order. Our stemplot is given below, notice that leaves are ordered. 0 7 1 0 2 1 3 3 8 8 4 1 1 5 7 7 8 8 5 3 4 5 6 7 9 6 0 1 1 4 5 8 8 7 0 5 6 7 9 9 8 2 3 3 Ex4 Radishes growth (mm in 3 days) A(in the dark) B (12 hours of light/ 12 hours of dark) A: 15 20 22 20 29 37 11 35 15 30 8 25 33 10 B: 10 11 15 15 20 4 22 21 10 25 27 20 9 20 Side by side Stemplots (with 2 leaves per stem) can let us compare both sets: In both stems are tens, leaves are ones 0 4 8 0 9 1 0 1 0 0 1 5 5 1 5 5 A 2 0 0 2 0 0 0 1 2 B 9 5 2 5 7 3 0 3 7 5 3 Interpreting areas of the histogram: Area of each bar of the histogram is proportional to corresponding frequency. In example #3 area between 10 and 20 (2 bars) equals 3/35~8.6% of the total area of the histogram Ex5 The amounts of iron intake, in milligrams, during a 24-hour period for a sample of 30 females under the age of 51 15.0 18.1 14.4 14.6 10.9 18.1 18.2 18.3 15.0 16.0 12.6 16.6 20.7 19.8 11.6 12.8 15.6 11.0 15.3 9.4 19.5 18.3 14.5 16.6 11.5 16.4 12.5 14.6 11.9 12.5

In that last example we may select groups of width 2, namely: [9,11), [11,13), [13,15) and so on, we will get 6 classes, appropriate number for data of 30 observations. Shapes of Distributions. right skewed distribution, left skewed distribution, symmetric distribution, 2.3 Descriptive Measures of Center Let Y be our variable, numerical. y = Median=middle of the ordered data. Position (location) of the median is n=sample size. Ex Weight gain in pounds for 6 young lambs n+ 1 2, where 1 2 10 11 13 19, 0.5(6+1)=3.5 (median is between observation #3 and #4), y =(10+11)/2=10.5 lb If we add one more observation: 10lb, data becomes: 1 2 10 10 11 13 19, 0.5(7+1)=4,(median is observation #4) y =10 Median is a robust (resistant) measure of center, it is relatively unaffected by changes in small portion of the data. y = Mean (arithmetic mean)= n i=1 y= n y i, where y i -s are observations in the sample. In our example y =56/6~9.33 lb

Differences between each data point and the mean and their sum i=1 n ( y i y)=0 for any data set. ( y i y) are called deviations from the mean In our example sum of all deviations=-8.33+ (-7.33)+.67+1.67+3.67+9.67=0 Mean can be visualized as a point of balance of the weightless seesaw with points (like children) sitting on it. Unlike median, mean is not robust, it is influenced by any data changes, very much by extremes. If data has some extreme values then median is a better measure of center for that data. Mean vs Median For symmetric distributions mean and medial are equal, if distribution is left skewed, mean<median, if distribution is right skewed mean>median. 2.4 Boxplots. Single variable data may be summarized by 5 numbers: Minimum, Maximum, Median and 2 Quartiles referred to as five-number summary. These values are also used to make a box plot. Lower quartile denoted by Q 1 is a median of lower half of data, upper quartile denoted by Q 3 is a median of upper half of data. Ex1 Data represents systolic blood pressure (in mmhg) of 7 adult males 151 124 132 170 146 124 113 We order data first: 113 124 124 132 146 151 170 Min=113, Max=170, Median=132 Q 1 =124 Q 3 =151 (Median is excluded when we compute quartiles) Boxplot connects all 5 numbers in the following way, the box represents middle half of the data. 110 120 130 140 150 160 170

Another measure we can compute is Interquartile Range IQR= Q 1 - Q 3. This measure gives spread of middle half of data values. We can use it to find unusual data points (outliers). The procedure is as follows: Compute lower fence=q 1-1.5*IQR and upper fence=q 3 + 1.5*IQR. An outlier is a data point that falls outside of the fences. In our example: IQR=151-124=27, 1.5(IQR)=1.5*27 = 40.5 lower fence=124-40.5=83.5, upper fence= 151+40.5 = 191.5, all observations are within the fences, so so there are no outliers in our data set. Ex2 Radishes growth (in mm) in the light. 4 5 5 7 7 8 9 10 10 10 10 14 20 21 Min=4, Max=21, Q 1 =7, Median=(9+10)/2=9.5 Q 3 =10 IQR=3, lower fence=2.5 upper fence=14.5, so 20 and 21 are outliers. Modified box plot exposes outliers. * * 5 10 15 20 25 2.5 Relationship between variables. This section discusses various ways used to compare two or more variables. Some methods include: a) Two way frequency and relative frequency tables to examine relationship between two categorical variables. They are useful to determine if variables are associated or not. b) Scatter plots for numerical variables to decide if there is a linear trend present, so that we can fit a regression line to the data. c) Side-by-side boxplots, dot plots, stemplots are useful to observe if there are differences between two or more treatments.

2.6 Measures of dispersion (variability) Range=Maximum-Minimum, gives overall spread of the data, easy to calculate, but very sensitive to extreme data values. IQR as we stated before gives range of the middle half of data and is a robust measure, not sensitive to extreme data values. Sample standard deviation n (y i y ) 2 averages the squared deviations from the mean. s = i=1 n 1 Square root is taken at the end, so the units of s are the same as the units of the data. s 0, s=0 if all data points are the same s 2 is the sample variance. We will abbreviate SD for standard deviation, s will be used in the formulas. Ex. Experiment on chrysanthemums, botanist measured stem elongation in 7 days (in mm) 76, 72, 65, 70, 82 n=5 y=365 /5=73, deviations from the mean are: 3, -1,-8,-3,9, squared deviations are: 9, 1,64,9,81 s= (9+ 1+ 64+ 9+ 81)/4 = 164/ 4 =6.40 mm variance s 2 =41mm 2 s gives typical distance of the observations from the mean, larger s means more variability. Similar to the mean, s is also influenced by extreme data values (not a robust measure). n-1 =degrees of freedom of s, as an intuitive justification why we use ( n-1) not n we can consider n=1, when variability of 1 observation can't be computed, one data point gives no information about variability. The Coefficient of Variation = s expressed as a percentage of the mean: coefficient of variation= units, for example: s y 100% has no units and can be used to compare data sets with different EX Weight and height is measured for girls at age 2. Which of the two measures has greater variability? Weight : mean=12.6 kg, SD=1.4 kg Height: mean=86.6 cm, SD=2.9 cm coef. of variation: 11.1% for weight and 3.3% for height, we conclude that weight is more variable, here SD is much larger percentage of the mean than for height.

Typical Percentages: The Empirical Rule For a nice distribution (pretty symmetric, unimodal, no very long or very short tails) we expect to find : about 68% of all data points within the interval ( y SD, y+ SD) about 95% of all data points within the interval ( y 2SD, y+ 2SD) more than 99% of all data points within the interval ( y 3SD, y+ 3SD) 2.8. Statistical Inference is the process of drawing conclusions about the population based on the observations in the sample. We can for example estimate percentage of all people in England with blood type A as 44% (the sample proportion of people with that blood type). Sample must be considered a random sample from entire population, must be representative of that population. 44% is a statistics (sample proportion p= y n, p hat ) that is estimating a parameter of the population (population proportion p). There are also other statistics we can use to estimate a population proportion, namely p= y+ 2, p tilde. n+ 4 In each case y=number of people in a sample that have a blood type A, n=sample size. We will discuss these estimates in later chapters Other parameters of the population that we often estimate from the samples are: population mean, μ, is estimated by sample mean, y. population SD, σ, is estimated by sample SD, s.