Statistics I Chapter 2: Univariate data analysis

Similar documents
Statistics I Chapter 2: Univariate data analysis

Unit 2. Describing Data: Numerical

Describing distributions with numbers

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Statistics for Managers using Microsoft Excel 6 th Edition

Statistics I Chapter 1: Introduction

Statistics I Chapter 1: Introduction

Describing distributions with numbers

After completing this chapter, you should be able to:

Chapter 3. Data Description

P8130: Biostatistical Methods I

Statistics I Chapter 2: Analysis of univariate data

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

STAT 200 Chapter 1 Looking at Data - Distributions

Chapter 2: Tools for Exploring Univariate Data

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

Chapter 3. Measuring data

Statistics I Chapter 3: Bivariate data analysis

2011 Pearson Education, Inc

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Chapter 4. Displaying and Summarizing. Quantitative Data

Section 3. Measures of Variation

Chapter 1: Exploring Data

1. Exploratory Data Analysis

Units. Exploratory Data Analysis. Variables. Student Data

MATH 1150 Chapter 2 Notation and Terminology

Chapter 1. Looking at Data

CHAPTER 2: Describing Distributions with Numbers

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

Elementary Statistics

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

Descriptive Univariate Statistics and Bivariate Correlation

1.3: Describing Quantitative Data with Numbers

Determining the Spread of a Distribution

Determining the Spread of a Distribution

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

A is one of the categories into which qualitative data can be classified.

Measures of Location. Measures of position are used to describe the relative location of an observation

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

are the objects described by a set of data. They may be people, animals or things.

Chapter 5. Understanding and Comparing. Distributions

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

Determining the Spread of a Distribution Variance & Standard Deviation

Describing Distributions with Numbers

Chapter 3: Displaying and summarizing quantitative data p52 The pattern of variation of a variable is called its distribution.

2.0 Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008

Descriptive Statistics

Quantitative Tools for Research

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

Introduction to Statistics

Example 2. Given the data below, complete the chart:

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

Summarizing and Displaying Measurement Data/Understanding and Comparing Distributions

3 Lecture 3 Notes: Measures of Variation. The Boxplot. Definition of Probability

Unit 2: Numerical Descriptive Measures

MATH4427 Notebook 4 Fall Semester 2017/2018

MATH 117 Statistical Methods for Management I Chapter Three

Descriptive Statistics-I. Dr Mahmoud Alhussami

MgtOp 215 Chapter 3 Dr. Ahn

3.1 Measure of Center

Chapter 4.notebook. August 30, 2017

All the men living in Turkey can be a population. The average height of these men can be a population parameter

Describing Distributions With Numbers

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

TOPIC: Descriptive Statistics Single Variable

Chapter 3 Data Description

Lecture 2 and Lecture 3

Exploring, summarizing and presenting data. Berghold, IMI, MUG

Chapter 2 Class Notes Sample & Population Descriptions Classifying variables

Measures of Central Tendency

Histograms allow a visual interpretation

Describing Distributions

2.1 Measures of Location (P.9-11)

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

CIVL 7012/8012. Collection and Analysis of Information

CHAPTER 1. Introduction

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

SUMMARIZING MEASURED DATA. Gaia Maselli

Lecture 1: Descriptive Statistics

Measures of Central Tendency and their dispersion and applications. Acknowledgement: Dr Muslima Ejaz

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

Resistant Measure - A statistic that is not affected very much by extreme observations.

AP Final Review II Exploring Data (20% 30%)

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Lecture 6: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Lecture 3B: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Slide 1. Slide 2. Slide 3. Pick a Brick. Daphne. 400 pts 200 pts 300 pts 500 pts 100 pts. 300 pts. 300 pts 400 pts 100 pts 400 pts.

Stat 101 Exam 1 Important Formulas and Concepts 1

BNG 495 Capstone Design. Descriptive Statistics

Chapter 1:Descriptive statistics

In this investigation you will use the statistics skills that you learned the to display and analyze a cup of peanut M&Ms.

200 participants [EUR] ( =60) 200 = 30% i.e. nearly a third of the phone bills are greater than 75 EUR

Chapter 3: Displaying and summarizing quantitative data p52 The pattern of variation of a variable is called its distribution.

Shape, Outliers, Center, Spread Frequency and Relative Histograms Related to other types of graphical displays

Descriptive Data Summarization

Transcription:

Statistics I Chapter 2: Univariate data analysis

Chapter 2: Univariate data analysis Contents Graphical displays for categorical data (barchart, piechart) Graphical displays for numerical data data (histogram, polygon, boxplot) Numerical measures to describe: central tendency (mean, median, mode) location (quartiles, percentiles) variation (variance, standard deviation, quasi-variance and quasi-standard-deviation, range, IQR, coefficient of variation)

Chapter 2: Univariate data analysis Recommended reading Peña, D., Romo, J., Introducción a la Estadística para las Ciencias Sociales Chapters 4, 5 Newbold, P. Estadística para los Negocios y la Economía (2009) Chapter 2

Graphical presentation of data Once we have a frequency distribution of the data, the following graphical displays can be obtained: Categorical piechart barchart Numerical histogram polygon boxplot

Graphs for qualitative data: piechart Example 1: The frequency table below corresponds to the data representing blood types reported for a sample of 40 individuals. Absolute Relative Class Frequency Frequency A 12 0.300 B 11 0.275 AB 8 0.200 O 9 0.225 Total 40 1

Piechart Example 1 cont.: Each slice is a fraction of the total size of the pie Many softwares rank slices alphabetically Although pretty harder to read than barcharts Avoid 3D piecharts, for those the area in the background seems to be smaller than the area in the foreground O 22.5% A 30% B 27.5% AB 20%

Graphs for qualitative data: barchart Example 2: The frequency table below corresponds to levels of satisfaction for 901 employees. Cumulative Cumulative Absolute Relative Absolute Relative Class Frequency Frequency Frequency Frequency VU 62 0.07 62 0.07 U 108 0.12 170 0.19 S 319 0.35 489 0.54 VS 412 0.46 901 1 Total 901 1

Barchart Example 2 cont.: Bars are of the same width and equally-spaced, with the heights corresponding to the frequencies There are gaps between the bars Bars are labeled with class names Many softwares rank bars alphabetically FREQUENCY 0 100 200 300 400 VU U S VS

Barchart Barcharts can also be constructed for discrete data if there are not too many values This is a barchart for Example 3 of Ch.1 where we looked at the number of leaves attacked by a pest for a sample of 50 plants FREQUENCY 0 2 4 6 8 10 12 0 1 2 3 4 5 6 7 8 9 10

Graphs for quantitative data: histogram and polygon Example: 4 The frequency distribution of the daily high temperature (in Fahrenheit) reported on 20 winter days is as follows: Class Interval Midpoint n i f i N i F i [10, 20) 15 3 0.15 3 0.15 [20, 30) 25 6 0.30 9 0.45 [30, 40) 35 5 0.25 14 0.70 [40, 50) 45 4 0.20 18 0.90 [50, 60) 15 2 0.10 20 1 Total 20 1

Histogram and polygon There are no gaps between the bars/bins Bin widths = widths of class intervals (identical), class boundaries are marked on the horizontal axis Bin heights = frequencies (here, absolute) Bin areas are proportional to the frequencies FREQUENCIES 0 1 2 3 4 5 6 Polygon 0 10 20 30 40 50 60 70 TEMP (F)

Histogram with area of 1 (on a density scale) Bin widths = widths of class intervals (not necessarily identical) Bin heights = Bin areas = f i f i l i l i 1 TOTAL AREA = 1 0.000 0.010 0.020 0.030 0 10 20 30 40 50 60 70 TEMP (F)

Describing data numerically New notation: Center Location Variation mean quartiles range median percentiles interquartile range mode variance standard deviation coeff. of variation n x i = x 1 + x 2 +... + x n i=1 ( : sum, i = 1: the lower limit, n: the upper limit, x i : example of a formula depending on i) Example: 3 i 2 = ( 1) 2 + 0 2 + 1 2 + 2 2 + 3 2 = 15 i= 1

Central tendency: (arithmetic) mean The most common measure of central tendency Population mean Sample mean N i=1 µ = x i N n i=1 x = x i = n = x1 +... + x N N x1 +... + xn n If a, b (b 0) are real numbers and y = a + bx, then Affected by extreme values (outliers) ȳ = a + b x Example: X : 3, 1, 5, 4, 2, Y : 3, 1, 5, 4, 200 x = 3 + 1 + 5 + 4 + 2 5 = 3 ȳ = 3 + 1 + 5 + 4 + 200 5 = 42.6!

Central tendency: median In the ordered list, the median M is the middle number { x((n+1)/2) if n odd (the middle number) M = x (n/2) +x (n/2+1) if n even (the average of the two middle numbers) 2 (x (1), x (2),..., x (n) means that the observations are ranked in increasing order, eg. x (1) = x min, x (n) = x max) Not affected by outliers Example: Given observations 3, 1, 5, 4, 2 (n = 5), first rank the data 1,2, 3,4,5, then identify the middle number(s) M = x ((5+1)/2) = 3rd smallest {}}{ x (3) = 3 Example: Given observations 3, 1, 5, 4, 2, 0 (n = 6), first rank the data 0,1, 2,3,4,5, then identify the middle number(s) M = x (6/2) + x (6/2+1) 2 = the average of 3rd and 4th {}}{ x (3) + x (4) 2 = 2 + 3 2 = 2.5

Central tendency: mode The value that occurs most often Not affected by outliers Used for either numerical or categorical data There may be no mode, there may be several modes Example: Given observations 3, 1, 5, 4, 2, there is no mode Example: Given observations 3, 1, 5, 4, 2, 1, the mode is 1

Shape: comparing mean and median Three types of distributions: Skewed to the left Mean < Median Symmetric Mean = Median Skewed to the right Median < Mean LEFT SKEWED x < M SYMMETRIC x = M RIGHT SKEWED M < x Note: The distribution in the middle is known as bell-shaped or normal

Quartiles and percentiles Quartiles split the ranked data into four segments with an equal number of values per segment The first quartile Q 1 has position 1 (n + 1) 4 The second quartile Q 2 (= median) has position 1 (n + 1) 2 The third quartile Q 3 has position 3 (n + 1) 4 Example: Given observations 22, 18, 17, 16, 16, 13, 12, 21, 11 (n = 9), first rank the data 11, 12, 13, 16, 16, 17, 18, 21, 22, then identify the positions Q 1 = x (2.5) = x (3) = 12 Q 2 = 16 Q 3 = x (7.5) = x (8) = 21 pth percentile, p = 1, 2,..., 99, P k = x (k(n+1)/100). Example cont.: 60th percentile = x (60(9+1)/100) = x (6) = 17

Variation: range and interquartile range (IQR) Range is the simplest measure of variation R = x max x min Ignores the way the data is distributed Sensitive to outliers Example: Given observations 3, 1, 5, 4, 2, R = 5 1 = 4 Example: Given observations 3, 1, 5, 4, 100, R = 100 1 = 99 Interquartile range (IQR) can eliminate some outlier problems. Eliminate high and low observations and calculate the range of the middle 50% of the data IQR = 3rd quartile 1st quartile = Q 3 Q 1

Variation: Interquartile range and boxplot Outliers are observations that fall below the value of Q1 1.5 IQR above the value of Q3 + 1.5 IQR For extreme outliers, replace 1.5 by 3 in the above definition MEDIAN x min Q 1 (Q 2 ) Q 3 x max 25% 25% 25% 25% 12 24 31 42 58 IQR=18

Measure of variation: variance Average of squared deviations of values from the mean Population variance Sample variance n ˆσ 2 i=1 = (x i x) 2 n N σ 2 i=1 = (x i µ) 2 N faster to calculate { }}{ n i=1 = x i 2 n( x) 2 n divided by n Sample quasi-variance (corrected sample variance) n s 2 i=1 = (x i x) 2 n 1 They are related via = n i=1 x 2 i n( x) 2 n 1 ˆσ 2 = n 1 n s2 divided by n 1 If a, b (b 0) are real numbers and y = a + bx, then s 2 y = b 2 s 2 x

Measure of variation: standard deviation (SD) The most-commonly used measure of spread Population standard deviation, sample standard deviation and sample quasi-standard deviation are respectively Shows variation about the mean σ = σ 2 ˆσ = ˆσ 2 s = s 2 Has the same units as the original data, whilst variance is in units 2 Variance and SD are both affected by outliers

Calculating variance and standard deviation Example: X : 11, 12, 13, 16, 16, 17, 18, 21, Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20 x = 124 8 = 15.5 ȳ = 124 8 = 15.5 z = 124 8 = 15.5 n i=1 n i=1 n i=1 x 2 i = 11 2 + 12 2 +... + 21 2 = 2000 y 2 i = 14 2 + 15 2 +... + 17 2 = 1928 z 2 i = 11 2 + 11 2 +... + 20 2 = 2068 n sx 2 i=1 = x i 2 n( x) 2 2000 8(15.5)2 = = 78 = 11.1429 sx = 3.3381 n 1 8 1 7 sy 2 1928 8(15.5)2 = = 6 = 0.8571 sy = 0.9258 8 1 7 sz 2 2068 8(15.5)2 = = 146 = 20.8571 sz = 4.5670 8 1 7

Comparing standard deviations Example cont.: X : 11, 12, 13, 16, 16, 17, 18, 21, Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20 x = 15.5 s x = 3.3 11 12 13 14 15 16 17 18 19 20 21 y = 15.5 s y = 0.9 11 12 13 14 15 16 17 18 19 20 21 z = 15.5 s z = 4.6 11 12 13 14 15 16 17 18 19 20 21

Numerical summaries and frequency tables. Standarization. If the data is discrete then k i=1 x = x in i n and s 2 = k i=1 x 2 i n i n x 2 n 1 If the data is continuous, we replace x i in the above difinition, by the mid-points of class intervals To standardize variable x means to calculate x x s If you apply this formula to all observations x 1,..., x n and call the transformed ones z 1,..., z n, then the mean of the z s is zero with the standard deviation of one Standarization = finding z-score

Empirical rule If the data is bell-shaped (normal), that is, symmetric and with light tails, the following rule holds: 68% of the data are in ( x 1s, x + 1s) 95% of the data are in ( x 2s, x + 2s) 99.7% of the data are in ( x 3s, x + 3s) Note: This rule is also known as 68-95-99.7 rule Example: We know that for a sample of 100 observations, the mean is 40 and the quasi-standard deviation is 5. Assuming that the data is bell-shaped, give the limits of an interval that captures 95% of the observations. 95% of x i s are in: ( x ± 2s) = (40 ± 2(5)) = (30, 50)

Measure of variation: coefficient of variation (CV) Measures relative variation and is defined as CV = s x Is a unitless number (sometimes given in % s) Shows variation relative to mean Example: Stock A: Average price last year = 50, Standard deviation = 5 Stock B: Average price last year = 100, Standard deviation = 5 CV A = 5 50 = 0.10 CV B = 5 100 = 0.05 Both stocks have the same SDs, but stock B is less variable relative to its mean price