Stat Lecture Slides Exploring Numerical Data. Yibi Huang Department of Statistics University of Chicago

Similar documents
Descriptive statistics

Announcements. Lecture 1 - Data and Data Summaries. Data. Numerical Data. all variables. continuous discrete. Homework 1 - Out 1/15, due 1/22

Chapter 2: Tools for Exploring Univariate Data

STAT 200 Chapter 1 Looking at Data - Distributions

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

Chapter 4. Displaying and Summarizing. Quantitative Data

Elementary Statistics

STT 315 This lecture is based on Chapter 2 of the textbook.

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Lecture 6: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Describing distributions with numbers

Lecture 3B: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

AP Final Review II Exploring Data (20% 30%)

Chapter 4.notebook. August 30, 2017

Vocabulary: Samples and Populations

Describing distributions with numbers

Topic 3: Introduction to Statistics. Algebra 1. Collecting Data. Table of Contents. Categorical or Quantitative? What is the Study of Statistics?!

Math 140 Introductory Statistics

Math 140 Introductory Statistics

Summarizing and Displaying Measurement Data/Understanding and Comparing Distributions

Units. Exploratory Data Analysis. Variables. Student Data

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Descriptive Statistics

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

Descriptive Statistics-I. Dr Mahmoud Alhussami

CHAPTER 2: Describing Distributions with Numbers

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

Chapter 1. Looking at Data

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

P8130: Biostatistical Methods I

Chapter 5. Understanding and Comparing. Distributions

Chapter2 Description of samples and populations. 2.1 Introduction.

Math 223 Lecture Notes 3/15/04 From The Basic Practice of Statistics, bymoore

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

Section 3. Measures of Variation

M 140 Test 1 B Name (1 point) SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

TOPIC: Descriptive Statistics Single Variable

Lecture 1 : Basic Statistical Measures

3.1 Measure of Center

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

Histograms allow a visual interpretation

Describing Distributions with Numbers

Chapter 6 The Standard Deviation as a Ruler and the Normal Model

Chapter 2 Class Notes Sample & Population Descriptions Classifying variables

are the objects described by a set of data. They may be people, animals or things.

MATH 1150 Chapter 2 Notation and Terminology

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

Stat 20: Intro to Probability and Statistics

ECLT 5810 Data Preprocessing. Prof. Wai Lam

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

Slide 1. Slide 2. Slide 3. Pick a Brick. Daphne. 400 pts 200 pts 300 pts 500 pts 100 pts. 300 pts. 300 pts 400 pts 100 pts 400 pts.

Stat 101 Exam 1 Important Formulas and Concepts 1

Chapter 1: Exploring Data

Vocabulary: Data About Us

REVIEW: Midterm Exam. Spring 2012

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008

Chapter 2 Solutions Page 15 of 28

Describing Distributions

The empirical ( ) rule

Last time. Numerical summaries for continuous variables. Center: mean and median. Spread: Standard deviation and inter-quartile range

AP Statistics Cumulative AP Exam Study Guide

M 225 Test 1 B Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

1.3.1 Measuring Center: The Mean

Descriptive Univariate Statistics and Bivariate Correlation

Chapter 3. Measuring data

Sampling, Frequency Distributions, and Graphs (12.1)

Description of Samples and Populations

MATH 10 INTRODUCTORY STATISTICS

Introduction to Statistics

Statistics I Chapter 2: Univariate data analysis

Describing Distributions With Numbers Chapter 12

Lecture 2 and Lecture 3

Section 2.3: One Quantitative Variable: Measures of Spread

Types of Information. Topic 2 - Descriptive Statistics. Examples. Sample and Sample Size. Background Reading. Variables classified as STAT 511

CIVL 7012/8012. Collection and Analysis of Information

1. Exploratory Data Analysis

2.1 Measures of Location (P.9-11)

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

Performance of fourth-grade students on an agility test

The response variable depends on the explanatory variable.

Statistics I Chapter 2: Univariate data analysis

Lecture 1: Description of Data. Readings: Sections 1.2,

Data Analysis and Statistical Methods Statistics 651

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Describing Distributions With Numbers

Lecture Notes 2: Variables and graphics

Week 1: Intro to R and EDA

1.3: Describing Quantitative Data with Numbers

Eruptions of the Old Faithful geyser

Describing Data: Two Variables

Preliminary Statistics course. Lecture 1: Descriptive Statistics

Practice problems from chapters 2 and 3

Visualizing Data: Basic Plot Types

STA 218: Statistics for Management

CHAPTER 1. Introduction

Transcription:

Stat 22000 Lecture Slides Exploring Numerical Data Yibi Huang Department of Statistics University of Chicago

Outline In this slide, we cover mostly Section 1.2 & 1.6 in the text. Data and Types of Variables (1.2) Histograms (1.6.3) Mean and Median (1.6.2 & 1.6.6) Five-Number-Summary and Box Plots (1.6.5) Standard Deviation (1.6.4 & 3.1.5) Scatterplots (1.6.1) Transforming Data (1.6.7) Please skip 1.6.8 (Mapping data). 1

Data basics

Data: Cases & Variables In a study, we collect information data from cases. Cases can be individuals, corporations, animals, or any objects of interest. A variable is a characteristic of a case. A variable varies among cases. E.g., age, blood pressure, leaf length, first language 2

Data matrix Example: Data Set email50 variable spam num char line breaks format number 1 no 21,705 551 html small 2 no 7,011 183 html big 3 yes 631 28 text none case...... 50 no 15,829 242 html small Each row of data corresponds to a case (an email), and each column contains the values of one variable of all observations. 3

Types of variables all variables numerical categorical nominal ordinal (ordered categorical) A variable continuous is numerical discrete when it can(unordered take acategorical) wide range of numerical values, and it is sensible to take arithmetic operations (addition, subtraction, average) with those values. Otherwise, it is categorical. e.g., Zip codes, area codes are NOT numerical variables 4

Types of variables all variables numerical categorical continuous discrete nominal (unordered categorical) ordinal (ordered categorical) A numerical variable is discrete if its possible values form a set of separate numbers, such as 0, 1, 2, 3,.... continuous if its possible values form an interval. A categorical variable with ordered categories is ordinal. 5

Types of variables (cont.) spam num char line breaks format number 1 no 21,705 551 html small 2 no 7,011 183 html big 3 yes 631 28 text none...... 50 no 15,829 242 html small spam num_char line_breaks format number 6

Types of variables (cont.) spam num char line breaks format number 1 no 21,705 551 html small 2 no 7,011 183 html big 3 yes 631 28 text none...... 50 no 15,829 242 html small spam............................................. categorical num_char line_breaks format number 6

Types of variables (cont.) spam num char line breaks format number 1 no 21,705 551 html small 2 no 7,011 183 html big 3 yes 631 28 text none...... 50 no 15,829 242 html small spam............................................. categorical num_char.......................................... numerical line_breaks format number 6

Types of variables (cont.) spam num char line breaks format number 1 no 21,705 551 html small 2 no 7,011 183 html big 3 yes 631 28 text none...... 50 no 15,829 242 html small spam............................................. categorical num_char.......................................... numerical line_breaks.......................................numerical format number 6

Types of variables (cont.) spam num char line breaks format number 1 no 21,705 551 html small 2 no 7,011 183 html big 3 yes 631 28 text none...... 50 no 15,829 242 html small spam............................................. categorical num_char.......................................... numerical line_breaks.......................................numerical format.................................. categorical, nominal number 6

Types of variables (cont.) spam num char line breaks format number 1 no 21,705 551 html small 2 no 7,011 183 html big 3 yes 631 28 text none...... 50 no 15,829 242 html small spam............................................. categorical num_char.......................................... numerical line_breaks.......................................numerical format.................................. categorical, nominal number................................... categorical, ordinal 6

Sometimes a ordinal categorical variable can also be regarded as a numerical variable, e.g., rating of a movie from 1 star to 5 stars stage of cancer from 0 to 4 7

Histograms

How to Make Histograms Data: Infant mortality rates (number of deaths under one year of age per 1000 live births) of 201 countries/regions in 2010-2015 (Data file: infmort2015.txt): Country.Region Continent X2010.2015 1 Burundi Africa 77.9 2 Comoros Africa 58.1 3 Djibouti Africa 55.3... 200 Samoa Oceania 19.7 201 Tonga Oceania 20.4 Step 1: Divide the range of values into class intervals. Step 2: Count the number of values in each class interval. Interval 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 Count 75 38 18 16 18 11 10 9 1 5 https:// en.wikipedia.org/ wiki/ List of countries by infant mortality rate 8

How to Make Histograms (Cont d) Interval 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 Count 75 38 18 16 18 11 10 9 1 5 Step 3: Draw the histogram No space between bars. Label the horizontal axes (with units)! 60 count 40 20 0 0 25 50 75 100 Deaths per 1000 Live Births 9

Histograms Histograms provide a view of the data density. Higher bars indicate regions with more observations. Histograms are especially convenient for describing the shape of the data distribution. The selection of bin width can alter the shape of histogram 60 count 40 20 0 0 25 50 75 100 Deaths per 1000 Live Births 10

Try Different Binwidth Which one(s) of these histograms reveal too much about the data? Which hide too much? Binwidth = 25 Binwidth = 10 120 60 count 80 40 count 40 20 0 0 25 50 75 100 Deaths per 1000 Births 0 0 25 50 75 100 Deaths per 1000 Births 60 Binwidth = 8 Binwidth = 3 30 count 40 20 count 20 10 0 0 25 50 75 100 Deaths per 1000 Births 0 0 25 50 75 100 Deaths per 1000 Births 11

Selection of Binwidth It is an iterative process try and try again. What bin with should you use? Not too small that most bins have either 0 or 1 counts Not too big that you lose the details in a bin (There may not be a unique perfect bin size) General rule: the more observations, the more bins. 12

What to Look in a Histogram? Shape symmetric or skewed (lopsided) number of modes (peaks) Outliers: Are there any observations that lie outside the overall pattern? They could be unusual observations, or they could be mistakes. Check them! Center: Where is the middle of the histogram? typically represented by mean and median Spread: What is the range of data? typically represented by SD and IQR (will introduce soon) 13

Skewness of Histograms Right skewed Symmetric/Bell shaped Left skewed 14

Mode of Histograms (= Number of Peaks) Unimodal Bimodal Trimodal Multimodal A histogram with two or more modes may indicate that the data is a mixture of two or more distinct populations. 15

Example (Infant Mortality Rates) In addition to the major peak near 0, there appears to be a secondary peak around 40-50. 60 40 count 20 0 0 25 50 75 100 Deaths per 1000 Births 16

Side-by-Side Histograms Infant Mortality Rate Data If countries are grouped by continent and histograms are made separately on the same horizontal axis, we can compare the infant mortality rates of countries in the 5 continents by the location of the histograms, which were count 10.0 7.5 5.0 2.5 0.0 3 2 1 0 10 5 0 Africa Oceania Asia uniformly low in Europe much higher and with greater variability in Africa. This explains why the histogram for the whole world to be bimodal. 15 10 5 0 30 20 10 0 America Europe 0 25 50 75 100 Deaths per 1000 Live Births 17

Outliers Another thing we look at a histogram is whether there are any unusual observations or potential outliers? 0 5 10 15 20 25 30 0 10 20 30 40 0 5 10 15 20 20 40 60 80 100 18

Which graph is which, and 2. why? Match the following variables with the histograms and bar graphs giv Sta 101 students at Duke. [Hint: Think about how each variable sho (b) Practice: (a) Shapes of Distributions (a) the height of students (d) the number ceived last n (b) gender breakdown of students Match the following variables with the histograms and bar graphs (e) whether or n (c) the time it took students to get to their first given below. Suppose the data represent STAT 220 students. wing variables with the histograms class and of bar the day graphs given below. These data represent (f) the number nts at Duke. [Hint: Think about [Hint: Think about how (1) how each variable should behave.] each variable should behave.] (3) 2. Match the following variables with the histograms and bar graphs given below. These data represent t of Sta students 101 students at Duke. [Hint: Think about (d) the how number each variable of hours should of behave.] sleep students received last night (a) the (a) height the of height students of students reakdown of students (d) the number of hours of sleep students received last night (b) gender (b) gender breakdown breakdown of students of students (e) whether or not students live off campus it took students to get to their first he day (e) whether or not live off campus (c) the time it took students to get to their (f) the firstnumber of piercings students have (c) the number of piercings students have class of the day (f) the number of piercings students have (3) (2) (5) (4) (1) (3) (5) (2) (4) (4) (6) (6) 19

Mean and median

Mean The mean of a numerical variable is computed as the sum of all of the observations divided by the number of observations: x = x 1 + x 2 + + x n n where x 1, x 2,..., x n represent the n observed values. Example. Suppose a variable has 5 observed values: 4, 8, 3, 5, 13. The mean of the variable is given by: x = 4 + 8 + 3 + 5 + 13 5 = 33 5 = 6.6. 20

Median The median of a numerical variable is a number such that half of the observed value are smaller than it and half are larger than it. Ex 1: Suppose a variable has 5 observed values: 4, 8, 3, 5, 13. The median is 5. data 4 8 3 5 13 sorted 3 4 5 8 13 Ex 2: Suppose a variable has 5 observed values: 4, 8, 3, 5, 13, 12. data 4 8 3 5 13 12 sorted 3 4 5 8 12 13 The median is thus = 5 + 8 2 = 6.5. 21

Median The median of a numerical variable is a number such that half of the observed value are smaller than it and half are larger than it. Ex 1: Suppose a variable has 5 observed values: 4, 8, 3, 5, 13. The median is 5. data 4 8 3 5 13 sorted 3 4 5 8 13 Ex 2: Suppose a variable has 5 observed values: 4, 8, 3, 5, 13, 12. data 4 8 3 5 13 12 sorted 3 4 5 8 12 13 The median is thus = 5 + 8 2 = 6.5. 21

Median The median of a numerical variable is a number such that half of the observed value are smaller than it and half are larger than it. Ex 1: Suppose a variable has 5 observed values: 4, 8, 3, 5, 13. The median is 5. data 4 8 3 5 13 sorted 3 4 5 8 13 Ex 2: Suppose a variable has 5 observed values: 4, 8, 3, 5, 13, 12. data 4 8 3 5 13 12 sorted 3 4 5 8 12 13 The median is thus = 5 + 8 2 = 6.5. 21

Mean vs. Median Comparing In a symmetric the mean distribution, and the median mean median. The Ifmean exactly and symmetric, the median then are mean the same = median. only if the distribution is symmetrical. In a skewed The median distribution, a measure theof mean center that is pulled is resistant toward skew the and longer outliers. tail. The mean is not. Mean and median for a symmetric distribution Mean Median Mean and median for skewed distributions Left skew Mean Median Mean Median Right skew 22

Example (Infant Mortality Rates) Which of the 5 histograms are symmetric? Which are skewed? How are their means compared with their medians? Continent Mean Median Africa 53.22 53.30 Oceania 21.77 19.70 Asia 21.49 15.30 America 16.20 14.55 Europe 5.00 3.75 count 10.0 7.5 5.0 2.5 0.0 3 2 1 0 10 5 0 15 10 5 0 30 20 10 0 Africa Oceania Asia America Europe 0 25 50 75 100 Deaths per 1000 Live Births (red solid mean, blue dash median) 23

Robustness of the Median Consider the data w/ six observations: 2, 1, 0, 0, 2, 4. If the number 2 in the data is wrongly recorded as 20, The mean is increased by 20 2 = 3. 6 The median is unaffected. Median is more resistent, i.e., less sensitive to extreme values or outliers than the mean. We say the median is more robust. Example: Housing sales price in Hyde Park Mean Median Jun Aug, 2011 $525,384 $227,000 Jun Aug, 2013 $423,528 $291,750 May Aug, 2017 $259,542 $226,750 Source: http://www.trulia.com/home_prices/illinois/chicago-heat_map/ 24

Five Number Summary

Quartiles, IQR, Five-Number Summary Quartiles divide data into 4 even parts first quartile Q 1 = 25th percentile: 25% of data fall below it and 75% above it second quartile Q 2 = median = 50th percentile third quartile Q 3 = 75th percentile 75% of data fall below it and 25% above it Interquartile Range (IQR) = Q 3 Q 1 Five-Number Summary: min, Q 1, Median, Q 3, max 25

Example 1 For the 9 numbers: 43, 35, 43, 33, 38, 53, 64, 27, 34 43 35 43 33 27 33 34 35 38 sort 38 53 64 27 34 43 43 53 64 IQR = Q 3 Q 1 = 48 33.5 = 14.5 Five number summary: 27, 33.5, 38, 48, 64 26

Example 1 For the 9 numbers: 43, 35, 43, 33, 38, 53, 64, 27, 34 43 35 43 33 27 33 34 35 38 sort 38 53 64 27 34 43 43 53 64 overall median = Q 2 IQR = Q 3 Q 1 = 48 33.5 = 14.5 Five number summary: 27, 33.5, 38, 48, 64 26

Example 1 For the 9 numbers: 43, 35, 43, 33, 38, 53, 64, 27, 34 43 27 35 33 median 33 + 34 = 43 34 of this half 2 33 35 38 sort 38 53 64 27 34 43 43 53 64 overall median = Q 2 = 33.5 = Q 1 IQR = Q 3 Q 1 = 48 33.5 = 14.5 Five number summary: 27, 33.5, 38, 48, 64 26

Example 1 For the 9 numbers: 43, 35, 43, 33, 38, 53, 64, 27, 34 43 27 35 33 median 33 + 34 = 43 34 of this half 2 33 35 38 sort 53 64 27 34 38 43 43 53 64 overall median = Q 2 median of this half = 43 + 53 2 = 33.5 = Q 1 = 48 = Q 3 IQR = Q 3 Q 1 = 48 33.5 = 14.5 Five number summary: 27, 33.5, 38, 48, 64 26

Example 2 For the 10 numbers: 43, 35, 43, 33, 38, 53, 64, 27, 34, 27 43 35 43 33 38 53 64 27 34 27 sort 27 27 33 34 35 38 43 43 53 64 IQR = Q 3 Q 1 = 43 33 = 10 Five number summary: 27, 33, 36.5, 43, 64 27

Example 2 For the 10 numbers: 43, 35, 43, 33, 38, 53, 64, 27, 34, 27 43 35 43 33 38 53 64 27 34 27 27 27 33 34 sort 35 overall 38 median 43 43 53 64 = 35 + 38 2 = 36.5 = Q 2 IQR = Q 3 Q 1 = 43 33 = 10 Five number summary: 27, 33, 36.5, 43, 64 27

Example 2 For the 10 numbers: 43, 35, 43, 33, 38, 53, 64, 27, 34, 27 43 27 35 27 43 33 median = 33 = Q of this half 1 33 34 38 35 53 64 27 34 27 sort overall 38 median 43 43 53 64 = 35 + 38 2 = 36.5 = Q 2 IQR = Q 3 Q 1 = 43 33 = 10 Five number summary: 27, 33, 36.5, 43, 64 27

Example 2 For the 10 numbers: 43, 35, 43, 33, 38, 53, 64, 27, 34, 27 43 27 35 27 43 33 median = 33 = Q of this half 1 33 34 38 35 53 64 27 34 27 sort overall 38 median 43 43 median of this half 53 64 = 35 + 38 2 = 43 = Q 3 = 36.5 = Q 2 IQR = Q 3 Q 1 = 43 33 = 10 Five number summary: 27, 33, 36.5, 43, 64 27

Calculation of Quartiles (1) In fact, statisticians don t have a consensus on the calculation of quartiles. There are several formulas for quartiles, varying from book to book, software to software. E.g., for the 9 numbers in Example 1 > x = c(43,35,43,33,38,53,64,27,34) our computation gives Q 1 = 33.5, Q 3 = 48, but R gives Q 1 = 34, Q 3 = 43. > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 27.00 34.00 38.00 41.11 43.00 64.00 > fivenum(x) [1] 27 34 38 43 64 > IQR(x) [1] 9 28

Calculation of Quartiles (2) Sometimes even different commands in R give different quartiles. E.g., for the 10 numbers in Example 2, > y = c(43,35,43,33,38,53,64,27,34,27) > summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. 27.00 33.25 36.50 39.70 43.00 64.00 > fivenum(y) [1] 27.0 33.0 36.5 43.0 64.0 > IQR(y) [1] 9.75 Don t worry about the formula. Just keep in mind that quartiles divide data into 4 even parts In HWs, just report whatever values your software gives. 29

Boxplots

1.5 IQR Rule for Identifying Potential Outliers The 1.5 IQR Rule tags an observation as a potential outlier if it lies more than 1.5 IQR below Q 1 above above Q 3. 30

Box-and-Whiskers Plot (also called Boxplot) 31

Think About It... What does a boxplot look like if the distribution is symmetric? Ditto, if right-skewed? Can you tell from a boxplot whether the distribution is unimodal or bimodal? 32

Side by Side Boxplots Just like histograms, boxplots of related distributions are often placed side-by-side for comparison. Boxplots Dotplots 0 25 50 75 100 Africa Oceania Asia America Europe Continent Deaths per 1000 Live Births 0 25 50 75 100 Africa Oceania Asia America Europe Continent Deaths per 1000 Live Births 33

Standard Deviation

Standard Deviation Another common way to describe how spread out are the observations is the standard deviation (SD). To understand how SD works, let s use a small data set {1, 2, 2, 7} as an example. Each of these numbers deviates from the mean 1+2+2+7 4 = 3 by some amount: 1 3 = 2, 2 3 = 1, 2 3 = 1, 7 3 = 4. How should we measure the overall size of these deviations? Taking their mean isn t going to tell us anything (why not?) 34

Standard Deviation (Cont d) One sensible way is take the average of their absolute values: 2 + 1 + 1 + 4 4 = 2 This is called the mean absolute deviation (MAD), not the SD. But for a variety of reasons, statisticians prefer using the root-mean-square as a measure of overall size: ( 2) 2 + ( 1) 2 + ( 1) 2 + 4 2 2.35 4 but this is still not the (sample) SD. 35

Standard Deviation (Cont d) The formula for the (sample) standard deviation (SD) is n i=1 s = (x i x) 2 n 1 Wait a minute; why divide by n 1? Not n? The reason (which we will discuss further in a few weeks) is that dividing by n turns out to underestimate the true (population) standard deviation. Dividing by n 1 instead of n corrects some of that bias The standard deviation of {1, 2, 2, 7} is ( 2) 2 + ( 1) 2 + ( 1) 2 + 4 2 2.71 4 1 (recall we get 2.35 when divided by n = 4) 36

Variance The square of the (sample) standard deviation is called the (sample) variance, denoted as s 2 = ni=1 (x i x) 2 n 1 which is roughly the average squared deviation from the mean. 37

Meaning of the Standard Deviation The standard deviation (SD) describes how far away numbers in a list are from their average The SD is often used as a plus-or-minus number, as in Adult women tend to be about 5 4, plus or minus 3 inches. 38

The 68% and 95% Rule (Section 3.1.5 in Text) Roughly 68% of the observations will be within 1 SD away from the mean Roughly 95% will be with 2 SD away from the mean 68% Ave one SD Ave Ave + one SD 95% Ave two SDs Ave Ave + two SDs The 68% and 95% rules work very well for bell-shaped data, and reasonably well for unimodal and not seriously skewed data, but not for all data. 39

The 68% and 95% Rule (Section 3.1.5 in Text) Continent Mean SD Africa 53.2 23.4 Oceania 21.8 15.2 Asia 21.5 17.7 America 16.2 9.7 Europe 5.0 3.0 10.0 7.5 5.0 2.5 0.0 3 2 1 0 Africa Oceania Asia proportion within Continent 1 SD from mean Africa 39/57 68% Oceania 8/13 62% Asia 35/51 69% America 29/40 72% Europe 31/40 78% count 10 5 0 15 10 5 0 30 20 10 0 America Europe 0 25 50 75 100 Deaths per 1000 Live Births 40

The 68% and 95% Rule (Section 3.1.5 in Text) Continent Mean SD Africa 53.2 23.4 Oceania 21.8 15.2 Asia 21.5 17.7 America 16.2 9.7 Europe 5.0 3.0 10.0 7.5 5.0 2.5 0.0 3 2 1 0 Africa Oceania Asia proportion within Continent 2 SD from mean Africa 55/57 96% Oceania 13/13 = 100% Asia 49/51 96% America 38/40 = 95% Europe 39/40 = 97.5% count 10 5 0 15 10 5 0 30 20 10 0 America Europe 0 25 50 75 100 Deaths per 1000 Live Births 41

Properties of Standard Deviation (SD) SD measures spread about the mean and should be used only when the mean is the measure of center. When SD = 0, what are the observations look like? and what if IQR = 0? and what if SD < 0? SD is very sensitive to outliers. SD has the same units of measurement as the original observations, while the variances in the square of these units. 42

Recap: Common Numerical Summaries of Numerical Variables When comparing histograms, we often compare their center and spread. Common measure of center: Mean Median Common measure spread: Range: max min Standard deviation (SD) Interquartile range (IQR) count 10.0 7.5 5.0 2.5 0.0 3 2 1 0 10 5 0 15 10 5 0 30 20 10 0 Africa Oceania Asia America Europe 0 25 50 75 100 43 Deaths per 1000 Live Births

Recap: Common Numerical Summaries of Numerical Variables When comparing histograms, we often compare their center and spread. Common measure of center: Mean Median Common measure spread: Range: max min Standard deviation (SD) Interquartile range (IQR) Measures of center and spread are important summaries of a distribution. But they don t tell about modality, skewness, and whether there are outliers. Always check the histogram! count 10.0 7.5 5.0 2.5 0.0 3 2 1 0 10 5 0 15 10 5 0 30 20 10 0 Africa Oceania Asia America Europe 0 25 50 75 100 43 Deaths per 1000 Live Births

Recap: Common Graphical Displays for Numerical Variables Histograms A histogram breaks the range of values of a variable into classes and displays only the count or percent of the observations that fall into each class. Dotplots: See Section 1.6.2 Pros: Shows each individual data point Cons: Hard to read even for moderately big data Box-plots 44

Scatterplots

Example (Diamonds Data) Diamonds Data: records of 53940 diamonds, collected from http:// www.diamondse.info/ in 2008. carat cut color clarity depth table price x y z 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48... 53940 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64 See Lab 02 for R codes: http:// www.stat.uchicago.edu/ yibi/ s220/ labs/ lab02.html 45

Example (Diamonds Data) The diamonds data contains 10 variables, including price: price in US dollars carat: weight of the diamond cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal) color: diamond color, from J (worst) to D (best) clarity: a measurement of how clear the diamond is (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best)) 46

Scatter Plot of Prices and Carats of Diamonds What patterns do you observe from the scatter plot? 0 5000 10000 15000 0 1 2 3 4 5 carat Price in US dollars 47

Forms of Relationship Linear Relationship Nonlinear Relationship No Relationship 48

Positive and Negative Relation Positive Negative together with association: High values of one variable tend to occur high low values of the other variable. y1 0.0 0.4 0.8 positive negative positive linear linear nonlinear 0.0 0.4 0.8 y2 1.2 0.6 0.0 0.0 0.4 0.8 y4 3.5 2.0 0.5 0.0 0.4 0.8 49

Strength of the Relationship The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form. Weak Association Strong Association Y Range of Y when X is fixed Range of Y when X is unknown Y Range of Y when X is fixed Range of Y when X is unknown X Large spread of Y when X is known X Small spread of Y when X is known 50

Clusters & Outliers Sometimes points in a scatter plot form clusters, which indicates data may be comprised of several distinct kinds of individuals. Scatter plots can also be used to spot outliers. 51

Recap: What to Look For in a Scatterplot? What is the form of the relationship? (linear, curved, clustered...) 52

Recap: What to Look For in a Scatterplot? What is the form of the relationship? (linear, curved, clustered...) What is the direction of the relationship? (positive, negative) 52

Recap: What to Look For in a Scatterplot? What is the form of the relationship? (linear, curved, clustered...) What is the direction of the relationship? (positive, negative) What is the strength of the relationship? (strong, weak,... ) 52

Recap: What to Look For in a Scatterplot? What is the form of the relationship? (linear, curved, clustered...) What is the direction of the relationship? (positive, negative) What is the strength of the relationship? (strong, weak,... ) Are the points clustered? 52

Recap: What to Look For in a Scatterplot? What is the form of the relationship? (linear, curved, clustered...) What is the direction of the relationship? (positive, negative) What is the strength of the relationship? (strong, weak,... ) Are the points clustered? Are there any deviations from the overall pattern, i.e. outliers? 52

Recap: What to Look For in a Scatterplot? What is the form of the relationship? (linear, curved, clustered...) What is the direction of the relationship? (positive, negative) What is the strength of the relationship? (strong, weak,... ) Are the points clustered? Are there any deviations from the overall pattern, i.e. outliers? In fact, the list above is not exclusive. Any unusual pattern worths investigation. 52

Example: Scatter Plot of Prices and Carats of Diamonds What do you observe from the scatter plot below? 0 5000 10000 15000 0 1 2 3 4 5 carat Price in US dollars 53

Weights of diamonds cluster in some benchmark carats (0.3, 0.5, 0.7, 0.9, 1, 1.2, 1.5, 2,...), e.g., 1981 diamonds weigh 0.7 carat, 1294 weigh 0.71 carat, but only 26 weigh 0.69 carat 1558 diamonds weigh 1 carat, 2242 weigh 1.01 carat, but only 23 weigh 0.99 carat 793 diamonds weigh 1.5 carat, 807 weigh 1.51 carat, but only 11 weigh 1.49 carat count 2000 1000 0 0.0 0.5 1.0 1.5 2.0 carat (This histogram is truncated at 2.2 carat. The x axis in fact extends to 5 carat) 54

Making Scatterplots More Informative The scatterplot shows relationship between two variables - weight and price, but there are many other attributes of the data. E.g., we might want to see how the quality of the cut, or the color, or the clarity, affects the price. We can make the scatterplot more informative, by using the color of points to represent, e.g., the clarity of diamonds. 55

0 5000 10000 15000 0 1 2 3 4 5 carat Price in US dollars clarity I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF For diamonds of the same weight, the price increases with the clarity of the diamond. 56

Another way to communicate information about a 3rd (categorical) variable in a scatter plot is to divide the plot up into multiple plots, one for each level, letting you view them separately. Such a plot is shown in the next slide. 57

I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF 0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000 0 1 2 3 4 5 0 1 2 3 4 5 carat Price in US dollars cut Fair Good Very Good Premium Ideal 58

Transforming Data

Scatterplot with Both X-Y Variables Transformed Scatterplot when both carat and price are log-transformed. 350 1000 5000 10000 15000 20000 0.2 0.5 1.0 2.0 3.0 4.0 5.0 carat Price in US dollars 59

Scatter Plot of Prices and Carats of Diamonds What s the advantage of transformation (for the diamond data)? After log-transformation, the relationship between log(price) and log(carat) become a simple linear relationship, and the variability of log(price) stays about the same with carat. Before transformation, points are too tightly clustered near 0 to reveal the structure of the distribution there. After transformation, the scatterplot shows a clear gap in price around $1500. 60

Histogram of Log-Transformed Data The prices for diamonds are heavily skewed, but on a log scale seem much better behaved. In fact, we can see some evidence of bimodality on the log scale and the gap at $1500. count 3000 2000 1000 0 0 5000 10000 15000 price 900 count 600 300 0 300 1000 1500 5000 10000 20000 price 61

Transformation Sometimes, only one variable in the scatterplot is transformed, instead of both. Other commonly used transformations other than log: square-root, cubic-root, inverse,... Why transforming data? Highly skewed data are easier to model with when they are transformed because outliers tend to become far less prominent after an appropriate transformation. Variables may exhibit a simpler relationship after transformation Sometimes, in a scatter plot or a histogram, when observations are too tightly clustered near 0 to look at the pattern there, transformation might help. However, the transformed variable might be meaningless, making the result of analysis hard to interpret. 62