Descriptive statistics

Similar documents
The normal distribution

The normal distribution

Stat Lecture Slides Exploring Numerical Data. Yibi Huang Department of Statistics University of Chicago

STAT 200 Chapter 1 Looking at Data - Distributions

One-sample categorical data: approximate inference

Chapter 1: Exploring Data

Chapter 4. Displaying and Summarizing. Quantitative Data

Statistics and parameters

Chapter 2 Solutions Page 15 of 28

Units. Exploratory Data Analysis. Variables. Student Data

Announcements. Lecture 1 - Data and Data Summaries. Data. Numerical Data. all variables. continuous discrete. Homework 1 - Out 1/15, due 1/22

Describing distributions with numbers

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Elementary Statistics

Clinical Research Module: Biostatistics

MATH 1150 Chapter 2 Notation and Terminology

Chapter 2 Class Notes Sample & Population Descriptions Classifying variables

Chapter 2: Tools for Exploring Univariate Data

Chapter 1. Looking at Data

Chapter 3. Measuring data

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

Describing Distributions

The t-distribution. Patrick Breheny. October 13. z tests The χ 2 -distribution The t-distribution Summary

Describing Distributions With Numbers Chapter 12

The Central Limit Theorem

The Central Limit Theorem

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

A is one of the categories into which qualitative data can be classified.

Chapter 3: The Normal Distributions

Describing distributions with numbers

CHAPTER 1. Introduction

Describing Distributions with Numbers

Summarizing and Displaying Measurement Data/Understanding and Comparing Distributions

STT 315 This lecture is based on Chapter 2 of the textbook.

Topic 3: Introduction to Statistics. Algebra 1. Collecting Data. Table of Contents. Categorical or Quantitative? What is the Study of Statistics?!

Two-sample Categorical data: Testing

Week 1: Intro to R and EDA

CHAPTER 2 Description of Samples and Populations

CHAPTER 2: Describing Distributions with Numbers

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

Instrumentation (cont.) Statistics vs. Parameters. Descriptive Statistics. Types of Numerical Data

Math 140 Introductory Statistics

Math 140 Introductory Statistics

M 140 Test 1 B Name (1 point) SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

SUMMARIZING MEASURED DATA. Gaia Maselli

BIOS 2041: Introduction to Statistical Methods

Chapter 6. The Standard Deviation as a Ruler and the Normal Model 1 /67

Chapter2 Description of samples and populations. 2.1 Introduction.

Chapter 1: Introduction. Material from Devore s book (Ed 8), and Cengagebrain.com

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Chapter 6 The Standard Deviation as a Ruler and the Normal Model

Essentials of Statistics and Probability

Lecture Notes 2: Variables and graphics

Continuous random variables

Describing Distributions With Numbers

Descriptive Statistics-I. Dr Mahmoud Alhussami

Chapter 7: Statistics Describing Data. Chapter 7: Statistics Describing Data 1 / 27

Chapter. Numerically Summarizing Data. Copyright 2013, 2010 and 2007 Pearson Education, Inc.

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

Summarising numerical data

Full file at

8/4/2009. Describing Data with Graphs

Francine s bone density is 1.45 standard deviations below the mean hip bone density for 25-year-old women of 956 grams/cm 2.

Descriptive Univariate Statistics and Bivariate Correlation

P8130: Biostatistical Methods I

Lecture 1 : Basic Statistical Measures

Chapter 4.notebook. August 30, 2017

Study and research skills 2009 Duncan Golicher. and Adrian Newton. Last draft 11/24/2008

200 participants [EUR] ( =60) 200 = 30% i.e. nearly a third of the phone bills are greater than 75 EUR

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

BNG 495 Capstone Design. Descriptive Statistics

Finding Quartiles. . Q1 is the median of the lower half of the data. Q3 is the median of the upper half of the data

Biostatistics for biomedical profession. BIMM34 Karin Källen & Linda Hartman November-December 2015

Section 5.4. Ken Ueda

CHAPTER 4 VARIABILITY ANALYSES. Chapter 3 introduced the mode, median, and mean as tools for summarizing the

Performance of fourth-grade students on an agility test

ACMS Statistics for Life Sciences. Chapter 13: Sampling Distributions

Statistics 528: Homework 2 Solutions

Module 1. Identify parts of an expression using vocabulary such as term, equation, inequality

Descriptive Statistics

Descriptive Data Summarization

Statistics in medicine

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

11. The Normal distributions

Lecture 1: Descriptive Statistics

Analytical Graphing. lets start with the best graph ever made

In this investigation you will use the statistics skills that you learned the to display and analyze a cup of peanut M&Ms.

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

- a value calculated or derived from the data.

a table or a graph or an equation.

The empirical ( ) rule

Resistant Measure - A statistic that is not affected very much by extreme observations.

Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

MATH 10 INTRODUCTORY STATISTICS

Martin L. Lesser, PhD Biostatistics Unit Feinstein Institute for Medical Research North Shore-LIJ Health System

Lecture 6: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Analytical Graphing. lets start with the best graph ever made

Unit 1 Summarizing Data

Transcription:

Patrick Breheny February 6 Patrick Breheny to Biostatistics (171:161) 1/25

Tables and figures Human beings are not good at sifting through large streams of data; we understand data much better when it is summarized for us We often display summary statistics in one of two ways: tables and figures Tables of summary statistics are very common (we have already seen several in this course) nearly all published studies in medicine and public health contain a table of basic summary statistics describing their sample However, figures are usually better than tables in terms of distilling clear trends from large amounts of information Patrick Breheny to Biostatistics (171:161) 2/25

Types of data The best way to summarize and present data depends on the type of data There are two main types of data: Categorical data: Data that takes on distinct values (i.e., it falls into categories), such as sex (male/female), alive/dead, blood type (A/B/AB/O), stages of cancer Continuous data: Data that takes on a spectrum of fractional values, such as time, age, temperature, cholesterol levels The distinction between categorical (also called discrete) and continuous data is fundamental and we will return to it throughout the course Patrick Breheny to Biostatistics (171:161) 3/25

Categorical data Summarizing categorical data is pretty straightforward you just count how many times each category occurs Instead of counts, we are often interested in percents A percent is a special type of rate, a rate per hundred Counts (also called frequencies), percents, and rates are the three basic summary statistics for categorical data, and are often displayed in tables or bar charts, as we saw in lab Patrick Breheny to Biostatistics (171:161) 4/25

Continuous data For continuous data, instead of a finite number of categories, observations can take on a potentially infinite number of values Summarizing continuous data is therefore much less straightforward To introduce concepts for describing and summarizing continuous data, we will look at data on infant mortality rates for 111 nations on three continents: Africa, Asia, and Europe Patrick Breheny to Biostatistics (171:161) 5/25

One very useful way of looking at continuous data is with histograms To make a histogram, we divide a continuous axis into equally spaced intervals, then count and plot the number of observations that fall into each interval This allows us to see how our data points are distributed Patrick Breheny to Biostatistics (171:161) 6/25

Histogram of European infant mortality rates Europe Count 25 2 15 1 5 1 8 6 4 2 12 1 8 6 4 2 Asia Africa 5 1 15 2 Deaths per 1, births Patrick Breheny to Biostatistics (171:161) 7/25

Summarizing continuous data As we can see, continuous data comes in a variety of shapes Nothing can replace seeing the picture, but if we had to summarize our data using just one or two numbers, how should we go about doing it? The aspect of the histogram we are usually most interested in is, Where is its center? This is typically represented by the average Patrick Breheny to Biostatistics (171:161) 8/25

The average and the histogram The average represents the center of mass of the histogram: Europe Count 25 2 15 1 5 1 8 6 4 2 12 1 8 6 4 2 Asia Africa 5 1 15 2 Deaths per 1, births Patrick Breheny to Biostatistics (171:161) 9/25

Spread The second most important bit of information from the histogram to summarize is, How spread out are the observations around the center? This is most typically represented by the standard deviation To understand how standard deviation works, let s return to our small example with the numbers {4, 5, 1, 9} Each of these numbers deviates from the mean by some amount: 4 4.75 =.75 5 4.75 =.25 1 4.75 = 3.75 9 4.75 = 4.25 How should we measure the overall size of these deviations? Patrick Breheny to Biostatistics (171:161) 1/25

Root-mean-square Taking their mean isn t going to tell us anything (why not?) We could take the average of their absolute values:.75 +.25 + 3.75 + 4.25 4 = 2.25 But it turns out that for a variety of reasons, the root-mean-square works better as a measure of overall size: (.75) 2 + (.25) 2 + ( 3.75) 2 + (4.25) 2 2.86 4 Patrick Breheny to Biostatistics (171:161) 11/25

The standard deviation The formula for the standard deviation is n i=1 s = (x i x) 2 n 1 Wait a minute; why n 1? The reason (which we will discuss further in a few weeks) is that dividing by n turns out to underestimate the true standard deviation Dividing by n 1 instead of n corrects some of that bias The standard deviation of {4, 5, 1, 9} is 3.3 (recall that we got 2.86 if we divide by n) Patrick Breheny to Biostatistics (171:161) 12/25

Meaning of the standard deviation The standard deviation (SD) describes how far away numbers in a list are from their average The SD is often used as a plus or minus number, as in adult women tend to be about 5 4, plus or minus 3 inches Most numbers (roughly 68%) will be within 1 SD away from the average Very few entries (roughly 5%) will be more than 2 SD away from the average This rule of thumb works very well for a wide variety of data; we ll discuss where these numbers come from in a few weeks Patrick Breheny to Biostatistics (171:161) 13/25

Standard deviation and the histogram Background areas within 1 SD of the mean are shaded: 25 2 15 1 5 Europe Count 1 8 6 4 2 5 1 15 2 Asia 12 1 8 6 4 2 5 1 15 2 Africa 5 1 15 2 Deaths per 1, births Patrick Breheny to Biostatistics (171:161) 14/25

The 68%/95% rule in action % of observations within Continent One SD Two SDs Europe 78 97 Asia 67 97 Africa 63 95 Patrick Breheny to Biostatistics (171:161) 15/25

Summaries can be misleading! All of the following have the same mean and standard deviation: 4 2 2 4 4 2 2 4 4 2 2 4 4 2 2 4 Patrick Breheny to Biostatistics (171:161) 16/25

The average and standard deviation are not the only ways to summarize continuous data Another type of summary is the percentile A number is the 25th percentile of a list of numbers if it is bigger than 25% of the numbers in the list The 5th percentile is given a special name: the median The median, like the mean, can be used to answer the question, Where is the center of the histogram? Patrick Breheny to Biostatistics (171:161) 17/25

Median vs. mean The dotted line is the median, the solid line is the mean: 15 Europe 1 5 Count 6 4 2 1 2 3 4 Asia 12 1 8 6 4 2 5 1 15 Africa 5 1 15 2 Deaths per 1, births Patrick Breheny to Biostatistics (171:161) 18/25

Skew Note that the histogram for Europe is not symmetric: the tail of the distribution extends further to the right than it does to the left Such distributions are called skewed The distribution of infant mortality rates in Europe is said to be right skewed or skewed to the right For asymmetric/skewed data, the mean and the median will be different Patrick Breheny to Biostatistics (171:161) 19/25

Hypothetical example Azerbaijan had the highest infant mortality rate in Europe at 37 What if, instead of 37, it was 2? Mean Median Real 14.1 11 Hypothetical 19.2 11 The mean is now higher than 72% of the countries Note that the average is sensitive to extreme values, while the median is not; statisticians say that the median is robust to the presence of outlying observations Patrick Breheny to Biostatistics (171:161) 2/25

Five number summary The mean and standard deviation are a common way of providing a two-number summary of a distribution of continuous values Another approach, based on quantiles, is to provide a five-number summary consisting of: (1) the minimum, (2) the first quartile, (3) the median, (4) the third quartile, and (5) the maximum Europe Asia Africa Min 5 4 2 First quartile 7 32 83 Median 11 52.5 1 Third quartile 2 83 123 Max 37 165 191 Patrick Breheny to Biostatistics (171:161) 21/25

Box plots Quantiles are used in a type of graphical summary called a box plot Box plots are constructed as follows: Calculate the three quartiles (the 25th, 5th, and 75th) Draw a box bounded by the first and third quartiles and with a line in the middle for the median Call any observation that is extremely far from the box an outlier and plot the observations using a special symbol (this is somewhat arbitrary and different rules exist for defining outliers) Draw a line from the top of the box to the highest observation that is not an outlier; likewise for the lowest non-outlier Patrick Breheny to Biostatistics (171:161) 22/25

Box plots of the infant mortality rate data Deaths per 1, births 15 1 5 Africa Asia Europe Patrick Breheny to Biostatistics (171:161) 23/25

Box plots and bar charts In lab, we saw that bar charts provide an effective way of comparing two (or more) categorical variables (e.g., survival and sex) Box plots provide a way to examine the relationship between a continuous variable and a categorical variable (e.g., infant mortality and continent) Next week, we will discuss how to summarize and illustrate the relationship between two continuous variables Patrick Breheny to Biostatistics (171:161) 24/25

Raw data is complex and needs to be summarized; typically, these summaries are displayed in tables and figures Tables are useful for looking up information, but figures tend to be superior for illustrating trends in the data measures for categorical variables: counts, percents, rates measures for continuous variables: mean, standard deviation, quantiles Ways to display continuous data: histogram, box plot Patrick Breheny to Biostatistics (171:161) 25/25