A Little Stats Won t Hurt You

Similar documents
unadjusted model for baseline cholesterol 22:31 Monday, April 19,

Descriptive Statistics-I. Dr Mahmoud Alhussami

Descriptive Data Summarization

BE640 Intermediate Biostatistics 2. Regression and Correlation. Simple Linear Regression Software: SAS. Emergency Calls to the New York Auto Club

Chapter 3. Measuring data

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

Handout 1: Predicting GPA from SAT

Chapter 8 (More on Assumptions for the Simple Linear Regression)

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

Time Series Forecasting Methods

EXST7015: Estimating tree weights from other morphometric variables Raw data print

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

MATH 1150 Chapter 2 Notation and Terminology

Chapter 1: Exploring Data

Describing distributions with numbers

Comparison of a Population Means

STATISTICS 479 Exam II (100 points)

Chapter 3. Data Description

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Chapter 2: Tools for Exploring Univariate Data

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

Performance of fourth-grade students on an agility test

Getting Correct Results from PROC REG

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

- a value calculated or derived from the data.

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Statistics I Chapter 2: Univariate data analysis

- measures the center of our distribution. In the case of a sample, it s given by: y i. y = where n = sample size.

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Describing distributions with numbers

Describing Distributions with Numbers

SESSION 5 Descriptive Statistics

CHAPTER 1. Introduction

Booklet of Code and Output for STAC32 Final Exam

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

Statistics I Chapter 2: Univariate data analysis

CHAPTER 2: Describing Distributions with Numbers

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode.

1. Exploratory Data Analysis

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Chapter 16. Nonparametric Tests

Lecture 3. Experiments with a Single Factor: ANOVA Montgomery 3.1 through 3.3

An Introduction to the Analysis of Rare Events

Business Statistics. Lecture 10: Course Review

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

Assessing Model Adequacy

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

Resistant Measure - A statistic that is not affected very much by extreme observations.

Week 7.1--IES 612-STA STA doc

In many situations, there is a non-parametric test that corresponds to the standard test, as described below:

Review of Statistics 101

Chapter 11. Analysis of Variance (One-Way)

20 Hypothesis Testing, Part I

Data Analysis and Statistical Methods Statistics 651

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS

Lecture 3. Experiments with a Single Factor: ANOVA Montgomery 3-1 through 3-3

Histograms allow a visual interpretation

SAS Commands. General Plan. Output. Construct scatterplot / interaction plot. Run full model

1.3.1 Measuring Center: The Mean

Section 2.3: One Quantitative Variable: Measures of Spread

Week 1: Intro to R and EDA

STAT 3A03 Applied Regression Analysis With SAS Fall 2017

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

Topic 23: Diagnostics and Remedies

Chapter 3: Displaying and summarizing quantitative data p52 The pattern of variation of a variable is called its distribution.

3.1 Measure of Center

STAT 350. Assignment 4

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

Single Factor Experiments

Overview Scatter Plot Example

Chapter 13 Correlation

Exploratory data analysis: numerical summaries

Chapter 4. Displaying and Summarizing. Quantitative Data

5.3 Three-Stage Nested Design Example

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Notes 6. Basic Stats Procedures part II

Glossary for the Triola Statistics Series

Chapter 2: Descriptive Analysis and Presentation of Single- Variable Data

Power Analysis for One-Way ANOVA

= n 1. n 1. Measures of Variability. Sample Variance. Range. Sample Standard Deviation ( ) 2. Chapter 2 Slides. Maurice Geraghty

Transition Passage to Descriptive Statistics 28

TOPIC: Descriptive Statistics Single Variable

Chapter 1 - Lecture 3 Measures of Location

Elementary Statistics

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

Lecture 11. Data Description Estimation

Perhaps the most important measure of location is the mean (average). Sample mean: where n = sample size. Arrange the values from smallest to largest:

Topic 20: Single Factor Analysis of Variance

STATISTICS 1 REVISION NOTES

Introduction to Linear regression analysis. Part 2. Model comparisons

Chapter 6 Multiple Regression

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

Nonparametric tests. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 704: Data Analysis I

Stat 101 Exam 1 Important Formulas and Concepts 1

Transcription:

A Little Stats Won t Hurt You Nate Derby Statis Pro Data Analytics Seattle, WA, USA Edmonton SAS Users Group, 11/13/09 Nate Derby A Little Stats Won t Hurt You 1 / 71

Outline Introduction 1 Introduction What Can Statistical Methods Tell Us? Exploratory Data Analysis (EDA) 2 3 PROC UNIVARIATE Decimal Places Hypothesis Tests Nate Derby A Little Stats Won t Hurt You 2 / 71

What Can Statistical Methods Tell Us? Exploratory Data Analysis (EDA) What Can Statistical Methods Tell Us? Statistics can describe data, extract information from them: Test hypotheses ( Is X correlated with Y? ). Extrapolate trends for forecasts ( What will X be in n weeks? ). Quantify what happened ( What is effect of X on Y? ). First step: Look at the data, look for interesting features. Meaning of interesting depends on context: Preliminary steps: Should this data point be included? More advanced stages: If we change the data, how does that change the results? If it looks interesting, it s probably wrong. Investigate to determine if really interesting or just wrong. Nate Derby A Little Stats Won t Hurt You 3 / 71

Exploratory Data Analysis (EDA) What Can Statistical Methods Tell Us? Exploratory Data Analysis (EDA) EDA = Looking at data to see what the data are telling us. Tells us which variables to look at, what models to use, etc. Prerequisites for more advanced methods: ANOVA (PROC ANOVA, PROC GLM) Linear/logistic regression (PROC REG, PROC LOGISTIC) ARIMA (PROC ARIMA) We will focus on univariate methods. Look at one variable at a time. How is that variable distributed (spread out) within our data set? Nate Derby A Little Stats Won t Hurt You 4 / 71

Two Methods Introduction What Can Statistical Methods Tell Us? Exploratory Data Analysis (EDA) Graphical techniques to quickly/easily see general trends. Keep it simple! (Like/unlike a dashboard) Statistical measures to summarize characteristics of distribution. Keep it short! Nate Derby A Little Stats Won t Hurt You 5 / 71

Example Data Sets Introduction What Can Statistical Methods Tell Us? Exploratory Data Analysis (EDA) Six sets of 50 measurements of systolic blood pressure (mmhg): 113 122 106 131 130 112 132 122 114 117 108 103 117 120 117 126 116 128 124 123 126 143 118 110 103 119 136 109 113 116 127 97 144 108 121 128 115 115 124 115 120 98 115 107 131 126 112 118 121 126 Named data1 - data6. Variables bp, group (values 1-6). Difficult to discern distribution information from the raw data! Too much information! Nate Derby A Little Stats Won t Hurt You 6 / 71

Scatterplot I Introduction First step: Look at one-dimensional scatterplots Basic Scatterplot (page 2) PROC GPLOT data=data1; PLOT group*bp; RUN; Nate Derby A Little Stats Won t Hurt You 7 / 71

group 1 90 100 110 120 130 140 150 bp Nate Derby A Little Stats Won t Hurt You 8 / 71

Scatterplot II Introduction Let s clean that up a bit: Custom-Formatted Scatterplot (page 2) PROC FREQ data=data1 noprint; TABLES group*bp / out=data1stats ( keep = group bp count ); RUN; goptions ftitle='times/bold' ftext='times'; * Use Times New Roman font; symbol1 c=red; * Plot the data points in red; axis1 label=( 'mmhg' ) order=( 90 to 150 by 10 ) minor=( number=4 ) value=( height=1.2 ); * Make the x-axis more readable; axis2 label=( '' ) value=( height=1.2 ); * Take off the label on the y-axis; title 'Systolic Blood Pressure'; * Give this graph a title; PROC GPLOT data=data1stats; PLOT group*bp=count / haxis=axis1 vaxis=axis2; RUN; Nate Derby A Little Stats Won t Hurt You 9 / 71

Systolic Blood Pressure 1 90 100 110 120 130 140 150 mmhg Frequency Count 1 2 3 4 Nate Derby A Little Stats Won t Hurt You 10 / 71

Bubble Plots Introduction Let s try one variation: Custom-Formatted Scatterplot (page 2) PROC FREQ data=data1 noprint; TABLES group*bp / out=data1stats ( keep = group bp count ); RUN; goptions ftitle='times/bold' ftext='times'; * Use Times New Roman font; symbol1 c=red; * Plot the data points in red; axis1 label=( 'mmhg' ) order=( 90 to 150 by 10 ) minor=( number=4 ) value=( height=1.2 ); * Make the x-axis more readable; axis2 label=( '' ) value=( height=1.2 ); * Take off the label on the y-axis; title 'Systolic Blood Pressure'; * Give this graph a title; PROC GPLOT data=data1stats; BUBBLE group*bp=count / haxis=axis1 vaxis=axis2 bcolor=red; RUN; Nate Derby A Little Stats Won t Hurt You 11 / 71

Systolic Blood Pressure 1 90 100 110 120 130 140 150 mmhg Nate Derby A Little Stats Won t Hurt You 12 / 71

: Conclusions Scatterplots: Gives us a effective look at the range, distribution, outliers. Difficult to differentiate multiple points in the same space. Bubble Plots: Represents multiple points in the same space by a bubble, size relational to # points. More effective way to look at range, distribution, outliers. Other than that, not very useful. Nate Derby A Little Stats Won t Hurt You 13 / 71

Histogram I Introduction Histogram: Chart of frequency/percentage counts for different ranges. Basic Histogram (page 3) PROC GPLOT data=data1; VBAR bp; RUN; Nate Derby A Little Stats Won t Hurt You 14 / 71

FREQUENCY 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 96 104 112 120 128 136 144 bp MIDPOINT Nate Derby A Little Stats Won t Hurt You 15 / 71

Histogram II Introduction Again, clean it up: Custom-Formatted Histogram (page 3) goptions ftitle='times/bold' ftext='times'; axis1 label=( 'Interval Midpoint (mmhg)' height=1.2 ) offset=( 4, 4 ) value=( height=1.2 ); axis2 label=( angle=90 height=1.2 'Frequency' ) order=( 0 to 20 by 5 ) minor=( number=4 ) value=( height=1.2 ); title 'Systolic Blood Pressure'; PROC GCHART DATA=data1; VBAR bp / maxis=axis1 raxis=axis2 width=4 space=2; RUN; Nate Derby A Little Stats Won t Hurt You 16 / 71

Systolic Blood Pressure 20 15 Frequency 10 5 0 96 104 112 120 128 136 144 Interval Midpoint (mmhg) Nate Derby A Little Stats Won t Hurt You 17 / 71

Histogram III Introduction Still not quite right want to make explicit ranges Custom-Formatted Histogram, Customized Ranges (page 4) axis1 label=( 'mmhg' ) value=( height=1.2 '92-100' '100-108' '108-116' '116-124' '124-132' '132-140' '140-148' ) offset=( 4, 4 ); axis2 label=( angle=90 height=1.2 'Frequency' ) order=( 0 to 20 by 5 ) minor=( number=4 ) value=( height=1.2 ); PROC GCHART DATA=data1; VBAR bp / maxis=axis1 raxis=axis2 width=4 space=2 midpoints = 96 to 144 by 8; RUN; Nate Derby A Little Stats Won t Hurt You 18 / 71

Systolic Blood Pressure 20 15 Frequency 10 5 0 92-100 100-108 108-116 116-124 124-132 132-140 140-148 mmhg Nate Derby A Little Stats Won t Hurt You 19 / 71

What Can We Deduce from Our Histogram? Data centered around 120 mmhg (easier with odd number of intervals), Bulk of data within 8-10 mmhg of 120 mmhg. Answers to questions: What is the central value of the data? How spread out are the data? Nate Derby A Little Stats Won t Hurt You 20 / 71

Central Value of Data What is the central value of the data? Mean: Arithmetic average, or x 1 + x 2 + + x n. n Median: Middle Value : 50% of data below this value. Mode: Value/category that is a maximum ( high point ) in the distribution. Relationship between median and mode can determine skewness: How lopsided the distribution is. Nate Derby A Little Stats Won t Hurt You 21 / 71

Skewness Introduction Distribution Distribution x left-skewed (left-tailed) mean < median x right-skewed (right-tailed) mean > median Nate Derby A Little Stats Won t Hurt You 22 / 71

Spread of Data Introduction How spread out are the data? Standard Deviation: Average distance from the mean. Minimum: Smallest value 25 th percentile: 25% of data lie below this value. 75 th percentile: 75% of data lie below this value. Interquartile Range: Difference between 75 th and 25 th percentiles. Maximum: Largest value. Note: Median = 50 th percentile. Nate Derby A Little Stats Won t Hurt You 23 / 71

Number of Intervals Introduction Beware of too many/few intervals! Histogram with xx intervals (page 6) axis1 label=( 'Interval Midpoint (mmhg)' height=1.2 ) value=( height=1.2 axis2 label=( angle=90 'Frequency' height=1.2 ) value=( height=1.2 ); title 'Systolic Blood Pressure, xx Intervals'; PROC GCHART DATA=data1; VBAR bp / maxis=axis1 raxis=axis2 levels = xx; RUN; Nate Derby A Little Stats Won t Hurt You 24 / 71

Systolic Blood Pressure, 2 Intervals 30 20 Frequency 10 0 105 135 Interval Midpoint (mmhg) Nate Derby A Little Stats Won t Hurt You 25 / 71

Systolic Blood Pressure, 15 Intervals 10 9 8 7 6 Frequency 5 4 3 2 1 0 92 96 100 104 108 112 116 120 124 128 132 136 140 144 148 Interval Midpoint (mmhg) Nate Derby A Little Stats Won t Hurt You 26 / 71

Number of Intervals Introduction Rule of Thumb: Number of Data Points Number of Intervals Under 50 5 to 7 50 to 100 6 to 10 100 to 250 7 to 12 over 250 10 to 20 Nate Derby A Little Stats Won t Hurt You 27 / 71

Histogram: Conclusions Gives us a rough estimate of distribution. Can be misleading from too few/many intervals! (Can be misleading even from shifting intervals!) Nate Derby A Little Stats Won t Hurt You 28 / 71

Box Plot I Introduction Box Plot: Graphical representation of six summary statistics: minimum, 25 th percentile, mean, 75 th percentile, maximum, mean. Nate Derby A Little Stats Won t Hurt You 29 / 71

Box Plot II Introduction Code: Similar to vertical-aligned PROC GPLOT Vertical-Aligned Scatterplot (page 8) goptions ftitle='times/bold' ftext='times'; symbol1 c=red; axis1 label=( angle=90 'mmhg' height=1.2 ) order=( 90 to 150 by 10 ) minor=( number=3 ) value=( height=1.2 ); axis2 label=( '' ) value=( height=1.2 ); title 'Systolic Blood Pressure'; PROC GPLOT data=data1; BUBBLE bp*group=count / vaxis=axis1 haxis=axis2; RUN; Nate Derby A Little Stats Won t Hurt You 30 / 71

Systolic Blood Pressure 150 140 130 mmhg 120 110 100 90 1 Nate Derby A Little Stats Won t Hurt You 31 / 71

Box Plot III Introduction Use PROC BOXPLOT instead of PROC GPLOT: Basic Box Plot (page 9) axis1 label=( 'mmhg' height=1.2 ) order=( 90 to 150 by 10 ) minor=( number=3 ) value=( height=1.2 ); symbol1; PROC BOXPLOT data=data1; PLOT bp*group / vaxis=axis1 haxis=axis2; RUN; Nate Derby A Little Stats Won t Hurt You 32 / 71

Systolic Blood Pressure 150 140 130 mmhg 120 110 100 90 1 Nate Derby A Little Stats Won t Hurt You 33 / 71

What Does the Box Plot Tell Us? Overall: Bulk of data between 108 mmhg and 132 mmhg, as in scatterplot. Now easier to see; 25 th and 75 th percentiles very clear! Mean > Median, so definitely right-skewed. Box plots more reliable than histograms for percentiles or skewness. Why? Because they incorporate summary statistics. Nate Derby A Little Stats Won t Hurt You 34 / 71

Box Plot IV Introduction Adding Summary Statistics to Box Plot (page 9) PROC UNIVARIATE noprint data=data1; VAR bp; BY group; OUTPUT min=min mean=mean q1=q1 median=med q3=q3 max=max out=stats; RUN; DATA anno1; SET stats; FORMAT function $8. text $50.; RETAIN when 'a'; function = 'label'; text = '' trim( left( put( min, 5 ) ) ) ', ' trim( left( put( q1, 5. ) ) ) ', ' trim( left( put( med, 5. ) ) ) ', ' trim( left( put( q3, 5. ) ) ) ', ' trim( left( put( max, 5. ) ) ) ''; position = '2';... OUTPUT; RUN; PROC BOXPLOT data=data1; PLOT bp*group / vaxis=axis1 haxis=axis2 annotate=anno1; RUN; Nate Derby A Little Stats Won t Hurt You 35 / 71

Systolic Blood Pressure 150 {97, 113, 118, 126, 144} 140 130 mmhg 120 110 100 90 1 Nate Derby A Little Stats Won t Hurt You 36 / 71

Comparing Multiple Data Sets Box plots can be used to compare multiple data sets. Comparing Multiple Data Sets with (page 10) DATA data123; SET data1 data2a data3; RUN; PROC UNIVARIATE data=data123 noprint; VAR bp; BY group; OUTPUT min=min mean=mean q1=q1 median=med q3=q3 max = max out=stats; RUN; DATA anno123; SET stats; FORMAT function $8. text $50.;... RUN; axis1 label=( 'mmhg' ) minor=( number=4 ) value=( height=1.2 ); axis2 label=( '' ) order=( 1 to 3 by 1 ) value=( height=1.2 'Group A' 'Group B' 'Group C' ) minor=none; PROC BOXPLOT data=data123; PLOT bp*group / vaxis=axis1 haxis=axis2 annotate=anno123; RUN; Nate Derby A Little Stats Won t Hurt You 37 / 71

Systolic Blood Pressure 200 {97, 113, 118, 126, 144} {0, 75, 111, 126, 153} {114, 118, 120, 121, 125} 150 mmhg 100 50 0 Group A Group B Group C Nate Derby A Little Stats Won t Hurt You 38 / 71

Exploratory Data Analysis Obvious irregularity in Group B! Minimum value is zero (?!) Data heavily skewed toward that as well (mean median!). Q: Does this make sense? (No!) Data error: 11 data points of zero. Re-run code after correcting this error. Nate Derby A Little Stats Won t Hurt You 39 / 71

Systolic Blood Pressure (Data Error Removed) 175 {97, 113, 118, 126, 144} {73, 109, 119, 129, 163} {114, 118, 120, 121, 125} 150 125 mmhg 100 75 50 Group A Group B Group C Nate Derby A Little Stats Won t Hurt You 40 / 71

Exploratory Data Analysis With corrected data, what can we see about the data for these three groups? All have about the same central values (median and mean). All have (very) different spreads: Highest for Group B. Lowest for Group C. Do these observations make sense? B composed of very diverse people? C composed of very similar people? Answering these questions leads to deeper analysis. Nate Derby A Little Stats Won t Hurt You 41 / 71

Different Introduction Try boxstyle=schematic option: with boxstyle=schematic (page 12) PROC BOXPLOT data=data456a; PLOT bp*group / vaxis=axis1 haxis=axis2 annotate=anno456a boxstyle=schematic; RUN; Boxes now denote outliers: Points below the 25 th percentile - 1.5 IQR. Points above the 75 th percentile + 1.5 IQR. Nate Derby A Little Stats Won t Hurt You 42 / 71

Systolic Blood Pressure 200 {97, 113, 118, 126, 144} {0, 75, 111, 126, 153} {114, 118, 120, 121, 125} 150 mmhg 100 50 0 Group A Group B Group C Nate Derby A Little Stats Won t Hurt You 43 / 71

Systolic Blood Pressure 200 {87, 103, 107, 113, 124} {106, 122, 137, 144, 192} {132, 143, 151, 159, 171} 175 150 mmhg 125 100 75 Group D Group E Group F Nate Derby A Little Stats Won t Hurt You 44 / 71

Exploratory Data Analysis What s happening in Group E? Data heavily skewed upward (mean > median). Q: Does this make sense? (Maybe) Six values at/above 180 mmhg. Assume rational explanation for them (e.g., six people have extreme risk of heart attack). Not data errors; they are outliers. Do we keep them or throw them out? (Open question!) What do we get if we throw them out? Nate Derby A Little Stats Won t Hurt You 45 / 71

Systolic Blood Pressure (Outliers Removed) 180 {87, 103, 107, 113, 124} {106, 120, 132, 138, 149} {132, 143, 151, 159, 171} 160 140 mmhg 120 100 80 Group D Group E Group F Nate Derby A Little Stats Won t Hurt You 46 / 71

Exploratory Data Analysis (without outliers) All three groups have about the same spreads. Three groups have different central values. Do these observations make sense? Outside realm of this paper. Nate Derby A Little Stats Won t Hurt You 47 / 71

With or Without Outliers? Do we cut outliers out of the analysis? Yes: Including them distorts the general tendencies we are looking for. No: They are valid data points, and thus part of those general tendencies. No simple answer, except that we can t just discard them. If taken out of the analysis, Mention them in sentence or footnote. Don t bury them in an end note or appendix. Show them in PROC BOXPLOT as isolated points (boxstyle=schematic). Nate Derby A Little Stats Won t Hurt You 48 / 71

Conclusions Introduction show tendencies more effectively/robustly than histograms. Can be used to easily compare data sets. Nate Derby A Little Stats Won t Hurt You 49 / 71

PROC UNIVARIATE Decimal Places Hypothesis Tests Demystifying PROC UNIVARIATE What do all these numbers mean? Nate Derby A Little Stats Won t Hurt You 50 / 71

PROC UNIVARIATE Decimal Places Hypothesis Tests Moments N 50 Sum Weights 50 Mean 118.84 Sum Observations 5942 Std Deviation 10.14861 Variance 102.994286 Skewness 0.19322593 Kurtosis 0.26180882 Uncorrected SS 711194 Corrected SS 5046.72 Coeff Variation 8.53972571 Std Error Mean 1.4352302 Basic Statistical Measures Location Variability Mean 118.8400 Std Deviation 10.14861 Median 118.0000 Variance 102.99429 Mode 115.0000 Range 47.00000 Interquartile Range 13.00000 NOTE: The mode displayed is the smallest of 2 modes with a count of 4. Nate Derby A Little Stats Won t Hurt You 51 / 71

PROC UNIVARIATE Decimal Places Hypothesis Tests Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t 82.80205 Pr > t <.0001 Sign M 25 Pr >= M <.0001 Signed Rank S 637.5 Pr >= S <.0001 Quantiles (Definition 5) Quantile Estimate 100% Max 144.0 99% 144.0 95% 136.0 90% 131.0 75% Q3 126.0 50% Median 118.0 25% Q1 113.0 10% 106.5 5% 103.0 1% 97.0 0% Min 97.0 Nate Derby A Little Stats Won t Hurt You 52 / 71

PROC UNIVARIATE Decimal Places Hypothesis Tests Extreme Observations -----Lowest----- -----Highest---- Value Obs Value Obs 97 32 131 45 98 42 132 7 103 25 136 27 103 12 143 22 106 3 144 33 Nate Derby A Little Stats Won t Hurt You 53 / 71

PROC UNIVARIATE Decimal Places Hypothesis Tests Demystifying PROC UNIVARIATE PROC UNIVARIATE provides summary statistics. Summary statistics are supposed to describe a data set with just a few numbers (data reduction). PROC UNIVARIATE is using 46 numbers to summarize 50 numbers. PROC UNIVARIATE is not an effective data summary... but it s not supposed to be! Nate Derby A Little Stats Won t Hurt You 54 / 71

PROC UNIVARIATE Decimal Places Hypothesis Tests Demystifying PROC UNIVARIATE PROC UNIVARIATE has more information than we usually need. Not meant to be a data summary per se. Designed to be one-stop shop for anything we would ever want. Just because a statistic is listed does not mean we need to use it! For completeness, we ll go through everything here (starting from the bottom). Nate Derby A Little Stats Won t Hurt You 55 / 71

Output of PROC UNIVARIATE PROC UNIVARIATE Decimal Places Hypothesis Tests Extreme Observations = five lowest/highest observations, with observation numbers. Quantiles = percentiles. Definition 5 = define percentiles a certain way (not very important). Nate Derby A Little Stats Won t Hurt You 56 / 71

Output of PROC UNIVARIATE PROC UNIVARIATE Decimal Places Hypothesis Tests N = total number of data points. Sum Weights = sum of data weights (almost always N). Mean = sample mean. Sum Observations = sum of data observations: Equal to N Mean. N x i. Std Deviation = sample standard deviation (mean distance from mean): 1 N (x i x) 2. N 1 i=1 i=1 Nate Derby A Little Stats Won t Hurt You 57 / 71

Output of PROC UNIVARIATE PROC UNIVARIATE Decimal Places Hypothesis Tests Variance = sample variance (mean squared distance from mean): 1 N (x i x) 2. N 1 i=1 Equal to (Std Deviation) 2. Skewness = measure of how skewed our data distribution is (negative = left, positive = right). Kurtosis = measure of how peaked our data distribution is at the mode. Nate Derby A Little Stats Won t Hurt You 58 / 71

Output of PROC UNIVARIATE PROC UNIVARIATE Decimal Places Hypothesis Tests Fun Fact: N (x i x) 2 = i=1 N i=1 x 2 i N x 2. Uncorrected SS = uncorrected sum of squares: Used in Variance, Std Deviation. Corrected SS = corrected sum of squares: Used in Variance, Std Deviation. N i=1 x 2 i. N (x i x) 2. i=1 Nate Derby A Little Stats Won t Hurt You 59 / 71

Output of PROC UNIVARIATE PROC UNIVARIATE Decimal Places Hypothesis Tests Coeff Variation = coefficient of variation = scaled version of the spread: 100 Std Deviation. Mean Std Error Mean = standard error of the mean = standard deviation of the distribution of the mean x: Std Deviation N. Are we done? Nate Derby A Little Stats Won t Hurt You 60 / 71

PROC UNIVARIATE Decimal Places Hypothesis Tests Moments N 50 Sum Weights 50 Mean 118.84 Sum Observations 5942 Std Deviation 10.14861 Variance 102.994286 Skewness 0.19322593 Kurtosis 0.26180882 Uncorrected SS 711194 Corrected SS 5046.72 Coeff Variation 8.53972571 Std Error Mean 1.4352302 Basic Statistical Measures Location Variability Mean 118.8400 Std Deviation 10.14861 Median 118.0000 Variance 102.99429 Mode 115.0000 Range 47.00000 Interquartile Range 13.00000 NOTE: The mode displayed is the smallest of 2 modes with a count of 4. Nate Derby A Little Stats Won t Hurt You 61 / 71

PROC UNIVARIATE Decimal Places Hypothesis Tests Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t 82.80205 Pr > t <.0001 Sign M 25 Pr >= M <.0001 Signed Rank S 637.5 Pr >= S <.0001 Quantiles (Definition 5) Quantile Estimate 100% Max 144.0 99% 144.0 95% 136.0 90% 131.0 75% Q3 126.0 50% Median 118.0 25% Q1 113.0 10% 106.5 5% 103.0 1% 97.0 0% Min 97.0 Nate Derby A Little Stats Won t Hurt You 62 / 71

PROC UNIVARIATE Decimal Places Hypothesis Tests Extreme Observations -----Lowest----- -----Highest---- Value Obs Value Obs 97 32 131 45 98 42 132 7 103 25 136 27 103 12 143 22 106 3 144 33 Nate Derby A Little Stats Won t Hurt You 63 / 71

Output of PROC UNIVARIATE PROC UNIVARIATE Decimal Places Hypothesis Tests Are all those decimal places necessary? Std Deviation 10.14861 Variance 102.994286 Skewness 0.19322593 Kurtosis 0.26180882 Coeff Variation 8.53972571 Std Error Mean 1.4352302 Yes, if used as intermediate value. No, if final value. St Dev = 10.14861 St Dev = 10.1 Misleading accuracy, confusing! Proper accuracy, intuitive. Nate Derby A Little Stats Won t Hurt You 64 / 71

Population vs Sample PROC UNIVARIATE Decimal Places Hypothesis Tests Assumption: Data is a representative sample from a much larger population. Statistics from PROC UNIVARIATE are estimates of (unknown) population parameters. Ex: Sample mean of 118.84 = estimate of population mean. Unbiased sample good estimates. Biased sample lousy estimates (1936, 1948 election polls). Nate Derby A Little Stats Won t Hurt You 65 / 71

Hypothesis Tests Introduction PROC UNIVARIATE Decimal Places Hypothesis Tests A hypothesis test tests if population parameter significantly different from some value (usually 0). Ex: H 0 : Population mean = 0. Significantly different = account for spread of data. More spread = more volatile data = less reliable estimates. Is our estimate reliable enough? We try to reject H 0. Either reject or don t reject H 0. Do we have enough evidence to reject H 0? Never accept H 0. Nate Derby A Little Stats Won t Hurt You 66 / 71

Hypothesis Tests Introduction PROC UNIVARIATE Decimal Places Hypothesis Tests How do we reject H 0? 1 Calculate a test statistic from the data. 2 Compare test statistic to an assumed distribution. 3 Calculate p-value probability of test statistic, assuming H 0. 4 Reject H 0 if p-value really small (less than 0.05 = 5%). Assuming H 0, the probability of our result is really small. We should have gotten that result less than 5% of the time. Since we got that result, our assumption is probability wrong! Nate Derby A Little Stats Won t Hurt You 67 / 71

PROC UNIVARIATE Decimal Places Hypothesis Tests Hypothesis Tests in PROC UNIVARIATE H 0 : Population Mean = 0 Three different tests for H 0, using different test statistics: Student's t= student s t test. Sign = sign test. Signed Rank = Wilcoxon signed rank test. Each test has different data assumptions. Which one to use? Often not important! Reject H 0 if all three p-values < 0.05. Nate Derby A Little Stats Won t Hurt You 68 / 71

PROC UNIVARIATE Decimal Places Hypothesis Tests Hypothesis Tests in PROC UNIVARIATE H 0 : Population Mean = 0 Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t 82.80205 Pr > t <.0001 Sign M 25 Pr >= M <.0001 Signed Rank S 637.5 Pr >= S <.0001 Reject H 0. (Trivial, since all values are > 0) Nate Derby A Little Stats Won t Hurt You 69 / 71

Conclusions Conclusions Conclusions Further Resources This presentation shows most commonly used tools of exploratory data analysis. Methods: Data visualization and summarization. Simple, yet effective. They help us find data irregularities and interesting features. They give us guidance for further analysis, set stage for more complex methods. Nate Derby A Little Stats Won t Hurt You 70 / 71

Further Resources Conclusions Conclusions Further Resources Robert Adams, in SAS: UNIVARIATE, BOXPLOT or GPLOT? Proceedings of the 21st NESUG Conference, np16, 2008. Perry Watts, Using SAS Software to Generate Textbook Style, Proceedings of the 21st NESUG Conference, np03, 2008. Nate Derby: http://nderby.org nderby@sprodata.com Nate Derby A Little Stats Won t Hurt You 71 / 71