Chapter 4 - Summarizing Numerical Data

Similar documents
Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

(# x) 2 n. (" x) 2 = 30 2 = 900. = sum. " x 2 = =174. " x. Chapter 12. Quick math overview. #(x " x ) 2 = # x 2 "

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

1 Inferential Methods for Correlation and Regression Analysis

Chapter 2 Descriptive Statistics

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

Correlation. Two variables: Which test? Relationship Between Two Numerical Variables. Two variables: Which test? Contingency table Grouped bar graph

CHAPTER 2. Mean This is the usual arithmetic mean or average and is equal to the sum of the measurements divided by number of measurements.

Regression, Inference, and Model Building

Summarizing Data. Major Properties of Numerical Data

Formulas and Tables for Gerstman

Economics 250 Assignment 1 Suggested Answers. 1. We have the following data set on the lengths (in minutes) of a sample of long-distance phone calls

Data Description. Measure of Central Tendency. Data Description. Chapter x i

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Lecture 1. Statistics: A science of information. Population: The population is the collection of all subjects we re interested in studying.

Linear Regression Models

(all terms are scalars).the minimization is clearer in sum notation:

STP 226 EXAMPLE EXAM #1

Median and IQR The median is the value which divides the ordered data values in half.

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

STP 226 ELEMENTARY STATISTICS

Parameter, Statistic and Random Samples


Correlation and Covariance

Stat 139 Homework 7 Solutions, Fall 2015

Chapter 6 Sampling Distributions

ENGI 4421 Probability and Statistics Faculty of Engineering and Applied Science Problem Set 1 Solutions Descriptive Statistics. None at all!

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

n but for a small sample of the population, the mean is defined as: n 2. For a lognormal distribution, the median equals the mean.

S Y Y = ΣY 2 n. Using the above expressions, the correlation coefficient is. r = SXX S Y Y

11 Correlation and Regression

Correlation Regression

Read through these prior to coming to the test and follow them when you take your test.

Properties and Hypothesis Testing

Simple Regression. Acknowledgement. These slides are based on presentations created and copyrighted by Prof. Daniel Menasce (GMU) CS 700

Probability and statistics: basic terms

Worksheet 23 ( ) Introduction to Simple Linear Regression (continued)

Simple Linear Regression

Statistics Lecture 27. Final review. Administrative Notes. Outline. Experiments. Sampling and Surveys. Administrative Notes

multiplies all measures of center and the standard deviation and range by k, while the variance is multiplied by k 2.

Bivariate Sample Statistics Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 7

Assessment and Modeling of Forests. FR 4218 Spring Assignment 1 Solutions

Final Examination Solutions 17/6/2010

3/3/2014. CDS M Phil Econometrics. Types of Relationships. Types of Relationships. Types of Relationships. Vijayamohanan Pillai N.

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

CHAPTER SUMMARIES MAT102 Dr J Lubowsky Page 1 of 13 Chapter 1: Introduction to Statistics

Lecture 7: Properties of Random Samples

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL/MAY 2009 EXAMINATIONS ECO220Y1Y PART 1 OF 2 SOLUTIONS

ST 305: Exam 3 ( ) = P(A)P(B A) ( ) = P(A) + P(B) ( ) = 1 P( A) ( ) = P(A) P(B) σ X 2 = σ a+bx. σ ˆp. σ X +Y. σ X Y. σ X. σ Y. σ n.

Chapter 1 (Definitions)

Statistical Fundamentals and Control Charts

Linear Regression Analysis. Analysis of paired data and using a given value of one variable to predict the value of the other

Chapters 5 and 13: REGRESSION AND CORRELATION. Univariate data: x, Bivariate data (x,y).

Gotta Keep It Correlatin

PH 425 Quantum Measurement and Spin Winter SPINS Lab 1

Department of Civil Engineering-I.I.T. Delhi CEL 899: Environmental Risk Assessment HW5 Solution

Dr. Maddah ENMG 617 EM Statistics 11/26/12. Multiple Regression (2) (Chapter 15, Hines)

Eco411 Lab: Central Limit Theorem, Normal Distribution, and Journey to Girl State

Chapter 12 Correlation

Describing the Relation between Two Variables

BHW #13 1/ Cooper. ENGR 323 Probabilistic Analysis Beautiful Homework # 13

Linear Regression Demystified

AAEC/ECON 5126 FINAL EXAM: SOLUTIONS

Sample Size Estimation in the Proportional Hazards Model for K-sample or Regression Settings Scott S. Emerson, M.D., Ph.D.

MCT242: Electronic Instrumentation Lecture 2: Instrumentation Definitions

Central Limit Theorem the Meaning and the Usage

Paired Data and Linear Correlation

STAT 350 Handout 19 Sampling Distribution, Central Limit Theorem (6.6)

SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

Confidence Interval for Standard Deviation of Normal Distribution with Known Coefficients of Variation

Anna Janicka Mathematical Statistics 2018/2019 Lecture 1, Parts 1 & 2

MATH/STAT 352: Lecture 15

Tables and Formulas for Sullivan, Fundamentals of Statistics, 2e Pearson Education, Inc.

Statistics 203 Introduction to Regression and Analysis of Variance Assignment #1 Solutions January 20, 2005

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates

PROPERTIES OF AN EULER SQUARE

MATHEMATICS: PAPER III (LO 3 AND LO 4) PLEASE READ THE FOLLOWING INSTRUCTIONS CAREFULLY

Summary: CORRELATION & LINEAR REGRESSION. GC. Students are advised to refer to lecture notes for the GC operations to obtain scatter diagram.

Expectation and Variance of a random variable

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Topic 9: Sampling Distributions of Estimators

Comparing your lab results with the others by one-way ANOVA

Chapter VII Measures of Correlation

Number of fatalities X Sunday 4 Monday 6 Tuesday 2 Wednesday 0 Thursday 3 Friday 5 Saturday 8 Total 28. Day

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

Chapter 13, Part A Analysis of Variance and Experimental Design

Measures of Spread: Variance and Standard Deviation

BIOS 4110: Introduction to Biostatistics. Breheny. Lab #9

Mathematical Notation Math Introduction to Applied Statistics

For nominal data, we use mode to describe the central location instead of using sample mean/median.

October 25, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 1

Lecture 11 Simple Linear Regression

Chapter 2 The Monte Carlo Method

NCSS Statistical Software. Tolerance Intervals

Example: Find the SD of the set {x j } = {2, 4, 5, 8, 5, 11, 7}.

Introducing Sample Proportions

IE 230 Probability & Statistics in Engineering I. Closed book and notes. No calculators. 120 minutes.

Transcription:

Chapter 4 - Summarizig Numerical Data 15.075 Cythia Rudi Here are some ways we ca summarize data umerically. Sample Mea: i=1 x i x :=. Note: i this class we will work with both the populatio mea µ ad the sample mea x. Do ot cofuse them! Remember, x is the mea of a sample take from the populatio ad µ is the mea of the whole populatio. Sample media: order the data values x (1) x (2) x (), so the x ( +1 ) odd media := x := 2 1. [x ( ) + x ( +1)] eve 2 2 2 Mea ad media ca be very differet: 1, 2, 3, 4, } 500. The media is more robust to outliers. outlier Quatiles/Percetiles: Order the sample, the fid x p so that it divides the data ito two parts where: a fractio p of the data values are less tha or equal to x p ad the remaiig fractio (1 p) are greater tha x p. That value x p is the p th -quatile, or 100 p th percetile. 5-umber summary {x mi, Q 1, Q 2, Q 3, x max }, where, Q 1 = θ.25, Q 2 = θ.5, Q 3 = θ.75. Rage: x max x mi measures dispersio Iterquartile Rage: IQR := Q 3 Q 1, rage resistat to outliers 1

Sample Variace s 2 ad Sample Stadard Deviatio s: s 2 := 1 }{{ 1 } see why later (xi x) 2. i=1 Remember, for a large sample from a ormal distributio, 95% of the sample falls i [ x 2s, x + 2s]. Do ot cofuse s 2 with σ 2 which is the variace of the populatio. s Coefficiet of variatio (CV) := x, dispersio relative to size of mea. z-score x i x z i :=. s It tells you where a data poit lies i the distributio, that is, how may stadard deviatios above/below the mea. E.g. z i = 3 where the distributio is N(0, 1). It allows you to compute percetiles easily usig the z-scores table, or a commad o the computer. Now some graphical techiques for describig data. Bar chart/pie chart - good for summarizig data withi categories 2

Pareto chart - a bar chart where the bars are sorted. Histogram Boxplot ad ormplot Scatterplot for bivariate data Q-Q Plot for 2 idepedet samples Has Roslig 3

Chapter 4.4: Summarizig bivariate data Two Way Table Here s a example: Respiratory Problem? yes o row total smokers 25 25 50 o-smokers 5 45 50 colum total 30 70 100 Questio: If this example is from a study with 50 smokers ad 50 o-smokers, is it meaigful to coclude that i the geeral populatio: a) 25/30 = 83% of people with respiratory problems are smokers? b) 25/50 = 50% of smokers have respiratory problems? Simpso s Paradox Deals with aggregatig smaller datasets ito larger oes. Simpso s paradox is whe coclusios draw from the smaller datasets are the opposite of coclusios draw from the larger dataset. Occurs whe there is a lurkig variable ad ueve-sized groups beig combied E.g. Kidey stoe treatmet (Source: Wikipedia) Which treatmet is more effective? Treatmet A Treatmet B 78% 273 83% 289 350 350 Icludig iformatio about stoe size, ow which treatmet is more effective? small stoes large stoes Treatmet A group 1 93% 81 87 group 3 73% 192 263 Treatmet B group 2 87% 234 270 group 4 69% 55 80 both 78% 273 83% 289 350 350 What happeed!? 4

Cotiuig with bivariate data: Correlatio Coefficiet- measures the stregth of a liear relatioship betwee two variables: S xy sample correlatio coefficiet = r :=, S x S y where 1 S xy = (x i x )(y i ȳ) 1 i=1 S 2 = 1 x (x i x ) 2. 1 i=1 This is also called the Pearso Correlatio Coefficiet. If we rewrite 1 (x i x ) (y i ȳ) r =, 1 i=1 S x S y x) y) S x S y you ca see that (x i ad (y i are the z-scores of x i ad y i. r [ 1, 1] ad is ±1 oly whe data fall alog a straight lie sig(r) idicates the slope of the lie (do y i s icrease as x i s icrease?) always plot the data before computig r to esure it is meaigful Correlatio does ot imply causatio, it oly implies associatio (there may be lurkig variables that are ot recogized or cotrolled) For example: There is a correlatio betwee decliig health ad icreasig wealth. Liear regressio (i Ch 10) y ȳ S y x x = r. S x 5

Chapter 4.5: Summarizig time-series data Movig averages. Calculate average over a widow of previous timepoits x t w+1 + + x t MA t =, w where w is the size of the widow. Note that we make widow w smaller at the begiig of the time series whe t < w. Example To use movig averages for forecastig, give x 1,..., x t 1, let the predicted value at time t be ˆx t = MA t 1. The the forecast error is: e t = x t xˆt = x t MA t 1. The Mea Absolute Percet Error (MAPE) is: 1 MAP E = T 1 T e t t=2 100%. xt 6

The MAPE looks at the forecast error e t as a fractio of the measuremet value x t. Sometimes as measuremet values grow, errors, grow too, the MAPE helps to eve this out. For MAPE, x t ca t be 0. Expoetially Weighted Movig Averages (EWMA). It does t completely drop old values. EW MA t = ωx t + (1 ω)ew MA t 1, where EW MA 0 = x 0 ad 0 < ω < 1 is a smoothig costat. Example here ω cotrols balace of recet data to old data called expoetially from recursive formula: EW MA t = ω[x t + (1 ω)x t 1 + (1 ω) 2 x t 2 +... ] + (1 ω) t EW MA 0 the forecast error is thus: e t = x t xˆt = x t EW MA t 1 HW? Compare MAPE for MA vs EWMA Autocorrelatio coefficiet. Measures correlatio betwee the time series ad a lagged versio of itself. The k th order autocorrelatio coefficiet is: Example r k := T t=k+1 (x t k x )(x t x ) T t=1 (x t x ) 2 7

MIT OpeCourseWare http://ocw.mit.edu 15.075J / ESD.07J Statistical Thikig ad Data Aalysis Fall 2011 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.