Chapter 4 - Summarizing Numerical Data

Chapter 4 - Summarizig Numerical Data 15.075 Cythia Rudi Here are some ways we ca summarize data umerically. Sample Mea: i=1 x i x :=. Note: i this class we will work with both the populatio mea µ ad the sample mea x. Do ot cofuse them! Remember, x is the mea of a sample take from the populatio ad µ is the mea of the whole populatio. Sample media: order the data values x (1) x (2) x (), so the x ( +1 ) odd media := x := 2 1. [x ( ) + x ( +1)] eve 2 2 2 Mea ad media ca be very differet: 1, 2, 3, 4, } 500. The media is more robust to outliers. outlier Quatiles/Percetiles: Order the sample, the fid x p so that it divides the data ito two parts where: a fractio p of the data values are less tha or equal to x p ad the remaiig fractio (1 p) are greater tha x p. That value x p is the p th -quatile, or 100 p th percetile. 5-umber summary {x mi, Q 1, Q 2, Q 3, x max }, where, Q 1 = θ.25, Q 2 = θ.5, Q 3 = θ.75. Rage: x max x mi measures dispersio Iterquartile Rage: IQR := Q 3 Q 1, rage resistat to outliers 1

Sample Variace s 2 ad Sample Stadard Deviatio s: s 2 := 1 }{{ 1 } see why later (xi x) 2. i=1 Remember, for a large sample from a ormal distributio, 95% of the sample falls i [ x 2s, x + 2s]. Do ot cofuse s 2 with σ 2 which is the variace of the populatio. s Coefficiet of variatio (CV) := x, dispersio relative to size of mea. z-score x i x z i :=. s It tells you where a data poit lies i the distributio, that is, how may stadard deviatios above/below the mea. E.g. z i = 3 where the distributio is N(0, 1). It allows you to compute percetiles easily usig the z-scores table, or a commad o the computer. Now some graphical techiques for describig data. Bar chart/pie chart - good for summarizig data withi categories 2

Pareto chart - a bar chart where the bars are sorted. Histogram Boxplot ad ormplot Scatterplot for bivariate data Q-Q Plot for 2 idepedet samples Has Roslig 3

Chapter 4.4: Summarizig bivariate data Two Way Table Here s a example: Respiratory Problem? yes o row total smokers 25 25 50 o-smokers 5 45 50 colum total 30 70 100 Questio: If this example is from a study with 50 smokers ad 50 o-smokers, is it meaigful to coclude that i the geeral populatio: a) 25/30 = 83% of people with respiratory problems are smokers? b) 25/50 = 50% of smokers have respiratory problems? Simpso s Paradox Deals with aggregatig smaller datasets ito larger oes. Simpso s paradox is whe coclusios draw from the smaller datasets are the opposite of coclusios draw from the larger dataset. Occurs whe there is a lurkig variable ad ueve-sized groups beig combied E.g. Kidey stoe treatmet (Source: Wikipedia) Which treatmet is more effective? Treatmet A Treatmet B 78% 273 83% 289 350 350 Icludig iformatio about stoe size, ow which treatmet is more effective? small stoes large stoes Treatmet A group 1 93% 81 87 group 3 73% 192 263 Treatmet B group 2 87% 234 270 group 4 69% 55 80 both 78% 273 83% 289 350 350 What happeed!? 4

Cotiuig with bivariate data: Correlatio Coefficiet- measures the stregth of a liear relatioship betwee two variables: S xy sample correlatio coefficiet = r :=, S x S y where 1 S xy = (x i x )(y i ȳ) 1 i=1 S 2 = 1 x (x i x ) 2. 1 i=1 This is also called the Pearso Correlatio Coefficiet. If we rewrite 1 (x i x ) (y i ȳ) r =, 1 i=1 S x S y x) y) S x S y you ca see that (x i ad (y i are the z-scores of x i ad y i. r [ 1, 1] ad is ±1 oly whe data fall alog a straight lie sig(r) idicates the slope of the lie (do y i s icrease as x i s icrease?) always plot the data before computig r to esure it is meaigful Correlatio does ot imply causatio, it oly implies associatio (there may be lurkig variables that are ot recogized or cotrolled) For example: There is a correlatio betwee decliig health ad icreasig wealth. Liear regressio (i Ch 10) y ȳ S y x x = r. S x 5

Chapter 4.5: Summarizig time-series data Movig averages. Calculate average over a widow of previous timepoits x t w+1 + + x t MA t =, w where w is the size of the widow. Note that we make widow w smaller at the begiig of the time series whe t < w. Example To use movig averages for forecastig, give x 1,..., x t 1, let the predicted value at time t be ˆx t = MA t 1. The the forecast error is: e t = x t xˆt = x t MA t 1. The Mea Absolute Percet Error (MAPE) is: 1 MAP E = T 1 T e t t=2 100%. xt 6

The MAPE looks at the forecast error e t as a fractio of the measuremet value x t. Sometimes as measuremet values grow, errors, grow too, the MAPE helps to eve this out. For MAPE, x t ca t be 0. Expoetially Weighted Movig Averages (EWMA). It does t completely drop old values. EW MA t = ωx t + (1 ω)ew MA t 1, where EW MA 0 = x 0 ad 0 < ω < 1 is a smoothig costat. Example here ω cotrols balace of recet data to old data called expoetially from recursive formula: EW MA t = ω[x t + (1 ω)x t 1 + (1 ω) 2 x t 2 +... ] + (1 ω) t EW MA 0 the forecast error is thus: e t = x t xˆt = x t EW MA t 1 HW? Compare MAPE for MA vs EWMA Autocorrelatio coefficiet. Measures correlatio betwee the time series ad a lagged versio of itself. The k th order autocorrelatio coefficiet is: Example r k := T t=k+1 (x t k x )(x t x ) T t=1 (x t x ) 2 7

MIT OpeCourseWare http://ocw.mit.edu 15.075J / ESD.07J Statistical Thikig ad Data Aalysis Fall 2011 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.