Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics )

Similar documents
Chapter 3 Describing Data Using Numerical Measures

Definition. Measures of Dispersion. Measures of Dispersion. Definition. The Range. Measures of Dispersion 3/24/2014

AS-Level Maths: Statistics 1 for Edexcel

A random variable is a function which associates a real number to each element of the sample space

x = , so that calculated

F statistic = s2 1 s 2 ( F for Fisher )

Introduction to Random Variables

/ n ) are compared. The logic is: if the two

Interval Estimation in the Classical Normal Linear Regression Model. 1. Introduction

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Statistics Chapter 4

Economics 130. Lecture 4 Simple Linear Regression Continued

Q1: Calculate the mean, median, sample variance, and standard deviation of 25, 40, 05, 70, 05, 40, 70.

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

Check off these skills when you feel that you have mastered them. List and describe two types of distributions for a histogram.

Cathy Walker March 5, 2010

Lecture 12: Discrete Laplacian

Correlation and Regression. Correlation 9.1. Correlation. Chapter 9

Systematic Error Illustration of Bias. Sources of Systematic Errors. Effects of Systematic Errors 9/23/2009. Instrument Errors Method Errors Personal

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Lecture 3: Probability Distributions

SIMPLE LINEAR REGRESSION

Affine transformations and convexity

Modeling and Simulation NETW 707

18. SIMPLE LINEAR REGRESSION III

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

28. SIMPLE LINEAR REGRESSION III

Basic Statistical Analysis and Yield Calculations

ANOVA. The Observations y ij

Composite Hypotheses testing

THE SUMMATION NOTATION Ʃ

Kernel Methods and SVMs Extension

Linear Approximation with Regularization and Moving Least Squares

U-Pb Geochronology Practical: Background

ROBUST AND EFFICIENT ESTIMATION OF THE MODE OF CONTINUOUS DATA: THE MODE AS A VIABLE MEASURE OF CENTRAL TENDENCY

Problem Set 9 Solutions

Statistical tables are provided Two Hours UNIVERSITY OF MANCHESTER. Date: Wednesday 4 th June 2008 Time: 1400 to 1600

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Limited Dependent Variables

Expected Value and Variance

LINEAR REGRESSION ANALYSIS. MODULE VIII Lecture Indicator Variables

Negative Binomial Regression

A be a probability space. A random vector

Chapter 3. Descriptive Statistics Numerical Methods

ISQS 6348 Final Open notes, no books. Points out of 100 in parentheses. Y 1 ε 2

Société de Calcul Mathématique SA

10-701/ Machine Learning, Fall 2005 Homework 3

Statistics for Economics & Business

MEASURES OF CENTRAL TENDENCY AND DISPERSION

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

The Geometry of Logit and Probit

Department of Statistics University of Toronto STA305H1S / 1004 HS Design and Analysis of Experiments Term Test - Winter Solution

An (almost) unbiased estimator for the S-Gini index

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Basic Business Statistics, 10/e

First Year Examination Department of Statistics, University of Florida

Linear Regression Analysis: Terminology and Notation

= = = (a) Use the MATLAB command rref to solve the system. (b) Let A be the coefficient matrix and B be the right-hand side of the system.

Learning Theory: Lecture Notes

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Chapter 11: Simple Linear Regression and Correlation

Population element: 1 2 N. 1.1 Sampling with Replacement: Hansen-Hurwitz Estimator(HH)

Effects of Ignoring Correlations When Computing Sample Chi-Square. John W. Fowler February 26, 2012

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests

Physics 181. Particle Systems

Chapter 8 Indicator Variables

Stat 642, Lecture notes for 01/27/ d i = 1 t. n i t nj. n j

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

Convergence of random processes

ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM

REAL ANALYSIS I HOMEWORK 1

x yi In chapter 14, we want to perform inference (i.e. calculate confidence intervals and perform tests of significance) in this setting.

Notes prepared by Prof Mrs) M.J. Gholba Class M.Sc Part(I) Information Technology

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

4.1. Lecture 4: Fitting distributions: goodness of fit. Goodness of fit: the underlying principle

More metrics on cartesian products

# c i. INFERENCE FOR CONTRASTS (Chapter 4) It's unbiased: Recall: A contrast is a linear combination of effects with coefficients summing to zero:

Difference Equations

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Lecture 3 Stat102, Spring 2007

β0 + β1xi. You are interested in estimating the unknown parameters β

Section 8.3 Polar Form of Complex Numbers

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

Formulas for the Determinant

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Module 14: THE INTEGRAL Exploring Calculus

Polynomials. 1 More properties of polynomials

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

IRO0140 Advanced space time-frequency signal processing

Comparison of Regression Lines

Lecture 4 Hypothesis Testing

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

Lecture 6 More on Complete Randomized Block Design (RBD)

Fall 2012 Analysis of Experimental Measurements B. Eisenstein/rev. S. Errede. . For P such independent random variables (aka degrees of freedom): 1 =

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Linear, affine, and convex sets and hulls In the sequel, unless otherwise specified, X will denote a real vector space.

Transcription:

Ismor Fscher, 8//008 Stat 54 / -8.3 Summary Statstcs Measures of Center and Spread Dstrbuton of dscrete contnuous POPULATION Random Varable, numercal True center =??? True spread =???? parameters ( populaton characterstcs ) unknown fxed numercal values usually denoted by Greek letters, e.g., θ ( theta )? Statstcal Inference SAMPLE, sze n Measures of center medan, mode, mean Measures of spread range, varance, standard devaton statstcs ( sample characterstcs ) known (or computable) numercal values obtaned from sample data estmators of parameters, e.g., ˆ θ usually denoted by correspondng Roman letters

Ismor Fscher, 8//008 Stat 54 / -9 Measures of Center For a gven numercal random varable, assume that a random sample {x, x,, x n } has been selected, and sorted from lowest to hghest values,.e., x x x n x n 50% 50% sample medan = the numercal mddle value, n the sense that half the data values are smaller, half are larger. If n s odd, take the value n poston # n +. If n s even, take the average of the two closest neghborng data values, left (poston # n ) and rght (poston # n + ). Comments: The sample medan s robust (nsenstve) wth respect to the presence of outlers. More generally, can also defne quartles (Q = 5% cutoff, Q = 50% cutoff = medan, Q 3 = 75% cutoff), or percentles (a.k.a. quantles), whch dvde the data values nto any gven p% vs. (00 p)% splt. Example: SAT scores sample mode = the data value wth the largest frequency (f max ) Comment: The sample mode s robust to outlers. If present, repeated sample data values can be neatly consoldated n a frequency table, vs-à-vs the correspondng dotplot. (If a value x s not repeated, then ts f =.) k dstnct data values of absolute frequency of x relatve frequency of x x f f (x ) = f / n x f f (x ) x f f (x ) x k f k f (x k ) n f f f fmax....... x x mean mode x k f k

Ismor Fscher, 8//008 Stat 54 / -0 Example: n = random sample values of = Body Temperature ( F) : {98.5, 98.6, 98.6, 98.6, 98.6, 98.6, 98.9, 98.9, 98.9, 99., 99., 99.} x f f (x ) 98.5 / 98.6 5 5/ 98.9 3 3/ 99. / 99. / n = f 5 3 98.5 98.6 98.7 98.8 98.9 99.0 99. 99. 98.6 + 98.9 sample medan = sample mode = 98.6 F sample mean = = 98.75 F (sx data values on ether sde) [ (98.5)() + (98.6)(5) + (98.9)(3) + (99.)() + (99.)() ] or, = (98.5) + (98.6) 5 + (98.9) 3 + (99.) + (99.) = 98.8 F sample mean = the weghted average of all the data values x = = k x f n =, where f s the absolute frequency of x k = x f ( x ) f, where f(x ) = n s the relatve frequency of x Comments: The sample mean s the center of mass, or balance pont, of the data values. The sample mean s senstve to outlers. One common remedy for ths Trmmed mean: Compute the sample mean after deletng a predetermned number or percentage of outlers from each end of the data set, e.g., 0% trmmed mean. Robust to outlers by constructon. 0% 0%

Ismor Fscher, 8//008 Stat 54 / - Grouped Data Suppose the orgnal values had been lumped nto categores. Example: x Class Interval Recall the grouped Memoral Unon age data set Frequency f Relatve Frequency f n Densty (Rel Freq Class Wdth) 5 [0, 0) 4 0.0 0.0 5 [0, 30) 8 0.40 0.04 45 [30, 60) 8 0.40 0.03 n = 0.00 group mean: Same formula as above, wth x = mdpont of class nterval. x group = 0 [ (5)(4) + (5)(8) + (45)(8) ] = 3.0 years th Exercse: Compare ths value wth the ungrouped sample mean x = 9. years. group medan (& other quantles): Densty Hstogram 0.30.0 By defnton, the medan Q dvdes the data set nto equal halves,.e., 0.50 above and below. In ths example, t must therefore le n the class nterval [0, 30), and dvde the 0.40 area of the correspondng class rectangle as shown. Snce the 0.0 strp s ¼ of that area, t proportonally follows that Q must le at ¼ of the class wdth 30 0 = 0, or.5, from the rght endpont of 30. That s, Q = 30.5, or Q = 7.5 years. (Check that the ungrouped medan = 5 years.) 0.0 0.40 Q

Ismor Fscher, 8//008 Stat 54 / - Formal approach ~ Densty A B a Q b Frst, dentfy whch class nterval [a, b) contans the desred quantle Q (e.g., medan, quartle, etc.), and determne the respectve left and rght areas A and B nto whch t dvdes the correspondng class rectangle. Equatng proportons for Densty = A + B b a, we obtan Densty = A B = Q a b Q, from whch t follows that A Q = a + or Densty Q = b B Densty or Ab+ Ba Q =. A+ B For example, n the grouped Memoral Unon age data, we have a = 0, b = 30, and A = 0.30, B = 0.0. Substtutng these values nto any of the equvalent formulas above yelds the medan Q = 7.5. Exercse: Now that Q s found, use the formula agan to fnd the frst and thrd quartles Q and Q 3, respectvely. Note also from above, we obtan the useful formulas A = ( Q a) Densty B = ( b Q) Densty for calculatng the areas A and B, when a value of Q s gven! Ths can be used when fndng the area between two quantles Q and Q. (See next page for another way.)

Ismor Fscher, 8//008 Stat 54 / -3 Alternatve approach ~ Class Interval Frequency Relatve Frequency f f / n Cumulatve Relatve Frequency f f f n n n F = + + + I 0 0 0 0 I f f / n F I f f / n F I Q =? n f f / n [ a, b ) f + f / n + hgh F low < 0.5 0.5 F > 0.5 Ik f f / n k k n Then Q 0.5 F low = a + ( b a) F F hgh low Fhgh 0.5 or Q = b ( b a ). F F hgh low Agan, n the grouped Memoral Unon age data, we have a = 0, b = 30, F low = 0., and F hgh = 0.6 (why?). Substtutng these values nto ether formula yelds the medan Q = 7.5. To fnd Q, replace the 0.5 n the formula by 0.5; to fnd Q 3, replace the 0.5 n the formula by 0.75, etc. Conversely, f a quantle Q s gven, then we can solve for the cumulatve relatve 0.5 Flow frequency up to that value: F = Flow + ( b a). It follows that the relatve Q a frequency (.e., area) between two quantles Q and Q s equal to the dfference between ther cumulatve relatve frequences: F(Q ) F(Q ).

Ismor Fscher, 8//008 Stat 54 / -4 Shapes of Dstrbutons Symmetrc dstrbutons correspond to values that are spread equally about a center. mean = medan Examples: (Drawn for smoothed hstograms of a random varable.) unform trangular bell-shaped Note: An mportant specal case of the bell-shaped curve s the normal dstrbuton, a.k.a. Gaussan dstrbuton. Example: = IQ score Otherwse, f more outlers of occur on one sde of the medan than the other, the correspondng dstrbuton wll be skewed n that drecton, formng a tal. skewed to the left (negatvely skewed) skewed to the rght (postvely skewed) 0.5 0.5 0.5 0.5 mean < medan medan < mean Examples: = calcum level (mg) = serum cholesterol level (mg/dl) Furthermore, dstrbutons can also be classfed accordng to the number of peaks : unmodal bmodal multmodal

Ismor Fscher, 8//008 Stat 54 / -5 Measures of Spread Agan assume that a numercal random sample {x, x,, x n } has been selected, and sorted from lowest to hghest values,.e., x x x n x n sample range = x n x (hghest value lowest value) Comments: Uses only the two most extreme values. Very crude estmator of spread. The sample range s extremely senstve to outlers. One common remedy Interquartle range (IQR) = Q 3 Q. Robust to outlers by constructon. 5% 5% 5% 5% Q Q Q 3 If the orgnal data are grouped nto k class ntervals [a, a ), [a, a 3 ),, [a, a k k+), then the group range = a k+ a. A smlar calculaton holds for group IQR. Example: The Body Temperature data set has a sample range = 99. 98.5 = 0.7 F. {98.5, 98.6, 98.6, 98.6, 98.6, 98.6, 98.9, 98.9, 98.9, 99., 99., 99.} x f 98.5 98.6 5 98.9 3 99. 99. n =

Ismor Fscher, 8//008 Stat 54 / -6 For a much less crude measure of spread that uses all the data, frst consder the followng Defnton: x x = ndvdual devaton of the sample data value from the sample mean x 98.8 x x th 98.5 0.3 98.6 0. 5 98.9 +0. 3 99. +0.3 99. +0.4 n = Navely, an estmate of the spread of the data values mght be calculated as the average of these n = ndvdual devatons from the mean. However, ths wll always yeld zero! f FACT: k ( x x) f = 0, =.e., the sum of the devatons s always zero. Check: In ths example, the sum = ( 0.3)() + ( 0.)(5) + (0.)(3) + (0.3)() + (0.4)() = 0. Exercse: Prove ths general fact algebracally. Interpretaton: The sample mean s the center of mass, or balance pont, of the data values. f 5 98.5 98.6 98.7 98.8 98.9 99.0 99. 99. 3

Ismor Fscher, 8//008 Stat 54 / -7 Best remedy: To make them non-negatve, square the devatons before summng. sample varance s = n k (x x) f = s s not on the same scale as the data values! sample standard devaton s = + s s s on the same scale as the data values. Example: x x x (x x) f 98.5 0.3 +0.09 98.6 0. +0.04 5 98.9 +0. +0.0 3 99. +0.3 +0.09 99. +0.4 +0.6 n = Then s = [ (0.09)() + (0.04)(5) + (0.0)(3) + (0.09)() + (0.6)() ] = 0.06 ( F), so that s = 0.06 = 0.45 F. Body Temp has a small amount of varance. Comments: ( x ) s x f = has the mportant frequently-recurrng form SS n df, where SS = Sum of Squares (sometmes also denoted S xx ) and df = degrees of freedom = n, snce the n ndvdual devatons have a sngle constrant. (Namely, ther sum must equal zero.) Same formulas are used for grouped data, wth xgroup, and x = class nterval mdpont. Exercse: Compute s for the grouped and ungrouped Memoral Unon age data. A related measure of spread s the absolute devaton, defned as n x x f, but ts statstcal propertes are not as well-behaved as the standard devaton. Also, see Appendx > Geometrc Vewpont > Mean and Varance, for a way to understand the sum of squares formula va the Pythagorean Theorem (!), as well as a useful alternate computatonal formula for the sample varance.