Lecture 1 Review of Fundamental Statistical Concepts

Similar documents
Summary of the lecture in Biostatistics

Econometric Methods. Review of Estimation

Continuous Distributions

Simple Linear Regression

CHAPTER VI Statistical Analysis of Experimental Data

Mean is only appropriate for interval or ratio scales, not ordinal or nominal.

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Lecture Notes Types of economic variables

b. There appears to be a positive relationship between X and Y; that is, as X increases, so does Y.

Chapter 8. Inferences about More Than Two Population Central Values

Simple Linear Regression

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

MEASURES OF DISPERSION

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Analysis of Variance with Weibull Data

hp calculators HP 30S Statistics Averages and Standard Deviations Average and Standard Deviation Practice Finding Averages and Standard Deviations

Statistics MINITAB - Lab 5

Class 13,14 June 17, 19, 2015

22 Nonparametric Methods.

Chapter 5 Properties of a Random Sample

Lecture 3. Sampling, sampling distributions, and parameter estimation

Functions of Random Variables

Measures of Dispersion

ENGI 3423 Simple Linear Regression Page 12-01

STA 105-M BASIC STATISTICS (This is a multiple choice paper.)

Third handout: On the Gini Index

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Chapter 13 Student Lecture Notes 13-1

Multiple Regression. More than 2 variables! Grade on Final. Multiple Regression 11/21/2012. Exam 2 Grades. Exam 2 Re-grades

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

ECONOMETRIC THEORY. MODULE VIII Lecture - 26 Heteroskedasticity

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

Chapter 13, Part A Analysis of Variance and Experimental Design. Introduction to Analysis of Variance. Introduction to Analysis of Variance

Chapter 11 The Analysis of Variance

Chapter Statistics Background of Regression Analysis

Chapter 3 Sampling For Proportions and Percentages

1. The weight of six Golden Retrievers is 66, 61, 70, 67, 92 and 66 pounds. The weight of six Labrador Retrievers is 54, 60, 72, 78, 84 and 67.

Multiple Linear Regression Analysis

2.28 The Wall Street Journal is probably referring to the average number of cubes used per glass measured for some population that they have chosen.

STA302/1001-Fall 2008 Midterm Test October 21, 2008

Simulation Output Analysis

= 1. UCLA STAT 13 Introduction to Statistical Methods for the Life and Health Sciences. Parameters and Statistics. Measures of Centrality

Lecture 8: Linear Regression

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Module 7. Lecture 7: Statistical parameter estimation

is the score of the 1 st student, x

A Study of the Reproducibility of Measurements with HUR Leg Extension/Curl Research Line

Chapter Business Statistics: A First Course Fifth Edition. Learning Objectives. Correlation vs. Regression. In this chapter, you learn:

Module 7: Probability and Statistics

PROPERTIES OF GOOD ESTIMATORS

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Chapter 14 Logistic Regression Models

THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA

Chapter -2 Simple Random Sampling

LINEAR REGRESSION ANALYSIS

Chapter -2 Simple Random Sampling

C. Statistics. X = n geometric the n th root of the product of numerical data ln X GM = or ln GM = X 2. X n X 1

12.2 Estimating Model parameters Assumptions: ox and y are related according to the simple linear regression model

Chapter 8: Statistical Analysis of Simulated Data

Lesson 3. Group and individual indexes. Design and Data Analysis in Psychology I English group (A) School of Psychology Dpt. Experimental Psychology

Correlation and Simple Linear Regression

Probability and. Lecture 13: and Correlation

f f... f 1 n n (ii) Median : It is the value of the middle-most observation(s).

Multiple Choice Test. Chapter Adequacy of Models for Regression

Objectives of Multiple Regression

Special Instructions / Useful Data

X ε ) = 0, or equivalently, lim

Lecture 3 Probability review (cont d)

The equation is sometimes presented in form Y = a + b x. This is reasonable, but it s not the notation we use.

GOALS The Samples Why Sample the Population? What is a Probability Sample? Four Most Commonly Used Probability Sampling Methods

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

Point Estimation: definition of estimators

Bootstrap Method for Testing of Equality of Several Coefficients of Variation

Descriptive Statistics

Arithmetic Mean Suppose there is only a finite number N of items in the system of interest. Then the population arithmetic mean is

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

ε. Therefore, the estimate

Statistics: Unlocking the Power of Data Lock 5

Simple Linear Regression and Correlation. Applied Statistics and Probability for Engineers. Chapter 11 Simple Linear Regression and Correlation

Unimodality Tests for Global Optimization of Single Variable Functions Using Statistical Methods

Chapter 2 Supplemental Text Material

A New Family of Transformations for Lifetime Data

Linear Regression with One Regressor

Statistics Descriptive

LECTURE - 4 SIMPLE RANDOM SAMPLING DR. SHALABH DEPARTMENT OF MATHEMATICS AND STATISTICS INDIAN INSTITUTE OF TECHNOLOGY KANPUR

ESS Line Fitting

STATISTICAL INFERENCE

Bounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy

Simple Linear Regression - Scalar Form

ENGI 4421 Propagation of Error Page 8-01

Parameter, Statistic and Random Samples

TESTS BASED ON MAXIMUM LIKELIHOOD

A Combination of Adaptive and Line Intercept Sampling Applicable in Agricultural and Environmental Studies

THE ROYAL STATISTICAL SOCIETY 2016 EXAMINATIONS SOLUTIONS GRADUATE DIPLOMA MODULE 2

Midterm Exam 1, section 1 (Solution) Thursday, February hour, 15 minutes

SPECIAL CONSIDERATIONS FOR VOLUMETRIC Z-TEST FOR PROPORTIONS

Quantitative analysis requires : sound knowledge of chemistry : possibility of interferences WHY do we need to use STATISTICS in Anal. Chem.?


Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

Transcription:

Lecture Revew of Fudametal Statstcal Cocepts Measures of Cetral Tedecy ad Dsperso A word about otato for ths class: Idvduals a populato are desgated, where the dex rages from to N, ad N s the total umber of dvduals the populato. Idvduals a radom sample take from a populato are also deoted, but ths case the dex rages from to, where s the total umber of dvduals the sample. Greek letters wll be used for populato parameters (e.g. µ = populato mea; = populato varace), whle Roma letters wll be used for estmates of populato parameters, based o radom samplg (e.g. = sample mea populato mea = µ; s = sample varace populato varace = ). Basc formulas: Mea or average (a measure of cetral tedecy) Populato mea: µ N = = N Sample mea: = = Varace (a measure of dsperso of dvduals about the mea) Populato varace: Sample varace: s N = = = = ( µ) ( N ) The quattes ( ) are called the devatos.

Stadard devato (a measure of dsperso the orgal uts of observato) Populato stadard devato: = Coeffcet of varato Sample stadard devato: s = s I some stuatos, t s useful to express the stadard devato uts of the populato mea. For ths purpose, we have a quatty called the coeffcet of varato: Populato coeffcet of varato: Sample coeffcet of varato: CV CV = = µ s Measures of dsperso of sample meas Aother mportat populato parameter we wll work wth ths class s the sample varace of the mea ( ). If you repeatedly sample a populato by takg samples of sze, the varace of those () sample meas s what we call the sample varace of the mea. It relates very smply to the populato varace, ths way: Sample varace of the mea: = ( ) Suppose we take r depedet, radom samples of sze from a populato. The varace of the dstrbuto of those sample meas wll be a estmate of for that populato. I other words, f () s the mea of the th sample ad () () s the overall mea for all r samples, the what we fd s: r = ( ( ) r ( ) ) ( ) The square root of s called the stadard devato of a mea, or more ofte the stadard error. () Stadard error: = = ( ) ( ) As wth the stadard devato, ths s a quatty the orgal uts of observato. As you wll see, the stadard error s extremely useful due to the role t plays determg cofdece tervals ad the powers of tests.

The Normal Dstrbuto If you measure a quattatve trat o a populato of meagfully related dvduals, what you ofte fd s that most of the measuremets wll cluster ear the populato mea (µ). Ad as you cosder values further ad further from µ, dvduals exhbtg those values become rarer. Graphcally, such a stuato ca be vsualzed terms of a frequecy dstrbuto, as show below: Frequecy of observato µ Observed value Some basc characterstcs of ths kd of dstrbuto are: ) The maxmum value occurs at µ (.e. the most probable value of a dvdual pulled radomly from the populato s µ; aother way of sayg ths s that the expected value of a dvdual pulled radomly from ths populato s µ); ) The dsperso s symmetrc about µ (.e. the mea, meda, ad mode of the populato are equal); ad 3) The tals asymptotcally approach zero. A dstrbuto whch meets these basc crtera s kow as a ormal dstrbuto. The followg codtos ted to result a ormal dstrbuto of a quattatve trat: ) There are may factors whch cotrbute to the observed value of the trat; ) These may factors act depedetly of oe aother; ad 3) The dvdual effects of these factors are addtve ad of comparable magtude. As t turs out, a great may varables of terest are approxmately ormally dstrbuted. Ideed, the ormal dstrbuto s observed for characters complex systems of all kds: Bologcal, socoecoomc, dustral, etc. The bell-shaped ormal dstrbuto s also kow as a Gaussa curve, amed after Fredrch Gauss who fgured out the formal mathematcs uderlyg fuctos of ths type. Specfcally, a ormal probablty desty fucto of mea µ ad stadard varace s descrbed by the expresso: ( ) = e π µ where () s the heght of the curve at a gve observed value. Notce that the locato ad shape of a ormal probablty desty fucto are uquely determed by oly two parameters, µ ad. By 3

varyg the value of µ, oe ca ceter () aywhere o the x-axs. By varyg, oe ca freely adjust the wdth of the cetral hump. All of the statstcal techques we wll dscuss ths class are based o the dea that may systems we study the real world ca be modeled by ths theoretcal fucto (). Such techques fall to the broad category of parametrc statstcs, because the ultmate objectves of these techques are to estmate ad compare the theoretcal parameters ( ths case, µ ad ) whch best expla our observatos. If we set µ = 0 ad =, we obta a especally useful ormal probablty desty fucto kow as the stadard ormal curve [N(0,)]: N ( ) = e [ µ = 0, = ] π A word about otato: Rather tha, t s tradtoal to use the letter to represet a radom varable draw from the stadard ormal dstrbuto: (0,) N(0,) = e π Frequecy of observato Area uder curve = = 0 As wth all probablty desty fuctos, the total area uder the curve equals. O page 6 of your book (Appedx A4), you wll fd a table of the propertes of ths curve. For ay gve postve value of, the table reports the area uder the curve to the rght of. Ths s useful because the area to the rght of s the theoretcal probablty of radomly pckg a dvdual from N(0,) whose value s greater tha. How does ths help us the real world? It helps us because AN ormal dstrbuto ca be stadardzed (.e. ay ormal dstrbuto ca be coverted to N(0,)). The way ths s doe s qute smple: µ = Subtractg µ from each observato shfts the mea of the dstrbuto to 0. Dvdg by chages the scale of the x-axs from the orgal uts of observato to uts of stadard devato ad thus makes the stadard devato (ad the varace) of the dstrbuto equal to. What ths meas s that for ay uque dvdual from a ormal dstrbuto of mea µ ad varace, there s a correspodg 4

uque value (.e. a ormal score) the stadard ormal curve. Ad sce we kow the theoretcal probablty of pckg a dvdual of a certa value at radom from N(0,), we ow have a way of determg the probablty of pckg a dvdual of a certa value at radom from ay ormally dstrbuted populato. Some examples: Questo : From a ormally dstrbuted populato of fches wth mea weght (µ) = 7. g ad varace ( ) = 36 g, what s the probablty of radomly selectg a dvdual fch weghg more tha g? Soluto: To aswer ths, frst covert the value g to ts correspodg ormal score: µ g 7.g = = 6g = 0.8 From Table A4, we see that.9% of the area uder N(0,) les to the rght of = 0.8. Therefore, there s a.9% chace of radomly selectg a dvdual fch weghg more tha g from ths populato. I other words, g s ot a uusual weght for a fch ths populato. It helps to vsualze these problems graphcally, especally as they get more complcated: Questo: What s ths area? Or: P( ) = X Aswer: P( ) = P( 0.8) = 0.9 7..0 Icdetally, we also see, smply by symmetry, that we have a.9% chace of radomly selectg a dvdual fch weghg less tha.4 g. Do you see why? 0 0.8 Questo : From a ormally dstrbuted populato of fches wth mea weght (µ) = 7. g ad varace ( ) = 36 g, what s the probablty of radomly selectg a sample of 0 fches wth a average weght of more tha g? Soluto: The dfferece betwee ths questo ad the prevous oe s that the prevous questo was askg the probablty of selectg a dvdual of a certa value at radom whle ths questo s askg for the probablty of selectg a sample of a certa average value at radom. For dvduals, the approprate dstrbuto to cosder s the ormal dstrbuto of the populato of dvduals (µ = 7. g ad = 36 g ). But for samples of sze = 0, the approprate dstrbuto to cosder s the ormal 36g dstrbuto of sample meas for sample sze = 0 (µ = 7. g ad = = =. 8g ). ( = 0) 0 5

Wth ths md, we proceed as before: µ g 7.g = = ( = 0).8g = 3.58 From Table A4, we see that oly 0.0% (approxmately) of the area uder N(0,) les to the rght of = 3.58. Therefore, there s a mere 0.0% ( 0,000) chace of radomly selectg a sample of 0 fches wth a average weght of more tha g from ths populato. I other words, g s a extremely uusual mea weght for a sample of twety fches ths populato. So, wth a smple trasformato of locato ad scale, ay ormal dstrbuto, whether of dvduals or of sample meas, ca be trasformed to N(0,), thereby allowg us to determe how uusual a gve dvdual or sample s. Recall that the x-axs of the stadard ormal curve s uts of stadard devatos. Our mds are ot used to thkg terms of uts of dsperso, but t s a credbly powerful way to thk. To gve you a tutve feelg for such uts, cosder the followg: I a ormal frequecy dstrbuto, µ ± cotas 68.7% of the tems µ ± cotas 95.45% of the tems µ ± 3 cotas 99.73% of the tems Thought of aother way, 50% of the tems fall betwee µ ± 0.674 95% of the tems fall betwee µ ±.960 99% of the tems fall betwee µ ±.576 99.73% 95.45% 68.7% -3 - - 0 3 Wth these basc bechmarks place, the results from the two examples above make a lot of sese. A g fch s ot uusual because t s less tha oe stadard devato from the mea. But a sample of 0 fches wth a mea weght of g s hghly uusual because ths sample mea s over three stadard errors from the mea. Oe fal word about the mportace ad wde applcablty of the ormal dstrbuto: The cetral lmt theorem states that, as sample sze creases, the dstrbuto of sample meas draw from a populato of ay dstrbuto wll approach a ormal dstrbuto wth mea µ ad varace /. Testg for Normalty [ST&D pp. 566-567] Oe s justfed usg Table A4 (ad, as you wll see, t-tests ad ANOVAs) f ad oly f the populato or sample uder cosderato s ormal. Such statstcal tables ad techques are sad to assume ormalty of the data. Do ot be msled by ths use of the word. As a user of these tables ad techques, you do ot smply assume that your data are ormal; you test for t. Normalty s spoke of as a assumpto, but fact t s a crtero whch must be met for the aalyss to be vald. I ths class, we wll be usg the Shapro-Wlk test for assessg ormalty. See pages 566-567 your text for 6

a good descrpto of ths techque. Below s Fgure 4. from your book (page 566), wth some supplemetal aotato ad dscusso to help uderstadg the test: "Expected" observed value of the hghest quatle: 4 = 4 +μ =.7(.80) + 75.943 4 = 78.5 Devato from "ormal" Actual observed value of the hghest quatle: 4 = 77.7 W = 0.9484 (Pr<W) = 0.509 > 0.05 Fal to reject H 0 Sample s "ormal" Observed ormal score of the hghest quatle: 4 =.43 "Expected" ormal score of the hghest quatle: 4.80 (see Table A.4) Fgure 4. A ormal probablty plot (a.k.a. quatle-quatle or Q-Q plot). Ths s a graphc tool for vsualzg devato from ormalty. The Shapro-Wlk test assgs a probablty to such devato, provdg a "objectve" determato of ormalty. Sce the dataset cossts of 4 values ( = 4) [77.7, 76.0, 76.9, 74.6, 74.7, 76.5, 74., 75.4, 76.0, 76.0, 73.9, 77.4, 76.6, ad 77.3], the area uder N(0,) s dvded to 4 equal portos. Ths meas that each porto, lke the oe dcated orage above, has a area of /4 = 0.07 square uts. The ormal score () whch splts a porto half (by area) s cosdered the "expected" value for that porto. I the fgure above, ths meas that the yellow area ad the blue area are equal (/8 = 0.036 square uts 7

each) ad the "expected" value of the secod porto s -.4. The 4 "expected" ormal scores are the trasformed to the orgal uts of observato ( = + µ ), thereby geeratg a perfectly straght le wth slope ad tercept μ. So, each data pot the sample has a correspodg "expected" ormal score (e.g. whle the actual 77.7 75.943 ormal score for 77.7 s = =. 43, ts expected ormal score s 4.80), ad a.7 ormal probablty plot s essetally just a scatterplot of these pared values. ou ca thk of the Shapro-Wlk test as essetally a test for correlato to the ormal le. If the sample s perfectly ormal, the scatterplot of observed values vs. expected ormal scores wll fall exactly o the ormal le ad the Shapro-Wlk test statstc W (smlar to a correlato coeffcet) wll equal. Complete lack of correlato (.e. a completely o-ormal dstrbuto) wll yeld a test statstc W equal to 0. How much devato s too much? For ths class, to reject the ull hypothess of the test (H 0 : The sample s from a ormally dstrbuted populato), the probablty of obtag, by chace, a value of W from a truly ormal dstrbuto that s less tha the observed value of W must be less tha 5%. I ths case, W = 0.9484 ad Pr<W = 0.509 > 0.05; so we fal to reject H 0. There s o evdece, at ths chose level of sgfcace, that the sample s from a o-ormal populato. 8