Chapter Statistics Background of Regression Analysis

Similar documents
Lecture Notes Types of economic variables

Multiple Choice Test. Chapter Adequacy of Models for Regression

hp calculators HP 30S Statistics Averages and Standard Deviations Average and Standard Deviation Practice Finding Averages and Standard Deviations

Simple Linear Regression

STA302/1001-Fall 2008 Midterm Test October 21, 2008

Mean is only appropriate for interval or ratio scales, not ordinal or nominal.

MEASURES OF DISPERSION

Econometric Methods. Review of Estimation

Summary of the lecture in Biostatistics

ESS Line Fitting

Functions of Random Variables

12.2 Estimating Model parameters Assumptions: ox and y are related according to the simple linear regression model

ENGI 3423 Simple Linear Regression Page 12-01

Lecture 1 Review of Fundamental Statistical Concepts

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

Chapter Business Statistics: A First Course Fifth Edition. Learning Objectives. Correlation vs. Regression. In this chapter, you learn:

2.28 The Wall Street Journal is probably referring to the average number of cubes used per glass measured for some population that they have chosen.

Random Variables and Probability Distributions

is the score of the 1 st student, x

Probability and. Lecture 13: and Correlation

Midterm Exam 1, section 1 (Solution) Thursday, February hour, 15 minutes

Descriptive Statistics

Simple Linear Regression

Multiple Regression. More than 2 variables! Grade on Final. Multiple Regression 11/21/2012. Exam 2 Grades. Exam 2 Re-grades

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

C. Statistics. X = n geometric the n th root of the product of numerical data ln X GM = or ln GM = X 2. X n X 1

CHAPTER VI Statistical Analysis of Experimental Data

Lecture 8: Linear Regression

Chapter 8. Inferences about More Than Two Population Central Values

Continuous Distributions

Lecture 3. Sampling, sampling distributions, and parameter estimation

Can we take the Mysticism Out of the Pearson Coefficient of Linear Correlation?

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Arithmetic Mean Suppose there is only a finite number N of items in the system of interest. Then the population arithmetic mean is

Midterm Exam 1, section 2 (Solution) Thursday, February hour, 15 minutes

Linear Regression with One Regressor

Measures of Dispersion

Correlation and Regression Analysis

Chapter 13 Student Lecture Notes 13-1

Lecture 3 Probability review (cont d)

Statistics MINITAB - Lab 5

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

2SLS Estimates ECON In this case, begin with the assumption that E[ i

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

ENGI 4421 Propagation of Error Page 8-01

Correlation and Simple Linear Regression

Third handout: On the Gini Index

Chapter 5 Properties of a Random Sample


Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

b. There appears to be a positive relationship between X and Y; that is, as X increases, so does Y.

: At least two means differ SST

The equation is sometimes presented in form Y = a + b x. This is reasonable, but it s not the notation we use.

Statistics. Correlational. Dr. Ayman Eldeib. Simple Linear Regression and Correlation. SBE 304: Linear Regression & Correlation 1/3/2018

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #1

ECONOMETRIC THEORY. MODULE VIII Lecture - 26 Heteroskedasticity

Statistics Descriptive and Inferential Statistics. Instructor: Daisuke Nagakura

1. The weight of six Golden Retrievers is 66, 61, 70, 67, 92 and 66 pounds. The weight of six Labrador Retrievers is 54, 60, 72, 78, 84 and 67.

Analysis of Variance with Weibull Data

Module 7. Lecture 7: Statistical parameter estimation

Chapter 14 Logistic Regression Models

Fitting models to data.

X ε ) = 0, or equivalently, lim

SPECIAL CONSIDERATIONS FOR VOLUMETRIC Z-TEST FOR PROPORTIONS

STA 105-M BASIC STATISTICS (This is a multiple choice paper.)

= 1. UCLA STAT 13 Introduction to Statistical Methods for the Life and Health Sciences. Parameters and Statistics. Measures of Centrality

Class 13,14 June 17, 19, 2015

Lecture 2: Linear Least Squares Regression

IFYMB002 Mathematics Business Appendix C Formula Booklet

Chapter 4 Multiple Random Variables

Point Estimation: definition of estimators

Johns Hopkins University Department of Biostatistics Math Review for Introductory Courses

BIOREPS Problem Set #11 The Evolution of DNA Strands

ε. Therefore, the estimate

Johns Hopkins University Department of Biostatistics Math Review for Introductory Courses

Simple Linear Regression - Scalar Form

f f... f 1 n n (ii) Median : It is the value of the middle-most observation(s).

Chapter 13, Part A Analysis of Variance and Experimental Design. Introduction to Analysis of Variance. Introduction to Analysis of Variance

ECON 482 / WH Hong The Simple Regression Model 1. Definition of the Simple Regression Model

Module 7: Probability and Statistics

Lecture Notes Forecasting the process of estimating or predicting unknown situations

Statistics: Unlocking the Power of Data Lock 5

Analyzing Two-Dimensional Data. Analyzing Two-Dimensional Data

CLASS NOTES. for. PBAF 528: Quantitative Methods II SPRING Instructor: Jean Swanson. Daniel J. Evans School of Public Affairs

The expected value of a sum of random variables,, is the sum of the expected values:

ECON 5360 Class Notes GMM

STK4011 and STK9011 Autumn 2016

Chapter 2 Supplemental Text Material

UNIT 7 RANK CORRELATION

Lecture 1: Introduction to Regression

Chapter 11 The Analysis of Variance

Quantitative analysis requires : sound knowledge of chemistry : possibility of interferences WHY do we need to use STATISTICS in Anal. Chem.?

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

COV. Violation of constant variance of ε i s but they are still independent. The error term (ε) is said to be heteroscedastic.

PGE 310: Formulation and Solution in Geosystems Engineering. Dr. Balhoff. Interpolation

residual. (Note that usually in descriptions of regression analysis, upper-case

9.1 Introduction to the probit and logit models

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

A New Family of Transformations for Lifetime Data

Transcription:

Chapter 06.0 Statstcs Backgroud of Regresso Aalyss After readg ths chapter, you should be able to:. revew the statstcs backgroud eeded for learg regresso, ad. kow a bref hstory of regresso. Revew of Statstcal Termologes Although the laguage of statstcs may be used at a elemetary ad descrptve level ths chapter, t makes a tegral part of our every day dscussos. Whe two freds talk about the weather (whether t wll ra or ot - probablty), or the tme t takes to drve from pot A to pot B (speed - mea or average), or baseball facts (all tme career RBI or home rus of a sportsma -sortg, rage), or about class grades (lowest ad hghest score - rage ad sortg), they are varably usg statstcal tools. From the foregog, t s mperatve the that we revew some of the statstcal termologes that we may ecouter studyg the topc of regresso. Some key terms we eed to revew are sample, arthmetc mea (average), error or devato, stadard devato, varace, coeffcet of varato, probablty, Gaussa or ormal dstrbuto, degrees of freedom, ad hypothess. Elemetary Statstcs A statstcal sample s a fracto or a porto of the whole (populato) that s studed. Ths s a cocept that may be cofusg to may ad s best llustrated wth examples. Cosder that a chemcal egeer s terested uderstadg the relatoshp betwee the rate of a reacto ad temperature. It s mpractcal for the egeer to test all possble ad measurable temperatures. Apart from the fact that the strumet for temperature measuremet have lmted temperature rages for whch they ca fucto, the sheer umber of hours requred to measure every possble temperature makes t mpractcal. What the egeer does s choose a temperature rage (based o hs/her kowledge of the chemstry of the system) whch to study. Wth the chose temperature rage, the egeer further chooses specfc temperatures that spa the rage wth whch to coduct the expermets. These chose temperatures for study costtute the sample whle all possble temperatures are the populato. I statstcs, the sample s the fracto of the populato chose for study. The locato of the ceter of a dstrbuto - the mea or average - s a tem of terest our every day lves. We use the cocept whe we talk about the average come, the class average for a test, the average heght of some persos or about oe beg overweght (based o the average weght expected of a dvdual wth smlar 06.0.

06.0. Chapter 06.0 characterstcs) or ot. The arthmetc mea of a sample s a measure of ts cetral tedecy ad s evaluated by dvdg the sum of dvdual data pots by the umber of pots. Cosder Table whch 4 measuremets of the cocetrato of sodum chlorate produced a chemcal reactor operated at a ph of 7.0. 3 Table Chlorate o cocetrato mmol/cm.0 5.0 4. 5.9.5 4.8. 3.7 5.9.6 4.3.6. 4.8 The arthmetc mea y s mathematcally defed as y y () whch s the sum of the dvdual data pots y dvded by the umber of data pots. Oe of the measures of the spread of the data s the rage of the data. The rage R s defed as the dfferece betwee the maxmum ad mmum value of the data as R y max y m () where ymax s the maxmum of the values of y,,,...,, y s the mmum of the values of y,,,...,.. m However, rage may ot gve a good dea of the spread of the data as some data pots may be far away from most other data pots (such data pots are called outlers). That s why the devato from the average or arthmetc mea s looked as a better way to measure the spread. The resdual betwee the data pot ad the mea s defed as e y y (3) The dfferece of each data pot from the mea ca be egatve or postve depedg o whch sde of the mea the data pot les (recall the mea s cetrally located) ad hece f oe calculates the sum of such dffereces to fd the overall spread, the dffereces may smply cacel each other. That s why the sum of the square of the dffereces s cosdered a better measure. The sum of the squares of the dffereces, also called summed squared error (SSE), S t, s gve by S t ( y y) Sce the magtude of the summed squared error s depedet o the umber of data pots, a average value of the summed squared error s defed as the varace, σ ( y y) St (5) The varace, σ s sometmes wrtte two dfferet coveet formulas as (4)

Statstcs Backgroud of Regresso Aalyss 06.0.3 or y y (6) y y (7) However, why s the varace dvded by ( ) ad ot as we have data pots? Ths s because wth the use of the mea calculatg the varace, we lose the depedece of oe of the data pots. That s, f you kow the mea of data pots, the the value of oe of the data pots ca be calculated by kowg the other ( ) data pots. To brg the varato back to the same level of uts as the orgal data, a ew term called stadard devato, σ, s defed as ( y y) St (8) Furthermore, the rato of the stadard devato to the mea, kow as the coeffcet varato c. v s also used to ormalze the spread of a sample. σ c. v 00 y (9) Example Use the data Table to calculate the a) mea chlorate cocetrato, b) rage of data, c) resdual of each data pot, d) sum of the square of the resduals. e) sample stadard devato, f) varace, ad g) coeffcet of varato. Soluto Set up a table (see Table ) cotag the data, the resdual for each data pot ad the square of the resduals. Table Data ad data summatos for statstcal calculatos. y y y y ( y y) 44 -.607.589 5 5.399.940 3 4. 98.8 0.499 0.49

06.0.4 Chapter 06.0 a) Mea chlorate cocetrato as from Equato () y 90.5 y 3.607 4 b) The rage of data as per Equato () s R y max y m 5.9. 4.7 c) Resdual at each pot s show Table. For example, at the frst data pot as per Equato (3) e y y.0 3.607.607 d) The sum of the square of the resduals as from Equato (4) s St ( y y) 33.49 (See Table ) e) The stadard devato as per Equato (8) s 4 5.9 5.8.99 5.57 5.5 3.5 -.07 4.440 6 4.8 9.04.99.49 7. 5.44 -.407 5.7943 8 3.7 87.69 0.099 0.00864 9 5.9 5.8.99 5.57 0.6 58.76 -.007.043 4.3 04.49 0.699 0.48005.6 58.76 -.007.043 3. 46.4 -.507.75 4 4.8 9.04.99.49 4 ( y y) 33.49 4.5969 f) The varace s calculated as from Equato (5) (.597).5499 90.50 65.3 0.0000 33.49

Statstcs Backgroud of Regresso Aalyss 06.0.5 The varace ca be calculated usg Equato (6) y y (90.5) 65.3 4 4.5499 or by usg Equato (7) y y 65.3 4 3.607 4.5499 g) The coeffcet of varato, c. v as from Equato (9) s σ c. v 00 y.5969 00 3.607.735% Chlorate Cocetrato (mmol/cm 3 ) 9 5 7 6 Data pot y+σ y+σ y y-σ y-σ Fgure Chlorate cocetrato data pots. A Bref Hstory of Regresso Ayoe who s famlar wth the Pearso Product Momet Correlato (PPMC) wll o doubt assocate regresso prcples wth the ame of Pearso. Although ths assocato may be rght, the cocept of lear regresso was largely due to the work of Galto, a cous of Charles Darw of the evoluto theory fame. Sr Galto's work o herted

06.0.6 Chapter 06.0 characterstcs of sweet peas led to the tal cocepto of lear regresso. Hs treatmet of regresso was ot mathematcally rgorous. The mathematcal rgor ad subsequet developmet of multple regresso were due largely to the cotrbutos of hs assstat ad co-worker - Karl Pearso. It s however structve to ote for hstorcal accuracy that the developmet of regresso could be attrbuted to the attempt at aswerg the questo of heredtary - how ad what characterstcs offsprg acqure from ther progetor. Sweet peas were used by Galto hs observatos of characterstcs of ext geeratos of a gve speces. Despte hs poor choce of descrptve statstcs ad lmted mathematcal rgor, Galto was able to geeralze hs work over a varety of heredtary problems. He further arrved at the dea that the dffereces regresso slopes were due to dffereces varablty betwee dfferet sets of measuremets. I today's apprecato of ths, oe ca say that Galto recogzed the rato of varablty of two measures was a key factor determg the slope of the regresso le. The frst rgorous treatmet of correlato ad regresso was the work of Pearso 896. I the paper the Phlosophcal Trasactos of the Royal Socety of Lodo, Pearso showed that the optmum values of both the regresso slope ad the correlato coeffcet for a straght le could be evaluated from the product-momet, ( x x)( y y), where x ad y are the meas of observed x ad y values, respectvely. I the 896 paper, Pearso had attrbuted the tal mathematcal formula for correlato to Auguste Bravas work ffty years earler. Pearso stated that although Bravas dd demostrate the use of product-momet for calculatg the correlato coeffcet, he dd ot show that t provded the best ft for the data. REGRESSION Topc Statstcs Backgroud for Regresso Summary Textbook otes for the backgroud of regresso Major All egeerg majors Authors Egwu Kalu, Autar Kaw Date October, 008 Web Ste http://umercalmethods.eg.usf.edu