Chapter Statistics Background of Regression Analysis

Chapter 06.0 Statstcs Backgroud of Regresso Aalyss After readg ths chapter, you should be able to:. revew the statstcs backgroud eeded for learg regresso, ad. kow a bref hstory of regresso. Revew of Statstcal Termologes Although the laguage of statstcs may be used at a elemetary ad descrptve level ths chapter, t makes a tegral part of our every day dscussos. Whe two freds talk about the weather (whether t wll ra or ot - probablty), or the tme t takes to drve from pot A to pot B (speed - mea or average), or baseball facts (all tme career RBI or home rus of a sportsma -sortg, rage), or about class grades (lowest ad hghest score - rage ad sortg), they are varably usg statstcal tools. From the foregog, t s mperatve the that we revew some of the statstcal termologes that we may ecouter studyg the topc of regresso. Some key terms we eed to revew are sample, arthmetc mea (average), error or devato, stadard devato, varace, coeffcet of varato, probablty, Gaussa or ormal dstrbuto, degrees of freedom, ad hypothess. Elemetary Statstcs A statstcal sample s a fracto or a porto of the whole (populato) that s studed. Ths s a cocept that may be cofusg to may ad s best llustrated wth examples. Cosder that a chemcal egeer s terested uderstadg the relatoshp betwee the rate of a reacto ad temperature. It s mpractcal for the egeer to test all possble ad measurable temperatures. Apart from the fact that the strumet for temperature measuremet have lmted temperature rages for whch they ca fucto, the sheer umber of hours requred to measure every possble temperature makes t mpractcal. What the egeer does s choose a temperature rage (based o hs/her kowledge of the chemstry of the system) whch to study. Wth the chose temperature rage, the egeer further chooses specfc temperatures that spa the rage wth whch to coduct the expermets. These chose temperatures for study costtute the sample whle all possble temperatures are the populato. I statstcs, the sample s the fracto of the populato chose for study. The locato of the ceter of a dstrbuto - the mea or average - s a tem of terest our every day lves. We use the cocept whe we talk about the average come, the class average for a test, the average heght of some persos or about oe beg overweght (based o the average weght expected of a dvdual wth smlar 06.0.

06.0. Chapter 06.0 characterstcs) or ot. The arthmetc mea of a sample s a measure of ts cetral tedecy ad s evaluated by dvdg the sum of dvdual data pots by the umber of pots. Cosder Table whch 4 measuremets of the cocetrato of sodum chlorate produced a chemcal reactor operated at a ph of 7.0. 3 Table Chlorate o cocetrato mmol/cm.0 5.0 4. 5.9.5 4.8. 3.7 5.9.6 4.3.6. 4.8 The arthmetc mea y s mathematcally defed as y y () whch s the sum of the dvdual data pots y dvded by the umber of data pots. Oe of the measures of the spread of the data s the rage of the data. The rage R s defed as the dfferece betwee the maxmum ad mmum value of the data as R y max y m () where ymax s the maxmum of the values of y,,,...,, y s the mmum of the values of y,,,...,.. m However, rage may ot gve a good dea of the spread of the data as some data pots may be far away from most other data pots (such data pots are called outlers). That s why the devato from the average or arthmetc mea s looked as a better way to measure the spread. The resdual betwee the data pot ad the mea s defed as e y y (3) The dfferece of each data pot from the mea ca be egatve or postve depedg o whch sde of the mea the data pot les (recall the mea s cetrally located) ad hece f oe calculates the sum of such dffereces to fd the overall spread, the dffereces may smply cacel each other. That s why the sum of the square of the dffereces s cosdered a better measure. The sum of the squares of the dffereces, also called summed squared error (SSE), S t, s gve by S t ( y y) Sce the magtude of the summed squared error s depedet o the umber of data pots, a average value of the summed squared error s defed as the varace, σ ( y y) St (5) The varace, σ s sometmes wrtte two dfferet coveet formulas as (4)

Statstcs Backgroud of Regresso Aalyss 06.0.3 or y y (6) y y (7) However, why s the varace dvded by ( ) ad ot as we have data pots? Ths s because wth the use of the mea calculatg the varace, we lose the depedece of oe of the data pots. That s, f you kow the mea of data pots, the the value of oe of the data pots ca be calculated by kowg the other ( ) data pots. To brg the varato back to the same level of uts as the orgal data, a ew term called stadard devato, σ, s defed as ( y y) St (8) Furthermore, the rato of the stadard devato to the mea, kow as the coeffcet varato c. v s also used to ormalze the spread of a sample. σ c. v 00 y (9) Example Use the data Table to calculate the a) mea chlorate cocetrato, b) rage of data, c) resdual of each data pot, d) sum of the square of the resduals. e) sample stadard devato, f) varace, ad g) coeffcet of varato. Soluto Set up a table (see Table ) cotag the data, the resdual for each data pot ad the square of the resduals. Table Data ad data summatos for statstcal calculatos. y y y y ( y y) 44 -.607.589 5 5.399.940 3 4. 98.8 0.499 0.49

06.0.4 Chapter 06.0 a) Mea chlorate cocetrato as from Equato () y 90.5 y 3.607 4 b) The rage of data as per Equato () s R y max y m 5.9. 4.7 c) Resdual at each pot s show Table. For example, at the frst data pot as per Equato (3) e y y.0 3.607.607 d) The sum of the square of the resduals as from Equato (4) s St ( y y) 33.49 (See Table ) e) The stadard devato as per Equato (8) s 4 5.9 5.8.99 5.57 5.5 3.5 -.07 4.440 6 4.8 9.04.99.49 7. 5.44 -.407 5.7943 8 3.7 87.69 0.099 0.00864 9 5.9 5.8.99 5.57 0.6 58.76 -.007.043 4.3 04.49 0.699 0.48005.6 58.76 -.007.043 3. 46.4 -.507.75 4 4.8 9.04.99.49 4 ( y y) 33.49 4.5969 f) The varace s calculated as from Equato (5) (.597).5499 90.50 65.3 0.0000 33.49

Statstcs Backgroud of Regresso Aalyss 06.0.5 The varace ca be calculated usg Equato (6) y y (90.5) 65.3 4 4.5499 or by usg Equato (7) y y 65.3 4 3.607 4.5499 g) The coeffcet of varato, c. v as from Equato (9) s σ c. v 00 y.5969 00 3.607.735% Chlorate Cocetrato (mmol/cm 3 ) 9 5 7 6 Data pot y+σ y+σ y y-σ y-σ Fgure Chlorate cocetrato data pots. A Bref Hstory of Regresso Ayoe who s famlar wth the Pearso Product Momet Correlato (PPMC) wll o doubt assocate regresso prcples wth the ame of Pearso. Although ths assocato may be rght, the cocept of lear regresso was largely due to the work of Galto, a cous of Charles Darw of the evoluto theory fame. Sr Galto's work o herted

06.0.6 Chapter 06.0 characterstcs of sweet peas led to the tal cocepto of lear regresso. Hs treatmet of regresso was ot mathematcally rgorous. The mathematcal rgor ad subsequet developmet of multple regresso were due largely to the cotrbutos of hs assstat ad co-worker - Karl Pearso. It s however structve to ote for hstorcal accuracy that the developmet of regresso could be attrbuted to the attempt at aswerg the questo of heredtary - how ad what characterstcs offsprg acqure from ther progetor. Sweet peas were used by Galto hs observatos of characterstcs of ext geeratos of a gve speces. Despte hs poor choce of descrptve statstcs ad lmted mathematcal rgor, Galto was able to geeralze hs work over a varety of heredtary problems. He further arrved at the dea that the dffereces regresso slopes were due to dffereces varablty betwee dfferet sets of measuremets. I today's apprecato of ths, oe ca say that Galto recogzed the rato of varablty of two measures was a key factor determg the slope of the regresso le. The frst rgorous treatmet of correlato ad regresso was the work of Pearso 896. I the paper the Phlosophcal Trasactos of the Royal Socety of Lodo, Pearso showed that the optmum values of both the regresso slope ad the correlato coeffcet for a straght le could be evaluated from the product-momet, ( x x)( y y), where x ad y are the meas of observed x ad y values, respectvely. I the 896 paper, Pearso had attrbuted the tal mathematcal formula for correlato to Auguste Bravas work ffty years earler. Pearso stated that although Bravas dd demostrate the use of product-momet for calculatg the correlato coeffcet, he dd ot show that t provded the best ft for the data. REGRESSION Topc Statstcs Backgroud for Regresso Summary Textbook otes for the backgroud of regresso Major All egeerg majors Authors Egwu Kalu, Autar Kaw Date October, 008 Web Ste http://umercalmethods.eg.usf.edu