Chapter Objectives Chapter 3 Liear Regressio ad Correlatio Descriptive Aalysis & Presetatio of Two Quatitative Data To be able to preset two-variables data i tabular ad graphic form Display the relatioship betwee two quatitative variables graphically usig a scatter diagram. Calculate ad iterpret the liear correlatio coefficiet. Discuss basic idea of fittig the scatter diagram with a best-fitted lie called a liear regressio lie. Create ad iterpret the liear regressio lie. Termiology Data for a sigle variable is uivariate data May or most real world models have more tha oe variable multivariate data I this chapter we will study the relatios betwee two variables bivariate data Bivariate Data I may studies, we measure more tha oe variable for each idividual Some examples are Raifall amouts ad plat growth Exercise ad cholesterol levels for a group of people Height ad weight for a group of people Types of Relatios Whe we have two variables, they could be related i oe of several differet ways They could be urelated Oe variable (the iput or explaatory or predictor variable) could be used to explai the other (the output or respose or depedet variable) Oe variable could be thought of as causig the other variable to chage Lurkig Variable Sometimes it is ot clear which variable is the explaatory variable ad which is the respose variable Sometimes the two variables are related without either oe beig a explaatory variable Sometimes the two variables are both affected by a third variable, a lurkig variable, that had ot bee icluded i the study Note: Whe two variables are related to each other, oe variable may ot cause the chage of the other variable. Relatio does ot always mea causatio. 1
Example 1 A example of a lurkig variable A researcher studies a group of elemetary school childre Y = the studet s height X = the studet s shoe size It is ot reasoable to claim that shoe size causes height to chage The lurkig variable of age affects both of these two variables More Examples Some other examples Raifall amouts ad plat growth Explaatory variable raifall Respose variable plat growth Possible lurkig variable amout of sulight Exercise ad cholesterol levels Explaatory variable amout of exercise Respose variable cholesterol level Possible lurkig variable diet Types of Bivariate Data Three combiatios of variable types: 1. Both variables are qualitative (attribute). Oe variable is qualitative (attribute) ad the other is quatitative (umerical) 3. Both variables are quatitative (both umerical) Two Qualitative Variables Whe bivariate data results from two qualitative (attribute or categorical) variables, the data is ofte arraged o a cross-tabulatio or cotigecy table Example: A survey was coducted to ivestigate the relatioship betwee prefereces for televisio, radio, or ewspaper for atioal ews, ad geder. The results are give i the table below: TV Radio NP Male 80 175 305 Female 115 75 170 Margial Totals This table, may be exteded to display the margial totals (or margials). The total of the margial totals is the grad total: TV Radio NP Row Totals Male 80 175 305 Female 115 75 170 760 560 Col. Totals 395 450 475 130 Note: Cotigecy tables ofte show percetages (relative frequecies). These percetages are based o the etire sample or o the subsample (row or colum) classificatios. Percetages Based o the Grad Total (Etire Sample) The previous cotigecy table may be coverted to percetages of the grad total by dividig each frequecy by the grad total ad multiplyig by 0 For example, 175 becomes 13.3% 175 0 133 130 =. TV Radio NP Row Totals Male 1. 13.3 3.1 57.6 Female 8.7 0.8 1.9 4.4 Col. Totals 9.9 34.1 36.0 0.0
Illustratio These same statistics (umerical values describig sample results) ca be show i a (side-by-side) bar graph: Percet 5 0 15 5 Percetages Based o Grad Total Male Female Percetages Based o Row (Colum) Totals The etries i a cotigecy table may also be expressed as percetages of the row (colum) totals by dividig each row (colum) etry by that row s (colum s) total ad multiplyig by 0. The etries i the cotigecy table below are expressed as percetages of the colum totals: TV Male 70.9 Female 9.1 Col. Totals 0.0 Radio 38.9 61.1 0.0 NP Row Totals 64. 57.6 35.8 4.4 0.00 0.0 0 TV Radio NP Media Note:These statistics may also be displayed i a side-by-side bar graph Oe Qualitative & Oe Quatitative Variable 1. Whe bivariate data results from oe qualitative ad oe quatitative variable, the quatitative values are viewed as separate samples. Each set is idetified by levels of the qualitative variable 3. Each sample is described usig summary statistics, ad the results are displayed for side-by-side compariso 4. Statistics for compariso: measures of cetral tedecy, measures of variatio, 5-umber summary Example: Example A radom sample of households from three differet parts of the coutry was obtaied ad their electric bill for Jue was recorded. The data is give i the table below: Northeast 3.75 40.50 33.65 31.5 4.55 50.60 37.70 31.55 38.85 1.5 Midwest 34.38 34.35 39.15 37.1 36.71 34.39 35.1 35.80 37.4 40.01 West 54.54 65.60 59.78 45.1 60.35 61.53 5.79 47.37 59.64 37.40 The part of the coutry is a qualitative variable with three levels of respose. The electric bill is a quatitative variable. The electric bills may be compared with umerical ad graphical techiques. 5. Graphs for compariso: side-by-side stemplot ad boxplot Compariso Usig Box-ad-Whisker Plots 70 The Mothly Electric Bill 60 Electric Bill 50 40 30 0 Northeast Midwest West The electric bills i the Northeast ted to be more spread out tha those i the Midwest. The bills i the West ted to be higher tha both those i the Northeast ad Midwest. Descriptive Statistics for Two Quatitative Variables Scatter Diagrams ad correlatio coefficiet 3
Two Quatitative Variables The most useful graph to show the relatioship betwee two quatitative variables is the scatter diagram Each idividual is represeted by a poit i the diagram The explaatory (X) variable is plotted o the horizotal scale The respose (Y) variable is plotted o the vertical scale Example Example: I a study ivolvig childre s fear related to beig hospitalized, the age ad the score each child made o the Child Medical Fear Scale (CMFS) are give i the table below: Age (x ) 8 9 9 11 9 8 9 8 11 CMFS (y ) 31 5 40 7 35 9 5 34 44 19 Age (x ) 7 6 6 8 9 1 15 13 CMFS (y ) 8 47 4 37 35 16 1 3 6 36 Costruct a scatter diagram for this data Solutio age = iput variable, CMFS = output variable Child Medical Fear Scale 50 Aother Example A example of a scatter diagram 40 CMFS 30 0 6 7 8 9 Age 11 1 13 14 15 Note: the vertical scale is trucated to illustrate the detail relatio! Types of Relatios There are several differet types of relatios betwee two variables A relatioship is liear whe, plotted o a scatter diagram, the poits follow the geeral patter of a lie A relatioship is oliear whe, plotted o a scatter diagram, the poits follow a geeral patter, but it is ot a lie A relatioship has o correlatio whe, plotted o a scatter diagram, the poits do ot show ay patter Liear Correlatios Liear relatios or liear correlatios have poits that cluster aroud a lie Liear relatios ca be either positive (the poits slats upwards to the right) or egative (the poits slat dowwards to the right) 4
Positive Correlatios For positive (liear) correlatio Above average values of oe variable are associated with above average values of the other (above/above, the poits tred right ad upwards) Below average values of oe variable are associated with below average values of the other (below/below, the poits tred left ad dowwards) Example: Positive Correlatio As x icreases, y also icreases: Output 60 50 40 30 0 15 0 5 30 35 40 45 50 55 Iput Negative Correlatios For egative (liear) correlatio Above average values of oe variable are associated with below average values of the other (above/below, the poits tred right ad dowwards) Below average values of oe variable are associated with above average values of the other (below/above, the poits tred left ad upwards) Example: Negative Correlatio As x icreases, y decreases: Output 95 85 75 65 55 15 0 5 30 35 40 45 50 55 Iput Noliear Correlatios Noliear relatios have poits that have a tred, but ot aroud a lie The tred has some bed i it No Correlatios Whe two variables are ot related There is o liear tred There is o oliear tred Chages i values for oe variable do ot seem to have ay relatio with chages i the other 5
Example: No Correlatio As x icreases, there is o defiite shift i y: Output 55 45 Distictio betwee Noliear & No Correlatio Noliear relatios ad o relatios are very differet Noliear relatios are defiitely patters just ot patters that look like lies No relatios are whe o patters appear at all 35 0 Iput 30 Example Examples of oliear relatios Age ad Height for people (icludig both childre ad adults) Temperature ad Comfort level for people Examples of o relatios Temperature ad Closig price of the Dow Joes Idustrials Idex (probably) Age ad Last digit of telephoe umber for adults Please Note Perfect positive correlatio: all the poits lie alog a lie with positive slope Perfect egative correlatio: all the poits lie alog a lie with egative slope If the poits lie alog a horizotal or vertical lie: o correlatio If the poits exhibit some other oliear patter: oliear relatioship Need some way to measure the stregth of correlatio Measure of Liear Correlatio Liear Correlatio Coefficiet The liear correlatio coefficiet is a measure of the stregth of liear relatio betwee two quatitative variables The sample correlatio coefficiet r is r = ( xi x ) ( yi y ) sx sy 1 Note: X, Y, S, S x y are the sample meas ad sample variaces of the two variables X ad Y. 6
Properties of Liear Correlatio Coefficiets Some properties of the liear correlatio coefficiet r is a uitless measure (so that r would be the same for a data set whether x ad y are measured i feet, iches, meters etc.) r is always betwee 1 ad +1. r = -1 : perfect egative correlatio r = +1: perfect positive correlatio Positive values of r correspod to positive relatios Negative values of r correspod to egative relatios Various Expressios for r There are other equivalet expressios for the liear correlatio r as show below: ( x x)( y y) r = ( 1) S x S y r = ( x ( x x) x)( y y) ( y y) However, it is much easier to compute r usig the short-cut formula show o the ext slide. r= Short-Cut Formula for r SS( xy) SS( x) SS( y) = ( x) SS ( x) sum of squ ares for x = x ( y) = SS ( y) sum of squares for y = y SS ( xy) = x y sum of squares for xy = xy Example Example: The table below presets the weight (i thousads of pouds) x ad the gasolie mileage (miles per gallo) y for te differet automobiles. Fid the liear correlatio coefficiet: x y x y xy.5 3.0 4.0 3.5.7 4.5 3.8.9 5.0. Sum 34.1 x 40 43 30 35 4 19 3 39 15 14 309 6.5 9.00 16.00 1.5 7.9 0.5 14.44 8.41 5.00 4.84 13.73 1600 1849 900 15 1764 361 4 151 5 196 665 y x y 0.0 19.0.0 1.5 113.4 85.5 11.6 113.1 75.0 30.8.9 xy Completig the Calculatio for r ( x ) ( 34. 1) SS( x) = x = 13. 73 = 7. 449 ( y) ( 309) SS( y) = y = 665 = 1116. 9 9 34 1 309 x y (. )( ) SS( xy) = xy =. = 4. 79 SS( xy) 4. 79 r = = = 0. 47 SS( x) SS( y) ( 7. 449)( 1116. 9) Please Note r is usually rouded to the earest hudredth r close to 0: little or o liear correlatio As the magitude of r icreases, towards -1 or +1, there is a icreasigly stroger liear correlatio betwee the two variables We ll also lear to obtai the liear correlatio coefficiet from the graphig calculator. 7
Positive Correlatio Coefficiets Examples of positive correlatio Negative Correlatio Coefficiets Examples of egative correlatio Strog Positive r =.8 Moderate Positive r =.5 Very Weak r =.1 Strog Negative r =.8 Moderate Negative r =.5 Very Weak r =.1 I geeral, if the correlatio is visible to the eye, the it is likely to be strog I geeral, if the correlatio is visible to the eye, the it is likely to be strog Noliear versus No Correlatio Noliear correlatio ad o correlatio Noliear Relatio No Relatio Both sets of variables have r = 0.1, but the differece is that the oliear relatio shows a clear patter Iterpret the Liear Correlatio Coefficiets Correlatio is ot causatio! Just because two variables are correlated does ot mea that oe causes the other to chage There is a strog correlatio betwee shoe sizes ad vocabulary sizes for grade school childre Clearly larger shoe sizes do ot cause larger vocabularies Clearly larger vocabularies do ot cause larger shoe sizes Ofte lurkig variables result i cofoudig How to Determie a Liear Correlatio? How large does the correlatio coefficiet have to be before we ca say that there is a relatio? We re ot quite ready to aswer that questio Summary Correlatio betwee two variables ca be described with both visual ad umeric methods Visual methods Scatter diagrams Aalogous to histograms for sigle variables Numeric methods Liear correlatio coefficiet Aalogous to mea ad variace for sigle variables Care should be take i the iterpretatio of liear correlatio (oliearity ad causatio) 8
Learig Objectives Liear Regressio Lie Fid the regressio lie to fit the data ad use the lie to make predictios Iterpret the slope ad the y-itercept of the regressio lie Compute the sum of squared residuals Regressio Aalysis Regressio aalysis fids the equatio of the lie that best describes the relatioship betwee two variables Oe use of this equatio: to make predictios Best Fitted Lie If we have two variables X ad Y which ted to be liearly correlated, we ofte would like to model the relatio with a lie that best fits to the data. Draw a lie through the scatter diagram We wat to fid the lie that best describes the liear relatioship the regressio lie Residuals Oe differece betwee math ad stat is that statistics assumes that the measuremets are ot exact, that there is a error or residual The formula for the residual is always Residual = Observed Predicted This relatioship is ot just for this chapter it is the geeral way of defiig error i statistics What is a Residual? Here shows a residual o the scatter diagram The regressio lie The observed value y The predicted value y The x value of iterest The residual 9
Example For example, say that we wat to predict a value of y for a specific value of x Assume that we are usig y = x + 5 as our model To predict the value of y whe x = 3, the model gives us y = 3 + 5 = 55, or a predicted value of 55 Assume the actual value of y for x = 3 is equal to 50 The actual value is 50, the predicted value is 55, so the residual (or error) is 50 55 = 5 Method of Least Squares We wat to miimize the predictio errors or residuals, but we eed to defie what this meas We use the method of least-squares which ivolves the followig 3 steps: 1. We cosider a possible liear model to fit the data. We calculate the residual for each poit 3. We add up the squares of the residuals ( We square all of the residuals to avoid the cacellatio of positive residuals ad egative residuals, sice some observed values are uder predicted, some of the observed valued are over predicted by the proposed liear model.) The lie that has the smallest overall residuals ( i.e. the sum of all the squares of the residuals) is called the least-squares regressio lie or simply the regressio lie which is the best-fitted lie to the data. Method of Least Squares Assume the equatio of the best-fittig lie: Illustratio Observed ad predicted values of y: y ˆ = b + b x y 0 1 Where ŷ (called, y hat) deotes the predicted value of Least squares method: Fid the costats b 0 ad b 1 such that the sum y y y^ ( x, y) ( x, y^ ) y^ = b + b x 0 1 ˆ) ( y y = ( y ( b0 b1 x of the overall predictio errors is as small as possible )) y y^ x Liear Regressio Lie The equatio for the regressio lie is give by ˆ = b + b x y 0 1 Yˆ deotes the predicted value for the respose variable. b 1 is the slope of the least-squares regressio lie b 0 is the y-itercept of the least-squares regressio lie Note: Differet textbooks may use differet otatios for the slope ad the itercept. Fid the Equatio of a Liear Regressio Lie The equatio is determied by: b 0 : y-itercept b 1 : slope Values that satisfy the least squares criterio: b = ( x x)( y y) SS( xy) = ( x x) SS( x) 1 ( b x) y b = 1 0 = y ( b 1 x )
Example Example: A recet article measured the job satisfactio of subjects with a 14-questio survey. The data below represets the job satisfactio scores, y, ad the salaries, x, for a sample of similar idividuals: x 31 33 4 35 9 3 37 y 17 0 13 15 18 17 1 1 1) Draw a scatter diagram for this data ) Fid the equatio of the lie of best fit (i.e., regressio lie) Fidig b 1 & b 0 Prelimiary calculatios eeded to fid b 1 ad b 0 : x y x xy 3 31 33 4 35 9 37 34 x 1 17 0 13 15 18 17 1 133 59 961 89 484 576 15 841 1369 7074 y x 76 57 660 86 360 630 493 777 4009 xy Liear Regressio Lie ( x ) 34 x 7074 = 9 5 SS( x) = =. 8 ( )( ) SS( xy) = x y xy. = 34 133 4009 8 = 118 75 SS( xy) 118. 75 b1 = = = 0. 5174 SS( x) 9. 5 b = 0 y ( b1 x) 133 (0. 5174)( 34) = 8 = 1490. Solutio ) Job Satisfactio 1 0 19 18 17 16 15 14 13 1 Scatter Diagram Job Satisfactio Survey Solutio 1) Equatio of the lie of best fit: y^ = 149. + 0. 517x 1 3 5 7 9 31 33 35 37 Salary Please Note Keep at least three extra decimal places while doig the calculatios to esure a accurate aswer Whe roudig off the calculated values of b 0 ad b 1, always keep at least two sigificat digits i the fial aswer The slope b 1 represets the predicted chage i y per uit icrease i x The y-itercept is the value of y where the lie of best fit itersects the y-axis. That is, it is the predicted value of y whe x is zero. The lie of best fit will always pass through the poit ( x, y) Please Note Fidig the values of b 1 ad b 0 is a very tedious process We should also kow to use Graphig calculator for this Fidig the coefficiets b 1 ad b 0 is oly the first step of a regressio aalysis We eed to iterpret the slope b 1 We eed to iterpret the y-itercept b 0 11
Makig Predictios 1. Oe of the mai purposes for obtaiig a regressio equatio is for makig predictios. For a give value of x, we ca predict a value of y^ 3. The regressio equatio should be used oly to cover the sample domai o the iput variable. You ca estimate values outside the domai iterval, but use cautio ad use values close to the domai iterval. 4. Use curret data. A sample take i 1987 should ot be used to make predictios i 1999. Iterpret the Slope Iterpretig the slope b 1 The slope is sometimes defied as as Rise Ru The slope is also sometimes defied as as Chage i y Chage i x The slope relates chages i y to chages i x Iterpret the Slope For example, if b 1 = 4 If x icreases by 1, the y will icrease by 4 If x decreases by 1, the y will decrease by 4 A positive liear relatioship For example, if b 1 = 7 If x icreases by 1, the y will decrease by 7 If x decreases by 1, the y will icrease by 7 A egative liear relatioship Example For example, say that a researcher studies the populatio i a tow (which is the y or respose variable) i each year (which is the x or predictor variable) To simplify the calculatios, years are measured from 1900 (i.e. x = 55 is the year 1955) The model used is y = 300 x + 1,000 A slope of 300 meas that the model predicts that, o the average, the populatio icreases by 300 per year. A itercept of 1,000 meas that the model predicts that the tow had a populatio of 1,000 i the year 1900 (i.e. whe x = 0) Iterpret the y-itercept Iterpretig the y-itercept b 0 Sometimes b 0 has a iterpretatio, ad sometimes ot If 0 is a reasoable value for x, the b 0 ca be iterpreted as the value of y whe x is 0 If 0 is ot a reasoable value for x, the b 0 does ot have a iterpretatio I geeral, we should ot use the model for values of x that are much larger or much smaller tha the observed values of x icluded (that is, it may be ivalid to predict y for x values lyig outside the rage of the observed x.) Summary Summarize two quatitative data Scatter diagrams Correlatio coefficiets Liear models of correlatio Least-squares regressio lie Predictio 1
Obtai Liear Correlatio Coefficiet ad Regressio Lie Equatio from TI Calculator 1. Tur o the diagostic tool: CATALOG[ d 0] DiagosticO ENTER ENTER. Eter the data: STAT EDIT. Eter the x-variable data ito L1 ad the correspodig y-variable data ito L 3. Obtai regressio lie ad the liear correlatio r: STAT CALC 4:LiReg(ax+b) ENTER L1, L, Y1 (Notice: to eter Y1, use VARS Y-VARS 1:Fuctio 1:Y1 ENTER). (The scree will also show r. Just igore it.) 4. Display the scatter diagram ad the fitted regressio lie: Zoom 9:ZoomStat TRACE (press up or dow arrow keys to move the cursor to the regressio lie. Now, you ca trace the poits alog the lie by pressig the right or left arrow keys. While the cursor is o the regressio lie, you ca also eter a umber, the scree will show the predicted value of y for the x value you just etered.) 13