Linear Regression Simple linear regression: linear relationship etween two qunatitative variales The regression line Facts aout least-squares regression Residuals Influential oservations Cautions aout Correlation and Regression Correlation/regression using averages Lurking variales Association is not causation Correlation tells us aout strength (scatter) and direction of the linear relationship etween two quantitative variales. In addition, we would like to have a numerical description of how oth variales var together. For instance, is one variale increasing faster than the other one? And we would like to make predictions ased on that numerical description. But which line est descries our data? Here we will ask Minita to decide
The linear regression line A regression line is a straight line that descries how a response variale changes as an eplanator variale changes. We often use a regression line to predict the value of for a given value of. In regression, the distinction etween eplanator and response variales is important. Rätta linjers ekvation (repetition) A Statistical model Simple linear regression model X Y Y E { Y } slope is = dependent variale = independent variale = -intercept = slope of the line X ε= error variale, normall distriuted aout (mean is zero) with constant standard deviation. The error variale accounts for all the variales, oth measurale and unmeasurale that are not part of the model 2
Estimating regression line for data The least-squares regression (minsta-kvadrat regression) line is the unique line such that the sum of the squared vertical () distances etween the data points and the line is the smallest possile. Distances etween the oserved points and line are called residuals of the model for the data. Residuals can e oth positive and negative. Estimation of β and β β and β are unknown and are estimated ( Minita) from the data least square regression are and respectivel We get the estimated model: That is, we get ˆ ˆ -hat is the predicted response for an is the oserved value is the slope is the intercept An eample For randoml selected 6 students numer of studing hours for the eam and eam marks are recorded student hours marks () () 7 2 2 32 3 3 58 4 4 6 5 5 87 6 6 99 Total mean 35 58.83 3
Eample... Minita output: Regression Analsis: Marks versus Hours The regression equation is Marks =,3 +,65 Hours Predictor Coef SE Coef T P Constant,33 5,56,22,837 Hours,6486,324 2,45, S = 5,5386 R-Sq = 97,5% R-Sq(adj) = 96,9% Interpretations: =.3: when a student spends no hours for studing he/she gets.3 marks =.65: for each additional one hour of stud student gets an increase of.65 in his/her mark Another eample: ads and revenue In order to see the relationship etween numer of advertisents in local radio services ads and revenue (in SEK s) during a month, the manager of a fast food compan recorded them for several months. ads revenue 327,67 3 376,68 4 392,52 34 443,4 93 342,62 4 476,6 5 324,74 5 338,98..... Epected revenue 4.852 2. 473 ads = 2.473 When the numer of advertisements increases with one, the monthl revenue increases with 2.473 SEK (in ). = 4.852 When no advertising is done in local radios, the monthl revunue is 4.852 SEK (in s) 4
5 s s r E. s s r E. Coefficient of determination, r 2 r 2 represents the percentage of the variance in (vertical scatter from the regression line) that can e eplained changes in. s s r r 2, the coefficient of determination, is the square of the correlation coefficient.
r = - r 2 = Changes in eplain % of the variations in. Y can e entirel predicted for an given value of. r =.87 r 2 =.76 r = r 2 = Changes in eplain % of the variations in. The value(s) takes is (are) entirel independent of what value takes. Here the change in onl eplains 76% of the change in. The rest of the change in (the vertical scatter, shown as red arrows) must e eplained something other than. hours and marks eample Coefficient of determination S = 5,5386 R-Sq = 97,5% R-Sq(adj) = 96,9% 97.5% of the variance of marks is eplained hours. Facts aout least-squares regression 6
Etrapolation!!! Etrapolation is the use of a regression line for predictions outside the range of values used to otain the line.!!! This can e a ver stupid thing to do, as seen here. The intercept Sometimes the -intercept is not iologicall possile. Here we have negative lood alcohol content, which makes no sense But the negative value is appropriate for the equation of the regression line. -intercept shows negative lood alcohol There is a lot of scatter in the data, and the line is just an estimate. Residuals The distances from each point to the least-squares regression line give us potentiall useful information aout the contriution of individual data points to the overall pattern of scatter. These distances are called residuals. Points aove the line have a positive residual. The sum of these residuals is alwas. Points elow the line have a negative residual. Predicted ŷ Oserved dist. ( ˆ) residual 7
Residual plots Residuals are randoml scattered good! Curved pattern means the relationship ou are looking at is not linear. A change in variailit across plot is a warning sign. You need to find out wh it is, and rememer that predictions made in areas of larger variailit will not e as good. Regression diagnostics: ads and revenue 8
Assessing the Model The least squares method produces a regression line whether or not there are linear relationship etween X and Y. Consequentl, it is important to assess how well the linear model fits the data. Check if we have reasonal high R-square value. Check if regression assumptions are fulfilled: do the residuals of the model. Do the residuals have constant variation for all predicted values Draw a scatter plot of residuals against predicted values to see if spread is the same 2. Do the residuals independent from each other Draw a scatter plot of residuals against predicted values to see if the follow a regular pattern 3. Normal with aout zero: draw a histogram of the residuals and normal P-P plot Alwas plot our data A correlation coefficient and a regression line can e calculated for an relationship etween two quantitative variales. However, outliers can greatl influence the results. Also, running a linear regression on a nonlinear association is not onl meaningless ut misleading. log_, 8,7 2, 2, 2,8 2,5, 9,9 2,29, 8, 2,9 9, 6,8,92 4, 8, 2,9 5, 22, 3,9 6, 27, 3,3 4, 8,2 2,9 Y is transformed into log_ to get a linear relationship 7, 33, 3,5 3, 3,5 2,6 So, make sure to alwas plot our data efore ou run a correlation or regression analsis. However, making the scatterplots shows us that the correlation/ regression analsis is not appropriate for all data sets. Moderate linear association; regression OK. Ovious nonlinear relationship; regression not OK. One point deviates from the highl linear pattern; this outlier must e eamined closel efore proceeding. Just one ver influential point; all other points have the same value; a redesign is due here. 9
Lurking variales A lurking variale is a variale not included in the stud design that does have an effect on the variales studied. Lurking variales can falsel suggest a relationship. What is the lurking variale in these eamples? How could ou answer if ou didn t know anthing aout the topic? Strong positive association etween numer of firefighters at a fire site and the amount of damage a fire does. Negative association etween moderate amounts of wine drinking and death rates from heart disease in developed nations. Simpsons parado (s. 46-47 i Moore). Eempel: Vilket sjukhus är det ättre? 29 Men om vi även vet om patientens tillstånd innan operationen... 3
Vocaular: lurking vs. confounding A lurking variale is a variale that is not among the eplanator or response variales in a stud and et ma influence the interpretation of relationships among those variales. Two variales are confounded when their effects on a response variale cannot e distinguished from each other. The confounded variales ma e either eplanator variales or lurking variales. But ou often see them used interchangeal Association is not causation An association etween an eplanator variale and a response variale, even if it is ver strong, is not itself good evidence that changes in actuall cause changes in. Eample: There is a high positive correlation etween the numer of television sets per person () and the average life epectanc () for the world s nations. Could we lengthen the lives of people in Rwanda shipping them TV sets? The est wa to get evidence that causes is to do an eperiment in which we change and keep lurking variales under control. Caution efore rushing into a correlation or a regression analsis Do not use a regression on inappropriate data. Pattern in the residuals Presence of large outliers Use residual plots for help. Clumped data falsel appearing linear Beware of lurking variales. Avoid etrapolating (going eond interpolation). A relationship, however strong, does not itself impl causation.
Linear regression eample (multiple) 2