regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist sales $ (y - dependent variable) advertising $ (x - independent variable) there is a relationship between the independent variable, x (advertising), and the dependent variable, y (sales) Section 4 - Correlation (5)
correlation tells us if there is a relationship between variables, while regression tells us what type of relationship we have... is it positive or negative? is it linear or nonlinear? Section 4 - Correlation (6)
correlation tells us if there is a relationship between variables, while regression tells us what type of relationship we have... is it a simple or a multiple relationship? what is the relationship (form of the equation)? how good is the relationship? can we use it for predictive purposes? Section 4 - Correlation (7)
simple linear regressions straight lines, 1 independent variable can be extended to nonlinear and/or multiple regression when trying to determine whether there is a relationship between variables, start with a scatter plot: salary age salary age Section 4 - Correlation (8)
which variable is independent and which is dependent? in general, the independent variable can be controlled or manipulated and the dependent variable cannot sometimes it is unclear which is which... eg, tree diameter vs. volume ask which variable depends on the other? OR, which variable are you trying to predict? OR, which variable is easier to measure? OR, choose arbitrarily Section 4 - Correlation (9)
salary age first, determine what type of relationship it is next, come up with the regression line - the best fit of data recall that a line is described by y = mx + b where: m is the slope b is the y-intercept Section 4 - Correlation (30)
in regression, the line is described by: y = a + bx where: y is the predicted value a is the y-intercept b is the slope and to describe each point, y i, we must incorporate an error term, e: y i = a + bx + e i Section 4 - Correlation (31)
salary age Section 4 - Correlation (3)
given a set of data... age salary how do you find the regression line of best fit? the y-intercept (a) and slope (b) and the slope that best describes the observations y = a + bx Section 4 - Correlation (33)
it turns out that the best solution is the one that minimizes the sum of squares of the vertical distances between the line and each point... ie, n i= 1 e i is minimized rationale: the closer the line is to the points, the better the fit will be and the better the predictive power salary least squares solution age Section 4 - Correlation (34)
the least squares solution can be solved by: partial derivatives matrices (linear algebra) easy to use equations 1 make a table listing all of the x s, y s and the following: x y xy x y x 1 y 1 xy 1 x 1 y 1 x y xy x y x 3 y 3 xy 3 x 3 y 3 x n y n xy n x n y n Σx Σy Σxy Σx Σy sum each column and plug into y-intercept and slope formulae Section 4 - Correlation (35)
y-intercept, a: a = ( )( y x ) ( x)( xy) ( n x ) ( x) slope, b: b = n ( xy) ( x)( y) ( n x ) ( x) Section 4 - Correlation (36)
eg, the relationship between advertising budget and sales in a forest products firm (in $ 000 s): Ad (x) Sales (y) 4.6 87.1 5.1 93.1 4.8 89.8 4.4 91.4 5.9 99.5 4.7 9.1 5.1 95.5 5. 99.3 4.9 93.4 5.1 94.4 Scatter Plot 10 100 98 96 94 9 90 88 86 4 4.5 5 5.5 6 Advertising Budget (dollars) Section 4 - Correlation (37)
eg, the relationship between advertising budget and sales in a forest products firm (in $ 000 s): x y xy x y 4.6 Totals: 49.8 935.6 4671.1 49.54 87670.34 Section 4 - Correlation (38)
eg, the relationship between advertising budget and sales in a forest products firm (in $ 000 s): a = ( y)( x ) ( x)( xy) n ( x ) ( x) b = n ( xy) ( x)( y) n ( x ) ( x) Section 4 - Correlation (39)
eg, the relationship between advertising budget and sales in a forest products firm (in $ 000 s): regression line: Scatter Plot Sales (dollars) 10 100 98 96 94 9 90 88 86 4 4.5 5 5.5 6 Advertising Budget (dollars) Section 4 - Correlation (40)
how good is our prediction? in other words, how well does our line fit the data (note that the best fitting line may still have a lot of variation) 3 methods: 1correlation coefficient (r) coefficient of determination (r ) 3significance test Section 4 - Correlation (41)
1 correlation coefficient (r) check to see how well our variables are correlated in other words, how strong is the relationship between dependent and independent variables? many ways to measure this, but the most common is the Pearson Product Moment Correlation Coefficient (PPMC), which measures the strength and direction of a relationship between variables, given by: r = n ( xy) ( x)( y) ( x ) ( x) ( ) ( ( ) ( ) ) n n y y Section 4 - Correlation (4)
1 correlation coefficient (r) the correlation coefficient (r) will lie between -1 and 1-1 0 1 strong, negative relationship strong, positive relationship from our eg... Section 4 - Correlation (43)
coefficient of determination (r ) in order to understand r, we must analyze the regression line: y _ y x Section 4 - Correlation (44)
coefficient of determination (r ) this implies that total variation is made up of types of variation: explained variation it the variation explained by the regression line unexplained variation is due to random error ( y y) = ( y y) + ( y y ) ( n 1) = (1) + ( n ) σ = σ + total regression σ residual Section 4 - Correlation (45)
coefficient of determination (r ) computed as follows: r = SS explained SS variation total var iation 100% ( y y) ( y y ) r 100% = the coefficient of determination will always be: 0% r 100% Section 4 - Correlation (46)
coefficient of determination (r ) an easier method is to simply square the PPMC (correlation coefficient) meaning? 66.9% of the variation of the dependent variable is explained by (or accounted for by) the independent variable the regression line means that x and y are correlated, so that as x increases so to does y 66.9% of the variation in y is accounted for by the regression line Section 4 - Correlation (47)
coefficient of determination (r ) the higher the r, the better the predictive power of the regression line Section 4 - Correlation (48)
3 significance test we can also test the significance of the regression equation itself a regression is significant only if... σ > regression σ residual this requires an F-test, in which the regression is significant only if we reject the null hypothesis, H 0 : H 0 : σ σ regression residual = 1 H 1 : σ σ regression residual > 1 Section 4 - Correlation (49)
prediction: with regression, we are predicting y values from x values we are essentially inferring onto our dependant variable population and can build confidence intervals for this purpose to do so, we need to compute the standard error of estimate of the regression line s est ( y y ) = n working formula s est = y a y b n xy Section 4 - Correlation (50)
prediction: s est y a y b = n xy from our example, Section 4 - Correlation (51)
prediction: the standard error of estimate is a measure of variation of the observations around the regression line (the standard deviation of the y values [data] around the predicted valued [line] ) from our example, Section 4 - Correlation (5)
prediction: we use the standard error of estimate to build confidence intervals around the predictions to do this we need to calculate the standard error of a predicted y value : s y x 1 = sest + n n ( x x ) ( x ) i x i= 1 and construct the confidence interval: P y t α s y < y < y + tα s x y x ( n ) ( n ) = 1 α Section 4 - Correlation (53)
prediction: eg, what is the 95% confidence interval of sales with an advertising budget of $5,500? _ x = 4.98 SS x = 1.54 S est =.336 Section 4 - Correlation (54)
prediction: eg, what is the 95% confidence interval of sales with an advertising budget of $5,500? Section 4 - Correlation (55)