Notes 6. Basic Stats Procedures part II

Size: px
Start display at page:

Download "Notes 6. Basic Stats Procedures part II"

Transcription

1 Statistics 5106, Fall 2007 Notes 6 Basic Stats Procedures part II Testing for Correlation between Two Variables You have probably all heard about correlation. When two variables are correlated, they are dependent, which means that knowing the outcome of one variable tells you something about the probability of the various outcomes of the other variable. We can test for correlation by looking at scatterplots and by running this test in SAS. The correlation coefficient r is a measure of the strength of the linear relationship between two quantitative variables. It has the following properties: 1 r 1 r = 0 indicates no linear relationship, r > 0 indicates a positive relationship and r < 0 indicates a negative relationship. r = 1 occurs only when the data fall perfectly on a line with positive slope; r = 1 occurs only when the data fall perfectly on a line with negative slope. Computing the correlation coefficient We calculate r for any two quantitative variables using the following formula. r = ( ) ( ) x x y ȳ S x S y n 1 = Zx Z y n 1 (1) This is sometimes called Pearson s r or Pearson s correlation to distinguish it from other measures of association; however, the phrase correlation coefficient in statistics refers specifically to r. Question: Why is r a measure of the linear relationship between two variables? Since the z-scores give us how far the value is from the mean, if the z x always vary from their mean to the same degree that the z y vary from their mean, the z-scores will be equal 1

2 and the slope between them will be one. If the deviation is only slight, then the correlation will be close to one. If the deviation is large, the correlation will be close to zero. Other properties of correlation: it makes no difference which variable you call x and which you call y in computing correlation the correlation is unchanged by changing the units of measurement for x or y A scatterplot is a graphical analog to the correlation. Remember, that correlations should never be examined without also examining the scatterplots. Let s look at the correlation between weight and MPG in the 81 cars data set. We saw the scatterplot earlier and it looked like there was a correlation. We ll now do a formal test to see if this relationship is statistically significant. Import the 81cars data set again and try the SAS code below. Remember to plot the scatterplot. (You learned this earlier in Notes 3). proc corr data=cars; var weight MPG; run; This code produces the Pearson s coefficient by default. Spearman s Coefficient Spearman s rank correlation is calculated by converting each variable to its rank. For example, in testing for correlation of blood pressure vs. body weight, the lightest person would get a rank of 1, second-lightest a rank of 2, etc. The lowest blood pressure would get a rank of 1, second lowest a rank of 2, etc. When two or more observations are equal, the average rank is used. For example, if two observations are tied for the second-highest rank, they would get a rank of 2.5 (the average of 2 and 3). Once the two variables are converted to ranks, the Pearson s correlation coefficient is calculated for the two columns of ranks, and the significance of this is tested. To get the Spearman s coefficient we add spearman to the PROC CORR statement. proc corr data=cars spearman; var weight MPG; run; 2

3 Simple Linear Regression When we have a continuous response and a continuous predictor, we can get a scatterplot of the data. Often the response varies linearly with the predictor, so that it looks reasonable to draw a straight line through the scatterplot to represent the relationship between the variables. Simple linear regression means fitting a line to the data. It is appropriate when both predictor (x) and response (y) variables are continuous, and the relationship between the variables is linear. Technically, the term linear relationship means that the variables x and y vary together, in such a way that the average increase in y associated with a change of one unit in x is constant over the range of x (i.e., the change in y given a unit change in x doesn t depend on the original value of x). Intuitively, the best line should be closest to the data. To find this, we need a specific numerical measure of closeness. The least-squares criterion (which is not the only option) is the most common measure of closeness, and the only one discussed in this text. To describe the criterion and best fit line, we first need to define a residual. Below is a plot of measurements of height (inches) and weight (pounds) for 22 kindergarteners. weights of kindergarteners as function of height weight Least-Squares Line slope=0.661 intercept= height For each datapoint on the plot, draw a vertical line connecting the point to the line. The lengths of these little vertical lines are the (absolute value of) the residuals. The datapoints above the fitted line have positive residuals, and the residuals for the datapoints below the line are negative. The principle of least squares is: 3

4 The best fit line minimizes the sum of the squared residuals. Some notation and definitions: n is the number of observations. The observations are ordered pairs (x i, y i ), i = 1, 2,..., n, where x i is the ith predictor variable observation and y i is the ith dependent variable observation. The equation for the best fit line is written ŷ = b 0 +b 1 x. This is called the estimated regression equation as well as best fit line. b 0 and b 1 are called the coefficients of the best fit line. b 0 is the intercept of the best fit line. This is its y-coordinate at x = 0. b 1 is the slope of the best fit line. A one-unit increase in the x-coordinate is associated with an increase of b 1 units in the y-coordinate. (If b 1 is negative, the y value of the best fit line is decreasing as x increases.) ŷ i is the y-coordinate of the best fit line at x = x i. It is the predicted or estimated value of the average dependent variable value at x i. Now we can write the sum of squared residuals as n SSE = (y i ŷ i ) 2 i=1 where SSE stands for sum of squared errors. (The words residual and error can be used interchangeably here.) The formulas. We can derive the coefficients of the best fit line using calculus. If you know calculus, this is what you do: 1. Write the best fit criterion as a function of b 0 and b 1 : 2. Take partial derivatives n f(b 0, b 1 ) = (y i (b 0 + b 1 x i )) 2. i=1 f and f. b 0 b 1 3. Set the partial derivatives to zero and simultaneously solve the pair of linear equations for b 0 and b 1. 4

5 If you don t know calculus, here are the answers: b 1 = ni=1 (x i x)(y i ȳ) ni=1 (x i x) 2 = xi y i n xȳ x 2 i n x 2 and b 0 = ȳ b 1 x. We won t be using the formulas anyway: SAS can do that for us. Goodness of fit measure: the R 2. We ve already defined the sum-of-squared errors or SSE. Let s look at a small dataset with the response variable y being the cholesterol level of the subject, and the predictor variable x being the tofu consumption in pounds per week. The sums of the squares of the dotted lines shown connecting the data points to the line is the SSE. linear relationship between cholesterol level and soy consumption cholesterol level soy consumption (gm/week) We can compute the total sum of squares and the regression sum of squares in a similar manner, as follows: 5

6 linear relationship between cholesterol level and soy consumption computing total sum of squares computing regression sum of squares cholesterol level cholesterol level soy consumption (gm/week) soy consumption (gm/week) These sums of squares are measures of variation: SST The total sum of squares is the measure of total variation of the y-coordinates. SSR The regression sum of squares is the variation in the y-coordinates which is explained by the regression line, or by the linear relationship with x. This is also called the explained variation. SSE The error sum of squares is the unexplained variation. Not surprisingly, we have SST = SSR + SSE, that is, the total variation is the sum of the explained variation and the unexplained variation. Now we can define the coefficient of determination: R 2 = SSR SST and is interpreted as the proportion of variation in y that is explained by the linear relationship with x. You can see that 0 R 2 1. Datasets for which the scatterplot shows points tightly clustered about a line have high R 2. If all the points are on a line, we get R 2 = 1. We get R 2 = 0 if the best fit line has slope zero, which would indicate that the predicted value of y does not depend on the value of x. When we do a simple linear regression, we are making the assumption that there is a true underlying linear relationship between y and x, that is, y i = β 0 + β 1 x i + ɛ i, 6

7 where β 0 and β 1 are the real intercept and slope defining the relationship. The ɛ i are random variables, called errors, that have mean zero. These unpredictable errors are why we can t get the exact parameter values from our sample. If we didn t have these errors, our data would lie exactly on a straight line, from which we could calculate the intercept β 0 and slope β 1. We are estimating β 0 and β 1 with b 0 and b 1. The β 0 and β 1 are parameters in the model, because they are a characteristic of the population. The estimates of the parameters are b 0 and b 1.These estimates are random variables, or statistics, because they are calculated from our sample. If we take a different sample of observations of the y i values, at the same x i values, we would calculate different b 0 and b 1. SAS reports the standard errors, (i.e., estimates of the standard deviation) of b 0 and b 1 along with their values. These are used to calculate t-statistics so we can do inference about the slope and the intercept. SAS reports the t-statistics and corresponding p-values to test the hypotheses H 0 : β 1 = 0 vs. H a : β 1 0. and also H 0 : β 0 = 0 vs. H a : β 0 0. Example: Salary Data, revisited Recall the data in salary.txt, about salaries at XYZ corporation. We have plotted the salary against seniority, and concluded that there is a general trend that as seniority increases, salary increases too. To get the best linear fit of salary to seniority, we can use proc reg as follows: proc reg; model salary=seniority; run; The default SAS output is: The REG Procedure Model: MODEL1 7

8 Dependent Variable: salary Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept <.0001 seniority

9 The top part of the table you will recognize as looking like the ANOVA table, with sources of variation. In fact, the p-value for the F -statistic is calculated in the exact same way. Confirm that the R-Square is the model sum of squares divided by the total sum of squares. (The R-Square is also reported for one-way ANOVA, and has the exact same interpretation, but we didn t discuss it then.) The Parameter Estimates part of the table contains the parameter estimates, standard errors, t-statistics, and the p-values. Here the label for the slope is the predictor variable name. Note that the p-value for the F -statistic is the same as the p-value for the slope. The p-value for the intercept tells us that the starting salary for someone with seniority=0 (just starting out) is significantly different from zero. Well, it should be! Often we re not really interested in hypothesis tests about the intercept. The p-value for the slope tells us that the salary increases significantly with seniority. The R 2 tells us that about 7.4% of the variation in salary is explained by the linear relationship with seniority. The rest of the variation is called error variation but is due in part to things like employee type, and maybe education level, competence, etc. Maybe gender? That s what we d like to know! We ll come back to this question again later when we discuss proc glm. Example: Beta Carotene Data Let s look again at the dataset betaplasma.txt. Plot blood beta carotene level against fiber consumption. What is the general trend? Do a simple linear regression and report the least-squares line. If you want to look at the plot with the least-squares line superimposed, just include the plot statement within proc reg. Try this and look at the graph. There is a lot of information included. proc reg; model salary=seniority; plot salary*seniority; run; What proportion of the variation in blood beta carotene level is explained by the linear relationship with fiber? 9

10 Interpret the regression coefficients in the context of the problem. Is there a significant linear relationship between blood beta carotene level and fiber? (use α = 0.01)? Describe. What is the estimated blood beta carotene level for a person with fiber intake of 25? What about for someone with fiber intake of 200? (Note that it s generally not a good idea to extrapolate too far out of the data range.) Report a p-value. State the hypotheses corresponding to the p-value. State the conclusion in the context of the problem. Now do the same analyses but investigate the relationship with blood beta carotene level and cholesterol. What is the R 2 associated with the simple linear regression, and what does this mean in the context of the problem? Interpret the regression coefficients in the context of the problem. 10

11 Report a p-value. State the hypotheses corresponding to the p-value. State the conclusion in the context of the problem. Example: Body and Brain sizes Let s look again at the dataset sleeptime.txt. The data were collected to study sleep and dreaming patterns in mammals, but we can use it to investigate the relationship between brain weight and body weight. Plot brain weight against body weight (put brain weight on y-axis). Describe what you see. Now plot the logarithm of brain weight against the logarithm of body weight. Describe what you see. Which do you think is more appropriate for simple linear regression? Do a simple linear regression and report the least-squares line. Report the R 2 and interpret in the context of the problem. 11

12 Look at the scatterplot of the logarithm of brain weight against the logarithm of body weight, with the best fit line superimposed. Some points are above the line; others below. Describe the points above the line, compared to those below, in the context of the problem. Which mammal has the largest residual? Interpret in the context of the problem. What is the predicted brain size for a mammal with a 60kg body size, according to the regression model? Example: Population and Profit Let s look again at the dataset iceprof.txt. Recall that this files contains data concerning the net profit of ice cream stores in a chain around the country. The variables are store ID, region, population, and profit. We have investigated whether profits are different over regions. Now we want to ask if profit varies with population of the town. What do you guess about the relationship? Plot profit against population. Describe what you see. Make a guess: is the relationship significant? 12

13 Let y i be the profit of the ith store, and x i be the population of the town of the ith store. Write down a simple linear regression model. Fit the model and report the best fit line. Interpret the coefficients in the context of the problem. Do the profits increase significantly with population? Explain. What proportion of the variation in profit is explained by the linear relationship with population? Predict profits for a store located in a town of population 500,000. Residual Analyses A statistical model is always a simplified approximation to a real-world situation. The model assumptions may be enumerated, and given these assumptions, statistical tests and confidence intervals are derived. No one expects the model to be exactly correct, but each time a model is used, an attempt should be made to find obvious deviations from the model. If deviations are found, a new more complicated model can be developed, using the new information. For the simple linear regression, we are assuming that there is a linear relationship between y and x, and that we have independent normal random errors that 13

14 all have the same variance. Residual analysis is one way to check these assumptions. Using residual analysis, we can also find anomalies in the data, such as observations that are outliers or influential points, which should be investigated by the researcher. Every time we do a regression analysis, we ought to plot our residuals to check assumptions. In this section we show how to spot deviances from the regression assumptions, but unfortunately most of the fixes for these situations are beyond the scope of this text. Recall the assumptions for a simple linear regression model: 1. We assume that the underlying structure of the data fits the model. That is, the y-coordinates of our data are varying linearly with the x-coordinates, plus some random error. Deviations from this assumption are called lack of fit. Maybe the true relationship could be better described with a function that is not linear, such as a parabola. Perhaps there is another predictor variable that affects the response and the main predictor variable, that we should include in the model. 2. We assume that the random errors are normally distributed with mean zero. The two most common deviations from this assumption are skewed error distributions and heavy-tailed error distributions. Both cause outliers in the dataset observations with unusually large y values, compared to others. 3. We assume that the random errors all have the same variance. This assumption is called homoskedasticity and the deviation from the assumption is heteroskedasticity; these are impressive-sounding terms! Two types of heteroskedasticity are common and can be found with residual plots: error variances that increase or decrease with the value of the predictor variable, and error variances that increase or decrease with the expected value of the response variable. 4. We assume that the random errors are independent of each other. The most common deviation from this assumption is called serial correlation or autoregressive errors. This phenomenon often occurs when the predictor variable values are consecutive in time or space. When this happens, the value of one error can depend on the value of the previous error. Positive autoregression implies that consecutive errors tend to be closer together: if one error is large and positive, the next error is more likely to be positive than negative. Negative autocorrelation means that large positive residuals are more likely to be followed by negative residuals, and vice-versa. There are several types of residual plots for simple linear regression. Each plot checks some of the model assumptions. We will enumerate the types of residual plots that should be generated, and discuss which assumptions can be checked with each. 14

15 1. Plot the residuals against the predictor values We put the residuals on the vertical axis, and the predictor variable on the horizontal axis. If the regression assumptions hold, the residual plot should look like a random horizontal band of points; it should have no patterns in it. These are some specific patterns to look for, which indicate violations of the regression assumptions. For example, we can look for lack of fit. Suppose we fit the equation of a straight line to data for which y is really quadratic in x; we might get the type of pattern illustrated in the plot on the left: fit line to scatterplot residual plot y res x x We can see some lack of fit in the scatterplot, but it is more noticable in the residual plot (on the right). If we see this kind of lack of fit, we should consider fitting a function other than a straight line model, perhaps a parabola (other types of regression functions are discussed in Chapter 7). We can also look for non-constant error variance, or heteroskedasticity. Sometimes when the error variance is not constant, it is increasing or decreasing in the x-values. The next figure shows residuals for the former situation. Note that as x gets bigger, the absolute values of the residuals tends to increase, in a fanning out pattern. 15

16 fit to scatterplot residual plot y residual xdat x An example of a dataset where this type of residual plot might be seen is as follows. Suppose a marketing researcher is looking at luxury spending (y) as a function of income (x). In this case, smaller incomes will have a smaller variation of luxury spending, since people with small incomes do not have the option of large spending on luxuries. On the other hand, there might be quite a bit of variation in luxury spending for people with large incomes. The remedy for error variances that increase or decrease in x is weighted regression. A brief treatment of this subject will presented in a later handout, but in broad brushstrokes, the idea is to weight the observations by the inverse of the x values if the residual variance seems to be increasing in x, and by the x values themselves if the residual variance seems to be decreasing in x. The observations with smaller variances get larger weights and hence get counted more in the analyses. 2. Plot the residuals against the predicted values If all the assumptions are met, the plot of the residuals against the predicted values (i.e., the ŷ i ) should show a random-looking horizontal band. Again, we want to see no patterns in the residuals. One common pattern is the fanning out or funneling in that was discussed above concerning the plots of the residuals against the predictor values x i. In the figure below, we see the fanning out pattern in the plot of the residuals against the predicted values. This type of pattern is another example heteroskedasticity. Sometimes a log-transformation of the response variable is helpful in correcting this problem. 16

17 y Residuals x Predicted values 3. Histogram of the residuals A histogram provides an approximate distribution of the residuals. Of course, we re looking for a nice symmetric bell shape. If the sample size is fairly large (say, n 50), the histogram will indicate whether or not the residuals have a bell-shaped type of distribution. The next figure shows three plots using a dataset with n = 80. The scatterplot of the response against the predictor is shown on the left. We can see the upward skew in the scatterplot it looks as if the positive residuals might tend to be larger in absolute value than the negative residuals. y resid(g) x predict(g) Residuals In the center of the figure, the residuals are plotted against the predictor values. Here the upward skew is a little more obvious. Finally, a histogram of the residuals shows a definite skew to the right. Skewed residuals indicate a violation of the normality assumption, and so cast doubt on the validity of the regression results. When the skew is to the right, sometimes the handy log-transformation of the response values will fix the problem; that is, convert the model to one with no obvious violations of the assumptions. 17

18 3. Normal probability plot of the residuals Another way to visually assess the normality assumption is through a normal probability plot. As discussed in proc univariate handout, the sorted residuals are plotted against quantiles of the normal distribution. If the resulting points fall roughly on a straight line, this is evidence for normality. If the line of points is curved in an obvious way, this is evidence for deviation from normality. Plot (a) in the figure below shows the sorted residuals from the skewed-right residuals above, plotted against the normal quantiles. There is a clear curved pattern, indicating non-normality. Plot (b) shows the sorted residuals from a model in which the normal assumptions are met, plotted against the normal quantiles. Here the points lie roughly on a straight line, with some random jitter. (a) (b) Sorted residuals Sorted residuals Normal quantiles Normal quantiles There are also statistical tests for normality, that we learned in proc univariate material. Residuals may sometimes be heavy-tailed. This indicates that there are both unusually large and unusually small values of the response. The normal density has thin tails meaning that is it very unlikely to see observations more than 2 or 3 standard deviations away from the mean. For some densities, such as the t-density with small degrees of freedom, it is not so surprising to see values more than 3 or 4 standard deviations away. We call these observations outliers. When residuals from a dataset with heavy-tailed errors are plotted against the normal quantiles, an upward skew is seen at the right, and a downward skew is seen at the left. This pattern is seen in plot (a) of the figure below. The regression errors were generated from a t-density with two degrees of freedom, instead of a normal density. There is a pronounced indication of heavy-tails. Note that the curve is concave-down on the left, then concave-up on the right. 18

19 (a) (b) Sorted residuals Sorted residuals Normal quantiles Normal quantiles In plot (b) of the above figure, we see the opposite phenomenon: residuals from a distribution that is thinner-tailed than the normal density. The curve is concave-up, then concave-down. This is a more unusual situation. Getting residual plots using proc reg We know how to get a scatterplot with the best fit line superimposed: we use a plot statement within the regression procedure: proc reg data=a; model y=x; plot y*x; run; We can do the same for residual plots. The terms residual. and predicted. (actually you can use r. and p. for short!) are variables that you can use in the plot statement, although they are not variables in the dataset. Try this: proc reg data=a; model y=x; plot residual.*predicted.; run; To get a probability plot: proc reg data=a; model y=x; plot residual.*npp.; run; 19

20 If you want to save the residuals and predicted values, the following will create a new dataset bdat with all the variables of adat, plus the residuals (res) and predicted (pred) values. proc reg data=adat; model y=x; output out=bdat r=res p=pred; run; Other useful things you can save are: COOKD=name (Cook s D influence statistic ) STUDENT=name (studentized residuals, which are the residuals divided by their standard errors) RSTUDENT=name (a studentized residual with the current observation deleted) Outliers and Influential Points After performing regression analysis to check the model assumptions, we should look for outliers and influential points as part of the residual analysis. In simple linear regression, we can often see these points in the scatterplot of y on x. An outlier is far from the best fit line, and will produce an unusually large residual. An observation is an influential point if removing it from the fit (deleting it from the analysis) changes the slope of the line in a substantial way. Plot (a) in the figure below contains an example of a point that is an outlier but not an influential point. The solid line is the best fit with all the data, and the dotted line is the best fit with the circled point removed. If we draw a vertical line from the point to the solid line, we see that the residual for the circled point is considerably larger than the residuals for the other observations, but the presence of the point does not much alter the fit. (a) (b) (c) O O O y y y x x x 20

21 In the plot (b), the circled point is an influential point but not an outlier. When it is included in the regression, we get the solid line. Because the point is as near the line as any of the others, the residual is not large compared to the others, so is not an outlier. However, the absence of the circled point drastically changes the fit to the remaining points, shown as a dotted line. The circled point greatly influences the fit to the data. The circled point in the plot (c) is both an outlier and an influential point. It has a larger residual than the other observations, and it has the effect of pulling the regression line toward itself. Outliers can be seen in the plots of the residuals against the predictor values, and of the residuals against the predicted values. Influential points can be harder to spot. One way to check to see if an observation is influential is to remove the observation from the dataset, perform a regression analysis using the reduced dataset, and compare the results to the full data results. This might take a long time if we want to check every point to see if it is influential. Fortunately, the computer can do this for us. Cook s D Influence Suppose we do a regression with the ith observation removed, and let ŷ j(i) be the predicted jth response, using the slope and intercept obtained with the reduced dataset. The notation (i) in the subscript stands for with the ith observation removed. The idea is to compare the predictions ŷ j(i), j = 1, 2,..., n, with the predictions ŷ j using the full dataset, to see if they are substantially different. The measure of influence of the ith observation is called the Cook s D Influence, and written D i = nj=1 (ŷ j ŷ j(i) ) 2 k MSE where ŷ j is the predicted jth response for the regression on the entire dataset. The number k is two for simple linear regression; for the more complicated models of Chapter 7, it is the number of parameters in the model. The value of D i is large if the ith observation is influential, that is, if the response at each j, without the ith observation, is far from the response at each j, using the entire dataset. It is calculated at each observation, and we can tell the software package to save these in a new column. Then we plot the influence statistics against the predictor variable, for example, to look for observations with unusually high influence. There is rule for how large a value of D i should be before the observation is considered to be an influential point; the cut-off depends on n and the number of parameters k in the model. The values of D i are compared with the 95th percentile of an F (k, n k) density; if the value of D i is large than the percentile, the ith observation is considered influential. Outliers and influential points should always be investigated. Sometimes they are the 21

22 results of simple mistakes when entering the data, and should be corrected. Sometimes they are observations from an anomalous case. Sometimes these should be removed from the dataset, and other times they should be left in, depending on the context of the problem, and what the purpose of the analysis is. Exercises: 1. Look again at the betaplasma.txt dataset. When you plotted the beta carotene blood plasma level (BCBPL) against fiber consumption or BMI, you got an unsatisfying scatterplot, and the fit didn t looks so good. Re-do your regression analyses, and get residual plots. Describe the distribution of the residuals. 2. Now do the same thing, only use log(bcbpl) as your response variable. Get residual plots. Did the transformation fix the problem? 3. Check the residual plots for the body temperature against pulse rate analysis. Are the model assumptions met? 4. Plot the Cook s influence statistics. Are there any data points with undue influence? 5. Interpret the results in the context of the problem. 22

Lecture 11: Simple Linear Regression

Lecture 11: Simple Linear Regression Lecture 11: Simple Linear Regression Readings: Sections 3.1-3.3, 11.1-11.3 Apr 17, 2009 In linear regression, we examine the association between two quantitative variables. Number of beers that you drink

More information

The Simple Linear Regression Model

The Simple Linear Regression Model The Simple Linear Regression Model Lesson 3 Ryan Safner 1 1 Department of Economics Hood College ECON 480 - Econometrics Fall 2017 Ryan Safner (Hood College) ECON 480 - Lesson 3 Fall 2017 1 / 77 Bivariate

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

Inferences for Regression

Inferences for Regression Inferences for Regression An Example: Body Fat and Waist Size Looking at the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In

More information

Chapter 16. Simple Linear Regression and dcorrelation

Chapter 16. Simple Linear Regression and dcorrelation Chapter 16 Simple Linear Regression and dcorrelation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable

HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable Chapter 08: Linear Regression There are lots of ways to model the relationships between variables. It is important that you not think that what we do is the way. There are many paths to the summit We are

More information

Inference with Simple Regression

Inference with Simple Regression 1 Introduction Inference with Simple Regression Alan B. Gelder 06E:071, The University of Iowa 1 Moving to infinite means: In this course we have seen one-mean problems, twomean problems, and problems

More information

9. Linear Regression and Correlation

9. Linear Regression and Correlation 9. Linear Regression and Correlation Data: y a quantitative response variable x a quantitative explanatory variable (Chap. 8: Recall that both variables were categorical) For example, y = annual income,

More information

Lecture 18: Simple Linear Regression

Lecture 18: Simple Linear Regression Lecture 18: Simple Linear Regression BIOS 553 Department of Biostatistics University of Michigan Fall 2004 The Correlation Coefficient: r The correlation coefficient (r) is a number that measures the strength

More information

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006 Chapter 17 Simple Linear Regression and Correlation 17.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

Business Statistics. Lecture 9: Simple Regression

Business Statistics. Lecture 9: Simple Regression Business Statistics Lecture 9: Simple Regression 1 On to Model Building! Up to now, class was about descriptive and inferential statistics Numerical and graphical summaries of data Confidence intervals

More information

Chapter 16. Simple Linear Regression and Correlation

Chapter 16. Simple Linear Regression and Correlation Chapter 16 Simple Linear Regression and Correlation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

appstats27.notebook April 06, 2017

appstats27.notebook April 06, 2017 Chapter 27 Objective Students will conduct inference on regression and analyze data to write a conclusion. Inferences for Regression An Example: Body Fat and Waist Size pg 634 Our chapter example revolves

More information

ST Correlation and Regression

ST Correlation and Regression Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we ve learned: Why we want a random sample and how to achieve it (Sampling Scheme)

More information

appstats8.notebook October 11, 2016

appstats8.notebook October 11, 2016 Chapter 8 Linear Regression Objective: Students will construct and analyze a linear model for a given set of data. Fat Versus Protein: An Example pg 168 The following is a scatterplot of total fat versus

More information

Correlation & Simple Regression

Correlation & Simple Regression Chapter 11 Correlation & Simple Regression The previous chapter dealt with inference for two categorical variables. In this chapter, we would like to examine the relationship between two quantitative variables.

More information

Statistical View of Least Squares

Statistical View of Least Squares May 23, 2006 Purpose of Regression Some Examples Least Squares Purpose of Regression Purpose of Regression Some Examples Least Squares Suppose we have two variables x and y Purpose of Regression Some Examples

More information

1 A Review of Correlation and Regression

1 A Review of Correlation and Regression 1 A Review of Correlation and Regression SW, Chapter 12 Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then

More information

INFERENCE FOR REGRESSION

INFERENCE FOR REGRESSION CHAPTER 3 INFERENCE FOR REGRESSION OVERVIEW In Chapter 5 of the textbook, we first encountered regression. The assumptions that describe the regression model we use in this chapter are the following. We

More information

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1 Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression

More information

Chapter 27 Summary Inferences for Regression

Chapter 27 Summary Inferences for Regression Chapter 7 Summary Inferences for Regression What have we learned? We have now applied inference to regression models. Like in all inference situations, there are conditions that we must check. We can test

More information

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Correlation and the Analysis of Variance Approach to Simple Linear Regression Correlation and the Analysis of Variance Approach to Simple Linear Regression Biometry 755 Spring 2009 Correlation and the Analysis of Variance Approach to Simple Linear Regression p. 1/35 Correlation

More information

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math. Regression, part II I. What does it all mean? A) Notice that so far all we ve done is math. 1) One can calculate the Least Squares Regression Line for anything, regardless of any assumptions. 2) But, if

More information

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to: STA 2023 Module 5 Regression and Correlation Learning Objectives Upon completing this module, you should be able to: 1. Define and apply the concepts related to linear equations with one independent variable.

More information

Section 3: Simple Linear Regression

Section 3: Simple Linear Regression Section 3: Simple Linear Regression Carlos M. Carvalho The University of Texas at Austin McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Regression: General Introduction

More information

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat). Statistics 512: Solution to Homework#11 Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat). 1. Perform the two-way ANOVA without interaction for this model. Use the results

More information

Ordinary Least Squares Regression Explained: Vartanian

Ordinary Least Squares Regression Explained: Vartanian Ordinary Least Squares Regression Explained: Vartanian When to Use Ordinary Least Squares Regression Analysis A. Variable types. When you have an interval/ratio scale dependent variable.. When your independent

More information

Mathematics for Economics MA course

Mathematics for Economics MA course Mathematics for Economics MA course Simple Linear Regression Dr. Seetha Bandara Simple Regression Simple linear regression is a statistical method that allows us to summarize and study relationships between

More information

BIOSTATISTICS NURS 3324

BIOSTATISTICS NURS 3324 Simple Linear Regression and Correlation Introduction Previously, our attention has been focused on one variable which we designated by x. Frequently, it is desirable to learn something about the relationship

More information

Regression. Marc H. Mehlman University of New Haven

Regression. Marc H. Mehlman University of New Haven Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven the statistician knows that in nature there never was a normal distribution, there never was a straight line, yet with normal and

More information

Unit 6 - Introduction to linear regression

Unit 6 - Introduction to linear regression Unit 6 - Introduction to linear regression Suggested reading: OpenIntro Statistics, Chapter 7 Suggested exercises: Part 1 - Relationship between two numerical variables: 7.7, 7.9, 7.11, 7.13, 7.15, 7.25,

More information

Lectures on Simple Linear Regression Stat 431, Summer 2012

Lectures on Simple Linear Regression Stat 431, Summer 2012 Lectures on Simple Linear Regression Stat 43, Summer 0 Hyunseung Kang July 6-8, 0 Last Updated: July 8, 0 :59PM Introduction Previously, we have been investigating various properties of the population

More information

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc. Chapter 8 Linear Regression Copyright 2010 Pearson Education, Inc. Fat Versus Protein: An Example The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu: Copyright

More information

Chapter 7. Scatterplots, Association, and Correlation

Chapter 7. Scatterplots, Association, and Correlation Chapter 7 Scatterplots, Association, and Correlation Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 1 / 29 Objective In this chapter, we study relationships! Instead, we investigate

More information

Warm-up Using the given data Create a scatterplot Find the regression line

Warm-up Using the given data Create a scatterplot Find the regression line Time at the lunch table Caloric intake 21.4 472 30.8 498 37.7 335 32.8 423 39.5 437 22.8 508 34.1 431 33.9 479 43.8 454 42.4 450 43.1 410 29.2 504 31.3 437 28.6 489 32.9 436 30.6 480 35.1 439 33.0 444

More information

Chapter 3: Examining Relationships

Chapter 3: Examining Relationships Chapter 3: Examining Relationships Most statistical studies involve more than one variable. Often in the AP Statistics exam, you will be asked to compare two data sets by using side by side boxplots or

More information

Overview Scatter Plot Example

Overview Scatter Plot Example Overview Topic 22 - Linear Regression and Correlation STAT 5 Professor Bruce Craig Consider one population but two variables For each sampling unit observe X and Y Assume linear relationship between variables

More information

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation? Did You Mean Association Or Correlation? AP Statistics Chapter 8 Be careful not to use the word correlation when you really mean association. Often times people will incorrectly use the word correlation

More information

y response variable x 1, x 2,, x k -- a set of explanatory variables

y response variable x 1, x 2,, x k -- a set of explanatory variables 11. Multiple Regression and Correlation y response variable x 1, x 2,, x k -- a set of explanatory variables In this chapter, all variables are assumed to be quantitative. Chapters 12-14 show how to incorporate

More information

Chapter 19 Sir Migo Mendoza

Chapter 19 Sir Migo Mendoza The Linear Regression Chapter 19 Sir Migo Mendoza Linear Regression and the Line of Best Fit Lesson 19.1 Sir Migo Mendoza Question: Once we have a Linear Relationship, what can we do with it? Something

More information

Chapter 11. Correlation and Regression

Chapter 11. Correlation and Regression Chapter 11. Correlation and Regression The word correlation is used in everyday life to denote some form of association. We might say that we have noticed a correlation between foggy days and attacks of

More information

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM 1 REGRESSION AND CORRELATION As we learned in Chapter 9 ( Bivariate Tables ), the differential access to the Internet is real and persistent. Celeste Campos-Castillo s (015) research confirmed the impact

More information

Analysing data: regression and correlation S6 and S7

Analysing data: regression and correlation S6 and S7 Basic medical statistics for clinical and experimental research Analysing data: regression and correlation S6 and S7 K. Jozwiak k.jozwiak@nki.nl 2 / 49 Correlation So far we have looked at the association

More information

Unit 6 - Simple linear regression

Unit 6 - Simple linear regression Sta 101: Data Analysis and Statistical Inference Dr. Çetinkaya-Rundel Unit 6 - Simple linear regression LO 1. Define the explanatory variable as the independent variable (predictor), and the response variable

More information

STAT 350. Assignment 4

STAT 350. Assignment 4 STAT 350 Assignment 4 1. For the Mileage data in assignment 3 conduct a residual analysis and report your findings. I used the full model for this since my answers to assignment 3 suggested we needed the

More information

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij = K. Model Diagnostics We ve already seen how to check model assumptions prior to fitting a one-way ANOVA. Diagnostics carried out after model fitting by using residuals are more informative for assessing

More information

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. Multicollinearity Read Section 7.5 in textbook. Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. Example of multicollinear

More information

Multiple Regression Theory 2006 Samuel L. Baker

Multiple Regression Theory 2006 Samuel L. Baker MULTIPLE REGRESSION THEORY 1 Multiple Regression Theory 2006 Samuel L. Baker Multiple regression is regression with two or more independent variables on the right-hand side of the equation. Use multiple

More information

4.1 Introduction. 4.2 The Scatter Diagram. Chapter 4 Linear Correlation and Regression Analysis

4.1 Introduction. 4.2 The Scatter Diagram. Chapter 4 Linear Correlation and Regression Analysis 4.1 Introduction Correlation is a technique that measures the strength (or the degree) of the relationship between two variables. For example, we could measure how strong the relationship is between people

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

1 Least Squares Estimation - multiple regression.

1 Least Squares Estimation - multiple regression. Introduction to multiple regression. Fall 2010 1 Least Squares Estimation - multiple regression. Let y = {y 1,, y n } be a n 1 vector of dependent variable observations. Let β = {β 0, β 1 } be the 2 1

More information

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007 STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007 LAST NAME: SOLUTIONS FIRST NAME: STUDENT NUMBER: ENROLLED IN: (circle one) STA 302 STA 1001 INSTRUCTIONS: Time: 90 minutes Aids allowed: calculator.

More information

Statistical View of Least Squares

Statistical View of Least Squares Basic Ideas Some Examples Least Squares May 22, 2007 Basic Ideas Simple Linear Regression Basic Ideas Some Examples Least Squares Suppose we have two variables x and y Basic Ideas Simple Linear Regression

More information

Graphical Diagnosis. Paul E. Johnson 1 2. (Mostly QQ and Leverage Plots) 1 / Department of Political Science

Graphical Diagnosis. Paul E. Johnson 1 2. (Mostly QQ and Leverage Plots) 1 / Department of Political Science (Mostly QQ and Leverage Plots) 1 / 63 Graphical Diagnosis Paul E. Johnson 1 2 1 Department of Political Science 2 Center for Research Methods and Data Analysis, University of Kansas. (Mostly QQ and Leverage

More information

Simple Linear Regression Using Ordinary Least Squares

Simple Linear Regression Using Ordinary Least Squares Simple Linear Regression Using Ordinary Least Squares Purpose: To approximate a linear relationship with a line. Reason: We want to be able to predict Y using X. Definition: The Least Squares Regression

More information

STATISTICS 479 Exam II (100 points)

STATISTICS 479 Exam II (100 points) Name STATISTICS 79 Exam II (1 points) 1. A SAS data set was created using the following input statement: Answer parts(a) to (e) below. input State $ City $ Pop199 Income Housing Electric; (a) () Give the

More information

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables. Regression Analysis BUS 735: Business Decision Making and Research 1 Goals of this section Specific goals Learn how to detect relationships between ordinal and categorical variables. Learn how to estimate

More information

Regression Analysis. BUS 735: Business Decision Making and Research

Regression Analysis. BUS 735: Business Decision Making and Research Regression Analysis BUS 735: Business Decision Making and Research 1 Goals and Agenda Goals of this section Specific goals Learn how to detect relationships between ordinal and categorical variables. Learn

More information

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals Chapter 8 Linear Regression Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8-1 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Fat Versus

More information

Business Statistics. Lecture 10: Course Review

Business Statistics. Lecture 10: Course Review Business Statistics Lecture 10: Course Review 1 Descriptive Statistics for Continuous Data Numerical Summaries Location: mean, median Spread or variability: variance, standard deviation, range, percentiles,

More information

Multiple Regression Examples

Multiple Regression Examples Multiple Regression Examples Example: Tree data. we have seen that a simple linear regression of usable volume on diameter at chest height is not suitable, but that a quadratic model y = β 0 + β 1 x +

More information

MATH 1150 Chapter 2 Notation and Terminology

MATH 1150 Chapter 2 Notation and Terminology MATH 1150 Chapter 2 Notation and Terminology Categorical Data The following is a dataset for 30 randomly selected adults in the U.S., showing the values of two categorical variables: whether or not the

More information

Multiple Regression Analysis

Multiple Regression Analysis Multiple Regression Analysis y = β 0 + β 1 x 1 + β 2 x 2 +... β k x k + u 2. Inference 0 Assumptions of the Classical Linear Model (CLM)! So far, we know: 1. The mean and variance of the OLS estimators

More information

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS 1a) The model is cw i = β 0 + β 1 el i + ɛ i, where cw i is the weight of the ith chick, el i the length of the egg from which it hatched, and ɛ i

More information

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation y = a + bx y = dependent variable a = intercept b = slope x = independent variable Section 12.1 Inference for Linear

More information

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression BSTT523: Kutner et al., Chapter 1 1 Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression Introduction: Functional relation between

More information

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph. Regression, Part I I. Difference from correlation. II. Basic idea: A) Correlation describes the relationship between two variables, where neither is independent or a predictor. - In correlation, it would

More information

Lecture notes on Regression & SAS example demonstration

Lecture notes on Regression & SAS example demonstration Regression & Correlation (p. 215) When two variables are measured on a single experimental unit, the resulting data are called bivariate data. You can describe each variable individually, and you can also

More information

Relationships between variables. Visualizing Bivariate Distributions: Scatter Plots

Relationships between variables. Visualizing Bivariate Distributions: Scatter Plots SFBS Course Notes Part 7: Correlation Bivariate relationships (p. 1) Linear transformations (p. 3) Pearson r : Measuring a relationship (p. 5) Interpretation of correlations (p. 10) Relationships between

More information

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis STAT 3900/4950 MIDTERM TWO Name: Spring, 205 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis Instructions: You may use your books, notes, and SPSS/SAS. NO

More information

Wooldridge, Introductory Econometrics, 4th ed. Chapter 6: Multiple regression analysis: Further issues

Wooldridge, Introductory Econometrics, 4th ed. Chapter 6: Multiple regression analysis: Further issues Wooldridge, Introductory Econometrics, 4th ed. Chapter 6: Multiple regression analysis: Further issues What effects will the scale of the X and y variables have upon multiple regression? The coefficients

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

LECTURE 15: SIMPLE LINEAR REGRESSION I

LECTURE 15: SIMPLE LINEAR REGRESSION I David Youngberg BSAD 20 Montgomery College LECTURE 5: SIMPLE LINEAR REGRESSION I I. From Correlation to Regression a. Recall last class when we discussed two basic types of correlation (positive and negative).

More information

Ordinary Least Squares Regression Explained: Vartanian

Ordinary Least Squares Regression Explained: Vartanian Ordinary Least Squares Regression Eplained: Vartanian When to Use Ordinary Least Squares Regression Analysis A. Variable types. When you have an interval/ratio scale dependent variable.. When your independent

More information

ECON3150/4150 Spring 2015

ECON3150/4150 Spring 2015 ECON3150/4150 Spring 2015 Lecture 3&4 - The linear regression model Siv-Elisabeth Skjelbred University of Oslo January 29, 2015 1 / 67 Chapter 4 in S&W Section 17.1 in S&W (extended OLS assumptions) 2

More information

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X. Estimating σ 2 We can do simple prediction of Y and estimation of the mean of Y at any value of X. To perform inferences about our regression line, we must estimate σ 2, the variance of the error term.

More information

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran Lecture 2 Linear Regression: A Model for the Mean Sharyn O Halloran Closer Look at: Linear Regression Model Least squares procedure Inferential tools Confidence and Prediction Intervals Assumptions Robustness

More information

Simple Linear Regression

Simple Linear Regression 9-1 l Chapter 9 l Simple Linear Regression 9.1 Simple Linear Regression 9.2 Scatter Diagram 9.3 Graphical Method for Determining Regression 9.4 Least Square Method 9.5 Correlation Coefficient and Coefficient

More information

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference. Understanding regression output from software Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals In 1966 Cyril Burt published a paper called The genetic determination of differences

More information

Chapter 3: Describing Relationships

Chapter 3: Describing Relationships Chapter 3: Describing Relationships Section 3.2 The Practice of Statistics, 4 th edition For AP* STARNES, YATES, MOORE Chapter 3 Describing Relationships 3.1 Scatterplots and Correlation 3.2 Section 3.2

More information

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x). Linear Regression Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x). A dependent variable is a random variable whose variation

More information

Correlation and Regression

Correlation and Regression Correlation and Regression October 25, 2017 STAT 151 Class 9 Slide 1 Outline of Topics 1 Associations 2 Scatter plot 3 Correlation 4 Regression 5 Testing and estimation 6 Goodness-of-fit STAT 151 Class

More information

Lecture 11 Multiple Linear Regression

Lecture 11 Multiple Linear Regression Lecture 11 Multiple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: 6.1-6.5 11-1 Topic Overview Review: Multiple Linear Regression (MLR) Computer Science Case Study 11-2 Multiple Regression

More information

Lecture 4 Scatterplots, Association, and Correlation

Lecture 4 Scatterplots, Association, and Correlation Lecture 4 Scatterplots, Association, and Correlation Previously, we looked at Single variables on their own One or more categorical variables In this lecture: We shall look at two quantitative variables.

More information

Lecture 4 Scatterplots, Association, and Correlation

Lecture 4 Scatterplots, Association, and Correlation Lecture 4 Scatterplots, Association, and Correlation Previously, we looked at Single variables on their own One or more categorical variable In this lecture: We shall look at two quantitative variables.

More information

Multiple Regression and Regression Model Adequacy

Multiple Regression and Regression Model Adequacy Multiple Regression and Regression Model Adequacy Joseph J. Luczkovich, PhD February 14, 2014 Introduction Regression is a technique to mathematically model the linear association between two or more variables,

More information

10 Model Checking and Regression Diagnostics

10 Model Checking and Regression Diagnostics 10 Model Checking and Regression Diagnostics The simple linear regression model is usually written as i = β 0 + β 1 i + ɛ i where the ɛ i s are independent normal random variables with mean 0 and variance

More information

Regression Analysis IV... More MLR and Model Building

Regression Analysis IV... More MLR and Model Building Regression Analysis IV... More MLR and Model Building This session finishes up presenting the formal methods of inference based on the MLR model and then begins discussion of "model building" (use of regression

More information

Density Temp vs Ratio. temp

Density Temp vs Ratio. temp Temp Ratio Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.0 0.2 0.4 0.6 0.8 1.0 1. (a) 170 175 180 185 temp 1.0 1.5 2.0 2.5 3.0 ratio The histogram shows that the temperature measures have two peaks,

More information

POL 681 Lecture Notes: Statistical Interactions

POL 681 Lecture Notes: Statistical Interactions POL 681 Lecture Notes: Statistical Interactions 1 Preliminaries To this point, the linear models we have considered have all been interpreted in terms of additive relationships. That is, the relationship

More information

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure. STATGRAPHICS Rev. 9/13/213 Calibration Models Summary... 1 Data Input... 3 Analysis Summary... 5 Analysis Options... 7 Plot of Fitted Model... 9 Predicted Values... 1 Confidence Intervals... 11 Observed

More information

y n 1 ( x i x )( y y i n 1 i y 2

y n 1 ( x i x )( y y i n 1 i y 2 STP3 Brief Class Notes Instructor: Ela Jackiewicz Chapter Regression and Correlation In this chapter we will explore the relationship between two quantitative variables, X an Y. We will consider n ordered

More information

Chapter 4: Regression Models

Chapter 4: Regression Models Sales volume of company 1 Textbook: pp. 129-164 Chapter 4: Regression Models Money spent on advertising 2 Learning Objectives After completing this chapter, students will be able to: Identify variables,

More information

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Regression Models. Chapter 4. Introduction. Introduction. Introduction Chapter 4 Regression Models Quantitative Analysis for Management, Tenth Edition, by Render, Stair, and Hanna 008 Prentice-Hall, Inc. Introduction Regression analysis is a very valuable tool for a manager

More information

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects Contents 1 Review of Residuals 2 Detecting Outliers 3 Influential Observations 4 Multicollinearity and its Effects W. Zhou (Colorado State University) STAT 540 July 6th, 2015 1 / 32 Model Diagnostics:

More information

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation Bivariate Regression & Correlation Overview The Scatter Diagram Two Examples: Education & Prestige Correlation Coefficient Bivariate Linear Regression Line SPSS Output Interpretation Covariance ou already

More information

Statistiek II. John Nerbonne. March 17, Dept of Information Science incl. important reworkings by Harmut Fitz

Statistiek II. John Nerbonne. March 17, Dept of Information Science incl. important reworkings by Harmut Fitz Dept of Information Science j.nerbonne@rug.nl incl. important reworkings by Harmut Fitz March 17, 2015 Review: regression compares result on two distinct tests, e.g., geographic and phonetic distance of

More information

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables Regression Analysis Regression: Methodology for studying the relationship among two or more variables Two major aims: Determine an appropriate model for the relationship between the variables Predict the

More information

Statistics for exp. medical researchers Regression and Correlation

Statistics for exp. medical researchers Regression and Correlation Faculty of Health Sciences Regression analysis Statistics for exp. medical researchers Regression and Correlation Lene Theil Skovgaard Sept. 28, 2015 Linear regression, Estimation and Testing Confidence

More information

STAB22 Statistics I. Lecture 7

STAB22 Statistics I. Lecture 7 STAB22 Statistics I Lecture 7 1 Example Newborn babies weight follows Normal distr. w/ mean 3500 grams & SD 500 grams. A baby is defined as high birth weight if it is in the top 2% of birth weights. What

More information