Correlation and Regression

Size: px
Start display at page:

Download "Correlation and Regression"

Transcription

1 Correlation and Regression Marc H. Mehlman University of New Haven All models are wrong. Some models are useful. George Box the statistician knows that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world. George Box (University of New Haven) Correlation and Regression 1 / 64

2 Table of Contents 1 Bivariate Data 2 Correlation 3 Simple Regression 4 Variation 5 Logistic Regression 6 Logistic Regression 7 Chapter #9 R Assignment (University of New Haven) Correlation and Regression 2 / 64

3 Bivariate Data Bivariate Data and Scatterplots Bivariate Data and Scatterplots (University of New Haven) Correlation and Regression 3 / 64

4 Bivariate Data Bivariate data comes from measuring two aspects of the same item/individual. For instance, (70, 178), (72, 192), (74, 184), (68, 181) is a random sample of size four obtained from four male college students. The bivariate data gives the height in inches and the weight in pounds of each of the for students. The third student sampled is 74 inches high and weighs 184 pounds. Can one variable be used to predict the other? Do tall people tend to weigh more? Definition A response (or dependent) variable measures the outcome of a study. The explanatory (or independent) variable is the one that predicts the response variable. (University of New Haven) Correlation and Regression 4 / 64

5 Bivariate Data Bivariate data For each individual studied, we record data on two variables. We then examine whether there is a relationship between these two variables: Do changes in one variable tend to be associated with specific changes in the other variables? Here we have two quantitative variables recorded for each of 16 students: 1. how many beers they drank 2. their resulting blood alcohol content (BAC) Student ID Number of Beers Blood Alcohol Content (University of New Haven) Correlation and Regression 5 / 64

6 Bivariate Data Scatterplots A scatterplot is used to display quantitative bivariate data. Each variable makes up one axis. Each individual is a point on the graph. Student Beers BAC (University of New Haven) Correlation and Regression 6 / 64

7 Bivariate Data > plot(trees$girth~trees$height,main="girth vs height") girth vs height trees$girth trees$height (University of New Haven) Correlation and Regression 7 / 64

8 Bivariate Data How to scale a scatterplot Same data in all four plots Both variables should be given a similar amount of space: Plot is roughly square Points should occupy all the plot space (no blank space) (University of New Haven) Correlation and Regression 8 / 64

9 Bivariate Data Interpreting scatterplots After plotting two variables on a scatterplot, we describe the overall pattern of the relationship. Specifically, we look for Form: linear, curved, clusters, no pattern Direction: positive, negative, no direction Strength: how closely the points fit the form and clear deviations from that pattern Outliers of the relationship (University of New Haven) Correlation and Regression 9 / 64

10 Bivariate Data Form Linear No relationship Nonlinear (University of New Haven) Correlation and Regression 10 / 64

11 Bivariate Data Direction Positive association: High values of one variable tend to occur together with high values of the other variable. Negative association: High values of one variable tend to occur together with low values of the other variable. (University of New Haven) Correlation and Regression 11 / 64

12 Bivariate Data Strength The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form. (University of New Haven) Correlation and Regression 12 / 64

13 Bivariate Data Outliers An outlier is a data value that has a very low probability of occurrence (i.e., it is unusual or unexpected). In a scatterplot, outliers are points that fall outside of the overall pattern of the relationship. (University of New Haven) Correlation and Regression 13 / 64

14 Bivariate Data Adding categorical variables to scatterplots Two or more relationships can be compared on a single scatterplot when we use different symbols for groups of points on the graph. The graph compares the association between thorax length and longevity of male fruit flies that are allowed to reproduce (green) or not (purple). The pattern is similar in both groups (linear, positive association), but male fruit flies not allowed to reproduce tend to live longer than reproducing male fruit flies of the same size. (University of New Haven) Correlation and Regression 14 / 64

15 Correlation Correlation Correlation (University of New Haven) Correlation and Regression 15 / 64

16 Correlation Definition Given the bivariate data, (x 1, y 1 ),, (x n, y n ), the sample correlation coefficent (sample Pearson product-moment correlation coefficient) is r def = 1 n 1 n ( ) ( xj x yj ȳ j=1 The population correlation coefficient is denoted as ρ def = 1 N s x s y ). N ( ) ( ) xj µ X yj µ Y j=1 where the above sum is summed over the entire population of size N. One thinks of r as an estimator of ρ. σ X σ Y (University of New Haven) Correlation and Regression 16 / 64

17 Correlation One can also use the formula r = n( n j=1 x jy j ) ( n j=1 x j)( n j=1 y j) [ n ( n j=1 x n ) ] [ 2 j 2 j=1 x j n ( n j=1 y n ) ] 2 j 2 j=1 y j R command: > cor(trees$girth,trees$height) [1] (University of New Haven) Correlation and Regression 17 / 64

18 Correlation One can also use the formula r = n( n j=1 x jy j ) ( n j=1 x j)( n j=1 y j) [ n ( n j=1 x n ) ] [ 2 j 2 j=1 x j n ( n j=1 y n ) ] 2 j 2 j=1 y j R command: > cor(trees$girth,trees$height) [1] (University of New Haven) Correlation and Regression 17 / 64

19 Correlation The correlation coefficient measures the strength of any linear relationship between X and Y. Properties of Correlation: cor(x, Y ) = cor(y, X ). 1 r 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer r is to one, the stronger the linear relationship between the two variables. if r = 1 (ie, r = 1 or 1), all the data points lie on a straight line. (University of New Haven) Correlation and Regression 18 / 64

20 Correlation The correlation coefficient measures the strength of any linear relationship between X and Y. Properties of Correlation: cor(x, Y ) = cor(y, X ). 1 r 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer r is to one, the stronger the linear relationship between the two variables. if r = 1 (ie, r = 1 or 1), all the data points lie on a straight line. (University of New Haven) Correlation and Regression 18 / 64

21 Correlation The correlation coefficient measures the strength of any linear relationship between X and Y. Properties of Correlation: cor(x, Y ) = cor(y, X ). 1 r 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer r is to one, the stronger the linear relationship between the two variables. if r = 1 (ie, r = 1 or 1), all the data points lie on a straight line. (University of New Haven) Correlation and Regression 18 / 64

22 Correlation The correlation coefficient measures the strength of any linear relationship between X and Y. Properties of Correlation: cor(x, Y ) = cor(y, X ). 1 r 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer r is to one, the stronger the linear relationship between the two variables. if r = 1 (ie, r = 1 or 1), all the data points lie on a straight line. (University of New Haven) Correlation and Regression 18 / 64

23 Correlation The correlation coefficient measures the strength of any linear relationship between X and Y. Properties of Correlation: cor(x, Y ) = cor(y, X ). 1 r 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer r is to one, the stronger the linear relationship between the two variables. if r = 1 (ie, r = 1 or 1), all the data points lie on a straight line. (University of New Haven) Correlation and Regression 18 / 64

24 Correlation The correlation coefficient measures the strength of any linear relationship between X and Y. Properties of Correlation: cor(x, Y ) = cor(y, X ). 1 r 1, and scale invariant. if r is positive there is a positive linear relationship between the two variables. if r is negative there is a negative linear relationship between the two variables. the closer r is to one, the stronger the linear relationship between the two variables. if r = 1 (ie, r = 1 or 1), all the data points lie on a straight line. (University of New Haven) Correlation and Regression 18 / 64

25 Correlation r has no unit r = standardized value of x (unitless) standardized value of y (unitless) r = (University of New Haven) Correlation and Regression 19 / 64

26 Correlation r is not resistant to outliers Correlations are calculated using means and standard deviations, and thus are NOT resistant to outliers. Just moving one point away from the linear pattern here weakens the correlation from 0.91 to 0.75 (closer to zero). (University of New Haven) Correlation and Regression 20 / 64

27 Correlation Correlation 14 (University of New Haven) Correlation and Regression 21 / 64

28 Correlation Caution: Correlation is not Causation Definition When calculating correlation, a lurking variable is a third factor that explains the relationship between the two correlated variables Example (Lurking Variables) There is a strong correlation between shoe size and reading skills among elementary school children. The lurking variable is There is a strong correlation between the number of firefighters at a fire site and the amount of damage. The lurking variable is Caution: Beware correlations based on averaged data. While there is a strong correlation average age and average height among children, the correlation between age and height for individual children is much, much lower. (University of New Haven) Correlation and Regression 22 / 64

29 Correlation Definition Two variables are confounded when their effects on the response variable can not be distinguished from each other. The confounded variables can be either explanatory or lurking variables (or only work in the presence of each other). The only way to distinguish between two confounded variables is to redesign the experiment. Example When I m stressed, I get muscle cramps. However, when I m stressed, I also drink lots of coffee and lose sleep. Are the cramps caused by stress, or coffee, or lack of sleep, or some combination of the above? Example A classic example of confounding: A study suggests that people who carry matches are more likely to develop lung cancer. Is it the matches or is there confounding here with a lurking variable? (University of New Haven) Correlation and Regression 23 / 64

30 Correlation Definition Two variables are confounded when their effects on the response variable can not be distinguished from each other. The confounded variables can be either explanatory or lurking variables (or only work in the presence of each other). The only way to distinguish between two confounded variables is to redesign the experiment. Example When I m stressed, I get muscle cramps. However, when I m stressed, I also drink lots of coffee and lose sleep. Are the cramps caused by stress, or coffee, or lack of sleep, or some combination of the above? Example A classic example of confounding: A study suggests that people who carry matches are more likely to develop lung cancer. Is it the matches or is there confounding here with a lurking variable? (University of New Haven) Correlation and Regression 23 / 64

31 Correlation Definition Two variables are confounded when their effects on the response variable can not be distinguished from each other. The confounded variables can be either explanatory or lurking variables (or only work in the presence of each other). The only way to distinguish between two confounded variables is to redesign the experiment. Example When I m stressed, I get muscle cramps. However, when I m stressed, I also drink lots of coffee and lose sleep. Are the cramps caused by stress, or coffee, or lack of sleep, or some combination of the above? Example A classic example of confounding: A study suggests that people who carry matches are more likely to develop lung cancer. Is it the matches or is there confounding here with a lurking variable? (University of New Haven) Correlation and Regression 23 / 64

32 Correlation Establishing causation Establishing causation from an observed association can be done if: 1) The association is strong. 2) The association is consistent. 3) Higher doses are associated with stronger responses. 4) The alleged cause precedes the effect. 5) The alleged cause is plausible. Lung cancer is clearly associated with smoking. What if a genetic mutation (lurking variable) caused people to both get lung cancer and become addicted to smoking? It took years of research and accumulated indirect evidence to reach the conclusion that smoking causes lung cancer. (University of New Haven) Correlation and Regression 24 / 64

33 Correlation Theorem (Test for Correlation) Let The test statistic is for H 0. H 0 : ρ = 0 vs H 1 : ρ 0 t = r 1 r 2 n 2 t(n 2) One can also use Table A-6 with test statistic r. R command: cor.test(x, Y) (one can also do one sided tests with R). (University of New Haven) Correlation and Regression 25 / 64

34 Using R Correlation Example > cor.test(trees$girth,trees$height) Pearson s product-moment correlation data: trees$girth and trees$height t = , df = 29, p-value = alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: sample estimates: cor Note that one is assuming that the (trees$height, trees$girth) are sampled from a bivariate normal distribution. (University of New Haven) Correlation and Regression 26 / 64

35 Correlation Example Each day, for the last 63 days, measurements of the time Joe spends sleeping and the time he spends watching tv are taken. Assume time spent sleeping and time spent watching tv form a bivariate normal random variable. A sample correlation of r = 0.12 is calculated. Find the p value of H 0 : ρ = 0 versus H A : ρ 0. Solution: > tstat=0.12*sqrt((63-2)/(1-0.12^2)) > tstat [1] > 2*(1-pt(tstat,61)) [1] There is little evidence that the time Joe spends sleeping and the time Joe spends watching tv is correlated. (University of New Haven) Correlation and Regression 27 / 64

36 Correlation Example Each day, for the last 63 days, measurements of the time Joe spends sleeping and the time he spends watching tv are taken. Assume time spent sleeping and time spent watching tv form a bivariate normal random variable. A sample correlation of r = 0.12 is calculated. Find the p value of H 0 : ρ = 0 versus H A : ρ 0. Solution: > tstat=0.12*sqrt((63-2)/(1-0.12^2)) > tstat [1] > 2*(1-pt(tstat,61)) [1] There is little evidence that the time Joe spends sleeping and the time Joe spends watching tv is correlated. (University of New Haven) Correlation and Regression 27 / 64

37 Simple Regression Simple Regression Simple Regression (University of New Haven) Correlation and Regression 28 / 64

38 Simple Regression Let X = the predictor or independent variable Y = the response or dependent variable. Given a bivariate random variable, (X, Y ), is there a linear (straight line) association between X and Y (plus some randomness)? And if so, what is it and how much randomness? Definition (Statistical Model of Simple Linear Regression) Given a predictor, x, the response, y is y = β 0 + β 1 x + ɛ x where β 0 + β 1 x is the mean response for x. The noise terms, the ɛ x s, are assumed to be independent of each other and to be randomly sampled from N(0, σ). The parameters of the model are β 0, β 1 and σ. (University of New Haven) Correlation and Regression 29 / 64

39 Simple Regression Conditions for Regression Inference The figure below shows the regression model when the conditions are met. The line in the figure is the population regression line µy= β0 + β1x. For each possible value of the explanatory variable x, the mean of the responses µ(y x) moves along this line. The Normal curves show how y will vary when x is held fixed at different values. All the curves have the same standard deviation σ, so the variability of y is the same for all values of x. The value of σ determines whether the points fall close to the population regression line (small σ) or are widely scattered (large σ). 8 (University of New Haven) Correlation and Regression 30 / 64

40 Simple Regression Moderate linear association; regression OK. Obvious nonlinear relationship; regression inappropriate. y = x y = x One extreme outlier, requiring further examination. Only two values for x; a redesign is due here y = x y = x (University of New Haven) Correlation and Regression 31 / 64

41 Simple Regression Given bivariate random sample from the simple linear regression model, (x 1, y 1 ), (x 2, y 2 ),, (x n, y n ) one wishes to estimate the parameters of the model, (β 0, β 1, σ). Given an arbitrary line, y = mx + b define the sum of the squares of errors to be n i=1 [y i (mx i + b)] 2. Using Calculus, one can find the least squares regression line, y = b 0 + b 1 x, that minimizes the sum of squares of errors. (University of New Haven) Correlation and Regression 32 / 64

42 Simple Regression Theorem (Estimating β 0 and β 1 ) Given the bivariate random sample, (x 1, y 1 ), (x n, y n ), the least squares regression line, y = b 0 + b 1 x is obtained by letting ( ) sy b 1 = r and b 0 = ȳ b 1 x. s x where b 0 is an unbiased estimator of β 0 and b 1 is an unbiased estimator of β 1. Note: The point ( x, ȳ) will lie on the regression line, though there is no reason to believe that ( x, ȳ) is one of the data points. One can also calculate b 1 using b 1 = n( n j=1 x jy j ) ( n j=1 x j)( n j=1 y j) n n j=1 x j 2 ( n j=1 x j) 2. (University of New Haven) Correlation and Regression 33 / 64

43 Simple Regression Example > plot(trees$girth~trees$height,main="girth vs height") > abline(lm(trees$girth ~ trees$height), col="red") girth vs height trees$girth trees$height Since both variables come from trees, in order for the R command lm (linear model) to work, trees has to be in the R format, data.frame. > class(trees) # "trees" is in data.frame format - lm will work. [1] "data.frame" > g.lm=lm(girth~height,data=trees) > coef(g.lm) (Intercept) trees$height (University of New Haven) Correlation and Regression 34 / 64

44 Simple Regression Example > plot(trees$girth~trees$height,main="girth vs height") > abline(lm(trees$girth ~ trees$height), col="red") girth vs height trees$girth trees$height Since both variables come from trees, in order for the R command lm (linear model) to work, trees has to be in the R format, data.frame. > class(trees) # "trees" is in data.frame format - lm will work. [1] "data.frame" > g.lm=lm(girth~height,data=trees) > coef(g.lm) (Intercept) trees$height (University of New Haven) Correlation and Regression 34 / 64

45 Simple Regression Definition The predicted value of y at x j is ŷ j def = b 0 + b 1 x j. The predicted value, ŷ, is a unbiased estimator of the mean response, µ y. Example Using the R dataset trees, one wants the predicted girth of three trees, of heights 74, 83 and 91 respectively. One uses the regression model girth height for our predictions. The work below is done in R. > g.lm=lm(girth~height,data=trees) > predict(g.lm,newdata=data.frame(height=c(74,83,91))) (University of New Haven) Correlation and Regression 35 / 64

46 Simple Regression Never make forecasts, especially about the future. Samuel Goldwyn The regression line only has predictive value for y at x if 1 ρ 0 (if no significant linear correlation, don t use the regression line for predictions.) If ρ 0, then ȳ is best predictor of y at x. 2 only predict y for x s within the range of the x j s one does not predict the girth of a tree with a height of 1000 feet. Interpolate, don t extrapolate. r (or r 2 ) is a measure of how well the regression equation fits data. bigger r better data fits regression line better prediction. (University of New Haven) Correlation and Regression 36 / 64

47 Simple Regression Outliers and influential points Outlier: An observation that lies outside the overall pattern. Influential individual : An observation that markedly changes the regression if removed. This is often an isolated point. Child 19 = outlier (large residual) Child 19 is an outlier of the relationship (it is unusually far from the regression line, vertically). Child 18 = potential influential individual Child 18 is isolated from the rest of the points, and might be an influential point. (University of New Haven) Correlation and Regression 37 / 64

48 Simple Regression Outlier All data Without child 18 Without child 19 Influential Child 18 changes the regression line substantially when it is removed. So, Child 18 is indeed an influential point. Child 19 is an outlier of the relationship, but it is not influential (regression line changed very little by its removal). (University of New Haven) Correlation and Regression 38 / 64

49 Simple Regression Definition Given a data point, (x j, y j ), the residual of that point is y i ŷ i. Note: 1 Outliers are data points with large residuals. 2 The residuals should be approximately N(0, σ). 3 The regression equation gives the smallest residuals 2 = (y ŷ) 2 possible. (University of New Haven) Correlation and Regression 39 / 64

50 Simple Regression R command for finding residuals: Example > g.lm=lm(girth~height,data=trees) > residuals(g.lm) (University of New Haven) Correlation and Regression 40 / 64

51 Simple Regression Definition Given bivariate data, (x 1, y 1 ),, (x n, y n ), the residual plot is a plot of the residuals against the x j s. If (X, Y ) is bivariate normal, the residuals satisfy the Homoscedasticity Assumption: Definition (Homoscedasticity Assumption) The assumption that the variance around the regression line is the same for all values of the predictor variable X. In other words the pattern of the spread of the residual points around the x axis does not change as one travels left to right on the x axis. There should not be discernible patterns in the residual plot. (University of New Haven) Correlation and Regression 41 / 64

52 Simple Regression R command for testing if Linear Model applies (residuals approximately N(0, σ)). Example > g.lm=lm(girth~height,data=trees) > par(mfrow=c(2,2)) # visualize four graphs at once > plot(g.lm) > par(mfrow=c(1,1)) # reset the graphics defaults Residuals vs Fitted Normal Q Q Residuals Standardized residuals Fitted values Theoretical Quantiles Standardized residuals Scale Location Standardized residuals Residuals vs Leverage Cook's distance Fitted values Leverage (University of New Haven) Correlation and Regression 42 / 64

53 Variation Variation Variation (University of New Haven) Correlation and Regression 43 / 64

54 Variation: Variation y j ȳ }{{} = ŷ j ȳ }{{} + y j ŷ j }{{}. total deviation explained deviation unexplained deviation From here, using some math, one gets the following sum of squares, (SS), n (y j ȳ) 2 j=1 }{{} SS TOT =total variation = n (ŷ j ȳ) 2 j=1 }{{} SS A =explained variation (University of New Haven) Correlation and Regression 44 / 64 + n (y j ŷ j ) 2 j=1 }{{} SS E =unexplained variation.

55 Variation Definition The coefficient of determination is the portion of the variation in y explained by the regression equation r 2 def = SS n A j=1 = (ŷ j ȳ) 2 SS n TOT j=1 (y j ȳ) 2. Properties of the Coefficient of Determination: 1 r 2 = (r) 2 = (correlation coefficient) 2. 2 r 2 = proportion of variation of Y that is explained by the linear relationship between X and Y. Example Using R, since > (cor(trees$girth,trees$height))^2 [1] one concludes that approximately 27% of variation in tree Girth is explained by tree Height and 73% by other factors. (University of New Haven) Correlation and Regression 45 / 64

56 Variation r = 0.3, r 2 = 0.09, or 9% The regression model explains not even 10% of the variations in y. r = 0.7, r 2 = 0.49, or 49% The regression model explains nearly half of the variations in y. r = 0.99, r 2 = , or ~98% The regression model explains almost all of the variations in y. (University of New Haven) Correlation and Regression 46 / 64

57 Variation Definition The variance of the observed y i s about the predicted ŷ i s is s 2 def = SS E n 2 = (yj ŷ j ) 2 n 2 = y 2 j b 0 yj b 1 xj y j, n 2 which is an unbiased estimator of σ 2. The standard error of estimate (also called the residual standard error) is s, an estimator of σ. Note: (b 0, b 1, s) is an estimator of the parameters of the simple linear regression model, (β 0, β 1, σ). Furthermore, b 0, b 1 and s 2 are unbiased estimators of β 0, β 1 and σ 2. (University of New Haven) Correlation and Regression 47 / 64

58 Variation Definition Let y be a future observation corresponding to x. A (1 α)100% Prediction Interval for y is a confidence interval where y will be in the confidence interval (1 α)100% of the time. A prediction interval a confidence interval that not only has to contend with the variability of the response variable, but also the fact that β 0 and β 1 can only be approximated. Theorem ((1 α)100% Prediction Interval for y given x = x ) A (1 α)100% Prediction Interval for y given x = x is ŷ ± m where ŷ = b 0 + b 1 x and the margin of error is m = t α/2 (n 2) s n + (x x) 2 n j=1 (x j x) 2. }{{} SEŷ (University of New Haven) Correlation and Regression 48 / 64

59 A confidence interval for y: Variation (University of New Haven) Correlation and Regression 49 / 64

60 Variation R commands: Example > g.lm=lm(girth~height,data=trees) > predict(g.lm,newdata=data.frame(height=c(74,83,91)),interval="prediction",level=.90) fit lwr upr > summary(g.lm) Call: lm(formula = Girth ~ Height, data = trees) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) Height ** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 29 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 29 DF, p-value: (University of New Haven) Correlation and Regression 50 / 64

61 Logistic Regression Multivariate Regression Multivariate Regression (University of New Haven) Correlation and Regression 51 / 64

62 Logistic Regression Given multivariate variate data, (x (1) 1, x (1) 2,, x (1) k, y 1 ), (x (2) 1, x (2) 2,, x (2) k, y 2 ),, (x (n) 1, x (n) 2,, x (n) k, y n ) where x (i) 1, x (i) 2,, x (i) k is a predictor of the response y i, one explores the following possible model. Definition (Statistical Model of Multivariate Linear Regression) Given a k dimensional multivariate predictor, (x (i) 1, x (i) 2,, x (i) k ), the response, y i, is y i = β 0 + β 1 x (i) β k x (i) k + ɛ i where β 0 + β 1 x (i) β k x (i) k is the mean response. The noise terms, the ɛ i s are assumed to be independent of each other and to be randomly sampled from N(0, σ). The parameters of the model are β 0, β 1,, β k and σ. (University of New Haven) Correlation and Regression 52 / 64

63 Logistic Regression Definition Given a multivariate normal sample, ( ) ( ) x (1) 1,, x (1) k, y 1,, x (n) 1,, x (n) k, y n, the least squares multiple regression equation, is the linear equation that minimizes ŷ = b 0 + b 1 x b k x k, n (ŷ j y j ) 2, j=1 where ŷ j def = b 0 + b 1 x (j) b k x (j) k. (University of New Haven) Correlation and Regression 53 / 64

64 Logistic Regression There must be at least k + 2 data points to do obtain the estimators n b 0, b j s and s 2 def j=1 = (y i ŷ i ) 2 n k 1 of β 0, β j s and σ 2, where b 0, the y intercept, is the unbiased, least square estimator of β 0. b j, the coefficient of x j, is the unbiased, least square estimator of β j. s 2 is an unbiased estimator of σ 2 and s is an estimator of σ. Due to computational intensity, computers are used to obtain b 0, b j s and s 2. (University of New Haven) Correlation and Regression 54 / 64

65 Logistic Regression Example > g.lm=lm(mpg~disp+hp+wt+qsec, data=mtcars) > par(mfrow=c(2,2)) > plot(g.lm) > par(mfrow=c(1,1)) Does the linear model fit? Residuals vs Fitted Normal Q Q Residuals Standardized residuals Fitted values Theoretical Quantiles Standardized residuals Scale Location Standardized residuals Residuals vs Leverage Cook's distance Fitted values Leverage (University of New Haven) Correlation and Regression 55 / 64

66 Logistic Regression Example (cont.) > summary(g.lm) Call: lm(formula = mpg ~ disp + hp + wt + qsec, data = mtcars) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) ** disp hp wt ** qsec Signif. codes: 0 *** ** 0.01 * Residual standard error: on 27 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 4 and 27 DF, p-value: 3.311e-10 (University of New Haven) Correlation and Regression 56 / 64

67 Logistic Regression Inflation Problem: As k increases r 2 increases, but the increase in predictability is illusionary. Solution: Best to use Definition The adjusted coefficient of determination is The p value of a test of is associated with a F -statistic. The p value of a test of is associated with a t-statistic. R 2 adj = 1 n 1 n k 1 (1 R2 ). H 0 : β 1 = = β k = 0 versus H A : not H 0 H 0 : β j = 0 versus H A : β j 0 (University of New Haven) Correlation and Regression 57 / 64

68 Logistic Regression Factor Analysis: One strives for the best fit (largest R 2 and smallest p value associated with the F statistic) with the fewest number of independent variables. Independent variables that are mostly independent of the dependent variable or highly correlated with another dependent variable can be discarded. It is an art. Doing this mechanically (on a machine) is called stepwise regression. (University of New Haven) Correlation and Regression 58 / 64

69 Logistic Regression Logistic Regression Logistic Regression (University of New Haven) Correlation and Regression 59 / 64

70 Logistic Regression A variable that takes on only the values 0 and 1 is a dummy variable. ex, gender, infected, etc If dummy variable is an independent variable, use methods of this chapter. If dummy variable is the dependent variable, use logistic regression. Let Y, the dependent variable is the dummy variable and use Ỹ = ln ( p 1 p ) in place of Y. Here p is the probability of Y = 1. (University of New Haven) Correlation and Regression 60 / 64

71 Chapter #9 R Assignment Chapter #9 R Assignment Chapter #9 R Assignment (University of New Haven) Correlation and Regression 61 / 64

72 Chapter #9 R Assignment (from the book Mathematical Statistics with Applications by Mendenhall, Wackerly and Scheaffer (Fourth Edition Duxbury 1990)) Fifteen alligators were captured and two measurements were made on each of the alligators. The weight (in pounds) was recorded with the snout vent length (in inches this is the distance between the back of the head to the end of the nose). The purpose of using this data is to determine whether there is a relationship, described by a simple linear regression model, between the weight and snout vent length. lnlength ~ lnweight. The authors analyzed the data on the log scale (natural logarithms) and we will follow their approach for consistency. > lnlength = c(3.87, 3.61, 4.33, 3.43, 3.81, 3.83, 3.46, 3.76, , 3.58, 4.19, 3.78, 3.71, 3.73, 3.78) > lnweight = c(4.87, 3.93, 6.46, 3.33, 4.38, 4.70, 3.50, 4.50, , 3.64, 5.90, 4.43, 4.38, 4.42, 4.25) (University of New Haven) Correlation and Regression 62 / 64

73 Chapter #9 R Assignment 1 Create a scatterplot of lnlength lnweight, complete with the regression line. 2 What is the slope and y intercept of the regression line? 3 Predict lnlength when lnweight is five. 4 Use graphs to decide if lnlength lnweight satisfies the requirements for being a linear model. 5 Find a 95% prediction interval for lnlength when lnweight is five. 6 What is the p value of a test of H 0 : β 1 = 0 versus H A : β 1 0? 7 What is the standard error of estimate? 8 What is the coefficient of determination, R 2. 9 What is the explained variation, the unexplained variation and the total variation? 10 What is the F statistic of H 0 : β 1 = 0 versus H A : β 1 0 and what is its degrees of freedom? 11 Using the correlation test, what is the p value of a test that H 0 : ρ = 0 versus H A : ρ 0? (University of New Haven) Correlation and Regression 63 / 64

74 Chapter #9 R Assignment First enter into R: > class(state.x77) # "lm" needs a data.frame not a matrix [1] "matrix" > st = as.data.frame(state.x77) # make state.x77 a data.frame > class(st) # "st" is a data.frame [1] "data.frame" > colnames(st)[4] = "Life.Exp" # no spaces in variable names > colnames(st)[6] = "HS.Grad" # no spaces in variable names 1 Do a multivariate regression with Life.Exp as the response variable and Population, Income, Illiteracy, Murder, HS.Grad, Frost and Area as explanatory variables. (a) Show that the multivariate regression linear model fits this data. (b) What is R 2 and adjusted R 2? (c) Which explanatory variables are relevant at the 0.05 significance level? 2 Do another multivariate regression, but only with explanatory variables Murder and HS.Grad. What is R 2 and adjusted R 2? 3 Comparing the adjusted R 2 in the above two problems, what do you conclude? (University of New Haven) Correlation and Regression 64 / 64

Regression. Marc H. Mehlman University of New Haven

Regression. Marc H. Mehlman University of New Haven Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven the statistician knows that in nature there never was a normal distribution, there never was a straight line, yet with normal and

More information

Relationships Regression

Relationships Regression Relationships Regression BPS chapter 5 2006 W.H. Freeman and Company Objectives (BPS chapter 5) Regression Regression lines The least-squares regression line Using technology Facts about least-squares

More information

Chapter 14. Statistical versus Deterministic Relationships. Distance versus Speed. Describing Relationships: Scatterplots and Correlation

Chapter 14. Statistical versus Deterministic Relationships. Distance versus Speed. Describing Relationships: Scatterplots and Correlation Chapter 14 Describing Relationships: Scatterplots and Correlation Chapter 14 1 Statistical versus Deterministic Relationships Distance versus Speed (when travel time is constant). Income (in millions of

More information

Unit 6 - Introduction to linear regression

Unit 6 - Introduction to linear regression Unit 6 - Introduction to linear regression Suggested reading: OpenIntro Statistics, Chapter 7 Suggested exercises: Part 1 - Relationship between two numerical variables: 7.7, 7.9, 7.11, 7.13, 7.15, 7.25,

More information

Confidence Intervals, Testing and ANOVA Summary

Confidence Intervals, Testing and ANOVA Summary Confidence Intervals, Testing and ANOVA Summary 1 One Sample Tests 1.1 One Sample z test: Mean (σ known) Let X 1,, X n a r.s. from N(µ, σ) or n > 30. Let The test statistic is H 0 : µ = µ 0. z = x µ 0

More information

REVIEW 8/2/2017 陈芳华东师大英语系

REVIEW 8/2/2017 陈芳华东师大英语系 REVIEW Hypothesis testing starts with a null hypothesis and a null distribution. We compare what we have to the null distribution, if the result is too extreme to belong to the null distribution (p

More information

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships Objectives 2.3 Least-squares regression Regression lines Prediction and Extrapolation Correlation and r 2 Transforming relationships Adapted from authors slides 2012 W.H. Freeman and Company Straight Line

More information

Unit 6 - Simple linear regression

Unit 6 - Simple linear regression Sta 101: Data Analysis and Statistical Inference Dr. Çetinkaya-Rundel Unit 6 - Simple linear regression LO 1. Define the explanatory variable as the independent variable (predictor), and the response variable

More information

Chapter 9. Correlation and Regression

Chapter 9. Correlation and Regression Chapter 9 Correlation and Regression Lesson 9-1/9-2, Part 1 Correlation Registered Florida Pleasure Crafts and Watercraft Related Manatee Deaths 100 80 60 40 20 0 1991 1993 1995 1997 1999 Year Boats in

More information

Chapter 6: Exploring Data: Relationships Lesson Plan

Chapter 6: Exploring Data: Relationships Lesson Plan Chapter 6: Exploring Data: Relationships Lesson Plan For All Practical Purposes Displaying Relationships: Scatterplots Mathematical Literacy in Today s World, 9th ed. Making Predictions: Regression Line

More information

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference. Understanding regression output from software Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals In 1966 Cyril Burt published a paper called The genetic determination of differences

More information

Introduction and Single Predictor Regression. Correlation

Introduction and Single Predictor Regression. Correlation Introduction and Single Predictor Regression Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Correlation A correlation

More information

Lecture 11: Simple Linear Regression

Lecture 11: Simple Linear Regression Lecture 11: Simple Linear Regression Readings: Sections 3.1-3.3, 11.1-11.3 Apr 17, 2009 In linear regression, we examine the association between two quantitative variables. Number of beers that you drink

More information

11 Correlation and Regression

11 Correlation and Regression Chapter 11 Correlation and Regression August 21, 2017 1 11 Correlation and Regression When comparing two variables, sometimes one variable (the explanatory variable) can be used to help predict the value

More information

STAT 3022 Spring 2007

STAT 3022 Spring 2007 Simple Linear Regression Example These commands reproduce what we did in class. You should enter these in R and see what they do. Start by typing > set.seed(42) to reset the random number generator so

More information

Looking at Data Relationships. 2.1 Scatterplots W. H. Freeman and Company

Looking at Data Relationships. 2.1 Scatterplots W. H. Freeman and Company Looking at Data Relationships 2.1 Scatterplots 2012 W. H. Freeman and Company Here, we have two quantitative variables for each of 16 students. 1) How many beers they drank, and 2) Their blood alcohol

More information

Chapter 5 Least Squares Regression

Chapter 5 Least Squares Regression Chapter 5 Least Squares Regression A Royal Bengal tiger wandered out of a reserve forest. We tranquilized him and want to take him back to the forest. We need an idea of his weight, but have no scale!

More information

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison.

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison. Regression Bret Hanlon and Bret Larget Department of Statistics University of Wisconsin Madison December 8 15, 2011 Regression 1 / 55 Example Case Study The proportion of blackness in a male lion s nose

More information

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation Scatterplots and Correlation Name Hr A scatterplot shows the relationship between two quantitative variables measured on the same individuals. variable (y) measures an outcome of a study variable (x) may

More information

Bivariate data analysis

Bivariate data analysis Bivariate data analysis Categorical data - creating data set Upload the following data set to R Commander sex female male male male male female female male female female eye black black blue green green

More information

Correlation & Simple Regression

Correlation & Simple Regression Chapter 11 Correlation & Simple Regression The previous chapter dealt with inference for two categorical variables. In this chapter, we would like to examine the relationship between two quantitative variables.

More information

y n 1 ( x i x )( y y i n 1 i y 2

y n 1 ( x i x )( y y i n 1 i y 2 STP3 Brief Class Notes Instructor: Ela Jackiewicz Chapter Regression and Correlation In this chapter we will explore the relationship between two quantitative variables, X an Y. We will consider n ordered

More information

9. Linear Regression and Correlation

9. Linear Regression and Correlation 9. Linear Regression and Correlation Data: y a quantitative response variable x a quantitative explanatory variable (Chap. 8: Recall that both variables were categorical) For example, y = annual income,

More information

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation AP Statistics Chapter 6 Scatterplots, Association, and Correlation Objectives: Scatterplots Association Outliers Response Variable Explanatory Variable Correlation Correlation Coefficient Lurking Variables

More information

Lecture 18: Simple Linear Regression

Lecture 18: Simple Linear Regression Lecture 18: Simple Linear Regression BIOS 553 Department of Biostatistics University of Michigan Fall 2004 The Correlation Coefficient: r The correlation coefficient (r) is a number that measures the strength

More information

UNIT 12 ~ More About Regression

UNIT 12 ~ More About Regression ***SECTION 15.1*** The Regression Model When a scatterplot shows a relationship between a variable x and a y, we can use the fitted to the data to predict y for a given value of x. Now we want to do tests

More information

Motor Trend Car Road Analysis

Motor Trend Car Road Analysis Motor Trend Car Road Analysis Zakia Sultana February 28, 2016 Executive Summary You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are

More information

INFERENCE FOR REGRESSION

INFERENCE FOR REGRESSION CHAPTER 3 INFERENCE FOR REGRESSION OVERVIEW In Chapter 5 of the textbook, we first encountered regression. The assumptions that describe the regression model we use in this chapter are the following. We

More information

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Biostatistics for physicists fall Correlation Linear regression Analysis of variance Biostatistics for physicists fall 2015 Correlation Linear regression Analysis of variance Correlation Example: Antibody level on 38 newborns and their mothers There is a positive correlation in antibody

More information

Inferences for Regression

Inferences for Regression Inferences for Regression An Example: Body Fat and Waist Size Looking at the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In

More information

L21: Chapter 12: Linear regression

L21: Chapter 12: Linear regression L21: Chapter 12: Linear regression Department of Statistics, University of South Carolina Stat 205: Elementary Statistics for the Biological and Life Sciences 1 / 37 So far... 12.1 Introduction One sample

More information

appstats27.notebook April 06, 2017

appstats27.notebook April 06, 2017 Chapter 27 Objective Students will conduct inference on regression and analyze data to write a conclusion. Inferences for Regression An Example: Body Fat and Waist Size pg 634 Our chapter example revolves

More information

Important note: Transcripts are not substitutes for textbook assignments. 1

Important note: Transcripts are not substitutes for textbook assignments. 1 In this lesson we will cover correlation and regression, two really common statistical analyses for quantitative (or continuous) data. Specially we will review how to organize the data, the importance

More information

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall) Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall) We will cover Chs. 5 and 6 first, then 3 and 4. Mon,

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

Lecture 4 Scatterplots, Association, and Correlation

Lecture 4 Scatterplots, Association, and Correlation Lecture 4 Scatterplots, Association, and Correlation Previously, we looked at Single variables on their own One or more categorical variable In this lecture: We shall look at two quantitative variables.

More information

M 225 Test 1 B Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

M 225 Test 1 B Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75 M 225 Test 1 B Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points 1-13 13 14 3 15 8 16 4 17 10 18 9 19 7 20 3 21 16 22 2 Total 75 1 Multiple choice questions (1 point each) 1. Look at

More information

The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

The scatterplot is the basic tool for graphically displaying bivariate quantitative data. Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data. Example: Some investors think that the performance of the stock market in January

More information

IT 403 Practice Problems (2-2) Answers

IT 403 Practice Problems (2-2) Answers IT 403 Practice Problems (2-2) Answers #1. Which of the following is correct with respect to the correlation coefficient (r) and the slope of the leastsquares regression line (Choose one)? a. They will

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

Correlation and simple linear regression S5

Correlation and simple linear regression S5 Basic medical statistics for clinical and eperimental research Correlation and simple linear regression S5 Katarzyna Jóźwiak k.jozwiak@nki.nl November 15, 2017 1/41 Introduction Eample: Brain size and

More information

Inference for Regression

Inference for Regression Inference for Regression Section 9.4 Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Department of Mathematics University of Houston Lecture 13b - 3339 Cathy Poliak, Ph.D. cathy@math.uh.edu

More information

SECTION I Number of Questions 42 Percent of Total Grade 50

SECTION I Number of Questions 42 Percent of Total Grade 50 AP Stats Chap 7-9 Practice Test Name Pd SECTION I Number of Questions 42 Percent of Total Grade 50 Directions: Solve each of the following problems, using the available space (or extra paper) for scratchwork.

More information

Chapter 27 Summary Inferences for Regression

Chapter 27 Summary Inferences for Regression Chapter 7 Summary Inferences for Regression What have we learned? We have now applied inference to regression models. Like in all inference situations, there are conditions that we must check. We can test

More information

The response variable depends on the explanatory variable.

The response variable depends on the explanatory variable. A response variable measures an outcome of study. > dependent variables An explanatory variable attempts to explain the observed outcomes. > independent variables The response variable depends on the explanatory

More information

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1 Lecture Slides Elementary Statistics Tenth Edition and the Triola Statistics Series by Mario F. Triola Slide 1 Chapter 10 Correlation and Regression 10-1 Overview 10-2 Correlation 10-3 Regression 10-4

More information

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

CHAPTER 5 LINEAR REGRESSION AND CORRELATION CHAPTER 5 LINEAR REGRESSION AND CORRELATION Expected Outcomes Able to use simple and multiple linear regression analysis, and correlation. Able to conduct hypothesis testing for simple and multiple linear

More information

Lecture 4 Scatterplots, Association, and Correlation

Lecture 4 Scatterplots, Association, and Correlation Lecture 4 Scatterplots, Association, and Correlation Previously, we looked at Single variables on their own One or more categorical variables In this lecture: We shall look at two quantitative variables.

More information

Inference for Regression Inference about the Regression Model and Using the Regression Line

Inference for Regression Inference about the Regression Model and Using the Regression Line Inference for Regression Inference about the Regression Model and Using the Regression Line PBS Chapter 10.1 and 10.2 2009 W.H. Freeman and Company Objectives (PBS Chapter 10.1 and 10.2) Inference about

More information

Single and multiple linear regression analysis

Single and multiple linear regression analysis Single and multiple linear regression analysis Marike Cockeran 2017 Introduction Outline of the session Simple linear regression analysis SPSS example of simple linear regression analysis Additional topics

More information

23. Inference for regression

23. Inference for regression 23. Inference for regression The Practice of Statistics in the Life Sciences Third Edition 2014 W. H. Freeman and Company Objectives (PSLS Chapter 23) Inference for regression The regression model Confidence

More information

Chapter 10 Correlation and Regression

Chapter 10 Correlation and Regression Chapter 10 Correlation and Regression 10-1 Review and Preview 10-2 Correlation 10-3 Regression 10-4 Variation and Prediction Intervals 10-5 Multiple Regression 10-6 Modeling Copyright 2010, 2007, 2004

More information

Example: Can an increase in non-exercise activity (e.g. fidgeting) help people gain less weight?

Example: Can an increase in non-exercise activity (e.g. fidgeting) help people gain less weight? Example: Can an increase in non-exercise activity (e.g. fidgeting) help people gain less weight? 16 subjects overfed for 8 weeks Explanatory: change in energy use from non-exercise activity (calories)

More information

Review of Regression Basics

Review of Regression Basics Review of Regression Basics When describing a Bivariate Relationship: Make a Scatterplot Strength, Direction, Form Model: y-hat=a+bx Interpret slope in context Make Predictions Residual = Observed-Predicted

More information

Business Statistics. Lecture 10: Course Review

Business Statistics. Lecture 10: Course Review Business Statistics Lecture 10: Course Review 1 Descriptive Statistics for Continuous Data Numerical Summaries Location: mean, median Spread or variability: variance, standard deviation, range, percentiles,

More information

2. Outliers and inference for regression

2. Outliers and inference for regression Unit6: Introductiontolinearregression 2. Outliers and inference for regression Sta 101 - Spring 2016 Duke University, Department of Statistical Science Dr. Çetinkaya-Rundel Slides posted at http://bit.ly/sta101_s16

More information

Chapter 4 Describing the Relation between Two Variables

Chapter 4 Describing the Relation between Two Variables Chapter 4 Describing the Relation between Two Variables 4.1 Scatter Diagrams and Correlation The is the variable whose value can be explained by the value of the or. A is a graph that shows the relationship

More information

Business Statistics. Lecture 10: Correlation and Linear Regression

Business Statistics. Lecture 10: Correlation and Linear Regression Business Statistics Lecture 10: Correlation and Linear Regression Scatterplot A scatterplot shows the relationship between two quantitative variables measured on the same individuals. It displays the Form

More information

Analytics 512: Homework # 2 Tim Ahn February 9, 2016

Analytics 512: Homework # 2 Tim Ahn February 9, 2016 Analytics 512: Homework # 2 Tim Ahn February 9, 2016 Chapter 3 Problem 1 (# 3) Suppose we have a data set with five predictors, X 1 = GP A, X 2 = IQ, X 3 = Gender (1 for Female and 0 for Male), X 4 = Interaction

More information

Machine Learning. Module 3-4: Regression and Survival Analysis Day 2, Asst. Prof. Dr. Santitham Prom-on

Machine Learning. Module 3-4: Regression and Survival Analysis Day 2, Asst. Prof. Dr. Santitham Prom-on Machine Learning Module 3-4: Regression and Survival Analysis Day 2, 9.00 16.00 Asst. Prof. Dr. Santitham Prom-on Department of Computer Engineering, Faculty of Engineering King Mongkut s University of

More information

Analysing data: regression and correlation S6 and S7

Analysing data: regression and correlation S6 and S7 Basic medical statistics for clinical and experimental research Analysing data: regression and correlation S6 and S7 K. Jozwiak k.jozwiak@nki.nl 2 / 49 Correlation So far we have looked at the association

More information

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel Psychology Seminar Psych 406 Dr. Jeffrey Leitzel Structural Equation Modeling Topic 1: Correlation / Linear Regression Outline/Overview Correlations (r, pr, sr) Linear regression Multiple regression interpreting

More information

Chapter 7 Linear Regression

Chapter 7 Linear Regression Chapter 7 Linear Regression 1 7.1 Least Squares: The Line of Best Fit 2 The Linear Model Fat and Protein at Burger King The correlation is 0.76. This indicates a strong linear fit, but what line? The line

More information

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation y = a + bx y = dependent variable a = intercept b = slope x = independent variable Section 12.1 Inference for Linear

More information

Chapter 12 Summarizing Bivariate Data Linear Regression and Correlation

Chapter 12 Summarizing Bivariate Data Linear Regression and Correlation Chapter 1 Summarizing Bivariate Data Linear Regression and Correlation This chapter introduces an important method for making inferences about a linear correlation (or relationship) between two variables,

More information

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population Lecture 5 1 Lecture 3 The Population Variance The population variance, denoted σ 2, is the sum of the squared deviations about the population mean divided by the number of observations in the population,

More information

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression Introduction to Correlation and Regression The procedures discussed in the previous ANOVA labs are most useful in cases where we are interested

More information

Chapter 16: Understanding Relationships Numerical Data

Chapter 16: Understanding Relationships Numerical Data Chapter 16: Understanding Relationships Numerical Data These notes reflect material from our text, Statistics, Learning from Data, First Edition, by Roxy Peck, published by CENGAGE Learning, 2015. Linear

More information

HOMEWORK (due Wed, Jan 23): Chapter 3: #42, 48, 74

HOMEWORK (due Wed, Jan 23): Chapter 3: #42, 48, 74 ANNOUNCEMENTS: Grades available on eee for Week 1 clickers, Quiz and Discussion. If your clicker grade is missing, check next week before contacting me. If any other grades are missing let me know now.

More information

Mrs. Poyner/Mr. Page Chapter 3 page 1

Mrs. Poyner/Mr. Page Chapter 3 page 1 Name: Date: Period: Chapter 2: Take Home TEST Bivariate Data Part 1: Multiple Choice. (2.5 points each) Hand write the letter corresponding to the best answer in space provided on page 6. 1. In a statistics

More information

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to: STA 2023 Module 5 Regression and Correlation Learning Objectives Upon completing this module, you should be able to: 1. Define and apply the concepts related to linear equations with one independent variable.

More information

Correlation and Regression Bangkok, 14-18, Sept. 2015

Correlation and Regression Bangkok, 14-18, Sept. 2015 Analysing and Understanding Learning Assessment for Evidence-based Policy Making Correlation and Regression Bangkok, 14-18, Sept. 2015 Australian Council for Educational Research Correlation The strength

More information

1 A Review of Correlation and Regression

1 A Review of Correlation and Regression 1 A Review of Correlation and Regression SW, Chapter 12 Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then

More information

Final Exam - Solutions

Final Exam - Solutions Ecn 102 - Analysis of Economic Data University of California - Davis March 19, 2010 Instructor: John Parman Final Exam - Solutions You have until 5:30pm to complete this exam. Please remember to put your

More information

The Simple Linear Regression Model

The Simple Linear Regression Model The Simple Linear Regression Model Lesson 3 Ryan Safner 1 1 Department of Economics Hood College ECON 480 - Econometrics Fall 2017 Ryan Safner (Hood College) ECON 480 - Lesson 3 Fall 2017 1 / 77 Bivariate

More information

Data Analysis Using R ASC & OIR

Data Analysis Using R ASC & OIR Data Analysis Using R ASC & OIR Overview } What is Statistics and the process of study design } Correlation } Simple Linear Regression } Multiple Linear Regression 2 What is Statistics? Statistics is a

More information

Recall, Positive/Negative Association:

Recall, Positive/Negative Association: ANNOUNCEMENTS: Remember that discussion today is not for credit. Go over R Commander. Go to 192 ICS, except at 4pm, go to 192 or 174 ICS. TODAY: Sections 5.3 to 5.5. Note this is a change made in the daily

More information

appstats8.notebook October 11, 2016

appstats8.notebook October 11, 2016 Chapter 8 Linear Regression Objective: Students will construct and analyze a linear model for a given set of data. Fat Versus Protein: An Example pg 168 The following is a scatterplot of total fat versus

More information

STAT 350 Final (new Material) Review Problems Key Spring 2016

STAT 350 Final (new Material) Review Problems Key Spring 2016 1. The editor of a statistics textbook would like to plan for the next edition. A key variable is the number of pages that will be in the final version. Text files are prepared by the authors using LaTeX,

More information

Chapter 7. Scatterplots, Association, and Correlation

Chapter 7. Scatterplots, Association, and Correlation Chapter 7 Scatterplots, Association, and Correlation Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 1 / 29 Objective In this chapter, we study relationships! Instead, we investigate

More information

8. Example: Predicting University of New Mexico Enrollment

8. Example: Predicting University of New Mexico Enrollment 8. Example: Predicting University of New Mexico Enrollment year (1=1961) 6 7 8 9 10 6000 10000 14000 0 5 10 15 20 25 30 6 7 8 9 10 unem (unemployment rate) hgrad (highschool graduates) 10000 14000 18000

More information

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y Regression and correlation Correlation & Regression, I 9.07 4/1/004 Involve bivariate, paired data, X & Y Height & weight measured for the same individual IQ & exam scores for each individual Height of

More information

Simple Linear Regression Using Ordinary Least Squares

Simple Linear Regression Using Ordinary Least Squares Simple Linear Regression Using Ordinary Least Squares Purpose: To approximate a linear relationship with a line. Reason: We want to be able to predict Y using X. Definition: The Least Squares Regression

More information

Correlation and Regression

Correlation and Regression Correlation and Regression Dr. Bob Gee Dean Scott Bonney Professor William G. Journigan American Meridian University 1 Learning Objectives Upon successful completion of this module, the student should

More information

Topic 10 - Linear Regression

Topic 10 - Linear Regression Topic 10 - Linear Regression Least squares principle Hypothesis tests/confidence intervals/prediction intervals for regression 1 Linear Regression How much should you pay for a house? Would you consider

More information

ECON3150/4150 Spring 2015

ECON3150/4150 Spring 2015 ECON3150/4150 Spring 2015 Lecture 3&4 - The linear regression model Siv-Elisabeth Skjelbred University of Oslo January 29, 2015 1 / 67 Chapter 4 in S&W Section 17.1 in S&W (extended OLS assumptions) 2

More information

Regression Analysis: Exploring relationships between variables. Stat 251

Regression Analysis: Exploring relationships between variables. Stat 251 Regression Analysis: Exploring relationships between variables Stat 251 Introduction Objective of regression analysis is to explore the relationship between two (or more) variables so that information

More information

Chapter 2: Looking at Data Relationships (Part 3)

Chapter 2: Looking at Data Relationships (Part 3) Chapter 2: Looking at Data Relationships (Part 3) Dr. Nahid Sultana Chapter 2: Looking at Data Relationships 2.1: Scatterplots 2.2: Correlation 2.3: Least-Squares Regression 2.5: Data Analysis for Two-Way

More information

ST430 Exam 1 with Answers

ST430 Exam 1 with Answers ST430 Exam 1 with Answers Date: October 5, 2015 Name: Guideline: You may use one-page (front and back of a standard A4 paper) of notes. No laptop or textook are permitted but you may use a calculator.

More information

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc. Chapter 8 Linear Regression Copyright 2010 Pearson Education, Inc. Fat Versus Protein: An Example The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu: Copyright

More information

R 2 and F -Tests and ANOVA

R 2 and F -Tests and ANOVA R 2 and F -Tests and ANOVA December 6, 2018 1 Partition of Sums of Squares The distance from any point y i in a collection of data, to the mean of the data ȳ, is the deviation, written as y i ȳ. Definition.

More information

Stat 101 Exam 1 Important Formulas and Concepts 1

Stat 101 Exam 1 Important Formulas and Concepts 1 1 Chapter 1 1.1 Definitions Stat 101 Exam 1 Important Formulas and Concepts 1 1. Data Any collection of numbers, characters, images, or other items that provide information about something. 2. Categorical/Qualitative

More information

MATH 1070 Introductory Statistics Lecture notes Relationships: Correlation and Simple Regression

MATH 1070 Introductory Statistics Lecture notes Relationships: Correlation and Simple Regression MATH 1070 Introductory Statistics Lecture notes Relationships: Correlation and Simple Regression Objectives: 1. Learn the concepts of independent and dependent variables 2. Learn the concept of a scatterplot

More information

Warm-up Using the given data Create a scatterplot Find the regression line

Warm-up Using the given data Create a scatterplot Find the regression line Time at the lunch table Caloric intake 21.4 472 30.8 498 37.7 335 32.8 423 39.5 437 22.8 508 34.1 431 33.9 479 43.8 454 42.4 450 43.1 410 29.2 504 31.3 437 28.6 489 32.9 436 30.6 480 35.1 439 33.0 444

More information

Objectives. 2.1 Scatterplots. Scatterplots Explanatory and response variables. Interpreting scatterplots Outliers

Objectives. 2.1 Scatterplots. Scatterplots Explanatory and response variables. Interpreting scatterplots Outliers Objectives 2.1 Scatterplots Scatterplots Explanatory and response variables Interpreting scatterplots Outliers Adapted from authors slides 2012 W.H. Freeman and Company Relationships A very important aspect

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression Reading: Hoff Chapter 9 November 4, 2009 Problem Data: Observe pairs (Y i,x i ),i = 1,... n Response or dependent variable Y Predictor or independent variable X GOALS: Exploring

More information

1. Use Scenario 3-1. In this study, the response variable is

1. Use Scenario 3-1. In this study, the response variable is Chapter 8 Bell Work Scenario 3-1 The height (in feet) and volume (in cubic feet) of usable lumber of 32 cherry trees are measured by a researcher. The goal is to determine if volume of usable lumber can

More information

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti Prepared by: Prof Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti Putra Malaysia Serdang M L Regression is an extension to

More information