13 Simple Linear Regression

Size: px
Start display at page:

Download "13 Simple Linear Regression"

Transcription

1 B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 3 Simple Linear Regression 3. An industrial example A study was undertaken to determine the effect of stirring rate on the amount of impurity in paint produced by a chemical process. The study yielded the data shown in the following S+ output. The stirring rate is in revolutions per minute and the impurity is recorded as a percentage. It appears from the data that twelve stirring rates were chosen at intervals of 2 rpm and the resulting impurity levels recorded for each stirring rate. The subsequent plot shows that impurity increases approximately linearly with stirring rate. > stirrate <- seq(20, 42, 2) > impurity <- c(8.4, 9.5,.8, 0.4, 3.3, 4.8, 3.2, 4.7, 6.4, 6.5, 8.9, 8.5) > paint.data <- data.frame(stirrate, impurity) > paint.data stirrate impurity > rm(stirrate, impurity) > attach(paint.data) > plot(stirrate, impurity)

2 impurity stirrate Figure : Plot of Impurity versus Stirrate 3.2 The statistical model for simple linear regression In general, suppose that we have observed n pairs of values, (x, y ), (x 2, y 2 ),..., (x n, y n ), where y is regarded as the observed value of the response variable Y and x as the regressor variable (or predictor variable or explanatory variable), so that Y is the dependent variable, and we wish to investigate how the values of Y depend upon the values of x. The simplest model is a linear one. Given the set of values x i, i =,..., n, regarded as fixed and observed without error, consider the linear regression model Y i = β 0 + β x i + ε i i =,..., n, () where β 0 and β are unknown parameters. The random errors ε i are assumed to be NID(0, σ 2 ), with σ 2 unknown. We are now looking at the relationship of the (observed) response variable y to a quantitative factor, which takes numerical values x. Previously we dealt with a qualitative factor, in the form of a treatment, the different levels of which did not necessarily represent different numerical levels of some variable, and even if they did, this was not taken into account in the underlying statistical model. The line with equation y = β 0 + β x is known as the regression line. The regression coefficient β is the slope of the regression line and the regression coefficient (the constant) β 0 is the intercept of the line on the y-axis. 2

3 3.3 The least squares estimates of the parameters We shall use hatted Greek letters, ˆβ, for parameter estimators, and lower case Roman letters, b, for parameter estimates. Thus coefficients regression line model parameters β 0 β y = β 0 + β x parameter estimators ˆβ0 ˆβ y = ˆβ 0 + ˆβ x parameter estimates b 0 b y = b 0 + b x Given estimated parameter values, for each x i the corresponding observed fitted value ŷ i is given by ŷ i = b 0 + b x i and e i y i ŷ i is the corresponding observed residual. According to the method of least squares, given the observed values (x i, y i ), i =,..., n, we choose our parameter estimates, b 0 and b, to be those values of β 0 and β that minimize L = n (y i β 0 β x i ) 2. (2) i= In geometrical terms, given a scatter plot of the points (x i, y i ), i =,..., n, we choose our fitted regression line in such a way as to minimize the sum of squares of the vertical distances of the points from the line. It is worth introducing some more notation at this stage. In what follows, all summations are from i = to n. Denote the corrected sums of squares by = (x i x) 2 and S yy = (Y i Ȳ )2. Note that S yy is the total (corrected) sum of squares in the ANOVA. The corrected sum of products, S xy, is defined by S xy = (x i x)(y i Ȳ ), or, equivalently, by S xy = (x i x)y i. Note that, whereas and S yy are necessarily non-negative, S xy can take negative values. The observed values of S yy and S xy are denotes by s yy and s xy respectively. It turns out that the least squares estimates b 0 and b of β 0 and β, respectively, are given by b = s xy (3) 3

4 and Thus the equation of the fitted regression line, can be written as b 0 = ȳ b x. y = b 0 + b x, y = ȳ + b (x x). This is the equation of the line with slope b s xy / passing through the point ( x, ȳ). 3.4 The partition of the total sum of squares It turns out that the total sum of squares SS T S yy may be partitioned as where SS Reg is the regression sum of squares, SS T = SS Reg + SS R, (4) SS Reg = ˆβ S xy, (5) and the residual sum of squares SS R corresponds to the minimized value of L in Equation (2). The regression sum of squares SS Reg may be interpreted as that part of the total sum of squares which is accounted for by the estimated regression. Given SS T, the larger the value of SS Reg and the smaller the value of SS R, the better the fit of the estimated regression line. We may test the null hypothesis H 0 : β = 0 against the alternative H : β 0, which is a test for the absence of a linear relationship between the x and y variables. If β = 0 then the regression model () reduces to Y i = β 0 + ε i i =,..., n, so that the Y i are assumed to be NID(β 0, σ 2 ). In this case, the joint distribution of the Y i does not depend upon the values of the x i, so that the x i have no predictive power. It may be shown that the two terms on the right hand side of Equation (4), SS Reg and SS R, are independently distributed. SS R /σ 2 has the χ 2 n 2 distribution and, under H 0, SS Reg /σ 2 has the χ 2 distribution. Hence, under H 0, the ratio F = MS Reg MS R has the F,n 2 distribution. This statistic is used for a one-tail test of H 0. The calculations may be laid out in the form of the following ANOVA table. As in previous ANOVAs, a mean 4

5 square (MS) is obtained by dividing the corresponding sum of squares by its degrees of freedom and Ŝ2 MS R is an unbiased estimator of the error variance σ 2. ANOVA TABLE Source DF SS M S Regression ˆβ S xy SS Reg Error n 2 by subtraction Ŝ 2 SS R /(n 2) Total n S yy 3.5 Example (continued) The regression analysis is carried out using the S+ function lm, where impurity is regressed against a constant (which is included by default) and stirrate, the data being drawn from the data frame paint.data. The functions summary and anova are then applied to the fitted model object paint.lm in order to obtain the corresponding parameter estimates and analysis of variance table. > paint.lm <- lm(impurity ~ stirrate, data = paint.data) > summary(paint.lm) Call: lm(formula = impurity ~ stirrate, data = paint.data) Residuals: Min Q Median 3Q Max Coefficients: Value Std. Error t value Pr(> t ) (Intercept) stirrate Residual standard error: on 0 degrees of freedom Multiple R-Squared: Adjusted R-squared: F-statistic: 4. on and 0 degrees of freedom, the p-value is 3.2e-007 > anova(paint.lm) Analysis of Variance Table Response: impurity Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) stirrate e-007 Residuals The output shows that the coefficients of the fitted regression line are b 0 = and b = We shall discuss in a future section some of the details of the calculation of the 5

6 associated standard errors and tests of significance. In the case of simple linear regression, the t-test for the coefficient β (for stirrate) is equivalent to the F-test in the ANOVA. The p-value for both is 0.000, so clearly there is a very significant linear relationship. The p-value for the constant β 0 is not significant, but we do keep the constant term in the regression equation. We may also verify that, correct to two decimal places, the observed value of Ŝ = MS R is ŝ = 0.85 = Correlation and the coefficient of determination From Equations (3) and (5), Hence where r is the sample correlation coefficient, SS Reg = S2 xy. SS Reg SS T = S2 xy S yy = r 2, (6) r = S xy sxx S yy. r, which satisfies the inequalities r, may be thought of as a measure of the strength of the linear relationship between the x i and the y i. The closer r is to the value, the stronger the relationship. But from Equation (6) it follows that r 2 may be characterized as the proportion of the total sum of squares accounted for by the regression. (robs 2 = 93.4% = 9.28/27.73 in our example.) It also follows from Equations (4) and (6) that r 2 = SS R SS T, (7) and it turns out that Equation (7) is the one that is the most appropriate for generalization to more general regression models and measures of fit. In general, define the coefficient of determination R 2 by R 2 = SS Reg SS T = SS R SS T. This quantity, which like r 2 is the proportion of the total sum of squares accounted for by the regression, may be regarded as a measure of the goodness of fit of the regression model. An alternative measure, which is often preferred, is the adjusted coefficient of determination R 2 (adjusted for the number of regressor variables, one in the case of simple linear regression), R 2 = MS R MS T, where MS T = SS T /(n ). The significance of these quantities becomes apparent only when more complicated regression models are to be investigated. S+ outputs these two coefficients Multiple R-Squared and Adjusted R-squared, respectively. 6

7 In comparing the use of the F -statistic and R 2, we may recall that the F -statistic is used to investigate whether there is evidence of a linear relationship between the variables x and y. The value of R 2 is an indicator of the strength of that relationship. It is readily checked that, in the case of simple linear regression, the values of F and R 2 are related by the formula or, equivalently, R 2 = F = F n 2 + F (n 2)R2 R 2. It is possible to have a highly significant value of F together with a relatively low value of R 2 (if n is large) or a relatively large value of R 2 with a non-significant value of F (if n is small). 3.7 A test and confidence interval for the slope parameter Recall that the least squares estimator ˆβ of β is given by ˆβ = S xy = (xi x)y i (8) and that in the regression model the x i are regarded as fixed. So, on the right hand side of Equation (8), only the Y i are random variables, independently and normally distributed. Since ˆβ is a linear combination of normally distributed r.v.s, it follows that ˆβ is also normally distributed. It may be shown that ˆβ is an unbiased estimator of β, that is, and that the variance of ˆβ is given by E[ ˆβ ] = β var( ˆβ ) = σ2. Hence ˆβ has the N(β, σ 2 / ) distribution. We estimate the unknown error variance σ 2 by using the estimator Ŝ2 MS R from the ANOVA table. (In the S+ output, the estimate ŝ of σ is given by Residual standard error.) Thus ŝ/ is the observed standard error of b and the t-statistic for testing H 0 : β = 0 is T = ˆβ sxx, ŝ which under H 0 has the t n 2 distribution. We can verify from our S+ output that T obs for β is calculated as the ratio of the estimated coefficient to its standard error:.9 = / The above t-statistic satisfies T 2 = ˆβ 2 ŝ 2 = MS Reg MS R = F, 7

8 where F is the F-statistic calculated from the ANOVA. This fact is a special feature of simple linear regression and does not hold for more general regression models. In our example, F obs = 4.3 = = T 2 obs. It follows from the definitions of the distributions that the square of a random variable with a t ν distribution has the F,ν distribution. The p-values of the above t-statistic and F-statistic are identical. Given the value of b, a 00( α)% observed confidence interval for β is given by b ± t n 2,α/2ŝ sxx. In our example we may calculate the 95% confidence interval for β using S+. ##Direct Calculation of the observed CI #k is upper 2.5% percentage point of t-distn wit 0 d.o.f. #k2 is the half-length of interval #k3 is the estimated value of #the slope > k <- qt(0.975, 0) > k2 <- k * > k3 < > CI <- c(k3 - k2, k3 + k2) > CI [] Thus the confidence interval for β is (0.37,0.54). 3.8 Fitted values and Analysis of residuals Previously, we found that the fitted equation was of the form y = x. The observed fitted values may be obtained for each of the stir rates in the data set using the function fitted. > fitted.values <- fitted(paint.lm) > fitted.values Recall that the residuals ˆε i are defined by ˆε i = Y i Ŷi = Y i ˆβ 0 ˆβ x i i =,..., n. Given that ˆβ is an unbiased estimator of β, it is easy to check that ˆβ 0 is an unbiased estimator of β 0. It follows that E[ˆε i ] = E[Y i ] E[ ˆβ 0 ] E[ ˆβ ]x i = (β 0 + β x i ) β 0 β x i = 0. 8

9 A more detailed analysis shows that var(ˆε i ) = ( h i )σ 2 i =,..., n, where h i is the leverage of the i-th observation, Hence the standardized residuals D i are defined by h i = n + (x i x) 2 i =,..., n. (9) D i = ˆε i Ŝ h i i =,..., n. If the assumptions of the regression model are correct, the standardized residuals are approximately NID(0, ). The leverage h i of the i-th observation as defined in Equation (9) depends only on the value x i of the predictor variable and not on the value y i of the response variable. The leverage h i may be regarded as a measure of the remoteness of the value x i of the predictor variable for the i-th observation from the sample mean x of all n observed values of the predictor variable. It is always the case for simple regression that n h i i =,..., n and hi = 2, so that h = 2/n. If h i is large then the corresponding observation may be highly influential in determining the estimated regression coefficients. There are situations in which removal of an observation with large leverage from the data set can result in drastic changes in the estimates of the regression coefficients. So observations with large leverage should be treated with caution. We can obtain a list of the leverage values and the standardized residuals by using the commands lm.influence() and (upon invoking library(mass) first) stdres(), respectively. As a benchmark, we might consider an h i greater than say 3 times the average (or very close to ), which equates to 0.5 in our example, as high (suggesting corresponding predictor is unusual) and the standardized residual d i satisfying d i > 2 to be high (suggesting corresponding response is unusual). > leverages <- lm.influence(paint.lm)$hat > library(mass) > std.residuals <- stdres(paint.lm) > diagnostics <- data.frame(leverages, std.residuals) > diagnostics leverages std.residuals

10 Nothing untoward in the above output. 3.9 Prediction One of the reasons for carrying out a linear regression analysis may be that, in future, given an x-value, we wish to be able to predict the corresponding y-value, using the fitted regression equation, so that Ŷ = ˆβ 0 + ˆβ x. (0) Assuming the validity of the linear regression model, for the given x-value, the actual y-value will be given by Y = β 0 + β x + ε, where, as before, the error term ε is assumed to have the N(0, σ 2 ) distribution. Hence and E[Y ] = β 0 + β x Y = E[Y ] + ε. () The Ŷ defined in Equation (0) may be regarded in two ways, either as an estimator of E[Y ] (the long-term average of all y-values for the given x-value) or as a predictor of y (one particular y-value for the given x-value). In the latter case, there are two sources of error in accounting for the difference between an observed value of Y, i.e. y, and the predicted value ŷ: one due to using the estimators ˆβ 0 and ˆβ instead of the actual parameter values β 0 and β, and the other due to the presence of the error term ϵ. Since ˆβ is an unbiased estimator of β and ˆβ 0 is an unbiased estimator of β 0, from Equation (0), E[Ŷ ] = E[ ˆβ 0 + ˆβ x] = β 0 + β x = E[Y ]. Thus Ŷ is an unbiased estimator of E[Y ] and an unbiased predictor of Y. From Equation (0), var(ŷ ) = var( ˆβ 0 + ˆβ x), which turns out to be given by var(ŷ ) = ( n ) (x x)2 + σ 2. (2) 0

11 Additionally, using Equation (), var(ŷ Y ) is equal to i.e. ( + ) (x x)2 + σ 2. n var(ŷ ) + var(ε) = var(ŷ ) + σ2. (3) As before, we estimate σ 2 by Ŝ2 MS R from the ANOVA table. A 00( α)% observed confidence interval for E[Y ] is given by (x x)2 b 0 + b x ± t n 2,α/2 + ŝ. n A 00( α)% observed prediction interval for the value of y is given by b 0 + b x ± t n 2,α/2 + (x x)2 + ŝ. n S+ refers to the quantity (x x)2 + n ŝ as se.fit, the standard error of the fit. Note how the widths of the confidence and prediction intervals depend on the distance of x from x. The prediction interval is wider than the confidence interval. If the regression equation has been fitted using x-values in some interval A and appears to provide a good representation of the relationship between x and y in A, we should be wary of extrapolating this equation to make predictions for x-values outside A, as the linear relationship between x and y may not hold outside A. 3.0 Example (continued) We use the function predict in S+ to obtain predicted values and their standard errors. We construct a data frame x whose variable name is that of the regressor variable, stirrate, and which contains the values of the regressor variables for which we wish to make predictions. In the present case, we shall use the single value of 4. The first argument of the predict function is the object paint.lm that corresponds to our model and the second argument is the data frame x that contains the values of the regressor variable for which we wish to make predictions. The argument se.fit = TRUE is required so that we obtain standard errors for our predictions and so that, subsequently, we can use the function pointwise to produce confidence intervals.

12 In the output, the term residual.scale refers to the value of ŝ. Given this and the value of standard error of the fit, we may, if desired, calculate the prediction interval as defined above, in addition to the confidence interval produced by the function pointwise. > x <- data.frame(stirrate = 4) > predict.impurity <- predict(paint.lm, x, se.fit = TRUE) > predict.impurity $fit: $se.fit: $residual.scale: [] $df: [] 0 > pointwise(predict.impurity, 0.95) $upper: $fit: $lower:

14 Multiple Linear Regression

14 Multiple Linear Regression B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 14 Multiple Linear Regression 14.1 The multiple linear regression model In simple linear regression, the response variable y is expressed in

More information

Ch 2: Simple Linear Regression

Ch 2: Simple Linear Regression Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression In simple linear regression we are concerned about the relationship between two variables, X and Y. There are two components to such a relationship. 1. The strength of the relationship.

More information

Inference for Regression

Inference for Regression Inference for Regression Section 9.4 Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Department of Mathematics University of Houston Lecture 13b - 3339 Cathy Poliak, Ph.D. cathy@math.uh.edu

More information

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope Oct 2017 1 / 28 Minimum MSE Y is the response variable, X the predictor variable, E(X) = E(Y) = 0. BLUP of Y minimizes average discrepancy var (Y ux) = C YY 2u C XY + u 2 C XX This is minimized when u

More information

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference. Understanding regression output from software Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals In 1966 Cyril Burt published a paper called The genetic determination of differences

More information

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression AMS 315/576 Lecture Notes Chapter 11. Simple Linear Regression 11.1 Motivation A restaurant opening on a reservations-only basis would like to use the number of advance reservations x to predict the number

More information

Measuring the fit of the model - SSR

Measuring the fit of the model - SSR Measuring the fit of the model - SSR Once we ve determined our estimated regression line, we d like to know how well the model fits. How far/close are the observations to the fitted line? One way to do

More information

12 The Analysis of Residuals

12 The Analysis of Residuals B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 12 The Analysis of Residuals 12.1 Errors and residuals Recall that in the statistical model for the completely randomized one-way design, Y ij

More information

Simple and Multiple Linear Regression

Simple and Multiple Linear Regression Sta. 113 Chapter 12 and 13 of Devore March 12, 2010 Table of contents 1 Simple Linear Regression 2 Model Simple Linear Regression A simple linear regression model is given by Y = β 0 + β 1 x + ɛ where

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression September 24, 2008 Reading HH 8, GIll 4 Simple Linear Regression p.1/20 Problem Data: Observe pairs (Y i,x i ),i = 1,...n Response or dependent variable Y Predictor or independent

More information

Ch 3: Multiple Linear Regression

Ch 3: Multiple Linear Regression Ch 3: Multiple Linear Regression 1. Multiple Linear Regression Model Multiple regression model has more than one regressor. For example, we have one response variable and two regressor variables: 1. delivery

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression Reading: Hoff Chapter 9 November 4, 2009 Problem Data: Observe pairs (Y i,x i ),i = 1,... n Response or dependent variable Y Predictor or independent variable X GOALS: Exploring

More information

ST430 Exam 1 with Answers

ST430 Exam 1 with Answers ST430 Exam 1 with Answers Date: October 5, 2015 Name: Guideline: You may use one-page (front and back of a standard A4 paper) of notes. No laptop or textook are permitted but you may use a calculator.

More information

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata' Business Statistics Tommaso Proietti DEF - Università di Roma 'Tor Vergata' Linear Regression Specication Let Y be a univariate quantitative response variable. We model Y as follows: Y = f(x) + ε where

More information

Density Temp vs Ratio. temp

Density Temp vs Ratio. temp Temp Ratio Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.0 0.2 0.4 0.6 0.8 1.0 1. (a) 170 175 180 185 temp 1.0 1.5 2.0 2.5 3.0 ratio The histogram shows that the temperature measures have two peaks,

More information

Lecture 6 Multiple Linear Regression, cont.

Lecture 6 Multiple Linear Regression, cont. Lecture 6 Multiple Linear Regression, cont. BIOST 515 January 22, 2004 BIOST 515, Lecture 6 Testing general linear hypotheses Suppose we are interested in testing linear combinations of the regression

More information

Lecture 18: Simple Linear Regression

Lecture 18: Simple Linear Regression Lecture 18: Simple Linear Regression BIOS 553 Department of Biostatistics University of Michigan Fall 2004 The Correlation Coefficient: r The correlation coefficient (r) is a number that measures the strength

More information

Simple Linear Regression

Simple Linear Regression Chapter 2 Simple Linear Regression Linear Regression with One Independent Variable 2.1 Introduction In Chapter 1 we introduced the linear model as an alternative for making inferences on means of one or

More information

Lecture 14 Simple Linear Regression

Lecture 14 Simple Linear Regression Lecture 4 Simple Linear Regression Ordinary Least Squares (OLS) Consider the following simple linear regression model where, for each unit i, Y i is the dependent variable (response). X i is the independent

More information

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik MAT2377 Rafa l Kulik Version 2015/November/26 Rafa l Kulik Bivariate data and scatterplot Data: Hydrocarbon level (x) and Oxygen level (y): x: 0.99, 1.02, 1.15, 1.29, 1.46, 1.36, 0.87, 1.23, 1.55, 1.40,

More information

Unit 6 - Simple linear regression

Unit 6 - Simple linear regression Sta 101: Data Analysis and Statistical Inference Dr. Çetinkaya-Rundel Unit 6 - Simple linear regression LO 1. Define the explanatory variable as the independent variable (predictor), and the response variable

More information

STAT Chapter 11: Regression

STAT Chapter 11: Regression STAT 515 -- Chapter 11: Regression Mostly we have studied the behavior of a single random variable. Often, however, we gather data on two random variables. We wish to determine: Is there a relationship

More information

Overview Scatter Plot Example

Overview Scatter Plot Example Overview Topic 22 - Linear Regression and Correlation STAT 5 Professor Bruce Craig Consider one population but two variables For each sampling unit observe X and Y Assume linear relationship between variables

More information

Stat 401B Exam 2 Fall 2015

Stat 401B Exam 2 Fall 2015 Stat 401B Exam Fall 015 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied by supporting reasoning

More information

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression BSTT523: Kutner et al., Chapter 1 1 Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression Introduction: Functional relation between

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 15: Examples of hypothesis tests (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 32 The recipe 2 / 32 The hypothesis testing recipe In this lecture we repeatedly apply the

More information

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X. Estimating σ 2 We can do simple prediction of Y and estimation of the mean of Y at any value of X. To perform inferences about our regression line, we must estimate σ 2, the variance of the error term.

More information

Statistical Modelling in Stata 5: Linear Models

Statistical Modelling in Stata 5: Linear Models Statistical Modelling in Stata 5: Linear Models Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 07/11/2017 Structure This Week What is a linear model? How good is my model? Does

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression MATH 282A Introduction to Computational Statistics University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/ eariasca/math282a.html MATH 282A University

More information

Homework 2: Simple Linear Regression

Homework 2: Simple Linear Regression STAT 4385 Applied Regression Analysis Homework : Simple Linear Regression (Simple Linear Regression) Thirty (n = 30) College graduates who have recently entered the job market. For each student, the CGPA

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression ST 370 Regression models are used to study the relationship of a response variable and one or more predictors. The response is also called the dependent variable, and the predictors

More information

CHAPTER 2 SIMPLE LINEAR REGRESSION

CHAPTER 2 SIMPLE LINEAR REGRESSION CHAPTER 2 SIMPLE LINEAR REGRESSION 1 Examples: 1. Amherst, MA, annual mean temperatures, 1836 1997 2. Summer mean temperatures in Mount Airy (NC) and Charleston (SC), 1948 1996 Scatterplots outliers? influential

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression ST 430/514 Recall: A regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates)

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Simple linear regression tries to fit a simple line between two variables Y and X. If X is linearly related to Y this explains some of the variability in Y. In most cases, there

More information

Simple Linear Regression Analysis

Simple Linear Regression Analysis LINEAR REGRESSION ANALYSIS MODULE II Lecture - 6 Simple Linear Regression Analysis Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Prediction of values of study

More information

2.4.3 Estimatingσ Coefficient of Determination 2.4. ASSESSING THE MODEL 23

2.4.3 Estimatingσ Coefficient of Determination 2.4. ASSESSING THE MODEL 23 2.4. ASSESSING THE MODEL 23 2.4.3 Estimatingσ 2 Note that the sums of squares are functions of the conditional random variables Y i = (Y X = x i ). Hence, the sums of squares are random variables as well.

More information

General Linear Model (Chapter 4)

General Linear Model (Chapter 4) General Linear Model (Chapter 4) Outcome variable is considered continuous Simple linear regression Scatterplots OLS is BLUE under basic assumptions MSE estimates residual variance testing regression coefficients

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression ST 430/514 Recall: a regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates).

More information

Unit 6 - Introduction to linear regression

Unit 6 - Introduction to linear regression Unit 6 - Introduction to linear regression Suggested reading: OpenIntro Statistics, Chapter 7 Suggested exercises: Part 1 - Relationship between two numerical variables: 7.7, 7.9, 7.11, 7.13, 7.15, 7.25,

More information

Statistics for Engineers Lecture 9 Linear Regression

Statistics for Engineers Lecture 9 Linear Regression Statistics for Engineers Lecture 9 Linear Regression Chong Ma Department of Statistics University of South Carolina chongm@email.sc.edu April 17, 2017 Chong Ma (Statistics, USC) STAT 509 Spring 2017 April

More information

Lecture 3: Inference in SLR

Lecture 3: Inference in SLR Lecture 3: Inference in SLR STAT 51 Spring 011 Background Reading KNNL:.1.6 3-1 Topic Overview This topic will cover: Review of hypothesis testing Inference about 1 Inference about 0 Confidence Intervals

More information

Lecture 10 Multiple Linear Regression

Lecture 10 Multiple Linear Regression Lecture 10 Multiple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: 6.1-6.5 10-1 Topic Overview Multiple Linear Regression Model 10-2 Data for Multiple Regression Y i is the response variable

More information

Math 3330: Solution to midterm Exam

Math 3330: Solution to midterm Exam Math 3330: Solution to midterm Exam Question 1: (14 marks) Suppose the regression model is y i = β 0 + β 1 x i + ε i, i = 1,, n, where ε i are iid Normal distribution N(0, σ 2 ). a. (2 marks) Compute the

More information

Chapter 1 Linear Regression with One Predictor

Chapter 1 Linear Regression with One Predictor STAT 525 FALL 2018 Chapter 1 Linear Regression with One Predictor Professor Min Zhang Goals of Regression Analysis Serve three purposes Describes an association between X and Y In some applications, the

More information

Biostatistics 380 Multiple Regression 1. Multiple Regression

Biostatistics 380 Multiple Regression 1. Multiple Regression Biostatistics 0 Multiple Regression ORIGIN 0 Multiple Regression Multiple Regression is an extension of the technique of linear regression to describe the relationship between a single dependent (response)

More information

Lecture 1: Linear Models and Applications

Lecture 1: Linear Models and Applications Lecture 1: Linear Models and Applications Claudia Czado TU München c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 0 Overview Introduction to linear models Exploratory data analysis (EDA) Estimation

More information

School of Mathematical Sciences. Question 1. Best Subsets Regression

School of Mathematical Sciences. Question 1. Best Subsets Regression School of Mathematical Sciences MTH5120 Statistical Modelling I Practical 9 and Assignment 8 Solutions Question 1 Best Subsets Regression Response is Crime I n W c e I P a n A E P U U l e Mallows g E P

More information

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x). Linear Regression Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x). A dependent variable is a random variable whose variation

More information

Inference for Regression Simple Linear Regression

Inference for Regression Simple Linear Regression Inference for Regression Simple Linear Regression IPS Chapter 10.1 2009 W.H. Freeman and Company Objectives (IPS Chapter 10.1) Simple linear regression p Statistical model for linear regression p Estimating

More information

Comparing Nested Models

Comparing Nested Models Comparing Nested Models ST 370 Two regression models are called nested if one contains all the predictors of the other, and some additional predictors. For example, the first-order model in two independent

More information

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com 12 Simple Linear Regression Material from Devore s book (Ed 8), and Cengagebrain.com The Simple Linear Regression Model The simplest deterministic mathematical relationship between two variables x and

More information

Linear models and their mathematical foundations: Simple linear regression

Linear models and their mathematical foundations: Simple linear regression Linear models and their mathematical foundations: Simple linear regression Steffen Unkel Department of Medical Statistics University Medical Center Göttingen, Germany Winter term 2018/19 1/21 Introduction

More information

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y Predictor or Independent variable x Model with error: for i = 1,..., n, y i = α + βx i + ε i ε i : independent errors (sampling, measurement,

More information

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

Statistics - Lecture Three. Linear Models. Charlotte Wickham   1. Statistics - Lecture Three Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Linear Models 1. The Theory 2. Practical Use 3. How to do it in R 4. An example 5. Extensions

More information

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5) 10 Simple Linear Regression (Chs 12.1, 12.2, 12.4, 12.5) Simple Linear Regression Rating 20 40 60 80 0 5 10 15 Sugar 2 Simple Linear Regression Rating 20 40 60 80 0 5 10 15 Sugar 3 Simple Linear Regression

More information

Applied Regression Analysis

Applied Regression Analysis Applied Regression Analysis Chapter 3 Multiple Linear Regression Hongcheng Li April, 6, 2013 Recall simple linear regression 1 Recall simple linear regression 2 Parameter Estimation 3 Interpretations of

More information

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B Simple Linear Regression 35 Problems 1 Consider a set of data (x i, y i ), i =1, 2,,n, and the following two regression models: y i = β 0 + β 1 x i + ε, (i =1, 2,,n), Model A y i = γ 0 + γ 1 x i + γ 2

More information

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

 M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2 Notation and Equations for Final Exam Symbol Definition X The variable we measure in a scientific study n The size of the sample N The size of the population M The mean of the sample µ The mean of the

More information

1 Multiple Regression

1 Multiple Regression 1 Multiple Regression In this section, we extend the linear model to the case of several quantitative explanatory variables. There are many issues involved in this problem and this section serves only

More information

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim 0.0 1.0 1.5 2.0 2.5 3.0 8 10 12 14 16 18 20 22 y x Figure 1: The fitted line using the shipment route-number of ampules data STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim Problem#

More information

STAT 215 Confidence and Prediction Intervals in Regression

STAT 215 Confidence and Prediction Intervals in Regression STAT 215 Confidence and Prediction Intervals in Regression Colin Reimer Dawson Oberlin College 24 October 2016 Outline Regression Slope Inference Partitioning Variability Prediction Intervals Reminder:

More information

REGRESSION ANALYSIS AND INDICATOR VARIABLES

REGRESSION ANALYSIS AND INDICATOR VARIABLES REGRESSION ANALYSIS AND INDICATOR VARIABLES Thesis Submitted in partial fulfillment of the requirements for the award of degree of Masters of Science in Mathematics and Computing Submitted by Sweety Arora

More information

R 2 and F -Tests and ANOVA

R 2 and F -Tests and ANOVA R 2 and F -Tests and ANOVA December 6, 2018 1 Partition of Sums of Squares The distance from any point y i in a collection of data, to the mean of the data ȳ, is the deviation, written as y i ȳ. Definition.

More information

y n 1 ( x i x )( y y i n 1 i y 2

y n 1 ( x i x )( y y i n 1 i y 2 STP3 Brief Class Notes Instructor: Ela Jackiewicz Chapter Regression and Correlation In this chapter we will explore the relationship between two quantitative variables, X an Y. We will consider n ordered

More information

Regression Models - Introduction

Regression Models - Introduction Regression Models - Introduction In regression models there are two types of variables that are studied: A dependent variable, Y, also called response variable. It is modeled as random. An independent

More information

Correlation Analysis

Correlation Analysis Simple Regression Correlation Analysis Correlation analysis is used to measure strength of the association (linear relationship) between two variables Correlation is only concerned with strength of the

More information

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response. Leverage Some cases have high leverage, the potential to greatly affect the fit. These cases are outliers in the space of predictors. Often the residuals for these cases are not large because the response

More information

Lecture 15. Hypothesis testing in the linear model

Lecture 15. Hypothesis testing in the linear model 14. Lecture 15. Hypothesis testing in the linear model Lecture 15. Hypothesis testing in the linear model 1 (1 1) Preliminary lemma 15. Hypothesis testing in the linear model 15.1. Preliminary lemma Lemma

More information

Chapter 16. Simple Linear Regression and dcorrelation

Chapter 16. Simple Linear Regression and dcorrelation Chapter 16 Simple Linear Regression and dcorrelation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017 UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Tuesday, January 17, 2017 Work all problems 60 points are needed to pass at the Masters Level and 75

More information

Simple linear regression

Simple linear regression Simple linear regression Biometry 755 Spring 2008 Simple linear regression p. 1/40 Overview of regression analysis Evaluate relationship between one or more independent variables (X 1,...,X k ) and a single

More information

SCHOOL OF MATHEMATICS AND STATISTICS

SCHOOL OF MATHEMATICS AND STATISTICS RESTRICTED OPEN BOOK EXAMINATION (Not to be removed from the examination hall) Data provided: Statistics Tables by H.R. Neave MAS5052 SCHOOL OF MATHEMATICS AND STATISTICS Basic Statistics Spring Semester

More information

MODULE 4 SIMPLE LINEAR REGRESSION

MODULE 4 SIMPLE LINEAR REGRESSION MODULE 4 SIMPLE LINEAR REGRESSION Module Objectives: 1. Describe the equation of a line including the meanings of the two parameters. 2. Describe how the best-fit line to a set of bivariate data is derived.

More information

Section 4.6 Simple Linear Regression

Section 4.6 Simple Linear Regression Section 4.6 Simple Linear Regression Objectives ˆ Basic philosophy of SLR and the regression assumptions ˆ Point & interval estimation of the model parameters, and how to make predictions ˆ Point and interval

More information

STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511

STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511 STAT 511 Lecture : Simple linear regression Devore: Section 12.1-12.4 Prof. Michael Levine December 3, 2018 A simple linear regression investigates the relationship between the two variables that is not

More information

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

y ˆ i = ˆ  T u i ( i th fitted value or i th fit) 1 2 INFERENCE FOR MULTIPLE LINEAR REGRESSION Recall Terminology: p predictors x 1, x 2,, x p Some might be indicator variables for categorical variables) k-1 non-constant terms u 1, u 2,, u k-1 Each u

More information

Exercise I.1 I.2 I.3 I.4 II.1 II.2 III.1 III.2 III.3 IV.1 Question (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Answer

Exercise I.1 I.2 I.3 I.4 II.1 II.2 III.1 III.2 III.3 IV.1 Question (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Answer Solutions to Exam in 02402 December 2012 Exercise I.1 I.2 I.3 I.4 II.1 II.2 III.1 III.2 III.3 IV.1 Question (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Answer 3 1 5 2 5 2 3 5 1 3 Exercise IV.2 IV.3 IV.4 V.1

More information

Inference for Regression Inference about the Regression Model and Using the Regression Line

Inference for Regression Inference about the Regression Model and Using the Regression Line Inference for Regression Inference about the Regression Model and Using the Regression Line PBS Chapter 10.1 and 10.2 2009 W.H. Freeman and Company Objectives (PBS Chapter 10.1 and 10.2) Inference about

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7 MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7 1 Random Vectors Let a 0 and y be n 1 vectors, and let A be an n n matrix. Here, a 0 and A are non-random, whereas y is

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

Multiple Regression Introduction to Statistics Using R (Psychology 9041B) Multiple Regression Introduction to Statistics Using R (Psychology 9041B) Paul Gribble Winter, 2016 1 Correlation, Regression & Multiple Regression 1.1 Bivariate correlation The Pearson product-moment

More information

Ma 3/103: Lecture 25 Linear Regression II: Hypothesis Testing and ANOVA

Ma 3/103: Lecture 25 Linear Regression II: Hypothesis Testing and ANOVA Ma 3/103: Lecture 25 Linear Regression II: Hypothesis Testing and ANOVA March 6, 2017 KC Border Linear Regression II March 6, 2017 1 / 44 1 OLS estimator 2 Restricted regression 3 Errors in variables 4

More information

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison.

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison. Regression Bret Hanlon and Bret Larget Department of Statistics University of Wisconsin Madison December 8 15, 2011 Regression 1 / 55 Example Case Study The proportion of blackness in a male lion s nose

More information

Six Sigma Black Belt Study Guides

Six Sigma Black Belt Study Guides Six Sigma Black Belt Study Guides 1 www.pmtutor.org Powered by POeT Solvers Limited. Analyze Correlation and Regression Analysis 2 www.pmtutor.org Powered by POeT Solvers Limited. Variables and relationships

More information

Chapter 4 Describing the Relation between Two Variables

Chapter 4 Describing the Relation between Two Variables Chapter 4 Describing the Relation between Two Variables 4.1 Scatter Diagrams and Correlation The is the variable whose value can be explained by the value of the or. A is a graph that shows the relationship

More information

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Correlation and the Analysis of Variance Approach to Simple Linear Regression Correlation and the Analysis of Variance Approach to Simple Linear Regression Biometry 755 Spring 2009 Correlation and the Analysis of Variance Approach to Simple Linear Regression p. 1/35 Correlation

More information

df=degrees of freedom = n - 1

df=degrees of freedom = n - 1 One sample t-test test of the mean Assumptions: Independent, random samples Approximately normal distribution (from intro class: σ is unknown, need to calculate and use s (sample standard deviation)) Hypotheses:

More information

Coefficient of Determination

Coefficient of Determination Coefficient of Determination ST 430/514 The coefficient of determination, R 2, is defined as before: R 2 = 1 SS E (yi ŷ i ) = 1 2 SS yy (yi ȳ) 2 The interpretation of R 2 is still the fraction of variance

More information

TMA4255 Applied Statistics V2016 (5)

TMA4255 Applied Statistics V2016 (5) TMA4255 Applied Statistics V2016 (5) Part 2: Regression Simple linear regression [11.1-11.4] Sum of squares [11.5] Anna Marie Holand To be lectured: January 26, 2016 wiki.math.ntnu.no/tma4255/2016v/start

More information

Lectures on Simple Linear Regression Stat 431, Summer 2012

Lectures on Simple Linear Regression Stat 431, Summer 2012 Lectures on Simple Linear Regression Stat 43, Summer 0 Hyunseung Kang July 6-8, 0 Last Updated: July 8, 0 :59PM Introduction Previously, we have been investigating various properties of the population

More information

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Lecture 2. The Simple Linear Regression Model: Matrix Approach Lecture 2 The Simple Linear Regression Model: Matrix Approach Matrix algebra Matrix representation of simple linear regression model 1 Vectors and Matrices Where it is necessary to consider a distribution

More information

Chapter 2 Inferences in Simple Linear Regression

Chapter 2 Inferences in Simple Linear Regression STAT 525 SPRING 2018 Chapter 2 Inferences in Simple Linear Regression Professor Min Zhang Testing for Linear Relationship Term β 1 X i defines linear relationship Will then test H 0 : β 1 = 0 Test requires

More information

Inference for the Regression Coefficient

Inference for the Regression Coefficient Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression line. We can shows that b 0 and b 1 are the unbiased estimates

More information

Correlation. Bivariate normal densities with ρ 0. Two-dimensional / bivariate normal density with correlation 0

Correlation. Bivariate normal densities with ρ 0. Two-dimensional / bivariate normal density with correlation 0 Correlation Bivariate normal densities with ρ 0 Example: Obesity index and blood pressure of n people randomly chosen from a population Two-dimensional / bivariate normal density with correlation 0 Correlation?

More information

2. Outliers and inference for regression

2. Outliers and inference for regression Unit6: Introductiontolinearregression 2. Outliers and inference for regression Sta 101 - Spring 2016 Duke University, Department of Statistical Science Dr. Çetinkaya-Rundel Slides posted at http://bit.ly/sta101_s16

More information

CAS MA575 Linear Models

CAS MA575 Linear Models CAS MA575 Linear Models Boston University, Fall 2013 Midterm Exam (Correction) Instructor: Cedric Ginestet Date: 22 Oct 2013. Maximal Score: 200pts. Please Note: You will only be graded on work and answers

More information

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION In this lab you will learn how to use Excel to display the relationship between two quantitative variables, measure the strength and direction of the

More information

Lecture 1 Linear Regression with One Predictor Variable.p2

Lecture 1 Linear Regression with One Predictor Variable.p2 Lecture Linear Regression with One Predictor Variablep - Basics - Meaning of regression parameters p - β - the slope of the regression line -it indicates the change in mean of the probability distn of

More information