Introduction to Simple Linear Regression

Size: px

Start display at page:

Download "Introduction to Simple Linear Regression"

Catherine Morton
5 years ago
Views:

1 Introduction to Simple Linear Regression Yang Feng Yang Feng (Columbia University) Introduction to Simple Linear Regression 1 / 68

2 About me Faculty in the Department of Statistics, Columbia University Operations Research & Financial Engineering PhD, 2010, Princeton University Taught the course over 5 times My research interests. High-dimensional Statistical Learning Network models Nonparametric and semiparametric methods Bioinformatics See my website for more details: Yang Feng (Columbia University) Introduction to Simple Linear Regression 2 / 68

3 Course Description Theory and practice of regression analysis, Simple and multiple regression, including testing, estimation, and confidence procedures, modeling, regression diagnostics and plots, polynomial regression, collinearity and confounding, model selection, geometry of least squares. Extensive use of the computer to analyze data. Course website: Required Text: Applied Linear Regression Models (4th Ed.) Authors: Kutner, Nachtsheim, Neter Yang Feng (Columbia University) Introduction to Simple Linear Regression 3 / 68

4 Course Outline First half of the course is single variable linear regression. Least squares Maximum likelihood, normal model Tests / inferences ANOVA Diagnostics Remedial Measures Yang Feng (Columbia University) Introduction to Simple Linear Regression 4 / 68

5 Course Outline (Continued) Second half of the course is multiple linear regression and other related topics. Multiple linear Regression Linear algebra review Matrix approach to linear regression Multiple predictor variables Diagnostics Tests Model selection Other topics (If time permits) Principle Component Analysis Generalized Linear Models Introduction to Bayesian Inference Yang Feng (Columbia University) Introduction to Simple Linear Regression 5 / 68

6 Requirements Calculus Derivatives, gradients, convexity Linear algebra Matrix notation, inversion, eigenvectors, eigenvalues, rank Probability and Statistics Random variable Expectation, variance Estimation Bias/Variance Basic probability distributions Hypothesis Testing Confidence Interval Yang Feng (Columbia University) Introduction to Simple Linear Regression 6 / 68

7 Software R will be used throughout the course and it is required in all homework. An R tutorial session will be given (Bring your laptop with R and RStudio installed!). Reasons for R: Completely free software. Can be downloaded from RStudio is an easy to use IDE. Available on various systems, PC, MAC, Linux, and even Yang Feng (Columbia University) Introduction to Simple Linear Regression 7 / 68

8 Software R will be used throughout the course and it is required in all homework. An R tutorial session will be given (Bring your laptop with R and RStudio installed!). Reasons for R: Completely free software. Can be downloaded from RStudio is an easy to use IDE. Available on various systems, PC, MAC, Linux, and even iphone and ipad! Advanced yet easy to use. An Introduction to R: R for Data Science: Yang Feng (Columbia University) Introduction to Simple Linear Regression 7 / 68

9 Why regression? Want to model a functional relationship between an predictor variable (input, independent variable, etc.) and a response variable (output, dependent variable, etc.) Examples? But real world is noisy, no f = ma Observation noise Process noise Two distinct goals (Estimation) Understanding the relationship between predictor variables and response variables (Prediction) Predicting the future response given the new observed predictors. Yang Feng (Columbia University) Introduction to Simple Linear Regression 8 / 68

10 History Sir Francis Galton, 19 th century Studied the relation between heights of parents and children and noted that the children regressed to the population mean Regression stuck as the term to describe statistical relations between variables Yang Feng (Columbia University) Introduction to Simple Linear Regression 9 / 68

11 Example Applications Okun s law: The gap version states that for every 1% increase in the unemployment rate, a country s GDP will be roughly an additional 2% lower than its potential GDP. Yang Feng (Columbia University) Introduction to Simple Linear Regression 10 / 68

12 Others Epidemiology Relating lifespan to obesity or smoking habits etc. Science and engineering Relating physical inputs to physical outputs in complex systems. Education and income Relating the income to the length of education. Yang Feng (Columbia University) Introduction to Simple Linear Regression 11 / 68

13 Aims for the course Given something you would like to predict (the response) and the available explanatory variables. What kind of model should you use? Which variables should you include? Which transformations of variables and interaction terms should you use? Yang Feng (Columbia University) Introduction to Simple Linear Regression 12 / 68

14 Aims for the course Given something you would like to predict (the response) and the available explanatory variables. What kind of model should you use? Which variables should you include? Which transformations of variables and interaction terms should you use? Given a model and some data How do you fit the model to the data? How do you express confidence in the values of the model parameters? How do you regularize the model to avoid over-fitting and other related issues? Yang Feng (Columbia University) Introduction to Simple Linear Regression 12 / 68

15 Data for Regression Analysis Observational Data Example: relation between age of employee (X ) and number of days of illness last year (Y ) Cannot be controlled! Yang Feng (Columbia University) Introduction to Simple Linear Regression 13 / 68

16 Data for Regression Analysis Observational Data Example: relation between age of employee (X ) and number of days of illness last year (Y ) Cannot be controlled! Experimental Data Example: an insurance company wishes to study the relation between productivity of its analysts in processing claims (Y ) and length of training X. Treatment: the length of training Experimental Units: the analysts included in the study. Yang Feng (Columbia University) Introduction to Simple Linear Regression 13 / 68

17 Data for Regression Analysis Observational Data Example: relation between age of employee (X ) and number of days of illness last year (Y ) Cannot be controlled! Experimental Data Example: an insurance company wishes to study the relation between productivity of its analysts in processing claims (Y ) and length of training X. Treatment: the length of training Experimental Units: the analysts included in the study. Completely Randomized Design: Most basic type of statistical design Example: same example, but every experimental unit has an equal chance to receive any one of the treatments. Yang Feng (Columbia University) Introduction to Simple Linear Regression 13 / 68

18 Questions? Good time to ask now. Yang Feng (Columbia University) Introduction to Simple Linear Regression 14 / 68

19 Simple Linear Regression Want to find parameters for a function of the form Y i = β 0 + β 1 X i + ɛ i Distribution of error random variable not specified Yang Feng (Columbia University) Introduction to Simple Linear Regression 15 / 68

20 Quick Example - Scatter Plot a b Yang Feng (Columbia University) Introduction to Simple Linear Regression 16 / 68

21 Formal Statement of Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor variable in the i th trial ɛ i is a random error term with mean E(ɛ i ) = 0 and variance Var(ɛ i ) = σ 2 ɛ i and ɛ j are uncorrelated i = 1,..., n Yang Feng (Columbia University) Introduction to Simple Linear Regression 17 / 68

22 Properties The response Y i is the sum of two components Constant term β 0 + β 1 X i Random term ɛ i The expected response is E(Y i ) = E(β 0 + β 1 X i + ɛ i ) = β 0 + β 1 X i + E(ɛ i ) = β 0 + β 1 X i Yang Feng (Columbia University) Introduction to Simple Linear Regression 18 / 68

23 Expectation Review Suppose X has probability density function f (x). Definition E(X ) = xf (x)dx. Linearity property E(aX ) = a E(X ) E(aX + by ) = a E(X ) + b E(Y ) Obvious from definition Yang Feng (Columbia University) Introduction to Simple Linear Regression 19 / 68

24 Example Expectation Derivation Suppose p.d.f of X is Expectation f (x) = 2x, 0 x 1 E(X ) = = = 2x = 2 3 xf (x)dx 2x 2 dx Yang Feng (Columbia University) Introduction to Simple Linear Regression 20 / 68

25 Expectation of a Product of Random Variables If X,Y are random variables with joint density function f (x, y) then the expectation of the product is given by E(XY ) = xyf (x, y)dxdy. XY Yang Feng (Columbia University) Introduction to Simple Linear Regression 21 / 68

26 Expectation of a product of random variables What if X and Y are independent? If X and Y are independent with density functions f 1 and f 2 respectively then E(XY ) = xyf 1 (x)f 2 (y)dxdy XY = xyf 1 (x)f 2 (Y )dxdy X Y = xf 1 (x)[ yf 2 (y)dy]dx X Y = xf 1 (x) E(Y )dx X = E(X ) E(Y ) Yang Feng (Columbia University) Introduction to Simple Linear Regression 22 / 68

27 Regression Function The response Y i comes from a probability distribution with mean E(Y i ) = β 0 + β 1 X i This means the regression function is E(Y ) = β 0 + β 1 X Since the regression function relates the means of the probability distributions of Y for a given X to the level of X Yang Feng (Columbia University) Introduction to Simple Linear Regression 23 / 68

28 Error Terms The response Y i in the i th trial exceeds or falls short of the value of the regression function by the error term amount ɛ i The error terms ɛ i are assumed to have constant variance σ 2 Yang Feng (Columbia University) Introduction to Simple Linear Regression 24 / 68

29 Response Variance Responses Y i have the same constant variance Var(Y i ) = Var(β 0 + β 1 X i + ɛ i ) = Var(ɛ i ) = σ 2 Yang Feng (Columbia University) Introduction to Simple Linear Regression 25 / 68

30 Variance (2 nd central moment) Review Continuous distribution Var(X ) = E[(X E(X )) 2 ] = (x E(X )) 2 f (x)dx Discrete distribution Var(X ) = E[(X E(X )) 2 ] = i (X i E(X )) 2 P(X = X i ) Alternative Form for Variance: Var(X ) = E(X 2 ) E(X ) 2. Yang Feng (Columbia University) Introduction to Simple Linear Regression 26 / 68

31 Example Variance Derivation f (x) = 2x, 0 x 1 Var(X ) = E(X 2 ) E(X ) 2 = 1 0 = 2x = xx 2 dx ( 2 3 )2 = 1 18 Yang Feng (Columbia University) Introduction to Simple Linear Regression 27 / 68

32 Variance Properties Var(aX ) = a 2 Var(X ) Var(aX + by ) = a 2 Var(X ) + b 2 Var(Y ) if X Y Var(a + cx ) = c 2 Var(X ) if a and c are both constants More generally Var( a i X i ) = i a i a j Cov(X i, X j ) j Yang Feng (Columbia University) Introduction to Simple Linear Regression 28 / 68

33 Covariance The covariance between two real-valued random variables X and Y, with expected values E(X ) = µ and E(Y ) = ν is defined as Cov(X, Y ) = E((X µ)(y ν)) = E(XY ) µν. If X Y, Then E(XY ) = E(X ) E(Y ) = µν. Cov(X, Y ) = 0 Yang Feng (Columbia University) Introduction to Simple Linear Regression 29 / 68

34 Formal Statement of Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor variable in the i th trial ɛ i is a random error term with mean E(ɛ i ) = 0 and variance Var(ɛ i ) = σ 2 i = 1,..., n Yang Feng (Columbia University) Introduction to Simple Linear Regression 30 / 68

35 Example An experimenter gave three subjects a very difficult task. Data on the age of the subject (X ) and on the number of attempts to accomplish the task before giving up (Y ) follow: Table: Subject i Age X i Number of Attempts Y i Yang Feng (Columbia University) Introduction to Simple Linear Regression 31 / 68

36 Least Squares Linear Regression Goal: make Y i and b 0 + b 1 X i close for all i. Yang Feng (Columbia University) Introduction to Simple Linear Regression 32 / 68

37 Least Squares Linear Regression Goal: make Y i and b 0 + b 1 X i close for all i. Proposal 1: minimize n i=1 [Y i (b 0 + b 1 X i )]. Proposal 2: minimize n i=1 Y i (b 0 + b 1 X i ). Yang Feng (Columbia University) Introduction to Simple Linear Regression 32 / 68

38 Least Squares Linear Regression Goal: make Y i and b 0 + b 1 X i close for all i. Proposal 1: minimize n i=1 [Y i (b 0 + b 1 X i )]. Proposal 2: minimize n i=1 Y i (b 0 + b 1 X i ). Final Proposal: minimize Q(b 0, b 1 ) = n [Y i (b 0 + b 1 X i )] 2 i=1 Choose b 0 and b 1 as estimators for β 0 and β 1. b 0 and b 1 will minimize the criterion Q for the given sample observations (X 1, Y 1 ), (X 2, Y 2 ),, (X n, Y n ). Yang Feng (Columbia University) Introduction to Simple Linear Regression 32 / 68

39 Guess #1 Yang Feng (Columbia University) Introduction to Simple Linear Regression 33 / 68

40 Guess #2 Yang Feng (Columbia University) Introduction to Simple Linear Regression 34 / 68

41 Function maximization Important technique to remember! Take derivative Set result equal to zero and solve Test second derivative at that point Question: does this always give you the maximum? Going further: multiple variables, convex optimization Yang Feng (Columbia University) Introduction to Simple Linear Regression 35 / 68

42 Function Maximization Find the value of x that maximize the function f (x) = x 2 + log(x), x > 0 Yang Feng (Columbia University) Introduction to Simple Linear Regression 36 / 68

43 Function Maximization Find the value of x that maximize the function f (x) = x 2 + log(x), x > 0 1. Take derivative d dx ( x 2 + log(x)) = 0 (1) 2x + 1 x = 0 (2) 2x 2 = 1 (3) x 2 = 1 2 (4) x = 2 2 (5) Yang Feng (Columbia University) Introduction to Simple Linear Regression 36 / 68

44 2. Check second order derivative d 2 dx 2 ( x 2 + log(x)) = d dx ( 2x + 1 x ) (6) = 2 1 x 2 (7) < 0 (8) Then we have found the maximum. x = 2 2, f (x ) = 1 2 [1 + log(2)]. Yang Feng (Columbia University) Introduction to Simple Linear Regression 37 / 68

45 Least Squares Minimization Function to minimize w.r.t. b 0 and b 1 b 0 and b 1 are called point estimators of β 0 and β 1 respectively Q(b 0, b 1 ) = n [Y i (b 0 + b 1 X i )] 2 i=1 Find partial derivatives and set both equal to zero Q b 0 = 0 Q b 1 = 0 Yang Feng (Columbia University) Introduction to Simple Linear Regression 38 / 68

46 Normal Equations The result of this maximization step are called the normal equations. b 0 and b 1 are called point estimators of β 0 and β 1 respectively. Yi = nb 0 + b 1 Xi (9) Xi Y i = b 0 Xi + b 1 X 2 i (10) This is a system of two equations and two unknowns. Yang Feng (Columbia University) Introduction to Simple Linear Regression 39 / 68

47 Solution to Normal Equations After a lot of algebra one arrives at b 1 = (Xi X )(Y i Ȳ ) (Xi X ) 2 b 0 = Ȳ b 1 X X = Ȳ = Xi n Yi n Yang Feng (Columbia University) Introduction to Simple Linear Regression 40 / 68

48 Least Square Fit Yang Feng (Columbia University) Introduction to Simple Linear Regression 41 / 68

49 Properties of Solution The i th residual is defined to be e i = Y i Ŷ i The sum of the residuals is zero from (9). e i = (Y i b 0 b 1 X i ) i = Y i nb 0 b 1 Xi = 0 The sum of the observed values Y i equals the sum of the fitted values Ŷ i Ŷ i = Y i i i Yang Feng (Columbia University) Introduction to Simple Linear Regression 42 / 68

50 Properties of Solution The sum of the weighted residuals is zero when the residual in the i th trial is weighted by the level of the predictor variable in the i th trial from (10). X i e i = (X i (Y i b 0 b 1 X i )) i = i = 0 X i Y i b 0 Xi b 1 X 2 i Yang Feng (Columbia University) Introduction to Simple Linear Regression 43 / 68

51 Properties of Solution The sum of the weighted residuals is zero when the residual in the i th trial is weighted by the fitted value of the response variable in the i th trial Ŷ i e i = (b 0 + b 1 X i )e i i i = b 0 e i + b 1 X i e i = 0 i i Yang Feng (Columbia University) Introduction to Simple Linear Regression 44 / 68

52 Properties of Solution The regression line always goes through the point X, Ȳ Using the alternative format of linear regression model: Y i = β 0 + β 1 (X i X ) + ɛ i The least squares estimator b 1 for β 1 remains the same as before. The least squares estimator for β 0 = β 0 + β 1 X becomes b 0 = b 0 + b 1 X = (Ȳ b 1 X ) + b 1 X = Ȳ Hence the estimated regression function is Ŷ = Ȳ + b 1 (X X ) Apparently, the regression line always goes through the point ( X, Ȳ ). Yang Feng (Columbia University) Introduction to Simple Linear Regression 45 / 68

53 Estimating Error Term Variance σ 2 Review estimation in non-regression setting. Show estimation results for regression setting. Yang Feng (Columbia University) Introduction to Simple Linear Regression 46 / 68

54 Estimation Review An estimator is a rule that tells how to calculate the value of an estimate based on the measurements contained in a sample i.e. the sample mean Ȳ = 1 n n i=1 Y i Yang Feng (Columbia University) Introduction to Simple Linear Regression 47 / 68

55 Point Estimators and Bias Point estimator ˆθ = f ({Y 1,..., Y n }) Unknown quantity / parameter θ Definition: Bias of estimator Bias(ˆθ) = E(ˆθ) θ Yang Feng (Columbia University) Introduction to Simple Linear Regression 48 / 68

56 Example Samples Estimate the population mean Y i N(θ, σ 2 ) ˆθ = Ȳ = 1 n n i=1 Y i Yang Feng (Columbia University) Introduction to Simple Linear Regression 49 / 68

57 Sampling Distribution of the Estimator First moment E(ˆθ) = E( 1 n = 1 n n i=1 n Y i ) i=1 E(Y i ) = nθ n = θ This is an example of an unbiased estimator Bias(ˆθ) = E(ˆθ) θ = 0 Yang Feng (Columbia University) Introduction to Simple Linear Regression 50 / 68

58 Variance of Estimator Definition: Variance of estimator Var(ˆθ) = E([ˆθ E(ˆθ)] 2 ) Remember: Var(cY ) = c 2 Var(Y ) n Var( Y i ) = n Var(Y i ) i=1 i=1 Only if the Y i are independent with finite variance Yang Feng (Columbia University) Introduction to Simple Linear Regression 51 / 68

59 Example Estimator Variance Var(ˆθ) = Var( 1 n n Y i ) i=1 = 1 n n 2 Var(Y i ) = nσ2 n 2 = σ2 n i=1 Yang Feng (Columbia University) Introduction to Simple Linear Regression 52 / 68

60 Bias Variance Trade-off The mean squared error of an estimator MSE(ˆθ) = E([ˆθ θ] 2 ) Can be re-expressed MSE(ˆθ) = Var(ˆθ) + [Bias(ˆθ)] 2 Yang Feng (Columbia University) Introduction to Simple Linear Regression 53 / 68

61 MSE = Var + Bias 2 Proof MSE(ˆθ) = E((ˆθ θ) 2 ) = E(([ˆθ E(ˆθ)] + [E(ˆθ) θ]) 2 ) = E([ˆθ E(ˆθ)] 2 ) + 2 E([E(ˆθ) θ][ˆθ E(ˆθ)]) + E([E(ˆθ) θ] 2 ) = Var(ˆθ) + 2 E([E(ˆθ)[ˆθ E(ˆθ)] θ[ˆθ E(ˆθ)])) + (Bias(ˆθ)) 2 = Var(ˆθ) + 2(0 + 0) + (Bias(ˆθ)) 2 = Var(ˆθ) + (Bias(ˆθ)) 2 Yang Feng (Columbia University) Introduction to Simple Linear Regression 54 / 68

62 Trade-off Think of variance as confidence and bias as correctness. Intuitions (largely) apply Sometimes choosing a biased estimator can result in an overall lower MSE if it has much lower variance than the unbiased one. Yang Feng (Columbia University) Introduction to Simple Linear Regression 55 / 68

63 s 2 estimator for σ 2 for Single population Sum of Squares: Sample Variance Estimator: n (Y i Ȳ ) 2 i=1 s 2 = n i=1 (Y i Ȳ )2 n 1 s 2 is an unbiased estimator of σ 2. The sum of squares SSE has n 1 degrees of freedom associated with it, one degree of freedom is lost by using Ȳ as an estimate of the unknown population mean µ. Yang Feng (Columbia University) Introduction to Simple Linear Regression 56 / 68

64 Estimating Error Term Variance σ 2 for Regression Model Regression model Variance of each observation Y i is σ 2 (the same as for the error term ɛ i ) Each Y i comes from a different probability distribution with different means that depend on the level X i The deviation of an observation Y i must be calculated around its own estimated mean. Yang Feng (Columbia University) Introduction to Simple Linear Regression 57 / 68

65 s 2 estimator for σ 2 s 2 = MSE = SSE n 2 = (Yi Ŷ i ) 2 n 2 MSE is an unbiased estimator of σ 2 E(MSE) = σ 2 = e 2 i n 2 The sum of squares SSE has n 2 degrees of freedom associated with it. Cochran s theorem (later in the course) tells us where degree s of freedom come from and how to calculate them. Yang Feng (Columbia University) Introduction to Simple Linear Regression 58 / 68

66 Normal Error Regression Model No matter how the error terms ɛ i are distributed, the least squares method provides unbiased point estimators of β 0 and β 1 that also have minimum variance among all unbiased linear estimators To set up interval estimates and make tests we need to specify the distribution of the ɛ i We will assume that the ɛ i are normally distributed. Yang Feng (Columbia University) Introduction to Simple Linear Regression 59 / 68

67 Normal Error Regression Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor variable in the i th trial ɛ i iid N(0, σ 2 ) note this is different, now we know the distribution i = 1,..., n Yang Feng (Columbia University) Introduction to Simple Linear Regression 60 / 68

68 Notational Convention When you see ɛ i iid N(0, σ 2 ) It is read as ɛ i is identically and independently distributed according to a normal distribution with mean 0 and variance σ 2 Yang Feng (Columbia University) Introduction to Simple Linear Regression 61 / 68

69 Maximum Likelihood Principle The method of maximum likelihood chooses as estimates those values of the parameters that are most consistent with the sample data. Yang Feng (Columbia University) Introduction to Simple Linear Regression 62 / 68

70 Likelihood Function If Y i F (Θ), i = 1... n then the likelihood function is n L({Y i } n i=1, Θ) = f (Y i ; Θ) i=1 Yang Feng (Columbia University) Introduction to Simple Linear Regression 63 / 68

71 Maximum Likelihood Estimation The likelihood function can be maximized w.r.t. the parameter(s) Θ, doing this one can arrive at estimators for parameters as well. L({Y i } n i=1, Θ) = n f (Y i ; Θ) To do this, find solutions to (analytically or by following gradient) i=1 dl({y i } n i=1, Θ) dθ = 0 Yang Feng (Columbia University) Introduction to Simple Linear Regression 64 / 68

72 Important Trick (almost) Never maximize the likelihood function, maximize the log likelihood function instead. log(l({y i } n i=1, Θ)) = log( = n f (Y i ; Θ)) i=1 n log(f (Y i ; Θ)) Usually the log of the likelihood is easier to work with mathematically. i=1 Yang Feng (Columbia University) Introduction to Simple Linear Regression 65 / 68

73 ML Normal Regression Likelihood function L({Y i } n i=1; β 0, β 1, σ 2 ) = = n i=1 1 (2πσ 2 ) 1 (2πσ 2 ) 1 e 2σ 1/2 2 (Y i β 0 β 1 X i ) 2 1 n e 2σ n/2 2 i=1 (Y i β 0 β 1 X i ) 2. We can maximize (how?) w.r.t. to the parameters β 0, β 1, σ 2, and get... Yang Feng (Columbia University) Introduction to Simple Linear Regression 66 / 68

74 Maximum Likelihood Estimator(s) β 0 b 0 same as in least squares case β 1 b 1 same as in least squares case σ 2 ˆσ 2 = i (Y i Ŷ i ) 2 Note that ML estimator is biased as s 2 is unbiased and s 2 = MSE = n n n 2 ˆσ2 Yang Feng (Columbia University) Introduction to Simple Linear Regression 67 / 68

75 Comments Least squares minimizes the squared error between the prediction and the true output The normal distribution is fully characterized by its first two central moments (mean and variance) Yang Feng (Columbia University) Introduction to Simple Linear Regression 68 / 68

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Regression Estimation - Least Squares and Maximum Likelihood Dr. Frank Wood Least Squares Max(min)imization Function to minimize w.r.t. β 0, β 1 Q = n (Y i (β 0 + β 1 X i )) 2 i=1 Minimize this by maximizing