Regression - Modeling a response

Similar documents
Business Statistics. Lecture 10: Correlation and Linear Regression

The cover page of the Encyclopedia of Health Economics (2014) Introduction to Econometric Application in Health Economics

Fitting a regression model

Scatter plot of data from the study. Linear Regression

Chapter 4: Regression Models

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Scatter plot of data from the study. Linear Regression

Single and multiple linear regression analysis

Bivariate Relationships Between Variables

Categorical Predictor Variables

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

3. Diagnostics and Remedial Measures

Statistical View of Least Squares

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

STATS DOESN T SUCK! ~ CHAPTER 16

LECTURE 2: SIMPLE REGRESSION I

Mathematics for Economics MA course

Inferences for Regression

Section 3: Simple Linear Regression

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections

SF2930: REGRESION ANALYSIS LECTURE 1 SIMPLE LINEAR REGRESSION.

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Problem Set #6: OLS. Economics 835: Econometrics. Fall 2012

Chapter 6: Exploring Data: Relationships Lesson Plan

Basic Business Statistics 6 th Edition

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model

11.5 Regression Linear Relationships

The Empirical Rule, z-scores, and the Rare Event Approach

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

PBAF 528 Week 8. B. Regression Residuals These properties have implications for the residuals of the regression.

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Ordinary Least Squares Regression Explained: Vartanian

Simple Regression Model. January 24, 2011

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

STA 4210 Practise set 2a

Chapter 16. Simple Linear Regression and dcorrelation

Linear Regression. 1 Introduction. 2 Least Squares

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

Lecture 6 Multiple Linear Regression, cont.

Eco 391, J. Sandford, spring 2013 April 5, Midterm 3 4/5/2013

Test 3A AP Statistics Name:

LI EAR REGRESSIO A D CORRELATIO

Chapter 3. Diagnostics and Remedial Measures

Business Statistics. Lecture 9: Simple Regression

Lecture 5: Clustering, Linear Regression

Lecture 11: Simple Linear Regression

STAT Chapter 11: Regression

CHAPTER 4 & 5 Linear Regression with One Regressor. Kazu Matsuda IBEC PHBU 430 Econometrics

Chapter 14 Simple Linear Regression (A)

Regression Analysis II

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

STAT Regression Methods

Practice Questions for Exam 1

FREC 608 Guided Exercise 9

22s:152 Applied Linear Regression. Chapter 5: Ordinary Least Squares Regression. Part 2: Multiple Linear Regression Introduction

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

Regression Models - Introduction

Intermediate Econometrics

BNAD 276 Lecture 10 Simple Linear Regression Model

Two-Variable Analysis: Simple Linear Regression/ Correlation

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear

Lectures 5 & 6: Hypothesis Testing

MATH11400 Statistics Homepage

MATHEMATICS: PAPER I

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Simple linear regression

Applied Regression Modeling: A Business Approach Chapter 2: Simple Linear Regression Sections

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Econometrics I Lecture 3: The Simple Linear Regression Model

Variance Decomposition and Goodness of Fit

The Substitution Method

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Range The range is the simplest of the three measures and is defined now.

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511

MATH 644: Regression Analysis Methods

Lecture 5: Clustering, Linear Regression

Year 10 Mathematics Semester 2 Bivariate Data Chapter 13

STAT2201 Assignment 6

A discussion on multiple regression models

Regression Analysis Chapter 2 Simple Linear Regression

Study Unit 2 : Linear functions Chapter 2 : Sections and 2.6

Simple Regression Model (Assumptions)

STAT5044: Regression and Anova. Inyoung Kim

Lecture 2. The Simple Linear Regression Model: Matrix Approach

(quantitative or categorical variables) Numerical descriptions of center, variability, position (quantitative variables)

ST430 Exam 2 Solutions

Six Sigma Black Belt Study Guides

B. Weaver (24-Mar-2005) Multiple Regression Chapter 5: Multiple Regression Y ) (5.1) Deviation score = (Y i

Regression Models - Introduction

Regression. Simple Linear Regression Multiple Linear Regression Polynomial Linear Regression Decision Tree Regression Random Forest Regression

Big Data Analysis with Apache Spark UC#BERKELEY

Lecture 5: Clustering, Linear Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line

Transcription:

Regression - Modeling a response We often wish to construct a model to Explain the association between two or more variables Predict the outcome of a variable given values of other variables. Regression is a technique that allows us to do so. A regression model has three components: An outcome variable: y Predictor variables or covariates: x 1, x 2,..., x k A random error: ɛ. Stat 328 - Fall 2004 1

Regression - Salary example Data on 20 mid-level managers in insurance industry in 2003: number of employees supervised and annual salary in $1,000. Employees 28 31 38 38 43 47 48 Salary 99.4 102.4 136.3 127.3 157.5 121.5 173.4 Employees 49 53 56 56 56 60 60 Salary 197.8 200.5 176.9 219.9 223.4 219.0 224.4 Employees 60 60 70 72 78 81 Salary 247.2 237.5 207.3 214.9 242.9 262.5 Stat 328 - Fall 2004 2

328/Fall 04/Notes/reg graph1.jpg Stat 328 - Fall 2004 3 Figure 1: Data on mid-level executives

Example (cont d) Data on graph are salaries (in $1,000) and number of employees supervised by 20 mid-level managers in the insurance industry in 2003. Questions of interest: Is annual compensation associated to personnel supervised? If a manager s supervision responsibilities increase by 30 employees, can she predict what her new salary might be? Stat 328 - Fall 2004 4

Example (cont d) A few things to note: Salary appears to increase with number of employees supervised. There is a positive association between salary and supervision responsibilities. A straight line through the points provides a good summary of the relashionship between salary and number of supervisees. Two managers with the same number of supervisees do not necessarily make exactly the same salary, so factors other than number of supervisees also determine salary. Stat 328 - Fall 2004 5

Regression - Formal introduction Every regression model has the following form: y = E(y) + ɛ E(y) is the systematic part of the model which establishes the relationship between y and the predictors x 1, x 2,... ɛ is a random deviation or random error. Note: ɛ = y E(y). Stat 328 - Fall 2004 6

Regression - Formal intro (cont d) In a linear regression model, E(y) is a linear function of the predictors. In a simple linear regression model there is one predictor x. regression relation is just a straight line of the form: The E(y) = β 0 + β 1 x, so that y = E(y) + ɛ = β 0 + β 1 x + ɛ. Back to salary example. Stat 328 - Fall 2004 7

Regression - Salary example In example, managers supervising less than 40 employees earn between $99,430 and $136,325. Those who supervise more than 70 earn $207,306 to $262,515. Three managers supervise 56 employees and make $176,990, $219,895 and $223,390. Salary appears to depend on number of employees supervised: expected salary increases with number of supervisees. A manager with no employees to supervise still earns some money. All of this can be summarized as: E(Salary) = β 0 + β 1 employees Stat 328 - Fall 2004 8

Regression - Definitions From earlier page: E(Salary) = β 0 + β 1 employees β 0 is the intercept of the regression equation: the salary (in $1,000) that can be expected by a manager with no employees to supervise. β 1 is the regression coefficient: it is the expected increase in salary (in $1,000) when the number of employees increases by 1. If, for example, β 0 = 25 and β 1 = 3 then: A manager with no supervisees can expect to earn 25 + 3 0 = $25. A manager with 30 supervisees can expect to earn 25+3 30 = $115. A manager with 56 supervisees can expect to earn 25+3 56 = $193. Stat 328 - Fall 2004 9

Regression - Definitions (cont d) A positive β 1 indicates a positive association between y and x: as x increases, so does y. A negative β 1 indicates a negative association between y and x: as x increases, y decreases. An example might be median household salary and proportion of household income spent on food. A β 1 close to zero indicates no association between y and x: y might increase or decrease when x increases. An example might be divorce rate and oil prices. See graphical examples of each of the three possibilities. Errors ɛ are the deviations of the observations from the regression line (more soon). Stat 328 - Fall 2004 10

Regression - Definitions (cont d) Simple linear regression: outcome y is linear function of one predictor x. For ith unit: y i = β 0 + β 1 x i + ɛ i. Multiple linear regression: outcome y is linear function of two or more predictors x 1, x 2,..., x k : y i = β 0 + β 1 x i1 + β 2 x i2 +... + β k x ik + ɛ i. Stat 328 - Fall 2004 11

Regression - Definitions (cont d) In model with k predictors, there are k + 1 coefficients: β 0, β 1,..., β k. Intercept β 0 is the expected value of y when all predictors x 1, x 2,..., x k take on the value zero. Meaning of the multiple regression coefficients later in the course. See Figure 2.3 in textbook for example of regression with two predictors. Stat 328 - Fall 2004 12

Regression - The random error The regression line (in simple linear regression) or the regression surface (in multiple linear regression) indicates the expected or average association between the outcome and the predictor(s). Actual, individual observations deviate up or down from the regression line. Deviations of observations from the regression relationship occur because the model is not perfect: there is variability in the y that is not explainable by the predictor(s). In the case of the salary example, we saw that several managers who supervise the same number of employees have different salaries. This is because number of supervisees is but one factor associated to salary. Stat 328 - Fall 2004 13

Regression - Random errors (cont d) We include all (known) explanatory factors in the model. Unexplainable effects on the outcome variable that induce variability of the y around the regression relation will remain. Errors are defined as deviations: ɛ i = y i E(y i ), after accounting for all known effects on y. Typically, errors are assumed to be normal: ɛ i N(0, σ 2 ). The larger σ 2, the larger the deviations of the observations from their expectation. On the graph, the ɛ i are the vertical distances of observations from the regression line (or surface). Stat 328 - Fall 2004 14

Regression - Population vs sample As in Chapter 1, we talk about a true regression relation and an estimated regression relation. The true regression relation is knowable only if we can obtain values of y and the predictors for each unit in the entire population. The estimated regression relation is what we obtain from a sample: for each unit in our sample, measure y and predictors, and from those measurements obtain sample estimates of all parameters in the model. Stat 328 - Fall 2004 15

Regression - Population vs sample We use b 0, b 1,..., b k to denote the sample estimates of the population quantities β 0, β 1,..., β k. We use ˆσ 2 or MSE (Mean Squared Error) to denote the sample estimate of the population parameter σ 2, the variance of the errors. We will see later that as in the case of ȳ, the sample estimates of intercept and regression coefficients are random and have their own sampling distributions. As before, randomness arises from the fact that the sample from which we obtain sample estimates is generated by a random sampling process. Stat 328 - Fall 2004 16

Regression - Steps to fit model We fit a regression model when we wish to determine whether there is an association between a response variable and one or more predictors. Steps are: 1. Hypothesize a relationship between y and one or more predictors x 1, x 2,... 2. Draw a random sample of size n from the relevant population. 3. Collect data on a random sample of units (y i, x i1, x i2,..., x ik ): one outcome variable and k predictor variables on the ith unit, with i = 1,..., n in a sample of units of size n. 4. Fit the model: obtain sample estimates for the population parameters β 0, β 1,..., β k, σ 2. 5. Check the fit of model and if necessary, reformulate. 6. Test hypotheses and draw conclusions. Stat 328 - Fall 2004 17

Regression - Collecting data Two types of studies for obtaining sample of size n: 1. Experimental: the values of x 1, x 2,... are fixed in advance according to some experimental design. Example: x 1 is amount of fertilizer, x 2 is amount of irrigation and y is yield of corn. Experiment consists of applying pre-determined amounts of water and fertilizer to different plots and measuring yields. 2. Observational: values of predictors are not set in advance. Example: sample 20 mid-level managers in insurance industry to determine association between number of supervisees and salary. In business, economics, social sciences, observational studies more common than experimental studies. Stat 328 - Fall 2004 18

Regression - Observational vs experimental studies The mechanics of fitting a regression model are the same in both types of studies. Inferences that can be drawn from results are more limited in observational studies. In particular, we cannot establish cause-effect relationships, only associations. How large a sample? As in Chapter 1, sample size will depend on desired reliability of estimates b 0, b 1,..., b k. Determining sample size precisely in observational studies is not straight forward. Rule of thumb: n = 10 k (10 times as many observations as there are regression coefficients to estimate). Stat 328 - Fall 2004 19