Data Analysis 1 LINEAR REGRESSION. Chapter 03

Similar documents
Lecture 6: Linear Regression (continued)

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Python 데이터분석 보충자료. 윤형기

Statistical Methods for Data Mining

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Linear Regression In God we trust, all others bring data. William Edwards Deming

Machine Learning Linear Regression. Prof. Matteo Matteucci

Chapter 4. Regression Models. Learning Objectives

Chapter 4: Regression Models

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

General Linear Model (Chapter 4)

Lab 10 - Binary Variables

Categorical Predictor Variables

Regression Analysis II

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

22s:152 Applied Linear Regression

Chapter 7 Student Lecture Notes 7-1

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

Unit 11: Multiple Linear Regression

Lecture 6: Linear Regression

Chapter 14 Student Lecture Notes 14-1

Single and multiple linear regression analysis

Chapter 3 Multiple Regression Complete Example

Introduction to Regression

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Inferences for Regression

STATISTICS 110/201 PRACTICE FINAL EXAM

Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity

LECTURE 03: LINEAR REGRESSION PT. 1. September 18, 2017 SDS 293: Machine Learning

Basic Business Statistics, 10/e

Lecture 4: Multivariate Regression, Part 2

Lecture 9: Linear Regression

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

LI EAR REGRESSIO A D CORRELATIO

Ordinary Least Squares Regression Explained: Vartanian

Linear Regression Measurement & Evaluation of HCC Systems

Regression Models for Quantitative and Qualitative Predictors: An Overview

Chapter 4 Regression with Categorical Predictor Variables Page 1. Overview of regression with categorical predictors

sociology sociology Scatterplots Quantitative Research Methods: Introduction to correlation and regression Age vs Income

Week 3: Multiple Linear Regression

1 Multiple Regression

Statistics and Quantitative Analysis U4320

Correlation Analysis

Regression Models REVISED TEACHING SUGGESTIONS ALTERNATIVE EXAMPLES

Basic Business Statistics 6 th Edition

Lecture 4: Multivariate Regression, Part 2

Inference for the Regression Coefficient

Mathematics for Economics MA course

Simple Linear Regression: One Qualitative IV

Unit 7: Multiple linear regression 1. Introduction to multiple linear regression

Business Statistics. Lecture 10: Correlation and Linear Regression

Chapter 14 Simple Linear Regression (A)

Regression #8: Loose Ends

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

STAT Chapter 10: Analysis of Variance

Ch 2: Simple Linear Regression

y response variable x 1, x 2,, x k -- a set of explanatory variables

Statistics 5100 Spring 2018 Exam 1

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

sociology 362 regression

Finding Relationships Among Variables

Statistics for Managers using Microsoft Excel 6 th Edition

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Making sense of Econometrics: Basics

Introduction to Regression

Review of the General Linear Model

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Marketing Research Session 10 Hypothesis Testing with Simple Random samples (Chapter 12)

Analysing data: regression and correlation S6 and S7

BNAD 276 Lecture 10 Simple Linear Regression Model

The Multiple Regression Model

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Ordinary Least Squares (OLS): Multiple Linear Regression (MLR) Analytics What s New? Not Much!

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Ch 13 & 14 - Regression Analysis

Ordinary Least Squares Regression Explained: Vartanian

4. Nonlinear regression functions

Correlation and Simple Linear Regression

1: a b c d e 2: a b c d e 3: a b c d e 4: a b c d e 5: a b c d e. 6: a b c d e 7: a b c d e 8: a b c d e 9: a b c d e 10: a b c d e

ECON Interactions and Dummies

ECON 450 Development Economics

SIMPLE REGRESSION ANALYSIS. Business Statistics

ANOVA - analysis of variance - used to compare the means of several populations.

Example. Multiple Regression. Review of ANOVA & Simple Regression /749 Experimental Design for Behavioral and Social Sciences

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

CHAPTER 7. + ˆ δ. (1 nopc) + ˆ β1. =.157, so the new intercept is = The coefficient on nopc is.157.

Business Statistics. Lecture 9: Simple Regression

Section 3: Simple Linear Regression

Chapter 13. Multiple Regression and Model Building

CHAPTER 4 & 5 Linear Regression with One Regressor. Kazu Matsuda IBEC PHBU 430 Econometrics

ISQS 5349 Final Exam, Spring 2017.

Problem Set 10: Panel Data

Multiple Regression: Chapter 13. July 24, 2015

9. Linear Regression and Correlation

1. The shoe size of five randomly selected men in the class is 7, 7.5, 6, 6.5 the shoe size of 4 randomly selected women is 6, 5.

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

Transcription:

Data Analysis 1 LINEAR REGRESSION Chapter 03

Data Analysis 2 Outline The Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression Other Considerations in Regression Model Qualitative Predictors Interaction Terms Potential Fit Problems Linear vs. KNN Regression

Data Analysis 3 Outline The Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression Other Considerations in Regression Model Qualitative Predictors Interaction Terms Potential Fit Problems Linear vs. KNN Regression

Data Analysis 4 The Linear Regression Model Y i = b 0 + b 1 X 1 + b 2 X 2 + + b p X p +e The parameters in the linear regression model are very easy to interpret. 0 is the intercept (i.e. the average value for Y if all the X s are zero), j is the slope for the jth variable X j j is the average increase in Y when X j is increased by one and all other X s are held constant.

Data Analysis 5 Least Squares Fit We estimate the parameters using least squares i.e. minimize MSE = 1 n = 1 n n å i=1 n å i=1 ( ) Y i - ˆ Y i 2 ( Y i - ˆb 0 - ˆb 1 X 1 - - ˆb p X ) 2 p

Estimate Data Analysis 6

Relationship between population and least squares lines Population line Data Analysis 7 Y i = b 0 + b 1 X 1 + b 2 X 2 + + b p X p +e Least Squares line ˆ Y i = ˆb 0 + ˆb 1 X 1 + ˆb 2 X 2 + + ˆb p X p We would like to know 0 through p i.e. the population line. Instead we know ˆ 0 through ˆ p i.e. the least squares line. ˆ ˆ Hence we use 0 through p as guesses for 0 through p and as a guess for Y i. The guesses will not be perfect just as X is not a perfect guess for. Ŷ i

Data Analysis 8 Measures of Fit: R 2 Some of the variation in Y can be explained by variation in the X s and some cannot. R 2 tells you the fraction of variance that can be explained by X. R 2 RSS ( Y Y 1 2 i ) 1 SSR SST R 2 is always between 0 and 1. Zero means no variance has been explained. One means it has all been explained (perfect fit to the data).

Data Analysis 9 Inference in Regression The regression line from the sample is not the regression line from the population. What we want to do: Assess how well the line describes the plot. Guess the slope of the population line. 14 12 10 Guess what value Y would take 2 for a given X value 0 8 6 4 Estimated (least squares) line. -10-5 0 5 10 X True (population) line. Unobserved

Data Analysis 10 Some Relevant Questions Is j =0 or not? We can use a hypothesis test to answer this question. If we can t be sure that j 0 then there is no point in using X j as one of our predictors. Can we be sure that at least one of our X variables is a useful predictor i.e. is it the case that β 1 = β 2 = = β p =0?

Data Analysis 11 1. Is j =0 i.e. is X j an important variable? We use a hypothesis test to answer this question H 0 : j =0 vs H a : j 0 Calculate ˆ j t SE( ˆ ) j Number of standard deviations away from zero. If t is large (equivalently p-value is small) we can be sure that j 0 and that there is a relationship Regression coefficients Coefficient Std Err t-value p-value Constant 7.0326 0.4578 15.3603 0.0000 TV 0.0475 0.0027 17.6676 0.0000 ˆ 1 ˆ 1 is 17.67 SE s from 0 ˆ SE ) ( 1 P-value

Data Analysis 12 Testing Individual Variables Is there a (statistically detectable) linear relationship between Newspapers and Sales after all the other variables have been accounted for? Regression coefficients Coefficient Std Err t-value p-value Constant 2.9389 0.3119 9.4223 0.0000 TV 0.0458 0.0014 32.8086 0.0000 Radio 0.1885 0.0086 21.8935 0.0000 Newspaper -0.0010 0.0059-0.1767 0.8599 Regression coefficients Coefficient Std Err t-value p-value Constant 12.3514 0.6214 19.8761 0.0000 Newspaper 0.0547 0.0166 3.2996 0.0011 No: big p-value Small p-value in simple regression Almost all the explaining that Newspapers could do in simple regression has already been done by TV and Radio in multiple regression!

Data Analysis 13 2. Is the whole regression explaining anything at all? Test for: H 0 : all slopes = 0 ( 1 = 2 = = p =0), H a : at least one slope 0 ANOVA Table Source df SS MS F p-value Explained 2 4860.2347 2430.1174 859.6177 0.0000 Unexplained 197 556.9140 2.8270 Answer comes from the F test in the ANOVA (ANalysis Of VAriance) table. The ANOVA table has many pieces of information. What we care about is the F Ratio and the corresponding p-value.

Data Analysis 14 Outline The Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression Other Considerations in Regression Model Qualitative Predictors Interaction Terms Potential Fit Problems Linear vs. KNN Regression

Data Analysis 15 Qualitative Predictors How do you stick men and women (category listings) into a regression equation? Code them as indicator variables (dummy variables) For example we can code Males=0 and Females= 1.

Data Analysis 16 Interpretation Suppose we want to include income and gender. Two genders (male and female). Let ì Gender i = íî 0 if male 1 if female then the regression equation is ì ï Y i» b 0 + b 1 Income i + b 2 Gender i = í îï b 0 + b 1 Income i if male b 0 + b 1 Income i + b 2 if female 2 is the average extra balance each month that females have for given income level. Males are the baseline. Regression coefficients Coefficient Std Err t-value p-value Constant 233.7663 39.5322 5.9133 0.0000 Income 0.0061 0.0006 10.4372 0.0000 Gender_Female 24.3108 40.8470 0.5952 0.5521

Data Analysis 17 Other Coding Schemes There are different ways to code categorical variables. Two genders (male and female). Let ì Gender i = íî -1 if male 1 if female then the regression equation is ì ï Y i» b 0 + b 1 Income i + b 2 Gender i = í îï b 0 + b 1 Income i -b 2, if male b 0 + b 1 Income i + b 2, if female 2 is the average amount that females are above the average, for any given income level. 2 is also the average amount that males are below the average, for any given income level.

Data Analysis 18 Other Issues Discussed Interaction terms Non-linear effects Multicollinearity Model Selection

Data Analysis 19 Interaction When the effect on Y of increasing X 1 depends on another X 2. Example: Maybe the effect on Salary (Y) when increasing Position (X 1 ) depends on gender (X 2 )? For example maybe Male salaries go up faster (or slower) than Females as they get promoted. Advertising example: TV and radio advertising both increase sales. Perhaps spending money on both of them may increase sales more than spending the same amount on one alone?

Data Analysis 20 Interaction in advertising Sales = b 0 + b 1 TV + b 2 Radio+ b 3 TV Radio Sales = b 0 +(b 1 + b 3 Radio) TV + b 2 Radio Spending $1 extra on TV increases average sales by 0.0191 + 0.0011Radio Interaction Term Sales = b 0 +(b 2 + b 3 TV) Radio+ b 2 TV Spending $1 extra on Radio increases average sales by 0.0289 + 0.0011TV Paramete r Estimates Term Estimate Std Error t Ratio Prob> t Intercept 6.7502202 0.247871 27.23 <.0001* TV 0.0191011 0.001504 12.70 <.0001* Radio 0.0288603 0.008905 3.24 0.0014* TV*Radio 0.0010865 5.242e-5 20.73 <.0001*

Data Analysis 21 Parallel Regression Lines Expande d Estimates Nom inal factors expanded to all levels Term Intercept Gender[fem ale] Gender[male] Pos ition Regression equation female: salary = 112.77+1.86 + 6.05 position males: salary = 112.77-1.86 + 6.05 position Different intercepts Estimate 112.77039 1.8600957-1.860096 6.0553559 Std Error 1.454773 0.527424 0.527424 0.280318 Same slopes t Ratio 77.52 3.53-3.53 21.60 Prob> t <.0001 0.0005 0.0005 <.0001 170 160 150 140 130 120 Line for women 0 2.5 5 7.5 10 Position Line for men Parallel lines have the same slope. Dummy variables give lines different intercepts, but their slopes are still the same.

Data Analysis 22 Interaction Effects Our model has forced the line for men and the line for women to be parallel. Parallel lines say that promotions have the same salary benefit for men as for women. If lines aren t parallel then promotions affect men s and women s salaries differently.

Data Analysis 23 Should the Lines be Parallel? 170 160 150 Expande d Estimate s Nom inal factors expanded to all levels Term Estimate Std Error Intercept Gender[fem ale] Gender[male] Pos ition 112.63081 1.1792165-1.179216 6.1021378 1.484825 1.484825 1.484825 0.296554 Gender[fem ale]*pos ition0.1455111 0.296554 Gender[male]*Pos ition -0.145511 0.296554 t Ratio 75.85 0.79-0.79 20.58 0.49-0.49 Prob> t <.0001 0.4280 0.4280 <.0001 0.6242 0.6242 140 130 120 110 0 1 2 3 4 5 6 7 8 9 10 Position Interaction is not significant Interaction between gender and position

Data Analysis 24 Outline The Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression Other Considerations in Regression Model Qualitative Predictors Interaction Terms Potential Fit Problems Linear vs. KNN Regression

Data Analysis 25 Potential Fit Problems There are a number of possible problems that one may encounter when fitting the linear regression model. 1. Non-linearity of the data 2. Dependence of the error terms 3. Non-constant variance of error terms 4. Outliers 5. High leverage points 6. Collinearity See Section 3.3.3 for more details.

Data Analysis 26 Outline The Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression Other Considerations in Regression Model Qualitative Predictors Interaction Terms Potential Fit Problems Linear vs. KNN Regression

Data Analysis 27 KNN Regression knn Regression is similar to the knn classifier. To predict Y for a given value of X, consider k closest points to X in training data and take the average of the responses. i.e. If k is small knn is much more flexible than linear regression. Is that better? f (x) = 1 K å x i ÎN i y i

KNN Fits for k =1 and k = 9 Data Analysis 28

Data Analysis 29 KNN Fits in One Dimension (k =1 and k = 9)

Linear Regression Fit Data Analysis 30

KNN vs. Linear Regression Data Analysis 31

Data Analysis 32 Not So Good in High Dimensional Situations