Bivariate data analysis

Similar documents
Lecture 16 - Correlation and Regression

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Stat 101 Exam 1 Important Formulas and Concepts 1

AMS 7 Correlation and Regression Lecture 8

Chi-square tests. Unit 6: Simple Linear Regression Lecture 1: Introduction to SLR. Statistics 101. Poverty vs. HS graduate rate

Regression. Marc H. Mehlman University of New Haven

1 A Review of Correlation and Regression

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

8. Example: Predicting University of New Mexico Enrollment

Regression diagnostics

Introduction and Single Predictor Regression. Correlation

The Model Building Process Part I: Checking Model Assumptions Best Practice

Box-Cox Transformations

Statistical View of Least Squares

Analysis of Bivariate Data

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

STATISTICS 479 Exam II (100 points)

Inferences for Regression

Introduction to Linear regression analysis. Part 2. Model comparisons

Simple Linear Regression

Multivariate Analysis of Variance

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

Correlation and Regression

STATISTICS 110/201 PRACTICE FINAL EXAM

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

For more information about how to cite these materials visit

Correlation & Simple Regression

Multiple Linear Regression

Multiple Regression Examples

Correlation and Regression Theory 1) Multivariate Statistics

Nonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp

Estadística II Chapter 5. Regression analysis (second part)

Simple Linear Regression

Simple Linear Regression

CHAPTER 10. Regression and Correlation

Announcements. Unit 6: Simple Linear Regression Lecture : Introduction to SLR. Poverty vs. HS graduate rate. Modeling numerical variables

CHAPTER 5. Outlier Detection in Multivariate Data

Announcements. Lecture 18: Simple Linear Regression. Poverty vs. HS graduate rate

Chapter 16. Simple Linear Regression and dcorrelation

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Polynomial Regression

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

Announcements. Lecture 10: Relationship between Measurement Variables. Poverty vs. HS graduate rate. Response vs. explanatory

Unit 6 - Introduction to linear regression

General Linear Model (Chapter 4)

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Talking feet: Scatterplots and lines of best fit

Graphical Diagnosis. Paul E. Johnson 1 2. (Mostly QQ and Leverage Plots) 1 / Department of Political Science

7.0 Lesson Plan. Regression. Residuals

Unit 6 - Simple linear regression

2. Outliers and inference for regression

appstats8.notebook October 11, 2016

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

Relationship Between Interval and/or Ratio Variables: Correlation & Regression. Sorana D. BOLBOACĂ

Assumptions in Regression Modeling

Regression Diagnostics Procedures

A course in statistical modelling. session 09: Modelling count variables

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

Business Statistics. Lecture 10: Correlation and Linear Regression

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

Chapter 16. Simple Linear Regression and Correlation

BE640 Intermediate Biostatistics 2. Regression and Correlation. Simple Linear Regression Software: SAS. Emergency Calls to the New York Auto Club

Statistical Modelling in Stata 5: Linear Models

Data Set 1A: Algal Photosynthesis vs. Salinity and Temperature

Example. Multiple Regression. Review of ANOVA & Simple Regression /749 Experimental Design for Behavioral and Social Sciences

Multiple Predictor Variables: ANOVA

Single and multiple linear regression analysis

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Unit 10: Simple Linear Regression and Correlation

Review. Number of variables. Standard Scores. Anecdotal / Clinical. Bivariate relationships. Ch. 3: Correlation & Linear Regression

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Checking model assumptions with regression diagnostics

Chapter 6 Multiple Regression

Business Statistics. Lecture 9: Simple Regression

Weighted Least Squares

Linear Regression. Anna Leontjeva

1 Introduction to Minitab

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Simple Linear Regression: Introduction to Diagnostics

Simple Linear Regression

Simple Linear Regression for the Advertising Data

THE PEARSON CORRELATION COEFFICIENT

6.0 Lesson Plan. Answer Questions. Regression. Transformation. Extrapolation. Residuals

Chapter 3 - Linear Regression

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

22S39: Class Notes / November 14, 2000 back to start 1

FREC 608 Guided Exercise 9

Section 5.4 Residuals

Can you tell the relationship between students SAT scores and their college grades?

Simple Linear Regression

A course in statistical modelling. session 06b: Modelling count data

Simple Linear Regression

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

H. Diagnostic plots of residuals

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Transcription:

Bivariate data analysis

Categorical data - creating data set Upload the following data set to R Commander sex female male male male male female female male female female eye black black blue green green green black green blue blue Method 1: Type the table in the Notepad, save it and import to Rcmdr Method 2: Introduce directly in the Script Window eye = c(" black "," black "," blue "," green "," green ", " green "," black "," green "," blue "," blue ") sex = c(" female "," male "," male "," male "," male ", " female ", " female "," male "," female "," female ") DataSexEye = data.frame (sex, eye )

Categorical data - contingency table

Categorical data - contingency table cont. How many of the sampled people are female with black eyes? (2) What % of the sampled people are male with blue eyes? (10%) What % of the sampled people are male? (50%) What % of the sampled people have green eyes? (40%)

Categorical data - barchart Load the library lattice, then create barchart grouping the data by sex library ( lattice ) barchart ( DataSexEye, groups = DataSexEye $ sex )

Categorical data - barchart cont. Are there more females or males with blue eyes? (females) What is the most common eye color among males? (green) green male female black blue male female 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Freq

Numerical data - load anscombe data set from R library

Numerical data - scatterplot of y1 versus x1

Numerical data - scatterplot of y1 versus x1 cont. Uncheck all but the Least-squares line Plotting characters 20 corresponds to bullets Increase the Point size to 2.5

Numerical data - scatterplot of y1 versus x1 cont. y1 4 5 6 7 8 9 10 11 4 6 8 10 12 14 x1

Numerical data - scatterplot matrix (only x 1, x 2, y 1, y 2 )

Numerical data - scatterplot matrix (only x 1, x 2, y 1, y 2 ) cont. Check Least-squares line

Numerical data - scatterplot matrix (only x 1, x 2, y 1, y 2 ) cont. x1 4 6 8 12 3 5 7 9 4 6 8 12 4 6 8 12 x2 y1 4 6 8 10 4 6 8 12 3 5 7 9 4 6 8 10 y2

latticist environment You can create interactive graphics: data ( anscombe, package =" datasets ") library ( latticist ) latticist ( anscombe )

Numerical data - correlation matrix

Numerical data - correlation matrix (only x 1, x 2, y 1, y 2 ) cont. Matrix is symmetrical with values on the diagonal = 1 cor(x 1, y 1 ) = cor(y 1, x 1 ) = 0.8164205

Numerical data - covariance matrix (only x 1, x 2, y 1, y 2 ) Replace cor by cov in the last command in the Script Window cov(x 1, y 1 ) = 5.501 Matrix is symmetrical with values on the diagonal = variances, eg, cov(y 1, y 1 ) = var(y 1 ) = 4.127269

Simple linear regression - y1 versus x1

Simple linear regression - y1 versus x1 cont.

Simple linear regression - y1 versus x1 cont. Intercept estimate: a = 3.0001 Slope estimate: b = 0.5001 Residual standard deviation: s R = n i=1 (y i ŷ i ) 2 n 2 = 1.237 R-squared: R 2 = 0.6665 cor(x, y) = 0.6665

Regression Diagnostics: Tools for Checking the Validity of a Model (I) Determine whether the proposed regression model provides an adequate fit to the data: plots of standardized residuals. The plots assess visually whether the assumptions are being violated. Determine which (if any) of the data points have x values that have an unusually large effect on the estimated regression model (leverage points). Determine which (if any) of the data points are outliers: points which do not follow the pattern set by the bulk of the data.

Regression Diagnostics: Tools for Checking the Validity of a Model (II) If leverage points exist, determine whether each is a bad leverage point. If a bad leverage point exists we shall assess its influence on the fitted model. Examine whether the assumption of constant variance of the errors is reasonable. If not, we shall look at how to overcome this problem. If the data are collected over time, examine whether the data are correlated over time. If the sample size is small or prediction intervals are of interest, examine whether the assumption that the errors are normally distributed is reasonable.

Data Sources: Edward R. Tufte, The Visual Display of Quantitative Information (Cheshire, Connecticut: Graphics Press, 1983), pp. 14-15. F.J. Anscombe, "Graphs in Statistical Analysis," American Statistician, vol. 27 (Feb 1973), pp. 17-21. Anscombe's Data Observation x1 y1 x2 y2 x3 y3 x4 y4 1 10 8,04 10 9,14 10 7,46 8 6,58 2 8 6,95 8 8,14 8 6,77 8 5,76 3 13 7,58 13 8,74 13 12,74 8 7,71 4 9 8,81 9 8,77 9 7,11 8 8,84 5 11 8,33 11 9,26 11 7,81 8 8,47 6 14 9,96 14 8,1 14 8,84 8 7,04 7 6 7,24 6 6,13 6 6,08 8 5,25 8 4 4,26 4 3,1 4 5,39 19 12,5 9 12 10,84 12 9,13 12 8,15 8 5,56 10 7 4,82 7 7,26 7 6,42 8 7,91 11 5 5,68 5 4,74 5 5,73 8 6,89 Summary Statistics N 11 11 11 11 11 11 11 11 mean 9,00 7,50 9,00 7,50091 9,00 7,50 9,00 7,50 SD 3,16 1,94 3,16 1,94 3,16 1,94 3,16 1,94 r 0,82 0,82 0,82 0,82 Use the charts below to get the regression lines via Excel's Trendline feature.

Data 14 12 10 8 6 4 2 0 x1-y1 0 5 10 15 20 14 12 10 8 6 4 2 0 x2-y2 0 5 10 15 20 14 12 10 8 6 4 2 0 x3-y3 0 5 10 15 20 x4-y4 14 12 10 8 6 4 2 0 0 5 10 15 20

Data Regression Results LINEST OUTPUT x1-y1 x2-y2 x3-y3 x4-y4 slope intercept 0,50 3 0,50 3 0,50 3 0,50 3 SE SE 0,12 1,12 0,12 1,13 0,12 1,12 0,12 1,12 R 2 RMSE 0,67 1,24 0,67 1,24 0,67 1,24 0,67 1,24 F df 17,99 9 17,97 9 17,97 9 18,00 9 Reg SS SSR 27,51 13,76 27,50 13,78 27,47 13,76 27,49 13,74

Data 12 10 8 6 4 2 0 x1-y1 data y = 0,5001x + 3,0001 0 5 10 15 12 10 8 6 4 2 0 x2-y2 data y = 0,5x + 3,0009 0 5 10 15 14 x3-y3 data 12 10 8 6 4 y = 0,4997x + 3,0025 2 0 0 5 10 15 14 x4-y4 data 12 10 8 6 4 y = 0,4999x + 3,0017 2 0 0 5 10 15 20

Simple linear regression - residual plot (method 1)

Simple linear regression - residual plot (method 1) cont. Residuals versus fitted (top left plot) lm(y1 ~ x1) Residuals 2 1 0 1 2 Residuals vs Fitted 10 9 3 Standardized residuals 1 0 1 2 3 10 Normal Q Q 9 5 6 7 8 9 10 1.5 0.5 0.5 1.5 Fitted values Theoretical Quantiles Standardized residuals 0.0 0.4 0.8 1.2 Scale Location 10 3 9 Standardized residuals 2 1 0 1 2 Residuals vs Leverage 9 10 Cook's distance 3 1 0.5 0.5 1 5 6 7 8 9 10 0.00 0.10 0.20 0.30 Fitted values Leverage

Simple linear regression - residual plot (method 2) Append the fitted values, residuals, standardized residuals etc to the existing data set

Simple linear regression - residual plot (method 2 cont.) Append the fitted values, residuals, studentized residuals etc to the existing data set

Simple linear regression - residual plot (method 2 cont.) Now the data set has new columns on the right with ŷ, r, etc

Simple linear regression - residual plot (method 2 cont.) Use the scatterplot option in the Graphs menu to plot residuals versus fitted

fitted.regmodel.1 Simple linear regression - residual plot (method 2 cont.) Residuals versus fitted (cloud of points oscillates around the horizontal axis y = 0) There is no pattern, no heteroscedasticity regression model is appropriate residuals.regmodel.1 2 1 0 1 5 6 7 8 9 10

Simple linear regression - residual plot (method 2 cont.) Studentized Residuals ( r i s R ) versus x 1 (cloud of points oscillates around the horizontal axis y = 0) There is no pattern, no heteroscedasticity regression model is appropriate rstudent.regmodel.1 2 1 0 1 4 6 8 10 12 14