Regression_Model_Project Md Ahmed June 13th, 2017

Size: px
Start display at page:

Download "Regression_Model_Project Md Ahmed June 13th, 2017"

Transcription

1 Regression_Model_Project Md Ahmed June 13th, 2017 Executive Summary Motor Trend is a magazine about the automobile industry. It is interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome), particularly: Is an automatic or manual transmission better for MPG Quantify the MPG difference between automatic and manual transmissions Project progression: This is a linear regression model project. In searching answer to these questions;i will use major statistical analyses processes to verify, quantify and justify my model selection. In these steps, I will offer some statistical inference and immediate conclusion as if my model is a good fit. At the onset, I did some very basic exploratory data analysis(eda) meaning little slicing and dicing the mtcars dataset. Data manipulation is designed to get the am variable factored into two levels(auto, Manual), as per project instruction. In my regression model summary, I did try to analyze the summary-result as detail as possible to justify that Manual-transmission definitely hold upper mileage benefit.i did a simpler residual analysis to verify my model efficiency. In addition,i did some multivariable model analysis with variable adjustment and interaction to validate that none of the other models offer better mileage gain than my(fit01)model. Finally, I did use anova function to prove that my, lm(mpg ~ am) model is the right answer choice for the project questions. Project Question criteria and report writing instruction Load the mtcars data set and implement some exploratory data analysis. Design a regression model and execute some detail statistical analysis. Our linear model analysis should adhere to these instucted criteria: Interpreting the coefficients and slopes correctly. Doing some basic relevant exploratory data analyses. Fitting some multivariable linear models and evaluate reasoning for model selection. portraying a residual plot and with some diagnostics analysis. quantifying uncertainty in their(models) inferencial conclusions and/or perform an inference correctly. answering the questions of interest or detail why the question(s) is (are) not answerable? Your report should: Include an executive summary about project design progression. Written in a PDF printout format and compiled (using knitr) with a R markdown document. Concise and roughly the equivalent of 2 pages or less for the main text. Supporting figures in an appendix can be included up to 5 total pages. 1

2 1. EDA: Exploratory Data Analysis # loading 'mtcars' data set data(mtcars) # a brief data display head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX Mazda RX4 Wag Datsun Hornet 4 Drive Hornet Sportabout Valiant # displaying 'mtcars' data summary summary(mtcars) mpg cyl disp hp Min. :10.40 Min. :4.000 Min. : 71.1 Min. : st Qu.: st Qu.: st Qu.: st Qu.: 96.5 Median :19.20 Median :6.000 Median :196.3 Median :123.0 Mean :20.09 Mean :6.188 Mean :230.7 Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.:180.0 Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 drat wt qsec vs Min. :2.760 Min. :1.513 Min. :14.50 Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median :3.695 Median :3.325 Median :17.71 Median : Mean :3.597 Mean :3.217 Mean :17.85 Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. :4.930 Max. :5.424 Max. :22.90 Max. : am gear carb Min. : Min. :3.000 Min. : st Qu.: st Qu.: st Qu.:2.000 Median : Median :4.000 Median :2.000 Mean : Mean :3.688 Mean : rd Qu.: rd Qu.: rd Qu.:4.000 Max. : Max. :5.000 Max. :8.000 # data dimension dim(mtcars) [1] # summarizing 'mpg' values based on list-factor(auto/manual) transmission only by(mtcars$mpg, INDICES = list(mtcars$am), summary) : 0 Min. 1st Qu. Median Mean 3rd Qu. Max : 1 Min. 1st Qu. Median Mean 3rd Qu. Max

3 2. DM: Data manipulation with t-test library(dplyr) Warning: package 'dplyr' was built under R version Attaching package: 'dplyr' The following objects are masked from 'package:stats': filter, lag The following objects are masked from 'package:base': intersect, setdiff, setequal, union # factoring 'am' variable elements of 'mtcars' datasets summary(mtcars$am <- factor(mtcars$am)) # creating new levels with factored 'am-variable' data elements levels(mtcars$am) <- c("auto", "Manual") # quick view of the new 'level-set' head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX Manual 4 4 Mazda RX4 Wag Manual 4 4 Datsun Manual 4 1 Hornet 4 Drive Auto 3 1 Hornet Sportabout Auto 3 2 Valiant Auto 3 1 # separating 'auto' levels only into a new 'level-set' Auto_Data <- mtcars[mtcars$am == "Auto",] # separating only 'manual' levels into new 'level-set' Manual_Data Manual_Data <- mtcars[mtcars$am == "Manual",] # separating 'mpg' mean by 'Auto' and 'Manual' level summarise(group_by(mtcars, am), mn = mean(mpg)) # A tibble: 2 x 2 am mn <fctr> <dbl> 1 Auto Manual # doing t-test for verifying level-mpg-mean values t.test(auto_data$mpg, Manual_Data$mpg) Welch Two Sample t-test 3

4 data: Auto_Data$mpg and Manual_Data$mpg t = , df = , p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y We are doing t-test only with mpg-mean data related to auto & manual levels to examine that these mean values are truly representattive of their group and carries a level of statistical significance. t-test analysis: Our 95% confidence interval( , to ) range does not contain zero,it is all negative values. p-value = is close to zero, which is ( < 0.05 ) at 0.05 level a statistically significant one. We can reject the assumed null hypothesis[ auto_mean == manual_mean ] at 0.05 level. Also ( Auto_mean = < Manual_mean = ), indicates the direction of the factored-element mean is significant and truly representative. 3. Regression analysis with linear model My regression model will try to Substantiate project question: Is an automatic or manual transmission better for MPG # designing first linear-model with new level with summary fit01 <- lm(mpg ~ am, mtcars) summary(fit01) Call: lm(formula = mpg ~ am, data = mtcars) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-15 ammanual Signif. codes: 0 '' '' 0.01 '' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 30 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 30 DF, p-value:

5 Linear model summary-result analysis: The intercept is the mean mileage for Automatic Transmission. The estimated mean for Manual Transmission is intercept plus the slope ( ) = Coefficient - Estimate: The intercept with this model is essentially the expected value of mileage attained from a car with Auto transmission, while the slope is the Manual transmission. amauto, the [Auto Transmission] cars in average attains MPG. The coefficients slope[ ammanual ], indicates mileage increases by MPG. So we can surmise that Manual transmission has better MPG than auto-transmission. Coefficient - t value: We can see that out t-static values are and = [ ] both are relatively far away from zero and large relative to corresponding standard error values, which indicates we could reject the null hypothesis meaning [ auto!= manual ]. Coefficient - Pr(> t ): Since our p-values for the intercept[1.13e-15] and slope[ ] indicates that ammanual has smaller p-value than amauto. We can infer that ammanual has higher level of p-value significance than amauto. Residual - Standar Error: Residual standar error measure the quality of a linear regression fit. The Residaul standard error is the average amount that the response(mileage) will deviate from the true regression line. In this model, the actual mileage varies between two transmission can deviate from the true regression line by approximately miles on average. In other words, given that the mean mileage for amauto are mile and that the Residual standard Error is R-squared, Adjusted R-squared: The R-squared static provides of how well the model is fitting the actual data. In our calculation multiple Rˆ2 is or rougly 35% of the variance found in the response variable(mpg) can be explained by the predictor variable am(auto/manual). Adjusted Rˆ2 is In both cases we see Rˆ2 values in range 0 < (0.3598, ) < 1 supports a good correlation between these two variables. This indicates a good linear model fit. Calculating confidence interval for the Intercept-Slope of this model: # Confidence Interval of this model [ fit01 ] with 'amauto' coefficients sumcoef <- summary(fit01)$coefficients sumcoef[1, 1] + c(-1, 1) qt(0.975, df = fit01$df ) sumcoef[1, 2] [1] # Now let's do the confidence interval of 'ammanual' slope coefficients (sumcoef[2,1] + c(-1, 1) qt(0.975, df = fit01$df) sumcoef[2, 2]) [1]

6 Analysis: So we can interpret these interval with 95% confidence that as we switch transmission from auto to manual average mileage increases to mile. Inference: we can say that manual transmission definitely produces better gas-mileage than automatic one. Residual analysis for model selection # resid function returns residuals of the linear model(fit01). residual <- resid(fit01) # a visual of the estimated residuals with model 'fit01' summary(residual) Min. 1st Qu. Median Mean 3rd Qu. Max Analysis: We can see very clearly that all the negative values, residuals ( ) = nearly equates ( ) = We know residuals must sum to.. 0, apparently ( ) = is almost close to 0. A good measurment of accurate model fit. # Plotting Residual vs fitted value par(mfrow = c(1,2)) plot(residual, pch='', xlab = "Fitted values", ylab = "Residuals") abline(0,0) # Normality of residuals(errors) qqnorm(residual, pch='') qqline(residual) 6

7 Normal Q Q Plot Residuals Sample Quantiles Fitted values Theoretical Quantiles Figure 1: Residual plot Residual vs. fitted: Residual points are in a pattern and symmetrically distributed on and below the 0-line. Residaul Q-Q plot: It is obvious that our model(fit01) residual(error) values roughly falling on a line in a normal QQ plot. These distribution verifies our model(fit01) design with potential effectiveness. plots of the regression model # drawing plots library(ggplot2) Warning: package 'ggplot2' was built under R version # dotted plot ggplot(mtcars, aes(x = factor(am), y = mpg, color=factor(am), shape = factor(am))) + geom_point(size = 3 7

8 35 mileage distribution by 'transmission and cylinder' 30 Mileage factor(am) Auto Manual Auto Manual Auto and Manual Transmission Figure 2: Dot plot: lm( mpg ~ am) It is obvious that Manual transmission getting incremental mileage. 4. Multivariable analysis with nested model testing We know omitting variables from regressors may results in bias in the coefficients of interest ( unless the regressors are uncorrelated with the omitted ones). So to avoid bias, I have decided to do a very generalized mpg measurements in connection to all the relevant regressor variables of mtcars dataset regardless of correlations. # mpg vs. all relevant regressors variables into a new linear model fit02 <- lm(mpg ~., data = mtcars) summary(fit02) Call: lm(formula = mpg ~., data = mtcars) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) cyl

9 disp hp drat wt qsec vs ammanual gear carb Signif. codes: 0 '' '' 0.01 '' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.65 on 21 degrees of freedom Multiple R-squared: 0.869, Adjusted R-squared: F-statistic: on 10 and 21 DF, p-value: 3.793e-07 Nested models: Inspecting into fit02 model Coefficients-Estimate we can surmise that variables( cyl, hp, wt, carb ) are losing more mileage than any other variables. Mileage regression is in negative territory with these variables. The rest of the variables are are gaining somewhat but trivial mileage. # Obtaining residual plot of fitted model ( fit02 ) par(mfrow = c(2,2)) residual02 <- resid(fit02) plot(fit02) Residuals Residuals vs Fitted Chrysler Imperial Fiat 128 Toyota Corolla Standardized residuals Ford Pantera L Normal Q Q Chrysler Imperial Fiat Fitted values Theoretical Quantiles Standardized residuals Chrysler Imperial Scale Location Ford Pantera Fiat L Standardized residuals Residuals vs Leverage Chrysler Imperial Cook's distance Merc Ford Pantera L Fitted values Leverage Figure 4: Detail Residual plot model-fit02 Analysis: The Residual vs Fitted plot is not exactly a smooth residual distribution. We can see from 9

10 our model(fit02), Residuals: ( ) = is not sum to zero. Our Normal QQ plot visually shows a residual normality. The scale-location plot shows some sort of linear distribution of residuals. Finally, our Residuals vs Leverage plot shows no large outlying data point holding any significant leverage. Adjustment and interaction between multiple variables with am : so I will experiment with some nested linear model with these variables adding with factored am regressor. This also called adjustment and interaction, by adding more regressor into the linear model to investigate the role of a third/fourth variable on to the relationship with outcome variable mpg. These added variable can distort, or confound the linear relationhsip between (outcome-regressor) and offer a renewed perspective about possible variable influence. # variable adjustment with possible relationship with 'cyl' fit03 <- lm(mpg ~ am + cyl, data = mtcars) summary(fit03)$coef Estimate Std. Error t value Pr(> t ) (Intercept) e-14 ammanual e-02 cyl e-07 We can see from this model that t-value = relatively away from 0, which indicates that there is a minimal relationship between mpg - (am + cyl) model. # variable adjustment with possible relationship with 'hp' interactive fit04 <- lm(mpg ~ am + cyl + hp + cyl hp, data = mtcars) summary(fit04)$coef Estimate Std. Error t value Pr(> t ) (Intercept) e-08 ammanual e-03 cyl e-03 hp e-03 cyl:hp e-02 This model (fit04) t-value correlated with [ cyl, hp = -3.08, ] is not far away from 0. We can say there exist a fainted relationship with between mileage change and (cyl + hp) predictor variable. # variable adjustment with possible relationship with 'wt' fit05 <- lm(mpg ~ am + cyl + hp + wt, data = mtcars) summary(fit05)$coef Estimate Std. Error t value Pr(> t ) (Intercept) e-12 ammanual e-01 cyl e-01 hp e-02 wt e-03 We added another variable wt into this linear model. Estimated corresponding slope, t-values all stayed in negative territory without being far away from 0. So we can infer that mileage relations will not be significant with this new variable adjustment. # On this model(fit06) we added predictor variable 'qsec' with the long list fit06 <- lm(mpg ~ am + cyl + hp + wt + qsec + wt qsec, data = mtcars) summary(fit06)$coef Estimate Std. Error t value Pr(> t ) (Intercept)

11 ammanual cyl hp wt qsec wt:qsec None of these interaction offers any new height of observation in t-values far from standard error. The only difference now p-value is significantly out of range towards accepting null values. An example of simpsons paradox. Inference: So, we can effectively assume that adding multiple variables into the linear-model wouldn t make any difference in pursuasion of mileage gain/loss on mileage coefficient-slope. 5. ANOVA - test for multiple-model statistical significance We know ANOVA test is useful for comparing two or more model for statistical significance. It is conceptually similar to multiple two-sample t-test. anova(fit01, fit03, fit04, fit05, fit06) Analysis of Variance Table Model 1: mpg ~ am Model 2: mpg ~ am + cyl Model 3: mpg ~ am + cyl + hp + cyl hp Model 4: mpg ~ am + cyl + hp + wt Model 5: mpg ~ am + cyl + hp + wt + qsec + wt qsec Res.Df RSS Df Sum of Sq F Pr(>F) e Signif. codes: 0 '' '' 0.01 '' 0.05 '.' 0.1 ' ' 1 Analysis: By analysing all four nested models with anova function, we are witnessing, Model-2 has second highest level( 0.001) of significance. There are no obvious mileage gain even with newer adjusted nested models. The only second most significant model is fit03 = am + cyl with multivariable combination.let s have a box-plot visual with this adjustment Model-2( fit03 ). boxplot(mpg ~ factor(am)+cyl, data=mtcars, col=c("salmon","dodgerblue2"), xlab="transmission-cylinder", 11

12 mileage variation with (am+cyl) Mileage Auto.4 Manual.4 Auto.6 Manual.6 Auto.8 Manual.8 Transmission Cylinder Figure 4: Box plot of ( mpg ~ am + cyl) Manual transmission still carries higher mileage with 4-cylinder combination Conclusion: All throughout these statistical verification processes, it is obvious that ammanual transmission holds a significant mileage gain in comparison to amauto cars. Our t-test, confidence interval and residual analysis offer a clear mileage preference for manual-transmission cars. Even multivariable analysis with variable adjustment and interaction forcefully confirms that A manual transmission car is better for MPG, rather than an automatic one. 12

Motor Trend Car Road Analysis

Motor Trend Car Road Analysis Motor Trend Car Road Analysis Zakia Sultana February 28, 2016 Executive Summary You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are

More information

Lab #5 - Predictive Regression I Econ 224 September 11th, 2018

Lab #5 - Predictive Regression I Econ 224 September 11th, 2018 Lab #5 - Predictive Regression I Econ 224 September 11th, 2018 Introduction This lab provides a crash course on least squares regression in R. In the interest of time we ll work with a very simple, but

More information

R package ggplot2 STAT 133. Gaston Sanchez. Department of Statistics, UC Berkeley

R package ggplot2 STAT 133. Gaston Sanchez. Department of Statistics, UC Berkeley R package ggplot2 STAT 133 Gaston Sanchez Department of Statistics, UC Berkeley gastonsanchez.com github.com/gastonstat/stat133 Course web: gastonsanchez.com/stat133 ggplot2 2 Scatterplot with "ggplot2"

More information

Regression and the 2-Sample t

Regression and the 2-Sample t Regression and the 2-Sample t James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Regression and the 2-Sample t 1 / 44 Regression

More information

Chapter 8 Conclusion

Chapter 8 Conclusion 1 Chapter 8 Conclusion Three questions about test scores (score) and student-teacher ratio (str): a) After controlling for differences in economic characteristics of different districts, does the effect

More information

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model Lab 3 A Quick Introduction to Multiple Linear Regression Psychology 310 Instructions.Work through the lab, saving the output as you go. You will be submitting your assignment as an R Markdown document.

More information

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression Introduction to Correlation and Regression The procedures discussed in the previous ANOVA labs are most useful in cases where we are interested

More information

MODELS WITHOUT AN INTERCEPT

MODELS WITHOUT AN INTERCEPT Consider the balanced two factor design MODELS WITHOUT AN INTERCEPT Factor A 3 levels, indexed j 0, 1, 2; Factor B 5 levels, indexed l 0, 1, 2, 3, 4; n jl 4 replicate observations for each factor level

More information

Density Temp vs Ratio. temp

Density Temp vs Ratio. temp Temp Ratio Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.0 0.2 0.4 0.6 0.8 1.0 1. (a) 170 175 180 185 temp 1.0 1.5 2.0 2.5 3.0 ratio The histogram shows that the temperature measures have two peaks,

More information

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam Statistics 203 Introduction to Regression Models and ANOVA Practice Exam Prof. J. Taylor You may use your 4 single-sided pages of notes This exam is 7 pages long. There are 4 questions, first 3 worth 10

More information

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

Statistics - Lecture Three. Linear Models. Charlotte Wickham   1. Statistics - Lecture Three Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Linear Models 1. The Theory 2. Practical Use 3. How to do it in R 4. An example 5. Extensions

More information

Chapter 3 - Linear Regression

Chapter 3 - Linear Regression Chapter 3 - Linear Regression Lab Solution 1 Problem 9 First we will read the Auto" data. Note that most datasets referred to in the text are in the R package the authors developed. So we just need to

More information

Analytics 512: Homework # 2 Tim Ahn February 9, 2016

Analytics 512: Homework # 2 Tim Ahn February 9, 2016 Analytics 512: Homework # 2 Tim Ahn February 9, 2016 Chapter 3 Problem 1 (# 3) Suppose we have a data set with five predictors, X 1 = GP A, X 2 = IQ, X 3 = Gender (1 for Female and 0 for Male), X 4 = Interaction

More information

Introduction and Single Predictor Regression. Correlation

Introduction and Single Predictor Regression. Correlation Introduction and Single Predictor Regression Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Correlation A correlation

More information

Generating OLS Results Manually via R

Generating OLS Results Manually via R Generating OLS Results Manually via R Sujan Bandyopadhyay Statistical softwares and packages have made it extremely easy for people to run regression analyses. Packages like lm in R or the reg command

More information

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015 Introduction to Linear Regression Rebecca C. Steorts September 15, 2015 Today (Re-)Introduction to linear models and the model space What is linear regression Basic properties of linear regression Using

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 15: Examples of hypothesis tests (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 32 The recipe 2 / 32 The hypothesis testing recipe In this lecture we repeatedly apply the

More information

Multiple Regression: Example

Multiple Regression: Example Multiple Regression: Example Cobb-Douglas Production Function The Cobb-Douglas production function for observed economic data i = 1,..., n may be expressed as where O i is output l i is labour input c

More information

ANOVA and Multivariate Analysis

ANOVA and Multivariate Analysis ANOVA and Multivariate Analysis Introduction Many PhotosynQ users are interested in comparing the performance of different treatments, crop varieties, etc. A common approach to separate different groups

More information

1 Multiple Regression

1 Multiple Regression 1 Multiple Regression In this section, we extend the linear model to the case of several quantitative explanatory variables. There are many issues involved in this problem and this section serves only

More information

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA s:5 Applied Linear Regression Chapter 8: ANOVA Two-way ANOVA Used to compare populations means when the populations are classified by two factors (or categorical variables) For example sex and occupation

More information

Simple linear regression: estimation, diagnostics, prediction

Simple linear regression: estimation, diagnostics, prediction UPPSALA UNIVERSITY Department of Mathematics Mathematical statistics Regression and Analysis of Variance Autumn 2015 COMPUTER SESSION 1: Regression In the first computer exercise we will study the following

More information

Inference with Heteroskedasticity

Inference with Heteroskedasticity Inference with Heteroskedasticity Note on required packages: The following code requires the packages sandwich and lmtest to estimate regression error variance that may change with the explanatory variables.

More information

Linear Modelling: Simple Regression

Linear Modelling: Simple Regression Linear Modelling: Simple Regression 10 th of Ma 2018 R. Nicholls / D.-L. Couturier / M. Fernandes Introduction: ANOVA Used for testing hpotheses regarding differences between groups Considers the variation

More information

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference. Understanding regression output from software Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals In 1966 Cyril Burt published a paper called The genetic determination of differences

More information

Logistic Regression in R. by Kerry Machemer 12/04/2015

Logistic Regression in R. by Kerry Machemer 12/04/2015 Logistic Regression in R by Kerry Machemer 12/04/2015 Linear Regression {y i, x i1,, x ip } Linear Regression y i = dependent variable & x i = independent variable(s) y i = α + β 1 x i1 + + β p x ip +

More information

SCHOOL OF MATHEMATICS AND STATISTICS

SCHOOL OF MATHEMATICS AND STATISTICS RESTRICTED OPEN BOOK EXAMINATION (Not to be removed from the examination hall) Data provided: Statistics Tables by H.R. Neave MAS5052 SCHOOL OF MATHEMATICS AND STATISTICS Basic Statistics Spring Semester

More information

STAT 215 Confidence and Prediction Intervals in Regression

STAT 215 Confidence and Prediction Intervals in Regression STAT 215 Confidence and Prediction Intervals in Regression Colin Reimer Dawson Oberlin College 24 October 2016 Outline Regression Slope Inference Partitioning Variability Prediction Intervals Reminder:

More information

Introduction to Statistics and R

Introduction to Statistics and R Introduction to Statistics and R Mayo-Illinois Computational Genomics Workshop (2018) Ruoqing Zhu, Ph.D. Department of Statistics, UIUC rqzhu@illinois.edu June 18, 2018 Abstract This document is a supplimentary

More information

Inference for Regression

Inference for Regression Inference for Regression Section 9.4 Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Department of Mathematics University of Houston Lecture 13b - 3339 Cathy Poliak, Ph.D. cathy@math.uh.edu

More information

Recall that a measure of fit is the sum of squared residuals: where. The F-test statistic may be written as:

Recall that a measure of fit is the sum of squared residuals: where. The F-test statistic may be written as: 1 Joint hypotheses The null and alternative hypotheses can usually be interpreted as a restricted model ( ) and an model ( ). In our example: Note that if the model fits significantly better than the restricted

More information

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model 1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor

More information

Operators and the Formula Argument in lm

Operators and the Formula Argument in lm Operators and the Formula Argument in lm Recall that the first argument of lm (the formula argument) took the form y. or y x (recall that the term on the left of the told lm what the response variable

More information

Tests of Linear Restrictions

Tests of Linear Restrictions Tests of Linear Restrictions 1. Linear Restricted in Regression Models In this tutorial, we consider tests on general linear restrictions on regression coefficients. In other tutorials, we examine some

More information

Stat 412/512 TWO WAY ANOVA. Charlotte Wickham. stat512.cwick.co.nz. Feb

Stat 412/512 TWO WAY ANOVA. Charlotte Wickham. stat512.cwick.co.nz. Feb Stat 42/52 TWO WAY ANOVA Feb 6 25 Charlotte Wickham stat52.cwick.co.nz Roadmap DONE: Understand what a multiple regression model is. Know how to do inference on single and multiple parameters. Some extra

More information

Statistics. Introduction to R for Public Health Researchers. Processing math: 100%

Statistics. Introduction to R for Public Health Researchers. Processing math: 100% Statistics Introduction to R for Public Health Researchers Statistics Now we are going to cover how to perform a variety of basic statistical tests in R. Correlation T-tests/Rank-sum tests Linear Regression

More information

Variance Decomposition and Goodness of Fit

Variance Decomposition and Goodness of Fit Variance Decomposition and Goodness of Fit 1. Example: Monthly Earnings and Years of Education In this tutorial, we will focus on an example that explores the relationship between total monthly earnings

More information

Biostatistics 380 Multiple Regression 1. Multiple Regression

Biostatistics 380 Multiple Regression 1. Multiple Regression Biostatistics 0 Multiple Regression ORIGIN 0 Multiple Regression Multiple Regression is an extension of the technique of linear regression to describe the relationship between a single dependent (response)

More information

STAT 350: Summer Semester Midterm 1: Solutions

STAT 350: Summer Semester Midterm 1: Solutions Name: Student Number: STAT 350: Summer Semester 2008 Midterm 1: Solutions 9 June 2008 Instructor: Richard Lockhart Instructions: This is an open book test. You may use notes, text, other books and a calculator.

More information

Module 4: Regression Methods: Concepts and Applications

Module 4: Regression Methods: Concepts and Applications Module 4: Regression Methods: Concepts and Applications Example Analysis Code Rebecca Hubbard, Mary Lou Thompson July 11-13, 2018 Install R Go to http://cran.rstudio.com/ (http://cran.rstudio.com/) Click

More information

Multiple Linear Regression. Chapter 12

Multiple Linear Regression. Chapter 12 13 Multiple Linear Regression Chapter 12 Multiple Regression Analysis Definition The multiple regression model equation is Y = b 0 + b 1 x 1 + b 2 x 2 +... + b p x p + ε where E(ε) = 0 and Var(ε) = s 2.

More information

> modlyq <- lm(ly poly(x,2,raw=true)) > summary(modlyq) Call: lm(formula = ly poly(x, 2, raw = TRUE))

> modlyq <- lm(ly poly(x,2,raw=true)) > summary(modlyq) Call: lm(formula = ly poly(x, 2, raw = TRUE)) School of Mathematical Sciences MTH5120 Statistical Modelling I Tutorial 4 Solutions The first two models were looked at last week and both had flaws. The output for the third model with log y and a quadratic

More information

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

Multiple Regression Introduction to Statistics Using R (Psychology 9041B) Multiple Regression Introduction to Statistics Using R (Psychology 9041B) Paul Gribble Winter, 2016 1 Correlation, Regression & Multiple Regression 1.1 Bivariate correlation The Pearson product-moment

More information

STAT 572 Assignment 5 - Answers Due: March 2, 2007

STAT 572 Assignment 5 - Answers Due: March 2, 2007 1. The file glue.txt contains a data set with the results of an experiment on the dry sheer strength (in pounds per square inch) of birch plywood, bonded with 5 different resin glues A, B, C, D, and E.

More information

ST430 Exam 1 with Answers

ST430 Exam 1 with Answers ST430 Exam 1 with Answers Date: October 5, 2015 Name: Guideline: You may use one-page (front and back of a standard A4 paper) of notes. No laptop or textook are permitted but you may use a calculator.

More information

STAT 3022 Spring 2007

STAT 3022 Spring 2007 Simple Linear Regression Example These commands reproduce what we did in class. You should enter these in R and see what they do. Start by typing > set.seed(42) to reset the random number generator so

More information

Comparing Nested Models

Comparing Nested Models Comparing Nested Models ST 370 Two regression models are called nested if one contains all the predictors of the other, and some additional predictors. For example, the first-order model in two independent

More information

Stat 5102 Final Exam May 14, 2015

Stat 5102 Final Exam May 14, 2015 Stat 5102 Final Exam May 14, 2015 Name Student ID The exam is closed book and closed notes. You may use three 8 1 11 2 sheets of paper with formulas, etc. You may also use the handouts on brand name distributions

More information

1 Introduction 1. 2 The Multiple Regression Model 1

1 Introduction 1. 2 The Multiple Regression Model 1 Multiple Linear Regression Contents 1 Introduction 1 2 The Multiple Regression Model 1 3 Setting Up a Multiple Regression Model 2 3.1 Introduction.............................. 2 3.2 Significance Tests

More information

Lecture 8: Fitting Data Statistical Computing, Wednesday October 7, 2015

Lecture 8: Fitting Data Statistical Computing, Wednesday October 7, 2015 Lecture 8: Fitting Data Statistical Computing, 36-350 Wednesday October 7, 2015 In previous episodes Loading and saving data sets in R format Loading and saving data sets in other structured formats Intro

More information

ST430 Exam 2 Solutions

ST430 Exam 2 Solutions ST430 Exam 2 Solutions Date: November 9, 2015 Name: Guideline: You may use one-page (front and back of a standard A4 paper) of notes. No laptop or textbook are permitted but you may use a calculator. Giving

More information

Linear Probability Model

Linear Probability Model Linear Probability Model Note on required packages: The following code requires the packages sandwich and lmtest to estimate regression error variance that may change with the explanatory variables. If

More information

Multiple Regression Part I STAT315, 19-20/3/2014

Multiple Regression Part I STAT315, 19-20/3/2014 Multiple Regression Part I STAT315, 19-20/3/2014 Regression problem Predictors/independent variables/features Or: Error which can never be eliminated. Our task is to estimate the regression function f.

More information

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope Oct 2017 1 / 28 Minimum MSE Y is the response variable, X the predictor variable, E(X) = E(Y) = 0. BLUP of Y minimizes average discrepancy var (Y ux) = C YY 2u C XY + u 2 C XX This is minimized when u

More information

Booklet of Code and Output for STAC32 Final Exam

Booklet of Code and Output for STAC32 Final Exam Booklet of Code and Output for STAC32 Final Exam December 7, 2017 Figure captions are below the Figures they refer to. LowCalorie LowFat LowCarbo Control 8 2 3 2 9 4 5 2 6 3 4-1 7 5 2 0 3 1 3 3 Figure

More information

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT Nov 20 2015 Charlotte Wickham stat511.cwick.co.nz Quiz #4 This weekend, don t forget. Usual format Assumptions Display 7.5 p. 180 The ideal normal, simple

More information

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species Lecture notes 2/22/2000 Dummy variables and extra SS F-test Page 1 Crab claw size and closing force. Problem 7.25, 10.9, and 10.10 Regression for all species at once, i.e., include dummy variables for

More information

Simple linear regression

Simple linear regression Simple linear regression Business Statistics 41000 Fall 2015 1 Topics 1. conditional distributions, squared error, means and variances 2. linear prediction 3. signal + noise and R 2 goodness of fit 4.

More information

Ch 2: Simple Linear Regression

Ch 2: Simple Linear Regression Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component

More information

STAT Chapter 11: Regression

STAT Chapter 11: Regression STAT 515 -- Chapter 11: Regression Mostly we have studied the behavior of a single random variable. Often, however, we gather data on two random variables. We wish to determine: Is there a relationship

More information

Cuckoo Birds. Analysis of Variance. Display of Cuckoo Bird Egg Lengths

Cuckoo Birds. Analysis of Variance. Display of Cuckoo Bird Egg Lengths Cuckoo Birds Analysis of Variance Bret Larget Departments of Botany and of Statistics University of Wisconsin Madison Statistics 371 29th November 2005 Cuckoo birds have a behavior in which they lay their

More information

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam Booklet of Code and Output for STAD29/STA 1007 Midterm Exam List of Figures in this document by page: List of Figures 1 Packages................................ 2 2 Hospital infection risk data (some).................

More information

Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R

Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R Gilles Lamothe February 21, 2017 Contents 1 Anova with one factor 2 1.1 The data.......................................... 2 1.2 A visual

More information

Multiple Linear Regression (solutions to exercises)

Multiple Linear Regression (solutions to exercises) Chapter 6 1 Chapter 6 Multiple Linear Regression (solutions to exercises) Chapter 6 CONTENTS 2 Contents 6 Multiple Linear Regression (solutions to exercises) 1 6.1 Nitrate concentration..........................

More information

Holiday Assignment PS 531

Holiday Assignment PS 531 Holiday Assignment PS 531 Prof: Jake Bowers TA: Paul Testa January 27, 2014 Overview Below is a brief assignment for you to complete over the break. It should serve as refresher, covering some of the basic

More information

Pumpkin Example: Flaws in Diagnostics: Correcting Models

Pumpkin Example: Flaws in Diagnostics: Correcting Models Math 3080. Treibergs Pumpkin Example: Flaws in Diagnostics: Correcting Models Name: Example March, 204 From Levine Ramsey & Smidt, Applied Statistics for Engineers and Scientists, Prentice Hall, Upper

More information

STA 303H1F: Two-way Analysis of Variance Practice Problems

STA 303H1F: Two-way Analysis of Variance Practice Problems STA 303H1F: Two-way Analysis of Variance Practice Problems 1. In the Pygmalion example from lecture, why are the average scores of the platoon used as the response variable, rather than the scores of the

More information

SCHOOL OF MATHEMATICS AND STATISTICS

SCHOOL OF MATHEMATICS AND STATISTICS SHOOL OF MATHEMATIS AND STATISTIS Linear Models Autumn Semester 2015 16 2 hours Marks will be awarded for your best three answers. RESTRITED OPEN BOOK EXAMINATION andidates may bring to the examination

More information

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017 Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017 PDF file location: http://www.murraylax.org/rtutorials/regression_anovatable.pdf

More information

Regression. Marc H. Mehlman University of New Haven

Regression. Marc H. Mehlman University of New Haven Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven the statistician knows that in nature there never was a normal distribution, there never was a straight line, yet with normal and

More information

Chapter 5 Exercises 1

Chapter 5 Exercises 1 Chapter 5 Exercises 1 Data Analysis & Graphics Using R, 2 nd edn Solutions to Exercises (December 13, 2006) Preliminaries > library(daag) Exercise 2 For each of the data sets elastic1 and elastic2, determine

More information

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017 UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Tuesday, January 17, 2017 Work all problems 60 points are needed to pass at the Masters Level and 75

More information

Regression and Models with Multiple Factors. Ch. 17, 18

Regression and Models with Multiple Factors. Ch. 17, 18 Regression and Models with Multiple Factors Ch. 17, 18 Mass 15 20 25 Scatter Plot 70 75 80 Snout-Vent Length Mass 15 20 25 Linear Regression 70 75 80 Snout-Vent Length Least-squares The method of least

More information

Topics on Statistics 2

Topics on Statistics 2 Topics on Statistics 2 Pejman Mahboubi March 7, 2018 1 Regression vs Anova In Anova groups are the predictors. When plotting, we can put the groups on the x axis in any order we wish, say in increasing

More information

General Linear Statistical Models - Part III

General Linear Statistical Models - Part III General Linear Statistical Models - Part III Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Interaction Models Lets examine two models involving Weight and Domestic in the cars93 dataset.

More information

Final Exam. Name: Solution:

Final Exam. Name: Solution: Final Exam. Name: Instructions. Answer all questions on the exam. Open books, open notes, but no electronic devices. The first 13 problems are worth 5 points each. The rest are worth 1 point each. HW1.

More information

BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) R Users

BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) R Users BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users Unit Regression and Correlation of - Practice Problems Solutions R Users. In this exercise, you will gain some practice doing

More information

exemp531.r jmsinger Mon Mar 27 15:51:

exemp531.r jmsinger Mon Mar 27 15:51: exemp531.r jmsinger Mon Mar 27 15:51:03 2017 # # Codigo R para analise do exemplo 5.2.1 (Singer&Nobre&Rocha2017) library(car) library(gdata) gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.

More information

Extensions of One-Way ANOVA.

Extensions of One-Way ANOVA. Extensions of One-Way ANOVA http://www.pelagicos.net/classes_biometry_fa18.htm What do I want You to Know What are two main limitations of ANOVA? What two approaches can follow a significant ANOVA? How

More information

MATH 644: Regression Analysis Methods

MATH 644: Regression Analysis Methods MATH 644: Regression Analysis Methods FINAL EXAM Fall, 2012 INSTRUCTIONS TO STUDENTS: 1. This test contains SIX questions. It comprises ELEVEN printed pages. 2. Answer ALL questions for a total of 100

More information

Chapter 5 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (April 24, 2004)

Chapter 5 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (April 24, 2004) Chapter 5 Exercises 1 Data Analysis & Graphics Using R Solutions to Exercises (April 24, 2004) Preliminaries > library(daag) Exercise 2 The final three sentences have been reworded For each of the data

More information

Regression Analysis lab 3. 1 Multiple linear regression. 1.1 Import data. 1.2 Scatterplot matrix

Regression Analysis lab 3. 1 Multiple linear regression. 1.1 Import data. 1.2 Scatterplot matrix Regression Analysis lab 3 1 Multiple linear regression 1.1 Import data delivery

More information

Stat 412/512 REVIEW OF SIMPLE LINEAR REGRESSION. Jan Charlotte Wickham. stat512.cwick.co.nz

Stat 412/512 REVIEW OF SIMPLE LINEAR REGRESSION. Jan Charlotte Wickham. stat512.cwick.co.nz Stat 412/512 REVIEW OF SIMPLE LINEAR REGRESSION Jan 7 2015 Charlotte Wickham stat512.cwick.co.nz Announcements TA's Katie 2pm lab Ben 5pm lab Joe noon & 1pm lab TA office hours Kidder M111 Katie Tues 2-3pm

More information

22s:152 Applied Linear Regression. Take random samples from each of m populations.

22s:152 Applied Linear Regression. Take random samples from each of m populations. 22s:152 Applied Linear Regression Chapter 8: ANOVA NOTE: We will meet in the lab on Monday October 10. One-way ANOVA Focuses on testing for differences among group means. Take random samples from each

More information

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website. SLR output RLS Refer to slr (code) on the Lecture Page of the class website. Old Faithful at Yellowstone National Park, WY: Simple Linear Regression (SLR) Analysis SLR analysis explores the linear association

More information

movies Name:

movies Name: movies Name: 217-4-14 Contents movies.................................................... 1 USRevenue ~ Budget + Opening + Theaters + Opinion..................... 6 USRevenue ~ Opening + Opinion..................................

More information

WEB-DISTANCE ST 370 Quiz 1 FALL 2007 ver. B NAME ID # I will neither give nor receive help from other students during this quiz Sign

WEB-DISTANCE ST 370 Quiz 1 FALL 2007 ver. B NAME ID # I will neither give nor receive help from other students during this quiz Sign WEB-DISTANCE ST 370 Quiz 1 FALL 2007 ver. B NAME ID # I will neither give nor receive help from other students during this quiz Sign PROBLEM 1: If the number 3 is added to every member of a sample of observations

More information

Lecture 2. Simple linear regression

Lecture 2. Simple linear regression Lecture 2. Simple linear regression Jesper Rydén Department of Mathematics, Uppsala University jesper@math.uu.se Regression and Analysis of Variance autumn 2014 Overview of lecture Introduction, short

More information

Multiple Predictor Variables: ANOVA

Multiple Predictor Variables: ANOVA Multiple Predictor Variables: ANOVA 1/32 Linear Models with Many Predictors Multiple regression has many predictors BUT - so did 1-way ANOVA if treatments had 2 levels What if there are multiple treatment

More information

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison.

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison. Regression Bret Hanlon and Bret Larget Department of Statistics University of Wisconsin Madison December 8 15, 2011 Regression 1 / 55 Example Case Study The proportion of blackness in a male lion s nose

More information

The Statistical Sleuth in R: Chapter 13

The Statistical Sleuth in R: Chapter 13 The Statistical Sleuth in R: Chapter 13 Linda Loi Kate Aloisio Ruobing Zhang Nicholas J. Horton June 15, 2016 Contents 1 Introduction 1 2 Intertidal seaweed grazers 2 2.1 Data coding, summary statistics

More information

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A = Matrices and vectors A matrix is a rectangular array of numbers Here s an example: 23 14 17 A = 225 0 2 This matrix has dimensions 2 3 The number of rows is first, then the number of columns We can write

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

R STATISTICAL COMPUTING

R STATISTICAL COMPUTING R STATISTICAL COMPUTING some R Examples Dennis Friday 2 nd and Saturday 3 rd May, 14. Topics covered Vector and Matrix operation. File Operations. Evaluation of Probability Density Functions. Testing of

More information

ANOVA (Analysis of Variance) output RLS 11/20/2016

ANOVA (Analysis of Variance) output RLS 11/20/2016 ANOVA (Analysis of Variance) output RLS 11/20/2016 1. Analysis of Variance (ANOVA) The goal of ANOVA is to see if the variation in the data can explain enough to see if there are differences in the means.

More information

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA 22s:152 Applied Linear Regression Chapter 8: ANOVA NOTE: We will meet in the lab on Monday October 10. One-way ANOVA Focuses on testing for differences among group means. Take random samples from each

More information

1 Use of indicator random variables. (Chapter 8)

1 Use of indicator random variables. (Chapter 8) 1 Use of indicator random variables. (Chapter 8) let I(A) = 1 if the event A occurs, and I(A) = 0 otherwise. I(A) is referred to as the indicator of the event A. The notation I A is often used. 1 2 Fitting

More information

1 The Classic Bivariate Least Squares Model

1 The Classic Bivariate Least Squares Model Review of Bivariate Linear Regression Contents 1 The Classic Bivariate Least Squares Model 1 1.1 The Setup............................... 1 1.2 An Example Predicting Kids IQ................. 1 2 Evaluating

More information

Chapter 8: Correlation & Regression

Chapter 8: Correlation & Regression Chapter 8: Correlation & Regression We can think of ANOVA and the two-sample t-test as applicable to situations where there is a response variable which is quantitative, and another variable that indicates

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression ST 430/514 Recall: a regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates).

More information