Homework1 Yang Sun 2017/9/11
|
|
- Aubrie Morris
- 5 years ago
- Views:
Transcription
1 Homework1 Yang Sun 2017/9/11 1. Describe data According to the data description, the response variable is AmountSpent; the predictors are, Age, Gender, OwnHome, Married, Location, Salary, Children, History, Catalogs. 2. Statistical and graphical data summary 2.0 Initial Setup # Set workspace setwd("d:/dropbox/pitt/fall 18/IS 2160 Data Mining/Homeworks/HW1") # Import csv file here DirectMarketing <- read.csv("directmarketing.csv", header=true, stringsasfactors=false) # Data Summary summary(directmarketing) Age Gender OwnHome Length:1000 Length:1000 Length:1000 Class :character Class :character Class :character Mode :character Mode :character Mode :character Married Location Salary Children Length:1000 Length:1000 Min. : Min. :0.000 Class :character Class :character 1st Qu.: st Qu.:0.000 Mode :character Mode :character Median : Median :1.000 Mean : Mean : rd Qu.: rd Qu.:2.000 Max. : Max. :3.000 History Catalogs AmountSpent Length:1000 Min. : 6.00 Min. : 38.0 Class :character 1st Qu.: st Qu.: Mode :character Median :12.00 Median : Mean :14.68 Mean : rd Qu.: rd Qu.: Max. :24.00 Max. : a Missing Values # Check for missing values table(is.na(directmarketing)) 1
2 FALSE TRUE # Determing if that all missing values came from history table(is.na(directmarketing$history)) FALSE TRUE # Make the N/A field into 'None' as one category # Based on the data description, NA means that this customer has not yet purchased # Hence we cannot simply delete NAs, instead convert them into "None"" DirectMarketing [is.na(directmarketing)] <- "None" # Chekc again for missing values table(is.na(directmarketing)) FALSE No more missing values 2.b Generate Summary summary(directmarketing) Age Gender OwnHome Length:1000 Length:1000 Length:1000 Class :character Class :character Class :character Mode :character Mode :character Mode :character Married Location Salary Children Length:1000 Length:1000 Min. : Min. :0.000 Class :character Class :character 1st Qu.: st Qu.:0.000 Mode :character Mode :character Median : Median :1.000 Mean : Mean : rd Qu.: rd Qu.:2.000 Max. : Max. :3.000 History Catalogs AmountSpent Length:1000 Min. : 6.00 Min. : 38.0 Class :character 1st Qu.: st Qu.: Mode :character Median :12.00 Median : Mean :14.68 Mean : rd Qu.: rd Qu.: Max. :24.00 Max. : # Standard Deviation for numerical values sd(directmarketing$salary) 2
3 [1] # Standard Deviation for Salary sd(directmarketing$children) [1] # Standard Deviation for Children sd(directmarketing$catalogs) [1] # Standard Deviation for Catalogs sd(directmarketing$amountspent) [1] # Standard Deviation for AmountSpent # Convert all fields of char into factor cols <- c("age", "Gender", "OwnHome", "Married", "Location", "History") DirectMarketing[cols] <- lapply(directmarketing[cols], factor) # Do Summary again summary(directmarketing) Age Gender OwnHome Married Location Middle:508 Female:506 Own :516 Married:502 Close:710 Old :205 Male :494 Rent:484 Single :498 Far :290 Young :287 Salary Children History Catalogs Min. : Min. :0.000 High :255 Min. : st Qu.: st Qu.:0.000 Low :230 1st Qu.: 6.00 Median : Median :1.000 Medium:212 Median :12.00 Mean : Mean :0.934 None :303 Mean : rd Qu.: rd Qu.: rd Qu.:18.00 Max. : Max. :3.000 Max. :24.00 AmountSpent Min. : st Qu.: Median : Mean : rd Qu.: Max. : C.Kernel Density Plot AmountSpent Density Distribution library(ggplot2) load the plotting package ggplot(directmarketing, aes(x=amountspent)) + geom_density() + 3
4 labs(title = "AmountSpent Density Distribution") AmountSpent Density Distribution 6e 04 4e 04 density 2e 04 0e AmountSpent The density distribution for AmountSpent is right-skewed. Salary Density Distribution # For Salary ggplot(directmarketing, aes(x=salary)) + geom_density() + labs(title = "Salary Density Distribution") 4
5 1.2e 05 Salary Density Distribution 9.0e 06 density 6.0e e e Salary The density distribution for Slary is left-skewed, and it also has two peaks. 2.d correlatio and scatterplot for numerical predictor Correlation numpredictor <- as.data.frame(directmarketing[,c("salary", "Children", "Catalogs")]) responsevariable <- as.data.frame(directmarketing$amountspent) colnames(responsevariable)[1] <- "Amount Spent" cor(responsevariable, numpredictor) Salary Children Catalogs Amount Spent Scatterplot # Salary vs. AmountSpent # use ggplot theme_set(theme_bw()) set default theme with a white background ggplot(data=directmarketing, aes(x=salary,y=amountspent)) + geom_point() + geom_smooth(method=lm, # add linear regression line se=false) # (by default includes 95% confidence region) 5
6 6000 AmountSpent Salary # Children vs. AmountSpent ggplot(data=directmarketing, aes(x=children,y=amountspent)) + geom_point() + geom_smooth(method=lm, # add linear regression line se=false) # (by default includes 95% confidence region) 6
7 6000 AmountSpent Children # Catalogs vs. AmountSpent ggplot(data=directmarketing, aes(x=catalogs,y=amountspent)) + geom_point() + geom_smooth(method=lm, # add linear regression line se=false) # (by default includes 95% confidence region) 7
8 6000 AmountSpent Catalogs 2.e Conditional density plot for categorical predictor ggplot(data=directmarketing, aes(amountspent, colour = Age)) + geom_density() 8
9 density Age Middle Old Young AmountSpent ggplot(data=directmarketing, aes(amountspent, colour = Gender)) + geom_density() 9
10 6e 04 density 4e 04 Gender Female Male 2e 04 0e AmountSpent ggplot(data=directmarketing, aes(amountspent, colour = OwnHome)) + geom_density() 10
11 density OwnHome Own Rent AmountSpent ggplot(data=directmarketing, aes(amountspent, colour = Married)) + geom_density() 11
12 density Married Married Single AmountSpent ggplot(data=directmarketing, aes(amountspent, colour = Location)) + geom_density() 12
13 6e 04 density 4e 04 Location Close Far 2e 04 0e AmountSpent ggplot(data=directmarketing, aes(amountspent, colour = History)) + geom_density() 13
14 density History High Low Medium None AmountSpent 2.f Compare significantly different means #Age a1 <- mean(directmarketing$amountspent[directmarketing$age == "Young"]) a2 <- mean(directmarketing$amountspent[directmarketing$age == "Middle"]) a3 <- mean(directmarketing$amountspent[directmarketing$age == "Old"]) AgeMean <- data.frame("meanofamountspent" = c(a1, a2, a3)) rownames(agemean) <- c("age-young", "Age-Middle", "Age-Old") #Gender g1 <- mean(directmarketing$amountspent[directmarketing$gender == "Male"]) g2 <- mean(directmarketing$amountspent[directmarketing$gender == "Female"]) GenderMean <- data.frame("meanofamountspent" = c(g1, g2)) rownames(gendermean) <- c("gender-male", "Gender-Female") #OwnHome o1 <- mean(directmarketing$amountspent[directmarketing$ownhome == "Own"]) o2 <- mean(directmarketing$amountspent[directmarketing$ownhome == "Rent"]) OwnHomeMean <- data.frame("meanofamountspent" = c(o1, o2)) rownames(ownhomemean) <- c("ownhome-own", "OwnHome-Rent") #Married m1 <- mean(directmarketing$amountspent[directmarketing$married == "Married"]) m2 <- mean(directmarketing$amountspent[directmarketing$married == "Single"]) MarriedMean <- data.frame("meanofamountspent" = c(m1, m2)) rownames(marriedmean) <- c("married-married", "Married-Single") #Location 14
15 l1 <- mean(directmarketing$amountspent[directmarketing$location == "Far"]) l2 <- mean(directmarketing$amountspent[directmarketing$location == "Close"]) LocationMean <- data.frame("meanofamountspent" = c(l1, l2)) rownames(locationmean) <- c("location-far", "Location-Close") #History h1 <- mean(directmarketing$amountspent[directmarketing$history == "None"]) h2 <- mean(directmarketing$amountspent[directmarketing$history == "Low"]) h3 <- mean(directmarketing$amountspent[directmarketing$history == "Medium"]) h4 <- mean(directmarketing$amountspent[directmarketing$history == "High"]) HistoryMean <- data.frame("meanofamountspent" = c(h1, h2, h3, h4)) rownames(historymean) <- c("history-none", "History-Low", "History_Medium", "History-High") #Overall categorytable <- rbind(agemean,gendermean,ownhomemean,marriedmean,locationmean,historymean) categorytable MeanOfAmountSpent Age-Young Age-Middle Age-Old Gender-Male Gender-Female OwnHome-Own OwnHome-Rent Married-Married Married-Single Location-Far Location-Close History-None History-Low History_Medium History-High From both the conditional density plots and the table of means, it shows that for catagoty age, young among 3 age groups has a siginificantly different means. Similarly for OwnHome-Own vs OwnHome-Rent; Married-Married vs Maeeired-Single; the means for category History are all different. 3. Regression modeling and analysis 3.a Standard linear regression # Standard linear regression model with all predictors slr <- lm(amountspent~., data=directmarketing) summary(slr) Call: lm(formula = AmountSpent ~., data = DirectMarketing) Residuals: Min 1Q Median 3Q Max
16 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) * AgeOld AgeYoung GenderMale OwnHomeRent MarriedSingle LocationFar < 2e-16 *** Salary < 2e-16 *** Children < 2e-16 *** HistoryLow e-08 *** HistoryMedium e-14 *** HistoryNone Catalogs < 2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 987 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 12 and 987 DF, p-value: < 2.2e-16 # RMSE y = DirectMarketing$AmountSpent mean.mse = mean((rep(mean(y),length(y)) - y)^2) model.mse = mean(residuals(slr)^2) rmse = sqrt(model.mse) rmse [1] Summary for standard linear regression model, r squared is , Adjusted R-squared is , RMSE is b Different combination of predictors in linear and non-linear models Out-of-Sample RMSE for standard linear regression n = length(directmarketing$amountspent) error = dim(n) for (k in 1:n) { train1 = c(1:n) train = train1[train1!=k] pick elements that are different from k slr = lm(amountspent ~., data=directmarketing[train,]) pred = predict(slr, newdat=directmarketing[-train,]) obs = DirectMarketing$AmountSpent[-train] error[k] = obs-pred } OSrmse=sqrt(mean(error^2)) OSrmse root mean square error (out-of-sample) [1]
17 Backward Stepwise Selection library(mass) slr <- lm(amountspent~., data=directmarketing) stepaic(slr, direction="backward") Start: AIC= AmountSpent ~ Age + Gender + OwnHome + Married + Location + Salary + Children + History + Catalogs Df Sum of Sq RSS AIC - Age OwnHome Married <none> Gender Children History Location Catalogs Salary Step: AIC= AmountSpent ~ Gender + OwnHome + Married + Location + Salary + Children + History + Catalogs Df Sum of Sq RSS AIC - Married OwnHome <none> Gender Children History Location Catalogs Salary Step: AIC= AmountSpent ~ Gender + OwnHome + Location + Salary + Children + History + Catalogs Df Sum of Sq RSS AIC - OwnHome <none> Gender Children History Location Catalogs Salary Step: AIC= AmountSpent ~ Gender + Location + Salary + Children + History + Catalogs 17
18 Df Sum of Sq RSS AIC <none> Gender Children History Location Catalogs Salary Call: lm(formula = AmountSpent ~ Gender + Location + Salary + Children + History + Catalogs, data = DirectMarketing) Coefficients: (Intercept) GenderMale LocationFar Salary Children HistoryLow HistoryMedium HistoryNone Catalogs Use new combinations of Gender + Location + Salary + Children + History + Catalogs newslr <- lm(amountspent~gender + Location + Salary + Children + History + Catalogs, data=directmarketing) summary(newslr) Call: lm(formula = AmountSpent ~ Gender + Location + Salary + Children + History + Catalogs, data = DirectMarketing) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e e ** GenderMale e e LocationFar 4.360e e < 2e-16 *** Salary 1.892e e < 2e-16 *** Children e e < 2e-16 *** HistoryLow e e e-08 *** HistoryMedium e e e-14 *** HistoryNone e e Catalogs 4.175e e < 2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 991 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 8 and 991 DF, p-value: < 2.2e-16 18
19 Out-of-Sample RMSE for new linear regression n = length(directmarketing$amountspent) error = dim(n) for (k in 1:n) { train1 = c(1:n) train = train1[train1!=k] pick elements that are different from k slr = lm(amountspent ~ Gender + Location + Salary + Children + History + Catalogs, data=directmarketing[train,]) pred = predict(slr, newdat=directmarketing[-train,]) obs = DirectMarketing$AmountSpent[-train] error[k] = obs-pred } OSrmse=sqrt(mean(error^2)) OSrmse root mean square error (out-of-sample) [1] Nonlinear regression 2-degree on Salary, Children and Catalogs nonlr <- lm(amountspent~ Gender + Location + poly(salary,degree=2) + poly(children,degree=2) + History + poly(catalogs,degree=2), data=directmarketing) summary(nonlr) Call: lm(formula = AmountSpent ~ Gender + Location + poly(salary, degree = 2) + poly(children, degree = 2) + History + poly(catalogs, degree = 2), data = DirectMarketing) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** GenderMale LocationFar < 2e-16 *** poly(salary, degree = 2) < 2e-16 *** poly(salary, degree = 2) poly(children, degree = 2) < 2e-16 *** poly(children, degree = 2) HistoryLow e-07 *** HistoryMedium e-14 *** HistoryNone poly(catalogs, degree = 2) < 2e-16 *** poly(catalogs, degree = 2)
20 Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 988 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 11 and 988 DF, p-value: < 2.2e-16 Out-of-Sample RMSE n = length(directmarketing$amountspent) error = dim(n) for (k in 1:n) { train1 = c(1:n) train = train1[train1!=k] pick elements that are different from k poly1 = lm(amountspent~ Gender + Location + poly(salary,degree=2) + poly(children,degree=2) + History + poly(catalogs,degree=2), data=directmarketing[train,] pred = predict(poly1, newdat=directmarketing[-train,]) obs = DirectMarketing$AmountSpent[-train] error[k] = obs-pred } nlrmse1=sqrt(mean(error^2)) nlrmse1 root mean square error (out-of-sample) [1] degree on Salary and Children nonlr1 <- lm(amountspent~ Gender + Location + poly(salary,degree=2) + poly(children,degree=2) + History, data=directmarketing) summary(nonlr1) Call: lm(formula = AmountSpent ~ Gender + Location + poly(salary, degree = 2) + poly(children, degree = 2) + History, data = DirectMarketing) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** GenderMale LocationFar < 2e-16 *** poly(salary, degree = 2) < 2e-16 *** poly(salary, degree = 2) poly(children, degree = 2) < 2e-16 *** poly(children, degree = 2) HistoryLow e-13 *** HistoryMedium < 2e-16 *** HistoryNone ** 20
21 --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 990 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 9 and 990 DF, p-value: < 2.2e-16 Out-of-Sample RMSE n = length(directmarketing$amountspent) error = dim(n) for (k in 1:n) { train1 = c(1:n) train = train1[train1!=k] pick elements that are different from k poly2 = lm(amountspent ~ Gender + Location + poly(salary,degree=2) + poly(children,degree=2) + History, data=directmarketing[train,]) pred = predict(poly2, newdat=directmarketing[-train,]) obs = DirectMarketing$AmountSpent[-train] error[k] = obs-pred } nlrmse1=sqrt(mean(error^2)) nlrmse1 root mean square error (out-of-sample) [1] degree on Salary, Children and Catalogs nonlr <- lm(amountspent~ Gender + Location + poly(salary,degree=3) + poly(children,degree=3) + History + poly(catalogs,degree=3), data=directmarketing) summary(nonlr) Call: lm(formula = AmountSpent ~ Gender + Location + poly(salary, degree = 3) + poly(children, degree = 3) + History + poly(catalogs, degree = 3), data = DirectMarketing) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** GenderMale LocationFar < 2e-16 *** poly(salary, degree = 3) < 2e-16 *** poly(salary, degree = 3) poly(salary, degree = 3) poly(children, degree = 3) < 2e-16 *** poly(children, degree = 3)
22 poly(children, degree = 3) HistoryLow e-07 *** HistoryMedium e-13 *** HistoryNone poly(catalogs, degree = 3) < 2e-16 *** poly(catalogs, degree = 3) poly(catalogs, degree = 3) Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 985 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 14 and 985 DF, p-value: < 2.2e-16 Out-of-Sample RMSE n = length(directmarketing$amountspent) error = dim(n) for (k in 1:n) { train1 = c(1:n) train = train1[train1!=k] pick elements that are different from k poly3 = lm(amountspent~ Gender + Location + poly(salary,degree=3) + poly(children,degree=3) + History + poly(catalogs,degree=3), data=directmarketing[train,] pred = predict(poly3, newdat=directmarketing[-train,]) obs = DirectMarketing$AmountSpent[-train] error[k] = obs-pred } nlrmse1=sqrt(mean(error^2)) nlrmse1 root mean square error (out-of-sample) [1] degree on Salary and Children nonlr1 <- lm(amountspent~ Gender + Location + poly(salary,degree=3) + poly(children,degree=3) + History, data=directmarketing) summary(nonlr1) Call: lm(formula = AmountSpent ~ Gender + Location + poly(salary, degree = 3) + poly(children, degree = 3) + History, data = DirectMarketing) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** GenderMale
23 LocationFar < 2e-16 *** poly(salary, degree = 3) < 2e-16 *** poly(salary, degree = 3) poly(salary, degree = 3) poly(children, degree = 3) < 2e-16 *** poly(children, degree = 3) poly(children, degree = 3) HistoryLow e-13 *** HistoryMedium < 2e-16 *** HistoryNone ** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 988 degrees of freedom Multiple R-squared: 0.673, Adjusted R-squared: F-statistic: on 11 and 988 DF, p-value: < 2.2e-16 Out-of-Sample RMSE n = length(directmarketing$amountspent) error = dim(n) for (k in 1:n) { train1 = c(1:n) train = train1[train1!=k] pick elements that are different from k poly4 = lm(amountspent ~ Gender + Location + poly(salary,degree=3) + poly(children,degree=3) + History, data=directmarketing[train,]) pred = predict(poly4, newdat=directmarketing[-train,]) obs = DirectMarketing$AmountSpent[-train] error[k] = obs-pred } nlrmse1=sqrt(mean(error^2)) nlrmse1 root mean square error (out-of-sample) [1] Since multiple degree of polynomial did not improve the performance of the model comparing to standard linear regression and too many degrees of polynomial can cause overfitting, I decided to stop here. 3.c Best model and the most important predictor The original standard linear regression model is the best performanced one with RMSE of 482. When determining the important predictors, we look at its p value, the smaller the p value, the more important predictors will be. In this case, the important predictors are Location, Salary, Children, History and Catalogs. 23
1 Introduction 1. 2 The Multiple Regression Model 1
Multiple Linear Regression Contents 1 Introduction 1 2 The Multiple Regression Model 1 3 Setting Up a Multiple Regression Model 2 3.1 Introduction.............................. 2 3.2 Significance Tests
More informationAnalytics 512: Homework # 2 Tim Ahn February 9, 2016
Analytics 512: Homework # 2 Tim Ahn February 9, 2016 Chapter 3 Problem 1 (# 3) Suppose we have a data set with five predictors, X 1 = GP A, X 2 = IQ, X 3 = Gender (1 for Female and 0 for Male), X 4 = Interaction
More information22s:152 Applied Linear Regression
22s:152 Applied Linear Regression Chapter 7: Dummy Variable Regression So far, we ve only considered quantitative variables in our models. We can integrate categorical predictors by constructing artificial
More informationExercise 2 SISG Association Mapping
Exercise 2 SISG Association Mapping Load the bpdata.csv data file into your R session. LHON.txt data file into your R session. Can read the data directly from the website if your computer is connected
More informationBooklet of Code and Output for STAD29/STA 1007 Midterm Exam
Booklet of Code and Output for STAD29/STA 1007 Midterm Exam List of Figures in this document by page: List of Figures 1 Packages................................ 2 2 Hospital infection risk data (some).................
More informationBooklet of Code and Output for STAC32 Final Exam
Booklet of Code and Output for STAC32 Final Exam December 7, 2017 Figure captions are below the Figures they refer to. LowCalorie LowFat LowCarbo Control 8 2 3 2 9 4 5 2 6 3 4-1 7 5 2 0 3 1 3 3 Figure
More informationMultiple Regression Introduction to Statistics Using R (Psychology 9041B)
Multiple Regression Introduction to Statistics Using R (Psychology 9041B) Paul Gribble Winter, 2016 1 Correlation, Regression & Multiple Regression 1.1 Bivariate correlation The Pearson product-moment
More informationRegression on Faithful with Section 9.3 content
Regression on Faithful with Section 9.3 content The faithful data frame contains 272 obervational units with variables waiting and eruptions measuring, in minutes, the amount of wait time between eruptions,
More information1 The Classic Bivariate Least Squares Model
Review of Bivariate Linear Regression Contents 1 The Classic Bivariate Least Squares Model 1 1.1 The Setup............................... 1 1.2 An Example Predicting Kids IQ................. 1 2 Evaluating
More informationComparing Nested Models
Comparing Nested Models ST 370 Two regression models are called nested if one contains all the predictors of the other, and some additional predictors. For example, the first-order model in two independent
More informationLecture 6: Linear Regression (continued)
Lecture 6: Linear Regression (continued) Reading: Sections 3.1-3.3 STATS 202: Data mining and analysis October 6, 2017 1 / 23 Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d.
More informationR Output for Linear Models using functions lm(), gls() & glm()
LM 04 lm(), gls() &glm() 1 R Output for Linear Models using functions lm(), gls() & glm() Different kinds of output related to linear models can be obtained in R using function lm() {stats} in the base
More informationST430 Exam 2 Solutions
ST430 Exam 2 Solutions Date: November 9, 2015 Name: Guideline: You may use one-page (front and back of a standard A4 paper) of notes. No laptop or textbook are permitted but you may use a calculator. Giving
More informationModeling kid s test scores (revisited) Lecture 20 - Model Selection. Model output. Backward-elimination
Modeling kid s test scores (revisited) Lecture 20 - Model Selection Sta102 / BME102 Colin Rundel November 17, 2014 Predicting cognitive test scores of three- and four-year-old children using characteristics
More informationMath 2311 Written Homework 6 (Sections )
Math 2311 Written Homework 6 (Sections 5.4 5.6) Name: PeopleSoft ID: Instructions: Homework will NOT be accepted through email or in person. Homework must be submitted through CourseWare BEFORE the deadline.
More informationcor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )
Tutorial 7: Correlation and Regression Correlation Used to test whether two variables are linearly associated. A correlation coefficient (r) indicates the strength and direction of the association. A correlation
More information> Y ~ X1 + X2. The tilde character separates the response variable from the explanatory variables. So in essence we fit the model
Regression Analysis Regression analysis is one of the most important topics in Statistical theory. In the sequel this widely known methodology will be used with S-Plus by means of formulae for models.
More informationx3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators
Multiple Regression Relating a response (dependent, input) y to a set of explanatory (independent, output, predictor) variables x, x 2, x 3,, x q. A technique for modeling the relationship between variables.
More informationStat 4510/7510 Homework 7
Stat 4510/7510 Due: 1/10. Stat 4510/7510 Homework 7 1. Instructions: Please list your name and student number clearly. In order to receive credit for a problem, your solution must show sufficient details
More informationStat 5102 Final Exam May 14, 2015
Stat 5102 Final Exam May 14, 2015 Name Student ID The exam is closed book and closed notes. You may use three 8 1 11 2 sheets of paper with formulas, etc. You may also use the handouts on brand name distributions
More informationExample: 1982 State SAT Scores (First year state by state data available)
Lecture 11 Review Section 3.5 from last Monday (on board) Overview of today s example (on board) Section 3.6, Continued: Nested F tests, review on board first Section 3.4: Interaction for quantitative
More informationStat 412/512 TWO WAY ANOVA. Charlotte Wickham. stat512.cwick.co.nz. Feb
Stat 42/52 TWO WAY ANOVA Feb 6 25 Charlotte Wickham stat52.cwick.co.nz Roadmap DONE: Understand what a multiple regression model is. Know how to do inference on single and multiple parameters. Some extra
More informationGPA Chris Parrish January 18, 2016
Chris Parrish January 18, 2016 Contents Data..................................................... 1 Best subsets................................................. 4 Backward elimination...........................................
More informationSTAT 572 Assignment 5 - Answers Due: March 2, 2007
1. The file glue.txt contains a data set with the results of an experiment on the dry sheer strength (in pounds per square inch) of birch plywood, bonded with 5 different resin glues A, B, C, D, and E.
More informationLecture 3: Inference in SLR
Lecture 3: Inference in SLR STAT 51 Spring 011 Background Reading KNNL:.1.6 3-1 Topic Overview This topic will cover: Review of hypothesis testing Inference about 1 Inference about 0 Confidence Intervals
More informationSTAT 3022 Spring 2007
Simple Linear Regression Example These commands reproduce what we did in class. You should enter these in R and see what they do. Start by typing > set.seed(42) to reset the random number generator so
More informationlm statistics Chris Parrish
lm statistics Chris Parrish 2017-04-01 Contents s e and R 2 1 experiment1................................................. 2 experiment2................................................. 3 experiment3.................................................
More informationCLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition
CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data
More informationMODELS WITHOUT AN INTERCEPT
Consider the balanced two factor design MODELS WITHOUT AN INTERCEPT Factor A 3 levels, indexed j 0, 1, 2; Factor B 5 levels, indexed l 0, 1, 2, 3, 4; n jl 4 replicate observations for each factor level
More informationChapter 14. Multiple Regression Models. Multiple Regression Models. Multiple Regression Models
Chapter 14 Multiple Regression Models 1 Multiple Regression Models A general additive multiple regression model, which relates a dependent variable y to k predictor variables,,, is given by the model equation
More informationUNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018
UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018 Work all problems. 60 points needed to pass at the Masters level, 75 to pass at the PhD
More informationSCHOOL OF MATHEMATICS AND STATISTICS Autumn Semester
RESTRICTED OPEN BOOK EXAMINATION (Not to be removed from the examination hall) Data provided: "Statistics Tables" by H.R. Neave PAS 371 SCHOOL OF MATHEMATICS AND STATISTICS Autumn Semester 2008 9 Linear
More informationPsychology 405: Psychometric Theory
Psychology 405: Psychometric Theory Homework Problem Set #2 Department of Psychology Northwestern University Evanston, Illinois USA April, 2017 1 / 15 Outline The problem, part 1) The Problem, Part 2)
More informationMultiple Regression: Example
Multiple Regression: Example Cobb-Douglas Production Function The Cobb-Douglas production function for observed economic data i = 1,..., n may be expressed as where O i is output l i is labour input c
More informationData Analysis 1 LINEAR REGRESSION. Chapter 03
Data Analysis 1 LINEAR REGRESSION Chapter 03 Data Analysis 2 Outline The Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression Other Considerations in Regression Model Qualitative
More informationGeneral Linear Model (Chapter 4)
General Linear Model (Chapter 4) Outcome variable is considered continuous Simple linear regression Scatterplots OLS is BLUE under basic assumptions MSE estimates residual variance testing regression coefficients
More informationIES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc
IES 612/STA 4-573/STA 4-576 Winter 2008 Week 1--IES 612-STA 4-573-STA 4-576.doc Review Notes: [OL] = Ott & Longnecker Statistical Methods and Data Analysis, 5 th edition. [Handouts based on notes prepared
More informationClass: Dean Foster. September 30, Read sections: Examples chapter (chapter 3) Question today: Do prices go up faster than they go down?
Class: Dean Foster September 30, 2013 Administrivia Read sections: Examples chapter (chapter 3) Gas prices Question today: Do prices go up faster than they go down? Idea is that sellers watch spot price
More informationMultivariate Analysis of Variance
Chapter 15 Multivariate Analysis of Variance Jolicouer and Mosimann studied the relationship between the size and shape of painted turtles. The table below gives the length, width, and height (all in mm)
More informationIntroduction and Single Predictor Regression. Correlation
Introduction and Single Predictor Regression Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Correlation A correlation
More information> nrow(hmwk1) # check that the number of observations is correct [1] 36 > attach(hmwk1) # I like to attach the data to avoid the '$' addressing
Homework #1 Key Spring 2014 Psyx 501, Montana State University Prof. Colleen F Moore Preliminary comments: The design is a 4x3 factorial between-groups. Non-athletes do aerobic training for 6, 4 or 2 weeks,
More informationTests of Linear Restrictions
Tests of Linear Restrictions 1. Linear Restricted in Regression Models In this tutorial, we consider tests on general linear restrictions on regression coefficients. In other tutorials, we examine some
More informationmovies Name:
movies Name: 217-4-14 Contents movies.................................................... 1 USRevenue ~ Budget + Opening + Theaters + Opinion..................... 6 USRevenue ~ Opening + Opinion..................................
More informationMS&E 226: Small Data
MS&E 226: Small Data Lecture 2: Linear Regression (v3) Ramesh Johari rjohari@stanford.edu September 29, 2017 1 / 36 Summarizing a sample 2 / 36 A sample Suppose Y = (Y 1,..., Y n ) is a sample of real-valued
More informationLog-linear Models for Contingency Tables
Log-linear Models for Contingency Tables Statistics 149 Spring 2006 Copyright 2006 by Mark E. Irwin Log-linear Models for Two-way Contingency Tables Example: Business Administration Majors and Gender A
More informationLab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model
Lab 3 A Quick Introduction to Multiple Linear Regression Psychology 310 Instructions.Work through the lab, saving the output as you go. You will be submitting your assignment as an R Markdown document.
More informationInteractions in Logistic Regression
Interactions in Logistic Regression > # UCBAdmissions is a 3-D table: Gender by Dept by Admit > # Same data in another format: > # One col for Yes counts, another for No counts. > Berkeley = read.table("http://www.utstat.toronto.edu/~brunner/312f12/
More informationRegression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison.
Regression Bret Hanlon and Bret Larget Department of Statistics University of Wisconsin Madison December 8 15, 2011 Regression 1 / 55 Example Case Study The proportion of blackness in a male lion s nose
More informationST430 Exam 1 with Answers
ST430 Exam 1 with Answers Date: October 5, 2015 Name: Guideline: You may use one-page (front and back of a standard A4 paper) of notes. No laptop or textook are permitted but you may use a calculator.
More informationMultiple Regression Part I STAT315, 19-20/3/2014
Multiple Regression Part I STAT315, 19-20/3/2014 Regression problem Predictors/independent variables/features Or: Error which can never be eliminated. Our task is to estimate the regression function f.
More informationChapter 8 Conclusion
1 Chapter 8 Conclusion Three questions about test scores (score) and student-teacher ratio (str): a) After controlling for differences in economic characteristics of different districts, does the effect
More informationNature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.
Understanding regression output from software Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals In 1966 Cyril Burt published a paper called The genetic determination of differences
More informationMultiple Linear Regression. Chapter 12
13 Multiple Linear Regression Chapter 12 Multiple Regression Analysis Definition The multiple regression model equation is Y = b 0 + b 1 x 1 + b 2 x 2 +... + b p x p + ε where E(ε) = 0 and Var(ε) = s 2.
More informationActivity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression
Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression Scenario: 31 counts (over a 30-second period) were recorded from a Geiger counter at a nuclear
More informationA course in statistical modelling. session 09: Modelling count variables
A Course in Statistical Modelling SEED PGR methodology training December 08, 2015: 12 2pm session 09: Modelling count variables Graeme.Hutcheson@manchester.ac.uk blackboard: RSCH80000 SEED PGR Research
More informationBooklet of Code and Output for STAC32 Final Exam
Booklet of Code and Output for STAC32 Final Exam December 8, 2014 List of Figures in this document by page: List of Figures 1 Popcorn data............................. 2 2 MDs by city, with normal quantile
More informationStat 401B Exam 3 Fall 2016 (Corrected Version)
Stat 401B Exam 3 Fall 2016 (Corrected Version) I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied
More informationLinear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.
Linear regression Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear. 1/48 Linear regression Linear regression is a simple approach
More informationRegression Methods for Survey Data
Regression Methods for Survey Data Professor Ron Fricker! Naval Postgraduate School! Monterey, California! 3/26/13 Reading:! Lohr chapter 11! 1 Goals for this Lecture! Linear regression! Review of linear
More informationR in Linguistic Analysis. Wassink 2012 University of Washington Week 6
R in Linguistic Analysis Wassink 2012 University of Washington Week 6 Overview R for phoneticians and lab phonologists Johnson 3 Reading Qs Equivalence of means (t-tests) Multiple Regression Principal
More informationSLR output RLS. Refer to slr (code) on the Lecture Page of the class website.
SLR output RLS Refer to slr (code) on the Lecture Page of the class website. Old Faithful at Yellowstone National Park, WY: Simple Linear Regression (SLR) Analysis SLR analysis explores the linear association
More informationLecture 6: Linear Regression
Lecture 6: Linear Regression Reading: Sections 3.1-3 STATS 202: Data mining and analysis Jonathan Taylor, 10/5 Slide credits: Sergio Bacallado 1 / 30 Simple linear regression Model: y i = β 0 + β 1 x i
More informationAnalysis of Covariance: Comparing Regression Lines
Chapter 7 nalysis of Covariance: Comparing Regression ines Suppose that you are interested in comparing the typical lifetime (hours) of two tool types ( and ). simple analysis of the data given below would
More informationStat 401B Exam 2 Fall 2016
Stat 40B Eam Fall 06 I have neither given nor received unauthorized assistance on this eam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied by supporting reasoning will
More informationRegression. Marc H. Mehlman University of New Haven
Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven the statistician knows that in nature there never was a normal distribution, there never was a straight line, yet with normal and
More informationMATH 423/533 - ASSIGNMENT 4 SOLUTIONS
MATH 423/533 - ASSIGNMENT 4 SOLUTIONS INTRODUCTION This assignment concerns the use of factor predictors in linear regression modelling, and focusses on models with two factors X 1 and X 2 with M 1 and
More informationWe d like to know the equation of the line shown (the so called best fit or regression line).
Linear Regression in R. Example. Let s create a data frame. > exam1 = c(100,90,90,85,80,75,60) > exam2 = c(95,100,90,80,95,60,40) > students = c("asuka", "Rei", "Shinji", "Mari", "Hikari", "Toji", "Kensuke")
More informationFinal Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58
Final Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Final Review 1 / 58 Outline 1 Multiple Linear Regression (Estimation, Inference) 2 Special Topics for Multiple
More informationLinear Regression. In this lecture we will study a particular type of regression model: the linear regression model
1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor
More informationLecture 4 Multiple linear regression
Lecture 4 Multiple linear regression BIOST 515 January 15, 2004 Outline 1 Motivation for the multiple regression model Multiple regression in matrix notation Least squares estimation of model parameters
More informationFigure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim
0.0 1.0 1.5 2.0 2.5 3.0 8 10 12 14 16 18 20 22 y x Figure 1: The fitted line using the shipment route-number of ampules data STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim Problem#
More informationInference for Regression
Inference for Regression Section 9.4 Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Department of Mathematics University of Houston Lecture 13b - 3339 Cathy Poliak, Ph.D. cathy@math.uh.edu
More informationData Analysis Using R ASC & OIR
Data Analysis Using R ASC & OIR Overview } What is Statistics and the process of study design } Correlation } Simple Linear Regression } Multiple Linear Regression 2 What is Statistics? Statistics is a
More informationSTK 2100 Oblig 1. Zhou Siyu. February 15, 2017
STK 200 Oblig Zhou Siyu February 5, 207 Question a) Make a scatter box plot for the data set. Answer:Here is the code I used to plot the scatter box in R. library ( MASS ) 2 pairs ( Boston ) Figure : Scatter
More informationWorkshop 7.4a: Single factor ANOVA
-1- Workshop 7.4a: Single factor ANOVA Murray Logan November 23, 2016 Table of contents 1 Revision 1 2 Anova Parameterization 2 3 Partitioning of variance (ANOVA) 10 4 Worked Examples 13 1. Revision 1.1.
More informationInferences on Linear Combinations of Coefficients
Inferences on Linear Combinations of Coefficients Note on required packages: The following code required the package multcomp to test hypotheses on linear combinations of regression coefficients. If you
More informationStat 5303 (Oehlert): Analysis of CR Designs; January
Stat 5303 (Oehlert): Analysis of CR Designs; January 2016 1 > resin
More informationLecture 19: Inference for SLR & Transformations
Lecture 19: Inference for SLR & Transformations Statistics 101 Mine Çetinkaya-Rundel April 3, 2012 Announcements Announcements HW 7 due Thursday. Correlation guessing game - ends on April 12 at noon. Winner
More informationThe Statistical Sleuth in R: Chapter 5
The Statistical Sleuth in R: Chapter 5 Linda Loi Kate Aloisio Ruobing Zhang Nicholas J. Horton January 21, 2013 Contents 1 Introduction 1 2 Diet and lifespan 2 2.1 Summary statistics and graphical display........................
More informationChapter 8: Correlation & Regression
Chapter 8: Correlation & Regression We can think of ANOVA and the two-sample t-test as applicable to situations where there is a response variable which is quantitative, and another variable that indicates
More informationDiagnostics and Transformations Part 2
Diagnostics and Transformations Part 2 Bivariate Linear Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University Multilevel Regression Modeling, 2009 Diagnostics
More information7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis
Lecture 6: Logistic Regression Analysis Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox Review Homework 2 Overview Logistic regression model conceptually Logistic regression
More informationStat 8053, Fall 2013: Multinomial Logistic Models
Stat 8053, Fall 2013: Multinomial Logistic Models Here is the example on page 269 of Agresti on food preference of alligators: s is size class, g is sex of the alligator, l is name of the lake, and f is
More informationSTAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis
STAT 3900/4950 MIDTERM TWO Name: Spring, 205 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis Instructions: You may use your books, notes, and SPSS/SAS. NO
More informationLinear Regression Measurement & Evaluation of HCC Systems
Linear Regression Measurement & Evaluation of HCC Systems Linear Regression Today s goal: Evaluate the effect of multiple variables on an outcome variable (regression) Outline: - Basic theory - Simple
More informationSTAT 350: Summer Semester Midterm 1: Solutions
Name: Student Number: STAT 350: Summer Semester 2008 Midterm 1: Solutions 9 June 2008 Instructor: Richard Lockhart Instructions: This is an open book test. You may use notes, text, other books and a calculator.
More informationBooklet of Code and Output for STAD29/STA 1007 Midterm Exam
Booklet of Code and Output for STAD29/STA 1007 Midterm Exam List of Figures in this document by page: List of Figures 1 NBA attendance data........................ 2 2 Regression model for NBA attendances...............
More informationLinear Modelling in Stata Session 6: Further Topics in Linear Modelling
Linear Modelling in Stata Session 6: Further Topics in Linear Modelling Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 14/11/2017 This Week Categorical Variables Categorical
More informationRegression in R I. Part I : Simple Linear Regression
UCLA Department of Statistics Statistical Consulting Center Regression in R Part I : Simple Linear Regression Denise Ferrari & Tiffany Head denise@stat.ucla.edu tiffany@stat.ucla.edu Feb 10, 2010 Objective
More informationStat 328 Final Exam (Regression) Summer 2002 Professor Vardeman
Stat Final Exam (Regression) Summer Professor Vardeman This exam concerns the analysis of 99 salary data for n = offensive backs in the NFL (This is a part of the larger data set that serves as the basis
More informationSTA220H1F Term Test Oct 26, Last Name: First Name: Student #: TA s Name: or Tutorial Room:
STA0HF Term Test Oct 6, 005 Last Name: First Name: Student #: TA s Name: or Tutorial Room: Time allowed: hour and 45 minutes. Aids: one sided handwritten aid sheet + non-programmable calculator Statistical
More informationLinear Regression Model. Badr Missaoui
Linear Regression Model Badr Missaoui Introduction What is this course about? It is a course on applied statistics. It comprises 2 hours lectures each week and 1 hour lab sessions/tutorials. We will focus
More informationUsing R in 200D Luke Sonnet
Using R in 200D Luke Sonnet Contents Working with data frames 1 Working with variables........................................... 1 Analyzing data............................................... 3 Random
More informationRegression and Models with Multiple Factors. Ch. 17, 18
Regression and Models with Multiple Factors Ch. 17, 18 Mass 15 20 25 Scatter Plot 70 75 80 Snout-Vent Length Mass 15 20 25 Linear Regression 70 75 80 Snout-Vent Length Least-squares The method of least
More informationOutline. 1 Preliminaries. 2 Introduction. 3 Multivariate Linear Regression. 4 Online Resources for R. 5 References. 6 Upcoming Mini-Courses
UCLA Department of Statistics Statistical Consulting Center Introduction to Regression in R Part II: Multivariate Linear Regression Denise Ferrari denise@stat.ucla.edu Outline 1 Preliminaries 2 Introduction
More informationholding all other predictors constant
Multiple Regression Numeric Response variable (y) p Numeric predictor variables (p < n) Model: Y = b 0 + b 1 x 1 + + b p x p + e Partial Regression Coefficients: b i effect (on the mean response) of increasing
More informationL21: Chapter 12: Linear regression
L21: Chapter 12: Linear regression Department of Statistics, University of South Carolina Stat 205: Elementary Statistics for the Biological and Life Sciences 1 / 37 So far... 12.1 Introduction One sample
More informationCASE STUDY: CYCLONES
CASE STUDY: CYCLONES DEFINITION Cyclones are defined as ``an atmospheric system in which the barometric pressure diminishes progressively to a minimum at the centre and toward which the winds blow spirally
More informationInference with Heteroskedasticity
Inference with Heteroskedasticity Note on required packages: The following code requires the packages sandwich and lmtest to estimate regression error variance that may change with the explanatory variables.
More informationSCHOOL OF MATHEMATICS AND STATISTICS
SHOOL OF MATHEMATIS AND STATISTIS Linear Models Autumn Semester 2015 16 2 hours Marks will be awarded for your best three answers. RESTRITED OPEN BOOK EXAMINATION andidates may bring to the examination
More information