Homework1 Yang Sun 2017/9/11

Size: px

Start display at page:

Download "Homework1 Yang Sun 2017/9/11"

Aubrie Morris
5 years ago
Views:

1 Homework1 Yang Sun 2017/9/11 1. Describe data According to the data description, the response variable is AmountSpent; the predictors are, Age, Gender, OwnHome, Married, Location, Salary, Children, History, Catalogs. 2. Statistical and graphical data summary 2.0 Initial Setup # Set workspace setwd("d:/dropbox/pitt/fall 18/IS 2160 Data Mining/Homeworks/HW1") # Import csv file here DirectMarketing <- read.csv("directmarketing.csv", header=true, stringsasfactors=false) # Data Summary summary(directmarketing) Age Gender OwnHome Length:1000 Length:1000 Length:1000 Class :character Class :character Class :character Mode :character Mode :character Mode :character Married Location Salary Children Length:1000 Length:1000 Min. : Min. :0.000 Class :character Class :character 1st Qu.: st Qu.:0.000 Mode :character Mode :character Median : Median :1.000 Mean : Mean : rd Qu.: rd Qu.:2.000 Max. : Max. :3.000 History Catalogs AmountSpent Length:1000 Min. : 6.00 Min. : 38.0 Class :character 1st Qu.: st Qu.: Mode :character Median :12.00 Median : Mean :14.68 Mean : rd Qu.: rd Qu.: Max. :24.00 Max. : a Missing Values # Check for missing values table(is.na(directmarketing)) 1

2 FALSE TRUE # Determing if that all missing values came from history table(is.na(directmarketing$history)) FALSE TRUE # Make the N/A field into 'None' as one category # Based on the data description, NA means that this customer has not yet purchased # Hence we cannot simply delete NAs, instead convert them into "None"" DirectMarketing [is.na(directmarketing)] <- "None" # Chekc again for missing values table(is.na(directmarketing)) FALSE No more missing values 2.b Generate Summary summary(directmarketing) Age Gender OwnHome Length:1000 Length:1000 Length:1000 Class :character Class :character Class :character Mode :character Mode :character Mode :character Married Location Salary Children Length:1000 Length:1000 Min. : Min. :0.000 Class :character Class :character 1st Qu.: st Qu.:0.000 Mode :character Mode :character Median : Median :1.000 Mean : Mean : rd Qu.: rd Qu.:2.000 Max. : Max. :3.000 History Catalogs AmountSpent Length:1000 Min. : 6.00 Min. : 38.0 Class :character 1st Qu.: st Qu.: Mode :character Median :12.00 Median : Mean :14.68 Mean : rd Qu.: rd Qu.: Max. :24.00 Max. : # Standard Deviation for numerical values sd(directmarketing$salary) 2

3 [1] # Standard Deviation for Salary sd(directmarketing$children) [1] # Standard Deviation for Children sd(directmarketing$catalogs) [1] # Standard Deviation for Catalogs sd(directmarketing$amountspent) [1] # Standard Deviation for AmountSpent # Convert all fields of char into factor cols <- c("age", "Gender", "OwnHome", "Married", "Location", "History") DirectMarketing[cols] <- lapply(directmarketing[cols], factor) # Do Summary again summary(directmarketing) Age Gender OwnHome Married Location Middle:508 Female:506 Own :516 Married:502 Close:710 Old :205 Male :494 Rent:484 Single :498 Far :290 Young :287 Salary Children History Catalogs Min. : Min. :0.000 High :255 Min. : st Qu.: st Qu.:0.000 Low :230 1st Qu.: 6.00 Median : Median :1.000 Medium:212 Median :12.00 Mean : Mean :0.934 None :303 Mean : rd Qu.: rd Qu.: rd Qu.:18.00 Max. : Max. :3.000 Max. :24.00 AmountSpent Min. : st Qu.: Median : Mean : rd Qu.: Max. : C.Kernel Density Plot AmountSpent Density Distribution library(ggplot2) load the plotting package ggplot(directmarketing, aes(x=amountspent)) + geom_density() + 3

4 labs(title = "AmountSpent Density Distribution") AmountSpent Density Distribution 6e 04 4e 04 density 2e 04 0e AmountSpent The density distribution for AmountSpent is right-skewed. Salary Density Distribution # For Salary ggplot(directmarketing, aes(x=salary)) + geom_density() + labs(title = "Salary Density Distribution") 4

5 1.2e 05 Salary Density Distribution 9.0e 06 density 6.0e e e Salary The density distribution for Slary is left-skewed, and it also has two peaks. 2.d correlatio and scatterplot for numerical predictor Correlation numpredictor <- as.data.frame(directmarketing[,c("salary", "Children", "Catalogs")]) responsevariable <- as.data.frame(directmarketing$amountspent) colnames(responsevariable)[1] <- "Amount Spent" cor(responsevariable, numpredictor) Salary Children Catalogs Amount Spent Scatterplot # Salary vs. AmountSpent # use ggplot theme_set(theme_bw()) set default theme with a white background ggplot(data=directmarketing, aes(x=salary,y=amountspent)) + geom_point() + geom_smooth(method=lm, # add linear regression line se=false) # (by default includes 95% confidence region) 5

6 6000 AmountSpent Salary # Children vs. AmountSpent ggplot(data=directmarketing, aes(x=children,y=amountspent)) + geom_point() + geom_smooth(method=lm, # add linear regression line se=false) # (by default includes 95% confidence region) 6

7 6000 AmountSpent Children # Catalogs vs. AmountSpent ggplot(data=directmarketing, aes(x=catalogs,y=amountspent)) + geom_point() + geom_smooth(method=lm, # add linear regression line se=false) # (by default includes 95% confidence region) 7

8 6000 AmountSpent Catalogs 2.e Conditional density plot for categorical predictor ggplot(data=directmarketing, aes(amountspent, colour = Age)) + geom_density() 8

9 density Age Middle Old Young AmountSpent ggplot(data=directmarketing, aes(amountspent, colour = Gender)) + geom_density() 9

10 6e 04 density 4e 04 Gender Female Male 2e 04 0e AmountSpent ggplot(data=directmarketing, aes(amountspent, colour = OwnHome)) + geom_density() 10

11 density OwnHome Own Rent AmountSpent ggplot(data=directmarketing, aes(amountspent, colour = Married)) + geom_density() 11

12 density Married Married Single AmountSpent ggplot(data=directmarketing, aes(amountspent, colour = Location)) + geom_density() 12

13 6e 04 density 4e 04 Location Close Far 2e 04 0e AmountSpent ggplot(data=directmarketing, aes(amountspent, colour = History)) + geom_density() 13

14 density History High Low Medium None AmountSpent 2.f Compare significantly different means #Age a1 <- mean(directmarketing$amountspent[directmarketing$age == "Young"]) a2 <- mean(directmarketing$amountspent[directmarketing$age == "Middle"]) a3 <- mean(directmarketing$amountspent[directmarketing$age == "Old"]) AgeMean <- data.frame("meanofamountspent" = c(a1, a2, a3)) rownames(agemean) <- c("age-young", "Age-Middle", "Age-Old") #Gender g1 <- mean(directmarketing$amountspent[directmarketing$gender == "Male"]) g2 <- mean(directmarketing$amountspent[directmarketing$gender == "Female"]) GenderMean <- data.frame("meanofamountspent" = c(g1, g2)) rownames(gendermean) <- c("gender-male", "Gender-Female") #OwnHome o1 <- mean(directmarketing$amountspent[directmarketing$ownhome == "Own"]) o2 <- mean(directmarketing$amountspent[directmarketing$ownhome == "Rent"]) OwnHomeMean <- data.frame("meanofamountspent" = c(o1, o2)) rownames(ownhomemean) <- c("ownhome-own", "OwnHome-Rent") #Married m1 <- mean(directmarketing$amountspent[directmarketing$married == "Married"]) m2 <- mean(directmarketing$amountspent[directmarketing$married == "Single"]) MarriedMean <- data.frame("meanofamountspent" = c(m1, m2)) rownames(marriedmean) <- c("married-married", "Married-Single") #Location 14

15 l1 <- mean(directmarketing$amountspent[directmarketing$location == "Far"]) l2 <- mean(directmarketing$amountspent[directmarketing$location == "Close"]) LocationMean <- data.frame("meanofamountspent" = c(l1, l2)) rownames(locationmean) <- c("location-far", "Location-Close") #History h1 <- mean(directmarketing$amountspent[directmarketing$history == "None"]) h2 <- mean(directmarketing$amountspent[directmarketing$history == "Low"]) h3 <- mean(directmarketing$amountspent[directmarketing$history == "Medium"]) h4 <- mean(directmarketing$amountspent[directmarketing$history == "High"]) HistoryMean <- data.frame("meanofamountspent" = c(h1, h2, h3, h4)) rownames(historymean) <- c("history-none", "History-Low", "History_Medium", "History-High") #Overall categorytable <- rbind(agemean,gendermean,ownhomemean,marriedmean,locationmean,historymean) categorytable MeanOfAmountSpent Age-Young Age-Middle Age-Old Gender-Male Gender-Female OwnHome-Own OwnHome-Rent Married-Married Married-Single Location-Far Location-Close History-None History-Low History_Medium History-High From both the conditional density plots and the table of means, it shows that for catagoty age, young among 3 age groups has a siginificantly different means. Similarly for OwnHome-Own vs OwnHome-Rent; Married-Married vs Maeeired-Single; the means for category History are all different. 3. Regression modeling and analysis 3.a Standard linear regression # Standard linear regression model with all predictors slr <- lm(amountspent~., data=directmarketing) summary(slr) Call: lm(formula = AmountSpent ~., data = DirectMarketing) Residuals: Min 1Q Median 3Q Max

16 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) * AgeOld AgeYoung GenderMale OwnHomeRent MarriedSingle LocationFar < 2e-16 *** Salary < 2e-16 *** Children < 2e-16 *** HistoryLow e-08 *** HistoryMedium e-14 *** HistoryNone Catalogs < 2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 987 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 12 and 987 DF, p-value: < 2.2e-16 # RMSE y = DirectMarketing$AmountSpent mean.mse = mean((rep(mean(y),length(y)) - y)^2) model.mse = mean(residuals(slr)^2) rmse = sqrt(model.mse) rmse [1] Summary for standard linear regression model, r squared is , Adjusted R-squared is , RMSE is b Different combination of predictors in linear and non-linear models Out-of-Sample RMSE for standard linear regression n = length(directmarketing$amountspent) error = dim(n) for (k in 1:n) { train1 = c(1:n) train = train1[train1!=k] pick elements that are different from k slr = lm(amountspent ~., data=directmarketing[train,]) pred = predict(slr, newdat=directmarketing[-train,]) obs = DirectMarketing$AmountSpent[-train] error[k] = obs-pred } OSrmse=sqrt(mean(error^2)) OSrmse root mean square error (out-of-sample) [1]

17 Backward Stepwise Selection library(mass) slr <- lm(amountspent~., data=directmarketing) stepaic(slr, direction="backward") Start: AIC= AmountSpent ~ Age + Gender + OwnHome + Married + Location + Salary + Children + History + Catalogs Df Sum of Sq RSS AIC - Age OwnHome Married <none> Gender Children History Location Catalogs Salary Step: AIC= AmountSpent ~ Gender + OwnHome + Married + Location + Salary + Children + History + Catalogs Df Sum of Sq RSS AIC - Married OwnHome <none> Gender Children History Location Catalogs Salary Step: AIC= AmountSpent ~ Gender + OwnHome + Location + Salary + Children + History + Catalogs Df Sum of Sq RSS AIC - OwnHome <none> Gender Children History Location Catalogs Salary Step: AIC= AmountSpent ~ Gender + Location + Salary + Children + History + Catalogs 17

18 Df Sum of Sq RSS AIC <none> Gender Children History Location Catalogs Salary Call: lm(formula = AmountSpent ~ Gender + Location + Salary + Children + History + Catalogs, data = DirectMarketing) Coefficients: (Intercept) GenderMale LocationFar Salary Children HistoryLow HistoryMedium HistoryNone Catalogs Use new combinations of Gender + Location + Salary + Children + History + Catalogs newslr <- lm(amountspent~gender + Location + Salary + Children + History + Catalogs, data=directmarketing) summary(newslr) Call: lm(formula = AmountSpent ~ Gender + Location + Salary + Children + History + Catalogs, data = DirectMarketing) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e e ** GenderMale e e LocationFar 4.360e e < 2e-16 *** Salary 1.892e e < 2e-16 *** Children e e < 2e-16 *** HistoryLow e e e-08 *** HistoryMedium e e e-14 *** HistoryNone e e Catalogs 4.175e e < 2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 991 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 8 and 991 DF, p-value: < 2.2e-16 18

19 Out-of-Sample RMSE for new linear regression n = length(directmarketing$amountspent) error = dim(n) for (k in 1:n) { train1 = c(1:n) train = train1[train1!=k] pick elements that are different from k slr = lm(amountspent ~ Gender + Location + Salary + Children + History + Catalogs, data=directmarketing[train,]) pred = predict(slr, newdat=directmarketing[-train,]) obs = DirectMarketing$AmountSpent[-train] error[k] = obs-pred } OSrmse=sqrt(mean(error^2)) OSrmse root mean square error (out-of-sample) [1] Nonlinear regression 2-degree on Salary, Children and Catalogs nonlr <- lm(amountspent~ Gender + Location + poly(salary,degree=2) + poly(children,degree=2) + History + poly(catalogs,degree=2), data=directmarketing) summary(nonlr) Call: lm(formula = AmountSpent ~ Gender + Location + poly(salary, degree = 2) + poly(children, degree = 2) + History + poly(catalogs, degree = 2), data = DirectMarketing) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** GenderMale LocationFar < 2e-16 *** poly(salary, degree = 2) < 2e-16 *** poly(salary, degree = 2) poly(children, degree = 2) < 2e-16 *** poly(children, degree = 2) HistoryLow e-07 *** HistoryMedium e-14 *** HistoryNone poly(catalogs, degree = 2) < 2e-16 *** poly(catalogs, degree = 2)

20 Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 988 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 11 and 988 DF, p-value: < 2.2e-16 Out-of-Sample RMSE n = length(directmarketing$amountspent) error = dim(n) for (k in 1:n) { train1 = c(1:n) train = train1[train1!=k] pick elements that are different from k poly1 = lm(amountspent~ Gender + Location + poly(salary,degree=2) + poly(children,degree=2) + History + poly(catalogs,degree=2), data=directmarketing[train,] pred = predict(poly1, newdat=directmarketing[-train,]) obs = DirectMarketing$AmountSpent[-train] error[k] = obs-pred } nlrmse1=sqrt(mean(error^2)) nlrmse1 root mean square error (out-of-sample) [1] degree on Salary and Children nonlr1 <- lm(amountspent~ Gender + Location + poly(salary,degree=2) + poly(children,degree=2) + History, data=directmarketing) summary(nonlr1) Call: lm(formula = AmountSpent ~ Gender + Location + poly(salary, degree = 2) + poly(children, degree = 2) + History, data = DirectMarketing) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** GenderMale LocationFar < 2e-16 *** poly(salary, degree = 2) < 2e-16 *** poly(salary, degree = 2) poly(children, degree = 2) < 2e-16 *** poly(children, degree = 2) HistoryLow e-13 *** HistoryMedium < 2e-16 *** HistoryNone ** 20

21 --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 990 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 9 and 990 DF, p-value: < 2.2e-16 Out-of-Sample RMSE n = length(directmarketing$amountspent) error = dim(n) for (k in 1:n) { train1 = c(1:n) train = train1[train1!=k] pick elements that are different from k poly2 = lm(amountspent ~ Gender + Location + poly(salary,degree=2) + poly(children,degree=2) + History, data=directmarketing[train,]) pred = predict(poly2, newdat=directmarketing[-train,]) obs = DirectMarketing$AmountSpent[-train] error[k] = obs-pred } nlrmse1=sqrt(mean(error^2)) nlrmse1 root mean square error (out-of-sample) [1] degree on Salary, Children and Catalogs nonlr <- lm(amountspent~ Gender + Location + poly(salary,degree=3) + poly(children,degree=3) + History + poly(catalogs,degree=3), data=directmarketing) summary(nonlr) Call: lm(formula = AmountSpent ~ Gender + Location + poly(salary, degree = 3) + poly(children, degree = 3) + History + poly(catalogs, degree = 3), data = DirectMarketing) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** GenderMale LocationFar < 2e-16 *** poly(salary, degree = 3) < 2e-16 *** poly(salary, degree = 3) poly(salary, degree = 3) poly(children, degree = 3) < 2e-16 *** poly(children, degree = 3)

22 poly(children, degree = 3) HistoryLow e-07 *** HistoryMedium e-13 *** HistoryNone poly(catalogs, degree = 3) < 2e-16 *** poly(catalogs, degree = 3) poly(catalogs, degree = 3) Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 985 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 14 and 985 DF, p-value: < 2.2e-16 Out-of-Sample RMSE n = length(directmarketing$amountspent) error = dim(n) for (k in 1:n) { train1 = c(1:n) train = train1[train1!=k] pick elements that are different from k poly3 = lm(amountspent~ Gender + Location + poly(salary,degree=3) + poly(children,degree=3) + History + poly(catalogs,degree=3), data=directmarketing[train,] pred = predict(poly3, newdat=directmarketing[-train,]) obs = DirectMarketing$AmountSpent[-train] error[k] = obs-pred } nlrmse1=sqrt(mean(error^2)) nlrmse1 root mean square error (out-of-sample) [1] degree on Salary and Children nonlr1 <- lm(amountspent~ Gender + Location + poly(salary,degree=3) + poly(children,degree=3) + History, data=directmarketing) summary(nonlr1) Call: lm(formula = AmountSpent ~ Gender + Location + poly(salary, degree = 3) + poly(children, degree = 3) + History, data = DirectMarketing) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** GenderMale

23 LocationFar < 2e-16 *** poly(salary, degree = 3) < 2e-16 *** poly(salary, degree = 3) poly(salary, degree = 3) poly(children, degree = 3) < 2e-16 *** poly(children, degree = 3) poly(children, degree = 3) HistoryLow e-13 *** HistoryMedium < 2e-16 *** HistoryNone ** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 988 degrees of freedom Multiple R-squared: 0.673, Adjusted R-squared: F-statistic: on 11 and 988 DF, p-value: < 2.2e-16 Out-of-Sample RMSE n = length(directmarketing$amountspent) error = dim(n) for (k in 1:n) { train1 = c(1:n) train = train1[train1!=k] pick elements that are different from k poly4 = lm(amountspent ~ Gender + Location + poly(salary,degree=3) + poly(children,degree=3) + History, data=directmarketing[train,]) pred = predict(poly4, newdat=directmarketing[-train,]) obs = DirectMarketing$AmountSpent[-train] error[k] = obs-pred } nlrmse1=sqrt(mean(error^2)) nlrmse1 root mean square error (out-of-sample) [1] Since multiple degree of polynomial did not improve the performance of the model comparing to standard linear regression and too many degrees of polynomial can cause overfitting, I decided to stop here. 3.c Best model and the most important predictor The original standard linear regression model is the best performanced one with RMSE of 482. When determining the important predictors, we look at its p value, the smaller the p value, the more important predictors will be. In this case, the important predictors are Location, Salary, Children, History and Catalogs. 23

1 Introduction 1. 2 The Multiple Regression Model 1

1 Introduction 1. 2 The Multiple Regression Model 1 Multiple Linear Regression Contents 1 Introduction 1 2 The Multiple Regression Model 1 3 Setting Up a Multiple Regression Model 2 3.1 Introduction.............................. 2 3.2 Significance Tests