Topics on Statistics 2

Size: px

Start display at page:

Download "Topics on Statistics 2"

Victor Gaines
5 years ago
Views:

1 Topics on Statistics 2 Pejman Mahboubi March 7, Regression vs Anova In Anova groups are the predictors. When plotting, we can put the groups on the x axis in any order we wish, say in increasing or decreasing order of their means or even alphabetical order of the group names. There is no relation between the means µ 1, µ 2,, µ g, other than the ones specified by constraints. In regression, predictors are numbers, say average winter time daily temperature, which has a natural order on the horizontal axis T. Let y be their corresponding values of the mean energy consumption. Furthermore, the regression assumption is that these means form a straight line. > set.seed(1114) > T<-sort(runif(15,20,50))#independent or predictor variable > E< *T+rnorm(15,0,.2)#dependent or response variable > (df<-data.frame(temperature=round(t,1),energy=round(e,2))) Temperature Energy > plot(t,e,xlab="temperature",ylab="energy",cex.lab=.5,cex.axis=.4,pch=20,tcl=-0.1) 1

2 Energy Temperature 1. We assume at every temperature T = t, the value E = e is sampled from an independent normal distribution which is assigned to that specific temperature t 2. Therefore, each temperature corresponds to its own population. 3. We assume that the variance of the populations are the same: σ We assume that the mean of these populations are on a straight line. (the red line below) What we are looking for is the equation of the straight line that goes through the points: > plot(t,e,xlab="temperature",ylab="energy",cex.lab=.5,cex.axis=.4,pch=20,tcl=-0.1) > abline(lm(e~t),col='red') 2

3 Energy Temperature 2 Prediction What is your predicted energy consumption on a day that temperature is 30, or 43.44? The regression model claims that the average values fo the energy consumption falls on a line, R computes β 0 and β 1 : > fit<-lm(energy~temperature,df) > summary(fit) e := e(t) = β 0 + β 1 t 3

4 Call: lm(formula = Energy ~ Temperature, data = df) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** Temperature e-07 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 13 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 13 DF, p-value: 4.541e-07 If you just want to see the coefficients, write > coef(fit) (Intercept) Temperature Therefore, β 0 = and β 1 = R has a prediction function. It is very simple. Something like: > predict_fn<-function(lmobj,t){ + sum(coef(lmobj)*c(1,t)) + } > predict_fn(fit,44) [1] or we can use R s function: > newdata = data.frame(temperature=44) > predict(fit,newdata)

5 3 Fitted values vs. Actual For each temperature t, the fitted value is the energy consumption predicted by the model. The fitted value is the y value of the point in the red line for the specific t. There are multiple ways to compute them. 1. Use the predict function with the original data frame > predict(fit,df) Use the object fit: > fit$fitted.value Use our own function > sapply(df$temperature,function(x) predict_fn(fit,x)) [1] [9] Residuals Just as in anova, the distance between the fitted value and the actual value is called residual. residaul at t = observed e fitted value e residaul at t = e ê We can easily compute all 15 residuals. There are multiple ways of doing this: 1. Use the predict function > df$energy-predict(fit,df)

6 2. Use the resid() function > resid(fit) Remember that populations of energy corresponding to different temperatures are independent. Therefore, residuals are independent normal. Residuals are sampled from a centered normal distribution with fixed variance. Let s take a look at the histograms of the residuals > hist(resid(fit),10) Histogram of resid(fit) Frequency resid(fit) qqpnorm() is another tool that allows us to compare any set of numbers with normal distributions. > qqnorm(resid(fit),ann=false,cex.lab=.4,cex.axis=.4,pch=20,tcl=-0.1,cex.main=.8,mgp = > mtext(side=1,text="theoretical Quantile", line=1,cex.lab=.4,cex.axis=.4,pch=20,tcl=-0 6

7 > mtext(side = 2, text = "Sample Quantile", line = 1,cex.lab=.4,cex.axis=.4,pch=20,tcl= > qqline(resid(fit)) Sample Quantile Theoretical Quantile 4 Explained, Unexplained variation and R-squared If we consider every temperature as a separate group, similar to anova we can define sum of square of residuals as the undefined variation. Similarly, we can compute the grand mean and and define the distance of the fitted values to the grand mean as the explained variation. In our example we have the residuals, so the unexplained variation is > (within.var<-sum((resid(fit))^2)) [1] The explained variation is > (between.var<-sum((fitted.values(fit)-mean(df$energy))^2)) [1]

8 and the total variation is > (total.var<-var(df$energy)*(length(df$energy)-1)) [1] And, once again we can check that total.var equals within.var + between.var > within.var+between.var [1] We also can compute the R-squared, which is the ratio of the explained variation to the total variation, or in codes > (r.sqrd<-between.var/total.var) [1] which matches the R-squared reported by the summary() function. F-statistics also mentioned in the summary() output. Remember that F-statistics is the average Explained var to the average Unexplained var. So the question is what are the degrees of freedom for the explained and unexplained variations. The unexplained variation comes from the residuals. There are 15 residuals r 1,, r 15. But they are not arbitrary numbers. Linear regression puts 2 constraints on them 1 : Constraint1 The sum of residuals is zero: Let s check this: r r n = 0. > sum(resid(fit)) [1] e-17 Constraint2 The predictor Temperature is perpendicular to the residuals, i.e., T emperature Residual = (t 1,, t 15 ) (r 1,, r 15 ) = t 1 r 1 + t 2 r t 15 r 15 = 0 Let s check this: > sum(df$temperature*resid(fit)) [1] e-14 Therefore, with two constraints, the degree of freedom of the residuals is 15-2=13. Next look at the between variations, which is the distance of the 15 fitted values to the horizontal line Energy = mean(df$energy), grand mean. The only parameter we need for computing between.var is β 1 (it is not so trivial though). Therefore the degree of freedom of the explained var is 1. Therefore, the F-statistics is 1 We will soon see where these constraints come from 8

9 > (between.var/1)/(within.var/13) [1] Definition 1 (Erros vs. Residuals). Let s draw 5 samples from N(10, 1) > (x<-rnorm(5,10,1)) [1] You can check that mean(x) is not zero, and var(x) is not one. > mean(x) [1] > var(x) [1] errors are x 10 > (errors=x-10)# distance to the mean of the population [1] You can check that mean(error)!=0, > mean(errors) [1] If we don t privy to the population, Errors remain unknown to us. But residuals are x- mean(x) > (residuals<-x-mean(x)) [1] And mean of residuals is zero > mean(residuals) [1] e-16 9

10 4.1 Equation of the regression line Assume, we are given points > head(df) Temperature Energy Likelihood Function Cost Function In linear regression, cost function is the sum of square of residuals. Let f(x) = a + b x denote the regression line, then Then the cost function is Residual corresponding pint T = E f(t ) = E a b T. (1) 4.2 Pearson s Correlation Coefficient 15 cost(df, f) = (E f(t )) 2 (2) 1 15 = (E a b T ) 2 (3) 1 If x = x 1,, x n and y = y 1,, y n, then the correlation coefficient r between x and y is defined as n i=1 (x x)(y ȳ) r = n i=1 (x x)2 n i=1 (y ȳ)2 or n i=1 (x x)(y ȳ) covar(x, y) r = = sd(x)sd(y) sd(x)sd(y) where x and ȳ denotes the mean of x and y respectively r 1. This follows from a mathematical inequality called Schwarz inequality, 2. If r = 1, then there is a perfect negative linear relationship between x and y. If r = 1, then there is a perfect positive linear relationship between x and y. 10

11 3. If r = 0, then there is no linear relationship between x and y. 4. All other values of r tell us that the relationship between x and y is not perfect. The closer r is to 0, the weaker the linear relationship. The closer r is to 1, the stronger the negative linear relationship. And, the closer r is to 1, the stronger the positive linear relationship The Schwarz inequality states that X n Y n 2 X 2 n Y 2 n. Let see an example > x<-rnorm(10);y<-runif(10,20,1000) > #Schwarz inequality implies that > (sum(x*y))^2<sum(x^2)*sum(y^2) [1] TRUE The dot product of two vectors x, y is a measure of similarity between the two. To make it comparable, we can normalize the vectors by their lengths. For example for x = (1, 0) let us see which one of the following vectors is most similar to x according to the dot product. y1 = (1,.2), y2 = c(1, 1), y3 = c(0, 1). > x<-c(1,0) > y1<-c(1,.2);y2<-c(1,1);y3<-c(0,1) > sapply(list(y1,y2,y3),function(z) sum(x*z)/sum(z^2)) [1] Returning to energy vs. temperature, let us see what is the correlation coefficient between the 2. > with(df,cor(temperature,energy)) [1] It is very close to 1, but the slope of the regression line is > coef(fit) (Intercept) Temperature That is very close to 0. So, the value of b doesn t say much about the true strength of the relation between x and y. But, this is because the variance on the y axis is much smaller than the variance on x axis. If we multiply b by sd(temperature)/sd(energy). > unname(coef(fit)[2])*sd(df$temperature)/sd(df$energy) [1] By the way, this gives us the formula for computing b, i.e., b = r sd(y) sd(x) = covar(x, y) sd(x) sd(y) sd(y) sd(x) = covar(x, y) var(x) = n i=1 (x x)(y ȳ) var(x) (3) 11

12 4.2.1 Connection to the R-squared Both r and R 2 are indications of the goodness of the fit. For computing r we don t need to fit the data. To compute R 2 we need the explained and unexplained data. Means that we need the fitted values to compute residuals. We know that 0 R 2 1 and 1 r 1. However, r 2 just like R 2 is positive and in simple linear regression, they are closely related. The relationship is r 2 = R 2 (3) If R 2 =.9, then r 2 =.9, which means r =.9 = So either r = or r = How can we find it out? The answer is, if b > 0, then r > 0, and if b < 0, then we take the negative value of r. Example 1. Using the fit object calculate Pearson r. Solution. Since the model fit is linear, we can compute r using R 2. First we find R 2 > summary(fit)$r.squared [1] Therefore, the Pearson correlation is one of the two roots of R 2 > (sqrt(summary(fit)$r.squared)) [1] > #or > -(sqrt(summary(fit)$r.squared)) [1] Since b is negative: > (b=coef(fit)[2]) Temperature Therefore, r is also negative, and r = cor(x,y) function: We can compute r directly using > cor(df$energy,df$temperature) [1] Remark r measures the linear relation. R 2 measure the percentage of the variation explained by the model. So R 2 is not model independent r is. 12

13 Consider the following data: > x<-sort(runif(20,-2,2)) > y<-x^2+rnorm(20,-3,.1) > plot(x,y) y x Here, there is an almost perfect relation between x and y, i.e., y = x 2 3. Let me add the plot of y = x > x<-sort(runif(20,-2,2)) > y<-x^2+rnorm(20,-3,.1) > plot(x,y) > a<-seq(from = -2,to = 2,.5) 13

14 > b<-a^2-3 > lines(a,b) y x If we don t know that a line is not a best fit, we would do this: > fit.1<-lm(y~x) Let s look at the result: > summary(fit.1) Call: lm(formula = y ~ x) 14

15 Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) *** x Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 18 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 18 DF, p-value: Let s look at the plots of prediction and the data > x<-sort(runif(20,-2,2)) > y<-x^2+rnorm(20,-3,.1) > plot(x,y) > abline(fit.1,col='red') 15

16 y x So, what should we do. We know that a quadratic term would create a better result. Here is how we proceed > fit.2<-lm(y~i(x^2)+i(x)) > summary(fit.2) Call: lm(formula = y ~ I(x^2) + I(x)) Residuals: Min 1Q Median 3Q Max

17 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** I(x^2) <2e-16 *** I(x) Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 17 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: 1389 on 2 and 17 DF, p-value: < 2.2e-16 Note that R 2 = Let s check the correlation coefficient: > cor(x,y) [1] How to plot the data and the model? > plot(x,y) > lines(a,predict(object = fit.2,newdata = data.frame(x=a)),col='red') 17

18 y x 4.3 Optimization and Gradient Descent Remember how we computed the coefficients a and b for the simple linear regression y = a + b x. Assume the data is x = x 1,, x n and y = y 1,, y n, then the fitted values are Then residuals are ŷ i = a + b x i for i = 1,, n (3) res i = y i ŷ i for i = 1,, n = y i (a + b x i ) for i = 1,, n 18

19 Therefore, the cost function is cost(data, model) = n res 2 i = i=1 n (y i (a + b x i )) 2 (3) i=1 Therefore, cost is actually a function of a and b. The goal is to determine a and b that minimizes cost(a, b). Last time we saw how to do it mathematically, i.e., we compute the roots of the partial derivatives, by solving a cost = 0 b cost = 0. Computing softwares don t solve this equation, because as the number of predictors increases, the complexity increases significantly. The solution often involves finding the inverse matrices which are computationally very expensive. Example 2. Compute the cost function for the energy data. > #cost function > cost<-function(a,b){ + sum((df$energy-a-b*df$temperature)^2) + } > (cost(5,-1)) [1] > (cost(10,.02)) [1] Since we solved this problem before, we already know what is the best value of a and b. By looking at the coef(fit), we know that a = and b = The cost function for these values is minimum. > cost( , ) [1] The method of gradient descent states that, starting from any point, say a = 3, b = 0 19

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT Nov 20 2015 Charlotte Wickham stat511.cwick.co.nz Quiz #4 This weekend, don t forget. Usual format Assumptions Display 7.5 p. 180 The ideal normal, simple