Lecture 4 This week lab:exam 1! Review lectures, practice labs 1 to 4 and homework 1 to 5!!!!! Need help? See me during my office hrs, or goto open lab or GS 211. Bring your picture ID and simple calculator.(note computation could be done in Excel) Do not be late.
Q1: What is the interpretation of the number 4.1? A: There were 4.1 million visits to ER by people 85 and older, Q2: What percent of people 65-74 visited the ER between 2009 and 2010? A: About 40%,
Q3: What is the total number of ER visits for senior citizens between 2009 and 2010? A: 19.6 millions, Q4: What do you think was the main purpose of this chart? What would be the conclusion? A: Although percentagewise the 85+ population visits ER the most their actual number is the smallest.
Q: What % of women age 30-34 in 2009 were pregnant? Q:Which age group experienced the biggest drop from 1990 to 2009? Q: Which age group had the highest rate of pregnancy in 1990? A: 13.8%, A: 18-19 year, A: 20-24 year,
Q: Which age group had the highest rate of pregnancy in 2009? Q:Which age group experienced the biggest jump from 1990 to 2009? Q: What do you think was the main purpose of this chart? What would be the conclusion? A :25-29 year, A: 35-39 year A :There exists a reverse trend! As years go by younger women are less often pregnant while older woman are more often pregnant
Q: For which period was the CIA estimate the highest? A 1928-1940, Q: For the period 60-65 what was the Official Soviet Income Growth? A about 7-8%, Q: For which period do we see the highest discrepancy between Khanin s and CIA estimates? A 1928-1940.
Q1. Which month has the highest tornado related deaths? A: April, Q2. Which two months have the highest number of tornados? A: May and June
Q3. Which month has the lowest number of tornado related deaths? A: July, Q4. The blue chart (tornado deaths) is shifted toward left compared to the red chart. What could be the likely reason for this? A: People tend to underestimate the danger and are unprepared at the beginning of tornado season.
Example Compare the # of tornadoes of the months of January, March, May and July. Which of the three months is the most correlated to Jan? Produce a scatter plot of the X axis representing January and the Y axis the month most correlated to January. What is the equation of the regression line and the R^2? If you see 30 tornadoes in January how many tornadoes would you expect to see in the month most correlated to January?
400 350 300 250 200 150 100 50 jan vs may y = 1.8034x + 139.13 R² = 0.0841 0 0 10 20 30 40 50 60 # of tornadoes in May, Y = 1.8034 * 30 + 139.13 = 193.23 Round to the nearest integer, Y = 193 tornadoes is one would expect to see in May if 30 tornadoes occurred in January.
A few more examples Q1. Consider the Chart and imagine if one would randomly pick a month, what is the probability that this month has more than 4000 tornadoes total? A1. From the chart we can see that there are only three months for which this happened: April, May and June, thus the probability is 3/12 = 1/4 Q2. Consider the Chart and imagine if one would randomly pick two different months, what is the probability that each of them would have more than 5000 tornadoes in total. A2. We have already seen that one can construct 66 different 2-pair months. (Just remember the correlation table). But there is only one pair that satisfies the requirement: (May, June). Thus only one-out-of-66 pairs would work. And probability is 1/66.
A few more examples Q3. Out of three statistical measurements we have learned: Median, Average and Standard Deviation, which one is NOT sensitive to outliers? A3. Median. Q4. Which of the three measurements, Median, Average and Standard Deviation, describe the center of the data? A4. Median and Average Note: In literature, we often use Mean and Average indistinguishably.
Multiple Regression Week of July 11 th : Exam 2!!!!! Do not miss Lab 5 and Lab 6! Exam 2 will cover materials from lab 0 to lab 6 and of course all lectures Review homework 2 to 8!
Warm up: Data in question are SAT.txt. So far we have learned how to work with individual charts based on two variables ( x variable and y variable). Q1.What is the interpretation of 0.5655 on High School chart? A: This is a tricky one. Students typically write: For each 1 point increase in High School GPA one expects 0.565 increase in College GPA; Another alternative and better answer: Given two students for which High School GPA differs by 1 point, one could expect that their College GPA will differ by 0.565 points. This second interpretation is useful for College administrators Q2.The largest R square is on the High School vs College GPA chart. How would you comment on this information? A: It looks like High school GPA has the closest relation to College GPA.
Warm up: Data in question are SAT.txt. So far we have learned how to work with individual charts based on two variables ( x variable and y variable). Q4.What is the interpretation of the number 0.822 on the third chart? A: The number 0.822 should not be interpreted. A person with zero High school GPA does not go to College. Q5. What does each dot represent on the chart? A: A student
Breakdown of R^2 from 0 to 0.2 from 0.2 to 0.4 from 0.4 to 0.6 from 0.6 to 0.85 from 0.85 to 1 (poor) (decent) (good) (very good) (excellent) So far so good. Clearly each of the three variables (high school GPA, SAT score and letters) had a positive influence on the college GPA. In other words, if one would like to predict the performance of an incoming freshman, each of these three predictors would be relevant.
So how do we compute a student s College GPA based on the letters, HS GPA and SAT? For example: If student A has quality of letter = 8 then his predicted College GPA = 0.1754(8)+1.0702 = 2.4134 Now if his High school GPA = 3.4 then his predicted College GPA = 0.5655(3.4)+0.822 = 2.7447 And if his SAT score is 1050 then his predicated College GPA = 0.0018(1050)+0.1519 = 2.0419
Warm up: Data in question are SAT.txt. So far we have learned how to work with individual charts based on two variables ( x variable and y variable). Natural question: If one really wants to predict the students performance (and many admission officers do) it would make sense to combine these three variables and try to predict the student s performance based on all three factors at once. But how to do so? For this we need multiple regression.
Multiple regression (using the same SAT data). To do this we need to click on Data, then on Data analysis, then on Regression. In dialog box for the input X-variables highlight all three columns, (and do not forget to include Labels) As far as Excel is concerned this highlighting of three columns (instead just one) is the only difference between Regression and Multiple Regression. After performing the above highlighting $B$1: $D$101 should appear in the box for Input X range $A1$:$A$101 should appear in the box for Input Y range) Do not forget to click on the box for Labels and then OK.
The following table should appear: We cannot visualize d- dimensions! One of the main and most important problems regarding multiple regressions is the visualization issue. Namely, unlike the one dim case where we can plot the chart and fit the line, here we have 3-dim inputs and 1-dim output and there are no mathematical ways to plot this chart; so we cannot see what is going on. Instead we must depend on the regression table. Predictions: Imagine a student with a good high school GPA=3.5, SAT=1300, and with a letter quality of 9, how do we use the Table above and predict his College GPA????
Multiple Regression Analysis Multiple regression analysis is a powerful technique used for predicting the unknown value of a variable from the known value of two or more variables- also called the predictors. More precisely, multiple regression analysis helps us to predict the value of Y for the given values of X 1, X 2,, X k By multiple regression, we mean models with just one dependent(y) and two or more independent (explanatory) variables(xs). The variable whose value is to be predicted is known as the dependent variable and the ones whose known values are used for prediction are known independent (explanatory) variables.
The Multiple Regression Model coefficient Standard Error t stat P-value Lower 95% Upper 95% Lower 95% Upper 95% Intercept 0.345908-0.37459 0.678905-0.879605 0.48759-0.87961 0.48759 Variable X1 0.115679 4.567089 0.009385 0.1456789 0.178938 0.145679 0.178938 Variable X2 0.00394 3.245097 0.657849 0.0000768 0.0013 7.68E-05 0.0013.... 0.059608 0.45689 0.034579-0.09786 0.123456-0.09786 0.123456 0.309457 5.345609 0.57689 0.0905949 0.58697 0.090595 0.58697 Variable Xk 0.009846-3.40958 0.00346 0.056784 0.678591 0.056784 0.678591 In general, the multiple regression equation of Y on X 1, X 2,, X k from the table above is given by: Y = b 0 + b 1 X 1 + b 2 X 2 + + b k X k Here b 0 is the intercept and b 1, b 2, b 3,, b k are analogous to the slope in linear regression equation and are also called regression coefficients. They can be interpreted the same way as slope. Thus if b i = 2.5, it would indicates that Y will increase by 2.5 units if X i increased by 1 unit.
The Multiple Regression Model In general, the multiple regression equation of Y on X 1, X 2,, X k is given by: Y = b 0 + b 1 X 1 + b 2 X 2 + + b k X k Use the above table on SAT data, predict an incoming student s College GPA if his High school GPA = 3.5, SAT = 1300, and Letter quality = 9. From the table above we can read that the predictive model is, Y, the College GPA = Y intercept + coefficient of GPA high* HS GPA + coefficient of SAT*SAT + coefficient of letters *letters College GPA = -0.15326 + 0.376351*3.5 + 0.001227*1300 + 0.022684*9 = 2.963 Thus the predicted College GPA is 2.963. Important: WE USED ALL THREE VARIABLES!
How Good Is the Regression? Once a multiple regression equation has been constructed, one can check how good it is (in terms of predictive ability) by examining the coefficient of determination (R-square, R^2). R- square always lies between 0 and 1. All software provides it whenever regression procedure is run. The closer R^2 is to 1, the better is the model and its prediction. In our case of the SAT data, the predictive model of College GPA based on High school GPA, SAT, and letters has R^2 = 0.3997 which is considered a decent model. Breakdown of R^2 from 0 to 0.2 from 0.2 to 0.4 from 0.4 to 0.6 from 0.6 to 0.85 (poor) (decent) (good) (very good) from 0.85 to 1 (excellent)
Let us practice a bit: In general, the multiple regression equation of Y on X 1, X 2,, X k is given by: Y = b 0 + b 1 X 1 + b 2 X 2 + + b k X k High School GPA SAT Letters Prediction (round to 2 decimal places) 2.5 1100 8?? 2.32 3.5 1100 8?? 2.70 4.0 900 9?? 2.66 2.5 1500 10?? 2.85
We start with a review: Data in question are SAT.txt. Clearly, each of the three variables (high school GPA, SAT score and letters) had a positive influence on the college GPA. In other words, if one would like to predict the performance of an incoming freshman, each of these three predictors would be relevant. So how do we compute a student s College GPA based on the letters, HS GPA and SAT?
So how do we compute a student s College GPA based on the letters, HS GPA and SAT? For example: If student A has quality of letter = 8 then his predicted College GPA = 0.1754(8)+1.0702 = 2.4134 Now if his High school GPA = 3.4 then his predicted College GPA = 0.5655(3.4)+0.822 = 2.7447 And if his SAT score is 1050 then his predicated College GPA = 0.0018(1050)+0.1519 = 2.0419
So how do we compute a student s College GPA based on the letters, HS GPA and SAT? Natural question: If one really wants to predict the students performance (and many admission officers do) it would make sense to combine these three variables and try to predict the student s performance based on all three factors at once. But how to do so? For this we need multiple regression.
Multiple regression (using the same SAT data). To do this we need to click on Data, then on Data analysis, then on Regression. In dialog box for the input X-variables highlight all three columns, (and do not forget to include Labels) As far as Excel is concerned this highlighting of three columns (instead just one) is the only difference between Regression and Multiple Regression. After performing the above highlighting $B$1: $D$101 should appear in the box for Input X range $A1$:$A$101 should appear in the box for Input Y range) Do not forget to click on the box for Labels and then OK.
The following table should appear: We cannot visualize d- dimensions! One of the main and most important problems regarding multiple regressions is the visualization issue. Namely, unlike the one dim case where we can plot the chart and fit the line, here we have 3-dim inputs and 1-dim output and there are no mathematical ways to plot this chart; so we cannot see what is going on. Instead we must depend on the regression table.
The Multiple Regression Model coefficient Standard Error t stat P-value Lower 95% Upper 95% Lower 95% Upper 95% Intercept 0.345908-0.37459 0.678905-0.879605 0.48759-0.87961 0.48759 Variable X1 0.115679 4.567089 0.009385 0.1456789 0.178938 0.145679 0.178938 Variable X2 0.00394 3.245097 0.657849 0.0000768 0.0013 7.68E-05 0.0013.... 0.059608 0.45689 0.034579-0.09786 0.123456-0.09786 0.123456 0.309457 5.345609 0.57689 0.0905949 0.58697 0.090595 0.58697 Variable Xk 0.009846-3.40958 0.00346 0.056784 0.678591 0.056784 0.678591 In general, the multiple regression equation of Y on X 1, X 2,, X k from the table above is given by: Y = b 0 + b 1 X 1 + b 2 X 2 + + b k X k Here b 0 is the intercept and b 1, b 2, b 3,, b k are analogous to the slope in linear regression equation and are also called regression coefficients. They can be interpreted the same way as slope. Thus if b i = 2.5, it would indicates that Y will increase by 2.5 units if X i increased by 1 unit.
The Multiple Regression Model In general, the multiple regression equation of Y on X 1, X 2,, X k is given by: Y = b 0 + b 1 X 1 + b 2 X 2 + + b k X k Use the above table on SAT data, predict an incoming student s College GPA if his High school GPA = 3.5, SAT = 1300, and Letter quality = 9. From the table above we can read that the predictive model is, Y, the College GPA = Y intercept + coefficient of GPA high* HS GPA + coefficient of SAT*SAT + coefficient of letters *letters College GPA = -0.15326 + 0.376351*3.5 + 0.001227*1300 + 0.022684*9 = 2.963 Thus the predicted College GPA is 2.963. Important: WE USED ALL THREE VARIABLES!
Practice Use Cars04-1 data. Your TASK is to use engine s size (litre), cylinders, horsepower, and weight (pounds) to predict the retail price of a car. Create the regression table.
Using the table predict the retail price of a car if the car s engine size = 4 litre, and the car has 6 cylinders, 400 horsepower, and its weight is 3200 pounds. Retail price = -31602.488 + -6504.722*engine size + 3547.161*cylinders + 162.451*horsepower + 8.508*weight. = -31602.488 + -6504.722*4 + 3547.161*6 + 162.451*400 + 8.508*3200. = $55,868.45
Back to SAT.txt data, after performing multiple regression with variable Y= College GPA we get the table below. Imagine a student with a good high school GPA=3.5, SAT=1300, and with letter quality of 9. The table below allows us to predict his College GPA!! Regression Statistics Multiple R 0.63225 R Square 0.39974 Adjusted R Square 0.38098 Standard Error 0.58948 Observations 100 ANOVA df SS MS F gnificance F Regression 3 22.2144 7.40479 21.3098 1.2E-10 Residual 96 33.3583 0.34748 Total 99 55.5727 Coefficientsandard Err t Stat P-value Lower 95%Upper 95%ower 95.0%pper 95.0% Intercept -0.15326 0.32294-0.47459 0.63616-0.79429 0.48776-0.79429 0.48776 GPA High 0.37635 0.11426 3.29377 0.00139 0.14954 0.60316 0.14954 0.60316 SAT 0.00123 0.0003 4.04636 0.00011 0.00063 0.00183 0.00063 0.00183 Letters 0.02268 0.05098 0.44495 0.65736-0.07851 0.12388-0.07851 0.12388 From this table we can read that the predictive model is: College GPA=-0.15326+GPA High *0.37635 +SAT *0.00123+Letters *0.02268 In our case this becomes College GPA=-0.15326+3.5 *0.37635 +1300*0.00123+9 *0.02268 =2.963 Thus predicted College GPA is 2.963. Question: Clearly we cannot be 100% sure that an incoming student with these given credentials will have the College GPA of exactly 2.96. Thus, the prediction 2.96 is only an approximation. But how accurate is this approximation? The regression table comes to the rescue: the number we use is highlighted above. Standard Error=0.589. In other words, statistical analysis implies that for these particular credentials we can expect that student s college GPA will be 2.96+/- 0.589. Another way to state this: the predicted GPA is in the interval [2.37, 3.52]
A bit more practice Regression Statistics Multiple R 0.63225 R Square 0.39974 Adjusted R Square 0.38098 Standard Error 0.58948 Observations 100 ANOVA df SS MS F gnificance F Regression 3 22.2144 7.40479 21.3098 1.2E-10 Residual 96 33.3583 0.34748 Total 99 55.5727 Coefficientsandard Err t Stat P-value Lower 95%Upper 95%ower 95.0%pper 95.0% Intercept -0.15326 0.32294-0.47459 0.63616-0.79429 0.48776-0.79429 0.48776 GPA High 0.37635 0.11426 3.29377 0.00139 0.14954 0.60316 0.14954 0.60316 SAT 0.00123 0.0003 4.04636 0.00011 0.00063 0.00183 0.00063 0.00183 Letters 0.02268 0.05098 0.44495 0.65736-0.07851 0.12388-0.07851 0.12388 Q1. What would be the interpretation for the coefficient GPA High =0.376? A For each point increase in Highs School GPA we expect a 0.367 point increase in College GPA. Q2. What is the interpretation of the number -0.153? A. The intercept has no interpretation here (it is impossible to have a student with 0 high school GPA and zero SAT going to college).
A bit more practice Regression Statistics Multiple R 0.63225 R Square 0.39974 Adjusted R Square 0.38098 Standard Error 0.58948 Observations 100 ANOVA df SS MS F gnificance F Regression 3 22.2144 7.40479 21.3098 1.2E-10 Residual 96 33.3583 0.34748 Total 99 55.5727 Coefficientsandard Err t Stat P-value Lower 95%Upper 95%ower 95.0%pper 95.0% Intercept -0.15326 0.32294-0.47459 0.63616-0.79429 0.48776-0.79429 0.48776 GPA High 0.37635 0.11426 3.29377 0.00139 0.14954 0.60316 0.14954 0.60316 SAT 0.00123 0.0003 4.04636 0.00011 0.00063 0.00183 0.00063 0.00183 Letters 0.02268 0.05098 0.44495 0.65736-0.07851 0.12388-0.07851 0.12388 Imagine the following scenario: Student A s High School GPA is 1 point higher than Student B s. On the other hand, Student B s SAT score is higher by 200 points. Their letters are of the same strength. Q3. Which of the two students will have a higher predicted College GPA? A: Student A will gain 0.3763 points due to his high school GPA and student B will gain 200*0.0012=0.24 points due to his superior SAT. Overall, Student A will have predicted GPA higher by 0.1363 points (since 0.3763-0.24=0.1363) Q4. What are the predicted GPA s for students A and B? A: It is impossible to state the predictions for students A and B since we do not know their actual credentials.
Using Sail boat data, we can make the following chart and the regression table. Regression Statistics Multiple R 0.92355 R Square 0.85294 Adjusted R Square 0.84477 Standard Error 4.41382 Observations 20 ANOVA df SS MS F nificance Regression 1 2033.9 2033.9 104.4 6E-09 Residual 18 350.67 19.482 Total 19 2384.6 Observe: Chart The line equation on the chart is Y=1.0129X-18.016 R^2 =08529 The chart offers visual information; we can actually see the dots (i.e. sail boats), the trend line and how well they fit. It does not extend beyond 1-dim input X Coefficientndard Er t Stat P-valueower 95% Intercept -18.016 4.1006-4.394 0.0004-26.63 Feet 1.01286 0.0991 10.218 6E-09 0.8046 Table The intercept is -18.016 and the coefficient next to Feet is 1.0129. Equation of the line: Y=1.0129X-18.016 R Square=0.8529 No visualization but it contains many more numbers, some of which we already used (and some of them we will use soon). Easily extends to d-dimensional input
More practice Regression Statistics Multiple R 0.92355 R Square 0.85294 Adjusted R Square 0.84477 Standard Error 4.41382 Observations 20 ANOVA df SS MS F nificance Regression 1 2033.9 2033.9 104.4 6E-09 Residual 18 350.67 19.482 Total 19 2384.6 Coefficientndard Er t Stat P-valueower 95% Intercept -18.016 4.1006-4.394 0.0004-26.63 Feet 1.01286 0.0991 10.218 6E-09 0.8046 Given the table and chart answer the following questions: Q1. What is the predicted weight for a sail boat that is 30 feet long? A: 12370 pounds =1.01286*30-18.016 = 12.3698 = 12.3698 * 1000 pounds = 12369.8 = about 12370 pounds Q2. This prediction comes with certain error estimate. What is it? In other words what is the interval prediction for this weight? A: the error is 4.41 thus the interval is [7960, 16780] pounds (remember, the units are in thousands of pounds and we truncated the decimals) =[12.3698-4.41, 12.3698+4.41] =[7.9598, 16.7798] = [7.9598*1000, 16.7798*1000] = [7959.8, 16779.8]= about [7960,16780] pounds!
More practice Regression Statistics Multiple R 0.92355 R Square 0.85294 Adjusted R Square 0.84477 Standard Error 4.41382 Observations 20 ANOVA df SS MS F nificance Regression 1 2033.9 2033.9 104.4 6E-09 Residual 18 350.67 19.482 Total 19 2384.6 Coefficientndard Er t Stat P-valueower 95% Intercept -18.016 4.1006-4.394 0.0004-26.63 Feet 1.01286 0.0991 10.218 6E-09 0.8046 Given the table and chart answer the following questions: Q3. What is the interpretation of the slope 1.01? A: For each foot increase the boat s weight increases by about 1010 pounds Q4. What is the interpretation for the intercept -18.016? A: The intercept has no real life interpretation here (no sailboat is zero feet long).