Business Statistics 41000:

Size: px

Start display at page:

Download "Business Statistics 41000:"

Alan Green
5 years ago
Views:

1 Business Statistics 41000: Plotting and Summarizing Bivariate Data Drew D. Creal University of Chicago, Booth School of Business Week 2: January 17 and 18,

2 Class information Drew D. Creal Office: 404 Harper Center Office hours: me for an appointment Office phone: Course homepage: 2

3 Course schedule Week # 1: Plotting and summarizing univariate data Week # 2: Plotting and summarizing bivariate data Week # 3: Probability 1 Week # 4: Probability 2 Week # 5: Probability 3 Week # 6: In-class exam Week # 7: Statistical inference 1 Week # 8: Statistical inference 2 Week # 9: Simple linear regression Week # 10: Multiple linear regression 3

4 Outline of today s topics I. Plotting bivariate data (AWZ p ) The two-way table for categorical variables Scatter plots for numeric variables II. Summarizing bivariate data Tables III. Covariance and correlation (AWZ p ) Sample covariance Sample correlation Properties of the correlation coefficient Graphical depictions of correlation Correlation matrix 4

5 Outline of today s topics IV. Linearly related variables Linear functions Mean and variance of a linear function The sample mean and variance of a linear function Linear combinations Mean and variance of a linear combination V. Linear regression (AWZ p ) 5

6 Looking at Two Variables Last week, we discussed how to plot and analyze one variable. We found this to be a helpful way of understanding our data. In many practical situations, however, we need to look at two (or more) variables. 6

7 The Two-Way Table Let s say we would like to investigate the relationship (if any) between two categorical variables x and y. If x has two categories and y has two categories, then there are only four possible combinations of x and y. We can then count the number of combinations we observe of the 4 possibilities. If instead x has N x possible outcomes and y has N y possible outcomes, then the total number of possible combinations of x and y together is N x N y. We can often plot these in a table. 7

8 The Two-Way Table Consider the simp and cola variables from the British marketing data. Remember that 0 stands for does not. simp cola 0 1 Total Total In the second table on the right, we display percentages. For example on the left, 146 of the 1000 respondents drink cola and watch the Simpsons. simp cola 0 1 Total % 3.5% 42.2% % 14.6% 57.8% Total 81.9% 18.1% 100.0% 8

9 A conditional Two-Way Table simp cola 0 1 Total Total simp cola 0 1 Total % 19.34% 42.20% % 80.66% 57.80% Total 100% 100% 100% In the table on the right, we display the percentage of each column. Conditional on those respondents who do not watch the Simpsons (simp = 0), 387 out of 819 or 47.25% do not drink cola. 9

10 A conditional Two-Way Table simp cola 0 1 Total Total simp cola 0 1 Total % 8.30% 100% % 25.26% 100% Total 81.90% 18.90% 100% Alternatively, we can condition on each row. For example, conditional on drinking cola (cola = 1), we know that 146 out of 578 respondents or 25% watch the Simpsons. 10

11 Social grades of Great Britain The variable soc stands for socioeconomic status and there are six categories: A: Higher managerial, administrative and professional B: Intermediate managerial, administrative and professional C1: Supervisory, clerical and junior managerial, administrative and professional C2: Skilled manual workers D: Semi-skilled and unskilled manual workers E: State pensioners, casual and lowest grade workers, unemployed with state benefits only 11

12 The Two-Way Table Using cig and soc from the British marketing data, how does social grade relate to cigarette use? cigs soc 0 1 Total % 25.00% 100% % 19.87% 100% % 25.81% 100% % 29.36% 100% % 37.82% 100% % 35.83% 100% Total 71.20% 28.80% 100% Notice that this is a conditional Two-Way table and these are row percentages. Lower social grades seem to smoke more cigarettes. 12

13 The Scatter Plot For numeric variables, we can use a scatter plot Each row corresponds to an individual and their drinking ability. (see beer.xls ) Each person has recorded the number of beers they can drink and their weight. Do you think there is a relationship? nbeer weight

14 The Scatter Plot For numeric variables, we can use a scatter plot 22 Anonymous survey of MBA drinking In a scatter plot, each point corresponds to an observation. Weight is on the horizontal axis. The number of beers is on the vertical axis. number of beers Outlier? weight 14

15 The Scatter Plot Are returns on a mutual fund related to the market? VWNDX GSPC Each point corresponds to a monthly return. 15

16 Mutual funds Consider the mutual fund data (mutualfundreturns.xls ). We have monthly data on 12 different assets from July 1996 to Dec Ticker Symbol TWEIX BRKA DGRIX LBF FTRNX JAVLX OPGSX PRTBX PTTRX PINCX GSPC VWNDX Fund name American Century Equity Income (large value) Berkshire Hathaway Holding Company A shares Dreyfus Growth & Income (large growth) DWS Global High Income Fund (bonds) Fidelity Trend Fund (large growth) Janus Twenty (large growth) Oppenheimer Gold & Special Minerals Permanent Portfolio Treasury Bill (ultrashort bonds) Pimco Funds Total Return (intermediate bonds) Putnam Income S&P 500 index Vanguard Windsor (large value) 16

17 It is common in finance to take a series of assets and plot the sample mean versus the sample standard deviation DWS Global Oppenheimer G&M Mean American CenturyBerkshire Hathaway Janus Twenty PIMCO Bond Fund Putnam Bond Treasury Bill Fund Vanguard Windsor Fidelity S&P500 Dreyfus Std Dev If you were an asset manager, where would you want your fund to be located on this plot? 17

18 The Scatter Plot Let s compare the mean and std. dev. of portfolios from several countries (see conret.xls ). Mean Hong Kong Malaysia The Netherlands!!!!! Sweden SwissDenmark USA Belgium France Germany Norway Singapore AustraliaIreland Austria Finland Canada United Kingdom New Zealand Spain Italy Japan Std. Dev. Monthly returns from 1988 to

19 Comparing numeric and categorical variables How do you relate a numeric variable versus a categorical variable? This is not obvious at first glance. Our first option is to bin the numeric variable and transform it into a categorical variable. Then, we can use a Two-Way Table. 19

20 Comparing numeric and categorical variables Consider the age and cigs variables from the British marketing data. What is the relationship between age and cigarette usage? cigs age 0 1 total % 49.02% % % 36.36% % % 32.31% % % 35.24% % % 20.24% % % 8.87% % % 11.90% % % 0.00% % total 71.20% 28.80% % 20

21 Summaries for bivariate data Two-way tables and scatter plots can help us understand the information contained in our data. However, they do not describe the strength of evidence. For categorical variables, we can use the information contained in our tables but there is no one-way to do this. For numeric variables, two important summaries are the sample covariance and sample correlation. 21

22 Covariance and correlation 22

23 Measuring strength of evidence for numeric data: Covariance and Correlation In the beer data, it looks like there is a relationship. The relationship looks linear in the sense that it looks like we could draw a straight line with positive slope through the plot. number of beers Anonymous survey of MBA drinking weight 23

24 Covariance and Correlation Covariance and correlation summarize how strong a linear relationship there is between two variables. Consider any two numeric variables x and y. In the beer example, we could let x be the number of beers and y be each individual s weight. 24

25 The sample covariance The sample covariance between data x and y is defined as s xy = 1 n 1 n (x i x) (y i y) i=1 In words, covariance is the average product of the deviations from the means. Remember in Lecture # 1 we demonstrated how to compute expressions like (x i x) and (y i y). What are the units of the covariance? 25

26 The sample correlation The sample correlation between data x and y is defined as r xy = s xy s x s y The sample correlation is simply the sample covariance divided by the standard deviation of x and y, respectively. Can anyone tell me what the units of measurement are? 26

27 Properties of the sample correlation The sample correlation always lies between -1 and 1, i.e. 1 r xy 1. The closer r xy is to 1 the stronger the linear relationship is with a positive slope. When one variable increases the other tends to increase. The closer r xy is to -1 the stronger the linear relationship is with a negative slope. When one variable increases the other tends to decrease. A correlation of 1 is a straight line with positive slope. 27

28 Correlation Compare the mutual fund data and the beer data. 22 Anonymous survey of MBA drinking VWNDX number of beers GSPC weight Which appears to be more correlated? 28

29 Correlation Sample correlation between VWNDX and S&P500 = Sample correlation between nbeer and weight = The larger correlation between VWNDX and S&P500 indicates that the linear relationship is STRONGER. 22 Anonymous survey of MBA drinking VWNDX number of beers GSPC weight 29

30 Correlation: more examples 100 simulated data points 2 Sample correlation between x 1 and y 1 is 0.1 X Y1 2 Sample correlation between x 2 and y 2 is X Y2 30

31 Correlation: more examples 100 simulated data points 2 Sample correlation between x 3 and y 3 is X Y3 2 Sample correlation between x 4 and y 4 is X Y4 31

32 Correlation: more examples 100 simulated points (on the x-axis) 1.0 Sample correlation is Sample correlation is

33 Correlation: be cautious X Y5 IMPORTANT: Correlation only measures a LINEAR relationship. Clearly, the variables x 5 and y 5 are highly (nonlinearly) related. Correlation between x 5 and y 5 =

34 Correlation: more examples Data on monthly returns for different countries. We have a total of 22 countries Canada USA Correlation is 0.65 for Canada and the USA. 34

35 The sample correlation matrix The sample correlation matrix is a table of all sample correlations between each pair of variables australia austria belgium canada denmark australia austria belgium canada denmark Why are all the diagonal entries equal to one? Why are the upper and lower off-diagonals the same? 35

36 The sample covariance matrix The sample covariance matrix is a table of all sample covariances between each pair of variables australia austria belgium canada denmark australia austria belgium canada denmark The diagonal elements of the sample covariance matrix are equal to the variances of that variable. 36

37 Using the correlation and covariance formulas Let s return to our simple example from lecture # 1. x (x x) y (y y) s xy = 1 n 1 n (x i x) (y i y) i=1 = 1 [( ) ( ) + ( ) ( ) 3 + ( ) ( ) + ( ) ( )] = 1 [ ] 3 37

38 Using the correlation and covariance formulas s xy = 1 ( ) 3 = 1 ( ) 3 = 1 (0.0012) = Each of the four combinations of points contributes to the covariance. Notice what determines the sign (, +) and magnitude of each contribution. Let s look at where those points are relative to the sample means x and y. 38

39 (x 3 x)(y 3 ȳ) = (.01) *.02 = x 0.10 (x 1 x)(y 1 ȳ) =.02 *.04 = (ii) (i) (iii) (iv) ȳ (x 2 x)(y 2 ȳ) =.01 * (.02) = (x 4 x)(y 4 ȳ) = (.02) * (.04) =.0008 Points in quadrant (iii) have both x and y less than their means so they make a positive contribution to the covariance. Points in quadrant (i) have both x and y larger than their means so they make a positive contribution to the covariance. In (ii) and (iv) one of x and y is less than its mean and the other is greater so we get a negative contribution. The sign (, +) of the covariance tells us in which quadrants our data lies on average. 39

40 There are lots of data points in quadrants (i) and (iii) which make positive contributions (ii) (i) 0.10 VWNDX (iii) (iv) SP500 There are only a few data points in quadrants (ii) and (iv) which make negative contributions. 40

41 How changes in units affect the covariance Suppose we have data on 4 individuals education and income: years of school income in dollars 34, ,200 54,950 98,100 The sample covariance is What are the units of measurement? 41

42 How changes in units affect the covariance What if we measure income in thousands of dollars instead of dollars? years of school income in thous. of dollars The sample covariance changes to !! What are the units of measurement? 42

43 Key points to remember about covariance A positive covariance implies that when a variable is above (below) its mean the other variable tends to be above (below) its mean. A negative covariance implies that when one variable is above (below) its mean the other variable tends to be below (above) its mean. The units of the covariance are (typically) not meaningful. The magnitude of the covariance is not easy to interpret. Focus on the sign of the covariance. It tells us which quadrant which should expect to see our data in relative to the mean. 43

44 Return to our numerical example x (x x) y (y y) We computed the sample covariance as s xy = The sample correlation is r xy = s xy s x s y = (0.0365)(0.0183) = 0.6 where we divide the covariance by the two standard deviations. 44

45 How changes in units affect the correlation Consider the same data on individuals education and income as above: years of school income in dollars 34, ,200 54,950 98,100 The sample correlation is What are the units of measurement? 45

46 How changes in units affect the correlation Again, what if we measure income in thousands of dollars? years of school income in thous. of dollars The sample correlation remains 0.975!! What are the units of measurement? 46

47 Key points to remember about correlation The correlation always has the same sign as the covariance because we are simply dividing by standard deviations which are always positive. The correlation can be more informative than the covariance because it is easier to interpret as a measure of strength. Correlation is unit-less and always lies between 1 and 1. Interpretation: close to 1 means a strong positive relationship. Interpretation: close to 1 means a strong negative relationship. 47

48 Linear functions 48

49 Linear functions We have seen data sets which display some kind of relationship between variables (e.g. Vanguard vs. S&P500). An exact linear relationship or linear function between two univariate variables is defined as y = c 0 + c 1 x In this formula, the variable c 0 is a number called the intercept. The variable c 1 is a number called the slope. (NOTE: There is nothing special about the notation y, x, c 0, and c 1.) 49

50 Linear functions There are many reasons why we are interested in linear functions. A first example, suppose we observed a sample of the variable x and we knew its sample mean and sample variance. Using this information, could we determine the sample mean and variance of y if y = c 0 + c 1 x? YES! 50

51 Celsius to Fahrenheit Suppose we have a sample of temperatures measured in Celsius but we wanted to convert them to Fahrenheit. cel fahr The relationship between these variables is fahr = cel Sample mean: cel = 32.5 Sample stand. dev. s cel =

52 Celsius to Fahrenheit If we plot the data using a scatter plot, what do we get? Fahr Celsius Note: the correlation is 1. 52

53 Definition of a linear function A variable y is a linear function of the variable x if y = c 0 + c 1 x c 0 : the intercept c 1 : the slope We think of c 0 and c 1 as constants (i.e. fixed numbers) while x and y are allowed to vary. 53

54 Sample mean and variance of a linear function Suppose y is a linear function of x. y = c 0 + c 1 x How are the sample mean and variance (std. dev.) of y related to those of x? In other words, given x and s 2 x, what separate affects do multiplying by c 1 and adding c 0 have? 54

55 Sample mean and variance of a linear function Let us look at the temperature example where fahr = cel We can see both affects graphically. 9 cel cel fahr 5 mean std. dev Cel 9/5 Cel Fahr

56 Mean and variance of a linear function When we multiply by a constant (in this case c 1 = 9 ), we 5 affect (in this case increase) both the mean and the standard deviation proportionally. If we add a constant (in this case c 0 = 32), we simply increase the mean (by the value of the constant c 0 ) but leave the overall dispersion unchanged. 56

57 Seeing the effects graphically Here, I have simulated 1000 data points labeled x with x = 1 and s x = 1. These are in the top histogram. X Mean: 1 Std Dev: X Mean: 3 Std Dev: X Mean: 2 Std Dev:

58 Sample mean and variance of a linear function Two important formulas Suppose y = c 0 + c 1 x Then y = c 0 + c 1 x s 2 y = c 2 1 s 2 x s y = c 1 s x 58

59 Celsius to Fahrenheit Let s return to our temperature example where c 0 = 32 and c 1 = 9 5. fahr = c 0 + c 1 cel = = = 90.5 sfahr 2 = c1 2 scel 2 ( ) 2 9 = = 1296 (s fahr = 36) 59

60 Formal proof y = 1 n = 1 n = 1 n n y i (by definition of y) i=1 n (c 0 + c 1 x i ) (because y i = c 0 + c 1 x i ) i=1 n c n i=1 = n n c 0 + c 1 n n i=1 n c 1 x i i=1 x i = c 0 + c 1 x (by definition of x) NOTE: you do not need to know this for any exam. 60

61 Formal proof sy 2 = = = = = 1 n 1 1 n 1 1 n 1 1 n 1 1 n 1 n (y i y) 2 (by definition of sy 2 ) i=1 n (y i c 0 c 1 x) 2 (because y = c 0 + c 1 x) i=1 n (c 0 + c 1 x i c 0 c 1 x) 2 (because y i = c 0 + c 1 x i ) i=1 n (c 1 x i c 1 x) 2 i=1 n c1 2 (x i x) 2 = c1 2 1 n 1 i=1 n (x i x) 2 = c1 2 sx 2 i=1 NOTE: you do not need to know this for any exam. 61

62 We can get the sample standard deviation by using the formula s 2 y = c 2 1 s 2 x and then just taking the square root. Or, we can use our other formula directly s y = c 1 s x This is because the sample standard deviation is always the square root of the sample variance. 62

63 Example Suppose x has sample mean 100 and sample standard deviation 10. What are the sample mean, sample variance, and sample std. dev. of y when 1. y = 2x? 2. y = 5 + x? 3. y = 5 2x? NOTE: Answers are on the next slide. 63

64 Example Suppose x has sample mean 100 and sample standard deviation 10. What are the sample mean, sample variance, and sample std. dev. of y? 1. y = 2x y = 200, s 2 y = 400, s y = y = 5 + x y = 105, s 2 y = 100, s y = y = 5 2x y = 195, s 2 y = 400, s y = 20 64

65 Linear combinations 65

66 Linear combinations We may want a variable y to be a function of more than one variable. Assume we have k different variables x 1,..., x k. A variable y is a linear combination if it is related to several other variables x 1,..., x k by the formula y = c 0 + c 1 x 1 + c 2 x c k x k c 0 : the intercept c i : a coefficient 66

67 Linear combinations You may occasionally see the following notation where we double-index the variable x ij. When we have a sample of size i = 1,..., n, each variable x ij goes for j = 1,..., k. The i-th observed variable y i is a linear combination if it is related to several other variables x i1,..., x ik by the formula y i = c 0 + c 1 x i1 + c 2 x i c k x ik c 0 : the intercept c j : a coefficient 67

68 Example: Portfolios Suppose you have $ 100 to invest. Let x 1 be the return on asset 1. If x 1 = 0.1 and you put all your money into asset 1, then you will have $ 100 ( ) = $ 110 at the end of the period. Let x 2 be the return on asset 2. If x 2 = 0.15 and you put all your money into asset 2, then you will have $ 100 ( ) = $ 115 at the end of the period. What happens if you put 1 2 your money in asset 1 and 1 2 your money in asset 2? 68

69 Example: Portfolios At the end of the period you will have ( ) ( ) = 100 [1 + (.5.1) + (.5.15)] = $ In other words, if we put 1 2 of our money in asset 1 and 1 2 in asset 2 the return on the portfolio is R p = 1 2 x x 2 The return on the portfolio is a linear combination of the returns on the individual assets. 69

70 Example: Portfolios In general, suppose you have $M dollars to invest in two assets with returns x 1 and x 2. Let w 1 be the fraction of your wealth that you choose to put in x 1. Assume that our portfolio weights sum to one, w 1 + w 2 = 1. w 1 M(1 + x 1 ) + w 2 M(1 + x 2 ) = M[w 1 + w 2 + (w 1 x 1 ) + (w 2 x 2 )] The return on our portfolio is = M[1 + w 1 x 1 + w 2 x 2 ] R p = w 1 x 1 + w 2 x 2 70

71 Example: Portfolio R p = w 1 x 1 + w 2 x 2 In this linear combination, the coefficients are the portfolio weights or the percentage of our wealth placed in each asset. When discussing portfolios, it is common to change notation and use w i for the weights instead of c i. In this class, we will always assume that the portfolio weights sum to one. Can an asset s weight be negative? 71

72 Example: Portfolio Suppose we have m possible assets. Let x i denote the return on the ith asset. Let w i denote the percentage of wealth invested in the ith asset. Then, the return on the portfolio is: R p = m i=1 w ix i The return on the portfolio is a linear combination of individual asset returns, where the coefficients are equal to the fraction of wealth invested. 72

73 Mean and variance of a linear combination First, consider the case of two variables x 1 and x 2 Suppose Then y = c 0 + c 1 x 1 + c 2 x 2 y = c 0 + c 1 x 1 + c 2 x 2 s 2 y = c 2 1 s2 x 1 + c 2 2 s2 x 2 + 2c 1 c 2 s x1x 2 Notice that when we have two variables we must take their covariance s x1 x 2 into account. 73

74 Reminder about correlations and covariances Remember, we defined the sample correlation of two variables x 1 and x 2 as the sample covariance divided by the standard deviations r x1 x 2 = s x 1 x 2 s x1 s x2. So, if we know the sample correlation and the sample standard deviations we can determine the sample covariance. s x1 x 2 = r x1 x 2 s x1 s x2 If we know the sample standard deviations, then all we need to know is either the sample correlation or the sample covariance. 74

75 Example: country returns Consider building a portfolio using our monthly country returns data (conret.xls ) and the two variables Hong Kong and USA. We place 1 2 of our wealth in each asset. honkong usa port We obtain the portfolio return as R p = 1 2 honkong usa so that w 1 = 0.5 and w 2 =

76 Example: country returns honkong usa port The sample mean of the portfolio is R p = w 1 honkong + w 2 usa = 0.5honkong + 0.5usa honkong = usa = = 0.5 ( ) ( ) =

77 Example: country returns Let us compute the variance of our portfolio. The sample covariance matrix is: honkong usa honkong usa Remember that the sample variances are on the diagonal and the sample covariances in the off-diagonal. Next, we apply the variance formula s 2 port = w 2 1 s 2 honkong + w 2 2 s 2 usa + 2w 1 w 2 s honkong,usa = ( ) ( ) + 2(0.5 2 )( ) = (s port = = 0.046) 77

78 Example: country returns What if we had put 25% into the US and 75% into Hong Kong? How does the variance of the portfolio get affected? The covariance matrix does not change honkong usa honkong usa Again, we apply the formula but with different weights sport 2 = w1 2 shonkong 2 + w2 2 susa 2 + 2w 1 w 2 s honkong,usa = ( ) ( ) + 2(0.25)(0.75)( ) = (s port = = 0.058) 78

79 Example: country returns Scatter plot of the mean and standard dev. for the portfolio with equal weights. It looks like the mean of the portfolio is half-way between the sample mean of honkong and usa. Mean honkong port usa Standard Dev 79

80 Example: country returns The sample standard dev is less than half-way between s usa and s honkong honkong What happened? Mean port Diversification! usa Standard Dev 80

81 Example: diversification Consider returns x 1 and x 2 and a portfolio y = 1 2 x x 2. At each point (x 1, x 2 ), we plot the value y. x 1 x 2 y Covariance matrix: x 1 x 2 x x X x X 1 x Using our formulas, the variance of y is: s 2 y = Why is the variance of y smaller than the variance of both x 1 and x 2? 81

82 Example: diversification Consider returns x 1 and x 2 and a portfolio y = 1 2 x x 2. At each point (x 1, x 2 ), we plot the value y. x 1 x 2 y X x x 2. Covariance matrix: x 1 x 2 x x X 1 Using our formulas, the variance of y is: The covariance is now positive! Why is the variance of y similar to that of x 1 and x 2? 82

83 Example: diversification Consider returns x 1 and x 2 and a portfolio y = 1 2 x x 2. At each point (x 1, x 2 ), we plot the value y. x 1 x 2 y Covariance matrix: x 1 x 2 x x X x X 1 x Using our formulas, the variance of y is: Why is the variance of y smaller than the variance of both x 1 and x 2? 83

84 Mean and variance of a linear combination Three right hand side variables x 1, x 2, and x 3. Suppose Then y = c 0 + c 1 x 1 + c 2 x 2 + c 3 x 3 y = c 0 + c 1 x 1 + c 2 x 2 + c 3 x 3 s 2 y = c 2 1 s2 x 1 + c 2 2 s2 x 2 + c 2 3 s2 x [c 1 c 2 s x1x 2 + c 1 c 3 s x1x 3 + c 2 c 3 s x2x 3 ] Notice that the variance formula now has 3 covariance terms. 84

85 Example: mutual funds Vanguard Windsor (VWNDX), Pimco bond fund (PTTRX), DWS Global (LBF) Portfolio: port = 0.1 VWNDX PTTRX LBF VWNDX PTTRX LBF r VWNDX,PTTRX = VWNDX e r VWNDX,LBF = PTTRX e r PTTRX,LBF = LBF We apply the variance formula sport 2 = w1 2 svwndx 2 + w2 2 spttrx 2 + w3 2 slbf 2 +2w 1 w 2 s VWNDX,PTTRX + 2w 1 w 3 s VWNDX,LBF + 2w 2 w 3 s PTTRX,LBF = ( ) ( ) ( ) + +2(0.1)(0.7)(6.6220e 5) + 2(0.1)(0.2)( ) +2(0.7)(0.2)( ) =

86 Example: mutual funds Scatter plot of the mean and standard dev. We created a new portfolio with a slightly higher return than PTTRX and also more risk as measured by the standard Mean port Pimco Bond Fund Vanguard Windsor DWS global deviation Standard Dev 86

87 Further remarks on linear combinations For linear combinations greater than 3 right-hand side variables (say k variables), the mean and variance formulas can be generalized. The mean formula is: y = c 0 + k j=1 c jx j The variance formulas take into account all pairwise combinations of the covariances. I will not ask you to calculate by hand any formulas with more than 3 right-hand side variables. If you take the portfolios class from the finance group, you will learn about building portfolios by taking linear combinations of assets. The goal is to choose the weights to build good portfolios that are on or close to the efficient frontier. 87

88 Linear regression 88

89 Linear regression This is data on 128 homes (MidCity.xls ). It includes their sales price (in dollars) and interior size (in square feet) Price Size: square feet 89

90 Linear regression Clearly, the data are correlated. price size price size It looks like we could fit a line through the data. But, which line? And, what is the equation of that line? Why would we want to do this? Linear regression fits a line through data. (Univariate) linear regression helps us compute the coefficients c 0 and c 1 in the equation y = c 0 + c 1 x. 90

91 Linear regression Let y be the house prices and x be the size of the house. When we run a regression, we obtain values for the intercept and slope coefficients given our data. y = intercept + slope x coefficients constant intercept sq. ft slope 91

92 Linear regression Here is the data with the regression line drawn through it Price Size: square feet 92

93 Linear regression formulas slope = sxy s 2 x intercept = y slope x The formulas for the slope and intercept just use the sample mean, sample covariance, and sample variance. We will study this in more detail later in the class. The slope formula takes the covariance and standardizes it so that its units are (units of y)/(units of x). The intercept formula will make the regression line pass through the point (x, y). 93

94 Regression and prediction You have a house on the market with size 2200 sq. ft. Can we predict at what price the house will sell? We could use the sample mean or median of prices but this doesn t take size into account Histogram of house prices

95 Regression and prediction Regression allows us to use information on size to form our prediction. We plug the size (2200 sq. ft) into our equation. Predicted price: *2200 = $ 144, Price Size: square feet 95

96 Additional comments on regression Because we are using more information (in this case the information on the size of the home), the predictions we make are (hopefully!) better in some sense. Importantly, regression is based on the same concepts (sample means, sample covariance and variances) that we learned in today s lecture. It s simply an alternative way to use our information. There is nothing mysterious about it! 96

97 Limits of regression It is important to remember that regression has its limits. First, it matters which variable is the left hand side variable y, i.e. it matters that we let y be house prices and not size of the house. You will get a different answer if you swap y and x. Correlation does not imply causation. Just because we regress x on y does not mean that changes in x cause changes in y. 97

Chapter 6 Scatterplots, Association and Correlation

Chapter 6 Scatterplots, Association and Correlation Looking for Correlation Example Does the number of hours you watch TV per week impact your average grade in a class? Hours 12 10 5 3 15 16 8 Grade 70