Statistics I Exercises Lesson 3 Academic year 2015/16 1. The following table represents the joint (relative) frequency distribution of two variables: semester grade in Estadística I course and # of hours per week that a student studied for that course. # of h: \ Grade: Suspenso Aprobado Notable Sobresaliente 2 0.20 0.15 0.08 0.03 3 0.12 0.07 0.02 0.02 4 0.04 0.10 0.02 0.00 5 0.00 0.05 0.05 0.05 a) Find the two marginal (relative) frequency distributions. b) Find the following conditional distributions: grade given each value of # of hours. The same for # of hours given each value of grade. c) Compute the average hours per week that a student has to study to obtain Notable. And to obtain Sobresaliente? 2. We have the following joint (relative) frequency distribution for bivariate observations (X, Y ), where X = number of children and Y = monthly income (in EUR): a) Find the marginals. # of children \ Income: 0-1000 1000-2000 2000-3000 3000-5000 0 0.15 0.05 0.03 0.02 1 0.10 0.20 0.10 0.05 2 0.05 0.10 0.05 0.03 3 0.02 0.03 0.02 0.00 b) Find the conditional distribution of Y X = 2. Compute the conditioned mean and variance. c) Find the conditional distribution of X 1000 < Y < 2000. Compute the conditioned mean and variance. 3. The following table gives the values of the relative frequencies for the joint distribution of the variables X, representing the number of credit cards that a person owns, and Y, corresponding to the number of purchases made per week using these credit cards. # of purchases 0 1 2 3 4 1 0.08 0.13 0.09 0.06 0.03 # cards 2 0.03 0.08 0.08 0.09 0.07 3 0.01 0.03 0.06 0.08 0.08 a) If we know that the study was conducted on 300 participants, compute the joint distribution of the absolute frequencies. b) Find the marginal distribution of Y. What is the mean value and the standard deviation of the number of weekly payments made using a credit card? c) Find the distribution of the number of credit cards owned by the persons in the study. Which one is the most frequent number of credit cards owned by these persons? d) Compute the distribution of the number of weekly purchases made using a credit card for those persons having three cards. What is the mean of this distribution? 4. You have been given the following information on the sales of automobiles in 12 showrooms: the Comunidad Autónoma where the showroom is located, the average sales price for each car model in 2008 (en miles de euros) and the percentage change in the sales volume between the years 2007 and 2008. 1
CA Mean Price Change Madrid 19.5 28.3 Castilla y León 16.1 25.1 Madrid 22.3 34.2 Castilla-La Mancha 15.0 23.7 Madrid 16.6 22.9 Madrid 23.9 32.3 Castilla-La Mancha 17.7 19.2 Castilla-La Mancha 13.0 14.9 Madrid 16.2 24.6 Madrid 18.6 28.3 Castilla y León 14.1 16.5 Castilla-La Mancha 18.3 21.0 a) Represent the contingency table for the variables CA and Change using four groups for the values of the second variable. b) Give the conditional frequencies for the variable Change when the variable CA takes the value Madrid. c) Build now the two-dimensional table corresponding to the variables Mean Price and Change. Use 3 groups for the values of the variable Mean Price, and provide values for the marginal frequencies of both variables (grouped in the suggested manner). d) Indicate the relative frequencies of the variable Mean Price conditional on the variable Change taking values between 29.5 and 19.5. e) Compute the average of the values of the variable Change conditional on the Mean Price taking values between 20 and 24. 5. A survey has been conducted on 80 married men. They have been asked the number of their siblings (X) and children (Y ). The results are shown in the following tables: 0 1 4 1 1 3 1 3 4 2 0 2 2 2 9 2 3 3 3 1 6 3 2 12 3 3 5 3 4 2 4 0 2 4 1 7 4 2 15 4 4 1 5 2 5 a) Construct a two-dimensional table for these two variables and represent their distribution using a scatterplot. b) From the previous graph, what values of the covariance between X and Y would you consider more likely? And for the (linear) correlation coefficient? Justify your answer. c) Compute the correlation coefficient. 6. The following data represents the number of years of experience (x) and the annual profits (y, in millions of euros) of several companies: a) Draw a scatterplot for this data. b) Find the correlation coefficient, r xy. years of experience 5 15 24 16 19 3 6 12 27 13 profit 4 4 9 7 6 2 3 3 7 5 c) Is the (linear) relationship positive/negative, strong/weak? 7. In a certain city data for family sizes (x) and number of washing powder packages (y) used by these families per month, has been collected: family size 5 2 4 3 2 4 5 3 6 3 4 5 6 # packages 2 1 3 2 1 2 3 2 4 1 3 3 3 a) Determine the correlation coefficient between x and y. Is the (linear) relationship positive/negative, strong/weak? b) Write a contingency table with the absolute and relative frequencies distribution for these data. 2
c) Compute again the correlation coefficient only using the distribution of relative frequencies. Verify that you obtain the same result as in a). 8. The following data shows the number of inventories per year and the percentage of sales, for five companies: Company # of inventories % of sales A 3 10 B 4 8 C 5 12 D 6 15 E 7 13 Determine the correlation coefficient between the number of inventories and the percentage of sales. (linear) relationship positive/negative, strong/weak? Is the 9. The owner of a souvenir shop suspects that the weekly sales in his shop (y) are related to the (weekly) fluctuations of the Dow Jones index (x). To validate his suspicion, he has collected the following data: Dow Jones 58.3 62.9 46.3 48.2 58.2 65.8 Weekly sales 2215 2518 1781 1823 2117 2703 Dow Jones 36.7 32.3 52.7 39.3 58.7 39.3 sales 1423 1532 1879 1713 2122 2346 a) Draw a scatterplot of this data and compute the regression line for y on x. b) What can you say about the owner s claim? c) Does a significant relationship between x and y imply that x affects y? Explain. 10. (June 2010 Exam) The table below shows data about the age and length of service, both measured in years, from 75 employees of a certain company. Note: blank cells correspond to zero counts. length of service age 1 2 3 4 5 7 9 10 12 15 20 [20,30) 3 6 4 [30,40) 1 4 2 3 [40,50) 1 4 13 2 3 1 [50,60) 2 8 7 4 [60,70] 2 5 a) Indicate, what kind of variables are age and length of service. b) Find the (absolute) marginal distribution of the variable length of service. What is the mean length of service for the company s employees? c) Taking into account that the units of the variables are different, compare the variability in age with that in length of service (coefficient of variation). d) Compute the variance and correlation coefficient between the variables age and length of service. Interpret the results. 11. (May 2012 Exam) A survey based on 1000 Spanish households led to a bivariate data set from (X, Y ), where X = number of cars owned in 2011, with possible values 0, 1, 2, and Y = household income in 2011 (in thousands of euros). The following table shows the survey results: Y [0, 50) [50, 100) [100, ) 0 324 105 37 X 1 112 234 6 2 1 4 177 (a) What kind of variables are X and Y? (b) Find the marginal absolute frequency distribution of X. Calculate the mean and the quasi-standard deviation of X. (c) For those households with incomes below 50 thousand euros, what was the mean and the most frequent number of cars owned? Here you have some additional information of Y grouped by the different values of X: 3
Summary Statistics for Y X = 0 X = 1 X = 2 Count 466 352 182 Average 39.70 58.93 248.12 Median 28.35 59.52 265.24 Variance 1376.33 372.85 2672.49 Standard dev. 37.10 19.31 51.70 Coeff. of variation 93.44% 32.77% 20.84% Minimum 0.03 0.94 32.56 Maximum 288.44 126.09 299.81 Range 288.41 125.15 267.26 Lower quartile 12.39 46.16 235.33 Upper quartile 57.07 71.96 285.62 Interquartile range 44.67 25.80 50.29 (d) Identify each of the boxplots with one of the three groups: X = 0, X = 1, X = 2. Justify your choice. (e) Identify each of the histograms I), II), III) with one of the boxplots a), b), c) from the previous part. Justify your choice. (f) We pick one household from each of the three groups and report their incomes (in thousands of euros). The values are: 51 for X = 0; 62 for X = 1 and 75 for X = 2. Compared to the rest of the incomes in their corresponding groups, which of these three household is the poorest? (Hint: standardize). 12. For the following two-dimensional table, obtained from a sample of 20 values from two random variables, X e Y, select the correct answer: Y 1 2 3 1 1 2 5 X 2 1 3 3 3 5 0 0 a) The sample mean for the variable Y conditioned to X = 2 is 16/7. b) The relative marginal frequency corresponding to Y = 1 is 5/20. c) The relative frequency of Y = 1 conditioned to X = 3 is 5/20. d) The sample mean for the variable X conditioned to Y = 1 is 16/7. 13. From a sample of values of two random variables X and Y, we have that their covariance is 30 and the standard deviations of the two variables are 5 and 8 respectively. Indicate the correct answer: a) There is no significant linear relationship between X and Y. b) In general, larger values of X correspond to larger values of Y. c) In general, smaller values of X correspond to smaller values of Y. d) The linear correlation coefficient between Y and X is 0.75. 4
14. Given a sample of size 10 from two variables X and Y, if the marginal frequency of X = 2 is 3 and the marginal frequency of Y = 3 is 2, which of the following statements is (are) true? I. The pair (2, 3) has been observed 5 times. II. The pair (2, 3) has been observed at least 5 times. III. The pair (2, 3) has been observed twice at most. IV. The relative frequency of Y = 3 is lower than the relative frequency of X = 2. a) I only. b) III and IV only. c) IV only. d) II and IV only. 15. Which of the following expressions is (are) correct? I. x i y j = x i y j II. III. x i y j = a) I, II and III. j=1 i=1 j=1 i=1 y j x i ( ) x i y j = x i b) I and II only. c) I and III only. d) II and III only. i=1 j=1 y j 5