SALES AND MARKETING Department MATHEMATICS. 2nd Semester. Bivariate statistics. SOLUTIONS of tutorials and exercises

SALES AND MARKETING Department MATHEMATICS 2nd Semester Bivariate statistics SOLUTIONS of tutorials and exercises Online document: http://jff-dut-tc.weebly.com section DUT Maths S2. IUT de Saint-Etienne Département TC J.F.Ferraris Math S2 Stat2Var TExCorr Rev2018 page 1 / 18

Exercise 1. (Tutorial for lesson page 5) Are people s behaviour in relation to tobacco and people s gender related, with a 10% significant level? Here are the results of a survey made on a sample of 51 men and 66 women: G : variable "gender" B : variable "behaviour in relation to tobacco" Gm : men Bn : never smoked Gw : women Bs : smoke Bss : stopped smoking observed frequencies: theoretical frequencies according to H 0 : Detailed Chi-squares and total: Gm Gw Gm Gw Gm Gw Bn 12 23 35 Bn 15.26 19.74 35 Bn 0.69507 0.53710 Bs 31 26 57 Bs 24.85 32.15 57 Bs 1.52417 1.17777 Bss 8 17 25 Bss 10.90 14.10 25 Bss 0.77038 0.59529 51 66 117 51 66 117 5.300 1) Place the subtotals and the general total in the first table, and in the second one, identically. 2) Fill the second table (6 central theoretical values) following proportional calculations. 3) Table #3: calculate the six Chi-square, then add them to get the value χ² calc. 4) Test writing: Null hypothesis: H 0 : Gender and tobacco behaviour are independent Observed χ² Value of the variable χ² between the observed and the theoretical samples: χ² calc = 5.3 Rejection area Significance level: α = 10 % Number of dof: (r-1)(k-1) = (3 1)(2-1) = 2 Value of the variable χ² limit until rejection : χ² lim = 4.61 Comparison and decision: As χ² calc > χ² lim, H 0 can be rejected, at a 10% significance level. In other words, we can say with less than 10% risk of being wrong, that men and women behave differently with tobacco. However, we could not reject our null hypothesis at a 5% significance level: χ² lim is 5.99 in such conditions, and so isn t reached by χ² calc, thus showing us that claiming dependence is done with more than 5% risk of being wrong. Exercise 2. Two candidates compete for a presidential election: NS and FH. In a little town, there are 500 voters. 100 are retired people, 50 are unemployed and 350 are employees. There, the vote results are: candidates blank/ FH NS voters abstention unemployed 24 16 10 employees 122 148 80 retired 36 27 37 1) Decide, with a 1% significance level, whether people s opinion depends on their social group or not. * H 0 : "The type of vote is independent of the social group" IUT de Saint-Etienne Département TC J.F.Ferraris Math S2 Stat2Var TExCorr Rev2018 page 2 / 18

* Let s perform the necessary calculations in order to get χ² calc : observations in theory (indep.) Chi-square 24 16 10 50 18.2 19.1 12.7 50 1.848 0.503 0.574 122 148 80 350 127.4 133.7 88.9 350 0.229 1.529 0.891 36 27 37 100 36.4 38.2 25.4 100 0.004 3.284 5.298 182 191 127 500 182 191 127 500 Chi²calc = 14.16 * Rejection area: with α = 1 %, and with 4 degrees of freedom : Chi² lim = 13.28 * Decision: as Chi² calc > Chi² lim, we can reject H 0 (so: claim that People s opinion depends on their social group) with a 1% chance of being wrong. 2. What can we say if we do not include blank votes and abstentions? Let s take back the analysis, excluding blank votes and abstentions: * observations in theory (indep.) Chi-square 24 16 40 19.52 20.48 40 1.03 0.981 122 148 270 131.7 138.3 270 0.72 0.687 36 27 63 30.74 32.26 63 0.9 0.858 182 191 373 182 191 373 Chi²calc = 5.175 * with 2 dof : Chi² lim = 5.991 with α = 5 % and Chi² lim = 4.605 with α = 10 %. We can assess that people s opinion depends on their social group, with 10 % chances of being wrong, but we couldn t assess it if we wanted to take only 5 % chances of being wrong. Exercise 3. The table shows attendance in two stores A and B: how many people made at least one purchase. These clients have been sorted by age group (10 to 15 years old, and so on). 1. Say, with a 5% significance level, whether the chosen store depends on the age of a client. store age A B 10-15 46 24 15-20 29 35 20-40 14 17 > 40 12 18 * store store store obs A B th A B χ² A B 10 to 15 46 24 70 10 to 15 36.26 33.74 70 10 to 15 2.6185 2.8135 5.4320 15 to 20 29 35 64 15 to 20 33.15 30.85 64 15 to 20 0.5192 0.5579 1.0771 20 to 40 14 17 31 20 to 40 16.06 14.94 31 20 to 40 0.2634 0.2830 0.5464 40 + 12 18 30 40 + 15.54 14.46 30 40 + 0.8058 0.8658 1.6716 101 94 195 101 94 195 4.2069 4.5202 8.727 * with 3 dof and a 5% level, the table gives χ² lim = 7.815. * Thus, this limit value has been exceeded. With a 5 % significance level, we can reject the hypothesis that the choice of the store and the age group are independent. 2) What age group mostly contributes to the previous result? Explain. The age group «10 to 15 year old» mostly contributes to the total χ². It could be easily stated that people that are over 15 year old show quite the same purchasing behaviour. On the contrary, the first age group shows a very different frequency distribution (first table, in blue), compared to other customers. 3) Give the meaning of the 5% significance level on your first answer. We assume the dependence between age and chosen store with a 5 % chance to be wrong. 4) According to your Chi² table, can you be more accurate about the chance taken in this statement (your first answer)? If we wanted to reach a 2% level, χ² calc would have been more than 9.837, but our value isn t. So, the χ² table (form) doesn t allow us to say more than the risk is between 2% and 5%. IUT de Saint-Etienne Département TC J.F.Ferraris Math S2 Stat2Var TExCorr Rev2018 page 3 / 18

Exercise 4. In a survey, 100 people were asked about their age and their attendance at theatres (cinema). We name X the variable "age" and Y the variable "number of annual cinema shows". The survey result is the following table of quotes (fr.: citations) : Y X [15 ; 25[ [25 ; 50[ 50 none 4 6 13 1 to 11 10 16 15 12 to 23 13 8 4 24 6 3 2 1) By a χ² independence test, with a 2% significance level, decide whether there s a link or not between the age and the level of attendance at the cinema. Y X [15 ; 25[ [25 ; 50[ 50 and more total obs th χ² obs th χ² obs th χ² obs th χ² none 4 7.59 1.698 6 7.59 0.333 13 7.82 3.431 23 23 5.462 1 to 11 10 13.53 0.921 16 13.53 0.451 15 13.94 0.081 41 41 1.453 12 to 23 13 8.25 2.735 8 8.25 0.008 4 8.5 2.382 25 25 5.125 24 6 3.63 1.547 3 3.63 0.109 2 3.74 0.81 11 11 2.466 total 33 33 6.901 33 33 0.901 34 34 6.704 100 100 14.51 With 6 dof and α = 2%, the χ² table gives Chi² lim = 15.03. Our Chi² calc (14.51) doesn t exceed it. So, at a 2% significance level, we can t reject the idea that age and level of attendance at the cinema are independent. 2) Using your form table, discuss the level of confidence you can assign to the assertion : they are dependent. Our Chi² calc (14.51) is located between both Chi² lim of levels 2% and 5%. Thus, we can assume dependence with more than 95% confidence, but with less than 98% confidence. 3) Identify the most important partial Chi-2s and give the meaning of these high values. The biggest partial Chi² has been obtained with the 50 year old and more whose attendance is zero: the observed frequency (13) is much higher than the expected one (7.82). The partial Chi² of the 50 year old and more whose attendance is between 12 and 23 times a year is big too: the observed frequency is much lower than the theoretical one The partial Chi² of the 15 to 25 year old whose attendance is between 12 and 23 times a year is big too: the observed frequency is much higher than the theoretical one. Exercise 5. (Tutorial for lesson page 6) Let s have a close look of a company s turnover evolution through time. 2009 2010 2011 2012 tri1 tri2 tri3 tri4 tri1 tri2 tri3 tri4 tri1 tri2 tri3 tri4 tri1 tri2 tri3 tri4 (M ) 28 45 49 36 30 44 48 40 28 46 52 37 31 42 54 39 Though there are big seasonal variations, due to its particular activity, is it possible to find out a global trend on several years? IUT de Saint-Etienne Département TC J.F.Ferraris Math S2 Stat2Var TExCorr Rev2018 page 4 / 18

Let s decide to calculate and display the 5 by 5 moving means: (do it as a group job: divide the set of calculations with your neighbours and share your results) 1-5 2-6 3-7 X 3 4 5 6 7 8 9 10 11 12 13 14 Y 37.6 40.8 41.4 39.6 38 41.2 42.8 40.6 38.8 41.6 43.2 40.6 calculations: The values of X (on the graph) correspond to the quantity of trimesters since the beginning: 1 st trimester 2009 x = 1 ; 2 nd trimester 2009 x = 2 ; and so on. We deduce that the values of X to be entered in the table are the integers from 3 to 14: 1 st value = mean of 1,2,3,4,5 = 3 ; 2 nd value = mean of 2,3,4,5,6 = 4 ; and so on until the 12 th value, which is the mean of 12,13,14,15,16, that equals 14. The values of Y calculated in the table above are the average turnovers of the five considered trimesters. 1 st value of Y = mean of 28,45,49,36,30 = 37.6 ; 2 nd value of Y = mean of 45,49,36,30,44 = 40.8 ; and so on. Exercise 6. (Tutorial for lesson page 7) Let s take back one of the examples introduced page 3 (lessons doc): effect of the amount of fertilizer on the harvested production. fertilizer harvest plot # X (kg.ha -1 ) Y (q.ha -1 ) 1 150 46 2 80 37 3 120 46 4 220 51 5 100 43 1) For each half-cloud, determine the mean points coordinates. Half-clouds have to be defined: since there are 5 pairs of results, let s choose a cut in 3 points on the left and 2 points on the right (the contrary would have been allowed too), separating them by the X values (always): 1st half-cloud: (80, 37), (100, 43), (120, 46); mean point: G 1 (100, 42) 2nd half-cloud: (150, 46), (220, 51); mean point: G 2 (185, 48.5) 2) Determine the expression of the Mayer s line (G 1 G 2 ). 48.5 42 6.5 slope: a = = 0.07647 185 100 85 y = 0.07647 x + b can be written with the coordinates of G 1 (for instance): 100 = 0.07647 42 + b, which gives us b = 34.35. Expression of the Mayer s line: y = 0.07647 x + 34.35 IUT de Saint-Etienne Département TC J.F.Ferraris Math S2 Stat2Var TExCorr Rev2018 page 5 / 18

3) On a graph, plot the initial table and draw this line. Exercise 7. Determine the expression of the Mayer s line, taking back the case given in exercise 5. The 16 values are parted in 8 for 2009 and 2010 besides 8 for 2011 and 2012. 1+ 2 +... + 8 9 + 10 +... + 16 xg = = 4.5 x 1 G = = 12.5 2 8 8 1.125 slope: a = = 0.140625 28 + 45+... + 40 28 + 46 +... + 39 8 yg = = 40 y 1 G = = 41.125 2 8 8 y = 0.140625 x + b can be written with the coordinates of G 1 (for instance): 40 = 0.140625 4.5 + b, which gives us b = 39.367. Expression of the Mayer s line: y = 0.140625 x + 39.367 Exercise 8. (Tutorial for lesson page 8) Calculate or display on your calculator: the means and standard deviations; the covariance. 1) Taking the data of exercise 6 (fertilizer/harvest) x = 134 kg.ha -1 and y = 44.6 q.ha -1 ; σ ( X ) = 48.826 kg.ha -1 and σ ( Y ) = 4.5869 q.ha -1 (Stat mode). n xi yi i= 1 30900 Cov( X, Y ) = x y = 134 44.6 = 203.6 n 5 2) Taking the data of exercise 4 (age/# of cinema shows) choose 60 as average age for the class 50 and more; choose 36 as average number of shows for the class 24 and more. x = 39.375 yo and y = 10.795 shows ; σ ( X ) = 16.422 years and σ ( Y ) = 10.833 shows (Stat mode). n xi yi i= 1 36890 Cov( X, Y ) = x y = 39.375 10.795 = 56.15 n 100 Exercise 9. (Tutorial for lesson page 9) Let s consider the following time series: a company s annual expenses in advertising. X : year 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Y : expense (k ) 41 60 55 66 87 61 90 95 82 120 125 118 The corresponding scatter plot is represented: IUT de Saint-Etienne Département TC J.F.Ferraris Math S2 Stat2Var TExCorr Rev2018 page 6 / 18

Determine the expression of the Y on X fitting line, following the least square method; then, draw it. (D) : y = 7.0629 x + 37.42 Exercise 10. year 1: 2006 500 people, having passed their driving license exam, are sorted in the table below. They are distributed with respect to the number X of times they took the exam before passing it and to the number Y of hours of driving lessons before their first attempt. X Y [0; 15[ [15; 25[ [25; 40[ 1 23 92 80 2 77 84 33 3 42 35 13 4 12 6 3 1) Define a margin frequency. Then, give an example from the table. A margin frequency is the total number of individuals associated to a value of one of the variables. e.g.: 195 (margin frequency) people passed their exam following their first attempt (value: X = 1). 2) Describe, shortly, the way to enter the data set in your calculator. We use to enter the frequencies in List3, so 12 values here; List1 and List2 will be used for entering the corresponding X and Y values. 3) Calculate the covariance of the pair (X, Y) and give a concrete comment about this value. 16815 Cov( X, Y ) = 1.874 19.375 = 2.679, non-positive. Globally, the more hours of driving lessons one 500 takes, the less attempts one needs to pass the exam. 4) Among those who took between 15 and 25 hours of driving lessons, what is the rate of those who passed their exam on the third attempt? 35/217 = 16.13 % 5) Among those who passed their exam on the third attempt, what is the rate of those who took between 15 and 25 hours of driving lessons? 35/90 = 38.89 % Exercise 11. A sales agent wishes to analyse his (or her) activity and efficiency. On each appointment to a prospect have been noted the length (X, in minutes) of the presentation of the product, and the sold quantity (Y). The twelve values inside the table were filled with the number of appointments that correspond to each pair (X, Y). IUT de Saint-Etienne Département TC J.F.Ferraris Math S2 Stat2Var TExCorr Rev2018 page 7 / 18

1) Give the meaning of the frequency "8" found inside the table. During each of 8 appointments with prospects, the sales agent made a 10 to 20 min-long presentation and then sold 2 units. 2) Calculate, manually, the average time spent per appointment. Margin frequencies of the three values of X: 7, 19 and 21. The corresponding lengths are 5, 15 and 25 (in minutes). Total number of appointments: 47. The average time is then (5 7 + 19 15 + 21 25)/47 = 17.98 minutes per appointment (about 18 minutes). 3) Give the covariance of the pair (X, Y). 1595 Cov( X, Y ) = 17.9787 1.80851 = 1.422 47 Exercise 12. The following table indicates the sales price ( ) of an equipment and the number of sold items, for 4 years. year rank 1 2 3 4 sales price ( ) X 300 210 270 375 # of sold items Y 198 240 222 160 1) Build the scatter plot with an orthogonal frame. The axes intersection must be the point (210, 160); scales: 1 cm for 15 on the abscissas axis, 1 cm for 10 items on the ordinates axis. 2) Determine the coordinates of G, mean point of the cloud. G(288.75 ; 205) 3) a. Determine the expression of the Y on X fitting line, following the least square method. The coefficients will be expressed with 6 significant figures. y = -0.498274 x + 348.876 b. Draw this regression line on the graph. 4) Which year saw the highest turnover? For which amount? The turnover is X Y. Its four values are: 59400, 59940, 50400 and 60000. The highest was in year # 4. going further: 5) Now, we assume that, each year, the number of sold items y and the sales price x are related this way: y = 0.498 x + 349. We denote S(x) the turnover achieved by selling y items, x each. a. Express S(x) with respect to x. S(X) = xy = -0.498 x² + 349 x b. Find the variations of the function S defined in [210 ; 375]. S (X) = -0.996 x + 349 > 0 iff x < 350.4. S is decreasing in [210 ; 350.4] and increasing in [350,4 ; 375]. IUT de Saint-Etienne Département TC J.F.Ferraris Math S2 Stat2Var TExCorr Rev2018 page 8 / 18

c. Deduce the sales price we would have to set for a fifth year if we want a maximum turnover. How many items will be sold (round to one unit)? For what turnover? We have to set the sales price at 350,4. # of sold items: y = 0.498 350.4 + 349 = 174.5. Considering 174 items, x = 350.4/unit and turnover = 60969.6; considering 175 items, x = 350.4/unit and turnover = 61320. Exercise 13. A survey wishes to compare people's expense in high tech equipment compared to their sales. Each column of the table T below represents, in a given French land, the average monthly income of people (X) and the average monthly expense (Y) in high-tech equipment. land A B C D E F income X ( ) 1550 1620 1770 1850 1930 2000 expense Y ( ) 57 61 66 73 76 82 1) Calculate the covariance and then the linear correlation coefficient of the pair (X, Y). Give an interpretation of both parameters. 749720 Cov( X, Y ) = 1786.66667 69.1666667 1375.55556, positive, showing a global upward trend 6 of the expense, as the income increases. 1375.55556 r 0.9901, very close to 1, hence an excellent linear correlation between X and Y. 160.2775 8.66827 2) a. Give, by the mean of your calculator, the expression of the Y on X regression line. y = 0.05355 x 26.50 b. Obtain the expression of the Mayer's line of the series, from the table T. Let's part the table into two groups: {A, B, C} and {D, E, F} (indeed, the values of X have already been sorted in an ascending order). The coordinates of both mean points are: G 1 (1646.6667 ; 61.333333) and G 2 (1926.66667 ; 77) The Mayer's fitting line, (G 1 G 2 ), has a typical expression y = ax + b, where yg y 2 G1 a = 0.05595 and b = yg a x ; ( ) : 1 G 30.80 D 1 M y = 0.05595x 30.80 x x G2 G1 c. Both lines slightly differ. Find the income for which they both give the same expense. What makes this common point special, inside the point cloud? Let's act as if we didn't already know this common point. We can seek it by an identification of both expressions: 0.05595 x 30.80 = 0.05355 x 26.50. That gives: 0.0024 x = 4.3 and then x = 1791.67. We can deduce the value of y: 69.44. Both lines give an estimated average expense of 69.44, for an average income of 1791,67. This common point is in fact the midpoint of the cloud: 1791.67 is the actual average value of X in the table, and 69.44 is the average value of Y (little differences can be seen, mostly due to the rounded slopes used four lines above). This particularity is general, as explained in the lessons of this chapter: a least square fitting line, as well as a Mayer's fitting line, meets Mayer's criterion, that is equivalent to "the line owns G"! IUT de Saint-Etienne Département TC J.F.Ferraris Math S2 Stat2Var TExCorr Rev2018 page 9 / 18

Exercise 14. (Tutorial for lesson page 12) Data about the fuel consumption of a motorcycle have been collected. Consumption: Y, in L/100km, speed: X, in km/h) : X 10 20 30 40 50 60 70 80 90 Y 15.2 11.6 9.3 7.8 7 6.6 6.9 8 9.6 The scatter plot, on the right, clearly shows us that a linear regression would be inappropriate to describe the evolution of the consumption with respect to the speed. Thus, we will propose a variable change. 1) Let s define the variable T by: T = (X 60)². Complete the following table: T 2500 1600 900 400 100 0 100 400 900 Y 15.2 11.6 9.3 7.8 7 6.6 6.9 8 9.6 2) Perform a linear regression of Y on T. Cov(T, Y) = 81280/9 766.66667 9.111111 = 2045.926 ; r = 2045.926/780.3133/2.62782 = 0.997759 r is very close to 1, a linear fitting is appropriate, between T and Y. Least square regression line: y = 0.00336 t + 6.535 3) Thus, deduce the expression of the regression curve, for the initial scatter plot. Regression curve of the pair (X, Y) : y = 0.00336 (x 60)² + 6.535 Exercise 15. quadratic fitting A company took note of its profits Y with respect to X, produced and sold quantity: X (tons) 2 3 5 7 11 Y (k ) 38 55 72 69 24 T -16-9 -1-1 -25 1) Thanks to your calculator, give the linear correlation coefficient between X and Y. Comment. Cov(X, Y) = 1348/5 5.6 51.6 = -19.36 ; r = -19.36/3,2/18.315 = -0.3303 This is far from -1, the linear correlation is very bad between X and Y. 2) Let s settle the variable T = -(X - 6)². a. Complete the table. b. Calculate Cov(T, Y) and then the linear correlation coefficient between both variables. Cov(T, Y) = -1844/5 - (-10.4) 51.6 = 167.84 ; r = 167.84/9.2/18.315 = 0.9961 c. Is a linear fitting of Y on T appropriate? r is very close to 1, a linear fitting is appropriate, between T and Y. d. Determine the expression of the Y on T fitting line, following the least square method. y = 1.983 t + 72.22 e. Deduce an expression of the regression of Y on X. y = -1.983(x - 6)² + 72.22 IUT de Saint-Etienne Département TC J.F.Ferraris Math S2 Stat2Var TExCorr Rev2018 page 10 / 18

Exercise 16. quadratic fitting A market study was conducted on a new type of product. The table below gives, for several proposed sales price, the number of people willing to pay that price. unit price ( ) X 2 3 4 5 6 7 number of people Y 66 47 34 25 18 14 unit p. nb X(X-20) nb sales X Y T Y CA CA 2 66-36 62.97 132 125.9 3 47-51 48.88 141 146.6 4 34-64 36.66 136 146.7 5 25-75 26.33 125 131.7 6 18-84 17.88 108 107.3 7 14-91 11.3 98 79.13 1) Calculate the covariance of the variables X and Y, then comment its sign. 740 Cov( X, Y ) = 4.5 34 = 29.67, non-positive: Y values tend to improve as X decreases. 6 2) We set T = X(X - 20) a. Calculate le the linear correlation coefficient between both variables T and Y. 11610 337.33 Cov( T, Y ) = ( 66.8333) 34 = 337.33. r = = 0.992487 6 18.95096 17.93507 b. Comment its value. This coefficient (0.992487) is an excellent one. c. Determine the expression of the Y on T fitting line, following the least square method. y = 0.9393 t + 96.78 d. Deduce an expanded expression of the regression of Y with respect to X. y = 0.9393 (x² - 20x) + 96.78 = 0.9393 x² - 18.79 x + 96.78 3) Here we examine the expected turnover (unit selling price number of sales), if the numbers of citations obtained in the survey are considered to be the numbers of units sold. a. Calculate the turnovers that can be extracted from the initial table. See above: grey table (turnover = CA = XY) b. Calculate, for the same values of X, the turnovers CA' that can be got thanks to the formula obtained in question 2)d. See above: grey table (turnover = CA = XY ) c. What unit selling price should we fix, so that the best turnover would be reached? According to the model, it seems that CA would be maximum when X is between 3 and 4. Le s be a little more accurate: X 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 CA' 146.6 147.4 148.1 148.5 148.8 148.8 148.7 148.4 148 147.4 146.6 We will recommend a selling price at about 3.5 for an optimized turnover. IUT de Saint-Etienne Département TC J.F.Ferraris Math S2 Stat2Var TExCorr Rev2018 page 11 / 18

Exercise 17. inverse fitting A perfumery, on analysing its turnover, connects the sales quantities (Y) to various perfume brands and models prices (X). The results are gathered in the following table: X, bottle s price ( ) 15 25 30 40 45 60 75 90 Y, # of sold bottles 202 117 107 82 78 60 55 48 Answer the questions beginning with "calculate" by using your calculator s results. calculator s results: 1) a. Calculate the covariance of X and Y; comment its sign. 28000 Cov( X, Y ) = 47.5 93.625 = 947.2, non-positive. Y is globally a decreasing function of X. 8 b. Calculate the linear correlation coefficient of X and Y; comment its value. 947.2 r XY = = 0.8357, not very close to 1. The linear correlation between X and Y is not 24.109 46.843 excellent (the point cloud may be noisy or following a curve). 850 2) In order to have a more precise idea of how X and Y are related, we set the variable change: T = X a. After having calculated the list of values of T, in a third list (calculator), justify that the linear correlation is excellent between T and Y. The values of T have been show above. The calculations, relatively to the pair (T, Y), lead to r = 0.9971, very close to 1. Their linear correlation is excellent. b. Give the expression of the Y on T regression line, according to the least square method. y = 3.215 t + 15.62 c. What is the least square criterion? The sum of the squared residues must be minimum (which makes the fitting line unique). d. Deduce from question 2)b a modelled expression of Y with respect to X. 850 a y = at + b = + b = 2733 + 15.62 x x e. According to this model, how many bottles whose cost is 150 would the perfumery expect to sell? If x = 150, the estimate of y is: 2733 + 15.62 33.84 34: it can expect to sell 34 bottles. 150 IUT de Saint-Etienne Département TC J.F.Ferraris Math S2 Stat2Var TExCorr Rev2018 page 12 / 18

Exercise 18. (Tutorial for lesson page 13) Calculate the point estimates, in the given situations. 1) Taking back exercise 9, give an estimate of the expense in 2015. y = 7.0629 x + 37.42 ; x 0 = 14 ; hence y 0 = k 136.3 2) Taking back exercise 6, give an estimate of the quantity of fertilizer that would offer a harvest of 60 q/ha. y = 0.07647 x + 34.35 ; y 0 = 60 q/ha ; hence x 0 = 335.4 kg/ha 3) Taking back exercise 13, give an estimate of the fuel consumption when the speed is 100 km/h. y = 0.00336 (x 60)² + 6.535 ; x 0 = 100 ; hence y 0 = 11,91 L/100km Exercise 19. (Tutorial for lesson page 13) Let s take back exercise 9. We want to estimate the expense, for the year 2015, by a 95% confidence interval. 1) a. Get the values of Y, from the values of X and the expression of the fitting line; b. Get the values of Z, by dividing Y by Y ; c. Then, give the mean and standard deviation of Z. z = 1.000971 ; σ Z = 0.125286 2) Give the point estimate of the expense in 2015. see exercise 18-1: y 0 = k 136.3 3) Give the coefficient u corresponding to the confidence level. u = 1.96 4) Then, give the confidence interval. [129.2(1.000971 1.96 0.125286) ; 129.2(1.000971 + 1.96 0.125286)] = [97.6 ; 161] Exercise 20. (Tutorial for lesson page 13) With exercise 6, estimate the harvest by a 99% confidence interval, due to 300 kg/ha of fertilizer. 1) a. Get the values of Y, from the values of X and the expression of the fitting line; b. Get the values of Z, by dividing Y by Y ; c. Then, give the mean and standard deviation of Z. z = 0.9991106 ; σ Z = 0.0472554 2) Give a point estimate of the harvest. y = 0.07647 x + 34.35 ; x 0 = 300 kg/ha ; hence y 0 = 57.29 q/ha 3) Give the coefficient u corresponding to the confidence level. u = 2.58 4) Then, give the confidence interval. [57.29(0.9991 2.58 0.047255) ; 57.29(0.9991 + 2.58 0.047255)] = [50.25 ; 64.22] IUT de Saint-Etienne Département TC J.F.Ferraris Math S2 Stat2Var TExCorr Rev2018 page 13 / 18

Exercise 21. (Tutorial for lesson page 13) On each person in a sample, a survey noted the age class (X) and the visual acuity (Y, 1/10 = 0.1): X [5; 35[ [35; 45[ [45; 55[ [55; 65[ 0.3 1 5 10 20 Y 0.6 8 12 25 18 0.9 55 30 14 6 Estimate the visual acuity of a 80 year-old person, by a 99% confidence interval. Variable Y, variable Z, results on Z: z = 0.999266 ; σ Z = 0.298378 Point estimate: y = -0.008422 x + 1.038 ; x 0 = 80 ; hence y 0 = 0.3642 Coefficient u: u = 2.58 Confidence interval: [0.3642(0.999266 2.58 0.298378) ; 0.3642(0.999266 + 2.58 0.298378)] = [0.08358 ; 0.6444] Exercise 22. In a country, two variables are compared: the consumer force index and the turnover of its car industry: consumer force (index) X 3.26 3.85 3.44 3.08 3.6 car industry turnover (G ) Y 9.3 9.56 9.36 9.24 9.47 1) Give the expression of the Y on X Mayer s line. Two ways to cut this data set (3 points then 2, or 2 points then 3) as X increases. case #1: G 1 (3.26 ; 9.3) and G 2 (3.725 ; 9.515) y = 0.4624 x + 7.793 case #2: G 1 (3.17 ; 9.27) and G 2 (3.63 ; 9.463) y = 0.4283 x + 7.912 2) By the mean of a point estimate, give a value of the consumer force that would correspond to a G 10 car industry turnover. case #1: y = 10 iff x = 4.733 case #2: y = 10 iff x = 4.875 3) Is a strong correlation between two variables a sign of a cause and effect relationship between them? Not necessarily. This numerical relationship may just be a coincidence. Exercise 23. least square + confidence interval Monthly revenues of a commercial website are listed below, from January to December 2015: in k : 3 5 4 8 10 9 13 12 17 18 18 21 1) In a few words, describe the least square method. This method consists in finding out the line that minimizes the sum of the squared residues (rises between the points and the line). 2) Thanks to the global trend of the evolution of the monthly revenue, give the 95% confidence interval of the predictable revenue in December 2016. (number the months from 1 for January 2015) IUT de Saint-Etienne Département TC J.F.Ferraris Math S2 Stat2Var TExCorr Rev2018 page 14 / 18

month, X 1 2 3 4 5 6 7 8 9 10 11 12 revenue, Y 3 5 4 8 10 9 13 12 17 18 18 21 Y 2.5 4.136 5.573 7.409 9.045 10.68 12.32 13.95 15.59 17.23 18.86 20.5 Z 1.2 1.209 0.693 1.08 1.106 0.843 1.055 0.86 1.09 1.045 0.954 1.024 Expression of the Y on X regression line: y = 1.636 x + 0.8636 Point estimate of the revenue in December 2016 (x = 24): y 0 = k 40.14 Variable Z : z = 1.0132222 and σ Z = 0.14538387 Coefficient u for a 95 % confidence level: u = 1.96 Confidence interval: [29.23 ; 52.10] 3) Give the probability that, in December 2016, the revenue would be less than k 29.23. There are 95% chances that this revenue be inside this interval. Moreover, the concept of confidence interval involves a symmetric probability distribution (the normal law); thus, there are 2.5% chances that the revenue would be less than the values included in the interval, and 2.5% chances that it would be more than them. Answer: 2.5%. 4) Build the scatter plot (scale: 2 cm for one month), draw the regression line and finally represent the confidence interval. Y revenue (k ) X month IUT de Saint-Etienne Département TC J.F.Ferraris Math S2 Stat2Var TExCorr Rev2018 page 15 / 18

Exercise 24. Mayer + confidence interval city X Y The given table includes eight among major cities of a country. The variable X A 850 58 gives, in thousands, the number of city residents; the variable Y gives, in thousands, the number of students in this city. B 623 37 C 587 38 1) Build the scatter plot from this data series. see below D 360 20 2) Give the coordinates of the mean point of the cloud. G(439.1 ; 26) E 312 16 3) a. Using Mayer s method, determine manually the expression of the Y on X regression line. F 275 15 G 1 (273.3; 13.75) and G 2 (605; 38.25) slope: a = 0.07385 G 262 12 With G 1 : b = y ax = -6.430 expression: y = 0.07385 x - 6.43 H 244 12 b. Draw this line. Does G belong to it? G always belongs to it c. Give "Mayer s principle". the sum of the residues must be zero Y : # students (thousands) X : # residents (thousands) 4) We will use here another fitting line, whose expression is: y' = 0.07 x 6. a. With this line, give the 95% confidence interval of the predictable number of students in a town that has two million inhabitants. X 850 623 587 360 312 275 262 244 Y 58 37 58 20 16 15 12 12 Y 53.5 37.61 35.09 19.2 15.84 13.25 12.34 11.08 Z 1.084 0.984 1.083 1.042 1.01 1.132 0.972 1.083 Expression of the Y on X regression line: y = 0.07 x - 6 Point estimate of the number of students (x = 2000): y 0 = 134 Variable Z : z = 1.04877 and σ Z = 0.052588 Coefficient u associated to a 95 % confidence level: u = 1.96 Confidence interval: [126.7 ; 154.3] b. What can we say about the chances that the number of students would exceed 155,000 in such a town? There are a bit less than 2.5 % chances. IUT de Saint-Etienne Département TC J.F.Ferraris Math S2 Stat2Var TExCorr Rev2018 page 16 / 18

Exercise 25. logarithmic fitting + confidence interval Service life of some identical office equipment has been studied. In the following table, t i represents the duration of use - expressed in thousands of hours - and R(t i ) the rate of equipment still in use at the time t i. (e.g.: after 1,000 hours, t i = 1, there are still 90 % left of equipment in use, R(t i ) = 0.90).. t i 1 2 3 4 5 6 7 8 9 R(t i ) 0.9 0.66 0.53 0.4 0.32 0.25 0.19 0.14 0.1 1) We set y i = ln[r(t i )] where ln is the natural logarithm. Fill the following table, then build the scatter plot, using the points M i (t i, y i ), into an orthogonal frame. t i 1 2 3 4 5 6 7 8 9 y i -0.105-0.416-0.635-0.916-1.139-1.386-1.661-1.966-2.303 2) May a linear fitting be relevant in the previous point? Calculate the linear correlation coefficient between T and Y. These points are almost collinear; a linear fitting appears to be relevant. 3) Using the least square method, determine an expression of the Y on T regression line. Deduce from this expression that there are two positive real numbers k and λ such that: R(t) = k e - λt. y = -0.26604 t + 0.1605. y = ln R(t) implies R(t) = e y = e -0.26604 t + 0.1605 = e 0.1605 e -0.26604 t = 1.174 e -0.26604 t. 4) In this question, we'll take k = 1.174 and λ = 0.266. a. Determine the predictable rate of equipment still in use after 10,000 hours. After 10,000 hours, t = 10 ; hence R(t) = 1.174 e - 2.66 = 0.08184 = 8,2 % rounded. b. After how long are there exactly 50 % of equipment still in use? R(t) = 0.5 implies 1.174 e - 0.266 t = 0.5 iff e - 0.266 t = 0.5/1.174 iff -0.266 t = ln(0.5/1.174) iff t = ln(0.5/1.174) / (-0.266) = 3.209. Answer: after 3,209 hours. 5) Give a 99% confidence interval of the rate of equipment still in use after 10,000 hours of service. T 1 2 3 4 5 6 7 8 9 Y -0.105-0.416-0.635-0.916-1.139-1.386-1.661-1.966-2.303 Y -0.106-0.372-0.638-0.904-1.170-1.436-1.702-1.968-2.234 Z 0.998 1.118 0.996 1.014 0.974 0.966 0.976 0.999 1.031 Expression of the Y on T regression line: y = -0.26604 t + 0.1605 Point estimate of the rate (t = 10) : y 0 = -2.5 Variable Z : z = 1.007964 and σ Z = 0.043476 Coefficient u associated to a 99 % confidence level: u = 2.58 Confidence interval on y : [-2.8003 ; -2.2395] and the the interval on R is: [0.0608 ; 0.1065]. IUT de Saint-Etienne Département TC J.F.Ferraris Math S2 Stat2Var TExCorr Rev2018 page 17 / 18

Exercise 26. 100 children have been classified by age (X) and size (Y): Y X [95 ; 105[ [105 ; 125[ [125 ; 135[ [3 ; 5[ 15 10 0 [5 ; 7[ 8 32 5 [7 ; 9[ 2 13 15 1) Enter this table in your calculator. 2) Give the means and standard deviations of X and Y, calculate their covariance. 3940., ( ). 2 x = 61 years V X = 61 = 219., σ ( X ) = 1480. year ; 100 1315375., ( ). 2 y = 11425 cm V Y = 114 25 = 1006875., σ ( Y ) = 1003. cm. 100 70540 Cov( X, Y ) = 6.1 114.25 = 8.475. 100 3) Calculate their linear correlation coefficient. Comment this value. 8475. r = = 0.5709, a very weak linear correlation (the cloud may be noisy and curved). 1.480 10.03 4) Nevertheless, does the table allow us to see some trend? We see that from one age to another, the sizes corresponding to the greatest number of individuals are not the same. But these largest frequencies do not represent, in their column, an overwhelming majority, which reflects a high variability of sizes for children of the same age. To model the growth of a child by a straight line is therefore difficult, or even by a well-defined curve. 5) Assuming that the relationship between age and size is linear until the age of 12, give the 95% confidence interval of the size of a 12 year-old child. X 4 6 8 4 6 8 4 6 8 Y 100 100 100 115 115 115 130 130 130 n 15 8 2 10 32 13 0 5 15 Y 106.12 113.86 121.6 106.12 113.86 121.6 106.12 113.86 121.6 Z 0.94233 0.87827 0.82237 1.08368 1.01001 0.94572 1.22503 1.14175 1.06908 Expression of the Y on X regression line: y = 3.87 x + 90.64 Point estimate of the size of a 12 yo child (x = 12): y 0 = 137.08 cm Variable Z : z = 1.013138 and σ Z = 0.121881 Coefficient u corresponding to a 99 % confidence level: u = 1.96 Confidence interval on y : [106.1 ; 171.6]. IUT de Saint-Etienne Département TC J.F.Ferraris Math S2 Stat2Var TExCorr Rev2018 page 18 / 18