SALES AND MARKETING Department MATHEMATICS. 2nd Semester. Bivariate statistics LESSONS

SALES AND MARKETING Departmet MATHEMATICS d Semester Bivariate statistics LESSONS Olie documet: http://jff-dut-tc.weebly.com sectio DUT Maths S. IUT de Sait-Etiee Départemet TC J.F.Ferraris Math S StatVar Lessos Rev08 page / 4

TABLE OF CONTENTS LESSONS 3 Itroductio, vocabulary 3 - Aims 3 - Formattig 3-3 Scatter plot 4 Chi-square idepedece testig 5 3 Fittig: Mayer s method ad movig meas 6 3- Movig meas 6 3- Purpose of liear fittig 7 3-3 Mayer s method 7 4 Liear fittig: least square method 8 4- Parameters of a bivariate series 8 4- Least square method 9 4-3 Liear correlatio coefficiet 0 5 No-liear fittig: variable chage 6 Statistical predictio 3 6- Poit estimate 3 6- Cofidece iterval 3 IUT de Sait-Etiee Départemet TC J.F.Ferraris Math S StatVar Lessos Rev08 page / 4

LESSONS Itroductio, vocabulary. Aims Two characters will be simultaeously studied o each idividual of a -sized populatio, creatig two variables (lists of values) X ad Y. Aims : * study the relatioship betwee both characters: their correlatio; * model this correlatio by a mathematical fuctio: regressio; * use this model to perform a predictio, with a associated cofidece level; * test the hypothesis that X ad Y are ot related.. Formattig From oe idividual (# i), a observatio will be writte dow as a ordered pair of values (x i ; y i ). There are two possible ways to display the data series, followig the situatio: * data series give i lists e.g.: relatioship betwee the quatity of spread fertilizer ad the harvested productio fertilizer harvest plot # X (kg.ha - ) Y (q.ha - ) 50 46 80 37 3 0 46 4 0 5 5 00 43 example of a time series: aual advertisig expese of a compay X : year 006 007 008 009 00 0 0 03 04 05 06 07 Y : expese 4 60 55 66 87 6 90 95 8 0 5 8 * series frequecies: cotigecy table e.g.: relatioship betwee age ad visual acuity, data collected from 00 people X : age 0 40 50 60 Y : acuity 3/0 5 0 0 6/0 8 5 8 9/0 55 6 4 6 IUT de Sait-Etiee Départemet TC J.F.Ferraris Math S StatVar Lessos Rev08 page 3 / 4

.3 Scatter plot Every statistical data series ca be displayed o a graph by a poits cloud, each variable takig place o its ow axis. * series i lists: a pair (x i ; y i ) correspods to oe idividual ad to oe poit. secod example i the previous page: year : 006) * series with cotigecy: a pair (x i ; y i ) mostly correspods to more tha oe idividual (freq ) ad to a object whose size is a icreasig fuctio of the associated frequecy. third example i the previous page: acuity age IUT de Sait-Etiee Départemet TC J.F.Ferraris Math S StatVar Lessos Rev08 page 4 / 4

Chi-square idepedece testig The goal of a statistical testig is to decide whether we ca afford to reject (or whether we ca t) a give hypothesis made o a populatio, by the aalysis of a sample. Firstly, this hypothesis is worded ad is amed "ull hypothesis", H 0. If the coclusio/decisio of the test is a rejectio of H 0, the there is automatically a risk to be wrog, whose probability is amed "sigificace level" of the test, oted α. The special case of a Chi-square idepedece testig: Here, a survey is crossig two variables (e.g. i the ext tutorial: geder ad behaviour towards tobacco), whose level of depedece has to be estimated i a populatio, oly aalysig the distributio of people got from a sample. I case of idepedece, the ideal distributio cosists of a proportio cotigecy table: people s aswers are supposed to be distributed i proportio to the existig subtotals (margi frequecies). The test aims to compare the observed frequecies (obs) to these theoretical proportioal frequecies (th) associated to perfect idepedece, to extract a value, "χ²" (proouce Chi-square), umberig a "distace betwee observatios ad perfect idepedece" foud i the studied sample, ad fially to decide whether this gap is large or ot. Methodology: observatios are coducted : idividuals are evaluated o two variables X ad Y. The variable X shows as results r differet values, ad Y shows k differet values. The ull hypothesis H 0 is by covetio: the variables are idepedet. Le test compares reality to what would perfect idepedece have show. We ca reject H 0 i case the set of observatios is «too far» from the theoretical distributio.. Calculatio of the observed χ² * table of observatios o idividuals Y Y Y k total X X obs obs obs k total X X obs obs obs k total X X r obs r obs r obs rk total X r total Y total Y total Y total Y k * table of the theoretical distributio (idepedece) This secod table is built from the first, takig back every subtotal, the calculatig each frequecy i proportio to these subtotals ad to the geeral total. * calculatio of χ² calc (global differece betwee obs ad th): χ² calc = table ( obs th) th. Rejectio area The χ² variable expresses the ifiity of the possible values that could be obtaied from ay possible sample compared to idepedece. This variable is distributed i probability, by a law of the same ame, settled by its umber of degrees of freedom (dof). dof = (r- )(k - ) To each possible χ² value ([0; [) correspods a probability "α" that a sample - extracted from a ideal idepedece populatio would exceed it. I a exercise, i case α is give, we the read the value of χ² lim. 3. Compariso ad decisio If χ² calc > χ² lim, the we re allowed to reject H 0 (the idepedece), with a risk α to be wrog. IUT de Sait-Etiee Départemet TC J.F.Ferraris Math S StatVar Lessos Rev08 page 5 / 4

3 Fittig: Mayer s method ad movig meas 3. Movig meas Movig meas are mostly used with time series, the variable X represetig time ad the variable Y a value that evolves i time. Whe Y values show large oscillatios through time, a global upward or dowward tred is hard to detect. Movig meas are there to help us givig a aswer, by smoothig these oscillatios. Methodology: * Cosolidate sets of successive Y values, always of the same umber (e.g.: take two values by two or three by three, or four by four, etc.); * The ext set cosists of the previous oe, i which the first value of Y is removed ad the ext oe is joied (slippery sets); * The average value of Y is calculated i each set (providig a list of movig meas), same for the average value of X (providig a average locatio i time for each set); * The correspodig poits may be plotted. e.g.: X (trimesters) 3 4 5 6 7 8 Y (thousads of tourists) 58 3 36 60 9 4 33 Let s create the list of the 4 4 movig meas: X.5 3.5 4.5 5.5 6.5 Y 3.5 3.75 3.5 3.5 3.75 This ew list of values (doubled by its graph) suggests a very slight dowward tred. ote: * the first movig mea is the mea of the values #,, 3 ad 4. (34)/4 =.5 for X ad (58336)/4 = 3.5 for Y * the secod movig mea is the mea of the values #, 3, 4 ad 5. (345)/4 = 3.5 for X ad (33660)/4 = 3.75 for Y * ad so o IUT de Sait-Etiee Départemet TC J.F.Ferraris Math S StatVar Lessos Rev08 page 6 / 4

3. Purpose of liear fittig A poit cloud may show a lik betwee both variables if its poits appear ot to be scattered at radom. I some cases, this cloud's shape may be elogated, quite thi, with a "directioal axis" quite right showig a tedecy... Ca we fid a axis, a straight lie which "follows" the whole cloud "to the best"? Let s say this lie has already bee draw (D) : y = ax b. To a give value x i are associated the value y i (ordiate of the poit M i i the cloud) ad the value y ˆi = ax i b (o the lie). y i yˆi defiitio: we ame residue the umber e i = y i y ˆi The residue of a poit M i is the positive if this poit is above the lie ad egative i the opposite situatio. x i Hece, we aim to fid the lie that «miimises to the best» the residues, the lie that passes through the cloud as close as possible to the poits. This way, we perform a liear fittig, or liear regressio. Oce doe, this object is called fittig lie, tred lie or regressio lie of the series. 3.3 Mayer s method Some residues are positive, some egative. Mayer's assumptio is that the "best" lie is the oe that leads to a zero sum of residues (the egative residues offset the positive oes). defiitio: we ame Mayer s priciple the goal mathematical aalysis: e = y ax b = y a x b ( ) i i i i i e i = 0 This sum is zero iff yi a xi b = 0 iff y ax b = 0 That is to say: to obtai a cacellatio of the sum of residues, it is ecessary ad sufficiet that the G x, y. This property is't sufficiet i itself to make a straight lie cotai the midpoit of the cloud, ( ) Mayer's lie uique, sice the oly obligatio is to ow oe give poit. There are a ifiite umber of straight lies makig a zero sum of residues! Mayer s method: * Divide the cloud ito two subclouds: Both subclouds must cotai the same umber of poits: / if is eve, or ()/ o oe side ad (-)/ o the other side if is odd. The abscissas x i the first subcloud must all be less tha the abscissas x i the secod oe; * Calculate the coordiates of G ad G, mea poits (midpoits) of both subclouds; * Determie (if asked) the expressio of the lie (G G ), Mayer s lie that will be chose; draw it ote: It s bee proved that the mea poit of the whole cloud, G, belogs to the lie (G G ) i ay case, ad the that the latter meets Mayer s priciple. IUT de Sait-Etiee Départemet TC J.F.Ferraris Math S StatVar Lessos Rev08 page 7 / 4

4 Liear fittig: least square method 4. Parameters of a bivariate series 4.. The mea of X or of Y are: x = x = 4.. r x i ad y = i xi ad y = k y j = i y j without cotigecy (data series i lists see p.3 examples ad ); j with cotigecy (frequecies gathered ito a crossed table p.3 ex 3). The special poit G( x, y ) is amed mea poit or midpoit of the cloud. The variace of X ad the oe of Y are easily accessible (maual calculatios) by Koeig s theorem: r x i V ( X ) = x ad V( Y ) V ( X ) r i xi = x ad V( Y ) r yi = y r i yi = y without cotigecy; with cotigecy. The stadard deviatios are still the square roots of the variaces. 4..3 We ame covariace of the pair (X,Y) the umber : Cov( X Y ), = ( x x )( y y ) This is a «commo variace» betwee both variables, which is ecessary to aalyze their correlatio. Koeig s theorem gives a easier way to calculate the covariace: Cov ( ) xi yi X, Y = x y (without cotigecy) ad Cov(, ) r i k ij i i j= X Y = x y i x y. (with) 4..4 Usig the calculator: The meas ad stadard deviatios are give directly, i Stat mode. Ufortuately, the calculator gives either the variaces or the covariace. IUT de Sait-Etiee Départemet TC J.F.Ferraris Math S StatVar Lessos Rev08 page 8 / 4

4. Least square method The idea of this method is to square each residue, the to add these squares, ad fially to say that the "best" lie is the oe that miimizes this sum (obtai the smallest possible sum, cosiderig the ifiite umber of possible lies). defiitio: We ame least square priciple the oe that cosists i fidig a lie leadig to ei is miimum withi the cloud (Gauss) mathematical aalysis: we set P( a, b) = ( y ax b) There are two differet ways to expad it: i i : bivariate polyomial. (, ) = (( ) ) = ( ) ( ) P a b y ax b b b y ax y ax () i i i i i i d degree triomial, with respect to b; (, ) = ( ( i ) i ) = i ( i i i ) ( i ) P a b y b ax a x a x y b x y b () d degree triomial with respect to a. I this cotext, we ca cotiue like this: * cosider a as a costat ad b as a variable. P(a,b) () is miimum whe its derivative (/b) is zero (its st coefficiet,, is o-egative), which leads to b = y ax * cosider this latest value of b, ad a as a variable. P(a,b) () is miimum whe its derivative (/a) is xi yi x. y ( X, Y ) zero, which leads to a Cov = = x ( X ) i x V Calculus amateurs may try to fid back these results! ote: such a value of b implies that the regressio lie ows the mea poit of the cloud, G. This method coducts to a uique lie ad so is mostly employed. least square method: * Calculate the coefficiets ( X, Y ) ( X ) Cov a = ad b = y ax (you ca get them o your calculator!) V * Write the expressio of the Y o X regressio lie D Y/X : y = ax b IUT de Sait-Etiee Départemet TC J.F.Ferraris Math S StatVar Lessos Rev08 page 9 / 4

4.3 Liear correlatio coefficiet A scatterplot shows a more or less strog lik betwee two variables X ad Y, sometimes displayig a elogated ad almost right cloud: i this case, a liear model is relevat. The purpose of the liear correlatio coefficiet is to evaluate the stregth of a liear lik, by a umber. liear correlatio coefficiet betwee X ad Y : (, ) ( X ) σ ( Y ) Cov X Y r = σ It s bee stated that, whatever the data series, - r (the capital R or the Greek letter ρ - «rhô», are sometimes used for this coefficiet) O the calculator: A calculator geerally writes it r if it metios it! (it depeds o the model). Therefore, we will calculate it by ourselves (which implies calculatig the covariace first...). Iterpretatio of its value: The strogest the liear correlatio is (cloud lookig like a straight lie), the closest to is r. "positive correlatio" : r is positive whe Y overall icreases with X "egative correlatio" : r is egative whe Y overall decreases as X icreases 0 r 0.5 : weak liear correlatio, iappropriate liear model. 0.5 r 0.75 : mea liear correlatio, o-appropriate liear model. 0.75 r 0.95 : tolerable liear correlatio, the liear model may ot be the best oe. 0.95 r : strog liear correlatio, the liear model is oe of the most appropriate. Commets: * are X ad Y really liked? If r is close to (or -), the poits are close to be colliear. Nevertheless, that does't always mea that X ad Y are cocretely related. E.g.: i Frace, from 974 to 98, the weddig rate decreased ad i the meatime the GDP (Frech : PIB) icreased, so that the scatter plot usig both data sets is quasi-liear (fourth graph below). The liear correlatio is mathematically very strog, but facts ad studies show there's o cause to effect relatioship betwee both variables! (after 98, the followig poits are ot at all colliear with the previous oes ay more). * liear correlatio r oly shows a liear lik. A correlatio betwee X ad Y may be very strog, but ot i a liear way (curved). I that case, r is far from ad -, ad the study has to be expaded (see IV). IUT de Sait-Etiee Départemet TC J.F.Ferraris Math S StatVar Lessos Rev08 page 0 / 4

Examples: icome ( ) vs. duratio i a compay success rate vs. % of disadvataged SPC r = 0.8449 duratio r = -0.7457 uit margi ( /u) vs. quatity weddig rate through time r = 0.6438 quatity (thousads) r = -0.9875 IUT de Sait-Etiee Départemet TC J.F.Ferraris Math S StatVar Lessos Rev08 page / 4

5 No-liear fittig: the variable chage A variable chage may be performed if the poits seem to follow a curve i particular. The fuctio to cosider will always be defied by the directios of a exercise. It may be: * a logarithm or expoetial fuctio * a polyomial fuctio * a trigoometric fuctio * Oe of the variables X or Y (or both!) has to be replaced each by a ew oe, oted T for istace, followig a give formula that allows its calculatio startig from the former. e.g.: X 3 5 8 Y 9 3 8 70 As Y seems to vary as X squared, plus 5, we ca defie the variable chage T = X ². We have to build the followig table, ito which T shall replace X : T 4 9 5 64 Y 9 3 8 70 * We perform a liear regressio of the pair (T, Y),observig their order. e.g.: Here, the questio is to determie the expressio of their fittig lie, y = at b. If we are told to use the least square method, the coefficiets a ad b will be give by the calculator: y =.056 t 3.856 * Fially, we ca deduce the expressio of a curve, fittig the o-liear relatioship betwee X ad Y, just by writig the variable chage agai; we may draw this curve, if we re told to. e.g.: Sice y =.056 t 3.856, we get: y =.056 x² 3.856 (expressio of a parabola) IUT de Sait-Etiee Départemet TC J.F.Ferraris Math S StatVar Lessos Rev08 page / 4

6 Statistical predictio 6. Poit estimate The fittig straight lie (obtaied with or without a variable chage) makes it possible, through its expressio, to estimate a value of the variable Y o choosig a uexplored value of the variable X (geerally greater tha those collected i the geuie series). I this case, if X represets time, it is possible to make a forecast to the future. e.g.: let s set the expressio of a fittig lie: y = 0.85x. a. Poit estimate of y with x 0 = 0. y 0 = 0.85 0 = 30.5. b. Poit estimate of x with y 0 = 39. x 0 = (39 )/0.85 = 0. 6. Cofidece iterval We ought to step back, cosiderig the poit estimate: accordig to the oise (dispersio) of the poit cloud, it is more or less trustable it gives us a more or less precise predictio. Here, the ew idea is to give a estimate by a rage (iterval), aroud the poit estimate, rather tha a sigle value, ad to be able to associate a probability (cofidece level) for the ukow reality to be iside such a rage. Rates method (uses a liear model, estimates y from x):. For each value x i of the iitial data set: * calculate the values y' i followig the expressio of the regressio lie * calculate the rates z i = y i / y' i * calculate the mea ad stadard deviatio of the variable Z. Z is cosidered as distributed by a ormal law. Cosequetly: z.96σ ; z.96σ 95 % of Z values take place iside the iterval [ Z Z ] 99 % of Z values take place iside the iterval [ z.58σ ; z.58σ ] 3. Calculate the poit estimate y' 0, associated to the ew give value x 0, thaks to the fittig lie. Now, we ca predict the uexplored possible values y 0 by a iterval, as follows: There are 95% chaces that y 0 would be i y 0 ( z.96σ Z ) ; y 0 ( z.96σ Z ) There are 99% chaces that y 0 would be i y 0 ( z.58σ Z ) ; y 0 ( z.58σ Z ) commets: * this method is efficiet oly for r > 0 (o-egative correlatio) * the probability (95%, 99%, etc.) is amed cofidece level of the predictio. Its complemet (5%, %, etc.) is amed sigificace level. * The size of such a iterval is related to the ucertaity of the aswer. It icreases whe:. the cofidece level icreases,. r decreases,. the distace betwee x 0 ad the abscissas x i of the poit cloud icreases. Z Z IUT de Sait-Etiee Départemet TC J.F.Ferraris Math S StatVar Lessos Rev08 page 3 / 4

IUT TC MATHEMATICS FORM FOR BIVARIATE STATISTICS IUT de Sait-Etiee Départemet TC J.F.Ferraris Math S StatVar Lessos Rev08 page 4 / 4