Société de Calcul Mathématique SA

Socété de Calcul Mathématque SA Outls d'ade à la décson Tools for decson help Probablstc Studes: Normalzng the Hstograms Bernard Beauzamy December, 202 I. General constructon of the hstogram Any probablstc study usually starts wth the constructon of an hstogram: one defnes some classes and counts how many ponts fall nto each class. The most common stuaton s as follows : we have a sample of real values x,,..., Itot ; let m mn x and M max x. We want to buld an hstogram wth classes, from ths sample. What people do n general s to dvde the nterval mm, nto classes, of wdth M m. But ths approach has several drawbacs, and people are not often conscous of them: The boundares of the classes are strongly dependent of the values of m and M, and would be modfed f these values were changed, for nstance f the sample grew bgger; These boundares do not tae nto account the uncertantes whch certanly exst upon the values of m and M ; All classes are of the form a x b, except the last one, whch s of the form a x b, snce the value M s necessarly met. An hstogram should be vewed as a measurement devce, just le a thermometer. It gves an nformaton, wth some accurracy. Therefore, the measurement devce should be as ndependent as possble from the sample. Of course, t cannot be totally ndependent. Sège socal et bureaux :, Faubourg Sant Honoré, 75008 Pars. Tel : 0 42 89 0 89. Fax : 0 42 89 0 69. www.scmsa.eu Socété Anonyme au captal de 56 200 Euros. RCS : Pars B 399 99 04. SIRET : 399 99 04 00035. APE : 729Z

The number of classes tself, namely, s not arbtrary but should reflect the performances of the measurement devce. Indeed, t s lned wth the precson we expect. When we buld an hstogram, two ponts n the same class are consdered as the same pont. When we mae a measurement, we consder that, f the value x s read, t mght as well be anythng between x and x, beng consdered as the precson of the measurement devce. So we have some rough ln between both concepts : M m 2, () snce M m s the wdth of each class, from the hstogram pont of vew, and 2 s the wdth of each class, from the precson pont of vew. In order to answer the dffcultes mentoned above, we wll buld classes such that the frst one s centered at m and the last one centered at M. Therefore, the centers of the classes wll be: c m M m, 0,..., (2) The half-wdth of a class s: M m l 2 (3) A pont x belongs to the class C, wth center c, f: M m M m M m x m M m m 2 2 (4) So, all our classes wll be here of the form a x b. Condton (4) may be wrtten: x m M m 2 (4a) and: x m M m 2 (4b) whch means that s defned by: BB Hstogram Normalzaton, december 202 2

x m M m 2, (5) where x s the nteger part of x, that s the largest nteger smaller than x. So, a VBA code may be wrtten as follows; Itot s the total number of lnes s the table, mn_values and max_values are respectvely the mn and the max, and tot s the number above: for = to Itot = nt( (x()-mn_values)*(tot-)/(max_values-mn_values) +/2) hsto()=hsto()+ next As t stands now, the method has a drawbac: the extremtes of each class are ratonal numbers, usually wth many decmal dgts, whch loos unnatural, wth respect to the requrement for a gven precson. For nstance, a class mght appear as 443.556-464.444. Its wdth s almost, whch means that we do not want to dstngush between numbers wth a dfference say of 0.5. But stll, we gve 3 dgts after the decmal pont, whch loos absurd. So, we have to study how to round up the values. II. Roundng up the values If we accept the dea that all values n our sample are subject to some measurement error, the smplest way of tang t nto account s to round up each value. Let be the precson we accept, and let 0 for some nteger 0. Then each value x s replaced by rx round ( x, ) whch s the number wth decmal places closest to x. Then of course m and M wll also be rounded to decmal places. But even so, the centers of the other classes wll not be rounded to the same number of decmal places, because they are M m multples of. It s mportant to eep the fact that all classes should have the same wdth: for nstance, when we generate random numbers, the percentage of ponts n each class depends on the wdth of the class. So, what we do s as follows: we do not try to replace all centers by approxmate values; we eep the ratonal value. But stll, n the Excel cells, we may present the result wth a gven number of dgts. We wll wrte for nstance : BB Hstogram Normalzaton, december 202 3

For = 0 To tot Sheets(3).Cells( + 2, ) = Round(mn_values + / (tot - ) * (max_values - mn_values) - (max_values - mn_values) / (2 * (tot - )), 2) & "-" & Round(mn_values + / (tot - ) * (max_values - mn_values) + (max_values - mn_values) / (2 * (tot - )), 2) Sheets(3).Cells( + 2, 2) = hsto() Next Ths way, we wll have an hstogram of the followng sort: number of nterval occurrences 0-0,0 43 0,0-0,02 90 0,02-0,03 5 0,03-0,04 98 The endponts loo smple, but stll the classes have the same wdth. If ths example, the value of l was 0,0050493490863668, the value of the mn was 0.0008948365740967, the value of the max 0.99996060329803. III. Avantages of the method Ths method answers the dffcultes mentoned prevously: If the sample grows bgger, the classes are not necessarly modfed, as long as no value becomes smaller than m l or larger than M l. Of course, f more ponts appear below m or above M, the values mm, wll not be centers of classes anymore, but the defnton of the classes wll not be modfed. The constructon ncorporates the uncertantes upon the values. All classes have the same form, namely a x b. The method may be fully automatzed. All we need s m, M,. An nterestng applcaton, whch we recently met, s that ths method allows us to show that some varables have dentcal laws. Assume for nstance that we have one random varable X and another one Y whch turns to be Y 00X. If we buld the hstograms the usual way, by hand, we mght not notce ths. Assume for nstance that the mnmum value for X s 0.04 and we want 00 classes. We would tae for frst nterval for X the nterval 0 0.. For Y, the smallest value s 4, and we would probably tae as frst nterval 0 5. We would not see that ths s the same varable, up to a multplcaton by a constant factor 00. BB Hstogram Normalzaton, december 202 4

If the classes are defned n an automatc manner, as was prevously explaned, the ln between X and Y s obvous. All endponts are multpled by 00, and the number of ponts n each class s the same. IV. Loss of nformaton Qute clearly, when we perform an hstogram, some nformaton s lost: all ponts belongng to the same class are dentfed together, and dentfed to the center of the class. Smply consder the extreme values m and M : t s therefore better to have them as centers of classes, and no nformaton wll be lost upon them. So the ndcator "total loss of nformaton when performng the hstogram" s one more reason for the choce we ndcate. V. Refnng the defnton of the grd In the wor above, we decded that the extreme classes would be centered at the extreme values of the sample. We may wonder f there are better choces. We now nvestgate ths queston. The set of classes wll be called a "grd". As before, there are classes, denoted by C, and the number s fxed. The wdth of the classes s denoted by 2l ; t s the same for all classes, and t s fxed, snce t results from the precson whch s requred. We denote by c,,...,, the center of the class C. Our queston now s how to choose the poston of the center c, snce all other centers wll follow. If some ponts x fall nto the class C, they are dentfed to ts center c ; so there s a loss of nformaton equal to x c for each of them. We are loong for the poston of the grd, that s the poston of c, whch wll mnmze ths loss of nformaton. As before, we set m mn x et M max x ; we admt the fact that the grd s larger enough to cover all the sample, wth half a class on each sde. Ths gves the nequalty: 2 l M m () It s useless to have empty classes; ether before m, or after M. So the frst class wll contan m and the last one wll contan M, and we get the condtons: m c l M c l c c 2 2 l, ths s compatble wth condton (), snce: Snce M m M c c c c m 2l 2 2 l 2l BB Hstogram Normalzaton, december 202 5

The total number of classes, tang () nto account, s: M m nt 2l (2) For nstance, f m 0, M and l / 20 (classes of wdth /0), we fnd. So, the dfference wth the paragraphs above s that now c may not be exactly n m. More precsely, we want to poston c, under the constrant: and we want to mnmze the quantty: m l c m l (3) (4) C Q c x In the defnton of ths quantty, we consder that the total loss of nformaton s smply the sum of all ndvdual losses of nformaton; we do not see any reason to tae, for nstance, a quadratc sum. c c 2 l,,...,, the quantty Q may be wrtten: Snce If we move c, but stll eepng each 2 (5) Q c l x C functon of c : the absolute value becomes a quantty x n the same class, then obvously Q s a lnear a c or c a and ther sum s lnear. The functon Qc s contnuous and pecewse lnear. The dscontnutes of the dervatve appear for the values of c such that, for some and some : that s: So, these are ponts c of the form: or of the form: c 2 l x l c 2 l x l c x 2 3 l c x 2 l (6) for,..., N and,...,. Both forms are equvalent, f we replace by. BB Hstogram Normalzaton, december 202 6

So we have N ponts of dscontnuty for the dervatve, and all we have to do s to compute the values of the functon at these ponts. The mnmum value of Q may be reached only at such ponts, snce the functon s lnear n between. A gven pont x belongs to the class c defned by: nt x c 2l 2 (7) So our program goes as follows. Here, we generated a random sample x() between 0 and, of sze Itot=0 000. We tae lc = / 20 (half wdth of a class). We have tot =. Let c() be the centers of the classes. Dm c As Double 'poston of the frst center Dm c0 As Double Dm dst As Double Dm d_mn As Double 'shortest dstance d_mn = 0000 'ntalzaton wth hgh value Dm As Integer Dm As Integer For = To Itot For = To tot c = x() - (2 * - ) * lc 'enumeraton of all possble frst centers If c > -lc And c < lc Then For = To tot c() = c + 2 * ( - ) * lc 'enumeraton of all centers, the frst one beng gven Next For = To Itot = Int((x() - c + lc) / (2 * lc)) + 'the ndex of the center closest to x() dst = dst + Abs(c() - x()) Next If dst < d_mn Then d_mn = dst c0 = c End If 'If dst < d_mn Then End If 'If c > - / 20 And c < / 20 Then dst = 0 Next Next The result s the value of c. In the present case, we fnd c 0.026, whch means that the value c 0 was not best : a slght shft of the grd to the rght mnmzes the loss of nformaton. The values of c to be searched are of the form x l, x 3 l, x 5 l,... so, qute obvously, for a gven x, only one of them may be n the nterval m l, m l. BB Hstogram Normalzaton, december 202 7