Société de Calcul Mathématique SA

Similar documents
3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

Linear Approximation with Regularization and Moving Least Squares

Lecture 10 Support Vector Machines II

Economics 101. Lecture 4 - Equilibrium and Efficiency

Generalized Linear Methods

Difference Equations

Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics )

a b a In case b 0, a being divisible by b is the same as to say that

The Geometry of Logit and Probit

Math 261 Exercise sheet 2

Foundations of Arithmetic

Introductory Cardinality Theory Alan Kaylor Cline

Kernel Methods and SVMs Extension

Appendix B. Criterion of Riemann-Stieltjes Integrability

Lecture Notes on Linear Regression

Affine transformations and convexity

FREQUENCY DISTRIBUTIONS Page 1 of The idea of a frequency distribution for sets of observations will be introduced,

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

Some Consequences. Example of Extended Euclidean Algorithm. The Fundamental Theorem of Arithmetic, II. Characterizing the GCD and LCM

APPENDIX A Some Linear Algebra

( ) [ ( k) ( k) ( x) ( ) ( ) ( ) [ ] ξ [ ] [ ] [ ] ( )( ) i ( ) ( )( ) 2! ( ) = ( ) 3 Interpolation. Polynomial Approximation.

Module 9. Lecture 6. Duality in Assignment Problems

Lecture 10 Support Vector Machines. Oct

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

A new construction of 3-separable matrices via an improved decoding of Macula s construction

Calculation of time complexity (3%)

9 Characteristic classes

= z 20 z n. (k 20) + 4 z k = 4

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

Lecture 4: November 17, Part 1 Single Buffer Management

Complex Numbers Alpha, Round 1 Test #123

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Complete subgraphs in multipartite graphs

Assortment Optimization under MNL

2.3 Nilpotent endomorphisms

Linear, affine, and convex sets and hulls In the sequel, unless otherwise specified, X will denote a real vector space.

Limited Dependent Variables

On the set of natural numbers

First Year Examination Department of Statistics, University of Florida

Dynamic Programming. Preview. Dynamic Programming. Dynamic Programming. Dynamic Programming (Example: Fibonacci Sequence)

18.1 Introduction and Recap

Lecture 17 : Stochastic Processes II

Singular Value Decomposition: Theory and Applications

REAL ANALYSIS I HOMEWORK 1

find (x): given element x, return the canonical element of the set containing x;

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Polynomial Regression Models

Finding Dense Subgraphs in G(n, 1/2)

NUMERICAL DIFFERENTIATION

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Statistical Foundations of Pattern Recognition

PHYS 705: Classical Mechanics. Calculus of Variations II

Feature Selection: Part 1

Problem Set 9 Solutions

20. Mon, Oct. 13 What we have done so far corresponds roughly to Chapters 2 & 3 of Lee. Now we turn to Chapter 4. The first idea is connectedness.

More metrics on cartesian products

Quantum Mechanics for Scientists and Engineers. David Miller

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

NP-Completeness : Proofs

Calculus of Variations Basics

Chapter 8 Indicator Variables

Lecture 10: Euler s Equations for Multivariable

Problem Do any of the following determine homomorphisms from GL n (C) to GL n (C)?

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

SL n (F ) Equals its Own Derived Group

Primer on High-Order Moment Estimators

Problem Set 6: Trees Spring 2018

Problem Solving in Math (Math 43900) Fall 2013

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.

Physics 5153 Classical Mechanics. Principle of Virtual Work-1

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

CHAPTER 9 Approximation theory and stability

E Tail Inequalities. E.1 Markov s Inequality. Non-Lecture E: Tail Inequalities

Chapter 8 SCALAR QUANTIZATION

Randomness and Computation

The Order Relation and Trace Inequalities for. Hermitian Operators

Modelli Clamfim Equazioni differenziali 7 ottobre 2013

VQ widely used in coding speech, image, and video

FINITELY-GENERATED MODULES OVER A PRINCIPAL IDEAL DOMAIN

Exercise Solutions to Real Analysis

MATH 5707 HOMEWORK 4 SOLUTIONS 2. 2 i 2p i E(X i ) + E(Xi 2 ) ä i=1. i=1

Perron Vectors of an Irreducible Nonnegative Interval Matrix

Expected Value and Variance

REDUCTION MODULO p. We will prove the reduction modulo p theorem in the general form as given by exercise 4.12, p. 143, of [1].

5 The Rational Canonical Form

Linear Regression Analysis: Terminology and Notation

On the Multicriteria Integer Network Flow Problem

MMA and GCMMA two methods for nonlinear optimization

THE SUMMATION NOTATION Ʃ

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

e i is a random error

Communication Complexity 16:198: February Lecture 4. x ij y ij

CSE 252C: Computer Vision III

Transcription:

Socété de Calcul Mathématque SA Outls d'ade à la décson Tools for decson help Probablstc Studes: Normalzng the Hstograms Bernard Beauzamy December, 202 I. General constructon of the hstogram Any probablstc study usually starts wth the constructon of an hstogram: one defnes some classes and counts how many ponts fall nto each class. The most common stuaton s as follows : we have a sample of real values x,,..., Itot ; let m mn x and M max x. We want to buld an hstogram wth classes, from ths sample. What people do n general s to dvde the nterval mm, nto classes, of wdth M m. But ths approach has several drawbacs, and people are not often conscous of them: The boundares of the classes are strongly dependent of the values of m and M, and would be modfed f these values were changed, for nstance f the sample grew bgger; These boundares do not tae nto account the uncertantes whch certanly exst upon the values of m and M ; All classes are of the form a x b, except the last one, whch s of the form a x b, snce the value M s necessarly met. An hstogram should be vewed as a measurement devce, just le a thermometer. It gves an nformaton, wth some accurracy. Therefore, the measurement devce should be as ndependent as possble from the sample. Of course, t cannot be totally ndependent. Sège socal et bureaux :, Faubourg Sant Honoré, 75008 Pars. Tel : 0 42 89 0 89. Fax : 0 42 89 0 69. www.scmsa.eu Socété Anonyme au captal de 56 200 Euros. RCS : Pars B 399 99 04. SIRET : 399 99 04 00035. APE : 729Z

The number of classes tself, namely, s not arbtrary but should reflect the performances of the measurement devce. Indeed, t s lned wth the precson we expect. When we buld an hstogram, two ponts n the same class are consdered as the same pont. When we mae a measurement, we consder that, f the value x s read, t mght as well be anythng between x and x, beng consdered as the precson of the measurement devce. So we have some rough ln between both concepts : M m 2, () snce M m s the wdth of each class, from the hstogram pont of vew, and 2 s the wdth of each class, from the precson pont of vew. In order to answer the dffcultes mentoned above, we wll buld classes such that the frst one s centered at m and the last one centered at M. Therefore, the centers of the classes wll be: c m M m, 0,..., (2) The half-wdth of a class s: M m l 2 (3) A pont x belongs to the class C, wth center c, f: M m M m M m x m M m m 2 2 (4) So, all our classes wll be here of the form a x b. Condton (4) may be wrtten: x m M m 2 (4a) and: x m M m 2 (4b) whch means that s defned by: BB Hstogram Normalzaton, december 202 2

x m M m 2, (5) where x s the nteger part of x, that s the largest nteger smaller than x. So, a VBA code may be wrtten as follows; Itot s the total number of lnes s the table, mn_values and max_values are respectvely the mn and the max, and tot s the number above: for = to Itot = nt( (x()-mn_values)*(tot-)/(max_values-mn_values) +/2) hsto()=hsto()+ next As t stands now, the method has a drawbac: the extremtes of each class are ratonal numbers, usually wth many decmal dgts, whch loos unnatural, wth respect to the requrement for a gven precson. For nstance, a class mght appear as 443.556-464.444. Its wdth s almost, whch means that we do not want to dstngush between numbers wth a dfference say of 0.5. But stll, we gve 3 dgts after the decmal pont, whch loos absurd. So, we have to study how to round up the values. II. Roundng up the values If we accept the dea that all values n our sample are subject to some measurement error, the smplest way of tang t nto account s to round up each value. Let be the precson we accept, and let 0 for some nteger 0. Then each value x s replaced by rx round ( x, ) whch s the number wth decmal places closest to x. Then of course m and M wll also be rounded to decmal places. But even so, the centers of the other classes wll not be rounded to the same number of decmal places, because they are M m multples of. It s mportant to eep the fact that all classes should have the same wdth: for nstance, when we generate random numbers, the percentage of ponts n each class depends on the wdth of the class. So, what we do s as follows: we do not try to replace all centers by approxmate values; we eep the ratonal value. But stll, n the Excel cells, we may present the result wth a gven number of dgts. We wll wrte for nstance : BB Hstogram Normalzaton, december 202 3

For = 0 To tot Sheets(3).Cells( + 2, ) = Round(mn_values + / (tot - ) * (max_values - mn_values) - (max_values - mn_values) / (2 * (tot - )), 2) & "-" & Round(mn_values + / (tot - ) * (max_values - mn_values) + (max_values - mn_values) / (2 * (tot - )), 2) Sheets(3).Cells( + 2, 2) = hsto() Next Ths way, we wll have an hstogram of the followng sort: number of nterval occurrences 0-0,0 43 0,0-0,02 90 0,02-0,03 5 0,03-0,04 98 The endponts loo smple, but stll the classes have the same wdth. If ths example, the value of l was 0,0050493490863668, the value of the mn was 0.0008948365740967, the value of the max 0.99996060329803. III. Avantages of the method Ths method answers the dffcultes mentoned prevously: If the sample grows bgger, the classes are not necessarly modfed, as long as no value becomes smaller than m l or larger than M l. Of course, f more ponts appear below m or above M, the values mm, wll not be centers of classes anymore, but the defnton of the classes wll not be modfed. The constructon ncorporates the uncertantes upon the values. All classes have the same form, namely a x b. The method may be fully automatzed. All we need s m, M,. An nterestng applcaton, whch we recently met, s that ths method allows us to show that some varables have dentcal laws. Assume for nstance that we have one random varable X and another one Y whch turns to be Y 00X. If we buld the hstograms the usual way, by hand, we mght not notce ths. Assume for nstance that the mnmum value for X s 0.04 and we want 00 classes. We would tae for frst nterval for X the nterval 0 0.. For Y, the smallest value s 4, and we would probably tae as frst nterval 0 5. We would not see that ths s the same varable, up to a multplcaton by a constant factor 00. BB Hstogram Normalzaton, december 202 4

If the classes are defned n an automatc manner, as was prevously explaned, the ln between X and Y s obvous. All endponts are multpled by 00, and the number of ponts n each class s the same. IV. Loss of nformaton Qute clearly, when we perform an hstogram, some nformaton s lost: all ponts belongng to the same class are dentfed together, and dentfed to the center of the class. Smply consder the extreme values m and M : t s therefore better to have them as centers of classes, and no nformaton wll be lost upon them. So the ndcator "total loss of nformaton when performng the hstogram" s one more reason for the choce we ndcate. V. Refnng the defnton of the grd In the wor above, we decded that the extreme classes would be centered at the extreme values of the sample. We may wonder f there are better choces. We now nvestgate ths queston. The set of classes wll be called a "grd". As before, there are classes, denoted by C, and the number s fxed. The wdth of the classes s denoted by 2l ; t s the same for all classes, and t s fxed, snce t results from the precson whch s requred. We denote by c,,...,, the center of the class C. Our queston now s how to choose the poston of the center c, snce all other centers wll follow. If some ponts x fall nto the class C, they are dentfed to ts center c ; so there s a loss of nformaton equal to x c for each of them. We are loong for the poston of the grd, that s the poston of c, whch wll mnmze ths loss of nformaton. As before, we set m mn x et M max x ; we admt the fact that the grd s larger enough to cover all the sample, wth half a class on each sde. Ths gves the nequalty: 2 l M m () It s useless to have empty classes; ether before m, or after M. So the frst class wll contan m and the last one wll contan M, and we get the condtons: m c l M c l c c 2 2 l, ths s compatble wth condton (), snce: Snce M m M c c c c m 2l 2 2 l 2l BB Hstogram Normalzaton, december 202 5

The total number of classes, tang () nto account, s: M m nt 2l (2) For nstance, f m 0, M and l / 20 (classes of wdth /0), we fnd. So, the dfference wth the paragraphs above s that now c may not be exactly n m. More precsely, we want to poston c, under the constrant: and we want to mnmze the quantty: m l c m l (3) (4) C Q c x In the defnton of ths quantty, we consder that the total loss of nformaton s smply the sum of all ndvdual losses of nformaton; we do not see any reason to tae, for nstance, a quadratc sum. c c 2 l,,...,, the quantty Q may be wrtten: Snce If we move c, but stll eepng each 2 (5) Q c l x C functon of c : the absolute value becomes a quantty x n the same class, then obvously Q s a lnear a c or c a and ther sum s lnear. The functon Qc s contnuous and pecewse lnear. The dscontnutes of the dervatve appear for the values of c such that, for some and some : that s: So, these are ponts c of the form: or of the form: c 2 l x l c 2 l x l c x 2 3 l c x 2 l (6) for,..., N and,...,. Both forms are equvalent, f we replace by. BB Hstogram Normalzaton, december 202 6

So we have N ponts of dscontnuty for the dervatve, and all we have to do s to compute the values of the functon at these ponts. The mnmum value of Q may be reached only at such ponts, snce the functon s lnear n between. A gven pont x belongs to the class c defned by: nt x c 2l 2 (7) So our program goes as follows. Here, we generated a random sample x() between 0 and, of sze Itot=0 000. We tae lc = / 20 (half wdth of a class). We have tot =. Let c() be the centers of the classes. Dm c As Double 'poston of the frst center Dm c0 As Double Dm dst As Double Dm d_mn As Double 'shortest dstance d_mn = 0000 'ntalzaton wth hgh value Dm As Integer Dm As Integer For = To Itot For = To tot c = x() - (2 * - ) * lc 'enumeraton of all possble frst centers If c > -lc And c < lc Then For = To tot c() = c + 2 * ( - ) * lc 'enumeraton of all centers, the frst one beng gven Next For = To Itot = Int((x() - c + lc) / (2 * lc)) + 'the ndex of the center closest to x() dst = dst + Abs(c() - x()) Next If dst < d_mn Then d_mn = dst c0 = c End If 'If dst < d_mn Then End If 'If c > - / 20 And c < / 20 Then dst = 0 Next Next The result s the value of c. In the present case, we fnd c 0.026, whch means that the value c 0 was not best : a slght shft of the grd to the rght mnmzes the loss of nformaton. The values of c to be searched are of the form x l, x 3 l, x 5 l,... so, qute obvously, for a gven x, only one of them may be n the nterval m l, m l. BB Hstogram Normalzaton, december 202 7