The memory centre IMUJ PREPRINT 2012/03. P. Spurek

Similar documents
Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

ECE 4400:693 - Information Theory

Communications Theory and Engineering

Lecture 1: September 25, A quick reminder about random variables and convexity

EE5585 Data Compression January 29, Lecture 3. x X x X. 2 l(x) 1 (1)

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

k-means Approach to the Karhunen-Loéve Transform

Tight Upper Bounds on the Redundancy of Optimal Binary AIFV Codes

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Kullback-Leibler Designs

Information Theory and Coding Techniques

Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory

A One-to-One Code and Its Anti-Redundancy

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

Asymptotic redundancy and prolixity

3F1 Information Theory, Lecture 3

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Chapter 9 Fundamental Limits in Information Theory

Data Compression. Limit of Information Compression. October, Examples of codes 1

Lecture 22: Final Review

ELEC546 Review of Information Theory

Compression and Coding

Block 2: Introduction to Information Theory

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Lecture 3 : Algorithms for source coding. September 30, 2016

Dispersion of the Gilbert-Elliott Channel

Intro to Information Theory

COS597D: Information Theory in Computer Science October 19, Lecture 10

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

3 Self-Delimiting Kolmogorov complexity

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information

Maximum Correlation Analysis of Nonlinear S-boxes in Stream Ciphers

Chapter 2: Source coding

Converse Lyapunov theorem and Input-to-State Stability

On the redundancy of optimum fixed-to-variable length codes

Chapter 5: Data Compression

k-protected VERTICES IN BINARY SEARCH TREES

Upper Bounds on the Capacity of Binary Intermittent Communication

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

Lecture 5 Channel Coding over Continuous Channels

1 Review of di erential calculus

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

Shannon Entropy: Axiomatic Characterization and Application

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

MEI STRUCTURED MATHEMATICS 4777

6.1 Main properties of Shannon entropy. Let X be a random variable taking values x in some alphabet with probabilities.

Motivation for Arithmetic Coding

COMM901 Source Coding and Compression. Quiz 1

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak

Lecture 8: Shannon s Noise Models

Coding of memoryless sources 1/35

Bounds on Achievable Rates of LDPC Codes Used Over the Binary Erasure Channel

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

An Approximation Algorithm for Constructing Error Detecting Prefix Codes

Solutions to Set #2 Data Compression, Huffman code and AEP

B553 Lecture 1: Calculus Review

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

( 1 k "information" I(X;Y) given by Y about X)

Lecture 1 : Data Compression and Entropy

Variable-Rate Universal Slepian-Wolf Coding with Feedback

A Probabilistic Upper Bound on Differential Entropy

7.4* General logarithmic and exponential functions

lossless, optimal compressor

Exercises with solutions (Set B)

Measurable Choice Functions

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for

1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer 11(2) (1989),

QUADRATIC MAJORIZATION 1. INTRODUCTION

Asymptotically Optimal Tree-based Group Key Management Schemes

Multiplicativity of Maximal p Norms in Werner Holevo Channels for 1 < p 2

Section Taylor and Maclaurin Series

Darboux functions. Beniamin Bogoşel

Banach Journal of Mathematical Analysis ISSN: (electronic)

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling

Lecture 11: Continuous-valued signals and differential entropy

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

1 Introduction to information theory

On bounded redundancy of universal codes

Source Coding and Function Computation: Optimal Rate in Zero-Error and Vanishing Zero-Error Regime

8.5 Taylor Polynomials and Taylor Series

The Steepest Descent Algorithm for Unconstrained Optimization

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Lecture 4 Noisy Channel Coding

Information Leakage of Correlated Source Coded Sequences over a Channel with an Eavesdropper

Design of Optimal Quantizers for Distributed Source Coding

11.8 Power Series. Recall the geometric series. (1) x n = 1+x+x 2 + +x n +

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

Homework Set #2 Data Compression, Huffman code and AEP

CSCI 2570 Introduction to Nanocomputing

Paper 1 (Edexcel Version)

Introduction to Information Theory. By Prof. S.J. Soni Asst. Professor, CE Department, SPCE, Visnagar

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Transcription:

The memory centre IMUJ PREPRINT 202/03 P. Spurek Faculty of Mathematics and Computer Science, Jagiellonian University, Łojasiewicza 6, 30-348 Kraków, Poland J. Tabor Faculty of Mathematics and Computer Science, Jagiellonian University, Łojasiewicza 6, 30-348 Kraków, Poland Abstract Let x R be given. As we know the, amount of bits needed to binary code x with given accuracy (h R) is approximately m h (x) log 2 (max{, x h }). We consider the problem where we should translate the origin a so that the mean amount of bits needed to code randomly chosen element from a realization of a random variable X is minimal. In other words, we want to find a R such that R a E(m h (X a)) attains minimum. We show that under reasonable assumptions, the choice of a does not depend on h asymptotically. Consequently, we reduce the problem to finding minimum of the function R a ln( x a )f(x)dx, R where f is the density distribution of the random variable X. Moreover, we provide constructive approach for determining a. Keywords: memory compressing, IRLS, differential entropy, coding, kernel estimation Email addresses: przemyslaw.spurek@ii.uj.edu.pl (P. Spurek), jacek.tabor@ii.uj.edu.pl (J. Tabor) Preprint submitted to Information Sciences April 2, 202

. Introduction Data compression is usually achieved by assigning short descriptions (codes) to the most frequent outcomes of the data source and necessarily longer descriptions to the less frequent outcomes [, 3, 7]. For the convenience of the reader, we shortly present theoretical background of this approach. Let p = (p 0,..., p n ) be a probability distribution for a discrete random variable X. Assume that l i is the length of the code of x i for i = 0,..., n. Then the expected number of bits is given by i p il i. The set of possible codeword with uniquely decodable codes is limited by the Kraft inequality i 2 l i. It is enough to verify that lengths which minimize i p il i are given by l i = log 2 p i. We obtain that minimal amount of information per one element in lossless coding is Shannon entropy [] defined by H(X) = p i log 2 p i. By this approach various types of lossless data compression were constructed. An optimal (shortest expected length) prefix code for a given distribution can be constructed by a simple algorithm discovered by Huffman [5]. If we want to consider continuous random variables and code with given maximal error h we arrive at the notion of differential entropy []. Let f : R R be a continuous density distribution of the random variable X, and let l : R R be the function of the code length. We divide the domain of f into disjoint intervals of length h. Let X h := hx be the discretization of X. The Shannon entropy h of X h can be rewritten as follows H(X h ) f(x i )h log 2 (f(x i )h) = f(x i ) log 2 (f(x i )) h log 2 (h) f(x i )h f(x) log 2 (f(x)) dx log 2 (h) f(x)dx = f(x i ) log 2 (f(x i )) dx log 2 (h). () By taking the limit of H(X h ) + log 2 (h) as h 0, we obtain the definition of the differential entropy H(f) := f(x) log 2 (f(x))dx. (2) Very often ln is used instead of log 2, also in this article we use this convention. 2

In this paper, we follow a different approach. Instead of looking for the best type of coding for a given dataset, we use standard binary coding 2 and we search for the optimal center a of the coordinate system so that the mean amount of bits needed to code the dataset is minimal. The main advantage of this idea is that we do not have to fit the type of compression to a dataset. Moreover, codes are given in very simple way. This approach allows to immediately encrypt and decrypt large datasets (we use only one type of code). Clearly, classical binary code is far from being optimal but it is simple and commonly used in practise. The number of bits needed to code x Z, by the classical binary code, is given by m(x) log 2 (max{, x }). Consequently, the memory needed to code x R with given accuracy h is approximately ( x ( { m h (x) = m log h) 2 max, x }). h Our aim is to find the place where to put the origin a of the coordinate system so that the mean amount of bits needed to code randomly chosen an element from a sample from probability distribution of X is minimal. In other words, we want to find a R such that E(m h (X a)) = m h (x a)f(x)dx R attains minimum, where f is the density distribution of the random variable X. Our paper is arranged as follows. In the next section we show that under reasonable assumptions, the choice of a does not depend on h asymptotically. This reasoning is similar to the derivation of the differential entropy (). In the third section, we consider the typical situation when the density distribution of a random variable X is not known. We use a standard kernel method to estimate the density f. Working with the estimation is possible, but from the numerical point of view, complicated and not effective. So in the next section we show reasonable approximation which has better properties. In the fourth section, we present our main algorithm and in Appendix B we put full implementation. In the last section, we present how our method works on typical datasets. 2 In the classical binary code we use one bit for the sign and then the standard binary representation. This code is not prefix so we have to mark ends of words. Similar coding is used in the decimal numeral system. 3

2. The kernel density estimation As it was mentioned in the previous section, our aim is to minimize, for a fixed h, the function a E(m h (X a)). In this chapter, we show that the choice of a (asymptotically) does not depend on h. In Theorem 2., we use a similar reasoning as in the derivation of differential entropy () and we show that it is enough to consider the function M f (a) := ln( x a )f(x)dx. R Theorem 2.. We assume that the random variable X has locally bounded density distribution f : R R. If M f (a) < for a R, then lim h 0 E(m h(x a)) ln(2) M f(a) + log 2 (h) = 0 for a R. Proof. Let a R be fixed. Then E(m h (X a)) = { }) x a m h (x a)f(x)dx = log 2 (max, f(x)dx = R R h ( ) x a = log 2 f(x)dx = R\(a h,a+h) h = log 2 ( x a ) f(x)dx + log 2 (h) f(x)dx = R\(a h,a+h) R\(a h,a+h) = log 2 ( x a ) f(x)dx log 2 ( x a ) f(x)dx + R (a h,a+h) log 2 (h) f(x)dx log 2 (h) f(x)dx. R (a h,a+h) Since the function f is a locally bounded density distribution so log 2 ( x a ) f(x)dx log 2 (h) f(x)dx = 0. (a h,a+h) (a h,a+h) Consequently lim h 0 E(m h(x a)) ln(2) M f(a) + log 2 (h) = 0 for a R. 4

As we see, when we increase the accuracy (h 0) of coding the shape of the function E(m h (X a)) stabilizes (modulo subtraction of log 2 (h)). Example 2.. Let f be a uniform density distribution on interval [ 2, 2]. Then M f (a) = R ln( x a )f(x)dx = 2 2 ln( x a )dx. Moreover, since ln(x a)dx = ln(x a)x ln(x a)a x a, then we have { ( ln M f (a) = a ) ( a) + ln ( + a ) ( + a) for a, 2 2 2 2 2 for a =. 2 Function M f (a) is presented in Fig.. It is easy to see that min (M f (a)) = 0..8.4 0.6 5 3 0.2 3 5 0.2 0.6.4.8 Figure : Function M f (a) constructed for uniform density on [ 2, 2]. In a typical situation, we do not know the density distribution f of random variable X. We only have a sample S = (x,..., x N ) from X. To approximate f we use kernel method [8]. For the convenience of the reader and to establish 5

the notation we shortly present this method. The kernel K : R R is simply a function satisfying the following condition K(x)dx =. R Usually, K is a symmetric probability density function. The kernel estimator of f with kernel K is defined by f(x) := Nh N ( ) x xi K, h i= where h is the window width 3. Asymptotically optimal (for N ) choice of kernel K in class of symmetric and square-integrable functions is the Epanechnikov kernel 4 { 3 K(x) = ( 4 x2 ) for x <, 0 for x. Asymptotically optimal choice of window width (under the assumption that density is Gaussian) is given by h 2.35sN 5, where s = N (x i m(s)) 2. N Thus our aim is to minimize the function a M S (a) := N ( xi +h ln( x a ) 3 ( ) ) 2 x xi dx. Nh 4 h i= x i h To compute M S (a), we analyse the function L: R R (see Fig. 2) given by: L: R a 3 4 i= ln( x a ) ( x 2) dx. 3 Also called the smoothing parameter or bandwidth by some authors. 4 Often a rescaled (normalized) Epanechnikov kernel is used. 6

.4 0.6 3 0.2 3 0.2 0.6.4 Figure 2: Function L. Lemma 2.. We have M S (a) = ln(h) + N Proof. By simple calculations we obtain i= N ( ) xi a L. h i= M S (a) = = N ( xi +h ln( x a ) 3 ( ) ) 2 x xi dx = Nh i= x i h 4 h y = x x i h = dy = dx h x = hy + x i = N 3 ( ln h N 4 y + x i a ( i= h ) ) y 2 dy = = N 3 ( ln y + x ) i a ( ) N 4 h y 2 dy + ln(h) = ln(h) + N ( ) xi a L. N h i= 3. Approximation of the function L As it was shown in the previous section, the crucial role in our investigation is played by the function L and therefore, in this chapter we study its basic properties. Let us begin with the exact formula for L. 7

Proposition 3.. The function L is given by the following formula { L(a) = ln( 2 a2 ) + ( 3 a a3) ln ( +a ) + 4 4 a 2 a2 4 for a, 3 ln(2) 5 for a =. 6 Moreover, L is even, L(0) = 4 3 and lim (L(a) ln( a )) = 0. a Proof. We consider the case when a > (by similar calculation, we can get this result for all a R) ln(x a) ( x 2) dx = = ln(x a) ( x2 x x a 3 x3 = x ) 3 x3 ln( x a ) = ln(x a) ( 3 x3 + x a + 3 ) a3 x 3 x3 x a dx = x + a + 9 x3 + 6 x2 a + 3 xa2 8 a3. Consequently { L(a) = ln( 2 a2 ) + ( 3 a a3) ln ( +a ) + 4 4 a 2 a2 4 for a, 3 ln(2) 5 for a =. 6 As a simple corollary, we obtain that L is even and L(0) = 4. To show the last 3 property, we use the equality 3 ln( x a ) ( x 2) dx = 3 ln( x + a ) ( x 2) dx. 4 4 Then for a >, we obtain 3 ln( x a ) ( x 2) dx = 3 (ln( x a ) + ln( x + a )) ( x 2) dx = 4 8 = 3 ln ( a 2) ( x 2) dx + 3 ) ln ( x2 ( ) x 2 dx = 8 8 a 2 = 2 ln(a2 ) + 3 ) ln ( x2 ( ) x 2 dx. 8 a 2 8

Since 3 ) ln ( x2 ( ) [ x 2 dx 8 a 2 [( = 2 ln we get min x [,] ( ( 2 ln x2 a 2 ], ( a 2 )), 0 )) (, max x [,] 2 ln ( 0 lim (L(a) ln(a)) lim ( a a 2 ln a )) = 0. 2 ))] ( x2 = a 2 From the numerical point of view, the use of the function L (Fig. 2) is complicated and not effective. The main problem is connected with a possible numerical instability for a close to. Moreover, in our algorithm we use the first derivative (more information in the next chapter and Appendix A) by considering the function R + L ( { ( 3 a) = ln ) ( a 8 a+ a + ) a ln a+ a + 2 a for a, 3 for a =. 4 In this case, we have numerical instability for a close to 0,. Thus, instead of L, we use Cauchy M-estimator [0] L: R R which is given by L(a) := 2 ln(e 8 3 + a 2 ). The errors caused by the approximation are reasonably small in relation to those connected with kernell estimation 5. Observation 3.. The function L is analytic, even, L(0) = L(0) and lim ( L(a) a L(a)) = 0. Consequently, the problem of finding the optimal (respectively to the needed memory) center of the coordinate system can be well approximated by searching for the global minimum of the function M S (a) := N N i= L ( a xi h ) for a R. 5 As it was said in this method one assume that dataset is realization of Gaussian random variable while usually, in practise, does not have to. 9

.4 0.6 3 0.2 3 0.2 0.6 The function L The function L.4 Figure 3: Comparison of functions L and L. 4. Search for the minimum In this section, we present a method of finding the minimum of the function M S (a). We use the robust technique which is called M estimator [4]. Remark 4.. For the convince of the reader we shortly present the standard use of the M estimator s method. Let {x,..., x n } be a given dataset. We are looking for the best representation of the points min a (x i a) 2. i Normally we choose barycentre of the data but elements which are fare from the center usually courses undesirable effect. The M estimators try to reduce the effect of outliers by replacing the squares by another function (in our case Cauchy M-estimator [0]) min L(x i a), a i where L is a symmetric, positive definite function with a unique minimum at zero, and is chosen to be less increasing than square. Instead of solving directly this problem, one usual implements an Iterated Reweighted Least Squares method (IRLS) [see Appendix A]. In our case we are looking for min a L(x i a), i 0

where L is interpreted as a function which describes the memory needed to code x i with respect to a. Our approach based on Iterated Reweighted Least Squares method (IRLS), is similar to that presented in [9] and [6]. For convenience of the reader, we include the basic theory connected with this method in Appendix A. In our investigations the crucial role is played by the following proposition. Corollary IRLS (see Appendix A). Let f : R + R +, f(0) = 0 be a given concave and differentiable function and let S = {x i,..., x N } be a given data-set. We consider the function F (a) := i f( a x i 2 ) for a R. Let ā R be arbitrarily fixed and let w i = f ( x i ā 2 ) for i =... N and ā w = N i=0 w i N w i x i. i=0 Then F (ā w ) F (ā). Making use of the above corollary, by substitution ā ā w in each step we come closer to a local minimum of the function F. It is easy to see, that L defined in the previous section, satisfies the assumptions of Corollary IRLS. Let S = (x,..., x N ) be a given realization of the random variable X and let h = 2.35N 5 N (x i m(s)) 2. N The algorithm (based on IRLS) can be described as follows. i= Algorithm.

initialization stop condition ε > 0 initial condition j = 0 a j = m(s) repeat calculate( ) w i := e 8 3 + (x h 2 i a j ) 2 for i =,..., N j = j + N a j = N w i=0 w i x i i i=0 until a j a j < ε 0.000 0.00 0.002 0.003 argmin(m S) 0 00 200 300 400 500 Figure 4: Estimation of the density distribution for the Forex data. The first initial point can be chosen in many different ways. We usually start from the barycentre of the dataset because, for equal weights, the barycentre is the best approximation of the sample (for full code written in R Project, see Appendix B). Now we show how our method works in practise. Example 4.. Let S be the sample of the index of USD/EUR from Forex stoke [2]. The density obtain by the kernel method is presented at Fig. 4. As a result 2

of the algorithm we obtained alg centre(s) = 238.474. We compare our result with the global minimum argmin(m S ) = 239.509 and the barycentre of data m(s) = 22.8004: min M S = 4.0477, M S (alg centre(s)) = 4.0478, M S (m(s)) = 4.0768. As we see the difference between min M S and M S (alg centre(s)) is small. Moreover, the barycentre gives a good approximation of the memory centre but, as we see in next examples, the difference can be large for not uni modal densities. In the next step we consider a random variables of the form X := p X + p 2 X 2 where X, X 2 are two normal random variables 6 or two uniform random variables 7. In Table, we present comparison of the result of our algorithm and global minimum of the function M S where S is the realization of random variable X of size 500 with different parameters. As we see, in the second and the third columns the algorithm which uses the function L, gives a good approximation of the minimum for the function L. It means that the use of L from the third section is reasonable and causes minimal errors. In our cases, we obtained a good approximation of the global minimum. Moreover, the difference between the minimum of the original function and the result of our algorithm is small (see fifth column). Consequently, we see that the barycentre of data sets is a good candidate for initial point. Clearly (see sixth column), we see that the barycentre of a data is not a good approximation of the memory center, especially in the situation of not uni-modal densities. 5. Appendix A In the fourth section, we presented a method for finding the minimum of the function M S (a). Our approach based on the Iterated Reweighted Least Squares 6 The normal random variables with means m and the standard deviation s we denote by N (m,s). 7 The uniform random variable on the interval [a, b] we denote by U [a,b]. 3

Model p X + p 2 X 2 a r a m M S (a r ) M S (a m ) M S (m(s)) M S (a m ) 0.6N (,) + 0.4N (,) -0.347-0.379 0.0007 0.00658 0.4N ( 6,) + 0.6N (6,) 5.803 5.676 0.00079 0.55374 0.4N (,) + 0.6N (,) 0.483 0.46 0.00093 0.0353 0.3N ( 6,0.5) + 0.7N (6,) 5.979 5.863 0.00089 0.57296 0.2N ( 2,0.5) + 0.8N (3,2) 2.72 2.770 0.0002 0.0529 0.6U [ 3, ] + 0.4U [0,] -.805 -.824 0.0004 0.9388 0.4U [ 3, ] + 0.6U [0,] 0.364 0.403 0.0044 0.439 0.3U [ 3, ] + 0.7U [0,] 0.433 0.447 0.00028 0.44633 0.2U [ 2, ] + 0.8U [,2].403.445 0.00265 0.38220 0.2U [ 5, 2] + 0.8U [3,4] 3.464 3.439 0.0009 0.5087 Table : In this table we have following the notation: a r = alg centre(s) and a m = argmin(m S ). algorithm (IRLS). The method can be applied in statistic and computer since. We have used them to minimize the function M S (a). Similar approach is presented in [9] and [6]. For convenience of the reader, we show the basic theoretical information about IRLS algorithm. The main theorem, related with this method, can be formulated as follows: Theorem IRLS. Let f i : R + R +, f(0) = 0 be a set of concave and differentiable function and let X = (x i ) be a given data-set. Let ā R be fixed. We consider functions F (a) := i f i ( a x i 2 ) for a R and H(a) := i [f i ( ā x i 2 ) f i( ā x i 2 ) ā x i 2 ]+f i( ā x i 2 ) a x i 2 for a R. Then H(ā) = F (ā) and H F. Proof. Let i be fixed. For simplicity, we denote f = f i. Let us first observe that, without loss of generality, we may assume that x = 0 (we can make an obvious substitution a a + x). 4

Then since all the considered functions are radial, it is sufficient to consider the one dimensional case when N =. Thus, from now on we assume that N = and x = 0. Since the functions are even, it is sufficient to consider the situation on R +. Concluding: we are given ā R + and consider functions and g : R + a f(a 2 ) h : R + a [f(ā 2 ) f (ā 2 )ā 2 ] + f (ā 2 )a 2. Clearly, h(ā) = g(ā). We have to show that h g. We consider two cases. Let us first discuss the situation on the interval [0, ā]. We show that g h is increasing on this interval, since coincide at ā this makes the proof completed. Clearly g and h are absolutely continuous functions (since f is concave). Thus, to prove that g h is increasing on [0, ā], it is sufficient to show that g h a.e. on [0, ā]. But f is concave, and therefore f is decreasing, which implies that g (a) = f (a 2 )2a f (ā 2 )2a = h (a). So let us consider the situation on the interval [ā, ). We will show that g h is decreasing on [ā, ). Since (g h)(ā) = 0, this is enough. Note that (g h) (a) = f (a 2 )2a f (ā 2 )2a = 2a(f (a 2 ) f (ā 2 )) 0. Now, the assertion of the theorem is a simple consequence of the previous property. Now we can form the most important results: Corollary IRLS. Let f : R + R +, f(0) = 0 be a given concave and differentiable function and let S = {x i } N i= be a given data-set. We consider the function F (a) = i f( a x i 2 ) for a R. Let ā R be arbitrarily fixed and let Then w i = f ( ā x i 2 ) for i =... N and a w = F (a w ) F (ā). 5 N i=0 w i N w i x i. i=0

Proof. Let H be defined like in Theorem IRLS H(a) = i [f( ā x i 2 ) f ( ā x i 2 ) ā x i 2 ]+f ( ā x i 2 ) a x i 2 for a R. Moreover by Theorem IRLS we have F (ā) = H(ā). Function H is quadratic so the minimum is a w = N i=0 w i N w i x i i=0 and consequently F (ā) = H(ā) H(a w ). Making use of the above theorem, we obtain a simple method of finding a better approximation of the minimum. For given a R, by taking weighted average a w (see Corollary IRLS) we find the point which reduces the value of the function. So, to find minimum, we iteratively calculate weighted barycentre of data set. 6. Appendix B In this section we present source code of our algorithm written in R Project. #The d e r i v a t i v e o f t h e f u n c t i o n bar L d b a r L = f u n c t i o n ( s ) {( exp( 8 / 3) +s ) ˆ( ) } 3 #The f u n c t i o n which d e s c r i b e a s i n g l e i t e r a t i o n 5 #S data p s t a r t i n g p o i n t i t e r a t i o n = f u n c t i o n ( p, S ) { 7 suma =0 w e i g h t =0 9 v=sd ( S ) # s t a n d a r d d e v i a t i o n h =2.35 v ( l e n g t h ( S ) ) ˆ( / 5) f o r ( s i n S ) { w e i g h t s=d b a r L ( ( abs ( s p ) ˆ 2 ) / ( h ˆ 2 ) ) 3 suma=suma+ w e i g h t s s w e i g h t = w e i g h t + w e i g h t s 6

5 } suma / w e i g h t 7 } 9 #The f u n c t i o n which d e t e r m i n e a l o c a l minimum #S data 2 # e e p s i l o n #N maximal number o f i t e r a t i o n s 23 look minimum= f u n c t i o n ( S, e,n) { ans =mean ( S ) 25 d i s t = while ( d i s t >e & N>0){ 27 ans n= i t e r a t i o n ( ans, S ) N=N 29 d i s t =abs ( ans n ans ) ans = ans n 3 } ans 33 } In our simulation we use ε = 0.00 and N = 50. [] T. M. Cover, J. A. Thomas, Elements of information theory, Wiley-India, 999. [2] Forex Rate, http://www.forexrate.co.uk/forexhistoricaldata.php,number of point, time min, date end 0/28/20. [3] D. R. Hankerson, G. A. Harris, P. D. Johnson, Introduction to information theory and data compression, Chapman & Hall/CRC, 2003. [4] P. J. Huber, E. M. Ronchetti, Robust statistics, John Wiley & Sons, 2009. [5] D. A. Huffman, A method for the construction of minimum-redundancy codes, Proceedings of the IRE, 40, 098 0, 952. [6] J. Idier, Convex half-quadratic criteria and interacting auxiliary variables for image restoration, IEEE Trans. Image Process., 0, 00 009, 200. [7] D. Salomon, G. Motta, D.C.O.N. Bryant, Handbook of data compression, Springer-Verlag New York Inc, 2009. [8] B. W. Silverman, Density estimation for statistics and data analysis, Chapman & Hall/CRC, 986. 7

[9] R. Wolke, Iteratively reweighted least squares. A comparison of several single step algorithms for linear models, BIT Numerical Mathematics, 32, 506 524, 992. [0] Z. Zhang, Parameter estimation techniques: A tutorial with application to conic fitting, Image and vision Computing, 59-76, 997. 8