CODE LENGTHS FOR MODEL CLASSES WITH CONTINUOUS UNIFORM DISTRIBUTIONS. Panu Luosto

Size: px

Start display at page:

Download "CODE LENGTHS FOR MODEL CLASSES WITH CONTINUOUS UNIFORM DISTRIBUTIONS. Panu Luosto"

Dorcas Jacobs
5 years ago
Views:

1 CODE LENGTHS FOR MODEL CLASSES WITH CONTINUOUS UNIFORM DISTRIBUTIONS Panu Luosto University of Helsinki Department of Computer Science P.O. Box 68, FI-4 UNIVERSITY OF HELSINKI, Finland ABSTRACT Continuous uniform distributions are an important means for modelling e.g. noise and unknown portions of heterogeneous data. Even if they are simplistic models, deriving the corresponding code lengths is sometimes non-trivial. One of the obvious problems is that the set in which a uniform density gets positive values is often not known in advance in practical applications. This paper treats uniform distributions in origin-centred balls, arbitrary balls and axis-aligned boxes. We derive normalized maximum likelihood NML) densities for the cases when the maximum likelihood parameters of the data are bounded. From the NML densities we derive code length functions that depend on the prior densities of the parameters. We generalize Rissanen s prior for positive reals for this purpose. We also suggest methods for dealing with the problems that arise from the singularities in the final code length functions.. INTRODUCTION Continuous uniform distributions are as simplistic models important in the field of minimum description length MDL) [, ] principle based learning. When the domain of the data is known, the uniform distribution gives the shortest worst-case code. In this paper, we assume that the domain is unknown, which makes the situation non-trivial. Our objective is to derive code length functions that are suitable for noise and entirely unknown data, or that can be used as baseline code length functions for determining the efficiency of more sophisticated models. In the same time, we avoid unnecessary assumptions about the domain. We have used our code length for the model class with uniform distributions in axis-aligned boxes in [3] where the objective is to find the best clustering with an unknown number of normally distributed clusters and one uniform cluster. If the bounded set in which the uniform distribution gets positive values is not known in advance, even choosing its geometrical form can be a difficult design choice. We consider origin-centred balls in any dimensionality and arbitrary balls in one and two dimensions. Uniform distributions in axis-aligned boxes are simply product densities of uniform distributions in arbitrary one-dimensional balls. If the ranges of the parameters are bounded, calculating normalized maximum likelihoods NML) [4] is straightforward for all the models mentioned above. If the parameters are unbounded, the NML density is not defined. We derive our code lengths with unbounded parameters in all the cases according to a similar idea. To outline the method, we take as an example the simplest model, the uniform distribution in an origin-centred ball. The distribution has one parameter, the radius of the ball. Let x n R d ) n be a data sequence. The maximum likelihood parameter Rx n ) is equal to the distance of the farthest point in the sequence from the origin. If we restrict the data so that Rx n ) [r, r ], we can derive a normalized maximum likelihood x n ; r, r ). But it can be difficult to give r and r any reasonable values before seeing the data. If we let r and r approach Rx n ), the density grows unbounded, which means that renormalization is not possible. Also, if we fix r and let r, or fix r and let r, the density approaches. Instead, we give the parameter r a continuous prior density p r and let t >. Now, we get a mixture density by integrating x n ; r, tr ) over such values of r that Rx n ) [r, tr ]. That gives the density fx n ; p r, t) Rx n ) Rx n )/t x n ; r, tr) p r r) dr. To dispose of the parameter t, we consider the it t + fxn ; p r, t) The iting function is a density function and a universal model []. We still have the problem of choosing a suitable p r. NML encoding minimizes the worst-case excess code length compared to the maximum likelihood code length the latter being the optimal coding method, but only with hindsight). Similarly, we should choose a flat prior which diminishes asymptotically as slow as possible in order to minimize the excess code length with all data. Section introduces a generalization of Rissanen s prior for positive reals as a candidate for p r. With continuous distributions, it is a common practice to use the term code length as a synonym for the negative logarithm of the density, which corresponds to encoding

2 of real numbers with infinite precision. We consider just densities in this paper, taking the logarithm is left to the reader. In practical situations, data values have a finite precision and minimizing the negative logarithm of the density is not quite equal to finding the most effective way to encode the data. This does not usually cause problems, but if the density can grow unbounded in the neighbourhood of some point, the results may be surprising, especially when the data is represented with greater precision than we find trustworthy. Therefore it might be reasonable to fine-tune the model so that the density is bounded. In the following sections, there are examples of densities having singularities in which they are not defined and in the neighbourhoods of which they get arbitrarily large values. These densities are problematic to use with some data sets. We might for example have a two-point cluster the density of which grows unbounded when the distance between the points approaches zero. The small cluster could thus totally dominate the code length of a potentially complex clustering, which might lead to very unintuitive results. We give therefore some solutions how the densities can be bounded and extended to the singularities. We use the notation log for the logarithm to the base of two and ln for the natural logarithm. The elements of a data sequence are assumed to be identically and independently distributed in all the models. Before deriving the densities corresponding to the code length functions we introduce a very flat density function that we use as a prior for parameters.. A PRIOR DENSITY FOR THE REAL NUMBERS The main criterion for priors of the parameters in our case is that they should be as uninformative as possible. In [5], Rissanen gives a density function for the reals in the interval [, [. We generalize it without changing its asymptotic properties by adding a parameter that defines how strongly the probability mass is concentrated in the vicinity of the origin. We write x y as x y for typographical reasons. Let x and let x y x x... x for x >, y copies of x y N. Now let b... δ where k N and k copies of s δ [, ]. For x R +, we define the density f R+ x; b) ) where ln ln ) k log δ ln ) hx) x + b) hx + b) { if log x log x hlog x) otherwise. We verify next that f R+ ; b) integrates to unity over the positive real line. Let log k) x log log... log x. No- k copies tice that D x ln ) k log k) x x hx), if log k ) x or equivalently k ) x k. Thus and k k ) x hx) dx k dx ln )k x hx) k k ) Assuming that b is defined as above, b x hx) dx b k x hx) dx i ln dx x hx) ln. i i ) x hx) dx k ) x hx) dx ln ) ln ) k ln ln ln ) k log δ ) ln ) k ln log δ. The proof is easily completed after a variable change x y + b. The function f R+ ; b) diminishes asymptotically only slightly faster than /x log x). When a prior for all the real numbers is needed, we simply use /)f R+ x ). 3. A CODE LENGTH ACCORDING TO UNIFORM DISTRIBUTIONS IN ORIGIN-CENTRED BALLS In this section, we consider a model class consisting of uniform distributions in a sphere centred at the origin. Let V d r) πd/ Γ d + )rd, denote the volume of a d-dimensional sphere with the radius r, and let A d r) V d r) d/r denote the surface area of that sphere. For the distance of the farthest point in the sequence x n x, x,..., x n ) R d ) n from the origin, we use the notation Rx n ) max { x i i {,,..., n} }. We consider first a special case where the radius of the smallest enclosing sphere belongs to a certain interval. Let r, r > and assume that r < r. Let A {x n R d ) n Rx n ) [r, r ]}. The maximum likelihood for x n A is x n ; r, r ) V d Rx n )) n.

3 The normalization integral for the NML density is x n ; r, r ) dx n x n A n n n n n x B,r )\B,r ) x n B, x ) x B,r )\B,r ) r r r nd ln r r y B,r) A d r) V d r) d r dr dr x B, x ) dx n dx n... dx V d x ) n V d x ) dx dy dr V d r) which yields the NML density function x n ; r, r ) V d Rx n )) n nd lnr /r ) if x n A. The parameters r and r are something we would like to get rid of, because we can seldom give them reasonable values before looking at the data. Setting the parameters to their maximum likelihood values r r Rx n ) results in an infinite density, which implies that a renormalization schema is not possible. Instead, we give r a continuous prior density p r, and make r a function of r, r r ) tr, where t >. Integrating the coefficient / lnr /r ) / ln t over the values of r such that Rx n ) [r, r ] [r, tr ] yields Rx n ) Rx n )/t ln t p r r) dr. When t approaches from above, let the iting function be ux n, p r ) t + t + Rx n ) Rx n )/t ln t p r r) dr Rx n ) Rxn ) t Rx n ) p r Rx n )). ) ) ln t p r Rx n )) Replacing the coefficient / lnr /r ) with ux n, p r ) in ) yields the function fx n ; p r ) Rx n ) V d Rx n )) n nd p r Rx n )) 3) if x n R d ) n and Rx n ) >. It is easy to check that 3) is a valid density by integrating over {x n R d ) n Rx n ) > }. We compare the result briefly with a more straightforward solution. Assume that Rx n ) a > and let pr) ɛa ɛ r ɛ where ɛ >. Consider the mixture density f x n ) Rx n ) V d r) n pr) dr V d Rx n )) n Rx n ) ɛ ɛa ɛ nd + ɛ. Let p r f R+ ; b) as defined in ). Then the ratio f x n ) fx n ; p r ) Rx n ) +ɛ p r Rx n )) approaches zero when Rx n ). Depending on the choice of p r, the density fx n ; p r ) can grow unbounded when Rx n ) approaches zero. A simple solution to get a bounded density is to make p r a function of n and d, and to let p r R) grow relative to R nd in an interval [, ɛ], which keeps fx n ; p r ) constant when Rx n ) [, ɛ]. As a concrete example, let b k ), where k N and we have used the notation explained in Section. Let also ɛ... α b k copies of s where α ], ]. A continuous density fulfilling the previous requirements is p r R) { c f R+ ɛ; b) ɛ nd R nd if R [, ɛ[ c f R+ R; b) if R ɛ, where f R+ is a density defined in ) and c is a constant for normalization. Because and ɛ ɛ f R+ R; b) dr ln ) log α f R+ ɛ; b) ɛ nd R nd dr f R + ɛ; b) ɛ, nd we get c ln ) log α + f R + ɛ; b) ɛ). nd 4. A CODE LENGTH ACCORDING TO UNIFORM DISTRIBUTIONS IN ARBITRARY BALLS We consider here modelling a data sequence according to a uniform distribution in an arbitrary ball, first in one and then in two dimensions. The one-dimensional case is important because a uniform distribution in an axis-aligned box is the product of the densities of the coordinates according to uniform distributions in one-dimensional balls. In the first subsection, we assume that the minimum and maximum values of the one-dimensional sequence are unequal. In the second subsection, we bound the density not by choosing a special prior but by altering the models slightly. The third and fourth subsections examine the two-dimensional case. Let cx n ) denote the centre of the smallest enclosing ball of x n R d ) n, and let rx n ) denote the radius of that ball.

4 4.. One dimension, minx n ) maxx n ) We restrict the data with the maximum likelihood parameters first. Let c R and let δ, r, r > Assume that r < r. Let the set of sequences to be considered be A {x n R n cx n ) [c δ, c + δ], rx n ) [r, r ]}. The maximum likelihood of x n A is x n ) rx n )) n, and the corresponding normalizing integral is Cc, δ, r, r ) 4) x n ) dx n x n A x nn ) x x nn ) nn ) x,x R: x +x )/ [c δ,c +δ], x x )/ [r,r ] x x ) n dx n dx n... dx dx x x,x R: x +x )/ [c δ,c +δ], x x )/ [r,r ] c+δ c δ r r) nn ) δ ). r r dx dx x x ) dr dc 5) There was a coordinate change x, x ) c r, c + r) at 5) in the previous integration. Dividing the maximum likelihood by the normalizing integral yields the NML density function x n ; c, δ, r, r ) rx n )) n nn ) δ r r r r 6) if x n A. The next step is to replace c, δ, r and r with more general parameters that allow us to define a non-zero density for all x n R having rx n ) >. Our solution is similar to the one in Section 3. We assume that r is independent of δ and c. Consider the parameters r and r first. Let again t > and r r ) tr. Requiring that rx n ) [r, r ] [r, tr ], we replace the coefficient r r )/r r ) tr )/t ) in 6) with the integral rx n ) rx n )/t tr t p r r) dr, where p r is a continuous prior of the parameter r. Letting t approach from above, we get t + rx n ) rx n )/t t + tr t p r r) dr rx n ) rxn ) t rx n ) p r rx n )). ) t rx n ) t p r rx n )) Next, we get rid of the coefficient /δ and the dependence on c in 6). Let δ > and let p c be a continuous prior density function of the parameter c. The integration goes over all such values of c that cx n ) [c δ, c +δ]. In a similar fashion as above, we substitute /δ with the iting function cx n )+δ δ + cx n ) δ δ p c c) dc p c cx n )). 7) The final density function is thus fx n ; p c, p r ) p c cx n )) p r rx n )) rx n )) n nn ) 8) if x n R n and rx n ) >. The sequences consisting of equal points are problematic singularities this time. We can bound the density and extend it to the singularities as at the end of Section 3, choosing a special prior p r that keeps fx n ; p c, p r ) constant when rx n ) [, ɛ]. For the case n we let fx ); p c ) p c x). A naive solution for bounding the density is to add one extra point to the beginning of the sequence in order to ensure that the difference between the maximum and minimum values in the sequence is greater than some positive ɛ. By decoding, this point is simply discarded. But then if rx n ) > ɛ, the number of extra bits needed compared to 8) is not a constant any more but log rx n ) + logn + ) logn ) +. In the next subsection, we provide yet another solution how to bound the density. 4.. One dimension, bounded maximum likelihood We restrict the largest possible density by bounding the radius parameter of the models from below. Assume for the time being that n {, 3,... }. We shall see later that the result applies for n as well. Let x n R n and let the smallest radius that can be used for encoding to be ɛ >. The maximum likelihood is thus { x n rx n )) n if rx n ) ɛ ; ɛ) ɛ) n otherwise. Let δ, r > and let c R. We calculate the normalizing integral in the bounded set {x n R n cx n ) [c o δ, c + δ], rx n ) r } first. The integral consist of two parts: Cc, δ, r ) x n R n : cx n ) [c δ, c +δ], rx n ) [, ɛ[ + x n R n : cx n ) [c δ, c +δ], rx n ) [ɛ, r ] ɛ) n dxn rx n )) n dxn.

5 The second term is Cc, δ, ɛ, r ) as in 4). The first term is equal to nn ) nn ) nδ ɛ. x,x R: x +x )/ [c δ, c+δ] x x )/ [,ɛ[ c+δ ɛ c δ x x ) n r) n ɛ) n dr dc ɛ) n dx dx Putting these together yields Cc, δ, r ) nn )δ ɛ ) + nδ r ɛ. When r approaches infinity, Cc, δ, r ) n δ)/ɛ. We normalize the maximum likelihood by this it and use a prior for the parameter c in a similar way as in 7), getting the density f ɛ x n ; p c ) if rx n ) ɛ, and f ɛ x n ; p c ) ɛ rx n )) n n p c cx n )) 9) p c cx n )) ɛ) n n ) if rx n ) [, ɛ[. Letting n in ) yields f ɛ x ); p c ) p co x ), which is a valid density Two dimensions, rx n ) > Next, we consider the arbitrary ball model in two dimensions. Assume first that n {3, 4,... }. The final result is also valid for n, which we shall see after the calculation of the normalizing integral at ). Let C, C ) R and let δ, r, r >, r < r. We assume first that the centre of the smallest enclosing ball of the point sequence belongs to the set D [C δ, C + δ] [C δ, C + δ] and that the radius of that ball is in the interval [r, r ]. Let the set of sequences fulfilling these conditions be A {x n R ) n cx n ) D, rx n ) [r, r ]}. Denote the maximum likelihood in this set x n ) πrx n ) ) n. There must be at least two different points x i, x j in the sequence x n such that x i, x j Bcx n ), rx n )). If x i and x j are the only points in x n belonging to the border of the smallest enclosing ball, then cx n ) x i + x j )/. If at least three points of x n belong to Bcx n ), rx n )), there are three different indices i, j, k {,,..., n} such that x i, x j, x k Bcx n ), rx n )) and cx n ) conv{x i, x j, x k }, where conv{x i, x j, x k } is the convex hull of the set {x i, x j, x k }. We derive then integral x n A x n ) dx n by dividing the integrating space into two parts whose intersection is a null set. First, consider a situation where the points x and x determine the minimal enclosing ball. Let x i j) denote the jth coordinate of x i. We change the coordinate β + π x 3 + x α x β α + π Figure. Three points and their smallest enclosing ball. If and only if the angular coordinate of x 3 is between α + π and β + π, the centre of the smallest enclosing ball is the point marked with a cross. system using the function wc, c, r, θ) c + r cos θ, c + r sin θ, c r cos θ, c r sin θ) x ), x ), x ), x )). Hence, det w c, c, r, θ) 4r. The integral with x and x as outermost points is I D, r, r ) x n A: x,x Bcx n ),rx n )), cx n )x +x )/ x n ) dx n x,x R : x 3,...,x n x +x )/ D, Bx +x )/, x x /) r < x x /<r ) ) n x x π dx n dx n... dx ) ) x x π dx dx x,x R : x +x )/ D, r < x x /<r π c,c ) D r r 4δ )π) 4 π r 3 dr 6δ π r ). 4r πr ) dr dθ dc dc Next, consider the situation where the points x, x and x 3 determine the smallest enclosing ball figure ). The function for the change of coordinates is now wc, c, r, α, β, γ) c + r cos α, c + r sin α, c + r cos β, c + r sin β, c + r cos γ, c + r sin γ), and det w c, c, α, β, γ) sinα β)+sinγ α)+ sinβ γ) r 3.

6 The integral with x, x and x 3 as outermost points without any fixed ordering is I 3 D, r, r ) x n A: x,x,x 3 Bcx n ),rx n )), cx n ) conv{x,x,x 3} x n ) dx n x,x,x 3 R : x 4,...,x n cx,x,x 3)) D, Bcx,x,x 3)),rx,x,x 3))) r <rx,x,x 3))<r x,x,x 3 R : cx,x,x 3)) D, r <rx,x,x 3))<r c,c ) D πrx, x, x 3 )) ) n dx n dx n... dx dx 3 dx dx πrx, x, x 3 )) ) 3 r π α+π β+π α α+π sinα β) + sinγ α) + sinβ γ) r 3 dγ dβ dα dr dc dc π 3 4δ ) r ) π)3π) 4δ π r ). πr ) 3 Using symmetry among the points, we get the normalizing integral x n ) dx n ) x n A ) ) n n I D, r, r ) + I 3 D, r, r ) 3 nn ) 6δ π r ) nn )n ) 4δ + 6 π r ) 4n n ) δ π r ). The calculation above is valid also for n if we define ) 3. The corresponding normalized density is x n π r ; D, r, r ) r πrx n ) ) n 4n n ) δ and t + rx n ) rx n )/t t + r tr) tr) r p r r) dr ) rx n ) rxn ) t rxn ) 3 p r rx n )). The final density is thus rx n ) t t p r rx n )) fx n ; p c, p r ) ) p c cx n )) p r rx n )) π n rx n ) n 3 n. n ) if x n {y n R ) n ry n ) > }, where n {, 3,... } Two dimensions, bounded maximum likelihood We omit the calculations here, because they are essentially similar to those in the one-dimensional case in Subsection 4.. Let n {,, 3,... }. The final density is f ɛ x n ; p c ) ɛ π n rx n ) n n 3 p ccx n )) if rx n ) ɛ, and f ɛ x n ; p c ) p c cx n ))/π n ɛ n n 3 ) if rx n ) [, ɛ[. 5. REFERENCES [] Jorma Rissanen, Information and Complexity in Statistical Modeling, Springer Verlag, New York, 7. [] Peter D. Grünwald, The Minimum Description Length Principle, The MIT Press, 7. [3] Panu Luosto, Jyrki Kivinen, and Heikki Mannila, Gaussian clusters and noise: an approach based on the minimum description length principle, in Discovery Science,, to appear. [4] Jorma Rissanen, Fisher information and stochastic complexity, IEEE Transactions on Information Theory, vol. 4, no., pp. 4 47, January 996. [5] Jorma Rissanen, A universal prior for integers and estimation by minimum description length, The Annals of Statistics, vol., no., pp , Jun if x n A and n {, 3,... }. When we give the centre of the smallest enclosing ball a prior density p c and the radius r a prior p r, we can derive the final density as before. Let c i x n ) denote the ith coordinate of cx n ). The its are cx n )+δ δ + c x n ) δ cx n )+δ c x n ) δ 4 p c c x n ), c x n ))) δ p cc, c )) dc dc

Normalized Maximum Likelihood Methods for Clustering and Density Estimation

Department of Computer Science Series of Publications A Report A-213-8 Normalized Maximum Likelihood Methods for Clustering and Density Estimation Panu Luosto To be presented, with the permission of the