A New Minimum Description Length

Size: px

Start display at page:

Download "A New Minimum Description Length"

Madeleine Mathews
6 years ago
Views:

1 A New Minimum Description Length Soosan Beheshti, Munther A. Dahleh Laboratory for Information an Decision Systems Massachusetts Institute of Technology Abstract The minimum escription length(mdl) metho is one of the pioneer methos of parametric orer estimation with a wie range of applications. We investigate the efinition of two-stage MDL for parametric linear moel sets an exhibit some rawbacks of the theory behin the existing MDL. We introuce a new escription length which is inspire by the Kolmogorov complexity principle. Introuction One shoul not increase, beyon what is necessary, the number of entities require to explain anything Occam s razor is a logical principle attribute to the meieval philosopher William of Occam. Applying the principal to the statement above, the main message is The simplest explanation is the best The principle states that one shoul not make more assumptions than the minimum neee. A computer scientific approach to this principle is manifeste in Kolmogorov complexity. Let y be a finite binary string an let U be a universal computer. Let l(y) enote the length of the string y. Let U(pg) enote the output of the computer U when presente with program pg. Then the Kolmogorov complexity K U (y) of a string y with respect to a universal computer U is efine as K U (y) = min l(pg) () pg : U(pg)=y The complexity of string y is calle the minimum escription length of y. For any other computer A we have K U (y) K A (y) + c A (2) where c A oes not epen on y. This inequality is known as universality of Kolmogorov complexity. Kolmogorov complexity is a moern notion of ranomness ealing with the quantity of information in iniviual obects; that is pointwise ranomness rather than average ranomness prouce by a ranom source. The two-stage MDL is one of the pioneer methos of computation of escription length which is suggeste base on this principle [9]. Here we aress the rawbacks of the metho of calculation of MDL an efine a new MDL base on the same Kolmogorov principle. We focus on the orer estimation problem for when the ata is generate by a linear moel with aitive noise. The new propose orer estimation metho is comparable to other well known orer estimation methos such as AIC [], BIC [0] an other forms of existing MDL methos [2]. 2 Problem Statement:Linear Moel Consier the class of parametric moel for which the output, y(n), is generate by y(n) = ȳ (n) + w(n) (3) where w(n) is aitive white Gaussian noise with zero mean an variance σ. Also, ȳ is the noiseless ata. Length N of the ata y N = [y(), y(n)] T, a member of ranom variable set Y N, is given. The noiseless ata is escribe with the basis family s i ȳ (n) = M s i (n)θ (i) = A SM (N)θS M (4) i= where s i (n) are boune numbers. Therefore, A SM (N) is a matrix of imension N M. The columns of A Sm (N), s i s, are inepenent an we have N s i 2 2 c (5) where c is a boune number an s i (n) 0 for all i M an n N. The value of M can be a part of the prior knowlege of the moel class. If it is unknown, the proper assumption of M = N is consiere. The parameter θs m = [θ (),, θ (M)] is a boune l -norm vector in R M.

2 A subset S m of Y N is of form x N = [A Sm (N)]θ Sm (6) where the columns of A Sm, a matrix of imension N m, are s i s which are in S m s i S m an θ Sm is a vector of length m for which θ Sm R m. (7) Inspire by the Kolmogorov complexity an the notion of minimal escription length for the string y, we want to search for the subspaces S m s which can provie the minimum escription length(dl) of the ata. Lets first follow principles of the existing two-stage MDL. In each subspace S m, of orer m, the escription length of y is escribe as the minimum coelength which can escribe y by an element of S m. For the coelength in this probabilistic setting the Shannon coing metho is use, therefore DL Sm (y) = min log f(y; g) (8) where the log is base on 2 an f(y ; g) is the probability istribution function(pf) of ranom variable Y when the mean is g an the aitive noise is w has the same characteristics which were efine for w[n] in (3). Note that in this scenario the probability istribution efine by each g in S m ( an therefore by a θ Sm, g = A Sm (N)θ S m ) is a Gaussian istribution with output of form x = g + w. (9) The least-square estimate of y in each subspace, which provies the output DL in subspace S m, is ŷ Sm = arg min g y 2. (0) The DL in each subspace is then efine as ( ) N ŷ DL(y; ŷ Sm ) = log 2πσ 2 Sm y 2 w + 2 log e. () which is the escription length of the noisy ata with an element of S m, ŷ Sm. The comparison of this escription length for ifferent subspaces always leas to the choice of the S m with largest possible imension, S N, for which the output error is zero. This is not a esire outcome, especially if we know that the unknown true number of parameters which can efine ȳ is less than a given M. To avoi this unwante outcome, two-stage MDL introuces a coelength which escribes elements of S m as well [9]. Here the assumption is that the length of the coe escribing any element of subspace S m is the same an is of orer m 2 log(n) (2) Therefore, the total coelength escribing y in subspace S m is efine as the coelength escribing elements of S m in (2) plus the escription length of the output given this estimate from () DL(y; S m ) m 2 log(n) + DL(y; ŷ S m ). (3) However, choosing the escription length for θ Sm s by coes of length m 2 log(n) seems to be an a-hoc metho. Partitioning the subspace of possible θ Sm s can be one with any other escritization per imension factor other than log(n). For this reason, it seems that the coelength for all θ Sm s can be of any form m log(aδ) when each imension has log Aδ elements. Another metho of achieving the escription length in (3) is given in [9]. Given a subspace S m, assume that M = m, then ȳ S m, (4) also assume that the ML estimator in this subspace, ŷ Sm approaches the true noiseless ata ȳ as N grows an N(ŷ Sm ȳ ) converges to a zero mean normal istribution. Then for any prefix coe on Y N, with coelength L(Y N ), the following inequality hols for all prefix coes except a set of prefix coes which will escribe later, E N L(Y N ) N H(f ȳ (Y N ))+( ɛ)m log(n) N, (5) where E(X) enotes the expecte value of ranom variable X an H(fȳ (Y N )) is the ifferential entropy of Y N. The inequality hols for all prefix coes except a set of prefix coes which are generate by the pfs which are generate by a subset of S m. The Lebesgue measure of this subset goes to zero as N grows 2 [9]. In [9] it is argue that the coelength in (3) is optimum since it can achieve the lower boun in the inequality in (5). However, in [4] it is shown that m log(n) in the inequality in (5) can be replace by a family of functions of N, log β(n) for which lim N ( ɛ/2) lim β(n) = (6) N (β(n)) ɛ ɛ/2 N m/2 = 0 (7) Therefore, with the same approach in [9] the coelength in (3) can be generalize to the form DL(y; S m ) m 2 β(n) + DL(y; ŷ S m ). (8) A coe maps the quantize version of elements of Y N to a set of binary coewors with length L(y N ). If no coewor is the prefix of any other, then it is uniquely eciable an is calle a prefix coe 2 Associate with any pf on Y N which is generate by ȳ an is nonzero over Y N, a prefix coelength is available which is proportional to log f(y; ȳ)

3 Hence, not onlylog(n) but any β(n) can be use to escribe the coelength. Another important issue is that in practical problems M is an upper boun for m, the true number of parameters which generate ȳ, an the challenge is to fin the true m using the observe ata. However, this coelength is provie with the assumption that ȳ is an element of a given S m. In practice the same coelength is use for all the subspaces. Even if ȳ is not an element of S m. For this subset is this a vali coelength? the answer to this question is not known. The only important fact known about this escription length is that it is a consistent criterion. So as the length of ata grows if m is smaller than M, the metho points to the correct m. We will iscuss this property of MDL more in the following sections. 2. New Description Length The comparison of the coelengths in () fails because of the following argument: Minimizing the escription length in (8) is the same as arg min log f(y; g) = arg max f(y; g) (9) which provies the ML estimate of ȳ in each subspace. As we iscusse previously, the ML estimation always points to a member of S N, which has the highest possible orer, as a perfect caniate. Therefore, comparison of the coelengths escribing y itself in each subspace is not a proper tool for comparison of the estimates ŷ Sm. Here y is not the string of ata, but y is the ata which is corrupte by an aitive noise. Therefore the coelength of the ML estimates has to be compare with the pf which is generate by the noiseless output, ȳ. To follow the Kolmogorov complexity is to compare the estimate of coelength base on the true, an unknown, pf. Therefore we consier the following escription length in each subspace: Definition The new escription length of ata in subset S m is efine as DL(y; S m ) DL(ŷ Sm ; ȳ ) (20) = log e ŷsm 2.(2) ȳ 2 Therefore, the new minimum escription length is obtaine for S m S m = arg min S m DL(y; S m ). (22) Calculation of this DL is provie in the following section. 2.2 Calculation of the New Description Length Calculation an comparison of the escription length efine in (2) for ifferent subspaces leas to comparison of the reconstruction error ŷ Sm ȳ 2. The available error in each subspace is ŷ Sm y 2. With this available ata we can valiate bouns on the escription length of ȳ S m probabilistically, where DL(ȳ S m ; ȳ ) (23) ȳ S m = E(ŷ Sm ). (24) Details of the valiation step is in [5]. For each subspace S m, the valiate upper an lower bouns are functions of orer of S m, m, Length of the ata, N, the noise variance, σw, 2 the ata s power an the valiation probability P. Next step is to provie probabilistic bouns on the esire DL, DL(ŷS m ; ȳ ). Probabilistic bouns are also provie in [5]. The bouns are functions of the confience probability P 2, Length of ata N, noise variance an the valiate upper an lower bouns on DL(ȳS m ; ȳ ) from step one. For large N, with P an P 2 approaching one, the upper an lower bouns on the esire escription length approach each other an provie a tight estimate for subspaces of low orer m << N [5]. Note that in [4] the calculation of another metho of orer estimation metho, minimum escription complexity (MDC), is provie. the close form criterion for MDC an the new MDL have the same structure for the consiere linear moel class for which the ata is generate by the structure given in (3). 3 Thresholing The existing information theoretic methos attempt to etermine the true parametric moel with m parameters. In most practical problems, m is not finite an we require to etect the optimum estimate for m, which represents the significant part of the noise less ata ȳ. Implementing the MDL metho in this situation, provies an estimate for m which is very sensitive to the variation in signal to noise ratio(snr) 3 an to the length of the output [7]. When the length of the true parameter is infinite, the consistent methos, such as MDL an BIC, point to a higher an higher orer as N an/or SNR grows. Some relate practical problems of these information theoretic methos are aresse in [] an [6]. When the true m is larger than the length of a very long ata, we propose implementing the new information theoretic metho of orer estimation. With this metho we can avoi pointing to higher an higher orers by using a threshol for the escription length. If a threshol ɛ is use for the minimum acceptable DL, 3 SNR= 0 log 0 Y N 2 2 Nσ

4 then we choose the smallest m for which the upperboun on DL is greater or equal to ɛ. An example of this approach is given in the simulation section. 3. MDL Thresholing Can thresholing be use for the two-stage MDL? In orer to make the escription length in (3) a vali coelength, which correspons to a prefix coe an satisfies the Kraft s inequality, it is suggeste to a a normalizing constant C(N) to the suggeste escription length DL Sm (y) = m log(n) + log f Sm (ŷ Sm ; y) + C (N). (25) In [9] it is argue that as N grows C (N)/N 0. However, note that as N grows the factor m log(n) N also goes to zero. For any fixe N, C (N) might be comparable with m log(n) N. Calculation of this normalizing factor is not trivial an is not available. Because of the structure of the DL, in general C (N) is a function of the noiseless output ȳ. Since C (N) is a fixe number in the comparison of ifferent subspaces an in calculation of the MDL this term is ignore. However, since C (N) changes for ifferent orer estimation settings, for example with the change of ȳ, implementation of a threshol is meaningless for this criterion. Here we prove that the use of threshol is meaningful for the new propose MDL. Assume that for any problem setting the escritization in output space Y is the same. We prove that the new escription length is a coelength of a prefix coe. A necessary an sufficient conition for a coe to be prefix is that it satisfies the Kraft s inequality. The new DL satisfies the Kraft s inequality by aing a normalize factor. The normalizing factor is not a function of y or ȳ, but a function of the orer of subset S m an escritization factor of Y. This proves the consistency of the coelengths with universality of Kolmogorov complexity in (2). Theorem The new escription length, efine in (2), satisfies the Kraft s inequality, i.e., correspons to a prefix coe, when a normalize factor C(N) is consiere where DL Sm (y) DL(ŷ Sm ; ȳ ) + C(N) = log f Sm (ŷ Sm ; ȳ + C(N). (26) ) C(N) = Ln2 Ln(δm 2πσw 2 (N m) ). (27) Note that although C(N) is a function of m, C(N)/N goes to zero much faster that the terms in the estimate of log f S m (ŷ Sm ;ȳ ) an it can be ignore for large enough N. Proof The escritize version of y is y an escritize version of ŷ Sm is ŷ S m. The inex is use for the escritize elements in Y N. Therefore, we have DL(ŷ S m (); ȳ ) (28) = log e ȳ ŷ Sm () 2 2 = Ln(2) Ln e ȳ ŷ S m () 2 2. To check the Kraft s inequality for each coe wor of length DL(ŷS m (); ȳ ) we have to show that D DL(ŷ Sm ();ȳ ) (29) where D is the size of alphabet resulte from escritizing the output space Y. Equivalently we can check for is ae such that ( ) LnD e DL(ŷ Sm ();ȳ ) (30) we know that (e ) LnD DL(ŷ Sm ();ȳ ) Note that δ m ( i e DL(ŷ Sm ();ȳ ) ) LnD e ȳ ŷ S m () 2 2 LnD Ln2 ( 2πσ ) m e ȳ ŷ Sm () 2 2σ w 2 (32) ( 2πσ ) m e ȳ ŷs m 2 2 y where δ is the precision per imension in the space Y, or equivalently in the space of the aitive noise W. On the other han, the error ȳs m ŷ Sm has a Gaussian istribution an we have ( 2πσ ) m e ȳ Sm ŷ Sm 2 2σ w 2 y =. (33) Therefore the error ȳ ŷ Sm also has a Gaussian istribution with same variance an a ifferent mean. Hence, we have ( 2πσ ) m e ȳ ŷsm 2 2σ w 2 y =. (34) (3)

5 Therefore e ȳ ŷsm 2 2σ y = ( 2πσ ) N m (35) 0 4 Therefore (e DL(ŷ Sm ();ȳ ) Ln(δm ) Ln2 2πσw 2 (N m) LnD ) Description Length 0 3 i (36) Hence, the normalizing factor is Ln2 Ln(δm 2πσw 2 (N m) ). 4 Linear Moels an Simulation Results Lets consier a linear time invariant(lti) system for which the matrix A Sm (N) in (3) is a Toeplitz matrix of input. The input of the system is a binary sequence of ±. The input is a sample of inepenent ientically istribute Bernoulli ranom process. Note that in this case the basis s i in (3) are asymptotically orthogonal. The input-output relationship is of form y(i) = i θ (k)x i k+ + w(i), (37) k= We use the microwave raio channel, chan0.mat, which is available at m Figure 2: Soli line is the escription length for SNR=0b, an N= : Probabilistic upperboun an lowerboun of the new escription length. Description Length m Figure 3: Soli line is the escription length for SNR=90b, an N= : probabilistic upperboun an lowerboun of the new escription length Figure : Real part of the first 60 taps of a microwave raio channel impulse response. Figure () shows the real part of the first 60 taps of the system impulse response. The simulation result for ata of length N = 300 an SNR=0b is shown in figure(2). Here the optimum impulse response length for ifferent methos are ˆm(AIC)=34 an ˆm(MDL)=32. The new propose criterion selects ˆm = 33. Figure(3) shows the upper an lower boun on the new DL for N=800, SNR=90b. The bouns on the DL are vali with probability 0.84 an valiation probability of In this case all the methos select an impulse response length which is larger than 30. With higher SNR an/or longer ata sample, all the methos choose a larger an larger length for the impulse response estimate. However, if we choose a threshol for the DL to be 0, the new criterion selects m = 37. With this threshol m 37 when SNR grows an/or the length of ata gets larger. Counting for the elay of the system, with the same threshol, the propose metho chooses the 0 taps of the impulse response estimate from 27 to 36 for optimum moelling of the system.. 5 Conclusion In this paper a new information theoretic metho of parametric estimation is introuce. By using the

6 available ata, we are able to probabilistically estimate tight bouns on the new criterion which is in form of a ata escription length. It is shown that the new escription length correspons to a prefix coe an is consistent with the universality of Kolmogorov complexity. References [] H. Akaike. A New Look at the Statistical Moel Ientification. IEEE Trans. on Automatic Control, vol. AC-9, pp , 974. [2] Y. Barron, J. Rissanen, Bin Yu. The Minimum Description Length Principle in Coing an Moeling. IEEE Trans. on Information theory, vol 44, pp , Oct [3] S. Beheshti an M.A. Dahleh. On moel quality evaluation of stable LTI systems. Proceeings of the 39th IEEE Conference on Decision an Control, pp , [4] S. Beheshti. Minimum Description Complexity. Thesis, MIT, September [5] S. Beheshti. an M.A. Dahleh. New Information Theoretic Approach to Orer Estimation Problem. 3th IFAC Symposium on System Ientification, August [6] W. Chen, K.M.Wong, an J. Reilly. Detection of the Number of Signals: A Preicte Eigen-Threshol Approach. IEEE Trans on Signal processing, vol.39, pp , May 99 [7] A.P. Liavas, P.A. Regalia, an J. Delmas. Blin Channel Approximation: Effective Channel Orer Estimation. IEEE Trans. on Signal Processing, vol.47, pp , 999. [8] L. Lung. System Ientification: Theory for the user. NJ: Prentice-Hall, 998. [9] J. Rissanen. Universal Coing, Information, Preiction, an Estimation. IEEE Trans. on Information Theory, vol.it30, pp , 984. [0] G. Schwarz. Estimating The Dimension of a Moel. The Annals of Statistics, vol.6, pp , 978. [] K.M.Wong, Q.T. Zhang, J. Reilly an P.C. Yip. On Information Theoretic Criteria for Determining the Number of Signals in High Resolution Array Processing. IEEE Trans. Acoust. Speech, Signal processing, vol.38, pp , November 990

A NEW INFORMATION THEORETIC APPROACH TO ORDER ESTIMATION PROBLEM. Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.

A NEW INFORMATION THEORETIC APPROACH TO ORDER ESTIMATION PROBLEM. Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A. A EW IFORMATIO THEORETIC APPROACH TO ORDER ESTIMATIO PROBLEM Soosan Beheshti Munther A. Dahleh Massachusetts Institute of Technology, Cambridge, MA 0239, U.S.A. Abstract: We introduce a new method of model