MDLP approach to clustering

Size: px

Start display at page:

Download "MDLP approach to clustering"

Roland Newman
6 years ago
Views:

1 MDLP approach to clustering Jacek Tabor Jagiellonian University Theoretical Foundations of Machine Learning 2015 Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

Computer science Ockham s razor Ockham s razor Ockham s razor Among competing hypotheses, the one with the fewest assumptions should be selected.

2 Computer science Ockham s razor Ockham s razor Ockham s razor Among competing hypotheses, the one with the fewest assumptions should be selected. Other, more complicated solutions may ultimately prove correct, but in the absence of certainty the fewer assumptions that are made, the better. Figure: William of Ocham Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

3 Computer science Ockham s razor Morse code Morse code Morse Code/telegraph [1836 S. Morse, A. Vail]: Vail determined the frequency of use of letters in the English language by counting the movable type he found in the type-cases of a local newspaper in Morristown: the code of typical text should be as short as possible, i.e.: the more common the letter, the shorter should be its code. Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

4 Computer science Ockham s razor Morse code Morse code Morse Code/telegraph [1836 S. Morse, A. Vail]: Vail determined the frequency of use of letters in the English language by counting the movable type he found in the type-cases of a local newspaper in Morristown: the code of typical text should be as short as possible, i.e.: the more common the letter, the shorter should be its code. ANECDOTE: Ernest Hemingway [newspaper notes from war in Spain] Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

5 Computer science Ockham s razor Morse code Morse code Morse Code/telegraph [1836 S. Morse, A. Vail]: Vail determined the frequency of use of letters in the English language by counting the movable type he found in the type-cases of a local newspaper in Morristown: the code of typical text should be as short as possible, i.e.: the more common the letter, the shorter should be its code. ANECDOTE: Ernest Hemingway [newspaper notes from war in Spain] READER (with admiration): how is it possible to learn to write the notes so short, but so wonderfully informative? Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

6 Computer science Ockham s razor Morse code Morse code Morse Code/telegraph [1836 S. Morse, A. Vail]: Vail determined the frequency of use of letters in the English language by counting the movable type he found in the type-cases of a local newspaper in Morristown: the code of typical text should be as short as possible, i.e.: the more common the letter, the shorter should be its code. ANECDOTE: Ernest Hemingway [newspaper notes from war in Spain] READER (with admiration): how is it possible to learn to write the notes so short, but so wonderfully informative? HEMINGWAY: easy you just have to pay one dollar for each word send by the telegraph Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

7 Computer science Ockham s razor Morse code Morse code Morse Code/telegraph [1836 S. Morse, A. Vail]: Vail determined the frequency of use of letters in the English language by counting the movable type he found in the type-cases of a local newspaper in Morristown: the code of typical text should be as short as possible, i.e.: the more common the letter, the shorter should be its code. ANECDOTE: Ernest Hemingway [newspaper notes from war in Spain] READER (with admiration): how is it possible to learn to write the notes so short, but so wonderfully informative? HEMINGWAY: easy you just have to pay one dollar for each word send by the telegraph Information=money Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

8 Computer science Ockham s razor Entropy Entropy C. Shannon [1948. "A Mathematical Theory of Communication". Bell System Technical Journal 27] Precise formulation of the idea of Morse leads to the formal definition of Shannon s entropy: in the optimal coding (that is those with the shortest code-length), if the signal appears with probability p, its code-length should equal to log 2 p. Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

9 Computer science Ockham s razor Entropy Entropy C. Shannon [1948. "A Mathematical Theory of Communication". Bell System Technical Journal 27] Precise formulation of the idea of Morse leads to the formal definition of Shannon s entropy: in the optimal coding (that is those with the shortest code-length), if the signal appears with probability p, its code-length should equal to log 2 p. Thus if the symbols x 1,..., x n appear with probabilities p 1,..., p n : entropy = minimal code length = p 1 log 2 p p n log 2 p n. In practice leads to Huffman s coding (used for example in jpg). Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

10 Computer science Ockham s razor Minimum description length (MDL) principle Minimum description length (MDL) principle Formulation of the MDL principle Jorma Rissanen [1978. "Modeling by shortest data description". Automatica 14]: the best hypothesis for a given set of data is the one that leads to the best compression of the data. Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

11 Computer science Ockham s razor Minimum description length (MDL) principle Minimum description length (MDL) principle Formulation of the MDL principle Jorma Rissanen [1978. "Modeling by shortest data description". Automatica 14]: the best hypothesis for a given set of data is the one that leads to the best compression of the data. P. Grünwald, [1998]: "any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally." Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

12 Computer science Ockham s razor Minimum description length (MDL) principle Minimum description length (MDL) principle Formulation of the MDL principle Jorma Rissanen [1978. "Modeling by shortest data description". Automatica 14]: the best hypothesis for a given set of data is the one that leads to the best compression of the data. P. Grünwald, [1998]: "any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally." Connected to the notion of Kolmogorov complexity [1963. "On Tables of Random Numbers". Sankhya Ser. A. 2]: the complexity of a string is the length of the shortest possible description of the string in some fixed universal description language. Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

13 Clustering How many clusters? How many clusters? In most clustering methods, one has to specify the number of clusters. This implies, that the procedure does not return the right number of clusters (or reduce unnecessary clusters on-line during the clustering procedure). Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

14 Clustering How many clusters? How many clusters? In most clustering methods, one has to specify the number of clusters. This implies, that the procedure does not return the right number of clusters (or reduce unnecessary clusters on-line during the clustering procedure). Advantages of the use of MDL principle in clustering: automatically reduces the complexity of the model (number of clusters) has high adaptability (often) small requirements on the the data: (we do not require vector or metric structures, but only the existence of encoding or compression methods) easily allows potential modifications Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

15 Clustering MDL principle in clustering MDL principle in clustering Requirements Data type which we want to cluster, available compression methods W. Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

16 Clustering MDL principle in clustering MDL principle in clustering Requirements Data type which we want to cluster, available compression methods W. Determination of the overall cost of the memory Let us assume that message X = (x 1,..., x N ) and the compression methods w 1,..., w K W are given. We denote the algorithm that encodes point x l (defines the cluster it belongs to) with w kl, where k l {1,..., K }. Then the memory cost of coding the message X equals K memory cost of w k + N (cost of identification of k l + i=1 l=1 the amount of memory algorithm w kl uses to code x l ). Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

17 Clustering MDL principle in clustering MDL principle in clustering Requirements Data type which we want to cluster, available compression methods W. Determination of the overall cost of the memory Let us assume that message X = (x 1,..., x N ) and the compression methods w 1,..., w K W are given. We denote the algorithm that encodes point x l (defines the cluster it belongs to) with w kl, where k l {1,..., K }. Then the memory cost of coding the message X equals K memory cost of w k + N (cost of identification of k l + i=1 l=1 the amount of memory algorithm w kl uses to code x l ). Memory minimization step We seek K, compression methods w 1,..., w K and indices k l, such that the total cost needed to encode the message X is minimal. Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

18 Clustering MDL principle in clustering MDL principle in clustering Requirements Data type which we want to cluster, available compression methods W. Determination of the overall cost of the memory Let us assume that message X = (x 1,..., x N ) and the compression methods w 1,..., w K W are given. We denote the algorithm that encodes point x l (defines the cluster it belongs to) with w kl, where k l {1,..., K }. Then the memory cost of coding the message X equals K memory cost of w k + N (cost of identification of k l + i=1 l=1 the amount of memory algorithm w kl uses to code x l ). Memory minimization step We seek K, compression methods w 1,..., w K and indices k l, such that the total cost needed to encode the message X is minimal. Construction of the clustering Points which are coded by the same algorithm are assigned to the same cluster. Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

19 Clustering MDL principle in clustering MDL principle in clustering Observe that in the above approach the use of any encoding method takes memory (needed for its determination), as a consequence we obtain the upper limit for the amount of possible clusters. Moreover, in practice when the amount of elements encoded by a given algorithm is small it will be worthwhile to give up the use of this algorithm, which leads to a reduction of the complexity of the constructed model (understood as a number of clusters). Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

20 Gaussian models Differential entropy, cross-entropy Differential entropy The coder adapted to the data generated by the continuous random variable with the density f. Differential entropy is the limiting case of entropy (smaller and smaller bins): h(f ) := f (x) log 2 (f (x))dx. Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

21 Gaussian models Differential entropy, cross-entropy Differential entropy The coder adapted to the data generated by the continuous random variable with the density f. Differential entropy is the limiting case of entropy (smaller and smaller bins): h(f ) := f (x) log 2 (f (x))dx. Cross-entropy h(f ) := g(x) log 2 (f (x))dx. Gives the memory cost of compressing the data with the density g by the coder optimally adapted to the density f. Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

22 Gaussian models Gaussian models Gaussian models In practice, we can compute the above for the Gaussian models! Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

23 Gaussian models Gaussian models Gaussian models In practice, we can compute the above for the Gaussian models! There is only the need for the knowledge of the mean of the data and the covariance matrix. Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

24 Papers Papers Papers TABOR, SPUREK: Cross-entropy clustering, Pattern Recognition 2014 TABOR, MISZTAL: Detection of elliptical shapes via cross-entropy clustering (IbPRIA 2013) SPUREK, TABOR, ZAJAC: Detection of disk-like particles in electron microscopy images (CORES 2013) ŚMIEJA, TABOR: Image segmentation with use of cross-entropy clustering (CORES 2013) Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

25 Papers EM vs CEC EM vs CEC [Geoffrey McLachlan,Thriyambakam Krishnan The EM Algorithm and Extensions] Fit the data X by X p 1 f p k f k, where f i belong to the fixed density family. CEC: X max(p 1 f 1,..., p k f k ), Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

26 Papers EM vs CEC Cluster reduction Figure: Cluster reduction. Jacek Tabor (UJ) MDLP approach to clustering TFML / 13

Kolmogorov complexity ; induction, prediction and compression

Kolmogorov complexity ; induction, prediction and compression Contents 1 Motivation for Kolmogorov complexity 1 2 Formal Definition 2 3 Trying to compute Kolmogorov complexity 3 4 Standard upper bounds