Universal Compression Bounds and Deep Learning

Size: px

Start display at page:

Download "Universal Compression Bounds and Deep Learning"

Leona Anthony
5 years ago
Views:

1 Université Pierre et Marie Curie Master 2 de Mathématiques fondamentales Mémoire de recherche Universal Compression Bounds and Deep Learning Author Léonard Blier Supervisor Yann Ollivier Septembre 2017

2 Abstract Solomonoff s theory of induction, based on Kolmogorov complexity theory, is a general theory of inductive inference. It could be seen as a formalisation of the well-known Ockham razor principle, which states that we should always choose the simplest model that explains the data. In this work we will present this theory, and see how statistics and machine learning can be interpreted with this theory. More precisely, we will see how deep neural networks can be seen as simple models, even if they have more parameters than there are data samples. We will present different ways to compute universal compression bounds with deep learning, with variational methods and online methods. We will show that variational methods are outperformed by online methods for compression. Finally, we will present the auto-switch, a way to improve compression bounds for models learned by gradient descent algorithm. 1

3 Contents 1 Introduction Deep Learning and Neural Networks 1.2 Machine learning, overfitting and underfitting 1.3 The issue of generalization 1.4 Outline 2 Kolmogorov Complexity Inductive reasonning 2.2 Definitions 2.3 Trying to compute Kolmogorov complexity 2.4 Generic bounds 2.5 Compression bound and probabilistic models 2.6 Incompressibility 2.7 Solomonoff s theory of inductive inference 3 Kolmogorov complexity and Machine Learning Machine learning and its formalisation 3.2 Minimum Description Length and Two part codes 3.3 Optimal precision 3.4 Minimum Description Length and Bayesian Statistics 3.5 The switch distribution and the catch-up phenomenon 4 Compression bounds with Deep Learning No compression 4.2 Model compression 4.3 Variational methods for deep learning 4.4 Online learning 4.5 Online learning with the Switch 4.6 Other attempts 4.7 Summary of the experiments: Do deep nets compress? 5 Discussions Variational compression with a small training set 5.2 Gap between variational learning and online learning 6 Conclusion

4 Remerciements Je tiens tout d abord à remercier mon directeur Yann Ollivier pour son encadrement, ses conseils, et les discussions passionnantes que nous avons partagées. Ce mémoire n est qu un début, et je suis heureux et impatient de poursuivre ce travail avec lui. Je remercie ensuite Stéphane Boucheron d avoir accepté d être membre du jury pour ce mémoire. Je remercie l équipe TAU pour son accueil. Tout particulièrement Olga pour son aide pratique, et Corentin pour son aide technique et nos discussions toujours très instructives. Enfin je remercie enfin Jonathan pour ses remarques, et Emilie pour sa relecture attentive, et son soutien quotidien. 3

5 1 Introduction 1.1 Deep Learning and Neural Networks Deep learning is a subfield of machine learning. It consists in learning high-level representations of data with what we call Neural networks. It was first developed in the 1980s but was not considered as very promising by the machine learning community. However, thanks to the improved computing power of machines, the size and the quality of new datasets, and some huge progress in optimisation procedures, Deep Learning recently has yielded outstanding results in many different areas. [LeCun et al., 2015] We might mention: Image classification: Convolutional Neural Networks (CNN) popularized by the AlexNet [Krizhevsky et al., 2012]. It won in 2012 the ImageNet competition when almost no one expected that these methods could work so well. The results were then successively improved by the VGGNet [Simonyan and Zisserman, 2014] and the Residual Networks [He et al., 2016]. Generative models: Auto-encoders [Kingma and Welling, 2013] and Generative adversarial networks [Goodfellow et al., 2014] Speech recognition: It was one of the first major application of Deep Learning in industry [Hinton et al., 2012] Natural language processing for different tasks especially question answering by looking for information in an external dataset [Bordes et al., 2014] or into a text [Hermann et al., 2015] or with machine reasoning on specific tasks [Sukhbaatar et al., 2015] ; Automatic translation where deep learning is now used in the most modern systems instead of models founded on linguistic knowledge [Sutskever et al., 2014] ; Image captioning [Xu et al., 2015] Artificial Intelligence in games, with the widely covered victory of AlphaGo on Lee Sedol [Silver et al., 2016] A neural network is a function f θ (x), where θ Θ is a parameter, and Θ is typically a linear space. In this work, we will only use feed forward neural networks. A feed forward neural network can be expressed as: where f(x) = f L f L 1... f 1 (x) (1.1) f k (x) = σ k (b k + W k x) (1.2) where x is a vector, b k R n k is a vector called the bias, W k R n k 1 n k is a matrix called the linear part, and σ k is a pointwise non-linear function, called 4

the activation. Activations are mostly the hyperbolic tangent or the Rectifier Linear Unit (ReLU) (σ(x) = max(x, 0)) ). Each f k is called a layer, and n k is its width. The depth of the model is L.

The parameters are called the weights of the network. See figure 1 for an example of a neural network.

6 the activation. Activations are mostly the hyperbolic tangent or the Rectifier Linear Unit (ReLU) (σ(x) = max(x, 0)) ). Each f k is called a layer, and n k is its width. The depth of the model is L. We call f 1 the input layer, f L the output layer, and f 2,..., f n 1 are the hidden layers. Here, the parameter is θ = (W 1, b 1, W 2, b 2,..., W n, b n ), and Θ = R N n1 R n1 R n1 n2... R n L. The parameters are called the weights of the network. See figure 1 for an example of a neural network. Very often, we choose to limit the possible values of θ to a smaller set of values, a subspace or a submanifold of Θ, based on our knowledge of the structure of the data. We say that we choose the topology of the network. These networks are called deep because there can be many layers, from 2 to a few hundreds. Figure 1: A feed forward neural network 1.2 Machine learning, overfitting and underfitting What are these networks used for? Let us consider the problem of supervised learning (classification or regression). Suppose we have some data X = {x 1,..., x n } R N. Let Y = {y 1,..., y n } be the labels (for classification) or the regression values (for regression) of the data. We assume that y is a function of x: y = f(x), or a random function of x: y = f(x, ω) where ω is random. The problem is to find a function ˆf such that ˆf(x i ) = y i on X, and which will generalize on new data. Example 1. The particular case we will discuss in this work is image classification. The x i are pictures, seen as vectors of a linear space. If the picture is in greyscale, each pixel is a dimension ; if the pictures have colors, each pixel is 3 dimensions. The y i are the labels of the pictures. Machine learning s overall approach is to split the data into two subsets: the train set X TRAIN, Y TRAIN and X TEST, Y TEST. We learn a model ˆf on the train set, and 5

7 see how it generalizes on the test set. We choose a loss function L( ˆf) = L(f, ˆf) = 1 n l(y i, n ˆf(x i )) (1.3) where l is a function such that: 1. l is a real function: l(y, y ) R 2. l(y, y) = 0 3. l is non-negative: l(y, y ) 0 l is not necessarily a distance. It is very often asked to be differentiable, and sometimes convex. Then, we choose a parametric model f θ, θ Θ, and try to find θ that minimizes L(f θ ). There are two main issues with machine learning models: overfitting and underfitting. Underfitting is when the chosen model f θ is not even good on the train set. It usually means that the parametric model we choose was not complex enough to capture the structure contained in the data, or that we did not find the best θ. Overfitting is when the chosen model f θ is very good on the train set, and not very good on the test set. It means that the models fits the noise in the data because the parametric model was too complex. See figure 2 for more details. What is the specificity of deep neural networks in machine learning? On one hand, traditional machine learning methods try to extract some high-level features of the data with specifically designed methods and then use learning algorithms on these features to achieve the task. The machine learning methods are usually chosen to be sufficiently complex to avoid underfitting, but not too complex to avoid overfitting. Remark 1. The high-level features can be of different kinds. In image processing, we often use Scale-invariant feature transform (SIFT) to detect patterns [Lowe, 1999], or Scattering transform [Mallat, 2012]. In sound classification and especially speech recognition, the Mel-frequency cepstrum (MFCC) are widely used (SOURCE). If we try to make an artificial intelligence for a game, we can implement some features that players use for their own strategies. Deep neural networks, on the other hand, do not use high-level features. They learn how to compute successive representations of the data that extract at each layer higher level features. The entire learning procedure is done by an algorithm based on Gradient Descent. The gradient descent algorithm is the following: we start with a parameter θ 0, and at each step we compute i=1 θ t+1 = θ t η t θ L(f θ )(θ t ) (1.4) = θ t η t 1 n θ l(y i, f θ (x i ))(θ t ) n (1.5) i=1 6

8 Figure 2: Overfitting and underfitting: In the first line, the problem is regression. The data (blue dots) are coming from f, an underlying polynomial of degree 3 (green line) on which we added noise. We try to learn it with polynomial function (in red) of degree 1 (left), 3 (center) and 9 (right). We observe that the approximation to the leftis really bad on train data which means that it underfits, while the approximation to the right is perfect on train data, but does not approximate at all the underlying function because it is too complex, which means that it overfits. The approximation at the center, the model is both good on the train data and close to the underlying function. On the second line the problem is classification. The data are points in a two-dimensional space from two classes (red crosses and blue circles), and we represent the model s decision function: on one side of this line, the data are classified as red crosses, and as blue circles on the other side. On the left, the model is linear and underfits. On the right, the model is way too complex and overfits. The center seems to learn the underlying structure of the data. Here η t is called the learning rate. In Stochastic Gradient Descent, we do not compute the gradient but an estimator of the gradient. It is easier to compute in practice, and it happens to obtain higher performances. The estimator can be defined by computing the gradient only over a small subset of the data called a mini-batch. There are many different gradient descent algorithms (Adam, RMSprop,...), and they are all founded on the same principle. 1.3 The issue of generalization As we explained in the beginning, these methods have proved their efficiency in many different tasks. But their properties are still very poorly understood. A major topic is the issue of generalization. On one hand, it is clear when we train neural networks that they can generalize very well. On the other hand, we 7

9 know that neural networks can theoretically overfit on all the datasets we use in practice. It is easy to define a 2-layers neural network with 2n + N parameters, (n is the number of data samples and N their dimension) that can learn how to output exactly the labels (or regression values) of the train set, even if the data are totally random, and thus without structure to learn [Zhang et al., 2016]. We can have the same property with a L-layers neural network with O(n/L) weights per layer. It has even been observed that if we take an image classification dataset like ImageNet and take one of the networks that are known for its great performance on this task, and learn again its weights, but with the labels shuffled randomly, the network is able to learn to classify the train set perfectly, even if there is nothing to learn. This means that the models are complex enough to overfit the data [Zhang et al., 2016]. Thus, the fact that they usually don t overfit on real data is even more mysterious. Moreover, it has been shown that networks can be misled by what is called adversarial examples: in the case of image classification, we can take an image that is well-classified and add a small noise to it. This noise is almost invisible to a human. But the noisy image will not be well-classified (see figure 3)[Szegedy et al., 2013]. This is currently a huge topic of interest lately since deep neural networks are being used in vision system of self-driving cars. Therefore, we need to trust these systems and be sure that they will see new situations the same way than human drivers. But adversarial examples suggest that, at least, the networks do not generalize in the same way than we do. Figure 3: Adversarial examples: On the left we see correctly predicted sample. On the center, the noise that is added to the original image magnified by 10 to be seen correctly. On the right, we see adversarial examples. All images in the right column are predicted to be an ostrich. Source: [Szegedy et al., 2013] The concept of generalization is hard to define. As we have seen, it is related to the notion of complexity. A model needs to be complex enough to describe the underlying structure of the data, but not too complex to avoid overfitting. 8

10 It is highly related to inductive reasoning. Indeed, inductive reasoning is about trying to find some general principles from particular cases. It is opposed to deductive reasoning since the conclusions of inductive reasoning are not certain. We use inductive reasoning all the time. Not only it is the fundamental tool used in all experimental science, but also in our every day life: our model of the world is built by inductive reasoning over all of our observations. The decisions we make are chosen according to the model of of a situation we have, and an estimation of the consequences. Inductive reasoning has been a topic of interest in the history of Science and Philosophy. The first well-known principle is called the Occam Razor principle, named after William of Ockham ( ), who said that Entities should not be multiplied unnecessarily. Even if it seems which means that among different models, we should choose the one with the fewer assumptions. Isaac Newton stated a few centuries later that: We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances. With these principles, we should always choose the model with the lowest complexity that still fits the observations. But if this is an insightful philosophical remark, how can it be used in practice? What does complexity means? 1.4 Outline In this work, we will first present Kolmogorov complexity theory and Solomonoff s theory of induction. Then, we will see the links between Kolmogorov complexity and machine learning. Finally, we will see how these tools can be useful to understand Deep Learning and tackle the apparent paradox of complexity of neural networks. Our contribution consists in: We introduce different ways to compute compression bounds with deep learning, a first necessary step before trying to do general inductive inference with deep neural networks. We compare different way of getting universal compression bounds with deep learning and compare these bounds. We show that the bounds obtained with the usual method, Variational learning, are outperformed by simple bounds obtained with online learning, which is a strong argument against variational methods. We show empirically how we can use a Bayesian method for model selection that avoids well-known criticisms of Bayesian method, with the Switch distribution. 9

11 We introduce the self-switch, an improvement of the switch method in the case of models learned with gradient descent methods. Finally, we solve the apparent paradox of Deep Learning and compression. We show that even if the best neural networks have more weights than there are data samples, and thus seem to have large complexity, they do compress the data. In section 2, we define Kolmogorov Complexity, we present some of its main properties and some complexity bounds we can have. We finally present Solomonoff s theory of inductive inference. In section 3, we will see the connections between Kolmogorov complexity and machine learning and statistics. We will see how selecting models with Kolmogorov complexity can lead to well-known statistical results. We will finally see how to improve Bayesian model selection with the switch distribution. In section 4, we will present different ways to compute compression bounds with deep learning, especially variational learning and online learning. And we will see that the models that are known to be the best in prediction will also be the ones that provides the best compression bounds on the data. Finally in section 5 we will discuss a few questions raised in the previous sections. 10

12 2 Kolmogorov Complexity 2.1 Inductive reasonning Kolmogorov Complexity is a very powerful tool, that can provide foundations for a theory of inductive reasoning. Let us start by an example. Imagine you are in front of the sequence x where x is a hidden digit, and you are asked to guess the value of x, what would be your guess? The first answer would probably be 0. But why? What is the reasoning path for this choice? A mathematician would say that there is no reason to answer 0 and that both answers are possible. But we are using different priors we have on the world. One of them is of course that if the question is asked, it is that there is probably enough information in the data to predict x. Then, we look at the structure of the data. We look for what seems to be the simplest model that can have produced this sequence. This program seems to be: while True: print(01) But many models can explain these data, and maybe the true program that created these data is: print( ) and then only 1s Why are we rejecting this hypothesis? Because it seems uselessly complicated given the data. But what if you are given the next hundred digits of the sequence and get Then the first model is rejected because it does not fit the data anymore, and the second model seems to be very reasonable. This might look like a toy example, but it happens all the time in everyday life that the behavior of something changes sometimes, and that we adjust the model we have about it. We could give as another example the sequence 2, 3, 5, 7, 11, 13, 17, 19, x You will probably observe that this sequence starts exactly as the sequence of prime numbers. How much are you convinced of this hypothesis? The program representing this hypothesis is the following. It seems quite simple and explains very well the data: print n for n N if not all(n mod k 0 for 1 < k < n) 11

13 What probability would you give to 23? To any other number? We can have the same question for random sequences. Suppose we have two sequences of one thousand digits: x y These two sequences seem to be random. In the first one, the frequency of 0 and 1 are equal to 0.5. In the second one, the frequency of 1 is 0.8. What probabiliy would you give for each possible outputs of x and y? Probably 0.5 for 0 and 1 for x and 0.8 for 1 and 0.2 for 0 for y. Why? Because the simplest model that fits the data is that the sequences are coming from two different Bernouilli distributions, and that the parameter that fits the most the data is the frequency of 1. Kolmogorov complexity allows to tackle these questions and to use it for much more difficult tasks relating to inductive inference. 2.2 Definitions All this section is based on two main sources: [Bensadon, 2016] and [Li and Vitányi, 2008]. In the following, A = {0, 1} will be the binary alphabet, A n = A... A the set of n-length words on A, and A = n N A n the set of all finite length words on A. Let us define the Kolmogorov complexity. As we explained, in the introduction, the Kolmogorov complexity of a string is the length of the shortest program whose output is x. Formally: Definition 1 (Kolmogorov complexity). The Kolmogorov complexity K U (x) of a string x {0, 1} on a universal Turing machine U is the length of the shortest program p on U such that the output of p is x. It is measured in bits. We will sometimes use nats (natural unit of information) instead of bits, which are bits in base e: 1 nats = 1 ln(2) bits 1.44 bits We will use the following language: a program p whose output is x is called a code, and we say that it encodes x. If we have two programs p1 and p2, then we use the notation p = p1 :: p2 for the concatenation of p1 and p2 on the Turing Machine. We will write x for the length of a string x, and similarly p for the length of a program. A problem of the definition we gave that it depends on the Turing Machine it is computed on. But we have the following theorem: Theorem 1. If U and U are two universal turing machines, and x a binary string, then K U (x) K U (x) C U,U where C U,U is a constant that does not depend on x. 12

14 Proof. We present the sketch of the proof. If we have two different universal Turing Machines U and U, then there is a program I on U that takes as input any program p on U and that output a program p on U. We will not prove the existence of such a program, but it can be thought as an interpreter between two programming language. Then, if we have a program p on U that encodes x, we can have the program p = I :: p which is composed of the interpreter, followed by p. Then, p will first interpret p on U, and then execute p. So p = I + p, and we have that K U (x) I +K U (x). Since the argument is symetric in U and U, we have our result. Intuitively, the constant C U,U is the length of a program that translates a program p on U to a program p on U. On real machines, this typically is the role of a compiler or an interpreter. Therefore, if we need values for the complexity of a given program, we can choose its length in our favorite programming language. If needed, I will consider the programs described in Python code. Definition 2 (Conditional Kolmogorov Complexity). If x and y are two strings, we define the conditional complexity of y given x K(y x) to be the length of the shortest program p such that p(x) = y Since all the bounds we have on Kolmogorov complexity are up to a constant, there is in the litterature the notation K + (x) f(x) which means that K + (x) f(x) + a for a a constant that does not depend on x. But we will always write K(x) f(x) instead of K + (x) f(x). A problem with this definition of Kolmogorov complexity is compositionality. In order to compose programs, or even to pass some encoded inputs to a program, we need to be able to separate the code from the different inputs. As an example, if we have a program that computes the sum of two binary numbers, and the input string is , how can we know if the two numbers are 1 and , or 11 and ,..., and so on. So we need to restrict to codes that make this unambiguous. We restrict the codes to prefix-free codes. Definition 3 (Prefix-free code). A set S of strings is prefix-free if no string of S is the prefix of other strings in S. A code is prefix-free if no other code is a prefix of it. We are going to state the Kraft inequality, which will be useful in the following. We will not give the proof of this results here. 13

15 Theorem 2 (Kraft s inequality). If E is a countable or finite set of prefix code over the alphabet A = {0, 1}. Then, we have the following inequality: 2 x 1 (2.1) x E Conversely, for a given set of natural numbers l 1, l 2,..., l n satisfying the above inequality, there exists a uniquely decodable code over A with those codeword lengths. 2.3 Trying to compute Kolmogorov complexity Kolmogorov complexity seems to be a very powerful tool for data compression. It could be used to send some large data files, by computing its complexity, and then maybe by finding the shortest program that encodes it. Unfortunately, it is very hard to have some information on the complexity of a string in practice, since it is not computable. It is even impossible to prove that the complexity of a string is larger than a given bound. It means that if we have a program p that outputs x, it is impossible to know if it is the shortest one. We have the following theorem: Theorem 3 (Chaitin s incompleteness theorem). There exist a constant L such that it is not possible to prove the statement K(x) > L for any x. We could have some bounds on the constant L and it is not so large, around 1Mb. Proof. We present the sketch of the proof. Let L be a constant, and suppose we have a program p such that p(x) = 1 for some x if K(x) > L (not necessarily all of them) and p(x) = 0 else. p can be seen as a proof, for a certain set of strings, that their complexity is above L. We suppose that this program is not always 0, which means that there is at least one string such that we can prove that its complexity is above L. We can define the program q that will enumerate all strings x and compute p(x), until p(x) = 1. We know by hypothesis that p is not always 0 so q will terminate, and output a string x 0. Then, we have a code for x 0, which is q. And we deduce the bound: K(x 0 ) q = p +C where C is a constant representing the small amount of code needed for the enumeration over all strings x. If the constant L is large enough, we have L p and then K(x 0 ) < L which is a contradiction. Corollary 1. Kolmogorov complexity is not computable. 14

16 Proof. It is clear that the Kolmogorov complexity is not bounded. If it were, there would be only a finite number of programs, and thus, only a finite set of computable strings. So there is x 0 such that K(x 0 ) > L, where L is the constant of Chaitin s theorem. If there was a program that can compute the Kolmogorov complexity of x, then we could prove that K(x 0 ) > L. Even if Kolmogorov Complexity is not computable, it is semi-computable. If x is a string, we can compute a decreasing sequence k 0 k 1... of integers such that k i K(x). i Definition 4 (Semi-computable function). f : N N is a semi-computable function if there is φ : N N N such that: φ is computable x N, k 0, φ(x, k + 1) φ(x, k) x N, φ(x, k) k f(x) We have the following theorem: Theorem 4. K : N N is semi-computable. Proof. Let x be a string. We will see in Proposition 2 that we have a bound K(x) ψ(x) where ψ is computable. Then, we define φ(x, t) as follows: We run all the programs of length lower than ψ(x) for t steps. Some of these programs might halt after t steps and output x. If there are some of these, φ(x, t) outputs the length of the shortest program that halt after t steps and outputs x. Else, it outputs ψ(x). Clearly, φ is computable, non increasing and φ(x, t) K(x) 2.4 Generic bounds Even if the last results on Kolmogorov complexity computability seems discouraging, we will see how we can compute upper bounds of the Kolmogorov complexity, and that these bounds can be very useful Bounds for integers What is K(n) the complexity of an integer n? We have the following proposition: Proposition 1. If n is an integer, b its binary expansion, and b the length of b, we have the bound: K(n) log 2 n + 2 log 2 log 2 n (2.2) 15

17 We are going to define a set of different encodings for n. We define b, the binary expansion of n. b is a code for n, but it is not a prefix-free code so we need to modify it. Proof. We define c 0 (n) = , which is n 1, followed by a zero, in order to mark the end of the sequence, and to be prefix free. It is the expression of n in base 1. Then, K(n) c 0 (n) = n + 1. We can do better by using the binary decomposition of n, whose length is log 2 (n). But in order to have a prefix-free code, we need to specify the length of this encoding. We define c 1 (n) = c 0 ( b ) :: b = b. To decode n, we have to decode b by counting how many 1s there are at the beginning before the 0, and then decode b. We have K(n) c 0 ( b ) + b = 2 log 2 (n). We could then define recursively: c k+1 (n) = c k ( b ) :: b. The complexity will still improve, but will remain a O(log 2 (n)). We will only use the inequality given by c 2 : K(n) log 2 n + 2 log 2 log 2 n (2.3) Generic bounds for strings We can use the generic bound for integer to get a generic bound for strings: Proposition 2. Let x be a binary string. We have the bound: K(x) x + log 2 x + log 2 log 2 x (2.4) Proof. For a generic string x, the program print(x) outputs x. In order to make it prefix-free, we have to specify the length of the string x, if we want the program to know where to stop. So we have: We use Proposition 1 and get our result. K(x) x +K( x ) (2.5) Bounds for elements in an enumerable set If we want to describe an element x E, where E is a recursively enumerable set, then we can use a two part code. First, we define a program p(n) such that p : N E is a bijection. Then, we encode the index n = p 1 (x) of x in E. We have the bound: K(x) K(p) + K(n) (2.6) 16

18 Example 2. If you need to encode the th prime number, you can do it with that way: define a program able to enumerate all prime numbers (which is short), and then a code for expansion of the th prime number. It is obviously shorter than the binary 2.5 Compression bound and probabilistic models Definition 5 (Complexity of a probability distribution). Suppose we have a set E, and a probability distribution µ over E. We define K(µ) to be the length of the shortest program p µ such that p µ (x) = µ(x). The following theorem is the fundamental result of this work, and most of the following will be consequences of it. Theorem 5 (Probabilistic two parts codes). Suppose we have a set E, and a probability distribution µ over E. Then, we have the following bound: K(x) K(µ) log 2 µ(x) (2.7) This results will be proved later. In the following, we will say that µ is a model, and that we encode x with this model Generic encoding bound over a finite set We can use this theorem in order to improve the bound given by equation 2.6 in the special case of finite sets. Proposition 3. If E is a finite set and x E. We define K(E) to be the length of the shortest program that enumerates all the elements in E. We have the following bound: K(x) K(E) + log 2 E (2.8) where E is the number of elements in E. Proof. We define the uniform distribution µ over E: µ(x) = 1 E. Then, by Theorem 5, we have: K(x) K(µ) log 2 µ(x) (2.9) = K(E) + log 2 E (2.10) 17

19 2.5.2 Information theory Theorem 5 comes from information theory. The problem was formalized and mostly solve by Claude Shannon [Shannon, 1948]. The problem was to communicate successive elements from an ensemble called an alphabet for the Telegraph. The alphabet was the usual latin alphabet, but could only be transferred with binary digits. We suppose that we have a set E, a probability distribution µ on E and X is a random variable of probability distribution µ over E. We define the entropy of the distribution: Definition 6 (Entropy). The entropy of the probability distribution µ is: H(µ) = E[ log 2 µ(x)] = x E µ(x) log 2 (µ(x)) (2.11) If we have a second probability distribution ν over E, we can define the Kullback-Leibler divergence between µ and ν: Definition 7 (Kullback Leibler divergence). The Kullback Leibler divergence between two probability distributions over a set E is defined by: [ ( )] µ(x) D KL (µ ν) = E µ log 2 (2.12) ν(x) = µ(x) µ(x) log 2 (2.13) ν(x) x The Kullback Leibler divergence is not a distance because it is not symetric. But we still have the following properties: The Kullback Leibler divergence is non negative: D KL (µ, ν) Symbol codes Assume we have an alphabet A. Here, A = {0, 1}. We use the notation A n = A... A and A = n A n. We want to define a prefix-free encoding of the elements of E over A. The elements of E are called the symbols. This means that we want to find S : E A such that: 1. S is an injection: any code must define at most one elements of E 2. S is a prefix code: for all x and y in E, S(x) cannot be a prefix of S(y) We want to minimize the expected code length: E[ S(X) ] = x µ(x) S(x). One of the most important results of Shannon work is the following theorem: Theorem 6 (Shannon s source coding theorem). E[ S(X) ] H(µ) 18

20 Proof. Let S be a prefix code. We define q the probability distribution given by q(x) = 2 S(x) C with C such that x q(x) = 1. Then, we have: (2.14) E[ S(X) ] = x = x µ(x) S(x) (2.15) µ(x)( log 2 (Cq(x))) (2.16) = log 2 C + D KL (µ q) + H(µ) (2.17) Since the code is prefix free, we know with Kraft s inequality that x 2 S(x) 1, so C 1 and log 2 C 0. Moreover, we saw that D KL (µ, q) 0. So we can conclude. Can we achieve this bound? We have the following theorem: Theorem 7 (Source coding theorem for symbol codes). [Mackay, 2003] There exists a prefix code S such that: H(µ) E µ [S(X)] H(µ) + 1 (2.18) Proof. We set the code length l(x) = log 2 µ(x). l satisfies the Kraft inequality. Indeed: 2 l(x) = 2 log 2 µ(x) (2.19) x x 2 log 2 µ(x) x (2.20) = 1 (2.21) Then, with Theorem 2, we know that there is a prefix code over E with code length l(x) for all x in E. Finally, we have: E µ [S(X)] = x µ(x)l(x) (2.22) = x x µ(x) log 2 µ(x) (2.23) µ(x)( log 2 µ(x) + 1) (2.24) = H(µ) + 1 (2.25) 19

21 We are going to define a simple algorithm that can achieve this bound. We will not prove this result. Definition 8 (Huffman coding). Suppose that we have a finite set E with a probability distribution µ. The Huffman coding is the following algorithm.: 1. We take the two least probable elements x, y out of E. These two elements will be given the longest code. They will have the same code length, and their codes will only differ by the last digits. 2. We combine these two elements into a new element z = (x, y) and assign to z the probability µ(x) + µ(y). Then, we repeat until the set E has a unique element. The code of x will be the code of z followed by a 0. The code of y will be the code of z followed by a 1. The optimal bound E µ [S(X)] = H(µ) can be achieved by arithmetic coding, if we have many data samples to send. But we will not detail here this method Symbol codes and compression bounds We can now prove Theorem 5: Proof of theorem 5. We know that if S is a prefix free encoding on a set E, we have the bound: K(x) K(S) S(x) (2.26) If we have an encoding such that S(x) = log 2 µ(x) as explained in the proof of theorem 7, this gives the following bound: K(x) K(S) log 2 µ(x) + 1 (2.27) Since the encoding S is completely determined by µ, for example with the Huffman code algorithm, we have K(S) K(µ). Since the Kolmogorov complexity is defined up to a constant, we can remove the +1 in the bound, and we finally get the bound: K(x) K(µ) log 2 µ(x) (2.28) Suppose that X 1,..., X n are random variables over E, iid with law µ, and that µ is computable (K(µ) < ). We want to encode (x 1,..., x n ), an outcome of (X 1,..., X n ). Then, we have the following bound: K(x 1,..., x n ) K(µ) + i log 2 µ(x i ) (2.29) = K(µ) + nh(µ) + o n (1) (2.30) 20

22 Equation 2.30 is given by the law of large numbers. Now, if we do not know exactly µ and try to approximate it with a distribution ν, the encoding cost with the two part code will be: K(x 1,..., x n ) K(ν) + i log 2 ν(x i ) (2.31) = K(ν) + nh(µ) + nd KL (µ ν) + o n (1) (2.32) What we observe is that the difference between the two encodings is K(µ) K(ν) nd KL(µ ν). The Kullback Leibler divergence is exactly the average number of bits that costs the error between µ and ν for encoding data coming from the distribution µ with the distribution ν. Moreover, since the Kullback Leibler divergence is nonnegative, we see that asymptotically, the best model for compression will always be the true model. But for a finite number of data, it is not necessarily the case, since K(ν) can be lower than K(µ). 2.6 Incompressibility At this stage, we can ask ourselves two questions about compressibility. Can every string be compressed? If a string can be compressed, does it mean that it contains some structure, or is it only compressable by chance? This incompressibility theorem will help to answer: Theorem 8 (Incompressibility theorem). Let c be a positive integer, and E a finite set. Then E has at least E (1 2 c ) + 1 elements such that K(x) log 2 m c. We remind that we know that with a two part code, we have always: Thus, we have for at least E (1 2 c ) + 1 of E: K(x) K(E) + log 2 m (2.33) K(x) log 2 m max(k(e), c) (2.34) Proof. The number of programs of length less than log m c is log 2 m c 1 i=0 2 i = 2 log 2 m c 1 (2.35) Hence, there are at least m m(2 c ) + 1 elements in E that have no program of length lower than log 2 m c. This means that K(x) log 2 m c. Example 3. If E is the set of strings of length 2 n, and if we want to know how many strings can be compressed by half, then the theorem implies that we have that at least 2 n (1 2 n/2 ) + 1 elements that can be compressed by half. If n = 1, 000, it means that at most 2 n/2 1 elements can be compressed by half, which is a proportion of the set. 21

23 With this result, we can answer to the two questions we raised at the beginning of this section. First, most of strings cannot be compressed. Then, this means that if we try to compress a sequence and that we succeed, it probably means that our compression algorithm found some instrinsic structure in the data. 2.7 Solomonoff s theory of inductive inference In 1964, Ray Solomonoff introduced his theory of inductive inference [Solomonoff, 1964]. Let us go back to the problem of sequence prediction. We suppose we have a sequence x 1,..., x n, and we want to predict x n+1. What would be the strategy? We could try to find the best probability distribution µ that encodes the data with the bound: K(x 1,..., x n ) K(µ) log 2 µ(x 1,..., x n ) (2.36) And then we could use the probability distribution µ to make the prediction with the posterior distribution: p(x n+1 ) = µ(x n+1 x 1,..., x n ) (2.37) = µ(x 1,..., x n, x n+1 ) µ(x 1,..., x n ) (2.38) If µ compressed well the n first data, we can hope that it will well predict the n + 1-th data. What we would like to do is not to select the best model because it can lead to mistakes, but to average all the possible models. How can we do this? We define the universal probability distributions: Definition 9 (Universal probability distributions). We define four probability distributions over A, A be the set of all finite strings on the alphabet A = {0, 1} : P 1 (x) 2 K(x) (2.39) P 2 (x) 2 p (2.40) P 3 (x) P 4 (x) p deterministic programs that outputsx p random programs µ probability distribution 2 p P(p outputs x) (2.41) 2 K(µ) µ(x) (2.42) With the Kraft inequality, we can define these probability distributions that way. These four probability distributions are different ways to define universal probability distributions over the set of finite strings. We will see that they are equivalent. 22

24 Definition 10. If µ and ν are two probability distributions on a set E, we say that they are equivalent if there are positive constants C and C such that for all x E: Cµ(x) ν(x) C µ(x) (2.43) We have the following theorem, proved in [Zvonkin and Levin, 1970]: Theorem 9. P 1, P 2, P 3 and P 4 are equivalent. Then, if we go back to our sequence prediction problem, we can now use these universal probabilities. We define: p(x n+1 ) = P (x n+1 x 1,..., x n ) (2.44) = P (x 1,..., x n+1 ) P (x 1,..., x n ) (2.45) where P is one of the P i. We can remark that in the first method we proposed, only taking the best probability distribution µ for the compression of x 1,..., x n is equivalent to approximate P 4 with only the larger term of the sum. With Solomonoff s theory of inductive reasoning, we now have a general framework for general induction. Moreover, it is the only general theory of induction, and has the advantage of being in agreement with the philsophical theory of induction and with our intuition. In the next section, we will study the links between Kolmogorov complexity theory and Machine learning. 23

25 3 Kolmogorov complexity and Machine Learning 3.1 Machine learning and its formalisation Supervised and unsupervised learning The problem of machine learning is the problem of induction. The goal is indeed to infer from a finite set of particular example a model that will generalize to new observations. Here, we will discuss two of the most important problems in machine learning: supervised and unsupervised learning. In supervised learning, the data is D = {(x 1, y 1 ),...(x n, y n )}. The x i -s can be vectors of a linear space, elements of discrete sets,... In the following, we will always suppose that the data are coming from a linear space of finite dimension or N. We define X = {x 1,..., x n } and Y = {y 1,..., y n }. We want a model to predict y given x on a validation set of data that has never been seen. The problem is called a classification problem if the y 1,...y n are integers, called labels. The problem is called a regression problem if the y 1,..., y n are real numbers or vectors in a linear space or a more general regression problem. Unsupervised learning is not as well defined as supervised learning. It is defined as learning methods with unlabeled data. In practice, unsupervised learning includes different tasks: clustering, anomaly detection, dimensionality reduction, generative models,... Here, we will only consider the theoretical problem of unsupervised learning in machine learning: The data are X = {x 1,..., x N }. We suppose that they come from a same source, and that they share structure. We want to find a model (ie a probability distribution) p such that p(x ) is large, and such that if X + is a test set of data coming from the same source, then p(x + ) is large. The metric in order to test our model is the probability our model gives to unseen data. It can be interpreted as a generative model. Remark 2. With this restrictive definition, unsupervised learning is a special case of supervised learning where the regression values are the elements of X, and the data samples usually called X contain no information (they can all be zero: i, x i = 0) The true model hypothesis In this section, we will only consider unsupervised learning. The hypothesis often taken in machine learning is that the data X = {x 1,...x n } R N are samples coming from a true distribution µ(x), and are independent. Many results even use some stronger hypotheses, like the fact that µ is continuous for the Lebesgue measure so we can use a probability density function, or that its support is on a submanifold of the space R N. In this section, we will discuss the true model hypothesis. 24

26 Kullback Leibler divergence and empirical measures Suppose that the data are coming from a true distribution µ. The goal is to find a model p to approximate µ. The most used metric is the Kullback Leibler divergence. But this starts with a technical issue: Proposition 4. Let ν, µ, µ 1,..., µ t,... be probability distributions over E a linear space such that µ t µ weakly. Then we do not necessarily have that t D KL (ν, µ t ) t D KL(ν, µ ) or D KL (µ t, ν) t D KL(µ, ν). Proof. In order to prove this result, we only need to provide a counter example: Suppose that µ θ (x, y) is the uniform distribution on the segment {θ} [0, 1] in R 2, and ν = µ 0. Then µ θ ν weakly, but D KL (µ θ, ν) = D KL (ν, µ θ ) = + if θ 0 θ 0 and D KL (µ θ, ν) = D KL (ν, µ θ ) = 0 if θ = 0. Corollary 2. Let ˆµ n (x) = 1 n n i=1 δ x i (x) be the empirical probability distribution, and let p be a model (probability distribution). Wo do not necessarily have D KL (ˆµ n, p) n D KL(µ, p) This last result is important. It means that it is impossible to evaluate the divergence between two distributions without stronger hypothesis if we only have a finite number of samples, even if this number is arbitrarily large. Density estimation and the curse of dimensionality In order to avoid the negative result of Corollary 2, we assume that µ is continuous. We assume that we do not have stronger hypotheses on µ, and therefore we want to use nonparametric statistics. The problem becomes a problem of density estimation. But real data are in high dimensional spaces, and the issue becomes the curse of dimensionality. The curse of dimensionality is an expression that refers to phenomena that only occurs in high dimension. In this situation, the issue is that the number of data samples needed in order to estimate the probability density function grows exponentially with the number of dimension. But in practice, the number of samples is often of the same order of magnitude than the dimension, or even lower. Therefore, it is in practicte impossible to get some informations on µ if we do not have even stronger hypotheses. Thus, if we do not know for sure that µ belongs to a family of probability distributions with a few parameters that can be estimated, then the hypothesis that the data are coming from a probability distribution is almost meaningless. Indeed, the entire information of µ is contained in its density, and it is impossible to know informations about it. The independence hypothesis Moreover, the independence hypothesis needs to be discussed too. In real machine learning datasets, the data are not independent at all. They have been carefully chosen to be as varied as possible, and cover 25

27 most of the possible situations that will happen in new situations. By the way, in real life inductive learning, the same thing is happening. If someone shows you a few different examples of something you have never seen (object, flower, animal,...), and then ask you to recognize some new examples of this thing, you will use the fact that the examples he presented might be very characteristic and represent all the variety of this new object. Compressing the past instead of predicting the future What we do, instead of predicting the future, is compress the past. Indeed, if we are able to compress the past, it means that we understand it well, that we understand its structure. And with this understanding, we can hope to also be good in prediction. This formalisation of the problem of machine learning is very well explained by Marcus Hutter in [Hutter, 2007]. Does it mean that we are rejecting all the machine learning methods that were developed? Actually, not at all. What we will show in the following is that many methods based on Kolmogorov Complexity are in fact well-known machine learning or statistical methods. By changing the principles underlying the methods we are using, we do not change so much these methods, but give a new interpretation of them, and this can be useful to develop new techniques in the future. 3.2 Minimum Description Length and Two part codes What does it mean to compress the data in order to make prediction? Suppose we are in an unsupervised learning situation, and the data are X = {x 1,..., x N }. The goal is to build a model p such that p gives high probability to training data but also on unseen data. If this goal is achieved, we say that the model has generalized. Our approach is to try to compress X. goal is to compress it. We are going to use a two part code, as in section 2.5. We choose a probability distribution p, we encode this distribution in a program, and then we encode the data according to this model. The complexity bound given by this method is: K(D) K(p) log 2 (p(d)) (3.1) In supervised learning, the data to be compressed is Y = {y 1,...y N } where the y i come from the dataset D = {(x 1, y 1 ),...(x N, y N )}. What we want to compress is the dataset Y, knowing the dataset X = {x 1,..., x N }. In a classification setting, the real task we want to achieve is to have a model that will be able to classify new data samples. We express this in term of compression this way: the task we are going to achieve is transmitting the labels of the data to someone who already have the data. If we succeed in doing this, it means that our models have compressed the labels well by using its mutual information with the data samples. 26

28 Thus, we can expect that it will generalize on new data. We could do the same thing for a regression problem. Similarly to unsupervised learning, we choose a probability distribution p(y X ), and we have the following bound: K(Y X ) K(p) log 2 p(y X ) (3.2) Here, the second term will be the most familiar one for people used to machine learning. It describes how much the model fits the data. The first term describes how complicated the model is. In order to compress the data, we will minimize this bound, by finding a good trade-off for these two terms. We will often restrict the set of possible models to a subset of the set of probability distributions P, and try to find p = arg min p P K(p(. X )) log 2 p(y X ) in order to have the best compression bound for the data. If we choose P to be a set of parametric distributions, P = {p θ, θ Θ}, then we have the bound: K(Y X ) K(p θ θ) + K(θ) log 2 p θ (D) (3.3) Trying to find the best bound here means finding the best parameter θ for this bound. If we only look at the last part of equation 3.3, we recognize the log-likelihood of the data for the parametric family P. In this case, we can think of the minimization of the Kolmogorov Complexity bound as a maximization of the log-likelihood, with a regularizer given by the complexity of the model. Example 4 (Two-class classification). Suppose that we have some data x 1,..., x n and their labels y 1,..., y n {0, 1}. We want to build a model to predict the labels. It means that we want to find a probability distribution p(y x). A possible solution would be p 0 defined by p 0 (y x i ) = δ yi (y) for all i, and p 0 (y x) = δ 0 (y) elsewhere. Then, the second term of the bound is: log 2 p(y X ) = i log 2 p(y i x i ) = 0 But to encode this distribution, we essentially have to encode all the x i and all the y i. Consequently, it will not give a good compression bound. This is an example of overfitting. We see that thanks to Kolmogorov complexity, we know that this model do not compress the data. Thus, we have no reason to think that it will generalize, and we will not try to make predictions with this model. It seems that there could be a matter with this formalisation for continuous data. In order to encode a real number with infinite precision, we would need an infinite amount of bits. How can we address this question? Since we are on Turing machines, we are only dealing with integers, and these integers can only represent 27

29 floats with a finite precision. First, we need to choose a precision required, which will be called ε 0. This means that we discretize the space R n with a step-size ε 0. The problem is now to encode ˆx 1 ε 0 Z n, with x ˆx ε 0. We have the following proposition: Proposition 5. Suppose that x R N, and we want to encode x with finite precision ε 0, which mean that we want to encode ˆx such that x ˆx < ε 0. We want to encode x with the probability distribution p. We assume that p is continuous. Then we have the encoding bound: K(ˆx ε 0 ) K(p) log 2 p(x) N log 2 ε (3.4) The additional cost given by the required precision N log 2 ε does not depend of the probability p. Proof. The cost of encoding ˆx with a probablity distribution becomes: ([ K(ˆx) K(p) log p ˆx 1 ε 0 2, ˆx 1 + ε [ [ 0... ˆx N ε 0 2 2, ˆx N + ε [) 0 2 (3.5) We can suppose that ε is small in regard to the variations of p, or this choice of p would have been useless for this given precision. Then, we have: ([ p ˆx 1 ε 0 2, ˆx 1 + ε [ [ 0... ˆx N ε 0 2 2, ˆx N + ε [) 0 ε N 0 p(x) (3.6) 2 We have the compression bound: K(ˆx) K(p) log 2 p(x) N log 2 ε (3.7) The first terms are well-known. The third term of the bounds describes the cost of the precision. What we see is that it does not depend on x or the model p. So when working with continuous data, we can forget the problem of precision when choosing the model, and work with continuous probability distributions. Example 5 (Linear regression). Suppose that we have the data X = {x 1,...x n }, where x i R, and we want to compress the data by using a linear regression. What is the model? The model we choose is that, for all i, y = a, x + b + z, with z N (0, 1). So the probability distribution p a,b (y x) defined by this model is: p a,b (y x) = N (y a, x + b ; 0, 1) (3.8) Then, if we want to encode Y, we only have to encode a, b, and the errors z i = y i a, x i + b. 28

30 Then, the bound given by this encoding would be: K(Y X ) K(p a,b a, b) + K(a, b) i log 2 exp ( ) z2 i 2 (3.9) where K(p a,b a, b) is the cost of the program that computes the probability distribution p a,b given a and b, and K(a, b) is the cost of encoding a and b. From this formula, we get: K(Y X ) K(p a,b a, b) + K(a, b) ln(2) = K(p a,b a, b) + K(a, b) ln(2) zi 2 (3.10) i (y i a, x i b) 2 (3.11) = K(p a,b a, b) + K(a, b) ln(2) y at X b1 2 2 (3.12) What we observe is that the second term exactly correspond to the l 2 norm we usually minimize when we fit a linear regression. Remark 3. The compression bounds we are going to compute might not be useful in real situations where the amount of information sent is critical. These situations are usually when the data is sent to a device with a bad connection, like a smartphone. But the smartphone also has a limited computation power. Some of the bounds we are going to present imply re-training entire models, so is extremely expensive. What we are trying to do is not designing some useful compression algorithm, but to know the minimal quantity of information required to define the data, and then use it for prediction. 3.3 Optimal precision Suppose we are trying to encode the data x by using a model from the parametric family P = {p θ, θ R}. Here, we take θ [0, 1] for simplicity, but all the results work the same in higher dimension. As we saw before, we have the bound: i K(X ) K(p θ θ) + K(θ) log 2 p θ (X ) (3.13) where the first term is the complexity of computing p θ (X ) when given θ and X, K(θ) is the cost of encoding the parameter θ, and the last term is the cost of encoding the data with this model. The goal is to compress X by choosing the best model. Since we only restrict ourselves to the parametric family P, the first term is a constant. 29

31 We have already discussed that if we only minimize the last term, it is exactly the problem of maximum likelihood. We define θ = arg min θ log 2 p θ (X ) the most likely parameter. We are now going to discuss the second term. What is the cost of encoding the parameter? Here, we will suppose that θ is real parameter, and θ [0, 1]. Since θ is a real number, we can t encode it with a finite amount of bits, so we have to approximate it. We can choose a precision ε and encode θ such that θ θ < ε. The cost of encoding this approximation depends on the precision we ask. The more precision we ask on θ, the more we can be close to the true θ, and the more the term log 2 p θ (X ) will be small. In return, the more precision we ask, the more the encoding of θ is expensive. What is the trade-off we can find? We have the proposition: Proposition 6. If θ is a real parameter, θ [0, 1], the optimal precision is ε = and we have the compression bound: 1 J (θ ) (3.14) K(X ) K(p) log 2 J (θ ) log 2 p θ (X ) + cste (3.15) where J (θ) = 2 θ 2 [ln p θ (X )] is the observed Fisher information. Proof. If we choose a precision ε, we can choose θ such that θ θ < ε, and have the bound: K(θ) log 2 ε (3.16) We can deduce the bound: K(X ) K(p) log 2 ε log 2 p θ (X ) (3.17) We want to find the optimal precision ε. We define: l(ε) = log 2 ε log 2 p θ (X ) (3.18) Since θ is close to θ, we can do a Taylor expansion of p around θ : l(ε) = log 2 ε log 2 p θ (X ) θ [log 2 p θ (X )] (θ θ ) θ 2 [log 2 p θ (X )] (θ θ ) 2 + o((θ θ ) 2 ) (3.19) Since θ is a maximum of log 2 p θ (X ), we know that θ log 2 p θ (X ) = 0 if θ = θ. We recognise J (θ) = 2 θ 2 [ln p θ (X )] the observed Fisher information. And since (θ θ ) 2 ε 2, we have: l(ε) log 2 ε log 2 p θ (X ) ln(2) J (θ )ε 2 30 (3.20)

32 We want to find the optimal precision, in order to get the optimal compression. We differentiate equation 3.20: l (ε) = 1 ( ) 1 ln 2 ε J (θ )ε (3.21) So the optimal precision is: ε = and the compression bound we get is: 1 J (θ ) (3.22) K(X ) K(p) log 2 ε + log 2 p θ (X ) ln(2) J (θ )(ε ) 2 (3.23) = K(p) log 2 J (θ ) + log 2 p θ (X ) + cste (3.24) What is the interpretation of this optimal precision? Its value is linked to the Cramer-Rao bound. Suppose that we have one data sample X p θ, where θ is unknown, and we design ˆθ an estimator of θ, that is supposed to be unbiased ( E[ˆθ(X)] = θ ). Then, the Cramer Rao bound tells us that if σ is the standard deviation of ˆθ(X): σ = E[(ˆθ θ ) 2 ] (3.25) then we have: σ 1 I(θ ) (3.26) This means that it is impossible to build an estimator that will be more precise 1 than and that giving more precision would not be significant. I(θ ) If we have iid samples coming from the same distribution p θ : X 1,..., X n p θ and if we have an unbiased estimator ˆθ(X 1,..., X n ) of θ, then if σ is the standard deviation of ˆθ(X 1,..., X n ), we have that: where: σ 1 In (θ ) (3.27) I n (θ) = 2 θ 2 E p θ [ln p θ (X 1,..., X n )] (3.28) n 2 = θ 2 E p θ [ln p θ (X i )] (3.29) i=1 = ni(θ) (3.30) 31

33 where I n (θ) = 2 θ 2 E pθ [ln p θ (X)]. We can conclude that: σ 1 ni(θ ) (3.31) We find the same kind of bound than the usual ones for confidence intervals, that decrease in O(n 1/2 ) in the number of data samples. Finally, we get the same precision bounds on estimators with the Kolmogorov complexity bound than with traditional statistics. Remark 4. When we try to take into account for the compression bounds of the data the compression cost of the parameters, we see that two parameters θ1 and θ2 can achieve the same results for the loss function, but will one will be better in compression than the other. The best one will be the one with the lowest I(θ) = 2 θ [ln p 2 θ (X )]. How can this be interpreted? I(θ) is the second order derivative of the loss function at the local minima. It characterizes the flatness of the loss function at this point. So, for compression, flat minima will be favored, and according to Kolmogorov complexity theory, parameters in flat minima will generalize better. This remark is interesting because people in the deep learning community recently claimed that the generalization capacity of Deep Learning comes from the fact that the stochastic gradient descent algorithm naturally tends to converge to flat minima [Keskar et al., 2016] [Wu and Zhu, 2017]. This can be an insight to understand why flat minima seem to generalize better. 3.4 Minimum Description Length and Bayesian Statistics In the following part, we suppose that we have data X, and want to encode it by using a parametric family P = {p θ, θ Θ} Using a prior to encode the parameters Suppose that we have a prior α on θ. How can we use it in order to compress more? We have the following proposition: Proposition 7. Suppose we want to encode the data x with a parametric model p θ, and we have a prior α over θ. We can encode θ with this prior. If θ is discrete, the complexity bound we get is: K(X ) K(α) + K(p θ θ) log 2 α(θ) log 2 p θ (X ) (3.32) The first term is the cost of the program describing the prior. The second one is as usual the cost of the program computing p θ (X ) given θ and X. 32

34 Remark 5. In order to use Proposition 7 with continuous parameter θ, we need to choose precision ε over θ. If θ is a local minima of the likelihood log 2 p θ (X ), we can use the results of optimal precision, and get the bound: K(X ) K(α) + K(p θ θ) log 2 α(θ) log 2 J (θ) log 2 p θ (X ) (3.33) Encode the data with the marginal distribution Since we have a parametric model p θ (X ) and a prior α over θ, we can define the joint distribution p(x, θ) = α(θ)p θ (X ). Then, we have the marginal distribution: p(x ) = p(x, θ)dθ (3.34) θ = p θ (X )α(θ)dθ (3.35) θ We can encode the data with this distribution, and we have the following bound given by the two part code: K(X ) K(p) log 2 p(x ) (3.36) Moreover, since p can be computed given α and p θ by using any algorithm to compute the integral, we have the following compression bound: Theorem 10 (Bayesian compression bound). If p θ is a parametric model and α is a prior over θ, we have the following bound: K(X ) K(α) + K(p θ θ) log 2 p(x ) (3.37) = K(α) + K(p θ θ) log 2 p θ (X )α(θ)dθ (3.38) θ The disadventage of this bound is that it is very hard to compute in practice. Indeed, the integral is easy to define but impossible to use in practice. We are going to see bounds that approximate this one The variational compression bound An other way to get a compression bound over the data is to encode a probability distribution q(θ X ) over θ, that can depend on the data X [Honkela and Valpola, 2004] [Hinton and Van Camp, 1993]. The intuition is the same than the one in the part about optimal precision: we can accept to lose some precision over the best values of θ if it allows compressing more. We first have to set a prior α over θ. 33

35 Theorem 11 (Variational compression bound). Let p θ be a parametric model and α a prior over θ. Suppose that q(θ) is a probability distribution over θ. Then we have the following compression bound: K(X ) K(α) + K(p θ θ) + D KL (q α) E θ qx [ log 2 p θ (X )] (3.39) Lemma 1. Let p θ be a parametric model and α a prior over θ. Suppose that q(θ) is a probability distribution over θ. We define p(x, θ) = α(θ)p θ (X ). Then, we have: log 2 p(x ) = D KL (q p(θ X )) D KL (q α) + E θ q [log 2 p θ (X )] (3.40) Moreover, since D KL (q p(θ X )) 0, we have: log 2 p(x ) D KL (q α) + E θ q [log 2 p θ (X )] (3.41) Proof. D KL (q p(θ X )) D KL (q α) + E θ q [log 2 p θ (X )] (3.42) ( ) q(θ) α(θ) = q(θ) log 2 θ p(θ X ) q(θ) p θ(x ) dθ (3.43) ( ) p(θ, X ) = q(θ) log 2 dθ (3.44) p(θ X ) θ = log 2 p(x ) (3.45) Proof of the theorem. With Theorem 10, we know that: K(X ) K(α) + K(pθ θ) log 2 p(x ) (3.46) With Lemma 1, we have the following bound: K(α) + K(p θ θ) + D KL (q α) E θ qx [ log 2 p θ (X )] (3.47) 34

36 3.4.4 Practical variational bound: the bits back argument It is not obvious that this bound can be achieved with a practical algorithm. We are going to use the bits-back argument. The idea is that the sender will have to send a lot of information. But many of it can be chosen randomly. Thus, instead of paying an expensive cost to transmit random data, these random bits will be chosen carefully, in order to send some other bits that could be used later, or in other tasks. Then, the amount of additional information that has been sent with this method can be deduced from the total. The procedure for the sender is this one: 1. Compute θ q(θ X ) if it needs pre-computing. 2. Choose θ randomly with probability distribution q(θ X ). If we wanted to choose it randomly, we would have to choose H(q(θ X )) random bits. Instead of choosing them randomly, we select a sequence b containing additional data we would like to send, but which is also a code for a ˆθ encoded with q(θ X ). The average length of the sequence b is H(q(θ X )). 3. Encode ˆθ with the prior α, and send it to the receiver. It will cost in expectation Eˆθ q(θ X ) [ log 2 α(θ)] 4. Encode the data X with the distribution pˆθ, and send it to the receiver. It will cost in expectation: Eˆθ q(θ X ) [ log 2 pˆθ(x )] With this method, the cost of the sending is Eˆθ q(θ X ) [ log 2 α(θ)] + Eˆθ q(θ X ) [ log 2 pˆθ(x )] (3.48) But as we explained, since an additional amount of information was sent in this procedure, it can be deduced, so we get the cost: Eˆθ q(θ X ) [ log 2 α(ˆθ)] + Eˆθ q(θ X ) [ log 2 pˆθ(x )] Eˆθ q(θ X ) [ log 2 q(ˆθ X )] (3.49) = Eˆθ q(θ X ) [ log 2 q(ˆθ X ) α(ˆθ) ] + Eˆθ q(θ X ) [ log 2 pˆθ(x )] (3.50) = D KL (q(θ X ), α) + Eˆθ qx [ log 2 pˆθ(x )] (3.51) So we finally get the bound of equation We still have to show how the receiver can decode b. The procedure for him is the following one: 1. Receive ˆθ and decode it with the prior α. 2. Receive and decode the data X with pˆθ. 35

37 3. Compute θ q(θ X ) with whatever algorithm the sender used. Typically a machine learning algorithm. Then, decode b by decoding ˆθ with q(θ X ) Then, the receiver now has ˆθ, X and the sequence of bits b, so the obtained bound is correct. So we have the following bound: K(α) + K(p θ θ) + K(q X )D KL (q α) E θ qx [ log 2 p θ (X )] (3.52) The only difference between the variational bound of Theorem 11 and the practical one is K(q X ). Indeed, in this algorithm the sender needs to specify the distribution family Q and the learning procedure. It is interesting to observe that in the true variational bound of theorem 11, there is no complexity term implying q. Therefore, we can choose a family Q as large as we want. Remark 6. When we discussed the issue of precision for parameter θ, with a required precision ε around a θ, we were exactly transmitting a ˆθ from a distribution q, which was a uniform distribution over the cube [θ 1 ε 2, θ 1+ ε 2 ]... [θ n ε 2, θ n+ ε 2 ] Optmizing the variational bound What is the optimal distribution q to choose? With equation 3.51, we have: [ K(X ) Cste + D KL (q(θ X ), α(θ)) + Eˆθ q(θ X ) log2 pˆθ(x ) ] (3.53) [ ] q(ˆθ X ) = Cste + Eˆθ q(θ X ) log 2 α(ˆθ)pˆθ(x (3.54) ) We define the joint distribution p(x, θ) = α(θ)p θ (X ). Then, we have: [ ] q(ˆθ) K(X ) Cste + log Eˆθ qx 2 p(x )p(θ X ) (3.55) = Cste + D KL (q(θ), p(θ X )) p(x ) (3.56) where p(x ) is the marginal likelihood: p(x ) = p θ (X )α(θ)dθ (3.57) and p(θ X ) is the posterior: p(θ X ) = p θ(x )α(θ) p(x ) θ = p θ(x )α(θ) θ p θ(x )α(θ)dθ (3.58) Therefore, we see that the optimal distribution q is the Bayesian posterior of the distribution p. If we choose q(θ X ) = p(θ X ), the description length for the data X is: K(X ) K(α) + K(p θ θ) log 2 p(x ) (3.59) 36

38 But now that we are in fact encoding the data X with the marginal distribution p(x ), we could simplify the procedure, and only have the two part code: K(X ) K(p(X )) log 2 p(x ) (3.60) We see that way that we are obtaining by an other way the Theorem 10. So if we choose q to be the posterior of the prior p, q(θ) = p(θ X ), the procedure we described is only a way to encode the marginal distribution p(x ). But actually, the problem with the Bayesian posterior is that it is often very hard to compute. What we are going to do instead is to approximate it for the Kullback Leibler divergence. We choose a parametric family of distribution for q: Q = {q φ, φ Φ}, and we try to optimize equation 3.56 which can be written that way: { } K(X ) inf K(α) + K(p θ θ) + D KL (q φ (θ), p(θ X )) log 2 p(x ) φ Φ (3.61) We only have to minimize: D KL (q φ (θ), p(θ X )). This divergence is hard to compute directly, since we do not have a closed formula for the Bayesian posterior. But we can use equation 3.39 to rewrite the bound: { } K(X ) inf K(α) + K(p θ θ) + D KL (q φ, α) E θ qφ [ log 2 p θ (X )] φ Φ (3.62) Since the first term K(p(θ), p θ (X ), q φ (θ)) does not depend on φ but only on the family of distribution chosen, we will only consider the loss: L(φ) = D KL (q φ, α) E θ qφ [ log 2 p θ (X )] (3.63) We can then compute φ = arg min φ L(φ) and we will have the bound: K(X ) K(α) + K(p θ θ) + L(φ ). If the function φ q φ (θ) is differentiable, we can try to find a minima of L(φ) by using a gradient descent algorithm. The family of distribution Q can typically the family of normal distributions with diagonal covariance. In figure 4, we see the learning procedure of a variational approximation of the Bayesian posterior at the different steps Bayesian model selection and Kolmogorov complexity Let us go back to the interpretation of equation What is the interpretation of the cost: K(X ) K(p(X )) log 2 p(x )? The marginal p(x ) = θ p θ(x )α(θ) is called the evidence of the model p. It is used in Bayesian Model Selection [Wasserman, 2000]. Suppose we need to make a choice between a few models M 1, 37

39 Figure 4: Variational learning. Figures (a), (b), (c), (d), (e) and (f) are successive steps of a variational learning procedure. The true posterior distribution s probability density function is indicated by its level lines, the solid lines. The variational approximation is learned, and its successive level lines are indicated by dashed lines. Here, the only constraint on the approximation is that it is separable in µ and σ. Source: [Mackay, 2003] { }..., M k, with M i = p Mi θ, θ Φ i and priors α i over θ for M i. The Bayesian model selection procedure would be to compute the evidence of the two models: p Mi (X ) = p Mi θ (X )α i (θ)dθ (3.64) θ Then, if we have no prior on the models, the chosen model would be the one with the stronger evidence. If we have a prior β(m i ) over the models, we can define the joint distribution: p(x, M) = β(m)p M (X ) (3.65) 38

40 Then, we can compute the posterior: p(m i X ) = pmi (X )β(m i ) i pmi (X )β(m i ) (3.66) And we would select the model with the highest posterior. This means that, if we switch to log-probability, the goal is to select the model minimizing: log 2 p(m i X ) = log 2 β(m i ) log 2 p Mi (X ) (3.67) What we see is that the Bayesian model selection is very close to a model selection with Kolmogorov complexity. Indeed, the Kolmogorov compression bound for a model M i is exactly log 2 p(m i X ) with the Solomonoff prior. Solomonoff induction is only a Bayesian model selection method with a universal prior founded on Kolmogorov complexity. To conclude, we have seen how much universal compression bounds are related to usual statistics and machine learning, and especially Bayesian statistics and Bayesian model selection. As a consequence, the Kolmogorov model selection seems to have the same well-known shortcoming of Bayesian model selection. Indeed, a Bayesian model selection is often considered as too careful: in most cases the procedure will select a simple model that does not fit so well the data than a more complicated one that fits better. We will introduce the switch distribution in order to solve this problem. 3.5 The switch distribution and the catch-up phenomenon In this section, we will present the switch distribution introduced in [Van Erven et al., 2012], and how it can help in bayesian model selection. We will use the switch in the next part in order to compute compression bounds with deep learning. Until now, the models we used are parametric models p θ. In the last part, we saw that with this method the best compression cost we can hope is: K(X ) C + log 2 p(x ) (3.68) = C + log 2 p θ (X )α(θ)dθ (3.69) θ As we have already explained, the term θ log 2 p θ (X )α(θ)dθ is called the evidence for the model. How can we interpret the evidence? The cost log 2 p(x ) can be interpreted as the cumulative cost of encoding X with the model p. If X = {x 1,..., x n }, we have indeed: 39

41 log 2 p(x ) = log 2 p(x 1,..., x n ) (3.70) = log 2 (p(x n x 1,..., x n 1 )p(x n 1 x 1,..., x n 2 )...p(x 1 )) (3.71) n = log 2 p(x i x 1,..., x i 1 ) (3.72) i=1 This simple rewriting of the encoding cost of the data with a Bayesian model might seem trivial. But it helps to highlight an issue of the evidence. Suppose that we try to learn a word prediction model. We compare two models: a Markov model of order 1 and a Markov model of order 2. During the first part of the training, the model of order 2 will be worse than the order 1 model, since it needs considerably more data to train. At a certain point, it will become better. But since the evidence is a cumulative score, it needs to fill a huge gap before being selected. Thus, the best model will not be selected for a long time. This is called the Catch up phenomenon [Van Erven et al., 2012] What we would like to do instead, is to select the order 2 model as soon as it becomes the best. If we decide to switch from the first model to the second after having seen t samples, then the encoding cost of the entire dataset would be the encoding cost of the first model for the t first data, and then the encoding cost of the second model for the rest of the data. See figure 5. The switch distribution is a way to tackle this issue. Suppose that we have X = {x 1,..., x n }. We say that p is a prediction strategy if it gives for all k a probability measure p(x k x 1,..., x k 1 ). We suppose that we have a countable set of models, M = {p i, i I}, where the p i are prediction strategies. Typically, they can be marginal distributions of Bayesian models. Usually, we try to select the best possible model. Here, we introduce a way to compose them. We define: S = {((t 1, k 1 ),..., (t l, k l )), 1 = t 1 < t 2 <... < t l < n, k 1,...k l I, s N} (3.73) Then, the set of strategies is Q = {q s, s S}, where, if s = ((t 1, k 1 ),..., (t l, k l )): q s (x i+1 x 1,..., x i ) = p kj (x i+1 x 1,..., x i ) where t j i + 1 < t j+1 (3.74) If we choose a prior π over S, we can define the Switch distribution: Definition 11 (Switch Distribution). The Switch distribution p sw with respect to a prior π is: p sw (x 1,..., x n, s) = q s (x 1,..., x n )π(s) (3.75) Then, we have the marginal distribution: p sw (x 1,..., x n ) = s S q s (x 1,..., x n )π(s) (3.76) 40

42 Figure 5: Switch between two models: The two models are Markov models of order 1 and 2. This figure represents the difference between the encoding costs between the first and the second model as a function of the sample size. What we see is that the model 2 becomes better after t 1 = data. But since the loss is cumulative, there is a huge gap at this point between the encoding cost with the first model and the encoding cost with the second (40000 bits). The model 2 finishes to fill this gap after more than 300,000 data, and is then selected by the Bayesian model selection (dashed line). Whereas the best strategy would be to encode the data with model 1 until t 1 = , and then use model 2 (dotted line). As we can see in the figure, it allows saving 40,000 bits. (Source [Van Erven et al., 2012]) What can be the prior π? A natural choice is to choose a prior over the models β(p i ) and a prior over the switching strategies κ(t 1,..., t l ). Then, we can define π(s) = π((t 1, k 1 ),..., (t l, k l )) (3.77) = κ(t 1,..., t l ) β(p 1 )... β(p l ) (3.78) Then, we could choose for κ something close to the Solomonoff distribution over these switching strategies. Since we know that: K(t 1,..., t l ) K(l) + l K(t i ) (3.79) i=1 log 2 (l) + 2 log 2 log 2 (l) + l log 2 t i + 2 log 2 log 2 t i (3.80) i=1 41

43 With this bound, we could have: κ(t 1,..., t l ) 1 l log 2 2 l t 1 log 2 2 t 1... t l log 2 2 t l (3.81) This probability distribution is also correct for infinite sequences. As we can see, the cost of the switch is very low. Suppose you have 1,000,000 samples, and you want to switch between 5 models. Then the most you might pay is log log 2 1, 000, bits, which is less than 10 4 bits/sample. What is the encoding cost of the data with this distribution? We need to encode all the models that are going to be chosen in the strategy, and, the strategy itself, and finally the encoding costs for the different parts of the data. We have the bound: Theorem 12 (Compression bound of the switch distribution). K(X ) K(π) log 2 π(s) l + K(p ki ) log 2 p ki (x ti, x ti+1,...x ti+1 1 x 1,..., x ti 1) i=1 (3.82 ) Remark 7. We could limit the encoding cost of the switch by limiting the moments when we can switch models. As an example, we could decide that the model could only switch after t data if t is a power of 2. This would highly reduce the cost of the switch. But since the cost of the switch is very low, we consider that it is insignificant. 42

44 4 Compression bounds with Deep Learning Deep Learning is now recognised as one of the most effective tools in machine learning. But neural networks used in practice are often huge, and have way more weights than there are data in the dataset. This seems to contradict Kolmogorov s compression bound theory since the best models seems to be the bigger one. It also contradicts the statistician intuition, since there are more parameters than data samples. How can we solve this paradox? Here, we will only discuss the problem of supervised learning, and more precisely image classification. We are using two standard supervised learning datasets: MNIST (Modified National Institute of Standards and Technology database) is a very standard dataset of manually written digits. There are 60,000 images in greyscale, of size 28x28 pixels. The task is to recognize the digits. See figure 6a This task is considered as very easy, due to the lack of variety in the dataset. The classification scores are 92, 4% for a linear classifier. and 95, 0% for a L2 k-nearest-neighbors method. [Lecun et al., 1998] The best learning algorithms yield a 99,7% accuracy score [Cireşan et al., 2012]. We define K(MNIST) the be K(MNIST Labels MNIST Pictures ) the encoding cost of MNIST s labels given the pictures. For this dataset, the trivial compression bound is the bound given if we encode the data with the uniform distribution. Since the labels are equally distributed, the cost of encoding all these labels with the uniform distribution over {0, 1,..., 9} is: K(MNIST) H(U({0, 1,..., 9})) (4.1) = 60, 000 log(10) (4.2) 138, 155 nats (4.3) CIFAR 10 is a dataset of 50,000 small pictures (32x32) in color and we have to classify them. The classes are either animals (bird, cat, horse,...) or means of transportation (airplane, ship, automobile,...).see figure 6b: The classification task is much more difficult than MNIST one. The best classification score is 96,5% [Graham, 2014]. As for MNIST, the trivial compression bound is: K(CIFAR10) 50, 000 log(10) (4.4) 115, 129 nats (4.5) What is the Kolmogorov complexity of a dataset stored using a Deep Learning algorithm? 43

and specific libraries for deep learning (Tensorflow, Theano, PyTorch, Keras,...).

45 (a) MNIST dataset (b) CIFAR10 Dataset Figure 6: Image classification datasets We are going to present different ways of compressing the same data. For each of these methods, we have to consider the cost of all the softwares we use (Python interpreter, Maths tools) and specific libraries for deep learning (Tensorflow, Theano, PyTorch, Keras,...). We will not take this into account since we suppose that all computer scientists doing deep learning, and hence the sender and the receiver, have it. Formally, this would be taken into account in the constant term of the Kolmogorov complexity which depends on the Turing machine. For each of these methods, we will have to take into account the cost of the code. Typically a Python script of a few hundred lines, which weigh a few thousand bits. But this can be highly compressed. A feedforward network with a finite set of different types of layers, each of them described by a few hyperparameters can be described by only these informations, and informations about the optimizer. The program describing the optimizer is included in the libraries. So the total cost of the script describing the learning procedure can be described with a few hundred bits. We will consider this as negligible. 4.1 No compression The first possibility is to save all the model weights, with its maximal precision. It is usually what is done in practice when sending a pre-trained model. But this is extremely expensive. Suppose that each weight is encoded on 32 bits. Then the encoding of the entire neural network can cost more than 10 8 bits. This can be interpreted in term of transmitting the weights with a given prior and an asked precision. Suppose that the weights are 32 bits floating points numbers, and take x a weight. Since its value is stored with a single precision floating point number, we know 44

46 Figure 7: Storage of a single precision floating point number. AJOUTER REFERENCE that it can be written x = ε (1 + u) 2 v 127 where ε is the sign, given by one bit, v is the biaised exponent, given with 8 bits, and u is the fraction, given with 23 bits. Since every possible value is encoded using the same number of bits, this corresponds to an encoding of all the weights with a uniform prior over all possible values. The precision asked is 2 23 over u, which means a precision of v 127 over x. This can be written as a precision of log 2 x. The corresponding prior is 1 p(x) = (4.6) log 2 x and is plotted in figure 8. Figure 8: Probability density function of the prior on a weight implicitely given by the single precision floating point number structure: p(x) = log 2 x The compression bound given by this method is substantially higher than the trivial bound given by the uniform distribution. This is exactly the paradox we highlighted: even if deep learning models generalize very well, they have too many 45

47 parameters and therefore, it seems that they do not compress the data at all. This is the reason why we have to find new ways to compute compression bounds with deep learning tools. 4.2 Model compression Practical model compression is a developped research topic. [Han et al., 2015a] [Han et al., 2015b]. The problem is to design neural networks which will be lighter but still with good performance, in order to send them and use them mostly on small devices like cell phones. The problem is not exactly to compress the data but the model itself, which of course also gives compression bounds for the data. A first strategy to do this is first to train an ordinary deep learning model, which can have many parameters. Then, we train a smaller deep learning model called the student model to imitate the first model. The first experiences chose student models with small depths [Ba and Caruana, 2014]. The training of the student model is easy because we can have as many data pairs (x, y) as we want. Other experiments were done with students models with exactly the same number of layers, but with thiner layers. Therefore, the training could even be done by training each layer of the student model to imitate the corresponding layer of the original model [Romero et al., 2014]. These methods reached around 10x compression factors on image classification. Another possibility is still to take a ordinary deep learning model, but then tray to make it sparse. We put all weights of the network that are under a given threshold to 0, and then fine-tune the non-zero weights. This is usually called pruning and can divide the number of parameters by a factor 10. Then, we can compress even more by using weight sharing. If in a layer, some weights are very close, we can replace them by a unique and shared weight. This can be done by using a clustering algorithm. Then, we can fine-tune the shared weights. Finally, the remaining weights can be encoded with a Huffman coding. Finally, this compression methods allow a compression rate of 50 on big networks like VGG16, reaching a total weight of 11.3MB instead of 500Mb, and without changing too much the accuracy. [Simonyan and Zisserman, 2014] This is a great practical result. It can be hard to transmit a 500MB network on a mobile device, whereas it is easy with a 11MB network. But these compressed sizes are still way higher than the trivial compression bound for the labels of an image classification dataset. 4.3 Variational methods for deep learning As we explained in section (REF), we can use variational methods in order to compress some data. [Hinton and Van Camp, 1993] 46

48 Suppose we have a neural network with parameter θ R N, given by the function p θ. Instead of learning a parameter θ, we can learn a distribution q φ over θ, from a parametric family Q = {q φ, φ Φ}. We choose Q = { N (µ, Σ), µ R N, Σ diagonal }, and learn a mean and a variance for each coefficient. For the prior α over the weights θ, we try two possibilities: A Gaussian prior with mean zero and given variance σ 0 : ( ) 1 α(θ i ) = exp θ2 i 2πσ0 2σ0 2 The variance can also be different for each layer. A conjugate Gaussian prior over µ and Σ. The idea is to encode all the parameters as if they were coming from an unkown Gaussian N (µ, σ) with a conjugate prior over µ and σ, the Normal-Gamma prior [Murphy, 2007]. Then, we use the marginal distribution over θ of this model. With equation (REF), we know that we have K(Y X ) Cste + L(φ) where L(φ) = D KL (q φ, α) E θ qφ [ log 2 p θ (X )] (4.7) The second term is estimated by sampling a θ q φ at each epoch. If the prior is the Gaussian prior, we can compute exactly the first term D KL (q φ, α). Indeed, the Kullback-Leibler divergence between two probability Gaussian distributions p = N (µ p, σ p ) and q = N (µ q, σ q ) is: σ q D KL (q p) = log ( σ 2 σ p 2σp 2 q σp 2 + (µ p µ q ) 2) (4.8) If the prior is not the Gaussian prior, the Kullback Leibler divergence cannot be computed exactly. We re-write L: [ ] q φ (θ) L(φ) = E θ qφ log 2 E θ qφ [ log α(θ) 2 p θ (X )] (4.9) = H(q φ ) E θ qφ [ log 2 p θ (X ) + log 2 α(θ)] (4.10) The second term can still be estimated by sampling a single θ q φ at each epoch. The first term can be exactly computed since the entropy of a Gaussian distribution p = N (µ, Σ) is: H(q) = 1 2 log 2 det (2πeΣ) (4.11) Then, the estimation of L(θ) we have is differentiable, and we can use a gradient descent algorithm. 47

49 We observed that neural networks that gives the best results with variational learning are smaller that the ordinary ones. We experimented different fully connected networks and convolutional networks, but the models that gave the better compression bounds were small fully connected networks with 3 layers, with the 2 hidden layers of dimension 256. On MNIST, we achieve a compression bound of 16,6 knats, and a classification score on a test set of 95,5%. The compression bound is very good, but the classification score is way below the best scores on this dataset. On CIFAR, we achieve a compression bound of 61,7 knats, and a classification score on a test set of 61, 6% 4.4 Online learning We can compress data by choosing a neural network structure p θ and learning its parameters online. The idea is that the sender sends some labels to the receiver. Then the receiver can use them to learn a first model, with a learning algorithm given by the sender. Since the sender knows exactly what data and what learning algorithm the receiver has, he knows the model he learned. So he can use this model to transmit more data using this model. Formally, we choose 0 = t 0 < t 1 <... < t S = n where n is the size of the dataset. Then, the algorithm is the following: 1. Send a parametric model θ p θ and a training procedure (a program which can compute ˆθ given some data) 2. Send the data y t0, y t0+1,..., y t1 1 using the uniform distribution. 3. s 1 4. Train a model with data y t0, y t0+1,..., y ts 1, which means find θ s with the training procedure given by the sender. The goal is to minimize: log 2 p θ (y t0, y t0+1,..., y ts 1 x t0, x t0+1,..., x ts 1) 5. Send the data y ts, y t0+1,..., y ts+1 1 by using the model p θs, which costs: 6. s s + 1 t s If s = S, end. Else go back to step 3. i=t s log 2 p θs (y i x i ) (4.12) 48

50 We will call a set of data y ti, y ti+1,..., y ti+1 2, y ti+1 1 a pack of data. The success of this method relies on how fast the model will generalize. Indeed, if the model overfits at the beginning of the online training, the cost of the first labels will be huge, much higher than the cost of the encoding with the uniform distribution over labels. On MNIST, we used four different models: The uniform probability over the labels. A fully connected network or Multi Layer Perceptron (MLP) with two hidden layers of dimension 256 A small convolutional network with two convolutional layers and two fully connected layers. (CNN1) A bigger convolutional network with 8 convolutional layers with 32, 32, 64, 64, 128, 128, 256 and 256 filters respectively and max pooling operators every two convolutional layers. For the three neural networks we use Dropout between the fully connected layers, and optimize the network with the Adam algorithm. The successive timestep t 1, t 2,..., t s are chosen to be all the power of two starting with 8: 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, and The training results can be seen in figure 11. Figure 9: MNIST training results: For each time step, we re-train the neural networks with all the data that has already been sent. We evaluate the loss and the accuracy on the next data pack for which the model will be used for compression. 49

51 We observe that even with a few number of data, the models are extremely good in prediction, since they all reach a 90% accuracy score with less than 1500 data samples. With these results, we can compute the corresponding compression bounds (see figure 13). In table 1, we sum up the classification and compression results of these models. The best compression bound we get is 2, 84 knats. Model Compression bound Classification score (test) Uniform 135 knats 10% MLP 21,0 knats 97,9% CNN1 4,19 knats 98,9% CNN2 2,84 knats 99,3% Table 1: MNIST compression and classification results On CIFAR, we used four different models: The uniform probability over the labels. A fully connected network or Multi Layer Perceptron (MLP) with two hidden layers of dimension 512. A convolutional network (CNN) with four convolutional layers with 32 filters, and a maxpooling operator after every two convolutional layers. Then, two fully connected layers of dimension 256. A very deep convolutional network with 13 convolutional layers. TAILLER) (DE- For the three neural networks we use Dropout after each layer and optimize the network with the Adam algorithm. For the last models, we trained the same model but with data augmentation and batch normalisation (DETAILLER). The successive timestep for CIFAR t 1, t 2,..., t s are chosen to be 10 2 s for s going from 0 to 12: 10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120, 10240, 20480, The training results can be seen in figure 11. The task of image classification on CIFAR dataset is obviously harder than the same task with MNIST, and we see it during the learning. We need way more data to get some good classification results. The best bound we achieve (without the switch that will be presented in next section) is 31,4knats with the regularized VGG model. We observe that using a regularizer improves the compression bound. 4.5 Online learning with the Switch We could use the switch distribution in order to improve the compression bounds. Indeed, when we are doing online learning a model can have very low results at 50

52 Figure 10: MNIST compression results: In the first subfigure, we see the cost for encoding every label. The cost for encoding a label in a data pack with online learning is exactly the loss evaluated on this label with the last model that has been learnt. Actually, we only plot the mean cost of encoding a label in a pack of labels. The line is discontinuous each time we train a new network, and constant elsewhere. In the second subfigure, we plot the difference between the cumulative encoding cost of encoding the data with the uniform distribution, and the cumulative cost of encoding the labels with online learning. In the third subfigure, we plot the corresponding compression rate if we want to encode a certain quantity of data (log scale). 51

53 Figure 11: CIFAR training results: For each time step, we re-train the neural networks with all the data that has already been sent. We evaluate the loss and the accuracy on the next data pack for which the model will be used for compression. Model Compression bound Accuracy (test) Uniform 115 knats 10% MLP 136 knats 41% CNN 69,2 knats 70,6% VGG 104 knats 85% VGG + reg, dataaugm 31,4 knats 93,3% VGG + reg, dataaugm, AS 24,2 knats 93,3% Switch + AS 23,9 knats 93,3% Table 2: Cifar compression and classification results the beginning of the learning (it can overfit with too little data) but become very good later, while some simpler models can be good at the beginning but reach a maximum height early. As an example, we see on CIFAR dataset that the model VGG is worse than the model CNN with less than 20,000 data samples, and then the situation is reversed. With the Switch the procedure becomes: 1. Send the data y t0, y t0+1,..., y t1 1 using the uniform distribution. 2. s = 1 3. Send a parametric model p s θ and a training procedure 4. Train p s θ with data y t 0, y t0+1,..., y ts 1. 52

54 5. Send the data y ts, y t0+1,..., y ts+1 1 by using the model p θs, which costs: 6. s s + 1 t s If s = S, end. Else go back to step 3. i=t s log 2 p s θ s (y i x i ) (4.13) In the experiments, we observe that the switch only use two models: the uniform one, and the model that will be chosen at the end, CNN2 for MNIST and VGG regularized for CIFAR. This allows to save 2knats on CIFAR. What we can be done more is to switch between training procedure in addition to the models. Indeed, we can choose after each data pack to adapt the learning rate, the optimisation algorithm, the data augmentation procedure, the number of epochs,... Adding a switch to choose a strategy for all these hyperparameters is very expensive for the sender in computation time, since he has to try all the possible values and select the best one. The only experiment we used in practice is the simplest one: add a switch between the same model after different epochs. Indeed, we observe in figure 12 that there is an optimal number of epochs. After that point, the model can overfit. Therefore, we only have to specify in our new training procedure the number of epoch after each data pack. We call this method Auto-Switch (AS) since the models switch with itself at different timesteps. It allows to save 5knats more on CIFAR. Figure 12: Evolution of loss during training with a small number of data (1280). We observe that there is an optimal number of epochs in order to minimize the loss for test data, and consequently the compression cost. Thus, the number of epochs acts as a regularizer. Remark 8. What would be the compression bounds if we were re-training the model after each new sample in online learning, instead of after each data pack? The bigger these packs are, the longer we have to wait to re-train a new model, 53

55 and thus keep longer models that are not optimal anymore. The true compression bound would be too expensive to compute, but we can try to get an approximation. We can use a linear interpolation of the loss on new data as a function of the number of samples in order to approximate it. This is not a rigourous compression bound, and it is the reason why we do not include it in the final table of compression results. But it gives an order of magnitude of what we can expect if we reduce the size of the data packs. On CIFAR, we estimate that we could save at least 3,2 knats more, and reach a compression bound of 20,7knats, only by decreasing the size of the data packs. 4.6 Other attempts Here, we present two additional attempts for improving the compression bounds, which did not give good results for now Unsupervised learning and specific high-level features When doing online learning, the most expensive labels to cost are obvisouly the first ones, because we don t have a lot of information yet. Therefore, the deep neural networks cannot have built their high level representations of data yet. We could add in our Switch some models that would be better than huge neural networks with few data sample. The first thing we could do is adding some models that extracts high level features, and then use simple learning algorithms (linear classification) on the top of them. We could also use a hybrid approach, by using smaller neural networks with these high level features as inputs. These methods reached some great results, especially with very small training sets [Oyallon et al., 2017]. We tried to reproduce the experiments from [Oyallon et al., 2017], stacking scattering transform and residual networks. With this architecture, we have only reached the same results than with the regularized VGG on CIFAR with small training sets, so it did not improve the compression bounds. The second thing we could do is use unsupervised learning. Since we are supposing that the sender and the receiver both have all the images of the dataset, the receiver can start by training an unsupervised model on the entire dataset, in order to build some good representations of the data. Then, he can train online a small supervised model that will take as inputs the unsupervised representations. We could use as an example Auto-encoder, or GANs. These last methods have not been tested yet. 54

56 Figure 13: CIFAR compression results: In the first subfigure, we see the cost for encoding every label. The cost for encoding a label in a data pack with online learning is exactly the loss evaluated on this label with the last model that has been learnt. Actually, we only plot the mean cost of encoding a label in a pack of labels. The line is discontinuous each time we train a new network, and constant elsewhere. In the second subfigure, we plot the difference between the cumulative encoding cost of encoding the data with the uniform distribution, and the cumulative cost of encoding the labels with online learning. In the third subfigure, we plot the corresponding compression rate if we want to encode a certain quantity of data (log scale). 55

57 4.7 Summary of the experiments: Do deep nets compress? We have observed that, with online learning, we get the best compression results of the data, well below the bounds given by variational learning. This is a very interesting observation, because we see in this case that the models that was the best in prediction are now also selected by a Bayesian model selection method, even if these models seemed too complex at the beginning. Method K(MNIST) MNIST acc K(CIFAR) CIFAR (acc) Uniform encoding 135 knats 10% 115 knats 10% No compression 100 Mnats 99, 5% 500Mnats 93% Model compression 2 Mnats 99% 10 Mnats 92% Variational learning 16,6 knats 95,5% 61,7 knats 61,6% Online 2,84 knats 99, 5% 31,4 knats 93% Online + Switch 2,81 knats 99, 5% 23,9 knats 93% Table 3: Summary of compression bounds given by deep learning methods. K(DATASET) is the compression bound for the labels of this dataset given its pictures. DATASET acc is the accuracy of the final model evaluated on a test set. The scores given with a are not exact scores but order of magnitude. Since the compression are anyway well above the uniform encoding, we give these scores to have approximate values 56

5 Discussions 5.1 Variational compression with a small training set We observed in practice in our experiments that variational learning needs many data samples to learn (see figure 14).

58 5 Discussions 5.1 Variational compression with a small training set We observed in practice in our experiments that variational learning needs many data samples to learn (see figure 14). Figure 14: Variational compression rate and accuracy as a function of the number of samples (CIFAR). With different sizes of training sets, we trained a variational model. For each of these size, we show the accuracy of the trained model evaluated on a test set, and the compression rate. This compression rate is seperated in two parts: the encoding cost of the labels, and the encoding cost of the weights. We see in figure 14 that the model needs many data to learn how to compress, and that it is unable to generalize at all before 20,000 samples. On the contrary, neural networks start to generalize after 2,000 samples. This suggests that variational methods are far too cautious, and it is a strong argument against them. 5.2 Gap between variational learning and online learning We have mainly compared two compression bound, the variational bound and the online learning bound, and observed that the second one was very below the first one. Suppose that D is a dataset. Let us call B 1 (D, p) online compression bound given by the model p. Let us take p θ (X ) a parametric model and α a prior over α. We define the joint probability distribution p(d, θ) = α(θ)p θ (D). Then, if we 57

Kolmogorov complexity ; induction, prediction and compression

Kolmogorov complexity ; induction, prediction and compression Contents 1 Motivation for Kolmogorov complexity 1 2 Formal Definition 2 3 Trying to compute Kolmogorov complexity 3 4 Standard upper bounds