Information, Learning and Falsification

Size: px

Start display at page:

Download "Information, Learning and Falsification"

Jayson Gordon
5 years ago
Views:

1 Information, Learning and Falsification David Balduzzi December 17, 2011 Max Planck Institute for Intelligent Systems Tübingen, Germany

2 Three main theories of information: Algorithmic information. Description. The information embedded in a single string depends on its shortest description.

3 Three main theories of information: Algorithmic information. Description. The information embedded in a single string depends on its shortest description. Shannon information. Transmission. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble.

4 Three main theories of information: Algorithmic information. Description. The information embedded in a single string depends on its shortest description. Shannon information. Transmission. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble. Statistical learning theory. Prediction. The information about the world embedded in a classifier (its expected error) depends on the complexity of the learning algorithm.

5 Three main theories of information: Algorithmic information. Description. The information embedded in a single string depends on its shortest description. Shannon information. Transmission. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble. Statistical learning theory. Prediction. The information about the world embedded in a classifier (its expected error) depends on the complexity of the learning algorithm. Can these be related?

6 Three main theories of information: Algorithmic information. Description. The information embedded in a single string depends on its shortest description. Shannon information. Transmission. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble. Statistical learning theory. Prediction. The information about the world embedded in a classifier (its expected error) depends on the complexity of the learning algorithm. Can these be related? Effective information. Discrimination. The information produced by a physical process when it produces an output depends on how sharply it discriminates between inputs.

7 Effective information

8 Nature decomposes into specific, bounded physical systems which we model as deterministic functions f : X Y or more generally as Markov matrices p m (y x), where X and Y are finite sets.

9 Physical processes discriminate between inputs thermometer

10 Definition The discrimination given by Markov matrix m outputting y is ( ) ) ˆp m x y := pm (y do(x) punif (x) p m (y), where p m (y) := x p m( y do(x) ) punif (x) is the effective distribution. Definition Effective information is the Kullback-Leibler divergence [ ( ) ] ei(m, y) := H ˆp m X y punif (X ) Balduzzi and Tononi, PLoS Computational Biology, 2008

11 Special case: deterministic f : X Y Definition The discrimination given by f outputting y assigns equal probability to all elements of pre-image f 1 (y). Definition Effective information is ei(f, y) := log f 1 (y) X

12 [ discrimination thermometer input when thermometer outputs is [ ei = -log size size

13 Algorithmic information

14 Definition Given universal prefix Turing machine T, the Kolmogorov complexity of string s is K T (s) := min len(i) {i:t (i)=s } the length of the shortest program that generates s. For any Turing machine U T, there exists a constant c such that K U (s) c K T (s) K U (s) + c for all s.

15 Definition Given T, the (unnormalized) Solomonoff prior probability of string s is p T (s) := 2 len(i), {i T (i)=s } where the sum is over strings i that cause T to output s as a prefix, and no proper prefix of i outputs s. The Turing machine discriminates between programs according to which strings they output; Solomonoff prior counts programs are in each class (weighted by length).

16 Kolmogorov complexity = Algorithmic probability Theorem (Levin) For all s log P T (s) = K T (s). up to an additive constant c. Upshot: for my purposes, Solomonoff s formulation of Kolmogorov complexity is the right one K T (s) := log p T (s).

17 Recall, the effective distribution was the denominator when computing discriminations using Bayes rule: ˆp m ( x y ) := pm (y do(x) ) punif (x) p m (y).

18 Solomonoff prior Effective distribution Proposition The effective distribution on Y induced by f is p f (y) = 2 len(x) {x:f (x)=y} Compare with Solomonoff distribution: p T (s) := Compute effective distribution by {i T (i)=s } 2 len(i) replacing universal Turing machine T with f : X Y ; and giving inputs len(x) = log X in the optimal code for the uniform distribution on X.

19 Kolmogorov Complexity Effective information Proposition For function f : X Y, effective information equals ei(f, y) = log p f (y) = log {x:f (x)=y} 2 len(x) Compare with Kolmogorov complexity: K T (s) = log p T (s) = log {i T (i)=s } 2 len(i)

20 Statistical learning theory

21 Hypothesis space Given unlabeled { data} D = (x 1,..., x l ) X l, let hypothesis space Σ D = σ : D ±1 be the set of all possible labelings HYPOTHESIS SPACE

22 Setup Suppose data D = (x 1,..., x l ) is drawn from unknown probability distribution P X and labeled y i = σ(x i ) by an unknown supervisor σ Σ X. The learning problem: Find a classifier ˆf guaranteed to perform well on future (unseen) data sampled via P X and labeled by σ.

23 Empirical risk minimization Suppose we are given a class F of functions to work with. A simple algorithm for tackling the learning problem is: Algorithm: Given data labeled by σ Σ D, find classifier ˆf F Σ D, that minimizes empirical risk: 1 ˆf := arg min f F l l i=1 I f (xi ) σ(x i )

24 Empirical risk minimization Suppose we are given a class F of functions to work with. A simple algorithm for tackling the learning problem is: Algorithm: Given data labeled by σ Σ D, find classifier ˆf F Σ D, that minimizes empirical risk: 1 ˆf := arg min f F l l i=1 I f (xi ) σ(x i ) Key step. Reformulate algorithm as function between finite sets:

25 Empirical risk minimization Suppose we are given a class F of functions to work with. A simple algorithm for tackling the learning problem is: Algorithm: Given data labeled by σ Σ D, find classifier ˆf F Σ D, that minimizes empirical risk: 1 ˆf := arg min f F l l i=1 I f (xi ) σ(x i ) Key step. Reformulate algorithm as function between finite sets: Empirical risk minimization: R F,D : HYPOTHESIS SPACE EMPIRICAL RISK Σ D R σ min f F 1 l l i=1 I f (x i ) σ(x i )

26 LOW CAPACITY fits few hypotheses HIGH CAPACITY fits many hypotheses F 1 F 2 HYPOTHESIS SPACE EMPIRICAL RISK R 1 R MINIMIZER 2 0 ε 1 ε 2 ε 3 TRAINING ERROR 0 ε 1 ε 2 ε 3

27 Theorem (standard template for error bounds in SLT) With probability 1 δ, ( ) ( ) expected error historical + of learner error ( capacity of algorithm ) + ( ) confidence term UNDERFITTING OVERFITTING

28 Minimizing empirical risk R F,D : Σ X R is a physical process. Questions: Q1. What is the effective distribution ( Solomonoff prior ) of the ERM? Q2. What is the effective information ( Kolmogorov complexity ) of its outputs?

29 Effective distribution Rademacher complexity ε 3 R ε 2 ε 1 0 Proposition ( Solomonoff Rademacher ) The expectation of the ERM over the effective distribution is empirical Rademacher complexity: ɛ p RF,D (ɛ) = 1 ( ) 1 Rademacher(F, D) 2 ɛ R

30 Effective information VC-entropy ε 3 R ε 2 ε 1 0 Proposition ( Kolmogorov Vapnik ) The effective information generated by the ERM when it outputs 0 is empirical VC-entropy: ei(r F,D, 0) = log p RF,D (0) = l VC-entropy(F, D), where l is amount of training data.

31 Corollary (reformulation of error bounds in SLT) With probability 1 δ, ( ) ( ) ( ) ( ) expected historical discrimination of confidence + + output ERM output ERM inputs by ERM term HYPOTHESES ERRORS ε 3 ERM ε 2 ε 1 0 how ERM discriminates inputs what ERM outputs

32 Falsification

33 Karl Popper wanted to justify scientific knowledge. He was very impressed by Einstein s bold conjecture about the Sun s gravitational field bending starlight which, when proved correct, overthrew Newtonian physics despite an enormous body of evidence in favor of Newton.

34 Karl Popper wanted to justify scientific knowledge. He was very impressed by Einstein s bold conjecture about the Sun s gravitational field bending starlight which, when proved correct, overthrew Newtonian physics despite an enormous body of evidence in favor of Newton. Popper s big idea: Rely on theories that have been severely tested, rather than theories supported by lots of facts. Unfortunately, Popper failed to justify his big idea.

35 Counting falsified hypotheses. Rademacher complexity. ɛ prf,d 2( (ɛ) = 1 ) 1 Rademacher(F, D) ε 3 R ε 2 ε 1 0 ( fraction of hypothesis ERM falsifies p RF,D (ɛ) ɛ = ɛ R ɛ ( ) = weighted count of falsified hypotheses ) ( ) on fraction ɛ of data

36 Counting falsified hypotheses. VC-entropy. ei(r F,D, 0) = l VC-entropy(F, D) ε 3 R ε 2 ε 1 0 ei(r F,D, 0) = log Σ X }{{} = log R 1 F,D (0) }{{} total # hypotheses # hypotheses ERM fits ( logarithmic count of falsified hypotheses. )

37 Back to Popper and justifying scientific knowledge. Minimal model of Popper s question: When can we trust generalizations based on training error? Answer: If empirical risk minimizer has small capacity

38 Back to Popper and justifying scientific knowledge. Minimal model of Popper s question: When can we trust generalizations based on training error? Answer: If empirical risk minimizer has small capacity ERM has small capacity ERM falsifies many hypotheses.

39 Conclusion

40 Philosophy A major theme of 20 th century mathematics was transition from set theory (language for talking about points = elements) to category theory (language for talking about arrows = functions).

41 Philosophy A major theme of 20 th century mathematics was transition from set theory (language for talking about points = elements) to category theory (language for talking about arrows = functions). This talk: substituted thinking about sets (e.g. function class F Σ X ) with thinking about the structure of arrow ERM : Σ X R from hypothesis space to training errors Immediate consequences: 1 SLT algorithmic information theory 2 SLT falsification

42 Conclusion Physical processes discriminate between inputs Effective information is non-universal analog of Kolmogorov complexity universal Turing machine finite function Information generated while minimizing empirical risk 1 controls error bounds (SLT) and 2 in terms of number of falsified hypotheses. Conjecture: effective information generated by optimizations other than ERM also controls future performance.

43 Thank you!

Generalization bounds

Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question