An Introduction to No Free Lunch Theorems

Size: px

Start display at page:

Download "An Introduction to No Free Lunch Theorems"

Conrad Doyle
5 years ago
Views:

1 February 2, 2012

2 Table of Contents

3 Induction Learning without direct observation. Generalising from data. Modelling physical phenomena.

4 The Problem of Induction David Hume (1748) How do we know an induced hypothesis is correct/likely? Simplicity of explanation? Agreement with subsequent observations? Because it usually is? Induction often works, and, by induction, is a plausible epistemic process.

5 Falsifiability Karl Popper No hypotheses is unconditionally accepted. The possibility of falsification, and agreement with new observations, supports a hypothesis. Observations Hypotheses Falsification Theory Generate Hypotheses

6 No Free Lunch Theorems Wolpert and Macready (1992) In search problems: No search algorithm yields consistently better optima than any other over all possible search landscapes. In learning problems: No learning algorithm achieves consistently better generalisation performance than any other over all possible target functions.

7 Preliminaries Target function f : X Y. Observated data D = { x 1, y 1... x N, y N }, where x i X, y i = f (x i ). We want a hypothesis h which is close to f on unobserved data (X \ D).

8 Binary NFL Example X = {1, 2}, Y = {0, 1}. f {f 1... f 4 }. D = { 1, 0 }. Once we see D, can we predict f (2) any more accurately? x f 1 (x) f 2 (x) f 3 (x) f 4 (x)

9 Binary NFL Example What about using validation data? Is that like falsification? X = {1, 2, 3}, Y = {0, 1} f {f 1... f 8 }. D = { 1, 0, 3, 1 }. x f 1 (x) f 2 (x) f 3 (x) f 4 (x) f 5 (x) f 6 (x) f 7 (x) f 8 (x)

10 Arbitrary Functions X = { } 2, Y = {0, 1}. There are possible f : X Y ; according to Matlab, = Inf. rng(a, b) generates the a th number from a Mersenne Twister with seed b. The function f (x) = δ[rng(x 1, x 2 793) > 0.5] is shown below:

11 Arbitrary Functions X = { } 2, Y = {0, 1}. There are possible f : X Y ; according to Matlab, = Inf. rng(a, b) generates the a th number from a Mersenne Twister with seed b. The function f (x) = δ[rng(x 1, x 2 793) > 0.5] is shown below:

12 Bayesian Inference Integrate over the space of plausible hypotheses, H: P(f (x) = y D) = δ[h(x) = y]p(h D)dh (1) Using the prediction of h on x, weighted by the posterior probability of h. Bayes Theorem gives us P(h D) = P(D h)p(h) P(D) Usually, we know/assume a model, and integrate over its parameters. H

13 Priors The prior on h, P(h), describes a state of knowledge before observing D. Often picked pragmatically (conjugate to P(D h) so that P(D h)p(h) has a convenient form). Obtaining a useful prior from first principles is difficult (What do you know about the problem? How do you know it?)

14 Maximum Entropy We should use the least informative (most entropic) prior that agrees with our prior information. If we know nothing, MaxEnt uniform prior, P(h) 1. When we have some information, our prior should describe it without implying any further assumptions. No prior can give better expected performance than the MaxEnt prior.

15 Bayesian Inference/Maximum Entropy and NFL x f 1 (x) f 2 (x) f 3 (x) f 4 (x) D = 1, 0. Maximum Entropy prior: i : P(h = f i ) = 1 4. Likelihood: P(D h = f 1 ) = P(D h = f 2 ) = 1 2 P(D h = f 3 ) = P(D h = f 4 ) = 0. Marginal Likelihood: P(D) = 1 4.

16 Bayesian Inference/Maximum Entropy and NFL x f 1 (x) f 2 (x) f 3 (x) f 4 (x) Posteriors: P(h = f 1 D) = P(h = f 2 D) = P(h = f 3 D) = P(h = f 4 D) = 0 Prediction: = 1 2, P(f (2) = y D) = 4 δ[f i (2) = y]p(h = f i D) (2) i=1 P(f (2) = 0 D) = P(f (2) = 1 D) = 1 2

17 Bayesian Inference and NFL Generalisation requires non-uniform prior. Bayesian inference gives the correct result: no violation of NFL. Later - falsifiability and Bayesian priors.

18 Structural Risk Minimisation Vapnik and Chervonenkis Minimise empirical risk (e tr ), while considering model complexity. Central result: a bound on generalisation error (e gen ) based on training error and model complexity.

19 Confidence Intervals SRM is based on the convergence of empirical estimates to expected values. The Hoeffding bound tells us how quickly the empirical mean converges to the expectation: ε : P(E[X ] X > ε) e 2ε2 N (3)

20 Simplified VC Bound Learning algorithms choose an h from a hypothesis space H (e.g. the space of linear models). Let H = {h}. D X, and h is independent of D. Hence, h(x) : x D are i.i.d. samples from h(x) : x X. Hoeffding s inequality will apply. We can solve e 2ε2N = η to give, with probability at least 1 η: log(η) e gen = E[e tr ] e tr + (4) 2N

21 Finite-Hypothesis VC Bounds For a finite hypothesis space H, apply the union bound: P(sup[e gen e tr ] > ε) H e 2ε2 N h H (5) With probability at least 1 η: e gen e tr + log H log η 2N (6)

22 VC Dimension Usually, H =. However, some H are not infinitely expressive (cannot describe any possible function). The VC Dimension is a measure of model complexity. Choose d points in X, and if h can shatter those points, VC(h) d. To shatter a dataset, h must be able to correctly classify the points under any labelling.

23 General VC Bound The VC dimension can be used analagously to hypothesis set cardinality to produce a general VC Bound for models of finite VC dimension d. With probability at least 1 η: d(log(2n/d) + 1) log(η/4) e gen e tr + N (7)

24 SRM and NFL VC Bounds are confidence intervals on the representativeness of the training performance. There is no fundamental difference between the single-hypothesis bound and the VC dimension bound. We show how the single-hypothesis bound relates to NFL; other bounds can be understood analogously.

25 SRM and NFL X = N, Y = {0, 1}. Let f be produced by independent infinite sequences of coin tosses, and x : h(x) = 1. A priori, we know agreement between h and f is completely random. Now pick N training examples. What does it mean if N i=1 δ[f (x i) = h(x i )] is large? If we know nothing useful about P(f ), then VC bounds only tell us how unlikely D is.

26 Vapnik, SRM & Falsifiability Vapnik is an Instrumentalist. General density estimation is an ill-posed (no unique solution), while finding a discriminant is well-posed; hence Vapnik wants to predict rather than understand. He describes the VC dimension as quantifying the falsifiability of hypotheses classes, because complex algorithms are unfalsifiable (e.g. 1-NN). Vapnik attributes the success of falsifiability primarily to its empirical properties.

27 Deutsch, Bayesian Inference & Falsifiability The physist David Deutch describes falsifiability from a realist perspective. To Bayesians (I think) estimating P(f (x) = y D) is well-posed because a prior forces the solution to be unique. Deutsch claims that falsifiability favours hypotheses with explanatory power (because priors are generated using our understanding of physical phenomena). Deutsch attributes the success of falsifiability primarily the way we create candidate theories (using prior information).

28 NFL & Falsifiability Whether falsifiability is justified by its empirical properties, or its reliance on priors, I m not sure why it circumvents NFL issues. Empirical falsifiability (choose a hypothesis that exposes itself to falsification but still performs well) seems similar to cross-validation. Falsifiability which relies upon a good prior needs us to generate this prior - I suppose one should just hope to have been born with a good prior.

29 Summary 1 Induction and Falsifiability describe two ways of generalising from observations. 2 No Free Lunch theorems show that assumption-free learning is impossible. 3 Bayesian Inference suffers from NFL issues when we apply Maximum Entropy in the absence of prior knowledge. 4 SRM appears to give assumption-free bounds on generalisation, but actually bounds the likelihood of the data. 5 Both approaches can be considered as exploiting falsifiability, but in different ways.

30 Conclusions There are strong parallels between epistemic philosophy and machine learning. We should be skeptical of assumption-free theoretical results. However, if we truly refused to make any unjustified assumtions, inability to predict the generalisation performance of learning algorithms would not be our biggest problem!

31 Questions? Comments? Discussion?

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target