Is there an Elegant Universal Theory of Prediction?

Similar documents
CISC 876: Kolmogorov Complexity

Algorithmic Probability

Chapter 6 The Structural Risk Minimization Principle

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

Kolmogorov complexity

Algorithmic probability, Part 1 of n. A presentation to the Maths Study Group at London South Bank University 09/09/2015

Adversarial Sequence Prediction

INDUCTIVE INFERENCE THEORY A UNIFIED APPROACH TO PROBLEMS IN PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE

Compression Complexity

Measuring Agent Intelligence via Hierarchies of Environments

Recovery Based on Kolmogorov Complexity in Underdetermined Systems of Linear Equations

,

Lecture 13: Foundations of Math and Kolmogorov Complexity

1 Showing Recognizability

A Universal Turing Machine

Kolmogorov complexity and its applications

Kolmogorov complexity and its applications

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

Theoretically Optimal Program Induction and Universal Artificial Intelligence. Marcus Hutter

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

Introduction to Kolmogorov Complexity

Decidability: Church-Turing Thesis

On approximate decidability of minimal programs 1

Decidability. Linz 6 th, Chapter 12: Limits of Algorithmic Computation, page 309ff

Recursion Theory. Joost J. Joosten

Theory of Computation

Loss Bounds and Time Complexity for Speed Priors

Self-Modification and Mortality in Artificial Agents

Computability Theory

Part I: Definitions and Properties

Kolmogorov Complexity

Convergence and Error Bounds for Universal Prediction of Nonbinary Sequences

Some Theorems on Incremental Compression

CSCE 551: Chin-Tser Huang. University of South Carolina

A version of for which ZFC can not predict a single bit Robert M. Solovay May 16, Introduction In [2], Chaitin introd

Universal Convergence of Semimeasures on Individual Random Sequences

Universal Prediction of Selected Bits

CS154, Lecture 12: Kolmogorov Complexity: A Universal Theory of Data Compression

Limitations of Efficient Reducibility to the Kolmogorov Random Strings

Theory of Computation p.1/?? Theory of Computation p.2/?? We develop examples of languages that are decidable

Further discussion of Turing machines

Theory of Computation

The Fastest and Shortest Algorithm for All Well-Defined Problems 1

ACS2: Decidability Decidability

Stochasticity in Algorithmic Statistics for Polynomial Time

Program size complexity for possibly infinite computations

Computability Theory. CS215, Lecture 6,

Computation. Some history...

Bias and No Free Lunch in Formal Measures of Intelligence

Universal probability distributions, two-part codes, and their optimal precision

Sophistication Revisited

Distribution of Environments in Formal Measures of Intelligence: Extended Version

COS597D: Information Theory in Computer Science October 19, Lecture 10

Turing s thesis: (1930) Any computation carried out by mechanical means can be performed by a Turing Machine

Understanding Computation

Preface These notes were prepared on the occasion of giving a guest lecture in David Harel's class on Advanced Topics in Computability. David's reques

Non-emptiness Testing for TMs

Information Theory and Coding Techniques: Chapter 1.1. What is Information Theory? Why you should take this course?

Finish K-Complexity, Start Time Complexity

2 Plain Kolmogorov Complexity

Introduction to Turing Machines. Reading: Chapters 8 & 9

Turing machines COMS Ashley Montanaro 21 March Department of Computer Science, University of Bristol Bristol, UK

Introduction to Languages and Computation

Turing s 1935: my guess about his intellectual journey to On Co

Computable Functions

Complexity 6: AIT. Outline. Dusko Pavlovic. Kolmogorov. Solomonoff. Chaitin: The number of wisdom RHUL Spring Complexity 6: AIT.

CS 229r Information Theory in Computer Science Feb 12, Lecture 5

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

Turing machine recap. Universal Turing Machines and Undecidability. Alphabet size doesn t matter. Robustness of TM

Limits of Computability

Turing Machines Part III

COM S 330 Homework 08 Solutions. Type your answers to the following questions and submit a PDF file to Blackboard. One page per problem.

Lecture 24: Randomized Complexity, Course Summary

ELEMENTS IN MATHEMATICS FOR INFORMATION SCIENCE NO.6 TURING MACHINE AND COMPUTABILITY. Tatsuya Hagino

CpSc 421 Homework 9 Solution

The axiomatic power of Kolmogorov complexity

IV. Turing Machine. Yuxi Fu. BASICS, Shanghai Jiao Tong University

Proof Theory and Subsystems of Second-Order Arithmetic

Computation Histories

Complexity Theory Part I

FORMAL LANGUAGES, AUTOMATA AND COMPUTATION

Algorithmic Information Theory

On minimal models of the Region Connection Calculus

Algorithmic Information Theory

The Legacy of Hilbert, Gödel, Gentzen and Turing

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

Sophistication as Randomness Deficiency

Prefix-like Complexities and Computability in the Limit

where Q is a finite set of states

Models. Models of Computation, Turing Machines, and the Limits of Turing Computation. Effective Calculability. Motivation for Models of Computation

Solution to CS375 Homework Assignment 11 (40 points) Due date: 4/26/2017

Turing machines and computable functions

Universal probability-free conformal prediction

KOLMOGOROV COMPLEXITY AND ALGORITHMIC RANDOMNESS

Computability and Complexity

Lecture 1: Introduction

CS21 Decidability and Tractability

CMPT 710/407 - Complexity Theory Lecture 4: Complexity Classes, Completeness, Linear Speedup, and Hierarchy Theorems

The Power of One-State Turing Machines

Transcription:

Is there an Elegant Universal Theory of Prediction? Shane Legg Dalle Molle Institute for Artificial Intelligence Manno-Lugano Switzerland 17th International Conference on Algorithmic Learning Theory

Is there an Elegant Universal Theory of Prediction? Solomonoff s incomputable model of induction rapidly learns to make optimal predictions for any computable sequence, including probabilistic ones. Lempel-Ziv, Context Tree Weighting and other computable predictors can predict some computable sequences. Question Does there exist an elegant computable predictor that is in some sense universal, or at least universal over large sets of simple sequences?

Basic notation B := {0, 1} B := the set of binary strings B := the set of infinite binary sequences x n := the n th symbol in the string x B x i:j := the substring x i x i+1...x j 1 x j x := the length of the string x f (x) < + g(x) : f (x) < g(x) + c for some independent constant c

Kolmogorov complexity In this work we will use Kolmogorov complexity to measure the complexity of both sequences and strings. For a sequence ω B and universal Turing machine U, K(ω) := min { p : U(p) = ω} p B In words: The Kolmogorov complexity of a sequence ω is the length of the shortest program that genereates ω. For strings, U(p) must halt with output x.

Basic definitions Definition A sequence ω B is a computable binary sequence if there exists a program q B that writes ω to a one-way output tape when run on a monotone universal Turing machine U. We denote the set of all computable sequences by C. Definition A computable binary predictor is a program p B that computes a total function B B. Ideally a predictor s output should be the next symbol in the sequence, that is, p(x 1:n ) = x n+1.

Predictability in the limit We will place no limits on the predictor s computation time storage capacity Furthermore we will only consider predictability in the limit: Definition We say that a predictor p can learn to predict a sequence ω := x 1 x 2... B if there exists m N such that n m : p(x 1:n ) = x n+1.

Sets of predictors We will be focused on the set of all predictors that are able to predict some specific sequence ω, or all sequences in some set of sequences S. Definition Let P(ω) be the set of all predictors able to learn to predict ω. For a set of sequences S, let P(S) := ω S P(ω).

Every computable sequence can be predicted Lemma ω C, p P(ω) : K(p) < + K(ω). In words: Every computable sequence can be predicted by at least one predictor. This predictor need not be significantly more complex than the sequence. Proof sketch: Take the program q that generates the sequence ω and convert this into a predictor p that always outputs the (n + 1) th symbol of ω for any input x 1:n. Clearly p is not significantly more complex than q and correctly predicts ω (and only ω!)

Simple predictors for complex strings Lemma There exists a predictor p such that n N, ω C : p P(ω) and K(ω) > n. In words: There exists a predictor that is able to learn to predict some strings of arbitrarily high complexity. Proof sketch: A predictor that always predicts 0 can predict any sequence ω of the form x0 no matter how complex x B is. In a sense ω becomes a simple sequence once it has converged to 0. This is not necessary: Consider a prediction program that detects when the input sequence is a repeating string and then predicts accordingly. Clearly, some sequences with arbitrarily high tail complexity can also be predicted by simple predictors.

There is no universal computable predictor Lemma For any predictor p there exists a computable sequence ω := x 1 x 2... C such that n N : p(x 1:n ) x n+1 and K(ω) < + K(p). In words: For every computable predictor there exists a computable sequence which it cannot predict at all, furthermore this sequence doesn t have to be significantly more complex than the predictor. Proof sketch: For any prediction program p we can construct a sequence generation program q that always outputs the opposite of what p predicts given the sequence so far. Clearly p will always mis-predict this sequence and q is not much more complex than p.

Prediction of simple computable sequences As there is no universal computable sequence predictor, a weaker goal is to be able to predict all simple computable sequences. Definition For n N, let C n := {ω C : K(ω) n}. Definition Let P n := P(C n ) be the set of predictors able to learn to predict all sequences in C n. A key question then is whether P n for large n. That is, whether powerful predictors exist that can predict all sequences up to a high level of complexity.

Predictors for sets of bounded complexity sequences exist Lemma n N, p P n : K(p) < + n + O(log 2 n). In words: Prediction algorithms exist that can learn to predict all sequences up to a given complexity, and these predictors need not be significantly more complex than the sequences they can predict. Proof sketch: Let h be the number of valid sequence generation programs of length up to n bits. Construct a predictor that on input x 1:k simulates all programs of length up to n bits until h of these produce sequences of length n + 1. In the limit only the h programs that are valid sequence generators will do this. Finally, predict according to the lexicographically first program that is consistent with the input string x 1:k. As h can be encoded in n + O(log 2 n) bits the result follows.

Can we do better? Do there exist simple predictors that can predict all sequences up to a high level of complexity? Theorem n N : p P n K(p) > + n. In words: If a predictor can predict all sequences up to a complexity of n then the complexity of the predictor must be at least n. Proof sketch: Follows immediately from the previous result that a simple predictor must fail for some equally simple sequence. Note: This result is true for any measure of complexity for which the inversion of a single bit is an inexpensive operation.

Complexity of prediction The previous results suggest the following definition, Definition K(ω) := min { p : p P(ω)} p B In words: The K complexity of a sequence is the length of the shortest program able to learn to predict the sequence. It can easily be seen that K has the same invariance to the choice of reference universal Turing machine as Kolmogorov complexity. We also generalise this definition to sets of sequences.

Previous results written in terms of K complexity We can now rewrite our previous results more succinctly: ω : 0 K(ω) < + K(ω), and for sets of sequences of bounded complexity, n N : n < + K(C n ) < + n +O(log 2 n). In words: The simplest predictor capable of predicting all sequences up to a Kolmogorov complexity of n, has itself a Kolmogorov complexity of roughly n.

Do some individual sequences demand complex predictors? Or more formally, does there exist ω such that K(ω) K(ω). Theorem n N, ω C : n < + K(ω) < + K(ω) < + n +O(log 2 n). In words: For all degrees of complexity, there exist sequences where the simplest predictor able to learn to predict the sequence is about as complex as the sequence itself. Proof sketch: Essentially we create a meta-predictor that simulates all predictors of complexity less than n. We then show that there exists a sequence ω which the meta-predictor cannot predict and therefore neither can any predictor of complexity less than n, that is, n > + K(ω). The remainder of the proof mostly follows from earlier results.

What properties do high K sequences have? If program q generates ω, let t q (n) be the number of computation steps performed by q before the n th symbol of ω is output. Lemma ω C, if q : U(q) = ω and r N, n > r : t q (n) < 2 n, then K(ω) + = 0. In words: If a sequence can be computed in a reasonable amount of time, then the sequence must have a low K complexity. Proof sketch: Construct a predictor that on input x 1:n simulates all programs of length n or less for 2 n+1 steps each, then predicts according to the lexicographically first program with output consistent with x 1:n. So long as t q (n) < 2 n for the true generator, in the limit this predictor must converge to the right model.

No elegant powerful constructive theory of prediction exists A constructive theory of prediction T, expressed in some sufficiently rich formal system F, is in effect a description of a prediction program with respect to a universal Turing machine which implements the required parts of F. Thus we can re-express our previous results, Corollary For individual sequences with high K complexity, and for the sets of all sequences of bounded Kolmogorov complexity, the predictive power of a constructive theory of prediction T is limited by K(T ). This is in marked contrast to Solomonoff s highly elegant but non-constructive universal theory of prediction.

Gödel incompleteness and prediction Theorem In any formal axiomatic system F that is sufficiently rich to express statements of the form p P n, there exists m N such that for all n > m and for all predictors p P n the true statement p P n cannot be proven in F. In words: Even though we have proven that very powerful sequence prediction programs exist ( n N : P n ), beyond a certain complexity it is impossible to find any of these programs using mathematics. The proof has a similar structure to Chaitin s information theoretic proof of Gödel incompleteness.

Prediction incompleteness proof Proof sketch: Create a search program s that searches for a proof of the statement p P n and halts with output p if it finds such a proof. As s contains an encoding of n, and some constant length code we see that K(s) < + O(log 2 n). Assume that s finds a proof and halts with output p. This means that s is an effective description of p and thus, K(p) < + O(log 2 n). However we have shown earlier that p P n K(p) > + n. Thus for large n we have a contradiction! Therefore, for large n the true statement p P n cannot be proven.

Summary